Files that do not end with newline?
Earlier I posted a program as a solution to a K&R exercise that reads a file/text stream and prints out a histogram of number of occurrences of word lengths. I think I've worked out nearly all the bugs and this is what I now have:
Code:
/* Corresponding K&R section: 1.6 */
/* Prints vertical histogram of lengths of words in input*/
#include <stdio.h>
/* NOTE: given enough horizontal space, can scale to any two-digit maximum by
modifying this alone... gets messy with three digits, but I could
make the numbers on the x-axis display vertically down to make it infinitely
scalable. */
#define MIN_WORD_LENGTH 1
#define MAX_WORD_LENGTH 20
int main(void)
{
//Holds characters
int c = 0;
int lastchar = 0;
//current number of contiguous non-whitespace chars
int currentLength = 0;
//holds number of occurrences of each length
int wordLengthFrequencies[MAX_WORD_LENGTH] = {0};
//highest number of occurrences encountered
int maxFrequency = 0;
/* ---------------- Collect length data --------------- */
while((c = getchar()) != EOF)
{
//Are we currently inside a word?
if(currentLength >= MIN_WORD_LENGTH)
{
//Are we encountering whitespace?
if(c == ' ' || c == '\t' || c == '\n' || c == '\r' || c == '\v')
{
/* We've reached the end of a word. Update array */
wordLengthFrequencies[currentLength - 1]++;
currentLength = 0; //Reset
}
/* No whitespace, so we're still inside a
word. Update currentLength if it hasn't
maxed out*/
else if(currentLength < MAX_WORD_LENGTH)
currentLength++;
}
/* Have we just encountered the start of a word? */
else if(c != ' ' && c != '\t' && c != '\n' && c != '\r' && c != '\v')
currentLength = 1;
lastchar = c;
}
/* Handle case where file does not end on newline */
if(lastchar != '\n' && currentLength != 0)
wordLengthFrequencies[currentLength - 1]++;
/* --------------- Print Histogram ------------- */
printf("\n\n");
int i = 0;
/* Determine maximum frequency */
for(i = MIN_WORD_LENGTH; i <= MAX_WORD_LENGTH; i++)
{
if(wordLengthFrequencies[i - 1] > maxFrequency)
maxFrequency = wordLengthFrequencies[i- 1];
}
/* Start printing the graph starting from the maximum frequency */
for(c = maxFrequency; c >= 1; c--)
{
/* Make sure graph will still be aligned even for 7-digit frequencies
i.e. if we are operating on a large file */
printf("%7d | ", c);
for(i = 0; i < MAX_WORD_LENGTH; i++)
{
if(wordLengthFrequencies[i] >= c)
printf("* "); //fill in where appropriate
else
printf(" ");
}
printf("\n");
}
/* print the x-axis and legend */
putchar('\t');
for(i = MIN_WORD_LENGTH; i <= MAX_WORD_LENGTH; i++)
printf("---");
printf("\n\t");
for(i = MIN_WORD_LENGTH; i <= MAX_WORD_LENGTH; i++)
printf("%3d", i);
putchar('+');
printf("\n\nx-axis: word length\ny-axis: # of occurrences\n\n");
return 0;
}
I noticed that my program failed to count the last word in the file if the file did not end on a newline and had no trailing whitespace after the last word. So I added this particular snippet to take care of that:
Code:
/* Handle case where file does not end on newline */
if(lastchar != '\n' && currentLength != 0)
wordLengthFrequencies[currentLength - 1]++;
By my logic, if the last character to be read before EOF was not '\n', this means
the file did not end on a newline. But, this does not necessarily mean I should
just go ahead and increment the array member corresponding to what's left
in currentLength. It may very well be 0 because I may have trailing whitespace after the last word which would cause currentLength to be reset when it is processed. What's worse, this would mean I'm incrementing the -1 index, which is out of bounds. No matter what whitespace is trailing, currentLength will be 0, so I just make sure it isn't. With that in place, I'm fairly certain my program is solid.
I might be being anal about this for an exercise out of a programming book, but I guess I see little point in trying to learn C with exercises if I don't make damned sure my solutions are airtight, given how easy it is to proverbially "shoot myself in the foot." I was wondering if anyone else could see any flaws in my logic above or if there's something I overlooked.