Originally Posted by
whiteflags
If I adopted this approach I would worry about becoming complacent with working code, and using it all the time
I'm personally way too paranoid for that
Originally Posted by
phantomotap
Yes, in C, there are certainly cases where `goto' is the better solution because it isolates otherwise duplicated code without wonky cleanup functions having several references to local functions. (Don't quote me on it though; if the situation comes up where a newbie is using `goto' poorly I will simply tell them "You should not be using `goto' as it is evil." rather than take the chance on explaining rare circumstances when it can be used properly. The newbie set already rights bad code on average and `goto' will only make that worse.)
Hm. I fully concur.
And, you have an even stronger point that I do. Most of the readers of this thread are newbies, especially if they're having difficulty parsing CSV in C. (I don't mean any of the posters are newbies. I mean that the thread title is such that this thread is likely to be read by many newbies later on.)
Therefore, the suggestions made here should be directed more towards newbies.
After taking a step back, and re-reading this thread, I'm convinced the non-goto version is better in this case. In particular, there will be less risk of new programmers misunderstanding and learning an unintended bad habit. Therefore, please allow me to replace my suggestion with this version of the csv_field() function:
Code:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
/* RFC4180-compatible CSV field reader. Does not consume the separator.
* Hash function is DJB2 XOR-variant:
* hash(0) = 5318
* hash(i) = (hash(i-1) * 33U) ^ character(i)
* If you are not interested in the hash, just supply a NULL pointer.
*
* Returns the length of the field read.
* If the function returns zero, also check errno for errors. (0 is OK; empty field.)
*/
size_t csv_field(char **const dataptr, size_t *const sizeptr, unsigned int *const hashptr, FILE *const input)
{
char *data, *temp;
size_t size;
size_t used = 0;
unsigned int hash = 5318U;
int quoted = 0;
int c;
/* Invalid parameters? */
if (!dataptr || !sizeptr || !input) {
errno = EINVAL;
return 0;
}
/* Initialize field content buffer. Same logic as POSIX.1-2008 getline(). */
if (*dataptr) {
data = *dataptr;
size = *sizeptr;
} else {
data = NULL;
size = 0;
}
c = getc(input);
/* Skip leading whitespace. This is not strictly RFC4180-compliant,
* but it allows the use of both \n and \r\n newline convention.
* Quoted values will retain their leading whitespace, of course. */
while (c == '\t' || c == '\v' || c == '\f' || c == '\r' || c == ' ')
c = getc(input);
/* Non-empty field? */
if (c != EOF && c != '\n' && c != ',') {
/* Is the field quoted? */
if (c == '"') {
quoted = 1;
c = getc(input);
}
while (c != EOF) {
/* If the field is not quoted, newline or comma ends the field. */
if (!quoted && (c == '\n' || c == ','))
break;
if (quoted && c == '"') {
/* " in a quoted value is special. */
c = getc(input);
/* Did the " end the quoted field? */
if (c == EOF || c == '\n' || c == ',')
break;
/* It really should be ", then. */
if (c != '"') {
/* Un-escaped " within field text; this is really an error.
* However, we're robust, and treat as if it was escaped.
*/
ungetc(c, input);
c = '"';
}
}
/* Enough room for the new character? */
if (used >= size) {
if (used < 4096)
size = 4096; /* Minimum 4096 */
else
if (used < 1048576)
size = (used * 5) / 4; /* Add 25%, up to one megabyte */
else
size = (used | 131071) + 130944; /* Pad to next (128k-128). */
temp = realloc(data, size);
if (!temp) {
errno = ENOMEM;
return 0;
}
data = temp;
*dataptr = temp;
*sizeptr = size;
}
hash = (33U * hash) ^ (unsigned int)c;
data[used++] = c;
c = getc(input);
}
}
/* Do not consume the delimiter, if there was a delimiter. */
if (c != EOF)
ungetc(c, input);
/* Enough room for the end-of-string mark? */
if (used >= size) {
size = (used | 7) + 1; /* Next multiple of 8. */
temp = malloc(size);
if (!temp) {
errno = ENOMEM;
return 0;
}
data = temp;
*dataptr = temp;
*sizeptr = size;
}
/* Terminate field value, */
data[used] = '\0';
/* save hash, if asked, */
if (hashptr)
*hashptr = hash;
/* and return the length of the field. */
errno = 0;
return used;
}
The only difference to the previous version is that the goto has been replaced with an if clause.