Understanding fseek

**127.0.0.1** · 05-20-2011

I am making a library to help parse some files, so in a function I created called 'Next_Float' if the found token doesn't completely match a number (say the token is '123.123TEST') it would null terminate the token in the buffer and reset the file pointer (in this case it should move back 4 so that the file pointer is at 'T'). Everything seemed to be working, however, when fseek has reached the end of the line exactly after a float with following characters it moves back the file pointer one more than it should. Am I doing something incorrectly and if not is there any way to get around this or use another method?

fseek Call:

Code:

fseek(Input, - k + (i - 1), SEEK_CUR);

Example Input:

Code:

1123.123TESTa 2123.123TESTb
3123.123TESTc 4123.123TESTd
5123.123TESTe 6123.123TESTf 7123.123TESTg
8123.123TESTh
9123.123TESTi

Example (Debug) Output -- Notice how any token before a newline is fseek-ed too far back:

Code:

 The fseek value: -6. Token to be processed:  1123.123TESTa. Next Token:  TESTa
 The fseek value: -6. Token to be processed:  2123.123TESTb. Next Token:  3TESTb
-Next_Line-
 The fseek value: -6. Token to be processed:  3123.123TESTc. Next Token:  TESTc
 The fseek value: -6. Token to be processed:  4123.123TESTd. Next Token:  3TESTd
-Next_Line-
 The fseek value: -6. Token to be processed:  5123.123TESTe. Next Token:  TESTe
 The fseek value: -6. Token to be processed:  6123.123TESTf. Next Token:  TESTf
 The fseek value: -6. Token to be processed:  7123.123TESTg. Next Token:  3TESTg
-Next_Line-
 The fseek value: -6. Token to be processed:  8123.123TESTh. Next Token:  3TESTh
-Next_Line-
 The fseek value: -6. Token to be processed:  9123.123TESTi. Next Token:  3TESTi
-Next_Line-

**Salem** · 05-20-2011

There are restrictions on using fseek() on a text file.

Basically, you can
- seek to the beginning or end
- seek to the result of an ftell()

That is, if you want to go back to some point inside a text file, you need to make a note of what ftell() returns.

**127.0.0.1** · 05-20-2011

The problem is all I know is the amount bytes I need to move back, I could put a for loop to use the c99 function ungetc but that seems a bit silly instead of making one function call.

EDIT:
I found a bit of a fix, just something quick i thought of, if anyone has a better idea feel free to suggest it:

Code:

        if(Temp = fgetc(Input) && (Temp == '\n' || Temp == 10))
          fseek(Input, - k + (i - 1), SEEK_CUR);
        else
          fseek(Input, - k + (i - 2), SEEK_CUR);

**~~CommonTater~~** · 05-20-2011

The error probably lies in your tokenizing strings... It would help if you could post the code for that as well...

Don't forget that Windows uses 2 characters for end of line \n and \r so if you are looking for EOL you need "\r\n" in your tokenizer strings.
*nix systems use only the \n character...

**Salem** · 05-20-2011

I did - but you decided to implement some non-portable hack instead.

**phantomotap** · 05-20-2011

O_o

You may just want to post your parser. Ideas abound, but it really depends on the design of your parser.

For example, if your "floating point value" element parser isn't supposed to read values that can't be interpreted as a valid floating point value, why does it read into "Test" string in the first place?

If there are some valid tokens that constitute ""floating point value" -> "string token"" but this example doesn't represent such a valid example, why aren't you memoizing the state of the file at the point the branch possibility occurs? (Literally, why aren't you `ftell' in your parser when you start reading characters than may necessitate a change in the type of the parsed token.)

[Edit]
This is exactly what Salem has told you to do.
[/Edit]

Alternatively, why aren't you opening the stream in binary mode and using a manually coded "whitespace" cruncher for newlines setting the state of a variable to track consecutive space? (Then using `fseek' adjusting by that known value.)

I bring these up because the newline character is very system specific so the fix you've proposed isn't really portable.

I could put a for loop to use the c99 function ungetc but that seems a bit silly instead of making one function call.

This will very likely not work. You are only guaranteed one character of buffer space.

Soma

**phantomotap** · 05-20-2011

Don't forget that Windows uses 2 characters for end of line \n and \r so if you are looking for EOL you need "\r\n" in your tokenizer strings.

That actually depends on how he is opening the file and subsequently reading it.

The standard behavior guarantees translation between `\n' in source and platform native newline endings in a lot of functions.

Soma

**127.0.0.1** · 05-20-2011

You may just want to post your parser. Ideas abound, but it really depends on the design of your parser.

Well, I have a function that grabs the next token (to skip comments). Next float uses this function to get a token to process.

Alternatively, why aren't you opening the stream in binary mode and using a manually coded "whitespace" cruncher for newlines setting the state of a variable to track consecutive space? (Then using `fseek' adjusting by that known value.)

I am very interested to know more about this. Do you have some links to resources that deal with this kind of thing?

For example, if your "floating point value" element parser isn't supposed to read values that can't be interpreted as a valid floating point value, why does it read into "Test" string in the first place?

Because Next_Token finds the next spaced value and if I use Next_Float when the next token is something like '123.123*', Next_Float would return FAILURE. This is the reason I thought a simple solution would be to move back the characters that are invalid.

Here is the code for Next_Float:

Code:

  int Next_Float
  (
          FILE  *Input,
    const char  *Comment_Prefix,
          char  *Buffer,
    const int   Buffer_Size,
          float *Target,
          int   *Skipped_Lines
  ){
    /* Get the next token */
    if(Target == NULL || !Next_Token(Input, 1, Comment_Prefix, Buffer, Buffer_Size, Skipped_Lines))
      return FAILURE;
      
    /* Temporary character (using during fseek) */
    char Temp;
    
    /* Loop through the token stored in Buffer */
    for(int k = strlen(Buffer), i = 0, j = 0;i < k;i++)
    
      /* Check if the characters in the token are digits with up to
         one occurence of a decimal character and one dash or plus character at the start */
      if(Buffer[i] < 45 || Buffer[i] > 57 || Buffer[i] == 47 || j > 1 || ((Buffer[i] == '-' || Buffer[i] == '+') && i != 0)){
      
        /* Test for the following cases (where X is any non digit) */
        if(i == 0 /* 'X' */
        ||(i == 1 && (Buffer[0] == '-' || Buffer[0] == '+' || Buffer[0] == '.')) /* '-X', '+X', '.X' */
        ||(i == 2 && ((Buffer[0] == '-' || Buffer[0] == '+') && Buffer[1] == '.'))){ /* '-.X', '+.X' */
        
          /* Move the file pointer back, the whole token is useless as number */
          if(Temp = fgetc(Input) && (Temp == '\n' || Temp == 10))
            fseek(Input, -k + 1, SEEK_CUR);
          else
            fseek(Input, -k, SEEK_CUR);
          
          /* Make Buffer null */
          Buffer[0] = '\0';
          
          /* Return FAILURE because the 0 through i tokens in Buffer are not numbers */
          return FAILURE;
        }
          
        /* Move the file pointer back because a portion of the read token is not part of the number */
        if(Temp = fgetc(Input) && (Temp == '\n' || Temp == 10))
          fseek(Input, -k + (i - 1), SEEK_CUR);
        else
          fseek(Input, -k + (i - 2), SEEK_CUR);
        
        /* Terminate the string at the end of the number */
if(*Skipped_Lines > 0)
printf("-Next_Line-\n");
printf(" The fseek value: %d. Token to be processed: ", - k + (i - 1));
printf(" %s. Next Token: ", Buffer);
        Buffer[i] = '\0';
        
        /* Exit the loop (we are done testing the token in the buffer) */
        break;
      }else if(Buffer[i] == '.')
        j++;
    
    /* Convert the valid float token and return */
    *Target = atof(Buffer);
    return SUCCESS;
  }

Here is the code for Next_Token:

Code:

  int Next_Token
  (
          FILE *Input,
    const int  Number_Of_Tokens,
    const char *Comment_Prefix,
          char *Buffer,
    const int  Buffer_Size,
          int  *Skipped_Lines
  ){
    /* Temporary value for a possible token character */
    char Possible_Token;
    
    /* Pointer used by strstr to point at a possible comment */
    char *Comment_Start;
    
    /* Preprocessed length of Comment_Prefix */
    int Comment_Length = strlen(Comment_Prefix);
    
    /* Loop down from the number of tokens needed to zero */
    int i = 0;
    
    /* Set Skipped_Lines to 0 */
    if(Skipped_Lines != NULL)
      *Skipped_Lines = 0;
      
    for(int j = Number_Of_Tokens;j > 0;j--){
    
      /* Skip whilespace */
      do{
        if((Possible_Token = fgetc(Input)) == EOF)
          return FAILURE;
        
        /* Increment Skipped_Lines if a new line character is found */
        if(Skipped_Lines != NULL && (Possible_Token == 10 || Possible_Token == '\n'))
          *Skipped_Lines = *Skipped_Lines + 1;
      }while(Possible_Token < 33);
      
      /* Add characters to token until end of file or a white space */
      do{
        
        /* Assign the current possible and terminate the string */
        if(i < Buffer_Size)
          Buffer[i++] = Possible_Token;
        else{
        
          /* If the buffer is not large enough, skip to next token and return failure */
          SKIP_TOKEN(Input);
          return FAILURE;
        }
        
        /* Test for a comment */
        if(i >= Comment_Length && i < Buffer_Size && strncmp(&Buffer[i - Comment_Length], Comment_Prefix, Comment_Length) == 0){
        
          /* Skip the line and increment Skipped_Lines if it is not null */
          SKIP_TO_END_OF_LINE(Input);
          
          /* Set index of the pointer to i */
          i -= Comment_Length;
          
          /* If no token was found before the comment, correct j */
          if(i < 1 || Buffer[i - 1] == ' ')
            j++;
            
          /* Comment means the end of a token, so exit the loop */
          break;
        }
        
        /* Get the next character. If it is the end of file and the token goal met, return success */
        if((Possible_Token = fgetc(Input)) == EOF && j == 1)
          return SUCCESS;
      }while(Possible_Token > 32);
      
      /* Move the file pointer back incase the last character was a new line */
      if(Possible_Token < 32)
        ungetc(Possible_Token, Input);
      
      /* If no words were added to Buffer (meaning a comment was found) and the number of tokens to find is greater than one, add a space */
      if(j != 1 && i > 0 && i < Buffer_Size && Buffer[i - 1] != ' ')
        Buffer[i++] = ' ';
    }
    
    /* Terminate the string and return */
    Buffer[i] = '\0';
    return SUCCESS;
  }

EDIT:
Here are the macros i used:

Code:

  #define SKIP_LINE(in)           fscanf(in, "%*[^\n]%*c")
  #define SKIP_TOKEN(in)          fscanf(in, "%*[^\n\t ]")
  #define SKIP_TO_END_OF_LINE(in) fscanf(in, "%*[^\n]")

**phantomotap** · 05-20-2011

O_o

Yea, you've gone about parsing this file the wrong way.

Don't consume input until you have processed everything that is necessary to yield a valid token. Instead, parse only what is a valid token.

Consider, if you have a function named `IsValidFloatingPointCharacter'. Within the element parser that consumes floating point values, you call this function storing the characters in a temporary buffer for later conversion until `IsValidFloatingPointCharacter' returns `false'. This is much the same as what you've done only with more functional decomposition.

Now consider the `GetToken' function as you've designed it. It consumes whitespace and tokens that constitute comments while producing a string that may represent any other token type. Instead, let us assume you've created both of the functions `IsSpacingCharacter' and `IsCommentCharacter'. You then create the function `TrimInvalidTokens'. This function composes `IsSpacingCharacter' into a block to consume whitespace until a non-whitespace character is found. The same function then composes `IsCommentCharacter' into a block that consumes input until the end of a comment (whatever that is) is found.

With these primitives in place you invoke `TrimInvalidTokens' at the top of every elemental processor. So within `ProcessFloatingPointToken' you call `TrimInvalidTokens'. You know, by definition, that the character at the cursor is a character that may be a character that constitutes a valid floating point character, but you do not know that it is until you attempt to process it. So, you save the current state of the parser (included the input position) and attempt to process the floating point value. If the floating point value is processed, you continue parsing with whatever token may be next by following the same process you have used for processing this floating point value. If the floating point value is not processed, you know that you either do not have a floating point value and must try to process an alternative token that may be valid at that point in the token stream or issue an error and terminate.

Because of this decomposed elemental process, you only need to back track in chunks, because you have memoized the state you know precisely how far to backtrack, because you have developed a parser by inclusion and exclusion of nested tokens and token types you always know that you may have a valid token of the relevant type at the current input position, and you know how to attempt to process any given token type or fail and backtrack. You never need to process a nested partial token or a composed token of invalid composition. Of course, the fun part comes in building a driver, but that is a chain of conditional composition, not necessarily simple nesting, of these processing primitives by just checking the success of failure of the given primitive.

Soma

**127.0.0.1** · 05-20-2011

I should divide up the Next_Token function and have a function that just skips to the beginning of the next valid token (call it say Seek_Token) and make Next_String (renamed from Next_Token) use Seek_Token and read from input until it has reached the end of a valid string (doing a similar thing for Next_Float).

What if I have an input file ' TEST//This is a test string' with Seek_Token bringing me to 'TEST//...' if I continue reading a string I will get to 'TEST//' before I realize I've read a comment. I still have to tell Seek_Token to start next time with reading comment, should I create another function Seek_Token_Comment or should I use fseek to move back the file pointer?

**phantomotap** · 05-20-2011

O_o

Read my post again. In fact, read it several more times.

Considering the example you've proposed, which part is the comment "TEST//" or "//This is a test string"? If the second, and "//" always marks the beginning of a comment, your `IsValid$(TokenType)Character' primitives will, by definition, stop parsing when it finds that sequence so the elemental parsers can safely do their job while ignoring whatever comments comes after they complete their job. (You should remember also that each elemental parser, by definition, calls the `TrimInvalidTokens' function first in order to find the first possibly valid character token.) If the first, and "//" marks the end of a comment, possibly beginning at the start of a line, the elemental primitives defined order of operation of always calling `TrimInvalidTokens' first also takes care of that case because it will consume the comment.

Lastly, the third option, being that "//" marks the end of a comment sequence depending on the state of the parser in an infix grammar with split portion tokens being valid, you will (in addition to or instead of, depending on the grammar, always calling `TrimInvalidTokens' first) provide a `ProcessCommentSequenceToken' similar to any other elemental parser with the primary difference being that this function will only consume input without producing a token. This new `ProcessCommentSequenceToken' will then be called by the same driver that the others would have been in any position that a comment may appear. This kind of grammar is exceedingly rare. The above situation almost certainly represents the grammar you intend.

Soma

**quzah** · 05-20-2011

Originally Posted by 127.0.0.1

The problem is all I know is the amount bytes I need to move back, I could put a for loop to use the c99 function ungetc but that seems a bit silly instead of making one function call.

The standard only guarantees 1 ungetc call to work.

Quzah.

**~~CommonTater~~** · 05-20-2011

A question... how big are these files?

I ask because it might save you a ton of trouble if you were to load the whole thing into memory and work on it from there with simple pointers.

Finding explicit text is easy with strstr() ... finding line ends etc. no longer matters... there's no seek, no rewind and no ftell to mess with... it's just a whole lot simpler.

**127.0.0.1** · 05-21-2011

Thanks for the advice everyone, I took everything into account and redesigned the library (keep in mind these are just the prototypes, I haven't finished implementing them).

Code:

#ifndef PARSING_H
#define PARSING_H
   
  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include "../Settings/Preprocessors.h"
  
  #define IS_CHARACTER(c)  c>31
  #define IS_ESCAPE(c)     c<32
  #define IS_WHITESPACE(c) c==' '||c=='\t'
  
  typedef struct{
    FILE *Input;
    char *Comment_Prefix;
    char *Buffer;
    int  Buffer_Size;
    int  Buffer_Index;
    int  Skipped_Lines;
  }Struct_Parser;
  
  Struct_Parser *Initialize_Parser(
    char *Name,
    char *Comment_Prefix,
    int  Buffer_Size
  );
  
  int Finalize_Parser(
    Struct_Parser *Parser
  );
    
  inline int Fetch_Clear_Line( /*Takes the next line from input and removes any comments */
    Struct_Parser *Parser
  );
 
  inline void Skip_Whitespace(
    Struct_Parser *Parser
  );
  
  int Next_Float(
    Struct_Parser *Parser,
    float         *Target
  );
  
  int Next_String(
          Struct_Parser *Parser,
          char          *Target,
    const int           Target_Size
  );
  
  int Next_Set(
          Struct_Parser *Parser,
          char          *Target,
    const int           Target_Size,
    const char          *Delimiter_Left,
    const char          *Delimiter_Right
  );
  
  int Is_Next_Match(
          Struct_Parser *Parser,
    const char          Reference,
    const int           Do_Ignore_Spaces
  );
  
  void Ignore_Next(
          Struct_Parser *Parser,
    const char          Reference,
    const int           Do_Ignore_Spaces
  );
  
#endif /* PARSING_H */

The standard only guarantees 1 ungetc call to work.

Well, I am not going to use ungetc or fseek in my redesigned library, but ill keep that in mind for future reference.

A question... how big are these files?
I ask because it might save you a ton of trouble if you were to load the whole thing into memory and work on it from there with simple pointers.

The files can be quite large, so I think this would be a poor choice.

Read my post again. In fact, read it several more times.

Well, I am still having trouble understanding a lot of the language and pseudo code you used (probably just due to my lack of experience), but I think the new parser takes into account at least some of your criticisms .

**~~CommonTater~~** · 05-21-2011

Originally Posted by 127.0.0.1

The files can be quite large, so I think this would be a poor choice.

Not to be argumentative, but...
Define "quite large"... especially in a machine with 8 or 16gb of memory...

I commonly load 10 and 15 megabyte files into memory in one pop without problems.

Thread: Understanding fseek

Thread Tools

Search Thread

Display

Understanding fseek

Similar Threads

New to 'C' using fseek(), please help

Reg. fseek

Help With fseek();

fseek

fseek ???