Thread: fscanf() issue...

  1. #1
    Registered User
    Join Date
    Oct 2007
    Posts
    100

    fscanf() issue...

    [I think the issue is more connected with fseek than fscanf..]

    Hi!
    Could anybody please explain me why this simple program to print the content of a file in a reverse order doens't work properly whit characters like 'è' or 'à' ecc...?

    Code:
    #include <stdio.h>
    
    int main ()
    {
    	FILE *fPtr=fopen ("reverse1.txt","rb");
    
    	int ch,numBytes=0;
    	while ((ch=fgetc(fPtr))!=EOF)
    		numBytes++;
    	int i;
    	char c;
    	printf("\n****************\n");
    	for (i=0;i<numBytes;i++)
    	{
    		fseek (fPtr,-(i+1),SEEK_END);
    		fscanf(fPtr,"%c",&c);//fread (&c,sizeof(char),1,fPtr);
    		printf("%c",c);
    	}
    	printf("\n****************\n");
    	fclose (fPtr);
    	return 0;
    }
    both fread and fscanf give the same problem: special chars are outputted like this: ��
    Aren't they 1byte long like normal chars?

    thanks!
    Last edited by smoking81; 09-08-2008 at 03:09 AM.

  2. #2
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    There are cases where a character consists of multiple characters - why not print the characters you actually read as numbers (decimal or hex) and compare with the actual character - you would then probably see multiple digits for one character [do this in the forward direction] - in which case your reverse function like that would not work at all.

    I'm sure it doesn't matter at all if fgetc(), fscanf() or fread() in this case - they all read a character on it's own. In a mult-byte-character set, the sequence is interpreted as a long character only when it comes to the displaying, not when reading from or writing to files.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #3
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    Only in extended ASCII (ie unsigned), whereas you're reading them in as signed (and printing as signed).

    [edit]
    I was talking about something else... but it's still related.
    [/edit]

  4. #4
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by zacs7 View Post
    Only in extended ASCII (ie unsigned), whereas you're reading them in as signed (and printing as signed).
    Shouldn't make a difference - the signed or unsigned value is the same binary value as a character - and that is what printf (or putc) will do.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #5
    Registered User
    Join Date
    Oct 2007
    Posts
    100
    Quote Originally Posted by matsp View Post
    - in which case your reverse function like that would not work at all.
    then, what could be an alternative solution? it sounds to be a bit complicated, isn't it?

    By the way, I use this post to ask another question related to dealing with files.
    Last day I solved an exercise which required to leave only 1 blank line in the case more than 1 consecutives empty lines where found.
    I solved it immediately in this way:

    Code:
    ........
    int ch;
    while ((ch=fgetc(fPtr))!=EOF)
    	{
    		if (ch!='\n') //IT's A CHAR-->copy it
    		{
    			fprintf(f2Ptr,"%c",ch);
    			prevIsChar=1;
    			prevIsEmptyLine=0;
    		}
    		else
    		{
    			if (prevIsChar)
    			{
    				fprintf(f2Ptr,"%c",ch);
    				prevIsChar=0;
    				prevIsEmptyLine=0;
    			}
    			else
    			{
    				if (!prevIsEmptyLine)
    				{
    					fprintf(f2Ptr,"%c",ch);
    					prevIsChar=0;
    					prevIsEmptyLine=1;
    				}
    			}
    		}
    	}
    ...
    in linux this works fine.. Now, after reading again the text of the exercise, it tells that in UNIX the <CR> corresponds to 10, thing which I didn't consider..
    To let it work in Unix, should I seek for sequences of
    Code:
    <CR><LF> --> 10 '\n'
    ?

    again, thanks for your kind help!

  6. #6
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    It appears that your source file is not encoded in a single-byte character set (like one of the ISO-8859 LATIN formats), and you are expecting it to be. It might be encoded in UTF-8.

    When dealing with unicode encoded files, you need to process characters, not bytes, if you want to deal with multi-byte characters.
    Mainframe assembler programmer by trade. C coder when I can.

  7. #7
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by Dino View Post
    It appears that your source file is not encoded in a single-byte character set (like one of the ISO-8859 LATIN formats), and you are expecting it to be. It might be encoded in UTF-8.

    When dealing with unicode encoded files, you need to process characters, not bytes, if you want to deal with multi-byte characters.
    Yes, and char, despite it's name, is not characters, but bytes [they are only characters when a byte and a character is the same thing, and UTF-8 for example is one of those where this is not true].

    Doing "backwards" on UTF-8 files is far from trivial - not sure how you'd go about it really, because the way that UTF works is "in order", so you would have to scan backwards to find the unque markers for multi-character start [I think there are more than one - but not a huge number], then go forward according to the length of the character.

    Given this complexity, it may make more sense to treat the file as a list of backwards lines - so read back from the very end until you find a newline, then reverse all characters [not bytes] in the resulting string. At least then it's relatively easy to deal with multi-byte characters, because you can move back and forth within the string itself.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #8
    Registered User
    Join Date
    Oct 2007
    Posts
    100
    Quote Originally Posted by Dino View Post
    It appears that your source file is not encoded in a single-byte character set (like one of the ISO-8859 LATIN formats), and you are expecting it to be. It might be encoded in UTF-8.

    When dealing with unicode encoded files, you need to process characters, not bytes, if you want to deal with multi-byte characters.
    so which function should I use instead than fseek(which works with bytes) to solve this problem?
    thanks!

  9. #9
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by smoking81 View Post
    so which function should I use instead than fseek(which works with bytes) to solve this problem?
    thanks!
    It is not fseek as much as the "encoding of each character" that is the problem. You would have to know where the boundaries of a character are, and that is far from trivial.

    If it's not strictly a requirement for your assignment that it works with multi-byte characters in UTF, I'd suggest you just mention as a restriction in your implementation that it doesn't support UTF-8 character sets properly.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  10. #10
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    You can try this if you want. Without knowing the exact encoding of your source file, this is a guess.

    If you read this description in the following link, you'll learn that in UTF-8 (my assumption of your source file format), any ISO 8859 character code that is less than X'80' can be represented in a single byte in UTF-8. Latin characters with diacritics need 2 bytes. See the description:

    http://en.wikipedia.org/wiki/UTF-8#Description

    In the description, you see that each UTF-8 "character" is self defining by inspecting the high order bit(s). If the bit is off, then the character only takes up one byte. If the high order bits AND with X'C0', then it is a two-byte character.

    See the examples for how the high end of the LATIN character set are represented in 2-byte UTF-8 character codes.

    http://en.wikipedia.org/wiki/UTF-8#Examples

    Todd
    Mainframe assembler programmer by trade. C coder when I can.

  11. #11
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Ah, ok, so you could actually identify that a character is UTF-8 by just looking at that byte itself.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  12. #12
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Yes. I just coded up a small test to see it. To reproduce, just create a UTF-8 encoded file with your favorite editor and insert some diacritics.

    Todd
    Mainframe assembler programmer by trade. C coder when I can.

  13. #13
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Here a routine that works to reverse a string in a UTF-8 encoded file. (It doesn't take into account CRs or LFs.

    Todd

    Code:
    #include <stdio.h>
    #include <string.h> 
    
    int main (int argc, const char * argv[]) {
    	char line[81], newline[81] = {0 }  ; 
    	char * linep, *newlinep ; 
    	FILE * file ; 
    
    	if ( (file = fopen("/Users/toddburch/Desktop/myutf8.txt","r+")) == NULL) { 
    		printf("Cannot open FILE\n" ) ; 
    		return -1 ; 
    	} 
    
    	fgets( line, sizeof(line), file ) ; 
    
    	// Reverse the string
    	linep = line ; 
    	newlinep = newline + strlen(line)-3 ;   // point to end of string 
    	*newlinep-- = 0 ;                       // null term char 
    	linep += 3 ;   // skip the utf-8 encoding flag bytes 
    	while (*linep) {   // Do until we hit the null term byte 
    	
    		if ( (*linep & 0xC0) == 0xC0) { // if 2-byte char, move both chars in order 
    			*(newlinep-1) = *linep++ ; 
    			*newlinep-- = *linep++ ; 
    			newlinep-- ; 
    		}
    		else { 
    			*newlinep-- = *linep++ ; 
    		}
    	}
    	fputs(newline, file) ;   // write reversed string 
    	fclose(file) ; 
        return 0;
    }
    The output of which looks like this:

    Code:
    Here is my résumé.
    
    .émusér ym si ereH
    Mainframe assembler programmer by trade. C coder when I can.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. fscanf causes a SEGMENTATION FAULT
    By yougene in forum C Programming
    Replies: 15
    Last Post: 12-29-2008, 12:11 AM
  2. fscanf in different functions for the same file
    By bchan90 in forum C Programming
    Replies: 5
    Last Post: 12-03-2008, 09:31 PM
  3. float calculation issue
    By George2 in forum C# Programming
    Replies: 1
    Last Post: 05-26-2008, 04:56 AM
  4. fscanf Issue
    By Maser in forum C Programming
    Replies: 3
    Last Post: 04-23-2007, 07:42 PM
  5. FSCANF format string issue
    By INFERNO2K in forum C++ Programming
    Replies: 2
    Last Post: 07-06-2005, 05:52 PM