Thread: Reading UTF

  1. #1
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,733

    Reading UTF

    Another thread meant to be found by others who are encountering a problem I've just fixed in mine, turns out using literals like 0x80 will not detect the UTF state but literals like 0x80u will, here's some example code to play with (just replace/rip out what won't compile since you don't have that... or make it yourself your choice):
    Code:
    #include "next.h"
    #include <stdlib.h>
    #include <stdio.h>
    int fnextchr( FILE *file, char8_t *c, size_t leng ) {
    	size_t i, max;
    	long int p;
    	if ( feof(file) ) return ENODATA;
    	p = ftell(file);
    	c[0] = fgetc(file);
    	if ( c[0] < 0x80u || c[0] == (char8_t)-1 ) return 0;
    	if ( c[0] & 0xC0u ) {
    		if ( c[0] & 0xE0u )
    			max = (c[0] & 0xF0u) ? 3 : 2;
    		else max = 1;
    	}
    	else return 0;
    	if ( max > leng ) {
    		fseek( file, p, SEEK_SET );
    		return ERANGE;
    	}
    	for ( i = 1; i < max; ++i ) {
    		if ( feof(file) ) return EILSEQ;
    		c[i] = fgetc(file);
    	}
    	return 0;
    }
    int main() {
    	long int p = 0;
    	NEXTC _nextc = {0};
    	_nextc.src = stdin;
    	_nextc.nextchr = (func_nextchr)fnextchr;
    	puts("Enter mizu character and others:\n");
    	while ( nextc(&_nextc) ) {
    		if ( _nextc.c[0] == U'\r' || _nextc.c[0] == U'\n' ) break;
    		printf("Character at stdin position %ld: '%s'\n", p, _nextc.c );
    		p += strlen( (char*)_nextc.c );
    	}
    	return 0;
    }
    I used '⺢abcd' as my test set to identify where I was going wrong, I can now go back to my original projects and check to see if they're working correctly now (fixed 'em straight after identifying the problem)

  2. #2
    Registered User
    Join Date
    Feb 2019
    Posts
    1,078
    Maybe because there is no 0x80 in UTF-8:

    UTF-8 encoding (U+N -> encoding):

    Code:
    0x00000000 - 0x0000007F:  0xxxxxxx
    0x00000080 - 0x000007FF:  110xxxxx 10xxxxxx
    0x00000800 - 0x0000FFFF:  1110xxxx 10xxxxxx 10xxxxxx
    0x00010000 - 0x001FFFFF:  11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    0x00200000 - 0x03FFFFFF:  111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    0x04000000 - 0x7FFFFFFF:  1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    See utf8 manpage:

    Code:
    $ man 7 utf8

  3. #3
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,733
    Nah it was the fact that the compiler converted 0x80 to signed (-128) instead of leaving it as unsigned (128) and prevented my reliant functions from ever seeing the full character which in turn corrupted the output, after fixing that I finally got my mizu character displayed correctly, and I was able to start identifying problems with my character literal reading code, fixed my escape character code to correctly produce UTF-8 when reading a UTF-16/UTF-32 character (final result is converted back to local which in my case is UTF-8), nearly done with the character literal code and will be able be move onto expressions with them soon

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. reading .bmp
    By fighter92 in forum C++ Programming
    Replies: 7
    Last Post: 01-24-2009, 08:19 PM
  2. reading into an int
    By Cpro in forum C++ Programming
    Replies: 2
    Last Post: 07-07-2008, 01:30 PM
  3. Reading From USB
    By mbh5m in forum C Programming
    Replies: 7
    Last Post: 01-14-2008, 08:29 AM
  4. reading from a .txt
    By sdherzo in forum C Programming
    Replies: 23
    Last Post: 06-25-2007, 06:50 AM
  5. Reading from txt
    By cppdungeon in forum C++ Programming
    Replies: 5
    Last Post: 09-25-2006, 05:12 PM

Tags for this Thread