Thread: UTF-8 character counter in C

  1. #1
    Registered User
    Join Date
    Nov 2020
    Posts
    4

    UTF-8 character counter in C

    I'm taking an introductory course in computer science and one of our tasks is to edit this file that counts the number of characters in a file so that it can also count the number of characters in a file with UTF-8 encoded characters.
    The code we were supposed to edit is:

    Code:
    #include <stdio.h>
    
    
    typedef unsigned char BYTE;
    
    
    int main(int argc, char const *argv[])
    {
        if (argc != 2)
        {
            printf("Usage: ./count INPUT\n");
            return 1;
        }
        FILE *file = fopen(argv[1], "r");
        if (!file)
        {
            printf("Could not open file.\n");
            return 1;
        }
            size_t i, count;
        while(1)
        {
            BYTE b;
    
            fread(&b, 1, 1, file);
            if (feof(file))
            {
                break;
            }
            count++;
            }
        }
        printf("Number of characters: %i\n", count);
    
    
        fclose(file);
    
    
        return 0;
    }
    I thought that simply changing

    fread(&b, 1, 1, file);

    into

    fread(&b, 4, 1, file);

    would do it since I know that utf-8 uses 4 bytes, but apparently not, because the program crashes. I've tried looking for other ways, but the methods they use are quite confusing as well (or perhaps it my lack of experience). I've also tried other ways such as assigning the file's contents into char variable instead of byte, but the code ends up looking more complicated and that's probably wrong anyway.
    Also, I've tried changing

    typedef unsigned char BYTE;

    into

    unsigned char BYTE[4] = {0};

    So that it would be able to hold 4 bytes, but it still doesn't work. Can you give me tips or at least point me in the right direction at least? Thank you.
    Last edited by meowmeow004; 11-22-2020 at 09:53 PM.

  2. #2
    Registered User
    Join Date
    Sep 2020
    Posts
    425
    Wikipedia has a graphic with all the info you need at UTF-8 - Wikipedia - UFT-8 can be up to 8 bytes in length, but they don't have to be.

    Also, think about using getc() rather than read(). It will make your life a lot simpler for byte-by-byte I/O.

    Oh, and remember that you are only counting characters, I assume you have no cares at all about what those characters are.
    Last edited by hamster_nz; 11-22-2020 at 09:57 PM.

  3. #3
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    When you pass "r" to fopen() it opens the file in read-text-mode. That could be problematic, as it basically gives the operating system a license to preprocess the data. So for this task, opening the file in binary-read-mode (eg: "rb") is a safer option.

    And you really just need to read in the first byte of each glyph in order to determine the number of bytes remaining in the sequence.

  4. #4
    Registered User
    Join Date
    Nov 2020
    Posts
    4
    Quote Originally Posted by hamster_nz View Post
    Wikipedia has a graphic with all the info you need at UTF-8 - Wikipedia - UFT-8 can be up to 8 bytes in length, but they don't have to be.

    Also, think about using getc() rather than read(). It will make your life a lot simpler for byte-by-byte I/O.

    Oh, and remember that you are only counting characters, I assume you have no cares at all about what those characters are.
    I'd also like to use getc(), but this code was provided to us and I just need to figure out which part of the code I need to change. I've seen the wiki page and I think that I should change this part of the code
    Code:
    	int count = 0;
    	while(1)
    	{
    		BYTE b[16];
    		fread(&b, 8, 1, file);
    		if (feof(file))
    		{
    			break;
    		}
    		count++;
    	}
    into maybe something like this:
    Code:
    	int count = 0;
    	while(1)
    	{
    		BYTE b[16];
    		fread(&b, 8, 1, file);
    		if (feof(file))
    		{
    			break;
    		}
    		count++;
    	}
    but no luck I guess. I've noticed though that the number displayed changes depending on the size inside fread, but nothing worked out. Now, I'm really not sure what's wrong anymore.

  5. #5
    Registered User
    Join Date
    Nov 2020
    Posts
    4
    Quote Originally Posted by Sir Galahad View Post
    When you pass "r" to fopen() it opens the file in read-text-mode. That could be problematic, as it basically gives the operating system a license to preprocess the data. So for this task, opening the file in binary-read-mode (eg: "rb") is a safer option.

    And you really just need to read in the first byte of each glyph in order to determine the number of bytes remaining in the sequence.
    Hello, does this mean that I only have to edit

    FILE *file = fopen(argv[1], "r");

    into

    FILE *file = fopen(argv[1], "rb");??

    I really thought I'd have to modify somewhere in this part

    Code:
        int count = 0;
        while(1)
        {
            BYTE b;
            fread(&b, 1, 1, file);
            if (feof(file))
            {
                break;
            }
            count++;
        }

  6. #6
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    Quote Originally Posted by meowmeow004 View Post
    I'd also like to use getc(), but this code was provided to us and I just need to figure out which part of the code I need to change. I've seen the wiki page and I think that I should change this part of the code
    Code:
        int count = 0;
        while(1)
        {
            BYTE b[16];
            fread(&b, 8, 1, file);
            if (feof(file))
            {
                break;
            }
            count++;
        }
    into maybe something like this:
    Code:
        int count = 0;
        while(1)
        {
            BYTE b[16];
            fread(&b, 8, 1, file);
            if (feof(file))
            {
                break;
            }
            count++;
        }
    but no luck I guess. I've noticed though that the number displayed changes depending on the size inside fread, but nothing worked out. Now, I'm really not sure what's wrong anymore.
    Well you definitely don't want to read it in 8 bytes at a time. Read a byte, determine the length, then read the rest (all in one go if you wish).

  7. #7
    Registered User
    Join Date
    Nov 2020
    Posts
    4
    Oh, I thought I'd have to read it in 8 bytes or 4 at a time since when I tried it 1 byte at a time, the number of characters returned was wrong. I'm not sure how to proceed since the characters I'll be counting are chinese, japanese, and korean

  8. #8
    Registered User Sir Galahad's Avatar
    Join Date
    Nov 2016
    Location
    The Round Table
    Posts
    277
    Quote Originally Posted by meowmeow004 View Post
    Oh, I thought I'd have to read it in 8 bytes or 4 at a time since when I tried it 1 byte at a time, the number of characters returned was wrong.
    I don't see how you could possibly know how many characters were read without checking the return value of fread().

    Quote Originally Posted by meowmeow004 View Post
    I'm not sure how to proceed since the characters I'll be counting are chinese, japanese, and korean
    It doesn't matter what language it's in. Just parse it according to the standard and you won't have any problems.

    (1) Read a byte.
    (2) Determine the number of bytes left by examining the bit pattern in the first byte.
    (3) Read the remaining 1 to 3 bytes, if necessary.
    (4) Rinse and repeat.

  9. #9
    Registered User
    Join Date
    Sep 2020
    Posts
    425
    Can you reword that so you are clear about 'bytes', characters, C 'chars' and unicode characters?

    They are all slightly different things.

    Using mixed somewhat vague words is obscuring the solution from you.

  10. #10
    Registered User
    Join Date
    Feb 2019
    Posts
    1,078
    UNICODE encoding is a 31 bits value corresponding to a single character. The trailing 16 bits correspond to a character in a "plane" the upper 16 bits. Actually planes 0, 1, 2, 3, 14, 15 and 16 are implemented, the other are unassigned. This 32 bits value can be 'transformed' to be used in encodings with 8 bits. This is called UNICODE TRANSFORMATION FORMAT (UTF). UTF-8 (8 bits format) is very common, as UTF-16. Each 'transformation format' uses a different way to encode the 32 bits unicode value.

    In UTF-8 the upper bits of the first byte determine how many bytes a character is encoded. From 'man utf8':
    Code:
    0x00000000 - 0x0000007F:
               0xxxxxxx
    
           0x00000080 - 0x000007FF:
               110xxxxx 10xxxxxx
    
           0x00000800 - 0x0000FFFF:
               1110xxxx 10xxxxxx 10xxxxxx
    
           0x00010000 - 0x001FFFFF:
               11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    
           0x00200000 - 0x03FFFFFF:
               111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    
           0x04000000 - 0x7FFFFFFF:
               1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    The 'x' are the UNICODE encoded value of the char.

    Notice if the most significant bit of the first byte is 1, the remaining upper '1' bits gives you how many more bytes there is in the encoding. Take '€' as example (U+20AC, encoded in UTF-8 as 0xE2 0x82 0xAC: 0b11100010 0b10000010 0b10101100, because 0x20AC in binary is 0b0010_000010_101100). First byte has 1 in msb and just three 2s after it (followed by 0), then the first byte is followed by 2 bytes.

    If the msb is 0 the char is encoded exactly as in ASCII encoding.

    UTF-8 can be encoded from 1 to 6 bytes (because the xxx part can have 31 bits, maximum, as shown above), but, in practice, any character has a maximum of 4 bytes encoding (because there is no implementation of chars beyond plane 16 or U+10FFFF). Following the rule above for UTF-8 encoding it is possible (but I'm not sure) this encoding supports a maximum of 7 bytes (for the last case, the first being 0b11111110, followed by 6 bytes).
    Last edited by flp1969; 11-23-2020 at 06:40 AM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 10
    Last Post: 07-05-2011, 08:21 PM
  2. frequency character counter
    By hasanah in forum C Programming
    Replies: 4
    Last Post: 04-15-2009, 01:28 AM
  3. Page File counter and Private Bytes Counter
    By George2 in forum Tech Board
    Replies: 0
    Last Post: 01-31-2008, 03:17 AM
  4. comparing character in a string to anothr character
    By merike in forum C Programming
    Replies: 5
    Last Post: 05-11-2007, 12:16 AM
  5. wide character (unicode) and multi-byte character
    By George2 in forum Windows Programming
    Replies: 6
    Last Post: 05-05-2007, 12:46 AM

Tags for this Thread