UTF-8 character counter in C

**meowmeow004** · 11-22-2020

I'm taking an introductory course in computer science and one of our tasks is to edit this file that counts the number of characters in a file so that it can also count the number of characters in a file with UTF-8 encoded characters.
The code we were supposed to edit is:

Code:

#include <stdio.h>


typedef unsigned char BYTE;


int main(int argc, char const *argv[])
{
    if (argc != 2)
    {
        printf("Usage: ./count INPUT\n");
        return 1;
    }
    FILE *file = fopen(argv[1], "r");
    if (!file)
    {
        printf("Could not open file.\n");
        return 1;
    }
        size_t i, count;
    while(1)
    {
        BYTE b;

        fread(&b, 1, 1, file);
        if (feof(file))
        {
            break;
        }
        count++;
        }
    }
    printf("Number of characters: %i\n", count);


    fclose(file);


    return 0;
}

I thought that simply changing

fread(&b, 1, 1, file);

into

fread(&b, 4, 1, file);

would do it since I know that utf-8 uses 4 bytes, but apparently not, because the program crashes. I've tried looking for other ways, but the methods they use are quite confusing as well (or perhaps it my lack of experience). I've also tried other ways such as assigning the file's contents into char variable instead of byte, but the code ends up looking more complicated and that's probably wrong anyway.
Also, I've tried changing

typedef unsigned char BYTE;

into

unsigned char BYTE[4] = {0};

So that it would be able to hold 4 bytes, but it still doesn't work. Can you give me tips or at least point me in the right direction at least? Thank you.

**hamster_nz** · 11-22-2020

Wikipedia has a graphic with all the info you need at UTF-8 - Wikipedia - UFT-8 can be up to 8 bytes in length, but they don't have to be.

Also, think about using getc() rather than read(). It will make your life a lot simpler for byte-by-byte I/O.

Oh, and remember that you are only counting characters, I assume you have no cares at all about what those characters are.

**Sir Galahad** · 11-22-2020

When you pass "r" to fopen() it opens the file in read-text-mode. That could be problematic, as it basically gives the operating system a license to preprocess the data. So for this task, opening the file in binary-read-mode (eg: "rb") is a safer option.

And you really just need to read in the first byte of each glyph in order to determine the number of bytes remaining in the sequence.

**meowmeow004** · 11-22-2020

Originally Posted by hamster_nz

Wikipedia has a graphic with all the info you need at UTF-8 - Wikipedia - UFT-8 can be up to 8 bytes in length, but they don't have to be.

Also, think about using getc() rather than read(). It will make your life a lot simpler for byte-by-byte I/O.

Oh, and remember that you are only counting characters, I assume you have no cares at all about what those characters are.

I'd also like to use getc(), but this code was provided to us and I just need to figure out which part of the code I need to change. I've seen the wiki page and I think that I should change this part of the code

Code:

	int count = 0;
	while(1)
	{
		BYTE b[16];
		fread(&b, 8, 1, file);
		if (feof(file))
		{
			break;
		}
		count++;
	}

into maybe something like this:

Code:

	int count = 0;
	while(1)
	{
		BYTE b[16];
		fread(&b, 8, 1, file);
		if (feof(file))
		{
			break;
		}
		count++;
	}

but no luck I guess. I've noticed though that the number displayed changes depending on the size inside fread, but nothing worked out. Now, I'm really not sure what's wrong anymore.

**meowmeow004** · 11-22-2020

Originally Posted by Sir Galahad

When you pass "r" to fopen() it opens the file in read-text-mode. That could be problematic, as it basically gives the operating system a license to preprocess the data. So for this task, opening the file in binary-read-mode (eg: "rb") is a safer option.

And you really just need to read in the first byte of each glyph in order to determine the number of bytes remaining in the sequence.

Hello, does this mean that I only have to edit

FILE *file = fopen(argv[1], "r");

into

FILE *file = fopen(argv[1], "rb");??

I really thought I'd have to modify somewhere in this part

Code:

    int count = 0;
    while(1)
    {
        BYTE b;
        fread(&b, 1, 1, file);
        if (feof(file))
        {
            break;
        }
        count++;
    }

**Sir Galahad** · 11-22-2020

Originally Posted by meowmeow004

I'd also like to use getc(), but this code was provided to us and I just need to figure out which part of the code I need to change. I've seen the wiki page and I think that I should change this part of the code

Code:

    int count = 0;
    while(1)
    {
        BYTE b[16];
        fread(&b, 8, 1, file);
        if (feof(file))
        {
            break;
        }
        count++;
    }

into maybe something like this:

Code:

    int count = 0;
    while(1)
    {
        BYTE b[16];
        fread(&b, 8, 1, file);
        if (feof(file))
        {
            break;
        }
        count++;
    }

but no luck I guess. I've noticed though that the number displayed changes depending on the size inside fread, but nothing worked out. Now, I'm really not sure what's wrong anymore.

Well you definitely don't want to read it in 8 bytes at a time. Read a byte, determine the length, then read the rest (all in one go if you wish).

**meowmeow004** · 11-23-2020

Oh, I thought I'd have to read it in 8 bytes or 4 at a time since when I tried it 1 byte at a time, the number of characters returned was wrong. I'm not sure how to proceed since the characters I'll be counting are chinese, japanese, and korean

**Sir Galahad** · 11-23-2020

Originally Posted by meowmeow004

Oh, I thought I'd have to read it in 8 bytes or 4 at a time since when I tried it 1 byte at a time, the number of characters returned was wrong.

I don't see how you could possibly know how many characters were read without checking the return value of fread().

Originally Posted by meowmeow004

I'm not sure how to proceed since the characters I'll be counting are chinese, japanese, and korean

It doesn't matter what language it's in. Just parse it according to the standard and you won't have any problems.

(1) Read a byte.
(2) Determine the number of bytes left by examining the bit pattern in the first byte.
(3) Read the remaining 1 to 3 bytes, if necessary.
(4) Rinse and repeat.

**hamster_nz** · 11-23-2020

Can you reword that so you are clear about 'bytes', characters, C 'chars' and unicode characters?

They are all slightly different things.

Using mixed somewhat vague words is obscuring the solution from you.

**flp1969** · 11-23-2020

UNICODE encoding is a 31 bits value corresponding to a single character. The trailing 16 bits correspond to a character in a "plane" the upper 16 bits. Actually planes 0, 1, 2, 3, 14, 15 and 16 are implemented, the other are unassigned. This 32 bits value can be 'transformed' to be used in encodings with 8 bits. This is called UNICODE TRANSFORMATION FORMAT (UTF). UTF-8 (8 bits format) is very common, as UTF-16. Each 'transformation format' uses a different way to encode the 32 bits unicode value.

In UTF-8 the upper bits of the first byte determine how many bytes a character is encoded. From 'man utf8':

Code:

0x00000000 - 0x0000007F:
           0xxxxxxx

       0x00000080 - 0x000007FF:
           110xxxxx 10xxxxxx

       0x00000800 - 0x0000FFFF:
           1110xxxx 10xxxxxx 10xxxxxx

       0x00010000 - 0x001FFFFF:
           11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x00200000 - 0x03FFFFFF:
           111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

       0x04000000 - 0x7FFFFFFF:
           1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The 'x' are the UNICODE encoded value of the char.

Notice if the most significant bit of the first byte is 1, the remaining upper '1' bits gives you how many more bytes there is in the encoding. Take '€' as example (U+20AC, encoded in UTF-8 as 0xE2 0x82 0xAC: 0b11100010 0b10000010 0b10101100, because 0x20AC in binary is 0b0010_000010_101100). First byte has 1 in msb and just three 2s after it (followed by 0), then the first byte is followed by 2 bytes.

If the msb is 0 the char is encoded exactly as in ASCII encoding.

UTF-8 can be encoded from 1 to 6 bytes (because the xxx part can have 31 bits, maximum, as shown above), but, in practice, any character has a maximum of 4 bytes encoding (because there is no implementation of chars beyond plane 16 or U+10FFFF). Following the rule above for UTF-8 encoding it is possible (but I'm not sure) this encoding supports a maximum of 7 bytes (for the last case, the first being 0b11111110, followed by 6 bytes).

Thread: UTF-8 character counter in C

Thread Tools

Search Thread

Display

UTF-8 character counter in C

Similar Threads

Expecting a null character, but receiving a blank character instead

frequency character counter

Page File counter and Private Bytes Counter

comparing character in a string to anothr character

wide character (unicode) and multi-byte character

Tags for this Thread