Arrays - accented characters

**FernandoBasso** · 11-02-2011

Why can't I have an array like this one:

Code:

  
 char nochars[] = {'ç', 'Ç', '~', '^', '´', '`' 'ã', 'é'};

It returns erros like this:

$ gcc cripto.c -o cripto.bin
cripto.c: In function ‘main’:
cripto.c:10:23: warning: multi-character character constant [-Wmultichar]
cripto.c:10:5: warning: overflow in implicit constant conversion [-Woverflow

**iceaway** · 11-02-2011

It is because you are trying to put characters which aren't in the ASCII-table into a char. Encoding is a headache, look into things like UTF-8, unicode etc. if you want to know more about it. The problem is that Ç is a multibyte char, and can't be represented by a single byte.

This is actually a very interesting blog post to read about text encoding: http://www.joelonsoftware.com/articles/Unicode.html

**FernandoBasso** · 11-02-2011

Originally Posted by iceaway

It is because you are trying to put characters which aren't in the ASCII-table into a char. Encoding is a headache, look into things like UTF-8, unicode etc. if you want to know more about it. The problem is that Ç is a multibyte char, and can't be represented by a single byte.

This is actually a very interesting blog post to read about text encoding: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software

All right. Thanks for the tip and for the link. I'm reading it already.

Perhaps I should rephrase my question.

I need to write a program that reads a string from user input, and if characters are, say, ç, Ç, á, à, etc, they should be replaced with c, C, a (just as an example).

I don't know what path to follow.

Thanks for the reply.

**iceaway** · 11-02-2011

Originally Posted by FernandoBasso

All right. Thanks for the tip and for the link. I'm reading it already.

Perhaps I should rephrase my question.

I need to write a program that reads a string from user input, and if characters are, say, ç, Ç, á, à, etc, they should be replaced with c, C, a (just as an example).

I don't know what path to follow.

Thanks for the reply.

I rarely do anything with strings, so I'm not the person to answer that question. Hopefully CommonTater will come around soon, he should have some advice on this matter.

**~~CommonTater~~** · 11-02-2011

Originally Posted by FernandoBasso

Why can't I have an array like this one:

Code:

  
 char nochars[] = {'ç', 'Ç', '~', '^', '´', '`' 'ã', 'é'};

It returns erros like this:

$ gcc cripto.c -o cripto.bin
cripto.c: In function ‘main’:
cripto.c:10:23: warning: multi-character character constant [-Wmultichar]
cripto.c:10:5: warning: overflow in implicit constant conversion [-Woverflow

Try it with unsigned char nochars[]...

Also you may need to add the last two characters by their numerical equivalents, as in...

Code:

unsigned char nochars[] = {'a', 'b', 214, 132, 'e' }

**Codeplug** · 11-02-2011

>> Why can't I have an array like this one:
It's dangerous to have character literals in in your source code that don't belong to the "basic source character set". How the compiler maps "Physical Source File Characters" to the "Execution Character Set" is implementation defined. It may depend on how the source file itself is encoded. Or it may depend on command line parameters passed to the compiler. Or it may depend on the current locale settings when the compiler is invoked.

Most *nix OS's do things as UTF8, so I'll assume that your source file is encoded as UTF8 and gcc is mapping it directly to the "Execution Character Set" as UTF8 (which is common). If we look at "LATIN SMALL LETTER C WITH CEDILLA" (ç), its Unicode value is U+000000E7. When encoded as UTF8, it becomes "0xC3,0xA7". That's 2 bytes which gcc can't store in 'char'.

>> I need to write a program that reads a string from user input, and if characters are, say, ç, Ç, á, à, etc, they should be replaced with c, C, a (just as an example).
Why do you need to do this? This is typically the wrong approach to take, unless you have special circumstances that require it.

gg

**FernandoBasso** · 11-02-2011

[QUOTE=Codeplug;1066025]>> Why can't I have an array like this one:
It's dangerous to have character literals in in your source code that don't belong to the "basic source character set". How the compiler maps "Physical Source File Characters" to the "Execution Character Set" is implementation defined. It may depend on how the source file itself is encoded. Or it may depend on command line parameters passed to the compiler. Or it may depend on the current locale settings when the compiler is invoked.

Most *nix OS's do things as UTF8, so I'll assume that your source file is encoded as UTF8 and gcc is mapping it directly to the "Execution Character Set" as UTF8 (which is common). If we look at "LATIN SMALL LETTER C WITH CEDILLA" (ç), its Unicode value is U+000000E7. When encoded as UTF8, it becomes "0xC3,0xA7". That's 2 bytes which gcc can't store in 'char'.
[QUOTE]

Yes, the source file is encoded as UTF8. I use vim, and it is set to write files in utf8.

Originally Posted by Codeplug

>> I need to write a program that reads a string from user input, and if characters are, say, ç, Ç, á, à, etc, they should be replaced with c, C, a (just as an example).
Why do you need to do this? This is typically the wrong approach to take, unless you have special circumstances that require it.
gg

The teacher asked the students to do it. No one was able do achieve it up to the present moment. It has no special purpose other than being able to do it, I guess. It is an exercise. I didn't mention that before because I didn't want the code. I just wanted some tips, but I spent all morning reading about this, and I'm feeling stymied by now.

Thanks for the help so far.

**cas** · 11-02-2011

Locale issues with C are a nightmare (locale issues in general are a nightmare).

You can specify wide character constants with an L, e.g.: L'ã'; this has type wchar_t. You can read into an array of wchar_t with functions such as fgetws(), and then compare each element of that to your wide character constants.

You'll probably want to do:

Code:

#include <locale.h>
...
setlocale(LC_ALL, "");

first.

If the user isn't entering UTF-8 values, though, who knows what will happen? If he's using a Latin-2 terminal and enters the character ă (which has value 227), I suspect your program will think it's a ã (which has value 227 in Latin-1 and UTF-8).

**Codeplug** · 11-02-2011

>> The teacher asked the students to do it. ... It is an exercise.
I highly doubt that your "teacher" has any idea what they are asking you do. What class is this for? Have they covered what the C standard says about character-sets? Have they covered how gcc handles all the "implementation defined" areas? Have they covered locales and character encodings used by your OS and how gcc interacts with them?

You don't have to answer any of that

This would be the most straight forward approach for the "exercise":

Code:

#include <stdio.h>
#include <wchar.h>
#include <locale.h>

#ifdef _WIN32
#include <windows.h>
#endif

const wchar_t SMALL_C_WITH_CEDILLA = L'\u00E7';

int main()
{
#ifdef _WIN32
    // this approach will only work on Windows if all the characters encountered
    // are represented in the ACP
    SetConsoleOutputCP(GetACP());
    SetConsoleCP(GetACP());
#endif
    setlocale(LC_ALL, "");

    fputws(L"Enter a string: ", stdout);

    wchar_t buff[128];
    fgetws(buff, 128, stdin);

    int len = wcslen(buff);
    for (int n = 0; n < len ;++n)
    {
        if (buff[n] == SMALL_C_WITH_CEDILLA)
            buff[n] = L'c';
    }//for

    wprintf(L"Unaccented: %ls\n", buff);

    return 0;
}//main

On most *nix OS's, wchar_t's are encoded as UTF32-BE/LE. Which is good since it give you common encoding to work with - Unicdoe.

By calling setlocale(LC_ALL,""), you are telling the standard library, among other things, to use the narrow (char) character encoding specified by the user - typically UTF8 for *nix.

By calling fgetws, or any other wide I/O function, the standard library will automatically convert from the user's narrow encoding to the wide encoding - or UTF8 -> UTF32 for most *nix OS's - which allows us to "compare apples to apples".

For those who care about Windows...
Things are more difficult in Windows. Not only does it have the narrow character encoding specified by the locale, it also has a separate console-input-encoding and console-output-encoding. Narrow character encodings in Windows are called "codepages". When performing wide input, the standard library performs the following conversions: console-input-codepage -> locale-codepage -> wchar_t-encoding (which is UTF16-LE). GetACP() returns the systems default codepage, which is also the codepage used by the locale after calling setlocale(LC_ALL, ""). So calling "SetConsoleCP(GetACP())" will make the console-input-codepage the same as the locale-codepage - which eliminates that initial conversion.

>> I didn't want the code
How about a frame work to begin with

From there you can create a table of accented characters and their unaccented counter-parts. Here are some links where you can lookup the Unicode values for particular characters:
http://en.wikipedia.org/wiki/List_of_Unicode_characters
http://www.tachyonsoft.com/cpindex.htm
http://msdn.microsoft.com/en-us/goglobal/bb964653.aspx

For completeness, here is what was found the last time this subject came up:
http://savannah.nongnu.org/projects/unac/

gg

**FernandoBasso** · 11-02-2011

Originally Posted by cas

Locale issues with C are a nightmare (locale issues in general are a nightmare).

You can specify wide character constants with an L, e.g.: L'ã'; this has type wchar_t. You can read into an array of wchar_t with functions such as fgetws(), and then compare each element of that to your wide character constants.

You'll probably want to do:

Code:

#include <locale.h>
...
setlocale(LC_ALL, "");

first.

If the user isn't entering UTF-8 values, though, who knows what will happen? If he's using a Latin-2 terminal and enters the character ă (which has value 227), I suspect your program will think it's a ã (which has value 227 in Latin-1 and UTF-8).

I tried using setlocale(LC_ALL, " "); but didn't have any luck. Thanks anyway.

Codeplug,

Thanks for the links (which I'm already reading and taking notes).

About the code, I run arch linux and gvim/xterm most of the time. The code didn't compile here.

I'll keep trying.

Thanks.

**Codeplug** · 11-02-2011

>> The code didn't compile here.
What was the error?

I compiled with: "gcc -Wall -std=c99 main.c"

gg

**FernandoBasso** · 11-02-2011

Originally Posted by Codeplug

>> The code didn't compile here.
What was the error?

I compiled with: "gcc -Wall -std=c99 main.c"

gg

Oh, I was compiling with gcc -std=gnu99 utf8.c -o utf8.

It compiled now.

Thanks.

Thread: Arrays - accented characters

Thread Tools

Search Thread

Display

Arrays - accented characters

Similar Threads

C++ arrays and characters

2D arrays and storing characters

Strings and Characters in arrays.

Deleting characters from 2d arrays

strings or arrays of characters?