Thread: Arrays - accented characters

  1. #1
    Registered User FernandoBasso's Avatar
    Join Date
    Oct 2011
    Location
    Brazil
    Posts
    45

    Arrays - accented characters

    Why can't I have an array like this one:
    Code:
      
     char nochars[] = {'ç', 'Ç', '~', '^', '´', '`' 'ã', 'é'};
    It returns erros like this:


    $ gcc cripto.c -o cripto.bin
    cripto.c: In function ‘main’:
    cripto.c:10:23: warning: multi-character character constant [-Wmultichar]
    cripto.c:10:5: warning: overflow in implicit constant conversion [-Woverflow

  2. #2
    Registered User
    Join Date
    Sep 2011
    Location
    Stockholm, Sweden
    Posts
    131
    It is because you are trying to put characters which aren't in the ASCII-table into a char. Encoding is a headache, look into things like UTF-8, unicode etc. if you want to know more about it. The problem is that Ç is a multibyte char, and can't be represented by a single byte.

    This is actually a very interesting blog post to read about text encoding: http://www.joelonsoftware.com/articles/Unicode.html
    Last edited by iceaway; 11-02-2011 at 08:00 AM.

  3. #3
    Registered User FernandoBasso's Avatar
    Join Date
    Oct 2011
    Location
    Brazil
    Posts
    45
    Quote Originally Posted by iceaway View Post
    It is because you are trying to put characters which aren't in the ASCII-table into a char. Encoding is a headache, look into things like UTF-8, unicode etc. if you want to know more about it. The problem is that Ç is a multibyte char, and can't be represented by a single byte.

    This is actually a very interesting blog post to read about text encoding: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) - Joel on Software
    All right. Thanks for the tip and for the link. I'm reading it already.

    Perhaps I should rephrase my question.

    I need to write a program that reads a string from user input, and if characters are, say, ç, Ç, á, à, etc, they should be replaced with c, C, a (just as an example).

    I don't know what path to follow.

    Thanks for the reply.

  4. #4
    Registered User
    Join Date
    Sep 2011
    Location
    Stockholm, Sweden
    Posts
    131
    Quote Originally Posted by FernandoBasso View Post
    All right. Thanks for the tip and for the link. I'm reading it already.

    Perhaps I should rephrase my question.

    I need to write a program that reads a string from user input, and if characters are, say, ç, Ç, á, à, etc, they should be replaced with c, C, a (just as an example).

    I don't know what path to follow.

    Thanks for the reply.
    I rarely do anything with strings, so I'm not the person to answer that question. Hopefully CommonTater will come around soon, he should have some advice on this matter.

  5. #5
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by FernandoBasso View Post
    Why can't I have an array like this one:
    Code:
      
     char nochars[] = {'ç', 'Ç', '~', '^', '´', '`' 'ã', 'é'};
    It returns erros like this:


    $ gcc cripto.c -o cripto.bin
    cripto.c: In function ‘main’:
    cripto.c:10:23: warning: multi-character character constant [-Wmultichar]
    cripto.c:10:5: warning: overflow in implicit constant conversion [-Woverflow

    Try it with unsigned char nochars[]...

    Also you may need to add the last two characters by their numerical equivalents, as in...

    Code:
    unsigned char nochars[] = {'a', 'b', 214, 132, 'e' }

  6. #6
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> Why can't I have an array like this one:
    It's dangerous to have character literals in in your source code that don't belong to the "basic source character set". How the compiler maps "Physical Source File Characters" to the "Execution Character Set" is implementation defined. It may depend on how the source file itself is encoded. Or it may depend on command line parameters passed to the compiler. Or it may depend on the current locale settings when the compiler is invoked.

    Most *nix OS's do things as UTF8, so I'll assume that your source file is encoded as UTF8 and gcc is mapping it directly to the "Execution Character Set" as UTF8 (which is common). If we look at "LATIN SMALL LETTER C WITH CEDILLA" (ç), its Unicode value is U+000000E7. When encoded as UTF8, it becomes "0xC3,0xA7". That's 2 bytes which gcc can't store in 'char'.

    >> I need to write a program that reads a string from user input, and if characters are, say, ç, Ç, á, à, etc, they should be replaced with c, C, a (just as an example).
    Why do you need to do this? This is typically the wrong approach to take, unless you have special circumstances that require it.

    gg

  7. #7
    Registered User FernandoBasso's Avatar
    Join Date
    Oct 2011
    Location
    Brazil
    Posts
    45
    [QUOTE=Codeplug;1066025]>> Why can't I have an array like this one:
    It's dangerous to have character literals in in your source code that don't belong to the "basic source character set". How the compiler maps "Physical Source File Characters" to the "Execution Character Set" is implementation defined. It may depend on how the source file itself is encoded. Or it may depend on command line parameters passed to the compiler. Or it may depend on the current locale settings when the compiler is invoked.

    Most *nix OS's do things as UTF8, so I'll assume that your source file is encoded as UTF8 and gcc is mapping it directly to the "Execution Character Set" as UTF8 (which is common). If we look at "LATIN SMALL LETTER C WITH CEDILLA" (ç), its Unicode value is U+000000E7. When encoded as UTF8, it becomes "0xC3,0xA7". That's 2 bytes which gcc can't store in 'char'.
    [QUOTE]

    Yes, the source file is encoded as UTF8. I use vim, and it is set to write files in utf8.

    Quote Originally Posted by Codeplug View Post
    >> I need to write a program that reads a string from user input, and if characters are, say, ç, Ç, á, à, etc, they should be replaced with c, C, a (just as an example).
    Why do you need to do this? This is typically the wrong approach to take, unless you have special circumstances that require it.
    gg
    The teacher asked the students to do it. No one was able do achieve it up to the present moment. It has no special purpose other than being able to do it, I guess. It is an exercise. I didn't mention that before because I didn't want the code. I just wanted some tips, but I spent all morning reading about this, and I'm feeling stymied by now.

    Thanks for the help so far.

  8. #8
    Registered User
    Join Date
    Sep 2007
    Posts
    1,012
    Locale issues with C are a nightmare (locale issues in general are a nightmare).

    You can specify wide character constants with an L, e.g.: L'ã'; this has type wchar_t. You can read into an array of wchar_t with functions such as fgetws(), and then compare each element of that to your wide character constants.

    You'll probably want to do:
    Code:
    #include <locale.h>
    ...
    setlocale(LC_ALL, "");
    first.

    If the user isn't entering UTF-8 values, though, who knows what will happen? If he's using a Latin-2 terminal and enters the character ă (which has value 227), I suspect your program will think it's a ã (which has value 227 in Latin-1 and UTF-8).

  9. #9
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> The teacher asked the students to do it. ... It is an exercise.
    I highly doubt that your "teacher" has any idea what they are asking you do. What class is this for? Have they covered what the C standard says about character-sets? Have they covered how gcc handles all the "implementation defined" areas? Have they covered locales and character encodings used by your OS and how gcc interacts with them?

    You don't have to answer any of that

    This would be the most straight forward approach for the "exercise":
    Code:
    #include <stdio.h>
    #include <wchar.h>
    #include <locale.h>
    
    #ifdef _WIN32
    #include <windows.h>
    #endif
    
    const wchar_t SMALL_C_WITH_CEDILLA = L'\u00E7';
    
    int main()
    {
    #ifdef _WIN32
        // this approach will only work on Windows if all the characters encountered
        // are represented in the ACP
        SetConsoleOutputCP(GetACP());
        SetConsoleCP(GetACP());
    #endif
        setlocale(LC_ALL, "");
    
        fputws(L"Enter a string: ", stdout);
    
        wchar_t buff[128];
        fgetws(buff, 128, stdin);
    
        int len = wcslen(buff);
        for (int n = 0; n < len ;++n)
        {
            if (buff[n] == SMALL_C_WITH_CEDILLA)
                buff[n] = L'c';
        }//for
    
        wprintf(L"Unaccented: %ls\n", buff);
    
        return 0;
    }//main
    On most *nix OS's, wchar_t's are encoded as UTF32-BE/LE. Which is good since it give you common encoding to work with - Unicdoe.

    By calling setlocale(LC_ALL,""), you are telling the standard library, among other things, to use the narrow (char) character encoding specified by the user - typically UTF8 for *nix.

    By calling fgetws, or any other wide I/O function, the standard library will automatically convert from the user's narrow encoding to the wide encoding - or UTF8 -> UTF32 for most *nix OS's - which allows us to "compare apples to apples".

    For those who care about Windows...
    Things are more difficult in Windows. Not only does it have the narrow character encoding specified by the locale, it also has a separate console-input-encoding and console-output-encoding. Narrow character encodings in Windows are called "codepages". When performing wide input, the standard library performs the following conversions: console-input-codepage -> locale-codepage -> wchar_t-encoding (which is UTF16-LE). GetACP() returns the systems default codepage, which is also the codepage used by the locale after calling setlocale(LC_ALL, ""). So calling "SetConsoleCP(GetACP())" will make the console-input-codepage the same as the locale-codepage - which eliminates that initial conversion.

    >> I didn't want the code
    How about a frame work to begin with From there you can create a table of accented characters and their unaccented counter-parts. Here are some links where you can lookup the Unicode values for particular characters:
    http://en.wikipedia.org/wiki/List_of_Unicode_characters
    http://www.tachyonsoft.com/cpindex.htm
    http://msdn.microsoft.com/en-us/goglobal/bb964653.aspx

    For completeness, here is what was found the last time this subject came up:
    http://savannah.nongnu.org/projects/unac/

    gg

  10. #10
    Registered User FernandoBasso's Avatar
    Join Date
    Oct 2011
    Location
    Brazil
    Posts
    45
    Quote Originally Posted by cas View Post
    Locale issues with C are a nightmare (locale issues in general are a nightmare).

    You can specify wide character constants with an L, e.g.: L'ã'; this has type wchar_t. You can read into an array of wchar_t with functions such as fgetws(), and then compare each element of that to your wide character constants.

    You'll probably want to do:
    Code:
    #include <locale.h>
    ...
    setlocale(LC_ALL, "");
    first.

    If the user isn't entering UTF-8 values, though, who knows what will happen? If he's using a Latin-2 terminal and enters the character ă (which has value 227), I suspect your program will think it's a ã (which has value 227 in Latin-1 and UTF-8).
    I tried using setlocale(LC_ALL, " "); but didn't have any luck. Thanks anyway.


    Codeplug,

    Thanks for the links (which I'm already reading and taking notes).

    About the code, I run arch linux and gvim/xterm most of the time. The code didn't compile here.

    I'll keep trying.

    Thanks.
    Last edited by FernandoBasso; 11-02-2011 at 01:49 PM.

  11. #11
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> The code didn't compile here.
    What was the error?

    I compiled with: "gcc -Wall -std=c99 main.c"

    gg

  12. #12
    Registered User FernandoBasso's Avatar
    Join Date
    Oct 2011
    Location
    Brazil
    Posts
    45
    Quote Originally Posted by Codeplug View Post
    >> The code didn't compile here.
    What was the error?

    I compiled with: "gcc -Wall -std=c99 main.c"

    gg
    Oh, I was compiling with gcc -std=gnu99 utf8.c -o utf8.

    It compiled now.

    Thanks.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. C++ arrays and characters
    By ofayto in forum C++ Programming
    Replies: 5
    Last Post: 02-21-2008, 06:04 PM
  2. 2D arrays and storing characters
    By John_L in forum C Programming
    Replies: 4
    Last Post: 10-13-2007, 12:17 PM
  3. Strings and Characters in arrays.
    By akidamo in forum C Programming
    Replies: 24
    Last Post: 04-06-2006, 10:11 PM
  4. Deleting characters from 2d arrays
    By `firefox in forum C Programming
    Replies: 4
    Last Post: 05-21-2005, 05:18 PM
  5. strings or arrays of characters?
    By Callith in forum C++ Programming
    Replies: 13
    Last Post: 12-26-2004, 11:28 AM