Thread: character set translation

  1. #1
    Registered User
    Join Date
    Mar 2009
    Posts
    48

    character set translation

    Hi

    I was trying to find out how the escape sequences were processed in different translation phases, mainly on character sets. The whole compilation involves four stages, according to gcc, preprocessing, compilation proper, assembly and linking.

    K&R II says in A.12 Preprocessing:
    Code:
    Preprocessing itself takes place in several logically successive phases that may, in a particular implementation, be condensed.
    ......
    4. Escape sequences in character constants and string literals (Pars. A.2.5.2, A.2.6) 
    are replaced by their equivalents; then adjacent string literals are concatenated.
    ......
    What does "equivalents" here mean? Are they the members in the execution character set? But with "gcc -E", I can't get it. For example, a "\n" in the source file is processed into 5c 6e, which are utf-8 encoding of '\' and 'n', not a 0a.

    Then I moved on to wide strings.

    Gcc 3.4.3 manual states that "-fexec-charset=charset" and "-fwide-exec-charset=charset" are Options Controlling the Preprocessor. But there is no effect at all when I specified either of them with "gcc -E". The output from "gcc -E" is always UTF-8, which is the source character set used by CPP. For example, there is a wide string literals
    Code:
    L"中文"
    in my source file, and using
    Code:
    gcc -E -finput-charset=gb2312 -fwide-exec-charset=utf-32
    to preprocess. The encoding of "中文" is still UTF-8. If letting gcc compile to the end, that is,
    Code:
    gcc  -finput-charset=gb2312 -fwide-exec-charset=utf-32
    or
    Code:
    gcc  -finput-charset=gb2312 -fwide-exec-charset=utf-16
    I can get the corresponding encoding of utf-32 or utf-16 from the executable file.
    Then why gcc manual says those two options are "Options Controlling the Preprocessor"?

    So escape sequences are translated into utf-8 at the very first step of CPP, just like other characters in character constants or string literals. They are not replaced by their equivalents in the execution character set in preprocessing. After preprocessing is complete, characters in character constants or string literals, including escape sequences, are translated into members in the execution character set. I'm confused with what K&R II and gcc manual says. Can anyone please help tell the character set translation in the the whole compilation process, especially escape sequences? Thanks a lot!

  2. #2
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> But with "gcc -E", I can't get it.
    You'll never see characters encoded in the execution character set by looking at -E output. You have to compile the code, then inspect the bytes within the string literal.

    >> Then why gcc manual says those two options are "Options Controlling the Preprocessor"?
    Well, just because it doesn't alter -E output, doesn't necessarily mean it doesn't belong in that section of the manual

    To maximize the portability if your source code, you shouldn't use "extended characters" within string literals. (The upcoming C++ standard will support universal character names.)

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. VIM: The (un)official thread
    By MK27 in forum Programming Book and Product Reviews
    Replies: 32
    Last Post: 04-20-2011, 06:43 PM
  2. Where is my UNICODE extended wide character ?
    By intmail in forum Linux Programming
    Replies: 3
    Last Post: 02-15-2006, 10:20 AM
  3. Where is my extended wide character ?
    By intmail in forum C Programming
    Replies: 4
    Last Post: 02-14-2006, 04:54 PM
  4. Game Pointer Trouble?
    By Drahcir in forum C Programming
    Replies: 8
    Last Post: 02-04-2006, 02:53 AM
  5. Character problem!!
    By cBegginer in forum C Programming
    Replies: 3
    Last Post: 09-02-2005, 11:51 PM