character set translation
I was trying to find out how the escape sequences were processed in different translation phases, mainly on character sets. The whole compilation involves four stages, according to gcc, preprocessing, compilation proper, assembly and linking.
K&R II says in A.12 Preprocessing:
What does "equivalents" here mean? Are they the members in the execution character set? But with "gcc -E", I can't get it. For example, a "\n" in the source file is processed into 5c 6e, which are utf-8 encoding of '\' and 'n', not a 0a.
Preprocessing itself takes place in several logically successive phases that may, in a particular implementation, be condensed.
4. Escape sequences in character constants and string literals (Pars. A.2.5.2, A.2.6)
are replaced by their equivalents; then adjacent string literals are concatenated.
Then I moved on to wide strings.
Gcc 3.4.3 manual states that "-fexec-charset=charset" and "-fwide-exec-charset=charset" are Options Controlling the Preprocessor. But there is no effect at all when I specified either of them with "gcc -E". The output from "gcc -E" is always UTF-8, which is the source character set used by CPP. For example, there is a wide string literals
in my source file, and using
to preprocess. The encoding of "中文" is still UTF-8. If letting gcc compile to the end, that is,
gcc -E -finput-charset=gb2312 -fwide-exec-charset=utf-32
gcc -finput-charset=gb2312 -fwide-exec-charset=utf-32
I can get the corresponding encoding of utf-32 or utf-16 from the executable file.
gcc -finput-charset=gb2312 -fwide-exec-charset=utf-16
Then why gcc manual says those two options are "Options Controlling the Preprocessor"?
So escape sequences are translated into utf-8 at the very first step of CPP, just like other characters in character constants or string literals. They are not replaced by their equivalents in the execution character set in preprocessing. After preprocessing is complete, characters in character constants or string literals, including escape sequences, are translated into members in the execution character set. I'm confused with what K&R II and gcc manual says. Can anyone please help tell the character set translation in the the whole compilation process, especially escape sequences? Thanks a lot!