Originally Posted by
phantomotap
The "GCC" documentation for whatever reason uses the term source character set instead of intermediate representation or similar.
No, GCC deviates from the standard, but only by that very definition.
Look, this is not at all complicated.
Your input files use character set A. The C standards say that this is your source character set, but GCC does not follow that definition. GCC calls A the input character set, and allows you to define it using -finput-charset.
The C standards call the character set used during execution execution character set. Let's call that B. GCC allows you to split B into two: one for normal/narrow characters and strings (defined using -fexec-charset), and one for wide characters and strings (defined using -fwide-exec-charset).
Now, the C99 standard section 5.2 defines all sorts of requirements for the source character set. These do not apply to the input character set for GCC, because the preprocessor always converts to UTF-8 encoding of Unicode, which GCC uses for all internal processing. That is why, for GCC, the source character set is always UTF-8.
What does this matter in practice?
- With GCC, you can use any character set your C library iconv() supports and can convert to UTF-8, for your input files. It does not need to fulfill the C99 requirements for the source character set. As long as it is convertible, you can use it with GCC.
If your input is in a character set that fulfills the C99 requirements for a source character set, but your iconv() cannot convert it to UTF-8, GCC will not be able to process it.
- The very first thing the preprocessor does, even before splitting the input into lines or tokens, is convert it to UTF-8. This means that all intermediate output you obtain from GCC -- for example, -E (which stops after the preprocessor stage, outputting the preprocessed input) or -S (which provides the assembly output of the compiled code) -- are always in UTF-8.
The UTF-8 processing is integral to GCC. The name "source character set" is more correct than "intermediate character set" or something equivalent, because the requirements set in the C standards apply to stuff that is always in UTF-8 in GCC. They do not apply to the input character set, the character set used for your input files, because GCC converts to UTF-8.
The deviation from the standard is that GCC does not use the source files as is, but converts to UTF-8. In other words, it only supports UTF-8 as source character set in the sense that the C standards define it; but it allows the programmer to transparently convert from any other character set (input character set) to UTF-8.
This is trivial to check for yourself. Save
const char message[] = "€˝ś";
as a source file, using something other than UTF-8. Then, preprocess it using
gcc -E -finput-charset=charset source.c
gcc -S -finput-charset=charset source.c
where charset is what character set you used for the file.
You can even modify your locale settings (LANG and LC_ALL) to see that it does not depend on your current locale. Even -fexec-charset= only affects the encoding of the strings and character constants in the assembly source (where the strings and character constants are already converted to numeric format).
The point: The preprocessed output and the assembly code itself will always use UTF-8. This is because UTF-8 is the source character set in the way the C standards define them; it just provides a facility to auto-convert any input file transparently to this set.
Originally Posted by
Codeplug
Hard to imagine where my confusion arrived ... or how someone could get so upset about it.
I'm wondering how you confuse the two, too. You obviously care little about whether you are correct, as long as you are perceived as being correct.
I get upset, because I don't like people who fail to admit to their errors when in a technical discussion. It reduces the information to noise ratio, and may lead others astray. It is a typical social play by those who cannot stand honestly behind their words -- weasels and rats -- and I find it wasteful. There's enough garbage in the world already; don't add to it.