Couple of questions about ASCII

**Nominal Animal** · 08-27-2015

Originally Posted by phantomotap

The "GCC" documentation for whatever reason uses the term source character set instead of intermediate representation or similar.

No, GCC deviates from the standard, but only by that very definition.

Look, this is not at all complicated.

Your input files use character set A. The C standards say that this is your source character set, but GCC does not follow that definition. GCC calls A the input character set, and allows you to define it using -finput-charset.

The C standards call the character set used during execution execution character set. Let's call that B. GCC allows you to split B into two: one for normal/narrow characters and strings (defined using -fexec-charset), and one for wide characters and strings (defined using -fwide-exec-charset).

Now, the C99 standard section 5.2 defines all sorts of requirements for the source character set. These do not apply to the input character set for GCC, because the preprocessor always converts to UTF-8 encoding of Unicode, which GCC uses for all internal processing. That is why, for GCC, the source character set is always UTF-8.

What does this matter in practice?

With GCC, you can use any character set your C library iconv() supports and can convert to UTF-8, for your input files. It does not need to fulfill the C99 requirements for the source character set. As long as it is convertible, you can use it with GCC.

If your input is in a character set that fulfills the C99 requirements for a source character set, but your iconv() cannot convert it to UTF-8, GCC will not be able to process it.
The very first thing the preprocessor does, even before splitting the input into lines or tokens, is convert it to UTF-8. This means that all intermediate output you obtain from GCC -- for example, -E (which stops after the preprocessor stage, outputting the preprocessed input) or -S (which provides the assembly output of the compiled code) -- are always in UTF-8.

The UTF-8 processing is integral to GCC. The name "source character set" is more correct than "intermediate character set" or something equivalent, because the requirements set in the C standards apply to stuff that is always in UTF-8 in GCC. They do not apply to the input character set, the character set used for your input files, because GCC converts to UTF-8.

The deviation from the standard is that GCC does not use the source files as is, but converts to UTF-8. In other words, it only supports UTF-8 as source character set in the sense that the C standards define it; but it allows the programmer to transparently convert from any other character set (input character set) to UTF-8.

This is trivial to check for yourself. Save

const char message[] = "€½œ";

as a source file, using something other than UTF-8. Then, preprocess it using

gcc -E -finput-charset=charset source.c
gcc -S -finput-charset=charset source.c

where charset is what character set you used for the file.

You can even modify your locale settings (LANG and LC_ALL) to see that it does not depend on your current locale. Even -fexec-charset= only affects the encoding of the strings and character constants in the assembly source (where the strings and character constants are already converted to numeric format).

The point: The preprocessed output and the assembly code itself will always use UTF-8. This is because UTF-8 is the source character set in the way the C standards define them; it just provides a facility to auto-convert any input file transparently to this set.

Originally Posted by Codeplug

Hard to imagine where my confusion arrived ... or how someone could get so upset about it.

I'm wondering how you confuse the two, too. You obviously care little about whether you are correct, as long as you are perceived as being correct.

I get upset, because I don't like people who fail to admit to their errors when in a technical discussion. It reduces the information to noise ratio, and may lead others astray. It is a typical social play by those who cannot stand honestly behind their words -- weasels and rats -- and I find it wasteful. There's enough garbage in the world already; don't add to it.

**phantomotap** · 08-28-2015

No, GCC deviates from the standard, but only by that very definition.

O_o

No. The "GCC" documentation deviates from the standard by that very definition.

Your use of "but" is a silly attempt to downplay the misuse of terminology.

I'm actually rather shocked that you are giving a pass to a misuse of a technical term considering your recent thread.

Look, this is not at all complicated.

Indeed.

You are glossing over the misuse of terminology, defined in the standard, by the "GCC" documentation for no better reason than to blame Codeplug for making a mistake with a potential learning impact.

As I suggested, your issue should be with the maintainers of the documentation.

Now, the C99 standard section 5.2 defines all sorts of requirements for the source character set. These do not apply to the input character set for GCC, because the preprocessor always converts to UTF-8 encoding of Unicode, which GCC uses for all internal processing.

Exactly!

How are you complaining about Codeplug making a mistake when the documentation is at fault for explaining an extension with terminology defined in the standard?

The various C and C++ standards require certain characteristics of basic source and basic execution character sets.

The source character set is the character in which the source file is written.

A file stored with a character set which does not meet the requirements set by the standard is not a conforming source file.

The "GCC" compilers relaxing the source character set requirements, by way of internal UTF8 conversion, is an extension.

A file stored with a character set which does not meet the requirements set by the standard yet is supported by `iconv` is a file which depends on the "GCC" extension or compilers with a similar extension.

The name "source character set" is more correct than "intermediate character set" or something equivalent, because the requirements set in the C standards apply to stuff that is always in UTF-8 in GCC.

You referenced the relevant part of the standard yourself.

You know the term "source character set" as used in the "GCC" documentation isn't in any way conforming.

The "GCC" documentation could be written to correctly use the standard term while referencing the extension.

You obviously care little about whether you are correct, as long as you are perceived as being correct.

The posts by Codeplug use the standard conforming definition of the "source character set" term.

You chose to use the "source character set" term as defined in the "GCC" documentation, a potentially intermediate UTF8 representation as converted with `iconv` from the character set in which the source file is written, which is limited to "GCC" and compilers which implement a similar extension.

Soma

**Absurd** · 08-28-2015

Originally Posted by Nominal Animal

So, in short: character set is the agreement on how each character maps to a numeric value (code point), and character encoding is how those numeric values are mapped to bytes. If a character set has only a single encoding, then the same name is used for both the set and the encoding, and it means both things.

But why not using the regular way of converting (decimal) numbers to bytes?
For example, Unicode:
Why not map €, which is 8364 as you stated, to 0010 0000 1010 1100, map א (again, with code point 1448, as you stated) to 0000 0101 1010 1000. and so one...

**whiteflags** · 08-28-2015

Originally Posted by Absurd

But why not using the regular way of converting (decimal) numbers to bytes?
For example, Unicode:
Why not map €, which is 8364 as you stated, to 0010 0000 1010 1100, map א (again, with code point 1448, as you stated) to 0000 0101 1010 1000. and so one...

They are? The bit patterns match the hex numbers that Nominal Animal wrote. It's just that in UTF-8 there might/should be a byte order mark, which is what the first byte is, and that should explain why they are all different.

[edit]
How much do you know about endianness?
[/edit]

**Codeplug** · 08-28-2015

>> You can even modify your locale settings (LANG and LC_ALL) to see that it does not depend on your current locale.
I assume you mean when using -finput-charset - then your locale doesn't matter.

If you don't use -finput-charset, and your source file does not have a BOM, then GCC will assume the file is encoded according to the current locale. If that fails (like on Windows) then it assumes the file is UTF8 encoded.

gg

**Absurd** · 08-29-2015

Originally Posted by whiteflags

They are? The bit patterns match the hex numbers that Nominal Animal wrote. It's just that in UTF-8 there might/should be a byte order mark, which is what the first byte is, and that should explain why they are all different.

[edit]
How much do you know about endianness?
[/edit]

I know about endianness, but I don't see the relevance here.
Maybe I'm missing something crucial here, so bare with me.
Say your have N characters (code points) in your character set. Let M be the smallest multiple of 8 which is bigger than logN.
Now encode every character with M/8-bytes number, using the ordinary decimal-to-binary conversion.
For example, say N=1500, then M=16, and so you can use a 2-byte integer ("unsigned short") for encoding.

**whiteflags** · 08-29-2015

For example, say N=1500, then M=16, and so you can use a 2-byte integer ("unsigned short") for encoding.

So again, say your bytes are 0xAC and 0x20. There's the order I wrote them in, and the swapped bytes (0x20 0xAC). It makes zero difference what order the bytes are in, they can be the same number. It is part of the encoding's job to say how the bytes are stored in terms of endianness.

Details on why endianness matters in Unicode are available here: https://en.wikipedia.org/wiki/Byte_order_mark.

**whiteflags** · 08-29-2015

And just as a correction, since I realize how badly I messed it up... while everything I said about BOM should be true -

0xE2 in the example UTF-8 character, the euro, is not a BOM. That is a byte which "indicates the number of bytes to follow in a multibyte sequence," and so, multibyte characters in UTF-8 have one more byte prepended to them than may be strictly necessary. The actual reason for this is because the code unit in UTF-8 is 8 bits. You need some way to know that the code point is N bits long to prevent the code points from overlapping or being ambiguous. UTF-8 specifically chose this way because it allows for faster processing (source: Section 2.5 of the Unicode standard).

While you could easily come up with the correct bit pattern for a character yourself, if you knew the code point, you also need to know that every byte of UTF-8 has some bits defined. I thought about dumping a table here, but you are better off looking at an existing one.

So when we look at the euro when encoded in UTF-8 we can see that certain values contain all the information:

0xE2	11100010
0x82	10000010
0xAC	10101100

The actual bits of 20AC in U+20AC have been made pink.

**Absurd** · 08-30-2015

OK, so correct me if I'm wrong:
From the table above, I understand that the first byte implies "how many bytes further to expect".
For example:
If the first byte is: 110xxxxx, then I should expect a two-byte-length character (one more to follow),
If the first byte is: 11110xxx, then I should expect a four-byte-length character (three more to follow), and so on...

Again, correct me if the following is wrong: The reason we encode it with a variable-length byte encoding, rather than with a fixed length byte encoding, is merely for saving memory space.

**Cat** · 08-30-2015

Originally Posted by Absurd

I know about endianness, but I don't see the relevance here.
Maybe I'm missing something crucial here, so bare with me.
Say your have N characters (code points) in your character set. Let M be the smallest multiple of 8 which is bigger than logN.
Now encode every character with M/8-bytes number, using the ordinary decimal-to-binary conversion.
For example, say N=1500, then M=16, and so you can use a 2-byte integer ("unsigned short") for encoding.

That is one possible encoding system, but you still need to be careful with endianness; a 2-byte integer on your machine and the same 2-byte integer on my machine might have their bytes stored in reverse order. It's important that my computer be able to read files you created and vice versa.

However, this is not the only possible encoding system.

**Cat** · 08-30-2015

Originally Posted by Absurd

Again, correct me if the following is wrong: The reason we encode it with a variable-length byte encoding, rather than with a fixed length byte encoding, is merely for saving memory space.

Not "merely". There's another important benefit of UTF-8 - if you were working with data that is purely encodable in ASCII, then the UTF-8 and the ASCII representations are identical, and no conversion is necessary to deal with them.

**Absurd** · 08-30-2015

Originally Posted by Cat

That is one possible encoding system, but you still need to be careful with endianness; a 2-byte integer on your machine and the same 2-byte integer on my machine might have their bytes stored in reverse order. It's important that my computer be able to read files you created and vice versa.

However, this is not the only possible encoding system.

OK, maybe I should be clear about what I meant by 'relevance' in this context:
Shouldn't endianness be a concern regardless of the data you save (given that the data might be composed of objects that are bigger than 1 byte)?
I mean, what's so special about character encoding in particular, that requires our attention?
For example, endianness is also an issue when you handle shorts, signed/unsigned ints, longs, floats, doubles and so on...

**whiteflags** · 08-30-2015

Consider... Are you talking about money, or just in Chinese?

갠 U+AC20 HANGUL SYLLABLE GAEN,
€ U+20AC EURO SIGN,

UTF-8 is rather special in that the leading byte, and the fact that it contains the ASCII character set, it is free of endianness issues. It is easy to detect when you are dealing with UTF-8, but I have heard programmers say that to detect Unicode, they only look for a BOM. No BOM - no Unicode. Or they just assume that a random text stream without a BOM is actually UTF-8 (which is true on Linuxes).