I tried to printf("®") to the console but get the << char instead. Any1 know how to do it?
I tried to printf("®") to the console but get the << char instead. Any1 know how to do it?
I don't see that character listed here: Ascii Table - ASCII character codes and html, octal, hex and decimal chart conversion
So I'm guessing you're SOL.
"I am probably the laziest programmer on the planet, a fact with which anyone who has ever seen my code will agree." - esbo, 11/15/2008
"the internet is a scary place to be thats why i dont use it much." - billet, 03/17/2010
Its not an ASCII (i.e. 8 bit) character, its Unicode. See here: Unicode Character 'REGISTERED SIGN' (U+00AE). Theres even a (among others) C example.
Well, the problem is that you can't really display unicode characters to the console. There are ways, but they are importable. Linux's console uses UTF-8. Windows's uses UTF-16. Other systems may use other encodings. And even if you do output the proper characters that represent the proper encoding of the character, you can only hope the console has the fonts to display the character.
That said, this works *for me* in Linux:
Code:#include <stdio.h> int main() { printf("\xC2\xAE\n"); }
Yes, of course it relies on what your running it in (Unix shell, Windows command prompt), as well as the font being used. Most Unix-es support Unicode very well out of the box. So the method in the example in the link I provided above should work without any modifications.
Windows Unicode command line support is very bad, I think, from my experience in having worked with it for a number of months on a project. I dont think Windows command prompt uses UTF-16 (BE or LE), by default. I think it uses ASCII and "code pages", which is not Unicode (therefore not any form of UTF). In addition to this, the default Windows command line font does not really support Unicode, as you mentioned.
Isn't there support for wide characters in the standard with the wchar_t data type?
Nevermind I just saw that it's up to individual implementations.
I'm not sure how those code-page .......... works, I never code in Windows. But the one time I tried to figure it out, I managed to output the characters I wanted by outputting UTF-16. But maybe I was just lucky it was in the right codepage.
Damn, what idiot thinks of that crap?
Its possible you were able to print it "incorrectly", though it appeared to work. That is, maybe the code page you were using, the encoding, the font, the decimal/hexadecimal value you used, etc., all lined up to print the exact character you wanted.
Code pages are basically different encoding subsets of the Unicode character set. So if the program prints value "0x123" (whatever) and it should print (Unicode) character "X" (whatever), it might work, depending on the variables mentioned above. In code page "A" it might print the correct value, if the encoded value, 0x123, maps to the character, X. In other code pages, some other character might be mapped to that value, say, "Y".
Theres also certain code pages in Windows that represent, say, UTF8. This means that, when youre on that code page, you know exactly what value it must print. However, this kind of defeats the purpose of Unicode--you just want to print some character and not worry about "code pages". So code pages are slowly being fazed out, so that when you want to print a (R), you just give its Unicode value, according to whatever UTF encoding your using.
Last edited by nadroj; 10-31-2009 at 09:34 AM. Reason: grammar
Unicode (UTF16LE) output to the console is possible via WriteConsoleW. If you are using a recent MS CRT (at least VS 2008 I think) then you can call "_setmode(_fileno(stdout), _O_U16TEXT)", which will cause "wide" output to stdout to be written directly as UTF16LE (via WriteConsoleW).
None of this will help if the console isn't using a Unicode font like Lucida Console. But even it doesn't support all Unicode characters. You can use charmap.exe to see what Unicode characters is does support.
gg
Last edited by Codeplug; 10-31-2009 at 11:04 AM.
Code:#include <wchar.h> int main(){ putwchar(0xAE); return 0; }
It is too clear and so it is hard to see.
A dunce once searched for fire with a lighted lantern.
Had he known what fire was,
He could have cooked his rice much sooner.
That doesn't realy work in general, but Linux will do clever things for you...
The locale determines what the multibyte representation is - that's the "C locale" by default. On Linux (glibc), the wide character representation is UTF32[LE/BE]. Passing 0xAE to wcrtomb, with a C locale, results in "(R)" - which is fairly clever. If I call setlocale(LC_ALL, "") first, then the default user locale is used, which is typically UTF8. Then the wcrtomb convertion (UTF32->UTF8) results in a "®".Originally Posted by ISO/IEC 9899:1999 (E)
On Windows, the MSCRT treats "C locale" characters as what ever the default console codepage is. In other words, multibyte character values are just indexes into the default codepage. Changing the locale (LC_CTYPE) under windows is just changing what codepage to use. Wide characters in Widows are UTF16LE. Calling wcrtomb under the C locale will simply "assign" each wchar_t to a char. Any wchar_t's greater than 0xFF results in an error. So in the end, we get a codepage index of 0xAE. My default codepage is 437 - http://msdn.microsoft.com/en-us/goglobal/cc305156.aspx As you can see, index 0xAE is "«", or U+00AB.
gg
So now is the real question, how do we portably write programs that support unicode input/output to a window?
Guess JAVA does have its uses... But that's MS's fault.
>> how do we portably write programs that support Unicode input/output
Portably speaking, you don't. Encoding is based on the locale which is typically abstracted away from the programmer. The only characters you can count on in any environment are the "basic character set" characters. This gives you A-Z, a-z, 0-9, "! " # % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~", space, and the standard escape sequences (alert, tab, etc...). Those characters are available regardless of the current LC_CTYPE.
I believe the standard even allows the wide character representation to change when the LC_CTYPE is changed. In reality though, Windows uses UTF16LE and *nix (glibc) uses UTF32[LE/BE]. But as 7.19.3 - 12 describes, wide character output is converted (I think of it as "normalized") to its multibyte representation first.
The first assumption you could make is that the user's default locale can handle whatever wchar_t's you throw at it:
On Linux with a UTF8 locale and compatible terminal, U+00AE gets converted and written as bytes: "\xC2\xAE".Code:#include <wchar.h> #include <locale.h> int main() { setlocale(LC_ALL, ""); wprintf(L"\u00ae\n"); return 0; }
Windows does not support UTF8 as a locale's multibyte encoding. It only supports codepages. On Windows (with 2008 CRT), the above program gives me an "r". Another thing to understand under Windows is that there are two codepages being considered for output to the console. First, the UTF16 character is converted to the codepage character associated with the current locale. The default user locale uses the ansi-codepage (as returned by GetACP). For me that's 1252, which supports "®". However, the second codepage you must consider is the console codepage (as returned by GetConsole[Output]CP). The MSCRT will do one last conversion to this codepage before calling WriteFile on the standard output handle. For me, that's a conversion from 1252 to 437, or "®" to "r".
Based on this knowledge, your next attempt on Windows might be:
Now we've set both the input and output console CP to the ACP. (For some reason, the 2008 CRT uses the input CP for conversion before output...) This basically eliminates the secondary conversion before output - which results in a "®" on my system.Code:#include <wchar.h> #include <locale.h> #include <windows.h> int main() { setlocale(LC_ALL, ""); SetConsoleOutputCP(GetACP()); SetConsoleCP(GetACP()); wprintf(L"\u00ae\n"); return 0; }
There is still a problem with this approach - you can only use characters supported by the ACP (and you can't change the ACP). For true Unicode support, you want to use the WriteConsoleW API and bypass any codepage conversions. This can be accomplished with the 2008 CRT with the following:
Here, U+00AE is sent directly to WriteConsoleW without conversion. Then you just need a console font that has a glyph for U+00AE.Code:#include <stdio.h> #include <fcntl.h> #include <io.h> int main() { _setmode(_fileno(stdout), _O_U16TEXT); wprintf(L"\u00ae\n"); return 0; }
gg