Unicode 2 byte wide characters

This is a discussion on Unicode 2 byte wide characters within the C Programming forums, part of the General Programming Boards category; Hello, I was wondering about Unicode, specifically 2 byte characters, if i have an array of 2 byte unicode characters ...

  1. #1
    Registered User
    Join Date
    Jan 2009
    Posts
    61

    Unicode 2 byte wide characters

    Hello,

    I was wondering about Unicode, specifically 2 byte characters, if i have an array of 2 byte unicode characters (each byte is an element) how would i be able to print these characters.

    for instance, the array looks mostly like this:

    00 a3 00 8c 00 b9 etc

    i.e. whenever i try to print it it thinks there is a null terminator and quits printing the string

    is there a 16 byte unsigned char type?

    i am really stuck, any light you could shed would be greatly appreciated

    thanks

  2. #2
    and the hat of wrongness Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    32,494
    Use wchar_t
    As in
    wchar_t mybuffer[100];

    Use printf("%S",mybuffer);
    Note the upper case.

    Though you'll need a modern compiler for this.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.
    I support http://www.ukip.org/ as the first necessary step to a free Europe.

  3. #3
    Registered User
    Join Date
    Jan 2009
    Posts
    61
    thanks for the reply, say if the bytes are locked up in a char array, is there a way to copy the bytes into this wchat_t array 1 byte at a time, or do you need to copy it two bytes at a time?

    thanks
    ps -- i wont use void main...

  4. #4
    Registered User
    Join Date
    Jan 2009
    Posts
    61
    reading at unicode.org it says it is O.K. to truncate 2 byte unicode (16bit big endian) code sequences to 1 byte because it will preserve characters c <= 127, like ascii.

    this wold only stuff up glyphs right?

    also, i will be sending these strings to objective-c which as far as i know can only support UTF8 c strings.

    i think i will go for the lossy conversion rather than mess around with wide characters...

    thanks for your help

  5. #5
    Registered User
    Join Date
    Jan 2009
    Posts
    61
    when i print byte for byte this sequence:

    0x00 0x6C 0x00 0x6F 0x00

    the letters 'l' and 'o' are printed but when i sprintf the 0x6c then 0x6f i get 'l' and nothing.

    i then looked up 6f in ascii and its just nothing...

    i am totally stuck now

    but still, thankyou for your replies thus far

  6. #6
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Can I ask why you are doing this and why you think the first element of a unicode character is 0x00 (because I'm pretty sure it isn't)?
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  7. #7
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,300
    Quote Originally Posted by davo666 View Post
    thanks for the reply, say if the bytes are locked up in a char array, is there a way to copy the bytes into this wchat_t array 1 byte at a time, or do you need to copy it two bytes at a time?
    You can drop the "locked up" from that sentence.
    If they are in a char array, then simply cast it to a wchar_t array:
    Code:
    char ch[] = {0x00, 0x6C, 0x00, 0x6F, 0x00, 0x00, 0x00};
    printf("%S", (wchar_t*)mybuffer);
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  8. #8
    Registered User
    Join Date
    Jan 2009
    Posts
    61
    ok, i will extrapolate:

    i am getting an OBEX packet delivered over Bluetooth. one of the packet fields is 'file name' - in 2 byte unicode characters.

    the problem is the array received is a void * (such that casting it to a uint16_t gets me the wrong first element of the array) so i have to cast it to uint8_t.

    as you can see now, i have 2 byte unicode characters in a single byte array such that

    00 6b 00 fc 00 73 00 00 <-- 2 byte null terminator

    is in the middle of an array but represents characters... i need to get them into a 2 byte char array on their own so i can write the filename and contents to disk.

    its a really hard task though...

    thanks

  9. #9
    Registered User
    Join Date
    Jan 2009
    Posts
    61
    i thought about the casting, but wouldnt that read 4 bytes at a time? also the bytes i need are 9 bytes in (i.e. the 8th element of the array)

  10. #10
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by davo666 View Post
    (such that casting it to a uint16_t gets me the wrong first element of the array) so i have to cast it to uint8_t.
    Dude, the first character cannot really be 0x00. Maybe you should look at this.

    I don't think you should use an unsigned datatype in the conversion. It may seem that in hex it's all the same, but I would observe that normal unicode is signed and mostly negative.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  11. #11
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,300
    Quote Originally Posted by davo666 View Post
    ok, i will extrapolate:

    i am getting an OBEX packet delivered over Bluetooth. one of the packet fields is 'file name' - in 2 byte unicode characters.

    the problem is the array received is a void * (such that casting it to a uint16_t gets me the wrong first element of the array) so i have to cast it to uint8_t.

    as you can see now, i have 2 byte unicode characters in a single byte array such that

    00 6b 00 fc 00 73 00 00 <-- 2 byte null terminator

    is in the middle of an array but represents characters... i need to get them into a 2 byte char array on their own so i can write the filename and contents to disk.
    You mean elaborate, not extrapolate.
    Are you working on a big-endian platform? If not, I believe you are casting an address that is off by one. There's nothing hard about this, it's easy, you just have to cast the right address to wchar_t*. You may have to cast it from void to uint8_t first, so that you can then add your 8 or 10 bytes etc, but then cast it to wchar_t.
    If this wont give you the correct output, then please post the code you're trying to use.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  12. #12
    Registered User
    Join Date
    Jan 2009
    Posts
    61
    the double casting worked, you are great!

    i still couldnt apply that to swprintf though :s but this is what i have so far:

    name_len - 3 is the number of bytes, including the null termintor that the string goes for.
    Code:
    here is a byte printout of the array, (starting at payload[0])
    
    0x82 0x00 0xD6 0x01 0x00 0x13 0x00 0x6C 0x00 0x6F 0x00 0x6C 0x00 0x2E 0x00 0x74 0x00 0x78 0x00 0x74 0x00 0x00
                                                         -- start of filename field                                                                                        2 byte null(end)
    
    wchar_t * file_name[200];
    
    	for(i = 0 ; i < name_len-3 ; i++){
    		wprintf(L"%C",(wchar_t *)((u8_t *)p->payload)[i+6]); //works
    
    		swprintf(&file_name[i], name_len-3 ,L"%C",(wchar_t *)((u8_t *)p->payload)[i+6]); //dosent work...
    	}
    
    	wprintf(L"%S",file_name); //dosent work
    thanks!

  13. #13
    CSharpener vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,484
    and why casting to pointer? isn't it wchar_t?
    The first 90% of a project takes 90% of the time,
    the last 10% takes the other 90% of the time.

  14. #14
    CSharpener vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,484
    The first 90% of a project takes 90% of the time,
    the last 10% takes the other 90% of the time.

  15. #15
    Registered User
    Join Date
    Jan 2009
    Posts
    61
    sorry, i got casted to (wchar_t) but that still didnt help... im using GCC on an ARM platform msdn is the last thing i would read

    thanks

Page 1 of 2 12 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Inserting a swf file in a windows application
    By face_master in forum Windows Programming
    Replies: 12
    Last Post: 05-03-2009, 11:29 AM
  2. C++ jargons: Really CONFUSING
    By cavestine in forum C++ Programming
    Replies: 3
    Last Post: 10-15-2007, 04:19 PM
  3. pass multi-byte (or wide) characters to DeleteFile
    By George2 in forum C Programming
    Replies: 1
    Last Post: 07-30-2007, 12:56 AM
  4. Unicode - a lot of confusion...
    By Jumper in forum Windows Programming
    Replies: 11
    Last Post: 07-05-2004, 07:59 AM
  5. Reading Unicode characters
    By Troll_King in forum Windows Programming
    Replies: 0
    Last Post: 10-18-2001, 12:57 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21