Thread: Reversing a string

  1. #31
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    945
    Quote Originally Posted by phantomotap View Post
    A compiler is free to use UTF8 for wide strings if the target environments native character set and locale can be so represented. Several real world compilers do use UTF16 for wide strings. (The problems referenced for UTF8 applies to UTF16.)

    As far as "never going to be in UTF8", the input encoding doesn't really matter if the vendor has supplied no transformation from the relevant encoding to the compiler's wide character set. The crucial bits of the standard only requires such transformation for the largest native supported extended character set.

    In other words, the sequence L"Hello, World!" when converted by a confirming compiler supporting the "enUS" locale to a wide character sequence should never be a problem. However, if a compiler uses a 16 bit `wchar_t' there is simply no way to map the entire UNICODE character set to a single wide character representation so a multi-byte sequence must remain a multi-byte sequence.
    Are you sure wchar_t can be UTF-8? That seems to be in direct violation of the standard:
    wchar_t

    which is an integral type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales
    (from 4.1.5 Common definitions <stddef.h>)

    The standard clearly says that a single wchar_t can represent all character codes in a character set. It doesn't say that multiple wchar_t's can be used to represent a single character code. UTF-8 is a variable-length encoding that requires between 1 and 4 octets to represent a single character code, so if wchar_t were in UTF-8 it would likely be defined as an 8-bit type (any wider would be wasteful), and multiple wchar_t's would be required to contain a single character in some cases. That is a direct violation of the standard.

    UTF-16 is also a multibyte character encoding. Systems that use UTF-16 values in their wchar_t are also violating the standard in spirit, if not in letter, if they support surrogate pairs to contain a single character code. If a system uses UCS-2 or UTF-16 for wchar_t but does not support surrogate pairs (essentially a subset of UCS-2 that is defined only for codes 0x0000-0xD7FF and 0xE000-0xFFFF), then it can support only a subset of Unicode, but that is fine as far as the C standard is concerned. Likewise, a system that uses UTF-8 for wchar_t but supports only single-byte encoding sequences has a limited character set (essentially the ASCII encoding), but that too is fine by the standard.


    Any conforming implementation that supports the full Unicode character set must have a wchar_t type of at least 21 bits wide. Full stop.

  2. #32
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Are you sure wchar_t can be UTF-8?
    Yes.

    What you aren't getting, and you are correct in your interpretation otherwise, is that the phrase "largest extended character set specified among the supported locales" trumps the rest.

    A compiler and library implementation is allowed conforming support of UNICODE and the C standard without supporting all locales.

    A compiler only has to match the requirements you are talking about with locales it claims to support which is often only the "C locale".

    That is a direct violation of the standard.
    No. It is not.

    Reaching beyond the supported locales leads to implementation defined results.

    The standard specifically allows implementation defined results; such things marked implementation defined are not violations of the standard.

    Any conforming implementation that supports the full Unicode character set must have a wchar_t type of at least 21 bits wide.
    Nope.

    You are confusing the notions of supporting UNICODE with supporting all locales specified by the UNICODE standard annexes.

    A compiler that only supports "enUS" must only provide a range suitable for that locale to confirm to the standard.

    If you reach beyond that locale, the results are implementation defined.

    Soma

  3. #33
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    945
    Quote Originally Posted by phantomotap View Post
    What you aren't getting, and you are correct in your interpretation otherwise, is that the phrase "largest extended character set specified among the supported locales" trumps the rest.

    A compiler and library implementation is allowed conforming support of UNICODE and the C standard without supporting all locales.
    I understand this. What you don't seem to understand is that UTF-8 is not even a character set. It is a character encoding. There is a big difference. So you cannot meaningfully use UTF-8 as an extended character set anyway. (Many people confuse character sets and character encodings, including Microsoft developers, which may explain why they try to use a character encoding (UTF-16) as a character set for their wchar_t type).


    An implementation can have an 8-bit or 16-bit wchar_t, but the "largest extended character set" of such an implementation will be limited to 256 or 65536 characters. Otherwise how could a single wchar_t represent each character in the supported set with a distinct code? Anything else will be in violation of the requirement that wchar_t be capable of representing "distinct codes for all members of the largest extended character set specified among the supported locales".

    Reaching beyond the supported locales leads to implementation defined results.

    The standard specifically allows implementation defined results; such things marked implementation defined are not violations of the standard.
    I understand that implementation-defined results are valid where they are marked as implementation-defined. However, the only relevant implementation-defined aspect of wchar_t that I found in the standard (besides the "implementation-defined current locale" part) is this part:
    The value of a wide character constant containing [...] a multibyte character or escape sequence not represented in the extended execution character set, is implementation-defined.

    How does a wchar_t object get set to a value that is not represented in the implementation's extended character set? The programmer can surely set a wchar_t object to an invalid value (such as 0x7FFFFFFF, which is not represented in Unicode, or 0xD800, which is not valid in UTF-16 except in a surrogate pair), but this will not happen when converting a multibyte character to a character in the extended character set. The functions mbtowc, mbrtowc, fgetwc, and others will not convert a multibyte character into a wide character that is not represented in the extended execution character set (perhaps they return EILSEQ in that case), and none of those functions can even convert a multibyte character into more than a single wide character.

    Of course, you could argue that an implementation that uses a 16-bit wchar_t type doesn't really support an extended character set larger than 65536 characters, and that a multibyte character with a code value outside this range (say, U+10000) might be translated to multiple wide characters (I couldn't find anything in the standard to support this behavior; perhaps you can point out the relevant standard text?). But then those characters are not really in the extended character set that is supported by the implementation in the first place, and we cannot expect meaningful or portable results if we try to do anything with them. So any character beyond 0xFFFF (or 0xD7FF in a system with surrogates) should be considered to be unsupported, or at the very least, implementation-defined.

    You are confusing the notions of supporting UNICODE with supporting all locales specified by the UNICODE standard annexes.
    Note that I said the full Unicode character set, which contains up to 0x110000 characters, including the null character. An implementation may support a limited set of the Unicode character set--perhaps only the BMP which fits in 16 bits, or only the first 256 characters which fits in 8 bits (and converting between wc and mb is straightforward if the mb is encoded in ISO-8859-1), or even a completely non-Unicode character set like EBCDIC which also fits in 8 bits.

  4. #34
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    What you don't seem to understand is that UTF-8 is not even a character set. It is a character encoding.
    O_o

    You can't point to a single comment I've made that would even imply such confusion on my part.

    You on the other hand keep saying "largest extended character set" when that is clearly and obviously omitting context.

    Considering that you yourself posted the relevant part of the standard I'd say it is a fair bet that you are just trolling so I'm out.

    Soma

  5. #35
    TEIAM - problem solved
    Join Date
    Apr 2012
    Location
    Melbourne Australia
    Posts
    1,907
    I'd say it is a fair bet that you are just trolling so I'm out
    Don't let them get your goats :P
    Fact - Beethoven wrote his first symphony in C

  6. #36
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Don't let them get your goats
    Epic!

    I was just watching the "Boogie Frights" episode of "Powerpuff Girls".

    [Edit]
    I found a link.

    powerpuff girls - boogie frights - YouTube

    You must watch a few seconds of it.
    [/Edit]

    Soma

  7. #37
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    945
    Quote Originally Posted by phantomotap View Post
    You can't point to a single comment I've made that would even imply such confusion on my part.
    Sure I can. You claimed that a wchar_t can be in UTF-8.

    That combined with the requirement that wchar_t "represent distinct codes for all members of the largest extended character set" suggested that you were saying that UTF-8 is one such character set.

    Using UTF-8 for wchar_t also makes little sense for these reasons:

    a. UTF-8 is defined as a variable-width encoding (of the Unicode character set) using a stream of octets (between 1 and 4 octets).
    b. A single wchar_t must be able to represent all codes in a character set (whichever character set in whichever locale the implementation supports).
    c. Even if you managed to pack a variable-width encoding like UTF-8 into a 32-bit wchar_t (perhaps by zero-padding the UTF-8 sequence), this packed UTF-8 sequence is still a single wchar_t (see previous point) and the program can still deal with it as a single unit, without regard to the actual significance of each code (a conforming program should use wide-character-classification functions/macros such as iswdigit if it needs to determine character types).
    d. Using the UTF-8 character encoding in this manner uses the same amount of memory and is not particularly simpler to implement than just storing the Unicode code point as a plain numeric value. The character-code space is also very sparse which is another likely disadvantage with this character-encoding scheme.

    Of course, you could alternatively have an 8-bit wchar_t that can contain values for the characters that can be encoded by a one-octet UTF-8 sequence (which is the same as saying the first 128 characters of Unicode), but that character set is normally called ASCII.


    You on the other hand keep saying "largest extended character set" when that is clearly and obviously omitting context.
    What context am I omitting? The part about locales? How does that affect anything that I've said so far? The locale has to do with which characters are included in the implementation's extended character set and digit separators and currency symbols and the like.

    Considering that you yourself posted the relevant part of the standard I'd say it is a fair bet that you are just trolling so I'm out.
    The part of the standard I quoted was "relevant" only in the sense that it marked implementation-defined behavior as it relates to the wchar_t type. Perhaps you could show how it is any more relevant than that, or perhaps you could address the other issues I brought up regarding converting a multibyte character to a wide character that is outside the extended character set of an implementation. I'd like to know how those are resolved with respect to the standard.

  8. #38
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    O_o

    Nope.

    Click_here told me all your secrets. You are after my goat!

    That is not going to happen. I love my goat.

    [Edit]
    Really, when I said "I'm out" that was your cue that I had no intention to respond to trolling via willful ignorance and intentional misunderstanding with anything more than trolling of my own.
    [/Edit]

    [Edit]
    But I do love my goat.
    [/Edit]

    Soma
    Last edited by phantomotap; 11-01-2012 at 09:19 PM.

  9. #39
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    945
    I am not trolling. I have a genuine interest in this subject, but you seem to be unwilling to point out parts of the actual standard in question that contradict what I am claiming. Your responses seem to me to be speculation or wishful thinking or something (or derived from so-called Microsoft "standards" where confusion and obfuscation is the name of the game).

    I like goats too. They taste great in a curry sauce. Yum. :P

  10. #40
    Registered User
    Join Date
    Oct 2012
    Posts
    99
    Taking a step back, I just think it's kind of funny that this standard comp science 101 chapter 4 question triggered this lengthy debate. I wish I knew what you'all were talking about, but at least I do understand the OP's homework question

  11. #41
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    945
    Quote Originally Posted by phantomotap View Post
    Really, when I said "I'm out" that was your cue that I had no intention to respond to trolling via willful ignorance and intentional misunderstanding with anything more than trolling of my own.
    Ok, phantomotap, let me take another approach. I'm going to ask you a few simple questions. Assume for the purpose of these questions that you are designing an implementation that uses UTF-8 for wchar_t.

    1. What character set will you support? If it's a subset of Unicode, what subset will you support? You can make the set as small (like ASCII, 128 characters) or as large as you'd like.
    2. How wide (in bits) is your wchar_t type? What is the range of values that it supports?
    3. Will you be able to represent a wide character value using a sequence of two or more consecutive wchar_t objects?
    4. If you answer yes to 3, how will your versions of mbtowc and fgetwc convert a multibyte character into a sequence of two or more wchar_t objects? Or in other words, how will these functions return more than only the first wchar_t in a sequence?

    Bonus questions:
    5. If you answer yes to 3, show how you would encode a wide character as a sequence of two or more wchar_t objects. (If your wchar_t is 8 bits wide and you use the UTF-8 encoding, you can say UTF-8).
    6. Show how your design decisions conform with C99.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. reversing a string, again..
    By Disident in forum C Programming
    Replies: 5
    Last Post: 06-15-2009, 08:01 PM
  2. reversing a string
    By swappo in forum C++ Programming
    Replies: 6
    Last Post: 06-14-2009, 03:18 PM
  3. Reversing a string
    By gandamkumar in forum C++ Programming
    Replies: 16
    Last Post: 11-27-2004, 10:09 PM
  4. Reversing a string
    By michelle1 in forum C++ Programming
    Replies: 12
    Last Post: 06-27-2003, 06:37 AM
  5. Reversing a String
    By ToasterPas in forum C++ Programming
    Replies: 10
    Last Post: 08-14-2001, 12:20 PM

Tags for this Thread