Thread: cast unsigned char* to (the default) signed char*

  1. #1
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446

    cast unsigned char* to (the default) signed char*

    Hello folks, long time no see.

    I have a quicky...

    I need some advice with the way I am dealing with SQLite data extraction functions. The culprit is one - of two, I don't use the other - function that returns an UTF-8 text representation of a single column of the current result row. The prototype is:

    Code:
     
    const unsigned char* sqlite3_column_text(sqlite3_stmt*, int);
    My implementation defines "char" as being of type signed char by default. So, in order to put the result of ths function into a std::string I first need to cast it to the appropriate type:
    Code:
    char* result = reinterpret_cast<char*>( sqlite3_column_text(/*...*/) );
    I'm only partially worried with the resulting data loss since the database in question only uses characters in the acceptable range (english characters only). However two question arise:

    1. Can I do better than a reinterpret_cast? Again data loss is no concern.

    2. What if data loss becomes a concern? How can I handle this? The problem is that the character '&#243;', for instance, is well within the capabilities of a signed char. However the cast alters that and renders latin characters unusable.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > The problem is that the character 'ó', for instance, is well within the capabilities of a signed char.
    > However the cast alters that and renders latin characters unusable.
    How would the cast affect it?
    AFAIK, all the unsigned types have the same number of bits as their signed counterparts. The only difference is one of interpretation. If you then go back to unsigned, you should be back to where you started.

    > function that returns an UTF-8 text representation of a single column of the current result row.
    http://en.wikipedia.org/wiki/Utf-8
    It seems to me that 'ó' would be encoded in two octets. Are you sure that you're not mistaking this encoding for data corruption?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    It is possible. I can't check it right now. But will get back to it later today.
    My understanding was that any characters above ASCII 128 were not being converted correctly, but I didn't think to look in the database.

    Thanks Salem.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  4. #4
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    UTF-8 and other encodings have little to nothing to do with the signedness of the underlying char type.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  5. #5
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    also note that you remove const modifier (that you shouldn't for just converting unsigned to signed char)
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  6. #6
    Hurry Slowly vart's Avatar
    Join Date
    Oct 2006
    Location
    Rishon LeZion, Israel
    Posts
    6,788
    BTW what about using basic_string with correct CharType template specifier to avoid casting at all?
    All problems in computer science can be solved by another level of indirection,
    except for the problem of too many layers of indirection.
    – David J. Wheeler

  7. #7
    Tropical Coder Darryl's Avatar
    Join Date
    Mar 2005
    Location
    Cayman Islands
    Posts
    503
    Another option may be to use a compiler switch to treat char as unsigned by default.

  8. #8
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    This problem kind of dissapeared on its own and I don't know why. Since things don't just solve by themselves, my guess is that the data on the database was corrupt and was fixed by someone in the meantime since I didn't touch my code.

    Thanks everyone for the replies. Meanwhile basic_string seems a reasonable approach no doubt as opossed to casting. thanks.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  9. #9
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Quote Originally Posted by vart View Post
    BTW what about using basic_string with correct CharType template specifier to avoid casting at all?
    This problem resurfaced. Sorry for bringing back the thread. Not too old, I reckon.

    I can't seem to figure out how to follow this advise from Vart. I should I go about doing it?
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  10. #10
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    You could try it.

    Code:
    typedef std::basic_string<unsigned char> ucstring;
    I don't think it will help. The WinAPI wants plain chars, cin and cout are based on plain chars. At some point, you have to convert, and the different strings only delay the moment.

    I still think it's an encoding issue and has nothing to do with signedness.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  11. #11
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Yup. It doesn't work. Just as you pointed out...

    Ok... hmm... let me collect my thoughts... Please bear with me. I have trouble understanding this world of encodings, different text formats and whatmore.

    The database text columns hold text in UTF-8 format. The function responsible for returning text from a text field is sqlite3_column_text which returns a null-terminated const unsigned char*.

    I expect text in these columns to always contain plain english characters. So one could say, data loss is not a concern. Under these conditions reinterpret_cast is a solution. However I it is screaming undefined behavior, is it not? If, as I think it can happen, some UTF-8 character falls out of the signed char range, reinterpret_cast will not do.

    How can I handle this conversion safely?

    I gave this some thought and came out with a possible solution involving the excellent boost::numeric cast. However this means iterating through the unsigned char null-terminated array. Probably too expensive of a solution.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  12. #12
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > If, as I think it can happen, some UTF-8 character falls out of the signed char range,
    Or indeed, all of them.
    http://en.wikipedia.org/wiki/UTF-8

    Any octet sequence which is encoding an ordinal character >=128 has the top-bit set in all the characters encoding the character.

    Nor can you just take a sequence of encoding octets and simply "assign" them to some wide character. There is a degree of bit-fiddling to extract the data bits from the encoding bits.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  13. #13
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by Mario F. View Post
    My implementation defines "char" as being of type signed char by default. So, in order to put the result of ths function into a std::string I first need to cast it to the appropriate type:
    Are you sure? You can create std::strings which hold unsigned chars instead of signed ones.

    Code:
    std::basic_string<unsigned char> utf8_string;
    In order to do this, you'll have to create a specialization of std::char_traits<unsigned char>, if your environment doesn't already provide one.

  14. #14
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    It does not. And that's where I would need some help. Vart suggested it also. I'm not sure how to do it.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  15. #15
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by Mario F. View Post
    It does not. And that's where I would need some help. Vart suggested it also. I'm not sure how to do it.
    Does the following link help:

    http://www.sgi.com/tech/stl/character_traits.html

    You need to define a specialization:

    Code:
    namespace std
    {
        template <>
        class char_traits<unsigned char>
        {
            ...
        };
    }
    Look at the sections on that page called "Associated Types" and "Valid Expressions." They define the things you have to provide. The types "char" and "unsigned char" are so similar to each other that you might be able to get away with just copying the char_traits<char> out of the standard library and using it. Hell, even deriving from it might work:

    Code:
    namespace std
    {
        template <>
        class char_traits<unsigned char> : public std::char_traits<char>
        {
        public:
            typedef unsigned char char_type;
        };
    }

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 8
    Last Post: 06-04-2009, 02:03 PM
  2. get keyboard and mouse events
    By ratte in forum Linux Programming
    Replies: 10
    Last Post: 11-17-2007, 05:42 PM
  3. Signed vs Unsigned int
    By osyrez in forum C++ Programming
    Replies: 18
    Last Post: 08-17-2006, 07:38 AM
  4. Obtaining source & destination IP,details of ICMP Header & each of field of it ???
    By cromologic in forum Networking/Device Communication
    Replies: 1
    Last Post: 04-29-2006, 02:49 PM
  5. can someone check this out and let me know ?
    By javaz in forum C Programming
    Replies: 5
    Last Post: 01-21-2002, 02:13 PM