Thread: adjusting character counts for utf8

  1. #1
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300

    adjusting character counts for utf8

    I have been doing this by using unsigned char buffers and a function like this:
    Code:
    int utfadj (unsigned char *buffer, int pos) {  
    	int i, count=0;
    	for (i=0;i<pos;i++) {
    		if ((buffer[i]>=0x80) && (buffer[i]<=0xBF)) continue;		
    		else count++;
    	}
    	return count;
    }
    However, I am kind of realizing this may not be the best idea because some functions (eg, strcasestr as opposed to strstr) return funky pointer values if the destination pointer is actually unsigned (the value works, but is useless for pointer arithmetic) . So mostly I'm just asking for the correct values to filter for a signed char (0x80 - 0xBF don't work), but I thought I'd check with a crowd to see if there are any more surprises in store.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  2. #2
    Woof, woof! zacs7's Avatar
    Join Date
    Mar 2007
    Location
    Australia
    Posts
    3,459
    What about:
    Code:
    if((signed char) buffer[i] < 0)
    ?

    Not sure if that's what you're after.

  3. #3
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by zacs7 View Post
    What about:
    Code:
    if((signed char) buffer[i] < 0)
    ?

    Not sure if that's what you're after.
    I don't think so; you don't want to eliminate all the values which would be less than zero, just some of them, because one multi-byte character is still one character. According to this, "A hard-wired technique to count the number of characters in a UTF-8 string is to count all bytes except those in the range 0x80 – 0xBF, because these are just continuation bytes and not characters of their own." However, this is only valid for unsigned bytes. I suppose all I have to do is convert 0x80 to it's unsigned equivalent, which I guess would be -1 (the real -1, not EOF), but since this is something I don't ordinarily do I'm suddenly realizing "two's compliment" is a fuzzy concept to me
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  4. #4
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> the value works, but is useless for pointer arithmetic
    That doesn't really make any sense...but using strcasestr() on a UTF8 string doesn't make sense either since each byte isn't necessarily a character.

    gg

  5. #5
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Quote Originally Posted by MK27 View Post
    I suppose all I have to do is convert 0x80 to it's unsigned equivalent, which I guess would be -1 (the real -1, not EOF), but since this is something I don't ordinarily do I'm suddenly realizing "two's compliment" is a fuzzy concept to me
    0xFF is -1 (in two's complement), since 0xFF + 1 = 0x00 with the carry bit blowing in the wind -- remember what two's complement means to do: flip the bits and add one.

  6. #6
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    Cast 0x80 and 0xBF to unsigned char or char when using them. Then it should work.

    The problem is because those values are int type by default, so comparing a negative char with them will always be less. If you cast the numbers to char, then they are converted to the twos compliment correctly.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  7. #7
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Here's a post I made some time ago for dealing with 2-byte UTF-8 chars. You can add code to handle the 3 and 4 byte cases pretty easily I presume.

    http://cboard.cprogramming.com/showp...5&postcount=13

    Here's what I used as a guide for how to inspect the characters: http://en.wikipedia.org/wiki/UTF-8#Description
    Mainframe assembler programmer by trade. C coder when I can.

  8. #8
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by King Mir View Post
    Cast 0x80 and 0xBF to unsigned char or char when using them. Then it should work.

    The problem is because those values are int type by default, so comparing a negative char with them will always be less. If you cast the numbers to char, then they are converted to the twos compliment correctly.
    This worked perfect! Thanks King!
    Code:
    if ((buffer[i]>=(char)0x80) && (buffer[i]<=(char)0xBF)) continue
    Now I can just ignore the ideas of Dino, etc.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  9. #9
    Jack of many languages Dino's Avatar
    Join Date
    Nov 2007
    Location
    Chappell Hill, Texas
    Posts
    2,332
    Wow. You sure know how to make a guy feel loved. As much as I hated to do it, I had to update my signature on that one.
    Mainframe assembler programmer by trade. C coder when I can.

  10. #10
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Just use unsigned char pointers, then you don't have to cast the constants.

    You still haven't made sense of your statements (post #4) or explained how using strcasestr() on UTF8 strings makes any sense.

    gg

  11. #11
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by Codeplug View Post
    Just use unsigned char pointers, then you don't have to cast the constants.

    You still haven't made sense of your statements (post #4) or explained how using strcasestr() on UTF8 strings makes any sense.

    gg
    strstr and strcasestr work fine on UTF8 strings. I don't know why you think they would not, since they return a pointer to the first byte of the match. The reason I am having to make an adjustment is because I'm applying the result of a search on a char buffer to a seperate (GUI display) buffer using the offset, so "byte count does not equal character count" becomes a reality.

    My original problem, I thought, was because I was using an unsigned pointer and that this was a problem for strcasestr. However, after I switched everything to signed it persisted. My solution to this was to write my own str(case)str that returns an int offset rather than a pointer.

    Since I don't have the old code to give you an example of the problemic output, all I can do is summarize thusly: The pointer returned by strcasestr did "contain" correct data, but the actual address of the pointer was mysteriously OUTSIDE the buffer, which meant that when I used it arithmetically to proceed through the buffer (start from *ptr-*buffer+length of word, ptr being the last return from str(case)str), it did not work. The problem was not my code, since substituting strstr worked fine. This only happened when the buffer was fairly large (several mb).

    I imagine the pointer was not really outside the buffer (since it did "contain" the correct data), but the address looked incredibly high (if the buffer started at say 0x7eXXXX, the strcasestr pointer started with 0xfffff. The strstr pointer was what I would predict, 0x7eXXXX+offset). I'm guessing this is some two's compliment related issue again, although it occured using both signed and unsigned buffers and pointers.

    If anyone has an explanation I would of course appreciate it (for future reference).
    Last edited by MK27; 02-03-2009 at 01:16 PM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  12. #12
    Registered User
    Join Date
    Dec 2008
    Location
    Black River
    Posts
    128
    I'm fairly sure that most strcasestr implementations do not follow UTF8-rules and simply use tolower / toupper on each single byte to compare, which makes the function unsuitable for UTF-8 strings unless it is fully composed by single-byte code points (On the other hand, strstr should work just fine on UTF-8 strings).

    I'm not sure I fully understand the problem you were having with strcasestr, but your use of pointer arithmetic looks wrong to me. If you want to get the byte offset of the first caseless match, then you would use ptr - string, where ptr represents the return value returned by a caseless search function and string is the beginning of the UTF-8 string.

  13. #13
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    According to GNU LibC manual, strcasestr() is locale dependent. So if you haven't loaded a UTF8 locale, then it makes no sense to pass a UTF8 string to strcasestr(). As you already mentioned, a "character" in UTF8 can be encoded with multiple code points (bytes in this case). The "case" in strcasestr() applies to "character" case. So the function needs to know how the characters are encoded.

    >> I'm guessing this is some two's compliment related issue again
    No, the value of a pointer is the same regardless of the type to which it points (signed or unsigned).

    gg

  14. #14
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by Ronix View Post
    I'm fairly sure that most strcasestr implementations do not follow UTF8-rules and simply use tolower / toupper on each single byte to compare, which makes the function unsuitable for UTF-8 strings unless it is fully composed by single-byte code points (On the other hand, strstr should work just fine on UTF-8 strings).
    Thanks for that, since the same logic will apply to my own str(case)str! I imagine there is no easy "tolower" function for the utf8 characters, although this must be a common problem dealing with non-english. Anyone have any pointers?

    Quote Originally Posted by Ronix View Post
    I'm not sure I fully understand the problem you were having with strcasestr, but your use of pointer arithmetic looks wrong to me. If you want to get the byte offset of the first caseless match, then you would use ptr - string, where ptr represents the return value returned by a caseless search function and string is the beginning of the UTF-8 string.
    No, you understand it just fine (but so is my arithmetic):
    Quote Originally Posted by me
    start from *ptr-*buffer+length of word, ptr being the last return from str(case)str)
    or maybe I should have explained that buffer is the utf-8 string, and "word" would be the searched for term? I just used asterisks here to indicate that they are pointers, so I'm doing math on the addresses.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  15. #15
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by Codeplug View Post
    According to GNU LibC manual, strcasestr() is locale dependent. So if you haven't loaded a UTF8 locale, then it makes no sense to pass a UTF8 string to strcasestr(). As you already mentioned, a "character" in UTF8 can be encoded with multiple code points (bytes in this case). The "case" in strcasestr() applies to "character" case. So the function needs to know how the characters are encoded.
    Well, in the test cases the utf-8 characters were not members of a non-ascii alphabet, they were things like the apostraphe, which does not have an ascii value (only the single quote ' and the backtick ` do) and so is a multi-byte. But since they could be, I'm going to have to do some research.

    >> I'm guessing this is some two's compliment related issue again
    No, the value of a pointer is the same regardless of the type to which it points (signed or unsigned).
    gg
    Hopefully someone will explain it tho -- consider this:
    Code:
    #include <string.h>
    #include <stdio.h>
    
    int main() {
    	const char haystack[]="this and that", needle[]="and";
    	char	*ptr=strstr(haystack,needle);
    	printf("%p %s\n",ptr,ptr);
    	ptr=strcasestr(haystack,needle);
    	printf("%p %s\n",ptr,ptr);
    	return 0;
    }
    You would expect the results to be the same. However, yesterday on large files the pointer address was different, but the string it pointed to was the same!

    Today, the code above is actually seg faulting for me at the second assignment of ptr. So much for strcasestr!

    I imagine this relates to the warning from gcc:

    test.c: In function ‘main’:
    test.c:8: warning: assignment makes pointer from integer without a cast


    However, in the GNU manual it says of strcasestr:

    This is like strstr, except that it ignores case in searching for the substring.

    So why a warning for one but not the other?
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Character literals incorrectly interpreted
    By DL1 in forum C Programming
    Replies: 11
    Last Post: 04-05-2009, 05:35 PM
  2. <string> to LPCSTR? Also, character encoding: UNICODE vs ?
    By Kurisu33 in forum C++ Programming
    Replies: 7
    Last Post: 10-09-2006, 12:48 AM
  3. Character handling help
    By vandalay in forum C Programming
    Replies: 18
    Last Post: 03-29-2004, 05:32 PM
  4. character occurrence program not working
    By Nutshell in forum C Programming
    Replies: 6
    Last Post: 01-21-2002, 10:31 PM
  5. Replies: 12
    Last Post: 01-12-2002, 09:57 AM