Thread: trouble isolating special utf-8 character in string

  1. #1
    Registered User
    Join Date
    Dec 2007
    Posts
    2

    trouble isolating special utf-8 character in string

    I'm working on a program that will conjugate words in different languages. My host language is currently Quenya. One of the features that I am trying to tackle is handling special characters like accented vowels and vowels with diaeresis, changing the vowels to vowels without accents/diaeresis (or the other way around).
    Below is a sample of code that I am using as an initial test of this concept. I am attempting to isolate the letter "ë" in "quentë" for the purpose of printing at this time. Once I succeed I will attempt to evaluate the existence of "ë" in a different word, replacing it with an "e".

    Code:
    #include <stdio.h>
    #include <locale.h>
    #include <string.h>
    
      int main()
    {
    
    	if (!setlocale(LC_CTYPE, "")) {
    		fprintf(stderr, "Can't set the specified locale! "
    				"Check LANG, LC_CTYPE, LC_ALL.\n");
    		return 1;
    	}
    	
    	wchar_t *wstring;
    
    	char *string = "quentë";
    	char newstring[10];
    
    
    	printf("the word is %s\n", string);
    	int len = mbstowcs(NULL,string,0);
    	printf("mbs string size is %d\n", len); //this shows the correct # of characters.
    	mbstowcs(wstring,string,len);
    	int len2 = wcstombs(newstring,wstring,len+1);
    	printf("the word is now %s\n", newstring);
    	printf("the last character is: %c\n", newstring[5]);
    
    	return 0;
    }
    Now the output is:

    Code:
    the word is quentë
    mbs string size is 6
    the word is now quentë
    the last character is:
    For some reason the final character will not print. I've tried the other characters with the same code, and they print fine. I've tried finding any information on the subject of UTF-8 strings in C programming, Linux, GCC, GLIBC, you name it. I just can't seem to nail this concept down. Any ideas?

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > int len = mbstowcs(NULL,string,0);
    I doubt your "ë" is encoded as a multi-byte sequence.

    > printf("the last character is: %c\n", newstring[5]);
    All this will do is try to print the first byte of a multi-byte sequence as a char. What isn't being passed to printf is all the other bytes which make up that char.

    You might try something like this, as it seems %s understands such things.
    Code:
    printf("the last character is: %.1s\n", &newstring[5]);
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Dec 2007
    Posts
    2

    Talking getting somewhere

    Okay! Thank you very much! While the code you posted didn't work, what you pointed out is that I need to reference the address of the multibyte character. This is what I have so far:

    Code:
    #include <stdio.h>
    #include <locale.h>
    #include <string.h>
    #include <stddef.h>
    
      int main()
    {
    	if (!setlocale(LC_CTYPE, "")) {
    		fprintf(stderr, "Can't set the specified locale! "
    				"Check LANG, LC_CTYPE, LC_ALL.\n");
    		return 1;
    	}
    	
    	wchar_t *wstring;
    
    	char *string = "quentë";
    	char *letter = "ë";
    
    
    	printf("the word is %s\n", string);
    	printf("the last letter is %s\n", string+5);
    	int len = mbstowcs(NULL,string,0);
    	printf("mbs string size is %d\n", len); //this shows the correct # of characters.
    	if (*(string+5) == *letter)
    	printf("true\n");
    	else
    	printf("false\n");
    
    	return 0;
    }
    Annnnd . . . voila! the output:

    the word is quentë
    the last letter is ë
    mbs string size is 6
    true

    Now I'm getting somewhere. Thank you so much!

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Don't you need to start with this?
    Code:
    wchar_t *string = L"quent&#235;";
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. char Handling, probably typical newbie stuff
    By Neolyth in forum C Programming
    Replies: 16
    Last Post: 06-21-2009, 04:05 AM
  2. Replies: 8
    Last Post: 04-25-2008, 02:45 PM
  3. Compile Error that i dont understand
    By bobthebullet990 in forum C++ Programming
    Replies: 5
    Last Post: 05-05-2006, 09:19 AM
  4. Replies: 4
    Last Post: 03-03-2006, 02:11 AM
  5. Game Pointer Trouble?
    By Drahcir in forum C Programming
    Replies: 8
    Last Post: 02-04-2006, 02:53 AM