trouble isolating special utf-8 character in string

**myrddinemrys** · 12-29-2007

I'm working on a program that will conjugate words in different languages. My host language is currently Quenya. One of the features that I am trying to tackle is handling special characters like accented vowels and vowels with diaeresis, changing the vowels to vowels without accents/diaeresis (or the other way around).
Below is a sample of code that I am using as an initial test of this concept. I am attempting to isolate the letter "ë" in "quentë" for the purpose of printing at this time. Once I succeed I will attempt to evaluate the existence of "ë" in a different word, replacing it with an "e".

Code:

#include <stdio.h>
#include <locale.h>
#include <string.h>

  int main()
{

	if (!setlocale(LC_CTYPE, "")) {
		fprintf(stderr, "Can't set the specified locale! "
				"Check LANG, LC_CTYPE, LC_ALL.\n");
		return 1;
	}
	
	wchar_t *wstring;

	char *string = "quentë";
	char newstring[10];


	printf("the word is %s\n", string);
	int len = mbstowcs(NULL,string,0);
	printf("mbs string size is %d\n", len); //this shows the correct # of characters.
	mbstowcs(wstring,string,len);
	int len2 = wcstombs(newstring,wstring,len+1);
	printf("the word is now %s\n", newstring);
	printf("the last character is: %c\n", newstring[5]);

	return 0;
}

Now the output is:

Code:

the word is quentë
mbs string size is 6
the word is now quentë
the last character is:

For some reason the final character will not print. I've tried the other characters with the same code, and they print fine. I've tried finding any information on the subject of UTF-8 strings in C programming, Linux, GCC, GLIBC, you name it. I just can't seem to nail this concept down. Any ideas?

**Salem** · 12-29-2007

> int len = mbstowcs(NULL,string,0);
I doubt your "ë" is encoded as a multi-byte sequence.

> printf("the last character is: %c\n", newstring[5]);
All this will do is try to print the first byte of a multi-byte sequence as a char. What isn't being passed to printf is all the other bytes which make up that char.

You might try something like this, as it seems %s understands such things.

Code:

printf("the last character is: %.1s\n", &newstring[5]);

**myrddinemrys** · 12-29-2007

Okay! Thank you very much! While the code you posted didn't work, what you pointed out is that I need to reference the address of the multibyte character. This is what I have so far:

Code:

#include <stdio.h>
#include <locale.h>
#include <string.h>
#include <stddef.h>

  int main()
{
	if (!setlocale(LC_CTYPE, "")) {
		fprintf(stderr, "Can't set the specified locale! "
				"Check LANG, LC_CTYPE, LC_ALL.\n");
		return 1;
	}
	
	wchar_t *wstring;

	char *string = "quentë";
	char *letter = "ë";


	printf("the word is %s\n", string);
	printf("the last letter is %s\n", string+5);
	int len = mbstowcs(NULL,string,0);
	printf("mbs string size is %d\n", len); //this shows the correct # of characters.
	if (*(string+5) == *letter)
	printf("true\n");
	else
	printf("false\n");

	return 0;
}

Annnnd . . . voila! the output:

the word is quentë
the last letter is ë
mbs string size is 6
true

Now I'm getting somewhere. Thank you so much!

**Salem** · 12-30-2007

Don't you need to start with this?

Code:

wchar_t *string = L"quent&#235;";

Thread: trouble isolating special utf-8 character in string

Thread Tools

Search Thread

Display

trouble isolating special utf-8 character in string

getting somewhere

Similar Threads

char Handling, probably typical newbie stuff

[Inheritance Hierarchy] User Input on program with constructors. How ?

Compile Error that i dont understand

String editor for a sentence inputted by a user - any suggestions or ideas?

Game Pointer Trouble?