Problem with ctype and locales

**cunnus88** · 10-29-2007

I'm trying to filter out a text so that I only have words with alphabetic characters left (no punctuation, no numbers, no words mixed with numbers, etc.). Now, here's the twist. I'm trying to make it take OS locale settings to filter the characters since I'm going to have to process some text in German and Portuguese.

So here's how I set it up:

Code:

  std::locale loc = std::locale("de_DE.utf8"); // the german locale in ubuntu
  const std::ctype<char>& ct = std::use_facet<std::ctype<char> >(loc);

  std::string line;
  while(input_file_stream >> line)
    {
      const char* const c = line.c_str();
      std::string::size_type st = line.size();
      if(ct.scan_not(std::ctype_base::alpha, c, c + st) == c + st)
	dump_line_in_some_container;

Now when I hit a German word with an upper ascii character such as "erkläre", I would still expect that word to be added to my container. However, this word fails the conditional and the word is not added. So is there a way to take into account international settings while still using the STL? I could define my own filter table, but why do that when you have a standard?

**prog-bman** · 10-29-2007

Probably would want to use std::wstring instead of std::string. And wchar_t vs char

**cunnus88** · 10-29-2007

Would it make a difference? One thing I do know is that all my data is coming from single byte character text. Straightforward non-unicode. The only reason I'm using that utf8 locale is because all international settings on the (shared) machine I'm using were defined as unicode.

**prog-bman** · 10-29-2007

Ah I forgot about this one.

http://www.cplusplus.com/reference/i...ase/imbue.html

I don't know a whole heck alot about C++ locale. These are some suggestions based on my very limited experiences with them

**nvoigt** · 10-30-2007

As far as I remember, an A-Umlaut is represented by two characters in utf8. If you have an ANSI/Extended ASCII file with one byte per character, you need to use a non-utf character set. Even if you do use UTF8 (which means a single char is represented by more than one byte if > ASCII), passing a single byte to a function and expecting a meaningful answer about a single character is only valid for characters that would have fit the ASCII table anyway.

**cunnus88** · 10-30-2007

Well, then, I'll have to create my own facet. Or if anyone can point me to a user-defined facet floating around for ISO Latin I, then please do.

**CornedBee** · 10-30-2007

Don't you have a de_DE.ISO-8859-1 facet?

I'd still say go with wchar_t. But then, I'd also say that C++'s standard library character handling is broken beyond repair, no matter what you do.

**cunnus88** · 10-30-2007

Nope, tried that locale setting, but because it's not installed on the ubuntu machine I use (and over which I do not have superuser rights), my program throws a seg fault when it hits the line for allocating the facet. And I really don't want add the complication of wchar_t when my original documents are in single byte characters.

**CornedBee** · 10-31-2007

Well, then get a ISO-8859-1 code table and get coding

Shouldn't be all that hard.

Thread: Problem with ctype and locales

Thread Tools

Search Thread

Display

Problem with ctype and locales