Thread: Problem with ctype and locales

  1. #1
    Registered User
    Join Date
    Oct 2005
    Posts
    271

    Problem with ctype and locales

    I'm trying to filter out a text so that I only have words with alphabetic characters left (no punctuation, no numbers, no words mixed with numbers, etc.). Now, here's the twist. I'm trying to make it take OS locale settings to filter the characters since I'm going to have to process some text in German and Portuguese.

    So here's how I set it up:
    Code:
      std::locale loc = std::locale("de_DE.utf8"); // the german locale in ubuntu
      const std::ctype<char>& ct = std::use_facet<std::ctype<char> >(loc);
    
      std::string line;
      while(input_file_stream >> line)
        {
          const char* const c = line.c_str();
          std::string::size_type st = line.size();
          if(ct.scan_not(std::ctype_base::alpha, c, c + st) == c + st)
    	dump_line_in_some_container;
    Now when I hit a German word with an upper ascii character such as "erkläre", I would still expect that word to be added to my container. However, this word fails the conditional and the word is not added. So is there a way to take into account international settings while still using the STL? I could define my own filter table, but why do that when you have a standard?

  2. #2
    Sweet
    Join Date
    Aug 2002
    Location
    Tucson, Arizona
    Posts
    1,820
    Probably would want to use std::wstring instead of std::string. And wchar_t vs char

  3. #3
    Registered User
    Join Date
    Oct 2005
    Posts
    271
    Would it make a difference? One thing I do know is that all my data is coming from single byte character text. Straightforward non-unicode. The only reason I'm using that utf8 locale is because all international settings on the (shared) machine I'm using were defined as unicode.

  4. #4
    Sweet
    Join Date
    Aug 2002
    Location
    Tucson, Arizona
    Posts
    1,820
    Ah I forgot about this one.

    http://www.cplusplus.com/reference/i...ase/imbue.html

    I don't know a whole heck alot about C++ locale. These are some suggestions based on my very limited experiences with them

  5. #5
    the hat of redundancy hat nvoigt's Avatar
    Join Date
    Aug 2001
    Location
    Hannover, Germany
    Posts
    3,130
    As far as I remember, an A-Umlaut is represented by two characters in utf8. If you have an ANSI/Extended ASCII file with one byte per character, you need to use a non-utf character set. Even if you do use UTF8 (which means a single char is represented by more than one byte if > ASCII), passing a single byte to a function and expecting a meaningful answer about a single character is only valid for characters that would have fit the ASCII table anyway.
    hth
    -nv

    She was so Blonde, she spent 20 minutes looking at the orange juice can because it said "Concentrate."

    When in doubt, read the FAQ.
    Then ask a smart question.

  6. #6
    Registered User
    Join Date
    Oct 2005
    Posts
    271
    Well, then, I'll have to create my own facet. Or if anyone can point me to a user-defined facet floating around for ISO Latin I, then please do.

  7. #7
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Don't you have a de_DE.ISO-8859-1 facet?

    I'd still say go with wchar_t. But then, I'd also say that C++'s standard library character handling is broken beyond repair, no matter what you do.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  8. #8
    Registered User
    Join Date
    Oct 2005
    Posts
    271
    Nope, tried that locale setting, but because it's not installed on the ubuntu machine I use (and over which I do not have superuser rights), my program throws a seg fault when it hits the line for allocating the facet. And I really don't want add the complication of wchar_t when my original documents are in single byte characters.

  9. #9
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Well, then get a ISO-8859-1 code table and get coding Shouldn't be all that hard.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

Popular pages Recent additions subscribe to a feed