Thread: UTF8, palindromes and characters

  1. #1
    Registered User Kinxil's Avatar
    Join Date
    Aug 2016
    Posts
    5

    UTF8, palindromes and characters

    Hello ladies and gentlemen,

    I've trying to achieve palindrome detection on an UTF8 encoded file. A palindrome is a sentence which mirror itself, for example "kayak" or "radar".

    While it has been a piece of cake in C#, I'm struggling with its C version because of UTF8 encoding.

    I'm reading and stocking each line in file using :
    Code:
    while (fgets(line, sizeof(line), file)) {
      if(line != NULL)
      {
        lines[nb_lines] = (char*)malloc(MAX_LINE_LENGTH*sizeof(char));
        strcpy(lines[nb_lines],line);
        if(++nb_lines == lines_alloc) { //pre increment needed!
          lines_alloc *= 2;
          lines = (char**)realloc(lines,lines_alloc*sizeof(char*));
        }
      }
    }
    So far, not sure if properly done, but I can display each line just fine.

    The problematic function : I'm trying to remove diacritics using :
    Code:
    char toNonDiacritic(char c){
      switch(c)
      {
        case 'à':
        case 'â':
        case 'ä':
        case 'ã':
          return 'a';
          break;
        //etc for e,u,i,o
      }
      return c;
    }
    This method does NOT work, I see weird characters in output terminal while displaying characters with supposedly removed diacritics.

    Important informations : Both file and my terminal are in UTF8.

    What I think I understood so far : UTF8 use characters coded in 1 to 4 bytes while a "char" is 1 byte. So my problematics characters might be coded in more than 1 byte, making toNonDiacritic inefficient.

    What I can't figure out : How to have a simple way to turn a multibyte char to something which can be compaired in a switch. So far I tried with wchar_t structure by using mbstowcs, without success (display the result either with printf or wprintf, with or without L give me things like :
    No├½l a trop par rapport a L├®on.
    (Original : Noël a trop par rapport a Léon)

    Any idea ?
    Attached Files Attached Files
    Last edited by Kinxil; 08-25-2016 at 07:04 AM.

  2. #2
    Registered User Kinxil's Avatar
    Join Date
    Aug 2016
    Posts
    5
    I was able to modify my own post an hour ago, not sure why I can't anymore. M'well. I attempted to fully convert the implementation to wchar_t (from file reading), but it didn't solved the issue. While reading the file, fgetws load characters similarly to fgets in the sense that multibyte characters are still splitted between two wchar_t while the structure is capable of holding the two bytes (sizeof(wchar_t) print 2).

  3. #3
    Registered User
    Join Date
    Jun 2015
    Posts
    1,640
    Does this work for you? Try it on one of your utf8 files.
    Code:
    #include <stdio.h>
    #include <locale.h>
    #include <wchar.h>
    #include <wctype.h>
    
    wchar_t toNonDiacritic(wchar_t c){
        switch (c) {
        case L'à': case L'â': case L'ä': case L'ã':
            return L'a';
        case L'é': case L'è': case L'ê': case L'ë': 
            return L'e'; 
        case L'ì': case L'î': case L'ï': 
            return L'i'; 
        case L'ò': case L'ô': case L'ö': case L'õ': 
            return L'o'; 
        case L'ù': case L'û': case L'ü': 
            return L'u'; 
        }
        return c;
    }
    
    int main(void) {
        setlocale(LC_ALL, "en_US.UTF-8");
        FILE *f = fopen("in.txt", "r");
        wchar_t s[1000];
    
        while (fgetws(s, 1000, f) != NULL) {
            for (wchar_t *p = s; *p != L'\0'; p++) {
                *p = towlower(*p);
                *p = toNonDiacritic(*p);
            }
            fputws(s, stdout);
        }
    
        fclose(f);
        return 0;
    }

  4. #4
    Registered User Kinxil's Avatar
    Join Date
    Aug 2016
    Posts
    5
    I tried to do that but it won't compile. I'm getting the following error several time:
    Code:
    error C2196: case value '195' already used
    What I do not understand is switch behaviour. Before that I tried to do a wchar version toNonDiacritic2 (which is similar to your) while converting everything to wchar, but with only one case (L'é'). But for some unknown reason every single diacritic in the text would be considered as a L'é'. And now when I add more options the compiler tell me it overlap...

  5. #5
    Registered User
    Join Date
    May 2012
    Posts
    505
    Don't use wchar_t.
    UTF-8 works, until it doesn't. Pass everything about as UTF-8, until you need
    code points instead of strings. Then just use int for the code point.

    Palindrome isn't necessarily a simple concept in every script. For example
    Hebrew has vowel points, which you probably wouldn't include in plaindrome.
    But it also has special "ending forms". So "M" looks a bit like a twisted Latin "n"
    at the start or inside of a word, and like a square box at the end. Then "Libya"
    is thought of as a palindrome in Arabic, but, as in English, it starts with L (lam)
    and ends in an A (alif). However those letters are both straight lines in those
    positions.

    However for a simple plaindrome detector, just convert the UTF_8 to a list of
    code points, then run the normal palindrome detection algorithm.
    (You can step backwards if you insist on being in place, but it's likely to
    be more fiddly than its worth).
    I'm the author of MiniBasic: How to write a script interpreter and Basic Algorithms
    Visit my website for lots of associated C programming resources.
    https://github.com/MalcolmMcLean


  6. #6
    Registered User Kinxil's Avatar
    Join Date
    Aug 2016
    Posts
    5
    I don't need to step backward, when palindrome is detected, I just have to display the original line which display fine.

    Gotta try that.

  7. #7
    Registered User Kinxil's Avatar
    Join Date
    Aug 2016
    Posts
    5
    Okay got it working on both Windows and Ubuntu Bash for Windows (not sure if it would works on a pure linux) with two methods. In both case I first detect a diacritic using :
    Code:
    int diacriticDetected(unsigned char c){
      if(c == 0xC3)
        return 1;
      //etc
      return 0;
    }
    This way my function toNonDiacritic can be applied to the two next char composing the 2byte character, in others case I just test regular letters. I converted everything to unsigned char for easier hexa comparison.

    The first method is Malcolm's suggested method : I check directly for Hexa code. The second method is string comparison by recomposing the 2byte char. Quite less barecode, probably quite less efficient in compute time, so Malcolm's version is likely definitely the way to go.

    Code:
    unsigned char toNonDiacritic(unsigned char c, unsigned char c_){
      switch(c){ //First method
        case 0xC3:
          if(c_<=0xA5 && c_>=0xA0)
            return 'a';
      }
    
    
      unsigned char dia[3] = {c,c_,'\0'}; //Second method
      if(!strcmp(dia,"é")||!strcmp(dia,"ê")||!strcmp(dia,"ë")||!strcmp(dia,"è"))
        return 'e';
      return c;
    }
    Many thanks for the help.
    Last edited by Kinxil; 08-26-2016 at 04:32 AM.

  8. #8
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    Some things to know when you have extended character literals in your source: Non-English characters with cout

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. ASCII UTF8 problems
    By blumache in forum C++ Programming
    Replies: 6
    Last Post: 12-17-2015, 04:54 AM
  2. adjusting character counts for utf8
    By MK27 in forum C Programming
    Replies: 32
    Last Post: 02-03-2009, 07:07 PM
  3. utf8
    By in9912116606 in forum C++ Programming
    Replies: 1
    Last Post: 06-17-2008, 03:54 AM
  4. utf8 en-de coding
    By SurferIX in forum C Programming
    Replies: 4
    Last Post: 05-18-2008, 05:23 PM
  5. How to convert raw text with accent to UTF8.
    By intmail in forum Linux Programming
    Replies: 1
    Last Post: 08-09-2006, 10:47 AM

Tags for this Thread