UTF8, palindromes and characters

**Kinxil** · 08-25-2016

Hello ladies and gentlemen,

I've trying to achieve palindrome detection on an UTF8 encoded file. A palindrome is a sentence which mirror itself, for example "kayak" or "radar".

While it has been a piece of cake in C#, I'm struggling with its C version because of UTF8 encoding.

I'm reading and stocking each line in file using :

Code:

while (fgets(line, sizeof(line), file)) {
  if(line != NULL)
  {
    lines[nb_lines] = (char*)malloc(MAX_LINE_LENGTH*sizeof(char));
    strcpy(lines[nb_lines],line);
    if(++nb_lines == lines_alloc) { //pre increment needed!
      lines_alloc *= 2;
      lines = (char**)realloc(lines,lines_alloc*sizeof(char*));
    }
  }
}

So far, not sure if properly done, but I can display each line just fine.

The problematic function : I'm trying to remove diacritics using :

Code:

char toNonDiacritic(char c){
  switch(c)
  {
    case 'à':
    case 'â':
    case 'ä':
    case 'ã':
      return 'a';
      break;
    //etc for e,u,i,o
  }
  return c;
}

This method does NOT work, I see weird characters in output terminal while displaying characters with supposedly removed diacritics.

Important informations : Both file and my terminal are in UTF8.

What I think I understood so far : UTF8 use characters coded in 1 to 4 bytes while a "char" is 1 byte. So my problematics characters might be coded in more than 1 byte, making toNonDiacritic inefficient.

What I can't figure out : How to have a simple way to turn a multibyte char to something which can be compaired in a switch. So far I tried with wchar_t structure by using mbstowcs, without success (display the result either with printf or wprintf, with or without L give me things like :
No├½l a trop par rapport a L├®on.
(Original : Noël a trop par rapport a Léon)

Any idea ?

**Kinxil** · 08-25-2016

I was able to modify my own post an hour ago, not sure why I can't anymore. M'well. I attempted to fully convert the implementation to wchar_t (from file reading), but it didn't solved the issue. While reading the file, fgetws load characters similarly to fgets in the sense that multibyte characters are still splitted between two wchar_t while the structure is capable of holding the two bytes (sizeof(wchar_t) print 2).

**algorism** · 08-25-2016

Does this work for you? Try it on one of your utf8 files.

Code:

#include <stdio.h>
#include <locale.h>
#include <wchar.h>
#include <wctype.h>

wchar_t toNonDiacritic(wchar_t c){
    switch (c) {
    case L'à': case L'â': case L'ä': case L'ã':
        return L'a';
    case L'é': case L'è': case L'ê': case L'ë': 
        return L'e'; 
    case L'ì': case L'î': case L'ï': 
        return L'i'; 
    case L'ò': case L'ô': case L'ö': case L'õ': 
        return L'o'; 
    case L'ù': case L'û': case L'ü': 
        return L'u'; 
    }
    return c;
}

int main(void) {
    setlocale(LC_ALL, "en_US.UTF-8");
    FILE *f = fopen("in.txt", "r");
    wchar_t s[1000];

    while (fgetws(s, 1000, f) != NULL) {
        for (wchar_t *p = s; *p != L'\0'; p++) {
            *p = towlower(*p);
            *p = toNonDiacritic(*p);
        }
        fputws(s, stdout);
    }

    fclose(f);
    return 0;
}

**Kinxil** · 08-26-2016

I tried to do that but it won't compile. I'm getting the following error several time:

Code:

error C2196: case value '195' already used

What I do not understand is switch behaviour. Before that I tried to do a wchar version toNonDiacritic2 (which is similar to your) while converting everything to wchar, but with only one case (L'é'). But for some unknown reason every single diacritic in the text would be considered as a L'é'. And now when I add more options the compiler tell me it overlap...

**Malcolm McLean** · 08-26-2016

Don't use wchar_t.
UTF-8 works, until it doesn't. Pass everything about as UTF-8, until you need
code points instead of strings. Then just use int for the code point.

Palindrome isn't necessarily a simple concept in every script. For example
Hebrew has vowel points, which you probably wouldn't include in plaindrome.
But it also has special "ending forms". So "M" looks a bit like a twisted Latin "n"
at the start or inside of a word, and like a square box at the end. Then "Libya"
is thought of as a palindrome in Arabic, but, as in English, it starts with L (lam)
and ends in an A (alif). However those letters are both straight lines in those
positions.

However for a simple plaindrome detector, just convert the UTF_8 to a list of
code points, then run the normal palindrome detection algorithm.
(You can step backwards if you insist on being in place, but it's likely to
be more fiddly than its worth).

**Kinxil** · 08-26-2016

I don't need to step backward, when palindrome is detected, I just have to display the original line which display fine.

Gotta try that.

**Kinxil** · 08-26-2016

Okay got it working on both Windows and Ubuntu Bash for Windows (not sure if it would works on a pure linux) with two methods. In both case I first detect a diacritic using :

Code:

int diacriticDetected(unsigned char c){
  if(c == 0xC3)
    return 1;
  //etc
  return 0;
}

This way my function toNonDiacritic can be applied to the two next char composing the 2byte character, in others case I just test regular letters. I converted everything to unsigned char for easier hexa comparison.

The first method is Malcolm's suggested method : I check directly for Hexa code. The second method is string comparison by recomposing the 2byte char. Quite less barecode, probably quite less efficient in compute time, so Malcolm's version is likely definitely the way to go.

Code:

unsigned char toNonDiacritic(unsigned char c, unsigned char c_){
  switch(c){ //First method
    case 0xC3:
      if(c_<=0xA5 && c_>=0xA0)
        return 'a';
  }


  unsigned char dia[3] = {c,c_,'\0'}; //Second method
  if(!strcmp(dia,"é")||!strcmp(dia,"ê")||!strcmp(dia,"ë")||!strcmp(dia,"è"))
    return 'e';
  return c;
}

Many thanks for the help.

**Codeplug** · 08-27-2016

Some things to know when you have extended character literals in your source: Non-English characters with cout

gg

Thread: UTF8, palindromes and characters

Thread Tools

Search Thread

Display

UTF8, palindromes and characters

Similar Threads

ASCII UTF8 problems

adjusting character counts for utf8

utf8

utf8 en-de coding

How to convert raw text with accent to UTF8.

Tags for this Thread