Hello all,
I'm looking for a good algorithm to recognize the language of a written text. So far I have a huge database with all words in every language and their frequency in that language. Of course I also have the frequencies of the words in the written document.
Now what's a good matching algorithm? So far I take all the words that are in the written text, multiply the "document frequency" with the "overall frequency". The sum of that is a score and the language that scores the highest is the matching language.
So far, it actually works quite good. I haven't found a single instance where it returned an invalid language. Though I still wonder if there are better algorithms (I bet there are) to identify language; whether it be with the frequency of the words or other techniques, especially since sometimes the scores are quite close - and it might misidentify a language if two languages are too similar.
I'd also prefer if it's possible to tell just as reliably that the language is probably none of those in the database, though I doubt it'll be able to do that reliably.
Do you guys know any good algorithms for this?
Thanks