1. ## Language matching algorithms

Hello all,

I'm looking for a good algorithm to recognize the language of a written text. So far I have a huge database with all words in every language and their frequency in that language. Of course I also have the frequencies of the words in the written document.

Now what's a good matching algorithm? So far I take all the words that are in the written text, multiply the "document frequency" with the "overall frequency". The sum of that is a score and the language that scores the highest is the matching language.
So far, it actually works quite good. I haven't found a single instance where it returned an invalid language. Though I still wonder if there are better algorithms (I bet there are) to identify language; whether it be with the frequency of the words or other techniques, especially since sometimes the scores are quite close - and it might misidentify a language if two languages are too similar.
I'd also prefer if it's possible to tell just as reliably that the language is probably none of those in the database, though I doubt it'll be able to do that reliably.

Do you guys know any good algorithms for this?

Thanks

2. A common and decently reliable technique is this:

N = number of words in the dictionary (for all languages)
X[N] = vector of word counts for example document D
Y[N] = vector of words counts for language L

Treating X[N] and Y[N] as vectors, compute the angle between them. This angle is a measure of similarity, with lower angles being more similar.

To compute the angle, use the definition of the dot product:

X . Y = |X| * |Y| * cos A.

Therefore, angle A = arccos( X . Y / ( |X| * |Y| ) )

Where the dot product is the usual, of course.

You might experiment with using individual letter counts instead of word counts, as well.

3. I assume you're taking in only ASCII text rather than Unicode, otherwise the first natural step would be to eliminate languages based on special characters found in the text. For instance, only a handful of languages might use an a with a breve over it. ... but disregarding this as I assume you're only talking the use of the standard latin alphabet used in English.

First, I'd disregard doing any calculations on words of a certain length or less. Most three letter words are used in almost every language that uses the latin alphabet and would do far less in eliminating choices than it would cost in performance.

Read a set amount of text into a buffer, parse it, sort it by length starting with the largest, and then perform your comparison from the largest word first. Chances are your algorithm will lead you overwhelmingly to a single language after the fifth word or so.

4. Thanks guys!

@brewbuck:
Perfect! At least I had the numerator wrong, but I had a different denominator. And of course I didn't have the arctan. I haven't tried this yet, but it sounds like exactly what I need!

@Sly:
Actually, it does support any language right now. However, if a language does use a different character set, then it definitely won't match any words in the latin languages. Thus, the algorithm should automatically take care of these cases.

5. Originally Posted by EVOEx
Thanks guys!

@brewbuck:
Perfect! At least I had the numerator wrong, but I had a different denominator. And of course I didn't have the arctan. I haven't tried this yet, but it sounds like exactly what I need!
Well, it's pretty close to what you are currently doing. The problem with just taking the dot product, though, is that the result is scaled by the product of the vector magnitudes -- thus, a language vector that simply has "more words" in it than another will receive an inappropriate boost. So that's why dividing by the product of the magnitudes is necessary.

Strictly, the arccos isn't necessary either if all you are doing is comparing. The angle will only ever range from 0..Pi/2 because all of the vector elements are non-negative. And the cosine function is monotonic over that range, so you don't really need to take the arccos().

So, comparing the dot products is ALMOST correct, you just need that normalization step.

6. Originally Posted by brewbuck
So, comparing the dot products is ALMOST correct, you just need that normalization step.
I did normalize it, with the number of words total rather than the length of the vector. I first calculated the frequency in the form of words/total_words. So I normalized by sum(X)*sum(Y) rather than |X|*|Y|.

7. Actually, I had another thought on this issue. If the input document contains a lot of different words, it may take quite a while to query the database for every word that exists in this input document. Now I've thought of two solutions for this:
1. Take only part of the text and analyze the language of that part, and possibly do the same for other parts.
2. Take the top X words from the input document and the top Y words of each language, and only use those X+Y words in the language detection.

Which is better? Or is there a third, and better, option?

Thanks

8. Ummm... well, I still stand with my first conclusion that you should be focusing on the longest words in the document. They are almost certainly going to be the most telling to what language you are reading than the shorter words.

9. Read in the first 100 words, do a determination, and if there isn't a clear winner, read another 100 words, and so on. This avoids having to scan long documents completely.

10. On a similar note, I'd probably prefer to approach this task statistically. So, for an alternative idea:

Read a statistically relevant portion of the text and do a determination for all languages you can detect (that your database supports), attributing a confidence level to each determination... a real number from 0 to 1 for instance. The text language will probably be the one with the top confidence.

I propose this option for a few reasons:

- Your algorithm will not be able to detect a language without a margin of error. With this, you give room for your application to list other options.

- High and similar confidence levels are possible if your current database includes dialects (UK and US English, Portugal and Brazil Portuguese, etc). But if your database doesn't include dialects, your algorithm is already capable of supporting them.

- If you can afford to read the whole text, you can even detect multi-language texts. A recent technical document I was involved with recently at work was a 200 page Word document written in Portuguese, German and English.

11. You might get better (and faster) results if you consider only a subset of words: go word-by-word through your frequency lists and look for words that have high specificity -- that is, words whose presence strongly suggests only one of a small number of languages. Words that are found in many languages are probably useless, and you can ignore them in source text.

Ex. "man" is probably useless because it exists in a large number of languages, but "itterashai" exists in only one.