Thread: Language matching algorithms

  1. #1
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262

    Language matching algorithms

    Hello all,

    I'm looking for a good algorithm to recognize the language of a written text. So far I have a huge database with all words in every language and their frequency in that language. Of course I also have the frequencies of the words in the written document.

    Now what's a good matching algorithm? So far I take all the words that are in the written text, multiply the "document frequency" with the "overall frequency". The sum of that is a score and the language that scores the highest is the matching language.
    So far, it actually works quite good. I haven't found a single instance where it returned an invalid language. Though I still wonder if there are better algorithms (I bet there are) to identify language; whether it be with the frequency of the words or other techniques, especially since sometimes the scores are quite close - and it might misidentify a language if two languages are too similar.
    I'd also prefer if it's possible to tell just as reliably that the language is probably none of those in the database, though I doubt it'll be able to do that reliably.

    Do you guys know any good algorithms for this?


    Thanks

  2. #2
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    A common and decently reliable technique is this:

    N = number of words in the dictionary (for all languages)
    X[N] = vector of word counts for example document D
    Y[N] = vector of words counts for language L

    Treating X[N] and Y[N] as vectors, compute the angle between them. This angle is a measure of similarity, with lower angles being more similar.

    To compute the angle, use the definition of the dot product:

    X . Y = |X| * |Y| * cos A.

    Therefore, angle A = arccos( X . Y / ( |X| * |Y| ) )

    Where the dot product is the usual, of course.

    You might experiment with using individual letter counts instead of word counts, as well.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  3. #3
    Devil's Advocate SlyMaelstrom's Avatar
    Join Date
    May 2004
    Location
    Out of scope
    Posts
    4,079
    I assume you're taking in only ASCII text rather than Unicode, otherwise the first natural step would be to eliminate languages based on special characters found in the text. For instance, only a handful of languages might use an a with a breve over it. ... but disregarding this as I assume you're only talking the use of the standard latin alphabet used in English.

    First, I'd disregard doing any calculations on words of a certain length or less. Most three letter words are used in almost every language that uses the latin alphabet and would do far less in eliminating choices than it would cost in performance.

    Read a set amount of text into a buffer, parse it, sort it by length starting with the largest, and then perform your comparison from the largest word first. Chances are your algorithm will lead you overwhelmingly to a single language after the fifth word or so.
    Sent from my iPadŽ

  4. #4
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262
    Thanks guys!

    @brewbuck:
    Perfect! At least I had the numerator wrong, but I had a different denominator. And of course I didn't have the arctan. I haven't tried this yet, but it sounds like exactly what I need!

    @Sly:
    Actually, it does support any language right now. However, if a language does use a different character set, then it definitely won't match any words in the latin languages. Thus, the algorithm should automatically take care of these cases.

  5. #5
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by EVOEx View Post
    Thanks guys!

    @brewbuck:
    Perfect! At least I had the numerator wrong, but I had a different denominator. And of course I didn't have the arctan. I haven't tried this yet, but it sounds like exactly what I need!
    Well, it's pretty close to what you are currently doing. The problem with just taking the dot product, though, is that the result is scaled by the product of the vector magnitudes -- thus, a language vector that simply has "more words" in it than another will receive an inappropriate boost. So that's why dividing by the product of the magnitudes is necessary.

    Strictly, the arccos isn't necessary either if all you are doing is comparing. The angle will only ever range from 0..Pi/2 because all of the vector elements are non-negative. And the cosine function is monotonic over that range, so you don't really need to take the arccos().

    So, comparing the dot products is ALMOST correct, you just need that normalization step.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  6. #6
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262
    Quote Originally Posted by brewbuck View Post
    So, comparing the dot products is ALMOST correct, you just need that normalization step.
    I did normalize it, with the number of words total rather than the length of the vector. I first calculated the frequency in the form of words/total_words. So I normalized by sum(X)*sum(Y) rather than |X|*|Y|.

  7. #7
    Registered User
    Join Date
    Oct 2008
    Posts
    1,262
    Actually, I had another thought on this issue. If the input document contains a lot of different words, it may take quite a while to query the database for every word that exists in this input document. Now I've thought of two solutions for this:
    1. Take only part of the text and analyze the language of that part, and possibly do the same for other parts.
    2. Take the top X words from the input document and the top Y words of each language, and only use those X+Y words in the language detection.

    Which is better? Or is there a third, and better, option?


    Thanks

  8. #8
    Devil's Advocate SlyMaelstrom's Avatar
    Join Date
    May 2004
    Location
    Out of scope
    Posts
    4,079
    Ummm... well, I still stand with my first conclusion that you should be focusing on the longest words in the document. They are almost certainly going to be the most telling to what language you are reading than the shorter words.
    Sent from my iPadŽ

  9. #9
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Read in the first 100 words, do a determination, and if there isn't a clear winner, read another 100 words, and so on. This avoids having to scan long documents completely.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  10. #10
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    On a similar note, I'd probably prefer to approach this task statistically. So, for an alternative idea:

    Read a statistically relevant portion of the text and do a determination for all languages you can detect (that your database supports), attributing a confidence level to each determination... a real number from 0 to 1 for instance. The text language will probably be the one with the top confidence.

    I propose this option for a few reasons:

    - Your algorithm will not be able to detect a language without a margin of error. With this, you give room for your application to list other options.

    - High and similar confidence levels are possible if your current database includes dialects (UK and US English, Portugal and Brazil Portuguese, etc). But if your database doesn't include dialects, your algorithm is already capable of supporting them.

    - If you can afford to read the whole text, you can even detect multi-language texts. A recent technical document I was involved with recently at work was a 200 page Word document written in Portuguese, German and English.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  11. #11
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    You might get better (and faster) results if you consider only a subset of words: go word-by-word through your frequency lists and look for words that have high specificity -- that is, words whose presence strongly suggests only one of a small number of languages. Words that are found in many languages are probably useless, and you can ignore them in source text.

    Ex. "man" is probably useless because it exists in a large number of languages, but "itterashai" exists in only one.
    You ever try a pink golf ball, Wally? Why, the wind shear on a pink ball alone can take the head clean off a 90 pound midget at 300 yards.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Assembly language.
    By JOZZY& Wakko in forum Tech Board
    Replies: 0
    Last Post: 12-18-2009, 05:58 AM
  2. What's wrong with C++ as a first language?
    By Angie in forum C++ Programming
    Replies: 43
    Last Post: 09-20-2009, 10:39 PM
  3. Which Language? need help ASAP
    By jcw122 in forum Tech Board
    Replies: 7
    Last Post: 03-07-2005, 04:16 PM
  4. Strange loop
    By D@rk_force in forum C++ Programming
    Replies: 22
    Last Post: 12-18-2004, 02:40 PM
  5. assembly language...the best tool for game programming?
    By silk.odyssey in forum Game Programming
    Replies: 50
    Last Post: 06-22-2004, 01:11 PM