Thread: Wickedly F-A-S-T

  1. #1
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868

    Wickedly F-A-S-T

    I wrote this to help out a guy with a PHP program, that was taking 15 seconds to identify (unjumble and match), three words, to the given list of words. That was too slow. (Turns out he didn't need it, after all).

    This is an absolute screamer - I limited it to just one word, but it appears tens of thousands of words could be un-jumbled and matched, in under 15 seconds, no problem.

    It shouldn't be a surprise maybe, but this is really FAST!

    Note: I used 18.2 in the timer line of code, which is the normal PIC chip timer clock speed. That allows any compiler to run the program, no matter what their compiler's macro is using.

    Code:
    /* finds jumbled words and matches them to a list of words, FAST! 
       REALLY fast!! :)
    
    Adak, Jan 7, 2011
    
    status: OK
    
    */
    
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    
    int main() {
      int i,j, k, n; 
      time_t start, stop;
      int wrdNum[26]={0};
      int libNum[26]={0};
      char *lib[] = {"ability", "absence", "actions", "amazing", "believe", "browser", 
      "caption", "captive", "ceiling", "degrees", "denizen", "develop", "examine", 
      "example", "exploit", "fifteen", "focused", "fortune", "forward", "garbage", 
      "gasping", "graphic", "handgun", "hastily", "helpful", "iceberg", "impeach", 
      "inspect", "jawbone", "justify", "keyword", "kickoff", "kneepad", "lactose", 
      "latency", "legible", "madness", "magical", "manhunt", "mission", "nametag", 
      "nitrate", "nowhere", "officer", "optical", "outlook", "oxidize", "paradox", 
      "pathway", "patient", "payment", "qualify", "quality", "quarrel", "radical", 
      "railing", "reduced", "resolve", "savings", "sayings", "scissor", "shadows", 
      "tactics", "teacher", "terrify", "tractor", "unarmed", "unmasks", "updates", 
      "vaccine", "valleys", "walnuts", "wealthy", "whacked", "willing", "wizards", 
      "xysters", "yielded", "yoghurt", "younger", "zippers", "zombies"
      };
      char jumbled[8], c;
      time_t t;
      printf("\n\n\n");
      //choose a word at random
      srand((unsigned) time(&t));
      n=(rand() % 82);
      //random word goes into "jumbled[]"
      strcpy(jumbled, lib[n]);
      for(i=0;i<7;i++) {    //swap random letters around to jumble the word
         j=rand()%7;
         c=jumbled[i];
         jumbled[i]=jumbled[j];
         jumbled[j]=c;
      }
      jumbled[7]='\0';
      printf("\nJumbled word is: %s   Library word is: %s", jumbled, lib[n]);
      start=clock();              //start the clock!
      for(i=0;i<7;i++) {          //build the distribution of letters array
        wrdNum[jumbled[i]-'a']++; //wrdNum[0]=number of a's in the word, etc
      }                           //for the jumbled word
      for(i=0;i<82;i++) {         //build an equal array, with each word
        for(j=0;j<7;j++) {        //in the library
          libNum[lib[i][j]-'a']++;
        }
        k=0;
        while(libNum[k]==wrdNum[k] && k<26) //while the letter distributions
          ++k;                              //match, increment k
    
        if(k<26) {                //not a match
          for(n=0;n<26;n++)       //reset the libNum[] array to all zero's
            libNum[n]=0;
        }else                     //we have a match
          break;
      }
      stop=clock();
      printf("\n\nElapsed seconds: %f", (stop-start)/18.2);  //same as CLK_TCK, etc
      printf("\nJumbled word matches %s", lib[i]);
      printf("\n\n\t\t\t     press enter when ready");
    
      (void) getchar(); 
      return 0;
    }

  2. #2
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    It would be even faster to simply take a copy of the input table, sort the letters in each word, and sort the table itself by the sorted word, reordering the original table of words to match. Then you sort the letters of the word and perform a simple binary search over the sorted table to find a match. It would then take O(log n) time to perform a match rather than O(n), where n is the number of words in the table.
    It can also be done in constant time with respect to the number of table entries, if you build and use a trie data structure.

    I'm curious as to how someone wrote something so much slower than what you have here.
    Also, you're naughtily using a constant 82 twice in the program. Should be sizeof(lib)/sizeof(lib[0]) and lib should be const char * too.
    I still can't figure out why you cast the result of getchar().
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  3. #3
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by iMalc View Post
    It would be even faster to simply take a copy of the input table, sort the letters in each word, and sort the table itself by the sorted word, reordering the original table of words to match. Then you sort the letters of the word and perform a simple binary search over the sorted table to find a match. It would then take O(log n) time to perform a match rather than O(n), where n is the number of words in the table.
    Thanks for the reply, but I have to disagree, that it would be faster if... etc.

    The reason is that this uses no sorting, just a simple assignment into the libNum or wrdNum array. It's using "counting sort", instead of anything else, and nothing is faster than counting sort, if you can use it. Your big O figure doesn't take the sorting and the selection via binary search both, into account.

    I can't say what the right big 0 expression is, but in stepping through the loop a few times, I found it makes very few int comparisons (libNum[i] to wrdNum[i]), before it breaks out if it's the wrong word. Yes, it could be optimized more, but it's not worth doing it, given it's speed, imo. Speeding up programs that run in 0.000 seconds is against my religion, I'm sure.

    I just modified the program to loop through (stupidly, but you'll see what I'm getting at), selecting a new random word, then jumbling it, then re-assigning each word in the lib[] array anew, a million times, and of course, finding the un-jumbled matching word.

    It matched a million jumbled words, with the correct un-jumbled word, in 7.4 seconds.

    No printing was done for this big test, of course. It would be even faster if the libNum[] kept the assigned values (in an expanded array), for each word, instead of re-computing it a millions of times. Even so, it's quick!

    I'm sure a huge part of that is because the small lib[] array of words, is small, allowing a good number of words into level 1 or 2 cache memory. A bigger word list would really slow that down a lot, and then a binary search might be a time saver.

    The slow program used this kind of logic:
    Code:
    for(a=lib[0];a<7;a++) {
      for(b=lib[1];b<7;b++) {
        for(c=lib[2];c<7;c++) {
          for(d=lib[3];d<7;d++) {
             ... continuing for all seven letters
             .. searching each letter in each library word for a matching letter
             .. if all the letters matched in a word, then the word
             .. was the one it was searching for.
          }
        }
      }
    }
    //repeating this for all three words that had to be found.
    It was coded up in PHP, which I know nothing about, but it's close to C in some respects. No idea why he chose that kind of logic.

    The 82 was just my frustration at decoding the crazy logic in PHP and running out of patience.

    Const has no real value in TC, that I have ever seen. It doesn't make for faster programs, or compilations. It's like having a sticker on your steering wheel, saying "don't hit anything with the car". In a few complex programs, I use it, since it's a nice safeguard. My general feeling in TC with small programs is, "you either know what you're doing, or you don't".

    In TC, getchar() brings up a nagging warning from the compiler. (void) getchar() removes the warning. The getchar() is needed to hold the terminal window open so you can read it before it closes.

  4. #4
    Registered User
    Join Date
    May 2010
    Location
    Naypyidaw
    Posts
    1,314
    In TC, getchar() brings up a nagging warning from the compiler. (void) getchar() removes the warning
    I would like to know what's the warning? Just curious...
    Is it return value not used? Then how about printf() ,strcpy()?

  5. #5
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    "Warning: Code has no effect in function main"

    Which is just annoying because it moves the focus over to the compiler message window, from where you can't run the program, or do anything else except move the focus back to the main code or watch window. (these are DOS type windows, not Windows type).

    All the other functions, like printf(), etc., are fine.

  6. #6
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by Adak View Post
    Thanks for the reply, but I have to disagree, that it would be faster if... etc.

    The reason is that this uses no sorting, just a simple assignment into the libNum or wrdNum array. It's using "counting sort", instead of anything else, and nothing is faster than counting sort, if you can use it. Your big O figure doesn't take the sorting and the selection via binary search both, into account.

    I can't say what the right big 0 expression is, but in stepping through the loop a few times, I found it makes very few int comparisons (libNum[i] to wrdNum[i]), before it breaks out if it's the wrong word. Yes, it could be optimized more, but it's not worth doing it, given it's speed, imo. Speeding up programs that run in 0.000 seconds is against my religion, I'm sure.
    Let me qualify my statement properly...
    The binary search approach is asymtopically faster O(log n), with respect to the number of words to match against. Since the word length is fixed, the sorting of it is only a constant factor (e.g. a heapsort of 6 items takes a fixed amount of time), and thus has no bearing on the Big-Oh notation of its runing time, with respect to the number of words to match against.
    FYI, your code is O(n) with respect to the number of words to match against.
    Put another way - No matter how fast your inner loop is, your outer loop would be the killer.

    In pratical terms, the time taken to sort the jumbled word plus performing the binary search, would, for a sufficient table size, be less than the time taken to try matching against every item in a table using your code. It may be that there would have to be 5000 words in the table before my suggestion wins of course, and with only 82 items it is certainly likely that your code can run faster.

    I imagine that after ingesting the above, you're not disputing that using a trie would similarly be asymtopically faster, given sufficient number of words to match against.

    So have you considered installing VS2008 Express along side TC? I definitely wouldn't go as far as suggesting the behemoth that is VS2010, but VS2008 Express really is very good. Personally I find plenty of benefit in having at least two different compilers installed at a time anyway.
    Last edited by iMalc; 01-08-2011 at 10:13 PM.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  7. #7
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    I think an interesting data structure to use for such a problem, especially one with a big dictionary, might be a trie, or prefix tree. It's basically what is used for things like the T9 texting "word guessing". It should, if implemented correctly, provide very efficient solutions to this problem and could give you all possible anagrams of your jumbled word. It would work great for words of different length and is very efficient storage-wise, since it takes advantage of overlapping sections of words.

    As I see it, it would require reading through the dictionary, sorting each word, and inserting it into the trie. Any node that is a terminus for a word could contain a linked list of all anagrams that lead to that node. Just sort the jumbled word and do a lookup letter-by-letter.

    EDIT: We seem to be on the same page today algorithmically, iMalc! You beat me to the trie.
    Last edited by anduril462; 01-08-2011 at 10:10 PM.

  8. #8
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Trie's would kick ass, if you wanted to have more words, and words of different lengths, etc. Same with binary search, but I can beat a binary search with an indexed search.

    I can't beat the trie's with an indexed search, though. I believe the trie is very close (if not ideal), for word look up.

    With just a bit of a touch up, I can cut the million word time of this program, down to half that much, which is blazing, imo.

    I have Visual C/C++ 2006, but I only use it for a few big programs. It's an oaf, but it has a profiler, and allows bigger programs than TC, which is great.

    Pelles C sounded quite good, but that will go on the 64 bit Windows7/Linux rig. I tried VC Express 2008, and it seemed like a toy with the battery half run down. Kind of like lukewarm coffee - didn't hit the spot. I removed it.
    Last edited by Adak; 01-08-2011 at 11:21 PM.

  9. #9
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Hmm, I didn't think of storing all permutations of the words within the trie. This would cut out the sorting step, but would increase the memory usage by somewhere on the order of 500 times (not quite 6! = 720 since some letters sometimes appear twice), and by a lot more if the words were even longer. It seems likely to me that memory access would become the bottleneck in that case.

    I wont be considering Pelles C myself because they've stated that there will never be Pelles C++ and frankly I find C pathetic compared to C++.
    I really can't comprehend your viewpoint on VS2008 Express, but I suppose if you only use it for C then maybe it's not as good.
    Last edited by iMalc; 01-09-2011 at 12:13 AM.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed