Thread: Binary Search - Strange Output

  1. #1
    Dr Dipshi++ mike_g's Avatar
    Join Date
    Oct 2006
    Location
    On me hyperplane
    Posts
    1,218

    Binary Search - Strange Output

    I'm putting together a few little progs I made over the last couple of days to make an anagram finder, which I will hopefully be able to work into some kind of engine for word games.

    Anyway the user enters a word and the prog cycles through each permutation of it. For each cycle it does a binary search on a dictionary to check for a match. Strangely it seems to work fine sometimes, but other times its completely off.

    Heres the head/main section:
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #define MAXLEN 20
    #define MAXIT 250000
    
    struct dir 
    {
        char word[MAXLEN];
    };  
    typedef struct dir dir;
    
    static FILE *OpenFile(char *file, char *mode);
    void LoadDictionary(dir *ptr);
    dir*  BinarySearch(dir *ptr, char *match, int size);
    
    int main()
    {
        struct dir *p;
        p = (struct dir *) malloc(MAXIT * MAXLEN);
        LoadDictionary(p);
        
        char word[10];
        int len, x;
        int permutations = 1;
        char temp;
        int matches=0;
         
        printf("Enter a word up to 8 characters in length: ");
        fgets(word, 10, stdin); len=strlen(word)-1; word[len]='\0';
        
        for(x=2; x<=len; x++)
            permutations*=x;
        
        for(x=0; x<permutations; x++)
        {
            temp = word[x%len];
            word[x%len] = word[(x+1)%len];
            word[(x+1)%len] = temp;
            if(BinarySearch(p, word, MAXIT) != NULL)
            {
                printf("%s\n", word);
                matches++;
            }
        }                       
        printf("PERMUTATIONS: %i\tMATCHES: %i", permutations, matches);
        while((temp=getchar()) != '\n'); 
    }
    And heres the search function:
    Code:
    dir* BinarySearch(dir *ptr, char *match, int size)
    {    
        dir *lo=ptr;
        dir *hi=ptr+MAXIT;    
        ptr+=(size >> 1); //Jump to middle of dictionary
        int shift=(size >> 1);
        int quit=0;
        
        while(quit < 50)
        {
            if(shift > 1) //Halve the shift size if greater than 1
                shift=shift >> 1; 
            else          
                quit++; 
                
            if(ptr < lo || ptr >= hi)
                return NULL;    
            else if(strcmp(match, ptr->word) < 0)
                ptr-=shift;
            else if(strcmp(match, ptr->word) > 0)
                ptr+=shift;
            else
                return ptr;
            //Debug output
            printf("SHIFT: %i\tA: %s\tB: %s\n", shift, match, ptr->word);
        }
        return NULL;  
    }
    For example if I enter the word 'dog' it should come up with 2 matches.It finds the word dog after 17 iterations, but when it searches for 'god' it goes way off to words begining with H. Eventually its keeps switching between 'Handelian' and 'hander' until the loop is killed as if 'god' lies somewhere between the two. I dont get why this is happening

  2. #2
    Registered User
    Join Date
    Oct 2001
    Posts
    2,934
    >between 'Handelian' and 'hander'
    If it's actually storing Handelian versus handelian, then that's one problem, as uppercase H does not equate to lowercase h.

    Another problem is you shouldn't be casting the return value of malloc().

  3. #3
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Look at the algorithm for binary search. You're not following it. If you miss a match by being too high in the dictionary, the high should drop to the current index of the dictionary, minus one.

    Same for too low a guess, in the dictionary. Then a new midpoint is found.

    In your search, that isn't happening. You're finding "dog", is just luck.
    In pseudo-code:

    Code:
    int low, high, mid;
    low = 0;
    high = n - 1;
    
    while (low <= high)  {
       mid = low + high/2;
       if ("god" = dictionary[mid])
           high = mid - 1;
       else if("god" > dictionary[mid])
          low = mid + 1;
       else
          return mid;
    }
    
    /* no match */
    
    return -1;
    I don't see the low and the high being handled well, *inside* your while loop.

  4. #4
    Dr Dipshi++ mike_g's Avatar
    Join Date
    Oct 2006
    Location
    On me hyperplane
    Posts
    1,218
    If it's actually storing Handelian versus handelian, then that's one problem, as uppercase H does not equate to lowercase h.
    Man I'm stupid the thought crossed my mind, but I presumed it wouldn't create output like that so didn't even bother testing it. It did fix it, with a single line of code. Thanks.

    Another dumb question: what do you mean by casting the return value of malloc()?

  5. #5
    Dr Dipshi++ mike_g's Avatar
    Join Date
    Oct 2006
    Location
    On me hyperplane
    Posts
    1,218
    Quote Originally Posted by Adak View Post
    Look at the algorithm for binary search. You're not following it. If you miss a match by being too high in the dictionary, the high should drop to the current index of the dictionary, minus one.

    Same for too low a guess, in the dictionary. Then a new midpoint is found.

    In your search, that isn't happening. You're finding "dog", is just luck.
    In pseudo-code:

    Code:
    int low, high, mid;
    low = 0;
    high = n - 1;
    
    while (low <= high)  {
       mid = low + high/2;
       if ("god" = dictionary[mid])
           high = mid - 1;
       else if("god" > dictionary[mid])
          low = mid + 1;
       else
          return mid;
    }
    
    /* no match */
    
    return -1;
    I don't see the low and the high being handled well, *inside* your while loop.
    Hmm.. my low and high don't change. The high and low is just there to test if the word goes out of scope of the dictionary. EG:
    Code:
     "4567something"
    Would have a lower strcmp() than all words in the dictionary which would move the pointer to other areas of memory.

    The shift does change. it divides by 2 each iteration as long as it is above 1. The shift then gets added or subtracted to/from the index. So yeah I'm pretty sure its still a binary sort, of some sort.

  6. #6
    Registered User
    Join Date
    Oct 2001
    Posts
    2,934
    >Another dumb question: what do you mean by casting the return value of malloc()?
    This line:
    > p = (struct dir *) malloc(MAXIT * MAXLEN);
    If you're actually compiling as C, you don't need the cast, just:
    Code:
        p = malloc(MAXIT * MAXLEN);
    If you get an error, then you're compiling in C++ mode, as C++ requires a cast.

    Adak points out another problem though. I would convert what you have to what he posted, that is, use array indices, and recalculate low and high each time. Otherwise you going to miss part of the dictionary. Let's say the size is 23, so using your method you would divide the array by 11, 5, 2, 1, and if the word was located at entry 22, it would miss. In this simple case it would probably work, but for a large dictionary like you have, it's a problem. The idea is to divide the dictionary into two partitions each time through the loop.

  7. #7
    Dr Dipshi++ mike_g's Avatar
    Join Date
    Oct 2006
    Location
    On me hyperplane
    Posts
    1,218
    Ok cool, I droped the cast from the malloc.

    I get what you and Adak are saying, about when positions don't fall into the binary search path. What I did to get around that was put a variable 'quit' in. When the shift size is reduced to one, my function cycles through the next 50 entries (one at a time) in the direction the search was going. It might make the search less efficient, but it should still work.

    That said I'm still having problems, but I think that this is different.

    If I enter 'love' the program returns 4 matches, which is correct. But they are:
    Code:
    velo, love, velo, love
    Alternatively if I enter 'vole' I get:
    Code:
    levo, vole, levo, vole
    I dont know, I think I'm going to have to get back to this tomorrow. I'm getting tired. Thanks for the help guys

  8. #8
    Registered User
    Join Date
    Jun 2007
    Location
    Michigan
    Posts
    12
    Not sure why you don't just use a regular search algorithm? Like a Boyer-Moore? See my post in this thread: Binary Searching

    You should be able to use the code with a little tweaking.


    M@

  9. #9
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Matthew, he's having trouble with a binary search, and you think he's ready for Boyer-Moore??

    What I can't figure out is why he doesn't just use a standard binary search algo for strings? What he's using now, even if it were tweaked enough to work, is inefficient.

    Mike, if you want to see some binary search code with strings in an array, let me know.

  10. #10
    Registered User
    Join Date
    Jun 2007
    Location
    Michigan
    Posts
    12
    Well, I didn't expect him to *write* a Boyer-Moore search, but there are plenty of existing search functions, so I really don't understand why he *is* writing his own. Even if you are an experienced C programmer, you will usually use something that exists, unless you have a really special case.

    The code I posted implements a Boyer-Moore algo taken from the Wikipedia. I give an example of searching memory or a file. The key and data are bytes, so it can be binary or text based data, it really does not matter. Anyway, it's there if he wants to use it.

    M@

  11. #11
    Dr Dipshi++ mike_g's Avatar
    Join Date
    Oct 2006
    Location
    On me hyperplane
    Posts
    1,218
    I write my own code because its fun! I also get to understand better how things work, as opposed to if I were to just rip and paste someone elses code.

    At the moment I couldent care less about speed as long as it works, which it does. Although a time to live of 50 was a bit excessive.

    Anyway I found where my problem was. My permutations arent right. I'm cycling the characters incorrectly. That part of the prog was one of the challenges on this site, I obviously didn't test my prog thoroughly enough. I could just grab the answer from there, but since it was a beginners challenge it would be a bit sad if I could not work this out for myself.

    Anyway, as and when I need more speed for from a binary search I'll ask you Adak

  12. #12
    Registered User
    Join Date
    Jun 2007
    Location
    Michigan
    Posts
    12
    Quote Originally Posted by mike_g View Post
    I write my own code because its fun! I also get to understand better how things work, as opposed to if I were to just rip and paste someone elses code.
    Well, you can't argue with that. If you're coding to learn or for fun then by all means try to work it out yourself. My days of coding for fun are few and far between. These days I want code that is fast and well written, and I know I can't solve every problem myself so I don't mind using (and modifying) code I didn't write or totally understand.

    Glad you got it working.

    M@

  13. #13
    Dr Dipshi++ mike_g's Avatar
    Join Date
    Oct 2006
    Location
    On me hyperplane
    Posts
    1,218
    Hey, thanks for the Boyer-Moore code. I had a brief look through it. Couldn't make much sense of it tho. I'll keep hold of it. If thats the about fastest algorithm for searching text, then I may find a need for it sometime, when I understand C better.

    Cheers

  14. #14
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    This has got to be about one of the worst places one could use Boyer-Moore. Dictionary strings are extremely short on average, and you're not looking for a sub-string match, you're looking for an exact or relative match.

    If you want to speed things up, I suggest either implementing a much more efficient binary search: see here: http://homepages.ihug.co.nz/~aurora7...y&#37;20search
    Or, use a radix search. I.e. make 26 buckets, one for each first letter of the string, and go straight to that bucket. Even that should speed things up. However you can then make another 26 buckets within each of those buckets etc.

    Are you sure that code is correctly generating all permutations, and with no duplicates? I suggest taking a look at the SC++L code for std::next_permutation - see http://www.sgi.com/tech/stl/next_permutation.html

    This would be laughably easy to implement in C++!
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Performance issue!
    By maven in forum C Programming
    Replies: 42
    Last Post: 03-23-2009, 11:57 AM
  2. finding the largest number in a binary search
    By Emeighty in forum C++ Programming
    Replies: 20
    Last Post: 07-31-2008, 03:19 AM
  3. strange virtual function output
    By George2 in forum C++ Programming
    Replies: 4
    Last Post: 02-08-2008, 08:08 AM
  4. Ornery binary search algorithm
    By Unregistered in forum C Programming
    Replies: 3
    Last Post: 11-17-2001, 03:32 PM