Thread: Binary search on a HUGE file.

  1. #1
    Registered User
    Join Date
    Dec 2002
    Posts
    4

    Binary search on a HUGE file.

    I understand the concept of a binary search, have been writing them for a while learned in high school, but they have always been searching with in arrays or linked lists in memory.
    At the moment I’m trying to make a dictionary esk program, simplistically it will look in a text file to see if a word that has been typed in exists. The only problem is I don’t know how to seek in a text file the correct way.
    The file is a plain text file, on word per line, about 585,000 lines, no that’s not a typo :) there are over 585 thousand lines in the file... is there any way to do this, do I need to change the dictionary file in any way? Any help would be greatly appreciated, thanks.

  2. #2
    Evil Member
    Join Date
    Jan 2002
    Posts
    638
    Have you considered reading the lot of it into a map<string, string>? That would certainly simplify lookups.

  3. #3
    Registered User
    Join Date
    Dec 2002
    Posts
    4
    I'm sorry, i don't know what a map<string, string> is.

  4. #4
    Registered User
    Join Date
    Oct 2002
    Posts
    160
    For some reason I havn't hear about "binary search" before. What is it? Who knows maybe I've done something that is quite the same but the phrase is just not familar to me.
    Well english isn't my first language, (it's instead a useless language called danish which only 5 milion people speak!!) so if you think my grammar SUCKS (it does by the way) than you're more then welcome to correct me.
    Hell I might even learn something

  5. #5
    Registered User
    Join Date
    Dec 2002
    Posts
    4
    This is just a quick explanation, there are may tutorials and help about how to implement this algorithm on the net.

    Its a form of searching an already sorted list, also you need to know how many elements are in the list. In the first pass you compare what you are looking for, to the element in the middle of the list, if what you are looking for is before the mid point, you throw out the end half of the list. Then with the half shortened list, you find the new mid point compare that new element to the none your looking for, you keep comparing and throwing out either the first half or second halves of the list until you run out of things to check.

  6. #6
    Registered User
    Join Date
    Oct 2002
    Posts
    160
    Yep I have heard about it before actually it was my grand dad who once told me about it *big laugh* It's not as weird as it my sound since he was a "coder" (if it was called that at the time). In those days they used those cards with holes in 'em.

  7. #7
    Code Monkey Davros's Avatar
    Join Date
    Jun 2002
    Posts
    812
    Appoloinus

    I have spent quite some time doing exactly what you're doing. (see www.alpha-india5.com)

    Either you read the file into memory & store in a searchable structure (this will be memory intensive) or you search on the file itself. If you want to search on the file, I suggest a [non-text] binary file. If you want to jump from word to word in a random access manner, you will either have to:

    work with fixed block sizes(i.e. big enough to accomodate any lexical item)

    or

    you will need a look up table which gives the position in file of every lexical item. You will need to maintain this when adding new items.

    Also, I developed a search algorithm which is faster than a binary search when dealing with files using read-ahead optimisation. I don't expect anyone to believe me, but you can find details here:

    http://www.generation5.org/ubb/Forum1/HTML/000134.html

  8. #8
    Registered User
    Join Date
    Dec 2002
    Posts
    4
    Davros- Havent checked you links yet, but i plan to... ive been mulling over the problem and cam to seekg and tellg to locate the end, middle, and begining based on a fixed word length maintained by adding white space to the smaller words... so that was the answer if any one else is following this thread. binary search with seekg and tellg.

  9. #9
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    Use variable-length structures and binary IO. That will simplify things and also give you the option to use either the "in-memory" or "from file" lookup aproach:

    something like:

    Code:
    class Data { 
     public:
    int size;
    int data_length;
    char * data;
    int extra_length;
    char * extra;
    };
    
    int index, size, total;
    int sin = sizeof(int);
    
    index = size = total = 0;
    
      while(fread(&size, sin, 1, fp)){
       ++total;
       fseek(fp, size-sin, SEEK_CUR);
     }
    
    rewind(fp);
    
    Data data[total];
    
      while(fseek(fp, sin, SEEK_CURR)==0){
      fread(&size, sin, 1, fp);
      data[index].data = new char[size];
      fread(data[index].data, 1, size, fp);
      fread(&size, sin, 1, fp);
      data[index].extra = new char[size];
      fread(data[index].extra, 1, size, fp);
      ++index; 
    }
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  10. #10
    Registered User
    Join Date
    Oct 2002
    Posts
    160
    Anyone who whants to explain me what functions fseek, fread and rewind is all about - how they work and stuf.
    Well english isn't my first language, (it's instead a useless language called danish which only 5 milion people speak!!) so if you think my grammar SUCKS (it does by the way) than you're more then welcome to correct me.
    Hell I might even learn something

  11. #11
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    fseek(FILE*, offset, relative_start_point);

    The three start points are SEEK_SET (beginning of file), SEEK_CURR (from current position), SEEK_END (end of file). You use this to move the file position indicator (FPI) to a specific point to begin a read/write operation. If the offset is negative, the FPI moves backward; positive, it moves forward. The FPI is precisely what is returned by ftell().

    fread/fwrite(void * data, size_of_data, length_of data, FILE*);

    These work identically. After the read/write the FPI moves forward by the amount of bytes indicated. For instance:

    char c[100];
    int i[100];

    fseek(fp, 0, SEEK_END);
    fwrite(c, sizeof(char), 100, fp);
    fwrite(i, sizeof(int), 100, fp);

    After the two arrays are written to the file, the FPI will have moved 500 bytes forward ( 1 * 100 + 4 * 100 ).

    rewind(fp)
    ...is identical to :
    fseek(fp, 0, SEEK_SET);
    ..and is probably written that way internally.



    The C++ streams are easier to use, but the C style functions are reliable and effective too.

    Hope that helps.
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  12. #12
    Registered User
    Join Date
    Oct 2002
    Posts
    160
    Isn't the FPI then the same as an iterator?
    Other then that: Yeah you deffently ansvered ít. Actually I know those functions but only for the C++ streams.
    Well english isn't my first language, (it's instead a useless language called danish which only 5 milion people speak!!) so if you think my grammar SUCKS (it does by the way) than you're more then welcome to correct me.
    Hell I might even learn something

  13. #13
    Registered User zahid's Avatar
    Join Date
    Aug 2001
    Posts
    531
    From the point of view of C :
    As your project requires, you may go for index file, where you can store the address of each string in dictionary.file. This will give better manageability.

    Then the steps are as same as explain earlier. Ohh.. I guess it is also known as "divide and conquer" method.
    [ Never code before desk work ]
    -------------------------------------:-->
    A man who fears Nothing is the man who Loves Nothing
    If you Love Nothing, what joy is there in your life.
    =------------------------------------------------------= - I may be wrong.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Can we have vector of vector?
    By ketu1 in forum C++ Programming
    Replies: 24
    Last Post: 01-03-2008, 05:02 AM
  2. Searching Binary Files for a Pattern
    By CaptainMorgan in forum C Programming
    Replies: 16
    Last Post: 06-17-2007, 06:04 PM
  3. help with text input
    By Alphawaves in forum C Programming
    Replies: 8
    Last Post: 04-08-2007, 04:54 PM
  4. C++ std routines
    By siavoshkc in forum C++ Programming
    Replies: 33
    Last Post: 07-28-2006, 12:13 AM
  5. Print file in binary mode
    By vikernes in forum C++ Programming
    Replies: 6
    Last Post: 02-25-2006, 12:43 AM