Thread: Sorting strings to speed up search?

  1. #1
    Registered User
    Join Date
    Aug 2001
    Posts
    41

    Question Sorting strings to speed up search?

    Hi everybody!

    I have a collection of strings (char *) and want to search and remove in the collection. The number of strings will be near 300 or more, so I thought I would sort the strings to speed up search and removal. But what will be the best way to do this?
    I'm trying to do it with a AVL tree (Binary tree with the one subtree not being (much) deeper than the other) but it is slow in creation with a lot of data (100000 items or more ). I also read about a RedBlack Tree, but I don't yet know how to do that one.
    Are there any other ways to do this in a fast way?

    Joren

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    You could consider using qsort to sort the array of pointers to strings, then using bsearch to search the array.

    Use the same compare function for both calls. It's pretty efficient if you do more searching than sorting.

    Both functions are in stdlib.h

    An example using ints as the data type
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    
    #define A_SIZE  50
    
    int compare ( const void *a, const void *b ) {
        const int *pa = a;
        const int *pb = b;
        if ( *pa < *pb ) return -1;
        if ( *pa > *pb ) return +1;
        return 0;
    }
    void print_arr ( int *arr ) {
        int i;
        for ( i = 0 ; i < A_SIZE ; i++ ) {
            printf( "%2d ", arr[i] );
            if ( (i+1) % 10 == 0 ) printf( "\n" );
        }
        printf( "\n" );
    }
    
    int main ( ) {
        int test[A_SIZE];
        int i;
        int key, *result;
    
        // get some test data
        srand( time(NULL) );
        for ( i = 0 ; i < A_SIZE ; i++ ) {
            test[i] = rand() % 100;
        }
        print_arr( test );
    
        // sort the array using some compare function
        qsort( test, A_SIZE, sizeof(test[0]), compare );
        print_arr( test );
    
        // search the array,
        //   using a key of the same type,
        //   and the same compare function
        key = rand() % 100;
        result = bsearch( &key, test, A_SIZE, sizeof(test[0]), compare );
        if ( result ) {
            printf( "Found %d\n", *result );
        } else {
            printf( "Not found %d\n", key );
        }
        return 0;
    }
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    Hint: here's a compare function too:

    Code:
    int compare(const void *i, const void *j) 
      { 
         
        return( strcmp(*(char **)i, *(char **)j) ); 
    
      }

    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  4. #4
    Registered User
    Join Date
    Aug 2001
    Posts
    41
    I tried using qsort and bsearch next to my own search and it was really fast. But the problem is that those functions only work with fixed length arrays. And the thing is that I will not know the number of strings I will use. I will try to explain so you could help me figure out a solution:

    I'm trying to find out which files are added at some time. I have a list of the files that were already there before a file was added and I can get a list of the files that are there after it was added. But I will not know how many files are in both list until I have them all. I use a function (WIN32 API) that gives me all the filenames and NULL at the end. So I have to add the filenames one at a time and can't use a fixed length array. After I have the lists, I check which files are in the new list and not in the old. So I have to search a lot and add far less than that. So bsearch and qsort would be great if I could find a way to use it with my lists.

    As I see it I have 2 options:
    1. First find out how much filenames I have to store by going through the process of getting the filenames once. Then creating a array (with new []) and going through the process again, but now storing the filenames in the array. That way I have an array with an "fixed" length and I can use qsort and bsearch.
    2. I use a linked list or another sort of array without a fixed length and store the filenames in there. After all the filenames are stored I get the number of files out of the linked list and create a fixed length array. And again I can use qsort an bsearch.

    One other question, maybe a stupid one, but why use char ** instead of char * in the compare function (and also in the bsearch and qsort)? Can't you just pass a char * to it, that's also a pointer...

  5. #5
    of Zen Hall zen's Avatar
    Join Date
    Aug 2001
    Posts
    1,007
    You could use STL vector or list for the container and then STL sort() and binary_search() - in algorithm header. If you need general searching and sorting algorithms you'll be hard pushed to beat these for speed (due to the inlining of the comparison function they'll be faster than qsort).
    Last edited by zen; 09-10-2001 at 09:58 AM.

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > But the problem is that those functions only work with fixed length arrays.
    They also work with variable length arrays

    If you replace
    &nbsp; int test[A_SIZE];
    with
    &nbsp; int *test = malloc( sizeof(int) * A_SIZE );

    Then you're on your way to creating a variable sized array.

    Or do as zen says

    > One other question, maybe a stupid one, but why use char ** instead of char * in the compare function (and also in the bsearch and qsort)? Can't you just pass a char * to it, that's also a pointer...
    Because the compare function is passed pointers to two elements of the array being sorted/searched. Since the array already contains pointers (char*), you get a pointer to a pointer (char**) inside the compare function.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Binary search on strings
    By ckathy in forum C++ Programming
    Replies: 1
    Last Post: 10-02-2007, 05:25 PM
  2. Binary Search Trees Part III
    By Prelude in forum A Brief History of Cprogramming.com
    Replies: 16
    Last Post: 10-02-2004, 03:00 PM
  3. Big help in Astar search code...
    By alvifarooq in forum C++ Programming
    Replies: 6
    Last Post: 09-24-2004, 11:38 AM
  4. Reading strings input by the user...
    By Cmuppet in forum C Programming
    Replies: 13
    Last Post: 07-21-2004, 06:37 AM
  5. linear search for structure (record) array
    By jereland in forum C Programming
    Replies: 3
    Last Post: 04-21-2004, 07:31 AM