Thread: BIG Sorting Problem!

  1. #16
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by comocomocomo View Post
    Then mmap() won't help, unless you want to use a raw binary file in your computer as a big cache of the database.

    mmap() would be the way to go if the data were in a huge binary file instead of a database.

    Since you have a database, the ideal would be to read/write large blocks of data and work on them. If there is some transparent way to do this buffering, use it.

    Since you are using quicksort, you might want to have two buffers (or sliding windows): one for the lower extreme and one for the upper extreme.

    One stupid question: Can't your database sort the data for you?
    The "database" is what was inferred - but it's output FROM a database - which isn't the same thing. The actual raw data format, has not been shared with me yet. (and I doubt it will be, since I'm a very unofficial helper type).

    As I understand it, the project won't be going forward without showing the ability to sort the data, and obtain useful info from the work. Whether this is a commercial enterprise or simply an educational project to see if it's feasible, is up in the air.

  2. #17
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by comocomocomo View Post
    First, you should use the standard qsort(). Your code searches for a pair of elements to swap them, taking three assignments per swap. The most extended implementation of quicksort manages to reduce the number of assignments. The trick is similar to that used in insertion sort: instead of doing swaps, you copy one item in an auxiliary variable, creating a 'hole' in the array; then you copy other value of the array filling this hole, but creating a new hole, and so on; in the end you copy the initial value in the final location of the hole. On average, you save up to a 33% of the assignments.
    It's been a few years since I tested qsort(), but the Quicksort I posted regularly walked all over it in integer sorting. It never has crashed, even sorting 10 million integers, with input in either random, ascending, or descending order.

    Code:
           #include <stdlib.h>
    
    
           void qsort(void *base, size_t nmemb, size_t size,
                      int(*compar)(const void *, const void *));
    In addition, the standard qsort() might be more careful with the stack by using tail recursion. Your code uses too much stack in the 'evil' worst case of quicksort.
    I've read about the worst case data for it, but I've never went looking for it. There's a simple way to avoid all that, if you suspect you might have some of it -- mix up your data, just a little, before you take it into Quicksort.

    Second, the structures are quite large, so don't sort them. Sort an array of indices (or pointers) instead. This will speed up the assignments a lot (maybe up to x50).
    That's on my to-do list. Very familiar with that.

    Third, quicksort is not a stable sort. You can't use the trick of sorting by one column, then sorting by other column... You have to sort only once, but with a slightly more complicated comparison. Don't worry: it will be way faster, since most times the comparisons will be decided after comparing the first element. Again, an additional x50 speed up.
    Oh, I've got that idea, for sure, at this time!


    The function provided by Salem looks fine to me. I would just add one level of indirection in order to sort an array of pointers to structs, and not the structs themselves.
    I'm looking forward to waving good-bye to "Pile Driver". < LOL >

    Ooooow. Then I have bad news for you :-(

    Suppose two individuals with nearly the same value in every field, except for the first one. They will be far form each other.
    Yes. I explained that to the contact, already. That's why the integers have to stay in their respective columns, because they have different "weights", with the leftmost having more than the right hand one's.

    My idea was to have shallow searches (quick), with those whose interests did match up on the left hand columns, and then have "deep" searches, which would match a member up with all the others who have N% matching interests. These would probably be built in the background or off-line, since they would take more time to create.
    You need some classification algorithm: Statistical classification - Wikipedia, the free encyclopedia

    The choice of what to use is completely dependent on the nature of your data and what you want to achieve.

    This is a BIG problem. Bigger than sorting ;-)
    Oh yes! I'm just helping out a bit, on the sorting, for now.

    TBH, there are lots of software programs that could be used here, instead of creating your own. It's odd that people want to "roll their own", but they do sometimes.
    Last edited by Adak; 01-03-2013 at 05:50 AM.

  3. #18
    Registered User
    Join Date
    Nov 2012
    Posts
    1,393
    Quote Originally Posted by Adak View Post
    A program I'm working on has records (structs), with an array of ints for every record, with 52 integers.
    ...
    The BIG part is the number of records going into an array of these - at least 100 million have to be processed, in several large "chunks".
    The first 50 records of Data look like this, when sorted properly:
    Code:
     0:  0   1 474 753 478 945 456 255   1 198 847 945 551  24 396 986 242  85 154 
     1:  0   3 299 465 382 737 460 803 301 771 179  50 801 262 483 330 815 905 349 
     2:  0   3 345  44 796  23 873  36 883 551 419 744 744 659 336 470 891 699  16 
     ...
    Records sorted: 1,000,000 Columns: 52 Int's limited to 999 here, but will range up to 300,000 at least, in the actual data.

    ...

    I can't use an algorithm that requires a large amount of auxiliary memory. "Large" is not defined, however.

    ...

    Any algorithmic tips? I'm not looking for code.
    You could consider loading your data into a table using a database engine such as SQLite - this way, you can stop concentrating on algorithms and move the focus of your work to your data and what operations you want to perform on it - then set up proper indexes on your table to achieve optimal performance characteristics for your use case.

    The internal algorithms built into SQLite are probably sufficient, and the API also provides a way to control it's usage of dynamic memory, but you'll have to test this for yourself to see if they meet your requirements.

  4. #19
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by Adak
    I've read about the worst case data for it, but I've never went looking for it. There's a simple way to avoid all that, if you suspect you might have some of it -- mix up your data, just a little, before you take it into Quicksort.
    Even if you do that, a worst case scenario will still exist, and could theoretically be exploited if the mixing process is deterministic. That said, you could use introsort instead...
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  5. #20
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Exactly, C99Tutorial!

    The whole concept of "rolling your own", on a project this large - well, you better have people and tools that REALLY know their rolling! I haven't seen that, so far.

  6. #21
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by laserlight View Post
    Even if you do that, a worst case scenario will still exist, and could theoretically be exploited if the mixing process is deterministic. That said, you could use introsort instead...
    Some other sort like Introsort might be fine for a business, but until I see a problem (which I never have, and I've run lots of tests with this version of Quicksort), I won't be budging. I used to have a version with a partition function before Quicksort. If the file was already sorted, it ran like Insertion sort -- very fast, and returned, so Quicksort was never called.

    But in an array with a million sorted integers, it only saved about 3/4ths of a second, or so. I finally got rid of it. For businesses who rely on their website, I'd say that something like Introsort or Heapsort, would be essential. That academic who wrote the bedevilment program for Quicksort, did an outstanding job of it, from the description!

  7. #22
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    The whole concept of "rolling your own", on a project this large - well, you better have people and tools that REALLY know their rolling! I haven't seen that, so far.
    O_o

    Well, I'm leaning towards "don't know their rolling".

    The "database" is what was inferred - but it's output FROM a database - which isn't the same thing.
    WTF!?

    How did you find association with a DBA who is that bad at their job?

    If the engine in question uses textbook database algorithms (Literally, the algorithms from the transactional database series of textbooks are far more than equal to this task.), it could wrangle this data for you far faster than you could possibly consume it if you are really doing collision analysis on such a large set.

    Taking data out of an existing database and inserting it into a dumb array to sort it? That's special.

    [Edit]
    Yeah, I know; at no point did Adak say that it was a database using data structures using native ordering. Yes, I know that other means are available. However, I'm still annoyed by the logic at play here because if they need the data sorted over such complex fields they shouldn't be using such a database as that.
    [/Edit]

    Whether this is a commercial enterprise or simply an educational project to see if it's feasible, is up in the air.
    This isn't a question of "feasible". This is a solved problem.

    The only question is, "Why are they trying to solve it so poorly?".

    Heck, if the engine in question is "OracleDB" it is literally already done; you only need to throw the right query at it.

    Soma
    Last edited by phantomotap; 01-03-2013 at 06:36 AM.

  8. #23
    Registered User
    Join Date
    Dec 2012
    Posts
    45
    This is what I described:

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    
    
    #define DEFAULT_NUM_RECORDS 100
    #define SAMPLE_SIZE 20
    #define SAMPLE_NUM_FIELDS 10
    
    
    #define NAME_LENGTH   30
    #define NUM_FIELDS    52
    
    
    typedef struct
    {
        char name[NAME_LENGTH];
        int data[NUM_FIELDS];
    }
    person;
    
    
    void make_random_people (person * p, int n);
    
    
    int compare_persons (const void * p,
                         const void * q);
    
    
    int main (int argc, char * argv[])
    {
        int i, j, num;
        person * people;
        person ** ptrs;
    
    
        srand ((unsigned)time(NULL));
        
        if (argc<2 || sscanf(argv[1],"%d",&num)!=1 || num<1)
            num = DEFAULT_NUM_RECORDS;
    
    
        puts ("Allocating memory...");
    
    
        people = (person*) malloc (num*sizeof(person));
        ptrs = (person**) malloc (num*sizeof(person*));
        
        if (!people || !ptrs)
        {
            fprintf (stderr, "Not enough memory.\n");
            free (people);
            free (ptrs);
            return -1;
        }
    
    
        puts ("Making up some random data...");
    
    
        make_random_people (people, num);
    
    
        puts ("Preparing array of pointers...");
    
    
        for (i=0; i<num; i++)
            ptrs[i] = people + i;
    
    
        printf ("Sorting %d elements...\n", num);
    
    
        qsort (ptrs, num, sizeof(person*), compare_persons);
    
    
        printf ("Data of first %d persons:\n", SAMPLE_SIZE);
    
    
        for (i=0; i<SAMPLE_SIZE; i++)
        {
            printf ("%-10s ", ptrs[i]->name);
            
            for (j=0; j<SAMPLE_NUM_FIELDS; j++)
                printf ("%4d", ptrs[i]->data[j]);
                
            putchar ('\n');
        }
        
        puts ("Freeing memory...");
    
    
        free (people);
        free (ptrs);
        return 0;
    }
    
    
    int compare_persons (const void * p, const void * q)
    {
        person * a = *(person**)p;
        person * b = *(person**)q;
        int i;
        
        for (i=0; i<NUM_FIELDS; i++)
        {
            if (a->data[i] > b->data[i]) return 1;
            if (a->data[i] < b->data[i]) return -1;
        }
        
        return 0;
    }
    
    
    void make_random_people (person * p, int n)
    {
        int i, j;
        
        for (i=0; i<n; i++)
        {
            sprintf (p[i].name, "Jim_%d", i);
            
            for (j=0; j<NUM_FIELDS; j++)
                p[i].data[j] = (int)(rand()*(999.0/RAND_MAX));
        }
    }

  9. #24
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    @Adak ^ I can see that here is discussion of ^big^ people, but I would like to say that mazbe changing to heapsort, rather than quicksort can help a bit.
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  10. #25
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    discussion of ^big^ people
    ^_^

    You actually made me groan which I found extremely funny.

    Soma

  11. #26
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    I hope I also helped a bit Adak, not only make you smile
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  12. #27
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Quote Originally Posted by std10093 View Post
    I hope I also helped a bit Adak, not only make you smile
    Heapsort got a little spring added to it's step with the i7's, in my tests. The only reason I wouldn't choose it, is because it's more difficult to work with, for me. Partitioning, and moving in from the right and left with the indices - I get that, right away.

    Heap building? Not nearly as easy to work with, for me.

    Thanks for the time and effort, Comocomocomo. I knew what you meant, but you put a nice polish on it.
    Last edited by Adak; 01-03-2013 at 07:14 AM.

  13. #28
    SAMARAS std10093's Avatar
    Join Date
    Jan 2011
    Location
    Nice, France
    Posts
    2,694
    Quote Originally Posted by Adak View Post
    Heapsort got a little spring added
    ...
    Heap building? Not nearly as easy to work with, for me.
    As almost everything in life and programming, this is a trade off. You are far more experienced than me to decide
    Code - functions and small libraries I use


    It’s 2014 and I still use printf() for debugging.


    "Programs must be written for people to read, and only incidentally for machines to execute. " —Harold Abelson

  14. #29
    Registered User
    Join Date
    Sep 2006
    Posts
    8,868
    Actually, I wasn't aware of the ternary based style of heapsort. If it was only stable...

  15. #30
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    If you are going to sort a separate list of indicies rather then moving the data itself around, then you can just sort first by the data, and then split ties by the index itself.
    That turns any non-stable sort into a stable sort.

    This whole thread looks like the continuation of some other thread and it's a little unclear what the actual problem is. What are we doing here exactly?
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Problem with sorting
    By preeengles in forum C Programming
    Replies: 35
    Last Post: 04-22-2009, 07:45 PM
  2. problem with sorting
    By pinkpenguin in forum C Programming
    Replies: 2
    Last Post: 11-18-2005, 11:06 AM
  3. Help! Sorting Problem
    By zz3 in forum C++ Programming
    Replies: 5
    Last Post: 06-19-2004, 02:48 AM
  4. Sorting problem
    By stimpyzu in forum C++ Programming
    Replies: 4
    Last Post: 11-21-2002, 01:07 AM
  5. Can anyone help with sorting problem???
    By DanTheMan in forum C Programming
    Replies: 1
    Last Post: 04-04-2002, 08:58 AM