BIG Sorting Problem!

**Adak** · 01-03-2013

Originally Posted by comocomocomo

Then mmap() won't help, unless you want to use a raw binary file in your computer as a big cache of the database.

mmap() would be the way to go if the data were in a huge binary file instead of a database.

Since you have a database, the ideal would be to read/write large blocks of data and work on them. If there is some transparent way to do this buffering, use it.

Since you are using quicksort, you might want to have two buffers (or sliding windows): one for the lower extreme and one for the upper extreme.

One stupid question: Can't your database sort the data for you?

The "database" is what was inferred - but it's output FROM a database - which isn't the same thing. The actual raw data format, has not been shared with me yet. (and I doubt it will be, since I'm a very unofficial helper type).

As I understand it, the project won't be going forward without showing the ability to sort the data, and obtain useful info from the work. Whether this is a commercial enterprise or simply an educational project to see if it's feasible, is up in the air.

**Adak** · 01-03-2013

Originally Posted by comocomocomo

First, you should use the standard qsort(). Your code searches for a pair of elements to swap them, taking three assignments per swap. The most extended implementation of quicksort manages to reduce the number of assignments. The trick is similar to that used in insertion sort: instead of doing swaps, you copy one item in an auxiliary variable, creating a 'hole' in the array; then you copy other value of the array filling this hole, but creating a new hole, and so on; in the end you copy the initial value in the final location of the hole. On average, you save up to a 33% of the assignments.

It's been a few years since I tested qsort(), but the Quicksort I posted regularly walked all over it in integer sorting. It never has crashed, even sorting 10 million integers, with input in either random, ascending, or descending order.

Code:

       #include <stdlib.h>


       void qsort(void *base, size_t nmemb, size_t size,
                  int(*compar)(const void *, const void *));

In addition, the standard qsort() might be more careful with the stack by using tail recursion. Your code uses too much stack in the 'evil' worst case of quicksort.

I've read about the worst case data for it, but I've never went looking for it. There's a simple way to avoid all that, if you suspect you might have some of it -- mix up your data, just a little, before you take it into Quicksort.

Second, the structures are quite large, so don't sort them. Sort an array of indices (or pointers) instead. This will speed up the assignments a lot (maybe up to x50).

That's on my to-do list. Very familiar with that.

Third, quicksort is not a stable sort. You can't use the trick of sorting by one column, then sorting by other column... You have to sort only once, but with a slightly more complicated comparison. Don't worry: it will be way faster, since most times the comparisons will be decided after comparing the first element. Again, an additional x50 speed up.

Oh, I've got that idea, for sure, at this time!

The function provided by Salem looks fine to me. I would just add one level of indirection in order to sort an array of pointers to structs, and not the structs themselves.

I'm looking forward to waving good-bye to "Pile Driver". < LOL >

Ooooow. Then I have bad news for you :-(

Suppose two individuals with nearly the same value in every field, except for the first one. They will be far form each other.

Yes. I explained that to the contact, already. That's why the integers have to stay in their respective columns, because they have different "weights", with the leftmost having more than the right hand one's.

My idea was to have shallow searches (quick), with those whose interests did match up on the left hand columns, and then have "deep" searches, which would match a member up with all the others who have N% matching interests. These would probably be built in the background or off-line, since they would take more time to create.

You need some classification algorithm: Statistical classification - Wikipedia, the free encyclopedia

The choice of what to use is completely dependent on the nature of your data and what you want to achieve.

This is a BIG problem. Bigger than sorting ;-)

Oh yes! I'm just helping out a bit, on the sorting, for now.

TBH, there are lots of software programs that could be used here, instead of creating your own. It's odd that people want to "roll their own", but they do sometimes.

**c99tutorial** · 01-03-2013

Originally Posted by Adak

A program I'm working on has records (structs), with an array of ints for every record, with 52 integers.
...
The BIG part is the number of records going into an array of these - at least 100 million have to be processed, in several large "chunks".
The first 50 records of Data look like this, when sorted properly:

Code:

 0:  0   1 474 753 478 945 456 255   1 198 847 945 551  24 396 986 242  85 154 
 1:  0   3 299 465 382 737 460 803 301 771 179  50 801 262 483 330 815 905 349 
 2:  0   3 345  44 796  23 873  36 883 551 419 744 744 659 336 470 891 699  16 
 ...

Records sorted: 1,000,000 Columns: 52 Int's limited to 999 here, but will range up to 300,000 at least, in the actual data.

...

I can't use an algorithm that requires a large amount of auxiliary memory. "Large" is not defined, however.

...

Any algorithmic tips? I'm not looking for code.

You could consider loading your data into a table using a database engine such as SQLite - this way, you can stop concentrating on algorithms and move the focus of your work to your data and what operations you want to perform on it - then set up proper indexes on your table to achieve optimal performance characteristics for your use case.

The internal algorithms built into SQLite are probably sufficient, and the API also provides a way to control it's usage of dynamic memory, but you'll have to test this for yourself to see if they meet your requirements.

**laserlight** · 01-03-2013

Originally Posted by Adak

I've read about the worst case data for it, but I've never went looking for it. There's a simple way to avoid all that, if you suspect you might have some of it -- mix up your data, just a little, before you take it into Quicksort.

Even if you do that, a worst case scenario will still exist, and could theoretically be exploited if the mixing process is deterministic. That said, you could use introsort instead...

**Adak** · 01-03-2013

Exactly, C99Tutorial!

The whole concept of "rolling your own", on a project this large - well, you better have people and tools that REALLY know their rolling! I haven't seen that, so far.

**Adak** · 01-03-2013

Originally Posted by laserlight

Even if you do that, a worst case scenario will still exist, and could theoretically be exploited if the mixing process is deterministic. That said, you could use introsort instead...

Some other sort like Introsort might be fine for a business, but until I see a problem (which I never have, and I've run lots of tests with this version of Quicksort), I won't be budging. I used to have a version with a partition function before Quicksort. If the file was already sorted, it ran like Insertion sort -- very fast, and returned, so Quicksort was never called.

But in an array with a million sorted integers, it only saved about 3/4ths of a second, or so. I finally got rid of it. For businesses who rely on their website, I'd say that something like Introsort or Heapsort, would be essential. That academic who wrote the bedevilment program for Quicksort, did an outstanding job of it, from the description!

**phantomotap** · 01-03-2013

The whole concept of "rolling your own", on a project this large - well, you better have people and tools that REALLY know their rolling! I haven't seen that, so far.

O_o

Well, I'm leaning towards "don't know their rolling".

The "database" is what was inferred - but it's output FROM a database - which isn't the same thing.

WTF!?

How did you find association with a DBA who is that bad at their job?

If the engine in question uses textbook database algorithms (Literally, the algorithms from the transactional database series of textbooks are far more than equal to this task.), it could wrangle this data for you far faster than you could possibly consume it if you are really doing collision analysis on such a large set.

Taking data out of an existing database and inserting it into a dumb array to sort it? That's special.

[Edit]
Yeah, I know; at no point did Adak say that it was a database using data structures using native ordering. Yes, I know that other means are available. However, I'm still annoyed by the logic at play here because if they need the data sorted over such complex fields they shouldn't be using such a database as that.
[/Edit]

Whether this is a commercial enterprise or simply an educational project to see if it's feasible, is up in the air.

This isn't a question of "feasible". This is a solved problem.

The only question is, "Why are they trying to solve it so poorly?".

Heck, if the engine in question is "OracleDB" it is literally already done; you only need to throw the right query at it.

Soma

**comocomocomo** · 01-03-2013

This is what I described:

Code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>


#define DEFAULT_NUM_RECORDS 100
#define SAMPLE_SIZE 20
#define SAMPLE_NUM_FIELDS 10


#define NAME_LENGTH   30
#define NUM_FIELDS    52


typedef struct
{
    char name[NAME_LENGTH];
    int data[NUM_FIELDS];
}
person;


void make_random_people (person * p, int n);


int compare_persons (const void * p,
                     const void * q);


int main (int argc, char * argv[])
{
    int i, j, num;
    person * people;
    person ** ptrs;


    srand ((unsigned)time(NULL));
    
    if (argc<2 || sscanf(argv[1],"%d",&num)!=1 || num<1)
        num = DEFAULT_NUM_RECORDS;


    puts ("Allocating memory...");


    people = (person*) malloc (num*sizeof(person));
    ptrs = (person**) malloc (num*sizeof(person*));
    
    if (!people || !ptrs)
    {
        fprintf (stderr, "Not enough memory.\n");
        free (people);
        free (ptrs);
        return -1;
    }


    puts ("Making up some random data...");


    make_random_people (people, num);


    puts ("Preparing array of pointers...");


    for (i=0; i<num; i++)
        ptrs[i] = people + i;


    printf ("Sorting %d elements...\n", num);


    qsort (ptrs, num, sizeof(person*), compare_persons);


    printf ("Data of first %d persons:\n", SAMPLE_SIZE);


    for (i=0; i<SAMPLE_SIZE; i++)
    {
        printf ("%-10s ", ptrs[i]->name);
        
        for (j=0; j<SAMPLE_NUM_FIELDS; j++)
            printf ("%4d", ptrs[i]->data[j]);
            
        putchar ('\n');
    }
    
    puts ("Freeing memory...");


    free (people);
    free (ptrs);
    return 0;
}


int compare_persons (const void * p, const void * q)
{
    person * a = *(person**)p;
    person * b = *(person**)q;
    int i;
    
    for (i=0; i<NUM_FIELDS; i++)
    {
        if (a->data[i] > b->data[i]) return 1;
        if (a->data[i] < b->data[i]) return -1;
    }
    
    return 0;
}


void make_random_people (person * p, int n)
{
    int i, j;
    
    for (i=0; i<n; i++)
    {
        sprintf (p[i].name, "Jim_%d", i);
        
        for (j=0; j<NUM_FIELDS; j++)
            p[i].data[j] = (int)(rand()*(999.0/RAND_MAX));
    }
}

**std10093** · 01-03-2013

@Adak ^ I can see that here is discussion of ^big^ people, but I would like to say that mazbe changing to heapsort, rather than quicksort can help a bit.

**phantomotap** · 01-03-2013

discussion of ^big^ people

^_^

You actually made me groan which I found extremely funny.

Soma

**std10093** · 01-03-2013

I hope I also helped a bit Adak, not only make you smile

**Adak** · 01-03-2013

Originally Posted by std10093

I hope I also helped a bit Adak, not only make you smile

Heapsort got a little spring added to it's step with the i7's, in my tests. The only reason I wouldn't choose it, is because it's more difficult to work with, for me. Partitioning, and moving in from the right and left with the indices - I get that, right away.

Heap building? Not nearly as easy to work with, for me.

Thanks for the time and effort, Comocomocomo. I knew what you meant, but you put a nice polish on it.

**std10093** · 01-03-2013

Originally Posted by Adak

Heapsort got a little spring added
...
Heap building? Not nearly as easy to work with, for me.

As almost everything in life and programming, this is a trade off. You are far more experienced than me to decide

**Adak** · 01-03-2013

Actually, I wasn't aware of the ternary based style of heapsort. If it was only stable...

**iMalc** · 01-03-2013

If you are going to sort a separate list of indicies rather then moving the data itself around, then you can just sort first by the data, and then split ties by the index itself.
That turns any non-stable sort into a stable sort.

This whole thread looks like the continuation of some other thread and it's a little unclear what the actual problem is. What are we doing here exactly?

Thread: BIG Sorting Problem!

Thread Tools

Search Thread

Display

Similar Threads

Problem with sorting

problem with sorting

Help! Sorting Problem

Sorting problem

Can anyone help with sorting problem???