# Thread: which sorting algorithm for GBs of data?

1. ## which sorting algorithm for GBs of data?

I've a really big problem. My new algorithm searches for patterns in ~26,000 protein sequences. If a certain pattern has been found, the sequence fragment is stored as a string in a sorted linked list. If the current fragment already exists in the list, the appropriate counter is incremented by 1.
The problem is that the amount of data produced is ~25 GBs.
I guess I have to find a less space consuming data representation but what if I have to do it the way described above? My PC does not have 25 GBs of RAM. Should I consider writing the unsorted list to a file and sorting it afterwards? But which sorting algorithm can handle GBs of data on a PC with just 256 MB of RAM without using virtual memory?

2. > If a certain pattern has been found, the sequence fragment is stored as a string in a sorted linked list
How many fragments are you talking about storing?
How long is each string?

If the strings are long, then calculate and store a hash of that string. Not only is that much shorter (typically, just 32 bits), it is an awful lot easier to compare.

I would also consider a binary tree for storing your sorted data. A sorted insert in a binary tree is so much quicker than in a linked list (O(log2(n)) vs. O(n/2))

You seem to be suggesting that the memory fills up at some point, where you then write the list to a file and start again - is this so?

> But which sorting algorithm can handle GBs of data on a PC with just 256 MB of RAM without using virtual memory?
If file1, file2 etc are already sorted, then it's just an n-way merge sort which just requires you to store the current record from each file.

3. >How many fragments are you talking about storing?
Well, that depends on the sequence of each protein. It can be just a few to several hundreds and even a few thousand for each protein.

>How long is each string?
Length varies from 5 to I-don't-know-how-long. The longest fragment I've seen so far was about 3000.

>If the strings are long, then calculate and store a hash of that string.
That sounds good! But I've never done that before. How do you do something like that?

>I would also consider a binary tree for storing your sorted data.
That is indeed something I should do. But then there is the problem of maintaining a balanced tree.

>You seem to be suggesting that the memory fills up at some point,
>where you then write the list to a file and start again - is this so?
Yes, memory fills up quite fast. If I had a way of detecting how much free physical memory is left, I would do it that way.

> If file1, file2 etc are already sorted, then it's just an n-way merge sort
>which just requires you to store the current record from each file.
Hmmm ... but what about the memory requirements for merge sort? Is it just the size of the two files, that I want to merge?

4. > That sounds good! But I've never done that before. How do you do something like that?
Well a really simple one is something like
char msg[] = "hello";
for ( i = 0 ; i < strlen(msg) ; i++ ) hash += msg[i];

But that is way too sensitive to transpositions in the message "heoll" has the same hash, which isn't good.
Better algorithms would be things like these...
crc32 MD5

Basically, any function which reduces a string to a single integer (trivial example - strlen() ) could be used as a hash function.

The downside with a hash function is that there is always a small probability that two messages will have the same hash value.

> But then there is the problem of maintaining a balanced tree
I don't think you need to do that. If things are being inserted randomly, the tree will be on average fairly well balanced to start with. The search may not be exactly O(log2(n)), but it will be pretty close most of the time.
Certainly before implementing balancing, I would measure the amount of un-balance.

> If I had a way of detecting how much free physical memory is left, I would do it that way.
You have your own malloc wrapper which records the total amount of memory allocated. Then it (or whoever) can then make a decision to output the current data and start again.
Or dig around in the documentation for your compiler. Some have such a counter/function available already.

> but what about the memory requirements for merge sort?
Your files are already sorted (internally), so there's no need to store the whole of each file, only the current record from each file.
Here's one example
http://cboard.cprogramming.com/showt...ighlight=merge

5. >But that is way too sensitive to transpositions in the message
>"heoll" has the same hash, which isn't good.
Ah, I see. But as you suggested, I have to use crc32 or md5.

>Certainly before implementing balancing, I would measure the amount of un-balance.
And again something I haven't done before. I will "google" for it. Thanks for the hint.

>You have your own malloc wrapper which records
>the total amount of memory allocated
Easier than I thought. I can even implement it in a heterogeneous PVM Cluster - great!

> Your files are already sorted (internally), so there's no need to store
> the whole of each file, only the current record from each file.
AHA! I am starting to love merge sort!

And again, thank you very much, Salem!