Thread: which sorting algorithm for GBs of data?

  1. #1
    Registered User Sargnagel's Avatar
    Join Date
    Aug 2002
    Posts
    166

    Question which sorting algorithm for GBs of data?

    I've a really big problem. My new algorithm searches for patterns in ~26,000 protein sequences. If a certain pattern has been found, the sequence fragment is stored as a string in a sorted linked list. If the current fragment already exists in the list, the appropriate counter is incremented by 1.
    The problem is that the amount of data produced is ~25 GBs.
    I guess I have to find a less space consuming data representation but what if I have to do it the way described above? My PC does not have 25 GBs of RAM. Should I consider writing the unsorted list to a file and sorting it afterwards? But which sorting algorithm can handle GBs of data on a PC with just 256 MB of RAM without using virtual memory?

    Thank you for your help.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > If a certain pattern has been found, the sequence fragment is stored as a string in a sorted linked list
    How many fragments are you talking about storing?
    How long is each string?

    If the strings are long, then calculate and store a hash of that string. Not only is that much shorter (typically, just 32 bits), it is an awful lot easier to compare.

    I would also consider a binary tree for storing your sorted data. A sorted insert in a binary tree is so much quicker than in a linked list (O(log2(n)) vs. O(n/2))

    You seem to be suggesting that the memory fills up at some point, where you then write the list to a file and start again - is this so?

    > But which sorting algorithm can handle GBs of data on a PC with just 256 MB of RAM without using virtual memory?
    If file1, file2 etc are already sorted, then it's just an n-way merge sort which just requires you to store the current record from each file.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User Sargnagel's Avatar
    Join Date
    Aug 2002
    Posts
    166
    >How many fragments are you talking about storing?
    Well, that depends on the sequence of each protein. It can be just a few to several hundreds and even a few thousand for each protein.

    >How long is each string?
    Length varies from 5 to I-don't-know-how-long. The longest fragment I've seen so far was about 3000.

    >If the strings are long, then calculate and store a hash of that string.
    That sounds good! But I've never done that before. How do you do something like that?

    >I would also consider a binary tree for storing your sorted data.
    That is indeed something I should do. But then there is the problem of maintaining a balanced tree.

    >You seem to be suggesting that the memory fills up at some point,
    >where you then write the list to a file and start again - is this so?
    Yes, memory fills up quite fast. If I had a way of detecting how much free physical memory is left, I would do it that way.

    > If file1, file2 etc are already sorted, then it's just an n-way merge sort
    >which just requires you to store the current record from each file.
    Hmmm ... but what about the memory requirements for merge sort? Is it just the size of the two files, that I want to merge?

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > That sounds good! But I've never done that before. How do you do something like that?
    Well a really simple one is something like
    char msg[] = "hello";
    for ( i = 0 ; i < strlen(msg) ; i++ ) hash += msg[i];

    But that is way too sensitive to transpositions in the message "heoll" has the same hash, which isn't good.
    Better algorithms would be things like these...
    crc32 MD5

    Basically, any function which reduces a string to a single integer (trivial example - strlen() ) could be used as a hash function.

    The downside with a hash function is that there is always a small probability that two messages will have the same hash value.

    > But then there is the problem of maintaining a balanced tree
    I don't think you need to do that. If things are being inserted randomly, the tree will be on average fairly well balanced to start with. The search may not be exactly O(log2(n)), but it will be pretty close most of the time.
    Certainly before implementing balancing, I would measure the amount of un-balance.

    > If I had a way of detecting how much free physical memory is left, I would do it that way.
    You have your own malloc wrapper which records the total amount of memory allocated. Then it (or whoever) can then make a decision to output the current data and start again.
    Or dig around in the documentation for your compiler. Some have such a counter/function available already.

    > but what about the memory requirements for merge sort?
    Your files are already sorted (internally), so there's no need to store the whole of each file, only the current record from each file.
    Here's one example
    http://cboard.cprogramming.com/showt...ighlight=merge
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Registered User Sargnagel's Avatar
    Join Date
    Aug 2002
    Posts
    166
    >But that is way too sensitive to transpositions in the message
    >"heoll" has the same hash, which isn't good.
    Ah, I see. But as you suggested, I have to use crc32 or md5.

    >Certainly before implementing balancing, I would measure the amount of un-balance.
    And again something I haven't done before. I will "google" for it. Thanks for the hint.

    >You have your own malloc wrapper which records
    >the total amount of memory allocated
    Easier than I thought. I can even implement it in a heterogeneous PVM Cluster - great!

    > Your files are already sorted (internally), so there's no need to store
    > the whole of each file, only the current record from each file.
    AHA! I am starting to love merge sort!

    And again, thank you very much, Salem!

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Bitmasking Problem
    By mike_g in forum C++ Programming
    Replies: 13
    Last Post: 11-08-2007, 12:24 AM
  2. Merge Sort the best sorting algorithm?
    By sehr alt in forum A Brief History of Cprogramming.com
    Replies: 21
    Last Post: 06-20-2007, 02:19 PM
  3. Program Crashing
    By Pressure in forum C Programming
    Replies: 3
    Last Post: 04-18-2005, 10:28 PM
  4. C diamonds and perls :°)
    By Carlos in forum A Brief History of Cprogramming.com
    Replies: 7
    Last Post: 05-16-2003, 10:19 PM
  5. C Programming Question
    By TK in forum A Brief History of Cprogramming.com
    Replies: 13
    Last Post: 07-04-2002, 07:11 PM