Thread: find the largest 1000 values

  1. #1
    Registered User
    Join Date
    May 2006
    Posts
    1,579

    find the largest 1000 values

    Hello everyone,


    I have a large data set of integers (about several Million), all of them are positive integers from experiment. They are stored on a file on local machine. I want to select the largest 1000 values then make analysis based on the 1000 values.

    Could anyone suggest some efficient solutions? I think sorting the several M data in memory is not feasible.


    thanks in advance,
    George

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    sort -k1rn file.txt | head -1000 > top1000.txt
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Nov 2007
    Posts
    4
    after taking a course in data bases i can tell you that as far as i know,
    and im pretty sure about it, there is no better way to find the 1000 largest without sorting, the only other alternative i can think of is running on all your list 1000 times and on each round take the largest number and delete it from the list for the next round, or you can try your luck with bucket sort, that
    O(n) if you know what that is (good efficient on certain scenarios), check it out in http://en.wikipedia.org/wiki/Bucket_sort

  4. #4
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Put together an array of 1000 elements. Read in the first 1000 elements [we don't care what tehy are], sort them [should be quick since it's not that many numbers]. Remember the smallest element. Read in another number, compare to the smallest. If it's bigger, then insert it in the array where it belongs. Continue this until there are no more numbers in the list.

    This is probably no more efficient than loading the entire array into memory and doing a quicksort of it.

    I see no reason why you can't load millions of numbers into memory - 4 bytes per integer is all you need, and a machine with 256MB of RAM can easily hold 32M numbers with plenty of memory left for other stuff.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #5
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by matsp View Post
    Put together an array of 1000 elements. Read in the first 1000 elements [we don't care what tehy are], sort them [should be quick since it's not that many numbers]. Remember the smallest element. Read in another number, compare to the smallest. If it's bigger, then insert it in the array where it belongs. Continue this until there are no more numbers in the list.

    This is probably no more efficient than loading the entire array into memory and doing a quicksort of it.

    I see no reason why you can't load millions of numbers into memory - 4 bytes per integer is all you need, and a machine with 256MB of RAM can easily hold 32M numbers with plenty of memory left for other stuff.

    --
    Mats
    This is basically what I was thinking. However, don't simply load them into an array, load them into a heap.

    This would definitely be way faster than trying to sort the whole lot using quicksort. Why...?
    1. Quicksort is O(nlogn) at best. The logn factor is going to be much larger. log(1000000) is quite a bit bigger than the log(1000) you would have if you only keep 1000 items in memory at once. n is of course 1M in both cases though since you always have to process all items.
    2. Quicksort can become O(n*n), whereas heap operations are always exactly O(logn).
    3. Keeping all items in order in an array takes O(n) time to move items over to insert each item that was greater.

    Heaps are basically designed exactly for what you want. Not to mention, it will still work perfectly fine if you had a billion numbers instead, as you still only have to keep 1000 in memory at any one time.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  6. #6
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Since I wasn't familiar with "heap" data structures, I had a look at Wikipedia:
    http://en.wikipedia.org/wiki/Heap_(data_structure)

    Seems like a decent solution [like I expect when I see that iMalc has commented].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. To Find Largest Consecutive Sum Of Numbers In 1D Array
    By chottachatri in forum C Programming
    Replies: 22
    Last Post: 07-10-2011, 01:43 PM
  2. putting values into an array
    By zdream8 in forum C Programming
    Replies: 15
    Last Post: 05-21-2008, 11:18 PM
  3. Replies: 1
    Last Post: 12-30-2007, 10:08 AM
  4. largest and smallest number
    By wise_ron in forum C Programming
    Replies: 11
    Last Post: 10-05-2006, 03:25 PM
  5. Need help with project
    By chrisa777 in forum C++ Programming
    Replies: 5
    Last Post: 06-19-2006, 05:01 PM