find the largest 1000 values

• 11-07-2007
George2
find the largest 1000 values
Hello everyone,

I have a large data set of integers (about several Million), all of them are positive integers from experiment. They are stored on a file on local machine. I want to select the largest 1000 values then make analysis based on the 1000 values.

Could anyone suggest some efficient solutions? I think sorting the several M data in memory is not feasible.

George
• 11-07-2007
Salem
sort -k1rn file.txt | head -1000 > top1000.txt
• 11-07-2007
oded_re
after taking a course in data bases i can tell you that as far as i know,
and im pretty sure about it, there is no better way to find the 1000 largest without sorting, the only other alternative i can think of is running on all your list 1000 times and on each round take the largest number and delete it from the list for the next round, or you can try your luck with bucket sort, that
O(n) if you know what that is (good efficient on certain scenarios), check it out in http://en.wikipedia.org/wiki/Bucket_sort
• 11-07-2007
matsp
Put together an array of 1000 elements. Read in the first 1000 elements [we don't care what tehy are], sort them [should be quick since it's not that many numbers]. Remember the smallest element. Read in another number, compare to the smallest. If it's bigger, then insert it in the array where it belongs. Continue this until there are no more numbers in the list.

This is probably no more efficient than loading the entire array into memory and doing a quicksort of it.

I see no reason why you can't load millions of numbers into memory - 4 bytes per integer is all you need, and a machine with 256MB of RAM can easily hold 32M numbers with plenty of memory left for other stuff.

--
Mats
• 11-08-2007
iMalc
Quote:

Originally Posted by matsp
Put together an array of 1000 elements. Read in the first 1000 elements [we don't care what tehy are], sort them [should be quick since it's not that many numbers]. Remember the smallest element. Read in another number, compare to the smallest. If it's bigger, then insert it in the array where it belongs. Continue this until there are no more numbers in the list.

This is probably no more efficient than loading the entire array into memory and doing a quicksort of it.

I see no reason why you can't load millions of numbers into memory - 4 bytes per integer is all you need, and a machine with 256MB of RAM can easily hold 32M numbers with plenty of memory left for other stuff.

--
Mats

This is basically what I was thinking. However, don't simply load them into an array, load them into a heap.

This would definitely be way faster than trying to sort the whole lot using quicksort. Why...?
1. Quicksort is O(nlogn) at best. The logn factor is going to be much larger. log(1000000) is quite a bit bigger than the log(1000) you would have if you only keep 1000 items in memory at once. n is of course 1M in both cases though since you always have to process all items.
2. Quicksort can become O(n*n), whereas heap operations are always exactly O(logn).
3. Keeping all items in order in an array takes O(n) time to move items over to insert each item that was greater.

Heaps are basically designed exactly for what you want. Not to mention, it will still work perfectly fine if you had a billion numbers instead, as you still only have to keep 1000 in memory at any one time.
• 11-08-2007
matsp
Since I wasn't familiar with "heap" data structures, I had a look at Wikipedia:
http://en.wikipedia.org/wiki/Heap_(data_structure)

Seems like a decent solution [like I expect when I see that iMalc has commented].

--
Mats