double ended priority queue

**apatil65** · 06-19-2013

Hi,

Can any one tell me how double ended priority queue be used to implement external quick sort?

**jwroblewski44** · 06-21-2013

I'm not sure a priority queue would aid in performing an external quicksort.

What makes you ask this question?

Here is a sample of code I found while searching "external quick sort". It was from a link found on the NIST website.
http://users.dcc.uchile.cl/~rbaeza/h...s/4/446.sort.c [I'm not too sure how well it works, or if it even does.]

**Nominal Animal** · 06-21-2013

Originally Posted by jwroblewski44

I'm not sure a priority queue would aid in performing an external quicksort.

I was thinking exactly the same thing.

I guess you could use quick sort on largeish chunks of data first, and then keep non-overlapping chunks in a priority queue of some sort. To output sorted data, you could do an variable-N-way merge, merging chunks when they overlap.

The bad thing about that is the variable-N-way merge, since for very random (or suitably arranged) data, N may be the total number of chunks. And for big data, that may be too large to do an efficient merge.

Another option is to keep the quicksorted chunks in order, and so that no chunks overlap. When a new quicksorted chunk is added, it's merged with overlapping chunks, and added to new chunks where there is no overlapping chunks. If the resulting chunks are larger than a full chunk, you need to split them (to keep the data storage sane -- otherwise you may end up with arbitrary-length chunks). Existing chunks never overlap, but you might wish to merge very small neighboring chunks, to keep storage usage efficient -- but excessive merging will slow things down. I think a configurable fill factor and/or limits should work, though. (Note: you only need to be able to hold three or four chunks, maximum, at the same time in this scheme.)

The data needed to describe the chunks is rather compact: chunk identifier, number of elements in it, minimum element, and maximum element. In a priority queue, the priority should describe the element range -- so really, I wouldn't use a priority queue. I think there are several data structures -- like range trees, which for one-dimensional data is just a balanced binary search tree -- that could be used to facilitate the aforementioned operations, especially finding overlapping existing chunks, more efficiently than using a priority queue.

Maybe I have a poor imagination, but I can't imagine a case where using a priority queue in externalizing quicksort would be better than the alternatives.

**phantomotap** · 06-21-2013

O_o

The priority queue is just treated as a sorted cache, thus being a sorted fraction of the entire data set, of elements throughout the sorting process by removing elements from the start of the queue (extreme left recursion) or end of the queue (extreme right recursion).

With a normal "Quicksort" you partition the data into two chunks around a pivot; these chunks are sorted, often recursively, by repeating the same process over each chunk. In the case of a form of "Quicksort" using a slow storage medium, partitioning over a single element would be extremely expensive; the solution, somewhat (*), to the cost of partitioning with a slow medium is keeping a "running pivot" in cache, a faster storage medium, allowing one to sort incoming data quickly into three groups: "less than everything so far", "more than everything so far", and "sorted cache". The two unsorted groups become the new partitions for recursively applying the sort algorithm.

You don't have to use a "Double Sided Priority Queue". The requirement to use a slow medium for storage is exorbitant in any event, using a poor strategy to implement the fractional sort within a fast medium, the fractional sorts cache, would reduce the already slow performance of the process. A properly implemented "Double Sided Priority Queue" just has characteristics, fast sorted insertion and removal from both ends, that make it great for how the cache is used for the fractional sort.

[Edit]
Just so as you know, using "memory mapped" flat files could almost be considered as a minimum requirement for getting any performance out of the process. If you try to manage this process yourself, you'll probably get terrible performance.

You'll get much better performance, for the overall process not any individual step, if you understand how to serialize more complex data structures in large chunks.

I only reference this for a simple reason, if you need to ask this question, you probably aren't ready to implement such a process. I strongly advise getting cozy with normal algorithms before you start trying to optimize sorting large data sets.
[/Edit]

Soma

(*) Using a sorted cache as a partition point only solves part of the problem. In practice, using a "Radix Sort" flavor of "Quicksort" when a slow medium is required will often significantly improve recursive application of the algorithm proper, but of course, this approach uses far more memory so is often not appropriate without multiple layers of caching.

**rcgldr** · 06-21-2013

As far as I know, most external sorts use k-way merge sorts (with k = 8 or more) combined with large I/O's (large number of elements read or written per read or write) to reduce the overhead of random access on hard drives with higher k-way merge sorts.

I doubt any one uses tape drives for sorting anymore, but an alternative to merge sort was poly phase sort, a variation of merge sort that deliberately distributes data unevenly in a special pattern on the tape drives. Poly phase sorts could have been used on hard drives back when memory was limited, but on modern computers, there's enough memory to use large I/O's, reducing the overhead of increased random access in a higher k-way merge sort, and higher k-way merge sorts are faster than poly phase merge sorts (when k is 6 or greater, sometimes when k is 5).

On some mainframes, the I/O controllers can be programmed to write elements via pointers to the elements in a single I/O, speeding up a sort. These type of controllers are called vectored I/O or scatter / gather. PC's have these type of controllers for hard drives, but scatter / gather is used for the scattered 4k blocks of physical memory that make up what appears to be a continuous virtual memory address space.

Wiki articles:

External sorting - Wikipedia, the free encyclopedia

Polyphase merge sort - Wikipedia, the free encyclopedia

Vectored I/O - Wikipedia, the free encyclopedia

**phantomotap** · 06-21-2013

As far as I know, all commercial products that do external sorts use k-way merge sorts combined with large I/O's to reduce the overhead of random access on hard drives with higher k-way merge sorts.

O_o

[Edit]
The point of this post isn't to argue with you. Most all processes do use a "N-Way Merge Sort" at one level or another.

The point is, the processes employed by databases and such are enormously complex using several varieties of algorithms and data structures just to do that one part of the job.

This, sorting big data, isn't something anyone should jump into with a "Algorithm $(X) will solve the problem." mindset.
[/Edit]

[Edit]
Actually, one should prefer using an existing solution if they have big data.

"SQLite", "BerkelyDB", "CouchDB", and similar are all awesome.
[/Edit]

As I've said before, that's way to simple of a view.

In the real world, using a single sorting algorithm across all storage characteristics would be too expensive.

The best general purpose processes in the wild designed for use with massive data sets over slow storage mediums generally use a variation of the same clustered and layered sorting process built on several algorithms and several data structures:

1): Fastest Storage - "N-Way Hypersort" (N:=Threads) [Implementations are "Quicksort" designed for specific cache "lengths" and size.)
2): Secondary Storage - "N-Way Merge Sort" (N:=Buckets) [This sorts runs of data from clusters of the fastest storage.]
3a): Tertiary Storage - "Scattering External Quicksort" (Priority Queue) [This stripes data writes across a pool of available storage mediums.]
3b): Tertiary Storage - "Radix Sort" (Hash Table) [Data stripes are moved into specific buckets across all available storage mediums according to the key schedule.]
3c): Tertiary Storage - "Serialized Structured Data" [Within each "bin", from the "Radix Sort", we harvest meta-data writing data in such a way that we can arbitrarily append yet still read sequentially, after a fashion, during the second pass according to a schedule allowing us to place data where it needs to be using simple "Asynchronous I/O".]
4): Tertiary Storage - "N-Way Replacement Sort" (N:=Storage "Bins"/Cacheable Clusters) [Process each bucket in strictly ordered fashion according to key schedule.]

In the general case, we have three reads and two writes of every piece of data from the slowest storage also costing us double the storage needed for the original data. That's your scaling factor, but I would not wish to figure the "Big O", and of course, if you don't have massive amounts of RAM with plenty of real threads you are going to spend just a silly lot of time waiting on "I/O" operations.

Soma

**rcgldr** · 06-21-2013

Originally Posted by phantomotap

Most all processes do use a "N-Way Merge Sort" at one level or another. ... databases

Or at least have an "k-way" merge sort as part of a product package.

In the case of databases, it depends on what is being done, for example an indexed sequential database (do they still have these anymore?) being updated by a transaction file that contains all the updates since the last time the main part of the database was updated (it's always kept sorted by a primary key). In this case only a copy of the transaction file needs to be sorted, then just a single merge with the main database.

Originally Posted by phantomotap

"SQLite", "BerkelyDB", "CouchDB", and similar are all awesome.

I was thinking more along the line of products like "CA Sort".

**phantomotap** · 06-21-2013

I was thinking more along the line of products like "CA Sort".

^_^

I realize. (I didn't know the exact software.) The recommendation relates to what people do with big data; if you have enough data you need something like that to sort it all, you are probably going to need something more than just "find+replace" to do anything useful with it.

Soma

**jwroblewski44** · 06-22-2013

I don't know how I could have missed this section on external sorting in the wikipedia page on DEPQ's: Double-ended priority queue - Wikipedia, the free encyclopedia

It seems to match up with phantomotap's methodology of reading data into the queue, using the read data as a "pivot point", and then reading the rest of the data and placing into an appropriate group for further sorting. Correct me if I am wrong, though, because most of this conversation has been [ whoosh! ] over my head!

Thread: double ended priority queue

Thread Tools

Search Thread

Display

double ended priority queue

Similar Threads

Priority queue STL

c++ stl priority queue

STL min priority queue?

Priority Queue :: C++

Priority Queue