# Thread: Scalability problems with Quicksort

1. ## Scalability problems with Quicksort

Hi, I'm writing various sorting algorithms in C++ and comparing their performance to each other (as well as trying to optimize them where I can). I'm running into a problem, but I'm not sure if this is a mistake on my part or if it is due to some physical limitation of the computer that I am unaware of. I'm hoping that someone here can shed some light on it.

I'm comparing Heapsort, Mergesort, and Quicksort, which operate on an array of randomly generated 16-bit shorts between -32768 to 32767 using srand() from <ctime> with seed time(0) (I chose short instead of int so that I could test the algorithms on large arrays while consuming half the memory that would be consumed by an array of ints). I've compiled these with g++.

For array sizes up to 10 million, the times that I get are pretty much what one would expect, i.e. quicksort outperforms mergesort, which outperforms heapsort. But for more than 20 million elements, quicksort performs worse than mergesort, and for more than 50 million elements, quicksort performs worse than heapsort. This is contrary to my understanding that on average, quicksort is the best performing algorithm of the three. And since these numbers are randomly generated, we expect that quicksort's worst case is not encountered.

Here is my Quicksort:

Code:
```int partition(short * nums, int lbound, int rbound)
{
int pivotIndex = rbound - 1;
while (lbound != pivotIndex)
{
if (nums[lbound] > nums[rbound])
{
short tmp = nums[lbound];
nums[lbound] = nums[pivotIndex];
nums[pivotIndex] = tmp;
pivotIndex--;
}
else
lbound++;
}
if (nums[pivotIndex] < nums[rbound])
pivotIndex++;

short tmp = nums[pivotIndex];
nums[pivotIndex] = nums[rbound];
nums[rbound] = tmp;

return pivotIndex;
}

void quickSort(short * nums, int lbound, int rbound)
{
int size = rbound - lbound + 1;
if (size <= 1)
return;

int pivotIndex = partition(nums, lbound, rbound);
quickSort(nums, lbound, pivotIndex - 1);
quickSort(nums, pivotIndex + 1, rbound);
}

void quickSort(short * nums, int size)
{
quickSort(nums, 0, size - 1);
}```
Note that here I have chosen the last element as the pivot for simplicity. Since the input is randomly generated, it should not have an impact on performance (for what it's worth I get the same problem if I randomly choose the pivot).

For reference, here is what I have for Mergesort:

Code:
```void mergeSort(short * nums, int size)
{
if (size == 0 || size == 1)
return;
short * lnums;
short * rnums;
if (size <= STACKLIMIT_MERGE)
{
lnums = static_cast<short *>(alloca((size/2) * sizeof(short)));
rnums = static_cast<short *>(alloca((size - size/2) * sizeof(short)));
}
else
{
lnums = new short[size/2];
rnums = new short[size - size/2];
}
for (int i = 0; i < size/2; i++)
{
lnums[i] = nums[i];
rnums[i] = nums[size/2 + i];
}
if (size % 2 != 0)
rnums[size - size/2 - 1] = nums[size - 1];

mergeSort(lnums, size/2);
mergeSort(rnums, size - size/2);

int lindex = 0, rindex = 0;
for (int i = 0; i < size; i++)
{
if ((rindex != size - size/2) && (lindex == size/2 || rnums[rindex] < lnums[lindex]))
{
nums[i] = rnums[rindex];
rindex++;
}
else
{
nums[i] = lnums[lindex];
lindex++;
}
}
if (size > STACKLIMIT_MERGE)
{
delete [] lnums;
delete [] rnums;
}
}```
In order to speed up this sort, I try to avoid using heap memory as much as possible and use stack allocations instead. STACKLIMIT_MERGE is an upper bound (determined experimentally) on the number of shorts for which I can use dynamic stack allocation (via alloca) and complete the mergesort without incurring a stack overflow exception.

Finally, my heapsort is as follows:

Code:
```void heapify(short * nums, int size) //max heap
{
int last = (int)(ceil((log((double)(size + 1)) / log((double)2))));
last = (int)(pow((float)2, last - 1) - 2);
for (int i = last; i >= 0; i--)
{
int j = i;
int k = ((2 * j + 2) < size) ?
((nums[2 * j + 1] > nums[2 * j + 2]) ? (2 * j + 1) : (2 * j + 2)) :
(((2 * j + 1) < size) ? (2 * j + 1) : -1);
if (k == -1)
continue;
while (nums[k] > nums[j])
{
short swp = nums[k];
nums[k] = nums[j];
nums[j] = swp;
j = k;
k = ((2 * k + 2) < size) ?
((nums[2 * k + 1] > nums[2 * k + 2]) ? (2 * k + 1) : (2 * k + 2)) :
(((2 * k + 1) < size) ? (2 * k + 1) : -1);
if (k == -1)
break;
}
}
}

void heapSort(short * nums, int size)
{
heapify(nums, size);
int index = size - 1;
while (index >= 0)
{
short root = nums[0];
nums[0] = nums[index--];
nums[index + 1] = root; //Critical step
if (index <= 0)
continue;
int j = 0;
int k = (index == 1) || (nums[1] > nums[2]) ? 1 : 2;
while (nums[k] > nums[j]) //rebuild the heap
{
short swp = nums[k];
nums[k] = nums[j];
nums[j] = swp;
j = k;
k = ((2 * k + 2) <= index) ?
((nums[2 * k + 1] > nums[2 * k + 2]) ? (2 * k + 1) : (2 * k + 2)) :
(((2 * k + 1) <= index) ? (2 * k + 1) : -1);
if (k == -1)
break;
}
}
}```
My overall concern is to find out why the Quicksort is scaling worse than Mergesort and Heapsort.
Here are some numbers for reference:
1 million elements - H: 0.641s, M: 0.625s, Q: 0.360s
10 million elements - H: 10.610s, M: 7.172s, Q: 5.922s
20 million elements - H: 24.188s, M: 14.765s, Q: 16.875s
50 million elements - H: 72.812s, M: 38.250s, Q: 78.313s
I'm running on Windows XP SP2, 2GB RAM, Intel x86 Centrino Duo (2 x 1.60 GHz), 120GB disk space (20GB free). I realize much of this info might be irrelevant.

Any insight into my issue would be much appreciated.

2. Originally Posted by hbejkosa
For array sizes up to 10 million, the times that I get are pretty much what one would expect, i.e. quicksort outperforms mergesort, which outperforms heapsort.
I certainly wouldn't expect mergesort to outperform heapsort.

Anyhow, I think you oversimplified the quicksort implementation a bit too much. Generally once the partition gets below some threshold, the typical thing to do is resort to introsort or any other O(n^2) algorithm. Using a median of 3 to select a pivot will also help quite a bit.

I also think your heapsort implementation could use some work, specially when you impose heap properties on the array. As far as I know, most algorithms sift down the upper and lower parts separatedly (IE: You don't need to compute the logarithms)

Here is a nice article regarding the 3 methods you are analyzing with source code, which will most likely give more insight than I can. Also note that most quicksort implementations use heapsort when the recursion depth gets too high (introsort), which
faster than any of these 3 methods.

3. Originally Posted by Ronix
Also note that most quicksort implementations use heapsort when the recursion depth gets too high (introsort), which
faster than any of these 3 methods.
It would be more accurate to say that most SC++L implementations use IntroSort for runtime Big Oh notation guarantees. It's not actually faster than only using QuickSort, execpt in the worst of cases. Switching to HeapSort guards against those worst cases, but detecting that worse case takes a smidgen more processing time.

The problem is that the partition step posted, is broken.
When you find that your lbound item is greater than your rbound item you're swapping the lower bound one with the pivot. That's not a good method to use as the pivot value must not change during the partitioning process.
You want to seek from the lbound end upwards until you find an item that is greater than the pivot, then search for an item from the rbound end downwards until you find one that it less than the pivot, and then swap the items, moving inwards one more, and continue until those indexes cross over. You need to actually copy the pivot item to somewhere and keep it there during the entire pivot process, such as a seperate variable or to the first position in the array etc. At the end the pivot value is put back in place.
I have some rather well optimised implementations on my site (see below), but I don't have a particularly good description of how Quicksort partitioning works. Try here for that.

Another good idea is that with your MergeSort, you may notice that the largest amount of extra memory allocated during the merge process is equal to the size of the original array. You improve things a lot by simply allocating one extra array of the same size as the original, up front, and just use this as your one and only temp buffer the whole way through. Many allocations and deallocations equals slow, no matter how you do it.

Your HeapSort looks a bit nasty. HeapSort should definitely never use floating point math. Also, your calculations for k are extremely hard to read. The inner while loops are the same in both functions. That's a lot of duplicated code that should be made into a function. If you move the outer loop in heapify back into the main function, then you can simply call heapify again - much shorter!

4. Thanks man! The partition algorithm in your link solved my scalability problems.
Yes, my apologies, my heapsort code is poorly commented. The idea behind the k calculations is that (in a heap context) I want to find the index of the larger child node of the j index if it exists (hence the 2j + 1 and 2j + 2). The idea behind the floating point math is to find the index that is the end of the last full level in the heap, so that we don't have to waste time with the last level. I imagine that this is still not optimal. The reason I'm writing these, by the way, is for a school project, so all I really need is steady implementations across the board. I hope you don't mind if I consult your code.