Originally Posted by
laserlight
You should see them confirmed. If you don't, then your input sizes are not large enough to matter (since "big O" notation refers to asymptotic growth, not actual runtime performance in all cases) and/or you used pathological worst case input for the faster algorithms (e.g., certain canonical variants of quick sort degrade to O(n^2) complexity when sorting an array already sorted in reverse), and/or best case input for the slower algorithms.
For the most part, yes. I think I've written the coolest linked-list comparative sort demo program ever! It does everything I mentioned earlier plus you can use two text files (lists of a single word on each line) to create random lists for the node struct (a first name, a last name, and an assigned "number").
Code:
typedef struct _node { /* singly linked list */
char first[64];
char second[64];
int ID;
struct _node *next;
}node;
You can also choose which criteria to use in the sort and a secondary criteria used to break ties. So using a short dictionary*, you can create lists of hundreds of thousands of nodes with very few possible ties, or using secondary criteria millions. However, you probably don't want to use the slower sorts there -- on my dual core pentium it took about about half an hour to bubblesort 65500 records (over 5 billion iterations). The mergesort will do 1,000,000 records in 6 seconds (a little less than 20 million iterations).
*there are lots of these kinds of (usually alphabetized) text lists of names and/or words around.
But the real purpose of my post now is to shed some further light on the significance of the big O notation and the issues it conceals.
Of course, deciding how to count the iterations was not as clear cut for all the sort styles. But here's some sample output:
Code:
[root~/C] ./a.out -h
Usage: ./a.out [-h] [-f -l] [-1] [-2] [-d]
-d either 1 or 2: copious output to stderr
-h print this message
-f filename
-l number of records in list, needed with -f
-1 primary sort criteria (1=first name, 2=last name[default], 3=number)
-2 secondary sort criteria (0=none[default], 1=first name, etc.)
Eg. "./a.out -f data.txt -l 27 -1 2 -2 1"
Records should have the structure "Some Name:0101".
[root~/C] ./a.out
Using "", 0 records
Primary Criteria offset=64
Command (h for help, Q to quit): h
d display list
h show this message
b (modified) bubblesort
i insertion sort
m mergesort
q quicksort
s selection sort
S shuffle
N new random list
R reload from file
W write to file
Command (h for help, Q to quit): N
How many records? 10
New list is 1.41 kb
Command (h for help, Q to quit): d
0x1a90010 nonglare,James (1)
0x1a900b0 stroys,James (420)
0x1a90150 enslave,Jim (1024)
0x1a901f0 mirex,Tammy (666)
0x1a90290 peartly,Lorenzo (420)
0x1a90330 soymilks,Bob (1024)
0x1a903d0 weeks,Robert (0)
0x1a90470 addrest,Tammy (69)
0x1a90510 starnose,Leslie (27)
0x1a905b0 acerous,Paul (1024)
Command (h for help, Q to quit): N
How many records? 10000
New list is 1406.25 kb
Command (h for help, Q to quit): b
Bubblesorting.......
97760000 Iterations Comparison calls: 97750224
Elapsed time: 0 min 31 sec
Command (h for help, Q to quit): S
Shuffled.
Command (h for help, Q to quit): s
Selectionsorting....
99980002 Iterations Comparison calls: 49995000
Elapsed time: 0 min 19 sec
Command (h for help, Q to quit): S
Shuffled.
Command (h for help, Q to quit): i
Insertionsorting.......
24883640 Iterations Comparison calls: 24883640
Elapsed time: 0 min 6 sec
(The "shuffle" is a Fisher-Yates algorithm.)
So the bubble sort and the selection sort do about what is expected vis. "iterations", (10,000 squared), the insertion sort comes out quite a bit better. Iterations/comparisons remain at a constant ratio for the selection sort, whereas because the bubblesort is "modified" (using a flag as per my discussion with tabstop above), the bubblesort and the insertion sort have a best case of n or n-1 for both iterations and comparisons.
The default is a string comparison, which is much more processor intensive than a numerical comparison (eg, selection sort will do 10,000 records by ID number in 8 seconds). The difference between the number of comparisons is what accounts for the speed difference above; while both bubble and selection sort did almost n^2 iterations, selection sort did half as many comparisons. It should be noted that while the iteration count may be subject to criticism, the comparison count is absolute as it is incremented every time a discrete comparison function is called. This discrepancy (that an iteration need not be a comparison) is not referred to in many discussions of sorting efficiency and is not accounted for with big O.
A third measure is available with the number of recursive function calls in merge sort and quick sort. Because the sorts are similarly recursive, these counts are always the same:
Code:
Command (h for help, Q to quit): N
How many records? 2097152
New list is 294912.00 kb
Command (h for help, Q to quit): m
Mergesorting..............2097151 function calls
61865985 Iterations Comparison calls: 41388898
Elapsed time: 0 min 15 sec
Command (h for help, Q to quit): S
Shuffled.
Command (h for help, Q to quit): q
Quicksorting..............2097151 function calls
148857828 Iterations Comparison calls: 75517673
Elapsed time: 0 min 32 sec
2^21 is 2097152, so we would expect 43830192 here as the the big O (n*log(n)) best/average score here. Initially, I could not write a quick sort faster than a merge sort and am not sure why anyone would think it should be faster, except that it does hold out some possibilities for optimization with the pivot value. I couldn't be bothered to go further than selecting the lowest of the two first values as the pivot. It does much, much worse using just the head of each list as it's pivot, probably since with very long alphabetic matches lists will end up composed of smaller subsections of the alphabet, easily creating a lopsided division with a totally random pivot value. Its open ended "worst case" means it's prone to certain weaknesses, such as when a secondary criteria is added to resolve ties. This becomes increasingly obvious as the list size increases, as all these differences do (at ten items, a bubblesort only does about twice as many iterations as a mergesort, at a hundred items it does ten times as many):
Code:
/* set primary criteria to "3" (ID number, of which there are only ten possible) and
secondary to first name (of which there are 59 possible). So there will be lots and
lots of ties to "resolve". Other cases in this demo use a single criteria (the last name)
of which there are 80638 unique possibilities: */
[root~/C] ./a.out -1 3 -2 1
Using "", 0 records
Primary Criteria offset=128
Secondary Criteria offset=0
Command (h for help, Q to quit): N
How many records? 100000
New list is 14062.50 kb
Command (h for help, Q to quit): m
Mergesorting..............99999 function calls
2315025 Iterations Comparison calls: 2748000
Elapsed time: 0 min 0 sec
Command (h for help, Q to quit): S
Shuffled.
Command (h for help, Q to quit): q
Quicksorting..............99999 function calls
20542647 Iterations Comparison calls: 20123133
Elapsed time: 0 min 2 sec
Command (h for help, Q to quit): N
How many records? 1000000
New list is 140625.00 kb
Command (h for help, Q to quit): m
Mergesorting..............999999 function calls
27884993 Iterations Comparison calls: 34081612
Elapsed time: 0 min 6 sec
Command (h for help, Q to quit): S
Shuffled.
Command (h for help, Q to quit): q
Quicksorting..............999999 function calls
1836821111 Iterations Comparison calls: 1832407830
Elapsed time: 4 min 4 sec
Quicksort suffers this way with ties for the same reason it suffers sorting an already ordered list, which is it's worst case scenario. Because mergesort does not spend time putting a list in order after it's been split off, and instead does it when merging them together, it will benefit if the original list was already in some kind of order, whereas this will only hurt quicksort depending on how the pivot value is choosen. To establish a mean or median value, we must iterate through the list before splitting it -- although in discussions of quicksort I have seen, this process apparently "does not count" as part of it's big O notation. Additional comparisons at each of these iterations would be required as well. Taking the pivot from an already ordered tied list will usually end up on the low side, creating an imbalanced list. Putting pivot ties on the right (opposite the pivot) may improve performance slightly.
Notice the number of recursions is still the same -- an imbalanced list does not increase the number of recursions, since the short half simply goes through fewer. But the number of iterations and comparisons with quicksort can change radically:
Code:
Command (h for help, Q to quit): N
How many records? 100
New list is 14.06 kb
Command (h for help, Q to quit): q
Quicksorting..............99 function calls
1256 Iterations Comparison calls: 694
Elapsed time: 0 min 0 sec
Command (h for help, Q to quit): q
Quicksorting..............99 function calls
9800 Iterations Comparison calls: 4950
Elapsed time: 0 min 0 sec
This is the "worst case" (n^2) on an ordered list.
Now, apparently merge sort will perform the same in either case. This is true if we just measure iterations, but if we take more reality into account, the number of comparisons can, will, and must change. This is because when two split lists come back to be merged, if one list already contains all smaller values, all of the values on the left will be compared one at a time to the first element of the right, and then (since the returned lists have an internal order) no further comparisons need be done and the right is simply iterated onto the end. That is the normative and perhaps only way to implement a merge sort:
Code:
/* the end (merge) of the sort function */
while ((left) && (right) && (lenL>LC) && (lenR>RC)) {
Iterations++;
result=compfunc(offset,left,right);
if ((result==0) && (altfunc)) result=altfunc(altset,left,right);
if (result==2) { ptr->next=right; ptr=right; right=right->next; RC++; }
else { ptr->next=left; ptr=left; left=left->next; LC++; }
} /* if one list is longer or contains more high values, leaving a final remainder */
while ((left) && (lenL>LC)) { ptr->next=left; ptr=left; left=left->next; LC++; Iterations++; }
while ((right) && (lenR>RC)) { ptr->next=right; ptr=right; right=right->next; RC++; Iterations++; }
ptr->next=NULL; /* the new tail */
The second last two lines represent these presorted/tied (hence already unequal) cases -- which in quicksort would be a handicapping imbalance. If the values are well mixed, these two lines are irrelevant and a comparison is made for every iteration. Observation bears this out:
Code:
Command (h for help, Q to quit): N
How many records? 100
New list is 14.06 kb
Command (h for help, Q to quit): m
Mergesorting..............99 function calls
817 Iterations Comparison calls: 545
Elapsed time: 0 min 0 sec
Command (h for help, Q to quit): m
Mergesorting..............99 function calls
817 Iterations Comparison calls: 316
Elapsed time: 0 min 0 sec
As predicted, the number of function calls and iterations (roughly n*log(n)) remains constant, but in the case of the ordered list, the number of comparisons is almost cut in half. This factor is completely hidden when considering big O notation.
To give quicksort one last chance, I decided to add a function which could select a better pivot, despite my misgivings about the additional iterations and comparisons required. So now, the quicksort compares the head with an element about half way thru, and if these are identical, goes thru the whole list until it finds one that isn't. Then we take a mean of the two different values. This helps to ensure that that we will actually have two lists.
This clearly helped to divert quicksort (which it should really be contrasted with mergesort as "splitsort", but anyway...) away from it's worst case, the already sorted list:
Code:
Command (h for help, Q to quit): N
How many records? 100
New list is 14.06 kb
Command (h for help, Q to quit): q
Quicksorting..............99 function calls
1443 Iterations Comparison calls: 882
Elapsed time: 0 min 0 sec
Command (h for help, Q to quit): q
Quicksorting..............99 function calls
1127 Iterations Comparison calls: 754
Elapsed time: 0 min 0 sec
In the second instance the list is already sorted. Contrast this with the cruder quicksort, above, which made the n*n on an already sorted list. While it still hasn't quite caught up to the mergesort, it very nearly has -- the time for the much tied sort of one million records went from over 4 minutes to 7 seconds.
Then I modified the newlist() function to assign random ID's just from rand(), rather than a fixed set of 10. That means billions of possibilities on a 64-bit so I could now compare the sorts on scrambled sets of millions of unique numbers:
Code:
[root~/C] ./a.out -1 3
Using "", 0 records
Primary Criteria offset=128
Command (h for help, Q to quit): N
How many records? 4000000
New list is 562500.00 kb
Command (h for help, Q to quit): m
Mergesorting..............3999999 function calls
123539969 Iterations Comparison calls: 82696100
Elapsed time: 0 min 9 sec
Command (h for help, Q to quit): S
Shuffled.
Command (h for help, Q to quit): q
Quicksorting..............4000000 function calls
190179315 Iterations Comparison calls: 100817020
Elapsed time: 0 min 23 sec
There is still obviously room to optimize the selection of the pivot value further, particularly if we were only dealing with unique numbers, since then taking a true mean of a sublist would be more feasible. But if quicksort and mergesort still both have a best case of n*log(n) (88 million here) anyway, I think I will just stick with merge sort.
A final interesting point: both the quicksort and the insertion sort demonstrated the variation inherent in their design; with insertion sort this variation is toward a "best case" (improving on n^2), whereas with quicksort this is toward a "worst case" (away from n*log(n)). A quick sort cannot really do better than a merge sort although the implementation of a quicksort may be easier to tweak closer to the best case, but in any case would be better known as the "split sort" or "quack sort".
I put up a nice syntax-colored version of the code which produced this nonsense, in case someone wants to argue some silly point:
MK27's linked list comparative sort demo program!
I believe this is fairly portable. It could almost be used as a crude API, since the sort functions use external comparison functions with offsets to criteria (meaning you could plug any kind of struct into it without changes).