merge sort - top down versus bottom up

**iMalc** · 08-16-2013

Originally Posted by rcgldr

I was referring to your comment in your list merge sort algorithms web page that a natural merge sort was not stable. The instability only occurs if using compares to determine the end of a run. The instability occurs if the first record of one run is greater than the last record of a previous run, causing the two runs to be treated as one run. Using run counts or some other method to determine the end of runs in a natural merge sort preserves stability.

For every sorting algorithm there is some way to make it stable. From my point of view, if you have to include any code which is not purely for sorting purposes, in order to make it stable, then it is inherently not stable.

Thanks for posting the example code snippets. It made me realise that although I thought we were talking about linked-list sorting the whole time as per the original poster in the previous thread, it turns out that you were talking about array sorting for at least this entire thread. Well I could go through everything said and re-examine it in such a light, but I don't think I will take the time today at least.

**rcgldr** · 08-16-2013

Originally Posted by rcgldr

I was referring to your comment in your list merge sort algorithms web page that a natural merge sort was not stable. ...

Originally Posted by iMalc

For every sorting algorithm there is some way to make it stable. From my point of view, if you have to include any code which is not purely for sorting purposes, in order to make it stable, then it is inherently not stable.

I understand your point of view. I was thinking that If a natural merge sort uses some method to track run boundaries created on the initial pass, is it no longer considered to be a natural merge sort?

Originally Posted by iMalc

Thanks for posting the example code snippets. It made me realize that although I thought we were talking about linked-list sorting the whole time as per the original poster in the previous thread, it turns out that you were talking about array sorting for at least this entire thread.

Actually a bit of both. I posted a link to a zip of an example linked list sort earlier in this thread.

For a top down linked list merge sort, each level of recursion ends up creating at least 2 pointers (left and right), each pointing to a list of nodes. For 1 million records (up to 2^20), that's up to 19 levels and 38 pointers during the recursion, with a total of (2 n - 4) instances of pointers used during a top down merge sort ( (n - 2) instances to split, (n - 2) instances to merge). The array based bottom up linked list merge sort example I created uses an array of 32 pointers plus 2 temp pointers, but the array size can be reduced if wanted. A "classic" bottom up linked-list merge sort can get by with just 4 pointers (2 input pointers, 2 output pointers), a run size variable, and two counters. (the counters can be eliminated if there's space in the nodes for an end of run flag bit, such as the least significant bit of a node next pointer if nodes are always on an even boundary).

**iMalc** · 08-16-2013

Originally Posted by rcgldr

I understand your point of view. I was thinking that If a natural merge sort uses some method to track run boundaries created on the initial pass, is it no longer considered to be a natural merge sort?

Yes it doesn't change what it is, but in my view it would be a supplemented form of what would otherwise be just the core algorithm, I guess. I'm not dead set on that opinion though. If I came across several sources which disagreed, I would happily change my view.

For a top down linked list merge sort, each level of recursion ends up creating at least 2 pointers (left and right), each pointing to a list of nodes. For 1 million records (up to 2^20), that's up to 19 levels and 38 pointers during the recursion, with a total of (2 n - 4) instances of pointers used during a top down merge sort ( (n - 2) instances to split, (n - 2) instances to merge).

You don't seem to be talking about memory used. As I have repeatedly said, it only uses log(n) memory, which is in the for of stack usage. There is nothing else allocated during the execution of the sort. The fact that you keep coming back to this suggests to me that you think there is more to the memory usage than I am saying, but there is not. Again these comments are about list-based Merge Sort.

I should also note that the same claims about the existence of an inherent tree during the course of the algorithm applies equally to bottom-up. It's just reversed-breadth-first visiting order rather than depth first, so it really doesn't deserve mention as a point of difference IMHO.

**rcgldr** · 08-16-2013

Originally Posted by rcgldr

I understand your point of view. I was thinking that If a natural merge sort uses some method to track run boundaries created on the initial pass, is it no longer considered to be a natural merge sort?

Originally Posted by iMalc

Yes it doesn't change what it is, but in my view it would be a supplemented form of what would otherwise be just the core algorithm.

Based on my web search for natural merge sort, most of the articles don't mention stability, and based on the ones that do mention stability, it could be considered an optional enhancement. Note that a top down natural merge sort would have to keep track of run boundaries to keep from having issues with what I would call "dangling" pointers to runs that got treated as part of adjacent runs, so it would be stable.

Originally Posted by iMalc

You don't seem to be talking about memory used. As I have repeatedly said, it only uses log(n) memory in the stack.

My issue is not with the memory space used, it's with the number of memory operations to push and load (2 n - 4) instances of pointers, of which (n - 2) of these are used just to recursively split the lists. Then there's the overhead of (n - 2) calls and returns to implement top down sort.

Bottom up merge sort doesn't have a split phase, so that's already eliminated (n - 2) instances of indices or pointers. Perhaps you're thinking that a bottom up merge sort for list would create n pointers to lists of size 1, which would consume a significant amount of memory, but this isn't done. In my example, I used 2 temp pointers plus an array of 32 pointers. For a "classic" bottom up merge sort, I could use 4 pointers and 2 counters (to determine the end of runs in otherwise continuous lists).

Originally Posted by iMalc

I should also note that the same claims about the existence of an inherent tree during the course of the algorithm applies equally to bottom-up.

The main difference is that a top down method is more of a true tree traversal, since it starts at the root, follows the tree downwards during splits and upwards during merges, ending back up at the root. A bottom up method effectively starts at the bottom with a pointer (calculated during iterations) to every leaf and works its way up to the root during merges, and I'm not sure if this qualifies as a true tree traversal since a non tree like method was used to get to the initial state of a pointer to every leaf. Seems like bottom up could be considered 1/2 of a full tree traversal, since it's a one way trip from the leafs to the nodes (as opposed to starting at the root, visiting every leaf, and returning to the root which I would consider to be a full tree traversal).

**Adak** · 08-16-2013

I believe that you're over emphasizing the cost of those stack operations, rcgldr. If you profile them, I believe you'll find them to be faster than you expect.

**rcgldr** · 08-16-2013

Originally Posted by rcgldr

Top down merge sort ... overhead of (n - 2) calls and returns to implement top down sort.

It's (n - 2) recursive calls if the splitting stops at list size == 2, or (2n - 2) calls if the splitting stops at list size == 1.

Originally Posted by Adak

I believe that you're over emphasizing the cost of those stack operations, rcgldr. If you profile them, I believe you'll find them to be faster than you expect.

It's the call overhead in addition to the pushing and loading of pointers that that will impact performance, each call or return is going to clear out the instruction pipeline (but so will a jump in a loop).

**rcgldr** · 08-16-2013

Originally Posted by Adak

I believe that you're over emphasizing the cost of those stack operations, rcgldr. If you profile them, I believe you'll find them to be faster than you expect.

In the case of arrays, there are extra copy operations after each merge, or a check to see if a copy is needed before a merge which may require a copy before a merge. Take at look at this cprogramming.com top down merge sort example for an array that implements the recursive splits using indices:

Merge Sort in C++ (Source Code) - Cprogramming.com

In this version, after every merge, the data is to be copied back to the original array, because there's no easy way to otherwise determine in which of two buffers a sub-array exists due to the nature of a depth first process if the number of elements to be sorted is not a power of 2.

An alternative to this it to split arrays by allocating two new arrays of half size, copying the current array into the two new arrays, then freeing the original array. Merging of arrays would involve allocating an array of size = sum of the sizes of the two arrays to be merged, doing the merge, then freeing the two arrays. This is a lot of overhead. This is how the top down example in post #15 was implemented.

In the case of lists, the lists have to be traversed through in order to split them. This is a n log2(n) overhead in order to recursively split lists until list size is 1 (or 2), in addition to the actual merging process.

**iMalc** · 08-17-2013

Originally Posted by rcgldr

It's (n - 2) recursive calls if the splitting stops at list size == 2, or (2n - 2) calls if the splitting stops at list size == 1.

It's the call overhead in addition to the pushing and loading of pointers that that will impact performance, each call or return is going to clear out the instruction pipeline (but so will a jump in a loop).

Don't just guess that. Profile two alternate implementations.

Can you please stop confusing things by jumping back to array-base merge sort every now and then. They are different things with different performance and memory usage characteristics.
I'm having a hard time trying to figure out when you're arguing about performance characteristics, when you're arguing about memory usage characteristics, and when you're going off topic.

What I'm trying to get to the bottom of is that you seem to claim is that top-down merge sort for a list linked, is less efficient than a bottom-up merge sort for a linked list, when the bottom-up merge sort is done in O(1) extra space.
The only way to do it in O(1) extra space is to avoid chopping the list up, and instead keep all runs in one list. That then necessitates stepping through each sub-list in order to get to the next one, which is the same amount of work as chopping the list in two in the top-down version.
So perhaps you have simply never been talking about a bottom-up merge-sort for a linked list, and may not even have seen such a thing. Afterall, the only implementations you've shown in either thread were for sorting an array. Thus you may have been comparing apples to oranges all along.

**rcgldr** · 08-17-2013

Originally Posted by iMalc

... bottom-up merge-sort for a linked list ...

From post #1:

Originally Posted by rcgldr

Link to an example program of a bottom up merge sort for a list using an array of 32 list pointers.
http://rcgldr.net/misc/sortl.zip

sortl.c will get renamed to msortla.c, (merge sort list using array), the next time I update it. It starts off with the premise that the initial list can be considered to be n groups of size 1. Nodes are removed from the original list one at a time and merged into the array. After all the nodes are merged into the array, then the array is merged to create a sorted list.

From post #17:

Originally Posted by rcgldr

A "classic" bottom up linked-list merge sort can get by with just 4 pointers (2 input pointers, 2 output pointers), a run size variable, and two counters.

I've written one of these as well. There are more than just the 7 variables, but the 4 pointers to lists, the two counters for sub-list merging, and the run size variable are the key variables.

Link to zip of "classic" bottom up merge sort for linked list. This is still a work in progress, but it appears to be working:

http://rcgldr.net/misc/msortl.zip

The initial "split list" sequence puts even nodes on one list, odd nodes one the other list. Then it merges the two lists, based on run size (1, 2, 4, 8, ... until run size exceeds the original list size), alternating output between two other lists. After a pass, the input and output list pointers are swapped, the run size doubled and the merge process repeated until run size >= original list size.

**Nominal Animal** · 08-17-2013

rcgldr, your (nonrecursive) bottom-up merge sort does use O(log N) extra space. Just because you know B and N ≤ 2^B, does not turn O(log N) into O(1).

**rcgldr** · 08-17-2013

Originally Posted by Nominal Animal

rcgldr, your (nonrecursive) bottom-up merge sort does use O(log N) extra space. Just because you know B and N ≤ 2^B, does not turn O(log N) into O(1).

I never made any claims about O(1) space. However, both example programs I created use a fixed amount of extra space regardless of N. There is no allocation (except for the allocation of space for the nodes themselves), and since there is no recursion, the amount of stack space is also fixed. For the example in sortl.zip , I chose to use 32 array members, but it's an arbitrary choice. Microsoft's STL list.sort() uses 23 array members. The last member of the array doesn't have a limit on the number of nodes in the list it points to. For the example in msortl.zip , the key variables are 4 pointers to list, two counters to keep track of runs within each of the 2 current input lists, and a current run size count, again regardless of N.

Perhaps it's not clear that no space for nodes is allocated (other than the original list), only the linkages are changed. The process of getting nodes from a list empties that list.

However my main issue with top down versus bottom up merging is performance. In the case of a top down merge sort for a list, during the split phase, the list and any sub-lists have to be repeatedly traversed in order to recursively split lists until list size is reduced to 2 or 1, before any merging takes place.

For the bottom up merge sort using an array of pointers to list, merging into the array is done during the one and only traversal of the original list, and then the arrays are merged.

For the bottom up "classic" merge using the 4 pointers to list, the first traversal of the list is used to create two lists, even nodes and odd nodes (run size of 1), but this could be enhanced to merge pairs of nodes during the one and only traversal of the original list to produce the initial two lists, each with a run size of 2 nodes per group. A further performance enhancement would use an array of pointers to nodes to sort nodes into groups of 2^m (such as m = 5 or 2^m = 32) nodes during the initial traversal of the original list and creation of the two initial lists for the merge process. This is how tape sorts were implemented, and this "classic" merge is using lists in the same manner as data on a tape.

**Nominal Animal** · 08-17-2013

Originally Posted by rcgldr

I never made any claims about O(1).

Oh, I just thought that was the point of contention here.

Is this similar to the implementation you're thinking about for linked lists? (This is the C board, not the C++ one, so please don't use C++ here.)

Code:

typedef struct list_st  list_t;
struct list_st {
    list_t *next;
    double key;
};

/* How many parallel lists to maintain.
*/
#define RANKS 30

/* Merge two lists, return root of new list.
*/
static list_t *merge(list_t *left, list_t *right)
{
    list_t *root = (list_t *)0, *curr;

    if (!left)
        return right;
    if (!right)
        return left;

    if (left->key <= right->key) {
        curr = root = left;
        left = left->next;
    } else {
        curr = root = right;
        right = right->next;
    }

    while (left && right)
        if (left->key <= right->key) {
            curr->next = left;
            curr = left;
            left = left->next;
        } else {
            curr->next = right;
            curr = right;
            right = right->next;
        }

    if (left)
        curr->next = left;
    else
    if (right)
        curr->next = right;
    else
        curr->next = (list_t *)0;

    return root;
}

/* Nonrecursive fixed-memory merge sort.
 * Returns the root of the sorted list.
*/
list_t *sort(list_t *list)
{
    list_t *rank[RANKS] = { (list_t *)0 }; /* All zeros per C rules. */
    list_t *curr;
    int     i;

    if (!list)
        return (list_t *)0;

    while (list) {
        curr = list;
        list = list->next;
        curr->next = (list_t *)0;

        if (rank[0]) {
            curr = merge(curr, rank[0]);

            for (i = 1; i < (int)RANKS - 1 && rank[i]; i++) {
                curr = merge(curr, rank[i]);
                rank[i-1] = (list_t *)0;
            }

            rank[i-1] = (list_t *)0;
            rank[i] = curr;

        } else
            rank[0] = curr;
    }

    /* Find longest existing list. */
    i = RANKS - 1;
    while (i > 0 && !rank[i])
        i--;

    /* Merge down. */
    for (; i > 0; i--)
        rank[i-1] = merge(rank[i-1], rank[i]);

    /* Done. */
    return rank[0];
}

Having a smaller RANKS will not limit the input list length, only a slow down when the input list length exceeds 2^RANKS elements.

Since each node must have a pointer at least, it makes no sense to use RANKS greater than 30 for 32-bit architectures, or 61 for 64-bit architectures.

Technically, the rank array contains sublists of doubling lengths. The length of the original list is not needed nor useful. Recursion is replaced by the usage of the rank array.

The list is chopped up into one-element-long sublists, and merged into exponentially longer lists as needed.

I can see the performance benefit compared to a typical top-down approach: this avoids the traversal of the original list to find the node at which to split, because every node is split from the original list. The work done to merge the sublists is the same in both cases.

One can also split the original list on input into fixed-length sorted sublists, and stick those into rank[0] instead. That seems to help with shorter lists.

Longer lists seem to be memory access limited (poor cache locality); sorting a temporary array (containing the key and the pointer to the actual node) and rechaining the nodes using the temporary array should yield much better cache locality, and therefore faster run times -- but I have not verified this.

**ajitgappd** · 08-17-2013

You can get a complete program at Data Structure Programs

**rcgldr** · 08-17-2013

Originally Posted by Nominal Animal

Is this similar to the implementation you're thinking about for linked lists? ... code ...

Yes, its similar and using the same concept as sortl.zip. There are minor coding differences, but it's the same core algorithm. The final loop that merges rank[i] goes from largest to smallest, which affects stability, since rank[i+1] has "older" nodes than rank[i]. Going from smallest to largest (incrementing i) should preserve stability, but I haven't confirmed this.

Originally Posted by Nominal Animal

Having a smaller RANKS will not limit the input list length, only a slow down when the input list length exceeds 2^RANKS elements.

and it's not that much of a slow down, if ranks is sufficiently large. As mentioned, microsoft STL list::sort() sets RANKS to 23.

Originally Posted by Nominal Animal

Since each node must have a pointer at least, it makes no sense to use RANKS greater than 30 for 32-bit architectures, or 61 for 64-bit architectures.

I had done some testing in 64 bit mode, but yes using RANKS of 32 is not needed in 32 bit mode.

Originally Posted by Nominal Animal

Technically, the rank array contains sublists of doubling lengths.

Except for rank[RANKS-1], which can have a list of any size.

Originally Posted by Nominal Animal

The length of the original list is not needed nor useful.

It's not used in sortl (to be renamed to msortla) and it's not needed for msortl, but both are works in progress. They work, but they need cleaning up.

Originally Posted by Nominal Animal

Recursion is replaced by the usage of the rank array. The list is chopped up into one-element-long sublists, and merged into exponentially longer lists as needed. I can see the performance benefit compared to a typical top-down approach: this avoids the traversal of the original list to find the node at which to split, because every node is split from the original list. The work done to merge the sublists is the same in both cases.

The repeated splitting of lists could be done without recursion, using iteration to create a stack like structure. It's the splitting that I consider to be an unneeded step for a merge sort.

Originally Posted by Nominal Animal

One can also split the original list on input into fixed-length sorted sublists, and stick those into rank[0] instead. That seems to help with shorter lists.

One way to implement that is an array of RANK0 pointers to nodes (sublists of size 1), where RANK0 is some power of 2, like 32. The program would get and sort RANK0 nodes at time from the original list, sorting the nodes via the array of pointers, linking the nodes to form a sorted sublist based on the sorted pointers, then merge that sublist into rank[0]. If the nodes are reasonably small, then use a temp array of RANK0 nodes, and sort them directly by moving nodes, then link them and merge into rank[0].

Originally Posted by Nominal Animal

Longer lists seem to be memory access limited (poor cache locality); sorting a temporary array (containing the key and the pointer to the actual node) and rechaining the nodes using the temporary array should yield much better cache locality, and therefore faster run times -- but I have not verified this.

It's an optimization that I'm not sure of. Some cache implementations aren't affected by distance between cached lines of memory. I compared merge sort on an array, sorting via pointers versus sorting the data directly, and sorting via pointers wasn't faster until element size was somewhere between 128 and 256 bytes. Part of this is due to the fact that the compare overhead is the same, and the compare is going to end up at least partially caching the two elements being compared, so the overhead is in moving the elements versus moving pointers, versus having to cache both pointers and elements versus just caching elements.

**iMalc** · 08-17-2013

Okay so it's not O(1), it's O(log n) but in an intentionally limited implementation.
It's still not right to state that it "uses a fixed amount of extra space" though. For one, it wont actually "use" all of that space, e.g. for small lists. And two, the implied ending to such a sentence is "regardless of the size of the input." which is untrue because it does not actually work with input larger than a certain size. I'm interested in theory, not so much practicality, here.

Note that you should also not be using a global or static for your buffer. Use a local variable like Nominal has.