merge sort - top down versus bottom up

**rcgldr** · 08-17-2013

Originally Posted by rcgldr

The final loop that merges rank[i] goes from largest to smallest, which affects stability, since rank[i+1] has "older" nodes than rank[i]. Going from smallest to largest (incrementing i) should preserve stability, but I haven't confirmed this.

Or just change the merge statement in that final loop to (so that if rank[i] member is == rank[i-1] member, the rank[i] member is used).

rank[i-1] = merge(rank[i], rank[i-1]);

Originally Posted by iMalc

Okay so it's not O(1), it's O(log n) but in an intentionally limited implementation. It's still not right to state that it "uses a fixed amount of extra space" though.

The "classic" version in msortl.zip uses a fixed amount of extra space, 7 key variables (the 4 pointers to lists, 2 counters, and a run size).

Originally Posted by iMalc

Note that you should also not be using a global or static for your buffer. Use a local variable like Nominal has.

I stated that these were works in progress. I'll clean them up later.

**Nominal Animal** · 08-18-2013

Originally Posted by rcgldr

It's an optimization that I'm not sure of. Some cache implementations aren't affected by distance between cached lines of memory.

No, but prefetching works much better with predictable access patterns.

This means that if you access the data in a predictable pattern, the CPU (and perhaps more importantly, your compiler) can better manage which parts of RAM are kept in which cache level, reducing the number of loads from relatively slow RAM.

The key to fastest possible implementation of an algorithm is to make sure that data flows continuously on all levels. Any interruption or stall, and you'll waste the user's time.

Originally Posted by rcgldr

I compared merge sort on an array, sorting via pointers versus sorting the data directly, and sorting via pointers wasn't faster until element size was somewhere between 128 and 256 bytes.

Sure; like I said, I haven't verified that at all.

The datasets I work with tend to be quite large, usually millions (of atoms in a MD simulation, for example).

If the sort key is a single number, an int, long, float, or double, and the array size is large enough (a few million, typically), radix sort will be faster than any of the alternatives, and scale linearly. (For IEEE-754 floats (Binary32) and doubles (Binary64), two extra passes are needed so that the actual radix sort can always treat the key as unsigned integer values of the same size.)

The optimum variant (bin sizes) depends on both the key size (32 or 64 bits), and the cache topology (cacheline size and second-closest cache size). This, and the fact that the parity point with other sorts is so hard to find, makes it much less attractive than say quicksort -- except if the array size is known to be large enough, tens of millions or more.

As radix sort moves the elements on each pass, it works much better for smaller elements; furthermore, the (separate) sort key can easily be derived from multiple scalars for the temporary array. (For example, the dot product between coordinates and the sort direction, or squared distance from some point). This scenario is valid in my work, and that's the reason I wondered if it was valid here, too.

**rcgldr** · 08-18-2013

Originally Posted by Nominal Animal

radix sort

Different topic (instead of top down versus bottom up merge), but I included an example in this zip file for array (vector) sorts. sortv.zip. It's the one named hsortv.cpp. There's a text file with the results of the run times. It's not much of a radix sort, using the upper 5 bits of 64 bit integers to put data into 32 bins, then it does a merge sort for each bin. It's using 1GB+ of ram to sort 32MB of data, so piggish on the ram.

**Nominal Animal** · 08-18-2013

Not to hijack the thread, but:

Originally Posted by rcgldr

It's not much of a radix sort

Here is a real radix sort, which sorts about ten million unsigned 64-bit pseudorandom integers (tested using 64-bit xorshift) with an associated pointer payload in less than two seconds on my machine:

Code:

#include <stdint.h> /* for uint64_t */
#include <string.h> /* for memset() */

/* 64-bit unsigned integer radix sort, with a pointer payload, using 8 8-bit bins.
 * Pointers may alias each other, as long as
 *      src_key != tmp1key
 *      src_ptr != tmp1key
 *      tmp1key != tmp2key
 *      tmp1ptr != tmp2ptr
 *      tmp2key != tmp3key
 *      tmp2ptr != tmp3ptr
 *      dst_key != tmp3key
 *      dst_ptr != tmp3ptr
 * In particular,
 *      src_key == tmp2key == dst_key
 *      tmp1key == tmp3key
 *      src_ptr == tmp2ptr == dst_ptr
 *      tmp1ptr == tmp3ptr
 * is perfectly okay.
*/
static void radix_sort_u64_ptr(uint64_t       *const dst_key, void *      *const dst_ptr,
                               uint64_t       *const tmp3key, void *      *const tmp3ptr,
                               uint64_t       *const tmp2key, void *      *const tmp2ptr,
                               uint64_t       *const tmp1key, void *      *const tmp1ptr,
                               const uint64_t *const src_key, void *const *const src_ptr,
                               const size_t          len)
{
    size_t  count[256];
    size_t  i, sum;

    /* Bits 0..7: tmp1 <- src. */
    memset(count, 0, 256 * sizeof count[0]);

    for (i = 0; i < len; i++)
        count[src_key[i] & 255]++;

    for (i = 0, sum = 0; i < 256; i++) {
        const size_t start = sum;
        sum += count[i];
        count[i] = start;
    }

    for (i = 0; i < len; i++) {
        const size_t offset = count[src_key[i] & 255]++;
        tmp1key[offset] = src_key[i];
        tmp1ptr[offset] = src_ptr[i];
    }

    /* Bits 8..15: tmp2 <- tmp1. */
    memset(count, 0, 256 * sizeof count[0]);

    for (i = 0; i < len; i++)
        count[(tmp1key[i] >> 8) & 255]++;

    for (i = 0, sum = 0; i < 256; i++) {
        const size_t start = sum;
        sum += count[i];
        count[i] = start;
    }

    for (i = 0; i < len; i++) {
        const size_t offset = count[(tmp1key[i] >> 8) & 255]++;
        tmp2key[offset] = tmp1key[i];
        tmp2ptr[offset] = tmp1ptr[i];
    }

    /* Bits 16..23: tmp1 <- tmp2. */
    memset(count, 0, 256 * sizeof count[0]);

    for (i = 0; i < len; i++)
        count[(tmp2key[i] >> 16) & 255]++;

    for (i = 0, sum = 0; i < 256; i++) {
        const size_t start = sum;
        sum += count[i];
        count[i] = start;
    }

    for (i = 0; i < len; i++) {
        const size_t offset = count[(tmp2key[i] >> 16) & 255]++;
        tmp1key[offset] = tmp2key[i];
        tmp1ptr[offset] = tmp2ptr[i];
    }

    /* Bits 24..31: tmp2 <- tmp1. */
    memset(count, 0, 256 * sizeof count[0]);

    for (i = 0; i < len; i++)
        count[(tmp1key[i] >> 24) & 255]++;

    for (i = 0, sum = 0; i < 256; i++) {
        const size_t start = sum;
        sum += count[i];
        count[i] = start;
    }

    for (i = 0; i < len; i++) {
        const size_t offset = count[(tmp1key[i] >> 24) & 255]++;
        tmp2key[offset] = tmp1key[i];
        tmp2ptr[offset] = tmp1ptr[i];
    }

    /* Bits 32..39: tmp1 <- tmp2. */
    memset(count, 0, 256 * sizeof count[0]);

    for (i = 0; i < len; i++)
        count[(tmp2key[i] >> 32) & 255]++;

    for (i = 0, sum = 0; i < 256; i++) {
        const size_t start = sum;
        sum += count[i];
        count[i] = start;
    }

    for (i = 0; i < len; i++) {
        const size_t offset = count[(tmp2key[i] >> 32) & 255]++;
        tmp1key[offset] = tmp2key[i];
        tmp1ptr[offset] = tmp2ptr[i];
    }

    /* Bits 40..47: tmp2 <- tmp1. */
    memset(count, 0, 256 * sizeof count[0]);

    for (i = 0; i < len; i++)
        count[(tmp1key[i] >> 40) & 255]++;

    for (i = 0, sum = 0; i < 256; i++) {
        const size_t start = sum;
        sum += count[i];
        count[i] = start;
    }

    for (i = 0; i < len; i++) {
        const size_t offset = count[(tmp1key[i] >> 40) & 255]++;
        tmp2key[offset] = tmp1key[i];
        tmp2ptr[offset] = tmp1ptr[i];
    }

    /* Bits 48..55: tmp3 <- tmp2. */
    memset(count, 0, 256 * sizeof count[0]);

    for (i = 0; i < len; i++)
        count[(tmp2key[i] >> 48) & 255]++;

    for (i = 0, sum = 0; i < 256; i++) {
        const size_t start = sum;
        sum += count[i];
        count[i] = start;
    }

    for (i = 0; i < len; i++) {
        const size_t offset = count[(tmp2key[i] >> 48) & 255]++;
        tmp3key[offset] = tmp2key[i];
        tmp3ptr[offset] = tmp2ptr[i];
    }

    /* Bits 56..63: dst <- tmp3. */
    memset(count, 0, 256 * sizeof count[0]);

    for (i = 0; i < len; i++)
        count[(tmp3key[i] >> 56) & 255]++;

    for (i = 0, sum = 0; i < 256; i++) {
        const size_t start = sum;
        sum += count[i];
        count[i] = start;
    }

    for (i = 0; i < len; i++) {
        const size_t offset = count[(tmp3key[i] >> 56) & 255]++;
        dst_key[offset] = tmp3key[i];
        dst_ptr[offset] = tmp3ptr[i];
    }
}

To sort a dataset of N entries, you need one temporary array of N entries also, plus a constant kilobyte or so of stack when calling the function.

(If you wish to keep the original data intact, you need two arrays of N entries; one for the results, and one for the temporaries.)

The reason the above has more than two pointer arguments is that the same function can be used to sort e.g. doubles (if it uses IEEE-754 Binary64 representation and the same byte order as integer types), with two extra passes over the data. Signed integers (assuming two's complement format) need a similar adjustment:

Code:

#define SIGN64 ((uint64_t)9223372036854775808.0)

int radix_sort_double_ptr(double *const keys, void **const data, const size_t len)
{
    uint64_t *const dst_key = (uint64_t *)keys;
    void    **const dst_ptr = data;
    uint64_t       *tmp_key;
    void          **tmp_ptr;
    size_t          i;

    if (len < 2)
        return 0;

    if (sizeof (double) != sizeof (uint64_t))
        return errno = ENOTSUP;

    tmp_key = malloc(len * sizeof (uint64_t));
    tmp_ptr = malloc(len * sizeof (void *));
    if (!tmp_key || !tmp_ptr) {
        free(tmp_ptr);
        free(tmp_key);
        return errno = ENOMEM;
    }

    for (i = 0; i < len; i++)
        tmp_key[i] = (dst_key[i] & SIGN64) ? ~dst_key[i] : dst_key[i] ^ SIGN64;

    radix_sort_u64_ptr(tmp_key, dst_ptr, 
                       dst_key, tmp_ptr, 
                       tmp_key, dst_ptr, 
                       dst_key, tmp_ptr, 
                       tmp_key, dst_ptr, len);
 
    for (i = 0; i < len; i++)
        dst_key[i] = (tmp_key[i] & SIGN64) ? tmp_key[i] ^ SIGN64 : ~tmp_key[i];

    free(tmp_ptr);
    free(tmp_key);
    return 0;
}

int radix_sort_i64_ptr(int64_t *const keys, void **const data, const size_t len)
{
    uint64_t *const dst_key = (uint64_t *)keys;
    void    **const dst_ptr = data;
    uint64_t       *tmp_key;
    void          **tmp_ptr;
    size_t          i;

    if (len < 2)
        return 0;

    tmp_key = malloc(len * sizeof (uint64_t));
    tmp_ptr = malloc(len * sizeof (void *));
    if (!tmp_key || !tmp_ptr) {
        free(tmp_ptr);
        free(tmp_key);
        return errno = ENOMEM;
    }

    for (i = 0; i < len; i++)
        tmp_key[i] = dst_key[i] + SIGN64;

    radix_sort_u64_ptr(tmp_key, dst_ptr, 
                       dst_key, tmp_ptr, 
                       tmp_key, dst_ptr, 
                       dst_key, tmp_ptr, 
                       tmp_key, dst_ptr, len);
 
    for (i = 0; i < len; i++)
        dst_key[i] = tmp_key[i] - SIGN64;

    free(tmp_ptr);
    free(tmp_key);
    return 0;
}

**rcgldr** · 08-18-2013

Originally Posted by Nominal Animal

Here is a real radix sort, which sorts about ten million unsigned 64-bit pseudorandom integers (tested using 64-bit xorshift) with an associated pointer payload in less than two seconds on my machine.

I changed the merge sort example, msortv.cpp in sortv.zip to sort 10x1024x1024 64 bit unsigned integers (instead of 1024x4096), and it took about 785 milliseconds, less than 1 second on my system, Intel 2600K 3.4ghz, windows xp x64 (in 64 bit mode). It's not clear to me what xxx_key and xxx_ptr are being used for. Assuming the goal is to sort an array of 64 bit integers, then why are there keys and pointers?

Note that msortv.cpp uses vectors in main instead of allocating arrays, but the sort functions are really C functions, so main() treats the vectors as C arrays by passing pointers to the first members of vectors.

**Nominal Animal** · 08-19-2013

Originally Posted by rcgldr

why are there keys and pointers?

The pointers are the payload.

Sorting just numbers is rarely useful. The sorted unit is (uint64_t, void *) where the first element is the key, and the second element is the payload associated with the key.

The two are in separate arrays to make it easier to use in case the data to be sorted is already referenced by an array of pointers (which is a common case). Then, the programmer only needs to allocate and construct the array of keys. Furthermore, the keys can be derived, not copied, from the data.

If you have a CPU with larger caches, the crossover from other sort algorithms (O(N log N)) to radix sort (O(N)) may occur at even higher N. If you can do some comparisons, I would be very interested to know the results on your hardware.

Also, comparing double sort algorithms is very interesting. Because the radix sort actually treats them as 64-bit unsigned integers, doing the transformations necessary to do so, it does not use any floating-point hardware at all. If you compare two floating-point values in C or C++, the differences in the floating-point unit, especially loads, become apparent between different CPUs.

If you plot the wall clock duration of each sort algorithm as a function of N, you can clearly see the effects of the CPU caches have at various sizes (as jumps in time taken). Only after the dataset is large enough to require a lot of RAM access, the time taken by each algorithm versus N stabilizes to the expected asymptotic function.

I'm sure you can find the point where radix sort beats the other sorts for similar data structures; after all, for a large enough N,

C₁ + C₂N < C₃ + C₄N log C₅N

regardless of what the positive constants C₁, C₂, C₃, C₄, and C₅ are.

(Thus far N has been in the millions to tens of millions range on all CPUs I've tried it on, but I have not tested the latest CPUs from Intel or AMD yet. That's why I'm obviously interested in your results.)

The radix sort algorithm itself can be easily optimized further. In particular, the 32-bit version sorts much faster, obviously. For signed integer and floating-point types, it is worth it to keep track of the minimum and maximum values (after the conversion), so that if the value range is sufficiently small, a secondary pass over the keys allows using a version that sorts fewer bits.

**rcgldr** · 08-19-2013

Originally Posted by Nominal Animal

The pointers are the payload. ... If you have a CPU with larger caches, the crossover from other sort algorithms (O(N log N)) to radix sort (O(N)) may occur at even higher N.

I created a new thread for radix sort with an example program.

**Nominal Animal** · 08-19-2013

Originally Posted by rcgldr

I created a new thread for radix sort with an example program.

Right. Back to the original topic.

Originally Posted by rcgldr

In my opinion, the main purpose of top down merge sort is as a learning tool used to teach recursion, similar to implementing factorial or fibonacci via recursion. It's not efficient, but it's useful for teaching recursion.

This is just a bit too general to fully agree with.

Merge sort is often used for linked lists. In cases where the lists are known to be almost sorted, the top down approach can easily and efficiently handle continuous runs, and only sort the parts that need sorting. This is also very straightforward to understand and implement.

I don't see how a bottom-up approach would efficiently handle the continuous runs. A large part of its efficiency relies on the sublists having doubling lengths (except for the largest bin, which can be of any size); if any of the sublists in the bottom-up approach is longer than expected, the efficiency of the implementation suffers.

If it means anything:

In my opinion, top down merge sort has two main purposes: one, as a learning tool (linked lists, recursion); two, as a basis for linked list sorting variants where the list is known to be mostly sorted already. I believe the bottom-up merge sort is more efficient for randomly-ordered lists than the top down merge sort (or other linked-list sort methods), but I have not verified this.

**rcgldr** · 08-19-2013

Originally Posted by Nominal Animal

In cases where the lists are known to be almost sorted, the top down approach can easily and efficiently handle continuous runs.

Bottom up merge sort can also handle continuous runs, but there needs to some method to keep track of run sizes. One suggestion I mentioned before was to use the least significant bit of the link pointer as an end of run indicator, since the nodes will be on 4 byte (or 8 byte) boundaries. Otherwise you'll need a pointer or a run count for every continuous run.

For the top down merge, when would you check for continuous runs, during the split list phase? If so, then there's the potential of a lot of recursion if the run size is small compared to the list size.

**Nominal Animal** · 08-19-2013

Originally Posted by rcgldr

Bottom up merge sort can also handle continuous runs, but there needs to some method to keep track of run sizes.

Have you tested if it is worth the extra complexity?

Originally Posted by rcgldr

For the top down merge, when would you check for continuous runs, during the split list phase?

Yes, usually by separating the initial segment of the list to be sorted, at each recursion level. Only the part of the list that follows the initial sorted part (either of which may be empty) is recursively sorted. An extra merge is done to merge the initial sorted part and the recursively sorted part.

This can only reduce the recursion depth.

You can avoid (most of) the extra merge(s) by using three-way or four-way merges when necessary, although this does make the recursive function more complicated.

My description may not be good enough; would you like to see a simplified (non-optimized) example implementation?

**rcgldr** · 08-19-2013

Originally Posted by rcgldr

Bottom up merge sort can also handle continuous runs, but there needs to some method to keep track of run sizes.

Originally Posted by Nominal Animal

Have you tested if it is worth the extra complexity?

The code for this is included in sortp.zip (note this includes a 10MB random text test file so it will take a few seconds to download), in msortp.cpp, change the define

#define VARSZGRP 0

to

#define VARSZGRP 1

to enable the natural merge sort method. The test file is random data, so it's not a good test for the natural sort. It should work with any text file, if you have a large text file to test to try. If not, I'll create one later. Also, msortp.cpp is based on old code, so it's expecting dos / windows type text files which have '\r' '\n' as line terminators. It only needs the '\n', but compares of records that are the same except one longer than the other is relying on the '\r' to create a difference. (I could clean this part of the compare to use the lengths). There's also a limit of 255 characters per line, since it's replaces the '\n' with record size in bytes.

Originally Posted by Nominal Animal

top down natural merge sort example ..

Seems that the natural part of this sort isn't top down, so it becomes a hybrid sort.

I don't see a benefit for using top down versus bottom up for any type of data if it's based on merge sort. If you want a top down algorithm, then use something like quick sort.

**rcgldr** · 08-19-2013

Originally Posted by rcgldr

For the top down merge, when would you check for continuous runs, during the split list phase? If so, then there's the potential of a lot of recursion if the run size is small compared to the list size.

Originally Posted by Nominal Animal

Yes, usually by separating the initial segment of the list to be sorted, at each recursion level. Only the part of the list that follows the initial sorted part (either of which may be empty) is recursively sorted. An extra merge is done to merge the initial sorted part and the recursively sorted part.

and where would this program store all the pointers, indexes, counts, ... to all those already in order sequences of elements? The idea of a top down merge sort is to create, update, and store that information on the stack, getting back to my issue of using either using recursion to store relatively short lists parameter information on the stack, creating a lot of recursion, or resorting to what is essentially a bottom up method to store that information, at which point in my opinion, it ceases to be a top down algorithm.

**rcgldr** · 08-20-2013

Originally Posted by Nominal Animal

Have you tested if it is worth the extra complexity?

I created a test text file, 1,048,576 lines, 20 characters per line (18 digits, cr, lf) with sequences of 8 to 39 ascending or descending numbers. It wasn't much faster than a standard merge sort. More pre-ordering of the file would help, but I only wanted partial ordering. Results on my system (note this is the time to sort pointers, not the data):

standard merge sort - 156 ms 17809794 compares
natural_ merge sort - 125 ms 16449694 compares

link to test source file, zip is 8MB, file is 20MB.

http://rcgldr.net/misc/srcp20.zip

**Nominal Animal** · 08-20-2013

Originally Posted by rcgldr

and where would this program store all the pointers, indexes, counts, ... to all those already in order sequences of elements?

Remember, this is for linked lists.

When the recursive function gets a list a₀, a₁, a₂, a₃, ..., a_N-2, a_N-1 to sort, the traditional top-down merge sort needs to first find the pointer to a_N/2 (and a_N/2-1 to terminate the initial list).

Instead, you can split the list into three parts: initial sorted prefix, left, and right.

If a₀ > a₁, then the initial prefix is a single element long. Otherwise, it is multiple elements long. This costs essentially a single local pointer variable (one pointer on the stack in the call chain), because we must anyway step through the list to find the point where left ends and right starts.

Let's assume that the first three items in the list are in ascending order, and that a₂ > a₃. Then,

prefix = a₀ a₁ a₂
left = a₃ a₄ .. a_H-1
right = a_H a_h+1 .. a_N-1
H = ⌊ (N-3) / 2 ⌋

The recursive call is done on both left and right, but all three lists are merged.

The cost in recursion is one extra pointer on stack per recursive call. The only other extra cost compared to the traditional top-down merge sort is the cost of using a three-way merge instead of two-way merge. On most architectures, however, real cost of a three-way merge is of the same order as a two-way merge, though.

Because the recursion splits both left and right to less than half of the original length, the maximum depth is at most (log₂ N). Maximum depth also defines the stack usage. Compared to the traditional top-down version, this approach uses at most log₂N extra pointers on stack.

This version does fewer recursive calls than the traditional version (or the same number of recursive calls, for the worst case).

Are you sure you don't have preconceived notions about the limitations of the top-down approach?

Originally Posted by rcgldr

Results on my system

So, roughly a 8% reduction in the number of compares, and 20% decrease in the run time, for that particular data set. What kind of results do you get for fully random data?

If there is no penalty for fully random data -- which obviously does have some short sorted sequences at least --, I'd consider it worth the added code complexity.

**rcgldr** · 08-20-2013

Originally Posted by Nominal Animal

Remember, this is for linked lists.

My issue is not the overhead of 2 or in this case 3 pointers per recursion call, it's having to repeately traverse at least 1/2 an entire sub-list to find the mid-point.

A standard top down merge sort ends up doing n-1 splits, and total traversals to do the split operations is number of levels of recursion times number of elements, just to get to the merge operations. That's the equivalent of 20 full traversals of a list with 1 million elements just to do the split operations, in addition to the 20 traversals for the merge operations. A top down merge sort doubles the number of traversals of data because of the unneeded split operations. Why not just start with the merge operations (bottom up)?

Originally Posted by Nominal Animal

So, roughly a 8% reduction in the number of compares, and 20% decrease in the run time, for that particular data set. What kind of results do you get for fully random data?

The program I used to test this (msortp.cpp) with has the same overhead for standard merge and natural merge. In standard mode, it creates an array of n/2 run (group) counts, and in natural mode n/2 or less run counts depending on the data (it looks for both ascending and descending runs). I need to create a basic merge sort for pointers that uses a run length variable rather than that run count array, which is a significant overhead in the early passes when run size is small. So for the test program I have, the times for random data were about the same, but fewer compares even for random data, since even with random data there are enough ascending or descending sequences greater in size than 2 to affect the compare count. Then again, both methods are sorting 1 million pointers to 10 byte records in less than 1/4 of a second, so there isn't much practical difference until file size gets significantly larger than the test files I'm using.