merge sort - top down versus bottom up

**Nominal Animal** · 08-20-2013

Originally Posted by rcgldr

My issue is not the overhead of 2 or in this case 3 pointers per recursion call, it's having to repeately traverse at least 1/2 an entire sub-list to find the mid-point.

I understand.

My point was that for some datasets, known to have long already sorted sequences, it just is worth it, compared to plain top-down or bottom-up, or the code complexity of detecting sorted sequences efficiently in a bottom-up merge sort.

Consider a list that is basically a concatenation of three sorted lists. Let's also assume that the second list is at least as long as the third list, for simplicity.

The first list is detected as sorted in the first pass. The first recursive call detects the rest of the second list, and recurses into the third list. The second recursive calls (two calls at the same depth) detect the entire list as sorted. After the merges, all the data is sorted.

This approach works extremely well when the input is mostly sorted. When the input has random sequences, the performance "degrades" gracefully to the top-down merge-sort level, while the code itself is very simple.

Overall, given a mix of almost-sorted data, with a few random sequences (and rarely a purely random list), I just believe this approach is valid. Sure, there are known faster methods, but the simplicity of the code makes this very attractive.

Remember: complex code can have complex bugs, too. In many real world cases, maintainability is more important than a few percent of run time difference.

Originally Posted by rcgldr

Why not just start with the merge operations (bottom up)?

Simply put, I am not convinced yet that there are good bottom-up variants for almost-sorted data.

In particular, consider datasets that are (for whatever reason) usually a few sorted lists concatenated, maybe with a few random nodes sprinkled in, and very rarely completely random lists. (So, no hard limits, just usually almost sorted, very rarely random.)

One approach is a mix, using an initial exploratory pass of the data to see if and where the input contains continuous sorted runs. If there are too many of them (as one can consider a single node as a continuous run of length 1), you fall back to a bottom-up merge sort. (Is there an efficient method to detect unsorted runs? So one could use bottom-up merge sort for unsorted runs, and merge them with sorted runs? Linked list traverses are expensive, so I don't think so.)

What I'd like to see best, is a theoretically sound method of merging a continuous run of sorted nodes (encountered during the bottom-up scan) into the bottom-up doubling-length list array.

(I'm pretty sure there is one. You always know the length of the continuous run; perhaps splitting it into power of two lengths? Or even detecting continuous sorted nodes of power-of-two lengths, and merging them directly to the corresponding arrays? What about order stability?)

Please do note that I'm not claiming you are incorrect; I am just saying I am not convinced yet -- and I suspect that some other members reading this thread share my sentiment. You may very well be right, but I just need to see proof -- for me that means an example.

I must also say that I much prefer seeing the salient code point inline in this post; downloading, extracting, and opening a ZIP file from who knows where (okay, your personal site

) is quite tedious, considering that we're just discussing interesting stuff, not doing in-depth research or actual work. At least I'm here for fun..

Having a full ZIP file test case would be very nice as an addition, if the salient part, the important function, was inline in the message here. Just my opinion, though.

**rcgldr** · 08-20-2013

Originally Posted by Nominal Animal

Simply put, I am not convinced yet that there are good bottom-up variants for almost-sorted data.

I assume you mean almost sorted data in a linked list. I've already done this for almost sorted data using arrays.

If you don't care about a stable sort (maintaining the original order of "equal" records"), then you don't need any run counts, and the end of a run on a list or array is determined by using compare (when list->next->value < list->value or array[i+1].value < array[i].value).

For a bottom up merge sort for linked lists, you'd only need 2 input pointers and 2 output pointers. The initial pass traverses the original list for "runs" (sequences of ascending values), sets the next pointers at the end of runs to NULL, then attaches the runs to the 2 output list pointers, alternating between them. Then merge passes are performed. The input and output pointers are swapped, and runs from the 2 input list pointers are merged and alternately attached to the 2 output list pointers.

For a stable sort with lists, the end of a run could be indicated by the least significant bit of the next pointer, since list nodes would be on a 4 byte or 8 byte boundary. For a stable sort with arrays, you need an array of run counts or some other method to determine the end of runs.

**Nominal Animal** · 08-20-2013

Originally Posted by rcgldr

I assume you mean almost sorted data in a linked list.

Yes, exactly.

Originally Posted by rcgldr

For a bottom up merge sort for linked lists, you'd only need 2 input pointers and 2 output pointers. The initial pass traverses the original list for "runs" (sequences of ascending values), sets the next pointers at the end of runs to NULL, then attaches the runs to the 2 output list pointers, alternating between them. Then merge passes are performed.

I think you are forgetting how important it is to the overall efficiency, that the merged lists are roughly the same size. (Or, to be more precise, that the "tree" is balanced. Just because there is no tree explicitly created, does not mean you can ignore it.)

Theoretical arguments are fun, but reality always trumps theory.

If you wish to sway minds, you need to show practical comparison.

The true test, in my opinion, is very simple. Generate some linked lists (integer keys are fine) with tunable randomness, then compare the time taken to sort them using the three functions. I'm mostly interested in longish lists, myself, say a nice million nodes; so I'd perhaps compose the list as N sorted lists concatenated, then test the different linked-list sorting approaches, measuring the run times, for various numbers of N, from say two to a hundred, emphasis on the low side. Then maybe see what happens if you add K random nodes into random places into the list.

If you can show that your bottom-up merge sort -- implemented without pointer bit tricks; we want the code to be maintainable and portable -- performs better than a straightforward top-down merge sort with the continuous pass as I outlined above, then I'm satisfied you have a valid point. Otherwise, I won't buy it.

Because you're the OP, it's your responsibility to prove your claims (assuming you want to). I can of course provide you a simple implementation of the continuous-run-scanning top-down merge sort, but I'm sure you can too, if you want.

Please do remember that I'm not questioning you or your skills or capabilities; I'm only trying to poke holes in your argument to see if it holds true or not.

**rcgldr** · 08-20-2013

Originally Posted by Nominal Animal

bottom up ... top down merge sort

The operations done via recursion for a linked list involve the unneeded overhead of traversing the linked lists in order to split them in half. One example I saw traversed a list 1 + 1/2 times for each split operation, advancing one pointer two nodes at at time, advancing the other pointer one node at a time, using that one node at a time pointer as the split point. There's no need to split lists for a merge sort. The basic idea of a merge sort is to merge shorter runs into longer runs, and there's no reason a merge sort can't start with the premise that run size is one, or run size is variable based on ascending (or descending) sequences.

Again, I'm not aware of any library (one included with a compiler) that uses top down merge sort. Most of them use quick sort or bottom up merge sort (for stable sort). If there was an advantage to using a top down merge sort, then why don't any libraries use it? Microsoft / HP STL list::sort is a bottom up merge sort using an array of 23 pointers to lists. Microsoft / HP STL std::sort() is quick sort, and std::stable_sort() is a hybrid sort, using insertion sort to create runs of 32 elements (maybe a cache strategy), then switching to bottom up merge sort.

**rcgldr** · 08-21-2013

Originally Posted by Nominal Animal

I think you are forgetting how important it is to the overall efficiency, that the merged lists are roughly the same size.

Pre-ordered sub-lists will not be the same size. Also by using compare to detect the end of runs, multiple runs can end up treated as a single run if they happen to be in order. This creates an issue with stability.

Originally Posted by Nominal Animal

compare the time taken to sort them

I wrote a "classic" style bottom up merge sort for list, using compare to detect the end of ascending sequences and treating those as runs. There are 2 source pointers, 2 destination start pointers, and 2 destination end pointers. The initial pass moves 1 run at time, alternating between the two destination pointers. Then for each merge pass, the source pointers are set to what were the destination pointers, the destination pointers set to NULL, and then 2 runs at a time, one from each source list, are merged, alternating between the two destination pointers. After reaching the end of one of the source lists, any remaining runs on the other source list are moved 1 run at time, continuing the alternating between the two destination pointers, which balances the run count on the two lists. At this point, the source lists are emptied and the next merge pass is started. The sort is complete when all the runs end up on one list.

With the random data that I have been using, it takes about 1.35 seconds to sort (4*1024*1024) 64 bit unsigned integers, and about 4.7 seconds to sort (10*1024*1024) 64 bit unsigned integers, on my system, Intel 2600K 3.4ghz, Windows XP (32 bit).

**Nominal Animal** · 08-21-2013

rcgldr, you make more arguments, but don't back them up with data or facts.

I don't like that, so I decided to show you exactly how and why you are wrong.

(Since I use Linux, you will need to adjust the following to compile for Windows, but please do, and do post the modified code; it should be easy to fix. Since I don't have Windows, I can't do that for you.)

Here are the three sort functions I wanted to test. sort.c:

Code:

#include <stdlib.h>
#include "sort.h"

/*
 * Common helper functions.
*/

static node_t *merge2(node_t *list1, node_t *list2)
{
    node_t *root, *tail;

    if (!list2)
        return list1;
    if (!list1)
        return list2;

    if (list1->key <= list2->key) {
        root = tail = list1;
        list1 = list1->next;
    } else {
        root = tail = list2;
        list2 = list2->next;
    }

    while (list1 && list2)
        if (list1->key <= list2->key) {
            tail->next = list1;
            tail = list1;
            list1 = list1->next;
        } else {
            tail->next = list2;
            tail = list2;
            list2 = list2->next;
        }

    if (list1)
        tail->next = list1;
    else
    if (list2)
        tail->next = list2;
    else
        tail->next = (node_t *)0;

    return root;
}

static node_t *merge3(node_t *list1, node_t *list2, node_t *list3)
{
    node_t *root, *tail;

    if (!list3)
        return merge2(list1, list2);

    if (list1->key <= list2->key) {
        if (list1->key <= list3->key) {
            root = tail = list1;
            list1 = list1->next;
        } else {
            root = tail = list3;
            list3 = list3->next;
        }
    } else {
        if (list2->key <= list3->key) {
            root = tail = list2;
            list2 = list2->next;
        } else {
            root = tail = list3;
            list3 = list3->next;
        }
    }

    while (list1 && list2 && list3)
        if (list1->key <= list2->key) {
            if (list1->key <= list3->key) {
                tail->next = list1;
                tail = list1;
                list1 = list1->next;
            } else {
                tail->next = list3;
                tail = list3;
                list3 = list3->next;
            }
        } else {
            if (list2->key <= list3->key) {
                tail->next = list2;
                tail = list2;
                list2 = list2->next;
            } else {
                tail->next = list3;
                tail = list3;
                list3 = list3->next;
            }
        }

    if (!list2) {
        list2 = list3;
        list3 = (node_t *)0;
    }
    if (!list1) {
        list1 = list2;
        list2 = list3;
        list3 = (node_t *)0;
    }

    while (list1 && list2)
        if (list1->key <= list2->key) {
            tail->next = list1;
            tail = list1;
            list1 = list1->next;
        } else {
            tail->next = list2;
            tail = list2;
            list2 = list2->next;
        }

    if (list1)
        tail->next = list1;
    else
    if (list2)
        tail->next = list2;
    else
        tail->next = (node_t *)0;

    return root;
}

/*
 * Traditional top-down merge sort.
*/

node_t *do_topdown(node_t *list, const size_t size)
{
    if (size > 1) {
        const size_t half = size / 2;
        size_t       n = half - 1;
        node_t      *temp = list;
        node_t      *tail;

        /* Find split point (node before second half). */
        while (n-->0)
            temp = temp->next;

        tail = temp->next;
        temp->next = (node_t *)0;

        if (half > 1)
            list = do_topdown(list, half);

        if (size - half > 1)
            tail = do_topdown(tail, size - half);

        return merge2(list, tail);
    } else
        return list;
}

node_t *topdown_sort(node_t *list)
{
    node_t *head, *tail;
    size_t  size,  half = 0;

    /* Trivial list? */
    if (!list || !list->next)
        return list;

    /* Find list length and split point. */
    head = tail = list;
    while (tail && tail->next) {
        tail = tail->next->next;
        list = list->next;
        half++;
    }
    size = 2 * (half++);
    if (tail)
        size++;

    tail = list->next;
    list->next = (node_t *)0;

    return merge2(do_topdown(head, half), 
                  do_topdown(tail, size - half));
}

/*
 * Top-down sequence-scanning merge sort
*/

node_t *do_nominal(node_t *list, size_t size)
{
    if (size > 1) {
        node_t *prefix = list;
        node_t *left, *tail;
        size_t  prefixlen = 1;
        size_t  half;

        /* Find sorted prefix. */
        while (list->next && list->key <= list->next->key) {
            list = list->next;
            prefixlen++;
        }

        /* Already sorted? */
        if (!list->next)
            return prefix;

        left = tail = list->next;
        list->next = (node_t *)0;
        list = tail;
        size -= prefixlen;

        /* Single-node unsorted part? */
        if (size < 2)
            return merge2(prefix, left);

        /* Find split point. */
        half = size/2 - 1;
        while (half-->0)
            list = list->next;

        tail = list->next;
        half = size / 2;
        list->next = (node_t *)0;        

        /* Recurse and three-way merge. */
        return merge3(prefix, do_nominal(left, half), do_nominal(tail, size - half));

    } else
        return list;
}

node_t *nominal_sort(node_t *list)
{
    node_t *prefix, *left, *tail;
    size_t  size, half;

    /* Trivial list? */
    if (!list || !list->next)
        return list;

    /* Find sorted prefix. */
    prefix = list;
    while (list->next && list->key <= list->next->key)
        list = list->next;

    /* Already sorted? */
    if (!list->next)
        return prefix;

    /* Detach unsorted part from list. */
    left = list->next;
    list->next = (node_t *)0;
    list = tail = left;

    /* Single-node unsorted part? */
    if (!list->next)
        return merge2(prefix, left);

    /* Find split point in unsorted part. */
    half = 0;
    while (tail && tail->next) {
        tail = tail->next->next;
        list = list->next;
        half++;
    }
    size = 2 * (half++);
    if (tail)
        size++;
    tail = list->next;
    list->next = (node_t *)0;

    /* Recurse and three-way merge. */
    return merge3(prefix, do_nominal(left, half), do_nominal(tail, size - half));
}

/*
 * Nonrecursive bottom-up merge sort using 32 pointers.
*/
#ifndef PARTS
#define PARTS 32
#endif

node_t *bottomup_sort(node_t *list)
{
    node_t *part[PARTS] = { 0 };
    node_t *curr;
    size_t  i;

    while (list) {

        curr = list;
        list = list->next;
        curr->next = (node_t *)0;

        if (part[0]) {
            curr = merge2(part[0], curr);

            i = 1;
            while (i < PARTS - 1 && part[i]) {
                curr = merge2(part[i], curr);
                part[i-1] = (node_t *)0;
                i++;
            }
            part[i-1] = (node_t *)0;

            if (!part[i])
                part[i] = curr;
            else
                part[i] = merge2(part[i], curr);
            
        } else
            part[0] = curr;
    }

    i = 0;
    while (i < PARTS - 1 && !part[i])
        i++;

    while (i < PARTS - 1) {
        if (part[i+1])
            part[i+1] = merge2(part[i+1], part[i]);
        else
            part[i+1] = part[i];
        i++;
    }

    return part[PARTS-1];
}

Above, topdown_sort() is a traditional top-down merge sort, nominal_sort() is the top-down merge sort that detects continuous runs and uses three-way merging, and bottomup_sort() is my bottom-up merge sort implementation.

sort.h:

Code:

#ifndef   SORT_H
#define   SORT_H

#ifdef __cplusplus
extern "C" {
#endif

typedef struct node_st  node_t;
struct node_st {
    node_t  *next;
    long     key;
};

node_t  *topdown_sort(node_t *);
node_t  *nominal_sort(node_t *);
node_t *bottomup_sort(node_t *);

#ifdef __cplusplus
}
#endif

#endif /* SORT_H */

To measure the time taken by each, I use timing.h:

Code:

#ifndef   TIMING_H
#define   TIMING_H
#ifdef __cplusplus
extern "C" {
#endif

void timing_start();

void timing_stop(double *const wall_seconds,
                 double *const cpu_seconds,
                 double *const cpu_cycles);

#ifdef __cplusplus
}
#endif
#endif /* TIMING_H */

which in Linux and most POSIX systems can be implemented as timing.c:

Code:

#define _POSIX_C_SOURCE 199309L
#include <time.h>

#if defined(__GNUC__) && (defined(__x86_64__) || defined(__i386__))

static unsigned int  cycles_start[2];
static struct timespec wall_start;
static struct timespec  cpu_start;

void timing_start(void)
{
    clock_gettime(CLOCK_REALTIME, &wall_start);
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &cpu_start);
    asm volatile ("rdtsc" : "=a" (cycles_start[0]), "=d" (cycles_start[1]));
} 

void timing_stop(double *const wall_seconds,
                 double *const cpu_seconds,
                 double *const cpu_cycles)
{
    unsigned int  cycles_stop[2];
    struct timespec wall_stop;
    struct timespec  cpu_stop;

    asm volatile ("rdtsc" : "=a" (cycles_stop[0]), "=d" (cycles_stop[1]));
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &cpu_stop);
    clock_gettime(CLOCK_REALTIME, &wall_stop);

    if (cpu_cycles)
        *cpu_cycles = ((((long long)cycles_stop[1])  << 32) + (long long)cycles_stop[0])
                    - ((((long long)cycles_start[1]) << 32) + (long long)cycles_start[0]);

    if (wall_seconds)
        *wall_seconds = difftime(wall_stop.tv_sec, wall_start.tv_sec)
                      + (double)(wall_stop.tv_nsec - wall_start.tv_nsec) / 1000000000.0;

    if (cpu_seconds)
        *cpu_seconds = difftime(cpu_stop.tv_sec, cpu_start.tv_sec)
                     + (double)(cpu_stop.tv_nsec - cpu_start.tv_nsec) / 1000000000.0;
}

#else

static struct timespec wall_start;
static struct timespec  cpu_start;

void timing_start(void)
{
    clock_gettime(CLOCK_REALTIME, &realtime_start);
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &cputime_start);
} 

void timing_stop(double *const wall_seconds,
                 double *const cpu_seconds,
                 long   *const cpu_cycles)
{
    struct timespec wall_stop;
    struct timespec  cpu_stop;

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &cputime_stop);
    clock_gettime(CLOCK_REALTIME, &realtime_stop);

    if (cpu_cycles)
        *cpu_cycles = 0.0;

    if (wall_seconds)
        *wall_seconds = difftime(wall_stop.tv_sec, wall_start.tv_sec)
                      + (double)(wall_stop.tv_nsec - wall_start.tv_nsec) / 1000000000.0;

    if (cpu_seconds)
        *cpu_seconds = difftime(cpu_stop.tv_sec, cpu_start.tv_sec)
                     + (double)(cpu_stop.tv_nsec - cpu_start.tv_nsec) / 1000000000.0;
}

#endif

The above uses the x86/x86-64 RDTSC to measure the CPU cycles if possible; otherwise it uses zero for the cycle counts.

The benchmark.c:

Code:

#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
#include "timing.h"
#include "sort.h"

/* Xorshift random number generator.
*/
static uint32_t xorshift_state[4] = {
    123456789U,
    362436069U,
    521288629U,
    88675123U
};

static int xorshift_setseed(const void *const data, const size_t length)
{
    uint32_t state[4];

    if (length < 1)
        return ENOENT;

    if (length < sizeof state) {
        memset(state, 0, sizeof state);
        memcpy(state, data, length);
    } else
        memcpy(state, data, sizeof state);

    /* State cannot be all zeros, as that'd produce only zeros. */
    if (state[0] || state[1] || state[2] || state[3]) {
        memcpy(xorshift_state, state, sizeof state);
        return 0;
    }

    return EINVAL;
}

static uint32_t xorshift_u32(void)
{
    const uint32_t temp = xorshift_state[0] ^ (xorshift_state[0] << 11U);
    xorshift_state[0] = xorshift_state[1];
    xorshift_state[1] = xorshift_state[2];
    xorshift_state[2] = xorshift_state[3];
    return xorshift_state[3] ^= (temp >> 8U) ^ temp ^ (xorshift_state[3] >> 19U);
}

static uint64_t xorshift_u64(void)
{
    return ((uint64_t)xorshift_u32() << 32)
         |  (uint64_t)xorshift_u32();
}

static size_t xorshift_range(const size_t min, const size_t max)
{
    if (max > min) {
        const size_t size = max - min;
        size_t       mask = ~0;
        size_t       temp;

        while (mask / (size_t)2 > size)
            mask /= (size_t)2;
        mask--;        

        if (size < (size_t)4294967295.0) {
            do {
                temp = xorshift_u32() & mask;
            } while (temp > size);
            return temp + min;
        } else {
            do {
                temp = xorshift_u64() & mask;
            } while (temp > size);
            return temp + min;
        }
    } else
        return min;
}

/* Given an array of keys,
 * dynamically allocate a single block, and fill with nodes as
 * the corresponding linked list.
*/
node_t *linked_list(const int *const key, const size_t len)
{
    node_t *array;
    size_t  i;

    if (len < 1) {
        errno = 0;
        return (node_t *)0;
    }

    array = malloc(len * sizeof array[0]);
    if (!array) {
        errno = ENOMEM;
        return (node_t *)0;
    }

    for (i = 0; i < len; i++) {
        array[i].next = &array[i+1];
        array[i].key  = key[i];
    }

    array[len-1].next = (node_t *)0;

    return array;
}

size_t sorted_length(const node_t *node)
{
    size_t len = 1;

    if (!node)
        return 0;

    while (node->next) {
        if (node->key > node->next->key)
            return len;
        node = node->next;
        len++;
    }

    return len;
}

int parse_range(const char *const arg, size_t *const min, size_t *const max)
{
    long val1, val2;
    char dummy;

    if (!arg || !*arg)
        return EINVAL;

    if (sscanf(arg, " %ld - %ld %c", &val1, &val2, &dummy) == 2 ||
        sscanf(arg, " %ld : %ld %c", &val1, &val2, &dummy) == 2 ||
        sscanf(arg, " %ld .. %ld %c", &val1, &val2, &dummy) == 2) {
        if (val1 <= val2) {
            if (min) *min = val1;
            if (max) *max = val2;
        } else {
            if (min) *min = val2;
            if (max) *max = val1;
        }
        return 0;
    }

    if (sscanf(arg, " %ld + %ld %c", &val1, &val2, &dummy) == 2) {
        if (val2 >= 0L) {
            if (min) *min = val1;
            if (max) *max = val1 + val2;
        } else {
            if (min) *min = val1 + val2;
            if (max) *max = val1;
        }
        return 0;
    }

    if (sscanf(arg, " %ld %c", &val1, &dummy) == 1) {
        if (min) *min = val1;
        if (max) *max = val1;
        return 0;
    }

    return EINVAL;
}

int compare_doubles(const void *ptr1, const void *ptr2)
{
    const double val1 = *(const double *)ptr1;
    const double val2 = *(const double *)ptr2;

    if (val1 < val2)
        return -1;
    else
    if (val1 > val2)
        return +1;
    else
        return  0;
}

int main(int argc, char *argv[])
{
    size_t  sublists, sublists_min, sublists_max;
    size_t  length, length_min, length_max;
    size_t  sprinkles, sprinkles_min, sprinkles_max;
    size_t  repeats;

    size_t  len, max_len;
    int    *key;

    double *topdown_wall,   *nominal_wall,   *bottomup_wall;
    double *topdown_cpu,    *nominal_cpu,    *bottomup_cpu;
    double *topdown_cycles, *nominal_cycles, *bottomup_cycles;
    size_t  topdown, nominal, bottomup;

    node_t *list, *sorted;

    size_t  i, n;
    int     one;

    if (argc < 5 || argc > 6 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
        fprintf(stderr, "\n");
        fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
        fprintf(stderr, "       %s REPEATS SUBLISTS LENGTH SPRINKLES [ SEED ]\n", argv[0]);
        fprintf(stderr, "Where:\n");
        fprintf(stderr, "       REPEATS     Number of operations measured\n");
        fprintf(stderr, "       SUBLISTS    Number of sorted sublists\n");
        fprintf(stderr, "       LENGTH      Length of each sublist\n");
        fprintf(stderr, "       SPRINKLES   Number of added random elements\n");
        fprintf(stderr, "       SEED        String used to seed the Xorshift PRNG\n");
        fprintf(stderr, "\n");
        fprintf(stderr, "This program generates a linked list with the above\n");
        fprintf(stderr, "randomness properties, then measures the time taken\n");
        fprintf(stderr, "to sort it using different sorting approaches.\n");
        fprintf(stderr, "\n");
        return 1;
    }

    if (argc >= 5)
        if (xorshift_setseed(argv[4], strlen(argv[4]))) {
            fprintf(stderr, "%s: Cannot use string as a random number seed.\n", argv[4]);
            return 1;
        }

    {
        long value;
        char dummy;
        if (sscanf(argv[1], " %ld %c", &value, &dummy) == 1 && value > 0L) 
            repeats = value;
        else {
            fprintf(stderr, "%s: Invalid number of sort operations to measure.\n", argv[1]);
            return 1;
        }
    }
    topdown_wall    = malloc(repeats * sizeof topdown_wall[0]);
    topdown_cpu     = malloc(repeats * sizeof topdown_cpu[0]);
    topdown_cycles  = malloc(repeats * sizeof topdown_cycles[0]);
    nominal_wall    = malloc(repeats * sizeof nominal_wall[0]);
    nominal_cpu     = malloc(repeats * sizeof nominal_cpu[0]);
    nominal_cycles  = malloc(repeats * sizeof nominal_cycles[0]);
    bottomup_wall   = malloc(repeats * sizeof bottomup_wall[0]);
    bottomup_cpu    = malloc(repeats * sizeof bottomup_cpu[0]);
    bottomup_cycles = malloc(repeats * sizeof bottomup_cycles[0]);
    if (!topdown_wall  || !topdown_cpu  || !topdown_cycles ||
        !nominal_wall  || !nominal_cpu  || !nominal_cycles ||
        !bottomup_wall || !bottomup_cpu || !bottomup_cycles) {
        fprintf(stderr, "Not enough memory: too many repeats (%lu).\n", (unsigned long)repeats);
        return 1;
    }

    if (parse_range(argv[2], &sublists_min, &sublists_max) || sublists_max < sublists_min) {
        fprintf(stderr, "%s: Invalid number of sublists.\n", argv[2]);
        return 1;
    }

    if (parse_range(argv[3], &length_min, &length_max) || length_max < length_min) {
        fprintf(stderr, "%s: Invalid sublist length.\n", argv[3]);
        return 1;
    }

    if (parse_range(argv[4], &sprinkles_min, &sprinkles_max) || sprinkles_max < sprinkles_min) {
        fprintf(stderr, "%s: Invalid number of added random elements.\n", argv[4]);
        return 1;
    }

    sublists = xorshift_range(sublists_min, sublists_max);
    sprinkles = xorshift_range(sprinkles_min, sprinkles_max);

    max_len = (size_t)sublists * (size_t)length_max + (size_t)sprinkles;
    len = 0;

    key = malloc(max_len * sizeof key[0]);
    if (!key) {
        fprintf(stderr, "Not enough memory (for %lu sublists and %lu sprinkles).\n",
                        (unsigned long)sublists, (unsigned long)sprinkles);
        return 1;
    }

    for (n = 0; n < sublists; n++) {
        length = xorshift_range(length_min, length_max);
        for (i = 0; i < length; i++)
            key[len++] = 256*i + (xorshift_u32() & 255);
    }

    for (n = 0; n < sprinkles; n++) {
        i = xorshift_range(0, len);
        if (i < len)
            memmove(key + i + 1, key + i, (len - i) * sizeof key[0]);
        key[i] = xorshift_u32();
        len++;
    }

    if (len < 2) {
        fprintf(stderr, "Nothing to sort.\n");
        return 1;
    }

    topdown  = 0;
    nominal  = 0;
    bottomup = 0;

    list = NULL;

    while (topdown < repeats || nominal < repeats || bottomup < repeats) {
        one = (xorshift_u32() >> 16) & 3;
        if (!one)
            continue;

        if (!list) {
            list = linked_list(key, len);
            if (!list) {
                fprintf(stderr, "Out of memory!\n");
                return 1;
            }
        }

        if (topdown < repeats && one == 1) {

            timing_start();
            sorted = topdown_sort(list);
            timing_stop(&topdown_wall[topdown], &topdown_cpu[topdown], &topdown_cycles[topdown]);
            if (sorted_length(sorted) != len) {
                fprintf(stderr, "topdown_sort() failed.\n");
                return 1;
            }

            topdown++;
            free(list);
            list = NULL;

        } else
        if (nominal < repeats && one == 2) {

            timing_start();
            sorted = nominal_sort(list);
            timing_stop(&nominal_wall[nominal], &nominal_cpu[nominal], &nominal_cycles[nominal]);
            if (sorted_length(sorted) != len) {
                fprintf(stderr, "nominal_sort() failed.\n");
                return 1;
            }

            nominal++;
            free(list);
            list = NULL;

        } else
        if (bottomup < repeats && one == 3) {

            timing_start();
            sorted = bottomup_sort(list);
            timing_stop(&bottomup_wall[bottomup], &bottomup_cpu[bottomup], &bottomup_cycles[bottomup]);
            if (sorted_length(sorted) != len) {
                fprintf(stderr, "bottomup_sort() failed.\n");
                return 1;
            }

            bottomup++;
            free(list);
            list = NULL;
        }
    }

    qsort(topdown_wall,    repeats, sizeof topdown_wall[0],    compare_doubles);
    qsort(topdown_cpu,     repeats, sizeof topdown_cpu[0],     compare_doubles);
    qsort(topdown_cycles,  repeats, sizeof topdown_cycles[0],  compare_doubles);
    qsort(nominal_wall,    repeats, sizeof nominal_wall[0],    compare_doubles);
    qsort(nominal_cpu,     repeats, sizeof nominal_cpu[0],     compare_doubles);
    qsort(nominal_cycles,  repeats, sizeof nominal_cycles[0],  compare_doubles);
    qsort(bottomup_wall,   repeats, sizeof bottomup_wall[0],   compare_doubles);
    qsort(bottomup_cpu,    repeats, sizeof bottomup_cpu[0],    compare_doubles);
    qsort(bottomup_cycles, repeats, sizeof bottomup_cycles[0], compare_doubles);

    fprintf(stderr, "List length:  %lu nodes\n", (unsigned long)len);
    fprintf(stderr, "Sublists:     %lu\n", (unsigned long)sublists);
    fprintf(stderr, "Random nodes: %lu\n", (unsigned long)sprinkles);
    fprintf(stderr, "\n");

    fprintf(stderr, "Traditional top-down merge sort:\n");
    fprintf(stderr, "\t%15.9f seconds wall time minimum\n", topdown_wall[0]);
    fprintf(stderr, "\t%15.9f seconds CPU time minimum\n", topdown_cpu[0]);
    if (topdown_cycles[0])
        fprintf(stderr, "\t%15.0f CPU cycles minimum\n", topdown_cycles[0]);
    fprintf(stderr, "\t%15.9f seconds wall time median\n", topdown_wall[repeats/2]);
    fprintf(stderr, "\t%15.9f seconds CPU time median\n", topdown_cpu[repeats/2]);
    if (topdown_cycles[repeats/2])
        fprintf(stderr, "\t%15.0f CPU cycles median\n", topdown_cycles[repeats/2]);
    fprintf(stderr, "\n");

    fprintf(stderr, "Top-down sorted-scanning merge sort:\n");
    fprintf(stderr, "\t%15.9f seconds wall time minimum\n", nominal_wall[0]);
    fprintf(stderr, "\t%15.9f seconds CPU time minimum\n", nominal_cpu[0]);
    if (nominal_cycles[0])
        fprintf(stderr, "\t%15.0f CPU cycles minimum\n", nominal_cycles[0]);
    fprintf(stderr, "\t%15.9f seconds wall time median\n", nominal_wall[repeats/2]);
    fprintf(stderr, "\t%15.9f seconds CPU time median\n", nominal_cpu[repeats/2]);
    if (nominal_cycles[repeats/2])
        fprintf(stderr, "\t%15.0f CPU cycles median\n", nominal_cycles[repeats/2]);
    fprintf(stderr, "\n");

    fprintf(stderr, "Bottom up merge sort:\n");
    fprintf(stderr, "\t%15.9f seconds wall time minimum\n", bottomup_wall[0]);
    fprintf(stderr, "\t%15.9f seconds CPU time minimum\n", bottomup_cpu[0]);
    if (bottomup_cycles[0])
        fprintf(stderr, "\t%15.0f CPU cycles minimum\n", bottomup_cycles[0]);
    fprintf(stderr, "\t%15.9f seconds wall time median\n", bottomup_wall[repeats/2]);
    fprintf(stderr, "\t%15.9f seconds CPU time median\n", bottomup_cpu[repeats/2]);
    if (bottomup_cycles[repeats/2])
        fprintf(stderr, "\t%15.0f CPU cycles median\n", bottomup_cycles[repeats/2]);
    fprintf(stderr, "\n");

    fflush(stderr);

    printf("%lu %lu %lu  ", (unsigned long)len, (unsigned long)sublists, (unsigned long)sprinkles);
    printf("%.0f %.0f %.0f ", topdown_wall[0], topdown_cpu[0], topdown_cycles[0]);
    printf("%.0f %.0f %.0f  ", topdown_wall[repeats/2], topdown_cpu[repeats/2], topdown_cycles[repeats/2]);
    printf("%.0f %.0f %.0f ", nominal_wall[0], nominal_cpu[0], nominal_cycles[0]);
    printf("%.0f %.0f %.0f  ", nominal_wall[repeats/2], nominal_cpu[repeats/2], nominal_cycles[repeats/2]);
    printf("%.0f %.0f %.0f ", bottomup_wall[0], bottomup_cpu[0], bottomup_cycles[0]);
    printf("%.0f %.0f %.0f\n", bottomup_wall[repeats/2], bottomup_cpu[repeats/2], bottomup_cycles[repeats/2]);

    return 0;
}

Finally, here is the Makefile I use to compile and run some test cases:

Code:

CC	:= gcc
CFLAGS	:= -W -Wall -O3 -fomit-frame-pointer
LD	:= $(CC)
LDFLAGS := -lrt
PROGS	:= benchmark

.PHONY: all clean pure run

all: clean benchmark pure run

clean:
	rm -f *.o $(PROGS)

pure:
	rm -f *.o

%.o: %.c
	$(CC) $(CFLAGS) -c $^

benchmark: sort.o timing.o benchmark.o
	$(LD) $^ $(LDFLAGS) -o $@

run: benchmark
	@rm -f results
	./benchmark 100  5 2001 0
	./benchmark 100 10 10000 10

If you copy-paste the makefile, note that the indentation on the left must be a TAB, not spaces.

The above runs two test cases, with 10005 and 100010 nodes. The benchmark dynamically regenerates the same list to be sorted, and runs the sort functions in random order; and, each sort is measured a hundred times, and only the minimum and median times are reported. (I trust the median times, myself.)

The first test case is a 10005-element list composed of five 2001-element long sorted sublists. Median timings:

Code:

Traditional top-down merge sort:
        339 µs, 1015095 cycles

Nominal Animal's top-down sorted-scanning merge sort:
        112 µs,  333738 cycles

Bottom-up merge sort:
        228 µs,  680681 cycles

The second test case is a 100010-element list composed of ten 10000-element long sorted sublists, with 10 random nodes thrown in somewhere. Median timings:

Code:

Traditional top-down merge sort:
       5360 µs, 16040680 cycles

Nominal Animal's top-down sorted-scanning merge sort:
       3030 µs,  9061101 cycles

Bottom-up merge sort:
       3330 µs,  9951478 cycles

I can think of real-world use cases where such lists are typical. In the first case, the top-down variant with sorted-sequence scanning is twice as fast as the bottom-up (without). In the second case, the top-down variant is still faster, but only by about 9%. Still, if such datasets were typical, it would clearly make sense to use the top-down sorted-sequence scanning variant -- just like I originally stated.

Furthermore, the top-down variant is slower than the bottom-up variant, it's not catastrophically so; it takes only up to twice the time the bottom-up variant does. If you consider the difficulty of fully grasping the scheme in the bottom-up variant, the slower run-time may be offset by the much easier (and thus cheaper) maintenance.

I didn't care enough to check whether these are stable, but it should be obvious that the top-down ones can be easily made stable, even with the sorted-sequence scanning.

The above results fully back my assertion that there are cases where top-down merge sort approaches make sense, and are preferable over bottom-up merge sort variants.

Your initial statement,

Originally Posted by rcgldr

[top down merge sort is] not efficient

is therefore incorrect. The results show that there are sensible, real world datasets where a simple variant of the top down merge sort clearly beats a bottom-up approach.

You could still prove your point by modifying or rewriting the above bottomup_sort() without pointer tricks or architecture-specific hacks -- we want portable and maintainable code --, in such a way that your variant consistently beats my top-down variant. It does not necessarily have to be stable, as long as it is clearly possible to make it so, if someone cares enough; after all, I didn't bother enough to make sure my variant is stable, either.

Note that to prove your point (that there is no need for top-down merge sort variants at all), it is enough for us to show even one class of datasets to disprove you, whereas you need to show your solution holds for all datasets. So, be prepared to test your variant with multiple input parameters.

If you don't bother, you're just insisting others listen to your unbacked opinions. That's worthless, as opinions are as free as the wind is. Opinions are only interesting if the context and basis can also be discussed. So, it's your turn to put your code where your argument is.

If you want to, that is.

**rcgldr** · 08-21-2013

Originally Posted by rcgldr

... top down merge sort is ... not efficient

Originally Posted by Nominal Animal

example code, 3 programs

To make this a fair comparison, you need a 4th program or you need to modify the bottom up merge. Change the lines in the bottom up merge to merge in one run at a time instead of one node at a time, and then compare the times.

Code:

    while (list) {

        curr = list;                   /* change these lines to get a run */
        list = list->next;
        curr->next = (node_t *)0;

**rcgldr** · 08-21-2013

I made this change to bottom up sort to make it a scanning bottom up merge sort:

Code:

/* ... */
    node_t *next;
/* ... */
    while (list) {

        curr = list;
        while((NULL != (next = list->next)) &&
            (list->key <= next->key))
            list = next;
        list->next = (node_t *)0;
        list = next;

Results - note - the windows high performance counter is the cpu counter. 64-bit mode is a bit slower due to 64 bit versus 32 bit pointers, in spite of having more registers. Bottom up scanning merge sort was the fastest. I also did a run with a larger number of nodes in 32 bit mode.

32 bit mode:

Code:

List length:  10005 nodes
Sublists:     5
Random nodes: 0

Traditional top-down merge sort:
        0.000256861 seconds wall time minimum
        0.000323350 seconds wall time median

Top-down sorted-scanning merge sort:
        0.000100361 seconds wall time minimum
        0.000166507 seconds wall time median

Bottom up scanning merge sort:
        0.000089928 seconds wall time minimum
        0.000154779 seconds wall time median

List length:  100010 nodes
Sublists:     10
Random nodes: 10

Traditional top-down merge sort:
        0.003316332 seconds wall time minimum
        0.003404675 seconds wall time median

Top-down sorted-scanning merge sort:
        0.001669866 seconds wall time minimum
        0.001754645 seconds wall time median

Bottom up scanning merge sort:
        0.001358650 seconds wall time minimum
        0.001428575 seconds wall time median

64 bit mode

Code:

List length:  10005 nodes
Sublists:     5
Random nodes: 0

Traditional top-down merge sort:
        0.000245304 seconds wall time minimum
        0.000293986 seconds wall time median

Top-down sorted-scanning merge sort:
        0.000106004 seconds wall time minimum
        0.000154309 seconds wall time median

Bottom up scanning merge sort:
        0.000098030 seconds wall time minimum
        0.000148743 seconds wall time median

List length:  100010 nodes
Sublists:     10
Random nodes: 10

Traditional top-down merge sort:
        0.003085288 seconds wall time minimum
        0.003153425 seconds wall time median

Top-down sorted-scanning merge sort:
        0.001671922 seconds wall time minimum
        0.001733476 seconds wall time median

Bottom up scanning merge sort:
        0.001470464 seconds wall time minimum
        0.001486737 seconds wall time median

32 bit mode larger list:

Code:

List length:  10000010 nodes
Sublists:     100
Random nodes: 10

Traditional top-down merge sort:
        0.634542662 seconds wall time minimum
        0.634778174 seconds wall time median

Top-down sorted-scanning merge sort:
        0.416188170 seconds wall time minimum
        0.417621510 seconds wall time median

Bottom up scanning merge sort:
        0.361863343 seconds wall time minimum
        0.362328064 seconds wall time median

I got some warnings, probably code that deals with 32 bit or 64 bit modes:

In 32 bit mode I get warning for size_t conversion in benchmark.c line 75

Code:

/* ... */
                temp = xorshift_u64() & mask;

In 32 or 64 bit mode I get warning for size_t conversion in benchmark.c line 295

Code:

/* ... */
            key[len++] = 256*i + (xorshift_u32() & 255);

I changed timing.c for VS2005 + windows:

Code:

#include <math.h>
#include <windows.h>                    /* for timer stuff */

typedef LARGE_INTEGER LI64;

static LI64     liQPFrequency;
static LI64     liStartTime;
static LI64     liStopTime;
static double   dQPFrequency;
static double   dStartTime;
static double   dStopTime;
static double   dElapsedTime;

void timing_start(void)
{
    QueryPerformanceFrequency(&liQPFrequency);
    dQPFrequency = ((double)liQPFrequency.HighPart)*pow(2.,32.) +
                   ((double)liQPFrequency.LowPart );

    Sleep(1);
    QueryPerformanceCounter(&liStartTime);
} 

void timing_stop(double *const wall_seconds,
                 double *const cpu_seconds,
                 double *const cpu_cycles)
{
    QueryPerformanceCounter(&liStopTime);

    dStartTime = ((double)liStartTime.HighPart)*pow(2.,32.) +
                 ((double)liStartTime.LowPart );
    dStopTime  = ((double)liStopTime.HighPart)*pow(2.,32.) +
                 ((double)liStopTime.LowPart );
    dElapsedTime = (dStopTime - dStartTime) / dQPFrequency;

    if (cpu_cycles)
        *cpu_cycles = 0.;

    if (wall_seconds)
        *wall_seconds = dElapsedTime;

    if (cpu_seconds)
        *cpu_seconds = 0.;
}

**Nominal Animal** · 08-21-2013

The warnings are harmless, and can be silenced with explicit casts to the target variable types.

Originally Posted by rcgldr

I made this change

Using very similar changes (adding temp and four lines of code at the start of the while loop),

Code:

node_t *bottomup_sort(node_t *list)
{
    node_t *part[PARTS] = { 0 };
    node_t *curr, *temp;
    size_t  i;

    while (list) {

        curr = list;
        while (list->next && list->key <= list->next->key)
            list = list->next;

        temp = list->next;
        list->next = (node_t *)0;
        list = temp;

        if (part[0]) {
            curr = merge2(part[0], curr);

            i = 1;
            while (i < PARTS - 1 && part[i]) {
                curr = merge2(part[i], curr);
                part[i-1] = (node_t *)0;
                i++;
            }
            part[i-1] = (node_t *)0;

            if (!part[i])
                part[i] = curr;
            else
                part[i] = merge2(part[i], curr);
            
        } else
            part[0] = curr;
    }

    i = 0;
    while (i < PARTS - 1 && !part[i])
        i++;

    while (i < PARTS - 1) {
        if (part[i+1])
            part[i+1] = merge2(part[i+1], part[i]);
        else
            part[i+1] = part[i];
        i++;
    }

    return part[PARTS-1];
}

all three sorts seem to be stable (according to tests I added to my copy of benchmark.c -- the list pointers are assigned in increasing addresses initially, so they can be checked when keys are equal).

Yet, there is still a class of inputs for which the top-down scanning sort is still faster.

If there are about 500 sublists, each with 50 to 100 sorted nodes, and about 1% of random nodes sprinkled in, I get

Code:

./benchmark 100 500 50-100 28
Constructed a list with 37582 nodes, 500 sublists, 28 random nodes.
List length:  37582 nodes
Sublists:     500
Random nodes: 28

Traditional top-down stable merge sort:
	    0.003297334 seconds wall time minimum
	    0.003294635 seconds CPU time minimum
	        9859830 CPU cycles minimum
	    0.003338460 seconds wall time median
	    0.003335976 seconds CPU time median
	        9982739 CPU cycles median

Top-down stable sorted-scanning merge sort:
	    0.003141621 seconds wall time minimum
	    0.003139022 seconds CPU time minimum
	        9394353 CPU cycles minimum
	    0.003179516 seconds wall time median
	    0.003177121 seconds CPU time median
	        9507317 CPU cycles median

Bottom up stable merge sort:
	    0.003297222 seconds wall time minimum
	    0.003294462 seconds CPU time minimum
	        9859129 CPU cycles minimum
	    0.003334290 seconds wall time median
	    0.003331613 seconds CPU time median
	        9970504 CPU cycles median

37582 500 28  0.003335976  0.003177121  0.003331613

The difference in median times is about five percent in that exact case.
(This inversion point is heavily CPU-dependent, so it'll occur with different parameters on your CPU.)

I am fully aware that for the vast majority of all possible random lists, the bottom-up approach is faster. It just does not matter: I've already agreed that if there is no information on the sortedness or other properties of a linked list, bottom-up merge sort is my choice.

My point is, because there are still linked list inputs where the top-down variant is still faster, your assertion that a bottom-up merge sort is always better, does not hold.

If you want to prove your assertion that "top-down merge sort is inefficient", then you need to provide it really is. In other words, show a bottom-up merge sort implementation that is always (aside from some single specific inputs) at least as efficient as the top-down approach.

To me, the interesting bit is why the bottom-up approach behaves poorly when compared to the top-down approach, for specific input properties.

For many changes in input randomness, the efficiency of the top-down and bottom-up approaches change in similar ways. For some -- the above one is not the only one, and probably depends on the CPU features, too --, we see these surprising inversions. They are not specific to one exact set of parameters, but occur in a small domain; so, it does not look like just a random occurrence.

To see this, you'll need to run a large number of tests with a wide range of input parameters, and look for interesting cases. When you see one, run another batch of tests around the interesting parameters, to see if it is just a random effect, and not a tendency. (Just having one input parameter set does not prove anything, because there are so many factors that affect the run time; but when you can vary the parameters and still get a similar result, it's not just a random occurrence.)

I wonder, is there a way to counteract that poor behaviour? I suspect it is an effect of having a longish sorted run merged with a nearly full "virtual tree" (list array), causing a cascade of inefficient merges to occur.

Since the bottom-up version is nonrecursive, adding a bit of state (to count the length of the sorted input run) is virtually free. If the sorted input runs are long enough, some sort of procedure on how to merge it to the "virtual tree" (list array) might work around this wart.

Furthermore, it might be possible to augment the bottom-up sort to not only detect sorted runs, but reversed runs also. One needs to prepend instead of append, and use < instead of <= for the comparisons, to keep stability. But, it might make efficient merging even more complicated...

Interesting stuff.

**rcgldr** · 08-21-2013

Originally Posted by Nominal Animal

Using very similar changes (adding temp and four lines of code at the start of the while loop),

Code:

/* ... */
    node_t *temp;
    while (list) {
        curr = list;
        while (list->next && list->key <= list->next->key)
            list = list->next;

I'm never sure when it's better to let the compiler handle the three instances of list->next, or to set a temp variable to ->next like I did. It probably doesn't make much difference.

Originally Posted by Nominal Animal

all three sorts seem to be stable.

Instability only occurs if compares are used to determine the end of runs after the initial pass that creates the runs. Even then it only occurs when 2 or more separate runs gets treated as one run during the merge phase.

Originally Posted by Nominal Animal

Yet, there is still a class of inputs for which the top-down scanning sort is still faster.

First note that your top down scanning sort is using a 3 way merge, which can be faster than a 2 way merge. If the list size is relatively small and/or much of the list is runs, then that can happen. The overhead for a standard top down sort is 1 1/2 traversals of the list (plus the overhead of recursion). By adding a scan for runs before the list splitting, you've created a hybrid sort where the 1 and 1/2 traversals and recursion only applies to the non run portions of the list, and only part of the list is going through the top down list splitting process that I consider unneeded.

Again, note my issue is with a standard top down merge sort, which goes through recursion to repeatedly split a list or array down to a size of 2 or 1, before any merging takes place, since the list splitting process isn't needed. A bottom up merge skips the list splitting process and immediately starts with the merging process.

Originally Posted by Nominal Animal

it might be possible to augment the bottom-up sort to not only detect sorted runs, but reversed runs also

That msortp.cpp I wrote does this on the initial pass, but it's sorting an array of pointers, and there's not a lot of overhead in reversing a run of pointers.

**rcgldr** · 08-21-2013

Getting back to my original topic, my issue with top down merge sort is the more common case of sorting arrays (arrays of pointers, vectors, ... ). From an earlier post:

Originally Posted by rcgldr

In the case of arrays, there are extra copy operations after each merge, or a check to see if a copy is needed before a merge which may require a copy before a merge. Take at look at this cprogramming.com top down merge sort example for an array that implements the recursive splits using indices:

Merge Sort in C++ (Source Code) - Cprogramming.com

In this version, after every merge, the data is to be copied back to the original array, because there's no easy way to otherwise determine in which of two buffers a sub-array exists due to the nature of a depth first process if the number of elements to be sorted is not a power of 2.

The issue is that the recursive splitting process is unneeded, since a merge sort can skip the splitting process and merge using iteration to advance indices or pointers. In addition, an iterator / pointer top down approach like the one above has to copy back after every merge, doubling the transfer overhead, or it copies data to produce sub-arrays during the split phase.

- - -

Back to iMalc's issues about merge sort with linked lists:

Originally Posted by iMalc

What I'm trying to get to the bottom of is that you seem to claim is that top-down merge sort for a list linked, is less efficient than a bottom-up merge sort for a linked list, when the bottom-up merge sort is done in O(1) extra space ... only one list. ... So perhaps you have simply never been talking about a bottom-up merge-sort for a linked list.

No one else mentioned O(1) extra space.

natural merge sort with linked lists

An example using an fixed size array of list pointers has already been posted. Here's an example of "classic" natural merge sort for linked lists, using compares to determine run boundaries, which is both slower and not stable (versus the array method), and it's more complex. During each merge pass, two input lists are merged into alternating output lists, then the direction is changed and the process repeated until there is just one list. When the end of an input list is reached, any remaining runs on the other list are alternately added to the two output lists to keep the run count balanced.

Code:

/*----------------------------------------------------------------------*/
/*      nsortl.c        natural list sort                               */
/*----------------------------------------------------------------------*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define NUMNODES ( 4*1024*1024) /* number of nodes */

typedef unsigned long long UI64;

typedef struct _NODE{
struct _NODE * next;            /* must be first member of NODE */
UI64 data;
}NODE;

/*----------------------------------------------------------------------*/
/*      data                                                            */
/*----------------------------------------------------------------------*/
static clock_t dwTimeStart;     /* clock values */
static clock_t dwTimeStop;

static FILE * fpDst;            /* dst (output) file */

/*----------------------------------------------------------------------*/
/*      code                                                            */
/*----------------------------------------------------------------------*/
NODE * SortList(NODE *pList);
int    Move1Run(NODE **ppDst, NODE **ppEnd, NODE **ppSrc);
int    Merge2Runs(NODE **ppDst, NODE **ppEnd, NODE **ppSrc0, NODE **ppSrc1);
NODE * Adv2EndRun(NODE *pSrc);
NODE * CreateList(int count);

/*----------------------------------------------------------------------*/
/*      main                                                            */
/*----------------------------------------------------------------------*/
int main(void)
{
NODE *pList;
void *pMem;

    if(NULL == (pList = CreateList(NUMNODES)))  /* create list */
        return(0);
    pMem = pList;

    dwTimeStart = clock();

    pList = SortList(pList);                    /* sort list */

    dwTimeStop = clock();
    printf("Number of ticks %d\n", (dwTimeStop-dwTimeStart));

    fpDst = fopen("dstl.bin", "wb");            /* write data */
    if(fpDst == NULL){
        goto exit0;}
    while(1){
        fwrite(&(pList->data), 1, sizeof(UI64), fpDst);
        if(NULL == (pList = pList->next))
            break;
    }
    fclose(fpDst);

exit0:
    free(pMem);                                 /* free memory */
    return(0);
}
 
/*----------------------------------------------------------------------*/
/*      SortList                                                        */
/*----------------------------------------------------------------------*/
NODE * SortList(NODE *pList)
{
NODE *pSrc0;                            /* src ptrs */
NODE *pSrc1;
NODE *pDst0 = NULL;                     /* dst ptrs */
NODE *pDst1 = NULL;
NODE *pEnd0;                            /* dst end ptrs */
NODE *pEnd1;

    if(pList == NULL)                   /* check for null ptr */
        return(NULL);

    pSrc0 = pList;                      /* split up runs into 2 lists */
    while(1){
        if(Move1Run(&pDst0, &pEnd0, &pSrc0))
            break;
        if(Move1Run(&pDst1, &pEnd1, &pSrc0))
            break;
    }
    while(1){                           /* merge pass */
        pSrc0 = pDst0;                  /* switch merge direction */
        pSrc1 = pDst1;
        pDst0 = NULL;
        pDst1 = NULL;
        while(1){                       /* merge runs */
            if(Merge2Runs(&pDst0, &pEnd0, &pSrc0, &pSrc1)){
                if(pSrc0 == NULL)       /* split up any remaining runs */
                    pSrc0 = pSrc1;
                while(1){
                    if(Move1Run(&pDst1, &pEnd1, &pSrc0))
                        break;
                    if(Move1Run(&pDst0, &pEnd0, &pSrc0))
                        break;
                }
                break;
            }
            if(Merge2Runs(&pDst1, &pEnd1, &pSrc0, &pSrc1)){
                if(pSrc0 == NULL)       /* split up any remaining runs */
                    pSrc0 = pSrc1;
                while(1){
                    if(Move1Run(&pDst0, &pEnd0, &pSrc0))
                        break;
                    if(Move1Run(&pDst1, &pEnd1, &pSrc0))
                        break;
                }
                break;
            }
        }
        if(pDst1 == NULL)               /* if done break */
            break;
    }
    return(pDst0);                      /* return ptr to sorted list */
}       

/*----------------------------------------------------------------------*/
/*      Move1Run                                                        */
/*      return 0 if end run, 1 if end data                              */
/*----------------------------------------------------------------------*/
int Move1Run(NODE **ppDst, NODE **ppEnd, NODE **ppSrc)
{
NODE *pSrc;
NODE *pEnd;

    pSrc = *ppSrc;                      /* get src ptr */
    if(NULL == pSrc)                    /* return if empty list */
        return(1);
    if(NULL == *ppDst)                  /* init dst/end ptr */
        pEnd = (NODE *)ppDst;
    else
        pEnd = *ppEnd;
    pEnd->next = pSrc;
    pEnd = Adv2EndRun(pSrc);            /* advance to end run */
    pSrc = pEnd->next;
    pEnd->next = NULL;
    *ppEnd = pEnd;                      /* update ptrs */
    *ppSrc = pSrc;
    if(pSrc == NULL)
        return(1);
    return(0);
}

/*----------------------------------------------------------------------*/
/*      Merge2Runs                                                      */
/*      return 0 if end runs, 1 if end data                             */
/*----------------------------------------------------------------------*/
int Merge2Runs(NODE **ppDst, NODE **ppEnd, NODE **ppSrc0, NODE **ppSrc1)
{
NODE *pEnd;
NODE *pSrc0;
NODE *pSrc1;

    pSrc0 = *ppSrc0;
    pSrc1 = *ppSrc1;
    if(NULL == *ppDst)                  /* init dst/end ptr */
        pEnd = (NODE *)ppDst;
    else
        pEnd = *ppEnd;
    if(pSrc0 == NULL || pSrc1 == NULL)  /* return if end of data */
        return(1);
    while(1){
        if(pSrc1->data < pSrc0->data){  /* if src1 < src0 */
            pEnd->next = pSrc1;         /*   append src1 */
            pEnd = pSrc1;               /*   advance end */
            pSrc1 = pSrc1->next;        /*   advance src1 */
            if((pSrc1 == NULL) ||       /*   if end src1 run */
               (pEnd->data > pSrc1->data)){
                pEnd->next = pSrc0;     /*     copy src0 run */
                pEnd = Adv2EndRun(pSrc0);
                pSrc0 = pEnd->next;     /*     update ptrs */
                pEnd->next = NULL;
                *ppEnd = pEnd;
                *ppSrc0 = pSrc0;
                *ppSrc1 = pSrc1;
                if(pSrc1 == NULL || pSrc0 == NULL)
                    return(1);
                return(0);
            }
        } else {                        /* src0 <= src1 */
            pEnd->next = pSrc0;         /*   append src0 */
            pEnd = pSrc0;               /*   advance end */
            pSrc0 = pSrc0->next;        /*   advance src0 */
            if((pSrc0 == NULL) ||       /*   if end src0 run */
               (pEnd->data > pSrc0->data)){
                pEnd->next = pSrc1;     /*     copy src1 run */
                pEnd = Adv2EndRun(pSrc1);
                pSrc1 = pEnd->next;     /*     update ptrs */
                pEnd->next = NULL;
                *ppEnd = pEnd;
                *ppSrc1 = pSrc1;
                *ppSrc0 = pSrc0;
                if(pSrc0 == NULL || pSrc1 == NULL)
                    return(1);
                return(0);
            }
        }
    }
}

/*----------------------------------------------------------------------*/
/*      Adv2EndRun  advance to end run                                  */
/*----------------------------------------------------------------------*/
NODE * Adv2EndRun(NODE *pSrc)
{
NODE *pNext;
    if(pSrc == NULL)                    /* if null ptr return */
        return(NULL);
    while(1){                           /* advance to end run */
        pNext = pSrc->next;
        if((pNext == NULL) || (pSrc->data > pNext->data))
            break;
        pSrc = pNext;
    }
    return(pSrc);
}

/*----------------------------------------------------------------------*/
/*      create list     used for test                                   */
/*----------------------------------------------------------------------*/
NODE * CreateList(int count)
{
void *pMem;
NODE *pNode;
int i;
UI64 r;

    /* allocate space for nodes */
    pMem = malloc(count * sizeof(NODE));
    if(pMem == NULL)
        return NULL;
    pNode = (NODE *)pMem;
    for(i = 0; i < count; i++){
        r  = (((UI64)((rand()>>4) & 0xff))<< 0);
        r += (((UI64)((rand()>>4) & 0xff))<< 8);
        r += (((UI64)((rand()>>4) & 0xff))<<16);
        r += (((UI64)((rand()>>4) & 0xff))<<24);
        r += (((UI64)((rand()>>4) & 0xff))<<32);
        r += (((UI64)((rand()>>4) & 0xff))<<40);
        r += (((UI64)((rand()>>4) & 0xff))<<48);
        r += (((UI64)((rand()>>4) & 0xff))<<56);
        pNode->data = r;
        pNode->next = pNode+1;
        pNode++;
    }
    (--pNode)->next = NULL;
    return(pMem);
}

**rcgldr** · 08-22-2013

Originally Posted by rcgldr

example of "classic" natural merge sort for linked lists, using compares to determine run boundaries ...

I added the "classic" bottom up merge sort to the benchmark. It's times were about the same as the bottom up scanning array merge sort.

Code:

List length:  10005 nodes
Sublists:     5
Random nodes: 0

Traditional top-down merge sort:
            0.000256722 seconds wall time minimum
            0.000324038 seconds wall time median

Top-down sorted-scanning merge sort:
            0.000100165 seconds wall time minimum
            0.000168969 seconds wall time median

Bottom up scanning merge sort:
            0.000090025 seconds wall time minimum
            0.000154366 seconds wall time median

Classic bottom up scanning merge sort:
            0.000094073 seconds wall time minimum
            0.000151362 seconds wall time median

List length:  100010 nodes
Sublists:     10
Random nodes: 10

Traditional top-down merge sort:
            0.003311198 seconds wall time minimum
            0.003401286 seconds wall time median

Top-down sorted-scanning merge sort:
            0.001685378 seconds wall time minimum
            0.001751138 seconds wall time median

Bottom up scanning merge sort:
            0.001382265 seconds wall time minimum
            0.001429571 seconds wall time median

Classic bottom up scanning merge sort:
            0.001395749 seconds wall time minimum
            0.001460691 seconds wall time median

List length:  10000010 nodes
Sublists:     100
Random nodes: 10

Traditional top-down merge sort:
            0.631818718 seconds wall time minimum
            0.633395153 seconds wall time median

Top-down sorted-scanning merge sort:
            0.416824896 seconds wall time minimum
            0.417633608 seconds wall time median

Bottom up scanning merge sort:
            0.362000177 seconds wall time minimum
            0.362691557 seconds wall time median

Classic bottom up scanning merge sort:
            0.351972112 seconds wall time minimum
            0.352416840 seconds wall time median

**iMalc** · 08-22-2013

Originally Posted by rcgldr

Getting back to my original topic, my issue with top down merge sort is the more common case of sorting arrays (arrays of pointers, vectors, ... ). From an earlier post:

The issue is that the recursive splitting process is unneeded, since a merge sort can skip the splitting process and merge using iteration to advance indices or pointers. In addition, an iterator / pointer top down approach like the one above has to copy back after every merge, doubling the transfer overhead, or it copies data to produce sub-arrays during the split phase.

Ah yes lets move on to the array version. I believe you haven't seen the "merge across then back" top-down technique.
Take a look at the Merge Sort I have here: Array Sorting
There is no copy-back step in that. It's essentially a 4-way merge; two-way in one direction then two-way back again.

I should also note that the std::stable_sort implementation I have seen for Visual Studio uses the top-down approach. Also, it does even better than the "merge across then back" technique in that it does not use a temp buffer at all! It uses a complex in-place merging technique that among other things uses the GCD algorithm. Not necessarily faster, but much lower memory usage. This technique can be used for top-down or bottom-up.

Bottom-up a.k.a breadth first merge sort for an array tends to be inferior from what I can tell, and this is for two reasons:
1. Accessing the items in breadth first manner has poorer cache locality than depth-first manner.
2. When the array size is not near a power of two, the final merge (which is the most important) can be merging arrays of significantly different sizes.

**rcgldr** · 08-22-2013

Originally Posted by iMalc

I should also note that the std::stable_sort implementation I have seen for Visual Studio uses the top-down approach.

The code is setup for recursion, but the recursion never occurs and only one array split is performed. It's probably legacy code that was never fully revised. The code allocates a temp buffer 1/2 the size of the source array, then calls a function that splits the source array in 1/2 and checks to see if the temp buffer is large enough, which it will be on the first try. No recursion takes place, but the split is done, and then a bottom up merge is done on each half, then the two half buffers are merged. Snippet of the high level part of the code. The calling sequence is:

User calls stable_sort(_BidIt _First, _BidIt _Last) which calls

void _Stable_sort(_BidIt _First, _BidIt _Last, _Diff *, _Ty *) which allocates a temp buffer 1/2 the size of the array and then calls

void _Stable_sort(_BidIt _First, _BidIt _Last, _Diff _Count, _Temp_iterator<_Ty>& _Tempbuf) which splits the source array into two halves and determines the temp buffer is large enough to sort each half, so no recursion occurs and it calls _Buffered_merge_sort(,,,) 2 times to do the bottom up merge and _Buffered_merge(,,,,,) to do the final merge.

Code:

template<class _BidI
    class _Diff,
    class _Ty> inline
    void _Stable_sort(_BidIt _First, _BidIt _Last, _Diff _Count,
        _Temp_iterator<_Ty>& _Tempbuf)
    {   //  sort preserving order of equivalents, using operator<
    if (_Count <= _ISORT_MAX)
        _Insertion_sort(_First, _Last); // small, insertion sort
    else
        {   // sort halves and merge
        _Diff _Count2 = (_Count + 1) / 2;
        _BidIt _Mid = _First;
        std::advance(_Mid, _Count2);

        if (_Count2 <= _Tempbuf._Maxlen())
            {   // temp buffer big enough, sort each half using buffer
            _Buffered_merge_sort(_First, _Mid, _Count2, _Tempbuf);
            _Buffered_merge_sort(_Mid, _Last, _Count - _Count2, _Tempbuf);
            }
        else
            {   // temp buffer not big enough, divide and conquer
            _Stable_sort(_First, _Mid, _Count2, _Tempbuf);
            _Stable_sort(_Mid, _Last, _Count - _Count2, _Tempbuf);
            }

        _Buffered_merge(_First, _Mid, _Last,
            _Count2, _Count - _Count2, _Tempbuf);   // merge sorted halves
        }
    }

template<class _BidIt,
    class _Diff,
    class _Ty> inline
    void _Stable_sort(_BidIt _First, _BidIt _Last, _Diff *, _Ty *)
    {   // sort preserving order of equivalents, using operator<
    _Diff _Count = 0;
    _Distance(_First, _Last, _Count);
    _Temp_iterator<_Ty> _Tempbuf((_Count + 1) / 2);
    _Stable_sort(_First, _Last, _Count, _Tempbuf);
    }

template<class _BidIt> inline
    void stable_sort(_BidIt _First, _BidIt _Last)
    {   // sort preserving order of equivalents, using operator<
    _DEBUG_RANGE(_First, _Last);
    if (_First != _Last)
        {
        _Stable_sort(_CHECKED_BASE(_First), _CHECKED_BASE(_Last), _Dist_type(_First), _Val_type(_First));
        }
    }

**rcgldr** · 08-22-2013

Originally Posted by iMalc

Lets move on to the array version. I believe you haven't seen the "merge across then back" top-down technique.
Take a look at the Merge Sort I have here: Array Sorting
There is no copy-back step in that. It's essentially a 4-way merge; two-way in one direction then two-way back again.

Then why does Merge Sort - Cprogramming.com have a copy back step? I'll have to investigate this later.

Originally Posted by iMalc

Accessing the items in breadth first manner has poorer cache locality than depth-first manner.

Breadth first deals with 2 sequential input and 2 sequential output streams of data. Why is this worse than jumping back and forth as a top down merge sort traverses down and up the call chain?

Originally Posted by iMalc

When the array size is not near a power of two, the final merge (which is the most important) can be merging arrays of significantly different sizes.

Why is the final merge the most important, that's the one with the least overhead? The iterators only iterate and never have to be advanced to the next run boundary?