merge sort - top down versus bottom up

**Nominal Animal** · 08-22-2013

Originally Posted by rcgldr

Instability only occurs if compares are used to determine the end of runs after the initial pass that creates the runs.

Okay, but do you agree with me that the three sorts are stable?

I didn't care enough to analyze the implementations for sorting stability. I just verified that for all input sequences I generate, the results are stable. So, if you have done the analysis, I'll take your word for it.

Originally Posted by rcgldr

Again, note my issue is with a standard top down merge sort

That does not mean the entire top-down merge sort approach is inefficient.

I've shown that when combined with logic to detect sorted runs, a top-down sorted-scanning merge sort performs at least as well as the sorted-scanning bottom-up merge sort does, for some input linked list types. (It's a specific level of randomness/sortedness, and probably quite CPU dependent.)

This means that even in theory, your initial statement, "top down merge sort is inefficient", is clearly incorrect.

My own original opinion, "In my opinion, top down merge sort has two main purposes: one, as a learning tool (linked lists, recursion); two, as a basis for linked list sorting variants where the list is known to be mostly sorted already." has changed very little based on these revelations, to "In my opinion, top down merge sort has two main purposes: one, as a learning tool (linked lists, recursion); two, as a basis for linked list sort variants that exploit known randomness/sortedness properties of the list."

Originally Posted by rcgldr

Getting back to my original topic, my issue with top down merge sort is the more common case of sorting arrays

As I only use merge sort for linked lists, I don't really have any input on that. So, I'll bow out here, if you don't mind.

**rcgldr** · 08-22-2013

Originally Posted by iMalc

I believe you haven't seen the "merge across then back" top-down technique.
Take a look at the Merge Sort I have here: Array Sorting

Note, I had to change the "while" that prefixed the template functions to "void" in order to compile with VS2005.

Originally Posted by iMalc

There is no copy-back step in that. It's essentially a 4-way merge; two-way in one direction then two-way back again.

So the trick is to alternate the recusion between two functions, once that merges from the original buffer to the temp buffer (MergeSortAtoB) and one that merges fomr the temp buffer back to the original buffer (MergeSortAtoA), which keeps the recursion levels in sync with the direction of the merge.

Your method solves the issue of copy back. There's still the overhead of recursively generating all those indices versus iteration. There's also an overhead of copying single values to the "other" array in order to "merge" them back, which could be optimized with a compare and optional swap, but that could make the code messy, and currently it's very clean.

**rcgldr** · 08-22-2013

Originally Posted by Nominal Animal

Okay, but do you agree with me that the three sorts are stable?

Yes.

Originally Posted by Nominal Animal

That does not mean the entire top-down merge sort approach is inefficient.

The top down scanning merge sort is using a bottom up like method during the scanning for runs phase, and it's doing a 3 way merge, so that's significantly different than a conventional top down merge sort which recursively splits list until size is 2 or 1. Also, it's my guess that in 90% or more of the typical cases of data patterns, it's still slower than the array or "classic" bottom up merge sort.

**rcgldr** · 08-22-2013

Originally Posted by iMalc

Take a look at the Merge Sort I have here: Array Sorting
There is no copy-back step in that. It's essentially a 4-way merge; two-way in one direction then two-way back again.

I sit corrected. I created a bottom up template similar in style to the top down template and the times are almost identical. Even with 10 million unsigned 64 bit integers to sort, bottom up was only 2% faster. Here is the source:

Code:

//      tsorttd.h - top down merge sort

template <class T>
void MergeSort(T a[], int n) {
    T *const b = new T[n];
    MergeSortAtoA(a, b, 0, n);
    delete[] b;
}

template <class T>
void MergeSortAtoA(T a[], T b[], const int l, const int r) {
    if (r - l > 1) {
        const int m = (l + r)>>1; //midpoint
        MergeSortAtoB(a, b, l, m);
        MergeSortAtoB(a, b, m, r);
        MergeAtoB(b, a, l, m, r); //Note a & b reversed (this is B->A)
    }
}

template <class T>
void MergeSortAtoB(T a[], T b[], const int l, const int r) {
    if (r - l > 1) {
        const int m = (l + r)>>1; //midpoint
        MergeSortAtoA(a, b, l, m);
        MergeSortAtoA(a, b, m, r);
        MergeAtoB(a, b, l, m, r);
    } else if (r - l == 1) {
        b[l] = a[l];
    }
}

template <class T>
void MergeAtoB(const T a[], T b[], const int l, const int m, const int r) {
    int i = l, j = m, c = l;
    //merge items while both lists are not empty
    while (i < m && j < r)
        b[c++] = (a[j] < a[i]) ? a[j++] : a[i++];
    //copy remaining items into place
    while (i < m) b[c++] = a[i++];
    while (j < r) b[c++] = a[j++];
}

Code:

//      tsortbu.h - bottom up merge sort

#ifndef min
#define min(a,b)  (((a) < (b)) ? (a) : (b))
#endif

template <class T>
void MergeSort(T a[], const size_t n) {
    size_t i = GetPassCount(n);
    if(!i)
        return;
    T *b = new T[n];                    // allocate temp array
    T *t;
    size_t width = 1;                   // init width
    if(i & 1){                          // if odd pass count
        T x;                            //   do swaps in a[]
        size_t m = n & ((size_t)0-(size_t)2);
        for(i = 0; i < m; i += 2){
            if(a[i] > a[i+1]){
                x = a[i];
                a[i] = a[i+1];
                a[i+1] = x;
            }
        }
        width = 2;}
//  merge sort array
    for(; width < n; width <<= 1){      // do a merge pass
        for(i = 0; i < n; i += 2*width){
            size_t o  = i;              //  merge runs
            size_t ll = i;
            size_t l  = ll;
            size_t mm = min(i+width, n);
            size_t r  = mm;
            size_t ee = min(i+2*width, n);
            while(l < mm && r < ee)
                b[o++] = (a[l] < a[r]) ? a[l++] : a[r++];
            // copy remaining values
            while (l < mm) b[o++] = a[l++];
            while (r < ee) b[o++] = a[r++];
        }
        t = a;                          // swap array ptrs
        a = b;
        b = t;
    }
    delete[] b;                         // free temp array
}

//  get pass count
size_t GetPassCount(size_t n)
{
    size_t i = 0;
    size_t m = n;
    while(m >>= 1)
        i += 1;
    if((((size_t)1<<i)-1)&n)
        i += 1;
    return(i);
}

Code:

//      tsortxx.cpp - test program for tsortxx.h

#include "tsortxx.h" // (tsorttd.h or tsortbu.h)
#include <fstream>
#include <iostream>
#include <stdlib.h>
#include <time.h>
typedef unsigned long long uint64_t;

#define COUNT (10*1024*1024)    // number of values to sort

clock_t ctTimeStart;            // clock values
clock_t ctTimeStop;

int main(int argc, char**argv)
{
    uint64_t   r;               // random number
    size_t     i;

    std::ofstream ofs("dst.bin", std::ofstream::binary | std::ofstream::trunc);
    if(!ofs.is_open())
        return(0);

    uint64_t * a = new uint64_t [COUNT];
    if(a == NULL)
        return(0);
    for(i = 0; i < COUNT; i++){
        r  = (((uint64_t)((rand()>>4) & 0xff))<< 0);
        r += (((uint64_t)((rand()>>4) & 0xff))<< 8);
        r += (((uint64_t)((rand()>>4) & 0xff))<<16);
        r += (((uint64_t)((rand()>>4) & 0xff))<<24);
        r += (((uint64_t)((rand()>>4) & 0xff))<<32);
        r += (((uint64_t)((rand()>>4) & 0xff))<<40);
        r += (((uint64_t)((rand()>>4) & 0xff))<<48);
        r += (((uint64_t)((rand()>>4) & 0xff))<<56);
        a[i] = r;
    }

    ctTimeStart = clock();
    MergeSort(a, COUNT);
    ctTimeStop = clock();
    std::cout << "# of ticks " << ctTimeStop - ctTimeStart << std::endl;

    ofs.write((const char *)a, COUNT*sizeof(a[0]));
    ofs.close();

    delete [] a;

    return(0);
}

**rcgldr** · 08-22-2013

Originally Posted by rcgldr

I sit corrected. I created a bottom up template similar in style to the top down template and the times are almost identical.

If I use a more optimized pointer based bottom up merge, it's about 18% faster (for 10 million values) 1.1 seconds versus 1.3 seconds, but it's possible that a similar optimization could be done to the top down merge sort.

Code:

//      tsortbu.h - bottom up merge sort

template <class T>
void MergeSort(T a[], const size_t n) {
    size_t i = GetPassCount(n);
    if(!i)
        return;
    T *b = new T[n];                    // allocate temp array
    size_t width = 1;                   // init width
    if(i & 1){                          // if odd pass count
        T x;                            //   do swaps in a[]
        size_t m = n & ((size_t)0-(size_t)2);
        for(i = 0; i < m; i += 2){
            if(a[i] > a[i+1]){
                x = a[i];
                a[i] = a[i+1];
                a[i+1] = x;
            }
        }
        width = 2;
    }
    T *  pGet;                          // ptrs to a and b
    T *  pPut;
    T *  pGet0;                         // merge run ptrs
    T *  pGet1;
    T *  pEnd;
    T *  pEnd0;
    T *  pEnd1;
    T *  pPutx;
    T    data0;                         // data values
    T    data1;
    pGet = a;                           // init array ptrs
    pPut = b;
    while(width < n){                   // merge pass
        pGet1 = pGet;                   // init run ptrs
        pPutx = pPut;
        pEnd  = pGet+n;
        while(1){                       // merge pairs of runs
            pGet0 = pGet1;              // update run ptrs
            pEnd0 = pGet0 + width;
            if(pEnd0 >= pEnd)
                pEnd0 = pEnd;
            pGet1 = pEnd0;
            pEnd1 = pGet1 + width;
            if(pEnd1 >= pEnd)
                pEnd1 = pEnd;
            if(pGet0 == pEnd0)          // if no data, end pass
                break;
            if(pGet1 == pEnd1){         // if only run0
                while(pGet0 != pEnd0)   //   copy it, end pass
                    *pPutx++ = *pGet0++;
                break;}
            data0 = *pGet0++;           // get 1st 2 values
            data1 = *pGet1++;
            while(1){                   // merge 2 runs
                if(data0 <= data1){     // if 0 <= 1, put data0 */
                    *pPutx++ = data0;
                    if(pGet0 < pEnd0){  //   if not end run0, get next data0
                        data0 = *pGet0++;
                        continue;
                    } else {            //   else copy rest of run1
                        *pPutx++ = data1;
                        while(pGet1 != pEnd1)
                            *pPutx++ = *pGet1++;
                        break;          //   end 2 run merge
                    }
                } else {                // else 1 < 0, put data1
                    *pPutx++ = data1;
                    if(pGet1 < pEnd1){  //   if not end run1, get next data1
                        data1 = *pGet1++;
                        continue;
                    } else {            //   else copy rest of run0
                        *pPutx++ = data0;
                        while(pGet0 != pEnd0)
                            *pPutx++ = *pGet0++;
                        break;          //   end 2 run merge
                    }
                }
            }
        }
        pEnd = pPut;                    // swap ptrs
        pPut = pGet;
        pGet = pEnd;
        width <<= 1;                    // bump run size
    }
    delete[] b;                         // free temp array
}

//  get pass count
size_t GetPassCount(size_t n)
{
    size_t i = 0;
    size_t m = n;
    while(m >>= 1)
        i += 1;
    if((((size_t)1<<i)-1)&n)
        i += 1;
    return(i);
}

**rcgldr** · 08-23-2013

Originally Posted by rcgldr

I created a bottom up template similar in style to the top down template and the times are almost identical.

Turns out the index based bottom up code in post #64 was redundantly calculating indices, something not needed in a bottom up merge. I fixed that issue and now the times for sorting 10 million 64 bit pseudo random integers is about 1.14 seconds for the bottom up merge versus 1.33 seconds for the top down merge, so the index based bottom merge is about 16% faster. Updated code for the bottom up merge:

Code:

//      tsortbu.h - bottom up merge sort

#ifndef min
#define min(a,b)  (((a) < (b)) ? (a) : (b))
#endif

template <class T>
void MergeSort(T a[], const size_t n) {
    size_t i = GetPassCount(n);
    if(!i)
        return;
    T *b = new T[n];                    // allocate temp array
    T *t;
    size_t width = 1;                   // init width
    if(i & 1){                          // if odd pass count
        T x;                            //   do swaps in a[]
        size_t m = n & ((size_t)0-(size_t)2);
        for(i = 0; i < m; i += 2){
            if(a[i] > a[i+1]){
                x = a[i];
                a[i] = a[i+1];
                a[i+1] = x;
            }
        }
        width = 2;}
//  merge sort array
    for(; width < n; width <<= 1){      // do a merge pass
        size_t o = 0;                   // init output index
        while(o < n){                   // merge pairs of runs
            size_t l  = o;              // update run indices
            size_t le = min(l+width, n);
            size_t r  = le;
            size_t re = min(r+width, n);
            while(l < le && r < re)
                b[o++] = (a[l] < a[r]) ? a[l++] : a[r++];
            // copy remaining values
            while (l < le) b[o++] = a[l++];
            while (r < re) b[o++] = a[r++];
        }
        t = a;                          // swap array ptrs
        a = b;
        b = t;
    }
    delete[] b;                         // free temp array
}

//  get pass count
size_t GetPassCount(size_t n)
{
    size_t i = 0;
    size_t m = n;
    while(m >>= 1)
        i += 1;
    if((((size_t)1<<i)-1)&n)
        i += 1;
    return(i);
}

**iMalc** · 08-23-2013

Originally Posted by rcgldr

Then why does Merge Sort - Cprogramming.com have a copy back step? I'll have to investigate this later.

Why would you expect it to be written with maximum efficiency in mind?

Breadth first deals with 2 sequential input and 2 sequential output streams of data. Why is this worse than jumping back and forth as a top down merge sort traverses down and up the call chain?

Breadth first makes many passes over the data, where each pass processes the data once and then leaves it to be likely pushed out of the cache. Depth first deals with many values close together multiple times before eventually the entire run it is building cannot fit in the cache at once etc.

Why is the final merge the most important, that's the one with the least overhead? The iterators only iterate and never have to be advanced to the next run boundary?

The larger the merge the more time it takes and thus the more it affects the overall total.
The more the sizes of the runs being merged differ, the less work is being done in the pass, for the same effort. Merging a run of n-1 items with a run of 1 items achieves the same as an insertion sort pass. Or put it another way, for 256 items it may take 8 passes, but for 257 items it will take 9 passes; The 9th pass, although necessary, adds very little to the overall progress of the algorithm.
Okay that one is kind of a weak argument and in explaining it I'm not quite as convinced as I was that it makes that much difference. I shall have to investigate this again some time.

**iMalc** · 08-23-2013

Code such as post #65 makes it very clear why bottom-up is not as common for array-based merge sort.

There will be a corner case where one could avoid copying over the remainder of a run after merging in the top-down version as well, for some minor gains at great expense to the readability. But how much one can optimise the heck out of either of these isn't quite the point.

So the trick is to alternate the recusion between two functions, once that merges from the original buffer to the temp buffer (MergeSortAtoB) and one that merges fomr the temp buffer back to the original buffer (MergeSortAtoA), which keeps the recursion levels in sync with the direction of the merge.

You almost understood it, but not quite.
MergeSortAtoA starts with the data in one buffer and ends up with the data in the same buffer. It calls MergeSortAtoB twice which performs the usual merges from the first buffer to the second, then with the two resulting runs of data in the second buffer, it performs a final merge back to the first buffer.
I guess you could say that there is mutual recursion between them.
It has the added advantage that if you wanted to produce a sorted copy that both that and sorting in place are of equal efficiency. They differ only in where the recursion starts i.e. which function is called first.

Sorry for the piecemeal reply.

**rcgldr** · 08-23-2013

Results - using the top down code in post #64, and the bottom up code in post #66, the bottom up merge sort is about 12% faster with arrays of 1 million, 4 million, or 10 million pseudo random 64 bit unsigned integers.

The previous results I posted were based on windows 64hz clock, (15.625 ms), which was affecting comparisons, so I added conditional code to use windows Query Performance Counter function, which uses the cpu clock (about 3.4ghz on my system).

Originally Posted by iMalc

Code such as post #65 makes it very clear why bottom-up is not as common for array-based merge sort.

The code in post #65 was based on a program over 10 years old. Most of the extra code is related to replacing min(...) with inline code and the rest mostly unfolded ? : statements.

So use the code in post #66 as an example of bottom up merge sort. Note that the part of the function that does the actual back and forth merge process between two arrays is only 15 lines of code.

What was fixed in the bottom up merge in post #66 was removal of two unneeded indexes that were in the post #64 example.

Originally Posted by iMalc

There will be a corner case where one could avoid copying over the remainder of a run after merging in the top-down version as well.

I'm not sure what you're getting at, since there isn't any such optimization in the bottom up merge examples I posted. The code for this is there in post #65, but it's in the middle of the merge run code.

Originally Posted by iMalc

You almost understood it, but not quite. ... mutual recursion

I just worded my response poorly. The main idea is that AtoA calls AtoB, which calls AtoA, and so on, instead of a single function calling itself, which is how your example avoided the unneeded copy back.

My guess is that the 12% difference in performance between top down and bottom up is the overhead of recursion for the index generation / usage, and the fact that the bottom up method uses iteration instead of recursive "splits" to generate the n/2 pairs of indexes for the initial run size of 1.

I didn't mention this before, the bottom up merge has an optimization, if the number of merge passes will be odd, it swaps pairs in a[] (only if needed), replacing the first merge pass, so that the back and forth merge process ends up merging back into a[].

**rcgldr** · 08-23-2013

Originally Posted by iMalc

MergeSortAtoA starts with the data in one buffer and ends up with the data in the same buffer. It calls MergeSortAtoB twice which performs the usual merges from the first buffer to the second, then with the two resulting runs of data in the second buffer, it performs a final merge back to the first buffer. I guess you could say that there is mutual recursion between them.

Originally Posted by rcgldr

I didn't mention this before, the bottom up merge has an optimization, if the number of merge passes will be odd, it swaps pairs in a[] (only if needed), replacing the first merge pass, so that the back and forth merge process ends up merging back into a[].

The top down merge handles an odd number of passes by copying single elements from a[] to b[]:

Code:

void MergeSortAtoA(T a[], T b[], const int l, const int r) {
    if (r - l > 1) {
        const int m = (l + r)>>1; //midpoint
        MergeSortAtoB(a, b, l, m);
        MergeSortAtoB(a, b, m, r);
        MergeAtoB(b, a, l, m, r); //Note a & b reversed (this is B->A)
    }
}

void MergeSortAtoB(T a[], T b[], const int l, const int r) {
    if (r - l > 1) {
        const int m = (l + r)>>1; //midpoint
        MergeSortAtoA(a, b, l, m);
        MergeSortAtoA(a, b, m, r);
        MergeAtoB(a, b, l, m, r);
    } else if (r - l == 1) {     // if run size == 1
        b[l] = a[l];             // copy single element from a[] to b[]
    }
}

By changing these functions to duplicate the bottom up optimization to swap in a[] for an odd number of passes, the performance increased by 9.8%, reducing the difference in performance between top down and bottom up to 2% (bottom up only slightly faster).

Code:

template <class T>
void MergeSortAtoA(T a[], T b[], const int l, const int r) {
    if (r - l == 2) {             // if run size == 2
        if( a[l]   > a[r-1]){     //   swap in a[]
            T x    = a[l];
            a[l]   = a[r-1];
            a[r-1] = x;
        }
    } else if (r - l > 1) {
        const int m = (l + r)>>1; //midpoint
        MergeSortAtoB(a, b, l, m);
        MergeSortAtoB(a, b, m, r);
        MergeAtoB(b, a, l, m, r); //Note a & b reversed (this is B->A)
    }
}

template <class T>
void MergeSortAtoB(T a[], T b[], const int l, const int r) {
    if (r - l > 1) {
        const int m = (l + r)>>1; //midpoint
        MergeSortAtoA(a, b, l, m);
        MergeSortAtoA(a, b, m, r);
        MergeAtoB(a, b, l, m, r);
    }
}

**iMalc** · 08-23-2013

Oh I see what you've done there.
Yeah in practicality, one would detect when the number of items to merge is below a certain threshold and just perform an insertion sort.
E.g.

Code:

    if (r - l > 10) {
        const int m = (l + r)>>1; //midpoint
        MergeSortAtoB(a, b, l, m);
        MergeSortAtoB(a, b, m, r);
        MergeAtoB(b, a, l, m, r); //Note a & b reversed (this is B->A)
    } else
        InsertionSort(&a[l], r-l);

**rcgldr** · 08-23-2013

Originally Posted by rcgldr

By changing these functions to duplicate the bottom up optimization to swap in a[] for an odd number of passes, the performance increased by 9.8%, reducing the difference in performance between top down and bottom up to 2% (bottom up only slightly faster).

Note that change only helps if there are an odd number of passes. It decreases performance by about 1% for even number of passes, apparently because that check is being done on ever call to MergeSortAtoA. The bottom up sort only makes that check one time, so it's not affected.

Originally Posted by iMalc

Oh I see what you've done there. Yeah in practicality, one would detect when the number of items to merge is below a certain threshold and just perform an insertion sort.

Yes, MergeSortAtoA would need to end up with the insertion sorted run in the same buffer a[], and MergeSortAtoB would need for the insertion sorted run to end up in the other buffer b[], to handle the even / odd number of passes issue.

**rcgldr** · 08-23-2013

I updated MergeSortAtoB with a similar change for run size == 2, copying 2 values from a[] to b[] when the pass count is even. It's only a 1% gain in performance for even pass count. The change to MergeSortAtoA for odd pass count swaps in place, and only about 1/2 the time, and the swaps should always be in the cache, and it's a 9.8% gain for odd pass counts. The bottom up merge sort (from post #66) is still faster, 10.9% for even pass count, and 2% faster for odd pass count.

The 1% performance increase with the change to MergeSortAtoB is an indicator of the top down overhead. It replaces two calls to MergeSortAtoA that just return (since the split run size == 1), and the call to MergeAtoB to copy / swap the two values in a[] to b[] with inline code that does the copy / swap from a[] to b[].

Here is the updated code for MergeSortAtoA and MergeSortAtoB:

Code:

template <class T>
void MergeSortAtoA(T a[], T b[], const int l, const int r) {
    if (r - l > 2) {
        const int m = (l + r)>>1; //midpoint
        MergeSortAtoB(a, b, l, m);
        MergeSortAtoB(a, b, m, r);
        MergeAtoB(b, a, l, m, r); //Note a & b reversed (this is B->A)
    } else if (r - l == 2) {      // if run size == 2
        if( a[l]   > a[r-1]){     //   swap in a[]
            T x    = a[l];
            a[l]   = a[r-1];
            a[r-1] = x;
        }
    }
}

template <class T>
void MergeSortAtoB(T a[], T b[], const int l, const int r) {
    if (r - l > 2) {
        const int m = (l + r)>>1; //midpoint
        MergeSortAtoA(a, b, l, m);
        MergeSortAtoA(a, b, m, r);
        MergeAtoB(a, b, l, m, r);
    } else if (r - l == 2) {      // if run size == 2
        if( a[l]   > a[r-1]){     //   copy or swap to b[]
            b[r-1] = a[l];
            b[l]   = a[r-1];
        } else {
            b[l]   = a[l];
            b[r-1] = a[r-1];
        }
    }
}

**iMalc** · 08-23-2013

Originally Posted by rcgldr

Yes, MergeSortAtoA would need to end up with the insertion sorted run in the same buffer a[], and MergeSortAtoB would need for the insertion sorted run to end up in the other buffer b[], to handle the even / odd number of passes issue.

I wouldn't bother adding it to MergeSortAtoB because it'll call MergeSortAtoA soon enough and the Insertion Sort can just be done there, albeit with half the number of items.
E.g. with a cutoff of 10 it might have 16 items the first time and not perform an Insertion Sort, next time it will have 4 items which is below the cutoff.
Not doing the Insertion Sort when there was 8 items isn't really a loss because it would have had to copy them across as well, so you win some you lose some.

**rcgldr** · 08-23-2013

Originally Posted by iMalc

optimizing ... MergeSortAtoA ... MergeSortAtoB ...

I like the simplicity of the original versions. The changes I made were more of an experiment then a suggestion to change the original versions which I like.

- - - benchmarking - - -

Originally Posted by iMalc

Breadth first makes many passes over the data, where each pass processes the data once and then leaves it to be likely pushed out of the cache. Depth first deals with many values close together multiple times before eventually the entire run it is building cannot fit in the cache at once etc.

It's not clear to me where the time is being consumed. I haven't tried using VS's profiler to see if that would help.

I made another version of the bottom up merge sort that doesn't include the swap on odd pass count and instead returns a pointer to the sorted array, deleting the original and returning the temp array if it's an odd number of passes.

Testing with 10 million 64 bit integers, the non-swap version took 1.188 seconds, while the swap version took 1.166 seconds, that's less than a 2% gain. This makes sense since 10 million integers takes 24 passes, so each pass is 4.16% of the total data moved, and the 2% gain would correspond to the swaps only occurring about 1/2 of the time on the first pass.

- - - cache - - -

A bottom up merge is sequentially accessing 4 streams of memory, 2 streams are being read, and 2 streams are being written. In the early passes, run size is small and the offset between the two input streams and two output streams is small. Even in the later passes, the sequential nature of a bottom up merge sort may result in the 4 streams of data often being cached, depending on the cache architecture.

One the other hand, the top down merge sort got a bigger improvement in performance (9.8%) from doing the swaps in a[] for odd pass count merges when run size == 2. That would seem to imply that more time is spent dealing with the smaller runs than the larger runs. After the first 3 or 4 passes for bottom up merge, each pass probably takes about the same amount of time (for random data).

- - - importance of the final few passes - - -

Originally Posted by iMalc

The larger the merge the more time it takes and thus the more it affects the overall total.

For the bottom up merge sort, there isn't that much overhead to advance from one pair of runs to the next after the first 3 or 4 passes when run size is increased to 8 or 16.

Originally Posted by iMalc

The more the sizes of the runs being merged differ, the less work is being done in the pass.

The main issue is an "odd" run that just gets copied. The worst case is a count of 2^n + 1, where that last element just gets copied on all but the final pass, and then you have a merge of 2^n elements and that 1 element.