Thread: merge sort - top down versus bottom up

  1. #61
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by rcgldr View Post
    Instability only occurs if compares are used to determine the end of runs after the initial pass that creates the runs.
    Okay, but do you agree with me that the three sorts are stable?

    I didn't care enough to analyze the implementations for sorting stability. I just verified that for all input sequences I generate, the results are stable. So, if you have done the analysis, I'll take your word for it.

    Quote Originally Posted by rcgldr View Post
    Again, note my issue is with a standard top down merge sort
    That does not mean the entire top-down merge sort approach is inefficient.

    I've shown that when combined with logic to detect sorted runs, a top-down sorted-scanning merge sort performs at least as well as the sorted-scanning bottom-up merge sort does, for some input linked list types. (It's a specific level of randomness/sortedness, and probably quite CPU dependent.)

    This means that even in theory, your initial statement, "top down merge sort is inefficient", is clearly incorrect.

    My own original opinion, "In my opinion, top down merge sort has two main purposes: one, as a learning tool (linked lists, recursion); two, as a basis for linked list sorting variants where the list is known to be mostly sorted already." has changed very little based on these revelations, to "In my opinion, top down merge sort has two main purposes: one, as a learning tool (linked lists, recursion); two, as a basis for linked list sort variants that exploit known randomness/sortedness properties of the list."

    Quote Originally Posted by rcgldr View Post
    Getting back to my original topic, my issue with top down merge sort is the more common case of sorting arrays
    As I only use merge sort for linked lists, I don't really have any input on that. So, I'll bow out here, if you don't mind.

  2. #62
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by iMalc View Post
    I believe you haven't seen the "merge across then back" top-down technique.
    Take a look at the Merge Sort I have here: Array Sorting
    Note, I had to change the "while" that prefixed the template functions to "void" in order to compile with VS2005.

    Quote Originally Posted by iMalc View Post
    There is no copy-back step in that. It's essentially a 4-way merge; two-way in one direction then two-way back again.
    So the trick is to alternate the recusion between two functions, once that merges from the original buffer to the temp buffer (MergeSortAtoB) and one that merges fomr the temp buffer back to the original buffer (MergeSortAtoA), which keeps the recursion levels in sync with the direction of the merge.

    Your method solves the issue of copy back. There's still the overhead of recursively generating all those indices versus iteration. There's also an overhead of copying single values to the "other" array in order to "merge" them back, which could be optimized with a compare and optional swap, but that could make the code messy, and currently it's very clean.

  3. #63
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by Nominal Animal View Post
    Okay, but do you agree with me that the three sorts are stable?
    Yes.

    Quote Originally Posted by Nominal Animal View Post
    That does not mean the entire top-down merge sort approach is inefficient.
    The top down scanning merge sort is using a bottom up like method during the scanning for runs phase, and it's doing a 3 way merge, so that's significantly different than a conventional top down merge sort which recursively splits list until size is 2 or 1. Also, it's my guess that in 90% or more of the typical cases of data patterns, it's still slower than the array or "classic" bottom up merge sort.
    Last edited by rcgldr; 08-22-2013 at 11:34 AM.

  4. #64
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by iMalc View Post
    Take a look at the Merge Sort I have here: Array Sorting
    There is no copy-back step in that. It's essentially a 4-way merge; two-way in one direction then two-way back again.
    I sit corrected. I created a bottom up template similar in style to the top down template and the times are almost identical. Even with 10 million unsigned 64 bit integers to sort, bottom up was only 2% faster. Here is the source:

    Code:
    //      tsorttd.h - top down merge sort
    
    template <class T>
    void MergeSort(T a[], int n) {
        T *const b = new T[n];
        MergeSortAtoA(a, b, 0, n);
        delete[] b;
    }
    
    template <class T>
    void MergeSortAtoA(T a[], T b[], const int l, const int r) {
        if (r - l > 1) {
            const int m = (l + r)>>1; //midpoint
            MergeSortAtoB(a, b, l, m);
            MergeSortAtoB(a, b, m, r);
            MergeAtoB(b, a, l, m, r); //Note a & b reversed (this is B->A)
        }
    }
    
    template <class T>
    void MergeSortAtoB(T a[], T b[], const int l, const int r) {
        if (r - l > 1) {
            const int m = (l + r)>>1; //midpoint
            MergeSortAtoA(a, b, l, m);
            MergeSortAtoA(a, b, m, r);
            MergeAtoB(a, b, l, m, r);
        } else if (r - l == 1) {
            b[l] = a[l];
        }
    }
    
    template <class T>
    void MergeAtoB(const T a[], T b[], const int l, const int m, const int r) {
        int i = l, j = m, c = l;
        //merge items while both lists are not empty
        while (i < m && j < r)
            b[c++] = (a[j] < a[i]) ? a[j++] : a[i++];
        //copy remaining items into place
        while (i < m) b[c++] = a[i++];
        while (j < r) b[c++] = a[j++];
    }
    Code:
    //      tsortbu.h - bottom up merge sort
    
    #ifndef min
    #define min(a,b)  (((a) < (b)) ? (a) : (b))
    #endif
    
    template <class T>
    void MergeSort(T a[], const size_t n) {
        size_t i = GetPassCount(n);
        if(!i)
            return;
        T *b = new T[n];                    // allocate temp array
        T *t;
        size_t width = 1;                   // init width
        if(i & 1){                          // if odd pass count
            T x;                            //   do swaps in a[]
            size_t m = n & ((size_t)0-(size_t)2);
            for(i = 0; i < m; i += 2){
                if(a[i] > a[i+1]){
                    x = a[i];
                    a[i] = a[i+1];
                    a[i+1] = x;
                }
            }
            width = 2;}
    //  merge sort array
        for(; width < n; width <<= 1){      // do a merge pass
            for(i = 0; i < n; i += 2*width){
                size_t o  = i;              //  merge runs
                size_t ll = i;
                size_t l  = ll;
                size_t mm = min(i+width, n);
                size_t r  = mm;
                size_t ee = min(i+2*width, n);
                while(l < mm && r < ee)
                    b[o++] = (a[l] < a[r]) ? a[l++] : a[r++];
                // copy remaining values
                while (l < mm) b[o++] = a[l++];
                while (r < ee) b[o++] = a[r++];
            }
            t = a;                          // swap array ptrs
            a = b;
            b = t;
        }
        delete[] b;                         // free temp array
    }
    
    //  get pass count
    size_t GetPassCount(size_t n)
    {
        size_t i = 0;
        size_t m = n;
        while(m >>= 1)
            i += 1;
        if((((size_t)1<<i)-1)&n)
            i += 1;
        return(i);
    }
    Code:
    //      tsortxx.cpp - test program for tsortxx.h
    
    #include "tsortxx.h" // (tsorttd.h or tsortbu.h)
    #include <fstream>
    #include <iostream>
    #include <stdlib.h>
    #include <time.h>
    typedef unsigned long long uint64_t;
    
    #define COUNT (10*1024*1024)    // number of values to sort
    
    clock_t ctTimeStart;            // clock values
    clock_t ctTimeStop;
    
    int main(int argc, char**argv)
    {
        uint64_t   r;               // random number
        size_t     i;
    
        std::ofstream ofs("dst.bin", std::ofstream::binary | std::ofstream::trunc);
        if(!ofs.is_open())
            return(0);
    
        uint64_t * a = new uint64_t [COUNT];
        if(a == NULL)
            return(0);
        for(i = 0; i < COUNT; i++){
            r  = (((uint64_t)((rand()>>4) & 0xff))<< 0);
            r += (((uint64_t)((rand()>>4) & 0xff))<< 8);
            r += (((uint64_t)((rand()>>4) & 0xff))<<16);
            r += (((uint64_t)((rand()>>4) & 0xff))<<24);
            r += (((uint64_t)((rand()>>4) & 0xff))<<32);
            r += (((uint64_t)((rand()>>4) & 0xff))<<40);
            r += (((uint64_t)((rand()>>4) & 0xff))<<48);
            r += (((uint64_t)((rand()>>4) & 0xff))<<56);
            a[i] = r;
        }
    
        ctTimeStart = clock();
        MergeSort(a, COUNT);
        ctTimeStop = clock();
        std::cout << "# of ticks " << ctTimeStop - ctTimeStart << std::endl;
    
        ofs.write((const char *)a, COUNT*sizeof(a[0]));
        ofs.close();
    
        delete [] a;
    
        return(0);
    }
    Last edited by rcgldr; 08-22-2013 at 10:00 PM.

  5. #65
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by rcgldr View Post
    I sit corrected. I created a bottom up template similar in style to the top down template and the times are almost identical.
    If I use a more optimized pointer based bottom up merge, it's about 18% faster (for 10 million values) 1.1 seconds versus 1.3 seconds, but it's possible that a similar optimization could be done to the top down merge sort.

    Code:
    //      tsortbu.h - bottom up merge sort
    
    template <class T>
    void MergeSort(T a[], const size_t n) {
        size_t i = GetPassCount(n);
        if(!i)
            return;
        T *b = new T[n];                    // allocate temp array
        size_t width = 1;                   // init width
        if(i & 1){                          // if odd pass count
            T x;                            //   do swaps in a[]
            size_t m = n & ((size_t)0-(size_t)2);
            for(i = 0; i < m; i += 2){
                if(a[i] > a[i+1]){
                    x = a[i];
                    a[i] = a[i+1];
                    a[i+1] = x;
                }
            }
            width = 2;
        }
        T *  pGet;                          // ptrs to a and b
        T *  pPut;
        T *  pGet0;                         // merge run ptrs
        T *  pGet1;
        T *  pEnd;
        T *  pEnd0;
        T *  pEnd1;
        T *  pPutx;
        T    data0;                         // data values
        T    data1;
        pGet = a;                           // init array ptrs
        pPut = b;
        while(width < n){                   // merge pass
            pGet1 = pGet;                   // init run ptrs
            pPutx = pPut;
            pEnd  = pGet+n;
            while(1){                       // merge pairs of runs
                pGet0 = pGet1;              // update run ptrs
                pEnd0 = pGet0 + width;
                if(pEnd0 >= pEnd)
                    pEnd0 = pEnd;
                pGet1 = pEnd0;
                pEnd1 = pGet1 + width;
                if(pEnd1 >= pEnd)
                    pEnd1 = pEnd;
                if(pGet0 == pEnd0)          // if no data, end pass
                    break;
                if(pGet1 == pEnd1){         // if only run0
                    while(pGet0 != pEnd0)   //   copy it, end pass
                        *pPutx++ = *pGet0++;
                    break;}
                data0 = *pGet0++;           // get 1st 2 values
                data1 = *pGet1++;
                while(1){                   // merge 2 runs
                    if(data0 <= data1){     // if 0 <= 1, put data0 */
                        *pPutx++ = data0;
                        if(pGet0 < pEnd0){  //   if not end run0, get next data0
                            data0 = *pGet0++;
                            continue;
                        } else {            //   else copy rest of run1
                            *pPutx++ = data1;
                            while(pGet1 != pEnd1)
                                *pPutx++ = *pGet1++;
                            break;          //   end 2 run merge
                        }
                    } else {                // else 1 < 0, put data1
                        *pPutx++ = data1;
                        if(pGet1 < pEnd1){  //   if not end run1, get next data1
                            data1 = *pGet1++;
                            continue;
                        } else {            //   else copy rest of run0
                            *pPutx++ = data0;
                            while(pGet0 != pEnd0)
                                *pPutx++ = *pGet0++;
                            break;          //   end 2 run merge
                        }
                    }
                }
            }
            pEnd = pPut;                    // swap ptrs
            pPut = pGet;
            pGet = pEnd;
            width <<= 1;                    // bump run size
        }
        delete[] b;                         // free temp array
    }
    
    //  get pass count
    size_t GetPassCount(size_t n)
    {
        size_t i = 0;
        size_t m = n;
        while(m >>= 1)
            i += 1;
        if((((size_t)1<<i)-1)&n)
            i += 1;
        return(i);
    }
    Last edited by rcgldr; 08-22-2013 at 10:58 PM.

  6. #66
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by rcgldr View Post
    I created a bottom up template similar in style to the top down template and the times are almost identical.
    Turns out the index based bottom up code in post #64 was redundantly calculating indices, something not needed in a bottom up merge. I fixed that issue and now the times for sorting 10 million 64 bit pseudo random integers is about 1.14 seconds for the bottom up merge versus 1.33 seconds for the top down merge, so the index based bottom merge is about 16% faster. Updated code for the bottom up merge:

    Code:
    //      tsortbu.h - bottom up merge sort
    
    #ifndef min
    #define min(a,b)  (((a) < (b)) ? (a) : (b))
    #endif
    
    template <class T>
    void MergeSort(T a[], const size_t n) {
        size_t i = GetPassCount(n);
        if(!i)
            return;
        T *b = new T[n];                    // allocate temp array
        T *t;
        size_t width = 1;                   // init width
        if(i & 1){                          // if odd pass count
            T x;                            //   do swaps in a[]
            size_t m = n & ((size_t)0-(size_t)2);
            for(i = 0; i < m; i += 2){
                if(a[i] > a[i+1]){
                    x = a[i];
                    a[i] = a[i+1];
                    a[i+1] = x;
                }
            }
            width = 2;}
    //  merge sort array
        for(; width < n; width <<= 1){      // do a merge pass
            size_t o = 0;                   // init output index
            while(o < n){                   // merge pairs of runs
                size_t l  = o;              // update run indices
                size_t le = min(l+width, n);
                size_t r  = le;
                size_t re = min(r+width, n);
                while(l < le && r < re)
                    b[o++] = (a[l] < a[r]) ? a[l++] : a[r++];
                // copy remaining values
                while (l < le) b[o++] = a[l++];
                while (r < re) b[o++] = a[r++];
            }
            t = a;                          // swap array ptrs
            a = b;
            b = t;
        }
        delete[] b;                         // free temp array
    }
    
    //  get pass count
    size_t GetPassCount(size_t n)
    {
        size_t i = 0;
        size_t m = n;
        while(m >>= 1)
            i += 1;
        if((((size_t)1<<i)-1)&n)
            i += 1;
        return(i);
    }
    Last edited by rcgldr; 08-23-2013 at 01:31 AM.

  7. #67
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by rcgldr View Post
    Then why does Merge Sort - Cprogramming.com have a copy back step? I'll have to investigate this later.
    Why would you expect it to be written with maximum efficiency in mind?

    Breadth first deals with 2 sequential input and 2 sequential output streams of data. Why is this worse than jumping back and forth as a top down merge sort traverses down and up the call chain?
    Breadth first makes many passes over the data, where each pass processes the data once and then leaves it to be likely pushed out of the cache. Depth first deals with many values close together multiple times before eventually the entire run it is building cannot fit in the cache at once etc.

    Why is the final merge the most important, that's the one with the least overhead? The iterators only iterate and never have to be advanced to the next run boundary?
    The larger the merge the more time it takes and thus the more it affects the overall total.
    The more the sizes of the runs being merged differ, the less work is being done in the pass, for the same effort. Merging a run of n-1 items with a run of 1 items achieves the same as an insertion sort pass. Or put it another way, for 256 items it may take 8 passes, but for 257 items it will take 9 passes; The 9th pass, although necessary, adds very little to the overall progress of the algorithm.
    Okay that one is kind of a weak argument and in explaining it I'm not quite as convinced as I was that it makes that much difference. I shall have to investigate this again some time.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  8. #68
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Code such as post #65 makes it very clear why bottom-up is not as common for array-based merge sort.

    There will be a corner case where one could avoid copying over the remainder of a run after merging in the top-down version as well, for some minor gains at great expense to the readability. But how much one can optimise the heck out of either of these isn't quite the point.

    So the trick is to alternate the recusion between two functions, once that merges from the original buffer to the temp buffer (MergeSortAtoB) and one that merges fomr the temp buffer back to the original buffer (MergeSortAtoA), which keeps the recursion levels in sync with the direction of the merge.
    You almost understood it, but not quite.
    MergeSortAtoA starts with the data in one buffer and ends up with the data in the same buffer. It calls MergeSortAtoB twice which performs the usual merges from the first buffer to the second, then with the two resulting runs of data in the second buffer, it performs a final merge back to the first buffer.
    I guess you could say that there is mutual recursion between them.
    It has the added advantage that if you wanted to produce a sorted copy that both that and sorting in place are of equal efficiency. They differ only in where the recursion starts i.e. which function is called first.

    Sorry for the piecemeal reply.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  9. #69
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Results - using the top down code in post #64, and the bottom up code in post #66, the bottom up merge sort is about 12% faster with arrays of 1 million, 4 million, or 10 million pseudo random 64 bit unsigned integers.

    The previous results I posted were based on windows 64hz clock, (15.625 ms), which was affecting comparisons, so I added conditional code to use windows Query Performance Counter function, which uses the cpu clock (about 3.4ghz on my system).

    Quote Originally Posted by iMalc View Post
    Code such as post #65 makes it very clear why bottom-up is not as common for array-based merge sort.
    The code in post #65 was based on a program over 10 years old. Most of the extra code is related to replacing min(...) with inline code and the rest mostly unfolded ? : statements.

    So use the code in post #66 as an example of bottom up merge sort. Note that the part of the function that does the actual back and forth merge process between two arrays is only 15 lines of code.

    What was fixed in the bottom up merge in post #66 was removal of two unneeded indexes that were in the post #64 example.

    Quote Originally Posted by iMalc View Post
    There will be a corner case where one could avoid copying over the remainder of a run after merging in the top-down version as well.
    I'm not sure what you're getting at, since there isn't any such optimization in the bottom up merge examples I posted. The code for this is there in post #65, but it's in the middle of the merge run code.

    Quote Originally Posted by iMalc View Post
    You almost understood it, but not quite. ... mutual recursion
    I just worded my response poorly. The main idea is that AtoA calls AtoB, which calls AtoA, and so on, instead of a single function calling itself, which is how your example avoided the unneeded copy back.

    My guess is that the 12% difference in performance between top down and bottom up is the overhead of recursion for the index generation / usage, and the fact that the bottom up method uses iteration instead of recursive "splits" to generate the n/2 pairs of indexes for the initial run size of 1.

    I didn't mention this before, the bottom up merge has an optimization, if the number of merge passes will be odd, it swaps pairs in a[] (only if needed), replacing the first merge pass, so that the back and forth merge process ends up merging back into a[].
    Last edited by rcgldr; 08-23-2013 at 05:10 AM.

  10. #70
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by iMalc View Post
    MergeSortAtoA starts with the data in one buffer and ends up with the data in the same buffer. It calls MergeSortAtoB twice which performs the usual merges from the first buffer to the second, then with the two resulting runs of data in the second buffer, it performs a final merge back to the first buffer. I guess you could say that there is mutual recursion between them.
    Quote Originally Posted by rcgldr View Post
    I didn't mention this before, the bottom up merge has an optimization, if the number of merge passes will be odd, it swaps pairs in a[] (only if needed), replacing the first merge pass, so that the back and forth merge process ends up merging back into a[].
    The top down merge handles an odd number of passes by copying single elements from a[] to b[]:

    Code:
    void MergeSortAtoA(T a[], T b[], const int l, const int r) {
        if (r - l > 1) {
            const int m = (l + r)>>1; //midpoint
            MergeSortAtoB(a, b, l, m);
            MergeSortAtoB(a, b, m, r);
            MergeAtoB(b, a, l, m, r); //Note a & b reversed (this is B->A)
        }
    }
    
    void MergeSortAtoB(T a[], T b[], const int l, const int r) {
        if (r - l > 1) {
            const int m = (l + r)>>1; //midpoint
            MergeSortAtoA(a, b, l, m);
            MergeSortAtoA(a, b, m, r);
            MergeAtoB(a, b, l, m, r);
        } else if (r - l == 1) {     // if run size == 1
            b[l] = a[l];             // copy single element from a[] to b[]
        }
    }
    By changing these functions to duplicate the bottom up optimization to swap in a[] for an odd number of passes, the performance increased by 9.8%, reducing the difference in performance between top down and bottom up to 2% (bottom up only slightly faster).

    Code:
    template <class T>
    void MergeSortAtoA(T a[], T b[], const int l, const int r) {
        if (r - l == 2) {             // if run size == 2
            if( a[l]   > a[r-1]){     //   swap in a[]
                T x    = a[l];
                a[l]   = a[r-1];
                a[r-1] = x;
            }
        } else if (r - l > 1) {
            const int m = (l + r)>>1; //midpoint
            MergeSortAtoB(a, b, l, m);
            MergeSortAtoB(a, b, m, r);
            MergeAtoB(b, a, l, m, r); //Note a & b reversed (this is B->A)
        }
    }
    
    template <class T>
    void MergeSortAtoB(T a[], T b[], const int l, const int r) {
        if (r - l > 1) {
            const int m = (l + r)>>1; //midpoint
            MergeSortAtoA(a, b, l, m);
            MergeSortAtoA(a, b, m, r);
            MergeAtoB(a, b, l, m, r);
        }
    }
    Last edited by rcgldr; 08-23-2013 at 02:40 PM.

  11. #71
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Oh I see what you've done there.
    Yeah in practicality, one would detect when the number of items to merge is below a certain threshold and just perform an insertion sort.
    E.g.
    Code:
        if (r - l > 10) {
            const int m = (l + r)>>1; //midpoint
            MergeSortAtoB(a, b, l, m);
            MergeSortAtoB(a, b, m, r);
            MergeAtoB(b, a, l, m, r); //Note a & b reversed (this is B->A)
        } else
            InsertionSort(&a[l], r-l);
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  12. #72
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by rcgldr View Post
    By changing these functions to duplicate the bottom up optimization to swap in a[] for an odd number of passes, the performance increased by 9.8%, reducing the difference in performance between top down and bottom up to 2% (bottom up only slightly faster).
    Note that change only helps if there are an odd number of passes. It decreases performance by about 1% for even number of passes, apparently because that check is being done on ever call to MergeSortAtoA. The bottom up sort only makes that check one time, so it's not affected.

    Quote Originally Posted by iMalc View Post
    Oh I see what you've done there. Yeah in practicality, one would detect when the number of items to merge is below a certain threshold and just perform an insertion sort.
    Yes, MergeSortAtoA would need to end up with the insertion sorted run in the same buffer a[], and MergeSortAtoB would need for the insertion sorted run to end up in the other buffer b[], to handle the even / odd number of passes issue.

  13. #73
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    I updated MergeSortAtoB with a similar change for run size == 2, copying 2 values from a[] to b[] when the pass count is even. It's only a 1% gain in performance for even pass count. The change to MergeSortAtoA for odd pass count swaps in place, and only about 1/2 the time, and the swaps should always be in the cache, and it's a 9.8% gain for odd pass counts. The bottom up merge sort (from post #66) is still faster, 10.9% for even pass count, and 2% faster for odd pass count.

    The 1% performance increase with the change to MergeSortAtoB is an indicator of the top down overhead. It replaces two calls to MergeSortAtoA that just return (since the split run size == 1), and the call to MergeAtoB to copy / swap the two values in a[] to b[] with inline code that does the copy / swap from a[] to b[].

    Here is the updated code for MergeSortAtoA and MergeSortAtoB:

    Code:
    template <class T>
    void MergeSortAtoA(T a[], T b[], const int l, const int r) {
        if (r - l > 2) {
            const int m = (l + r)>>1; //midpoint
            MergeSortAtoB(a, b, l, m);
            MergeSortAtoB(a, b, m, r);
            MergeAtoB(b, a, l, m, r); //Note a & b reversed (this is B->A)
        } else if (r - l == 2) {      // if run size == 2
            if( a[l]   > a[r-1]){     //   swap in a[]
                T x    = a[l];
                a[l]   = a[r-1];
                a[r-1] = x;
            }
        }
    }
    
    template <class T>
    void MergeSortAtoB(T a[], T b[], const int l, const int r) {
        if (r - l > 2) {
            const int m = (l + r)>>1; //midpoint
            MergeSortAtoA(a, b, l, m);
            MergeSortAtoA(a, b, m, r);
            MergeAtoB(a, b, l, m, r);
        } else if (r - l == 2) {      // if run size == 2
            if( a[l]   > a[r-1]){     //   copy or swap to b[]
                b[r-1] = a[l];
                b[l]   = a[r-1];
            } else {
                b[l]   = a[l];
                b[r-1] = a[r-1];
            }
        }
    }
    Last edited by rcgldr; 08-23-2013 at 08:21 PM.

  14. #74
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by rcgldr View Post
    Yes, MergeSortAtoA would need to end up with the insertion sorted run in the same buffer a[], and MergeSortAtoB would need for the insertion sorted run to end up in the other buffer b[], to handle the even / odd number of passes issue.
    I wouldn't bother adding it to MergeSortAtoB because it'll call MergeSortAtoA soon enough and the Insertion Sort can just be done there, albeit with half the number of items.
    E.g. with a cutoff of 10 it might have 16 items the first time and not perform an Insertion Sort, next time it will have 4 items which is below the cutoff.
    Not doing the Insertion Sort when there was 8 items isn't really a loss because it would have had to copy them across as well, so you win some you lose some.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  15. #75
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Quote Originally Posted by iMalc View Post
    optimizing ... MergeSortAtoA ... MergeSortAtoB ...
    I like the simplicity of the original versions. The changes I made were more of an experiment then a suggestion to change the original versions which I like.

    - - - benchmarking - - -

    Quote Originally Posted by iMalc View Post
    Breadth first makes many passes over the data, where each pass processes the data once and then leaves it to be likely pushed out of the cache. Depth first deals with many values close together multiple times before eventually the entire run it is building cannot fit in the cache at once etc.
    It's not clear to me where the time is being consumed. I haven't tried using VS's profiler to see if that would help.

    I made another version of the bottom up merge sort that doesn't include the swap on odd pass count and instead returns a pointer to the sorted array, deleting the original and returning the temp array if it's an odd number of passes.

    Testing with 10 million 64 bit integers, the non-swap version took 1.188 seconds, while the swap version took 1.166 seconds, that's less than a 2% gain. This makes sense since 10 million integers takes 24 passes, so each pass is 4.16% of the total data moved, and the 2% gain would correspond to the swaps only occurring about 1/2 of the time on the first pass.

    - - - cache - - -

    A bottom up merge is sequentially accessing 4 streams of memory, 2 streams are being read, and 2 streams are being written. In the early passes, run size is small and the offset between the two input streams and two output streams is small. Even in the later passes, the sequential nature of a bottom up merge sort may result in the 4 streams of data often being cached, depending on the cache architecture.

    One the other hand, the top down merge sort got a bigger improvement in performance (9.8%) from doing the swaps in a[] for odd pass count merges when run size == 2. That would seem to imply that more time is spent dealing with the smaller runs than the larger runs. After the first 3 or 4 passes for bottom up merge, each pass probably takes about the same amount of time (for random data).

    - - - importance of the final few passes - - -

    Quote Originally Posted by iMalc View Post
    The larger the merge the more time it takes and thus the more it affects the overall total.
    For the bottom up merge sort, there isn't that much overhead to advance from one pair of runs to the next after the first 3 or 4 passes when run size is increased to 8 or 16.

    Quote Originally Posted by iMalc View Post
    The more the sizes of the runs being merged differ, the less work is being done in the pass.
    The main issue is an "odd" run that just gets copied. The worst case is a count of 2^n + 1, where that last element just gets copied on all but the final pass, and then you have a merge of 2^n elements and that 1 element.
    Last edited by rcgldr; 08-23-2013 at 11:48 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. minimalist bottom up merge sort
    By rcgldr in forum C++ Programming
    Replies: 14
    Last Post: 05-04-2013, 10:31 AM
  2. Merge sort fails to sort
    By Cresh07 in forum C++ Programming
    Replies: 3
    Last Post: 09-23-2009, 11:17 AM
  3. Quick Sort or Merge Sort???
    By swanley007 in forum C++ Programming
    Replies: 6
    Last Post: 11-10-2005, 06:48 PM
  4. Quick sort VS Merge Sort
    By sachitha in forum Tech Board
    Replies: 7
    Last Post: 09-03-2004, 11:57 PM
  5. merge sort and selection sort and time spent on both
    By misswaleleia in forum C Programming
    Replies: 3
    Last Post: 06-04-2003, 02:24 PM