Like Tree1Likes

Threaded Merge Sort: No performance gain. :(

This is a discussion on Threaded Merge Sort: No performance gain. :( within the C++ Programming forums, part of the General Programming Boards category; The problem is that the threading has too much overhead to be useful here.(..as evident from the "sys" time taken.) ...

  1. #1
    Registered User manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    Kolkata@India
    Posts
    2,498

    Threaded Merge Sort: No performance gain. :(

    The problem is that the threading has too much overhead to be useful here.(..as evident from the "sys" time taken.)
    Yet, I've read in many places that merge sort is a good candidate for concurrency.

    (Also, I'm not testing this on a very good computer now... When I can do so..(tomorrow), the results may improve. If you can test this on a juicy machine now, please do.)

    I understand the following implementation could be little naive (any tips to improve is welcome).
    Here is the code:
    Code:
    #include<algorithm>
    #include<thread>
    #include<cstdlib>
    void merge_mix(int *array1,int count1,int* array2,int count2) //Merges and distributes the result over array1 and array2
    {
        int ptr_t(0),ptr_1(0),ptr_2(0); //offsets to temp, array1, array2 
        int * temp = new int[count1+count2];
    
    
        while((ptr_1<count1) && (ptr_2<count2))
        {
            while((ptr_1<count1) && (*(array1+ptr_1)<=*(array2+ptr_2)))
            {
                *(temp+ptr_t++) = *(array1+ptr_1);
                ptr_1++;
            }
            if (ptr_1<count1)
            {
                while((ptr_2<count2) && (*(array2+ptr_2)<=*(array1+ptr_1)))
                {
                    *(temp+ptr_t++) = *(array2+ptr_2);
                    ptr_2++;
                }
            }
        }
        std::copy(array1+ptr_1,array1+ptr_1 + (count1-ptr_1),temp+ptr_t);
        std::copy(temp,temp+count1,array1);
        std::copy(array2+ptr_2,array2+ptr_2+(count2-ptr_2),temp+ptr_t);
        std::copy(temp+count1,temp+count1+count2,array2);
    
    
        delete [] temp;
    }
    void merge_sort(int data[],int size)
    {
        if(size==1)return;
        if(size<10)
        {
            std::__insertion_sort(data,data+size);
            return;
        }
        merge_sort(data,size/2);
        merge_sort(data+size/2,(size+1)/2);
        merge_mix(data,size/2,data+size/2,(size+1)/2);
        return;
    }
    void threaded_merge_sort(int data[],int size)
    {
        if(size==1)return;
        if(size<100)
        {
            std::__insertion_sort(data,data+size);
            return;
        }
        std::thread foo(threaded_merge_sort,data,size/2);
        std::thread bar(threaded_merge_sort,data+size/2,(size+1)/2);
        
        foo.join();
        bar.join();
        merge_mix(data,size/2,data+size/2,(size+1)/2);
        return;
    }
    int main(int argc,char** argv)
    //arg1: Size of random int array.
    //arg2: -t or -n ..threaded or normal
    {
        srand(time(NULL));
        int* dat;
        int n = atoi(argv[1]);
        dat = new int[n];
        for(int i=0;i<n;i++)
            *(dat+i)=rand();
        
        if(argv[2][1]=='t')threaded_merge_sort(dat,n);
        else if(argv[2][1]=='n')merge_sort(dat,n);
    }
    Here is the testing script:
    Code:
    #!/bin/bash
    #$1 has the number of ints to be sorted.
    g++ a.cpp -std=c++0x -lpthread -o test
    echo "Time Taken for normal version ($1 nos):"
    time ./test $1 -normal
    echo "Time Taken for threaded version ($1 nos):"
    time ./test $1 -threaded
    Result:
    [manasij7479@manasijn ~]$ ./test.sh 10000
    Time Taken for normal version (10000 nos):


    real 0m0.013s
    user 0m0.011s
    sys 0m0.001s
    Time Taken for threaded version (10000 nos):


    real 0m0.315s
    user 0m0.018s
    sys 0m0.470s


    (I also have another question about the merging part. Is it possible to skip the memory allocation and copy() ....and do all operations on the two arrays themselves ?)
    Last edited by manasij7479; 02-02-2012 at 05:24 AM.
    Manasij Mukherjee | gcc-4.8.2 @Arch Linux
    Slow and Steady wins the race... if and only if :
    1.None of the other participants are fast and steady.
    2.The fast and unsteady suddenly falls asleep while running !



  2. #2
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,538
    You can do merge sort inline. Just swap elements around.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  3. #3
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,660
    >> ... threading has too much overhead to be useful here.
    Yes, the amount of work that the thread performs needs to outweigh the cost of thread creation/destruction.

    >> if(size<100)
    That's not enough work per thread. I would instead split the work up based on the number of available cores in the system. If there are 4 cores, split it up 4 ways.

    gg

  4. #4
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,299
    You shouldn't create two new threads and then leave the one that created them doing nothing. Change these lines:
    Code:
        std::thread foo(threaded_merge_sort,data,size/2);
        std::thread bar(threaded_merge_sort,data+size/2,(size+1)/2);
    
        foo.join();
        bar.join();
    to this:
    Code:
        std::thread foo(threaded_merge_sort, data, size/2); // one extra thread
        threaded_merge_sort(data+size/2, (size+1)/2); // and one direct call to do the rest in this thread
    
        foo.join();
    Secondly, having a cutoff of 100 instead of 10 is going to cause insertion sort to hurt performance. Change it back to 10. If you additionally want a cut-off before bothering to create threads, then do:
    Code:
        if (size < 10)
        {
            std::__insertion_sort(data,data+size);
            return;
        }
        if (size < 100)
        {
            threaded_merge_sort(data, size/2);
            threaded_merge_sort(data+size/2, (size+1)/2);
        }
        else
        {
            std::thread foo(threaded_merge_sort, data, size/2); // one extra thread
            threaded_merge_sort(data+size/2, (size+1)/2); // and one direct call to do the rest in this thread
            foo.join();
        }
    In fact you don't even need the if (size==1) test. Just let that case go into insertion sort and immediately realise that there is no work to be done. It'll practically never be true anyway because you never subdivide down to that level. It's only true if there was only 1 item to begin with. In fact the test doesn't help when you try and sort zero items, it's up to the insertion sort to save you from a bug here in that case.

    A better implementation would also cap the number of threads. E.g. at 16, or perhaps the number of cores.

    Lastly, because different threads may sometimes be reading and writing data that is adjacent in memory, the CPU will be stalled from time to time while the cores synchronise their data. Splitting the data into separately allocated chunks for each thread, before the merging step could help, or dividing the work up into powers of two (as is done in breadth-first merge sort) could help there.

    Now make sure you try it on a much larger number of items so you can really see the difference.
    Last edited by iMalc; 02-02-2012 at 11:24 AM.
    manasij7479 likes this.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  5. #5
    Captain Crash brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,239
    First, your workload seems incredibly small. A hundredth of a second of processing? The overhead will eat you alive.

    You're spawing threads at each level. The threads then terminate. Then you spawn more...

    That'll ruin you.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  6. #6
    Registered User manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    Kolkata@India
    Posts
    2,498
    Okay.. here is the second version... Does it look better ?
    Code:
    void threaded_merge_sort_v2(int data[],int size)
    {
    	static int thread_count(0) ; // <-- Does this need a lock ?
    	std::vector<std::thread> tbase;
    	std::vector<std::future<void>> fbase;
    	if(size<10)
    	{
    		std::__insertion_sort(data,data+size);
    		return;
    	}
    	if(thread_count<MAX_THREADS)
    	{
    		tbase.push_back(std::thread(threaded_merge_sort_v2,data,size/2));
    		thread_count++;
    	}
    	else	fbase.push_back(std::move(std::async(threaded_merge_sort_v2,data,size/2)));
    	threaded_merge_sort_v2(data+size/2,(size+1)/2);
    
    
    	if(!tbase.empty())
    	{
    		tbase[0].join();
    		tbase.pop_back();
    		thread_count--;
    	}
    	if(!fbase.empty())
    	{
    		fbase[0].wait();
    		fbase.pop_back();
    	}
    	merge_mix(data,size/2,data+size/2,(size+1)/2);
    	return;
    }
    Quote Originally Posted by iMalc
    In fact you don't even need the if (size==1) test.
    Surely no function call is better than one which immediately returns ?

    Quote Originally Posted by iMalc
    Lastly, because different threads may sometimes be reading and writing data that is adjacent in memory, the CPU will be stalled from time to time while the cores synchronise their data. Splitting the data into separately allocated chunks for each thread, before the merging step could help, or dividing the work up into powers of two (as is done in breadth-first merge sort) could help there.
    Cores synchronizing registers seems better that copying and allocating potentially huge chunks of memory to me.

    Quote Originally Posted by brewbuck
    You're spawing threads at each level. The threads then terminate. Then you spawn more...
    How much does this version 2 remedy that ?
    What else can I do ? .... (Without moving down to a lower level of abstraction... which I'm very scared of.. right now)



    Also, I was thinking about dynamically setting the cutoff to insertion sort to a number calculated from the size of the input and the number of cores in the CPU.
    Is it viable ?
    Last edited by manasij7479; 02-03-2012 at 01:23 AM.
    Manasij Mukherjee | gcc-4.8.2 @Arch Linux
    Slow and Steady wins the race... if and only if :
    1.None of the other participants are fast and steady.
    2.The fast and unsteady suddenly falls asleep while running !



  7. #7
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,299
    Quote Originally Posted by manasij7479 View Post
    Okay.. here is the second version... Does it look better ?
    In some ways certainly, but it's now using some concepts I'm unfamiliar with. e.g. std::async. It seems to be just another way of doing threading. So I don't understand why that's there now.

    Surely no function call is better than one which immediately returns ?
    The way you should think of it is, that surely having an if-statement there which is never true unless you were only sorting one item from the beginning is just useless excess baggage.

    Cores synchronizing registers seems better that copying and allocating potentially huge chunks of memory to me.
    It's not registers that synchronise, it's memory, or more specifically, cache. This can hurt performance a lot because if you have four cores working on the same cache line and one of them performs a write then the other three need to stop and get the update to that cacheline.

    How much does this version 2 remedy that ?
    What else can I do ? .... (Without moving down to a lower level of abstraction... which I'm very scared of.. right now)
    It doesn't do much for that at the moment. Seeing as how this clearly isn't some kind of assignment, I'll show you what I had in mind. Basically instead of creating one large temporary array and merging the subarrays into that, I'd create two smaller sub-arrays, copy the data out, and then merge back into where it came from.

    Also, I was thinking about dynamically setting the cutoff to insertion sort to a number calculated from the size of the input and the number of cores in the CPU.
    Is it viable ?
    The thing about insertion sort is that it's O(n*n) so doing something dynaimc where the cutoff goes much higher than about 16 is going to be worse for performance, and dropping that cutoff low is only going to negate the benefit of having it. So I'd just stick with around 10.

    Stay tuned, I'll post my version shortly.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  8. #8
    Registered User manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    Kolkata@India
    Posts
    2,498
    In some ways certainly, but it's now using some concepts I'm unfamiliar with. e.g. std::async. It seems to be just another way of doing threading. So I don't understand why that's there now.
    It is some sort of preliminary task based concurrency.(..AFAIK..)
    So.. After allocating a particular number of threads, I'm allowing the OS (?) to decide whether it can handle more. I still did the MAX_THREADS manually because the implementations are known to be somewhat buggy about the automatic launch policy.

    Also, I was confused with the results the second version.
    (On a Dual Core, it almost lives up to the expectations(reduces time by about 30%.. ...but even on my single core netbook... happens to reduce the time taken by 10%. The processor (Atom) clearly does something not advertized !)

    Seeing as how this clearly isn't some kind of assignment,
    ..I'm still half an year away from beginning college, anyway..
    Last edited by manasij7479; 02-03-2012 at 12:15 PM.
    Manasij Mukherjee | gcc-4.8.2 @Arch Linux
    Slow and Steady wins the race... if and only if :
    1.None of the other participants are fast and steady.
    2.The fast and unsteady suddenly falls asleep while running !



  9. #9
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,299
    I believe this takes care of all that I spoke of:
    Code:
    // Merge the smaller arrays into the larger one (which we know is big enough)
    void my_merge_mix(int *array1, int count1, int* array2, int count2, int *arrayOut)
    {
    	int i = 0, j = 0, c = 0;
    	// merge
    	while (i < count1 && j < count2)
    		arrayOut[c++] = (array2[j] < array1[i]) ? array2[j++] : array1[i++];
    	// copy remainder
    	while (i < count1)
    		arrayOut[c++] = array1[i++];
    	while (j < count2)
    		arrayOut[c++] = array2[j++];
    }
    
    // Sorry, this is the lock type I use most often, being a Windows programmer.
    // Ironically std::thread is not implemented on VS2010, so yes this is all untested code.
    CComAutoCriticalSection lock;
    // Note that this still creates and destroys threads rather than keeping them around.
    void threaded_merge_sort(int data[], int size)
    {
    	static int num_threads;
    	if (size < 10)
    	{
    		std::__insertion_sort(data, data+size);
    	}
    	else
    	{
    		// spearate allocations for each core to work on without spoiling each other's cache
    		std::vector<int> half1(data, size/2);
    		std::vector<int> half2(data+size/2, (size+1)/2);
    
    		// Lock to test if we should make another thread
    		CComAutoCriticalSectionHolder holder(lock);
    		if (size > 128 && num_threads < MAX_THREADS)
    		{
    			++num_threads;
    			holder.Unlock();
    
    			std::thread foo(threaded_merge_sort, &half1[0], half1.size());
    			threaded_merge_sort(&half2[0], half2.size());
    			foo.join();
    
    			// Lock to signal that we're done with the extra thread
    			holder.Lock();
    			--num_threads;
    			holder.Unlock();
    		}
    		else
    		{
    			holder.Unlock();
    			threaded_merge_sort(&half1[0], half1.size());
    			threaded_merge_sort(&half2[0], half2.size());
    		}
    		my_merge_mix(&half1[0], half1.size(), &half2[0], half2.size(), data);
    	}
    }
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  10. #10
    Registered User manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    Kolkata@India
    Posts
    2,498
    Thanks.
    I'll take this for a spin tomorrow.
    Last edited by manasij7479; 02-03-2012 at 01:02 PM.
    Manasij Mukherjee | gcc-4.8.2 @Arch Linux
    Slow and Steady wins the race... if and only if :
    1.None of the other participants are fast and steady.
    2.The fast and unsteady suddenly falls asleep while running !



  11. #11
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,538
    mana, keep in mind that while it may seem more efficient to put an if statement rather than make a function call, the real world isn't always so white and black.
    A function call is deterministic, and as thus, the processor will be able to keep the pipeline full. An if statement can cause a branch misprediction, which causes the pipeline to be flushed, costing a lot of time, and all the wrongly prefetched data and performed instructions must also be flushed. Then new data has to be fetched from memory.
    All in all, it can be pretty expensive.

    Golden rule: always test, and probably don't bother optimizing it unless you are sure it will really matter.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  12. #12
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,299
    Sorry, I got my vector initialisation wrong above. It should be:
    Code:
            std::vector<int> half1(data, data + size/2);
            std::vector<int> half2(data + size/2, data + size);
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  13. #13
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,660
    >> static int thread_count(0) ; // <-- Does this need a lock ?
    [s]In C++11, no. However, I don't know of a compiler that has implemented thread-safe local-statics yet.[/s]

    [Edit]
    Sorry, I confused myself It's the first-time initialization of local-statics that is thread-safe. Accesses to thread_count require a lock.

    gg

  14. #14
    Registered User manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    Kolkata@India
    Posts
    2,498
    Here are the results:
    1)v2 ( from post #6)

    Time Taken for normal version (10000000 nos):

    real 0m4.216s
    user 0m4.170s
    sys 0m0.043s
    Time Taken for threaded version (10000000 nos):

    real 0m5.888s
    user 0m10.583s
    sys 0m0.423s

    2) v2 (after getting rid of the async and replacing it with a normal call)

    Time Taken for normal version (10000000 nos):

    real 0m4.181s
    user 0m4.116s
    sys 0m0.063s
    Time Taken for threaded version (10000000 nos):

    real 0m2.817s
    user 0m4.733s
    sys 0m0.133s

    3) iMalc's Version:

    Time Taken for normal version (10000000 nos):

    real 0m4.447s
    user 0m4.093s
    sys 0m0.350s
    Time Taken for threaded version (10000000 nos):

    real 0m2.898s
    user 0m4.443s
    sys 0m0.330s
    I think the similar results for 2 and 3 are due to my poor 2 core processor.

    (If anyone wants to try out iMalc's version with standard locks, change the following: )
    Code:
    CComAutoCriticalSection lock;
    to
    std::mutex lock;
    
    CComAutoCriticalSectionHolder holder(lock);
    to
    std::unique_lock<std::mutex> holder(lock);
    
    //and all the upper case Lock `s and Unlock `s  to lower.
    I'm very curious to see how this performs on a better machine.
    Last edited by manasij7479; 02-04-2012 at 08:55 AM.
    Manasij Mukherjee | gcc-4.8.2 @Arch Linux
    Slow and Steady wins the race... if and only if :
    1.None of the other participants are fast and steady.
    2.The fast and unsteady suddenly falls asleep while running !



  15. #15
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,299
    What was MAX_THREADS?
    So, my threaded version was only a little over 50% quicker with two cores. That is probably about what should be expected as you're limited by RAM to some degree.
    You are afterall, including in the timing the code that sets up the unsorted array. You might want to use a different method of timing, i.e. put the timing code inside the application.

    FYI: I have a version of merge sort that merges both to and from the temporary array, meaning that there are no copying-back or copy-out-first steps. This reduces the number of copies significantly, providing easily the same amount of speedup that threading did here, but without the threading. Or there's the standard library version doesn't use a temporary array. It uses some tricky maths to merge in O(n+m) time and constant space.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Page 1 of 2 12 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Multithreading no performance gain?
    By cyberfish in forum C++ Programming
    Replies: 22
    Last Post: 12-30-2009, 12:47 PM
  2. Quick Sort or Merge Sort???
    By swanley007 in forum C++ Programming
    Replies: 6
    Last Post: 11-10-2005, 05:48 PM
  3. threaded merge sort
    By AusTex in forum Linux Programming
    Replies: 4
    Last Post: 05-04-2005, 04:03 AM
  4. Quick sort VS Merge Sort
    By sachitha in forum Tech Board
    Replies: 7
    Last Post: 09-03-2004, 11:57 PM
  5. merge sort and selection sort and time spent on both
    By misswaleleia in forum C Programming
    Replies: 3
    Last Post: 06-04-2003, 02:24 PM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21