Threaded Merge Sort: No performance gain. :(

**manasij7479** · 02-02-2012

The problem is that the threading has too much overhead to be useful here.(..as evident from the "sys" time taken.)
Yet, I've read in many places that merge sort is a good candidate for concurrency.

(Also, I'm not testing this on a very good computer now... When I can do so..(tomorrow), the results may improve. If you can test this on a juicy machine now, please do.)

I understand the following implementation could be little naive (any tips to improve is welcome).
Here is the code:

Code:

#include<algorithm>
#include<thread>
#include<cstdlib>
void merge_mix(int *array1,int count1,int* array2,int count2) //Merges and distributes the result over array1 and array2
{
    int ptr_t(0),ptr_1(0),ptr_2(0); //offsets to temp, array1, array2 
    int * temp = new int[count1+count2];


    while((ptr_1<count1) && (ptr_2<count2))
    {
        while((ptr_1<count1) && (*(array1+ptr_1)<=*(array2+ptr_2)))
        {
            *(temp+ptr_t++) = *(array1+ptr_1);
            ptr_1++;
        }
        if (ptr_1<count1)
        {
            while((ptr_2<count2) && (*(array2+ptr_2)<=*(array1+ptr_1)))
            {
                *(temp+ptr_t++) = *(array2+ptr_2);
                ptr_2++;
            }
        }
    }
    std::copy(array1+ptr_1,array1+ptr_1 + (count1-ptr_1),temp+ptr_t);
    std::copy(temp,temp+count1,array1);
    std::copy(array2+ptr_2,array2+ptr_2+(count2-ptr_2),temp+ptr_t);
    std::copy(temp+count1,temp+count1+count2,array2);


    delete [] temp;
}
void merge_sort(int data[],int size)
{
    if(size==1)return;
    if(size<10)
    {
        std::__insertion_sort(data,data+size);
        return;
    }
    merge_sort(data,size/2);
    merge_sort(data+size/2,(size+1)/2);
    merge_mix(data,size/2,data+size/2,(size+1)/2);
    return;
}
void threaded_merge_sort(int data[],int size)
{
    if(size==1)return;
    if(size<100)
    {
        std::__insertion_sort(data,data+size);
        return;
    }
    std::thread foo(threaded_merge_sort,data,size/2);
    std::thread bar(threaded_merge_sort,data+size/2,(size+1)/2);
    
    foo.join();
    bar.join();
    merge_mix(data,size/2,data+size/2,(size+1)/2);
    return;
}
int main(int argc,char** argv)
//arg1: Size of random int array.
//arg2: -t or -n ..threaded or normal
{
    srand(time(NULL));
    int* dat;
    int n = atoi(argv[1]);
    dat = new int[n];
    for(int i=0;i<n;i++)
        *(dat+i)=rand();
    
    if(argv[2][1]=='t')threaded_merge_sort(dat,n);
    else if(argv[2][1]=='n')merge_sort(dat,n);
}

Here is the testing script:

Code:

#!/bin/bash
#$1 has the number of ints to be sorted.
g++ a.cpp -std=c++0x -lpthread -o test
echo "Time Taken for normal version ($1 nos):"
time ./test $1 -normal
echo "Time Taken for threaded version ($1 nos):"
time ./test $1 -threaded

Result:

[manasij7479@manasijn ~]$ ./test.sh 10000
Time Taken for normal version (10000 nos):

real 0m0.013s
user 0m0.011s
sys 0m0.001s
Time Taken for threaded version (10000 nos):

real 0m0.315s
user 0m0.018s
sys 0m0.470s

(I also have another question about the merging part. Is it possible to skip the memory allocation and copy() ....and do all operations on the two arrays themselves ?)

**Elysia** · 02-02-2012

You can do merge sort inline. Just swap elements around.

**Codeplug** · 02-02-2012

>> ... threading has too much overhead to be useful here.
Yes, the amount of work that the thread performs needs to outweigh the cost of thread creation/destruction.

>> if(size<100)
That's not enough work per thread. I would instead split the work up based on the number of available cores in the system. If there are 4 cores, split it up 4 ways.

gg

**iMalc** · 02-02-2012

You shouldn't create two new threads and then leave the one that created them doing nothing. Change these lines:

Code:

    std::thread foo(threaded_merge_sort,data,size/2);
    std::thread bar(threaded_merge_sort,data+size/2,(size+1)/2);

    foo.join();
    bar.join();

to this:

Code:

    std::thread foo(threaded_merge_sort, data, size/2); // one extra thread
    threaded_merge_sort(data+size/2, (size+1)/2); // and one direct call to do the rest in this thread

    foo.join();

Secondly, having a cutoff of 100 instead of 10 is going to cause insertion sort to hurt performance. Change it back to 10. If you additionally want a cut-off before bothering to create threads, then do:

Code:

    if (size < 10)
    {
        std::__insertion_sort(data,data+size);
        return;
    }
    if (size < 100)
    {
        threaded_merge_sort(data, size/2);
        threaded_merge_sort(data+size/2, (size+1)/2);
    }
    else
    {
        std::thread foo(threaded_merge_sort, data, size/2); // one extra thread
        threaded_merge_sort(data+size/2, (size+1)/2); // and one direct call to do the rest in this thread
        foo.join();
    }

In fact you don't even need the if (size==1) test. Just let that case go into insertion sort and immediately realise that there is no work to be done. It'll practically never be true anyway because you never subdivide down to that level. It's only true if there was only 1 item to begin with. In fact the test doesn't help when you try and sort zero items, it's up to the insertion sort to save you from a bug here in that case.

A better implementation would also cap the number of threads. E.g. at 16, or perhaps the number of cores.

Lastly, because different threads may sometimes be reading and writing data that is adjacent in memory, the CPU will be stalled from time to time while the cores synchronise their data. Splitting the data into separately allocated chunks for each thread, before the merging step could help, or dividing the work up into powers of two (as is done in breadth-first merge sort) could help there.

Now make sure you try it on a much larger number of items so you can really see the difference.

**brewbuck** · 02-02-2012

First, your workload seems incredibly small. A hundredth of a second of processing? The overhead will eat you alive.

You're spawing threads at each level. The threads then terminate. Then you spawn more...

That'll ruin you.

**manasij7479** · 02-03-2012

Okay.. here is the second version... Does it look better ?

Code:

void threaded_merge_sort_v2(int data[],int size)
{
	static int thread_count(0) ; // <-- Does this need a lock ?
	std::vector<std::thread> tbase;
	std::vector<std::future<void>> fbase;
	if(size<10)
	{
		std::__insertion_sort(data,data+size);
		return;
	}
	if(thread_count<MAX_THREADS)
	{
		tbase.push_back(std::thread(threaded_merge_sort_v2,data,size/2));
		thread_count++;
	}
	else	fbase.push_back(std::move(std::async(threaded_merge_sort_v2,data,size/2)));
	threaded_merge_sort_v2(data+size/2,(size+1)/2);


	if(!tbase.empty())
	{
		tbase[0].join();
		tbase.pop_back();
		thread_count--;
	}
	if(!fbase.empty())
	{
		fbase[0].wait();
		fbase.pop_back();
	}
	merge_mix(data,size/2,data+size/2,(size+1)/2);
	return;
}

Originally Posted by iMalc

In fact you don't even need the if (size==1) test.

Surely no function call is better than one which immediately returns ?

Originally Posted by iMalc

Lastly, because different threads may sometimes be reading and writing data that is adjacent in memory, the CPU will be stalled from time to time while the cores synchronise their data. Splitting the data into separately allocated chunks for each thread, before the merging step could help, or dividing the work up into powers of two (as is done in breadth-first merge sort) could help there.

Cores synchronizing registers seems better that copying and allocating potentially huge chunks of memory to me.

Originally Posted by brewbuck

You're spawing threads at each level. The threads then terminate. Then you spawn more...

How much does this version 2 remedy that ?
What else can I do ? .... (Without moving down to a lower level of abstraction... which I'm very scared of.. right now)

Also, I was thinking about dynamically setting the cutoff to insertion sort to a number calculated from the size of the input and the number of cores in the CPU.
Is it viable ?

**iMalc** · 02-03-2012

Originally Posted by manasij7479

Okay.. here is the second version... Does it look better ?

In some ways certainly, but it's now using some concepts I'm unfamiliar with. e.g. std::async. It seems to be just another way of doing threading. So I don't understand why that's there now.

Surely no function call is better than one which immediately returns ?

The way you should think of it is, that surely having an if-statement there which is never true unless you were only sorting one item from the beginning is just useless excess baggage.

Cores synchronizing registers seems better that copying and allocating potentially huge chunks of memory to me.

It's not registers that synchronise, it's memory, or more specifically, cache. This can hurt performance a lot because if you have four cores working on the same cache line and one of them performs a write then the other three need to stop and get the update to that cacheline.

How much does this version 2 remedy that ?
What else can I do ? .... (Without moving down to a lower level of abstraction... which I'm very scared of.. right now)

It doesn't do much for that at the moment. Seeing as how this clearly isn't some kind of assignment, I'll show you what I had in mind. Basically instead of creating one large temporary array and merging the subarrays into that, I'd create two smaller sub-arrays, copy the data out, and then merge back into where it came from.

Also, I was thinking about dynamically setting the cutoff to insertion sort to a number calculated from the size of the input and the number of cores in the CPU.
Is it viable ?

The thing about insertion sort is that it's O(n*n) so doing something dynaimc where the cutoff goes much higher than about 16 is going to be worse for performance, and dropping that cutoff low is only going to negate the benefit of having it. So I'd just stick with around 10.

Stay tuned, I'll post my version shortly.

**manasij7479** · 02-03-2012

In some ways certainly, but it's now using some concepts I'm unfamiliar with. e.g. std::async. It seems to be just another way of doing threading. So I don't understand why that's there now.

It is some sort of preliminary task based concurrency.(..AFAIK..)
So.. After allocating a particular number of threads, I'm allowing the OS (?) to decide whether it can handle more. I still did the MAX_THREADS manually because the implementations are known to be somewhat buggy about the automatic launch policy.

Also, I was confused with the results the second version.
(On a Dual Core, it almost lives up to the expectations(reduces time by about 30%.. ...but even on my single core netbook... happens to reduce the time taken by 10%. The processor (Atom) clearly does something not advertized !)

Seeing as how this clearly isn't some kind of assignment,

..I'm still half an year away from beginning college, anyway..

**iMalc** · 02-03-2012

I believe this takes care of all that I spoke of:

Code:

// Merge the smaller arrays into the larger one (which we know is big enough)
void my_merge_mix(int *array1, int count1, int* array2, int count2, int *arrayOut)
{
	int i = 0, j = 0, c = 0;
	// merge
	while (i < count1 && j < count2)
		arrayOut[c++] = (array2[j] < array1[i]) ? array2[j++] : array1[i++];
	// copy remainder
	while (i < count1)
		arrayOut[c++] = array1[i++];
	while (j < count2)
		arrayOut[c++] = array2[j++];
}

// Sorry, this is the lock type I use most often, being a Windows programmer.
// Ironically std::thread is not implemented on VS2010, so yes this is all untested code.
CComAutoCriticalSection lock;
// Note that this still creates and destroys threads rather than keeping them around.
void threaded_merge_sort(int data[], int size)
{
	static int num_threads;
	if (size < 10)
	{
		std::__insertion_sort(data, data+size);
	}
	else
	{
		// spearate allocations for each core to work on without spoiling each other's cache
		std::vector<int> half1(data, size/2);
		std::vector<int> half2(data+size/2, (size+1)/2);

		// Lock to test if we should make another thread
		CComAutoCriticalSectionHolder holder(lock);
		if (size > 128 && num_threads < MAX_THREADS)
		{
			++num_threads;
			holder.Unlock();

			std::thread foo(threaded_merge_sort, &half1[0], half1.size());
			threaded_merge_sort(&half2[0], half2.size());
			foo.join();

			// Lock to signal that we're done with the extra thread
			holder.Lock();
			--num_threads;
			holder.Unlock();
		}
		else
		{
			holder.Unlock();
			threaded_merge_sort(&half1[0], half1.size());
			threaded_merge_sort(&half2[0], half2.size());
		}
		my_merge_mix(&half1[0], half1.size(), &half2[0], half2.size(), data);
	}
}

**manasij7479** · 02-03-2012

Thanks.
I'll take this for a spin tomorrow.

**Elysia** · 02-03-2012

mana, keep in mind that while it may seem more efficient to put an if statement rather than make a function call, the real world isn't always so white and black.
A function call is deterministic, and as thus, the processor will be able to keep the pipeline full. An if statement can cause a branch misprediction, which causes the pipeline to be flushed, costing a lot of time, and all the wrongly prefetched data and performed instructions must also be flushed. Then new data has to be fetched from memory.
All in all, it can be pretty expensive.

Golden rule: always test, and probably don't bother optimizing it unless you are sure it will really matter.

**iMalc** · 02-03-2012

Sorry, I got my vector initialisation wrong above. It should be:

Code:

        std::vector<int> half1(data, data + size/2);
        std::vector<int> half2(data + size/2, data + size);

**Codeplug** · 02-03-2012

>> static int thread_count(0) ; // <-- Does this need a lock ?
[s]In C++11, no. However, I don't know of a compiler that has implemented thread-safe local-statics yet.[/s]

[Edit]
Sorry, I confused myself

It's the first-time initialization of local-statics that is thread-safe. Accesses to thread_count require a lock.

gg

**manasij7479** · 02-04-2012

Here are the results:

1)v2 ( from post #6)

Time Taken for normal version (10000000 nos):

real 0m4.216s
user 0m4.170s
sys 0m0.043s
Time Taken for threaded version (10000000 nos):

real 0m5.888s
user 0m10.583s
sys 0m0.423s

2) v2 (after getting rid of the async and replacing it with a normal call)

Time Taken for normal version (10000000 nos):

real 0m4.181s
user 0m4.116s
sys 0m0.063s
Time Taken for threaded version (10000000 nos):

real 0m2.817s
user 0m4.733s
sys 0m0.133s

3) iMalc's Version:

Time Taken for normal version (10000000 nos):

real 0m4.447s
user 0m4.093s
sys 0m0.350s
Time Taken for threaded version (10000000 nos):

real 0m2.898s
user 0m4.443s
sys 0m0.330s

I think the similar results for 2 and 3 are due to my poor 2 core processor.

(If anyone wants to try out iMalc's version with standard locks, change the following: )

Code:

CComAutoCriticalSection lock;
to
std::mutex lock;

CComAutoCriticalSectionHolder holder(lock);
to
std::unique_lock<std::mutex> holder(lock);

//and all the upper case Lock `s and Unlock `s  to lower.

I'm very curious to see how this performs on a better machine.

**iMalc** · 02-04-2012

What was MAX_THREADS?
So, my threaded version was only a little over 50% quicker with two cores. That is probably about what should be expected as you're limited by RAM to some degree.
You are afterall, including in the timing the code that sets up the unsorted array. You might want to use a different method of timing, i.e. put the timing code inside the application.

FYI: I have a version of merge sort that merges both to and from the temporary array, meaning that there are no copying-back or copy-out-first steps. This reduces the number of copies significantly, providing easily the same amount of speedup that threading did here, but without the threading. Or there's the standard library version doesn't use a temporary array. It uses some tricky maths to merge in O(n+m) time and constant space.

Thread: Threaded Merge Sort: No performance gain. :(

Thread Tools

Search Thread

Display

Threaded Merge Sort: No performance gain. :(

Similar Threads

Multithreading no performance gain?

Quick Sort or Merge Sort???

threaded merge sort

Quick sort VS Merge Sort

merge sort and selection sort and time spent on both