Optimizing for parallell execution... on linux

Printable View

Show 80 post(s) from this thread on one page

02-04-2012
Elysia

2 Attachment(s)

Optimizing for parallell execution... on linux

Hello all,

Today, I am trying to optimize a N-queen puzzle (finding all solutions, not just one) board by parallellizing it. Easy, some might think.
...Except, the problem is that I am getting very different results on different platforms!

Running the code on Windows (with N=11), gives roughly 1.2 seconds for 1 thread. By adding another, I cut it roughly in half, and adding another decreases it to around 450 ms.
That would make sense, since I have a dual core on this windows machine.

However, the goal is to make this run on a linux machine (I get so tired of why I always have to use these damn linux machines -_-), which has 26 or 24 cores, if the info returned by the OS is to be believed.

The problem is that whenever I add more than one thread, it gets slower. I cannot fathom why. So I am seeking advice, to see if someone knows something I don't.

So I attached the source to the post. I used Boost for multithreading, since I don't know of any better for C++ (I can't use the standard's multi-threading facilities either).
The code must compile with GCC 4.4. This means I can't use lambdas (hence why I use boost::bind and boost::ref), nor nullptr (but I define that as NULL).

The command list I use for compiling is:

g++44 "Vecka 7-8.cpp" "stdafx.cpp" -std=c++0x --pedantic -O3 -Wall -Wextra -o foo -Dnullptr=NULL -Iboost_1_49_0_beta1 -L/afs/isk.kth.se/home/*snip*/boost_1_49_0_beta1/lib/lib -lboost_thread -lboost_date_time -DBOARD_SIZE=11 -DNUM_THREADS=2

(If you see anything wrong, please let me know. I am not too familiar with gcc's command line.)

I've built boost using shared libraries (couldn't get static libraries to work).

Any ideas/suggestions?
02-04-2012
brewbuck

You are queuing all work in a single FIFO which is queried by the threads. This is *the* classic reason why naive parallel implementations don't scale. When you lock you lose.
02-04-2012
Elysia

The time is takes to grab some work is miniscule compared to the time it takes to find a solution to the puzzle.
Besides, this is a so called "bag of tasks" implementation. It scales very well on Windows.
Results indicate 162% increase with two threads and 212% increase with 3 or more threads.
02-04-2012
phantomotap

Quote:

Besides, this is a so called "bag of tasks" implementation. It scales very well on Windows.

O_o

And yet, brewbuck is exactly right. The total cost of code versus locking may be vast, but the simple truth is you are creating a full mutex to do very little actual work. Have you considered cheaper locking primitives?

That said, I don't see "pthread" listed as part of the compiler/link command. I imagine, because you've clearly stated that you got an executable despite the missing "pthread" library, that "Boost" was built without native thread support. If "Boost" is having to use a more expensive thread type or faking the threads you'll see abysmal performance.

Soma
02-04-2012
Elysia

Quote:

Originally Posted by phantomotap

And yet, brewbuck is exactly right. The total cost of code versus locking may be vast, but the simple truth is you are creating a full mutex to do very little actual work. Have you considered cheaper locking primitives?

I am not aware of the cost of different lock primitives. What would you recommend?

Quote:

That said, I don't see "pthread" listed as part of the compiler/link command. I imagine, because you've clearly stated that you got an executable despite the missing "pthread" library, that "Boost" was built without native thread support. If "Boost" is having to use a more expensive thread type or faking the threads you'll see abysmal performance.

I have no idea, but if so... dang. Any ideas on how to do that? Just getting boost to build on Linux was a pain.
02-04-2012
phantomotap

Quote:

What would you recommend?

*shrug*

Unfortunately, different systems have different notions of what really constitutes a "binary semaphore", "general semaphore", "full mutual exclusive lock", "spinlock", "critical sections", and so on. It may have changed, but as I recall the "Boost" notion of a mutex used to be that a system wide mutually exclusive lock. That's an expensive beast to create and destroy as it needs to deal with operating system primitives that provide scheduling and events.

Because you are doing so little work on an object only shared within process you could get by with a simple, usually dirt cheap, "spinlock".

Quote:

Any ideas on how to do that? Just getting boost to build on Linux was a pain.

Nope. I rarely use "Boost". I suggest look around, and as a last resort post on the "Boost" mailing list, for any method to confirm or deny the use of "pthreads".

Soma
02-04-2012
iMalc

It you're on Windows, a simple critical section will suffice.

brewbuck, the "When you lock you lose." comment is over the top. As long as the lock only encompasses the tinyest of operations (e.g. just poping off the next task) then it does scale tremendously well. Sure, a single queue wont scale across a huge cluster of supercomputers, but we're only taking about within the confines of a single process here.
Often times the only alternative is a lockless queue. That has it's share of issues, and I've never seen a case where it was either warranted, or would be much better.
You need to examine the contention count of a critical section for example, to see whether the locking is a slight bottleneck.
02-04-2012
Elysia

The lock is not a bottleneck. I have tried removing it, but it does not help.
So it may be a numa system or something else weird. But the point is, I can't get it to execute faster. Only slower.
I'm going to have to ask the teacher on this one. I'll probably end up just making it run on Windows, too. Hopefully.
02-04-2012
phantomotap

[Edit]Please understand that I'm not trying to be condescending. A lot of people are under the impression that building highly scalable systems simply means throwing the right kind of threading primitives at the code from simple, strait forward non-concurrent algorithms. A lot of people are wrong.[/Edit]

Quote:

As long as the lock only encompasses the tinyest of operations (e.g. just poping off the next task) then it does scale tremendously well.

Using a simple "locking work queue" doesn't scale well at all. How much work being done by a specific thread while holding the lock becomes less important as the number of threads grows. Wait! Before you hasten to respond with "You should hold a lock for as little time as possible.", take a second to understand that I'm still talking about a few dozens of ticks. That small fraction of time becomes meaningless as the number of threads grows because every thread that has finished its allotted work is waiting on the lock. That is extremely important in understand scalability of services. A service built from many threads may use locking primitives that cost only a few ticks, but a simple application of the "locking work queue" principle means that threads have to wait in a single file queue for more work. Crucially, while a thread is waiting, it isn't doing anything useful; that simple bit of truth is why naive parallel algorithms don't scale.

Take an example for your time: a service built from 4 threads may have 3 threads waiting on that lock for a few ticks. It doesn't sound like a big deal. The thread holding the lock is already done. The queue is free to provide work for another thread. A different thread grabs a lock; now two threads are waiting on that same lock for a few ticks. It is important to realize that at this moment only one thread is doing real work. Again, the thread holding the lock has new work to do and is finished with the lock. We have to repeat the process again; another thread grabs the lock leaving one thread doing nothing other than waiting. Finally, the thread waiting all this time (less than second) has access to the queue. It can finally get new work to do.

It really doesn't seem like much of an issue does it? Unfortunately, reality likes to ........ on simple ideas. At some point, and in the real world this happens readily as the number of real threads providing a service increases, the threads don't have enough work to keep them busy until the work queue is available. In other words, the cost of a serializing access to the work queue grows unreasonable because it is a queue.

Contrast the simple "locking work queue" type of concurrency with systems of atomic transactions or observer based systems using multiple queues (generally per thread). The locking is still required but less of an issue because only one thread is ever waiting while every other thread involved in the service is busy doing something useful.

Soma
02-04-2012
iMalc

That all sounds great in theory, but in practice that's not what happens.

I've worked on a service that has 500-1000 threads on some sites. (We're planning on a redesign some time to use thread pools and cut this down) and this thing needs to run for as long as possible, ideally uninterrupted for a year or so, but at least typically several months. These are running on top of the line PCs too with sometimes 16 cores etc.
Dumps of such processes might say have a lock contention count of 3 for those that lock for very little duration. That's 3 times out of who knows how many billions of times, that there was ever any waiting that happened at all, and even then it was only for a few clock cycles. There was probably never more than one thread waiting at once. Heck it takes more time to switch threads than time taken during the time that the lock is held.

The bottom line is that if the operations during the lock amount to just a few clock cycles and the operations outside of the lock consititute durations on the order of 100's - 10000's of milliseconds say, then in reality it does scale well. Everything I have seen supports this, and I have seen plenty.

As Elysia too experiences here, the locking is not the problem.

Maybe it wont scale to 1 million threads, in fact probably not, but for the kind of numbers it does need to scale to in reality, I know what I've used in the past is perfectly fine.
02-04-2012
Codeplug

My 2¢: Uncontended locks can be fast. Lock contention does not scale.

gg
02-05-2012
iMalc

If I had to guess, I'd say that Linux is forcing any extra threads you create to run on the same core, like there's some affinity mask set on the application. Any way you can check that?
02-05-2012
grumpy

Traditionally, under most flavours of unix, process creation has been fairly lightweight, and thread creation is only a little less costly. Whereas, under windows, process creation is quite heavyweight and thread creation is (relatively speaking) very lightweight.

Although most modern versions of unix are multicore and multi-processor aware, there is a long way to go in enhancing the performance of multithreaded applications. That includes things like all threads in a process often being deployed on the same processor/core by default.

As Elysia and iMalc have said, locking is not - in itself - a particular performance problem on most systems, although there are benefits to using mechanisms that are local to a process rather than system wide (eg a critical section versus a system-wide mutex). It is, generically speaking, important to ensure that applications minimise their usage of locks, minimise contention between threads that use a particular lock, and to minimise the amount of work that any thread does while holding a lock. Beyond that, the limiting factor on performance is architecture of the operating system (for example, if it distributes load between cores for a multithreaded program, what it does to resolve contention when multiple threads try to grab a lock that is already held) and the only way to evaluate that, from an application perspective, is by detailed measurements. Some unix variants do that really well, some don't - with some distributions it is not even a priority and, with others, there are still substantial evolutions happening between versions, whether commercial product or not. Similarly, some RTOS's do it blindingly well, albeit with a need to get into the details and tune the system, some don't.
02-05-2012
Elysia

Quote:

Originally Posted by iMalc

If I had to guess, I'd say that Linux is forcing any extra threads you create to run on the same core, like there's some affinity mask set on the application. Any way you can check that?

I am trying to use linux system call to determine that, but so far, all I've got is a compile error:

Code:

pid_t pid = getpid(); cpu_set_t affinity; sched_getaffinity(pid, sizeof(affinity), &affinity); unsigned long long tmp; for (int i = 0; i < 64; i++) if (CPU_ISSET(i+1, affinity)) tmp |= (i+1); std::cout << "Hello from thread scheduled on " << std::hex << tmp << "!\n";

error: base operand of '->' has non-pointer type 'cpu_set_t'

EDIT: Found another way:

[peklof@sky4 ~]$ ./foo &
[1] 23142
[peklof@sky4 ~]$ taskset -p 23142
pid 23142's current affinity mask: ffffff

So no, it is set to execute on all cpus. Unless this changes from thread to thread.
02-05-2012
CornedBee

Some observations:
- Boost without native thread support doesn't have Boost.Threads. I don't think this is a problem. Also, I'm pretty sure Boost.Thread has a mutex optimized for single-process access, since there is no way to get access from another process.
- I find your names confusing. The BoardCoordinates class doesn't make sense to me. Neither does GetBoard (it gets a field).
- Why is GetBoard even a bound function? Why not make Board a proper class and give it an indexing call operator, so that you can write Board(x, y)? Boost.Bind can act as an optimization barrier (it holds a function pointer), and you definitely want the field access to be inlined.
- Given that your work queue is pretty much known at compile time, static work division by index would be far more efficient than the work queue: no locking, no fetching elements from a stack.
- Whether you use a work queue or computed work division, a better choice as the basis of the implementation would be the Intel Threading Building Blocks (TBB) library. It has higher-level operations, such as a parallel for loop. You could model your problem very well with a parallel for loop, I think.
- I think what's killing you is contention on the memory allocator. You have a lot of allocations (lots of vectors local to functions that are called a lot, for example).
- But that's just a guess. Use oprofile or a similar tool to profile your program and find out what's really going on. gprof is the easiest to get working in my experience, but also the one that most distorts the result, because it is based on compile-time program instrumentation.

Show 80 post(s) from this thread on one page