Pthreads performance

**C_ntua** · 06-16-2008

So changed foxman's code. Put gettimeofday() instead of clock(), and did it without mutexes, for the reasons I posted above.

I get on an 8-core CPU:
FOR 2 threads:
It took 8.917935 sec
FOR 8 threads:
It took 2.158017 sec

which make perfectly sense.
But,
FOR 16 threads:
It took 2.158018 sec
FOR 100 threads:
It took 1.731838 sec
FOR 200 threads:
It took 0.958165 sec

So you see what I mean

**foxman** · 06-16-2008

Originally Posted by matsp

Of course, the first thread to finish doesn't really measure all the work being done - it may well be much more time consumed than that, since some threads may not even have STARTED the work yet.

Are you sure ? Maybe you missed that (or maybe I'm wrong, we'll discover soon or later

) but in

Code:

    pthread_barrier_wait(&barrier);
    for (i = 0; i < NB_ITER_THREAD; i++);
    pthread_barrier_wait(&barrier);

There's 2 call to pthread_barrier_wait(). So, after the second call to pthread_barrier_wait() (by after I mean when the "first" thread will return from the call) , all the "work" will have been done. And by "work", to be sure we are talking about the same thing, i'm talking about the NB_ITER_TOTAL iterations of the loop (that's it, every thread will have done his NB_ITER_THREAD iterations).

Originally Posted by C_ntua

... So its thread would have an ID. Instead of mutexes the thread with ID == 0, could calculate time.

But why should we care about ID ? I'm using mutexes so I can know which is the last thread to reach the first barrier, and to know which is the first thread to leave the second barrier. And I need to put mutex out of the two pthread_barrier_wait call or it will "serialize" the threads and make incorrect result.

I really don't know why you would remove the mutexes, it wouldn't make sense (to me).

**matsp** · 06-16-2008

I don't think foxmans code is correct - it measures the time from the last thread started until the first one finishes, which is not when ALL threads finish. As a sanity check, add a measure of the time from before the first thread to the end of the last one in main (like my "tt" calculation in my code), and you should see that the timing is not correct. Of course, the first thread may finish before the last thread ends if you start LOTS of threads - it's slightly less likely with fewer threads, since there is a smaller window of oppurtunity and the time it takes to run each thread is longer.

Also, there is NO WAY that more threads than the number of CPU cores you have will ever run (overall time) quicker than the time it takes to run X threads where X == number of processors, because there will be SOME extra work involved in creating and switching between processors - the minimum on an X86 processor for context switching is:

Code:

__naked void switchthread(context *newContext)
{
    __asm
    {
        // eax, ecx and edx do not need to be preserved, as the C calling convention assumes
        // those are destroyed on a C function call anyways. 
        push ebx
        push esi
        push edi
        push ebp
        mov  eax, dword ptr currContext
        mod  esp, dword ptr [eax+SPOFFSET]
        mov  eax, dword ptr newContext
        mov  dword ptr currContext, eax
        mov  dword ptr [eax+SPOFFSET], esp
        pop   ebp
        pop   edi
        pop   esi
        pop   ebx
        ret
    }
}

This is the very least you can expect the context switch to do - in reality, there is most likely a whole heap more work done in the context switch. The above probably corresponds to 10-15 loop iterations in the a() function. A more realistic, full-fledged thread-switching implementation would most likely take up more than 100 loop iterations of an empty(ish).

--
Mats

**C_ntua** · 06-16-2008

matsp:
Well, you get a better result for 2 and 5 threads. A small differense though...
But, generally, if you may, try again without the z=z+1;
z=z+1 means that all threads try to write/read on the same memory location. So there would be a delay. So make z a local variable of function a if you want.
Also, see if you indeed get better results for 2, 5 threads, even slightly better.

Thanx a lot!

**matsp** · 06-16-2008

Originally Posted by foxman

Are you sure ? Maybe you missed that (or maybe I'm wrong, we'll discover soon or later

) but in

Code:

    pthread_barrier_wait(&barrier);
    for (i = 0; i < NB_ITER_THREAD; i++);
    pthread_barrier_wait(&barrier);

There's 2 call to pthread_barrier_wait(). So, after the second call to pthread_barrier_wait() (by after I mean when the "first" thread will return from the call) , all the "work" will have been done. And by "work", to be sure we are talking about the same thing, i'm talking about the NB_ITER_TOTAL iterations of the loop (that's it, every thread will have done his NB_ITER_THREAD iterations).

You are right. Your code measures the right thing.

--
Mats

**matsp** · 06-16-2008

Originally Posted by C_ntua

matsp:
Well, you get a better result for 2 and 5 threads. A small differense though...
But, generally, if you may, try again without the z=z+1;
z=z+1 means that all threads try to write/read on the same memory location. So there would be a delay. So make z a local variable of function a if you want.
Also, see if you indeed get better results for 2, 5 threads, even slightly better.

Thanx a lot!

Yes, and the compiler (at least in release mode) will optimize the entire loop into nothing (empty loop -> not needed).
Reading and writing the same memory location in the same thread is not a problem - yes, it takes one extra clock-cycle or so. Of course, it doubles the time of the loop, since i++ and z++ (or z = z + 1) is going to take the amount of time that it takes to increment a variable, and the rest of the loop is a predictable branch.

For laughs, here's the same without:

Code:

E:\proj\threads\Debug>threads.exe 1
total time: 2.03270, sum=2.03257, avg=2.03257, max=2.03257, min=2.03257

E:\proj\threads\Debug>threads.exe 2
total time: 2.03495, sum=3.90264, avg=1.95132, max=1.95143, min=1.95121

E:\proj\threads\Debug>threads.exe 5
total time: 2.03316, sum=8.77349, avg=1.75470, max=1.93551, min=1.52791

E:\proj\threads\Debug>threads.exe 10
total time: 2.03305, sum=10.67696, avg=1.06770, max=1.62267, min=0.20301

E:\proj\threads\Debug>threads.exe 20
total time: 2.03670, sum=11.28223, avg=0.56411, max=1.22387, min=0.10130
// and release - not these values are not valid, as the loop is not iterating:
E:\proj\threads\Release>threads.exe 1
total time: 0.00011, sum=0.00000, avg=0.00000, max=0.00000, min=0.00000

E:\proj\threads\Release>threads.exe 2
total time: 0.00018, sum=0.00000, avg=0.00000, max=0.00000, min=0.00000

E:\proj\threads\Release>threads.exe 5
total time: 0.00041, sum=0.00000, avg=0.00000, max=0.00000, min=0.00000

E:\proj\threads\Release>threads.exe 10
total time: 0.00078, sum=0.00000, avg=0.00000, max=0.00000, min=0.00000

E:\proj\threads\Release>threads.exe 20
total time: 0.00158, sum=0.00000, avg=0.00000, max=0.00000, min=0.00000

I can say that the values vary a bit up and down, so I'm sure that this is part of the "2 and 5 is faster than 1". My processor is NOT capable of running multiple threads in parallel, and any attempt to run MT code on my processor will just slow things down. Evidently, not by a huge amount, since it's very little difference between the timing of the code in overall time.

Code:

E:\proj\threads\Debug>threads.exe 1000
total time: 3.06047, sum=2.98273, avg=0.00298, max=0.00393, min=0.00000

E:\proj\threads\Debug>threads.exe 1000
total time: 3.08970, sum=3.01170, avg=0.00301, max=0.00394, min=0.00000

E:\proj\threads\Debug>threads.exe 1000
total time: 3.08314, sum=3.00482, avg=0.00300, max=0.00394, min=0.00000

E:\proj\threads\Debug>threads.exe 1000
total time: 3.09533, sum=3.01738, avg=0.00302, max=0.00395, min=0.00288

Here's an example of four consecutive runs [and also showing that once you get LOTS of threads running, it takes longer - even if it's only in the few hundreds of a second range] - just to show that it's varying a fair bit.

I think the conclusion here is that the "work" in each thread far outweighs the actual time of thread creation, switching and destruction. So the test isn't particularly meaningfull.

--
Mats

**matsp** · 06-16-2008

Sorry for "posting comments to myself":
I just thought I'd mention that the timer granularity will also play a role if your threads run for a shorter amount of time - I'm not sure how precise "gettimeofday" actually is - rdtsc is accurate to one or two clock-cycles on AMD and Intel processors respectively (as long as you are running on the same core that is - which of course is another problem with many threads - the code switches to another processor, stack is in a different location, and the variable i may be elsewhere (although it should be in a register anyways).

--
Mats

**Codeplug** · 06-16-2008

>> Do you believe it's a better or a worst measurement that what the OP is getting with his code?
Well, you can't really compare since what you're measuring and what the OP is measuring are different.
In terms of getting numbers for use in calculating speedup, your code is "better", but not "good". Better in the sense that your code includes some of the thread overhead. Not good in that it doesn't calculate total program execution - which is what you need to calculate speedup.

So here's a review of what foxman's code is measuring:

>> startTime = clock();
This is essentially the point in time in which all threads are "about to" break through the first barrier and start doing work.

>> endTime = clock();
This is essentially the point in time in which the first thread has broken through the second barrier - implying that all threads have finished their work. So it's the point in time in which all threads have "just finished" their work.

foxman's version at least includes all thread context switching overhead while the work is being done - which, on his machine, is about 2.2 seconds for the 998 additional threads. foxman's timing could be improved slightly by not including the constant overhead of the first pthread_mutex_unlock. This could be done by setting an "I'm last" flag while the mutex is locked, then checking the flag after the unlock.

foxman's code doesn't include thread creation overhead and the context switching overhead while not performing the work between the barriers. And since the actual work being done is a trivial increment, there's no overhead associated with shared (synchronization) and non-shared (CPU cache-coherency) data access that a "real" multi-threaded app usually has.

Now let's review of what C_ntua's code is measuring (from post #8):

Code:

pthread_barrier_wait(&barrier);
if (p->no == threads - 1)
   gettime(&s);

So s is the time at which the last thread has broken through the barrier. Who knows how many threads have actually already started, or even completed their work! Not a very useful timestamp...

Code:

pthread_barrier_wait(&barrier);
if (p->no == threads - 1) {
   gettime(&f);
   savetime(&f, &s, &p->time);
}

So f is the time at which the last thread was scheduled to run once the barrier condtion was satified the second time. Could be the first thread out of the barrier - could be the last. So the timing may include context switching time after the work was completed (if it were last) or may not (if it were first). The difference between the two doesn't really tell you much.

>> So, I don't care about overheads, joining and stuff like that. ... In any case I care ONLY about the specific algorithm execution time.
Once again, the math is real easy and there's no need to write any code - assuming the algorithm is deterministic:

Easy math when excluding overhead with multiple cores:
You have an algorithm that takes C amount of time for a single thread to execute. Split that work up between 8 cores. Each core finishes its portion of the work in C/8 amount of time. With each core working concurrently, the total algorithm execution time is now down to C/8. This is the definition of linear speedup. If you exclude all overhead, then all determinstic algorithms will achive linear speedup when you throw another core at it.

Once again, excluding overhead is useless when measuring execution time of a deterministic algorithm. It's all theory at that point.

>> I care about the total time needed to execute the algorithm.
The awnser is "C" for a single core. The awnser is "C/8" for 8 cores. In practice, you have to include overhead like context switching. For that, see foxman's code, or Salem's code from post #24.

>> so gettimeofday() seems better in calculating actually time passed.
Why would actual time passed be better? If you really want "time needed to execute the algorithm" then "CPU time" is better than "wall-time".

gg

**foxman** · 06-16-2008

I'll just add one or two comments but I think codeplug resumed pretty well. I agree with him (at least when he's talking about my stuff)

Originally Posted by codeplug

>> Do you believe it's a better or a worst measurement that what the OP is getting with his code?
Well, you can't really compare since what you're measuring and what the OP is measuring are different.

Haha, indeed, I think I forgot about this. Or maybe it's because I never really understood.

foxman's timing could be improved slightly by not including the constant overhead of the first pthread_mutex_unlock. This could be done by setting an "I'm last" flag while the mutex is locked, then checking the flag after the unlock.

I didn't think about this. I'm taking notes.

(talking about C_ntua's code) So s is the time at which the last thread has broken through the barrier.

By "last thread", we mean the last "created" thread (that's it, the thread with the highest ID), not necessarily the last one to execute this instruction. I think that was worth the precision.

On a last note, I changed the way I was calculating time in my program (replaced it with gettimeofday since I wanted the wall-time and not the CPU time), I ran the program 7 times on my old P3 (I get really a lot of variance in execution time with me new one), and I did a average of the runs.

Code:

1 threads:   7.06815
10 threads:  7.07258
100 threads: 7.07687

Edit: there was a graph but it was ugly, a bit insignifiant and was taking too much space... so I removed it.

**C_ntua** · 06-16-2008

Well, I think we discussed a lot the subject. Everybody, including me, agrees that it would be highly unlikely for 100 threads to execute faster than 10 threads. But still, the reason "why" is not answered.
The only reason could be with the time measurement. So two things:
1) What is the best time measurement method?
2) Why is the way I calculate time wrong.

Codeplug points out correctly that you don't know what happens until the first gettimeofday is called. Btw, I use the first thread for time measurements. It gives the same results. The last thread was kind of "stupid". But how much time could a gettimeofday possible need. We are talking about a few seconds. So its of the 100 thread would need like 20msec minimum to execute. I don't think a gettimeofday needs more than a few nsecs to be called. So I don't see a big problem.
Still, I put gettimeofday() BEFORE the first barrier and have more or less the same results.

I tested it many many many times. The same results. If it was something random I would get random results.

Now, gettimeofday and barriers are suggested and have accuracy. So I don't think we can consider them unreliable. Note that we call gettimeofday two times and a few seconds pass from each call.

So what could be wrong with the time calculation? Can somebody think of another way to prove that 100 threads run slower. Of course in the case that you share the data among threads and you don't have shared data.

Btw, the problem with parallelism is not so much overheads of threads, as synchronization and shared data. Which would cause the need for synchronization. In my case I have only the "problem" of overheads, in which case if you measure them, they are fairly low. Calculate for example the time with barriers and the time from main and see the differense.

Here is a version of my actual algorithm. Just to be more clear what I am doing:

Code:

for (i = start; i < fin; ++i)
	for (j = row_ptr[i]; j < row_ptr[i+1]; ++j)
			r[i] += val[j] * v[col_ind[j]];

start and fin is a space (in my algorithm a value rows) divided among threads.
So if you want, you may just create the above arrays, being careful not exceeding the array bounds and calculate again the time. The values of each array don't matter. I generally use double for val, r and v and unsigned int for cold_ind, row_ptr.

I post the above to make clear that more time is spent for getting values from the memory than for calculations, so maybe (since I don't know exactly) CPU-time won't fit.

I forgot to mention that I am not familiar with exactly what CPU-time is. Is the time needed for the CPU to perform the instructions given?

EDIT: posted the same time with foxman. At least in a uni-core system the results are normal. Still need to figure out what is wrong in my case. Well, there is no harm done using your method of calculating time. I will post again to see if there was any differense.

**C_ntua** · 06-16-2008

Originally Posted by C_ntua

So changed foxman's code. Put gettimeofday() instead of clock(), and did it without mutexes, for the reasons I posted above.

I get on an 8-core CPU:
FOR 2 threads:
It took 8.917935 sec
FOR 8 threads:
It took 2.158017 sec

which make perfectly sense.
But,
FOR 16 threads:
It took 2.158018 sec
FOR 100 threads:
It took 1.731838 sec
FOR 200 threads:
It took 0.958165 sec

So you see what I mean

Sorry, forgot I already have done this.

Tested in on a dual-core system:
FOR 2:
It took 5.626660 sec
FOR 200:
It took 4.175423 sec

Using foxman's code again. With gettimeofday instead of clock

**matsp** · 06-17-2008

CPU-time is the time a particular process is using the processor - that is, it counts all the time whether the processor is actually doing calculations, or for example waiting for memory to deliver data, but not time when the system is working on other processes, e.g. some background task (firewall, e-mail client, clock-in-task-bar, etc), or the system is waiting for hardware to finish (e.g. you are reading/writing to/from a file). Wallclock time (which you get from "gettimeofday" or similar will show the actual time taken, including any other processes that may run on the system).

I still think the best measurement is to measure the TOTAL time taken by the application (like my "tt" in the code I posted). That's the thing that REALLY matters, since that's how long it takes from starting the calculation until the calculation is finished. [My code, in an attempt to get fairly precise measurement, uses wall-clock time by using the TimeStampCounter in the processor - which counts up at a fixed frequency at the same rate as the processor clock [1], so you can measure very short timing intervals without worrying about the precision, and it's also much more lightweight than the gettimeofday and similar OS calls - the drawback is that it's only valid as long as you stay on one CPU for the start and end time.]

100 threads do not run, in overall time, faster than 10 threads. But since the OS will spread the work between threads, and each thread runs for a shorter (CPU-)time [fewer iterations], you may well get a shorter time for the individual thread to finish. As we can see here:

Code:

E:\proj\threads\Debug>threads.exe 1
total time: 2.03270, sum=2.03257, avg=2.03257, max=2.03257, min=2.03257

E:\proj\threads\Debug>threads.exe 2
total time: 2.03495, sum=3.90264, avg=1.95132, max=1.95143, min=1.95121

E:\proj\threads\Debug>threads.exe 5
total time: 2.03316, sum=8.77349, avg=1.75470, max=1.93551, min=1.52791

E:\proj\threads\Debug>threads.exe 10
total time: 2.03305, sum=10.67696, avg=1.06770, max=1.62267, min=0.20301

running a single thread takes 2.03 seconds.
running two threads take 2.03 seconds, but each threead is started and finished within 1.95 sends.
WIth 5 threads it's still 2.03 seconds, the time in each thread is averaging 1.75 seconds, and the time in each thread varies between 1.9 and 1.5 seconds.
In the example with 10 threads, it still takes 2.03 seconds to perform ALL the work, one thread manages to finish in 0.2 seconds, the slowest one is 1.6 seconds, and the average is 1 second. It would make sense that ALL threads actually took 0.2 seconds - the reason it doesn't is that there are multiple threads running simultaneously, and the OS is switching from one to the next before any particular one is finished.

[1] Some more modern processors have a "fixed rate TSC" [I think Intel Core2 Duo's do, and I know AMD's Quad core processors do], which is independent of both the maximum CPU frequency or the current CPU frequency - this helps in systems that change frequency depending on the load of the system to save power.

--
Mats

**C_ntua** · 06-17-2008

Tested my code with clock(). Same results.
So in the end I don't know why I get these strange results. Let me sum up the positives and negatives of using more threads than the CPU cores:

Negatives:
-Memory will be used for the information about the thread, especially if a struct is passed on a pthread.
-Overhead for creating/joining threads, even though I don't calculate it (it is small time anyway)
-Delay because of switching among threads. This shouldn't be big either if the algorithm is rather simple.
-The biggest delay, synchronization. Especially if you have shared data, which might cause you to use locks. Or have to use barriers in my case to calculate time. In my case barriers don't cause a great delay.
-You don't have control over the order of instructions, so the random order may give less performance

Positive(s):
-For some set of instructions the CPU might have instructions executed parallel. Modern CPU's use pipelines so two instructions may be executed the same time. Not always depends on the instructions.
For example see this scenario:

Code:

A[0] = A[0] + 1;
regA = fun();

In a serialized program the instructions would be executed in order. A compiler wouldn't probably change the order of instructions, since the function fun() might do whatever it wants. For example it could be changing the value A[0].

In a Threaded mode:

Code:

Thread1: A[0] = A[0] + 1;
Thread2: regA = fun();

maybe as the CPU was waiting for A[0] to be loaded to the cache it would execute the instruction of Thread2, since memory loads/writes are much slower. So, provided that fun() doesn't actually change A[0], so we won't need synchronization, we will have a speedup, even in a uni-core system.
In any case you tell the CPU "executed whatever comes first", which may be useful.
-Similar as above, we would have a speedup if for example fun() also read B[0], and B[1] would be used from another Thread. If B[0] is read then usually B[1] is also loaded in the cache.

MY algorithm, as noted before, is the following:

Code:

for (i = start; i < fin; ++i)

		for (j = row_ptr[i]; j < row_ptr[i+1]; ++j)

			r[i] += val[j] * v[col_ind[j]];

All the above negatives would apply. But the delay will be small.
So the only way I could gain speedup is if the positive also applied.
Personally I don't think any above positives would apply on my algorithm. It is kind of straightforward se t of instructions.
Lets say r[i] is being read from the memory. The CPU would still execute the multiplication as it waits for r[i]. Then the rest of r[i+...] would be loaded in the cache.
So the only way threads would speedup the process is if we have 2 threads for example and and they wait for their r[i] to be loaded in the memory. In the meantime both multiplications may be executed since they don't rely on r[i]. Then when r[i]'s come available their value be stored back to the cache or main memory. But there is no way to verify this. And v[col_ind[j]] and v[j] would need more time to be loaded, so the multiplications would have to wait.
Anyway, you get my way of thinking.

CONCLUSION: In any case 100 threads should give a slightly bigger time. But they don't. So the question is I respect my results or say that my logic is correct and there is an undefined error in all the time calculations I tested?

In any case thanx for the help!

Thread: Pthreads performance

Thread Tools

Search Thread

Display

Similar Threads

using pthreads with an API that uses pthreads

Performance and footprint of virtual function

File map performance

Difference between win32 and linux pthreads

inheritance and performance