Pthreads performance

**C_ntua** · 06-14-2008

Here is an in-depth question about pthreads:

I m currently writing a paper about how an algorithm performs. The main algorithm is simple. It is like:

for (i = start; i < finish; i++)
do_somethings;

I noticed that when using very many pthreads, like 100-200, it performs faster than using 8-10 pthreads, even though I m running it on a 8-core system (I 've tested in in many systems multi-core and not).

Even if I test this simple/useless algorithm:

for (int i=0; i < a; ++i) ;

with a = 1000000 / number_pthreads;

it runs faster with very many pthreads. If somebody has any idea why it would perform faster with 100 threads rather than 10 please enlighten me . Note that the total number of loops is the same.

The way I calculate time is:
Barrier
gettimeofday()
algorithm
Barrier
gettimeofday()

**Salem** · 06-14-2008

Don't cross-post.

**C_ntua** · 06-14-2008

{really sorry for cross-posting, i just didn't know (and still dont) hot to properly move the post. I have requested it to be deleted}

Well, here is the code:

Code:

int threads; int z=0;
pthread_barrier_t barrier;

void * a (void *args)
{
	
	timeval f,s;
	int a = 1000000000 / threads;	

	pthread_barrier_wait(&barrier);
	gettimeofday(&s, NULL)
		
	for (int i=0; i < a; ++i)
		; //or use this z=z*z;

	pthread_barrier_wait(&barrier);
	gettimeofday(&f, NULL);
	savetime(&f, &s, (double *)args);

	return NULL;
}

int main(int argc, char ** argv)
{	
	double time;
	threads = atoi(argv[1]);
	pthread_t *thread = malloc(threads * sizeof(pthread_t));
	pthread_barrier_init(&barrier, NULL, threads);
	for (int i=0; i < threads; ++i)
		pthread_create(&thread[i], NULL, a, &time);
	for (int i=0; i < threads; ++i)
		pthread_join(thread[i], NULL);
	free(thread);		
	printf("&#37;.6f\n", time);
}

(savetime just substracts the timevals and save the results at a double variable)

Results are like 0.4439 for 10threads and 0.3932 for 100threads. This is a test cost. My original code is more than 3000lines to post. But the problem is similar. For the same number of loops 100 threads perform faster. I might make some guesses but wanted to hear more ideas...

**Salem** · 06-14-2008

My guess is that having many short-duration threads means they've already finished by the time you come to pthread_join() each one, and that completes immediately.

If the thread is running when you try to join, there is likely to be more context switching.

**C_ntua** · 06-14-2008

Well, I have a pthread_barrier() which means all threads will reach that point. Then all threads will run simultaneously until the next ptread_barrier. THEN the threads will terminate. So all threads will be "joined". Right?

**Codeplug** · 06-14-2008

So every thread stores it's time-of-day difference in "double time". Assuming you used proper synchronization to store a value into "double time", then the last thread's time-of-day difference will be stored in "double time" after joining with all threads.

So the "work" is divided by the number of threads: "int a = 1000000000 / threads;" Which means that the more threads you create, the less work there is per thread.

>> Results are like 0.4439 for 10 threads and 0.3932 for 100 threads.
Smaller time for less work. Makes sense to me.

gg

**Salem** · 06-14-2008

Good catch Codeplug, they're all updating the same variable.

Try something like
double *times = malloc ( threads * sizeof *times );

And
pthread_create(&thread[i], NULL, a, &times[i]);

Then write something to sum all the times.

**C_ntua** · 06-14-2008

But why should it matter if they update the same variable?
The are all synchronized with a pthread_barrier(). All threads should get the same time. But nevertheless I changed the code. Only the last one calculates time and saves it.

Code:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <time.h>
#include <sys/time.h>
#include <pthread.h>
#include "util.h"

typedef struct pthread {
	int no;
	double time;
} pthread;

int threads; int z=0;
pthread_barrier_t barrier;

void * a (void *args)
{
	pthread *p = (pthread *)args;
	timeval f,s; int tmp = 0;	
	int a = 1000000000 / threads;	
	pthread_barrier_wait(&barrier);
	if (p->no == threads - 1)
		gettime(&s);		
	for (int i=0; i < a; ++i)
		;
	pthread_barrier_wait(&barrier);
	if (p->no == threads - 1) {
		gettime(&f);
		savetime(&f, &s, &p->time);
	}
	return NULL;
}

int main(int argc, char ** argv)
{	
	double time;
	threads = atoi(argv[1]);
	pthread p[threads];
	for (int i=0; i < threads; i++) {
		p[i].no = i;
		p[i].time = 0;
	}
	pthread_t *thread = malloc(threads * sizeof(pthread_t));
	pthread_barrier_init(&barrier, NULL, threads);
	for (int i=0; i < threads; ++i)
		pthread_create(&thread[i], NULL, a, p+i);
	for (int i=0; i < threads; ++i)
		pthread_join(thread[i], NULL);
	free(thread);		
	printf("%.6f\n", p[threads-1].time);
}

The same result though.
Each thread has indeed smaller work to do, but overall the same number of loops are executed! Since you don't have 100cores the instructions are serialized somehow on the processors.
Plus you have more overheads due to loop-initialization with more pthreads. Plus you have (though a small) delay to synchronize on the barrier.

**foxman** · 06-14-2008

Indeed, you should get a result similar to the one you got with your first version since they are, at the end, doing quite the same thing.

But why should it matter if they update the same variable?

Well, if they update the same variable, the result stored in that variable will be the time it took to one thread to complete his "1000000 / threads" iterations. If you want the total time it tooks for your "threads" threads to do the "1000000" iteration, multiply that number by "threads".

Code:

	gettimeofday(&s, NULL)
		
	for (int i=0; i < a; ++i)
		; //or use this z=z*z;

	pthread_barrier_wait(&barrier);
	gettimeofday(&f, NULL);
	savetime(&f, &s, (double *)args);

In your code, each thread does "1000000 / nb_of_threads" iterations. The more you'll have threads, the less iterations there will have and the less time it will takes for a thread to leave the loop. No magic involved here.

In the end, it would be extremely surprising if creating more threads would speed up a process after a certain point (that's it, the number of "logical core"). It's just more overhead from context switching, creating and destroying thread, even if thread are "ligth" (especially if you compare them against process which have a bigger amount of overhead). CPU can't do magic. If your code goes faster when you have 1000 thread then when you have 4 threads and you have a quad-core CPU, then it's mostly because there's a bug/error in your code. Mostly. I guess.

**C_ntua** · 06-15-2008

First of all, what you are saying about calculating the time by multiplying the time with the number of threads, or summing each thread's time would make sense without the use of barriers. We use barriers to calculate time to avoid the above, since even if one thread finishes it will first wait for all others to finish. Also it waits for every thread to start. So the time calculation method (it's a common method) should be OK.

Well, you have the code. What bug/error could it exist?
The first thing I thought was a bug/error. Since my code is very big I wrote the above, simple code in order to test it. Here I post gettime() and savetime() so somebody can compile the above code and tell me his results.
Theoritacally, I agree. It should get optimal results with a number close to the number of cores the processor has. But it doesn't...

Code:

void savetime (timeval *f, timeval *s, double *res)
{
	if (f->tv_usec < s->tv_usec) {
		int nsec = (s->tv_usec - f->tv_usec) / 1000000 + 1;
		s->tv_usec -= 1000000 * nsec;
		s->tv_sec += nsec;
	}
	if (f->tv_usec - s->tv_usec >1000000) {
		int nsec = (s->tv_usec - f->tv_usec) / 1000000;
		s->tv_usec += 1000000 * nsec;
		s->tv_sec -= nsec;
	}
	*res = (double)((f->tv_sec - s->tv_sec) * 1000000 + f->tv_usec - s->tv_usec) / 1000000; 
}

Code:

inline void gettime (timeval *s)
{
	if (gettimeofday(s, NULL) == -1) {
		printf("Error calculating time");
		exit(1);
	}
}

(also remove #include "util.h" from the above code. I compiled using -phtread and -std=gnu99 flags)

The program gets as parameters the number of threads.

**foxman** · 06-15-2008

I did not even see the second call to "pthread_barrier_wait" in your code. You could work a little bit on your style, you know...

(ok, that's also my fault)

Well, if you get lower time when you have more threads, it might has to do with how the threads are scheduled. Let's say the last thread to "leave" the first barrier is leaving the second barrier first (which could happen -- but we can't tell since we'll say we don't know how threads are scheduled). Since you have only 4 or 8 processing units, if you have more threads, it's certain some will "get started" later than some. So, you might end up that the thread number "pthread - 1" is the last one to make his first call to gettimeofday but is the first one to make his second call to gettimeofday (since it's the last to reach the second barrier, who knows, he might be in the first one to leave it -- but I couldn't tell). Basically, you should pick a random thread to calculate the time instead of a "fixed" thread.

Or, change the way you are calculating the time. I suggest you add the time it takes for each thread to execute the loop (no need to use barriers). I believe this will yield to more interesting results.

Or you could just say "Bah" and believe that having 1000 threads over 10 doesn't have any real advantages if you have less than 10 processing units.

Still, threads are light, and the overhead from creating more threads than the "optimal number" looks quite small.

**Codeplug** · 06-15-2008

You are still looking at the timing of the last thread - which is most likely the last thread to hit the barrier - so it's as-if the barrier weren't even there.

So, "less work, smaller time" still applies.

What exactly are you trying to time or prove? Your current approach probably isn't what you want.

gg

**C_ntua** · 06-15-2008

Summing the times is wrong. Lets say the time needed to execute the loop is TIME. Each loop needs approximately TIME / THREADS to execute its loop. Adding up times will give us TIME again. I tested it and of course this is what happens!
The problem with adding time is that threads run SIMULTANEOUSLY. So you care about the time of the slowest thread. If you find that time it means every other thread has already finished. I changed the code to find the slowest thread and it's execution time. I even get more speedup this way.
But I believe barriers is a better way. Whenever the last thread leaves or enters the barriers wouldn't matter. It would only cause a delay. Not a speedup. The delay is caused due to signals used for the barrier. Logically every thread waits for every other so it doesn't matter which one enters first or last the barriers.
As I said, it is for a paper and as I know barriers are the way to calculate time (a common way). I didn't come up with the idea myself. It is the most logically correct way, since you synchronize the threads on a specific part of the code, eliminating all other factors. The downside is the delay you get due to threads. But here stands the rule that you NEVER can have an absolutely accurate calculation using the same CPU to run the program and calculate time.
My paper is about performance of a sparse matrix multiplied with a vector by the way. For some reason a lot of people find it a useful topic :P Doesn't really matter. I just came across this issue on my testings.
My results are that in some cases the optimal number of threads is like 100 threads instead of 8 (I test on an 8-core system). If you put 1000 instead of 100 you ll still get a speedown. Also, the speedup is quite higher from 2 to 8 threads than from 8 to 100 threads. As expected.
My "problem" is why it speeds up with 100 threads.

First of all, each thread calculates a different part of the data array. So technically 100 threads should give the same results with 8 threads with a small delay. They both use the same cache memory and CPU cores.
Gaining a speedup means that either the OS for whatever reason executes small threads faster than big threads, which I believe is the case. Either it has to do with the CPU/cache, propably the way data are read/write in the cache memory. Though, that would be kind of strange.

It would be nice if somebody tested something similar with pthreads just to compare results. Or give me part of codes to see what kind of code is executed faster with a lot of threads

EDIT: I used the "time" command to have a better view. Indeed with 100 threads I get a less user time. User time is the CPU time spent for executing instructions.

**Salem** · 06-15-2008

Code:

	if (p->no == threads - 1) {
		gettime(&f);
		savetime(&f, &s, &p->time);
	}

Imagine for a moment that your system only runs 1 thread a second after all the threads reach the barrier.
If you're lucky, and the first thread out the gate is also the thread which calculates the time, then you're going to get a nice low answer.
But if your thread is the last one (and if you have more threads than cores, this could be a while), then you're going to get a very different answer.

I think removing all this barrier stuff, and generating an array of delta-t's which you then analyse will give you the information you want.

**C_ntua** · 06-16-2008

Well, I said I did that. I got all times and found the bigger one and printed it. The same results.

-And lets get this straight. All threads run and reach the first barrier. The system waits for signals to be sent. When all signals are sent then it sends signals to each thread to start. OK?

-So all start at about the same time in a random matter (a small delay for the signals).

-The execute all together the algorithm.

-When a thread finishes it exectutes the barrier_wait and waits there. When ALL threads finish then signals are sent for them to start again.

-Then, and only then, they continue the execution. At a random fashion again they get signals. When the last thread gets its signals it will calculate time and game over.

Scenarios:
-You are lucky and the last thread gets first the signal. You get the time you expect to get. The time after every thread has finished its job
-You are unlucky and the last thread gets last the singal. You get a bigger time, due to the delay from signals.

You NEVER get a smaller time than you would normally get without barriers. Isn't that clear? Again, barriers give delay, not speedup.

--But in any case I tested them both. Again more threads have a slower time.

Thread: Pthreads performance

Thread Tools

Search Thread

Display

Pthreads performance

Similar Threads

using pthreads with an API that uses pthreads

Performance and footprint of virtual function

File map performance

Difference between win32 and linux pthreads

inheritance and performance