Pthreads performance

**foxman** · 06-16-2008

What do you want to time exactly ?

**matsp** · 06-16-2008

My personal approach would be to time the code inside main, from the place just before you start the first thread, to just after the final thread has finished. That's how long it takes to perform all the work, and that is what matters, don't you think? It also makes the whole task very easy, because you are just measuring the one thread (the one that runs main).

I would also use two measurements:
the result from clock() and gettimeofday() - since "real" time and "CPU time" are two different things (it shouldn't be MUCH difference unless there is someone else using much of your CPU-time, but just in case there is, it's good to KNOW that there is a difference).

--
Mats

**C_ntua** · 06-16-2008

Well, I care about the real time.
I use barriers because I need to calculate that exact code. I need the time spent for only the loop.

**matsp** · 06-16-2008

Originally Posted by C_ntua

Well, I care about the real time.
I use barriers because I need to calculate that exact code. I need the time spent for only the loop.

Ok, so calculate the time taken in each of the threads, then post-process all threads taken time. The result should be displayed as max, min, average and sum (and perhaps also std deviation, if you want to see how well your average matches your actual data - if you have a big variance, then average is not really a good measure).

--
Mats

**Codeplug** · 06-16-2008

>> My "problem" is why it speeds up with 100 threads.
None of you posted code measures "speedup".

>> I need the time spent for only the loop.
That won't give you a measurement of speedup. I still don't understand why you want that measurement.

>> I used the "time" command to have a better view. Indeed with 100 threads I get a less user time.
You can't cut out time spent everywhere else when calculating speedup. You are introducing overhead in time and resources as you run more threads. You can't cut out that overhead (kernel time) when measuring speedup. Speedup is measured using total program execution time.

>> My personal approach would be to time the code inside main...
That will give you speed up - using the *total* CPU time (user + kernel). You'll want to remove any unneeded synchronization points (barriers) before measuring total program execution time.

You should be able to use the "time" utility to measure total program executions time.
Here's a few more options if you want to do your own timing: getrusage() or times().

gg

**C_ntua** · 06-16-2008

Well, I appreciate the help. But lets focus on the problem.
I need the time for the specific lines of code.
As I said, I have an algorithm that uses a for-loop. I posted the simplified above code with only the for loop without the rest of algorithm. I did that because even though the rest of algorithm is 1 line, in order to execute it you need the rest code, which is contained in many files so it is very big.

And AGAIN I calculated also the time WITHOUT barriers. I found the max time, which is what I need. AGAIN it is smaller for 100threads, than 10 threads.

Can anybody actually compile and test the above code to see for himself?

**matsp** · 06-16-2008

Originally Posted by C_ntua

And AGAIN I calculated also the time WITHOUT barriers. I found the max time, which is what I need. AGAIN it is smaller for 100threads, than 10 threads.

Can anybody actually compile and test the above code to see for himself?

Yes, 1/x tends to be smaller for larger x, so the time each loop takes if you measure from the start to the end (for short run-time of threads) would be smaller for a larger number of threads.

I can't compile the code and run it on my machine (either at work or at home) at present, since none of my currently usable machines have Linux (and thus gettimeofday, pthreads, etc).

I could rewrite your code to use Windows threads, but I doubt it will change the mathematical rule that you are just proving (constant divided by x approaches zero as x grows).

But as stated by both me and Codeplug, it's the overall runtime that matters, not the amount of time it takes to run one thread to completion. It's only ever sensible to run a number of threads that are roughly equivalent to the number of (currently available) CPU cores. Noticably more than that will definitely slow things down, as there is more overhead in creating, switching between, and destroying threads. [That is for CPU bound tasks - if the task is IO bound, then doubling or even tripling the number of threads can make the system run faster, since there is some other thread available to do work when the other threads are waiting for IO to be performed - so depending on the task, different number of threads is the right solution].

It is also worth noting that not all multithreaded applications use threads to get the best performance out of the system, but rather because it for example simplifies the task at hand, and saves the programmer from "trouble".

--
Mats

**C_ntua** · 06-16-2008

I agree about the overhead and the thread switching. That is why I consider my results strange. Well, you never know exactly what the CPU does, that is why we are testing.
But...

We are saying the same things again and again.

Lets say you have X threads. Each thread executes a A / X number of loops. The TOTAL number of loops executed by your CPU is A. If you have a large X it is still A the number of loops! That is the main idea, to have the same number of loops. Right?

So the processor executes A number of loops. It calculates the time only needed to execute the loop. Either with barriers, either with the max thread time.

Having the above code or a similar one executed on Windows would be very nice! If you don't get the same results, thus 100 threads performing faster than 10, then there is for sure a problem with the whole things.

Or there is indeed a strange bug that I fail to see...

And again keep in mind that barriers are used for time calculation. It is used from proffesionals. I don't know if it is the best method but it is a correct one. Didn't come up with it, just read it from other papers...

**Salem** · 06-16-2008

Sure, any single thread is going to have less work to do if you split the task into 100 parts as opposed to 10 parts.

But you're not counting an awful lot of overhead in terms of creating and joining threads, and all that synchronisation with the barrier.

Code:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <time.h>
#include <sys/time.h>
#include <pthread.h>
#include "util.h"

typedef struct pthread {
  timeval created;
  timeval alive;
  timeval dead;
  timeval reaped;
} pthread;

int threads;

void * a (void *args)
{
  pthread *p = (pthread *)args;
  int a = 1000000000 / threads;  

  gettimeofday( &p->alive, NULL );

  for (int i=0; i < a; ++i)
    ;

  gettimeofday( &p->dead, NULL );

  return NULL;
}

int main(int argc, char ** argv)
{  
  threads = atoi(argv[1]);
  pthread p[threads];

  for (int i=0; i < threads; ++i) {
    gettimeofday( &p[i]->created, NULL );
    pthread_create(&thread[i], NULL, a, p+i);
  }

  for (int i=0; i < threads; ++i) {
    pthread_join(thread[i], NULL);
    gettimeofday( &p[i]->reaped, NULL );
  }

  // now do some calcs on all those timings
}

**matsp** · 06-16-2008

I just reproduced [I think] your testing, and I got these rather random results:

Code:

E:\proj\threads\Debug>threads.exe 2
sum=0.39, avg=0.20, max=0.20, min=0.20

E:\proj\threads\Debug>threads.exe 10
sum=1.06, avg=0.11, max=0.16, min=0.02

I'm not going to post my code, because it's not very nice, and I can't be bothered to fix it up, to be honest. So, yes, one thread does less work, but even then, the overhead of running more threads actually sums up to more time - because each thread gets interrupted by other threads.

And you may not understand what the processor (and more importantly here, the OS) does, but I'm pretty sure I understand fairly well what happens inside a modern (and of course older - that's easier, as there's no [or at least much less] parallelism) processor.

So, are you convinced yet that measuring what each thread does is meaningless?

--
Mats

**Codeplug** · 06-16-2008

>> I don't know if it is the best method but it is a correct one.
Method for what?! What are you trying to calculate or show?.

If you are trying to calculate speedup (as your previous comments indicate), the only valid measurement for calculating speedup is total program execution.
*Please indicate whether or not you agree with this statement. If not, please explain why.*

>> Lets say you have X threads. Each thread executes a A / X number of loops. The TOTAL number of loops executed by your CPU is A.
Sure, assuming A/X is evenly divisible - but your are only looking at how long it takes for a single thread to execute its small slice of A, excluding all overhead. In my mind this is a meaningless measurement.
*Please indicate whether or not you agree with this statement. If not, please explain why.*

As we've all mentioned, the key point here is that you can't exclude all the overhead associated with threads. Since you've chosen to exclude the overhead, then the math is actually quite simple - you don't even have to write any code:

Easy math when excluding overhead:
If it takes C amount of time for a single thread to execute A loops. Then is will take C/2 amount of time for a thread to execute A/2 loops. So if you have two threads, each thread will take C/2 amount of time. The total amount of time is therefore C/2 + C/2 which equals C. You can carry this out for any number of threads and the total will always be C.

What you are measuring is C/X, which isn't useful.

If you do not agree with any of these statements, plese point out what you don't agree with and why so we can all "get on the same page".

gg

**foxman** · 06-16-2008

I wrote this stuff, who tries to measure the time (and by time I mean time like we human fell it, so the choice of my time function might be wrong, because I'm not sure which sort of "time" clock() is measuring) it takes for some work to get done. Do you believe it's a better or a worst measurement that what the OP is getting with his code ?

Code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <pthread.h>

#ifndef NB_THREADS
#define NB_THREADS 1000
#endif
#define NB_ITER_TOTAL 5000000000l    // 5 000 000 000
#define NB_ITER_THREAD (NB_ITER_TOTAL / NB_THREADS)

#define CHCK(X) { if (X != 0) { fprintf(stderr, "Error at line &#37;d: " #X "\n", __LINE__); exit(1); } }

pthread_barrier_t barrier;
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

// Those variable are counting the number of threads going through the lock
int startThreadCount = 0;
int endThreadCount = 0;

// Define your favorite variable for holding time-related stuff
clock_t startTime;
clock_t endTime;

void* loopFunc(void *arg)
{
    long i;

    // The last thread to get here will take the first time measurement
    pthread_mutex_lock(&mutex);
        startThreadCount++;
        if (startThreadCount == NB_THREADS)
        {
            // Insert your favorite timing function here
            startTime = clock();
        }
    pthread_mutex_unlock(&mutex);

    // Then all threads are going to stop here, do the stuff, and stop again after
    pthread_barrier_wait(&barrier);
    for (i = 0; i < NB_ITER_THREAD; i++);
    pthread_barrier_wait(&barrier);

    // The first thread to get here will take the other time measurement
    pthread_mutex_lock(&mutex);
        if (endThreadCount == 0)
        {
            // Insert your favotire timing function here and print
            // the time difference
            endTime = clock();
            printf("It took %f sec\n", (double) (endTime - startTime) / CLOCKS_PER_SEC);
        }
        endThreadCount++;
    pthread_mutex_unlock(&mutex);

    pthread_exit(NULL);
}

int main(int argc, char **argv)
{
    pthread_t threads[NB_THREADS];
    pthread_attr_t threadAttr;
    int i;

    // Initializing stuff and creating NB_THREADS threads
    CHCK(pthread_attr_init(&threadAttr));
    CHCK(pthread_attr_setdetachstate(&threadAttr, PTHREAD_CREATE_DETACHED));
    CHCK(pthread_barrier_init(&barrier, NULL, NB_THREADS));

    for (i = 0; i < NB_THREADS; i++)
    {
        CHCK(pthread_create(threads + i, &threadAttr, loopFunc, NULL));
    }

    // Free stuff and exit the main thread
    CHCK(pthread_attr_destroy(&threadAttr));
    pthread_exit(NULL);
}

Code:

$ gcc -DNB_THREADS=2 -lpthread -o a2 main.c
$ gcc -DNB_THREADS=1000 -lpthread -o a1000 main.c
$ ./a1000
It took 14.210000 sec
$ ./a2
It took 12.020000 sec

**matsp** · 06-16-2008

clock() measures CPU time (officially, at least - occassionally, C runtime will replace it with wall-clock time, since there's no good way to determine CPU time, e.g. in DOS), and it's not a bad measure - it would under low load and no IO operaitons be almost identical to the wall-clock time anyways.

Of course, the first thread to finish doesn't really measure all the work being done - it may well be much more time consumed than that, since some threads may not even have STARTED the work yet.

--
Mats

**C_ntua** · 06-16-2008

Again thanx for the posts.

First of all let me make it more clear. I need to calculate an algorithm. That means

Code:

get_time1
execute_algorithm
get_time2
time2 - time1 is the time needed for the algorithm execution

My algorithm is just the for-loop

Code:

for (int i=0; i < a; ++i)
;

In order to execute faster we use threads, because we have an 8-core CPU. The way to do that is:

Code:

for (int i=0; i <part_of_a; ++i)
;

Which means we divide "a" into a number of pieces. I DIVIDE THE DATA among threads. Each thread executes a piece. Well, since a = 1000000 and I test 10 and 100 or 1000 threads, "a" is divisible. So part_of_a for each thread would be a / threads, where "threads" is the number of threads.
To make it clear think of the following. You want to increase "number" a times.
You would have:

Code:

for (int i=0; i < a; ++i)
   number++;

with threads:

Code:

for (int i=0; i < a / threads; ++i)
   number++;

number in both ways will be increased by "a". With threads it will be done faster.
NOTE!! The above code isn't correct for anybody that knows about thread synchronization. Again it is wrong, but you get the idea. Right?

So, I don't care about overheads, joining and stuff like that. Why I don't care? Because that is what my "job" is to do. To calculate a specific algorithm. There are reasons why you want the specific algorithm, but that is another discussion. In any case I care ONLY about the specific algorithm execution time.

ABOUT SPEEDUP:
To make clear speedup can be a lot of things. Your statement is correct if I wanted to calculate the speedup of the program. I want to calculate the speedup of an algorithm. More specific I want to calculate speedup gained from parallel programming. The term is used in books of Parallel Programming as follows:
SPEEDUP = time_for_serial_program / time_for_parallel_program
Serial program is a program without threads, so every instruction is serialized. Or you can count the execution time with 1 thread.

Mastp:
Ok, so you have some times for each thread SEPERATELY. I care about the total time needed to execute the algorithm. Since some threads are executed simultaneously, some not, as I think of it, without barriers you cannot calculate time.
--You COULD count time from main. Main acts, in a way, as a barrier. You count with this method joining, initializing and other stuff for the threads. I don't care though about these stuff.
--So i AGREE. Calculating time for each thread is indeed meaningless. That is why I used two pthread_barrier. In that way you measure the time for every thread to finish its part of the algorithm.

foxman:
I believe it is a correct way. It is also a nice code, btw. I don't know either what clock() measures though, so I don't know exactly what is measured. I ll search and post again.
Personally I believe it is a worst way. Let me clarify that the way you thought of it indeed the best way to do it if you pass NULL on loopFunc. But you could pass i for example. So its thread would have an ID. Instead of mutexes the thread with ID == 0, could calculate time. So it wouldn't need to syncrhonize with mutexes.
When getting the endTime you wouldn't need synchronization at all, for obvious reasons.
There is NO WAY to get a perfect starting time.
You calculate startTime, but you include a
pthread_mutex_unlock(&mutex)
plus a
pthread_barrier_wait(&barrier);
after the startTime. First of all you should calculate startTime after the barrier. But the differense with the method I posted would be that my startTime would be wrong by a delay needed for gettimeofday() to return and store its value. Your startTime would be wrong by a delay needed for clock() to return, store its value AND execute pthread_mutex_unlock().

Finally:
If you calculate the following:

Code:

barrier
gettime
nothing!
barrier
gettime

you get like 50usec for 10 threads and 600usec for 100threads. My times are like 4000000usec, so the delay from the barriers doesn't really matter. You ll find that joining, creating pthreads doesn't spent a lot of time either. And in any case the main algorithm is the most time consuming. The rest is less than 10% of the time needed for the algorithm.

EDIT: posted before me mastp, so gettimeofday() seems better in calculating actually time passed.

**matsp** · 06-16-2008

50 usec to call barrier is A LOT!

I copied/pasted/rewrote this (apologies for ugly code - not meant to be "good" code, so none of you pedants point out bad thread methodology, use of short variable names, or lack of error checking) :

Code:

#include <stdio.h>
#include <windows.h>

int threads; 
volatile int z=0;

typedef struct pthread {
	int no;
	double time;
} pthread;

#define USE_TSC 1

#if USE_TSC
typedef unsigned __int64 timeval;
__declspec(naked)
unsigned __int64 __cdecl rdtsc(void)
{
   __asm
   {
      rdtsc
      ret       ; return value at EDX:EAX
   }
}

void savetime (timeval *f, timeval *s, double *res)
{
	*res = (*f - *s) / (2.0 * 1000.0 * 1000.0 * 1000.0);  // my processor runs at 2GHz
}

void gettime(timeval *s)
{
	*s = rdtsc();
}

#else
typedef SYSTEMTIME timeval;

void savetime (timeval *f, timeval *s, double *res)
{
	FILETIME start, end;
	LARGE_INTEGER delta, startl, endl;
	SystemTimeToFileTime(s, &start);
	SystemTimeToFileTime(f, &end);
	startl.LowPart  = start.dwLowDateTime;
	startl.HighPart = start.dwHighDateTime;
	endl.LowPart    = end.dwLowDateTime;
	endl.HighPart   = end.dwHighDateTime;
	delta.QuadPart = endl.QuadPart - startl.QuadPart; 
    *res = (double)delta.QuadPart / 100000000;
}

void gettime(timeval *s)
{
	GetSystemTime(s);
}
#endif


DWORD WINAPI a (void *args)
{
	int i;
	pthread *p = (pthread *)args;
	timeval f,s; 
	int tmp = 0;	
	int n = 1000000000 / threads;	

	gettime(&s);
	for (i=0; i < n; ++i)
		z = z + 1;
	gettime(&f);
	savetime(&f, &s, &p->time);

	return 0;
}

#define MAXTHREADS 1000
pthread thrinfo[MAXTHREADS];
HANDLE handle[MAXTHREADS];

int main(int argc, char ** argv)
{	
	int i;
	timeval t1,t2;
	double tt;
	double tsum = 0, tmax = 0, tmin = 10000;
	threads = atoi(argv[1]);
	if (threads > 1000) {
		printf("Too many threads\n");
		return 1;
	}
	gettime(&t1);
	for (i=0; i < threads; ++i)
		handle[i] = CreateThread(NULL, 0, a, &thrinfo[i], 0, NULL);
	WaitForMultipleObjects(threads, handle, TRUE, INFINITE);
	gettime(&t2);
	savetime(&t2, &t1, &tt);
	for (i = 0; i < threads; i++)
	{
		tsum += thrinfo[i].time;
		if (thrinfo[i].time > tmax) tmax = thrinfo[i].time;
		if (thrinfo[i].time < tmin) tmin = thrinfo[i].time;
	}

	printf("total time: %.5f, sum=%.5f, avg=%.5f, max=%.5f, min=%.5f\n", tt, tsum, tsum / threads, tmax, tmin);
	return 0;
}

And the results are (on a single processor machine, so no better result than 1 thread):

Code:

E:\proj\threads\Debug>threads.exe 1
thrinfo.time = 3.022202
total time: 3.02234, sum=3.02220, avg=3.02220, max=3.02220, min=3.02220

E:\proj\threads\Debug>threads.exe 1
total time: 3.06047, sum=3.06033, avg=3.06033, max=3.06033, min=3.06033

E:\proj\threads\Debug>threads.exe 2
total time: 3.02187, sum=5.94632, avg=2.97316, max=3.00914, min=2.93719

E:\proj\threads\Debug>threads.exe 5
total time: 3.01997, sum=13.76931, avg=2.75386, max=2.84780, min=2.65381

E:\proj\threads\Debug>threads.exe 10
total time: 3.02203, sum=19.66025, avg=1.96603, max=2.56212, min=0.86317

E:\proj\threads\Debug>threads.exe 20
total time: 3.02361, sum=17.67053, avg=0.88353, max=1.25765, min=0.15066

Release code is not better than debug - becuase most of the time is spent updating a global variable (unless we remove the volatile, at which point the compiler optimizes away the entire loop, which is not what we wanted).

Code:

E:\proj\threads\Release>threads.exe 1
total time: 3.28467, sum=3.28457, avg=3.28457, max=3.28457, min=3.28457

E:\proj\threads\Release>threads.exe 2
total time: 3.28459, sum=6.42355, avg=3.21178, max=3.28444, min=3.13911

E:\proj\threads\Release>threads.exe 10
total time: 3.28421, sum=21.23973, avg=2.12397, max=2.69832, min=0.88906

E:\proj\threads\Release>threads.exe 20
total time: 3.28929, sum=21.64023, avg=1.08201, max=2.34513, min=0.16381

--
Mats