Thread: Pthreads performance

  1. #16
    Chinese pâté foxman's Avatar
    Join Date
    Jul 2007
    Location
    Canada
    Posts
    404
    What do you want to time exactly ?
    I hate real numbers.

  2. #17
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    My personal approach would be to time the code inside main, from the place just before you start the first thread, to just after the final thread has finished. That's how long it takes to perform all the work, and that is what matters, don't you think? It also makes the whole task very easy, because you are just measuring the one thread (the one that runs main).

    I would also use two measurements:
    the result from clock() and gettimeofday() - since "real" time and "CPU time" are two different things (it shouldn't be MUCH difference unless there is someone else using much of your CPU-time, but just in case there is, it's good to KNOW that there is a difference).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #18
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    Well, I care about the real time.
    I use barriers because I need to calculate that exact code. I need the time spent for only the loop.

  4. #19
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by C_ntua View Post
    Well, I care about the real time.
    I use barriers because I need to calculate that exact code. I need the time spent for only the loop.
    Ok, so calculate the time taken in each of the threads, then post-process all threads taken time. The result should be displayed as max, min, average and sum (and perhaps also std deviation, if you want to see how well your average matches your actual data - if you have a big variance, then average is not really a good measure).

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #20
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> My "problem" is why it speeds up with 100 threads.
    None of you posted code measures "speedup".

    >> I need the time spent for only the loop.
    That won't give you a measurement of speedup. I still don't understand why you want that measurement.

    >> I used the "time" command to have a better view. Indeed with 100 threads I get a less user time.
    You can't cut out time spent everywhere else when calculating speedup. You are introducing overhead in time and resources as you run more threads. You can't cut out that overhead (kernel time) when measuring speedup. Speedup is measured using total program execution time.

    >> My personal approach would be to time the code inside main...
    That will give you speed up - using the *total* CPU time (user + kernel). You'll want to remove any unneeded synchronization points (barriers) before measuring total program execution time.

    You should be able to use the "time" utility to measure total program executions time.
    Here's a few more options if you want to do your own timing: getrusage() or times().

    gg

  6. #21
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    Well, I appreciate the help. But lets focus on the problem.
    I need the time for the specific lines of code.
    As I said, I have an algorithm that uses a for-loop. I posted the simplified above code with only the for loop without the rest of algorithm. I did that because even though the rest of algorithm is 1 line, in order to execute it you need the rest code, which is contained in many files so it is very big.

    And AGAIN I calculated also the time WITHOUT barriers. I found the max time, which is what I need. AGAIN it is smaller for 100threads, than 10 threads.

    Can anybody actually compile and test the above code to see for himself?

  7. #22
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Quote Originally Posted by C_ntua View Post
    And AGAIN I calculated also the time WITHOUT barriers. I found the max time, which is what I need. AGAIN it is smaller for 100threads, than 10 threads.

    Can anybody actually compile and test the above code to see for himself?
    Yes, 1/x tends to be smaller for larger x, so the time each loop takes if you measure from the start to the end (for short run-time of threads) would be smaller for a larger number of threads.


    I can't compile the code and run it on my machine (either at work or at home) at present, since none of my currently usable machines have Linux (and thus gettimeofday, pthreads, etc).

    I could rewrite your code to use Windows threads, but I doubt it will change the mathematical rule that you are just proving (constant divided by x approaches zero as x grows).

    But as stated by both me and Codeplug, it's the overall runtime that matters, not the amount of time it takes to run one thread to completion. It's only ever sensible to run a number of threads that are roughly equivalent to the number of (currently available) CPU cores. Noticably more than that will definitely slow things down, as there is more overhead in creating, switching between, and destroying threads. [That is for CPU bound tasks - if the task is IO bound, then doubling or even tripling the number of threads can make the system run faster, since there is some other thread available to do work when the other threads are waiting for IO to be performed - so depending on the task, different number of threads is the right solution].

    It is also worth noting that not all multithreaded applications use threads to get the best performance out of the system, but rather because it for example simplifies the task at hand, and saves the programmer from "trouble".

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  8. #23
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    I agree about the overhead and the thread switching. That is why I consider my results strange. Well, you never know exactly what the CPU does, that is why we are testing.
    But...

    We are saying the same things again and again.

    Lets say you have X threads. Each thread executes a A / X number of loops. The TOTAL number of loops executed by your CPU is A. If you have a large X it is still A the number of loops! That is the main idea, to have the same number of loops. Right?

    So the processor executes A number of loops. It calculates the time only needed to execute the loop. Either with barriers, either with the max thread time.

    Having the above code or a similar one executed on Windows would be very nice! If you don't get the same results, thus 100 threads performing faster than 10, then there is for sure a problem with the whole things.

    Or there is indeed a strange bug that I fail to see...

    And again keep in mind that barriers are used for time calculation. It is used from proffesionals. I don't know if it is the best method but it is a correct one. Didn't come up with it, just read it from other papers...

  9. #24
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,661
    Sure, any single thread is going to have less work to do if you split the task into 100 parts as opposed to 10 parts.

    But you're not counting an awful lot of overhead in terms of creating and joining threads, and all that synchronisation with the barrier.

    Code:
    #include <stdlib.h>
    #include <stdio.h>
    #include <unistd.h>
    #include <time.h>
    #include <sys/time.h>
    #include <pthread.h>
    #include "util.h"
    
    typedef struct pthread {
      timeval created;
      timeval alive;
      timeval dead;
      timeval reaped;
    } pthread;
    
    int threads;
    
    void * a (void *args)
    {
      pthread *p = (pthread *)args;
      int a = 1000000000 / threads;  
    
      gettimeofday( &p->alive, NULL );
    
      for (int i=0; i < a; ++i)
        ;
    
      gettimeofday( &p->dead, NULL );
    
      return NULL;
    }
    
    int main(int argc, char ** argv)
    {  
      threads = atoi(argv[1]);
      pthread p[threads];
    
      for (int i=0; i < threads; ++i) {
        gettimeofday( &p[i]->created, NULL );
        pthread_create(&thread[i], NULL, a, p+i);
      }
    
      for (int i=0; i < threads; ++i) {
        pthread_join(thread[i], NULL);
        gettimeofday( &p[i]->reaped, NULL );
      }
    
      // now do some calcs on all those timings
    }
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  10. #25
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    I just reproduced [I think] your testing, and I got these rather random results:
    Code:
    E:\proj\threads\Debug>threads.exe 2
    sum=0.39, avg=0.20, max=0.20, min=0.20
    
    E:\proj\threads\Debug>threads.exe 10
    sum=1.06, avg=0.11, max=0.16, min=0.02
    I'm not going to post my code, because it's not very nice, and I can't be bothered to fix it up, to be honest. So, yes, one thread does less work, but even then, the overhead of running more threads actually sums up to more time - because each thread gets interrupted by other threads.

    And you may not understand what the processor (and more importantly here, the OS) does, but I'm pretty sure I understand fairly well what happens inside a modern (and of course older - that's easier, as there's no [or at least much less] parallelism) processor.

    So, are you convinced yet that measuring what each thread does is meaningless?

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  11. #26
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> I don't know if it is the best method but it is a correct one.
    Method for what?! What are you trying to calculate or show?.

    If you are trying to calculate speedup (as your previous comments indicate), the only valid measurement for calculating speedup is total program execution.
    *Please indicate whether or not you agree with this statement. If not, please explain why.*

    >> Lets say you have X threads. Each thread executes a A / X number of loops. The TOTAL number of loops executed by your CPU is A.
    Sure, assuming A/X is evenly divisible - but your are only looking at how long it takes for a single thread to execute its small slice of A, excluding all overhead. In my mind this is a meaningless measurement.
    *Please indicate whether or not you agree with this statement. If not, please explain why.*

    As we've all mentioned, the key point here is that you can't exclude all the overhead associated with threads. Since you've chosen to exclude the overhead, then the math is actually quite simple - you don't even have to write any code:

    Easy math when excluding overhead:
    If it takes C amount of time for a single thread to execute A loops. Then is will take C/2 amount of time for a thread to execute A/2 loops. So if you have two threads, each thread will take C/2 amount of time. The total amount of time is therefore C/2 + C/2 which equals C. You can carry this out for any number of threads and the total will always be C.

    What you are measuring is C/X, which isn't useful.

    If you do not agree with any of these statements, plese point out what you don't agree with and why so we can all "get on the same page".

    gg

  12. #27
    Chinese pâté foxman's Avatar
    Join Date
    Jul 2007
    Location
    Canada
    Posts
    404
    I wrote this stuff, who tries to measure the time (and by time I mean time like we human fell it, so the choice of my time function might be wrong, because I'm not sure which sort of "time" clock() is measuring) it takes for some work to get done. Do you believe it's a better or a worst measurement that what the OP is getting with his code ?

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <pthread.h>
    
    #ifndef NB_THREADS
    #define NB_THREADS 1000
    #endif
    #define NB_ITER_TOTAL 5000000000l    // 5 000 000 000
    #define NB_ITER_THREAD (NB_ITER_TOTAL / NB_THREADS)
    
    #define CHCK(X) { if (X != 0) { fprintf(stderr, "Error at line &#37;d: " #X "\n", __LINE__); exit(1); } }
    
    pthread_barrier_t barrier;
    pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
    
    // Those variable are counting the number of threads going through the lock
    int startThreadCount = 0;
    int endThreadCount = 0;
    
    // Define your favorite variable for holding time-related stuff
    clock_t startTime;
    clock_t endTime;
    
    void* loopFunc(void *arg)
    {
        long i;
    
        // The last thread to get here will take the first time measurement
        pthread_mutex_lock(&mutex);
            startThreadCount++;
            if (startThreadCount == NB_THREADS)
            {
                // Insert your favorite timing function here
                startTime = clock();
            }
        pthread_mutex_unlock(&mutex);
    
        // Then all threads are going to stop here, do the stuff, and stop again after
        pthread_barrier_wait(&barrier);
        for (i = 0; i < NB_ITER_THREAD; i++);
        pthread_barrier_wait(&barrier);
    
        // The first thread to get here will take the other time measurement
        pthread_mutex_lock(&mutex);
            if (endThreadCount == 0)
            {
                // Insert your favotire timing function here and print
                // the time difference
                endTime = clock();
                printf("It took %f sec\n", (double) (endTime - startTime) / CLOCKS_PER_SEC);
            }
            endThreadCount++;
        pthread_mutex_unlock(&mutex);
    
        pthread_exit(NULL);
    }
    
    int main(int argc, char **argv)
    {
        pthread_t threads[NB_THREADS];
        pthread_attr_t threadAttr;
        int i;
    
        // Initializing stuff and creating NB_THREADS threads
        CHCK(pthread_attr_init(&threadAttr));
        CHCK(pthread_attr_setdetachstate(&threadAttr, PTHREAD_CREATE_DETACHED));
        CHCK(pthread_barrier_init(&barrier, NULL, NB_THREADS));
    
        for (i = 0; i < NB_THREADS; i++)
        {
            CHCK(pthread_create(threads + i, &threadAttr, loopFunc, NULL));
        }
    
        // Free stuff and exit the main thread
        CHCK(pthread_attr_destroy(&threadAttr));
        pthread_exit(NULL);
    }
    Code:
    $ gcc -DNB_THREADS=2 -lpthread -o a2 main.c
    $ gcc -DNB_THREADS=1000 -lpthread -o a1000 main.c
    $ ./a1000
    It took 14.210000 sec
    $ ./a2
    It took 12.020000 sec
    I hate real numbers.

  13. #28
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    clock() measures CPU time (officially, at least - occassionally, C runtime will replace it with wall-clock time, since there's no good way to determine CPU time, e.g. in DOS), and it's not a bad measure - it would under low load and no IO operaitons be almost identical to the wall-clock time anyways.

    Of course, the first thread to finish doesn't really measure all the work being done - it may well be much more time consumed than that, since some threads may not even have STARTED the work yet.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  14. #29
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    Again thanx for the posts.

    First of all let me make it more clear. I need to calculate an algorithm. That means

    Code:
    get_time1
    execute_algorithm
    get_time2
    time2 - time1 is the time needed for the algorithm execution
    My algorithm is just the for-loop
    Code:
    for (int i=0; i < a; ++i)
    ;
    In order to execute faster we use threads, because we have an 8-core CPU. The way to do that is:
    Code:
    for (int i=0; i <part_of_a; ++i)
    ;
    Which means we divide "a" into a number of pieces. I DIVIDE THE DATA among threads. Each thread executes a piece. Well, since a = 1000000 and I test 10 and 100 or 1000 threads, "a" is divisible. So part_of_a for each thread would be a / threads, where "threads" is the number of threads.
    To make it clear think of the following. You want to increase "number" a times.
    You would have:
    Code:
    for (int i=0; i < a; ++i)
       number++;
    with threads:
    Code:
    for (int i=0; i < a / threads; ++i)
       number++;
    number in both ways will be increased by "a". With threads it will be done faster.
    NOTE!! The above code isn't correct for anybody that knows about thread synchronization. Again it is wrong, but you get the idea. Right?

    So, I don't care about overheads, joining and stuff like that. Why I don't care? Because that is what my "job" is to do. To calculate a specific algorithm. There are reasons why you want the specific algorithm, but that is another discussion. In any case I care ONLY about the specific algorithm execution time.

    ABOUT SPEEDUP:
    To make clear speedup can be a lot of things. Your statement is correct if I wanted to calculate the speedup of the program. I want to calculate the speedup of an algorithm. More specific I want to calculate speedup gained from parallel programming. The term is used in books of Parallel Programming as follows:
    SPEEDUP = time_for_serial_program / time_for_parallel_program
    Serial program is a program without threads, so every instruction is serialized. Or you can count the execution time with 1 thread.

    Mastp:
    Ok, so you have some times for each thread SEPERATELY. I care about the total time needed to execute the algorithm. Since some threads are executed simultaneously, some not, as I think of it, without barriers you cannot calculate time.
    --You COULD count time from main. Main acts, in a way, as a barrier. You count with this method joining, initializing and other stuff for the threads. I don't care though about these stuff.
    --So i AGREE. Calculating time for each thread is indeed meaningless. That is why I used two pthread_barrier. In that way you measure the time for every thread to finish its part of the algorithm.

    foxman:
    I believe it is a correct way. It is also a nice code, btw. I don't know either what clock() measures though, so I don't know exactly what is measured. I ll search and post again.
    Personally I believe it is a worst way. Let me clarify that the way you thought of it indeed the best way to do it if you pass NULL on loopFunc. But you could pass i for example. So its thread would have an ID. Instead of mutexes the thread with ID == 0, could calculate time. So it wouldn't need to syncrhonize with mutexes.
    When getting the endTime you wouldn't need synchronization at all, for obvious reasons.
    There is NO WAY to get a perfect starting time.
    You calculate startTime, but you include a
    pthread_mutex_unlock(&mutex)
    plus a
    pthread_barrier_wait(&barrier);
    after the startTime. First of all you should calculate startTime after the barrier. But the differense with the method I posted would be that my startTime would be wrong by a delay needed for gettimeofday() to return and store its value. Your startTime would be wrong by a delay needed for clock() to return, store its value AND execute pthread_mutex_unlock().


    Finally:
    If you calculate the following:
    Code:
    barrier
    gettime
    nothing!
    barrier
    gettime
    you get like 50usec for 10 threads and 600usec for 100threads. My times are like 4000000usec, so the delay from the barriers doesn't really matter. You ll find that joining, creating pthreads doesn't spent a lot of time either. And in any case the main algorithm is the most time consuming. The rest is less than 10&#37; of the time needed for the algorithm.

    EDIT: posted before me mastp, so gettimeofday() seems better in calculating actually time passed.
    Last edited by C_ntua; 06-16-2008 at 04:24 PM.

  15. #30
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    50 usec to call barrier is A LOT!

    I copied/pasted/rewrote this (apologies for ugly code - not meant to be "good" code, so none of you pedants point out bad thread methodology, use of short variable names, or lack of error checking) :
    Code:
    #include <stdio.h>
    #include <windows.h>
    
    int threads; 
    volatile int z=0;
    
    typedef struct pthread {
    	int no;
    	double time;
    } pthread;
    
    #define USE_TSC 1
    
    #if USE_TSC
    typedef unsigned __int64 timeval;
    __declspec(naked)
    unsigned __int64 __cdecl rdtsc(void)
    {
       __asm
       {
          rdtsc
          ret       ; return value at EDX:EAX
       }
    }
    
    void savetime (timeval *f, timeval *s, double *res)
    {
    	*res = (*f - *s) / (2.0 * 1000.0 * 1000.0 * 1000.0);  // my processor runs at 2GHz
    }
    
    void gettime(timeval *s)
    {
    	*s = rdtsc();
    }
    
    #else
    typedef SYSTEMTIME timeval;
    
    void savetime (timeval *f, timeval *s, double *res)
    {
    	FILETIME start, end;
    	LARGE_INTEGER delta, startl, endl;
    	SystemTimeToFileTime(s, &start);
    	SystemTimeToFileTime(f, &end);
    	startl.LowPart  = start.dwLowDateTime;
    	startl.HighPart = start.dwHighDateTime;
    	endl.LowPart    = end.dwLowDateTime;
    	endl.HighPart   = end.dwHighDateTime;
    	delta.QuadPart = endl.QuadPart - startl.QuadPart; 
        *res = (double)delta.QuadPart / 100000000;
    }
    
    void gettime(timeval *s)
    {
    	GetSystemTime(s);
    }
    #endif
    
    
    DWORD WINAPI a (void *args)
    {
    	int i;
    	pthread *p = (pthread *)args;
    	timeval f,s; 
    	int tmp = 0;	
    	int n = 1000000000 / threads;	
    
    	gettime(&s);
    	for (i=0; i < n; ++i)
    		z = z + 1;
    	gettime(&f);
    	savetime(&f, &s, &p->time);
    
    	return 0;
    }
    
    #define MAXTHREADS 1000
    pthread thrinfo[MAXTHREADS];
    HANDLE handle[MAXTHREADS];
    
    int main(int argc, char ** argv)
    {	
    	int i;
    	timeval t1,t2;
    	double tt;
    	double tsum = 0, tmax = 0, tmin = 10000;
    	threads = atoi(argv[1]);
    	if (threads > 1000) {
    		printf("Too many threads\n");
    		return 1;
    	}
    	gettime(&t1);
    	for (i=0; i < threads; ++i)
    		handle[i] = CreateThread(NULL, 0, a, &thrinfo[i], 0, NULL);
    	WaitForMultipleObjects(threads, handle, TRUE, INFINITE);
    	gettime(&t2);
    	savetime(&t2, &t1, &tt);
    	for (i = 0; i < threads; i++)
    	{
    		tsum += thrinfo[i].time;
    		if (thrinfo[i].time > tmax) tmax = thrinfo[i].time;
    		if (thrinfo[i].time < tmin) tmin = thrinfo[i].time;
    	}
    
    	printf("total time: %.5f, sum=%.5f, avg=%.5f, max=%.5f, min=%.5f\n", tt, tsum, tsum / threads, tmax, tmin);
    	return 0;
    }


    And the results are (on a single processor machine, so no better result than 1 thread):
    Code:
    E:\proj\threads\Debug>threads.exe 1
    thrinfo.time = 3.022202
    total time: 3.02234, sum=3.02220, avg=3.02220, max=3.02220, min=3.02220
    
    E:\proj\threads\Debug>threads.exe 1
    total time: 3.06047, sum=3.06033, avg=3.06033, max=3.06033, min=3.06033
    
    E:\proj\threads\Debug>threads.exe 2
    total time: 3.02187, sum=5.94632, avg=2.97316, max=3.00914, min=2.93719
    
    E:\proj\threads\Debug>threads.exe 5
    total time: 3.01997, sum=13.76931, avg=2.75386, max=2.84780, min=2.65381
    
    E:\proj\threads\Debug>threads.exe 10
    total time: 3.02203, sum=19.66025, avg=1.96603, max=2.56212, min=0.86317
    
    E:\proj\threads\Debug>threads.exe 20
    total time: 3.02361, sum=17.67053, avg=0.88353, max=1.25765, min=0.15066
    Release code is not better than debug - becuase most of the time is spent updating a global variable (unless we remove the volatile, at which point the compiler optimizes away the entire loop, which is not what we wanted).

    Code:
    E:\proj\threads\Release>threads.exe 1
    total time: 3.28467, sum=3.28457, avg=3.28457, max=3.28457, min=3.28457
    
    E:\proj\threads\Release>threads.exe 2
    total time: 3.28459, sum=6.42355, avg=3.21178, max=3.28444, min=3.13911
    
    E:\proj\threads\Release>threads.exe 10
    total time: 3.28421, sum=21.23973, avg=2.12397, max=2.69832, min=0.88906
    
    E:\proj\threads\Release>threads.exe 20
    total time: 3.28929, sum=21.64023, avg=1.08201, max=2.34513, min=0.16381
    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. using pthreads with an API that uses pthreads
    By MK27 in forum C Programming
    Replies: 3
    Last Post: 03-06-2009, 02:47 PM
  2. Performance and footprint of virtual function
    By George2 in forum C++ Programming
    Replies: 8
    Last Post: 01-31-2008, 07:34 PM
  3. File map performance
    By George2 in forum C++ Programming
    Replies: 8
    Last Post: 01-04-2008, 04:18 AM
  4. Difference between win32 and linux pthreads
    By philipsan in forum C Programming
    Replies: 1
    Last Post: 02-07-2006, 04:57 PM
  5. inheritance and performance
    By kuhnmi in forum C++ Programming
    Replies: 5
    Last Post: 08-04-2004, 12:46 PM