Thread: Multithreading with pthreads -- no performance increase

  1. #1
    Registered User
    Join Date
    Jul 2006
    Posts
    111

    Multithreading with pthreads -- no performance increase

    Hi all, first allow me to apologize if there is an answer to my question already in the forums... I did some searching, but didnt find anything that related directly to my question.

    I am currently playing around with the pthread library and attempted a very simple example. The test consists of adding a single value to a large array of pixels/an image (large being 10 000x5 000 indecies).

    My current implementation (below) consists of two methods: one for performing the addition operation and one for managing the threads. The addition function takes a struct containing the arguments: a pointer to an image/array-containing-struct and two vertical boundaries. The addition function then operates on the paramaterized image within the given boudaries.

    The second function creates two threads: one for each half of the image/array.

    While the task is completed without error, there is NO speed increase in comparison to a non-threaded addition function.

    Here is the code I have for the two functions (I can post the other source and header files if someone would like to compile the example).

    Code:
    Image * add5Threaded(Image * image)
    {
    
    
    	addArgs arg1;
    	arg1.height0 = 0;
    	arg1.height1 = image->height/2;
    	arg1.image = image;
    	
    	addArgs arg2;
    	arg2.height0 = image->height/2;
    	arg2.height1 = image->height;
    	arg2.image = image;
    	
    	pthread_t tid1;
    	pthread_t tid2;
    	int ret;
    	
    	ret = pthread_create(&tid1, NULL, add5_algorithm, &arg1);
    	ret = pthread_create(&tid2, NULL, add5_algorithm, &arg2);
    	
    	pthread_join(tid1, NULL);
    	pthread_join(tid2, NULL);
    	
    	return image;
    }
    void * add5_algorithm(addArgs * args)
    {
    	Pixel pixel;
    	
    	int x, y;
    	
    	for(y = args->height0; y < args->height1; y++) {
    		for(x = 0; x < args->image->width; x++) {
    			
    			pixel = getPixel(args->image, x, y);
    			pixel.r += 100;
    			pixel.b += 100;
    			pixel.g += 100;
    			
    			//pthread_mutex_lock(&image_mutex);
    				setPixel(args->image, x, y, pixel);
    				fixPixel(args->image, x, y);
    			//pthread_mutex_unlock(&image_mutex);	
    		}
    	}
    	
    	pthread_exit(NULL);
    }
    for reference, the arguments:
    Code:
    typedef struct struct_addArgs {
    	int height0;
    	int height1;
    	Image * image;
    }addArgs;
    Also, when I use mutexs, the code takes FOREVER. Im guessing this is the deadlock problem I read about?

    Anyway, any help is greatly appreciated!!

    thanks

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    Well to see a performance increase, you would need to be running on a machine with at least two real cores.

    If you've only got one core, then threads are just a "smoke and mirrors" illusion, which takes away a small amount of time which could otherwise be used for doing work.
    This probably explains why you see no gain, the threads are run sequentially.

    > Also, when I use mutexs, the code takes FOREVER. Im guessing this is the deadlock problem I read about?
    No, deadlock is when nothing happens at all.
    This would be an example of thrashing.

    You only need the mutex if you're working with overlapping data. Assuming your split is good, then the two image halves are independent.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    I don't normally use threading with the expectation of a performance increase, I just use it for the threads, but in theory IF you have a multi-processor system, you should be able to get such an increase,

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <pthread.h>
    
    void *testfuncA(void *n) {
    	int i;
    	for (i=0;i<3;i++) {
    		printf("%d %d\n",(int)n,i); fflush(stdout);
    		sleep(1);
    	}
    	pthread_exit(0);
    }
    
    
    int main (int argc, char *argv[]) {
    	int i;
    	pthread_t one, two;
    
    	pthread_create(&one,NULL,testfuncA,(void*)1);
    	pthread_create(&two,NULL,testfuncA,(void*)2);
    
            sleep(3);
    	puts("DONE");
    
    	return 0;
    }
    This takes 3 seconds, not six, but of course there is no load.

    If you do have multiple processors, loop your process enough times to make it take 10 seconds or so, and use a dual CPU meter -- both of them should go up to 100% for the duration. If not, there is implementation problem (I guess).
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  4. #4
    Registered User
    Join Date
    Jul 2006
    Posts
    111
    Ah, I did leave out some obvious information... I am running the application on a machine with a dual core processor (core 2 duo), so it isnt a hardware limitation.

    Looking at my system monitor, the first core maintains 100% usage while the second core fluctuates at around 80% or 90%. So both cores are being used, though for what purpose I cant figure out...

    EDIT: I just added a print statement which prints out 0 in the first thread and 1 in the second thread and the result: with smaller images (10x10 or 400x400) the threads are not running completely concurrently: it prints all of the 0's and then all of the 1's... With a larger image (10 000x5 000), it prints large alternating blocks of all 0's and all 1's. (0000000000011111111111100000000001111111110000000 00etc).
    Last edited by lilrayray; 02-06-2010 at 09:02 AM.

  5. #5
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by lilrayray View Post
    EDIT: I just added a print statement which prints out 0 in the first thread and 1 in the second thread
    Did you use fflush()?
    Code:
    printf(...you debugging stuff...);  fflush(stdout);
    Otherwise the apparent order of operations on the display will not match the real order. The number of 1's or 0's just represents the size of your stdout buffer.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  6. #6
    Registered User
    Join Date
    Jul 2006
    Posts
    111
    Quote Originally Posted by MK27 View Post
    Did you use fflush()?
    Code:
    printf(...you debugging stuff...);  fflush(stdout);
    Otherwise the apparent order of operations on the display will not match the real order. The number of 1's or 0's just represents the size of your stdout buffer.
    Yup, I have an fflush(stdout) in there... a better representation of the output would be 000000111111101010101010100000000111111100000000111111100000
    Seems to work a little bit...

  7. #7
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Hmmm. How are you timing this?

    Like I said, I've never bothered to do a test like this because I haven't used pthreads purely for performance, but this is a little disappointing.

    Hopefully one of the truly seasoned pros on the board will come along with an explanation, be patient A really crude guess would be that because you are reading from and writing to the same contiguous blocks of memory, the kernel is simply serving up first one thread then the other, or that in fact the calculations in add5 (which look simple) are outpacing the FSB transfer (meaning it is a bottleneck). IE, maybe these are complications you need to consider when using threads this way.
    Last edited by MK27; 02-06-2010 at 10:34 AM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  8. #8
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    If the image is large enough, the vast majority of your time (single- or multi-threaded) will be spent waiting for DRAM accesses (cache misses), and memory has a constant bandwidth no matter how many threads you have.

  9. #9
    Registered User
    Join Date
    Jul 2006
    Posts
    111
    If this is the case, how does one optimize large matrix operations? Looking at an application like the gimp, the operations are certainly threaded. Should I be physically dividing the image into separate parts and tossing each part to its own thread? If I do that, I then have to resew each part which adds additional time.

    btw, I am using the clock() function for timing... it isnt super scientific, but it does give an indication as to runtime of my different algorithms.

    Thanks,
    Lilrayray

  10. #10
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by lilrayray View Post
    If this is the case, how does one optimize large matrix operations?
    I would guess by "unrolling" the loop, qv.
    Loop unwinding - Wikipedia, the free encyclopedia
    and/or (probably AND) working with larger chunks. Right now, your getPixel/setPixel are only getting/setting 3 bytes at a time. You might want to consider fetching an entire row of pixels instead, so that thread 1 works row by row thru the top half, and then thread 2 row by row thru the bottom.

    This may make better use of the processor caches, since they will have room for more than 3 bytes at a time.
    Last edited by MK27; 02-06-2010 at 12:44 PM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  11. #11
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    On x86, no matter how little data you request, a memory read will always be 64 bytes (a cache line), so your subsequent reads will be serviced by cache.

    The problem is with total memory bandwidth. There is really no way around that.

    GIMP can do it because their operations are much more complex, so the memory access time is a smaller fraction. In your case, you are doing very simple things to large chunk of data, so most of the time is in memory access.

  12. #12
    Registered User Maz's Avatar
    Join Date
    Nov 2005
    Location
    Finland
    Posts
    194
    you might also want to try pthread librarys set affinity to enrure both cores are workin with threaded version.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Multithreading no performance gain?
    By cyberfish in forum C++ Programming
    Replies: 22
    Last Post: 12-30-2009, 01:47 PM
  2. Multithreading (flag stopping a thread, ring buffer) volatile
    By ShwangShwing in forum C Programming
    Replies: 3
    Last Post: 05-19-2009, 07:27 AM
  3. Pthreads performance
    By C_ntua in forum C Programming
    Replies: 42
    Last Post: 06-17-2008, 11:29 AM
  4. Performance and footprint of virtual function
    By George2 in forum C++ Programming
    Replies: 8
    Last Post: 01-31-2008, 07:34 PM
  5. File map performance
    By George2 in forum C++ Programming
    Replies: 8
    Last Post: 01-04-2008, 04:18 AM