C Board  

Go Back   C Board > General Programming Boards > C Programming

Reply
 
LinkBack Thread Tools Display Modes
Old 02-06-2010, 06:55 AM   #1
Registered User
 
Join Date: Jul 2006
Posts: 111
Multithreading with pthreads -- no performance increase

Hi all, first allow me to apologize if there is an answer to my question already in the forums... I did some searching, but didnt find anything that related directly to my question.

I am currently playing around with the pthread library and attempted a very simple example. The test consists of adding a single value to a large array of pixels/an image (large being 10 000x5 000 indecies).

My current implementation (below) consists of two methods: one for performing the addition operation and one for managing the threads. The addition function takes a struct containing the arguments: a pointer to an image/array-containing-struct and two vertical boundaries. The addition function then operates on the paramaterized image within the given boudaries.

The second function creates two threads: one for each half of the image/array.

While the task is completed without error, there is NO speed increase in comparison to a non-threaded addition function.

Here is the code I have for the two functions (I can post the other source and header files if someone would like to compile the example).

Code:
Image * add5Threaded(Image * image)
{


	addArgs arg1;
	arg1.height0 = 0;
	arg1.height1 = image->height/2;
	arg1.image = image;
	
	addArgs arg2;
	arg2.height0 = image->height/2;
	arg2.height1 = image->height;
	arg2.image = image;
	
	pthread_t tid1;
	pthread_t tid2;
	int ret;
	
	ret = pthread_create(&tid1, NULL, add5_algorithm, &arg1);
	ret = pthread_create(&tid2, NULL, add5_algorithm, &arg2);
	
	pthread_join(tid1, NULL);
	pthread_join(tid2, NULL);
	
	return image;
}
void * add5_algorithm(addArgs * args)
{
	Pixel pixel;
	
	int x, y;
	
	for(y = args->height0; y < args->height1; y++) {
		for(x = 0; x < args->image->width; x++) {
			
			pixel = getPixel(args->image, x, y);
			pixel.r += 100;
			pixel.b += 100;
			pixel.g += 100;
			
			//pthread_mutex_lock(&image_mutex);
				setPixel(args->image, x, y, pixel);
				fixPixel(args->image, x, y);
			//pthread_mutex_unlock(&image_mutex);	
		}
	}
	
	pthread_exit(NULL);
}
for reference, the arguments:
Code:
typedef struct struct_addArgs {
	int height0;
	int height1;
	Image * image;
}addArgs;
Also, when I use mutexs, the code takes FOREVER. Im guessing this is the deadlock problem I read about?

Anyway, any help is greatly appreciated!!

thanks
lilrayray is offline   Reply With Quote
Old 02-06-2010, 08:06 AM   #2
and the hat of Destiny
 
Salem's Avatar
 
Join Date: Aug 2001
Location: The edge of the known universe
Posts: 22,495
Well to see a performance increase, you would need to be running on a machine with at least two real cores.

If you've only got one core, then threads are just a "smoke and mirrors" illusion, which takes away a small amount of time which could otherwise be used for doing work.
This probably explains why you see no gain, the threads are run sequentially.

> Also, when I use mutexs, the code takes FOREVER. Im guessing this is the deadlock problem I read about?
No, deadlock is when nothing happens at all.
This would be an example of thrashing.

You only need the mutex if you're working with overlapping data. Assuming your split is good, then the two image halves are independent.
__________________
If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.

Salem is offline   Reply With Quote
Old 02-06-2010, 08:28 AM   #3
dat is, vast staat
 
MK27's Avatar
 
Join Date: Jul 2008
Location: SE Queens
Posts: 6,612
I don't normally use threading with the expectation of a performance increase, I just use it for the threads, but in theory IF you have a multi-processor system, you should be able to get such an increase,

Code:
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

void *testfuncA(void *n) {
	int i;
	for (i=0;i<3;i++) {
		printf("%d %d\n",(int)n,i); fflush(stdout);
		sleep(1);
	}
	pthread_exit(0);
}


int main (int argc, char *argv[]) {
	int i;
	pthread_t one, two;

	pthread_create(&one,NULL,testfuncA,(void*)1);
	pthread_create(&two,NULL,testfuncA,(void*)2);

        sleep(3);
	puts("DONE");

	return 0;
}
This takes 3 seconds, not six, but of course there is no load.

If you do have multiple processors, loop your process enough times to make it take 10 seconds or so, and use a dual CPU meter -- both of them should go up to 100% for the duration. If not, there is implementation problem (I guess).
__________________
C programming resources:
GNU C Function and Macro Index -- glibc reference manual
The C Book -- nice online learner guide
Current ISO draft standard
CCAN -- new CPAN like open source library repository
GDB tutorial #1 -- gnu debugger tutorials -- GDB tutorial #2
cpwiki -- our wiki on sourceforge
MK27 is offline   Reply With Quote
Old 02-06-2010, 08:53 AM   #4
Registered User
 
Join Date: Jul 2006
Posts: 111
Ah, I did leave out some obvious information... I am running the application on a machine with a dual core processor (core 2 duo), so it isnt a hardware limitation.

Looking at my system monitor, the first core maintains 100% usage while the second core fluctuates at around 80% or 90%. So both cores are being used, though for what purpose I cant figure out...

EDIT: I just added a print statement which prints out 0 in the first thread and 1 in the second thread and the result: with smaller images (10x10 or 400x400) the threads are not running completely concurrently: it prints all of the 0's and then all of the 1's... With a larger image (10 000x5 000), it prints large alternating blocks of all 0's and all 1's. (0000000000011111111111100000000001111111110000000 00etc).

Last edited by lilrayray; 02-06-2010 at 09:02 AM.
lilrayray is offline   Reply With Quote
Old 02-06-2010, 09:10 AM   #5
dat is, vast staat
 
MK27's Avatar
 
Join Date: Jul 2008
Location: SE Queens
Posts: 6,612
Quote:
Originally Posted by lilrayray View Post
EDIT: I just added a print statement which prints out 0 in the first thread and 1 in the second thread
Did you use fflush()?
Code:
printf(...you debugging stuff...);  fflush(stdout); 
Otherwise the apparent order of operations on the display will not match the real order. The number of 1's or 0's just represents the size of your stdout buffer.
__________________
C programming resources:
GNU C Function and Macro Index -- glibc reference manual
The C Book -- nice online learner guide
Current ISO draft standard
CCAN -- new CPAN like open source library repository
GDB tutorial #1 -- gnu debugger tutorials -- GDB tutorial #2
cpwiki -- our wiki on sourceforge
MK27 is offline   Reply With Quote
Old 02-06-2010, 09:30 AM   #6
Registered User
 
Join Date: Jul 2006
Posts: 111
Quote:
Originally Posted by MK27 View Post
Did you use fflush()?
Code:
printf(...you debugging stuff...);  fflush(stdout); 
Otherwise the apparent order of operations on the display will not match the real order. The number of 1's or 0's just represents the size of your stdout buffer.
Yup, I have an fflush(stdout) in there... a better representation of the output would be 000000111111101010101010100000000111111100000000111111100000
Seems to work a little bit...
lilrayray is offline   Reply With Quote
Old 02-06-2010, 10:26 AM   #7
dat is, vast staat
 
MK27's Avatar
 
Join Date: Jul 2008
Location: SE Queens
Posts: 6,612
Hmmm. How are you timing this?

Like I said, I've never bothered to do a test like this because I haven't used pthreads purely for performance, but this is a little disappointing.

Hopefully one of the truly seasoned pros on the board will come along with an explanation, be patient A really crude guess would be that because you are reading from and writing to the same contiguous blocks of memory, the kernel is simply serving up first one thread then the other, or that in fact the calculations in add5 (which look simple) are outpacing the FSB transfer (meaning it is a bottleneck). IE, maybe these are complications you need to consider when using threads this way.
__________________
C programming resources:
GNU C Function and Macro Index -- glibc reference manual
The C Book -- nice online learner guide
Current ISO draft standard
CCAN -- new CPAN like open source library repository
GDB tutorial #1 -- gnu debugger tutorials -- GDB tutorial #2
cpwiki -- our wiki on sourceforge

Last edited by MK27; 02-06-2010 at 10:34 AM.
MK27 is offline   Reply With Quote
Old 02-06-2010, 11:48 AM   #8
Registered User
 
Join Date: Dec 2006
Location: Canada
Posts: 2,322
If the image is large enough, the vast majority of your time (single- or multi-threaded) will be spent waiting for DRAM accesses (cache misses), and memory has a constant bandwidth no matter how many threads you have.
cyberfish is offline   Reply With Quote
Old 02-06-2010, 12:24 PM   #9
Registered User
 
Join Date: Jul 2006
Posts: 111
If this is the case, how does one optimize large matrix operations? Looking at an application like the gimp, the operations are certainly threaded. Should I be physically dividing the image into separate parts and tossing each part to its own thread? If I do that, I then have to resew each part which adds additional time.

btw, I am using the clock() function for timing... it isnt super scientific, but it does give an indication as to runtime of my different algorithms.

Thanks,
Lilrayray
lilrayray is offline   Reply With Quote
Old 02-06-2010, 12:39 PM   #10
dat is, vast staat
 
MK27's Avatar
 
Join Date: Jul 2008
Location: SE Queens
Posts: 6,612
Quote:
Originally Posted by lilrayray View Post
If this is the case, how does one optimize large matrix operations?
I would guess by "unrolling" the loop, qv.
Loop unwinding - Wikipedia, the free encyclopedia
and/or (probably AND) working with larger chunks. Right now, your getPixel/setPixel are only getting/setting 3 bytes at a time. You might want to consider fetching an entire row of pixels instead, so that thread 1 works row by row thru the top half, and then thread 2 row by row thru the bottom.

This may make better use of the processor caches, since they will have room for more than 3 bytes at a time.
__________________
C programming resources:
GNU C Function and Macro Index -- glibc reference manual
The C Book -- nice online learner guide
Current ISO draft standard
CCAN -- new CPAN like open source library repository
GDB tutorial #1 -- gnu debugger tutorials -- GDB tutorial #2
cpwiki -- our wiki on sourceforge

Last edited by MK27; 02-06-2010 at 12:44 PM.
MK27 is offline   Reply With Quote
Old 02-06-2010, 12:56 PM   #11
Registered User
 
Join Date: Dec 2006
Location: Canada
Posts: 2,322
On x86, no matter how little data you request, a memory read will always be 64 bytes (a cache line), so your subsequent reads will be serviced by cache.

The problem is with total memory bandwidth. There is really no way around that.

GIMP can do it because their operations are much more complex, so the memory access time is a smaller fraction. In your case, you are doing very simple things to large chunk of data, so most of the time is in memory access.
cyberfish is offline   Reply With Quote
Old 02-07-2010, 06:27 AM   #12
Maz
Registered User
 
Join Date: Nov 2005
Posts: 150
you might also want to try pthread librarys set affinity to enrure both cores are workin with threaded version.
Maz is offline   Reply With Quote
Reply

Thread Tools
Display Modes

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Multithreading no performance gain? cyberfish C++ Programming 22 12-30-2009 01:47 PM
Multithreading (flag stopping a thread, ring buffer) volatile ShwangShwing C Programming 3 05-19-2009 07:27 AM
Pthreads performance C_ntua C Programming 42 06-17-2008 11:29 AM
Performance and footprint of virtual function George2 C++ Programming 8 01-31-2008 07:34 PM
File map performance George2 C++ Programming 8 01-04-2008 04:18 AM


All times are GMT -6. The time now is 12:19 AM.


Powered by vBulletin® Version 3.8.1
Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.3.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22