![]() |
| | #1 |
| Registered User Join Date: Jul 2006
Posts: 111
| Multithreading with pthreads -- no performance increase I am currently playing around with the pthread library and attempted a very simple example. The test consists of adding a single value to a large array of pixels/an image (large being 10 000x5 000 indecies). My current implementation (below) consists of two methods: one for performing the addition operation and one for managing the threads. The addition function takes a struct containing the arguments: a pointer to an image/array-containing-struct and two vertical boundaries. The addition function then operates on the paramaterized image within the given boudaries. The second function creates two threads: one for each half of the image/array. While the task is completed without error, there is NO speed increase in comparison to a non-threaded addition function. Here is the code I have for the two functions (I can post the other source and header files if someone would like to compile the example). Code: Image * add5Threaded(Image * image)
{
addArgs arg1;
arg1.height0 = 0;
arg1.height1 = image->height/2;
arg1.image = image;
addArgs arg2;
arg2.height0 = image->height/2;
arg2.height1 = image->height;
arg2.image = image;
pthread_t tid1;
pthread_t tid2;
int ret;
ret = pthread_create(&tid1, NULL, add5_algorithm, &arg1);
ret = pthread_create(&tid2, NULL, add5_algorithm, &arg2);
pthread_join(tid1, NULL);
pthread_join(tid2, NULL);
return image;
}
void * add5_algorithm(addArgs * args)
{
Pixel pixel;
int x, y;
for(y = args->height0; y < args->height1; y++) {
for(x = 0; x < args->image->width; x++) {
pixel = getPixel(args->image, x, y);
pixel.r += 100;
pixel.b += 100;
pixel.g += 100;
//pthread_mutex_lock(&image_mutex);
setPixel(args->image, x, y, pixel);
fixPixel(args->image, x, y);
//pthread_mutex_unlock(&image_mutex);
}
}
pthread_exit(NULL);
}
Code: typedef struct struct_addArgs {
int height0;
int height1;
Image * image;
}addArgs;
Anyway, any help is greatly appreciated!! thanks |
| lilrayray is offline | |
| | #2 |
| and the hat of Destiny Join Date: Aug 2001 Location: The edge of the known universe
Posts: 22,495
| Well to see a performance increase, you would need to be running on a machine with at least two real cores. If you've only got one core, then threads are just a "smoke and mirrors" illusion, which takes away a small amount of time which could otherwise be used for doing work. This probably explains why you see no gain, the threads are run sequentially. > Also, when I use mutexs, the code takes FOREVER. Im guessing this is the deadlock problem I read about? No, deadlock is when nothing happens at all. This would be an example of thrashing. You only need the mutex if you're working with overlapping data. Assuming your split is good, then the two image halves are independent. |
| Salem is offline | |
| | #3 |
| dat is, vast staat Join Date: Jul 2008 Location: SE Queens
Posts: 6,612
| I don't normally use threading with the expectation of a performance increase, I just use it for the threads, but in theory IF you have a multi-processor system, you should be able to get such an increase, Code: #include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
void *testfuncA(void *n) {
int i;
for (i=0;i<3;i++) {
printf("%d %d\n",(int)n,i); fflush(stdout);
sleep(1);
}
pthread_exit(0);
}
int main (int argc, char *argv[]) {
int i;
pthread_t one, two;
pthread_create(&one,NULL,testfuncA,(void*)1);
pthread_create(&two,NULL,testfuncA,(void*)2);
sleep(3);
puts("DONE");
return 0;
}
If you do have multiple processors, loop your process enough times to make it take 10 seconds or so, and use a dual CPU meter -- both of them should go up to 100% for the duration. If not, there is implementation problem (I guess).
__________________ C programming resources: GNU C Function and Macro Index -- glibc reference manual The C Book -- nice online learner guide Current ISO draft standard CCAN -- new CPAN like open source library repository GDB tutorial #1 -- gnu debugger tutorials -- GDB tutorial #2 cpwiki -- our wiki on sourceforge |
| MK27 is offline | |
| | #4 |
| Registered User Join Date: Jul 2006
Posts: 111
| Ah, I did leave out some obvious information... I am running the application on a machine with a dual core processor (core 2 duo), so it isnt a hardware limitation. Looking at my system monitor, the first core maintains 100% usage while the second core fluctuates at around 80% or 90%. So both cores are being used, though for what purpose I cant figure out... EDIT: I just added a print statement which prints out 0 in the first thread and 1 in the second thread and the result: with smaller images (10x10 or 400x400) the threads are not running completely concurrently: it prints all of the 0's and then all of the 1's... With a larger image (10 000x5 000), it prints large alternating blocks of all 0's and all 1's. (0000000000011111111111100000000001111111110000000 00etc). Last edited by lilrayray; 02-06-2010 at 09:02 AM. |
| lilrayray is offline | |
| | #5 | |
| dat is, vast staat Join Date: Jul 2008 Location: SE Queens
Posts: 6,612
| Quote:
Code: printf(...you debugging stuff...); fflush(stdout);
__________________ C programming resources: GNU C Function and Macro Index -- glibc reference manual The C Book -- nice online learner guide Current ISO draft standard CCAN -- new CPAN like open source library repository GDB tutorial #1 -- gnu debugger tutorials -- GDB tutorial #2 cpwiki -- our wiki on sourceforge | |
| MK27 is offline | |
| | #6 | |
| Registered User Join Date: Jul 2006
Posts: 111
| Quote:
Seems to work a little bit... | |
| lilrayray is offline | |
| | #7 |
| dat is, vast staat Join Date: Jul 2008 Location: SE Queens
Posts: 6,612
| Hmmm. How are you timing this? Like I said, I've never bothered to do a test like this because I haven't used pthreads purely for performance, but this is a little disappointing. Hopefully one of the truly seasoned pros on the board will come along with an explanation, be patient A really crude guess would be that because you are reading from and writing to the same contiguous blocks of memory, the kernel is simply serving up first one thread then the other, or that in fact the calculations in add5 (which look simple) are outpacing the FSB transfer (meaning it is a bottleneck). IE, maybe these are complications you need to consider when using threads this way.
__________________ C programming resources: GNU C Function and Macro Index -- glibc reference manual The C Book -- nice online learner guide Current ISO draft standard CCAN -- new CPAN like open source library repository GDB tutorial #1 -- gnu debugger tutorials -- GDB tutorial #2 cpwiki -- our wiki on sourceforge Last edited by MK27; 02-06-2010 at 10:34 AM. |
| MK27 is offline | |
| | #8 |
| Registered User Join Date: Dec 2006 Location: Canada
Posts: 2,322
| If the image is large enough, the vast majority of your time (single- or multi-threaded) will be spent waiting for DRAM accesses (cache misses), and memory has a constant bandwidth no matter how many threads you have. |
| cyberfish is offline | |
| | #9 |
| Registered User Join Date: Jul 2006
Posts: 111
| If this is the case, how does one optimize large matrix operations? Looking at an application like the gimp, the operations are certainly threaded. Should I be physically dividing the image into separate parts and tossing each part to its own thread? If I do that, I then have to resew each part which adds additional time. btw, I am using the clock() function for timing... it isnt super scientific, but it does give an indication as to runtime of my different algorithms. Thanks, Lilrayray |
| lilrayray is offline | |
| | #10 | |
| dat is, vast staat Join Date: Jul 2008 Location: SE Queens
Posts: 6,612
| Quote:
Loop unwinding - Wikipedia, the free encyclopedia and/or (probably AND) working with larger chunks. Right now, your getPixel/setPixel are only getting/setting 3 bytes at a time. You might want to consider fetching an entire row of pixels instead, so that thread 1 works row by row thru the top half, and then thread 2 row by row thru the bottom. This may make better use of the processor caches, since they will have room for more than 3 bytes at a time.
__________________ C programming resources: GNU C Function and Macro Index -- glibc reference manual The C Book -- nice online learner guide Current ISO draft standard CCAN -- new CPAN like open source library repository GDB tutorial #1 -- gnu debugger tutorials -- GDB tutorial #2 cpwiki -- our wiki on sourceforge Last edited by MK27; 02-06-2010 at 12:44 PM. | |
| MK27 is offline | |
| | #11 |
| Registered User Join Date: Dec 2006 Location: Canada
Posts: 2,322
| On x86, no matter how little data you request, a memory read will always be 64 bytes (a cache line), so your subsequent reads will be serviced by cache. The problem is with total memory bandwidth. There is really no way around that. GIMP can do it because their operations are much more complex, so the memory access time is a smaller fraction. In your case, you are doing very simple things to large chunk of data, so most of the time is in memory access. |
| cyberfish is offline | |
| | #12 |
| Registered User Join Date: Nov 2005
Posts: 150
| you might also want to try pthread librarys set affinity to enrure both cores are workin with threaded version. |
| Maz is offline | |
![]() |
| Thread Tools | |
| Display Modes | |
|
Similar Threads | ||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Multithreading no performance gain? | cyberfish | C++ Programming | 22 | 12-30-2009 01:47 PM |
| Multithreading (flag stopping a thread, ring buffer) volatile | ShwangShwing | C Programming | 3 | 05-19-2009 07:27 AM |
| Pthreads performance | C_ntua | C Programming | 42 | 06-17-2008 11:29 AM |
| Performance and footprint of virtual function | George2 | C++ Programming | 8 | 01-31-2008 07:34 PM |
| File map performance | George2 | C++ Programming | 8 | 01-04-2008 04:18 AM |