Hi all, first allow me to apologize if there is an answer to my question already in the forums... I did some searching, but didnt find anything that related directly to my question.

I am currently playing around with the pthread library and attempted a very simple example. The test consists of adding a single value to a large array of pixels/an image (large being 10 000x5 000 indecies).

My current implementation (below) consists of two methods: one for performing the addition operation and one for managing the threads. The addition function takes a struct containing the arguments: a pointer to an image/array-containing-struct and two vertical boundaries. The addition function then operates on the paramaterized image within the given boudaries.

The second function creates two threads: one for each half of the image/array.

While the task is completed without error, there is NO speed increase in comparison to a non-threaded addition function.

Here is the code I have for the two functions (I can post the other source and header files if someone would like to compile the example).

Code:
```Image * add5Threaded(Image * image)
{

arg1.height0 = 0;
arg1.height1 = image->height/2;
arg1.image = image;

arg2.height0 = image->height/2;
arg2.height1 = image->height;
arg2.image = image;

int ret;

return image;
}
{
Pixel pixel;

int x, y;

for(y = args->height0; y < args->height1; y++) {
for(x = 0; x < args->image->width; x++) {

pixel = getPixel(args->image, x, y);
pixel.r += 100;
pixel.b += 100;
pixel.g += 100;

setPixel(args->image, x, y, pixel);
fixPixel(args->image, x, y);
}
}

}```
for reference, the arguments:
Code:
```typedef struct struct_addArgs {
int height0;
int height1;
Image * image;
Also, when I use mutexs, the code takes FOREVER. Im guessing this is the deadlock problem I read about?

Anyway, any help is greatly appreciated!!

thanks

2. Well to see a performance increase, you would need to be running on a machine with at least two real cores.

If you've only got one core, then threads are just a "smoke and mirrors" illusion, which takes away a small amount of time which could otherwise be used for doing work.
This probably explains why you see no gain, the threads are run sequentially.

> Also, when I use mutexs, the code takes FOREVER. Im guessing this is the deadlock problem I read about?
No, deadlock is when nothing happens at all.
This would be an example of thrashing.

You only need the mutex if you're working with overlapping data. Assuming your split is good, then the two image halves are independent.

3. I don't normally use threading with the expectation of a performance increase, I just use it for the threads, but in theory IF you have a multi-processor system, you should be able to get such an increase,

Code:
```#include <stdio.h>
#include <stdlib.h>

void *testfuncA(void *n) {
int i;
for (i=0;i<3;i++) {
printf("%d %d\n",(int)n,i); fflush(stdout);
sleep(1);
}
}

int main (int argc, char *argv[]) {
int i;

sleep(3);
puts("DONE");

return 0;
}```
This takes 3 seconds, not six, but of course there is no load.

If you do have multiple processors, loop your process enough times to make it take 10 seconds or so, and use a dual CPU meter -- both of them should go up to 100% for the duration. If not, there is implementation problem (I guess).

4. Ah, I did leave out some obvious information... I am running the application on a machine with a dual core processor (core 2 duo), so it isnt a hardware limitation.

Looking at my system monitor, the first core maintains 100% usage while the second core fluctuates at around 80% or 90%. So both cores are being used, though for what purpose I cant figure out...

EDIT: I just added a print statement which prints out 0 in the first thread and 1 in the second thread and the result: with smaller images (10x10 or 400x400) the threads are not running completely concurrently: it prints all of the 0's and then all of the 1's... With a larger image (10 000x5 000), it prints large alternating blocks of all 0's and all 1's. (0000000000011111111111100000000001111111110000000 00etc).

5. Originally Posted by lilrayray
EDIT: I just added a print statement which prints out 0 in the first thread and 1 in the second thread
Did you use fflush()?
Code:
`printf(...you debugging stuff...);  fflush(stdout);`
Otherwise the apparent order of operations on the display will not match the real order. The number of 1's or 0's just represents the size of your stdout buffer.

6. Originally Posted by MK27
Did you use fflush()?
Code:
`printf(...you debugging stuff...);  fflush(stdout);`
Otherwise the apparent order of operations on the display will not match the real order. The number of 1's or 0's just represents the size of your stdout buffer.
Yup, I have an fflush(stdout) in there... a better representation of the output would be 000000111111101010101010100000000111111100000000111111100000
Seems to work a little bit...

7. Hmmm. How are you timing this?

Like I said, I've never bothered to do a test like this because I haven't used pthreads purely for performance, but this is a little disappointing.

Hopefully one of the truly seasoned pros on the board will come along with an explanation, be patient A really crude guess would be that because you are reading from and writing to the same contiguous blocks of memory, the kernel is simply serving up first one thread then the other, or that in fact the calculations in add5 (which look simple) are outpacing the FSB transfer (meaning it is a bottleneck). IE, maybe these are complications you need to consider when using threads this way.

8. If the image is large enough, the vast majority of your time (single- or multi-threaded) will be spent waiting for DRAM accesses (cache misses), and memory has a constant bandwidth no matter how many threads you have.

9. If this is the case, how does one optimize large matrix operations? Looking at an application like the gimp, the operations are certainly threaded. Should I be physically dividing the image into separate parts and tossing each part to its own thread? If I do that, I then have to resew each part which adds additional time.

btw, I am using the clock() function for timing... it isnt super scientific, but it does give an indication as to runtime of my different algorithms.

Thanks,
Lilrayray

10. Originally Posted by lilrayray
If this is the case, how does one optimize large matrix operations?
I would guess by "unrolling" the loop, qv.
Loop unwinding - Wikipedia, the free encyclopedia
and/or (probably AND) working with larger chunks. Right now, your getPixel/setPixel are only getting/setting 3 bytes at a time. You might want to consider fetching an entire row of pixels instead, so that thread 1 works row by row thru the top half, and then thread 2 row by row thru the bottom.

This may make better use of the processor caches, since they will have room for more than 3 bytes at a time.

11. On x86, no matter how little data you request, a memory read will always be 64 bytes (a cache line), so your subsequent reads will be serviced by cache.

The problem is with total memory bandwidth. There is really no way around that.

GIMP can do it because their operations are much more complex, so the memory access time is a smaller fraction. In your case, you are doing very simple things to large chunk of data, so most of the time is in memory access.

12. you might also want to try pthread librarys set affinity to enrure both cores are workin with threaded version.