I am writing a program to process some very large images (1gb+) and I am trying to implement threads to try and speed things up. I currently have to process the images pixel by pixel because of the output that is needed. My setup goes like this:

1 read thread which reads a large chunk of the image into an unsigned char **, which is a two dimensional array that I dynamically allocate based on the amount of free memory in the machine. Each element in the array is a pixel. Once I read in a portion of the image I start n number of processing threads to each take their own unique pixel and do various calculations on that pixel and dump the output into a separate array (also determined by the free memory on the machine). when I only start one processing thread the program manages to make it all the way through and process all of the pixels. When I start 2 threads it seems to make it all the way through but at the very end only one of the processing threads signals the read thread that it is done (the read thread performs the necessary cleanup when it notices everything has been read in...but since only one thread signals it, it doesn't finish the program off). When I have more than 2 threads the program gets a ways into the image (perhaps 7 or 8 million pixels...after a few separate reads have already been completed) and one processing thread always fails to signal that it has completed, so the program freezes. I am using critical sections to access global variables (to make sure each thread gets its own pixel) and Events to communicate between the threads. A basic outline of my program looks like this:

main()
start read thread

readThread()
{
start write thread
start n processing threads
while(endPixel<imageSize)
{
read() //set last pixel that is able to fit in char ** to endPixel
set event telling processing threads to start
WaitForMultipleObjects(PROCESSNUM, processingThreadEvent)
}
cleanup()
}

processingThreads()
{

while(true)
{

Waitforsignalfromread
while(currentPix<endPixel)
process pixel

SetEvent(processingThreadEvent[threadID]);
}
}

Can anyone see any problems with this design? All the processing threads set their event except one fails to. However, the more processing threads I have the sooner it fails to set its event. If I only have 3 processing threads two or three reads may complete successfully but the read between 10 and 11 million pixels may fail to start because of the one thread which doesn't set its event. Which thread it is 0,1,2,... always seems to be random and it is always only one which fails to set its event.

I am writing this in C/C++ in Microsoft Visual Studio .NET on Windows XP.

Also, if anyone has any ideas how to process a very large image pixel by pixel as fast as possible, I would be open to suggestions on how to change my program in order to pick up speed. I am going to look into implementing clustering next, but if anyone has any other suggestions I would appreciate any ideas.

Thanks.

Paul Marshall