Thread: problem with win32 threads

  1. #1
    Registered User
    Join Date
    May 2004
    Posts
    7

    problem with win32 threads

    I am writing a program to process some very large images (1gb+) and I am trying to implement threads to try and speed things up. I currently have to process the images pixel by pixel because of the output that is needed. My setup goes like this:

    1 read thread which reads a large chunk of the image into an unsigned char **, which is a two dimensional array that I dynamically allocate based on the amount of free memory in the machine. Each element in the array is a pixel. Once I read in a portion of the image I start n number of processing threads to each take their own unique pixel and do various calculations on that pixel and dump the output into a separate array (also determined by the free memory on the machine). when I only start one processing thread the program manages to make it all the way through and process all of the pixels. When I start 2 threads it seems to make it all the way through but at the very end only one of the processing threads signals the read thread that it is done (the read thread performs the necessary cleanup when it notices everything has been read in...but since only one thread signals it, it doesn't finish the program off). When I have more than 2 threads the program gets a ways into the image (perhaps 7 or 8 million pixels...after a few separate reads have already been completed) and one processing thread always fails to signal that it has completed, so the program freezes. I am using critical sections to access global variables (to make sure each thread gets its own pixel) and Events to communicate between the threads. A basic outline of my program looks like this:

    main()
    start read thread

    readThread()
    {
    start write thread
    start n processing threads
    while(endPixel<imageSize)
    {
    read() //set last pixel that is able to fit in char ** to endPixel
    set event telling processing threads to start
    WaitForMultipleObjects(PROCESSNUM, processingThreadEvent)
    }
    cleanup()
    }

    processingThreads()
    {

    while(true)
    {

    Waitforsignalfromread
    while(currentPix<endPixel)
    process pixel

    SetEvent(processingThreadEvent[threadID]);
    }
    }

    Can anyone see any problems with this design? All the processing threads set their event except one fails to. However, the more processing threads I have the sooner it fails to set its event. If I only have 3 processing threads two or three reads may complete successfully but the read between 10 and 11 million pixels may fail to start because of the one thread which doesn't set its event. Which thread it is 0,1,2,... always seems to be random and it is always only one which fails to set its event.

    I am writing this in C/C++ in Microsoft Visual Studio .NET on Windows XP.

    Also, if anyone has any ideas how to process a very large image pixel by pixel as fast as possible, I would be open to suggestions on how to change my program in order to pick up speed. I am going to look into implementing clustering next, but if anyone has any other suggestions I would appreciate any ideas.

    Thanks.

    Paul Marshall

  2. #2
    carry on JaWiB's Avatar
    Join Date
    Feb 2003
    Location
    Seattle, WA
    Posts
    1,972
    I'm not an expert on threads, but it doesn't seem like using threads will speed anything up. From what I understand, you won't actually have the threads running at the same time. Instead, one will run a little portion, then the next will run a little portion and they will switch back and forth.

    Maybe someone can correct me on this, but I don't think threads will speed anything up in this case.
    "Think not but that I know these things; or think
    I know them not: not therefore am I short
    Of knowing what I ought."
    -John Milton, Paradise Regained (1671)

    "Work hard and it might happen."
    -XSquared

  3. #3
    Registered User
    Join Date
    May 2004
    Posts
    7
    I think for the most part you are correct. On a regular single processor system it isn't going to speed things up much (perhaps it may even slow them down). However, on a SMP system I believe it will because I believe Windows 2K/NT/XP will handle the threads by allowing the different processors to run the threads simultaneously. I think I may also see a performance increase on the newer P4's with hyper-threading. I also believe that threading will come in handy if I do implement clustering.

  4. #4
    It's full of stars adrianxw's Avatar
    Join Date
    Aug 2001
    Posts
    4,829
    On a single processor system, the context switching would slow this kind of app down - correct.

    Is the thread that is not signalling completion, actually doing anything, or is it sitting there waiting for a go signal from the other thread? It would have the same end result, it would appear not to be signalling completion.

    I assume you are checking all of your API returns and you are not simply running out of a resource somewhere.
    Wave upon wave of demented avengers march cheerfully out of obscurity unto the dream.

  5. #5
    Yes, my avatar is stolen anonytmouse's Avatar
    Join Date
    Dec 2002
    Posts
    2,544
    Hi, are you a programmer for N(A)SA or something? That is a major image you processing. You must be a pretty advanced programmer so I doubt if I can be much help but here goes.

    --
    [edit]Rereading your post it seems possible the thread is getting stuck somewhere before the SetEvent() when it gets near the end of the input array. Is there any loop condition that might not work near the end of the input array?[/edit]

    Can you log the result code from SetEvent() and get the error code with GetLastError() if it is indeed failing?
    Code:
    if (!SetEvent)
    {
        DWORD dwErr = GetLastError();
        DebugBreak();
    }
    I would also be supspicious of the threadID value. Is there anyway it could get corrupted in some kind of race condition? How is it passed to each thread?

    --
    As for the algorithm, can we assume that you are reading from the file while the threads are doing there work? Your pseudo-code seems to be holding up the worker threads to read data but I'm assuming that is not the case in the real code.

    Making that assumption, your design seems pretty good. The question is, can you get rid of any of the critical sections or events. Instead of using a critical section to get each pixel, can you get a block of pixels?

    --
    Finally, here is how I think I would consider doing it.

    I would use IO completion ports to create a thread pool. Then I would split up the input pool into ~12 or so(say 1.5 times the number of processors) blocks. The read thread would look something like:
    Code:
    while still got some data
       for (each input block)
         if not marked as being used
            mark as being used
            use ReadFile to read into block
            Issue work item to thread pool with PostQueuedCompletionStatus
               - paramater indicates which block to work on
         End If
       End For Each
    End While
    The block is marked as being unused when the thread has finished processing the block.

    This removes any need for a critical section on the input side and only one event is needed. Every time a block is read in a work item is dispatched.

    [edit]
    Code:
    while(currentPix<endPixel)
    
        process pixel
    Is this actually more like this?
    Code:
    while(currentPix<endPixel)
    EnterCS
       PIXEL px = Pixels[x++];
    LeaveCS
    
        process pixel
    [/edit]

    Unlikely to be your problem but read this:
    http://support.microsoft.com/default...NoWebContent=1
    Last edited by anonytmouse; 07-26-2004 at 02:02 PM.

  6. #6
    Yes, my avatar is stolen anonytmouse's Avatar
    Join Date
    Dec 2002
    Posts
    2,544
    The problem interested me, so here is what I came up with:
    Code:
    // One or two for each processor...
    const UINT NUM_WORKER_THREADS = 8;
    
    // One and a half times the number of worker threads
    // so that blocks are always queued...
    const UINT NUM_BLOCKS = 12;
    
    
    PIXEL          g_PixelArray[BLOCK_SIZE * NUM_BLOCKS];
    volatile BOOL  g_BlockStates[NUM_BLOCKS];
    HANDLE         g_hWorkItemFinished;
    
    void SetupOperation(void)
    {
    	// Create completion port...
    	g_hCompletionPort = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, NUM_WORKER_THREADS);
    
    	// Create an event to be signalled when a work item is complete...
    	g_hWorkItemFinished = CreateEvent(NULL, FALSE, FALSE, NULL);
    
    	// Create worker threads...
    	for (int i = 0;i < NUM_WORKER_THREADS;i++)
    	{
    		HANDLE hThread = CreateThread(NULL, 0, WorkerThread, NULL, 0, NULL);
    		CloseHandle(hThread);
    	}
    
    	// Create IO thread...
    	HANDLE hThread = CreateThread(NULL, 0, ReadThread, NULL, 0, NULL);
    	CloseHandle(hThread);
    }
    
    
    
    static DWORD CALLBACK ReadThread(LPVOID lpParam)
    {
    	DWORD dwRead;
    
    	while (TRUE)
    	{
    		BOOL bReadFromFile = FALSE;
    
    		for (int i = 0;i < NUM_BLOCKS;i++)
    		{
    			// Look for an available block in the input buffer...
    
    			if (g_BlockStates[i] == FALSE)
    			{
    				// We found an unused block, mark it as being used...
    				g_BlockStates[i] = TRUE;
    
    				// Read from file into our selected block
    				ReadFile(g_hInputFile, &g_PixelArray[BLOCK_SIZE * i], BLOCK_SIZE * sizeof(PIXEL), &dwRead, NULL);
    
    				// Record that we have read from file in this iteration...
    				bReadFromFile = TRUE;
    
    				// Issue work item to thread pool with PostQueuedCompletionStatus.
    				// Tell it which block to process and the size...
    				PostQueuedCompletionStatus(g_hCompletionPort, dwRead, i, NULL);
    
    				// Check if we are out of input data...
    				if (dwRead < BLOCK_SIZE * sizeof(PIXEL)) goto end;
    			}
    		}
    
    		if (!bReadFromFile)
    		{
    			// Only take the time to Wait if we haven't been in a timely ReadFile() call.
    			// If we have, there is probably already a block free.
    
    			// Possibly the event is not needed at all and we just replace it with a Sleep(1)...
    
    			// Wait for a work item to complete before we check for free input blocks again.
    			WaitForSingleObject(g_hWorkItemFinished, INFINITE);
    		}
    	}
    
    
    end:
    	return 0;
    }
    
    
    
    
    
    static DWORD CALLBACK WorkerThread(LPVOID lpParam)
    {
    	DWORD        dwSize;
    	ULONG_PTR    dwBlock;
    	LPOVERLAPPED lpOverlapped;
    
    	while (TRUE)
    	{
    		// Get a work item posted from the IO thread...
    		GetQueuedCompletionStatus(g_hCompletionPort, &dwSize, &dwBlock, &lpOverlapped, INFINITE);
    
    		// Process the block of pixels we have been allocated...
    		YourFunc_ProcessPixels(&g_PixelArray[dwBlock], dwSize);
    
    		// Mark the block as being free again (no cs needed here(I think))...
    		g_BlockStates[dwBlock] = FALSE;
    
    		// Signal the main thread that a work item is finished so it
    		// can reuse the block...
    		SetEvent(g_hWorkItemFinished);
    	}
    
    	return 0;
    }
    No error checking, no guarantees...

  7. #7
    Registered User
    Join Date
    May 2004
    Posts
    7
    First off, thank you so much anonytmouse. Sorry it has been a few days but there has been some problems with the program that smooths the images and it was a priority to get that up and working. Secondly, Microsoft is going to be the death of me (if not all of us). The link you posted was my problem. I usually like to run my program in the debugging environment, that way if something goes wrong it is a little more useful than just a straight up windows crash, and since it takes so long to run it I don't like running it inside and outside the environment and taking up 2x the amount of time. However, I suppose that is the first rule when programming under windows: first assume it's micrsofts code that is causing the problem, then assume it is your own.

    Thank you for your input with using IO completion ports. I have a similar setup as that though with just using events and critical sections, however I am not currently processing a block of pixels and that may be a better way to do it. I believe that IO completion ports work only under 2K/XP and my program needs to run under 95-XP. However, I still think it might be a good idea to try using IO completion ports when I can (and keep using events when not in 2K/XP)...like you said it might best to try and eleminate some critical sections and events.

    Thanks again.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. string::operand+ usage problem (with WIN32 data types)
    By flashbaz-pi in forum C++ Programming
    Replies: 17
    Last Post: 11-02-2008, 02:41 PM
  2. POSIX threads problem, probably simple fix
    By parad0x13 in forum C++ Programming
    Replies: 3
    Last Post: 07-23-2008, 08:48 PM
  3. Yet another n00b in pthreads ...
    By dimis in forum C++ Programming
    Replies: 14
    Last Post: 04-07-2008, 12:43 AM
  4. WIN32 Problem!
    By jdude in forum Windows Programming
    Replies: 2
    Last Post: 03-09-2005, 01:03 AM
  5. qt problem on win32
    By epoch in forum C++ Programming
    Replies: 1
    Last Post: 09-01-2002, 10:06 PM