Thread: No time improvement with multiple threads

  1. #1
    Registered User
    Join Date
    Jan 2016
    Posts
    52

    No time improvement with multiple threads

    I am working on a program that uses 1 or more threads to process a NxN matrix. when I run the program for a 50x50 matrix with 1 thread I get an execution time of about 13000ms. When I process the same matrix with 2 threads I am getting an execution time of about 20000ms.

    Obviously there is something wrong because with 2 thread it should take about half the time, i.e. 6500ms.

    I was able to narrow it down to the loop that iterate through each element of the array:

    Code:
    for (uint32_t i = args->low; i < args->high+1; i++) {
        for (uint32_t j = 1; j < args->size+1; j++) {
            args->write[i][j] = (args->read[i-1][j]+args->read[i+1][j]+args->read[i][j-1]+args->read[i][j+1])*.25;
            dif = fabs(args->write[i][j]-args->read[i][j]);
            if (dif > max_dif) max_dif = dif;
        }
    }
    The way the threads work is that each thread gets assigned a set of rows to work on each thread passes over its section of the array several hundred times. For example, with two threads on a 50x50 matrix, thread 1 would get rows 0-25 and thread 2 would get 26-50. I have confirmed that each thread is being assigned the correct values for args->low and args->high so they are not doing duplicate work. So it makes sense that the program should execute in half the time... but it doesn't.

    Looking at some debug output I was able to see that with two threads each iteration of the loop above was taking 30-70ms. After one thread completes each iteration for the remaining thread drops to about 30ms. And with 1 thread each iteration takes 40-80ms.

    Does anyone know what is happening to cause this?

    The processor on my system is an Intel i5 and the machine is almost completely idle without running my program. So there should be no hardware constraints on getting the expected improvement with 2 threads.

  2. #2
    Registered User
    Join Date
    Jun 2015
    Posts
    1,640
    How large are the elements of the matrix?

    How many cores does your processor have (I assume 4, but just to be sure)?
    Do you know it's cache sizes? (or it's model number)

    What are the timings with a 10x10 matrix?

  3. #3
    Registered User
    Join Date
    Jan 2016
    Posts
    52
    These are float matrices, so 32 bits per index.

    I think it is 4 cores but not completely sure. It is an iMac, i don't know the year though. I am logged in remotely so Im not exactly sure how to get that information.


    The timings with a 10x10 matrix are:
    per iteration: 12-14ms with 1 thread and 40-70ms with 2 threads.
    total execution time: 500ms with 1 thread and 1000ms with 2 threads

    Though with a 10x10 matrix I would expect more threads to increase execution time. So maybe you meant a 100x100 matrix in which case here are the times.
    per iteration: 239-242ms with 1 thread and 600-800ms with 2 threads
    total execution time 80000 ms with 1 thread and 180000 with 2 threads
    Last edited by jdodle; 03-09-2016 at 04:36 PM.

  4. #4
    Registered User
    Join Date
    Jan 2016
    Posts
    52
    After doing some research I was able to get the hardware stats with sysctl:

    hw.cachelinesize = 64
    hw.l1icachesize = 32768
    hw.l1dcachesize = 32768
    hw.l2settings = 1
    hw.l2cachesize = 262144
    hw.l3settings = 1
    hw.l3cachesize = 6291456
    hw.ncpu = 4

  5. #5
    Registered User
    Join Date
    Jun 2015
    Posts
    1,640
    Can you post a link to a zip of the source?

  6. #6
    Registered User
    Join Date
    Jan 2016
    Posts
    52
    I don't want to make the source publicly available (academic integrity), however I can send you the source directly if you would like?

  7. #7
    Registered User
    Join Date
    Jun 2015
    Posts
    1,640
    You could put it up on tinyupload (no need to sign up) and pm me the download link
    and then, say 20 minutes later, use the delete link (make sure you save it).

  8. #8
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    The study of speed-up with multiple threads/processes can be a bit tricky. For example, in published papers, the overhead of thread startup time is typically omitted. "How" you perform you're timing is very important. So how?

    13 to 20 seconds is fairly low. Eg. for what 1 thread requires 2 minutes, 2 threads may only need 1.8 minutes. "Perfect speedup" is very rarely achieved - ie. 1 thread taking 2 minutes, and 2 threads taking 1 minute.

    gg

  9. #9
    Registered User
    Join Date
    Jan 2016
    Posts
    52
    I am fine with that, I am new to this forum, how do pm you?

  10. #10
    Registered User
    Join Date
    Jun 2015
    Posts
    1,640
    Quote Originally Posted by jdodle View Post
    I am fine with that, I am new to this forum, how do pm you?
    You can click "Settings" at the top right, then "Send new message" over on the left somewhere. Send it to my username.

  11. #11
    Registered User
    Join Date
    Jan 2016
    Posts
    52
    There is no "send new message" option for me. I was able to find "PM user" at "quick links" -> "open contacts popup" but it gives me a permission denied error.

  12. #12
    Registered User
    Join Date
    Jan 2016
    Posts
    52
    I am starting the timer just before starting the last thread (just before the last call to pthread_create) and stops just after the last thread ends (just after the last pthread_join). For timing each iteration of the loop I posted I start the timer just before the loop starts and stop it just after the loop ends.

    I am aware that I can not always expect "Perfect speedup" however each thread is given as equal amount of work as possible and each thread can run completely independent of each other thread for each iteration. My code does align the threads after each iteration using semaphores. However this just mean that the total execution time will be a sum of each iterations longest runtime. So as long as each thread is given equal work (or at least very close to equal) then you should see almost "Perfect speedup" when the thread count is small.

  13. #13
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    Add code to time the main loop in each thread, saving the results to be displayed after all threads have completed. Usually when multiple threads are slower, it is due to cache / memory conflicts, but a 10x10 matrix seems too small for this to be an issue.

  14. #14
    Registered User
    Join Date
    Jan 2016
    Posts
    52
    Quote Originally Posted by rcgldr View Post
    Add code to time the main loop in each thread, saving the results to be displayed after all threads have completed. Usually when multiple threads are slower, it is due to cache / memory conflicts, but a 10x10 matrix seems too small for this to be an issue.
    I do have code for this, however I split it up into two sections. One timer is for how long each thread waits to start its next iteration and the second timer times the for loops I posted. the remaining parts of the loop are insignificant in comparison so they are not timed. As I posted in my original question, the timer for the for loops is showing that with 2 threads those loops take just as much time as with 1 thread. with two thread the outer loop loops half as many times thus it is taking twice as long to do the work.

  15. #15
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> My code does align the threads after each iteration using semaphores.
    Using a barrier to ensure all threads start "at the same time" is good.

    >> One timer is for how long each thread waits to start its next iteration
    After the start barrier, there shouldn't be any other synchronization if the threads really are completely independent of each other.

    The timer should start when the barrier releases the threads - and stop when all work has been completed. You should also set thread affinity to ensure that they are never scheduled on the same core. You can also make a "yield" call after the barrier (before the timer) to ensure that the thread/timer starts at the beginning of a new quantum.

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 4
    Last Post: 06-09-2015, 03:21 PM
  2. Need help with multiple threads
    By Cogman in forum Windows Programming
    Replies: 2
    Last Post: 07-05-2009, 09:40 AM
  3. Multiple Threads
    By NuNn in forum C Programming
    Replies: 3
    Last Post: 03-14-2009, 11:29 PM
  4. Replies: 6
    Last Post: 01-09-2007, 04:12 PM
  5. a point of threads to create multiple threads
    By v3dant in forum C Programming
    Replies: 3
    Last Post: 10-06-2004, 09:48 AM

Tags for this Thread