Thread: Multiple thread for loops

  1. #1
    Registered User
    Join Date
    Jan 2009
    Posts
    159

    Multiple thread for loops

    Hi,
    It is the first time I apply multi-threading to my program. However the time performance is not that much impressive: 57 minutes before VS 42 minutes after applying multi-threading.

    I am using pthread library, dividing evenly the range of the outest loop of my embeded multiple loops and assigning each sub-range for each thread.

    Within the embeded loops, the operations that originally took up to 70% of time before applying multi-threading are not those modifying shared memory between threads and therefore no need to lock. So the operations that I actually lock probably don't change the time much.

    I create 8 threads in hope that this could get best performance on my 8-CPU server. I was hoping the time could drop to 1/8, but it turns out not even achieving half percent. Is this normal or am I not making right use of multi-threading to its fullest power?

    Thanks!
    Last edited by lehe; 03-28-2009 at 07:40 PM.

  2. #2
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    I'm not going to go way into this, but let me just say that threads do not make programs run any faster (on the contrary, they can actually add significant overhead). They simply allow you to run more tasks at once, but less of each. Factor in the complexity of scheduling, saving state information during task switches, etc, and it becomes clear how this can lead to perfomance degradation. I recommend using the fewest number of threads possible, and just make sure resource synchronization is being done efficiently (ie: no sooner than necessary).
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  3. #3
    Registered User
    Join Date
    Jan 2009
    Posts
    159
    Thank you, Sebastiani!
    Specifically, how do you "make sure resource synchronization is being done efficiently (ie: no sooner than necessary)"? What do you mean by " resource synchronization"? Thanks!

  4. #4
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    Lock resources as late as possible, and unlock them as early as possible, because when they are locked, other threads may have to wait for it to be unlocked.

  5. #5
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    Quote Originally Posted by lehe View Post
    Hi,
    It is the first time I apply multi-threading to my program. However the time performance is not that much impressive: 57 minutes before VS 42 minutes after applying multi-threading.

    I am using pthread library, dividing evenly the range of the outest loop of my embeded multiple loops and assigning each sub-range for each thread.

    Within the embeded loops, the operations that originally took up to 70% of time before applying multi-threading are not those modifying shared memory between threads and therefore no need to lock. So the operations that I actually lock probably don't change the time much.

    I create 8 threads in hope that this could get best performance on my 8-CPU server. I was hoping the time could drop to 1/8, but it turns out not even achieving half percent. Is this normal or am I not making right use of multi-threading to its fullest power?

    Thanks!
    Multithreading can change the performance bottlenecks. Make sure that the parallelization leads to more time doing necessary computation, instead of managing what each thread is supposed to do, or waiting on locks.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  6. #6
    Registered User
    Join Date
    Jan 2009
    Posts
    159
    Thanks!

    1. My intensive computation is in the embeded loops which I target for multi-threading. Since I simply divide the range of the outest loop evenly and assigned each subrange to each thread, there should not be that much time on managing what each thread is supposed to do. Right?
    Plus although the computation in each thread is not exactly the same, the actual time each thread take does not vary much (in seconds)
    Total time for thread 5: 2203
    Total time for thread 2: 2221
    Total time for thread 6: 2235
    Total time for thread 4: 2250
    Total time for thread 1: 2264
    Total time for thread 3: 2264
    Total time for thread 0: 2364
    Total time for thread 7: 2366
    But I agree the planning on creating threads is very important. If the time could be more similar between threads, it would be faster. So is there possibly some tip or tool can help in this aspect?

    2. As to waiting on locks, which one is faster, locking each variable individually or all together? For example, in my case I lock all shared variables together :
    Code:
               for ... // range of this outest loop is divided evenly and each subrange is assigned to each thread
                    ...
                    pthread_mutex_lock( &count_mutex );
                    *sum_1 += exp(log_prob_t_1); // *sum_1 and *sum_0 are shared between threads
                    *sum_0 += exp(log_prob_t_0);
                    for(int i=0; i<nb_vignettes; i++){
                        prob_img_1[i] += exp(log_prob_img_t[i] + log_prob_t_1); // array prob_img_1 and prob_img_0 are shared between threads
                        prob_img_0[i] += exp(log_prob_img_t[i] + log_prob_t_0);
                    }
                    pthread_mutex_unlock( &count_mutex );
    Will it be more efficient, if I lock each variable individually?
    Code:
               for ...
                    ...
                    pthread_mutex_lock( &count_mutex );
                    *sum_1 += exp(log_prob_t_1);
                    pthread_mutex_unlock( &count_mutex );
    
                    pthread_mutex_lock( &count_mutex );
                    *sum_0 += exp(log_prob_t_0);
                    pthread_mutex_unlock( &count_mutex );
    
                     for(int i=0; i<nb_vignettes; i++){
    
                        pthread_mutex_lock( &count_mutex );                    
                        prob_img_1[i] += exp(log_prob_img_t[i] + log_prob_t_1);
                        pthread_mutex_unlock( &count_mutex );
    
                        pthread_mutex_lock( &count_mutex );
                        prob_img_0[i] += exp(log_prob_img_t[i] + log_prob_t_0);
                        pthread_mutex_unlock( &count_mutex );
                    }
    Last edited by lehe; 03-28-2009 at 10:40 PM.

  7. #7
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    Code:
               for ... // range of this outest loop is divided evenly and each subrange is assigned to each thread
                    ...
                    pthread_mutex_lock( &count_mutex );
                    *sum_1 += exp(log_prob_t_1); // *sum_1 and *sum_0 are shared between threads
                    *sum_0 += exp(log_prob_t_0);
                    for(int i=0; i<nb_vignettes; i++){
                        prob_img_1[i] += exp(log_prob_img_t[i] + log_prob_t_1); // array prob_img_1 and prob_img_0 are shared between threads
                        prob_img_0[i] += exp(log_prob_img_t[i] + log_prob_t_0);
                    }
                    pthread_mutex_unlock( &count_mutex );
    Are you sure the majority of the time is NOT spent in the locked sections?

    Will it be more efficient, if I lock each variable individually?
    That would depend on the relative expensiveness of lock/unlock and exp, and I don't know much about that...

    I see one thing to try, though.
    Code:
               for ...
                    ...
                    double temp_1 = exp(log_prob_t_1);
                    double temp_2 = exp(log_prob_t_0);
    
                    pthread_mutex_lock( &count_mutex );
                    *sum_1 += temp_1;
                    *sum_0 += temp_2;
                    pthread_mutex_unlock( &count_mutex );
    
                     for(int i=0; i<nb_vignettes; i++){
    
                        double temp3 = exp(log_prob_img_t[i] + log_prob_t_1);
                        double temp4 = exp(log_prob_img_t[i] + log_prob_t_0);
    
                        pthread_mutex_lock( &count_mutex );                    
                        prob_img_1[i] += temp3;
                        prob_img_0[i] += temp4;
                        pthread_mutex_unlock( &count_mutex );
                    }
    Taking the exp calls out of the locked region.

    I'm not sure how expensive exp is, though, so I don't know if it will help noticeably. It should at least not make it slower, though.

  8. #8
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    Also, how big are those arrays? Is it possible that your program is memory-bound (threads waiting for cache-misses to be fulfilled)?

    Multithreading won't magically increase your memory bandwidth. Having more threads starving for memory doesn't really help...

    Do your CPUs have shared cache?

  9. #9
    Registered User
    Join Date
    Mar 2009
    Posts
    3
    You should try to remove the variables that are shared between the threads. Is it possible to give each thread their own sums and combine them after the threads are done? Locking and unlocking mutexs in a tight loop will eat all the performance gains. Also, the compiler can't make some optimizations in the loop because of the mutexs.

  10. #10
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by Sebastiani View Post
    I'm not going to go way into this, but let me just say that threads do not make programs run any faster (on the contrary, they can actually add significant overhead). They simply allow you to run more tasks at once, but less of each.
    You are confused. Multithreading on a single processor system will not increase throughput, but on a multi-core system it most definately will.

    As to the OP, it may have to do with what you are doing in the threaded section. and your process may not be compute bound, it could be I/O bound. If your memory isn't fast enough for the system tha will bottleneck your application. Generally you want your memory to be 1/8 as fast as the sum of the clock rates of all the procesors in the system. Something I doubt very much you have on an 8 core system. i.e. if you have 8 2.3 GHz processors, you need in theory 2.3GHz memory to reach maximum throughput. (2.3GHz x 8 cpus / 8). You can try using fewer threads, if you get most fo the benefit from 2 threads, then this is most likeyl the case. There can also be resource contention issues, If thread 1 is accessing say Data[0] and thread 2 is accessign Data[1] there willbe cache thrashing if Data[0] and [1] are in teh same cache line, which is almost certainly the case. It is much better to break teh processing up into large blocks, so that say you have to process Data[0] - [255], Thread 1 processes [0] - [127] while thread 2 processes [128]-[255]. This will improve performance greatly. Especially with 8 threads, you are engaging in enourmous amounts of cache thrashng using the former method, and nearly none in the one I presented.

  11. #11
    Malum in se abachler's Avatar
    Join Date
    Apr 2007
    Posts
    3,195
    Quote Originally Posted by cyberfish View Post
    I see one thing to try, though.
    Code:
               for ...
                    ...
                    double temp_1 = exp(log_prob_t_1);
                    double temp_2 = exp(log_prob_t_0);
     
                    pthread_mutex_lock( &count_mutex );
                    *sum_1 += temp_1;
                    *sum_0 += temp_2;
                    pthread_mutex_unlock( &count_mutex );
     
                     for(int i=0; i<nb_vignettes; i++){
     
                        double temp3 = exp(log_prob_img_t[i] + log_prob_t_1);
                        double temp4 = exp(log_prob_img_t[i] + log_prob_t_0);
     
                        pthread_mutex_lock( &count_mutex );                    
                        prob_img_1[i] += temp3;
                        prob_img_0[i] += temp4;
                        pthread_mutex_unlock( &count_mutex );
                    }
    Really innefficient using the same lock for two different arrays, each array shoudl have its own lock, so that seperate threads cn access the other when ti isnt in use. Use something like count_mutex_sum and count_mutex_prob_image instead of count_mutex for both sections. You also don't gain anything by locking inside the inner loop, since the lock/unlock will most certainly take longer than the 2 addition operations. You are better off locking for the entire inner loop and unlocking after the inner loop. And while we are at it get rid of the superfluous local variable allocation.

    Code:
               for ...
                    ...
     
                    pthread_mutex_lock( &count_mutex );
                    *sum_1 += exp(log_prob_t_1);
                    *sum_0 += exp(log_prob_t_0);
                    pthread_mutex_unlock( &count_mutex );
                
                     pthread_mutex_lock( &count_mutex_prob_image );                    
                     for(int i=0; i<nb_vignettes; i++){
                        prob_img_1[i] += exp(log_prob_img_t[i] + log_prob_t_1);
                        prob_img_0[i] += exp(log_prob_img_t[i] + log_prob_t_0);
                        }
                     pthread_mutex_unlock( &count_mutex_prob_image );
    And we now see the issue. That your data is not truly parallelized, since this routinwe will at MOST allow two threads to be executing at once, the others will be waiting on locks. So the goal is you need to break up that inner loop.
    Last edited by abachler; 03-29-2009 at 09:57 AM.

  12. #12
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    You are confused. Multithreading on a single processor system will not increase throughput, but on a multi-core system it most definately will.
    My point was simply that it does not *inherently* increase throughput. If you design a program that creates 10,000 threads, it will certainly run faster on a multicore system, but the fact remains that will almost definitely run slower than if the program had been designed to use less threads. Does that make sense?
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  13. #13
    Registered User
    Join Date
    Jan 2009
    Posts
    159
    cyberfish, abachler and soupbone, thank you so much for these suggestions!

    I finally made it to profile my multithreading program (with gprof and gprof-helper just learned). The results show that the operations modifying common variables, ie lying between lock and unlock, only take up to 11% of time. There is another code segment outside the lock part within the same loop that spends 78% of time, which confirms my guess in the original post. I think it is wiser to avoid locking those most time-consuming operations making other threads waiting for long time, is it? This is exactly what I did when applying multithreading. Here is what the loops are like (sorry, I should've posted this earlier.):
    Code:
    for(every possible tmplt){// range of this outest loop is divided evenly and each subrange is assigned to each thread. 
    
          // tmplt is of type Vignette*, where Vignette is a class
          generate the tmplt // pseduo-code, since the real code is too long. 
    
          /* segment that takes 78% of time */
          for(int k = 0; k < Vignette_size ; k++) { //8.29% and 6.18% of time. Vignette_size =   24 * 24. 
                if(tmplt->content[k] == 0){ //8.37% of time. tmplt->content is a 1D int array of 24*24
                    for(int i = 0; i < nb_vignettes; i++){ //  5.14% and 4.28% of time
                          //_xi is a precomputed 1D double array of size 256. 
                          double tmp = _xi[vignettes[i].content[k]]; // 24.41% of time.  
                          log_prob_img_t[i] += tmp;} // 20.58% of time
                     }
                }
           }
    
            ...
    
          /* lock part for multithreading taking up to 11% of time */
          pthread_mutex_lock( &count_mutex );
          *sum_1 += exp(log_prob_t_1); //5.05% of time *sum_1 and *sum_0 are shared between threads
          *sum_0 += exp(log_prob_t_0);
          for(int i=0; i<nb_vignettes; i++){ // nb_vignettes is set to be 10, size of two shared arrays prob_img_1 and prob_img_0
               //array prob_img_1 and prob_img_0 are shared between threads
               prob_img_1[i] += exp(log_prob_img_t[i] + log_prob_t_1); //3.20% of time. 
               prob_img_0[i] += exp(log_prob_img_t[i] + log_prob_t_0);//2.09 of time
          }
          pthread_mutex_unlock( &count_mutex );
          ...
    }


    Those operations taking 78% of time only read some common memory and have no need to write it. If multiple threads can read the same memory simultaneously without contension and does not involve waiting for each other, then why the total time only drops from 57min under single-thread to 42min under 8 threads? My code has no writing and reading from hard disk, all the data is generated by the code.

    As far as the locking part is concerned, I would try your suggestions. The two common double arrays prob_img_1 and prob_img_0 are just of size 10, relatively small (but I will have to try much larger number like 100000, but it seems like forever now). How to tell if a program is "memory-bound (threads waiting for cache-misses to be fulfilled)" and if my server's CPUs have shared cache (if they do, will that alleviate the contention for memory problem)?

    abachler:
    do you mean I need "in theory 2.3GHz memory" for each of the 8 CPUs and also my server may not actually have 8 CPUs? Here is the info of my server from top:
    top - 14:59:06 up 104 days, 1:24, 147 users, load average: 9.21, 7.98, 5.18
    Tasks: 339 total, 3 running, 335 sleeping, 1 stopped, 0 zombie
    Cpu0 : 64.9% us, 23.1% sy, 0.0% ni, 11.7% id, 0.0% wa, 0.0% hi, 0.3% si
    Cpu1 : 100.0% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
    Cpu2 : 99.7% us, 0.3% sy, 0.0% ni, 0.0% id, 0.0% wa, 0.0% hi, 0.0% si
    Cpu3 : 66.8% us, 24.3% sy, 0.0% ni, 9.0% id, 0.0% wa, 0.0% hi, 0.0% si
    Cpu4 : 65.3% us, 24.7% sy, 0.0% ni, 10.0% id, 0.0% wa, 0.0% hi, 0.0% si
    Cpu5 : 65.7% us, 25.4% sy, 0.0% ni, 8.9% id, 0.0% wa, 0.0% hi, 0.0% si
    Cpu6 : 64.7% us, 23.1% sy, 0.0% ni, 12.2% id, 0.0% wa, 0.0% hi, 0.0% si
    Cpu7 : 65.2% us, 27.5% sy, 0.0% ni, 7.3% id, 0.0% wa, 0.0% hi, 0.0% si
    Mem: 32958132k total, 16955880k used, 16002252k free, 390320k buffers
    Swap: 32764556k total, 11453788k used, 21310768k free, 12155396k cached
    Last edited by lehe; 03-29-2009 at 01:04 PM.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. user thread library
    By Eran in forum C Programming
    Replies: 4
    Last Post: 06-17-2008, 01:44 AM
  2. Question about Multiple threads and ring buffer.
    By qingxing2005 in forum C Programming
    Replies: 2
    Last Post: 01-15-2007, 12:30 AM
  3. Calling a Thread with a Function Pointer.
    By ScrollMaster in forum Windows Programming
    Replies: 6
    Last Post: 06-10-2006, 08:56 AM
  4. Problem : Threads WILL NOT DIE!!
    By hanhao in forum C++ Programming
    Replies: 2
    Last Post: 04-16-2004, 01:37 PM
  5. MFC Controls and Thread Safety :: MFC
    By kuphryn in forum Windows Programming
    Replies: 0
    Last Post: 12-06-2002, 11:36 AM