Thread: multithread bug I cannot reproduce

  1. #1
    Kiss the monkey. CodeMonkey's Avatar
    Join Date
    Sep 2001
    Posts
    937

    multithread bug I cannot reproduce

    Hello everyone,
    I recently wrote a small function class that allows a user to perform oft-needed operations between threads using minimal code. In the actual application for which I'm using this class, I have a bug. My efforts to reproduce this bug in other, simpler codes have failed. The essence of the thing, however, seems to be simple; but apparently not.
    You create a instance of this class, sync_func, with the main thread. Then n other threads may call the class operator(), each time feeding it some information. Once the n'th thread has come through (as the former threads wait), the function returns the result to each thread. The last thread cleans up so that it works again the next time.
    I'm using this class to get the sum of a bunch of numbers across threads. Each thread is the operator() of an instance of class task. main() creates a class method, which owns a class sync_func. Each thread has a reference to the method, which allows them to call the sync_func.
    In the below code, it all works. After that is my real code, which does not work. The problem is that the sync_func is returning 'nan' to all host threads, even though no calculation leading to it yields nan. Somewhere between the return statement and the assignment at the calling line, things go awry.

    Any help that can be offered is much appreciated. If you need any more information, let me know.

    The header, sync_func.hpp
    Code:
    #include <boost/thread/mutex.hpp>
    #include <boost/thread/barrier.hpp>
    
    template<class T>
    struct do_nothing { void operator()(T & t) {} }; 
    
    template<class T>           //static variable, thread-specific variable
    struct thread_max {  void operator()(T & stat,const T & thrd) { if(thrd > stat) stat = thrd; }  };
    
    template<class T>
    struct thread_min {  void operator()(T & stat,const T & thrd) { if(thrd < stat) stat = thrd; }  };
    
    template<class T>
    struct thread_sum {  void operator()(T & stat,const T & thrd) { stat += thrd; }  };
    
    template<class T, class T_binary_ref_func, class T_unary_once_func = do_nothing<T> >
    class sync_func
    {
    		T working;
    		const T starting_value;
    		T returnable;
    		T_binary_ref_func func;  //func( working, thread_piece );
    		T_unary_once_func at_end;  //at_end( working );
    		boost::mutex moo;
    		boost::barrier bar;
    		unsigned int n_threads;
    		unsigned int counter;
    	public:
    		sync_func(unsigned int _n_threads, const T & _starting_value,
    				  const T_unary_once_func & _at_end = T_unary_once_func(), const T_binary_ref_func & _func = T_binary_ref_func()) 
    			: bar(_n_threads), starting_value(_starting_value), n_threads(_n_threads), counter(0),
    			   working(_starting_value), at_end(_at_end), func(_func) {}
    		T operator()(const T & thread_piece)
    		{
    			{
    				boost::mutex::scoped_lock lock(moo);
    				func( working, thread_piece );
    				if(++counter == n_threads)
    				{
    					at_end( working );
    					returnable = working;
    					working = starting_value;
    					counter = 0;
    				}
    			}
    			bar.wait();
    			return returnable;
        	}
    };
    My toy program, to demonstrate how I'm using it:
    Code:
    #include <boost/thread.hpp>
    #include <iostream>
    boost::mutex cout_mutex;
    
    struct method
    {
    		typedef sync_func< double, thread_sum<double> > sync_sum;
    		method(unsigned nthreads, boost::barrier & b) : tfun(nthreads, 0), bar(b) {}
    		void do_stuff(unsigned thread_id)
    		{
    			for(int i = 0; i < 2; ++i)
    			{
    				double sum = tfun( 1.0 / (1+thread_id) );
    		  		{
    		  			boost::mutex::scoped_lock lock(cout_mutex);
    		  			std::cout << "Thread " << thread_id << " got a sum of " << sum 
    		  					  << " on iteration " << i << std::endl;
    		  		}
    		  		bar.wait();
    	  		}
    		}
    	private:
    		method(method & m);
    		method(const method & m);
    		sync_sum tfun;
    		boost::barrier & bar;
    };
    
    class task
    {
    		unsigned int thread_id;
    		unsigned int n;
    		method & m;
      	public:
      		task( unsigned int id, unsigned int _n, boost::barrier & b, method & _m )
      			: thread_id(id), n(_n), m(_m) {}
      		void operator()()
      		{
      			for(int i = 0; i < 2; ++i)
      				m.do_stuff(thread_id);
      		}
    };
    
    int main()
    {
    	boost::thread_group tg;
    	unsigned int n;
    	std::cout << "How many threads? ";
    	std::cin >> n;
    	
    	boost::barrier bar(n);
    	typedef sync_func< double, thread_sum<double> > sync_sum;
    	sync_sum tsum( n, 0.0 );
    	
    	method meth(n, bar);
    	
    	for(unsigned int i = 0; i < n; ++i)
    		tg.create_thread( task(i, n, bar, meth) );
    	tg.join_all();
    }
    Example output:
    Code:
    How many threads? 2
    Thread 0 got a sum of 1.5 on iteration 0
    Thread 1 got a sum of 1.5 on iteration 0
    Thread 0 got a sum of 1.5 on iteration 1
    Thread 1 got a sum of 1.5 on iteration 1
    Thread 0 got a sum of 1.5 on iteration 0
    Thread 1 got a sum of 1.5 on iteration 0
    Thread 0 got a sum of 1.5 on iteration 1
    Thread 1 got a sum of 1.5 on iteration 1
    And now, the actual code that fails
    Code:
    int main()
    {
       // . . .
       mode_method* mmeth = new . . .  
       boost::thread_group tg;
       for(unsigned i = 0; i < nthreads; ++i)
           tg.create_thread(simulator(*tmeth,*mmeth,g,lap,tmeth->region(i),tmeth->boundary(i),settings::time_steps,settings::dt) );
       tg.join_all();
       // . . .
    }
    
    // . . . . 
    
    void simulator::operator()() //Here is the simulation
    {
        for(uword i = 0; i < n; ++i) // n not related to nthreads
        {
            // . . . . 
                mmeth.subtract(reg);
             // . . . . .
        }
    }
    
    // . . . . 
    
    class high_mode : public mode_method
    {
          thread_method & tmeth; //actual object is instantiated by main(), just like *this
       public:
          void subtract(grid::region & reg)
          {
               functions::inner_product ip( oldm ); //thread_piece
               newm.iterate_region(reg, ip);
               grid::data_type inner_prod = tmeth.get_sum(ip.value());
               //ip.value() always well-defined but inner_prod always nan
               {
                    boost::mutex::scoped_lock lock(debug_mutex);
                    std::cout << "thread " << reg.front().x << " has ip " << ip.value()
                              << " and total ip " << inner_prod << std::endl;
               }
          }
    };
    
    //. . . .
    
    class multi_thread : public thread_method
    {
       // . . .
           sync_func< grid::data_type, thread_sum<grid::data_type> > sync_sum;
        public:
           grid::data_type get_sum(const grid::data_type & piece) { sync_sum(piece); }
        // . . .
    };
    If you've come this far, you truly are a wild goose-chaser.
    Last edited by CodeMonkey; 06-10-2009 at 01:52 PM.
    "If you tell the truth, you don't have to remember anything"
    -Mark Twain

  2. #2
    and the Hat of Guessing tabstop's Avatar
    Join Date
    Nov 2007
    Posts
    14,336
    Division by zero is the easy way to get NaN. Your sample code has a division in it, but I don't know if your real code does.

  3. #3
    Kiss the monkey. CodeMonkey's Avatar
    Join Date
    Sep 2001
    Posts
    937
    Good point, evoking a second looking-over. There are no divisions. Let me illustrate the strangeness of the issue:
    If we add a couple of calls to std::cout, we see that two values that should be absolutely equal are in fact not.
    Code:
    T operator()(const T & thread_piece)
    	{
    			{
    				boost::mutex::scoped_lock lock(moo);
    				func( working, thread_piece );
    				if(++counter == n_threads)  //if this is the last thread
    				{
    					at_end( working );
    					returnable = working;
                                            std::cout << "value to return: " << returnable << std::endl;
    					working = starting_value;
    					counter = 0;
    				}
    			}
    			bar.wait(); //bar is a barrier -- once all threads have gotten here, execution proceeds.
    			return returnable;
        	}
    and
    Code:
     void subtract(grid::region & reg)
          {
               functions::inner_product ip( oldm ); //thread_piece
               newm.iterate_region(reg, ip);
               grid::data_type inner_prod = tmeth.get_sum(ip.value()); //leads to invocation of the above function.
                     //grid::data_type is long double. The toy works with long double as well as double.
               {
                    boost::mutex::scoped_lock lock(debug_mutex);
                    std::cout << "What was returned: " << inner_prod << std::endl;
               }
          }
    The first std::cout prints some number, the second prints 'nan' . The fact that, for all I can see, my toy program works the same way, but works, indicates that I either do not understand fully what is going on, or I am looking at the wrong code. Perhaps little help can be offered without more code. Any input is appreciated.
    Last edited by CodeMonkey; 06-10-2009 at 12:32 AM. Reason: comments
    "If you tell the truth, you don't have to remember anything"
    -Mark Twain

  4. #4

  5. #5
    3735928559
    Join Date
    Mar 2008
    Location
    RTP
    Posts
    838
    Quote Originally Posted by tabstop View Post
    Division by zero is the easy way to get NaN. Your sample code has a division in it, but I don't know if your real code does.
    specifically - dividing 0 by 0

  6. #6
    Kiss the monkey. CodeMonkey's Avatar
    Join Date
    Sep 2001
    Posts
    937
    If, in my real code, I use my old way of doing things, i.e. a plain old function get_sum() -- it works fine. Using sync_func, which works well on it's own, in my program causes the problem.
    That is, if we replace this code:
    Code:
    class multi_thread : public thread_method
    {
       // . . .
           sync_func< grid::data_type, thread_sum<grid::data_type> > sync_sum;
        public:
           grid::data_type get_sum(const grid::data_type & piece) { sync_sum(piece); }
        // . . .
    };
    and instead use this:
    Code:
    class multi_thread : public thread_method
    {
       // . . .
           boost::mutex moo;
           grid::data_type sum;
           grid::data_type sum_ret;
        public:
           grid::data_type get_sum(const grid::data_type & piece)
           {
                  {
                       boost::mutex::scoped_lock lock(moo);
                       sum += piece;
                   }
                   bar.wait();
                   {
                       boost::mutex::scoped_lock lock(moo);
                       if(sum) //not the best logic (suppose sum IS zero), but this isn't the point.
                       {
                           sum_ret = sum;
                           sum = 0;
                       }
                   }
                   return sum_ret;
           }
        // . . .
    };
    Then all of the nan turn into numbers. I'd like to use sync_func, though, because it saves me some code, and in the near future it'll be saving a ton of code.

    In the mean time, I'm screwing around with the toy program in order to see if I can squeeze some nan out of it.

    Thanks for all of your comments so far.

    *EDIT* Thank you for bringing that to memory, Codeplug, but simulator is an object of only references, and the function operator().
    Last edited by CodeMonkey; 06-10-2009 at 02:07 PM. Reason: codeplug
    "If you tell the truth, you don't have to remember anything"
    -Mark Twain

  7. #7
    Kiss the monkey. CodeMonkey's Avatar
    Join Date
    Sep 2001
    Posts
    937
    I feel so foolish. The problem was that I was not returning the result of sync_func. If you look above you'll see the code:
    Code:
    class multi_thread : public thread_method
    {
       // . . .
           sync_func< grid::data_type, thread_sum<grid::data_type> > sync_sum;
        public:
           grid::data_type get_sum(const grid::data_type & piece) { sync_sum(piece); }
        // . . .
    };
    Which should be
    Code:
    class multi_thread : public thread_method
    {
       // . . .
           sync_func< grid::data_type, thread_sum<grid::data_type> > sync_sum;
        public:
           grid::data_type get_sum(const grid::data_type & piece) { return sync_sum(piece); }
        // . . .
    };
    Why g++ didn't generate a warning (or an error) is beyond me. But now it is solved!
    "If you tell the truth, you don't have to remember anything"
    -Mark Twain

  8. #8
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Did you actually have -Wall in your compiler command line.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  9. #9
    Kiss the monkey. CodeMonkey's Avatar
    Join Date
    Sep 2001
    Posts
    937
    Doubly foolish. Your signature makes me feel like a statistic.
    Last edited by CodeMonkey; 06-10-2009 at 04:09 PM.
    "If you tell the truth, you don't have to remember anything"
    -Mark Twain

  10. #10
    Registered User
    Join Date
    Aug 2008
    Location
    Belgrade, Serbia
    Posts
    163
    Everyone make mistakes, but the problem is solved now.
    Vanity of vanities, saith the Preacher, vanity of vanities; all is vanity.
    What profit hath a man of all his labour which he taketh under the sun?
    All the rivers run into the sea; yet the sea is not full; unto the place from whence the rivers come, thither they return again.
    For in much wisdom is much grief: and he that increaseth knowledge increaseth sorrow.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. gaks bug?
    By Yarin in forum A Brief History of Cprogramming.com
    Replies: 5
    Last Post: 08-31-2008, 02:47 PM
  2. Debugging a rare / unreproducible bug..
    By g4j31a5 in forum A Brief History of Cprogramming.com
    Replies: 18
    Last Post: 08-05-2008, 12:56 PM
  3. Another link from Microsoft about bug in fread
    By vart in forum A Brief History of Cprogramming.com
    Replies: 2
    Last Post: 05-06-2008, 11:56 AM
  4. ATL bug of CComPtr?
    By George2 in forum Windows Programming
    Replies: 6
    Last Post: 04-07-2008, 07:52 AM
  5. Major Multithread Problem :: MFC
    By kuphryn in forum Windows Programming
    Replies: 1
    Last Post: 05-07-2002, 09:58 PM