Thread: Eliminating false sharing with TBB

  1. #1
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654

    Eliminating false sharing with TBB

    So, as the title says, I'm trying to eliminate false sharing, or, eliminate sharing writes between threads with TBB. The question is how.

    Normally I'd make an array whose size is equal to the number of threads, then locally write to a local variable and update the array only at the end of the thread.

    But, of course, I cannot seem to get either thread id or total number of threads TBB uses. I found a reference to tbb::enumerable_thread_specific, which as I understand, is supposed to work for exactly this. But as soon as I added it, it hurt performance by ~60% instead of making it better.

    Code is below. Anyone more knowledgeable about TBB knows how to do this properly? You don't really need to look so hard on how the algorithm works (I don't know either). I know it's not quite right right now due to race conditions, but I'll fix that later. I used a reference implementation that I copied™, and my task is to parallelize it.

    Oh, and the parts where the problem is right now is in red (of course, it's not all problems; it's only a subset of them).

    Code:
    #include <iostream>
    #include <vector>
    #include <fstream>
    #include <ctime>
    #include <algorithm>
    #include <ppl.h>
    #include <tbb/tbb.h>
    #undef max
    
    void usage(const char* app_name)
    {
    	std::cout << "Argument error! Usage: C++ AMP.exe <input file>\n";
    	exit(0);
    }
    
    void clear(std::vector<int>& vec)
    {
    	std::fill(vec.begin(), vec.end(), 0);
    }
    
    int main(int argc, char** argv)
    {
    	if (argc != 2)
    		usage(argv[0]);
    
    	// Open files
    	std::ifstream InputFile(argv[1]);
    	if (!InputFile)
    		usage(argv[0]);
    
    	// Read the matrix
    	int dim = 0;
    	InputFile >> dim;
    	std::vector<std::vector<int>> mat(dim, std::vector<int>(dim));
    
    	for (decltype(dim) i = 0; i < dim; i++)
    	for (decltype(dim) j = 0; j < dim; j++)
    		InputFile >> mat[i][j];
    
    	auto TimeStart = std::clock();
    
    	std::vector<std::vector<int>> ps(dim, std::vector<int>(dim));
    
    	// Compute vertical prefix sum
    	//for (int j = 0; j < dim; j++)
    	tbb::parallel_for(0, dim, [&mat, &ps, dim](int j)
    	{
    		ps[0][j] = mat[0][j];
    		tbb::parallel_for(1, dim, [&mat, &ps, dim, j](int i)
    		{
    			ps[i][j] = ps[i - 1][j] + mat[i][j];
    		});
    		//for (int i = 1; i < dim; i++)
    	});
    
    	//Auxilliary variables 
    	int max_sum = mat[0][0];
    	int top = 0, left = 0, bottom = 0, right = 0;
    
    	//tbb::parallel_for(0, dim, [&, dim](int i)
    	for (int i = 0; i < dim; i++)
    	{
    		using thread_array_t = tbb::enumerable_thread_specific<int>;
    
    		//thread_array_t local_max_sum = mat[0][0];
    		//thread_array_t local_top = 0, local_left = 0, local_bottom = 0, local_right = 0;
    
    		tbb::parallel_for(i, dim, [&, dim](int k)
    		//for (int k = i; k < dim; k++)
    		{
    			// Kandane over all columns with the i..k rows
    			//clear(sum);
    			//clear(pos);
    			std::vector<int> sum(dim);
    			std::vector<int> pos(dim);
    			int local_max = 0;
    
    			// We keep track of the position of the max value over each Kandane's execution
    			// Notice that we do not keep track of the max value, but only its position
    			sum[0] = ps[k][0] - (i == 0 ? 0 : ps[i - 1][0]);
    
    			const decltype(sum)& const_sum(sum);
    			const decltype(pos)& const_pos(pos);
    			const decltype(ps)& const_ps(ps);
    			thread_array_t local_max_array = 0;
    
    			tbb::parallel_for(1, dim, [&, dim](int j)
    			//for (int j = 1; j<dim; j++)
    			{
    				int local_max_tmp = 0;
    				if (sum[j - 1] > 0)
    				{
    					sum[j] = const_sum[j - 1] + const_ps[k][j] - (i == 0 ? 0 : const_ps[i - 1][j]);
    					pos[j] = const_pos[j - 1];
    				}
    				else
    				{
    					sum[j] = const_ps[k][j] - (i == 0 ? 0 : const_ps[i - 1][j]);
    					pos[j] = j;
    				}
    				if (const_sum[j] > const_sum[local_max])
    					local_max_array.local() = j;
    			}); //Kandane ends here
    
    			for (const auto local_max_tmp : local_max_array)
    				local_max = std::max(local_max, local_max_tmp);
    
    			if (sum[local_max] > max_sum)
    			{
    				// sum[local_max] is the new max value
    				// the corresponding submatrix goes from rows i..k.
    				// and from columns pos[local_max]..local_max
    
    				max_sum = sum[local_max];
    				top = i;
    				left = pos[local_max];
    				bottom = k;
    				right = local_max;
    			}
    		});
    	}
    
    	// Compose the output matrix
    	int outmat_row_dim = bottom - top + 1;
    	int outmat_col_dim = right - left + 1;
    	std::vector<std::vector<int>> outmat(outmat_row_dim, std::vector<int>(outmat_col_dim));
    	for (int i = top, k = 0; i <= bottom; i++, k++)
    	for (int j = left, l = 0; j <= right; j++, l++)
    		outmat[k][l] = mat[i][j];
    
    	auto TimeEnd = std::clock();
    	std::cout << "Took " << (TimeEnd - TimeStart) / (CLOCKS_PER_SEC / 1000) << " ms.\n";
    }
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  2. #2
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,613
    I thought TBB was supposed to be high level so that you never messed with logical threads.

    There is also this article, which suggests that the easiest way to do it is to pass the same instance of data to all the threads, and then make local copies of that data in function that does the work. Then you would update the shared instance just before the function returns.

  3. #3
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by whiteflags View Post
    I thought TBB was supposed to be high level so that you never messed with logical threads.
    Which is precisely the problem.

    There is also this article, which suggests that the easiest way to do it is to pass the same instance of data to all the threads, and then make local copies of that data in function that does the work. Then you would update the shared instance just before the function returns.
    Yes, but how do I do that? I don't know how many threads it will spawn, so there is no way for me to know how much memory to allocate before. To make matters worse, I don't know the id of each thread, so I can't index into the array. It doesn't matter if I use a local variable--I don't know where to save it once finished.

    Take a look at the loop:
    Code:
    tbb::parallel_for(1, dim, [&, dim](int j)
    {
    	int local_max_tmp = 0;
    	if (sum[j - 1] > 0)
    	{
    		sum[j] = const_sum[j - 1] + const_ps[k][j] - (i == 0 ? 0 : const_ps[i - 1][j]);
    		pos[j] = const_pos[j - 1];
    	}
    	else
    	{
    		sum[j] = const_ps[k][j] - (i == 0 ? 0 : const_ps[i - 1][j]);
    		pos[j] = j;
    	}
    	if (const_sum[j] > const_sum[local_max])
    		local_max = j;
    }); //Kandane ends here
    There are two problems:
    One is that there is a race conditions between threads to write to local_max.
    Second is that it's shared between all threads, so we will get false sharing in a tight loop.

    So how does on fix it? Usually, one would declare some array (properly padded):

    std::vector<int> local_max_array(num_threads);

    Then update that in the loop, like so:

    local_max[thread_id] = j;

    But what are num_threads and thread_id? Without them, especially thread_id, this does not work!
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  4. #4
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    I think you want to use parallel_reduce instead of parallel_for. That's the function for doing reduction type operations over a loop, such as finding the maximum value in an array.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  5. #5
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    No, I don't think it will work at all. sum, pos and local_max are all connected. I need the maximum sum, and I can get that from indexing into sum with local_max. But until sum is complete, I cannot make any decisions about what local_max to keep.
    In other words, I must fill in sum, pos and then select the local_max whose sum is the greatest. I can't reduce this, or at least, I cannot see a way.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  6. #6
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,613
    Yes, but how do I do that? I don't know how many threads it will spawn, so there is no way for me to know how much memory to allocate before. To make matters worse, I don't know the id of each thread, so I can't index into the array. It doesn't matter if I use a local variable--I don't know where to save it once finished.
    I guess I'm just confused. So basically, there is nothing that you can pass from main() to the lambda function that would work?

    Code:
    struct ThreadParams
    {
     // For the following 4 variables: 4*4 = 16 bytes
     unsigned long thread_id;
     unsigned long v; //Frequent read/write access variable
     unsigned long start;
     unsigned long end;
    };
    
    void threadFunc(void *parameter) 
    {
     ThreadParams *p = (ThreadParams*) parameter;
     // local copy for read/write access variable
     unsigned long local_v = p->v;
    
     for(local_v = p->start; local_v < p->end; local_v++)
     {
     // Functional computation
     }
    
     p->v = local_v; // Update shared data structure only once
    }
    The way I read this example was -
    a) threadFunc is your lamda
    b) threadParams are the data from the caller that you wanted any and all threads to change
    c) thread_id appears to be inconsequential to threadFunc

    So you would create locals and update the parameter, which must relate to the particular thread.

  7. #7
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    That kind of defeats the whole purpose of using parallel_for in the first place. TBB will split up the work and dynamically schedule it. I don't know how it schedules or divides the work. I don't know the start, the end or the id. I don't pass those parameters because TBB decides them.
    I could do it manually, and I probably will do it in the future, but that would be without TBB. Right now, the focus is on TBB.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  8. #8
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    But for that inner most loop, local_max is a reduction variable. You can still modify sum and pos arrays just like with a parallel for loop in the reduction function, so that part stays the same.

    Unfortunately, looking at the docs, it looks like parallel_reduce doesn't play well with lambdas, so you have to do things the C++03 way.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  9. #9
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Well, let's skip local_max for the time being. It seems I have loop interdependencies, which is bad.
    How about pos and sum? They aren't reduction variables and I still need to eliminate false sharing for those.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  10. #10
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    Yeah, pos and sum have false sharing. You could fix that by making each iteration of parallel_for or parallel_reduce be 16 iterations on j.

    But I'd just leave it, and hope the scheduler gives you chunks of indexes that are large enough that it's not a problem.

    You could also leave that inner loop serial.
    Last edited by King Mir; 11-09-2013 at 09:08 PM.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  11. #11
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    You know, despite all its "greatness," working with TBB is a pain in the ass. The inner loop will currently execute 799 times. I don't know if that's enough to make it worth parallelizing, though.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  12. #12
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    Yeah, but it's nested in two loops that will also each execute as many times. It's also a relatively trivial operation for each iteration. So maybe not paralleling is the way to go.

    On the other hand, you should learn to use parallel_reduce and be comfortable with reduction variables, because it's a very common pattern. But maybe this isn't the project for it.

    I haven't used TBB itself, but I've used similar tools. IMO, having a simple way to make a simple loop into a parallel algorithm is worthwhile.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. false sharing confusion!!
    By mynickmynick in forum C Programming
    Replies: 4
    Last Post: 07-21-2008, 02:10 AM
  2. Eliminating dynamic_cast<>
    By skewray in forum C++ Programming
    Replies: 22
    Last Post: 09-14-2007, 04:58 PM
  3. eliminating dos box
    By happy dude 82 in forum Windows Programming
    Replies: 1
    Last Post: 04-16-2003, 04:15 AM
  4. Eliminating duplicates
    By C-Struggler in forum C Programming
    Replies: 3
    Last Post: 03-23-2003, 11:12 PM
  5. eliminating dup values in array
    By Unregistered in forum C Programming
    Replies: 2
    Last Post: 04-19-2002, 06:05 PM