Like Tree2Likes

How to write actually good multithreaded code?

This is a discussion on How to write actually good multithreaded code? within the Tech Board forums, part of the Community Boards category; I just got owned by parallelization actually slowing down my code because of stupid memory buses. I have a global ...

  1. #1
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    1,056

    How to write actually good multithreaded code?

    I just got owned by parallelization actually slowing down my code because of stupid memory buses.

    I have a global tree structure. Right now I'm using multiple threads to search the tree but I think because the tree is written in the RAM, it makes multithreading pointless because the memory bus is always saturated by one thread at a time.

    So is it possible to forcefully make a subsection of the tree exist in a processor's cache so as to avoid that dread bus?

    And if threading isn't useful for searching, what do I do with it?

    Like, what are some actual good uses of multithreading?

  2. #2
    Registered User manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    Kolkata@India
    Posts
    2,498
    Right now I'm using multiple threads to search the tree but I think because the tree is written in the RAM, it makes multithreading pointless because the memory bus is always saturated by one thread at a time.
    How did you reach that conclusion?
    Manasij Mukherjee | gcc-4.8.2 @Arch Linux
    Slow and Steady wins the race... if and only if :
    1.None of the other participants are fast and steady.
    2.The fast and unsteady suddenly falls asleep while running !



  3. #3
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    1,056
    Lol a bunch of time benchmarks. Consistently, 1 thread is faster than 4 or 10 or 16.

    I think maybe I was dumb in the sense that multithreading my search procedure was a possible place for speed-up and in all reality, it isn't at all. In fact, it's a huge hinderance. Or maybe it's my multithreaded allocation routine. Either way, something multithreaded stinks.

    I want to use multiple threads to speed up my code but at the same time, I need to use the cache like never before as that'll actually speed up my code.

    One thing that threading would do in my case is simplifying calculations. If each thread was instead used to do calculations using cache memory, I should experience a speed-up.

    I wanna write cache optimized tree construction code. The tree is in on heap, of course, and we should assume searching the tree simultaneously is a pointless endeavor unless each thread has its own copy which is in turn written in the cache, hopefully. Otherwise, it's pointless. Even if I did use memcpy(), I'd need it to be a contiguous block as well or at least, that's what I'd want it to be.

    I don't know, I'm tired and disappointed and will think about this in the morning.

  4. #4
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,583
    Quote Originally Posted by MutantJohn View Post
    ISo is it possible to forcefully make a subsection of the tree exist in a processor's cache so as to avoid that dread bus?
    Yes, it is possible, but you must ensure that this happens. Trees aren't exactly known for being cache friendly, which makes this difficult. You have to somehow limit the part of the tree it searches by imposing cutoff points. I will confess, however, that searching something isn't a good way of utilizing multithreading because it's not cache friendly.
    You can make a copy of the tree and distribute it among threads. Make the whole tree distributed and have the threads handle their own part of the tree (searching, insertion, deletion, etc). This makes it more likely you will avoid false sharing and that the tree will stay in the cache. You can try with a small test project and scale up if it works.

    Like, what are some actual good uses of multithreading?
    It's perfect when you have independent tasks which are cache friendly. Examples involve calculations. For example, parsing and calculating an equation might be a good example as all threads can work with a chunk of the equation and simplify it. Then you finally reduce the result.
    I managed to parallelize an application that generated a random data input file (2-3x speedup on a quad core machine). The threads would all calculate random numbers and add them into a string which is later written to file. There is a practical example.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  5. #5
    Registered User Hodor's Avatar
    Join Date
    Nov 2013
    Posts
    650
    Quote Originally Posted by Elysia View Post
    The threads would all calculate random numbers and add them into a string which is later written to file. There is a practical example.
    How were you calculating the random numbers?

  6. #6
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,583
    std::mt19937 engine;
    std::uniform_int_distribution<int> dist(RangeMin, RangeMax);
    auto Val = dist(engine);
    Hodor likes this.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  7. #7
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    1,056
    I have one important question. Are things only on the stack put into the cache? Or are allocations on the heap also put in there as well?

  8. #8
    Registered User manasij7479's Avatar
    Join Date
    Feb 2011
    Location
    Kolkata@India
    Posts
    2,498
    Quote Originally Posted by MutantJohn View Post
    I have one important question. Are things only on the stack put into the cache? Or are allocations on the heap also put in there as well?
    The cache is managed by hardware.
    Heap and stack are more or less fabrications by the OS, assisted by the hardware a little bit.
    So, for all memory references your code makes, the cpu tries to cache...either by demand or by prediction.
    Elkvis likes this.
    Manasij Mukherjee | gcc-4.8.2 @Arch Linux
    Slow and Steady wins the race... if and only if :
    1.None of the other participants are fast and steady.
    2.The fast and unsteady suddenly falls asleep while running !



  9. #9
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    1,056
    Elysia, I'm bad at multithreading. Would you be willing to post the code you mentioned? Or perhaps you have an example of code where multithreading is actually a significant speed-up?

  10. #10
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    1,056
    Alright, I have this code which takes an array and uses threads to compute the total sum of it. I break it up into regions based off the number of threads but it's not exactly that well done. Is there anyway I could optimize this code? It's C++, btw.
    Code:
    #include <iostream>
    #include <thread>
    #include <mutex>
    #include <vector>
    
    
    using namespace std;
    
    
    mutex lk;
    
    
    void add(float *data, int start, int end, float *sum) {
    
    
       float local_sum = 0;
    
    
       for (int i = start; i <= end; ++i) {
    
    
          local_sum += (data[i]/3)*.0012+3/1.1237;
       }
    //cout << local_sum << endl;
       lk.lock();
       *sum += local_sum;
       lk.unlock();
    
    
       return;
    }
    
    
    int main(void) {
    
    
       int size = 640000;
    
    
       float *data = new float[size];
    
    
       for (int i=0; i < size; ++i) {
          data[i] = i;
          //cout << data[i] << endl;
       }
    
    // define number of threads, nt
       int nt = 4;
    
    
       vector<thread> pool;
       pool.reserve(nt);
    
    
       float sum = 0;
    
    
       for (int i=0; i < pool.capacity(); ++i) {
    
    
          int start = i*(size/nt);
          int end = start + (size/nt) - 1;
    //printf("%d, %d\n", start, end);
          pool.push_back(thread(add, data, start, end, &sum));
       }   
    
    
       for (int i=0; i < pool.capacity(); ++i)
          pool[i].join();
    
    
       cout << "Final sum is : " << sum << endl;
    
    
       delete[](data);
    
    
       return 0;
    }

  11. #11
    Registered User
    Join Date
    Apr 2013
    Posts
    1,283
    Quote Originally Posted by MutantJohn View Post
    It makes multithreading pointless because the memory bus is always saturated by one thread at a time.
    There is only one memory bus, and on many multiple core processors, there is only one outer level of cache. It's difficult to optimize for cache unless you know the specifics of how the layers of cache are implemented (n-way, fully associative, ...) , plus the translation look aside buffer / cache used to speed up the mapping of virtual addresses into physical addresses. Most memory bandwidth limited algorithms work best if the accesses are somewhat sequential, and typically multi-threading will create more random like accesses, resulting in slower overall performance in such cases.

    Quote Originally Posted by MutantJohn View Post
    Like, what are some actual good uses of multithreading?
    When an algorithm is cpu bound and/or I/O bound (usually with multiple I/O devices as opposed to a single shared device). I created an example windows program that copies a file, but it's unlikely to actually improve performance unless the destination file is on a different disk drive than the source file. Note this example is windows specific, using windows mutexes and semaphores to support a fifo linked list messaging system used to communicate and synchronize between threads. It creates one additional thread, and uses one thread to read a file and another thread to write a file. The WaitForMultipleObjects() eliminates priority issues between threads.

    http://rcgldr.net/misc/mtcopy.zip

    The multi-threaded embedded code on computer peripheral devices that I've worked with use a similar messaging scheme.
    Last edited by rcgldr; 11-25-2013 at 02:57 PM.

  12. #12
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    1,056
    I have one more very important question. Are graphics cards limited to only having one memory bus? Or are they different?

  13. #13
    Registered User
    Join Date
    Apr 2013
    Posts
    1,283
    Quote Originally Posted by MutantJohn View Post
    I have one more very important question. Are graphics cards limited to only having one memory bus? Or are they different?
    Only one external memory bus, but similar to the SSE instructions, they have the ability to perform some operations on an array of data in parallel. Some older video cards optimized reading video card memory for display updates, while also writing video card memory from the computer, but I don't know if or how this is implemented on current video cards.

    Assuming a video card on the PCI bus, that would be independent of the main memory bus and also other PCI slots, depending on the motherboard.
    Last edited by rcgldr; 11-25-2013 at 06:06 PM.

  14. #14
    C++まいる!Cをこわせ! Elysia's Avatar
    Join Date
    Oct 2007
    Posts
    22,583
    Quote Originally Posted by MutantJohn View Post
    I have one important question. Are things only on the stack put into the cache? Or are allocations on the heap also put in there as well?
    Everything in your memory ends up in the cache. That usually means (in PC systems) your stack, heap, BSS, data segment, whatever. Everything. The cache is managed by the hardware. It is weird you do not know this since you are familiar with C. Oh well, guess one doesn't know everything, huh?
    Two other things to consider too:
    - You don't have control over cache (the hardware manages it, period), and
    - Your program is not the only one whose contents will reside in the cache, so don't be greedy.

    Quote Originally Posted by MutantJohn View Post
    I have one more very important question. Are graphics cards limited to only having one memory bus? Or are they different?
    They are sort of different. All graphics card have onboard high-speed memory. This is built specifically for the graphics card which requires very high bandwidth.
    But graphics cards also need to access system memory, and does so via the (nowadays) PCI Express bus, which it (obviously) will have to compete for. The main memory is nowhere as fast as graphics memory, of course.

    Quote Originally Posted by MutantJohn View Post
    Elysia, I'm bad at multithreading. Would you be willing to post the code you mentioned? Or perhaps you have an example of code where multithreading is actually a significant speed-up?
    Here is my code, if you still want it:
    Code:
    #include <cmath>
    #include <ctime>
    #include <fstream>
    #include <iostream>
    #include <sstream>
    #include <thread>
    #include <vector>
    #include <type_traits>
    #include <assert.h>
    #include <memory.h>
    #include <random>
    #include <iterator>
    
    const int CacheLineSize = 64;
    
    struct Buffer
    {
    	Buffer(unsigned int Threads, unsigned int MatrixSize):
    		BufferPos(0)
    	{
    		StreamBuffer.resize(MatrixSize * MatrixSize * 10 / Threads); 
    	}
    
    	std::vector<char> StreamBuffer;
    	size_t BufferPos;
    };
    	
    template<typename T>
    void ConvertStrInPlace(Buffer& _Buffer, T Number)
    {
    	static_assert(sizeof(Number) <= 8, "Integer size must at most 8 bytes. Algorithm not designed to handle bigger integers.");
    	static_assert(std::is_integral<T>::value, "Number be an integer.");
    
    	bool IsNegative = Number < 0;
    	if (IsNegative)
    		Number = -Number;
    	
    	assert(_Buffer.StreamBuffer.size() - _Buffer.BufferPos >= 30); // Make sure there is room for at least 30 characters
    	char* Buffer = &_Buffer.StreamBuffer[_Buffer.BufferPos];
    	int CurrentLength = 0;
    
    	if (Number == 0)
    	{
    		Buffer[0] = '0';
    		CurrentLength++;
    	}
    	
    	while (Number != 0)
    	{
    		char c = (Number % 10) + '0';
    		memmove(&Buffer[1], Buffer, CurrentLength);
    		Buffer[0] = c;
    		Number /= 10;
    		CurrentLength++;
    	}
    
    	if (IsNegative)
    	{
    		memmove(&Buffer[1], Buffer, CurrentLength);
    		Buffer[0] = '-';
    		CurrentLength++;
    	}
    	
    	_Buffer.BufferPos += CurrentLength;
    }
    
    void CopyStr(Buffer& _Buffer, const char* String)
    {
    	while (*String)
    		_Buffer.StreamBuffer[_Buffer.BufferPos++] = *String++;
    }
    
    int main(int argc, char** argv)
    {
    	if (argc < 2)
    	{
    		std::cout << "Syntax: <Program> <Integer>\nWhere <Integer> is an integer whose value is the NxN size of the matrix to write to the output file.\n";
    		return 1;
    	}
    
    	std::srand((unsigned int)std::time(nullptr));
    	std::ofstream OutFile("Output.txt");
    	if (!OutFile)
    	{
    		std::cout << "Unable to open outfile Output.txt. Make sure you have the appropriate permissions to write to the current directory.\n";
    		return 1;
    	}
    
    	int Tmp2;
    	std::stringstream Tmp(argv[1]);
    	if (!(Tmp >> Tmp2) || Tmp2 <= 0)
    	{
    		std::cout << "Invalid matrix size. The matrix size must be an integer and greater than 0.\n";
    		return 1;
    	}
    
    	unsigned int MatrixSize = static_cast<unsigned int>(Tmp2);
    	const auto NumThreads = std::thread::hardware_concurrency();
    	const auto ThreadSlice = MatrixSize / NumThreads;
    	const int RangeMax = 32000;
    	const int RangeMin = -32000;
        std::mt19937 engine;
        std::uniform_int_distribution<int> dist(RangeMin, RangeMax);
    
    	std::vector<std::thread> Threads;
    	std::vector<Buffer> StringBuffer;
    	for (unsigned int i = 0; i < NumThreads; i++)
    		StringBuffer.emplace_back(NumThreads, MatrixSize);
    
    	auto Time1 = std::clock();
    
    	ConvertStrInPlace(StringBuffer[0], MatrixSize);
    	CopyStr(StringBuffer[0], "\n");
    	for (unsigned int Thread = 0; Thread < NumThreads; Thread++)
    	{
    		Threads.emplace_back([Thread, MatrixSize, ThreadSlice, &OutFile, &StringBuffer, &dist, &engine]
    		{
    			auto Start = ThreadSlice * Thread;
    			auto End = (ThreadSlice * (Thread + 1) <= MatrixSize ? ThreadSlice * (Thread + 1) : MatrixSize);
    			for (decltype(Start) i = Start; i < End; i++)
    			{
    				ConvertStrInPlace(StringBuffer[Thread], i);
    				CopyStr(StringBuffer[Thread], ": ");
    				for (decltype(MatrixSize) j = 0; j < MatrixSize; j++)
    				{
    					auto Val = dist(engine);
    					auto & Buffer = StringBuffer[Thread];
    					ConvertStrInPlace(Buffer, Val);
    					CopyStr(StringBuffer[Thread], " ");
    				}
    				CopyStr(StringBuffer[Thread], "\n");
    			}
    		});
    	}
    
    	for (auto & Thread : Threads)
    		Thread.join();
    
    	auto Time2 = std::clock();
    	std::cout << "Took " << (Time2 - Time1) / (CLOCKS_PER_SEC / 1000) << " ms.\n";
    
    	Threads.clear();
    	Time1 = std::clock();
    	for (const auto & Buffer : StringBuffer)
    		OutFile.write(&Buffer.StreamBuffer[0], Buffer.BufferPos);
    
    	Time2 = std::clock();
    	std::cout << "Took " << (Time2 - Time1) / (CLOCKS_PER_SEC / 1000) << " ms.\n";
    }
    It had to make my own string functions since the standard library's string slowed down things considerably.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  15. #15
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    1,056
    Elysia, in your code, if I change NumThreads to 1, should I expect a speed-up or just fewer elements explaining a speed-up? I'm too tired to look in depth at the code right now.

Page 1 of 2 12 LastLast
Popular pages Recent additions subscribe to a feed

Similar Threads

  1. What are you using to write your code?
    By Sjvlsdsd in forum General Discussions
    Replies: 21
    Last Post: 07-23-2011, 07:20 AM
  2. MultiThreaded GUI.
    By execute in forum Windows Programming
    Replies: 6
    Last Post: 05-18-2006, 01:00 PM
  3. How to write a program as good as possible?
    By jinhao in forum C++ Programming
    Replies: 9
    Last Post: 06-15-2005, 12:59 PM
  4. any good way to write sprite/animation classes for a game?
    By compjinx in forum Game Programming
    Replies: 2
    Last Post: 03-28-2002, 12:22 AM

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21