to thread or not to thread -- performance and design issues
I wrote a hash table of bayer trees to use for fast key/value pair look-ups. I've been testing it on various sized sets of data against C++ std::map and against a plain bayer tree, and I seem to have done a decent job -- the plain bayer tree is slightly faster than map on small sets (there's a sweet spot where it's about 30% faster) but ~20% slower on the biggest sets (12 million items, which eats almost all my memory -- map uses the most). The double hashed table of b-trees is faster still than the plain bayer on small sets and about matches map on the biggest sets.
AFAIK, there is no way to thread single tree operations -- you cannot be rearranging the same tree simultaneously. But with the hash table, all the trees are independent, so in my original implementation I figured I'd fire a thread for every operation. This turned out to be real stupid -- the overhead for starting a new thread thousands and millions of times is just too much -- so I moved to a genuine producer-consumer* set up, whereby when the table is initialized, a small fixed number of threads are started, then they wait on conditions to pick up data to add. Each tree has a lock, but generally two "consecutive" adds will not end up with the same hash value, so they can mostly be done concurrently.
But the threaded producer-consumer version is still twice as slow as the unthreaded version.
Now here's the crux: my assumption was that the heftiest work would be in the actual tree ops.** All the other functions are quite short and sweet -- no mallocs or math, no string ops, just a few assignments and some passing of pointers. The only difference between the threaded and unthreaded version is the use of a few locks and conditions. So I have to assume this is where the time is being spent: the actual "add" function sets a shared pointer to the data, then calls pthread_cond_signal to wake a thread to add it to the table, then pthread_cond_wait. The thread then assigns it's own pointer to the data and calls cond_signal to release the add() function (initially I used a mem pooled linked list queue for that, then realized it was pointless -- if the threads, collectively, cannot keep up to the main process, the main process might as well wait, and the list/mempool handling is still more functions). This means that for each add we have the several pthread function calls to deal with passing the data (no way out of that). AFAICT, these are taking as much (or more) time as the actual tree ops, which negates any advantage concurrency would bestow there.
I'm a little shagged at this point, and don't really mind giving up on the threaded version -- it was interesting, a learning experience, and I did get it to work. But I'd like to get some second opinions on my diagnosis first. It seems to me this an issue with which people must have experience: obviously, a "dining philosophers" type scenario where multiple threads/processes just increment a counter is for sure going to be slower than just a single process incrementing a counter, because incrementing a counter all by itself takes no time at all whereas all the condition and lock handling necessary for the threaded model will, relatively. But I was hoping the tree adds -- which can involve rebalancing, etc -- would not fall into that category. I guess I'm wrong? At what point would a task be significant enough to make threading this way a performance bonus? There's a couple of other assumptions I've made here that could be incorrect (eg. "if the threads, collectively, cannot keep up to the main process, the main process might as well wait").
Worst case is I'm totally wrong and my thread design sucks (but of course it doesn't seem that way to me :p).
* Nb: up to now, other than little exercises, the only thing I've used threads for is a few concurrent process, eg, a GUI controlling something in real time, and not to speed up any kind of processing.
** I've run gprof on it, and it does appear that 90% of the time is spent there, but to be honest I am very dubious of gprof's evaluation, since with the prior threaded model it completely failed to spot the problem -- I presume I'd have to compile pthreads itself appropriately in order to get accurate profiling here.