I've written a bayer tree implementation in C for storing key-value pairs (etc). It works pretty well -- eg. it's faster than std::map.
My intention was to then take this and use it in a hash table, such that the hash table would be an array of bayer trees. This allows you to tweak a couple of elements to suit the size of the data set (the array size and the m-level of the trees). When I was writing that, it also occurred to me that since the trees are independent of one another, operations on them could be threaded. So I sort of doubled hashed the table into "realms", where each realm would represent a portion of the whole, and operations in that realm would be done with a single thread. In other words, if you initialized the total table size to 200 and with 200 realms, you'd have one thread for each table (but, nb, the "double hashing" would be irrelevant). Alternately, you could set the total size to 200 with only 10 realms -- as long as the total can be evenly divided into the number of realms, it's all good. While obviously each thread would not get it's own processor, the fewer trees per thread, the better the chance that "consecutive" operations could occur simultaneously, the next one starting before the last one finishes -- whereas with fewer threads, you have more trees per thread and a higher chance that two consecutive operations will take place in the same realm (with no chance of concurrency).
That seemed pretty clever and I got it all done up to play around and see what settings worked best. Varying the number of realms seems to have little effect (unlike the m-level of the trees). Even worse, the whole table scheme, despite seeming fairly simple, is orders of magnitude slower than just the single bayer tree. So I applied gprof, and here's where I'm stuck:
Gprof says 95% of the time is spent in the bayer tree functions, not the hash table. For example, building a structure of 80,000 items requires 80,000 bayer tree "add" calls, whether that's one tree or a whole table of them. But using a single tree, those 80,000 adds take < 0.2 seconds. Using a hash table of trees, there are still only 80,000 "add" calls, but they take 1.5 seconds. The hash table "add" function encapsulates the bayer tree add, so I had figured there was a bottleneck there, but apparently not -- it only takes 5% of the time. I had assumed the slight overhead of the encapsulation would be compensated for because the tree adds would be on much smaller simpler trees, thus faster, but instead it seems they are taking longer???
Something doesn't make sense there. Anyone got any gprof tips or other ideas about how to get to the bottom of this? Like alternative profiling tools to confirm or deny this strange conclusion?
Looks like this was my somewhat absurd threading scheme, altho I'm still upset that the grof output doesn't reflect the problem -- the threads are all created in the hash table code, removing them solved the speed problem.