Hmm, okay, well to clarify, the testing worked like this:
- add N items to the structure (these are read in from a randomized file)
- delete N/2 items from a separate randomized list (which is also a look-up, obviously)
- produce a sorted (in order traversal) list of all remaining items
The last test was mostly to insure that the structure was working properly (that the things that were supposed to be deleted were deleted, that the things that weren't weren't, and that what remains can be properly traversed to produce a sorted list) so I have not actually timed this against std::map yet.
Also, I haven't written a delete for the hash table version yet (so the sort test was just a sort test of all items -- each tree maintains a first/next pointer kept sorted in a master linked list to facilitate this) but that's very simple since the delete for the b-tree has been tested. So the only timing test I've done with the threaded hash table is the add test.
Originally Posted by
iMalc
I imagine you have a loop that fills up the tree and then a loop that performs lookups etc right?
Yeah, exactly, except the look-up was also a delete. WRT to a plain look-up, I was probably not going to thread that because it would mean passing the result back from the thread, meaning the main process would have to wait anyway. Whereas with add/delete, nothing need be returned. Also, add and delete* can lead to rearrangements of the nodes, which this is what I figured was the most expensive event, gprof bore than out (30% of the total time for all adds was spent re-arranging; presumably the later adds, when the trees are bigger, are more significant this way).
Originally Posted by
Codeplug
It seems like find/insert/remove operations would be "too fast" to warrant an asynchronous interface. So if I were to encounter this in the wild, my first thoughts would be: "What problems were they trying to solve with this design?"
That's how it seems to me, I am just surprised that the thread operations are that serious -- this is the first time I've done something like this (so that is mostly the "problem" I'm solving ). Like I said, I would not think that threading count++ would be anything but a slow down. But traversing and rearranging a b-tree of say several hundred items (if there are 2 000 000 in total and 10 000 double hashed trees -- I did distribution tests on the b-tree and it was pretty well balanced inc. depth**, but I haven't done anything on the hash table yet), possibly having to split and promote a whole series of nodes, would be much more complicated that a couple of pthread_cond_signal calls and some pointer assignments. Interestingly, setting the table up so the trees were way bigger (2 000 000 items in 1000 trees) did not make that much difference. This is true in the unthreaded version too. B-trees are pretty efficient things I guess. Anyway, for reference, this is all that's involved with the thread handling:
Code:
int bhasht_add(bhasht *bht, btree_key *key) {
if (!key->val) return EINVAL;
pthread_mutex_lock(&bht->tasklock);
bht->todo.data = key;
bht->todo.type = BHTADD;
if (!bht->threadup) pthread_cond_wait(&bht->needcnd,&bht->tasklock);
pthread_cond_signal(&bht->readycnd);
pthread_cond_wait(&bht->gotcnd,&bht->tasklock);
pthread_mutex_unlock(&bht->tasklock);
return 0;
}
The thread itself:
Code:
void *bhthread (void *p) {
bhasht *bht = (bhasht*)p;
btree_key *kp;
int rlm, tr;
while (1) {
pthread_mutex_lock(&bht->tasklock);
bht->threadup++;
pthread_cond_signal(&bht->needcnd);
pthread_cond_wait(&bht->readycnd,&bht->tasklock);
bht->threadup--;
if (bht->todo.type == BHTADD) {
kp = bht->todo.data;
bht->todo.type = BHTDONE;
pthread_mutex_unlock(&bht->tasklock);
pthread_cond_signal(&bht->gotcnd);
rlm = bhtIndex(kp->val, bht, &tr);
pthread_mutex_lock(&bht->table[rlm].trlocks[tr]);
BTadd(bht->table[rlm].trees[tr],kp);
pthread_mutex_unlock(&bht->table[rlm].trlocks[tr]);
} else if (bht->todo.type == BHTDONE) {
pthread_mutex_unlock(&bht->tasklock);
pthread_cond_signal(&bht->gotcnd);
}
}
return NULL;
}
The reason for the BHTDONE catch was to deal with pthread_signal_cond potentially releasing more than one thread (an irritating pthread feature)-- then one gets the tasklock, does the task, and releases the lock, such that another would then get the lock and perform the same task again. "If (!threadup)" rarely applies, so I guess the total number of pthread calls per add would normally be:
- lock down in add()
- cond_signal in add()
- cond_wait in add() releases lock
- thread acquires lock, copies pointer
- thread releases lock
- thread calls cond_signal
- add resumes, releases lock and finishes while thread is free to work (which requires the locking and unlocking of a tree).
So I dunno: how complex could a lock-up/down and signal call be? I would have thought maybe a simple assignment or two -- is that completely naive?
Another observation: I have a dual core processor. Using two threads was optimal. Using even just 4 threads lead to a noticable (25%) slow down, 8 threads 30%+. I would not have thought this would make so much difference either...the baseline here is still 200% of the single process unthreaded version.
I suppose a big factor here would be the size of the key because this would affects the strcmps. With the small tests I used stuff like a dictionary, but with the larger tests I've just been using 5 char permutations. I should try 10 and 20 and 100 char strings probably.
* the lookup for the delete would use the b-tree functions directly, whereas the look-up for the hash table proper would be a separate wrapper for the b-tree look-up.
** something that surprised me a lot there was how low the "optimal" m values turned out to be -- eg, even with 12 million items in a single tree, an m value over 20 proved not worth while, and for a million items anything beyond 6-8 was a slow down.