# Help with Data structures: Hash tables

Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last
• 07-02-2008
Shaun32887
Help with Data structures: Hash tables
My boss has recommended a hash table, but I'd like some second opinions as to what you think the best way to accomplish this task is.

My problem is that I don't know C++, so I've been learning things as I needed them to finish this project. I'm kind of stuck right now though, so a push into the direction of which data structures would be best suited for me would be great.

I have a list of strings. The strings are are five or less characters long. They're obtained by breaking up a larger string based on some code I've already written. For the sake of this post, I'll refer to the following list as my sample data.

The original string is:

CC.Cl

When processed by my program, the following strings (lets call these pieces "grams" from now on to avoid confusion) are returned.

C
C
.
C
l
CC
C.
.C
Cl
CC.
C.C
.Cl
CC.C
C.Cl
CC.Cl

What I need to do now, is take this list of grams and eliminate all the duplicates, leaving behind a list of unique grams. Then I need to process another string and compare it to these grams eliminating duplicates and adding the uniques. the end result will be a large collection of unique grams.

If you have an understanding of what I need and can give me a few pointers, you can stop reading here. The rest is what I've tried thus far.

I've already accomplished this in MatLab, but my method is painstakingly slow. As I understand it, O(1) is constant time, O(log(n)) is similar to what you would expect from a binary tree, and O(n) is a linear increase in processing time. Going on that understanding, my method would be something like O(n^2), it takes longer and longer as the search goes on.

My method was very straightforward, it compared the first gram to the next one. If they matched, it deleted the second gram (which in Matlab causes the entire array to shift up) and then compared to the same index again. If they did not match, the index incremented by one and repeat.

Then I would process the next string. This time, instead of looking for unique grams within itself, I'd look for duplicates with these grams and the grams from the first string. Duplicates deleted, unique grams added to the list of unique grams from the first string. Then the next string would be processed and so on.

That is why it was so slow. As the number of strings passed grew and grew, the list it had to search through got bigger and bigger. On a small scale, this didn't matter. However, I have 250,000 strings and some of the strings can reach a length of up to 500 characters. Some extrapolation based on data from 12000 strings shows that I can expect about 90,000 unique grams when I'm done.

A hash table was proposed, but I think my lack of data structure knowledge is being problematic. I've been doing extensive reading for the past few days, but I feel that there might be a better way to go about this.

So before I commit myself to the hash table, I'd like to get some input from this board to see if there might be a better way of going about this.

Thanks in advance everyone. If any clarification is needed, please tell me and I'll do the best I can.
• 07-02-2008
Daved
Here's why a hash table would work: all you are doing is taking a set of unique data items and searching them for a new item, then adding that new item to the set if it is unique. The key is searching. You want to make the search quick.

If you do a linear search like you did in Matlab, then you are potentially searching every element for each new element you are trying to add, that's O(n) for each new element. If you used a sorted binary tree, then you would only have to do O(log n) comparisons to find whether the element exists. A hash table has O(1) lookup if setup properly. So a hash table would be best (if you can set it up with a good hash function and proper initial size).

In C++, you can use the std::tr1::unordered_set to hold these values (or std::tr1::unordered_map if you want key/value pairs). However, that hash table is not widely available yet. You could use the std::set container that comes with the standard library. That generally uses a binary tree and will give you O(log n) lookups, which is a good start. In fact, I might even use that first and only look for a hash table if it is too slow.

If the std::set is too slow and you can't find an std::tr1::unordered_set implementation, then you'll have to use a non-portable hash table. If you have an MFC app then CMap is a good choice, otherwise maybe boost or your standard library implementation will have a hash table you can use.

BTW, another possible solution is to use a vector, then add all grams to it and run a unique algorithm against the vector to weed out all the duplicates. Depending on the size of your input that may be faster than keeping a sorted container.
• 07-02-2008
Prelude
I guess since you plan on getting your answer here, I can delete your thread on Daniweb. The shotgun approach is terribly insulting.
• 07-02-2008
Shaun32887
Quote:

Originally Posted by Prelude
I guess since you plan on getting your answer here, I can delete your thread on Daniweb. The shotgun approach is terribly insulting.

Seriously?

There's smart people here, there's smart people there. That board has helped me just as much as this one, why is it so insulting???

Also, I don't think that TWO boards really constitutes a "shotgun." They're both boards that I try to be as active on as I can with my limited knowledge. I contribute to both of them as best as I can, I shouldn't have to pick and choose which I get help from.
• 07-02-2008
laserlight
Quote:

There's smart people here, there's smart people there. That board has helped me just as much as this one, why is it so insulting???
The implication is that you do not trust the helpers in both forums.

Quote:

I contribute to both of them as best as I can, I shouldn't have to pick and choose which I get help from.
Pick one community to pose your question to. If after some time you do not get a good response, then re-phrase your question and try again. If you still do not get a satisfactory response, then pose your question to the other community, since it could be the case that there is no one in the first community able to properly answer your question.
• 07-02-2008
Shaun32887
Daved,

Thats for the tips, that's basically exactly what I was looking for.

I remember, when I read about vectors I got somewhat excited because the resemble the arrays of Matlab a little bit more. I thought I remembered hearing that they were significantly slower than just using an array, and so I didn't look into them much further. Is there any kind of guideline that you know of as to when a vector would be the better choice?

Thanks,
Shaun
• 07-02-2008
dwks

The conclusion was that, basically, vectors aren't really that much slower than arrays. You might as well use vectors wherever you want them. Use arrays when you need to interface with C, or there's always a fixed number of elements, if you want to. Or forget that arrays exist and use vectors. :)
• 07-02-2008
medievalelks
I don't get the uproar over cross posting to different forums. It's not like he posted to multiple boards on this site.
• 07-02-2008
Shaun32887
Quote:

Originally Posted by laserlight
The implication is that you do not trust the helpers in both forums.

Then let it be known that I trust everyone, that's why I'm asking. However, I don't think anyone here will argue that there are many ways to approach any given problem. The methods given in one place may very well differ from the methods of another. And as a student, I think that the idea of purposely limiting your search pool is absurd at best. Time is of the essence, and my knowledge as it is is limited. I took these internet "nationalistic" tendencies into account when I made the decision to join only two boards, but I honestly thought ego-bruising like this died after high school.

I've complied with everything else you've asked for, formatted my posts as best to your specifications as I can, tried to show that I'm putting in the effort, I've done extensive work on my own to solve my own problems before posting (I'm doing hash tables after a few weeks of experience with C++, do you think that there was MAYBE some time where I would have loved to just post my code and ask "What's wrong with it?"), and I contribute equally at both boards. I shouldn't have to abandon a method that is proven to be effective EVERY TIME just so people who are insecure about these things can feel better about themselves.
• 07-02-2008
laserlight
Quote:

Originally Posted by Shaun32887
I remember, when I read about vectors I got somewhat excited because the resemble the arrays of Matlab a little bit more. I thought I remembered hearing that they were significantly slower than just using an array, and so I didn't look into them much further. Is there any kind of guideline that you know of as to when a vector would be the better choice?

What's wrong with arrays?
Why are the standard containers so slow?

Quote:

Originally Posted by medievalelks
I don't get the uproar over cross posting to different forums.

Eric Raymond's perspective is that "that's like yelling and irritates people".
• 07-02-2008
Shaun32887
Quote:

Originally Posted by dwks

The conclusion was that, basically, vectors aren't really that much slower than arrays. You might as well use vectors wherever you want them. Use arrays when you need to interface with C, or there's always a fixed number of elements, if you want to. Or forget that arrays exist and use vectors. :)

I'll read through that thread. This is promising though, vectors definitely feel more comfortable, especially due to the large uncertainty in the final number of unique grams.

Also, that number isn't going to be fixed once I finish this program. I have to write it so that any number of strings can be processed and built into the unique grams collection. So while I'm predicting about 90,000 unique grams when I'm done with this current list, it is technically theoretically possible that the number of grams can eventually approach the number of total permutations, something like 250 million grams if I remember correctly. That's also assuming I first change everything to lowercase, which I'm considering.
• 07-02-2008
medievalelks
"cross-post to too many different newsgroups"

I wouldn't consider two unrelated forums "too many". I also never recall electing Eric Raymond as Netiquette Czar, but that's another issue entirely.
• 07-02-2008
Shaun32887
Quote:

Originally Posted by laserlight
What's wrong with arrays?
Why are the standard containers so slow?

I looked at them before, but I don't think I knew what a vector was last week, and I'm still not too sure about containers, so it was understandably over my head.

Quote:

Originally Posted by laserlight
Eric Raymond's perspective is that "that's like yelling and irritates people".

>>Don't shotgun-blast all the available help channels at once, that's like yelling and irritates people. Step through them softly.

I agree completely. I just don't think a well formed question to TWO boards that I'm an active member of constitutes this.
• 07-02-2008
laserlight
Quote:

I just don't think a well formed question to TWO boards that I'm an active member of constitutes this.
I agree, but you must understand that we deal with a fellow who goes by the handle George2. I am sure that you have come across him before. It is because of George2 that I can understand Prelude's harsh reaction.
• 07-02-2008
Daved
I've never had a problem with people posting to more than one board, as long as it's within reason (as opposed to what George2 does). I've found that different boards do give different answers, and some are better at some problems than others.

As far as vectors versus arrays, don't be scared off by "vectors are slower than arrays". That's not relevant here even if it was true. The question is which data structure/algorithm fits your needs.

Using a vector would likely mean filling it with all possible grams (including duplicates) and then removing those duplicates (or copying all unique grams to a new vector). This would involve lots of memory usage, and so it might not be your best choice. Also, it appears that you want to handle one string at a time, rather than all at once. If that's the case, the constant re-sorting or making unique of the vector's contents make it a less than ideal solution.

I would use set or unordered_set (or a non-standard hash table implementation). Starting with set is probably a good idea since it is quite simple to use and get help with (since by being a standard container it is quite common). If that proves to be too slow then one of the hash table solutions can be attempted.

One problem with the hash table is getting the right number of buckets. Since you don't know how many unique grams there will be, it is tough to come up with a suitable number. You want there to be more buckets than unique grams so that you reduce the number of collisions. But you don't want too many buckets or you waste memory. I'm not sure how well current hash table implementations resize, but that could affect the speed of your app if you went that direction.
Show 80 post(s) from this thread on one page
Page 1 of 2 12 Last