Thread: Help with Data structures: Hash tables

  1. #1
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43

    Help with Data structures: Hash tables

    My boss has recommended a hash table, but I'd like some second opinions as to what you think the best way to accomplish this task is.

    My problem is that I don't know C++, so I've been learning things as I needed them to finish this project. I'm kind of stuck right now though, so a push into the direction of which data structures would be best suited for me would be great.


    My task is as follows:

    I have a list of strings. The strings are are five or less characters long. They're obtained by breaking up a larger string based on some code I've already written. For the sake of this post, I'll refer to the following list as my sample data.

    The original string is:

    CC.Cl


    When processed by my program, the following strings (lets call these pieces "grams" from now on to avoid confusion) are returned.

    C
    C
    .
    C
    l
    CC
    C.
    .C
    Cl
    CC.
    C.C
    .Cl
    CC.C
    C.Cl
    CC.Cl


    What I need to do now, is take this list of grams and eliminate all the duplicates, leaving behind a list of unique grams. Then I need to process another string and compare it to these grams eliminating duplicates and adding the uniques. the end result will be a large collection of unique grams.


    If you have an understanding of what I need and can give me a few pointers, you can stop reading here. The rest is what I've tried thus far.


    I've already accomplished this in MatLab, but my method is painstakingly slow. As I understand it, O(1) is constant time, O(log(n)) is similar to what you would expect from a binary tree, and O(n) is a linear increase in processing time. Going on that understanding, my method would be something like O(n^2), it takes longer and longer as the search goes on.

    My method was very straightforward, it compared the first gram to the next one. If they matched, it deleted the second gram (which in Matlab causes the entire array to shift up) and then compared to the same index again. If they did not match, the index incremented by one and repeat.

    Then I would process the next string. This time, instead of looking for unique grams within itself, I'd look for duplicates with these grams and the grams from the first string. Duplicates deleted, unique grams added to the list of unique grams from the first string. Then the next string would be processed and so on.

    That is why it was so slow. As the number of strings passed grew and grew, the list it had to search through got bigger and bigger. On a small scale, this didn't matter. However, I have 250,000 strings and some of the strings can reach a length of up to 500 characters. Some extrapolation based on data from 12000 strings shows that I can expect about 90,000 unique grams when I'm done.

    A hash table was proposed, but I think my lack of data structure knowledge is being problematic. I've been doing extensive reading for the past few days, but I feel that there might be a better way to go about this.

    So before I commit myself to the hash table, I'd like to get some input from this board to see if there might be a better way of going about this.


    Thanks in advance everyone. If any clarification is needed, please tell me and I'll do the best I can.
    Last edited by Shaun32887; 07-02-2008 at 11:19 AM.

  2. #2
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    Here's why a hash table would work: all you are doing is taking a set of unique data items and searching them for a new item, then adding that new item to the set if it is unique. The key is searching. You want to make the search quick.

    If you do a linear search like you did in Matlab, then you are potentially searching every element for each new element you are trying to add, that's O(n) for each new element. If you used a sorted binary tree, then you would only have to do O(log n) comparisons to find whether the element exists. A hash table has O(1) lookup if setup properly. So a hash table would be best (if you can set it up with a good hash function and proper initial size).

    In C++, you can use the std::tr1::unordered_set to hold these values (or std::tr1::unordered_map if you want key/value pairs). However, that hash table is not widely available yet. You could use the std::set container that comes with the standard library. That generally uses a binary tree and will give you O(log n) lookups, which is a good start. In fact, I might even use that first and only look for a hash table if it is too slow.

    If the std::set is too slow and you can't find an std::tr1::unordered_set implementation, then you'll have to use a non-portable hash table. If you have an MFC app then CMap is a good choice, otherwise maybe boost or your standard library implementation will have a hash table you can use.

    BTW, another possible solution is to use a vector, then add all grams to it and run a unique algorithm against the vector to weed out all the duplicates. Depending on the size of your input that may be faster than keeping a sorted container.

  3. #3
    Code Goddess Prelude's Avatar
    Join Date
    Sep 2001
    Posts
    9,897
    I guess since you plan on getting your answer here, I can delete your thread on Daniweb. The shotgun approach is terribly insulting.
    My best code is written with the delete key.

  4. #4
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43
    Quote Originally Posted by Prelude View Post
    I guess since you plan on getting your answer here, I can delete your thread on Daniweb. The shotgun approach is terribly insulting.
    Seriously?

    There's smart people here, there's smart people there. That board has helped me just as much as this one, why is it so insulting???


    Also, I don't think that TWO boards really constitutes a "shotgun." They're both boards that I try to be as active on as I can with my limited knowledge. I contribute to both of them as best as I can, I shouldn't have to pick and choose which I get help from.
    Last edited by Shaun32887; 07-02-2008 at 12:54 PM.

  5. #5
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    There's smart people here, there's smart people there. That board has helped me just as much as this one, why is it so insulting???
    The implication is that you do not trust the helpers in both forums.

    I contribute to both of them as best as I can, I shouldn't have to pick and choose which I get help from.
    Pick one community to pose your question to. If after some time you do not get a good response, then re-phrase your question and try again. If you still do not get a satisfactory response, then pose your question to the other community, since it could be the case that there is no one in the first community able to properly answer your question.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  6. #6
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43
    Daved,

    Thats for the tips, that's basically exactly what I was looking for.

    I remember, when I read about vectors I got somewhat excited because the resemble the arrays of Matlab a little bit more. I thought I remembered hearing that they were significantly slower than just using an array, and so I didn't look into them much further. Is there any kind of guideline that you know of as to when a vector would be the better choice?

    Thanks,
    Shaun

  7. #7
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    There have been a few threads about this in the past. This one comes to mind: http://cboard.cprogramming.com/showthread.php?t=104334

    The conclusion was that, basically, vectors aren't really that much slower than arrays. You might as well use vectors wherever you want them. Use arrays when you need to interface with C, or there's always a fixed number of elements, if you want to. Or forget that arrays exist and use vectors.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  8. #8
    Registered User
    Join Date
    Apr 2008
    Posts
    890
    I don't get the uproar over cross posting to different forums. It's not like he posted to multiple boards on this site.

  9. #9
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43
    Quote Originally Posted by laserlight View Post
    The implication is that you do not trust the helpers in both forums.
    Then let it be known that I trust everyone, that's why I'm asking. However, I don't think anyone here will argue that there are many ways to approach any given problem. The methods given in one place may very well differ from the methods of another. And as a student, I think that the idea of purposely limiting your search pool is absurd at best. Time is of the essence, and my knowledge as it is is limited. I took these internet "nationalistic" tendencies into account when I made the decision to join only two boards, but I honestly thought ego-bruising like this died after high school.

    I've complied with everything else you've asked for, formatted my posts as best to your specifications as I can, tried to show that I'm putting in the effort, I've done extensive work on my own to solve my own problems before posting (I'm doing hash tables after a few weeks of experience with C++, do you think that there was MAYBE some time where I would have loved to just post my code and ask "What's wrong with it?"), and I contribute equally at both boards. I shouldn't have to abandon a method that is proven to be effective EVERY TIME just so people who are insecure about these things can feel better about themselves.

  10. #10
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by Shaun32887
    I remember, when I read about vectors I got somewhat excited because the resemble the arrays of Matlab a little bit more. I thought I remembered hearing that they were significantly slower than just using an array, and so I didn't look into them much further. Is there any kind of guideline that you know of as to when a vector would be the better choice?
    You might want to read Stroustrup's answers to the FAQs:
    What's wrong with arrays?
    Why are the standard containers so slow?

    Quote Originally Posted by medievalelks
    I don't get the uproar over cross posting to different forums.
    Eric Raymond's perspective is that "that's like yelling and irritates people".
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  11. #11
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43
    Quote Originally Posted by dwks View Post
    There have been a few threads about this in the past. This one comes to mind: http://cboard.cprogramming.com/showthread.php?t=104334

    The conclusion was that, basically, vectors aren't really that much slower than arrays. You might as well use vectors wherever you want them. Use arrays when you need to interface with C, or there's always a fixed number of elements, if you want to. Or forget that arrays exist and use vectors.
    Thanks for the quick reply.

    I'll read through that thread. This is promising though, vectors definitely feel more comfortable, especially due to the large uncertainty in the final number of unique grams.

    Also, that number isn't going to be fixed once I finish this program. I have to write it so that any number of strings can be processed and built into the unique grams collection. So while I'm predicting about 90,000 unique grams when I'm done with this current list, it is technically theoretically possible that the number of grams can eventually approach the number of total permutations, something like 250 million grams if I remember correctly. That's also assuming I first change everything to lowercase, which I'm considering.

  12. #12
    Registered User
    Join Date
    Apr 2008
    Posts
    890
    "cross-post to too many different newsgroups"

    I wouldn't consider two unrelated forums "too many". I also never recall electing Eric Raymond as Netiquette Czar, but that's another issue entirely.

  13. #13
    Registered User
    Join Date
    Jun 2008
    Location
    Houston, Texas
    Posts
    43
    Quote Originally Posted by laserlight View Post
    You might want to read Stroustrup's answers to the FAQs:
    What's wrong with arrays?
    Why are the standard containers so slow?
    Thanks, I'll read through these.

    I looked at them before, but I don't think I knew what a vector was last week, and I'm still not too sure about containers, so it was understandably over my head.


    Quote Originally Posted by laserlight View Post
    Eric Raymond's perspective is that "that's like yelling and irritates people".
    >>Don't shotgun-blast all the available help channels at once, that's like yelling and irritates people. Step through them softly.


    I agree completely. I just don't think a well formed question to TWO boards that I'm an active member of constitutes this.
    Last edited by Shaun32887; 07-02-2008 at 03:12 PM.

  14. #14
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    I just don't think a well formed question to TWO boards that I'm an active member of constitutes this.
    I agree, but you must understand that we deal with a fellow who goes by the handle George2. I am sure that you have come across him before. It is because of George2 that I can understand Prelude's harsh reaction.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  15. #15
    Registered User
    Join Date
    Jan 2005
    Posts
    7,366
    I've never had a problem with people posting to more than one board, as long as it's within reason (as opposed to what George2 does). I've found that different boards do give different answers, and some are better at some problems than others.

    As far as vectors versus arrays, don't be scared off by "vectors are slower than arrays". That's not relevant here even if it was true. The question is which data structure/algorithm fits your needs.

    Using a vector would likely mean filling it with all possible grams (including duplicates) and then removing those duplicates (or copying all unique grams to a new vector). This would involve lots of memory usage, and so it might not be your best choice. Also, it appears that you want to handle one string at a time, rather than all at once. If that's the case, the constant re-sorting or making unique of the vector's contents make it a less than ideal solution.

    I would use set or unordered_set (or a non-standard hash table implementation). Starting with set is probably a good idea since it is quite simple to use and get help with (since by being a standard container it is quite common). If that proves to be too slow then one of the hash table solutions can be attempted.

    One problem with the hash table is getting the right number of buckets. Since you don't know how many unique grams there will be, it is tough to come up with a suitable number. You want there to be more buckets than unique grams so that you reduce the number of collisions. But you don't want too many buckets or you waste memory. I'm not sure how well current hash table implementations resize, but that could affect the speed of your app if you went that direction.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. need help with hash tables with direct chaining
    By agentsmith in forum C Programming
    Replies: 4
    Last Post: 01-05-2008, 04:19 AM
  2. What's a good generalized data structures book?
    By indigo0086 in forum A Brief History of Cprogramming.com
    Replies: 12
    Last Post: 11-16-2006, 01:01 PM
  3. Replies: 2
    Last Post: 02-28-2006, 02:01 PM
  4. Any Good Data Structures Books?
    By YevGenius in forum C++ Programming
    Replies: 3
    Last Post: 05-26-2004, 07:49 AM
  5. Problems with Hash Tables
    By Zildjian in forum C Programming
    Replies: 6
    Last Post: 11-06-2003, 08:53 PM