Thread: hash table in c++

  1. #1
    Registered User
    Join Date
    Jan 2011
    Posts
    222

    hash table in c++

    Hi,

    i have a question about hash tables in c++. as far as i can see i can create the hash table via hash_map or using a vector. now i don't understand what is the difference and what would be a better( ==faster(construction time , retrieval time),memory efficient ) way for the following case:
    Code:
    Key  Value
    12   32
    13   56
    3    54
    1    54
    76   54
    ...
    all values are integers!

    now the table is expected to be large containing between 10 and several hundred million records.

    Thank you

    baxy
    Last edited by baxy; 12-06-2012 at 05:11 AM.

  2. #2
    The superhaterodyne twomers's Avatar
    Join Date
    Dec 2005
    Location
    Ireland
    Posts
    2,273
    Back-of-the-brain calculation, but a couple of hundred million records on a 64-bit machine will require about 12GB of memory. Can you support this? Will your table change?

  3. #3
    Registered User
    Join Date
    Jan 2011
    Posts
    222
    ok the machine that i am working on has 256GB of RAM so there is no doubt that it wil support this. What i am interested in is wjich approach is better(my primary concern is speed (retrieval and insertion) but as less memory i have to vast is better). and yes the table will change a have no idea how many records will it contain. The record number can span (theoretically) from 1 to 1000000000 but from the experiance it usually is somewhere between 10000000-500000000.
    Last edited by baxy; 12-06-2012 at 09:09 AM.

  4. #4
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    hash_map is a non-standard extension that implements a hash table. vector is an array, so if you want to use it for the hash table, you'll essentially be implementing a hash table yourself. That should be reason enough to go with hash_table (or better yet, unordered_table if your compiler supports it, as it's standard) until profiling shows that it is a bottleneck and you need a custom implementation to be fast enough.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  5. #5
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by baxy
    as far as i can see i can create the hash table via hash_map or using a vector.
    std::unordered_map is the C++11 (and TR1) version of the non-standard hash_map.

    Quote Originally Posted by baxy
    What i am interested in is wjich approach is better
    Measure by testing your typical operations with typical data and find out. What I can tell you is that insertion of a single element into a std::map and finding a single element given a key has logarithmic time complexity, whereas insertion of a single element into a std::unordered_map and finding a single element given a key has constant time complexity in the average case but linear time complexity in the worst case.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  6. #6
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by baxy View Post
    several hundred million records.
    When it comes to those sorts of numbers, every bit of information (more than you could possibly provide I expect) is useful. E.g. What is the exact range of values for the key and value? Tell us more about the origins of the data. Are there anre reasons that collisions may be more liekly than usual perhaps?

    For example, if those are typical values, then you'll notice that they can all fit within a byte. Or if your values can fit into a short then you can use that instead.
    In fact even if your data requires a 32-bit int in the worst case, if there are enough of the smaller values then having say two hash tables, one from short to short and one from int to int might be worthwhile. The smaller values go in one and the large values in another. Depending on the distribution of the values, that might well same you 40% on your RAM usage.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

  7. #7
    Registered User
    Join Date
    Jan 2011
    Posts
    222
    yes i know that. My initial problem is that i have a huge array of integers, 50% of them are small enough so that they can fit into 1 byt array and those i am encoding with a single byte, but the rest are too big and they need to be encoded as 32 or sometimes 64 bit integers. there is no regularity with respect to when will a certain number appear. you can think of it as a case where such numbers are equally likely to appear at first, middle or last position. so what i did is to scan through the array write one-byters into 1 byte array using numbers 1-250 and if a number is greater the 250 the i encode it as 251 and store the position and its true value in the hash. therefore when the number is required i check the position and if the position has 251 i go into the hash and check its value. now this saves me a large chunk of memory but i still don't have an idea how to compress the other 50% of data. i could probably split it into 32 and 64 bit-ers. but a bigger portion of them are 64bit-ers so i don't save that much space. If anyone has a clue how to shrink a 64 bit integer into 32 or 8 bit integer please share the knowledge

    baxy

  8. #8
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    If you often have a lot of pairs with the same key hash but with different values, then I would consider storing all the corresponding values in a vector<unsigned char> using a universal coding scheme to compress them. Then you could throw all values, no matter what the range, into the same thing.
    I have all the common universal encoding schemes implemented on my website. Some of the C code for them in Wikipedia was written by me.

    Or are your keys much more diverse than your values perhaps? In that case I would consider storing an index as the value inside the hash table, and putting the actual value somewhere else. Basically the "palette" approach. It all depends on knowing as much as possible about your data.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Hash table help
    By lemon-ice-tea in forum C++ Programming
    Replies: 1
    Last Post: 05-18-2011, 09:31 AM
  2. hash table
    By Aisthesis in forum C++ Programming
    Replies: 10
    Last Post: 09-22-2010, 08:58 AM
  3. hash table in C
    By -EquinoX- in forum C Programming
    Replies: 7
    Last Post: 03-25-2008, 09:16 PM
  4. hash table?
    By bennyboy in forum C Programming
    Replies: 2
    Last Post: 05-10-2007, 03:06 AM
  5. Hash table
    By Unregistered in forum C Programming
    Replies: 3
    Last Post: 01-14-2002, 07:06 PM