hash table in c++

**baxy** · 12-06-2012

Hi,

i have a question about hash tables in c++. as far as i can see i can create the hash table via hash_map or using a vector. now i don't understand what is the difference and what would be a better( ==faster(construction time , retrieval time),memory efficient ) way for the following case:

Code:

Key  Value
12   32
13   56
3    54
1    54
76   54
...

all values are integers!

now the table is expected to be large containing between 10 and several hundred million records.

Thank you

baxy

**twomers** · 12-06-2012

Back-of-the-brain calculation, but a couple of hundred million records on a 64-bit machine will require about 12GB of memory. Can you support this? Will your table change?

**baxy** · 12-06-2012

ok the machine that i am working on has 256GB of RAM so there is no doubt that it wil support this. What i am interested in is wjich approach is better(my primary concern is speed (retrieval and insertion) but as less memory i have to vast is better). and yes the table will change a have no idea how many records will it contain. The record number can span (theoretically) from 1 to 1000000000 but from the experiance it usually is somewhere between 10000000-500000000.

**CornedBee** · 12-06-2012

hash_map is a non-standard extension that implements a hash table. vector is an array, so if you want to use it for the hash table, you'll essentially be implementing a hash table yourself. That should be reason enough to go with hash_table (or better yet, unordered_table if your compiler supports it, as it's standard) until profiling shows that it is a bottleneck and you need a custom implementation to be fast enough.

**laserlight** · 12-06-2012

Originally Posted by baxy

as far as i can see i can create the hash table via hash_map or using a vector.

std::unordered_map is the C++11 (and TR1) version of the non-standard hash_map.

Originally Posted by baxy

What i am interested in is wjich approach is better

Measure by testing your typical operations with typical data and find out. What I can tell you is that insertion of a single element into a std::map and finding a single element given a key has logarithmic time complexity, whereas insertion of a single element into a std::unordered_map and finding a single element given a key has constant time complexity in the average case but linear time complexity in the worst case.

**iMalc** · 12-06-2012

Originally Posted by baxy

several hundred million records.

When it comes to those sorts of numbers, every bit of information (more than you could possibly provide I expect) is useful. E.g. What is the exact range of values for the key and value? Tell us more about the origins of the data. Are there anre reasons that collisions may be more liekly than usual perhaps?

For example, if those are typical values, then you'll notice that they can all fit within a byte. Or if your values can fit into a short then you can use that instead.
In fact even if your data requires a 32-bit int in the worst case, if there are enough of the smaller values then having say two hash tables, one from short to short and one from int to int might be worthwhile. The smaller values go in one and the large values in another. Depending on the distribution of the values, that might well same you 40% on your RAM usage.

**baxy** · 12-07-2012

yes i know that. My initial problem is that i have a huge array of integers, 50% of them are small enough so that they can fit into 1 byt array and those i am encoding with a single byte, but the rest are too big and they need to be encoded as 32 or sometimes 64 bit integers. there is no regularity with respect to when will a certain number appear. you can think of it as a case where such numbers are equally likely to appear at first, middle or last position. so what i did is to scan through the array write one-byters into 1 byte array using numbers 1-250 and if a number is greater the 250 the i encode it as 251 and store the position and its true value in the hash. therefore when the number is required i check the position and if the position has 251 i go into the hash and check its value. now this saves me a large chunk of memory but i still don't have an idea how to compress the other 50% of data. i could probably split it into 32 and 64 bit-ers. but a bigger portion of them are 64bit-ers so i don't save that much space. If anyone has a clue how to shrink a 64 bit integer into 32 or 8 bit integer please share the knowledge

baxy

**iMalc** · 12-07-2012

If you often have a lot of pairs with the same key hash but with different values, then I would consider storing all the corresponding values in a vector<unsigned char> using a universal coding scheme to compress them. Then you could throw all values, no matter what the range, into the same thing.
I have all the common universal encoding schemes implemented on my website. Some of the C code for them in Wikipedia was written by me.

Or are your keys much more diverse than your values perhaps? In that case I would consider storing an index as the value inside the hash table, and putting the actual value somewhere else. Basically the "palette" approach. It all depends on knowing as much as possible about your data.

Thread: hash table in c++

Thread Tools

Search Thread

Display

hash table in c++

Similar Threads

Hash table help

hash table

hash table in C

hash table?

Hash table