Thread: Unordered set (const char*) much slower than unordered set (string)

  1. #1
    Registered User
    Join Date
    Jun 2011
    Posts
    2

    Question Unordered set (const char*) much slower than unordered set (string)

    I'm loading a very long list from disk into an unordered_set. If I use a set of strings, it is very fast. A test list of about 7 MB loads in about 1 second. However, using a set of char pointers takes about 2.1 minutes! For a 59.6 MB list, the string version loads in 10 seconds. The char* version takes 525 minutes!

    Here is the code for the string version:

    Code:
    unordered_set<string> Set;
    string key;
    while (getline(fin, key))
    {
        Set.insert(key);
    }
    Here is the code for the char* version:

    Code:
    struct unordered_eqstr
    {
        bool operator()(const char* s1, const char* s2) const
        {
            return strcmp(s1, s2) == 0;
        }
    };
    
    struct unordered_deref
    {
        template <typename T>
        size_t operator()(const T* p) const
        {
            return hash<T>()(*p);
        }
    };
    
    unordered_set<const char*, unordered_deref, unordered_eqstr> Set;
    string key;
    
    while (getline(fin, key))
    {
        char* str = new(mem) char[key.size()+1];
        strcpy(str, key.c_str());
        Set.insert(str);
    }
    The "new(mem)" is because I'm using a custom memory manager so I can allocate big blocks of memory and give them out to tiny objects like c strings. However, I've tested this with regular "new" and the results are identical. I've also used my memory manager in other tools with no problems.

    The two structs are necessary to make the insert and find hash based on the actual c string and not its address. The unordered_deref I found on stack overflow.

    Eventually I need to load multi-gigabyte files. This is why I'm using a custom memory manager, but it's also why this horrible slow down is unacceptable. Any ideas?

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    I suspect that the problem is with the hash function: it looks like you are just hashing based on the first character, leading to many hash collisions.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Eventually I need to load multi-gigabyte files.
    This is probably going to be a problem with such a simple strategy.

    What are you trying to do?

    Soma

  4. #4
    Registered User
    Join Date
    Jun 2011
    Posts
    2
    Quote Originally Posted by laserlight View Post
    I suspect that the problem is with the hash function: it looks like you are just hashing based on the first character, leading to many hash collisions.
    That was it! I changed the code to this and it sped right up:

    Code:
    struct unordered_deref
    {
        size_t operator()(const char* p) const
        {
            return hash<string>()(p);
        }
    };
    Thank you!

    Quote Originally Posted by phantomotap View Post
    This is probably going to be a problem with such a simple strategy.

    What are you trying to do?

    Soma
    I'm simply trying to find a fast and memory efficient way to load a large map or set into memory where access time is most important. Once loaded, the data may be used in various ways.

    For example, the list may be a filter. Once in memory, another file will be traversed while writing it out to a new file. While traversing, each line is checked against the filter, and if found, it isn't written to the new (filtered) file. The reason I need to do it this way is the files that need to be filtered are too big to fit in memory, even in our 144 GB machine. But as long as we can fit the filter list in memory, it should be ok.

    My custom memory manager manages to reduce memory usage by about 30-40%. However, it seems the unordered set and map have more overhead than I thought, as the data still ends up taking about 2.8 times more memory than disk space.

    And while this new method uses less memory, it also takes about 50-70% longer to load. That's probably due to the hashing, and if so, it would slow down any procedures following the loading, such as filtering, etc.

    At the moment, our largest list takes up 14.9g on disk and 58g of memory when loaded using the simpler unordered_list<string> method and 37g using the custom memory manager. However, the load times are 13.2m and 19.4m respectively.

    Even though 58g is already less than half of the computer's 144g of memory, we're often running multiple memory intensive tasks at a time, so it's always nice to reduce memory usage when possible. But in this case, 50-70% longer access times might not be worth 30-40% memory reductions.

    Any thoughts?

  5. #5
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    Any thoughts?
    Several. They would all require some manually coding to be done.

    A custom string class that optionally generates a hash as it appends characters to its buffer could improve search speed.

    A hash system built upon a mathematical construct susceptible to partial memoization could vastly improve the speed of building the list.

    A strait hash tree built with a shared prefix "Trie" as a bucket could improve memory use without sacrificing a lot of speed on collision.

    Simply caching the hash of the filters could improve performance by a great deal.

    *shrug*

    It really depends on what the data and filters look like because none of this may be applicable.

    Soma

  6. #6
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    O_o

    I forgot to say, as a matter of interest, you can easily use a custom memory allocation mechanism with STL objects by creating a `std::allocator' compatible wrapper.

    If the `std::string' form still proves faster, you can use your custom allocator with it and see what happens.

    Soma

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. convert string to const char?
    By TenTierHook in forum C++ Programming
    Replies: 4
    Last Post: 10-20-2010, 02:58 PM
  2. String and const char *
    By sawer in forum C++ Programming
    Replies: 5
    Last Post: 03-05-2006, 02:16 AM
  3. Converting an std::string to a const char *
    By Shamino in forum C++ Programming
    Replies: 21
    Last Post: 01-24-2006, 04:03 PM
  4. Convert Const Char * to string
    By winsonlee in forum C++ Programming
    Replies: 3
    Last Post: 08-25-2004, 02:38 PM
  5. const char[] or string?
    By prog-bman in forum C++ Programming
    Replies: 10
    Last Post: 07-11-2004, 06:13 PM

Tags for this Thread