std::map with many elements

**Daved** · 12-01-2007

Ok, that's a good start. Now put timing data around that loop to see if the slow down is in the loop. In fact, just a cout before and after the outer while loop might do it. Remember, I'm thinking one potential problem is in the destruction of a large map, so commenting out the code that fills the large map would get rid of that slowdown as well.

**Scarvenger** · 12-01-2007

Oh, i am sorry, i wasn´t clear:
Total time execution with insertion: 2.08s
Loop time with insertion: 2.06s

Total time execution without insertion: 0.41s
Loop time without insertion: 0.41s

**iMalc** · 12-01-2007

Originally Posted by Scarvenger

okay, with this piece of code:

Code:

		while (begin != end)
		{
			dictionary[(*begin++)]++;
		}

The routine took 2.05 seconds, and without it (commenting it) i get 0.42 seconds...

What timing do you get if you just change that loop to this?:

Code:

		while (begin != end)
		{
			*begin++;
		}

Then tell us how long it takes if you instead do this:

Code:

		std::map<std::string, int> iter = dictionary.begin();
		while (iter != dictionary.end())
		{
			dictionary[iter++]++;
		}

That should tell us whether the slowness is mainly related to incrementing the sregex_token_iterator, or whether it is in fact the actual insertions.

**Scarvenger** · 12-01-2007

Code:

                while (begin != end)
		{
			dictionary[(*begin++)]++;
		}

0.951s

Code:

		while (begin != end)
		{
			*begin++;
		}

0.781s

So, yes, now i suppose that the problem is in the sregex_token_iterator... Maybe i will think anything to replace it...

**Scarvenger** · 12-01-2007

Okay, i have changed the whole code that used the boost::regex library to a pure std approach, the code is the following:

Code:

	ifstream file(filename);
	char data[3000];

	while (!file.eof())
	{
		file.getline(data, 3000);
		string line(data);

		stringToLower(line);

		unsigned int position = 0;
		while ((position = line.find_first_of(delimiters, position)) != -1)
		{
			int endPosition = line.find_first_not_of(delimiters, position);
			string result      = line.substr(position, endPosition - position);
			position             = endPosition;

			dictionary[result]++;
		}
	}

	return true;

This seems to be a much faster aproach than the boost::regex since it gives me total execution time of ~1.39s while the boost::regex was giving me ~4.01s.

Now i would like to squeeze this program to get the last drop of performance from it... Suggestions?
[edit]
Readability and good practices sugestion are welcome too

[/edit]

**Daved** · 12-02-2007

>> Now i would like to squeeze this program to get the last drop of performance from it... Suggestions?

Just a guess: put line.substr(position, endPosition - position) into the dictionary[] call instead of result and remove result entirely. Also you'd have to move the assignment of position to after that call as well. A good optimizing compiler probably would already have done that itself, but it's worth a try.

By the way, are the timings on a release build with optimizations on? That makes a big difference. All timing tests should be done with release builds.

>> Readability and good practices sugestion are welcome too

I'd use string::npos instead of -1 and string::size_type instead of unsigned int.

Why are you using the C-string version of getline? Are you purposefully trying to cap the input at 3000 characters? If so, that's fine. If you made it 3000 because you want to make sure you get the entire line, then use the C++ string version of getline: getline(file, line).

Whichever version of getline you use, you should not use eof() to control the loop. You need to check to see if the reading function (getline in this case) succeeds before using it. The most common and easy way to do this is to move the reading function into the while control:

Code:

string line;
while (getline(file, line))
{
    // ...

**CornedBee** · 12-02-2007

To get more performance you'd need a hash table instead of the balanced tree. See if your compiler provides std::tr1::unordered_map, or perhaps stdext::hash_map.

**Daved** · 12-02-2007

I would prefer boost's unordered_map implementation before using stdext::hash_map. The standard hash_map has pretty much been set now, so it only makes sense to start using that instead of the older compiler specific variations.

Also remember what I mentioned earlier about knowing a good size estimate before you start reading. If you don't have one, then you might either waste a ton of memory or lose all the performance benefits of the hash map (or worse have significantly worse performance). A good size to use is 50% more than how many elements will be in the map. Making that number prime or with few factors might help as well.

**CornedBee** · 12-02-2007

Except that Boost doesn't have an unordered_map implementation in its TR1 emulation. The only hash map in Boost is a multi_index container with a hash index.

**Daved** · 12-02-2007

Really? I could have sworn I'd used or at least read about their implementation before. I guess I confused that with something else.

That's too bad. It took long enough to get a standardized interface that it would be nice if there was a cross-platform implementation available that implements it.

**CornedBee** · 12-02-2007

Or there is talk about their implementation. In fact, there is an implementation - but it's in the Sandbox, not yet part of Boost proper. 1.34 doesn't have it, 1.35 won't have it either.

Thread: std::map with many elements

Thread Tools

Search Thread

Display

Similar Threads

using realloc for a dynamically growing array

Randomly rearranging the elements in a vector

delete[]'ing std::map elements... sort of.

way to check 3 elements of array and set 4th

Is there a way to find the number of elements in an array?