smart use of std:algorithm

**EOP** · 09-17-2008

Simplified:

I have one vector <known_names> and a second one <new_names>.
Both vectors are sorted and unique.

I want to find all occurrences of any *new_names where
1. if *new_names is already present in <known_names>, it should be added to a third vector <already_known_names>
2. if *new_names is not yet present, it should be added to a fourth vector <newly_added>

It has to be accomplished without nested loops, as I want to use just std:algorithm in a smart way.

I didn't find a smart combination of std:algorithms that satisfies my needs up to now.
Neither any of the remove... nor the for_each functions seem to support my needs.

Any suggestions would be highly appreciated.

btw. the lists have a lot of names. At least 100.000+.

**~~master5001~~** · 09-17-2008

I do not know what sort of sorting you may want to perform... so check this out.

**tabstop** · 09-17-2008

I don't know what's wrong with nested loops. Off the top of my head:
1. Get an iterator to point to the start of <known_names>.
2. Run a for_each through <new_names>
3. While (current new name > current known name) move the iterator forward.
4. If match, add to already_known; if not, add newly_added.
You should only end up going through each list once, and I don't see how to do much better than that.

**laserlight** · 09-17-2008

This is one possible solution:
1. Sort <known_names>.
2. Sort <new_names>.
3. Use std::set_difference with <known_names> and <new_names> to create <already_known_names>.
4. Use std::set_difference with <new_names> and <known_names> to create <newly_added>.

EDIT:
Oh wait, the input vectors are sorted, so you can ignore #1 and #2.

**tabstop** · 09-17-2008

And yet again, I am defeated by C++. (Although that does look much like what I suggested, but why not use the built-in version?)

**EOP** · 09-17-2008

Originally Posted by laserlight

This is one possible solution:
1. Sort <known_names>.
2. Sort <new_names>.
3. Use std::set_difference with <known_names> and <new_names> to create <already_known_names>.
4. Use std::set_difference with <new_names> and <known_names> to create <newly_added>.

EDIT:
Oh wait, the input vectors are sorted, so you can ignore #1 and #2.

I hoped to get a reply by you!

TY and regards from Germany. (I love you secretly for a long time

).

btw. I didn't find any idea to answer master5001's ingenious reply without being insulting.

Cheers and TY for the quick (and partially useful) replies.

**laserlight** · 09-17-2008

Originally Posted by tabstop

And yet again, I am defeated by C++. (Although that does look much like what I suggested, but why not use the built-in version?)

Actually, your suggested algorithm just makes one pass, whereas using set_difference means two passes. So this is the trade-off between a hand crafted algorithm and implementation that should be almost twice as fast versus a generic algorithm whose implementation is well tested and probably well optimised in its own right.

**EOP** · 09-17-2008

Originally Posted by tabstop

I don't know what's wrong with nested loops. Off the top of my head:
1. Get an iterator to point to the start of <known_names>.
2. Run a for_each through <new_names>
3. While (current new name > current known name) move the iterator forward.
4. If match, add to already_known; if not, add newly_added.
You should only end up going through each list once, and I don't see how to do much better than that.

Thanks, but I already tested the "loop approach" and it takes up to twice the time (and even more) than std:algorithms.

Maybe I could speed it up by compiling it to .asm and optimizing the .asm file, but I did this the last time in the late 80s and early 90s when puters were quite slow.

**~~master5001~~** · 09-17-2008

Erf? Your compiler probably optimizes in ways your granny only used to dream about. I, as many of you already well know, do enjoy writing code in assembler and doing hand optimizations. But I highly recommend against this approach outside of specific times when you are damn sure you are smarter than your compiler or for educational purposes.

As for the speed issue, theoretically the loops should be faster. So your code was probably designed poorly.

**tabstop** · 09-17-2008

Just 'cause I was feeling protective of my idea, I compared the two on 200000 random four-character strings (old and new each). Neither one took very long, but mine took about 2/3 as long on this particular machine with this particular compiler, with fake data, consult doctor before use, your mileage may vary, etc.

One little gotcha that I found with the std::algorithm solution to take note of if you go that route is that the "answer" vectors must be sized ahead of time -- an iterator to the end of the data is returned by set_difference, but the vector is not resized or anything, so you're going to have to store that iterator, or use that to chop off, or whatever.

**King Mir** · 09-17-2008

Originally Posted by tabstop

One little gotcha that I found with the std::algorithm solution to take note of if you go that route is that the "answer" vectors must be sized ahead of time -- an iterator to the end of the data is returned by set_difference, but the vector is not resized or anything, so you're going to have to store that iterator, or use that to chop off, or whatever.

Just use an insertion iterator like back_inserter.

**Daved** · 09-17-2008

>> Just use an insertion iterator like back_inserter.
Although if you're concerned about efficiency a reserve may still be useful.

**EOP** · 09-17-2008

Originally Posted by master5001

Erf? Your compiler probably optimizes in ways your granny only used to dream about... blah

What about e.g. using loopnz instead of compiler generated loops?

**tabstop** · 09-17-2008

Can we tell I'm up to about page 20 in Josuttis? Alright, back_inserter works too. Thanks.

And to make a legitimate contribution:
I don't think <known_names> - <new_names> is going to give <already_known_names>. We need to do the second part first to get <newly_added>, and then do <new_names> - <newly_added> to get <duplicates>. (Or at least that's how I did it in my test.)

**~~master5001~~** · 09-17-2008

What makes you think your compiler is not aware of every opcode for your CPU and knows how to rearrange data to be optimal for loops. To be honest, in assembler writing an optimal loop is probably the hardest thing you can do. A compiler is more apt to do these things than you or me.

Thread: smart use of std:algorithm

Thread Tools

Search Thread

Display

smart use of std:algorithm

Similar Threads

Which smart pointer to use?

smart pointer release

can not create Visual C++ Smart Device project

Intelligent vs Smart

Using a smart pointer for all objects