# How to scan multiple vectors for common elements efficiently?

This is a discussion on How to scan multiple vectors for common elements efficiently? within the C++ Programming forums, part of the General Programming Boards category; It's not clear why counter is a lambda. Either make it a free function (possibly static or namespace ""), or ...

1. It's not clear why counter is a lambda. Either make it a free function (possibly static or namespace ""), or put the body in tp.schedule(...). Do which ever is easier to read.

2. Good question about why counter is a lambda. It's because I don't feel like it deserves to be its own free function (it would just obfuscate my main.cpp and it doesn't fit into any other .cpp file except maybe my tetra.cpp one) and I honestly only use it once and only once for one purpose. So I figured, heck, why not go ahead and just write it down as a lambda and people can easily read that it's a small sub-function within my main loop.

Nevertheless, after much experimentation, I finally got something would actually compile and work. Also Elysia, I need those mutex locks otherwise my counting is subject to race conditions. I did some trial and error without the use of mutexes (but using else-if's because it's stylistically superior, imo) and for 22^3 particles, I got 8039 inside and 1378 on the surface for one run and for another I got 7515 inside and 1365 on the surface for another. Neither of those add up to the proper sum but when I use mutexes I get a consistent 9260 inside + 1388 on the surface for a grand total of 10648 which is exactly 22^3 so I mutexes ftw!

My revised code looks like this :
Code:
```/* Create and use a lambda function to count the number of particles contained by the root */

int *inside = new int, *surface = new int, *outside = new int;
*inside = 0; *surface = 0; *outside = 0;

auto counter = [](int *inside, int *surface, int *outside, vector<particle> parts, tetra *root) {

int x = 0;

for (auto it = parts.begin(); it < parts.end(); ++it) {

x = inside_tetra(root, &(*it));

if (x == 0) {

ins.lock();
++(*inside);
ins.unlock();
}

else if (x == -1) {

out.lock();
++(*outside);
out.unlock();
}

else if (x > 0) {

surf.lock();
++(*surface);
surf.unlock();
}
}

return;
};

for (unsigned i =0; i < num_threads; ++i)
tp.schedule([=]() { return counter(inside, surface, outside, particle_sets[i], root_tetra); });

tp.wait();

cout << "\nThere are " << *inside << " points inside the root" << endl;
cout << "There are " << *surface << " points on the surface of the root" << endl;
cout << "There are " << *outside << " points outside the root" << endl;
cout << *inside + *surface << " total contained points" << endl;

delete (inside), delete (surface), delete(outside);```
I suppose though I should be using auto_ptr instead of manual new() and delete() and you're right about the unique lock. Those are small revisions though which I will take into consideration or implement eventually. Heck, I should just do it now.

Thank you for all the advice though, I really do appreciate it

3. Originally Posted by MutantJohn
Nevertheless, after much experimentation, I finally got something would actually compile and work. Also Elysia, I need those mutex locks otherwise my counting is subject to race conditions. I did some trial and error without the use of mutexes (but using else-if's because it's stylistically superior, imo) and for 22^3 particles, I got 8039 inside and 1378 on the surface for one run and for another I got 7515 inside and 1365 on the surface for another. Neither of those add up to the proper sum but when I use mutexes I get a consistent 9260 inside + 1388 on the surface for a grand total of 10648 which is exactly 22^3 so I mutexes ftw!
I didn't say you should get rid of the mutextes. I simply said you shouldn't lock them manually. That is what we have lock guards for.

I suppose though I should be using auto_ptr instead of manual new() and delete() and you're right about the unique lock.
unique_ptr
Though I really fail to see why these ints must be allocated on the heap.

4. Alright, maybe you can help me with this then. The code I have hear is totally suffering from race conditions very badly.
Code:
```array<vector<vector<tree*>>, 2> scan_leaves(vector<tree*> *leaves, pool tp) {

cout << "Pool size : " << tp.size() << endl; // this checks out fine

array<vector<vector<tree*>>, 2> safe_and_unsafe; // we're sorting out safe vs unsafe for multithreading

//array<vector<pair<vector<tree*>, particle*>>, 2> unsafe_safe; This is a prototype for later optimization

vector<unsigned> safety_check; // use as numeric array
safety_check.assign(num_threads, 0); // initialize to all zeroes

for (unsigned i=0; i < num_threads; ++i) { // leaves is a pointer to a number of vectors equal to num_threads

tp.schedule([=]() { sort(leaves[i].begin(), leaves[i].end()); } ); // this works fine as each set of data is independent
}

tp.wait();

mutex lck; //create a mutex

for (unsigned i=0; i< num_threads-1; ++i) { // this is where the trouble begins...

tp.schedule([&]() {

for (unsigned j= i+1; j < num_threads; ++j) {
printf("(%u, %u)\n", i, j); // this is supposed to print the following (0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)
for (auto it = leaves[j].begin(); it < leaves[j].end(); ++it) {

if (binary_search(leaves[i].begin(), leaves[i].end(), *it)) {

lck.lock();
safety_check.at(i) += 1;
safety_check.at(j) += 1;
lck.unlock();
}
}
}
} );

}

tp.wait();

for (unsigned i=0; i < safety_check.size(); ++i)
if (safety_check.at(i) > 0)
safe_and_unsafe.at(1).push_back(leaves[i]);
else
safe_and_unsafe.at(0).push_back(leaves[i]);

cout << "array<0>.size = " << safe_and_unsafe.at(0).size() << endl;
cout << "array<1>.size = " << safe_and_unsafe.at(1).size() << endl;

for (auto & s : safety_check)
cout << s << endl;

return safe_and_unsafe;
}```
Basically, it's supposed to work like this, I have 4 vectors as part of leaves, leaves[0], leaves[1], leaves[2] and leaves[3]. As of now, each vector has one element and it's the same. These are pointers so for example, the value for the only element for all 4 is 0x209a9e0 and this matches well with previous data.

My algorithm is supposed to go like this, do vector 0 and vector 1 have any elements in common? If so, then add 1's to safety_check at index 0 and 1. Do vector 0 and vector 2 have any elements in common? If so, add 1's to safety_check at index 0 and 2. Do vector 0 and vector 3 have any elements in common? If so...

That's why I said we have a read spread like this :
(0, 1) (1, 2) (2, 3)
(0, 2) (1, 3)
(0, 3)

When I do this with just one thread, I get the proper spread for safety_check which is 3, 3, 3, 3.

How do I get that? Simple!
(0, 1) share so
safety_check = 1, 1, 0, 0

(0, 2) share so
safety_check = 2, 1, 1, 0

(0, 3) share so
safety_check = 3, 1, 1, 1

(1, 2) share so
safety_check = 3, 2, 2, 1

(1, 3) share so
safety_check = 3, 3, 2, 2

(2, 3) share so
safety_check = 3, 3, 3, 3

and that's basically the algorithm so why am I getting slapped in the face by race conditions so hard?

5. Originally Posted by MutantJohn
Good question about why counter is a lambda. It's because I don't feel like it deserves to be its own free function (it would just obfuscate my main.cpp and it doesn't fit into any other .cpp file except maybe my tetra.cpp one) and I honestly only use it once and only once for one purpose. So I figured, heck, why not go ahead and just write it down as a lambda and people can easily read that it's a small sub-function within my main loop.
That doesn't seem like a good enough reason. Splitting big functions into smaller ones does not clutter cpp files. You've basically got a non-trivial function nested in another function. It's not what C++ programmers are used to. I'd just pull it out and make it static (or empty namespace). You can also give it a longer name, to make it's narrow use clear.

6. Alright, that's fair. I do want to keep consistent with how a C++ programmer would code.

Also, I seem to have fixed my race conditions that I posted earlier. I am slowly becoming a master of the lambdas!!!!

Code:
```#include "structures.h"

void scan_leaves(vector<pair<vector<tree*>, particle*>> & pile_pair, pool & tp) {

/* Assert that our references are valid */

/*
tree t1(0x0), t2(0x0), t3(0x0);

pile_pair.at(0).first.push_back(&t2);
pile_pair.at(0).first.push_back(&t3);
pile_pair.at(0).first.push_back(&t1);

cout << "This is the original unsorted vector of pile.at(0).first :\n";
for (auto it = pile_pair.at(0).first.begin(); it < pile_pair.at(0).first.end(); ++it)
cout << (size_t)*it << endl;
*/
/* Sort our collected leaves */

for (auto it = pile_pair.begin(); it < pile_pair.end(); ++it) {

tp.schedule([=]() { sort((*it).first.begin(), (*it).first.end()); });
}

tp.wait();
/*
cout << "\nThis is the new sorted vector of pile.at(0).first :\n";
for (auto it = pile_pair.at(0).first.begin(); it < pile_pair.at(0).first.end(); ++it)
cout << (size_t)*it << endl;
*/

vector<unsigned> safety_check;

for (vector<pair<vector<tree*>, particle*>>::iterator it = pile_pair.begin(); it < pile_pair.end(); ++it) {

tp.schedule([=, &safety_check, &pile_pair]() {

for (vector<pair<vector<tree*>, particle*>>::iterator it2 = it+1; it2 < pile_pair.end(); ++it2) {

for (vector<tree*>::iterator it3 = (*it2).first.begin(); it3 < (*it2).first.end(); ++it3) {

if (binary_search((*it).first.begin(), (*it).first.end(), *it3)) {
safety_check.at(distance(pile_pair.begin(), it)) += 1;
safety_check.at(distance(pile_pair.begin(), it2)) += 1;
}
}
}
});
}

tp.wait();

for (auto & s : safety_check)
cout << s << endl;

return;
}```
And yes, using vector<pair<vector<tree*>, particle*>> is very confusing lol.

7. Yeah, that does look confusing. Can't you make some part of that a class? Like turn the pair into a class (or POD struct) that has better names than first and second. And add more convenience functions to the interface.

8. I'm wondering that since there's just one memory bus, and usually just one common outer cache, will multiple threads really help much? Also, instead of going through a vector and doing a binary search, once the vectors are sorted, perhaps it would be faster to perform an operation similar to a k-way merge when scanning for duplicates.

9. Why would multiple threads not help (if done right)? The reason we have multicores is because it's faster.
But I can imagine there will be a lot of false sharing on here.

10. Originally Posted by Elysia
Why would multiple threads not help (if done right)? The reason we have multicores is because it's faster.
But I can imagine there will be a lot of false sharing on here.
Because memory bound applications don't benefit from multicore processors. They are bound by memory latency on the memory bus, which is not affected by the number of cores. As rcgldr says, there's only one memory bus. In those cases improving performance means buying faster RAM, not more CPU cores.

That said, my instinct is that this code is not bound in that way.

11. Originally Posted by King Mir
Because memory bound applications don't benefit from multicore processors. They are bound by memory latency on the memory bus, which is not affected by the number of cores. As rcgldr says, there's only one memory bus. In those cases improving performance means buying faster RAM, not more CPU cores.

That said, my instinct is that this code is not bound in that way.
Yes, you are right, but the way rcgldr put it, it sounded more like there is no way to get parallelism unless you have more than one memory bus and more than one outer cache, both of which does not change. There is always one memory bus and two levels of outer cache (at least for now).

12. Also, instead of going through a vector and doing a binary search, once the vectors are sorted, perhaps it would be faster to perform an operation similar to a k-way merge when scanning for duplicates.
O_o

I thought about this for a bit, and I like the idea. Assuming you limit "k" reasonably, making cache efficient runs, you'd probably need multiple passes, but that still dramatically reduces the number of comparisons, and likely reduces seeking, and you can easily thread runs if desired by merging the runs after the threaded pass is finished.

That said, my instinct is that this code is not bound in that way.
It seems to me as if every "pair" of threads added to the scheduler shares at least two containers. Consider:

Code:
```// attempting to simplify
{
for(auto & c: pile_pair[1].first)
{
if(binary_search(pile_pair[0].first.begin(), pile_pair[0].first.end(), c))
{
safety_check[0] += 1;
}
}
}

{
for(auto & c: pile_pair[2].first)
{
if(binary_search(pile_pair[1].first.begin(), pile_pair[1].first.end(), c))
{
safety_check[1] += 1;
}
}
}```
The situation may not be enough to make threads compete for the bus, but then I don't have enough information to speculate about the size and complexity of the data.

Of course, giving each thread its own storage (`vector<unsigned> * safety_check(new [pile_pair.size()]);') might eliminate enough of the potential competition in any event.

Soma

13. Seeing these responses makes me happy

Page 2 of 2 First 12