Thread: How to scan multiple vectors for common elements efficiently?

  1. #1
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665

    How to scan multiple vectors for common elements efficiently?

    Hey y'all, your favorite and most prolific poster here just thinkin' out loud again. Feel free to chime in if y'all want.

    Let's say say we conform to the ISO C standard and this snippet of code :
    Code:
    vector<tree*> *leaves = new vector<tree*>[num_threads];
    where num_threads is specified from command line arguments so not dynamically allocating it violates the standard.

    Let's also assume num_threads is greater than one.

    What I want to do is scan each vector in leaves for duplicates. If any two vectors in the set have matching addresses, they both immediately go onto the "unsafe" pile and will no longer be subject for testing.

    If a vector clears one vector, we test it against the others in the set.

    So if we have 3 vectors, A, B and C we test A against B then A against C. For efficiency, we then then just test B against C.

    Like I said, I want a "safe" and "unsafe" pile. Every vector in "safe" is fully unique while every vector in "unsafe" is not unique.

    I thought about just using a for-loop to loop through leaves and then iterate through each element but I'm not sure if that'll work just right out of the box.

  2. #2
    Internet Superhero
    Join Date
    Sep 2006
    Location
    Denmark
    Posts
    964
    Quote Originally Posted by MutantJohn View Post
    I thought about just using a for-loop to loop through leaves and then iterate through each element but I'm not sure if that'll work just right out of the box.
    Sure why wouldn't that work? For each element in vec_one, compare with each element in vec_two. This is the naiive O(n*n) solution.

    You can do better than this by sorting both vectors and then using binary search to find matching elements, this would be O(n log n) i believe.

    How large will these vectors be? If it were me, they'd have to be pretty large for me to bother with the optimal solution.

    Edit:

    Don't try to implement binary search yourself by the way, you will almost certainly get it wrong, it's very tricky to get right.
    How I need a drink, alcoholic in nature, after the heavy lectures involving quantum mechanics.

  3. #3
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Ooh, sort then binary search might work. Idk what that kind of search is, but I dig the O(n log n) feel of it.

    I did have one idea though.

    What if I created a set for each vector (yeah, the inserting might be slow) and then use set.insert.second to check for duplicates. I'm not sure if that's faster than sorting then using binary_search().

  4. #4
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    binary_search works as follows:
    First, it requires the array to be sorted.
    Then, it defines two positions, start and end which are the beginning and end of the array at first.
    It then selects the element at position (end - start) / 2. If the element you are searching for is equal to the element at that position, then it's finished.
    Otherwise, if the element you are are seeking is less than the found element, then set end = (end - start) / 2 and repeat.
    Otherwise, if the element you are are seeking is greater than the found element, then set begin = (end - start) / 2 and repeat.
    By doing this, it halves the search space each iteration, which leads to log(n) time. This property obviously only holds if the array is sorted!

    Well, if you are going to have the sort the arrays (and delete duplicates), then it might just make sense to have them sorted in the first place. Then you will know directly when you insert a duplicate. The only real way to know which one is faster is to time it because either way, it's going to end up O(N*log(N)).
    And I know I've told you in your other thread... why the heck are you allocating the vector on the heap? And why the heck are you allocating an array of vectors? Allocating a vector on the heap rarely makes sense, and if you need a dynamic array of something, use a vector, even if the element of the dynamic array is a vector!
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  5. #5
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Oh I see now. Have a vector to hold your vectors. That's cool. Ty, manger.

  6. #6
    Registered User
    Join Date
    Apr 2013
    Posts
    1,658
    In the case of Microsoft compilers, a vector has two parts, a small header with a pointer, count, ..., and the actual content of the vector, which is allocated from the heap. I don't know if this is a standard implementation for vectors that would apply to other compilers as well.

  7. #7
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Okay, I got some code that works. Basically, it looks like this :
    Code:
    array<vector<vector<tree*>>, 2> scan_leaves(vector<tree*> *leaves) {
    
    
       array<vector<vector<tree*>>, 2> safe_and_unsafe;
    
    
       vector<unsigned> safety_check;
    
    
       for (unsigned i=0; i < num_threads; ++i) {
    
    
          sort(leaves[i].begin(), leaves[i].end());
          safety_check.push_back(0);
       }
    
    
       for (unsigned i=0; i< num_threads; ++i) {
    
    
          for (unsigned j= i+1; j < num_threads; ++j) {
    
    
             for (auto it = leaves[j].begin(); it < leaves[j].end(); ++it) {
    
    
                if (binary_search(leaves[i].begin(), leaves[i].end(), *it)) {
    
    
                   safety_check.at(i) += 1;
                   safety_check.at(j) += 1;
                }
             }
          } 
       }
    
    
       for (unsigned i=0; i < safety_check.size(); ++i)
          if (safety_check.at(i) > 0)
             safe_and_unsafe.at(1).push_back(leaves[i]);
          else
             safe_and_unsafe.at(0).push_back(leaves[i]);
    
    
       cout << "array<0>.size = " << safe_and_unsafe.at(0).size() << endl;
       cout << "array<1>.size = " << safe_and_unsafe.at(1).size() << endl;
    
    
       return safe_and_unsafe;
    }
    Yay, so super readable. My metaphor for it is, I have several piles of leaves (ha!) where each pile is a collection of leaves. I scan for duplicates and am basically trying to separate isolated piles from piles that share leaves.

    Also, why does boost::bind not support STL funcions? For example, for like an hour or two last night I was trying to do
    Code:
    boost::bind(std::sort, leaves[i].begin(), leaves[i].end());
    because it'd be simple for me to multithread sorting each independent leaves vector but alas, I could not. What's the deal with that? Bad syntax? Also, boost::threadpool is the easiest way to multithread code, I think. Or at least for my needs it is. I am so in love with it. Just as a test to see if it really does work, I scheduled 4 infinite while-loops and lo and behold, according to htop, each core was at 100% usage. That's amazing! boost::threadpool really, really works!!! And I downloaded the fix to make it no longer leak memory! I'm in love, omg.

  8. #8
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by MutantJohn View Post
    Also, why does boost::bind not support STL funcions? For example, for like an hour or two last night I was trying to do
    Code:
    boost::bind(std::sort, leaves[i].begin(), leaves[i].end());
    Why aren't you using lambas instead?
    Code:
    [&leaves, i]() { std::sort(leaves[i].begin(), leaves[i].end()); }
    That is assuming that leaves lives long enough for the lambda to execute.
    Regardless, you avoid the template hell std::bind/boost::bind can cause when used incorrectly.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  9. #9
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    What I mean at the end is, I have this code :
    Code:
    pool tp(num_threads);
    /* And the only way I've seen that I scan schedule a task with arguments is to use ..*/
    tp.schedule(boost::bind(search_tree, arg1, arg2, arg3, arg4));
    
    tp.wait();
    so I'm trying to do this with sorting but I keep getting an error that the template type cannot be deduced. Any way to make this explicitly known?

    Edit : But Elysia, the threadpooling!!! I need more of teh threadz!!! Can I use lambdas to schedule things with a boost::threadpool?

  10. #10
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Quote Originally Posted by MutantJohn View Post
    so I'm trying to do this with sorting but I keep getting an error that the template type cannot be deduced. Any way to make this explicitly known?
    I don't know where the error is coming from, but yes, you can explicitly specify a template type using, say, search_tree<my_template_parameters>. That would work if it's complaining about search_tree. You can do the same with boost::bind, tp.schedule, whatever. Of course, doing this is not recommended unless necessary.

    Edit : But Elysia, the threadpooling!!! I need more of teh threadz!!! Can I use lambdas to schedule things with a boost::threadpool?
    Anything that accepts a functor, or a boost::bind object will accept a lambda. Just try it.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  11. #11
    Registered User
    Join Date
    Oct 2006
    Posts
    3,445
    Quote Originally Posted by MutantJohn View Post
    Can I use lambdas to schedule things with a boost::threadpool?
    you can use lambdas to do anything that regular functions can do, and more.
    What can this strange device be?
    When I touch it, it gives forth a sound
    It's got wires that vibrate and give music
    What can this thing be that I found?

  12. #12
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Elysia, you are so sexy, holy crap.

    Check it, this compiles :
    Code:
    boost::bind(sort<vector<tree*>::iterator>, leaves[0].begin(), leaves[0].end());
    Ehrmagerd, the threading!!! The threading!!!!!

    Oh my God, the efficiency of my code just went by num_threads!!!! Oh my God, am I excited. And yes, boost library is totally weird but it feels super advanced and it has a really sexy threadpool which is literally exactly how I'm planning on multithreading my code because wait conditions? Yeah, who needs those in this case. Not me.

    Thank you so so so so so much.

    P.S.

    Can you tell I'm excited? :P

  13. #13
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Good, now let's switch to lambdas. They're superior to boost::bind/std::bind and will become even more powerful in the upcoming C++14 standard. They will far surpass std::bind/boost::bind, so get used to them now!
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  14. #14
    Registered User MutantJohn's Avatar
    Join Date
    Feb 2013
    Posts
    2,665
    Okay, so I'm trying to switch over to use lambdas in lieu of binds but I'm having some technical difficultah. Namely, can I condense this code?
    Code:
    /* Create and use a lambda function to count the number of particles contained by the root */
    
    
       int inside = 0, surface = 0, outside = 0;
    
    
       auto counter = [](int *inside, int *surface, int *outside, vector<particle> parts, tetra *root) {
    
    
          int x = 0;
    
    
          for (auto it = parts.begin(); it < parts.end(); ++it) {
    
    
             x = inside_tetra(root, &(*it));
    
    
             if (x == 0) {
    
    
                ins.lock();
                ++(*inside);
                ins.unlock();
             }
    
    
             if (x == -1) {
    
    
                out.lock();
                ++(*outside);
                out.unlock();
             }
    
    
             if (x > 0) {
    
    
                surf.lock();
                ++(*surface);
                surf.unlock();
             }
          }
    
    
          return;
       };
    
    
       auto f = [&](unsigned i) { return counter(&inside, &surface, &outside, particle_sets[i], root_tetra); };
    
    
       for (unsigned i =0; i < num_threads; ++i)
          tp.schedule([=]() { return f(i); } );
         //tp.schedule(boost::bind(*counter, &inside, &surface, &outside, particle_sets[i], root_tetra));
    
    
       tp.wait();
    
    
       cout << "\nThere are " << inside << " points inside the root" << endl;
       cout << "There are " << surface << " points on the surface of the root" << endl;
       cout << "There are " << outside << " points outside the root" << endl;
       cout << inside + surface << " total contained points" << endl;
    I think I wind up having to use like 3 lambdas so...

    Idk, I kept getting a bunch of weird compilation errors but this works so I'm happy with something that at least works but a more elegant solution would be much appreciated

  15. #15
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    >>ins.lock();
    Don't lock your mutexes manually. Use a lock class, such as std::unique_lock.

    Your if statements are mutually exclusive. You could use if, else if, else.

    >>for (auto it = parts.begin(); it < parts.end(); ++it)
    You could reduce that to
    for (auto & elem : parts)

    if it is sufficient to just pass a reference to the element or a pointer instead of an iterator to inside_tetra.

    Code:
    auto f = [&](unsigned i) { return counter(&inside, &surface, &outside, particle_sets[i], root_tetra); };
    tp.schedule([=]() { return f(i); } );
    You can merge these. You probably won't notice any difference anyway.
    Code:
    tp.schedule([=]() { return counter(&inside, &surface, &outside, particle_sets[i], root_tetra); };
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. How to scan in multiple values from a text file
    By GenghisAhn in forum C Programming
    Replies: 12
    Last Post: 10-30-2012, 02:52 AM
  2. Using/ Searching through Multiple Vectors
    By bengreenwood in forum C++ Programming
    Replies: 10
    Last Post: 08-03-2009, 08:35 AM
  3. Multiple vectors
    By dpp in forum C++ Programming
    Replies: 5
    Last Post: 01-16-2009, 06:20 PM
  4. How to handle multiple cases which use common code?
    By tmaxx in forum C Programming
    Replies: 3
    Last Post: 10-03-2008, 07:42 AM
  5. scan multiple items in multidimensional array
    By requiem in forum C Programming
    Replies: 1
    Last Post: 04-17-2003, 03:02 PM