Thread: Distinct doubles

  1. #1
    Registered User
    Join Date
    Mar 2007
    Posts
    25

    Distinct doubles

    Hello everyone.

    I found this block of code on the web. It reads text from standard input and returns the total number of distinct words found in the text. It performs much faster than if I were to use the SLT set<std::string>.

    Code:
    #include <unistd.h>
    #include <vector>
    #include <cctype>
    #include <stdint.h>
    #include <iostream>
    
    const size_t pagesize = 4096;
    
    struct knot_t {
      knot_t() : exists(false), childcount(0), children(0) {  }
      knot_t* child(unsigned c) {
        // Look for a matching child
        if (children) {
          unsigned i = childcount / 2;
          unsigned leftpos = 0;
          unsigned rightpos = childcount;
          while (children[i].letter != uint8_t(c)) {
            if (uint8_t(c) < children[i].letter) {
              rightpos = i;
              i = (leftpos + i) / 2;
            } else {
              leftpos = i+1;
              i = (rightpos + i) / 2;
            }
            if (rightpos == leftpos)
              goto nomatch;
          }
          return children + i;
        }
      nomatch:
        // Create matching child - insert in order
        knot_t *nary = new knot_t[childcount + 1];
        knot_t *destp = nary;
        knot_t *sourcep = children;
        knot_t *endsource = children + childcount;
        while (sourcep != endsource && sourcep->letter < c)
          *destp++ = *sourcep++;
        knot_t *outp = destp++;
        while (sourcep != endsource)
          *destp++ = *sourcep++;
    
        delete[] children;
        children = nary;
        outp->letter = c;
        childcount++;
        return outp;
      }
      unsigned distinctcount() const {
        unsigned total = exists ? 1 : 0;
        for (unsigned i = 0; i != childcount; ++i)
          total += children[i].distinctcount();
        return total;
      }
      bool exists;
    private:
      uint8_t letter;
      unsigned short childcount;
      knot_t *children;
    };
    
    
    int main(int argc, char **argv)
    {
      knot_t base;
      char buffer[pagesize];
      knot_t *curnot = 0;
    
      while (ssize_t r = read(0, buffer, pagesize)) {
        if (r <= 0)
          break;
        for (unsigned bufpos = 0; bufpos != unsigned(r); ++bufpos) {
          uint8_t c = buffer[bufpos];
          if (isspace(c)) {
            if (!curnot)
              continue;
            curnot->exists = true;
            curnot = 0;
            continue;
          }
          if (!curnot) {
            curnot = base.child(c);
            continue;
          }
          curnot = curnot->child(c);
        }
      }
      if (curnot)
        curnot->exists = true;
    
      std::cout << "Distinct: " << base.distinctcount() << std::endl;
      return 0;
    }
    Now, what I need is to read real numbers (doubles) and return the total number of distinct numbers in the input. I modified the above code and I'm getting the correct answer as I should. Can someone tell me what else my code needs? I'm not sure what algorithm is used in the above, but does anyone know? Is there a faster approach to my problem?

    Here is my modified version of the above code so that it works on doubles instead of strings.
    Code:
    #include <unistd.h>
    #include <vector>
    #include <cctype>
    #include <stdint.h>
    #include <iostream>
    using namespace std;
    
    //const size_t pagesize = 4096;
    
    struct knot_t {
      knot_t() : exists(false), childcount(0), children(0) {  }
      knot_t* child(double c) {
        // Look for a matching child
        if (children) {
          unsigned i = childcount / 2;
          unsigned leftpos = 0;
          unsigned rightpos = childcount;
          while( children[i].letter != c ) {
            if (c < children[i].letter) {
              rightpos = i;
              i = (leftpos + i) / 2;
            } else {
              leftpos = i+1;
              i = (rightpos + i) / 2;
            }
            if (rightpos == leftpos)
              goto nomatch;
          }
          return children + i;
        }
      nomatch:
        // Create matching child - insert in order
        knot_t *nary = new knot_t[childcount + 1];
        knot_t *destp = nary;
        knot_t *sourcep = children;
        knot_t *endsource = children + childcount;
        while (sourcep != endsource && sourcep->letter < c)
          *destp++ = *sourcep++;
        knot_t *outp = destp++;
        while (sourcep != endsource)
          *destp++ = *sourcep++;
    
        delete[] children;
        children = nary;
        outp->letter = c;
        childcount++;
        return outp;
      }
      unsigned distinctcount() const {
        unsigned total = exists ? 1 : 0;
        for (unsigned i = 0; i != childcount; ++i)
          total += children[i].distinctcount();
        return total;
      }
      bool exists;
    private:
      double letter;
      unsigned short childcount;
      knot_t *children;
    };
    
    
    int main(int argc, char **argv)
    {
      knot_t base;
      //char buffer[pagesize];
      knot_t *curnot = 0;
    
    //generate large number of random doubles for input 
      for(int i = 0; i < 100000000; i++){
        double c = (random() & 0xf) + ((random() & 0xfff) / (double)0xff);
        if( !curnot ){
          curnot = base.child(c);
          continue;
        }
        curnot->exists = true;
        curnot = 0;
        continue;
      }
      cout << base.distinctcount() << endl;
    
    
    
      return 0;
    }

  2. #2
    Registered User
    Join Date
    Mar 2007
    Posts
    25
    ok, I was wrong. The answer I get isn't correct. It's off by exactly 50%. Anyone know why that is?

  3. #3
    Registered User
    Join Date
    Jun 2005
    Posts
    6,815
    You're using some random code that you don't understand, and you've made some changes you don't really understand in an attempt to get it to do what you want. There's plenty of scope for something to go wrong in that. Given all that, and the fact the code isn't completely trivial, you've probably passed the threshold where people are willing to help.

    Particularly as your problem (finding number of unique values in a set of input values) can be solved with only a few lines of code - by simply using capability of the standard library.

  4. #4
    Registered User
    Join Date
    Mar 2007
    Posts
    25
    No, it's you who has passed the threshold where you aren't willing to help, and that's simply because you don't understand the code and you rather crap on other peoples post than be helpful.

    I don't come on these forums very often, unless I really have no other choice, and that's simply because there are too many people like you with too big of an ego whose expertise is to state the obvious.

    I know my problem can be solved with a few lines of code using the STL, and that's what I said above. But it's too slow for what I want to do.

    Here it is:
    Code:
    double n;
    std::set<double> nCount;
    while (std::cin >> n) nCount.insert(n);
    std::cout << nCount.size() << std::endl;
    I decided to look around and find something that does something similar to what I need but faster, and try to learn from it as well.

    You are right, I don't understand the code I posted, and I never claimed to know how it does what it does either. I wouldn't be posting here if I did.

    When someone is asking for help or having a problem do your best to help them. If you can't help them because of your own shortcomings then there is no reason to bite their head off.

  5. #5
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by ArlexBee-871RBO
    I know my problem can be solved with a few lines of code using the STL, and that's what I said above. But it's too slow for what I want to do.
    Before rejecting the use of the standard library, I would urge you to consider where the bottle neck might lie. For example, could it be that the difference that you observe is due to slower formatted input with operator>> versus the low level call to read()? Have you checked that the difference is still significant with doubles instead of strings?

    If you have determined that difference in I/O is not significant, then maybe consider another approach: read everything into a std::vector<double>, then apply std::unique() (of course, you do not need to use the erase() member function). Or if you have access to an implementation of TR1, consider using an unordered container (i.e., a hash table).

    If it is the formatted I/O that is the bottle neck, perhaps merely switching to low level or at least C standard I/O would be enough, so you can just use std::set<double>.

    Other things to check: did you enable optimisations? Are you using a Microsoft compiler? If you are, then define _SECURE_SCL to be 0 so as to avoid checked iterators. It might not make a difference when using std::set, but it is worth a try, and it will probably make a difference should you use std::vector<double> with std::unique().

    Quote Originally Posted by ArlexBee-871RBO
    ok, I was wrong. The answer I get isn't correct. It's off by exactly 50%. Anyone know why that is?
    Just a guess, but it might be due to floating point inaccuracy.
    Last edited by laserlight; 03-01-2009 at 05:39 AM.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  6. #6
    Registered User
    Join Date
    Mar 2007
    Posts
    25
    Quote Originally Posted by laserlight View Post
    Before rejecting the use of the standard library, I would urge you to consider where the bottle neck might lie. For example, could it be that the difference that you observe is due to slower formatted input with operator>> versus the low level call to read()? Have you checked that the difference is still significant with doubles instead of strings?
    The bottle neck is the STL set container. You are correct as far as operator>> being slow, and that functions like read() and scanf() are much faster, but in my case that won't matter because I won't be using them once I'm ready to implement the algorithm into my project.

    Quote Originally Posted by laserlight View Post
    If you have determined that difference in I/O is not significant, then maybe consider another approach: read everything into a std::vector<double>, then apply std::unique() (of course, you do not need to use the erase() member function). Or if you have access to an implementation of TR1, consider using an unordered container (i.e., a hash table).

    If it is the formatted I/O that is the bottle neck, perhaps merely switching to low level or at least C standard I/O would be enough, so you can just use std::set<double>.
    std::unique() won't work for me in this case. I'm using GCC and there is no problem with optimizations or anything like that.

    All I need right now is to understand that block of code that I posted in the beginning. Speed is top priority and, besides calculating total number of distinct doubles, later on I'll need to add the capability of finding the number of copies of each unique double and the value of that double. std::set<> will only give me the number of distinct doubles.

    I have a hash version that is twice as fast as the std::set<>. The above code is twice as fast as my hash version, which means it's about four times faster than std::set.

    If I figure out what that block of code is doing I'll post it here. I need some sleep now.

    Peace!

  7. #7
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by ArlexBee-871RBO
    std::unique() won't work for me in this case.
    That is true, but with your original information, it will, since the aim is only to find the number of distinct elements. After applying std::sort() and std::unique(), the number of distinct elements must be equal to the size of the unique range. Josuttis has this very example in The C++ Standard Library, though he prints the unique elements instead of counting them. In his test, he observed that the vector version was 10% faster (for his system, since it is implementation dependent), which unfortunately is not good enough an improvement.

    That said...
    Quote Originally Posted by ArlexBee-871RBO
    Speed is top priority and, besides calculating total number of distinct doubles, later on I'll need to add the capability of finding the number of copies of each unique double and the value of that double. std::set<> will only give me the number of distinct doubles.
    Then std::map<double, std::size_t> would be more appropriate as a base bench mark.

    Looking at the code, I have my doubts if you will be able to adapt it for double. It appears to be using the very property of strings being strings of characters to organise the strings into a tree structure. On the other hand, you could treat the doubles as numeric strings.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  8. #8
    The larch
    Join Date
    May 2006
    Posts
    3,573
    There would seem to be a problem using vector, sort and unique: the data set appears to be immense, but there seems to be only so many distinct doubles. It might be hard to beat a specialized algorithm for such particular conditions, although it might be possible to use STL to implement parts of the algorithm.

    Perhaps it might be cheaper to actually count (or compute) how many distinct doubles can be generated this way, since if there's significantly less than 100000000, surely they would all turn up at least once?
    I might be wrong.

    Thank you, anon. You sure know how to recognize different types of trees from quite a long way away.
    Quoted more than 1000 times (I hope).

  9. #9
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    I do think somebody's ego is getting in the way, and it's not grumpy's. The only thing worse than a whiner is a whiner who think's he's smarter than everyone else and "just this once" has to stoop down and ask somebody else a question.

    (An algorithm to do this quickly is rather obvious to me -- good luck finding it yourself)
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  10. #10
    Registered User
    Join Date
    Mar 2007
    Posts
    25
    Quote Originally Posted by brewbuck View Post
    I do think somebody's ego is getting in the way, and it's not grumpy's. The only thing worse than a whiner is a whiner who think's he's smarter than everyone else and "just this once" has to stoop down and ask somebody else a question.

    (An algorithm to do this quickly is rather obvious to me -- good luck finding it yourself)
    I never claimed to be smarter than anybody else. In fact, I love to be proven wrong because that's when I learn, and that's when I'm elevated to a new level of understanding. You, too, seem like one of those people who has an attitude problem. So keep your algorithm to yourself because it defines who you are and makes you more elite.

    What's amazing to me is that no one has been able to tell me anything about the original algorithm that I posted. That was my main concern. Anyway, I did some digging and googling, and finally I have some rough ideas as to what the algorithm is doing.

    It uses an approximation of a digital trie. It's extremely memory hungry, but allows you to find a word with k characters in a text of arbitrary length in O(k) time. That means the complexity is constant over the size of the input and linear over the length of the word being searched for.

    I will need to study it more closely to figure out how to approximate the behavior to cut down on the memory requirements. Since I'll be working with doubles I have two choices: work with double-sized bit-patterns, or work with somewhat distinct doubles. I'm going to go with the first one just for now because that's simpler, then see where things go.

    @ laserlight
    Thanks for all the suggestions.

    @ anon
    You are correct, there are significantly fewer distinct doubles in a pool of 100,000,000 real numbers, but that depends on the range. And I can't just compute how many distinct doubles can be generated because I've already generated a list of doubles and I need to draw them on screen. I need the distinct count to calculate the depth at each point.

  11. #11
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    I can't help but agree with grumpy and brewbuck.

    To me this just seems a case of "I know what I'm doing.".

    I can't imagine that even a simple approach would compare to the parsing and IO parts.

    Soma

  12. #12
    Registered User
    Join Date
    Jun 2005
    Posts
    6,815
    Quote Originally Posted by ArlexBee-871RBO View Post
    No, it's you who has passed the threshold where you aren't willing to help, and that's simply because you don't understand the code and you rather crap on other peoples post than be helpful.
    Taking code you don't understand, doing modifications that you don't understand, posting that code with few comments or explanation, and then whinging when you don't get help is the act of a spoilt child.

    I did a quick google search for "knot_t" and turned up (as the third hit) this page that explains what your original code is: a red-black tree and that it is optimised around an 8-bit values (which you have changed to 8-byte values, assuming a double is 8 bytes). I'm guessing, on a quick look, that little optimisation is probably a design assumption in the code (which you have flatly violated by changing the data to be of type double) and might explain why your results are "off by exactly 50%".
    Quote Originally Posted by ArlexBee-871RBO View Post
    I know my problem can be solved with a few lines of code using the STL, and that's what I said above. But it's too slow for what I want to do.
    No. You said it is too slow using a std::set<> ..... a container which is designed with a particular set of trade-offs, with speed rarely among them. And, on that basis, you disregard the whole STL.

    Other containers are designed with specific performance trade-offs in mind, and are potentially suitable for what you are trying to do. And it would take little time to test them: after all the code to insert elements into a vector, sort the vector, and then use std::unique is not exactly complicated.

    But, no. We are the ones who are being unhelpful because we point out simple and practical alternatives.

    Quote Originally Posted by ArlexBee-871RBO View Post
    When someone is asking for help or having a problem do your best to help them. If you can't help them because of your own shortcomings then there is no reason to bite their head off.
    I would suggest you address your own shortcomings first. If you want help, it is your repsonsibility to seek help in a manner that doesn't demand more effort from others than you are willing to expend yourself. Posting code with no comments or descriptions of the algorithm, and asking for an explanation, demands a lot of others but little of yourself.
    Last edited by grumpy; 03-02-2009 at 03:20 AM.
    Right 98% of the time, and don't care about the other 3%.

    If I seem grumpy or unhelpful in reply to you, or tell you you need to demonstrate more effort before you can expect help, it is likely you deserve it. Suck it up, Buttercup, and read this, this, and this before posting again.

  13. #13
    Registered User
    Join Date
    Mar 2007
    Posts
    25
    Quote Originally Posted by grumpy View Post
    Taking code you don't understand, doing modifications that you don't understand, posting that code with few comments or explanation, and then whinging when you don't get help is the act of a spoilt child.

    I did a quick google search for "knot_t" and turned up (as the third hit) this page that explains what your original code is: a red-black tree and that it is optimised around an 8-bit values (which you have changed to 8-byte values, assuming a double is 8 bytes). I'm guessing, on a quick look, that little optimisation is probably a design assumption in the code (which you have flatly violated by changing the data to be of type double) and might explain why your results are "off by exactly 50%".
    No, it's not a Red-Black tree; a Red-Black tree is basically a binary tree. Rather, my original code is a digital search tries. Again, that proves you didn't understand the code even after finding that page. I guess it's easier to be a troll.

    And no, it's not "optimized around an 8-bit value". That's by default because if you are going to use a DST for strings you will use a byte to store each character. Changing it to something larger to store a different type of data would only increase memory usage, and my results being off by 50% had nothing to do with it. Nice try though.

    I now have a much better understanding of how it works. One way to use the algorithm for doubles would be to consider a double to be an 8 character string and use the algorithm directly. Since doubles are of equal length the code actually runs in linear time over the size of the input. Does it get any better than that?
    Last edited by ArlexBee-871RBO; 03-02-2009 at 06:21 AM.

  14. #14
    Registered User
    Join Date
    Mar 2007
    Posts
    25
    Quote Originally Posted by grumpy View Post
    No. You said it is too slow using a std::set<> ..... a container which is designed with a particular set of trade-offs, with speed rarely among them. And, on that basis, you disregard the whole STL.

    Other containers are designed with specific performance trade-offs in mind, and are potentially suitable for what you are trying to do. And it would take little time to test them: after all the code to insert elements into a vector, sort the vector, and then use std::unique is not exactly complicated.
    std::set<> is slow. It's a binary tree, and there are faster ways to do what it does. It even shows how that is on that page that you found, but I guess you didn't even bother reading it.

    Besides being slow, I also gave other reasons why I won't be able to use std::set<>. Besides finding distinct doubles, I need to have a count of each distinct double that was fed into the algorithm and the value of each double. std::set<> won't give me that. Using std::vector<>, sort() and unique() won't work either. I might get a distinct count relatively fast this way, but then going back trying to calculate the number of copies of each distinct double would slow everything down.

    I can do all these in one step using DST. Heck, even a hash version will do it. That's because I can modify the code to add those capabilities. I don't know how I would add those capabilities using the STL without slowing everything down. I have a working hash version, but I want to try to get DST to work because of its speed.

    Peace!

  15. #15
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by ArlexBee-871RBO
    What's amazing to me is that no one has been able to tell me anything about the original algorithm that I posted.
    I did, by stating that "it appears to be using the very property of strings being strings of characters to organise the strings into a tree structure". A trie is "a tree for storing strings in which there is one node for every common prefix". I even suggested what you now conclude as "one way to use the algorithm for doubles would be to consider a double to be an 8 character string and use the algorithm directly".

    Quote Originally Posted by ArlexBee-871RBO
    Besides being slow, I also gave other reasons why I won't be able to use std::set<>. Besides finding distinct doubles, I need to have a count of each distinct double that was fed into the algorithm and the value of each double. std::set<> won't give me that.
    As I noted earlier, the correct standard container for your purposes would be std::map, not std::set.

    Quote Originally Posted by ArlexBee-871RBO
    Using std::vector<>, sort() and unique() won't work either. I might get a distinct count relatively fast this way, but then going back trying to calculate the number of copies of each distinct double would slow everything down.
    If you use std::vector, then you should only use std::sort() and not std::unique(). I gave you the std::unique() suggestion because you only mentioned counting the number of distinct elements. I do not think that there is a standard algorithm to perform the count of the distinct groups, but it does only take a single pass after the range is sorted. Ultimately, the time complexity will be the same as for std::map, so if you need something asymptotically faster, neither will suffice.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Outputting a list of squared Doubles
    By dnguyen1022 in forum C++ Programming
    Replies: 13
    Last Post: 01-19-2009, 01:24 PM
  2. parse doubles from socket read?
    By willy in forum C Programming
    Replies: 4
    Last Post: 05-28-2008, 05:32 AM
  3. How many distinct values?
    By George2 in forum Tech Board
    Replies: 2
    Last Post: 05-03-2008, 09:08 AM
  4. Passing array of doubles to ATL COM dll
    By wakeup in forum C++ Programming
    Replies: 3
    Last Post: 04-11-2006, 02:53 AM
  5. ints and doubles problem
    By iLLiCiT in forum C Programming
    Replies: 1
    Last Post: 12-04-2004, 03:42 PM