Thread: Enlightening lessons in programming

  1. #1
    Make Fortran great again
    Join Date
    Sep 2009
    Posts
    1,413

    Enlightening lessons in programming

    I think I've mentioned before on here that I did some bandwidth monitoring scripts in Python for work. Well, I've been learning C++ lately and I rewrote a critical part of the Python code in C++ to see how much faster it would be.

    I ran both codes and have the following results:
    Python: 59 seconds
    C++: 34 seconds

    These programs process very large files (tables of data from Wireshark dumps), and this is just a huge reminder of how something can be more IO bound than anything else. Honestly I thought the C++ version would be incredibly faster, but alas...

    Here are the codes...nothing critical here of concern to my employer:

    Code:
    bandwidth = {}
    fin = open('20110803.txt', 'r')
    line = fin.readline()
    while line != '':
    	row = line.split('\t')
    	bytes = int(row[6])
    	if row[1] in bandwidth:
    		bandwidth[row[1]] += bytes
    	else:
    		bandwidth[row[1]] = bytes
    	if row[3] in bandwidth:
    		bandwidth[row[3]] += bytes
    	else:
    		bandwidth[row[3]] = bytes
    	line = fin.readline()
    fin.close()
    Code:
    #include <fstream>
    #include <string>
    #include <unordered_map>
    #include <cstdlib>
    
    int main(void)
    {
    	std::ifstream infile;
    	std::unordered_map<std::string, int> bandwidth;
    	std::string line, ip1, ip2;
    	size_t pos1, pos2;
    	int bytes;
    
    	infile.open("20110803.txt");
    	while (std::getline(infile, line))
    	{
    		pos1 = line.find('\t');
    		pos2 = line.find('\t', pos1 + 1);
    		ip1 = line.substr(pos1 + 1, pos2 - pos1 - 1);
    		pos1 = line.find('\t', pos2 + 1);
    		pos2 = line.find('\t', pos1 + 1);
    		ip2 = line.substr(pos1 + 1, pos2 - pos1 - 1);
    		pos1 = line.find('\t', pos1 + 1);
    		pos1 = line.find('\t', pos1 + 1);
    		pos1 = line.find('\t', pos1 + 1);
    		pos2 = line.find('\t', pos1 + 1);
    		bytes = atoi(line.substr(pos1 + 1, pos2 - pos1 - 1).c_str());
    
    		if (bandwidth.find(ip1) != bandwidth.end())
    			bandwidth[ip1] += bytes;
    		else
    			bandwidth[ip1] = bytes;
    		if (bandwidth.find(ip2) != bandwidth.end())
    			bandwidth[ip2] += bytes;
    		else
    			bandwidth[ip2] = bytes;
    	}
    	infile.close();
    
    	return 0;
    }
    Suggestions/improvements are welcome, especially for the C++ part.

  2. #2
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,613
    I would have just made a list out of the split string like you did in python instead of searching. And I would use the iterator that bandwidth.find returns instead of the lookup operator. The lookup operator just calls find anyway.
    Last edited by whiteflags; 01-09-2012 at 04:57 PM.

  3. #3
    Make Fortran great again
    Join Date
    Sep 2009
    Posts
    1,413
    So basically I'm doing find twice. Lovely. Help me out with the syntax a bit, is this right? This is all so new to me.

    Code:
    std::unordered_map<std::string, int>::iterator i;
    ...
    i = bandwidth.find(ip1);
    if (i != bandwidth.end())
         i->second += bytes;
    else
         bandwidth[ip1] = bytes;
    Not really finding a straight answer on that one.

    As far as splitting the line into a list, I'm seeing a lot of different ways to do that, which one were you thinking of specifically, if you don't mind? Thanks for the help.

  4. #4
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,613
    I don't think it makes a difference exactly how you want to split, but I'm aware that string::find is a naive algorithm. Since I prefer to solve problems without searching if I can, I would actually probably do something like this:

    Code:
    while ( getline( infile, line ) )
    {
       vector<string> rows;
       string row;
       istringstream instr ( line );
       while ( getline(instr, row, '\t') ) {
          rows.push_back( row );
       }
       // and so forth...
    }
    Now for the case where bandwidth.find fails, you would just call insert:
    Code:
     typedef unordered_map<string, int> myMap;
    bandwisth.insert( myMap::value_type( ip1, bytes ));
    which returns an iterator and a bool in a std::pair.
    Last edited by whiteflags; 01-09-2012 at 06:52 PM.

  5. #5
    'Allo, 'Allo, Allo
    Join Date
    Apr 2008
    Posts
    639
    Quote Originally Posted by whiteflags View Post
    I don't think it makes a difference exactly how you want to split, but I'm aware that string::find is a naive algorithm. Since I prefer to solve problems without searching if I can, I would actually probably do something like this:
    I don't think that'll be any better, the tab character check just moves from your code to the stringstreams. All you're potentially going to save vs. using find is:
    an increment (to move to the next character in the string argument to find)
    a jump (to the start of a while loop)
    a check against zero (to see if the terminating null has been hit)
    another jump (out of the loop)

    On the stringstream/stringbuf side you have an allocation to copy the string. I'd wager the cost of that outshines the 9 calls to find. That's not to mention the additional allocation for the vector's buffer.

    The entire bit with bandwith's inserts can be reduced to
    Code:
    bandwidth[ip1] += bytes;
    bandwidth[ip2] += bytes;
    Since operator[] inserts a new entry if one doesn't exist

    I'd also declare the variables closer to where they're first used, but that's a stylistic choice rather than a technical one.

  6. #6
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Concerning the Python code: from what I have read (but not tested myself ), if you expect that the key will be found in the dictionary more often than not, then consider changing:
    Code:
    if row[1] in bandwidth:
        bandwidth[row[1]] += bytes
    else:
        bandwidth[row[1]] = bytes
    to:
    Code:
    try:
        bandwidth[row[1]] += bytes
    except KeyError:
        bandwidth[row[1]] = bytes
    The idea being that most of the cost of the exception handling is only paid when the exception is thrown. Of course, if it turns out that you are inserting more than updating, or the cost of catching the exception and then inserting is too high, then you'll end up with worse performance, heh.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  7. #7
    Make Fortran great again
    Join Date
    Sep 2009
    Posts
    1,413
    Quote Originally Posted by adeyblue View Post
    The entire bit with bandwith's inserts can be reduced to
    Code:
    bandwidth[ip1] += bytes;
    bandwidth[ip2] += bytes;
    Since operator[] inserts a new entry if one doesn't exist

    I'd also declare the variables closer to where they're first used, but that's a stylistic choice rather than a technical one.
    Well hell, that makes it all much simpler. Porting from Python obviously, and it would throw a KeyError as laserlight shows if the key doesn't already exist.

  8. #8
    Make Fortran great again
    Join Date
    Sep 2009
    Posts
    1,413
    Thanks everyone for the criticism/suggestions!

    Edit: Yep, dict[key] += value surely works! Makes all too much sense...
    Last edited by Epy; 01-09-2012 at 08:26 PM.

  9. #9
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Well, in the interest of making the original Python example more "Pythonic", what happens if you change the while loop to a for loop and ask for forgiveness instead of permission?
    Code:
    bandwidth = {}
    with open('20110803.txt', 'r') as fin:
        for line in fin:
            row = line.split('\t')
            bytes = int(row[6])
            try:
                bandwidth[row[1]] += bytes
            except KeyError:
                bandwidth[row[1]] = bytes
            try:
                bandwidth[row[3]] += bytes
            except KeyError:
                bandwidth[row[3]] = bytes
    How does the performance compare?
    Last edited by laserlight; 01-09-2012 at 08:51 PM. Reason: for -> to
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  10. #10
    Registered User
    Join Date
    Jan 2009
    Posts
    1,485
    How large is the file? You may be able to speed it up significantly by reading in the entire file in memory (or in large chunks if it's huge), then work on the memory buffer instead of reading one line at a time from the disk.

  11. #11
    Make Fortran great again
    Join Date
    Sep 2009
    Posts
    1,413
    That file in particular is 1.5 GB. The files are anywhere from 0.8-2 GB. Going to try all these suggestions tomorrow

    Edit: The chunks idea is probably a good one, it's pretty wasteful to just get one line at a time. Maybe chunks of 50 MB or so at a time.
    Last edited by Epy; 01-09-2012 at 09:15 PM.

  12. #12
    Master Apprentice phantomotap's Avatar
    Join Date
    Jan 2008
    Posts
    5,108
    As a matter of curiosity, can you post the current code as well as a sample file?

    Soma

  13. #13
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Quote Originally Posted by Epy
    The chunks idea is probably a good one, it's pretty wasteful to just get one line at a time.
    Yeah, besides it being more "Pythonic", another reason why I suggested the use of the for loop over the while loop is that the Python docs state that there would be a hidden read-ahead buffer.

    Incidentally, a matter of style on the C++ side: instead of writing:
    Code:
    std::ifstream infile;
    infile.open("20110803.txt");
    prefer:
    Code:
    std::ifstream infile("20110803.txt");
    based on the rule of thumb that variables should be declared near first use.

    Also, you don't need infile.close() since control returns from main soon after that, though it does not hurt to have it there. Oh, and there's no need to specify void in the parameter list to denote a function that has no parameters.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  14. #14
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,660
    > I ran both codes and have the following results:
    > Python: 59 seconds
    > C++: 34 seconds
    How many days have you spent so far on the problem, to save 25 seconds?

    I mean, 25 seconds every day is only 2.5 hours per year.
    If you've already spend a day of your time on it, then it's 4 years before your employer gets a return on that investment.

    Who runs the program, and how often?
    - batch file overnight - who cares how long it takes, so long as it's done by morning.
    - 1000's of interactive users running it every 5 minutes - now you might need to do something.

    Having realised that the problem is I/O bound, there isn't anything you can do in s/w that it going to turn 1-minute Python into 1-second C++.

    You've already done the biggest improvement you're going to make. Anything else will take a longer and longer to do, before you reach the point of spending weeks of effort to save microseconds of time.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  15. #15
    Make Fortran great again
    Join Date
    Sep 2009
    Posts
    1,413
    This is just a learning exercise for me. I wanted to see how to properly implement that section in C++ and was very (personally) curious to see how much faster it would be. I'm fully aware of the trade-offs between execution time and development time, and that's why most of my scripts for work are written in Python, which is apparently plenty fast. This is also to prove to my boss, who seems to think that these things run so slow to begin with, that they are not that slow at all.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. C Lessons
    By vurdlak in forum C Programming
    Replies: 8
    Last Post: 02-18-2006, 05:49 PM
  2. enlightening discussion...
    By no-one in forum A Brief History of Cprogramming.com
    Replies: 48
    Last Post: 04-24-2004, 03:27 PM
  3. c lessons on the computer?
    By Unregistered in forum C Programming
    Replies: 2
    Last Post: 07-03-2002, 12:22 PM
  4. english lessons
    By Driveway in forum A Brief History of Cprogramming.com
    Replies: 14
    Last Post: 06-30-2002, 09:38 PM