Thread: Enlightening lessons in programming

  1. #16
    Make Fortran great again
    Join Date
    Sep 2009
    Posts
    1,413
    Simply amazing!

    New times:
    C++: 24 seconds
    Python (original): 60.7 seconds
    Python (1st improv.): 48.4 seconds
    Python (2nd improv.): 45.5 seconds

    C++ code:
    Code:
    #include <fstream>
    #include <string>
    #include <unordered_map>
    #include <cstdlib>
    
    int main()
    {
    	std::ifstream infile("20110803.txt");
    	std::unordered_map<std::string, unsigned int> bandwidth;
    	std::string line, ip1, ip2;
    	size_t pos1, pos2;
    	unsigned int bytes;
    
    	while (std::getline(infile, line))
    	{
    		pos1 = line.find('\t');
    		pos2 = line.find('\t', pos1 + 1);
    		ip1 = line.substr(pos1 + 1, pos2 - pos1 - 1);
    		pos1 = line.find('\t', pos2 + 1);
    		pos2 = line.find('\t', pos1 + 1);
    		ip2 = line.substr(pos1 + 1, pos2 - pos1 - 1);
    		pos1 = line.find('\t', pos1 + 1);
    		pos1 = line.find('\t', pos1 + 1);
    		pos1 = line.find('\t', pos1 + 1);
    		pos2 = line.find('\t', pos1 + 1);
    		bytes = atoi(line.substr(pos1 + 1, pos2 - pos1 - 1).c_str());
    
    		bandwidth[ip1] += bytes;
    		bandwidth[ip2] += bytes;
    	}
    	infile.close();
    
    	return 0;
    }
    Python codes:
    Code:
    import time
    
    t1 = time.time()
    bandwidth = {}
    fin = open('20110803.txt', 'r')
    line = fin.readline()
    while line != '':
    	row = line.split('\t')
    	bytes = int(row[6])
    	if row[1] in bandwidth:
    		bandwidth[row[1]] += bytes
    	else:
    		bandwidth[row[1]] = bytes
    	if row[3] in bandwidth:
    		bandwidth[row[3]] += bytes
    	else:
    		bandwidth[row[3]] = bytes
    	line = fin.readline()
    fin.close()
    dt = time.time() - t1
    print(('%.1f' % dt) + ' seconds\n')
    
    t1 = time.time()
    bandwidth = {}
    with open('20110803.txt', 'r') as fin:
    	for line in fin:
    		row = line.split('\t')
    		bytes = int(row[6])
    		if row[1] in bandwidth:
    			bandwidth[row[1]] += bytes
    		else:
    			bandwidth[row[1]] = bytes
    		if row[3] in bandwidth:
    			bandwidth[row[3]] += bytes
    		else:
    			bandwidth[row[3]] = bytes
    fin.close()
    dt = time.time() - t1
    print(('%.1f' % dt) + ' seconds\n')
    
    t1 = time.time()
    bandwidth = {}
    with open('20110803.txt', 'r') as fin:
    	for line in fin:
    		row = line.split('\t')
    		bytes = int(row[6])
    		try:
    			bandwidth[row[1]] += bytes
    		except KeyError:
    			bandwidth[row[1]] = bytes
    		try:
    			bandwidth[row[3]] += bytes
    		except KeyError:
    			bandwidth[row[3]] = bytes
    fin.close()
    dt = time.time() - t1
    print(('%.1f' % dt) + ' seconds\n')
    Example text:
    Code:
    25201	72.247.133.186	443	192.168.3.62	52314	SSL	1514	Continuation Data
    25201	72.247.133.186	443	192.168.3.62	52314	SSL	1514	Continuation Data
    25201	72.247.133.186	443	192.168.3.62	52314	SSL	1514	Continuation Data
    25201	192.168.5.7	54164	192.168.3.181	445	SMB	124	Write AndX Request, FID: 0xc010, 2 bytes at offset 1425610
    25201	192.168.3.7	445	192.168.4.7	49495	SMB	105	Write AndX Response, 2 bytes
    25201	72.247.133.186	443	192.168.3.62	52314	SSL	1514	Continuation Data
    25201	72.247.133.186	443	192.168.3.62	52314	SSL	507	Continuation Data
    25201	192.168.3.7	445	192.168.4.7	49495	SMB	105	Write AndX Response, 86 bytes
    25201	192.168.62.2	65200	192.168.3.124	4089	DCERPC	106	Response: call_id: 3153182 Fragment: Single ctx_id: 1
    25201	192.168.3.7	445	192.168.4.7	49495	SMB	105	Write AndX Response, 40 bytes
    25202	192.168.9.7	2322	192.168.3.7	445	SMB	124	Write AndX Request, FID: 0x000f, 2 bytes at offset 906509

  2. #17
    Make Fortran great again
    Join Date
    Sep 2009
    Posts
    1,413
    So the next question, how would one implement reading ahead in C++? Would that be a multi-threaded application where one thread is processing and the other thread is reading from the file? If so, I'll have to wait a bit. I still haven't really done a multi-threaded application in C yet.

    I don't suppose changing the size of the file buffer would help, would it? I tried the following:
    Code:
    std::ifstream infile("20110803.txt");
    char *buf = new char[50 * 1024 * 1024];
    infile.rdbuf()->pubsetbuf(buf, 50 * 1024 * 1024);
    but this did not change the execution time, i.e. I think g++ threw out these changes probably when optimizing. Either that or it's wrong. In any case, the executable was only using 3 MB of RAM, so it's definitely not using this 50 MB buffer.
    Last edited by Epy; 01-10-2012 at 07:17 AM.

  3. #18
    Lurking whiteflags's Avatar
    Join Date
    Apr 2006
    Location
    United States
    Posts
    9,612
    Actually, call istream::readsome
    Last edited by whiteflags; 01-10-2012 at 07:27 AM.

  4. #19
    Registered User
    Join Date
    May 2010
    Posts
    4,632
    Please note the ifstream.get() also stops processing at the new line character. If you want to read a large amount of data you may have to use the read() function.

    Jim

  5. #20
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    OK, so first, stop using C conventions in c++. There is absolutely no need to declare your variables at the top of the function. Declare them as they are used!
    Don't use atoi. If you insist on C, use strtol; if you want to go the C++ route, use string streams, or better yet, boost::lexical_cast.
    Get rid of infile.close(). The destructor takes care of that.
    Now, get rid of new and use std::vector instead.

    Instead of:
    char *buf = new char[50 * 1024 * 1024];

    do:
    std::vector<char> buf(50 * 1024 * 1024);

    Now you won't have to clean it up later. The destructor will take care of that.

    Just be aware of that buffering line-oriented files can be a pain since the last row probably won't be a whole row (missing some part of it). You'll have to take that into account when searching. In order to make it fast, you'll probably have to end up manually parsing the line.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  6. #21
    Unregistered User Yarin's Avatar
    Join Date
    Jul 2007
    Posts
    2,158
    Have you tried caching the whole file in memory (or at least large chunks at a time), and then parsing it? It may be faster than the disk->data->disk->data cycle you have now.

  7. #22
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    How well does C-style I/O perform?

    Code:
    #include <cstdio>
    #include <string>
    #include <unordered_map>
    #include <cstdlib>
     
    int main()
    {
    	char ip1[16];
    	char ip2[16];
    	char bogus[128];
    	int dummy;
    	int bytes;
    	std::unordered_map<std::string, unsigned int> bandwidth;
    	
    	FILE *infile = fopen("20110803.txt", "r");
    	
    	while (std::fscanf(infile, "%d %s %d %s %d %s %u", &dummy, ip1, &dummy, ip2, &dummy, bogus, &bytes) == 7)
    	{
    		bandwidth[ip1] += bytes;
    		bandwidth[ip2] += bytes;
    	}
    	
    	std::fclose(infile);
     
        return 0;
    }
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  8. #23
    Make Fortran great again
    Join Date
    Sep 2009
    Posts
    1,413
    I'll try that tomorrow. Care to make a wager on the time? It's surely much more efficient, but I don't think it'll be much faster due to the disk I/O. 20 seconds at best is my guess.

  9. #24
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by Epy View Post
    I'll try that tomorrow. Care to make a wager on the time? It's surely much more efficient, but I don't think it'll be much faster due to the disk I/O. 20 seconds at best is my guess.
    Depends how fast your disk is. A typical platter rate on a 7200 RPM disk is about 120 megabytes per second, so I'd put the theoretical max at about 13 seconds. At 20 seconds, I'd say you're still looking at some software inefficiencies somewhere.

    The log entries themselves look highly structured and easily compressible, so another possibility to boost performance is to write the data to disk in compressed form and uncompress it as you read it in. Or just write the log entries in a more compact format.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  10. #25
    Make Fortran great again
    Join Date
    Sep 2009
    Posts
    1,413
    Being a dork, I decided to just VPN in and grab the data so I could test things tonight. I got some interesting results. I ran everything on my two main computers here at home. For reference, the original (work) computer is a 1.73GHz Core i7 w/ a SATA3 HD. Here are the times, in seconds:

    1.6 GHz Core Duo (x86) w/SATA2 HD (SATA1 supported by mobo)
    Python: 114
    C++ style I/O: 102
    C style I/O (1): 87
    C style I/O (2): 96

    1.86 GHz Core2 Duo (x64) w/SATA3 HD
    Python: 77
    C++ style I/O: 44
    C style I/O (1): 56
    C style I/O (2): 57

    These were compiled with:
    g++ -std=c++0x -Wall -Wextra -O3 -march=native file

    I have two C versions because I was having trouble parsing the file (missing chunks or something). The first variation uses fgets() to scan in the whole line and then sscanf() to try and extract values. The 2nd variation uses fscanf() to grab what it can and then uses fgets() to basically get to the end of the line.

    I've uploaded everything to Dropbox so you guys can play with the data if you want. You'll need 7-zip.

    http://dl.dropbox.com/u/57289645/20110803.7z
    http://dl.dropbox.com/u/57289645/speed_test.7z

  11. #26
    Make Fortran great again
    Join Date
    Sep 2009
    Posts
    1,413
    Work computer:
    Python: 47
    C++ style I/O: 32
    C style I/O (1): 35
    C style I/O (2): 38

  12. #27
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Formatted I/O is tremendously slow, whether C or C++. I ran the C solution through a profiler and half the time is spent inside isspace() and isdigit(), which apparently are used by scanf(). I think you're not even close to being I/O bound yet.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  13. #28
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by brewbuck View Post
    Formatted I/O is tremendously slow, whether C or C++. I ran the C solution through a profiler and half the time is spent inside isspace() and isdigit(), which apparently are used by scanf(). I think you're not even close to being I/O bound yet.
    Try for a single pass state machine. I don't have time to profile this but I bet it would be a lot faster than the scanf:

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    
    enum {
    	STRING,
    	NUMBER
    };
    
    int main(void) {
    	char *s[3],
    		eg[] = "34 okay! 45 again -678 bogus 123",
    		*p = eg, 
    		*p2;
    	int n[4], 
    		state = NUMBER,
    		item = 0;
    
    	while (1) {
    		while (*p <= ' ') p++;
    		if (state == STRING) {
    			s[item++] = p;
    			while (*p > ' ') p++;
    			*p = '\0';
    			state = NUMBER;
    			continue;
    		}
    		if (state == NUMBER) {
    			p2 = p;
    			while (*p > ' ') p++;
    			*p = '\0';
    			n[item] = strtol(p2, NULL, 0);
    			if (item == 3) break;
    			state = STRING;
    		}
    	}
    
    	printf("%s %s %s\n%d %d %d %d\n", s[0], s[1], s[2], n[0], n[1], n[2], n[3]);
    
    	return 0;
    }
    You'd probably want to insert some checks into, eg, the "while (*p > ' ')" unless you are sure the input is flawless. Doing the strings in place may or may not be viable, but if it is, that is an advantage over copying.

    I've had a lot of great results using stuff like this to eg, parse HTTP and HTML.
    Last edited by MK27; 01-19-2012 at 12:24 PM. Reason: forgot n[3]
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. C Lessons
    By vurdlak in forum C Programming
    Replies: 8
    Last Post: 02-18-2006, 05:49 PM
  2. enlightening discussion...
    By no-one in forum A Brief History of Cprogramming.com
    Replies: 48
    Last Post: 04-24-2004, 03:27 PM
  3. c lessons on the computer?
    By Unregistered in forum C Programming
    Replies: 2
    Last Post: 07-03-2002, 12:22 PM
  4. english lessons
    By Driveway in forum A Brief History of Cprogramming.com
    Replies: 14
    Last Post: 06-30-2002, 09:38 PM