Enlightening lessons in programming

**Epy** · 01-10-2012

Simply amazing!

New times:
C++: 24 seconds
Python (original): 60.7 seconds
Python (1st improv.): 48.4 seconds
Python (2nd improv.): 45.5 seconds

C++ code:

Code:

#include <fstream>
#include <string>
#include <unordered_map>
#include <cstdlib>

int main()
{
	std::ifstream infile("20110803.txt");
	std::unordered_map<std::string, unsigned int> bandwidth;
	std::string line, ip1, ip2;
	size_t pos1, pos2;
	unsigned int bytes;

	while (std::getline(infile, line))
	{
		pos1 = line.find('\t');
		pos2 = line.find('\t', pos1 + 1);
		ip1 = line.substr(pos1 + 1, pos2 - pos1 - 1);
		pos1 = line.find('\t', pos2 + 1);
		pos2 = line.find('\t', pos1 + 1);
		ip2 = line.substr(pos1 + 1, pos2 - pos1 - 1);
		pos1 = line.find('\t', pos1 + 1);
		pos1 = line.find('\t', pos1 + 1);
		pos1 = line.find('\t', pos1 + 1);
		pos2 = line.find('\t', pos1 + 1);
		bytes = atoi(line.substr(pos1 + 1, pos2 - pos1 - 1).c_str());

		bandwidth[ip1] += bytes;
		bandwidth[ip2] += bytes;
	}
	infile.close();

	return 0;
}

Python codes:

Code:

import time

t1 = time.time()
bandwidth = {}
fin = open('20110803.txt', 'r')
line = fin.readline()
while line != '':
	row = line.split('\t')
	bytes = int(row[6])
	if row[1] in bandwidth:
		bandwidth[row[1]] += bytes
	else:
		bandwidth[row[1]] = bytes
	if row[3] in bandwidth:
		bandwidth[row[3]] += bytes
	else:
		bandwidth[row[3]] = bytes
	line = fin.readline()
fin.close()
dt = time.time() - t1
print(('%.1f' % dt) + ' seconds\n')

t1 = time.time()
bandwidth = {}
with open('20110803.txt', 'r') as fin:
	for line in fin:
		row = line.split('\t')
		bytes = int(row[6])
		if row[1] in bandwidth:
			bandwidth[row[1]] += bytes
		else:
			bandwidth[row[1]] = bytes
		if row[3] in bandwidth:
			bandwidth[row[3]] += bytes
		else:
			bandwidth[row[3]] = bytes
fin.close()
dt = time.time() - t1
print(('%.1f' % dt) + ' seconds\n')

t1 = time.time()
bandwidth = {}
with open('20110803.txt', 'r') as fin:
	for line in fin:
		row = line.split('\t')
		bytes = int(row[6])
		try:
			bandwidth[row[1]] += bytes
		except KeyError:
			bandwidth[row[1]] = bytes
		try:
			bandwidth[row[3]] += bytes
		except KeyError:
			bandwidth[row[3]] = bytes
fin.close()
dt = time.time() - t1
print(('%.1f' % dt) + ' seconds\n')

Example text:

Code:

25201	72.247.133.186	443	192.168.3.62	52314	SSL	1514	Continuation Data
25201	72.247.133.186	443	192.168.3.62	52314	SSL	1514	Continuation Data
25201	72.247.133.186	443	192.168.3.62	52314	SSL	1514	Continuation Data
25201	192.168.5.7	54164	192.168.3.181	445	SMB	124	Write AndX Request, FID: 0xc010, 2 bytes at offset 1425610
25201	192.168.3.7	445	192.168.4.7	49495	SMB	105	Write AndX Response, 2 bytes
25201	72.247.133.186	443	192.168.3.62	52314	SSL	1514	Continuation Data
25201	72.247.133.186	443	192.168.3.62	52314	SSL	507	Continuation Data
25201	192.168.3.7	445	192.168.4.7	49495	SMB	105	Write AndX Response, 86 bytes
25201	192.168.62.2	65200	192.168.3.124	4089	DCERPC	106	Response: call_id: 3153182 Fragment: Single ctx_id: 1
25201	192.168.3.7	445	192.168.4.7	49495	SMB	105	Write AndX Response, 40 bytes
25202	192.168.9.7	2322	192.168.3.7	445	SMB	124	Write AndX Request, FID: 0x000f, 2 bytes at offset 906509

**Epy** · 01-10-2012

So the next question, how would one implement reading ahead in C++? Would that be a multi-threaded application where one thread is processing and the other thread is reading from the file? If so, I'll have to wait a bit. I still haven't really done a multi-threaded application in C yet.

I don't suppose changing the size of the file buffer would help, would it? I tried the following:

Code:

std::ifstream infile("20110803.txt");
char *buf = new char[50 * 1024 * 1024];
infile.rdbuf()->pubsetbuf(buf, 50 * 1024 * 1024);

but this did not change the execution time, i.e. I think g++ threw out these changes probably when optimizing. Either that or it's wrong. In any case, the executable was only using 3 MB of RAM, so it's definitely not using this 50 MB buffer.

**whiteflags** · 01-10-2012

Actually, call istream::readsome

**jimblumberg** · 01-10-2012

Please note the ifstream.get() also stops processing at the new line character. If you want to read a large amount of data you may have to use the read() function.

Jim

**Elysia** · 01-13-2012

OK, so first, stop using C conventions in c++. There is absolutely no need to declare your variables at the top of the function. Declare them as they are used!
Don't use atoi. If you insist on C, use strtol; if you want to go the C++ route, use string streams, or better yet, boost::lexical_cast.
Get rid of infile.close(). The destructor takes care of that.
Now, get rid of new and use std::vector instead.

Instead of:
char *buf = new char[50 * 1024 * 1024];

do:
std::vector<char> buf(50 * 1024 * 1024);

Now you won't have to clean it up later. The destructor will take care of that.

Just be aware of that buffering line-oriented files can be a pain since the last row probably won't be a whole row (missing some part of it). You'll have to take that into account when searching. In order to make it fast, you'll probably have to end up manually parsing the line.

**Yarin** · 01-13-2012

Have you tried caching the whole file in memory (or at least large chunks at a time), and then parsing it? It may be faster than the disk->data->disk->data cycle you have now.

**brewbuck** · 01-18-2012

How well does C-style I/O perform?

Code:

#include <cstdio>
#include <string>
#include <unordered_map>
#include <cstdlib>
 
int main()
{
	char ip1[16];
	char ip2[16];
	char bogus[128];
	int dummy;
	int bytes;
	std::unordered_map<std::string, unsigned int> bandwidth;
	
	FILE *infile = fopen("20110803.txt", "r");
	
	while (std::fscanf(infile, "%d %s %d %s %d %s %u", &dummy, ip1, &dummy, ip2, &dummy, bogus, &bytes) == 7)
	{
		bandwidth[ip1] += bytes;
		bandwidth[ip2] += bytes;
	}
	
	std::fclose(infile);
 
    return 0;
}

**Epy** · 01-18-2012

I'll try that tomorrow. Care to make a wager on the time? It's surely much more efficient, but I don't think it'll be much faster due to the disk I/O. 20 seconds at best is my guess.

**brewbuck** · 01-18-2012

Originally Posted by Epy

I'll try that tomorrow. Care to make a wager on the time? It's surely much more efficient, but I don't think it'll be much faster due to the disk I/O. 20 seconds at best is my guess.

Depends how fast your disk is. A typical platter rate on a 7200 RPM disk is about 120 megabytes per second, so I'd put the theoretical max at about 13 seconds. At 20 seconds, I'd say you're still looking at some software inefficiencies somewhere.

The log entries themselves look highly structured and easily compressible, so another possibility to boost performance is to write the data to disk in compressed form and uncompress it as you read it in. Or just write the log entries in a more compact format.

**Epy** · 01-18-2012

Being a dork, I decided to just VPN in and grab the data so I could test things tonight. I got some interesting results. I ran everything on my two main computers here at home. For reference, the original (work) computer is a 1.73GHz Core i7 w/ a SATA3 HD. Here are the times, in seconds:

1.6 GHz Core Duo (x86) w/SATA2 HD (SATA1 supported by mobo)
Python: 114
C++ style I/O: 102
C style I/O (1): 87
C style I/O (2): 96

1.86 GHz Core2 Duo (x64) w/SATA3 HD
Python: 77
C++ style I/O: 44
C style I/O (1): 56
C style I/O (2): 57

These were compiled with:
g++ -std=c++0x -Wall -Wextra -O3 -march=native file

I have two C versions because I was having trouble parsing the file (missing chunks or something). The first variation uses fgets() to scan in the whole line and then sscanf() to try and extract values. The 2nd variation uses fscanf() to grab what it can and then uses fgets() to basically get to the end of the line.

I've uploaded everything to Dropbox so you guys can play with the data if you want. You'll need 7-zip.

http://dl.dropbox.com/u/57289645/20110803.7z
http://dl.dropbox.com/u/57289645/speed_test.7z

**Epy** · 01-19-2012

Work computer:
Python: 47
C++ style I/O: 32
C style I/O (1): 35
C style I/O (2): 38

**brewbuck** · 01-19-2012

Formatted I/O is tremendously slow, whether C or C++. I ran the C solution through a profiler and half the time is spent inside isspace() and isdigit(), which apparently are used by scanf(). I think you're not even close to being I/O bound yet.

**MK27** · 01-19-2012

Originally Posted by brewbuck

Formatted I/O is tremendously slow, whether C or C++. I ran the C solution through a profiler and half the time is spent inside isspace() and isdigit(), which apparently are used by scanf(). I think you're not even close to being I/O bound yet.

Try for a single pass state machine. I don't have time to profile this but I bet it would be a lot faster than the scanf:

Code:

#include <stdio.h>
#include <stdlib.h>

enum {
	STRING,
	NUMBER
};

int main(void) {
	char *s[3],
		eg[] = "34 okay! 45 again -678 bogus 123",
		*p = eg, 
		*p2;
	int n[4], 
		state = NUMBER,
		item = 0;

	while (1) {
		while (*p <= ' ') p++;
		if (state == STRING) {
			s[item++] = p;
			while (*p > ' ') p++;
			*p = '\0';
			state = NUMBER;
			continue;
		}
		if (state == NUMBER) {
			p2 = p;
			while (*p > ' ') p++;
			*p = '\0';
			n[item] = strtol(p2, NULL, 0);
			if (item == 3) break;
			state = STRING;
		}
	}

	printf("%s %s %s\n%d %d %d %d\n", s[0], s[1], s[2], n[0], n[1], n[2], n[3]);

	return 0;
}

You'd probably want to insert some checks into, eg, the "while (*p > ' ')" unless you are sure the input is flawless. Doing the strings in place may or may not be viable, but if it is, that is an advantage over copying.

I've had a lot of great results using stuff like this to eg, parse HTTP and HTML.

Thread: Enlightening lessons in programming

Thread Tools

Search Thread

Display

Similar Threads

C Lessons

enlightening discussion...

c lessons on the computer?

english lessons