Software optimization

**Scarvenger** · 12-13-2007

Ok, i have done some modifications, here is the result:

Code:

#ifndef _SPELLCORRECTOR_H_
#define _SPELLCORRECTOR_H_

#include <map>
#include <vector>

class SpellCorrector
{
private:

	typedef std::map<std::string, int> Dictionary;
	typedef std::vector<std::string> Vector;

	static const char alphabet[];
	Dictionary dictionary;

	void edits(std::string& word, Vector& result);
	void known(Vector& results, Dictionary& candidates);

public:
	void load(std::string filename);
	std::string correct(std::string word);
};

#endif

Code:

#include <fstream>
#include <string>
#include <algorithm>

#include "SpellCorrector.h"

using namespace std;

const char SpellCorrector::alphabet[] = "abcdefghijklmnopqrstuvwxyz";

bool sortBySecond(const pair<std::string, int>& left, const pair<std::string, int>& right)
{
	return left.second < right.second;
}

void SpellCorrector::load(string filename)
{
	ifstream file(filename.c_str());
	char* data;
	
	file.seekg(0, ios_base::end);
	int length = file.tellg();
	file.seekg(0, ios_base::beg);

	data = new char[length+1];

	file.read(data, length);

	string line(data);

	delete [] data;
	
	transform(line.begin(), line.end(), line.begin(), tolower);

	string::size_type position = 0;
	while ((position = line.find_first_of(alphabet, position)) != string::npos)
	{
		string::size_type endPosition = line.find_first_not_of(alphabet, position);

		dictionary[line.substr(position, endPosition - position)]++;

		position = endPosition;
	}
}

string SpellCorrector::correct(string word)
{
	Vector result;
	Dictionary candidates;

	if (dictionary.find(word) != dictionary.end()) { return word; }

	edits(word, result);
	known(result, candidates);

	if (candidates.size() > 0) { return max_element(candidates.begin(), candidates.end(), sortBySecond)->first; }

	for (unsigned int i = 0;i < result.size();i++)
	{
		Vector subResult;
		edits(result[i], subResult);
		known(subResult, candidates);
	}

	if (candidates.size() > 0) { return max_element(candidates.begin(), candidates.end(), sortBySecond)->first; }

	return "";
}

void SpellCorrector::known(Vector& results, Dictionary& candidates)
{
	Dictionary::iterator end = dictionary.end();
	for (unsigned int i = 0;i < results.size();i++)
	{
		Dictionary::iterator value = dictionary.find(results[i]);
		
		if (value != end) candidates[value->first] = value->second;
	}
}

void SpellCorrector::edits(string& word, Vector& result)
{
	for (string::size_type i = 0;i < word.size();     i++) result.push_back(word.substr(0, i)                  + word.substr(i + 1)); //deletions
	for (string::size_type i = 0;i < word.size() - 1;i++) result.push_back(word.substr(0, i) + word[i+1] + word.substr(i + 2)); //transposition

	for (char j = 'a';j <= 'z';++j)
	{
		for (string::size_type i = 0;i < word.size();     i++) result.push_back(word.substr(0, i) + j + word.substr(i + 1)); //alterations
		for (string::size_type i = 0;i < word.size() + 1;i++) result.push_back(word.substr(0, i) + j + word.substr(i)     ); //insertion
	}
}

I have some especific questions:
1) In the load functions isn´t there any easier way to read all data from a file without affecting too much performance?

2) I have tried

Code:

string data;
getline(file, data, file.widen(EOF));

but it seems that performance gets affected by it, i tought it is because it checks each character of the file until EOF is reached, am i right?

3) It seems that the most intensive loop in my application, besides I/O, is the one that i break the lines and insert them into the map... is there any way i could do this faster?

**matsp** · 12-13-2007

I think your current load function is about as efficient as you can get. Is it actually measurably slow at the moment, and if so, what part of it is slow? [You can perhaps output a timestamp after each small section of your load process - I suspect you spend most of the time reading the file, followed by your while-loop processing the input data].

--
Mats

**Scarvenger** · 12-13-2007

Code:

void SpellCorrector::load(string filename)
{
	ifstream file(filename.c_str());
	char* data;

	boost::timer t1;

	file.seekg(0, ios_base::end);
	int length = file.tellg();
	file.seekg(0, ios_base::beg);

	data = new char[length+1];

	file.read(data, length);

	string line(data);

	delete [] data;

	cout << "t1:" << t1.elapsed() << "s \n";

	boost::timer t2;

	transform(line.begin(), line.end(), line.begin(), tolower);

	cout << "t2:" << t2.elapsed() << "s \n";

	boost::timer t3;

	string::size_type position = 0;
	while ((position = line.find_first_of(alphabet, position)) != string::npos)
	{
		string::size_type endPosition = line.find_first_not_of(alphabet, position);

		dictionary[line.substr(position, endPosition - position)]++;

		position = endPosition;
	}
	cout << "t3:" << t3.elapsed() << "s \n";
}

This gives me the following output:
t1: 0.218s
t2: 0.125s
t3: 2.985s

So, the most CPU intensive loop is the string "splitting" part, how can i do it more efficiently? By the way, the file is about 6.3MB.

**matsp** · 12-13-2007

I don't know. Do you know which is the more cpu consuming, the finding in the string, or the update of your dictionary? How much time do you save if you remove the dictionary[]++ line [this is of course not a valid change ultimately, but it makes sense to understand which part of this function that is the more time consuming].

--
Mats

**Scarvenger** · 12-13-2007

using the functions in this way

Code:

	string::size_type position = 0;
	while ((position = line.find_first_of(alphabet, position)) != string::npos)
	{
		string::size_type endPosition = line.find_first_not_of(alphabet, position);
                string s (line.substr(position, endPosition - position));
		//dictionary[line.substr(position, endPosition - position)]++;

		position = endPosition;
	}

t3 gives me an output of 0.969s. Then dictionary insertion is tooking a hell long time there uh?

**anon** · 12-13-2007

You might probably gain something if you used a sorted vector of pair<string, int>. For that you might need to preprocess the dictionary file: read it as you do now and save the contents of the map.

**Scarvenger** · 12-13-2007

Ah, sure, im planning to do that, but i wanted to improve the algorithm to read it in the first time, as we are doing now, no preprocessing.

**cpjust** · 12-13-2007

I don't know if this will improve performance much, but in Item 29 of "Effective STL" Scott Meyers suggests using istreambuf_iterators for character-by-character input:

Code:

// Open filename.txt.
ifstream inputFile( "filename.txt" );

// Load entire file into fileData string.
string fileData( (istreambuf_iterator<char>( inputFile )),
                         istreambuf_iterator<char>() );

**Cat** · 12-13-2007

Originally Posted by cpjust

Yeah, that's one of the reasons I like C++ over Java. But then there's so many C++ things that I wish Java had...

You might like C# then.

And by the way, using good programming practices you don't usually end up with that problem in Java or C#, because in order to keep classes loosely coupled (which should always be a major goal, especially if this is a class designed for other people to use), it should be implementing an interface. All you need to provide to the user of your class is your interface; how your class or classes implement it is irrelevant. This allows bare minimal coupling between your classes and theirs.

**Scarvenger** · 12-14-2007

After some time spended i could achieve the following code:

Code:

void SpellCorrector::load(const std::string& filename)
{
	ifstream file(filename.c_str(), ios_base::binary | ios_base::in);

	boost::timer t1;

	file.seekg(0, ios_base::end);
	int length = file.tellg();
	file.seekg(0, ios_base::beg);

	string line(length, '0');

	file.read(&line[0], length);

	cout << "t1:" << t1.elapsed() << "s \n";

	boost::timer t2;

	transform(line.begin(), line.end(), line.begin(), tolower);

	cout << "t2:" << t2.elapsed() << "s \n";

	boost::timer t3;

	string::size_type position = 0;
	while ((position = line.find_first_of(alphabet, position)) != string::npos)
	{
		string::size_type endPosition = line.find_first_not_of(alphabet, position);

		dictionary[line.substr(position, endPosition - position)]++;

		position = endPosition;
	}
	cout << "t3:" << t3.elapsed() << "s \n";
}

with these marks:
t1:0.078s
t2:0.109s
t3:2.109s

I just wanted to optimize the sring parsing now and i would be happy... I will keep searching...

**Elysia** · 12-14-2007

Erm, that's a big line.
Break it into lines and use code tags.

**Scarvenger** · 12-14-2007

Sorry ;P, fixed.

**matsp** · 12-14-2007

You may find that it's quicker to not use the find_first[_not]_of function to find a (non-)alphabet character. As far as I can tell, there's no direct way to determine if a letter is NOT in the list, but by comparing it throughout the entire 26 characters. This PARTICULARLY applies to the "not of" case.

A better method would be to make a comparison with the start/end of the range you want it to be in/out of. If you create your own function for that, I predict it will be a little bit quicker - but of course, you still spend a significant time to insert the item in the dictionary, from previous profile analysis.

--
Mats

**Scarvenger** · 12-14-2007

Sweet, thank you matsp you gave me a hell nice idea:

Code:

char filterNonAlphabetic(char& letter)
{
	if (isalpha(letter))
		return tolower(letter);
	return '-';
}


void SpellCorrector::load(const std::string& filename)
{
	ifstream file(filename.c_str(), ios_base::binary | ios_base::in);

	boost::timer t1;

	file.seekg(0, ios_base::end);
	int length = file.tellg();
	file.seekg(0, ios_base::beg);

	string line(length, '0');

	file.read(&line[0], length);

	cout << "t1:" << t1.elapsed() << "s \n";

	boost::timer t2;

	transform(line.begin(), line.end(), line.begin(), filterNonAlphabetic);

	cout << "t2:" << t2.elapsed() << "s \n";

	boost::timer t3;

	string::size_type position = 0;
	while ((position = line.find_first_not_of('-', position)) != string::npos)
	{
		string::size_type endPosition = line.find_first_of('-', position);

		dictionary[line.substr(position, endPosition - position)]++;

		position = endPosition;
	}

	cout << "t3:" << t3.elapsed() << "s \n";
}

Now i get these marks:
t1:0.063s
t2:0.188s
t3:1.5s

Nice increase huh?

**anon** · 12-14-2007

How about cleaning up the file, so it always contains exactly one special character between words and you won't have to transform anything and won't have to look for first_of/first_not_of? - may not be a good choice because it can also appear in some words.

Thread: Software optimization

Thread Tools

Search Thread

Display

Similar Threads

Software Design/Test - Redmond, WA

Why C Matters

Adding trial period to software

Optimization settings