Thread: Software optimization

  1. #16
    Registered User
    Join Date
    Oct 2005
    Location
    Brasil
    Posts
    220
    Ok, i have done some modifications, here is the result:
    Code:
    #ifndef _SPELLCORRECTOR_H_
    #define _SPELLCORRECTOR_H_
    
    #include <map>
    #include <vector>
    
    class SpellCorrector
    {
    private:
    
    	typedef std::map<std::string, int> Dictionary;
    	typedef std::vector<std::string> Vector;
    
    	static const char alphabet[];
    	Dictionary dictionary;
    
    	void edits(std::string& word, Vector& result);
    	void known(Vector& results, Dictionary& candidates);
    
    public:
    	void load(std::string filename);
    	std::string correct(std::string word);
    };
    
    #endif
    Code:
    #include <fstream>
    #include <string>
    #include <algorithm>
    
    #include "SpellCorrector.h"
    
    using namespace std;
    
    const char SpellCorrector::alphabet[] = "abcdefghijklmnopqrstuvwxyz";
    
    bool sortBySecond(const pair<std::string, int>& left, const pair<std::string, int>& right)
    {
    	return left.second < right.second;
    }
    
    void SpellCorrector::load(string filename)
    {
    	ifstream file(filename.c_str());
    	char* data;
    	
    	file.seekg(0, ios_base::end);
    	int length = file.tellg();
    	file.seekg(0, ios_base::beg);
    
    	data = new char[length+1];
    
    	file.read(data, length);
    
    	string line(data);
    
    	delete [] data;
    	
    	transform(line.begin(), line.end(), line.begin(), tolower);
    
    	string::size_type position = 0;
    	while ((position = line.find_first_of(alphabet, position)) != string::npos)
    	{
    		string::size_type endPosition = line.find_first_not_of(alphabet, position);
    
    		dictionary[line.substr(position, endPosition - position)]++;
    
    		position = endPosition;
    	}
    }
    
    string SpellCorrector::correct(string word)
    {
    	Vector result;
    	Dictionary candidates;
    
    	if (dictionary.find(word) != dictionary.end()) { return word; }
    
    	edits(word, result);
    	known(result, candidates);
    
    	if (candidates.size() > 0) { return max_element(candidates.begin(), candidates.end(), sortBySecond)->first; }
    
    	for (unsigned int i = 0;i < result.size();i++)
    	{
    		Vector subResult;
    		edits(result[i], subResult);
    		known(subResult, candidates);
    	}
    
    	if (candidates.size() > 0) { return max_element(candidates.begin(), candidates.end(), sortBySecond)->first; }
    
    	return "";
    }
    
    void SpellCorrector::known(Vector& results, Dictionary& candidates)
    {
    	Dictionary::iterator end = dictionary.end();
    	for (unsigned int i = 0;i < results.size();i++)
    	{
    		Dictionary::iterator value = dictionary.find(results[i]);
    		
    		if (value != end) candidates[value->first] = value->second;
    	}
    }
    
    void SpellCorrector::edits(string& word, Vector& result)
    {
    	for (string::size_type i = 0;i < word.size();     i++) result.push_back(word.substr(0, i)                  + word.substr(i + 1)); //deletions
    	for (string::size_type i = 0;i < word.size() - 1;i++) result.push_back(word.substr(0, i) + word[i+1] + word.substr(i + 2)); //transposition
    
    	for (char j = 'a';j <= 'z';++j)
    	{
    		for (string::size_type i = 0;i < word.size();     i++) result.push_back(word.substr(0, i) + j + word.substr(i + 1)); //alterations
    		for (string::size_type i = 0;i < word.size() + 1;i++) result.push_back(word.substr(0, i) + j + word.substr(i)     ); //insertion
    	}
    }
    I have some especific questions:
    1) In the load functions isn&#180;t there any easier way to read all data from a file without affecting too much performance?

    2) I have tried
    Code:
    string data;
    getline(file, data, file.widen(EOF));
    but it seems that performance gets affected by it, i tought it is because it checks each character of the file until EOF is reached, am i right?

    3) It seems that the most intensive loop in my application, besides I/O, is the one that i break the lines and insert them into the map... is there any way i could do this faster?

  2. #17
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    I think your current load function is about as efficient as you can get. Is it actually measurably slow at the moment, and if so, what part of it is slow? [You can perhaps output a timestamp after each small section of your load process - I suspect you spend most of the time reading the file, followed by your while-loop processing the input data].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  3. #18
    Registered User
    Join Date
    Oct 2005
    Location
    Brasil
    Posts
    220
    Code:
    void SpellCorrector::load(string filename)
    {
    	ifstream file(filename.c_str());
    	char* data;
    
    	boost::timer t1;
    
    	file.seekg(0, ios_base::end);
    	int length = file.tellg();
    	file.seekg(0, ios_base::beg);
    
    	data = new char[length+1];
    
    	file.read(data, length);
    
    	string line(data);
    
    	delete [] data;
    
    	cout << "t1:" << t1.elapsed() << "s \n";
    
    	boost::timer t2;
    
    	transform(line.begin(), line.end(), line.begin(), tolower);
    
    	cout << "t2:" << t2.elapsed() << "s \n";
    
    	boost::timer t3;
    
    	string::size_type position = 0;
    	while ((position = line.find_first_of(alphabet, position)) != string::npos)
    	{
    		string::size_type endPosition = line.find_first_not_of(alphabet, position);
    
    		dictionary[line.substr(position, endPosition - position)]++;
    
    		position = endPosition;
    	}
    	cout << "t3:" << t3.elapsed() << "s \n";
    }
    This gives me the following output:
    t1: 0.218s
    t2: 0.125s
    t3: 2.985s

    So, the most CPU intensive loop is the string "splitting" part, how can i do it more efficiently? By the way, the file is about 6.3MB.

  4. #19
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    I don't know. Do you know which is the more cpu consuming, the finding in the string, or the update of your dictionary? How much time do you save if you remove the dictionary[]++ line [this is of course not a valid change ultimately, but it makes sense to understand which part of this function that is the more time consuming].

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  5. #20
    Registered User
    Join Date
    Oct 2005
    Location
    Brasil
    Posts
    220
    using the functions in this way
    Code:
    	string::size_type position = 0;
    	while ((position = line.find_first_of(alphabet, position)) != string::npos)
    	{
    		string::size_type endPosition = line.find_first_not_of(alphabet, position);
                    string s (line.substr(position, endPosition - position));
    		//dictionary[line.substr(position, endPosition - position)]++;
    
    		position = endPosition;
    	}
    t3 gives me an output of 0.969s. Then dictionary insertion is tooking a hell long time there uh?

  6. #21
    The larch
    Join Date
    May 2006
    Posts
    3,573
    You might probably gain something if you used a sorted vector of pair<string, int>. For that you might need to preprocess the dictionary file: read it as you do now and save the contents of the map.
    I might be wrong.

    Thank you, anon. You sure know how to recognize different types of trees from quite a long way away.
    Quoted more than 1000 times (I hope).

  7. #22
    Registered User
    Join Date
    Oct 2005
    Location
    Brasil
    Posts
    220
    Ah, sure, im planning to do that, but i wanted to improve the algorithm to read it in the first time, as we are doing now, no preprocessing.

  8. #23
    and the hat of sweating
    Join Date
    Aug 2007
    Location
    Toronto, ON
    Posts
    3,545
    I don't know if this will improve performance much, but in Item 29 of "Effective STL" Scott Meyers suggests using istreambuf_iterators for character-by-character input:
    Code:
    // Open filename.txt.
    ifstream inputFile( "filename.txt" );
    
    // Load entire file into fileData string.
    string fileData( (istreambuf_iterator<char>( inputFile )),
                             istreambuf_iterator<char>() );

  9. #24
    Registered User
    Join Date
    May 2003
    Posts
    1,619
    Quote Originally Posted by cpjust View Post
    Yeah, that's one of the reasons I like C++ over Java. But then there's so many C++ things that I wish Java had...
    You might like C# then.

    And by the way, using good programming practices you don't usually end up with that problem in Java or C#, because in order to keep classes loosely coupled (which should always be a major goal, especially if this is a class designed for other people to use), it should be implementing an interface. All you need to provide to the user of your class is your interface; how your class or classes implement it is irrelevant. This allows bare minimal coupling between your classes and theirs.
    You ever try a pink golf ball, Wally? Why, the wind shear on a pink ball alone can take the head clean off a 90 pound midget at 300 yards.

  10. #25
    Registered User
    Join Date
    Oct 2005
    Location
    Brasil
    Posts
    220
    After some time spended i could achieve the following code:
    Code:
    void SpellCorrector::load(const std::string& filename)
    {
    	ifstream file(filename.c_str(), ios_base::binary | ios_base::in);
    
    	boost::timer t1;
    
    	file.seekg(0, ios_base::end);
    	int length = file.tellg();
    	file.seekg(0, ios_base::beg);
    
    	string line(length, '0');
    
    	file.read(&line[0], length);
    
    	cout << "t1:" << t1.elapsed() << "s \n";
    
    	boost::timer t2;
    
    	transform(line.begin(), line.end(), line.begin(), tolower);
    
    	cout << "t2:" << t2.elapsed() << "s \n";
    
    	boost::timer t3;
    
    	string::size_type position = 0;
    	while ((position = line.find_first_of(alphabet, position)) != string::npos)
    	{
    		string::size_type endPosition = line.find_first_not_of(alphabet, position);
    
    		dictionary[line.substr(position, endPosition - position)]++;
    
    		position = endPosition;
    	}
    	cout << "t3:" << t3.elapsed() << "s \n";
    }
    with these marks:
    t1:0.078s
    t2:0.109s
    t3:2.109s

    I just wanted to optimize the sring parsing now and i would be happy... I will keep searching...
    Last edited by Scarvenger; 12-14-2007 at 06:12 AM.

  11. #26
    C++まいる!Cをこわせ!
    Join Date
    Oct 2007
    Location
    Inside my computer
    Posts
    24,654
    Erm, that's a big line.
    Break it into lines and use code tags.
    Quote Originally Posted by Adak View Post
    io.h certainly IS included in some modern compilers. It is no longer part of the standard for C, but it is nevertheless, included in the very latest Pelles C versions.
    Quote Originally Posted by Salem View Post
    You mean it's included as a crutch to help ancient programmers limp along without them having to relearn too much.

    Outside of your DOS world, your header file is meaningless.

  12. #27
    Registered User
    Join Date
    Oct 2005
    Location
    Brasil
    Posts
    220
    Sorry ;P, fixed.

  13. #28
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    You may find that it's quicker to not use the find_first[_not]_of function to find a (non-)alphabet character. As far as I can tell, there's no direct way to determine if a letter is NOT in the list, but by comparing it throughout the entire 26 characters. This PARTICULARLY applies to the "not of" case.

    A better method would be to make a comparison with the start/end of the range you want it to be in/out of. If you create your own function for that, I predict it will be a little bit quicker - but of course, you still spend a significant time to insert the item in the dictionary, from previous profile analysis.

    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  14. #29
    Registered User
    Join Date
    Oct 2005
    Location
    Brasil
    Posts
    220
    Sweet, thank you matsp you gave me a hell nice idea:
    Code:
    char filterNonAlphabetic(char& letter)
    {
    	if (isalpha(letter))
    		return tolower(letter);
    	return '-';
    }
    
    
    void SpellCorrector::load(const std::string& filename)
    {
    	ifstream file(filename.c_str(), ios_base::binary | ios_base::in);
    
    	boost::timer t1;
    
    	file.seekg(0, ios_base::end);
    	int length = file.tellg();
    	file.seekg(0, ios_base::beg);
    
    	string line(length, '0');
    
    	file.read(&line[0], length);
    
    	cout << "t1:" << t1.elapsed() << "s \n";
    
    	boost::timer t2;
    
    	transform(line.begin(), line.end(), line.begin(), filterNonAlphabetic);
    
    	cout << "t2:" << t2.elapsed() << "s \n";
    
    	boost::timer t3;
    
    	string::size_type position = 0;
    	while ((position = line.find_first_not_of('-', position)) != string::npos)
    	{
    		string::size_type endPosition = line.find_first_of('-', position);
    
    		dictionary[line.substr(position, endPosition - position)]++;
    
    		position = endPosition;
    	}
    
    	cout << "t3:" << t3.elapsed() << "s \n";
    }
    Now i get these marks:
    t1:0.063s
    t2:0.188s
    t3:1.5s

    Nice increase huh?

  15. #30
    The larch
    Join Date
    May 2006
    Posts
    3,573
    How about cleaning up the file, so it always contains exactly one special character between words and you won't have to transform anything and won't have to look for first_of/first_not_of? - may not be a good choice because it can also appear in some words.
    I might be wrong.

    Thank you, anon. You sure know how to recognize different types of trees from quite a long way away.
    Quoted more than 1000 times (I hope).

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Software Design/Test - Redmond, WA
    By IRVolt in forum Projects and Job Recruitment
    Replies: 2
    Last Post: 06-11-2008, 10:26 AM
  2. Why C Matters
    By DavidP in forum A Brief History of Cprogramming.com
    Replies: 136
    Last Post: 01-16-2008, 09:09 AM
  3. Adding trial period to software
    By BobS0327 in forum C Programming
    Replies: 17
    Last Post: 01-03-2006, 02:13 PM
  4. Optimization settings
    By Roaring_Tiger in forum C Programming
    Replies: 4
    Last Post: 02-23-2005, 02:53 AM