I am trying to read a file which contains URLs and are 100 million in number. What I need to do is find out the different websites they are from. So, I am taking chunk of data in memory and reading it line by line. Also, I need to find out how many URLs does each website has in the file and what are those URLs. The way I figured it out is ro have a class domain which is:
and then declare a hash_set in which I insert domain objects. This is the declarationCode:class domain{ public: string domainname; int nolinks; string URLs; domain() { nolinks=1; } };
of hash_set
While reading the file, for each URL ex. Daggie - Viewing Profile, I take the website Cleaned.be V3.0, create a domain object with domain.domainname="www.cleaned.be" and insert it to my hash_set. Whenver I encounter another URL of the same website, I try to check if it already exists and increase the count domain.nolinks by 1 and append the URL to domain.URLs. This is the code block for that:Code:class hash_fnc:public stdext::hash_compare<std::string>{ public: /*enum{ bucket_size=1024, min_buckets=8 };*/ size_t operator()(const domain& d)const { size_t h = 0; std::string::const_iterator p, p_end; for(p = d.domainname.begin(), p_end = d.domainname.end(); p != p_end; ++p) { h = 31 * h + (*p); } return h; } bool operator()(const domain& x,const domain& y) const { return x.domainname.compare(y.domainname)<0; } }; hash_set<domain,hash_fnc>_domain; pair<hash_set<domain,hash_fnc>::iterator,bool>ret;
I do this for every line or URL in the file. Now the problem:Code:domain X; X.domainname="www.somedomain.com"; //X.URLs.assign("www.somedomain.com/index/____/x.html"); ret=_domain.insert(X); if(ret.second==false) //it already exists { (ret.first)->nolinks++; (ret.first)->URLs.append("\n "); (ret.first)->URLs.append("www.somedomain.com/index/____/x.html"); }
Although this worked out really well for an ordinary set<> , it is not for hash_set<>. The URLs are not getting added to the correct domain and my computer turns off while running this program sometimes. Also, the output is now missing almost half the domains and also messed up.Obviously I am making huge mistakes. So, please try to help me. I'll really be thankful.



LinkBack URL
About LinkBacks


