Thread: alternate to an array???

  1. #1
    Registered Abuser Loic's Avatar
    Join Date
    Mar 2007
    Location
    Sydney
    Posts
    115

    alternate to an array???

    Hi all, i am trying to write a program to help me manage word lists, basically just merge 2 word lists together, and filter out the double words...

    but one problem i can see coming is that some word lists can be quite large, some up to several gigabytes, normally when i have written programs to manipulate data i have been working with databases which are tiny in comparison. so i really have no idea on how to handle such large amounts of data.

    below i have posted what i have done so far... where the words out of the word list are stored in the string array called temp. but as you can see it isn't vary practical at all... any suggestion on how i could get around this problem???

    Code:
    #include <cstdlib>
    #include <iostream>
    #include <fstream>
    
    using namespace std;
    
    int main(int argc, char *argv[])
    {
        string temp[100];
        ifstream wordlist_in;
    
        wordlist_in.open (argv[1]);
    
        if (wordlist_in.is_open())
        {
            for (int i=0;;i++)
            {
                getline (wordlist_in, temp[i]);
                if (wordlist_in.eof())
                {
                   break;
                }
           }
        }
    
        return EXIT_SUCCESS;
    }

  2. #2
    Deathray Engineer MacGyver's Avatar
    Join Date
    Mar 2007
    Posts
    3,210
    std::vector is probably what you want, however, it's probably not going to store GB's of information on most of today's computers.

  3. #3
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    Use vectors!

    Code:
    #include <vector>
    // ...
    vector<string> data;
    // ...
    data.push_back("text");  // add an element to the vector
    // ...
    cout << data[0] << endl;  // access elements in vectors
    cout << data[1] << endl;
    cout << "data has " << data.size() << " elements.\n";
    http://cppreference.com/cppvector/index.html
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  4. #4
    Registered Abuser Loic's Avatar
    Join Date
    Mar 2007
    Location
    Sydney
    Posts
    115
    Quote Originally Posted by MacGyver View Post
    std::vector is probably what you want, however, it's probably not going to store GB's of information on most of today's computers.
    if its not going to store gb's of info, then the program most likely will not work in a real world situation... is there a way to write the data to the hard drive???

    correct me if i am wrong, but if a computer has 1gb of ram, and lets say half of that is being used by the system, so then you are left with 512mb which a program such as this will eat up vary quickly???

  5. #5
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Hi all, i am trying to write a program to help me manage word lists, basically just merge 2 word lists together, and filter out the double words...

    but one problem i can see coming is that some word lists can be quite large, some up to several gigabytes, normally when i have written programs to manipulate data i have been working with databases which are tiny in comparison. so i really have no idea on how to handle such large amounts of data.
    You can use merge sort for this. Take a look at the (current) Wikipedia article on merge sort, in particular the section on "merge sorting tape drives". Instead of tape drives, use files. Once all the data is sorted, removing duplicates is easy (and that's the requirement to use std::unique() too).
    Last edited by laserlight; 07-05-2008 at 04:10 PM.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  6. #6
    Deathray Engineer MacGyver's Avatar
    Join Date
    Mar 2007
    Posts
    3,210
    Quote Originally Posted by Loic View Post
    if its not going to store gb's of info, then the program most likely will not work in a real world situation...
    Um, we still have many 32-bit machines in existance and that means you're limited to about 4GB for everything. If you really think a modern OS is going to give you half of that just for your program, you will probably be in for a real surprise. So you're right, this won't work in a real world situation.

    Now there is virtual memory, so the system can simulate as if you have more than that amount, however, Windows, for example, will only allow your program to think it has access to a certain limit. Some say 4 GB is the theoretical limit (due to pointer size and such), but I don't think you can practically allocate such an amount. While I'm not sure about Vista's memory usage, I may have heard it's still capped currently at some arbitrary amount, well below the 64-bit limit, although I might be (and I hope I am) mistaken on this issue. I might be thinking of certain 64-bit hardware actually.

    Overall, the idea of storing a rather large dictionary of GB's worth of data is just bad in these days. You could, for example, be looking at a way to store this giant list of data in files and only store in active memory information that will tell you which files to open and access.

    Quote Originally Posted by Loic View Post
    is there a way to write the data to the hard drive???
    Of course.

    Quote Originally Posted by Loic View Post
    correct me if i am wrong, but if a computer has 1gb of ram, and lets say half of that is being used by the system, so then you are left with 512mb which a program such as this will eat up vary quickly???
    On most modern machines, this is only true if the OS lets you do it
    Last edited by MacGyver; 07-05-2008 at 04:18 PM.

  7. #7
    Registered Abuser Loic's Avatar
    Join Date
    Mar 2007
    Location
    Sydney
    Posts
    115
    Quote Originally Posted by laserlight View Post
    You can use merge sort for this. Take a look at the (current) Wikipedia article on merge sort, in particular the section on "merge sorting tape drives". Instead of tape drives, use files. Once all the data is sorted, removing duplicates is easy (and that's the requirement to use std::unique() too).
    I just read through that wikipedia article, and i think i get the theory of it, but i am a bit unsure how i would go about implementing it... and wouldn't the data at some point be stored in the virtual memory??? or would you read, the write the data one word at a time? but then i am still unclear on how it would sort the data????

  8. #8
    Registered Abuser Loic's Avatar
    Join Date
    Mar 2007
    Location
    Sydney
    Posts
    115
    Quote Originally Posted by MacGyver View Post
    On most modern machines, this is only true if the OS lets you do it
    well i don't ever intend for this program to be run on windows... as all of the tools i will use the word list for run under linux... ubuntu, or backtrack...

    but i think laserlight suggestion on using merge_sort seems to be the most practical option... and again, i am still a little confused on how to implement this....

  9. #9
    (?<!re)tired Mario F.'s Avatar
    Join Date
    May 2006
    Location
    Ireland
    Posts
    8,446
    Are the two current files already sorted?

    I'm assuming this is a one time operation you are doing here, or at least something that is only going to be rarely called. So you don't need to worry much about performance past what is reasonable. So you can get creative and devise your own way if you can't figure out merge sort.

    The idea behind merge sort is, if you have two sorted lists, you can merge them into one sorted list. You do this basically by picking the first element from each list, compare them and moving the smaller one to the 3rd new list. The index of the list that had its object moved is incremented, while the other list index stays the same. You do this until you reach the end of one list. Any remaining elements from the other list are then moved to the output list as they are.

    Since you need to also remove duplicates, you can then just use std::unique on the newly created list. On your case, the advantage of merge sort is the fact it consumes no memory; at most the size of a pair of elements. Conceptually, what you want to do is

    1. Extract one element from each file
    2. Compare these two elements
    3. Add the smallest to the new file
    4. Go back to 1 and repeat

    If the two lists are unsorted, you can still do this, by either sorting them beforehand or simply apply merge sort as usual, but when moving the smallest element to the output list, do it with std:lower_bound. It will be a little slower because there's a binary search involved, but I think performance is not a concern on this case.
    Last edited by Mario F.; 07-05-2008 at 07:20 PM.
    Originally Posted by brewbuck:
    Reimplementing a large system in another language to get a 25% performance boost is nonsense. It would be cheaper to just get a computer which is 25% faster.

  10. #10
    Deathray Engineer MacGyver's Avatar
    Join Date
    Mar 2007
    Posts
    3,210
    Quote Originally Posted by Loic View Post
    well i don't ever intend for this program to be run on windows... as all of the tools i will use the word list for run under linux... ubuntu, or backtrack...
    It still applies for Linux.

  11. #11
    Registered User VirtualAce's Avatar
    Join Date
    Aug 2001
    Posts
    9,607
    Um, we still have many 32-bit machines in existance and that means you're limited to about 4GB for everything. If you really think a modern OS is going to give you half of that just for your program, you will probably be in for a real surprise. So you're right, this won't work in a real world situation.
    The OS will give you up to 3GB RAM on WinXP SP2 if you specify you need the additional RAM. The default allocation for 4GB is the user gets 2GB and the OS the other 2GB.

    Even if you specify you want the additional gig, most people have many other apps running in the background and even the 3GB allowance gets eaten quickly. But theoretically you could use 3GB on a 4GB machine and XP wouldn't care.

    On a 3GB 32-bit machine the limit is 2GB for programs since the other 1GB is for XP.

    Vectors should do just fine for what you are wanting to do.
    Last edited by VirtualAce; 07-05-2008 at 08:09 PM.

  12. #12
    Registered User
    Join Date
    Jun 2005
    Posts
    6,815
    Presumably you will be writing out the merged data to file, rather than keeping it in memory for some other purpose. If so, simply do it all file to file (i.e. open the two input files, one word at a time from each, and output to the "merged" file as required). The only resources needed by your program are what's necessary to maintain two input streams and one output stream, memory to hold two words and compare them, and a few looping constructs. The consideration of available memory on your machine is unnecessary if you do this.

  13. #13
    int x = *((int *) NULL); Cactus_Hugger's Avatar
    Join Date
    Jul 2003
    Location
    Banks of the River Styx
    Posts
    902
    Wordlists are GBs in size? I've written a program to work with a scrabble game. The entire SOWPODS dictionary I had was less than 3MB, and contained over 250,000 words. Extrapolating outward, that looks to be about 80 million words from a gig of memory. Just a thought.
    long time; /* know C? */
    Unprecedented performance: Nothing ever ran this slow before.
    Any sufficiently advanced bug is indistinguishable from a feature.
    Real Programmers confuse Halloween and Christmas, because dec 25 == oct 31.
    The best way to accelerate an IBM is at 9.8 m/s/s.
    recursion (re - cur' - zhun) n. 1. (see recursion)

  14. #14
    Registered User
    Join Date
    Apr 2006
    Posts
    2,149
    Don't use vector, use deque. It allows for non continuous memory, so the OS will be able to give you more.
    It is too clear and so it is hard to see.
    A dunce once searched for fire with a lighted lantern.
    Had he known what fire was,
    He could have cooked his rice much sooner.

  15. #15
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    If you merge files and read the data unit by unit, you can kiss your performance goodbye. Read it in blocks of, say, a million units.

    But I'm more inclined to agree with Cactus_Hugger. Gigabytes? Sounds like you should store the data better - specifically, use a trie. It gives you very fast lookup and common prefixes among the words need memory only once.

    Edit: By the way, the Oxford English Dictionary contains about 180.000 entries. Where do you get gigabytes from that?
    Last edited by CornedBee; 07-06-2008 at 03:22 AM.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 16
    Last Post: 05-29-2009, 07:25 PM
  2. from 2D array to 1D array
    By cfdprogrammer in forum C Programming
    Replies: 17
    Last Post: 03-24-2009, 10:33 AM
  3. [question]Analyzing data in a two-dimensional array
    By burbose in forum C Programming
    Replies: 2
    Last Post: 06-13-2005, 07:31 AM
  4. Unknown Memory Leak in Init() Function
    By CodeHacker in forum Windows Programming
    Replies: 3
    Last Post: 07-09-2004, 09:54 AM
  5. Quick question about SIGSEGV
    By Cikotic in forum C Programming
    Replies: 30
    Last Post: 07-01-2004, 07:48 PM