Thread: Randomly shuffle lines of huge text file

  1. #1
    Registered User
    Join Date
    May 2006
    Posts
    34

    Question Randomly shuffle lines of huge text file

    I'm looking for an algorithm that can randomly shuffle every line of an extremely large text file (aprox. 2 to 3 gigabytes). The obvious way is to load every line into a vector of strings and use something like std::random_shuffle(), but the problem is that it would take up an enormous amount of memory. So, what kind of algorithm can I use that's more memory friendly? Speed isn't a real big issue, but obviously I don't want to be waiting hours for it to finish =) Any ideas would be appreciated. Thanks.

    Edit: forgot to mention, the file is pretty much in sorted order already. I'm trying to go from sorted, to randomly shuffled.
    Last edited by veecee; 06-12-2006 at 04:19 PM.

  2. #2
    Registered User Dante Shamest's Avatar
    Join Date
    Apr 2003
    Posts
    970
    I would memory map the file then shuffle the bytes.
    On Unix, mmap is used to memory map files, while on Windows use CreateFileMapping and MapViewOfFile. I think boost also provides this for memory mapped files.

  3. #3
    Registered User
    Join Date
    Jun 2006
    Posts
    5
    im only a second year computer science student but mayb this can help:

    1) create dynamic array of type string and assign it a pointer
    2) read some of the file into that dynamic array
    3) randomly shuffle the dynamic array
    4) send the results to output file
    5) delete dynamic array to save memory
    6) repeat steps 1-6 as much as needed

  4. #4
    Registered User Dante Shamest's Avatar
    Join Date
    Apr 2003
    Posts
    970
    @henrychan22: The dynamic array approach probably won't be able to handle gigabytes of data. Using memory mapped files will just treat the file as memory, without you having to allocate much.

  5. #5
    Registered User
    Join Date
    May 2006
    Posts
    34
    Hmm, I've heard about memory mapped files, but I'm not too sure exactly what they are. I thought they just copied the whole file into memory so you can access it faster than on disk. I'll have to read up on memory mapped files I guess to see how it can help.

    I had an idea similar to henrychan22's. Basically reading, say, a 10 meg chunk of the file into memory, and then shuffling it and outputting it to a new file. Once that's done, randomly append them back to eachother. The only problem with that, is the file is pretty much sorted alphabetically (forgot to mention that), so I'd end up with lines that start with like A through C near eachother, instead of more randomly spread out across the whole file.

  6. #6
    int x = *((int *) NULL); Cactus_Hugger's Avatar
    Join Date
    Jul 2003
    Location
    Banks of the River Styx
    Posts
    902
    Are your lines the same length? If they are, then you can swap them in place... if they aren't, (the likely case...) then I can't see any (easy) way around creating a new file.

    My approach, perhaps not the best:
    Read through the entire file, and record the position of each newline you encounter. Then, choose a newline at random, and write that line to the output file.
    This won't work well if you have short lines for a file of 3GB... lines would have to average 50 characters or more for this to work well. (25 character lines = 500MB of memory needed by my estimates... tad extreme.) (Perhaps create an intermediate file to hold these positions while you use them?)

    A twist on that, perhaps then, would be to record the postition of every fifth newline, and a count of the total #. Again, select a random newline, but keep track of ranges of finished newlines... a tad more complicated, but needs less memory. (An obvious constraint.)
    long time; /* know C? */
    Unprecedented performance: Nothing ever ran this slow before.
    Any sufficiently advanced bug is indistinguishable from a feature.
    Real Programmers confuse Halloween and Christmas, because dec 25 == oct 31.
    The best way to accelerate an IBM is at 9.8 m/s/s.
    recursion (re - cur' - zhun) n. 1. (see recursion)

  7. #7
    Cat without Hat CornedBee's Avatar
    Join Date
    Apr 2003
    Posts
    8,895
    Memory mapping is limited by the address space. On Win32, a program has only 2 GBs of address space available, and various areas of that used by something. Typically, you'll have a hard time mmapping anything larger than 1.5 GBs.
    The same with 32-bit Linux or similar.

    Of course, the problem is effectively gone in 64-bit OSs.

    Hmm ... I think I'd scan the large file and create an index of newline offsets. It should be a manageable size.
    Shuffle that index, then use it to write a new file.

    Edit: just realized that that's the same method as the poster above me.
    All the buzzt!
    CornedBee

    "There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code."
    - Flon's Law

  8. #8
    Registered User Dante Shamest's Avatar
    Join Date
    Apr 2003
    Posts
    970
    After reading the original question more carefully, I see that the OP wanted to re-arrange lines, not bytes. In that case ignore my idea of shuffling all the bytes. You can still use memory-mapped files to find the newlines quickly. If it's too big for memory mapping, map parts of it then, if your OS supports it.

  9. #9
    Registered User
    Join Date
    May 2006
    Posts
    34
    Nope, not the same length. For this particular file, the lines range from about 70 to 100 characters long.

    Hmm, an index of where each line ends sounds like a good idea. That would most likely be a smaller amount of data to deal with, but like Cactus_Hugger said, that could end up being a big file as well. I'll just have to try it and see since that seems like the best method to use so far.

    Thank you all for your ideas.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Formatting the contents of a text file
    By dagorsul in forum C++ Programming
    Replies: 2
    Last Post: 04-29-2008, 12:36 PM
  2. Line Counting
    By 00Sven in forum C Programming
    Replies: 26
    Last Post: 04-02-2006, 08:59 PM
  3. Removing text between /* */ in a file
    By 0rion in forum C Programming
    Replies: 2
    Last Post: 04-05-2004, 08:54 AM
  4. Read word from text file (It is an essay)
    By forfor in forum C Programming
    Replies: 7
    Last Post: 05-08-2003, 11:45 AM
  5. How to erase duplicate lines in a text file?
    By miketv in forum C Programming
    Replies: 0
    Last Post: 02-25-2002, 11:18 PM