Thread: searching huge files

  1. #1
    Slave MadCow257's Avatar
    Join Date
    Jan 2005
    Posts
    735

    searching huge files

    I've downloaded the lastest wikipedia dump and I'm wanting to do some analysis on it. The problem is it is nearly 5 gigs so I'm not sure how to do stuff with it. (It's an XML file) The most basic feature I'm interested in would be a function that creates a list with each page's character count. From there I would be able to implement the more advanced stuff.

  2. #2
    semi-colon generator ChaosEngine's Avatar
    Join Date
    Sep 2005
    Location
    Chch, NZ
    Posts
    597
    definetly not a good candidate for DOM

    A good SAX parser will parse the xml withough loading it all into memory, I think. (Haven't used SAX much)

    have a look at the SAX interface in XERCES
    "I saw a sign that said 'Drink Canada Dry', so I started"
    -- Brendan Behan

    Free Compiler: Visual C++ 2005 Express
    If you program in C++, you need Boost. You should also know how to use the Standard Library (STL). Want to make games? After reading this, I don't like WxWidgets anymore. Want to add some scripting to your App?

  3. #3
    Confused Magos's Avatar
    Join Date
    Sep 2001
    Location
    Sweden
    Posts
    3,145
    Not sure about the format, but if it's something like:
    Code:
    <???>
      <page>
        ...
      </page>
      <page>
        ...
      </page>
      ...
    </???>
    (since you mentioned pages)

    Building the whole xml tree is not possible due to the size, but why not find each page-tags (the initial and its corresponding ending tag) and build a subtree from its data which you perform your operations on. Deallocate the memory and repeat until all pages have been traversed.
    MagosX.com

    Give a man a fish and you feed him for a day.
    Teach a man to fish and you feed him for a lifetime.

  4. #4
    Slave MadCow257's Avatar
    Join Date
    Jan 2005
    Posts
    735
    Ah, I thought that fstream could only handle 4 gig files. Not that I know it can, how do I deal with finding the page tags? This is my thought

    have these vars
    char tagbuffer1[6]
    char tagbuffer2[7]
    string page

    Then the steps are
    *slide tagbuffer through the file until it hits a <page>
    *read everything into page while sliding tagbuffer2 until a </page> is hit
    *perform whatever analysis is needed
    *clear variables and repeat

    Is this a good method?

  5. #5
    Confused Magos's Avatar
    Join Date
    Sep 2001
    Location
    Sweden
    Posts
    3,145
    I'm not sure about limits in fstream, they *might* cause troubles, yes...
    If this is the case I bet it's not impossible to split the file into multiple smaller ones and handle this while parsing it.

    As for your steps you're missing one critical thing. What if the structure is this (assuming it's possible):
    Code:
    <page>
      <page>
      </page>
    </page>
    You'd match the outer starting page tag with the inner ending page tag. You'll need some stacking scheme so for each page-tag you encounter increase a counter by 1 and for each endingpage-tag you decrease it. When you reach an ending page-tag AND the counter is 0 (or 1 or however you implement it) then you have found the correct tag.

    Otherwise it looks fine.
    MagosX.com

    Give a man a fish and you feed him for a day.
    Teach a man to fish and you feed him for a lifetime.

  6. #6
    semi-colon generator ChaosEngine's Avatar
    Join Date
    Sep 2005
    Location
    Chch, NZ
    Posts
    597
    if you're going to parse xml, please don't write your own parser. It totally defeats the purpose of xml. use an existing parser, like the xerces one I posted a link to. Use the sax api and you won't need to worry about fstream limits.
    "I saw a sign that said 'Drink Canada Dry', so I started"
    -- Brendan Behan

    Free Compiler: Visual C++ 2005 Express
    If you program in C++, you need Boost. You should also know how to use the Standard Library (STL). Want to make games? After reading this, I don't like WxWidgets anymore. Want to add some scripting to your App?

  7. #7
    Registered User
    Join Date
    Aug 2003
    Posts
    470
    I've downloaded the lastest wikipedia dump and I'm wanting to do some analysis on it. The problem is it is nearly 5 gigs so I'm not sure how to do stuff with it.
    Create some type of index file(and maybe postings file) and do analysis on the index. You'll want to tailor the index according to the type of statistics your doing. For example, creating a list with each page's character count is easy if from a given page you can find the corresponding character count in the index file.

  8. #8
    Call me AirBronto
    Join Date
    Sep 2004
    Location
    Indianapolis, Indiana
    Posts
    195
    when dealing with some thing that big, and looking for reaults that have to do with all the information and not just a page or some thing, it is acually very usefull to just dump the entire thing into one huge character array and then get information from it.

  9. #9
    Call me AirBronto
    Join Date
    Sep 2004
    Location
    Indianapolis, Indiana
    Posts
    195
    rethinking this out, i would do what i said in my last post but i would take in one page at a time, index that page with what ever i extracted, then deallocate and get the next page. this would allow you to do both a macro analsis and micro ones as well, and you wont kill your hard drive doing it.

  10. #10
    Registered User Mortissus's Avatar
    Join Date
    Dec 2004
    Location
    Brazil, Porto Alegre
    Posts
    152
    Sorry, I don't know if I am saying a silly thing, but you could use multithread programming to achieve better performance. However, if you will run only once the software it is not worth the effort. Sorry the english too

    edit: Perhaps my ideia is not clear. On thread would be reading the file while other would be processing it. They would be switching these states, so you will need to use a synchronization mechanism to prevent reading wrong parts (duplicated ones) of the file. Am I clear??? If not, I can try to explain again! ^^
    Last edited by Mortissus; 03-08-2006 at 05:09 AM.

  11. #11
    Registered User
    Join Date
    Aug 2005
    Posts
    1,267
    >>Ah, I thought that fstream could only handle 4 gig files

    I don't know what operating system you are using, but MS-Windows win32 api function ReadFile() can read files much larger than the one you have.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Doxygen failing
    By Elysia in forum A Brief History of Cprogramming.com
    Replies: 19
    Last Post: 04-16-2008, 01:24 PM
  2. Folding@Home Cboard team?
    By jverkoey in forum A Brief History of Cprogramming.com
    Replies: 398
    Last Post: 10-11-2005, 08:44 AM
  3. Batch file programming
    By year2038bug in forum Tech Board
    Replies: 10
    Last Post: 09-05-2005, 03:30 PM
  4. Searching txt files (another file i/o qn)
    By Catif in forum C++ Programming
    Replies: 9
    Last Post: 05-13-2002, 04:14 PM
  5. displaying text files, wierd thing :(
    By Gades in forum C Programming
    Replies: 2
    Last Post: 11-20-2001, 05:18 PM