Algorithm to split Wikipedia dump into smaller files based on "link distance"

**cyberfish** · 03-09-2009

I am trying to split the English Wikipedia XML dump, ~20GB uncompressed, ~12GB with references/citations removed (!), and ~8GB converted to plain text (this is what I am working with) into a few thousand ~1MB files.

However, the challenge is that I want to put "related" articles in the same file (as much as possible), where "relatedness" is determined by the minimum number of "hops" required to jump from one article to another (can someone suggest better criteria?), so that my viewer program won't need to access many files when the user browses around by clicking links.

What (kind of) algorithm(s) should I be looking at? Preferably incremental since I obviously can't fit the whole thing into memory. Intuitive brute-force way probably won't work due to the data size.

BACKGROUND (probably irrelevant) info on why I need to do this -
I am going to write a Java viewer for cellphones (old ones, such as the one I use, running Nokia OS). J2ME has a very primitive file access API (JSR-75, FileConnection) that does not have a seek function (no idea what they were thinking...), meaning I have to split it into smaller blocks (I can read-and-chuck, but that's about 0.3MB/s according to reports on the Internet, so I figured I have to keep the file size at around 1MB, and build an index to tell the program where articles are, and also to support searching in article titles). And then, due to some more brain damage or dirty business decision on Sun's part, unless the application is signed (a few thousand dollars a year, not happening), the user will have to give confirmation on EVERY file access. There's only an "allow all file accesses in this session" option for signed apps. Pure nonsense, but I guess I have to live with it. That's why I want to group "linked" articles together, since 1MB is already a few thousand articles. Of course, all the preprocessing will be done on the PC in C++.

Thanks

Thread: Algorithm to split Wikipedia dump into smaller files based on "link distance"

Thread Tools

Search Thread

Display

Algorithm to split Wikipedia dump into smaller files based on "link distance"

Similar Threads

Binary Search Trees Part III