Compress and index a text file

**slowcode** · 08-25-2004

Greetings,

I currently have a program that reads a text file and creates a full-text index of it. It works great, but now I would like to compress the text file and create a full-text index of the compressed file.

I have read quite a bit on different compression algorithms, but which is the best for this type of application? And once I have the best (most appropriate) algorithm to use, where do I start with creating the index? After two days searching google, I have run out of keywords on the subject, and I have found nothing that specifically addresses this topic. I would assume that there are certain types of compression algorithms that cannot be indexed (???).

Any information (or even just pointing me in the right direction) would greatly appreciated.

Thanks!

**itsme86** · 08-26-2004

Why not do it in this order?

1) index the text file and store the index somewhere
2) compress the text file
...and then to read it...
3) decompress the text file
4) read in the index and seek() to the appropriate spot in the file

So you're stlil indexing the uncompressed file instead of the compressed one.

**vsriharsha** · 08-26-2004

I believe no one will use the text file in a compressed form and hence it may not be necessary. The solution suggested by itsme86 is a nice one.....

-Harsha.

**slowcode** · 08-26-2004

That is a good suggestion, but will not suit the needs of my app. There are a few reasons for this.

1) The text, in many cases, is copyrighted and cannot be in a plain text format. Compression will at least serve as an encryption of some sort, especially when I store the index in the same file. There is no need or desire for the user to use the text file directly. It should only be used as data for the application.

2) This app will soon be targeted at mobile devices, so the files need to be small and stay small.

3) Uncompressing the entire file at run-time impacts performance, especially when the file is very large.

I have downloaded some text readers of various sorts (bible programs, proprietary book readers, etc.) and a number of them maintain a compressed file.

I finally did come across a book late last night called "Managing Gigabytes," and it covers this exact topic. I downloaded the source code from their website, and though it's way overkill, I think I can glean from it what I need. I'm sure the book will be on it's way soon, too.

If you have any other suggestions or insight, please let me know.

Thanks!

**Salem** · 08-26-2004

If for example you needed to access the file on a line by line basis, your unit of compression would be a line.

- read in a line
- compress it
- save the current index position in the output file
- write the compressed line to the output file.

Though you might want to read the whole file in advance to work out what the best compression is going to be (say computing a fixed Huffman table).
You really don't want to store any adaptive Huffman information on a per-line basis, otherwise you'll defeat the objective of making each line decompressible without reference to other lines.

**slowcode** · 08-26-2004

(say computing a fixed Huffman table)

I think that definitely starts me in the right direction. I will start looking at this tonight and post my results here.

Thanks.

**okinrus** · 08-26-2004

Are you doing something like an IR engine? There are number of coding schemes for compressing the index. You should be able to find information about using Golomb encoding, variable byte encoding, etc in the Managing Gigabyte books but there are research papers out there. Compression using these forms of encoding works because only the deltas, which are usually much less than 32 bits, are stored.

**slowcode** · 08-26-2004

Hi okinrus,

Not quite a full-blown IR engine, but just a small app where you can give it the documents you want (bible texts, books, etc.) and be able to search through and scroll through them, annotate, highlight, etc..

I did some research on the algorithms you listed and those are turning up quite a bit of information (thanks very much for the info). It looks as though the two you listed (Golumb and variable byte encoding) are the best to use for this scenario, esp. Golumb coding.

I am trying to hunt down some example code to get more of a handle on it.

**okinrus** · 08-28-2004

Yes, I was little bit confused by what you mean by index. I was thinking of something like an inverted index, where you have the postings, which stores a position into the index, and then the index, which stores actual word counts/offsets into the text data. the postings may then be stored in RAM memory(choose hash or alphabetical mapping or even use tries) and then the index is stored on disk.

In other words, you have postings that looks like this, and that while parsing the data in you found that Alphabet was on word 10, 15, 17; Foobar, word 11, 12; and Zoo, word 13, 16

Code:

Postings
--------------------------------
Alphabet -> 0x3434
.
.
Foobar-> 0x4434
.
.
Zoo->0x0655

And then you're index is a binary file, where by the offsets into it correspond with the posting entries.

Code:

Index
-------------------------------------
0x3434:  10, 15, 17
0x4434:  11, 12
0x0655:  13, 16

Instead of storing the data like this, however, all that is necessary is to deltas. For instance,
[cope]
Index
-------------------------------------
0x3434: +10, +5, +2
0x4434: +11, +1
0x0655: +13, +3

The amount of compression that you will be able to do is depends on how small the deltas are.

The book "modern information retrieval" has information on the inverted index, but the information is geared towards a full-blown IR engine. Attempting to compress the text looks difficult because you would have to store an offset into the compressed data and then somehow decompress from that point.

You can probably use a technique similar to this in order to allow for fast searching of individual words but you're not actually compresing the actual text files.

The book "Modern Information Retrieval" nevertheless describes a away to use huffman coding and searching on the data on pg.223. Instead of taking bits by symbols, you take the actual words as symbols, counting their frequency, and then building the huffman tree. Then you compress the text by using the huffman tree. But rather than use bits(like normal huffman coding) use an alphabet of bytes.

I think what you want to do is use something like huffman coding, and then decompress only the part of the text you display, plus some for speed. You could then implement the searches by just doing a sequential search on the encoded data, decoding by using the huffman tree.

Thread: Compress and index a text file

Thread Tools

Search Thread

Display

Compress and index a text file

Similar Threads

Performance issue!

Values changing without reason?