Yes, I was little bit confused by what you mean by index. I was thinking of something like an inverted index, where you have the postings, which stores a position into the index, and then the index, which stores actual word counts/offsets into the text data. the postings may then be stored in RAM memory(choose hash or alphabetical mapping or even use tries) and then the index is stored on disk.
In other words, you have postings that looks like this, and that while parsing the data in you found that Alphabet was on word 10, 15, 17; Foobar, word 11, 12; and Zoo, word 13, 16
Code:
Postings
--------------------------------
Alphabet -> 0x3434
.
.
Foobar-> 0x4434
.
.
Zoo->0x0655
And then you're index is a binary file, where by the offsets into it correspond with the posting entries.
Code:
Index
-------------------------------------
0x3434: 10, 15, 17
0x4434: 11, 12
0x0655: 13, 16
Instead of storing the data like this, however, all that is necessary is to deltas. For instance,
[cope]
Index
-------------------------------------
0x3434: +10, +5, +2
0x4434: +11, +1
0x0655: +13, +3
The amount of compression that you will be able to do is depends on how small the deltas are.
The book "modern information retrieval" has information on the inverted index, but the information is geared towards a full-blown IR engine. Attempting to compress the text looks difficult because you would have to store an offset into the compressed data and then somehow decompress from that point.
You can probably use a technique similar to this in order to allow for fast searching of individual words but you're not actually compresing the actual text files.
The book "Modern Information Retrieval" nevertheless describes a away to use huffman coding and searching on the data on pg.223. Instead of taking bits by symbols, you take the actual words as symbols, counting their frequency, and then building the huffman tree. Then you compress the text by using the huffman tree. But rather than use bits(like normal huffman coding) use an alphabet of bytes.
I think what you want to do is use something like huffman coding, and then decompress only the part of the text you display, plus some for speed. You could then implement the searches by just doing a sequential search on the encoded data, decoding by using the huffman tree.