I do make the "all important" distinction. I have identified the data as data that is "encoded" in the English language, which there are many many millions and billions of documents to which this could apply. There are surely more documents written in English than all of the computer programs ever written and that ratio will be drastic. I am not saying I will just make my algortithm == the data. That is like saying the dictionary == Shakespeare, which is ridiculous.
Keyword: modified. As I said, you assume the status quo ("conventional size") and you are right to do so! By the same reasoning, I am assuming English is, in fact, a standardized, conventional language. SAME reasoning.Programmers assume 8-bit bytes simply because that's the conventional size used on most machines. Even so, internally they may analize and process them in various bit-widths. At any rate, a true compression algorithm (including gzip) can be easily modified to, say, work with 10-bit bytes (in fact the Huffman compressor I posted the other day allows you to do just that without even altering the source code; it's a template parameter).
Anyway, it would be compressed. The average word is 5 characters. A short int is two bytes. Then I think one byte for state information -- but that could be optimized, so the compression will be 50-75%. COMPRESSION.