# Data Compression

• 02-11-2006
Ezerhorden
Data Compression
I'm planning on making a data compression software for my thesis which is 2 years from now, can you tell me what problems I could face or the difficulties of making a data compression software which is almost if not better than the best today? And any good sites on how to learn data compression? I want to make a kick-ass algorithm :D
• 02-11-2006
Salem
http://www.faqs.org/faqs/compression-faq/

First you need to decide wether to go for lossless (like zip) or lossy (like mpeg) compression.

> I want to make a kick-ass algorithm
Then you need to be aware of theoretical limits of compressability.
http://en.wikipedia.org/wiki/Information_theory
• 02-11-2006
Shakti
Problems you will face:
How will you do the actual compression? Compression is not an easy task, and by the sound of it you dont have enough experience to do it (yet).

What algorithms will you use and how will you implement them?
Some algorightms you may want to look into are Huffmans compressionalgorithm and RLE.

How will you store the files?
Many softwares of today use archives, how will you do it?

Remember, the softwares that are today have almost pushed the limits, there are only so many ways you can store the data in, and in the end it all comes down to reducing the number of bits it takes to represent 1 byte.
• 02-11-2006
Ezerhorden
I was curious too as to what file types are the hardest or easiest to compress, didn't find it in the FAQ link above (or maybe I didn't look hard enough :D )
• 02-11-2006
Shakti
Easiest to compress:
Either an empty file :p or a file which consists of only 1 type of byte, for example this little text:
'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaa'
would with RLE be compressed to '5a' or possbly something like '5*a'. This assumes that the ascii number for '5' is 53. So with that you just squezed 53 bytes into 2.

Hardest to compress are files which uses almost all characters in the ascii set (yes i know, i make an assumption), and where all characters are in a pseudorandom order, and each character can find roughly the same amount of times in the file.
• 02-11-2006