# PKWare compression/decompression

• 03-04-2012
TAZIN
PKWare compression/decompression
I'm hoping someone here can shed some further insight on the PKWare compression/decompression stuff. Back in 2001 Ben Rudiak-Gould posted an outline of how the PKWare decompression works on the Google group comp.compression BBS. In 2003 Mark Adler wrote a simple decompressor based off of Ben's post which included a correction to Ben's outline.
First off let me say that this is way out of my league of comprehension but I hope any replies may help me further inch along in understanding this stuff. Below is a small corrected portion of Ben's original post:

Here's a sample compressed stream and how it would be decoded.
The stream is 00 04 82 24 25 8f 80 7f.
The first byte of the header is 0, so the fixed-width representation
is used for literal bytes.
The second byte is 4, so the dictionary is 1K in size.
The bitstream portion breaks down as follows:
0 10000010 literal byte 41 (ASCII 'A')
0 10010010 literal byte 49 (ASCII 'I')
1 001001 111000 copy 11 bytes starting at dictionary byte 1
(counting from the end starting with 0)
1 000000011111111 end of stream
0 padding to multiple of 8 bits (ignored)
This stream would decompress to "AIAIAIAIAIAIA" (without the quotes).

What I don't understand is the correlation between the hex digits in the bit stream and ASCII characters? How does hex 82 become ASCII 'A' for the data part in the example stream? Just of of curiousity, would anyone know what the bit stream would look like for "ABCD" (without the quotes) given that there all literal bytes with no length/distance pair?

Thanks.
• 03-04-2012
Salem
> 10000010
Which is 0x41, if you read right to left.
• 03-04-2012
TAZIN
Ok, that I see. So, hex 24 would be 100100 which is 0x49 but instead of padding it with zeros to make it eight-bit there tacking on 10 to make 10010010. So where does the 10 come from?
• 03-04-2012
MK27
Quote:

Originally Posted by TAZIN
Ok, that I see. So, hex 24 would be 100100 which is 0x49 but instead of padding it with zeros to make it eight-bit there tacking on 10 to make 10010010. So where does the 10 come from?

0x is hex notation. I think you mean, 0x24 (hex)= 49 (decimal), except that is not true. 0x24 = 36. Which 100100 is 36. But 10010010 = 73 = 0x49.

Quote:

Just of of curiousity, would anyone know what the bit stream would look like for "ABCD"
Code:

```#include <bitset> #include <iostream> using namespace std; int main (void) {         bitset<8> ABCD[4] = { 'A', 'B', 'C', 'D' };         for (bitset<8> bs : ABCD) {  // C++11 range based for                 cout << bs << endl;         } }```
01000001
01000010
01000011
01000100
• 03-04-2012
TAZIN
Yeah, I screwed up on that. ASCII 'I' is decimal: 73, hex: 49. So when I put 73 in Windows calculator I get 1001001, and the extra 0 must be the padding to make it eight-bit little endian.
• 03-04-2012
MK27
Quote:

Originally Posted by TAZIN
Yeah, I screwed up on that. ASCII 'I' is decimal: 73, hex: 49. So when I put 73 in Windows calculator I get 1001001, and the extra 0 must be the padding to make it eight-bit little endian.

No, since it's a stream, if it's MSB first (the "extra 0"), that's the bit level equivalent of big endian. You could, I guess, have big or little endian byte ordering (if applicable) with that.

Bit Endianness

Beware that the windows calculator is probably outputting LSB first, as with the C++ bitset output (should have mentioned that, whoops). 1001001 is a 7-bit binary palindrome ;) But if 10010010 = 73, then it's an MSB first ordered octet. M=most, L=least Significant Bit.
• 03-04-2012
TAZIN
Thanks MK27 for all the insight. Like I mentioned early on most of this stuff is way over my head. I'm just trying to piece together the bits of info that both Ben and Mark stated on this subject. At one point Mark mentions that bits are stored in bytes from the least significant bit to the most significant bit so that's were I got the little endian byte order stuff from.
• 03-04-2012
MK27
Quote:

Originally Posted by TAZIN
Thanks MK27 for all the insight. Like I mentioned early on most of this stuff is way over my head. I'm just trying to piece together the bits of info that both Ben and Mark stated on this subject. At one point Mark mentions that bits are stored in bytes from the least significant bit to the most significant bit so that's were I got the little endian byte order stuff from.

I get confused about bit endianess because it is generally irrelevant, but in any case, if there is an issue then it shouldn't be hard to sort out. Just don't test with binary palindromes ;)
• 03-04-2012
TAZIN
Still having a tuff time grasping this stuff. I see some of the connecting dots but the big picture is still a blur. I know I'm not the brightest bulb on the planet, but geez I never thought I was this stupid. Still trying to wrap my head around how hex 24 (used in there example) becomes ASCII 'I'. I see the similarities between hex 24 (100100) and hex 49 (1001001) which is ASCII 'I'. I guess what it simply boils down to is if they wanted to produce the ASCII character 'I' then why didn't they use hex 49 instead of hex 24? Is it the eight-bit thing that forces them to use hex 24? The other thing that makes it difficult to sort out is that it's not like a "plug-n-play" type of thing....So, you cannot change the bit stream to 00 04 82 82 25 8f 80 7f and expect an output of "AAAAAAAAAAAAA". But at the same time you can do 00 04 88 24 25 8f 80 7f and get "DIDIDIDIDIDID".... or 00 04 84 24 25 8f 80 7f and get "BIBIBIBIBIBIB".
• 03-05-2012
MK27
Well, you are trying to do something slightly obscure in that PKzip is not open source, and you are trying to decipher its data on the basis of some sketchy looking notes.

However, there are open source implementations of zip/unzip around. Have you thought of looking at one of those? That will also make it easier for you to find and communicate with people who already understand this stuff.

7-Zip