Thread: PKWare compression/decompression

  1. #1
    Registered User
    Join Date
    Jun 2009
    Posts
    93

    PKWare compression/decompression

    I'm hoping someone here can shed some further insight on the PKWare compression/decompression stuff. Back in 2001 Ben Rudiak-Gould posted an outline of how the PKWare decompression works on the Google group comp.compression BBS. In 2003 Mark Adler wrote a simple decompressor based off of Ben's post which included a correction to Ben's outline.
    First off let me say that this is way out of my league of comprehension but I hope any replies may help me further inch along in understanding this stuff. Below is a small corrected portion of Ben's original post:

    Here's a sample compressed stream and how it would be decoded.
    The stream is 00 04 82 24 25 8f 80 7f.
    The first byte of the header is 0, so the fixed-width representation
    is used for literal bytes.
    The second byte is 4, so the dictionary is 1K in size.
    The bitstream portion breaks down as follows:
    0 10000010 literal byte 41 (ASCII 'A')
    0 10010010 literal byte 49 (ASCII 'I')
    1 001001 111000 copy 11 bytes starting at dictionary byte 1
    (counting from the end starting with 0)
    1 000000011111111 end of stream
    0 padding to multiple of 8 bits (ignored)
    This stream would decompress to "AIAIAIAIAIAIA" (without the quotes).


    What I don't understand is the correlation between the hex digits in the bit stream and ASCII characters? How does hex 82 become ASCII 'A' for the data part in the example stream? Just of of curiousity, would anyone know what the bit stream would look like for "ABCD" (without the quotes) given that there all literal bytes with no length/distance pair?

    Thanks.

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,661
    > 10000010
    Which is 0x41, if you read right to left.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    Jun 2009
    Posts
    93
    Ok, that I see. So, hex 24 would be 100100 which is 0x49 but instead of padding it with zeros to make it eight-bit there tacking on 10 to make 10010010. So where does the 10 come from?

  4. #4
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by TAZIN View Post
    Ok, that I see. So, hex 24 would be 100100 which is 0x49 but instead of padding it with zeros to make it eight-bit there tacking on 10 to make 10010010. So where does the 10 come from?
    0x is hex notation. I think you mean, 0x24 (hex)= 49 (decimal), except that is not true. 0x24 = 36. Which 100100 is 36. But 10010010 = 73 = 0x49.

    Just of of curiousity, would anyone know what the bit stream would look like for "ABCD"
    Code:
    #include <bitset>
    #include <iostream>
    
    using namespace std;
    
    int main (void) {
    	bitset<8> ABCD[4] = { 'A', 'B', 'C', 'D' };
    	for (bitset<8> bs : ABCD) {  // C++11 range based for
    		cout << bs << endl;
    	}
    }
    01000001
    01000010
    01000011
    01000100
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  5. #5
    Registered User
    Join Date
    Jun 2009
    Posts
    93
    Yeah, I screwed up on that. ASCII 'I' is decimal: 73, hex: 49. So when I put 73 in Windows calculator I get 1001001, and the extra 0 must be the padding to make it eight-bit little endian.

  6. #6
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by TAZIN View Post
    Yeah, I screwed up on that. ASCII 'I' is decimal: 73, hex: 49. So when I put 73 in Windows calculator I get 1001001, and the extra 0 must be the padding to make it eight-bit little endian.
    No, since it's a stream, if it's MSB first (the "extra 0"), that's the bit level equivalent of big endian. You could, I guess, have big or little endian byte ordering (if applicable) with that.

    Bit Endianness

    Beware that the windows calculator is probably outputting LSB first, as with the C++ bitset output (should have mentioned that, whoops). 1001001 is a 7-bit binary palindrome But if 10010010 = 73, then it's an MSB first ordered octet. M=most, L=least Significant Bit.
    Last edited by MK27; 03-04-2012 at 02:59 PM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  7. #7
    Registered User
    Join Date
    Jun 2009
    Posts
    93
    Thanks MK27 for all the insight. Like I mentioned early on most of this stuff is way over my head. I'm just trying to piece together the bits of info that both Ben and Mark stated on this subject. At one point Mark mentions that bits are stored in bytes from the least significant bit to the most significant bit so that's were I got the little endian byte order stuff from.

  8. #8
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by TAZIN View Post
    Thanks MK27 for all the insight. Like I mentioned early on most of this stuff is way over my head. I'm just trying to piece together the bits of info that both Ben and Mark stated on this subject. At one point Mark mentions that bits are stored in bytes from the least significant bit to the most significant bit so that's were I got the little endian byte order stuff from.
    I get confused about bit endianess because it is generally irrelevant, but in any case, if there is an issue then it shouldn't be hard to sort out. Just don't test with binary palindromes
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  9. #9
    Registered User
    Join Date
    Jun 2009
    Posts
    93
    Still having a tuff time grasping this stuff. I see some of the connecting dots but the big picture is still a blur. I know I'm not the brightest bulb on the planet, but geez I never thought I was this stupid. Still trying to wrap my head around how hex 24 (used in there example) becomes ASCII 'I'. I see the similarities between hex 24 (100100) and hex 49 (1001001) which is ASCII 'I'. I guess what it simply boils down to is if they wanted to produce the ASCII character 'I' then why didn't they use hex 49 instead of hex 24? Is it the eight-bit thing that forces them to use hex 24? The other thing that makes it difficult to sort out is that it's not like a "plug-n-play" type of thing....So, you cannot change the bit stream to 00 04 82 82 25 8f 80 7f and expect an output of "AAAAAAAAAAAAA". But at the same time you can do 00 04 88 24 25 8f 80 7f and get "DIDIDIDIDIDID".... or 00 04 84 24 25 8f 80 7f and get "BIBIBIBIBIBIB".

  10. #10
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Well, you are trying to do something slightly obscure in that PKzip is not open source, and you are trying to decipher its data on the basis of some sketchy looking notes.

    However, there are open source implementations of zip/unzip around. Have you thought of looking at one of those? That will also make it easier for you to find and communicate with people who already understand this stuff.

    7-Zip
    Info-ZIP Home Page
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  11. #11
    Registered User
    Join Date
    Jun 2009
    Posts
    93
    Thanks MK27 for all your help and time. The final objective to all this is to try and gain enough knowledge of this particular type of compression so I can repair a corrupt file which is compressed using a very similar type of compression. I've looked at several documents and web sites regarding LZH compression and the thoery part I understand but the programming aspect is a bit out of my league. Thanks for the links...I check them out.

  12. #12
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by MK27 View Post
    I get confused about bit endianess
    As a footnote, lol, I think I had it backward in post #6 (if 10010010 = 73, it's LSB first, just like w is the first letter in whoops).
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Targa Decompression
    By carrotcake1029 in forum Tech Board
    Replies: 3
    Last Post: 07-17-2008, 12:48 AM
  2. Compression/Decompression Wave File and MP3
    By cindy_16051988 in forum Projects and Job Recruitment
    Replies: 51
    Last Post: 04-29-2006, 06:25 AM
  3. Reverse Decompression.
    By Coder87C in forum C++ Programming
    Replies: 2
    Last Post: 04-11-2005, 04:50 PM
  4. SIGN-UP: Text Compression/Decompression Contest
    By ygfperson in forum Contests Board
    Replies: 35
    Last Post: 10-19-2002, 12:19 PM
  5. Video Decompression
    By *Michelle* in forum A Brief History of Cprogramming.com
    Replies: 11
    Last Post: 06-27-2002, 05:47 PM