Re-reading the spec, now I understand it and I see I haven't honored that "all code lengths form a single sequence"
Ah well, better fix it up...
Re-reading the spec, now I understand it and I see I haven't honored that "all code lengths form a single sequence"
Ah well, better fix it up...
Dunno if it's of value to you but this is where I've got with just the deflation of the huffman data, currently got a bug with UIDs, specifically with getting the right length for length/literal symbols. As soon as that's fixed the only thing left to do is implement the look back through deflated data part of the code.
Also since you're working on PNGs anyways this resource should be helpful (don't remember if I mentioned it in a previous post but here it is anyways):
PngSuite - the official set of PNG test images
I ran you code against a test file, and got this output:
Code:main.c:281: symbols[ 0] (0x5560170045f0): src = 0, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 4, uid = 0010 main.c:281: symbols[ 4] (0x556017004690): src = 4, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 4, uid = 0011 main.c:281: symbols[ 5] (0x5560170046b8): src = 5, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 4, uid = 0100 main.c:281: symbols[ 6] (0x5560170046e0): src = 6, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 4, uid = 0101 main.c:281: symbols[ 7] (0x556017004708): src = 7, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 3, uid = 110 main.c:281: symbols[ 8] (0x556017004730): src = 8, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 2, uid = 10 main.c:281: symbols[ 9] (0x556017004758): src = 9, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 3, uid = 111 main.c:281: symbols[ 10] (0x556017004780): src = 10, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 4, uid = 0110 main.c:281: symbols[ 11] (0x5560170047a8): src = 11, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 3, uid = 000
The first shortest Hufman code should be always be all zeros.
So assuming your length table is correct, here's how they should be generated:
We start with "length 2" codes - "00" -
Assign to the length 2 codes, in order they appear in the table:
At this point next symbol that would be assigned is "01". Add a '0' to it and do the threes:Code:main.c:281: symbols[ 0] len = 4, main.c:281: symbols[ 4] len = 4, main.c:281: symbols[ 5] len = 4, main.c:281: symbols[ 6] len = 4, main.c:281: symbols[ 7] len = 3, main.c:281: symbols[ 8] len = 2, 00 main.c:281: symbols[ 9] len = 3, main.c:281: symbols[ 10] len = 4, main.c:281: symbols[ 11] len = 3,
At this point next symbol that would be assigned is "101". Add a '0' to it and do the fours:Code:main.c:281: symbols[ 0] main.c:281: symbols[ 4] main.c:281: symbols[ 5] main.c:281: symbols[ 6] main.c:281: symbols[ 7] 010 main.c:281: symbols[ 8] 00 main.c:281: symbols[ 9] 011 main.c:281: symbols[ 10] main.c:281: symbols[ 11] 100
Then do the same again (Lengths 4):
All done (but we seem to be one code short (1111)!Code:main.c:281: symbols[ 0] 1010 main.c:281: symbols[ 4] 1011 main.c:281: symbols[ 5] 1100 main.c:281: symbols[ 6] 1101 main.c:281: symbols[ 7] 010 main.c:281: symbols[ 8] 00 main.c:281: symbols[ 9] 011 main.c:281: symbols[ 10] 1110 main.c:281: symbols[ 11] 100
Last edited by hamster_nz; 07-28-2021 at 05:16 PM.
Here's my code for that, with error checking removed for clarity:
'reverse(x,y)' just flips the last 'y' bits around in 'x' - i.e. reverses the bit ordering in the lowest 'y' bits.Code:void codetable_generate(struct Codetable_entry *table, size_t len) { uint32_t next_code = 0; for(uint32_t pass_len = 1; pass_len < 16; pass_len++) { for(size_t i = 0; i < len; i++) { if(table[i].length == pass_len) { table[i].code = reverse(next_code,pass_len); next_code++; } } next_code <<= 1; } }
I *think* this is what you want in BuildCodes():
Seems to be better:Code:void BuildCodes( BUFF *Symbols, uint longest ) { uint uid = 0; HUFFMAN_SYMBOL *symbols = Symbols->addr; for ( uint len = 1; len <= longest; ++len ) { for ( uint i = 0; i < Symbols->used; ++i ) { if ( symbols[i].len == len ) symbols[i].uid = uid++; } uid <<= 1; } }
Bit ordering might need to be reversed - I didn't look too far into it..Code:main.c:281: Hidden Symbols, used = 13 main.c:281: symbols[ 0] (0x55d7b4b7a5f0): src = 0, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 4, uid = 1010 main.c:281: symbols[ 4] (0x55d7b4b7a690): src = 4, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 4, uid = 1011 main.c:281: symbols[ 5] (0x55d7b4b7a6b8): src = 5, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 4, uid = 1100 main.c:281: symbols[ 6] (0x55d7b4b7a6e0): src = 6, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 4, uid = 1101 main.c:281: symbols[ 7] (0x55d7b4b7a708): src = 7, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 3, uid = 010 main.c:281: symbols[ 8] (0x55d7b4b7a730): src = 8, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 2, uid = 00 main.c:281: symbols[ 9] (0x55d7b4b7a758): src = 9, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 3, uid = 011 main.c:281: symbols[ 10] (0x55d7b4b7a780): src = 10, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 4, uid = 1110 main.c:281: symbols[ 11] (0x55d7b4b7a7a8): src = 11, use = true, sym = '' ( 0), lit = '' ( 0), get = 0, len = 3, uid = 100 main.c:285: Hidden Symbols, used = 13
Hit a crash with the final section of my code, since I need to do some shopping I figured I dump it here and see if I get lucky by the time I get back and someone both felt like looking through it and found the bug. If you do look I recommend just downloading the attached file, compile and search for the "Deflating data..." string, at least you'll have the syntax highlighting then.
A reminder of the document I was refering to when making this code:
Understanding gzip
The Code:
The Output:Code:i = 0; j = 0; while ( stream.num < stream.max ) { leng_symbol = IdentifyNextSymbol( &stream, &Tree, leng_longest, true ); if ( !leng_symbol ) { PrintStreamDetails( &stream ); return Return( ret, EINVAL ); } //puts("Found:\n"); PrintSymbol( leng_symbol, __LINE__, leng_symbol->uid ); if ( leng_symbol->sym < 256 ) { j = i++; lazy[j] = leng_symbol->sym; } else { uint lookup, stop; dist_symbol = IdentifyNextSymbol( &stream, &Dist, dist_longest, true ); if ( !dist_symbol ) { PrintStreamDetails( &stream ); return Return( ret, EINVAL ); } lookup = CopyStreamBits( &stream, dist_symbol->get, true ); lookup += hufd_extra_bits_array[dist_symbol->src]; lookup = j - lookup; for ( j = 0; j < dist_symbol->len; ++j ) lazy[i++] = lazy[lookup++]; j = i - 1; PrintSymbol( dist_symbol, __LINE__, dist_symbol->uid ); } }
Code:./a.out aba.gz path = 'aba.gz' PrintStreamDetails( 0x7fffcc29b530 ): ptr = 0x56257145d480, pos = 0, num = 0, max = 328, fed = 0, got = PrintBytes( 0x56257145d480, 41, 16 ) 1F 8B 8 0 0 0 0 0 0 3 1D C6 49 1 0 0 40 C0 AC A3 7F 88 3D 3C 20 2A 97 9D 37 5E 1D C 29 34 94 23 0 0 0 PrintStreamDetails( 0x7fffcc29b530 ): ptr = 0x56257145d480, pos = 136, num = 80, max = 328, fed = 56, got = 00010000000000000000000000000001010010011100011000011101 gzip.magic = 1F8B, gzip.format = 8, gzip.flags = 0, gzip.mtime = 0, zlib.xflags = 0, zlib.system = 3 gzip.flag_TEXT = false, gzip.flag_HCRC = false, gzip.flag_MORE = false, gzip.flag_NAME = false, gzip.flag_NOTE = false, gzip.flag_RESERVED = 0 PrintStreamDetails( 0x7fffcc29b530 ): ptr = 0x56257145d480, pos = 136, num = 80, max = 296, fed = 56, got = 00010000000000000000000000000001010010011100011000011101 last = true, type = 2 lengc = 260, distc = 7, codec = 18, left = 60 pos = 212, max = 296, count = 267 left = 61, byte = 18, bit = 7 main.c:757: Code Symbols, used = 16 main.c:757: symbols[ 0] (0x5625714604d0): src = 2, use = true, sym = 'ÿ' ( 0), lit = 'ÿ' ( 0), get = 0, cpy = 0, len = 1, uid = 0 main.c:757: symbols[ 2] (0x562571460520): src = 18, use = true, sym = 'ÿ' ( 0), lit = 'ÿ' ( 0), get = 7, cpy = 11, len = 2, uid = 10 main.c:757: symbols[ 12] (0x5625714606b0): src = 1, use = true, sym = 'ÿ' ( 0), lit = 'ÿ' ( 0), get = 0, cpy = 0, len = 4, uid = 1100 main.c:757: symbols[ 13] (0x5625714606d8): src = 4, use = true, sym = 'ÿ' ( 0), lit = 'ÿ' ( 0), get = 0, cpy = 0, len = 4, uid = 1101 main.c:757: symbols[ 14] (0x562571460700): src = 16, use = true, sym = 'ÿ' ( 0), lit = 'ÿ' ( 0), get = 2, cpy = 3, len = 4, uid = 1110 main.c:757: symbols[ 15] (0x562571460728): src = 17, use = true, sym = 'ÿ' ( 0), lit = 'ÿ' ( 0), get = 3, cpy = 3, len = 4, uid = 1111 main.c:878: RdHuffmanLengDistCodes( 0x7fffcc29b530, 0x56256f60e420, 0x56256f60e440, 260, 0 ) main.c:771: Length Symbols, used = 16 main.c:771: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 main.c:771: symbols[ 2] (0x562571460820): src = 98, use = true, sym = 'b' ( 98), lit = 'b' ( 98), get = 0, cpy = 0, len = 2, uid = 10 main.c:771: symbols[ 12] (0x5625714609b0): src = 256, use = true, sym = 'ÿ' (256), lit = 'ÿ' (256), get = 0, cpy = 0, len = 4, uid = 1100 main.c:771: symbols[ 13] (0x5625714609d8): src = 257, use = true, sym = 'ÿ' (257), lit = 'ÿ' (257), get = 2, cpy = 0, len = 4, uid = 1101 main.c:771: symbols[ 14] (0x562571460a00): src = 258, use = true, sym = 'ÿ' (257), lit = 'ÿ' (258), get = 2, cpy = 0, len = 4, uid = 1110 main.c:771: symbols[ 15] (0x562571460a28): src = 259, use = true, sym = 'ÿ' (257), lit = 'ÿ' (259), get = 2, cpy = 0, len = 4, uid = 1111 main.c:878: RdHuffmanLengDistCodes( 0x7fffcc29b530, 0x56256f60e420, 0x56256f60e460, 7, 260 ) main.c:785: Distance Symbols, used = 4 main.c:785: symbols[ 0] (0x562571460a60): src = 0, use = true, sym = 'ÿ' (260), lit = 'ÿ' (260), get = 0, cpy = 0, len = 2, uid = 00 main.c:785: symbols[ 1] (0x562571460a88): src = 4, use = true, sym = 'ÿ' (264), lit = 'ÿ' (264), get = 0, cpy = 0, len = 2, uid = 01 main.c:785: symbols[ 2] (0x562571460ab0): src = 5, use = true, sym = 'ÿ' (265), lit = 'ÿ' (265), get = 0, cpy = 0, len = 2, uid = 10 main.c:785: symbols[ 3] (0x562571460ad8): src = 6, use = true, sym = 'ÿ' (266), lit = 'ÿ' (266), get = 0, cpy = 0, len = 2, uid = 11 main.c:806: main() Deflating data... main.c:820: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 main.c:820: symbols[ 2] (0x562571460820): src = 98, use = true, sym = 'b' ( 98), lit = 'b' ( 98), get = 0, cpy = 0, len = 2, uid = 10 main.c:820: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 main.c:820: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 double free or corruption (out) main.c:820: symbols[ 2] (0x562571460820): src = 98, use = true, sym = 'b' ( 98), lit = 'b' ( 98), get = 0, cpy = 0, len = 2, uid = 10 main.c:820: symbols[ 2] (0x562571460820): src = 98, use = true, sym = 'b' ( 98), lit = 'b' ( 98), get = 0, cpy = 0, len = 2, uid = 10 main.c:820: symbols[ 2] (0x562571460820): src = 98, use = true, sym = 'b' ( 98), lit = 'b' ( 98), get = 0, cpy = 0, len = 2, uid = 10 main.c:820: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 main.c:820: symbols[ 14] (0x562571460a00): src = 258, use = true, sym = 'ÿ' (257), lit = 'ÿ' (258), get = 2, cpy = 0, len = 4, uid = 1110 main.c:847: symbols[ 1] (0x562571460a88): src = 4, use = true, sym = 'ÿ' (264), lit = 'ÿ' (264), get = 0, cpy = 0, len = 2, uid = 01 main.c:820: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 main.c:820: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 main.c:820: symbols[ 13] (0x5625714609d8): src = 257, use = true, sym = 'ÿ' (257), lit = 'ÿ' (257), get = 2, cpy = 0, len = 4, uid = 1101 main.c:847: symbols[ 1] (0x562571460a88): src = 4, use = true, sym = 'ÿ' (264), lit = 'ÿ' (264), get = 0, cpy = 0, len = 2, uid = 01 main.c:820: symbols[ 2] (0x562571460820): src = 98, use = true, sym = 'b' ( 98), lit = 'b' ( 98), get = 0, cpy = 0, len = 2, uid = 10 main.c:820: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 main.c:820: symbols[ 15] (0x562571460a28): src = 259, use = true, sym = 'ÿ' (257), lit = 'ÿ' (259), get = 2, cpy = 0, len = 4, uid = 1111 main.c:847: symbols[ 0] (0x562571460a60): src = 0, use = true, sym = 'ÿ' (260), lit = 'ÿ' (260), get = 0, cpy = 0, len = 2, uid = 00 main.c:820: symbols[ 12] (0x5625714609b0): src = 256, use = true, sym = 'ÿ' (256), lit = 'ÿ' (256), get = 0, cpy = 0, len = 4, uid = 1100 main.c:847: symbols[ 0] (0x562571460a60): src = 0, use = true, sym = 'ÿ' (260), lit = 'ÿ' (260), get = 0, cpy = 0, len = 2, uid = 00 main.c:820: symbols[ 15] (0x562571460a28): src = 259, use = true, sym = 'ÿ' (257), lit = 'ÿ' (259), get = 2, cpy = 0, len = 4, uid = 1111 main.c:847: symbols[ 0] (0x562571460a60): src = 0, use = true, sym = 'ÿ' (260), lit = 'ÿ' (260), get = 0, cpy = 0, len = 2, uid = 00 main.c:820: symbols[ 2] (0x562571460820): src = 98, use = true, sym = 'b' ( 98), lit = 'b' ( 98), get = 0, cpy = 0, len = 2, uid = 10 main.c:820: symbols[ 2] (0x562571460820): src = 98, use = true, sym = 'b' ( 98), lit = 'b' ( 98), get = 0, cpy = 0, len = 2, uid = 10 main.c:820: symbols[ 14] (0x562571460a00): src = 258, use = true, sym = 'ÿ' (257), lit = 'ÿ' (258), get = 2, cpy = 0, len = 4, uid = 1110 main.c:847: symbols[ 0] (0x562571460a60): src = 0, use = true, sym = 'ÿ' (260), lit = 'ÿ' (260), get = 0, cpy = 0, len = 2, uid = 00 main.c:820: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 main.c:820: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 main.c:820: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 main.c:820: symbols[ 12] (0x5625714609b0): src = 256, use = true, sym = 'ÿ' (256), lit = 'ÿ' (256), get = 0, cpy = 0, len = 4, uid = 1100 main.c:847: symbols[ 0] (0x562571460a60): src = 0, use = true, sym = 'ÿ' (260), lit = 'ÿ' (260), get = 0, cpy = 0, len = 2, uid = 00 main.c:820: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 main.c:820: symbols[ 0] (0x5625714607d0): src = 97, use = true, sym = 'a' ( 97), lit = 'a' ( 97), get = 0, cpy = 0, len = 1, uid = 0 main.c:820: symbols[ 14] (0x562571460a00): src = 258, use = true, sym = 'ÿ' (257), lit = 'ÿ' (258), get = 2, cpy = 0, len = 4, uid = 1110 main.c:847: symbols[ 1] (0x56257 make: *** [GNUmakefile:2: run] Aborted (core dumped) Compilation failed.
Judging by the lack of replies I take it no one felt like it, no biggie, eventually "fixed" it, I say fixed because I got passed the section only to encounter a crash during the attempt to free memory on the way out, here's my current code (with mini makefile and the test archives)
deflate.zip - Google Drive
I just tested it with a gz file I made with "echo test | gzip > a.gz" and it crashed out.
Cause was around line 680 - with a buffer overflow.
lengi is unsigned, so if CopyStreamBits returns 0, then lengi gets set to 4294967295, and of course leng_extra_bits_array[4294967295] is a crash.Code:lengi = RevBits( CopyStreamBits( &stream, 7, true ), 7 ) - 1; bits = leng_extra_bits_array[lengi];
Strange, that should only be reached under a type 1 scenario, the one used for the "hello hello hello hello" test (hello.gz in the zip I provided), well either way I should probably add a check that it's in bounds anyways
Edit:
For now, to get the output, add this after the assignment:
Edit 2: btw I'm going to bed in case anyone responds after I've shut the computer down (need to reset after an update anyways)Code:if ( lengi >= 32 ) return Return( ret, err );
Last edited by awsdert; 07-29-2021 at 05:15 PM.
Oh, thanks for that link to the test images... It is proving invaluable.
One thing I find really fascinating about zlib compression is that nowhere in the file can you point to a set of bits and say "those bits encode that letter" - so if the compressed file contains a capital A there never is 8 bits that represent the capital A symbol.
It's all indirectly implied.
What are you talking about? 0 fits in an unsigned int.engi is unsigned, so if CopyStreamBits returns 0, then lengi gets set to 4294967295, and of course leng_extra_bits_array[4294967295] is a crash.
Output:Code:#include <stdio.h> int f() { return 0; } int main() { unsigned int lengi = f(); printf("lengi: %u", lengi); }
lengi: 0
No I didn't. I just saw the comment.