Increment gone wrong

**awsdert** · 07-30-2021

Yeah, I prefer a different compression I thought up that uses some similar concepts but is a LOT easier to deal with, basically the 1st 64 bits should read ".HUFFMAN" to make it clear that it's huffman type compression, the nibble that follows should have 1 added and be multiplied against 4 to get the size of each element, the next 12 bit are reserved for flags I haven't decided yet, the next bits (size decided by aforementioned (nibble + 1) * 4) are for the expected number of elements in the resulting data, I think I'll probably just make it a flag holder, next follows a list of indices (their size is as previously explained with the (nibble + 1) * 4) with a prefix bit that says to whether or not to use everything from the index itself to what comes before the next index provided so the shortest list could be 1 0000 1 1111, those 2 are the only required indices, these indices have a 1:1 relationship with the values to add into the buffer later, the huffman codes are to be generated in the same order, until the last index is reached, whatever your final next code is is gonna be the code of the last index anyways, it will be considered uncompressed data if more than a qtr of the available indices are used before getting the huffman codes in which case every value from here on until will be for the buffer or in the case of compressed data huffman tree to pull the value to put into the buffer, the will be only one prefix bit to these values, that tells you to either use it as uncompressed data/huffman code or to read 2 counts (again size is determine by (nibble + 1) * 4) to 1. minus from current value count and 2. begin copying existing values upto the 2nd count read

So "hello hello hello hello" could result in something like this (I've given interpretations, not bit representations in most cases):
".HUFFMAN" 10 00 24 [1 '\0', 0 1, 1 ' ', 0 33, 1 'e', 0 'f', 1 'h', 0 'i', 1 'l', 0 'm', 1 'o', 0 'p', 0 255] 0 'h' 0 'e' 0 'l' 0 'l' 0 'o' 0 ' ' 1 6 17 0 '\0'

I still need to do a bit more thinking on it but I like it better than zlib, it's a lot clearer what is what after all

**hamster_nz** · 07-30-2021

Originally Posted by awsdert

Yeah, I prefer a different compression I thought up that uses some similar concepts but is a LOT easier to deal with, basically the 1st 64 bits should read ".HUFFMAN" to make it clear that it's huffman type compression, the nibble that follows should have 1 added and be multiplied against 4 to get the size of each element, the next 12 bit are reserved for flags I haven't decided yet, the next bits (size decided by aforementioned (nibble + 1) * 4) are for the expected number of elements in the resulting data, I think I'll probably just make it a flag holder, next follows a list of indices (their size is as previously explained with the (nibble + 1) * 4) with a prefix bit that says to whether or not to use everything from the index itself to what comes before the next index provided so the shortest list could be 1 0000 1 1111, those 2 are the only required indices, these indices have a 1:1 relationship with the values to add into the buffer later, the huffman codes are to be generated in the same order, until the last index is reached, whatever your final next code is is gonna be the code of the last index anyways, it will be considered uncompressed data if more than a qtr of the available indices are used before getting the huffman codes in which case every value from here on until will be for the buffer or in the case of compressed data huffman tree to pull the value to put into the buffer, the will be only one prefix bit to these values, that tells you to either use it as uncompressed data/huffman code or to read 2 counts (again size is determine by (nibble + 1) * 4) to 1. minus from current value count and 2. begin copying existing values upto the 2nd count read

So "hello hello hello hello" could result in something like this (I've given interpretations, not bit representations in most cases):
".HUFFMAN" 10 00 24 [1 '\0', 0 1, 1 ' ', 0 33, 1 'e', 0 'f', 1 'h', 0 'i', 1 'l', 0 'm', 1 'o', 0 'p', 0 255] 0 'h' 0 'e' 0 'l' 0 'l' 0 'o' 0 ' ' 1 6 17 0 '\0'

I still need to do a bit more thinking on it but I like it better than zlib, it's a lot clearer what is what after all

I feel there has been a lot of research put into zlib's scheme that shouldn't be discounted so quickly. For example, the way the code tables are bootstrapped is pretty neat and effective.

Or for another perspective, it's always good to implement following an existing standard, as it means you don't have to write the documentation

**awsdert** · 07-30-2021

Why'd you think I copied some of the concepts? For example the look back feature, or the huffman code usage, there are places for storing the huffman codes sure, and that's part of the reason I left room for flags but in cases like this it is better not to store them at all, for the example of "hello hello hello hello" it is is also possible with my algorithm to go so small that it's almost just the string itself, for example:

".HUFFMAN" 10 01 24 0 'h' 0 'e' 0 'l' 0 'l' 0 'o' 0 ' ' 1 6 17 0 '\0'

64 + 4 + 12 + 8 + (8 * 1) + (7 * 8) + (2 * 8) results in 168 bits or 21 bytes, literally smaller than the source string which is something not even the zlib compression can do, there's even room for me to add another 3 bytes for extra flags if I deem necessary, and thinking about it I just decided on a method for psuedo infinite flags, just use a flag in that 1st nibble to say more flags expected, if I move the psuedo NO_LIST flag I cooked up in the example I just provided to the first nibble of flags then I can further reduce the size to 20 bytes, then if I introduce a pseudo NO_SIZE flag to that 1st nibble also then I can further reduce the bytes to 19 bytes, whereas gzip needed a whole 29 bytes which is contrary the purpose of compression

Edit: For streams that can't get an EOF error a lookup at the end can be introduced that is a simple 1 0000 0000, which basically says "because we are copying 0 values stop reading the stream here".

Edit 2: As another space saving if ".HUF" is available as a "magic number" then I can further reduce that 19 to 15 bytes, nearly half the original 24 bytes

**awsdert** · 07-30-2021

Fixed the crash, now my problem is missing data, this is what I get versus what I should get:

Code:

abaabbbaba       babaaaaba    bbbaa
abaabbbabaababbaababaaaabaaabbbbbaa

**hamster_nz** · 07-30-2021

zlib encoding for "hello hello hello hello" would be

3 bits - list block, uses default codeblook.
8 bits - h
8 bits - e
8 bits - l
8 bits - l
8 bits - o
8 bits - space
8 bits - Symbol 268 - repeat of length 17, including the extra bit
6 bits - distance code 4, distance of 6 bytes, including one extra bit
7 bits - End of block code

That makes 72 bits (9 bytes). to encode 184 bits (23 bytes).

Whatever metadata you want to wrap around it is pretty much a moot point - if it is 10 or 15 bytes won't make a difference when you are encoding things longer than a tens of bytes.

**hamster_nz** · 07-30-2021

PNG decoding progress going quite well - all the Basic images in PngSuite convert correctly to PPM.

Might look at interlacing tonight..

**awsdert** · 07-31-2021

Originally Posted by hamster_nz

zlib encoding for "hello hello hello hello" would be

3 bits - list block, uses default codeblook.
8 bits - h
8 bits - e
8 bits - l
8 bits - l
8 bits - o
8 bits - space
8 bits - Symbol 268 - repeat of length 17, including the extra bit
6 bits - distance code 4, distance of 6 bytes, including one extra bit
7 bits - End of block code

That makes 72 bits (9 bytes). to encode 184 bits (23 bytes).

Whatever metadata you want to wrap around it is pretty much a moot point - if it is 10 or 15 bytes won't make a difference when you are encoding things longer than a tens of bytes.

I was more thinking about the resulting file as a whole, sure there are cases where zlib is more appropriate, I was simply saying in this case picking zlib inside gzip results in wasted space whereas the encoding I thought up saves space as an archive (which is obviously the point of an archive, if I were to convert it to just an encoding gzip could make use of the scrapping the remain 4 bytes of the "magic number" it would bring the size down to 11 bytes, then if ( disinclude the null character as you did then it goes further to just barely under 11 bytes I think

4 + 4 + (7 * 1) + (6 * 8) + (2 * 8) = 4 + 4 + 7 + 48 + 16 = 79 / 8 = 9 r7 = 10 bytes

If I also introduce a further flag that tells of a multiplier for look back / copy bit lengths and reduce their default size to 4 bits also then I save another byte giving the same 9 bytes in a simpler form at 71 bits, if you restrain yourself by saying that further efficiency is impossible, you'll never find improvements.

Edit: BTW I did the math after stating barely under 11 bytes

**awsdert** · 07-31-2021

Since you've managed to implement the deflate algorithm correctly (unless I've misunderstood you somewhere) I'll ask you, can you see where I'm going wrong here?

Output of length/distance construction:

Code:

main.c:782: Code Symbols, foresee =  18, longest =  4, have =  32, used = 16
main.c:782: symbols[  0] (0x5598b41d14d0): src =   2, use = true, sym =   0, get = 0, cpy = 0, len = 1, name = '?', lit = 'ÿ' (  0), uid = 0
main.c:782: symbols[  2] (0x5598b41d1528): src =  18, use = true, sym =   0, get = 7, cpy = 11, len = 2, name = '?', lit = 'ÿ' (  0), uid = 10
main.c:782: symbols[ 12] (0x5598b41d16e0): src =   1, use = true, sym =   0, get = 0, cpy = 0, len = 4, name = '?', lit = 'ÿ' (  0), uid = 1100
main.c:782: symbols[ 13] (0x5598b41d170c): src =   4, use = true, sym =   0, get = 0, cpy = 0, len = 4, name = '?', lit = 'ÿ' (  0), uid = 1101
main.c:782: symbols[ 14] (0x5598b41d1738): src =  16, use = true, sym =   0, get = 2, cpy = 3, len = 4, name = '?', lit = 'ÿ' (  0), uid = 1110
main.c:782: symbols[ 15] (0x5598b41d1764): src =  17, use = true, sym =   0, get = 3, cpy = 3, len = 4, name = '?', lit = 'ÿ' (  0), uid = 1111
main.c:909: RdHuffmanLengDistCodes( 0x7fff047ee6a0, 0x5598b397a4c0, 0x5598b397a500, 0 )
main.c:958: symbols[ 12] (0x5598b41d16e0): src =   1, use = true, sym =   0, get = 0, cpy = 0, len = 4, name = 'Code Symbol', lit = 'ÿ' (  0), uid = 1100
main.c:970: symbols[ 97] (0x5598b41d2b0c): src =  97, use = true, sym =   0, get = 0, cpy = 1, len = 1, name = 'Leng Symbol', lit = 'a' ( 97), uid = 0
main.c:958: symbols[  0] (0x5598b41d14d0): src =   2, use = true, sym =   0, get = 0, cpy = 0, len = 1, name = 'Code Symbol', lit = 'ÿ' (  0), uid = 0
main.c:970: symbols[ 98] (0x5598b41d2b38): src =  98, use = true, sym =   0, get = 0, cpy = 2, len = 2, name = 'Leng Symbol', lit = 'b' ( 98), uid = 00
main.c:958: symbols[ 13] (0x5598b41d170c): src =   4, use = true, sym =   0, get = 0, cpy = 0, len = 4, name = 'Code Symbol', lit = 'ÿ' (  0), uid = 1101
main.c:970: symbols[256] (0x5598b41d4660): src = 256, use = true, sym =   0, get = 0, cpy = 4, len = 4, name = 'Leng Symbol', lit = 'ÿ' (256), uid = 0000
main.c:958: symbols[ 14] (0x5598b41d1738): src =  16, use = true, sym =   0, get = 2, cpy = 3, len = 4, name = 'Code Symbol', lit = 'ÿ' (  0), uid = 1110
main.c:970: symbols[257] (0x5598b41d468c): src = 257, use = true, sym =   3, get = 0, cpy = 0, len = 4, name = 'Leng Symbol', lit = 'ÿ' (257), uid = 0000
main.c:958: symbols[ 14] (0x5598b41d1738): src =  16, use = true, sym =   0, get = 2, cpy = 3, len = 4, name = 'Code Symbol', lit = 'ÿ' (  0), uid = 1110
main.c:970: symbols[258] (0x5598b41d46b8): src = 258, use = true, sym =   4, get = 0, cpy = 0, len = 4, name = 'Leng Symbol', lit = 'ÿ' (258), uid = 0000
main.c:958: symbols[ 14] (0x5598b41d1738): src =  16, use = true, sym =   0, get = 2, cpy = 3, len = 4, name = 'Code Symbol', lit = 'ÿ' (  0), uid = 1110
main.c:970: symbols[259] (0x5598b41d46e4): src = 259, use = true, sym =   5, get = 0, cpy = 0, len = 4, name = 'Leng Symbol', lit = 'ÿ' (259), uid = 0000
main.c:909: RdHuffmanLengDistCodes( 0x7fff047ee6a0, 0x5598b397a4c0, 0x5598b397a540, 260 )
main.c:958: symbols[  0] (0x5598b41d14d0): src =   2, use = true, sym =   0, get = 0, cpy = 0, len = 1, name = 'Code Symbol', lit = 'ÿ' (  0), uid = 0
main.c:963: symbols[  0] (0x5598b41d4750): src =   0, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = 'Dist Symbol', lit = 'ÿ' (260), uid = 00
main.c:958: symbols[  0] (0x5598b41d14d0): src =   2, use = true, sym =   0, get = 0, cpy = 0, len = 1, name = 'Code Symbol', lit = 'ÿ' (  0), uid = 0
main.c:963: symbols[  4] (0x5598b41d4800): src =   4, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = 'Dist Symbol', lit = 'ÿ' (264), uid = 00
main.c:958: symbols[  0] (0x5598b41d14d0): src =   2, use = true, sym =   0, get = 0, cpy = 0, len = 1, name = 'Code Symbol', lit = 'ÿ' (  0), uid = 0
main.c:963: symbols[  5] (0x5598b41d482c): src =   5, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = 'Dist Symbol', lit = 'ÿ' (265), uid = 00
main.c:958: symbols[  0] (0x5598b41d14d0): src =   2, use = true, sym =   0, get = 0, cpy = 0, len = 1, name = 'Code Symbol', lit = 'ÿ' (  0), uid = 0
main.c:963: symbols[  6] (0x5598b41d4858): src =   6, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = 'Dist Symbol', lit = 'ÿ' (266), uid = 00

Related code:

Code:

	while ( pos < Symbols->foresee && stream->num < stream->max )
	{
		uint i = 0, copy = 0;
		uintmax_t val;

		code = IdentifyNextSymbol( stream, Codes, true, false );

		if ( !code )
		{
#if 0
			Symbols->entries.used = pos;
			printf( PFX_FORMAT " pos = %u\n", PFX_VALUES, pos );
			PrintSymbols( Codes, __LINE__, "Code Symbols" );
			PrintSymbols( Symbols, __LINE__, name );
			PrintStreamDetails( stream, __LINE__ );
#endif
			return EINVAL;
		}

		//PrintStreamDetails( stream );
		val = CopyStreamBits( stream, code->get, true );

		//PrintSymbol( code, __LINE__, j );

		copy = (uint)(code->cpy + val);

		//if ( copy ) printf( "copy = %3u\n", copy );

		i = 0;

		do
		{
			if ( pos >= Symbols->entries.have )
			{
				printf( PFX_FORMAT " pos = %u\n", PFX_VALUES, pos );
				PrintSymbols( Symbols, __LINE__, name );
				return ERANGE;
			}

			symbol = symbols + pos;
			symbol->src = pos;
			symbol->lit = lit;

			if ( !copy || (!start_with_literal && lit >= 256) )
			{
				symbol->use = true;

				PrintSymbol( code, __LINE__, code->uid, "Code Symbol" );
				if ( start_with_literal )
				{
					symbol->len = code->len + 1;
					symbol->get = code->get;
					PrintSymbol( symbol, __LINE__, pos, "Dist Symbol" );
				}
				else
				{
					symbol->sym = (lit > 256) ? used : 0;
					symbol->len += (lit < 256) ? code->src : 4;
					symbol->cpy = (code->src < 16) ? code->src : 0;
					PrintSymbol( symbol, __LINE__, pos, "Leng Symbol" );
				}

				if ( symbol->len > Symbols->longest )
					Symbols->longest = symbol->len;

				++used;
			}

			++lit;
			++pos;
			++i;
		}
		while ( i < copy );
	}

Defaltation Output:

Code:

main.c:812: main() Deflating data...
main.c:826: symbols[  0] (0x5598b41d1a60): src =  97, use = true, sym =   0, get = 0, cpy = 1, len = 1, name = 'leng_symbol', lit = 'a' ( 97), uid = 0
main.c:826: symbols[  2] (0x5598b41d1ab8): src =  98, use = true, sym =   0, get = 0, cpy = 2, len = 2, name = 'leng_symbol', lit = 'b' ( 98), uid = 10
main.c:826: symbols[  0] (0x5598b41d1a60): src =  97, use = true, sym =   0, get = 0, cpy = 1, len = 1, name = 'leng_symbol', lit = 'a' ( 97), uid = 0
main.c:826: symbols[  0] (0x5598b41d1a60): src =  97, use = true, sym =   0, get = 0, cpy = 1, len = 1, name = 'leng_symbol', lit = 'a' ( 97), uid = 0
main.c:826: symbols[  2] (0x5598b41d1ab8): src =  98, use = true, sym =   0, get = 0, cpy = 2, len = 2, name = 'leng_symbol', lit = 'b' ( 98), uid = 10
main.c:826: symbols[  2] (0x5598b41d1ab8): src =  98, use = true, sym =   0, get = 0, cpy = 2, len = 2, name = 'leng_symbol', lit = 'b' ( 98), uid = 10
main.c:826: symbols[  2] (0x5598b41d1ab8): src =  98, use = true, sym =   0, get = 0, cpy = 2, len = 2, name = 'leng_symbol', lit = 'b' ( 98), uid = 10
main.c:826: symbols[  0] (0x5598b41d1a60): src =  97, use = true, sym =   0, get = 0, cpy = 1, len = 1, name = 'leng_symbol', lit = 'a' ( 97), uid = 0
main.c:826: symbols[ 14] (0x5598b41d1cc8): src = 258, use = true, sym =   4, get = 0, cpy = 0, len = 4, name = 'leng_symbol', lit = 'ÿ' (258), uid = 1110
main.c:856: symbols[  2] (0x5598b41d47a8): src =   5, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = 'dist_symbol', lit = 'ÿ' (265), uid = 10
main.c:861: lookup =   4
main.c:826: symbols[ 13] (0x5598b41d1c9c): src = 257, use = true, sym =   3, get = 0, cpy = 0, len = 4, name = 'leng_symbol', lit = 'ÿ' (257), uid = 1101
main.c:856: symbols[  3] (0x5598b41d47d4): src =   6, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = 'dist_symbol', lit = 'ÿ' (266), uid = 11
main.c:861: lookup =   3
main.c:826: symbols[ 15] (0x5598b41d1cf4): src = 259, use = true, sym =   5, get = 0, cpy = 0, len = 4, name = 'leng_symbol', lit = 'ÿ' (259), uid = 1111
main.c:856: symbols[  1] (0x5598b41d477c): src =   4, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = 'dist_symbol', lit = 'ÿ' (264), uid = 01
main.c:861: lookup =   6
main.c:826: symbols[  0] (0x5598b41d1a60): src =  97, use = true, sym =   0, get = 0, cpy = 1, len = 1, name = 'leng_symbol', lit = 'a' ( 97), uid = 0
main.c:826: symbols[  0] (0x5598b41d1a60): src =  97, use = true, sym =   0, get = 0, cpy = 1, len = 1, name = 'leng_symbol', lit = 'a' ( 97), uid = 0
main.c:826: symbols[  0] (0x5598b41d1a60): src =  97, use = true, sym =   0, get = 0, cpy = 1, len = 1, name = 'leng_symbol', lit = 'a' ( 97), uid = 0
main.c:826: symbols[ 15] (0x5598b41d1cf4): src = 259, use = true, sym =   5, get = 0, cpy = 0, len = 4, name = 'leng_symbol', lit = 'ÿ' (259), uid = 1111
main.c:856: symbols[  1] (0x5598b41d477c): src =   4, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = 'dist_symbol', lit = 'ÿ' (264), uid = 01
main.c:861: lookup =   5
main.c:826: symbols[  2] (0x5598b41d1ab8): src =  98, use = true, sym =   0, get = 0, cpy = 2, len = 2, name = 'leng_symbol', lit = 'b' ( 98), uid = 10
main.c:826: symbols[ 14] (0x5598b41d1cc8): src = 258, use = true, sym =   4, get = 0, cpy = 0, len = 4, name = 'leng_symbol', lit = 'ÿ' (258), uid = 1110
main.c:856: symbols[  0] (0x5598b41d4750): src =   0, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = 'dist_symbol', lit = 'ÿ' (260), uid = 00
main.c:861: lookup =   4
main.c:826: symbols[  0] (0x5598b41d1a60): src =  97, use = true, sym =   0, get = 0, cpy = 1, len = 1, name = 'leng_symbol', lit = 'a' ( 97), uid = 0
main.c:826: symbols[  0] (0x5598b41d1a60): src =  97, use = true, sym =   0, get = 0, cpy = 1, len = 1, name = 'leng_symbol', lit = 'a' ( 97), uid = 0
main.c:826: symbols[ 12] (0x5598b41d1c70): src = 256, use = true, sym =   0, get = 0, cpy = 4, len = 4, name = 'leng_symbol', lit = 'ÿ' (256), uid = 1100
PrintBytes( 0x5598b41d14a0, 0, 16 )
abaabbbaaaabaa

Related Code:

Code:

		else if ( type == 2 )
		{
			uint count = 0;
			uint i = 0, j = 0;
			HUFFMAN_SYMBOL *codes, *code, *leng_symbol, *dist_symbol;
			int err;

			LengSymbols.foresee = CopyStreamBits( &stream, 5, true ) + 257;
			DistSymbols.foresee = CopyStreamBits( &stream, 5, true ) + 1;
			CodeSymbols.foresee = CopyStreamBits( &stream, 4, true ) + 4;

			PrintSymbols( &DistSymbols, __LINE__, "Distance Symbols" );
			PrintSymbols( &LengSymbols, __LINE__, "Length Symbols" );
			PrintSymbols( &CodeSymbols, __LINE__, "Code Symbols" );
			PrintStreamDetails( &stream, __LINE__ );

			i = CodeSymbols.foresee;

			codes = ReAllocSymbols( &CodeSymbols, i < 32 ? 32 : i + 1 );

			if ( !codes )
				return Return( ret, ENOMEM );

			for ( i = 0; i < CodeSymbols.foresee; ++i )
			{
				static uint adjust[] =
				{
					16, 17, 18, 0,  8, 7,
					 9,  6, 10, 5, 11, 4,
					12,  3, 13, 2, 14, 1,
					15, 19
				};

				code = codes + adjust[i];

				code->src = adjust[i];
				code->get = hufd_extra_bits_array[code->src];
				code->cpy = hufd_repeat_bits_array[code->src];
				code->len = CopyStreamBits( &stream, 3, true );
				code->use = !!(code->len);

				if ( code->len > CodeSymbols.longest )
					CodeSymbols.longest = code->len;
			}

			CodeSymbols.entries.used = i + 1;

			count = LengSymbols.foresee + DistSymbols.foresee;
			printf( "pos = %u, max = %u, count = %u\n", stream.pos, stream.max, count );
			size = stream.num;
			printf( "left = %u, byte = %zu, bit = %zu\n", stream.fed , size / CHAR_BIT, size % CHAR_BIT );

			BuildCodes( &CodeSymbols );

			SortSymbolsByUID( &CodeSymbols );

			PrintSymbols( &CodeSymbols, __LINE__, "Code Symbols" );

			err = RdHuffmanLengDistCodes
			(
				&stream,
				&CodeSymbols,
				&LengSymbols,
				0,
				"Length Symbols"
			);

			if ( err )
				return Return( ret, err );

			//PrintSymbols( &LengSymbols, __LINE__, "Length Symbols" );

			err = RdHuffmanLengDistCodes
			(
				&stream,
				&CodeSymbols,
				&DistSymbols,
				LengSymbols.foresee,
				"Distance Symbols"
			);

			if ( err )
				return Return( ret, err );

			//PrintSymbols( &DistSymbols, __LINE__, "Distance Symbols" );

			printf("%s:%u: %s() Deflating data...\n", __FILE__, __LINE__, __func__ );

			i = 0; j = 0;
			while ( stream.num < stream.max )
			{
				leng_symbol = IdentifyNextSymbol( &stream, &LengSymbols, true, false );

				if ( !leng_symbol )
				{
					PrintStreamDetails( &stream, __LINE__ );
					return Return( ret, EINVAL );
				}

				//puts("Found:\n");
				PrintSymbol( leng_symbol, __LINE__, leng_symbol->uid, "leng_symbol" );

				if ( leng_symbol->lit == 256 )
					break;
				else if ( leng_symbol->lit < 256 )
				{
					if ( i == Lazy.size )
					{
						printf("%s:%u: Hit buffer limit!\n", __FILE__, __LINE__ );
						return Return( ret , ERANGE );
					}
					j = i++;
					lazy[j] = leng_symbol->lit;
				}
				else
				{
					uint dref;
					uint lookup;
					static uint extends[8] = { 0, 0, 0, 0, 1, 1, 2, 0 };

					dist_symbol = IdentifyNextSymbol( &stream, &DistSymbols, true, false );

					if ( !dist_symbol )
					{
						PrintStreamDetails( &stream, __LINE__ );
						return Return( ret, EINVAL );
					}

					dref = dist_symbol->src;

					PrintSymbol( dist_symbol, __LINE__, dist_symbol->uid, "dist_symbol" );

					lookup = leng_symbol->sym;
					lookup += CopyStreamBits( &stream, extends[dref], true );

					printf( "%s:%u: lookup = %3u\n", __FILE__, __LINE__, lookup );

					lookup = i - lookup;

					for ( j = 0; j < leng_symbol->cpy; ++j )
					{
						if ( i == Lazy.size )
						{
							printf("%s:%u: Hit buffer limit!\n", __FILE__, __LINE__ );
							return Return( ret , ERANGE );
						}
						lazy[i++] = lazy[lookup++];
					}

					j = i - 1;

					//PrintSymbol( dist_symbol, __LINE__, dist_symbol->uid );
				}
			}
		}

There's only 1 instance of "else if ( type == 2 )" in the attached file so if you want to get more information you can just search for that to get your starting point.

**hamster_nz** · 07-31-2021

The second table will never work - there are two entries of length 1, so that uses both codes '0' & '1', leaving no other possible codes for longer length codes.

A few posts ago supplied an edited bit of your code that assigned code symbols correctly, as well as a walk through of the assignment algorithm.

It would pay to revisit that, after working out why your table has too many entries with short lengths.

**hamster_nz** · 07-31-2021

If you can post a ASCII hex dump of *just* the zlib data I will turn up the debug level on my code and print out the code tables.

**hamster_nz** · 07-31-2021

I got a few spare minutes.

Attached is a program that should decode a .gz file less than 8kb, and if you add "-v" on the command line it will print out things like code tables/dictionaries along the way.

Code:

$ echo "Hi there" | gzip > a.gz

$ ./main2 a.gz
Have loaded 29 bytes from file
Last block = true
Compressed (default dictionary)
data: 48 'H'
data: 69 'i'
data: 20 ' '
data: 74 't'
data: 68 'h'
data: 65 'e'
data: 72 'r'
data: 65 'e'
data: 0a
  End of block 256
  This is the last block

It assumes that the zlib data start at byte 10.

Don't judge the code too harshly - it's just a hack to test my understanding, not production ready.

So here I'm gziping up the source file "inflate.c", then decoding it with 'verbose' enabled:

Code:

$ cat inflate.c | gzip > abc.gz
$ ./main2 -v abc.gz | more

And here is the output showing the code tables that were used:

Code:

Compressed (dynamic dictionary)
  HLIT 282, HDIST 29, HCLEN 13
Codelen table
    0:  4 1100
    4:  4 1101
    5:  3 000
    6:  3 001
    7:  3 010
    8:  3 011
    9:  3 100
   10:  3 101
   11:  4 1110
   12:  5 11110
   16:  6 111110
   17:  7 1111110
   18:  7 1111111
Reading in literal/length plus distance dictionary lengths
Literal/Repeat Length table
   10:  7 1011010
   32:  5 00100
   33:  9 111101000
   34:  8 11100010
   35: 12 111111111100
   37:  8 11100011
   38:  8 11100100
   39:  9 111101001
   40:  7 1011011
   41:  6 011110
   42:  9 111101010
   43:  7 1011100
   44:  6 011111
   45:  8 11100101
   46:  9 111101011
   47:  9 111101100
   48:  7 1011101
   49:  6 100000
   50:  6 100001
   51:  7 1011110
   52:  7 1011111
   53:  7 1100000
   54:  7 1100001
   55:  7 1100010
   56:  7 1100011
   57:  7 1100100
   58:  8 11100110
   59:  7 1100101
   60:  8 11100111
   61:  8 11101000
   62:  8 11101001
......

**awsdert** · 07-31-2021

Originally Posted by hamster_nz

The second table will never work - there are two entries of length 1, so that uses both codes '0' & '1', leaving no other possible codes for longer length codes.

A few posts ago supplied an edited bit of your code that assigned code symbols correctly, as well as a walk through of the assignment algorithm.

It would pay to revisit that, after working out why your table has too many entries with short lengths.

I've just double checked the data (no modifications from what I posted) and there's no duplicate lengths, are you sure you're not confusing symbols from 2 different tables? I added the names to make clear which table/tree they're from.

**awsdert** · 07-31-2021

Originally Posted by hamster_nz

If you can post a ASCII hex dump of *just* the zlib data I will turn up the debug level on my code and print out the code tables.

Did you mean this?

Code:

main.c:782: Code Symbols, foresee =  18, longest =  4, have =  32, used = 16
main.c:782: symbols[  0] (0x5598b41d14d0): src =   2, use = true, sym =   0, get = 0, cpy = 0, len = 1, name = '?', lit = 'ÿ' (  0), uid = 0
main.c:782: symbols[  2] (0x5598b41d1528): src =  18, use = true, sym =   0, get = 7, cpy = 11, len = 2, name = '?', lit = 'ÿ' (  0), uid = 10
main.c:782: symbols[ 12] (0x5598b41d16e0): src =   1, use = true, sym =   0, get = 0, cpy = 0, len = 4, name = '?', lit = 'ÿ' (  0), uid = 1100
main.c:782: symbols[ 13] (0x5598b41d170c): src =   4, use = true, sym =   0, get = 0, cpy = 0, len = 4, name = '?', lit = 'ÿ' (  0), uid = 1101
main.c:782: symbols[ 14] (0x5598b41d1738): src =  16, use = true, sym =   0, get = 2, cpy = 3, len = 4, name = '?', lit = 'ÿ' (  0), uid = 1110
main.c:782: symbols[ 15] (0x5598b41d1764): src =  17, use = true, sym =   0, get = 3, cpy = 3, len = 4, name = '?', lit = 'ÿ' (  0), uid = 1111
main.c:808: Length Symbols, foresee = 260, longest =  4, have = 261, used = 16
main.c:808: symbols[  0] (0x560a164e1a60): src =  97, use = true, sym =   0, get = 0, cpy = 1, len = 1, name = '?', lit = 'a' ( 97), uid = 0
main.c:808: symbols[  2] (0x560a164e1ab8): src =  98, use = true, sym =   0, get = 0, cpy = 2, len = 2, name = '?', lit = 'b' ( 98), uid = 10
main.c:808: symbols[ 12] (0x560a164e1c70): src = 256, use = true, sym =   0, get = 0, cpy = 4, len = 4, name = '?', lit = 'ÿ' (256), uid = 1100
main.c:808: symbols[ 13] (0x560a164e1c9c): src = 257, use = true, sym =   3, get = 0, cpy = 0, len = 4, name = '?', lit = 'ÿ' (257), uid = 1101
main.c:808: symbols[ 14] (0x560a164e1cc8): src = 258, use = true, sym =   4, get = 0, cpy = 0, len = 4, name = '?', lit = 'ÿ' (258), uid = 1110
main.c:808: symbols[ 15] (0x560a164e1cf4): src = 259, use = true, sym =   5, get = 0, cpy = 0, len = 4, name = '?', lit = 'ÿ' (259), uid = 1111
main.c:809: Distance Symbols, foresee =   7, longest =  2, have =   8, used = 4
main.c:809: symbols[  0] (0x560a164e4750): src =   0, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = '?', lit = 'ÿ' (260), uid = 00
main.c:809: symbols[  1] (0x560a164e477c): src =   4, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = '?', lit = 'ÿ' (264), uid = 01
main.c:809: symbols[  2] (0x560a164e47a8): src =   5, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = '?', lit = 'ÿ' (265), uid = 10
main.c:809: symbols[  3] (0x560a164e47d4): src =   6, use = true, sym =   0, get = 0, cpy = 0, len = 2, name = '?', lit = 'ÿ' (266), uid = 11

Just in case you didn't realise that's what those commented out PrintSymbols() calls produce when given the final tables, they iterate through the all the nodes up until it hits the used count and prints only those whose "use" boolean is equal to true, otherwise it only prints info about the table.

Edit: Ah, btw that "len" member is only used for generating the UIDs later (I deem UID less confusing than code when code is also used for the code table, confusing to say every code, leng & dist symbol has a code member, less confusing to say they all have a UID for huffman purposes).

The "get" member is for how many extra bits need to be read after the symbol's UID is found in the stream/buffer/file.

The "lit" member will always be of the original leng "alphabet" which is why it's used as the value to be put in the target buffer

Currently the "sym" member is being used for for how to look back

The "cpy" member is naturally supposed to hold the base number of characters to copy, the number obtained from using the "get" member to read more of the stream is added to this.

The "src" member is supposed to hold the non-huffman code mentioned in the tables of this document: Understanding gzip

I think that about covers the potentially confusing members

**awsdert** · 07-31-2021

Originally Posted by hamster_nz

I got a few spare minutes.

Attached is a program that should decode a .gz file less than 8kb, and if you add "-v" on the command line it will print out things like code tables/dictionaries along the way.

Code:

$ echo "Hi there" | gzip > a.gz

$ ./main2 a.gz
Have loaded 29 bytes from file
Last block = true
Compressed (default dictionary)
data: 48 'H'
data: 69 'i'
data: 20 ' '
data: 74 't'
data: 68 'h'
data: 65 'e'
data: 72 'r'
data: 65 'e'
data: 0a
  End of block 256
  This is the last block

It assumes that the zlib data start at byte 10.

Don't judge the code too harshly - it's just a hack to test my understanding, not production ready.

So here I'm gziping up the source file "inflate.c", then decoding it with 'verbose' enabled:

Code:

$ cat inflate.c | gzip > abc.gz
$ ./main2 -v abc.gz | more

And here is the output showing the code tables that were used:

Code:

Compressed (dynamic dictionary)
  HLIT 282, HDIST 29, HCLEN 13
Codelen table
    0:  4 1100
    4:  4 1101
    5:  3 000
    6:  3 001
    7:  3 010
    8:  3 011
    9:  3 100
   10:  3 101
   11:  4 1110
   12:  5 11110
   16:  6 111110
   17:  7 1111110
   18:  7 1111111
Reading in literal/length plus distance dictionary lengths
Literal/Repeat Length table
   10:  7 1011010
   32:  5 00100
   33:  9 111101000
   34:  8 11100010
   35: 12 111111111100
   37:  8 11100011
   38:  8 11100100
   39:  9 111101001
   40:  7 1011011
   41:  6 011110
   42:  9 111101010
   43:  7 1011100
   44:  6 011111
   45:  8 11100101
   46:  9 111101011
   47:  9 111101100
   48:  7 1011101
   49:  6 100000
   50:  6 100001
   51:  7 1011110
   52:  7 1011111
   53:  7 1100000
   54:  7 1100001
   55:  7 1100010
   56:  7 1100011
   57:  7 1100100
   58:  8 11100110
   59:  7 1100101
   60:  8 11100111
   61:  8 11101000
   62:  8 11101001
......

TBH I find that output to be unclear about what the values represent, part of the reason I print the member names is so I can more easily see what should've been what compared to what it actually is. Did look through you're code but didn't find what I was looking for so at present it wasn't of help to me, thx anyways

**awsdert** · 08-02-2021

Finally got that blasted algorithm right, for anyone wanting to reference my code (which I've already cleaned up a bit prior to getting the algorithm right) here's where it's uploaded:
Gitlab upload instance

Just don't expect it to be super clean, I literally only just got it right just now, I'll do a cleaner version for my paw project later.