Zlib / Gzip - Estimating Uncompressed Size in Advance

**Brandon9000** · 12-03-2012

I have a program which exchanges streams of data with a server across the Internet. It runs on Linux. Different users run it on a wide variety of flavors of Linux. The data streams are compressed. The application had been using bzip compression, but I am now changing it to use gzip. The application is working well with gzip now, but there is one thing about it which is less than satisfactory. These packets of data are stored in buffer variables and I need to make the buffers big enough to hold them. That means that when I compress data to send, I need to have an advance estimate of the compressed size to dynamically allocate a buffer for it, and when I uncompress data received, I need to have an estimate of the uncompressed size to allocate a buffer to store it. If I cannot do these things, then I will always have to use the largest size that any transmission might ever be, and that will be a big waste of memory.

I wish to emphasize that this data never gets stored into files. It is only stored in memory variables. I can estimate the compressed size before I compress data because the zlib library furnishes a method called CompressBound(), which takes the uncompressed size and returns an upper bound for what the compressed size might be. I use this size to allocate the buffer. However, for decompressing received packets, I can find no way to figure out in advance what the uncompressed size will be. I have seen vague hints that I could parse the gzip header, but nothing approaching an instruction as to how to do that. Does anyone know how to estimate the uncompressed size of a gzip compressed buffer???

Once again, this data is not in files, only in memory. Furthermore, this application rapidly exchanges large volumes of information with the server, and I cannot use the time or disk I/O to make some kind of temporary file for each packet. Nor can I shell out to give an operating system command like gunzip or something each time I receive a packet. This has to be handled with C++ calls alone.

Thanks in advance to any kind soul who can shed some illumination on this issue.

Brandon

**dmh2000** · 12-03-2012

RFC 1952 GZIP File Format Specification version 4.3

The header for each file has the uncompressed size. if your zipfile has multiple files you have to iterate over each one and sum the size.

example of a file named x2.exe with original size 31744

Code:

1f          id1 
8b          id2
08          cm             = deflate
08          flag            (flag.fname is set)
9f 3f b6 50 mtime 
00          xfl 
0b          os              = NTFS
78 32 2e 65 78 65 00 fname  = x2.exe

data ... omitted

 
de 7f a6 3a       crc32 
00 7c 00 00       original size= 00007c00 == 31744

multibyte fields are little endian

**Brandon9000** · 12-03-2012

Thanks. Are there any access functions for the header components?

**Salem** · 12-04-2012

So what is wrong/hard about defining your protocol as being

- 2 byte compressed length, in network byte order (=n)
- 2 byte uncompressed length, in network byte order
- n bytes of compressed data.

The receiver does
- fetch 4 bytes, extract a compressed length(x) and uncompressed length(y)
- allocate 2 buffers of length x and y
- fetch a further x bytes into buff_x
- decompress buff_x into buff_y

**Brandon9000** · 12-04-2012

The main thing wrong with it is that I have no idea what you're taling about. For instance, your first sentence is "2 bytes compressed length in network byte order (=n)." What two bytes? Are you saying that the header starts out with two bytes which are of no interest to my present purposes and must be bypassed? What is the significance of "(=n)?"

**Salem** · 12-04-2012

> The application had been using bzip compression, but I am now changing it to use gzip.
Sorry, I thought someone capable of doing what you describe in post #1 would be capable of creating and understanding a rudimentary protocol for exchanging information between two machines.

> What two bytes?
If I send you "Hello world\n", then the first two bytes are "He"

How do you know when the message is complete?

I could tell you that the message is complete when you get a \n.

I could also say that there is a length at the start of the message, say
"12,Hello world\n"
So you would start by reading a few bytes until you saw a comma
You would then interpret the 12 in some way, and deduce that 12 more bytes follow, namely Hello world\n

Communications protocol - Wikipedia, the free encyclopedia
How you do this is entirely up to you, so long as both ends agree on what the protocol is.

**dwks** · 12-04-2012

... in other words you don't have to send the gzipped data directly by itself. You can always attach some extra information to each packet, forming your own header. Your header can contain whatever information is necessary to make your life easier (the uncompressed size for example, and maybe the compressed size too unless you know that through other means). Standard practice when you're doing network communication, to not just send the data but also some metainformation along with it.

**Brandon9000** · 12-04-2012

Thanks. It's a reasonable point of view, and true, but if the zlib header already contains the information which I want, one would hope that the authors of zlib would create some function to access the information and relieve the users of the need to do hand parsing. Most people will know how big to dimension a buffer to hold the output. Being pretty new to zlib, I was just trying to determine whether such functions exist or whether it is indeed up to me to do a hand parsing of the header.

**Brandon9000** · 12-04-2012

Originally Posted by Salem

So what is wrong/hard about defining your protocol as being

- 2 byte compressed length, in network byte order (=n)
- 2 byte uncompressed length, in network byte order
- n bytes of compressed data.

The receiver does
- fetch 4 bytes, extract a compressed length(x) and uncompressed length(y)
- allocate 2 buffers of length x and y
- fetch a further x bytes into buff_x
- decompress buff_x into buff_y

Okay, now I see what you're saying. I would rather not re-write the existing server protocol that other software than mine also uses if gzip provides a way to access the information from the gzip stream.

**sean** · 12-04-2012

At zlib Home Site you can find documentation on the library. There's a function called inflateGetHeader that looks like what you want. You can find the specification for the struct that it populates in the header file.

Thread: Zlib / Gzip - Estimating Uncompressed Size in Advance

Thread Tools

Search Thread

Display

Zlib / Gzip - Estimating Uncompressed Size in Advance

Similar Threads

Zlib, GZip

The code for estimating-Memory usage

GZip

Window Size in Zlib

estimating time remaining

Tags for this Thread