Thread: Zlib / Gzip - Estimating Uncompressed Size in Advance

  1. #1
    Registered User
    Join Date
    Nov 2012
    Posts
    8

    Zlib / Gzip - Estimating Uncompressed Size in Advance

    I have a program which exchanges streams of data with a server across the Internet. It runs on Linux. Different users run it on a wide variety of flavors of Linux. The data streams are compressed. The application had been using bzip compression, but I am now changing it to use gzip. The application is working well with gzip now, but there is one thing about it which is less than satisfactory. These packets of data are stored in buffer variables and I need to make the buffers big enough to hold them. That means that when I compress data to send, I need to have an advance estimate of the compressed size to dynamically allocate a buffer for it, and when I uncompress data received, I need to have an estimate of the uncompressed size to allocate a buffer to store it. If I cannot do these things, then I will always have to use the largest size that any transmission might ever be, and that will be a big waste of memory.

    I wish to emphasize that this data never gets stored into files. It is only stored in memory variables. I can estimate the compressed size before I compress data because the zlib library furnishes a method called CompressBound(), which takes the uncompressed size and returns an upper bound for what the compressed size might be. I use this size to allocate the buffer. However, for decompressing received packets, I can find no way to figure out in advance what the uncompressed size will be. I have seen vague hints that I could parse the gzip header, but nothing approaching an instruction as to how to do that. Does anyone know how to estimate the uncompressed size of a gzip compressed buffer???

    Once again, this data is not in files, only in memory. Furthermore, this application rapidly exchanges large volumes of information with the server, and I cannot use the time or disk I/O to make some kind of temporary file for each packet. Nor can I shell out to give an operating system command like gunzip or something each time I receive a packet. This has to be handled with C++ calls alone.

    Thanks in advance to any kind soul who can shed some illumination on this issue.

    Brandon

  2. #2
    Registered User
    Join Date
    Mar 2011
    Posts
    546
    RFC 1952 GZIP File Format Specification version 4.3

    The header for each file has the uncompressed size. if your zipfile has multiple files you have to iterate over each one and sum the size.

    example of a file named x2.exe with original size 31744
    Code:
    1f          id1 
    8b          id2
    08          cm             = deflate
    08          flag            (flag.fname is set)
    9f 3f b6 50 mtime 
    00          xfl 
    0b          os              = NTFS
    78 32 2e 65 78 65 00 fname  = x2.exe
    
    data ... omitted
    
     
    de 7f a6 3a       crc32 
    00 7c 00 00       original size= 00007c00 == 31744
    
    multibyte fields are little endian
    Last edited by dmh2000; 12-03-2012 at 10:52 AM. Reason: expanded example

  3. #3
    Registered User
    Join Date
    Nov 2012
    Posts
    8
    Thanks. Are there any access functions for the header components?

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    So what is wrong/hard about defining your protocol as being

    - 2 byte compressed length, in network byte order (=n)
    - 2 byte uncompressed length, in network byte order
    - n bytes of compressed data.

    The receiver does
    - fetch 4 bytes, extract a compressed length(x) and uncompressed length(y)
    - allocate 2 buffers of length x and y
    - fetch a further x bytes into buff_x
    - decompress buff_x into buff_y
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Registered User
    Join Date
    Nov 2012
    Posts
    8
    The main thing wrong with it is that I have no idea what you're taling about. For instance, your first sentence is "2 bytes compressed length in network byte order (=n)." What two bytes? Are you saying that the header starts out with two bytes which are of no interest to my present purposes and must be bypassed? What is the significance of "(=n)?"

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    > The application had been using bzip compression, but I am now changing it to use gzip.
    Sorry, I thought someone capable of doing what you describe in post #1 would be capable of creating and understanding a rudimentary protocol for exchanging information between two machines.

    > What two bytes?
    If I send you "Hello world\n", then the first two bytes are "He"

    How do you know when the message is complete?

    I could tell you that the message is complete when you get a \n.

    I could also say that there is a length at the start of the message, say
    "12,Hello world\n"
    So you would start by reading a few bytes until you saw a comma
    You would then interpret the 12 in some way, and deduce that 12 more bytes follow, namely Hello world\n

    Communications protocol - Wikipedia, the free encyclopedia
    How you do this is entirely up to you, so long as both ends agree on what the protocol is.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  7. #7
    Frequently Quite Prolix dwks's Avatar
    Join Date
    Apr 2005
    Location
    Canada
    Posts
    8,057
    ... in other words you don't have to send the gzipped data directly by itself. You can always attach some extra information to each packet, forming your own header. Your header can contain whatever information is necessary to make your life easier (the uncompressed size for example, and maybe the compressed size too unless you know that through other means). Standard practice when you're doing network communication, to not just send the data but also some metainformation along with it.
    dwk

    Seek and ye shall find. quaere et invenies.

    "Simplicity does not precede complexity, but follows it." -- Alan Perlis
    "Testing can only prove the presence of bugs, not their absence." -- Edsger Dijkstra
    "The only real mistake is the one from which we learn nothing." -- John Powell


    Other boards: DaniWeb, TPS
    Unofficial Wiki FAQ: cpwiki.sf.net

    My website: http://dwks.theprogrammingsite.com/
    Projects: codeform, xuni, atlantis, nort, etc.

  8. #8
    Registered User
    Join Date
    Nov 2012
    Posts
    8
    Thanks. It's a reasonable point of view, and true, but if the zlib header already contains the information which I want, one would hope that the authors of zlib would create some function to access the information and relieve the users of the need to do hand parsing. Most people will know how big to dimension a buffer to hold the output. Being pretty new to zlib, I was just trying to determine whether such functions exist or whether it is indeed up to me to do a hand parsing of the header.

  9. #9
    Registered User
    Join Date
    Nov 2012
    Posts
    8
    Quote Originally Posted by Salem View Post
    So what is wrong/hard about defining your protocol as being

    - 2 byte compressed length, in network byte order (=n)
    - 2 byte uncompressed length, in network byte order
    - n bytes of compressed data.

    The receiver does
    - fetch 4 bytes, extract a compressed length(x) and uncompressed length(y)
    - allocate 2 buffers of length x and y
    - fetch a further x bytes into buff_x
    - decompress buff_x into buff_y
    Okay, now I see what you're saying. I would rather not re-write the existing server protocol that other software than mine also uses if gzip provides a way to access the information from the gzip stream.

  10. #10
    Registered User
    Join Date
    Sep 2001
    Posts
    4,912
    At zlib Home Site you can find documentation on the library. There's a function called inflateGetHeader that looks like what you want. You can find the specification for the struct that it populates in the header file.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Zlib, GZip
    By Brandon9000 in forum C Programming
    Replies: 4
    Last Post: 11-28-2012, 05:11 PM
  2. The code for estimating-Memory usage
    By rac1 in forum C Programming
    Replies: 8
    Last Post: 02-08-2012, 05:20 PM
  3. GZip
    By Dae in forum C++ Programming
    Replies: 1
    Last Post: 07-21-2009, 09:29 AM
  4. Window Size in Zlib
    By indrajit_muk in forum C Programming
    Replies: 3
    Last Post: 11-26-2008, 10:49 AM
  5. estimating time remaining
    By Compengineer in forum C++ Programming
    Replies: 6
    Last Post: 05-25-2003, 09:38 AM

Tags for this Thread