Thread: bits and bytes in C

  1. #1
    Registered User
    Join Date
    May 2011
    Posts
    13

    bits and bytes in C

    Hey guys,

    I wanted some clarification in my understanding of bits and bytes in C. For practice, I'm writing an LZW compression program. The idea is that I will get input from stdin, encode it, output the codes, and then decode them. So the encoded file will be something like:

    ABC256257 for input ABCABBC...

    And my dictionary is: 0-255 for the 256 ASCII characters, 256 = AB, 257 = BC, 258 = CA, 259 = ABB, etc.

    The issue is that in 8 bits, I can only represent numbers from 0 to 255, so if I want to output "256" as a single byte, I will need 9 bits. I found a program that has gets and puts bits to stdout, but I'm not sure how to use it correctly. The two methods are:

    put(int code, int bits), which puts a code of size bits into stdout. get(int bits) will get the code of size bits.

    So I figured that put(256, 9) should put the number 256 into stdout, but all it does is put some weird symbols like €"Á °("

    My questions are:

    1. Do I have the right idea? should I be doing put(256, 9) or something else?
    2. If so, then why does that print weird symbols.

    Thanks for your help.

  2. #2
    Registered User
    Join Date
    Oct 2008
    Location
    TX
    Posts
    2,059
    IIRC, the LZW compression uses more bits for output than input.
    If n bits are for input, then output bits are n+1 at the very least.

  3. #3
    Registered User
    Join Date
    May 2011
    Posts
    13
    Quote Originally Posted by itCbitC View Post
    IIRC, the LZW compression uses more bits for output than input.
    If n bits are for input, then output bits are n+1 at the very least.
    I don't see why this would be the case because the storage for the code/key would be consistent based on bits.

    Nonetheless, I guess my bigger question is: How do I pack 9 bits into a char? Or 10 bits, 11 bits, etc.

  4. #4
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    You don't pack 9 bits into an 8 bit variable... you use short or long integers instead.

    I haven't played with lzw (etc.) compression in quite a while but I'm sure there's an accomodation for that.

  5. #5
    Registered User
    Join Date
    Oct 2008
    Location
    TX
    Posts
    2,059
    Quote Originally Posted by Ach1lles View Post
    I don't see why this would be the case because the storage for the code/key would be consistent based on bits.

    Nonetheless, I guess my bigger question is: How do I pack 9 bits into a char? Or 10 bits, 11 bits, etc.
    Then what's the point of assigning > 8 bit codewords to strings AB BC CA etc.
    You need to revise the symbol table so strings stay under the 8 bit boundary.

  6. #6
    Registered User
    Join Date
    May 2011
    Posts
    13
    Quote Originally Posted by CommonTater View Post
    You don't pack 9 bits into an 8 bit variable... you use short or long integers instead.

    I haven't played with lzw (etc.) compression in quite a while but I'm sure there's an accomodation for that.
    When I tried put(16, 256), I still get odd symbols like "SOH" "NUL" etc instead of "256".

    Quote Originally Posted by itCbitC View Post
    Then what's the point of assigning > 8 bit codewords to strings AB BC CA etc.
    You need to revise the symbol table so strings stay under the 8 bit boundary.
    I was unclear, so sorry about that. What I meant was that 0-255 store the 256 characters, so I'd have to use 256+ to store the combinations. When decoding, the number 256 would be deconstructed into AB, so the space requirements are the same.

    I think I need to create a bitstream and use delimiters to note where I go from 9 to 10 bits, etc, but how would I do that? Is that even the best solution?

  7. #7
    Registered User
    Join Date
    Mar 2009
    Posts
    344
    I'm also not that familiar with LZW compression. I do understand what you're seeing in the output, though. Regardless of what you're writing, you're dumping the file and displaying it with each character being 8 bits. Each of those 8 bit values correspond to a character (e.g. ASCII Code - The extended ASCII table) and that's what's being printed.

    So you write 256 followed by 257, each as 9 bits :

    100000000 100000001

    Then display it as 8 bits :

    10000000 01000000 01 (binary)
    128 64 1 (decimal)

    Each 8 bits corresponds to a character. That 128 would be the code for a euro, I believe, and 64 is @. 1 is not printable - it'll either not show up as anything or might be the -like character you're seeing occasionally. This doesn't exactly match your example but should give you an idea of what's going on.

    ETA : The wikipedia article about LZW is pretty detailed, e.g. http://en.wikipedia.org/wiki/Lempel%...le-width_codes
    Last edited by KCfromNC; 06-29-2011 at 09:52 AM.

  8. #8
    Registered User
    Join Date
    Oct 2008
    Location
    TX
    Posts
    2,059
    Quote Originally Posted by Ach1lles View Post
    I was unclear, so sorry about that. What I meant was that 0-255 store the 256 characters, so I'd have to use 256+ to store the combinations. When decoding, the number 256 would be deconstructed into AB, so the space requirements are the same.
    Hopefully you know that code 257 can't be represented in 8 bits.
    Code:
    257 == 0x101 == 100000001
    If your input file is purely alphabetic, encode A=0x41 ... Z=0x5A.
    Do the same for strings i.e. AB=0x5B BC=0x5C CA=0x5D so on and so forth.
    Quote Originally Posted by Ach1lles View Post
    I think I need to create a bitstream and use delimiters to note where I go from 9 to 10 bits, etc, but how would I do that? Is that even the best solution?
    Perhaps you mean bitfields; only flip side is that they aren't portable.

  9. #9
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    6 bits 110011
    7 bits 0001000
    8 bits 11111111
    9 bits 111000111
    10 bits 0001111000
    Total 40 bits, or 5 bytes

    Packed into 8-bit bytes gets something looking like this
    11001100 01000111 11111111 00011100 01111000

    Create a function which takes an integer, and the number of bits of information it contains, and use &, |, >> and << to extract groups of bits, pack them with other bits already sent, and when you've completed a byte, store that byte in some buffer / file.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  10. #10
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    Quote Originally Posted by Salem View Post
    6 bits 110011
    7 bits 0001000
    8 bits 11111111
    9 bits 111000111
    10 bits 0001111000
    Total 40 bits, or 5 bytes

    Packed into 8-bit bytes gets something looking like this
    11001100 01000111 11111111 00011100 01111000

    Create a function which takes an integer, and the number of bits of information it contains, and use &, |, >> and << to extract groups of bits, pack them with other bits already sent, and when you've completed a byte, store that byte in some buffer / file.
    Finally some sensible advice! QFE
    Also, if you think about this, it means that your buffer where all these bits are being banged up against one another will need flushing when you're done. I.e. you can't write out a byte as you go if you've only got 3 meaningful bits in it, but if that's what you're left with at the end, then you have to write it out along with 5 extra zeros.

    Quote Originally Posted by Ach1lles
    So I figured that put(256, 9) should put the number 256 into stdout, but all it does is put some weird symbols like €"Á °("
    Both the input alphabet and the output values can contain unprintable data. So don't try and print it directly. If you really want to you could convert it to hex to print, or even to a string of '1's and '0's.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Bits and bytes- Please help.
    By chakra in forum C Programming
    Replies: 85
    Last Post: 05-30-2009, 11:22 PM
  2. bits & bytes
    By pelom in forum C Programming
    Replies: 8
    Last Post: 10-11-2005, 09:45 AM
  3. Bits & Bytes
    By jon_nc17 in forum A Brief History of Cprogramming.com
    Replies: 5
    Last Post: 11-13-2002, 03:56 PM
  4. Bits & Bytes
    By C of Green in forum C++ Programming
    Replies: 8
    Last Post: 06-21-2002, 06:50 PM
  5. Bits, Bytes & Nibbles!
    By Unregistered in forum C++ Programming
    Replies: 1
    Last Post: 10-13-2001, 10:22 PM