bits and bytes in C

**Ach1lles** · 06-29-2011

Hey guys,

I wanted some clarification in my understanding of bits and bytes in C. For practice, I'm writing an LZW compression program. The idea is that I will get input from stdin, encode it, output the codes, and then decode them. So the encoded file will be something like:

ABC256257 for input ABCABBC...

And my dictionary is: 0-255 for the 256 ASCII characters, 256 = AB, 257 = BC, 258 = CA, 259 = ABB, etc.

The issue is that in 8 bits, I can only represent numbers from 0 to 255, so if I want to output "256" as a single byte, I will need 9 bits. I found a program that has gets and puts bits to stdout, but I'm not sure how to use it correctly. The two methods are:

put(int code, int bits), which puts a code of size bits into stdout. get(int bits) will get the code of size bits.

So I figured that put(256, 9) should put the number 256 into stdout, but all it does is put some weird symbols like €"Á °("

My questions are:

1. Do I have the right idea? should I be doing put(256, 9) or something else?
2. If so, then why does that print weird symbols.

Thanks for your help.

**itCbitC** · 06-29-2011

IIRC, the LZW compression uses more bits for output than input.
If n bits are for input, then output bits are n+1 at the very least.

**Ach1lles** · 06-29-2011

Originally Posted by itCbitC

IIRC, the LZW compression uses more bits for output than input.
If n bits are for input, then output bits are n+1 at the very least.

I don't see why this would be the case because the storage for the code/key would be consistent based on bits.

Nonetheless, I guess my bigger question is: How do I pack 9 bits into a char? Or 10 bits, 11 bits, etc.

**~~CommonTater~~** · 06-29-2011

You don't pack 9 bits into an 8 bit variable... you use short or long integers instead.

I haven't played with lzw (etc.) compression in quite a while but I'm sure there's an accomodation for that.

**itCbitC** · 06-29-2011

Originally Posted by Ach1lles

I don't see why this would be the case because the storage for the code/key would be consistent based on bits.

Nonetheless, I guess my bigger question is: How do I pack 9 bits into a char? Or 10 bits, 11 bits, etc.

Then what's the point of assigning > 8 bit codewords to strings AB BC CA etc.
You need to revise the symbol table so strings stay under the 8 bit boundary.

**Ach1lles** · 06-29-2011

Originally Posted by CommonTater

You don't pack 9 bits into an 8 bit variable... you use short or long integers instead.

I haven't played with lzw (etc.) compression in quite a while but I'm sure there's an accomodation for that.

When I tried put(16, 256), I still get odd symbols like "SOH" "NUL" etc instead of "256".

Originally Posted by itCbitC

Then what's the point of assigning > 8 bit codewords to strings AB BC CA etc.
You need to revise the symbol table so strings stay under the 8 bit boundary.

I was unclear, so sorry about that. What I meant was that 0-255 store the 256 characters, so I'd have to use 256+ to store the combinations. When decoding, the number 256 would be deconstructed into AB, so the space requirements are the same.

I think I need to create a bitstream and use delimiters to note where I go from 9 to 10 bits, etc, but how would I do that? Is that even the best solution?

**KCfromNC** · 06-29-2011

I'm also not that familiar with LZW compression. I do understand what you're seeing in the output, though. Regardless of what you're writing, you're dumping the file and displaying it with each character being 8 bits. Each of those 8 bit values correspond to a character (e.g. ASCII Code - The extended ASCII table) and that's what's being printed.

So you write 256 followed by 257, each as 9 bits :

100000000 100000001

Then display it as 8 bits :

10000000 01000000 01 (binary)
128 64 1 (decimal)

Each 8 bits corresponds to a character. That 128 would be the code for a euro, I believe, and 64 is @. 1 is not printable - it'll either not show up as anything or might be the -like character you're seeing occasionally. This doesn't exactly match your example but should give you an idea of what's going on.

ETA : The wikipedia article about LZW is pretty detailed, e.g. http://en.wikipedia.org/wiki/Lempel%...le-width_codes

**itCbitC** · 06-29-2011

Originally Posted by Ach1lles

I was unclear, so sorry about that. What I meant was that 0-255 store the 256 characters, so I'd have to use 256+ to store the combinations. When decoding, the number 256 would be deconstructed into AB, so the space requirements are the same.

Hopefully you know that code 257 can't be represented in 8 bits.

Code:

257 == 0x101 == 100000001

If your input file is purely alphabetic, encode A=0x41 ... Z=0x5A.
Do the same for strings i.e. AB=0x5B BC=0x5C CA=0x5D so on and so forth.

Originally Posted by Ach1lles

I think I need to create a bitstream and use delimiters to note where I go from 9 to 10 bits, etc, but how would I do that? Is that even the best solution?

Perhaps you mean bitfields; only flip side is that they aren't portable.

**Salem** · 06-29-2011

6 bits 110011
7 bits 0001000
8 bits 11111111
9 bits 111000111
10 bits 0001111000
Total 40 bits, or 5 bytes

Packed into 8-bit bytes gets something looking like this
11001100 01000111 11111111 00011100 01111000

Create a function which takes an integer, and the number of bits of information it contains, and use &, |, >> and << to extract groups of bits, pack them with other bits already sent, and when you've completed a byte, store that byte in some buffer / file.

**iMalc** · 06-29-2011

Originally Posted by Salem

6 bits 110011
7 bits 0001000
8 bits 11111111
9 bits 111000111
10 bits 0001111000
Total 40 bits, or 5 bytes

Packed into 8-bit bytes gets something looking like this
11001100 01000111 11111111 00011100 01111000

Create a function which takes an integer, and the number of bits of information it contains, and use &, |, >> and << to extract groups of bits, pack them with other bits already sent, and when you've completed a byte, store that byte in some buffer / file.

Finally some sensible advice! QFE
Also, if you think about this, it means that your buffer where all these bits are being banged up against one another will need flushing when you're done. I.e. you can't write out a byte as you go if you've only got 3 meaningful bits in it, but if that's what you're left with at the end, then you have to write it out along with 5 extra zeros.

Originally Posted by Ach1lles

So I figured that put(256, 9) should put the number 256 into stdout, but all it does is put some weird symbols like €"Á °("

Both the input alphabet and the output values can contain unprintable data. So don't try and print it directly. If you really want to you could convert it to hex to print, or even to a string of '1's and '0's.

Thread: bits and bytes in C

Thread Tools

Search Thread

Display

bits and bytes in C

Similar Threads

Bits and bytes- Please help.

bits & bytes

Bits & Bytes

Bits & Bytes

Bits, Bytes & Nibbles!