Let me give my comments before actually answering the question:
Frankly, what is fast and slow depends on the CPU. The CPU will have some Arithmetic Logic Units that can do subtractions, multiplications etc etc.
Now, it depends on the minimum unit the ALU supports. If it supports a 32 bit unit only then if you pass it a byte it will simply fill up the rest of the bytes, for example.
So, assuming that the ALU supports 32bit (an int lets say) you want to take advantage of this when you subtract and pass 4 bytes. Fine until now. The tricky part is this: is the an additional cost to do so? If there is then you are gaining some time, but losing some other. Typically you want gain anything in the end. Might be slower. Will make the compiler optimization process more difficult and any other kind of optimization as well.
What I mean is this.
Code:
char value[4];
....
int optValue = (int)value[3] << 12 | (int)value[2] << 8 | (int)value[1] << 4 | (int)value[0];
this is nice, but the bit shifts, the dereference, the casting all this might take more time than the actual subtraction.
Of course you wouldn't do the above, you would do
Code:
char value[4];
int* optValue = (int*)value;
But what about when the number you subtract from this?
If you do as before
Code:
char value;
....
int optValue = (int)value << 12 | (int)value << 8 | (int)value << 4 | (int)value;
Then you will again lose cycles and it might really not be worth it.
The best way would be to hardcode all values from 0 to 255. So
Code:
#define B_255 0xFFFFFFFF
#define B_254 0xFEFEFEFE
....
then subtract normally.
To use them you should do
Code:
char* value = (char*)optValue;
// value[2] would be the second byte
which I guess you already do.
Note that endianess might get into play and ruin your attempt. This of course can be solved.
Now the actual question:
There is partial way around this. Use 2 bytes per integer. Then you will have
#junk#value1#junk#value2
So you have 8 bits of junk that will have the value of the carriers, which you don't care about.