This, by the very nature of it's redundancy, is a performance-intensive application. Therefore, the proper steps must be taken to analyse what you 1) want to get done, and 2) how to best accomplish that goal with an eye towards performance.

Here are the givens:

1) It is not necessary to shift more than 7 bits, period
2) Since this is a shift, and not a roll, it is not necessary to save bits being overwritten
3) mmove() is used, but does not mitigate the need for shifting every byte in order to reallign the bits appropriately.
4) No additional buffering is necessary.

Here's the solution.

d = distance to be shifted

1) Determine modulo, the bits being shifted (d%8).
2) If zero, means the distance to be shifted is byte-aligned, then just mmove() it and go to step #6
3) Otherwise, shift everything left or right according to the modulo result.
4) Get number of bytes to shift (d/8)
5) mmove() the entire datablock within the buffer by result in step 4.
6) exit

Takes approx 12 to 20 lines of code to do it.