Can you describe a bit what you are actually trying to achieve? It does, like you imply, seem like this is a poor match for using SSE. SSE works best if you have nicely aligned (to 16 bytes) data that is already in suitable lumps of four 32-bit values (or two 64-bit values).

--
Mats