I've just started looking at SSE but don't see how to accomplish a function we may need, namely, shift right (with a different shift count applied to each 32-bit register.) An integer-divide would work nicely too - if it was available.
We read thousands of double-words each bit-packed, and need to unpack them quickly. For instance, the high-order bit of first DW may represent one value, the next 5 bits another value and so on, the next DW is packed differently. No values span a DW boundary.
I'm investigating using SSE to unpack values from four DWs at a time.
One "pseudo-code" unpack-cycle might go as follows:
Question: Does any version of SSE support a different 'X' for each XXM register?
PSRAW X bits //move next target-value to low-order bit
ANDPS with Y //discard irrelevant high-order bits
MULPS by Z // scale as appropriate
Store 4 Values
(If there was an integer divide, i think it would accomplish the same thing.)
Any other ideas how SSE might be used to unpack this data?
Streaming SIMD Extensions