In my experience, SSE intrinsics in Visual Studio produce pretty poor code compared to inline assembler, so I wouldn't necessarily take that as a good indication of "true SSE performance".
Carmacks approximation seems interesting, as it's basically 2 iterations of the customary loop. I wonder how well it performs on a larger range of numbers. It also "messes up" the floating point & integer units, as it is overlaying FPU data with integer data to do integer subtraction of it. It's a bad idea to do that unless absolutely necessary, since it causes the processor to have to sync the FPU with the integer unit - normally the integer unit will operate independently of the FPU, and both units will "prefer" to work independently.
In general, SIMD operations is only "meaningful" if there is a complete set of data.