I like to deal with asm, as any optimization junkie does, but keep in mind that programming directly in asm will restrict your code to a specific family of processors.
For example, the Salem's code, for x86 in i686 mode will compile as:
Code:
nthbyte_x:
movzx ecx, BYTE [esp+4]
mov eax, [esp+8]
sal ecx, 3
shr eax, cl
ret
But on x86 in amd64 mode, it will be:
Code:
nthbyte_x:
movzx edi, dil
mov rax, rsi
lea ecx, [rdi*8]
shr rax, cl
ret
And on ARM (AArch32, for Cortex-A53):
Code:
nthbyte_x:
lsl r0, r0, #3
lsr r0, r1, r0
uxtb r0, r0
bx lr
But on ARM (AArch64, for Cortex-A53):
Code:
nthbyte_x:
ubfiz w0, w0, 3, 8
lsr x0, x1, x0
ret
And, to cite GCC example, there are other 55 (see GCC documetation) "kinds" of assembly.
So, using asm your code is not portable... If you want to restrict your implementation to one of them, it is ok to use asm, if not DON'T USE IT!