I could, however, see a game being programmed in C/C++ with core engine routines in NASM.....but on its best day it would still be 10 times slower than Direct3D or OpenGL.

Hardware acceleration is here to stay. Live by it or die by it. Your choice.
how do you feel about matrical and quaternion operations, which are infamous for being very slow when sent to the FPU instead of SSE(1/2) registers(I have seen code that is 10x-25x as fast as C equivalent from a few people over at www.gamedev.net)? Because unless I forgot a white paper, I didnt realize GPU's were being used for those kinds of math?