One thing that would speed up current designs is to use RISC more than CISC.
Also if you could load segment registers with descriptors w/o having to use LES LDS LFS LGS etc -
mov es,[Source]
Opcodes like this. This would be about as fast as
LES EDI,dword ptr [ptr32]
The biggest improvement to the x86 design would probably be to design the MOV instruction with RISC architecture rather than CISC. For instance if a mov only took say 1 cycle instead of 3 or 4 (depending upon usage) a program that had 40000 mov's would execute in 40000 cycles instead of 120000 or 160000.
Also if more opcodes were SIMD based like MMX and SSE2 then the system would speed up a lot. Also if MMX could do memory to memory transfers you could do this much faster:
Code:
;Additive color blending
;Adds Picture1 to screen image
;save ds
push ds
;load ds:esi with picture
lds esi,[Picture1]
;load es:edi with Screen
les edi,[Screen]
;load ecx with number of QWORDS in image
mov ecx,[size_in_bytes]
shr ecx,3 ;convert bytes to QWORDS
START:
;transfer 64 bits from image to MMX register 0
movq MM0,[ds:esi]
;perform packed add
paddd MM0,[es:edi]
;transfer 64-bits of last op back to screen
movq [es:edi],MM0
;increment edi by one QWORD
add edi,8
;increment edi by one QWORD
add esi,8
;decrement ecx by 1 QWORD
sub ecx,1
;loop to START while ecx>0
loop START
;clean up
pop ds
Also there is not a packed add with saturation for DWORDs or QWORDs which would help a lot. This is not a packed add with saturation so the data type will wrap around when the bytes reach 0FFh which could cause a problem. There is a workaround via another MMX opcode but I did not implement it.
So if a memory to memory transfer was allowed - I could simply add pic1 to image1 and then blit the result.
But using MMX this operation is still blazing fast for software implementations w/o using a 3D accelerator.