Originally Posted by
rcgldr
By trying to keep the inner loops of algorithms confined to memory spaces that fit within the cache. For multi-threaded programs, each core has it's own L1 cache, and usually it's own L2 cache. L3 cache and main memory are shared between cores.
Try to limit the number of highly used variables to what the compiler can use registers for when it optimizes the code. For X86 running in 64 bit mode, there are 16 registers, and knowing that 16 registers are available, I wrote a 4-way bottom up merge sort that uses 10 working pointers, and 1 integer (for run size). The pointers are used without indexing or offsets, which provides a slight improvement in performance. If sorting a large array of 32 bit or 64 bit integers, it's about 15% faster than a 2-way bottom up merge sort, and as fast or slightly faster than quicksort.
Note - a 4-way merge sort involves the same total number of operations as 2-way merge, except it's 1.5 x number of compares and 0.5 x number of moves. Since each compare has already read the data, the moves are essentially just writes. The compares are a bit more cache friendly than the moves (writes), which explains the small ~ 15% increase in performance.