Understanding CPU architecture to write better code

**CodeSlapper** · 03-11-2016

Does anyone have some good articles on understanding the underlying computer architecture with the end goal being to write faster code.

For example, how do you make the best use of cache? How do you get functions to run in registers? How to write code to make use sequential memory locations?

I'll be using intel processors for the forseeable future.

https://people.freebsd.org/~lstewart.../cpumemory.pdf
Found this link.

**whiteflags** · 03-11-2016

Unless your intention is to write something in assembly or somesuch, the best answer is to do nothing and know almost nothing. The compiler has switches for things it can do on specific architectures, and it is smart enough to turn code written simply and cleanly into optimized code. I mean, you can tell the compiler to use SSE, you can tell the compiler you're compiling on multicore processors, you can control some floating point stuff, etc. etc. Read your compiler's docs.

For example, how do you make the best use of cache?

When you use threads, cache misses are an important thing to worry about, otherwise not so much. concurrency - What is a cache hit and a cache miss? Why context-switching would cause cache miss? - Stack Overflow

How do you get functions to run in registers?

There is nothing you can do in code to help this anymore. The keyword register is ignored.

How to write code to make use sequential memory locations?

Make use of arrays, or libraries with guarantees. Additionally you can read in your compiler docs about structures and ways to pack them. Locality of reference is not hard to write. Malloc and new[] will guarantee that the memory it returns is contiguous.

**kmdv** · 03-12-2016

There are many optimizations level, from very high level to very low level like assembly. Optimizing at assembly level is your last resort, when nothing else can be done. If your end goal is to write faster code, no matter if you use C++, C# or Java, first focus on following good programming practices and idioms. For example, instead of decompiling code, ensure that you pass to functions "const std::string&" instead of "std::string". Or, when you intend to return a container, make use of move semantics if possible. Or, when iterating through a container, use iterators instead of indices. Calling function twice in the same expression => call once and store result in a local variable. Examples could go on.

Of course, the above examples are not premature optimizations - they are rules of thumb. Given solutions A and B which both cost same, why would one want to pick that of a lower quality?

**whiteflags** · 03-12-2016

C++ advice on the C forum?

**kmdv** · 03-12-2016

I said that it does not matter what language is used. (I admit that I could give more C-like examples though.)

**Elysia** · 03-12-2016

Iterators and pointers make it harder for compilers to optimize code, just so you know. It makes it harder for the compiler to figure out what's going on. But still, it is good advice. Only if you have figured out that it's causing issues and you need to optimize it, should you consider other options.

**rcgldr** · 03-12-2016

Originally Posted by CodeSlapper

For example, how do you make the best use of cache?

By trying to keep the inner loops of algorithms confined to memory spaces that fit within the cache. For multi-threaded programs, each core has it's own L1 cache, and usually it's own L2 cache. L3 cache and main memory are shared between cores.

Originally Posted by CodeSlapper

How do you get functions to run in registers?

Try to limit the number of highly used variables to what the compiler can use registers for when it optimizes the code. For X86 running in 64 bit mode, there are 16 registers, and knowing that 16 registers are available, I wrote a 4-way bottom up merge sort that uses 10 working pointers, and 1 integer (for run size). The pointers are used without indexing or offsets, which provides a slight improvement in performance. If sorting a large array of 32 bit or 64 bit integers, it's about 15% faster than a 2-way bottom up merge sort, and as fast or slightly faster than quicksort.

Note - a 4-way merge sort involves the same total number of operations as 2-way merge, except it's 1.5 x number of compares and 0.5 x number of moves. Since each compare has already read the data, the moves are essentially just writes. The compares are a bit more cache friendly than the moves (writes), which explains the small ~ 15% increase in performance.

**CodeSlapper** · 03-12-2016

Originally Posted by rcgldr

By trying to keep the inner loops of algorithms confined to memory spaces that fit within the cache. For multi-threaded programs, each core has it's own L1 cache, and usually it's own L2 cache. L3 cache and main memory are shared between cores.

Try to limit the number of highly used variables to what the compiler can use registers for when it optimizes the code. For X86 running in 64 bit mode, there are 16 registers, and knowing that 16 registers are available, I wrote a 4-way bottom up merge sort that uses 10 working pointers, and 1 integer (for run size). The pointers are used without indexing or offsets, which provides a slight improvement in performance. If sorting a large array of 32 bit or 64 bit integers, it's about 15% faster than a 2-way bottom up merge sort, and as fast or slightly faster than quicksort.

Note - a 4-way merge sort involves the same total number of operations as 2-way merge, except it's 1.5 x number of compares and 0.5 x number of moves. Since each compare has already read the data, the moves are essentially just writes. The compares are a bit more cache friendly than the moves (writes), which explains the small ~ 15% increase in performance.

This is the sort of thing I'm looking for. How did you know to try and use the L1 cache as much as possible? Did you learn this in a course, paper, website, etc?

**Elysia** · 03-13-2016

I learned all this stuff by actually taking some computer architecture courses.

**Greenhorn__** · 03-13-2016

Hi,

this is a good read:
Write Great Code, Volume 1: Understanding the Machine
Write Great Code, Volume 2: Thinking Low-Level, Writing High-Level

And also is Code Complete.

Of course Agner Fog's Software optimization resources should have a place in your favourites ...

Cheers
Greenhorn__

**Codeplug** · 03-13-2016

>> Iterators and pointers make it harder for compilers to optimize code, just so you know.
On the plus side, compiler optimizers are always getting better. Things like whole-program-optimization allow for additional insight into how memory locations will actually be accessed, which allows optimizations that prior generation optimizers could not assume. I would also add that using good const-correctness can also help out the optimizer. There is also the restrict keyword which helps as well. There are also compiler specific tools that can help branch prediction (eg. in the Linux kernel).

You'll also want to google "premature optimization".

gg

**kmdv** · 03-13-2016

Originally Posted by Elysia

Iterators and pointers make it harder for compilers to optimize code, just so you know. It makes it harder for the compiler to figure out what's going on. But still, it is good advice. Only if you have figured out that it's causing issues and you need to optimize it, should you consider other options.

It depends what containers we are talking about and what implementations. For std::vector, the difference will be minimal (if any). However, as for the other containers iterators generally do better than accessing by index (or key in the case of map and unordered_map), because they reduce the number of indirections which must be made otherwise (e.g.: std::deque), and indirections is something compiler cannot really optimize out. Also, templates (iterators) have almost no impact on machine code generation - these are different (generally independent) compilation phases.

**cpfott** · 03-13-2016

think low level and write high level
nice strategy.

Thread: Understanding CPU architecture to write better code

Thread Tools

Search Thread

Display

Understanding CPU architecture to write better code

Similar Threads

Code architecture: A call for advice

Code architecture: A call for advice

Return Code Architecture

Need help to understanding the code

Sharing variables and code architecture