How do you know if an algorithm uses the cache heavily?

**MutantJohn** · 10-22-2013

Basically, how do I know if the code I write will keep things in the cache vs using system memory?

That's what a cache is, right? It's on-board memory inside a microprocessor, isn't it?

Edit : What every programmer should know about memory, Part 1 [LWN.net]

This seems amazing to me.

**cyberfish** · 10-23-2013

There are special profiling tools that will give you that information, like VTune for Intel processors.

Unfortunately it would have to be vendor-specific because it needs to read the chip's internal counters.

Typical numbers are 97-99%, if you are using memory sensibly.

**Mario F.** · 10-25-2013

OS profiling tools will also provide you with enough data to measure the use of cache by your programs. You just need to select the correct cpu counters.

Cache-oriented programming (if we can call it that) is a bit of a black art. Regardless of the processor you use they will all follow essentially the same three (interchangeable) principles: Spatial Locality, Temporal Locality and Amdhal's Law. Rather then describing these to you, I'll add some references at the end of this post. So, your processor, while it may implement things a bit differently, choosing to optimize a bit here or a bit there, it will converge into these three principles. It is generally safe from a performance point of view, to ignore the processor type and just write generic cache friendly code.

But... despite memory cache being a processor oriented task, your operating system is going to define many aspects of your programming style. Data alignment, the scheduler, page coloring and prefetching are typical operating system features that will define how you will code your cache optimized application. To some extent, your compiler/language will also have a saying, since these techniques are largely defined or constrained by them.

One of the reasons I find cache-oriented programs to be of little value is because I'm a strong supporter of OS agnostic code. One must be dealing with some super duper highly specialized program so the performance gain trumps the advantage of easy code porting. But the other reason is that trying to optimize code for the CPU cache is actually a very difficult style of programming that complicates and delays our project and provides only a temporary benefit as processors and operating systems can make all our coding efforts obsolete in 5 or 10 years time.

The Elements of Cache Programming Style
How The Memory Cache Works
Memory Optimization (also mentions aliasing and its anti-aliasing techniques. An often overlooked problem in cache oriented programming)

**MutantJohn** · 10-25-2013

Thank you, Mario. That is a very insightful answer and gives me good advice. As long as those engineers keep making better and better parts, you're absolutely right. Now I feel better about myself

**grumpy** · 10-25-2013

Originally Posted by MutantJohn

As long as those engineers keep making better and better parts, you're absolutely right.

Well .... better is a debatable term.

It's a bit hard to claim that modern processors are better than old ones, since they actually meet very different objectives. Older processors were able to be compared on clock speed and continuous linear throughput. Then processors were compared more on peak linear throughput (with clock speed a bragging right, but often not actually meaningful) due to advent of things like pipelining and multi-level memory hierarchy - despite the fact that real-world software can very rarely achieve peak linear throughput due to processor stalls and other phenomena (average throughput is actually a more useful measure of how a processor runs real software). Now processors are compared on peak bandwidth, as those measures suit multicore pipelined processors with multi-level caches, despite the fact that it is quite a difficult job to write software that correctly and efficiently exploits such features, and a lot of existing real-world software cannot be rearchitected appropriately.

The thing is, these changes weren't about making a better product. They were about the ability of manufacturers to market their products, to encourage turnover (i.e. deliberate obsolescence), and to CLAIM it is better (e.g. as manufacturers stopped being able to measurably improve, they simply changed the metrics). And, by the way, the software industry is expected to magically follow along .... but have been spectacularly unsuccessful, except in some niche areas, at doing so - with all the bad press that entails.

Generally speaking, I agree with Mario. There has been a lot of fuss in literature and by some big names about how things have changed relative the past for software development. Even Stroustrup has jumped on the bandwagon, in advocating use of vector rather than list, because there are benefits in terms of locality of reference, maximising use of fast cache rather in comparison with slow memory, and things like that when using modern hardware. Their conclusions are perfectly fine with certain types of software (e.g. web servers that serve large numbers of clients) on modern hardware (i.e. a contemporary system with a multicore processor with multiple levels of cache that are substantially faster than RAM). The catch is that not all types of software benefits from doing things that way, not all software will run well on such processors and, worst, the semiconductor manufacturers will change their metrics (so a "better" processor in the future may be something completely different from today). There are, for example, continuing trends towards "green" processors, both in terms of reducing use of exotic/toxic materials and in terms of energy usage. It's a fair (albeit not certain) bet that as hardware architectures evolve so manufacturers can compete in those spheres, hardware engineers will not be shy about making changes that affect software development techniques (i.e. it will be necessary to rearchitect software again to really exploit future hardware features).

**cyberfish** · 10-25-2013

There are changes that do require software changes, like Intel's new Transactional Memory extension (which is really cool).

But most hardware improvements do not require software changes. For example, if you take a normal CPU-intensive single-threaded application from 2007 (end of Pentium 4 era), and run it on an equivalently priced chip today that consumes even less power (~3.5 GHz i7), without even recompiling it, you'll see a ~3x/4x improvement just from internal architectural changes. Of course, if your program can be parallelized, you'd be looking at 10x to 20x improvement.

I'd say it's safe to say it's definitely getting better, and hardware engineers deserve much more credit than you are giving them. They aren't just "playing" those number games. They have actually done some very remarkable innovation.

Things like multi-level caches don't really need special programming techniques to exploit. If you just follow simple general guidelines, you will see improvements automatically in your code. Those guidelines have changed very little over the years (spatial and temporal locality, and minimize branches). For example, they have been trying very hard to minimize branch penalty by improving branch predictors, but that doesn't really change how you should write your code.

**Mario F.** · 10-25-2013

Originally Posted by cyberfish

For example, if you take a normal CPU-intensive single-threaded application from 2007 (end of Pentium 4 era), and run it on an equivalently priced chip today that consumes even less power (~3.5 GHz i7), without even recompiling it, you'll see a ~3x/4x improvement just from internal architectural changes. Of course, if your program can be parallelized, you'd be looking at 10x to 20x improvement.

The Wikipedia entry for Computer Performance. It quickly exposes the problems associated with computer performance, with its many metrics, some overlapping, others complementing, many that will reveal flaws in an otherwise thought out to be optimized system, and all revealing how bad they are at measuring actual program performance. What Grumpy was talking about was this very problem. Hardware metrics are becoming distant from how we measure software performance, to a point where we can no longer fully understand how to exploit the hardware to its fullest. It's also important to remember that a CPU improvement in one metric usually means some other was sacrificed (and those ones you will never known). I agree we can say CPUs are getting better. I think Grumpy agrees too. But they aren't getting Better.

For something completely different, it's food for thought how your quote, for all the truth it contains, reduces the gains from coding techniques to an alarming rate.

**grumpy** · 10-26-2013

Originally Posted by cyberfish

But most hardware improvements do not require software changes. For example, if you take a normal CPU-intensive single-threaded application from 2007 (end of Pentium 4 era), and run it on an equivalently priced chip today that consumes even less power (~3.5 GHz i7), without even recompiling it, you'll see a ~3x/4x improvement just from internal architectural changes. Of course, if your program can be parallelized, you'd be looking at 10x to 20x improvement.

That is (part of) my point. The 10x to 20x is what you buy (because that is the basis of vendors claims). The 3x/4x observed is what you get practically, because the majority of programs cannot be parallelised all that much.

Originally Posted by cyberfish

I'd say it's safe to say it's definitely getting better, and hardware engineers deserve much more credit than you are giving them. They aren't just "playing" those number games. They have actually done some very remarkable innovation.

I'm not arguing with the innovations by hardware engineers. I'm well aware of the strides they have made with materials (getting transistors smaller), manufacture techniques (being able to manufacture more on a die affordably - the basis of Moore's law), and with architecting the processors and memory (superpipelining, cache architecture, speculative execution, etc etc).

My point is that the manufacturers are made up of more than hardware engineers.

Also, those 10x/20x factors (or 3x/4x practically) you're quoting are significantly less likely to be replicated in future, even from an engineering perspective. There is plenty of evidence of that in published semiconductor roadmaps.

Originally Posted by cyberfish

Things like multi-level caches don't really need special programming techniques to exploit.

They do if your goal is to squeeze out more than the 3x/4x figures you're quoting (such as the 10x/20x figures). And those 3x/4x are not likely to be repeated much longer. Moore's law has not yet stopped, but it is certainly slowing down (the time for a doubling of the number of transistors that can be affordably manufactured in a die is longer than it was a few years ago).

Originally Posted by cyberfish

If you just follow simple general guidelines, you will see improvements automatically in your code. Those guidelines have changed very little over the years (spatial and temporal locality, and minimize branches).

Some of those guidelines are actually a pretty big ask for software engineers .... a large part of the reason software is necessary is, for example, to implement logical change of execution path (i.e. branches and loops - which involve branches).

Originally Posted by cyberfish

For example, they have been trying very hard to minimize branch penalty by improving branch predictors, but that doesn't really change how you should write your code.

It does affect how your code will behave, particularly that code which is subject to the branch penalty. And branch predictors cannot (at least within current state of knowledge) eliminate branch penalties. They simply shift around the likelihoods and impacts of some sections of code being affected by those penalties. Highly critical code (say, code that corrects for some very rare occurrence that will kill lots of people) is one type of code with behaviour that is compromised by branch prediction, unrolling, and other penalties. With older processors, software engineers were able to exercise sufficient control so such code would work as intended. With modern processors, that control is implicitly or explicitly removed from the software engineer - who are being constrained to use modern processors because of "commercial realities".

The point is that modern processor design is all about trade-offs. CPUs are getting better by some metrics, but achieving that by making sacrifices against other metrics. One of the metrics that is being regularly sacrificed is the ability of software engineers to understand and reason about how their code will ACTUALLY be executed .... the abstractions that software engineers use to reason about software are getting further and further from reality.

And the hardware manufacturers have amply demonstrated already that correct execution of software is less important than their performance metrics they use for marketing. That does not rest at the feet of hardware engineers. It rests at the feet of senior management and sales departments of manufacturers - the ones who control the pay of hardware engineers.

**MutantJohn** · 10-26-2013

Not gonna lie, I think "green" processors are the way to go and they should be the future. I undervolt my cpu not only to reduce power but to keep it cool so it's a win-win as I've noticed no performance differences.

And I really did like Mario's link to aliasing and how to write more memory-friendly code. So much to read

**cyberfish** · 10-26-2013

The Wikipedia entry for Computer Performance. It quickly exposes the problems associated with computer performance, with its many metrics, some overlapping, others complementing, many that will reveal flaws in an otherwise thought out to be optimized system, and all revealing how bad they are at measuring actual program performance. What Grumpy was talking about was this very problem. Hardware metrics are becoming distant from how we measure software performance, to a point where we can no longer fully understand how to exploit the hardware to its fullest. It's also important to remember that a CPU improvement in one metric usually means some other was sacrificed (and those ones you will never known). I agree we can say CPUs are getting better. I think Grumpy agrees too. But they aren't getting Better.

Benchmarking is very tricky business indeed. Ultimately, what people care about is how fast the programs they actually use run. I'm saying CPUs are getting better, because for most/all people, the speed at which programs they use can do things (compression, encryption, video encoding/decoding, etc) have increased by 3x/4x to 10x/20x on average over just the past few years.

They do if your goal is to squeeze out more than the 3x/4x figures you're quoting (such as the 10x/20x figures). And those 3x/4x are not likely to be repeated much longer. Moore's law has not yet stopped, but it is certainly slowing down (the time for a doubling of the number of transistors that can be affordably manufactured in a die is longer than it was a few years ago).

Sure, to get closer to 10x/20x, you need to parallelize your program, but that is also a very old concept that have existed on supercomputers and mainframes for ages, and just recently (5-6 years ago) became relevant for consumer devices. It's also not a CPU-specific concept. If you properly multithread your program, it will be able to use 80-90% of any CPUs performance for many years to come, and you don't really have to change your code for specific processors. I don't think it's really fair to expect hardware to provide 10x/20x improvements all the time, without any software changes. Programming is not supposed to be that easy.

Also, those 10x/20x factors (or 3x/4x practically) you're quoting are significantly less likely to be replicated in future, even from an engineering perspective. There is plenty of evidence of that in published semiconductor roadmaps.

That is true and not really the topic of this discussion.

It does affect how your code will behave, particularly that code which is subject to the branch penalty. And branch predictors cannot (at least within current state of knowledge) eliminate branch penalties. They simply shift around the likelihoods and impacts of some sections of code being affected by those penalties. Highly critical code (say, code that corrects for some very rare occurrence that will kill lots of people) is one type of code with behaviour that is compromised by branch prediction, unrolling, and other penalties. With older processors, software engineers were able to exercise sufficient control so such code would work as intended. With modern processors, that control is implicitly or explicitly removed from the software engineer - who are being constrained to use modern processors because of "commercial realities".

The way to eliminate branch penalties would be to remove pipelining, which is probably the second biggest advancement in CPU technology in the past 20 years (#1 is cache). Removing pipelining would bring CPU performance back probably by around 10 years. Is that worth it just to be able to more simply define CPU performance?

Modern CPUs are definitely less predictable than old CPUs, but that's because they are more complex, and while no one likes complex things, it's a necessary evil most of the time to continue advancing. I would rather have a CPU that performs unpredictably (performance-wise) between 9x to 11x faster than another CPU that performs predictably at 1x.

The same can be said about software as well. 20 years ago we type simple commands one at a time and know exactly what the computer is doing. The computer won't run any single instruction without us typing a command. Nowadays, when you open Microsoft Word, before you even start typing, it's already doing a million things in background. Performance is also unpredictable because the scheduler tries to schedule threads to try to optimize for responsiveness. Should we go back to the old days just so we know exactly what computers are doing?

And the hardware manufacturers have amply demonstrated already that correct execution of software is less important than their performance metrics they use for marketing.

Huh? When did that happen?

When CPUs don't perform according to the ISA, it's definitely a hardware bug, and while hardware bugs do exist, they are much rarer than software bugs. The job of hardware designers is to make the CPU as fast as possible, while conforming to the ISA (result of programs should be exactly the same as if the instructions were executed sequentially, with the specified side effects), much like how compilers' optimizers try to make your code as fast as possible, while maintaining correctness (same result as an naive compilation).

**grumpy** · 10-26-2013

Originally Posted by cyberfish

The way to eliminate branch penalties would be to remove pipelining, which is probably the second biggest advancement in CPU technology in the past 20 years (#1 is cache). Removing pipelining would bring CPU performance back probably by around 10 years. Is that worth it just to be able to more simply define CPU performance?

That's just it. You consider performance is king. Long live the king, and kill all who oppose him. That's still not true in all applications.

A lot of safety-critical applications rely on events occurring with a specified timing relationship to some stimulus, where there is both a lower bound and an upper bound on the permitted time delay. Predictability of timing matters in those cases. Without it, it is considerably more difficult to limit the probability of people being unintentionally killed.

Modern processors are being used in safety-critical applications, because of the "commercial reality" (i.e. it is only possible to cost-effectively acquire modern processors). There are documented cases of modern passenger vehicles traveling down a freeway at high speed and unable to be slowed down or stopped until they run out of fuel, because the software has entered a mode that the developer did not anticipate, in which it ignores driver requests (hitting the brake pedal, engaging the hand brake, changing gears, turning the ignition off). That is one of the consequences of complexity - if a developer can't understand what the software is doing and how the hardware relly executes it, it is not always possible to adequately anticipate and mitigate hazards.

Originally Posted by cyberfish

Huh? When did that happen?

When CPUs don't perform according to the ISA, it's definitely a hardware bug, and while hardware bugs do exist, they are much rarer than software bugs. The job of hardware designers is to make the CPU as fast as possible, while conforming to the ISA (result of programs should be exactly the same as if the instructions were executed sequentially, with the specified side effects), much like how compilers' optimizers try to make your code as fast as possible, while maintaining correctness (same result as an naive compilation).

It's been happening for over 20 years.

You're interpreting my words as arguing against progress. I'm not. I'm pointing out that, in the interests of progress, some things have been sacrificed. And one of those, despite what you say, is guarantees of correct execution of software by hardware - which software engineers still assume and rely on. It won't be obvious in consumer applications for smart phones, MP3 players, or playing internet backgammon (where the implications are small and hard to notice). It is obvious in other types of development.

**MutantJohn** · 10-26-2013

So basically, if your life depended in it, it wouldn't end well lol.

**cyberfish** · 10-27-2013

I am quite familiar with real time systems.

There are still many very simple and predictable CPUs available - AVRs, PICs, Z80, etc, and they are used in embedded systems where instruction-level timing is important. Though usually they are only in charge of the critical subsystem and not the entire system, since they are not in the same order of magnitude performance-wise as "less predictable" processors. CPUs designed for critical applications (and even consumer applications) go through very rigorous testing and verification in both simulation and hardware, just like software designed for critical applications. That said, hardware bugs do come up once in a while, just like software bugs, because hardware engineers are not perfect either.

Embedded system reliability is a multi-faceted problem, and CPU "unpredictability", while theoretically a problem, I don't believe is a very big problem compared to all the other problems in most cases (scheduler unpredictability, bugs, hardware reliability, interference, etc). This is assuming the CPU is functionally correct. Otherwise it's a bug just like a software bug, either of which can have very bad consequences.

Modern processors are being used in safety-critical applications, because of the "commercial reality" (i.e. it is only possible to cost-effectively acquire modern processors). There are documented cases of modern passenger vehicles traveling down a freeway at high speed and unable to be slowed down or stopped until they run out of fuel, because the software has entered a mode that the developer did not anticipate, in which it ignores driver requests (hitting the brake pedal, engaging the hand brake, changing gears, turning the ignition off). That is one of the consequences of complexity - if a developer can't understand what the software is doing and how the hardware relly executes it, it is not always possible to adequately anticipate and mitigate hazards.

That is an example of very poor system design. The same can happen with or without hardware "complexities". There should be at least a way for the operator to mechanically disable the car. A software lockup is probably the most common and predictable problem in this kind of systems, and the system designer should have designed for it.

Designing a safety-critical system is a very difficult problem, and we have already come a very long way. When was the last time an airplane crashed purely because of a software/electronics malfunction? It's very rare, because of the redundancy built into every single layer of the system. Those are systems designed for failures, on all levels from software to mechanical linkages. Most smaller airplanes can continue flying normally even with a total electrical system failure. All control inputs from electronic systems can be mechanically overridden. On bigger airplanes that require electrical systems to work, there are always several redundant systems that are designed to be as independent and different as possible.

The point is, I don't believe if a function takes 20 clock cycles vs 25 clock cycles is a very big deal in most cases, and in cases where it is, people use dedicated co-processors to deal with it.

So yes, I do believe given correctness, performance is the most important thing for CPUs.

It's been happening for over 20 years.

You're interpreting my words as arguing against progress. I'm not. I'm pointing out that, in the interests of progress, some things have been sacrificed. And one of those, despite what you say, is guarantees of correct execution of software by hardware - which software engineers still assume and rely on. It won't be obvious in consumer applications for smart phones, MP3 players, or playing internet backgammon (where the implications are small and hard to notice). It is obvious in other types of development.

Any examples?

Software engineers should be able to assume correct execution according to the ISA, just like how if we write standard-compliant code, we expect the compiler to produce a correct program, even if it does all kinds of crazy optimizations.

**Mario F.** · 10-27-2013

Originally Posted by cyberfish

Any examples?

Software engineers should be able to assume correct execution according to the ISA, just like how if we write standard-compliant code, we expect the compiler to produce a correct program.

The program be correct. But execution will be flawed by way of CPU processing. This is the central point of this discussion.

This is not about the tools we have to ensure proper execution (like memory barrier instructions). It's about the fact that you must accept that your program no longer executes as you may expect it. The existence of these tools is what engineers managed to jump over the obstacle new processors presented: i.e. hardware-only imposed unpredictable and potentially incorrect execution. Your ISA isn't by any means a guarantee of correct execution.

To put it in another way, you must go through hoops to guarantee proper execution, which means your processor no longer guarantees proper execution if you don't go through hoops.

**cyberfish** · 10-27-2013

I see where you are coming from, but I would argue that is part of the ISA. If the ISA doesn't guarantee order of memory operations without barriers, then if a program doesn't use barriers and expects in-order memory operations, the program is incorrect. I see ISA as the "contract" between hardware and software, and it should be very well documented.

I agree that ISAs are more complex than they were before, but as long as correct programs (according to the ISA) still get executed correctly, I wouldn't say the CPU is "broken".

It's similar to, for example, the use of "volatile" in C/C++. If you don't use it on a variable that can be changed externally (analogous to not using memory barriers when expecting memory operations to be in-order), your program may not work when compiled by an optimizing compiler (analogous to a modern CPU). It may work with a naive compiler (analogous to a naive in-order non-pipelined CPU), but that doesn't mean the program is correct. It is still incorrect according to the "contract" (C standard or ISA).

Thread: How do you know if an algorithm uses the cache heavily?

Thread Tools

Search Thread

Display

How do you know if an algorithm uses the cache heavily?

Similar Threads

D-Cache/I Cache Simulator

Difference between ARP Cache and DNS cache?

HTTP Cache

Cache using c/c++

cache tag calculation