Pipeline Stalls?

**MutantJohn** · 07-02-2015

I was arguing on a different forum about how to write "optimized" C/C++ code. The other person got really mad and just walked away. Mostly 'cause they started talking mad poop about how I probably don't know the difference between C or C++ or even understand programming at all (which made me lol; it's a videogame board, I'm not going into super detail talking about programming).

Anyway, they threw around the term "pipeline stall". I tried reading up on this but all I can find about it are super low-level computer architecture stuff. Basically, what causes a pipeline stall from the C perspective? What is it exactly? Why should I care?

It seems like a super micro-optimization that doesn't honestly matter.

It was also funny that they acted like changing thread priority was a valid means of "optimizing" code.

Edit : Oh God, so much computer architecture! Forget it, tooooo boring XD

Lol jk. But seriously, trying to read up on this is... It's a new world lol. I thought I was special because I was using the "restrict" keyword. Guess not O_o

It seems like conditional code (using an if) creates a stall (duh!). But also it seemed like using a function pointer prevented the processor from saturating the pipeline with future instructions. Is that a correct interpretation? I'm actually very curious about this now.

**Aslaville** · 07-03-2015

Originally Posted by MutantJohn

Anyway, they threw around the term "pipeline stall". I tried reading up on this but all I can find about it are super low-level computer architecture stuff. Basically, what causes a pipeline stall from the C perspective? What is it exactly? Why should I care?

Am not sure whether you should care, I don't care myself

I figured you might find this interesting Blog: Playing with the CPU pipeline â€“ Lol Engine .

**phantomotap** · 07-03-2015

Anyway, they threw around the term "pipeline stall".

O_o

A person throwing around the phrase "pipeline stall" is probably trying to be too clever and actually introducing code which can't be easily optimized.

*shrug*

An optimizer tuned for particular hardware has information about how long certain mechanisms take to complete. A good optimizer can order instructions such that something useful is done while waiting on other instructions thus effectively reducing the cost of a stall.

Soma

**MutantJohn** · 07-03-2015

So I checked out that blog post. Man, some people really get into optimizing their code O_o

I know I should too but damn, that's... That's a lot of effort lol.

I think I have a better understanding of pipeline stalls now. I've seen a talk on instruction level parallelism before and how you can hide arithmetic latency by introducing more of the same instruction for different data. Very interesting to think about though.

I appreciate your guys' help!

And now I remember what we were arguing about in the first place. Someone was talking about video games being optimized for PCs vs consoles and all I could think was the exact opposite. Consoles and PCs both benefit from the same high-level optimizations. C and C++ are high-level languages so if you try to "optimize" for a PC, the console is just as likely to benefit as well. I was then proposing that one thing consoles have going for them is that each one has the exact same hardware so inlining assembly is much more viable than with PCs.

This then leads me to wonder, how does assembly work? I've heard that assembly both is and isn't portable. So which is it? I've seen things like x86 architecture but is there a universal x86 instruction set that will work on every processor in that group? I've also heard that assembly instructions can vary among processors in the same "family" so I'm really just confused. I thought assembly was the exact opposite of portable.

**Aslaville** · 07-03-2015

Consoles and PCs both benefit from the same high-level optimizations.

The catch here though is probably that PCs are x86 while most Consoles are PPC so to optimize for something like a 'pipeline stall' you might need different techniques and tools.

I've seen things like x86 architecture but is there a universal x86 instruction set that will work on every processor in that group? I've also heard that assembly instructions can vary among processors in the same "family" so I'm really just confused. I thought assembly was the exact opposite of portable.

Most x86 cpus will offer a common basic instruction set but I don't think they're required to do so. If they fail to do so it will break software which is clearly not the aim.

Advanced instructions like SIMD will vary though and there should be a way to detect whether a CPU supports certain instructions a.k.a CPUID.

**Elysia** · 07-03-2015

Originally Posted by MutantJohn

And now I remember what we were arguing about in the first place. Someone was talking about video games being optimized for PCs vs consoles and all I could think was the exact opposite. Consoles and PCs both benefit from the same high-level optimizations. C and C++ are high-level languages so if you try to "optimize" for a PC, the console is just as likely to benefit as well. I was then proposing that one thing consoles have going for them is that each one has the exact same hardware so inlining assembly is much more viable than with PCs.

This then leads me to wonder, how does assembly work? I've heard that assembly both is and isn't portable. So which is it? I've seen things like x86 architecture but is there a universal x86 instruction set that will work on every processor in that group? I've also heard that assembly instructions can vary among processors in the same "family" so I'm really just confused. I thought assembly was the exact opposite of portable.

Assembly is not portable. So the thing is that each processor needs some kind of instructions that can tell it "what to do." Those are assembly instructions. But there is no universal standard on what instructions a processor can accept or understand. So yeah, different processors have a different instruction set. That's why assembly isn't portable, unlike C and C++. With C/C++, you ALWAYS (okay, that's a little excessive, but let's go with that) write the code in the same format, regardless of processor or architecture. The compiler just translates this input into the different architecture sets.

Unfortunately, everything is not black and white. Processor design is tricky at best. Different processors execute different code with different levels of performance. Something that works great on one processor can work horribly for another. That's why, generally, optimizations for one architecture does not generally transfer over to another.

The x86 instruction set works on all PC processors. However, you should know that not all processors are made equal. Some instructions work better on some processors and worst on some (SSE is a typical example where it actually often runs slower on AMD processors; at least, that was the case once upon a time). You also need to consider extensions to the instruction set. AVX, for example, is indeed part of x86, but doesn't work on all x86 processors. So yeah, instructions do "vary" among "families." As do "features." It's all part of cost savings and, of course, introducing new instructions for better performance in newer processors.

Assembly is anything BUT portable.

**Mario F.** · 07-03-2015

Originally Posted by MutantJohn

This then leads me to wonder, how does assembly work? I've heard that assembly both is and isn't portable. So which is it? I've seen things like x86 architecture but is there a universal x86 instruction set that will work on every processor in that group? I've also heard that assembly instructions can vary among processors in the same "family" so I'm really just confused. I thought assembly was the exact opposite of portable.

It depends on whom you ask. Assembly can indeed be seen as non-portable or portable. Even across different architectures.

From the point of view of the computer architecture, high level programming languages are completely ignorant of any portability issues. It is the tools used to compile and translate the language to machine code that end up defining the programming language portability. C++ is no more or less portable than C#. But C++ compilers for different platforms are more common than C# ones, making C++ a better option for portability. Yet both language specifications are entirely platform agnostic.

An Assembly language is largely defined by the processor instruction set architecture (ISA). It is a low level hardware programming language. But one that essentially provides English-based mnemonics to the op codes defined in the ISA. One can conceivably write an assembler that allows for portable code across a number of different architectures. This assembler would just need to parse and transform code based on assembly directives indicating the target architecture. Such assembly languages and their assemblers already exist.

**MutantJohn** · 07-04-2015

That's interesting. Thanks for the lowdown on assembly!

It's funny, I've also learned how to indirectly avoid pipeline stalls.

There's a somewhat popular CUDA presentation where the author notes that instruction level parallelism can help achieve higher throughput.

So if you have some code that does :

Code:

A = a + b;

X = A....;

,

you can get more performance by doing :

Code:

A = a + b;
B = c + d;
C = e + f;

X = A....;
Y = B....;
Z = C....;

Although the author hid it in terms of "arithmetic latency". I've yet to really try it because it seems like a very annoying way to program.

**Mario F.** · 07-04-2015

Originally Posted by MutantJohn

That's interesting. Thanks for the lowdown on assembly!

But in retrospect I feel I didn't make a clear enough case for why assembly can be seen as a portable language. Let me try another way...

Assembly is little more than an abstraction of ISA opcodes into english-like mnemonics. In some system, 0x90 is defined as NOP, whereas in another system NOP has opcode 0x35. This means that within the same architecture family of processors, it is somewhat straightforward to define an assembly language that is portable across all processors. This is a more powerful level of portability than that usually defined by operating system constraints, which is what constrains higher level languages. So assembly can be even seen as "more portable" than higher level programming languages. That is, the reach of a single trivial assembly program is potentially greater than that of a single trivial program written in a high-level language.

How easy it is to define such a portable assembly language depends much on how ISAs in a given architecture differ. But generally speaking, x86 assembly is largely portable across the x86 family of processors, starting with the 8008 back in the early 70s, 45 years ago. It is expected that x64 assembly will follow a similar path.

"But Mario, this is not what we see in practice. Assembly code is usually written for a very tight group of processors". That's true. If all we had to go for was opcodes, assembly would be an incredibly powerful portable tool. But a processor is made up of more than gate controlled pathways. Memory organization -- the type and number of registers -- play a role too in the definition of the assembly language, for instance. The fact we address registers by name and that registers can have different constraints on data size, halts many of the hopes of "writing once and running anywhere". New generations of processors in the same family also tend to introduce new opcodes which are obviously not backwards compatible.

What this means is that portable assembly code is potentially a nightmarish spaghetti mess of assembly macros and directives controlling processor specific snippets of code, revealing one weakness of an assembly portable program; very hard to maintain.

But let us step back for a minute.

When we write systems portable code in C or C++ languages, which are considered high-level programming languages, do we expect anything better? Generally speaking, yes. C and C++ demand considerably less boilerplate code to tackle portability across systems. They abstract much of it away. But just like C and C++ have a compiler and linker that are the tools actually tackling all the problems of system portability (across a single family of processors), assembly has the assembler that can too greatly reduce assembly boilerplate code. It just so happens that, traditionally, assemblers don't have much parsing to speak for, whereas C and C++ compilers are designed around much more complex parsing trees. However, the potential for portable assemblers is there, just the same as for any high level language.

But what happens when we want to write C and C++ code not for different processors, but for different operating systems? Oh, boy! Then we get to see the dark side of high-level language portability. Entire libraries designed to try and abstract it away. Without those libraries, your C and C++ code is potentially as nightmarish as assembly. But you can also write assembly "libraries". So, all in all, one can we say about C and C++ is that they are as difficult to port across operating systems as assembly is to port across processors. Making both languages equally challenging to use depending on the task at hand. But, ironically enough, we are used to say that C and C++ are portable languages.

Newer interpreted languages like Java or Python are no different in this regard. They seem to have solved the problem of portability also across different operating systems. But they have done it at the expense of large libraries and, of course, systems and operating system specific interpreters. There is really no portability. It's an illusion. They provide portability by selectively and progressively tackling it through the core language processes, all the way down until those processes that end up producing machine code. And guess what is at the lowest level before machine code is produced? Yep.

So assembly, being it too an abstraction, can be as portable as the person writing the assembler wish it to be. And in many senses assembly is largely more portable than any high-level programming language because its architecture-based constraint has a wider reach than the more limiting operating system constraint of high level languages. We however consider high level languages portable. But it is ok, programming is full of idiosyncrasies, misnomers and wrong attributions.

Originally Posted by MutantJohn

Although the author hid it in terms of "arithmetic latency". I've yet to really try it because it seems like a very annoying way to program.

Well, like any other very detailed optimization technique, you will only want to use it on critical sections of code in your performance-critical programs. And only when the results are worth the effort. That goes without saying, I guess. But when that very rare day comes that you may need to use it, you will find it beautiful instead of annoying.

Thread: Pipeline Stalls?

Thread Tools

Search Thread

Display

Pipeline Stalls?

Similar Threads

XNA Content Pipeline Question

pipeline troubles

async Client/Server app, accept() stalls?

Computer stalls when networked

Pipeline