Can we talk about APUs?

**MutantJohn** · 06-15-2014

So, in the Rust thread Elysia mentioned APUs.

Now, to me it seems like an APU is just a GPU and a CPU on the same "die". Well, at least one implementation of an APU is a CPU/GPU hybrid so let's just discuss that for the time being.

Do we expect this to change how we program very much? Wikipedia sites that one common version is a CPU hybrid with an OpenCL compatible GPU.

I won't want to have to use OpenCL. I don't like the fact that my GPU kernels have to be written as strings and that any compilation bugs are found at run-time. But that's just something I guess I need to get used to. I think CUDA is more friendly for C programmers but that's because I heard it's really just a C API in which case, go nVidia; it's pretty sexy.

Also, nVidia has Project Denver which is a ARM CPU with an nVidia GPU so I imagine it'll be OpenCL compatible through its CUDA packages like it is now.

But aside from these hardware specifics, should we expect to see much of a programming paradigm shift? Like, what differences should we see if there are any?

Writing separate parallel kernels isn't exactly a radical departure from what we have now although it is more interesting to think of launching instances of a kernel for every point in a particle simulation, for instance, and how thread ID's are how we identify points in the array. But that's one isolated incident.

**grumpy** · 06-16-2014

Originally Posted by MutantJohn

Now, to me it seems like an APU is just a GPU and a CPU on the same "die".

That's a bit of an over-simplification, but alright for thinking about it.

Originally Posted by MutantJohn

Do we expect this to change how we program very much?

It depends on how you reason about your software designs. Effective use of an APU requires thorough reasoning about which parts of a program will be executed on a GPU (high parallelism, simple instructions) versus a CPU (a relatively small number of instruction streams that may include complex instructions which are executed sequentially).

The thing is, humans are not particularly good at reasoning about highly parallel processes, or the interaction (or rendevous) between such processes. This makes programming a GPU more challenging than coding for a CPU (or a single CPU core).

Originally Posted by MutantJohn

I won't want to have to use OpenCL. I don't like the fact that my GPU kernels have to be written as strings and that any compilation bugs are found at run-time. But that's just something I guess I need to get used to.

It would also be reasonable to expect that would be addressed as OpenCL and associated development environments mature - there is a lot on investment toward that.

Originally Posted by MutantJohn

I think CUDA is more friendly for C programmers but that's because I heard it's really just a C API in which case, go nVidia; it's pretty sexy.

In the near term (small number of years) CUDA will probably still be "more friendly". It takes less effort for nVidia to target their own hardware platforms than it does for a consortium to target multiple hardware platforms, as is the case for OpenCL. CUDA also comes with trade-offs, such as being stuck on nVidia hardware. In the long run, since standards-based frameworks and hardware agnosticism make system development and sustainment easier, OpenCL is the way to bet. But CUDA will continue to appeal to folks who are happy to limit their software to hardware from one vendor and being able to squeeze out maximum performance (or exploit other features) on that hardware.

CUDA actually uses a C++ compiler, so not all C code will compile. OpenCL is a development framework that deliberately seeks to be hardware agnostic. The trade-off with that is it doesn't exploit features that are specific to any hardware family, but is more likely to run reliably across hardware families.

Originally Posted by MutantJohn

But aside from these hardware specifics, should we expect to see much of a programming paradigm shift? Like, what differences should we see if there are any?

Apart from need to reason better about parallelism, and need to support that explicitly, I'd suggest few changes of paradigm. Maturation of development and host environments is more likely (but such maturation doesn't usually introduce paradigms - it is more likely to refine how existing paradigms are supported).

Also, bear in mind CPU/GPU/APU are not the only classes of hardware capable of (being programmed for) processing. There are also FPGAs, CPLDs, and other classes of hardware - each with advantages and disadvantages.

**Elysia** · 06-16-2014

I suspect that in the future, what we will see is a unified CPU/GPU architecture, which makes sense since it makes it easier for compilers to schedule appropriate instructions suitable for the two "approaches." But we're not there yet. I'd be interested in seeing some research for this, but I haven't seen any yet, nor any news of it. But surely, it must be coming. We're already seeing unified memory architecture which allows both CPU and GPU to read and write from the same "virtual memory" (no need to copy stuff to GPU). This sounds to me like GPUs are going to get things like memory prefetchers that are common in CPUs.

I don't know how it is with APUs, but which GPUs, even with high parallelism, there was still a big bottleneck between the memory and the GPU - there just was not enough bandwidth to hide the latency, so certain cores stall waiting for memory. That is precisely why modern CPUs speculate. But anyway, we'll probably also start to see more push a button to compile for both CPU and GPU that allows us to write in our favorite language can that can automatically compile executable code the GPU and the CPU. Microsoft kicked this off with its C++ AMP effort. Basically you write C++ code which with a simple press of a button can both run on the GPU and CPU (with a few restrictions, of course). But the real deal here is that we're talking about real C++ code - not some dumbed down subset.

I don't know how much farther it has progressed, though. Last time I tried it, it crashed my graphics drivers every time I ran it, so it should be considered alpha, I guess. I also imagine we'll see more shifts towards higher abstraction for GPUs. Since when did we last write desktop programs in C? Then why should we have to write in C for our GPUs? Today's GPUs are beasts, so they can easily handle higher-level languages for anything that does not require absolute maximum performance (e.g. anything but heavy computations and games). Again, C++ AMP is a good example of this. I don't really keep tab on other GPU languages, though. But we're also seeing abstractions written directly into APIs and frameworks - such as Microsoft's Metro environment.

There you can simply call APIs to do animations which Windows most likely will run on the GPU. I expect we'll see more of this in the future. Again, in the farthest end in the future, I imagine we'll see a completely integrated CPU/GPU with a unified instruction set and unified compilers which take your source code and figures out which instructions to run with which "approach." This makes it no different from today's compilers which generate CPU assembly instructions. Again, we'll see, but it's still far off.

**phantomotap** · 06-16-2014

But aside from these hardware specifics, should we expect to see much of a programming paradigm shift?

O_o

You will not see much of a change.

Like grumpy references, we are seeing tools arise which let us reason about massively parallel code using familiar paradigms.

Soma

**MutantJohn** · 06-16-2014

Interesting.

grumpy and Elysia, you both mentioned GPUs being tasked with "simpler" computations. What do you mean by this? What counts as a simple computation versus a complex one?

I was reading this paper on bring Delaunay triangulations to the GPU and one thing the author mentioned was the use of exact arithmetic in solving the determinants of 4x4 and 5x5 matrices, respectively. The routines were written in 1996 and are simple to adapt to CUDA simply by just putting __device__ in front of everything and I'm sure that's what the author did but one thing they mentioned was that the exact kernels required a higher number of "registers" and that caused a slowdown in terms of performance so the author broke the process down into a fast kernel which examined the relative floating point round-off errors and then conditionally launched a smaller number of threads for all determinants that required exactness. For example, out of 2000 fast kernel threads, maybe only 50 require exact kernels.

Are GPUs inherently slow at solving something relatively complex like this simply because each thread has fewer hardware resources?

**Elysia** · 06-16-2014

A CPU uses large amounts of logic towards trying to extract parallelism from serial code and for avoiding stalls in the pipeline. This amounts of specular executions, out-of-order execution, instruction re-ordering, branch prediction, memory prefetching and things I probably forget. Each core also shares a lot of logic. For example, a CPU may be able to execute 8 instructions at once, but it may only have 4 ALU (integer processing), so if there are 8 instructions at once which try execute integer instructions, 4 of them will be postponed until later. A CPU also has lots of complex instructions, different memory addressing (e.g. fetch this address and a 4 to it and then fetch the value at that location). A lot of the area is dedicated to huge amounts of cache. Only a small amount of the actual CPU is dedicated to logic and the rest is just cache.

A GPU does not have this. Until recently, GPUs had no cache. A GPU does not really need branch predictors, out-of-order scheduling, etc, because computer graphics is deterministic. It is a ton of math, but not a lot of branches. It also does a lot of memory accesses serially, so it is easy to know exactly what the GPU needs. Finally, graphics is just inherently parallelizable. Every pixel can be done individually. Every vertex can be done individually. So the GPU focuses on small processors which are good at floating point math for a specific memory address pattern. The space used for extra logic and cache is just dedicated to more processors.

This is what I mean by simpler computations.
GPU: Loves parallel computations. Hates computations that has a lot branches or irregular memory access patterns.
CPU: Can handle pretty much anything. Does not care if code is branch heavy. Can handle irregular memory access patterns. But it is not very parallel, so things that uses a higher amount of parallelism suffers greatly.

Blocks of code that executes serially with few to little branches and accesses memory in serial fashion or with a deterministic stride, where there can be many such blocks of code that executes in parallel with no inter-dependencies, I'd classify as simple. This is where the GPU shines. Remember that because there so many processors on the GPU (the more, the better because we can pretty much process all pixels in parallel which at full hd is 1920 x 1080 is over 2 million pixels), each processor only has a small amount of memory and dedicated logic.

**Mario F.** · 06-16-2014

GPUs become only relevant if you want to implement some sort of accelerator in your code that can take advantage of the superior multitasking abilities of the GPU, by offloading to the GPU work that is usually done in the CPU. And even then this is mostly only relevant if your work involves floating-point arithmetic; the area where GPUs shine the most. For the most part this isn't necessary or practical.

APUs are simply an architectural choice and will have no effect on this type of coding mechanism. Accelerators have been designed for a long time, supported by discreet GPUs which is the usual layout in our computers. In fact one could argue against APUs, since the offloading is limited by any other work being done by the GPU part of the APU. Meaning you probably don't want to offload to the GPU part of an APU during a gaming session. Whereas, discreet GPUs can be chained to the point of allowing intensive rendering scenes and still leave room for an accelerator.

**MutantJohn** · 06-16-2014

Interesting. This is a good thread for me.

This also explains why I'm going gaga over GPU programming lately. And in a really good way. OpenCL also seems incredibly easy to learn from knowing CUDA. It's literally (figuratively literal) the same thing with different names.

Okay, so let me make sure that I understand this correctly,

GPGPU coding is amazing if you have the following conditions :

1. Memory access are done in a linear fashion (i.e. a contiguous array is being read from or written to). If the array is not contiguous, memory access must be linearly separated by the same constant for best performance.

2. The operations have minimum branching. Even though this seems to be extra true for GPUs this was already a well-established thought in high performance computing for a long time.

3. The operations largely consist of floating point numbers. Not doubles 'cause it's too much for a GPU to handle, right? And if accuracy is an issue, there's always an exact method somehow.

Am I missing anything? I'm really starting to like GPGPU computing and I want to continue with it because it's exactly up my alley and is exactly what I need.

**Mario F.** · 06-17-2014

Most (if not all?) modern GPUs already support double precision. And have been for a while. As of the NVIDIA GTX series, double-precision performance is roughly half of the normal. Which is still faster than on a CPU. While cards from major vendors do emphasize single-precision computing, double-precision isn't that far off in performance that can't be used under most situations where it is needed. CUDA does require a flag to be set in the compiler for the kernel not to silently convert your doubles to float. I'm not sure of the state of OpenCL support, but I'd wager not even a flag will be necessary. Meanwhile the Tesla series fully supports double-precision computing at the same speed of single-precision (one instruction, per cycle, per core). So if you plan to code for these cards, you aren't restricted in any way. Similarly, if you need to work on large amount of double-precision values, then necessarily you need to move on to Tesla series (not sure about AMD's answer... FirePro?)

**Yarin** · 06-17-2014

Originally Posted by MutantJohn

literally (figuratively literal)

Kill it, kill it with fire.

**MutantJohn** · 06-17-2014

Yarin, I'm assuming you're talking about figuratively killing this code I'm gonna write, right? Right then.

And thank you, Mario. And you too, Elysia! Learning all this GPU stuff is really fun.

**cyberfish** · 06-17-2014

Originally Posted by MutantJohn

Interesting. This is a good thread for me.

This also explains why I'm going gaga over GPU programming lately. And in a really good way. OpenCL also seems incredibly easy to learn from knowing CUDA. It's literally (figuratively literal) the same thing with different names.

Okay, so let me make sure that I understand this correctly,

GPGPU coding is amazing if you have the following conditions :

1. Memory access are done in a linear fashion (i.e. a contiguous array is being read from or written to). If the array is not contiguous, memory access must be linearly separated by the same constant for best performance.

2. The operations have minimum branching. Even though this seems to be extra true for GPUs this was already a well-established thought in high performance computing for a long time.

3. The operations largely consist of floating point numbers. Not doubles 'cause it's too much for a GPU to handle, right? And if accuracy is an issue, there's always an exact method somehow.

Am I missing anything? I'm really starting to like GPGPU computing and I want to continue with it because it's exactly up my alley and is exactly what I need.

Another major deciding point is how long it will take to transfer the data to VRAM and back, vs just doing the computations on the CPU.

For example, if you need to add 2 large arrays together element-wise, it fits all the conditions you listed, but it's a bad idea to send it over to the GPU, because in the time it takes for the CPU (or the DMA engine) to transfer all the data to the GPU and back, and CPU could have done the calculations itself.

You only want to use the GPU if you need to do significant amount of calculations per byte transferred.

**Yarin** · 06-17-2014

Originally Posted by MutantJohn

Yarin, I'm assuming you're talking about figuratively killing this code I'm gonna write, right? Right then.

I was talking about your using "literal" to mean the exact fking opposite thing.

**MutantJohn** · 06-18-2014

But please don't actually try to kill me with fire, Yarin. That wouldn't be very becoming.

**phantomotap** · 06-18-2014

But please don't actually try to kill me with fire, Yarin. That wouldn't be very becoming.

O_o

Keep in mind that an no point did Yarin say he was going to literally kill you with fire.

Soma

Thread: Can we talk about APUs?

Thread Tools

Search Thread

Display

Can we talk about APUs?

Similar Threads

Do I talk too much?

Who wants to talk on AIM?