GCC -j

**brewbuck** · 10-05-2011

Originally Posted by MK27

Seems to me that popular opinion is against enabling hyper-threading on multi-core systems tho.

Then why does Intel waste their time designing 10-core Xeons with hyperthreading (thus 20 logical cores) and why do companies waste thousands of dollars buying them? Hyperthreading requires a more intelligent scheduler but there's nothing wrong with it.

**Mario F.** · 10-05-2011

Originally Posted by MK27

Seems to me that popular opinion is against enabling hyper-threading on multi-core systems tho.

Yeah. There's been a lot of misconception and false claims about HT and what it does especially to single-threaded applications -- ST applications that are still at the core of much of what we use today. There's still the false belief circulating that HT can slow down ST applications. And this rumor has been perpetrated especially by the gaming community; where false rumors tend to spread like plague on a drifting ship.

It's just not true. Besides there's been a long road from when HT was introduced in single core CPUs (when there was in fact an overhead for ST applications) and today. Both the technology has been refined, but operating systems everywhere have been taking real advantage of it and coding schedulers so applications make better use of it.

The truth is simply that HT will make my OS and my MT applications faster. I did a recent upgrade to my main system and don't intend to do anything else until late last year. But my next processor will definitely be HT ready.

**MK27** · 10-05-2011

Originally Posted by brewbuck

Then why does Intel waste their time designing 10-core Xeons with hyperthreading (thus 20 logical cores) and why do companies waste thousands of dollars buying them?

Hmm, I was under the impression Intel had discontinued hyperthreading because of benchmarks that showed "complimentary" processes (doing completely different things) slightly benefited but "competing" processes (doing the same thing) were significantly penalized. I believe this had to do with cache thrashing.

However, I notice after googling that it was not so much discontinued as just "absent" from their first dual cores due to lineage, and that it has been back in place since 2008, so evidently those issues were not serious and I stand corrected!

**anduril462** · 10-05-2011

Originally Posted by brewbuck

Hyperthreading is way cooler than just fast context switching. Modern superscalar CPUs can have dozens of instructions in flight simultaneously under good conditions. Hyperthreading is a "hack" (though a pretty advanced one) that allows two register contexts to interleave in the instruction decoder to pump the pipelines fuller. For example if one instruction stream is receiving a 200 cycle cache miss, the other hyperthread could step in and execute its instructions during the cycles that would otherwise be wasted.

That's exactly what happens in a context switch. You switch contexts (registers et al) so that a different task can run while others wait, whether it's because they need disk I/O or their time slice is up. Basically, context switching allows multiple processes to share a single processor/core by keeping track of the state (context) of a process so it can pick up where it left off. The OS used to do this scheduling exclusively. All hyperthreading does is do that at a hardware level. Sure, the OS sees 2 processors, and thinks that they're totall separate, but secretly, there's only one ALU -- one thing to perform your additions, multiplications, shifts, etc. In a simple example, the OS assigns process A to one virtual processor, and process B to the other to distribute the load. Instead of the OS needing to manage scheduling for processes A and B, they each have their own, dedicated virtual processor. The hyperthreading technology makes sure that both of those tasks get their turn on the one shared ALU, while the OS thinks they're both getting their own full processor. Yes, it does keep the pipelines fuller and help speed things up, but I don't see it as beeing that much cooler. Just two sets of register, one ALU and some hardware context switching. Maybe I'm just jaded

Originally Posted by MK27

I'd guess that you could then often have only half the real processors running, since the kernel treats all 12 as real and will assign based on that. It will not assign tasks to just even numbered processors or something.

Why would you guess that? I never said the kernel wouldn't use all 12. Plus, the fact that it's hyperthreaded makes it all the more likely that all the ALUs are in use. My point was that a 6 core hyperthreaded CPU can only perform 6 additions/multiplications/shifts/etc at once. Just because the kernel sees 12 processors/cores doesn't mean there are 12 full processors/cores. On a 6-core, hyperthreaded CPU, there are 12 sets of registers for 12 different threads (2 per core due to hyperthreading), but there is only one ALU per core. If hyperthreading provided 2 ALUs and two sets of registers per core, you would just have a 12 core CPU. Hyperthreading gives you 2 contexts/tasks per core. That means that, while you can have 12 tasks lined up, one per virtual core, you can still only perform 6 additions/multiplications/shifts/whatever at once. If all 12 tasks want to add numbers, then 6 of them have to wait their turn.

**brewbuck** · 10-05-2011

Originally Posted by anduril462

That's exactly what happens in a context switch. You switch contexts (registers et al) so that a different task can run while others wait, whether it's because they need disk I/O or their time slice is up. Basically, context switching allows multiple processes to share a single processor/core by keeping track of the state (context) of a process so it can pick up where it left off. The OS used to do this scheduling exclusively. All hyperthreading does is do that at a hardware level.

Hyperthreading is not context switching. The two hyperthreads execute simultaneously during the same clock cycles. You might have as many as 50 instructions in various stages of execution SIMULTANEOUSLY. Hyperthreading just means that these 50 instructions can come from different threads at the same time. The register context is duplicated, the other CPU resources are not. It's not "really fast context switching," it's literally executing two instruction streams simultaneously. No context is being "switched."

Analogy time. Suppose there's a woodworking shop. You want to share the shop between two people.

Policy 1 (Context switching): Only one guy can be in the shop at a time. While he's in there he can use whatever tools he wants, exclusively. Every hour, he switches places with the other guy, who then gets HIS exclusive use of the tools in the shop.

Policy 2 (Hyperthreading): Both guys can be in the workshop at the same time. They share the tools, so if one is using the hammer, the other guy can't be using the hammer. But if the other guy has some cutting he needs to do he can use the saw while waiting for the hammer.

You have to remember, modern processors might have 50 or more instructions in various phases of execution SIMULTANEOUSLY. They re-order instructions, they rename registers. They have lots and lots of execution units.

My point was that a 6 core hyperthreaded CPU can only perform 6 additions/multiplications/shifts/etc at once.

You don't really think a 2.1 billion transistor CPU design has only ONE adder, ONE multiplier, and ONE shifter on it do you?

Also remember that memory is slow. REALLY slow. Access to a word which isn't cached might have a latency of 200 cycles or more.

**anduril462** · 10-05-2011

Man, do I hate being wrong, especially when it's because I'm being dumb. I kept seeing "execution unit" and interpreting it as "ALU", forgetting the whole super-scalar thing. Kinda hard to do two additions on one ALU, but not so hard when your execution unit has 3 of them! You wouldn't believe the amount of stuff I read and re-read before that clicked. Oh well, learned some new stuff, re-learned some old stuff, and it was more interesting than doing my job, so I guess it's an all-around win. Well, maybe not all-around. My pride did take a bit of a beating

.

**MK27** · 10-05-2011

Originally Posted by brewbuck

Policy 2 (Hyperthreading): Both guys can be in the workshop at the same time. They share the tools, so if one is using the hammer, the other guy can't be using the hammer. But if the other guy has some cutting he needs to do he can use the saw while waiting for the hammer.

Isn't that

Originally Posted by MK27

benchmarks that showed "complimentary" processes (doing completely different things) slightly benefited but "competing" processes (doing the same thing) were significantly penalized. I believe this had to do with cache thrashing.

Originally Posted by anduril462

Just because the kernel sees 12 processors/cores doesn't mean there are 12 full processors/cores. On a 6-core, hyperthreaded CPU, there are 12 sets of registers for 12 different threads (2 per core due to hyperthreading), but there is only one ALU per core. If hyperthreading provided 2 ALUs and two sets of registers per core, you would just have a 12 core CPU. Hyperthreading gives you 2 contexts/tasks per core. That means that, while you can have 12 tasks lined up, one per virtual core, you can still only perform 6 additions/multiplications/shifts/whatever at once. If all 12 tasks want to add numbers, then 6 of them have to wait their turn.

I'm very ignorant of the details here, but I was assuming those 12 virtual cores are static, 2 per ALU, ie, if 8 tasks want to do task A ("division") but the kernel queued them on the first 8 virtual cores (statically associated with ALU's 1-4) they will in the end not accomplish the task faster and may even incur some penalty...maybe this is a complex scheduler responsibility...

**anduril462** · 10-05-2011

Originally Posted by MK27

I'm very ignorant of the details here, but I was assuming those 12 virtual cores are static, 2 per ALU, ie, if 8 tasks want to do task A ("division") but the kernel queued them on the first 8 virtual cores (statically associated with ALU's 1-4) they will in the end not accomplish the task faster and may even incur some penalty...maybe this is a complex scheduler responsibility...

There are two threads per core in a hyperthreaded processor. Based on all the stuff I read today, I think the penalty you speak of only applies when the two threads have few or no instructions that can be run in parallel. At that point, the overhead of HTT, while small, outweighs the advantages (you don't get any advantage when threads can't be run in parallel). This seems to be unlikely in most general cases, but can be an issue if you have, for example, a complex, multi-threaded scientific program, where it's threads occupy most of the virtual cores, and are very calculation-intensive, using lots of ALU and FPU operations, and little I/O. There were also issues with Intel's replay system (used to fix scheduler issues in the old P4 processors), but those are gone with newer processors.

Again, based on what I read, HTT does provide 2 virtual cores per execution unit, however an execution unit in a modern superscalar processor, like the new Intel Core series, has several ALUs (but not necessarily FPUs or other functional units). HTT and other fancy instruction level parallelism allow you to use use them simultaneously, so long as there are no issues with data dependency and the like.

**brewbuck** · 10-06-2011

Originally Posted by MK27

I'm very ignorant of the details here, but I was assuming those 12 virtual cores are static, 2 per ALU, ie, if 8 tasks want to do task A ("division") but the kernel queued them on the first 8 virtual cores (statically associated with ALU's 1-4) they will in the end not accomplish the task faster and may even incur some penalty...maybe this is a complex scheduler responsibility...

I was fortunate enough to meet James Reinders a month or two ago and we talked about all this stuff in detail. Moore's law just keeps on going and the question for Intel is, what to do with all these transistors. From what I gather they are taking really a three pronged approach:

1. Use the transistors to add execution resources, registers, and more complex instruction decoders and schedulers, in order to execute as many data-independent instructions as possible simultaneously while supporting the traditional x86 and AMD64 instruction sets.

2. Use them to add more vector registers and vector ALUs. For instance, on Sandy Bridge they doubled the length of the vectors to 256 bits and gave more of them. He says this trend should continue, and we should see 512 bit or bigger SIMD vectors, and generally just more SIMD execution units.

3. Use them to put more cores on the same chip. By "cores" I mean traditional CPUs as well as micron-reduced similified cores like the stuff in MIC, as well as GPU stream processors. In five years you'll probably be seeing chips with hundreds of heterogeneous cores for various purposes. Maybe not on your desktop but they'll be out there.

**brewbuck** · 10-06-2011

Originally Posted by MK27

I'm very ignorant of the details here, but I was assuming those 12 virtual cores are static, 2 per ALU, ie, if 8 tasks want to do task A ("division") but the kernel queued them on the first 8 virtual cores (statically associated with ALU's 1-4) they will in the end not accomplish the task faster and may even incur some penalty...maybe this is a complex scheduler responsibility...

I was fortunate enough to meet James Reinders a month or two ago and we talked about all this stuff in detail. Moore's law just keeps on going and the question for Intel is, what to do with all these transistors. From what I gather they are taking really a three pronged approach:

1. Use the transistors to add execution resources, registers, and more complex instruction decoders and schedulers, in order to execute as many data-independent instructions as possible simultaneously while supporting the traditional x86 and AMD64 instruction sets.

2. Use them to add more vector registers and vector ALUs. For instance, on Sandy Bridge they doubled the length of the vectors to 256 bits and gave more of them. He says this trend should continue, and we should see 512 bit or bigger SIMD vectors, and generally just more SIMD execution units.

3. Use them to put more cores on the same chip. By "cores" I mean traditional CPUs as well as micron-reduced similified cores like the stuff in MIC, as well as GPU stream processors. In five years you'll probably be seeing chips with hundreds of heterogeneous cores for various purposes. Maybe not on your desktop but they'll be out there.

Thread: GCC -j

Thread Tools

Search Thread

Display