You will get a slowdown. The "number of elements" is decided by the command line parameter.
It simply divides work on N threads, where N is decided by the number of processors in your system.
Oh, you're right. My bad, I read the output from 'time' wrong XD
Edit :
I hope this doesn't sound too brown nosey, Elysia, but looking at your code is kind of a privilege to me. You're really good at C++. Like shoot, you're really good. I'm learning so much by looking at your code, thank you.
Last edited by MutantJohn; 11-26-2013 at 11:44 AM.
We do have a little bit of control over cache, with prefetching instructions, as well as instructions that instruct the CPU to read from memory, but don't cache it (because you know it will only be needed one time), or only cache it in higher levels (because you know by the time it's needed again, it would have been flushed from L1 anyways even if you cached it there, etc, but still want it cached in L2).- You don't have control over cache (the hardware manages it, period)
GCC supports them through __builtin_prefetch, where the programmer would give "hints" on how much locality the data fetched has (will it be needed again later, etc), and GCC decides how to support that on the target.
Data Prefetch Support - GNU Project - Free Software Foundation (FSF)
The prefetch instructions are part of SSE / MMX, but the data prefetched can be used by any instruction.For X86 processors, it appears that those data prefetch options only apply to SSE / MMX instructions, not general purpose instructions that involve memory accesses.
Not everything that can be counted counts, and not everything that counts can be counted
- Albert Einstein.
No programming language is perfect. There is not even a single best language; there are only languages well suited or perhaps poorly suited for particular purposes.
- Herbert Mayer
Another issue is that a simple search may be memory bandwidth limited, regardless of the cache usage. In this case, sequential or near sequential accesses of RAM will be faster because of the RAS / CAS delay counts, and parallel but different access points into RAM would trigger the longer delay counts.