# Most efficent data type to work with?

Printable View

• 07-04-2008
h3ro
Most efficent data type to work with?
Not long ago I bought "Writing great code" 1 and 2 by Randall Hyde. The books are great (at least so far), but maybe a bit over my level, so I have a few questions.

The book talks about how variables are loaded from memory and goes into details about how different CPUs does this. From what I understand a 32bit processor always loads 32bit variables. So if you only need to load a 16 bit variable the cpu still loads 32bit, but only gives you 16.

Would it be more efficient to have a program that packs say 4 chars into a int, when it comes to loading from memory?

The book also talks about arithmetic for both floating point numbers and integers. According to the book (from what I understand) is that most CPUs are optimized for working on only one bit size. So taking a 16bit variable and multiply it by a 16bit variable is slower then using to 32 bit variables, as the cpu first have to convert the two 16bit variables into 32bit for the arithmetic operation.

Does the OS have anything to say? The book is a bit old, so all the examples are under a 32bit OS with a 32bit processor.
What would the best variable size be for a 64bit processor under say XP 32bit?

EDIT:
As a side question, how do I know that my program is causing cache thrashing?

I might have gotten everything wrong, so a bit of insight would be great.

Thanks for reading
• 07-04-2008
zacs7
MIPS is a really good way to learn all of this stuff, plus there are lots of resources. Usually all of the registers will be 32 bit wide on a 32 bit processor, once a 16 bit value is moved into a register it's usually padded with zeros. I'm speaking from a broad MIPS perspective here :)

Consider the following,
Code:

```int x = 5, y = 10, r = 0; r = x + y;```
Might be:
Code:

```.data x:     .word 5 y:     .word 10 r:     .word 0 .text main:     lw \$t0, x        # load word...     lw \$t1, y     add \$t0, \$t0, \$t1     sw \$t0, r        # store word...```
But how would "addi \$t0, \$0, 156" (add immediate) fit into 1 word you ask? (be it 32 bits or whatever). Thus any effort you went to to have a 'optimum' data type would go to waste (as shifting/padding must occur anyway). Of course 156 would have to be shifted to 32 bits to fit in a register / go in memory. Since,
Code:

`addi (6 bits) \$t0 (5 bits -- 32 registers) \$0 (5 bits -- again, 32 registers) 156 (only have 16 bits to represent immediate value)`
> Would it be more efficient to have a program that packs say 4 chars into a int, when it comes to loading from memory?
Perhaps, but packing and unpacking is most likely going to be slower than going to memory 4 times... Not to mention the assembler will often pack itself (if memory is byte addressable).

I might be wrong... but hey, at least I'm adding more fuel to the fire. I'd suggest playing around with MIPS (ie a simulator MARS is good). Does wonders for learning.
• 07-04-2008
CornedBee
Generally, if you need a general-purpose integer variable, use int. The C++ standard says that the intention is that int is the fastest-to-work-with type of the machine.

How the machine loads data depends. For example, an Alpha always loads 64 bits, and masks as necessary. An x86, on the other hand, really loads a single byte if you request it - at least the early ones did. Perhaps the newer ones don't. In any case any modern CPU only loads from the cache anyway - cache<->cache or cache<->RAM transfers are done by lines, which probably means 16 bytes L1<->L2 and 512 bytes or 1k L2<->RAM.

Is it more efficient to pack bytes up? Depends on the platform. If the L1<->register transfer is done in a single cycle using hardware masking, no. If a load-byte instruction actually expands into load-word+mask (as on the alpha), yes. Do you care? Frankly, you shouldn't. First, prove that that's your bottleneck.

Answering the side note: by using CPU performance counters. All x86 CPUs have some monitors you can set to watch, e.g., the number of instructions executed, the number of CPU cycles used, the number of jump mispredictions or, interesting for you, the number of cache misses at the various levels. Your program is cache trashing if it has a high cache miss-to-instruction ratio.

To give you an example: at my university we wrote a game of life implementation where we used such a tool to analyze the performance. Our results: a 2.5:1 instruction-to-cycle ratio (showing that we made excellent use of the parallelization of instructions even a single CPU core can do) and about a 1:100000 L2 miss ratio (about 10000 cache misses while executing a billion instructions - our main data fit into the cache completely).
• 07-04-2008
VirtualAce
x86 supports byte-addressing but it does not internally perform byte fetches. All fetches are either 32-bit or 64-bit.
• 07-04-2008
h3ro
Quote:

Perhaps, but packing and unpacking is most likely going to be slower than going to memory 4 times... Not to mention the assembler will often pack itself (if memory is byte addressable).
Ah, ok. I was under the impression that getting memory was about the worst thing you could do.

Quote:

But how would "addi \$t0, \$0, 156" (add immediate) fit into 1 word you ask? (be it 32 bits or whatever). Thus any effort you went to to have a 'optimum' data type would go to waste (as shifting/padding must occur anyway). Of course 156 would have to be shifted to 32 bits to fit in a register / go in memory. Since,
Code:

addi (6 bits) \$t0 (5 bits -- 32 registers) \$0 (5 bits -- again, 32 registers) 156 (only have 16 bits to represent immediate value)
Im not sure I follow this 100%

Quote:

How the machine loads data depends. For example, an Alpha always loads 64 bits, and masks as necessary. An x86, on the other hand, really loads a single byte if you request it - at least the early ones did. Perhaps the newer ones don't. In any case any modern CPU only loads from the cache anyway - cache<->cache or cache<->RAM transfers are done by lines, which probably means 16 bytes L1<->L2 and 512 bytes or 1k L2<->RAM.
How would I go about checking how my machine does this?

Quote:

Do you care? Frankly, you shouldn't. First, prove that that's your bottleneck
As of now I just trying to understand what I have been reading, but I see your point.

Code:

`Our results: a 2.5:1 instruction-to-cycle ratio`
Wow, that sounds really good. Are the measuring tools something that comes with the OS, or is it a separate program?

Is it possible to make a rough estimate of how much date you can work on in your function before you start getting cache misses?
• 07-04-2008
CornedBee
Quote:

How would I go about checking how my machine does this?
You read the manufacturer's reference documentation.

Quote:

Are the measuring tools something that comes with the OS, or is it a separate program?
It's a separate program, and I forgot its name. But the program is merely an interface to the functionality built into the CPU.
oprofile apparently is a program that supports these counters.

Quote:

Is it possible to make a rough estimate of how much date you can work on in your function before you start getting cache misses?
Yes. Simply take the size of the cache. If your data is in small chunks and scattered all over memory, you'll get worse results, because the granularity of the cache is the cache line.
• 07-04-2008
zacs7
> Im not sure I follow this 100&#37;
I was just saying sometimes you don't even get to control what you're going to work with (ie there is no other way).