# Copying constant amount of data

This is a discussion on Copying constant amount of data within the C++ Programming forums, part of the General Programming Boards category; I don't really know how the cpu does to copy a number of bytes; does it take as long time ...

1. ## Copying constant amount of data

I don't really know how the cpu does to copy a number of bytes; does it take as long time to copy a whole word (4 bytes on a 32-bit system) as it takes to copy 4 bytes separatelly? Which sizes can the cpu copy at a time, can it for exampe copy exactly 3, or 5 bytes at a time? And does every copy take the same time?

Here's an example: If n and m in known to the compiler during compilation, what is the fastest way of copying at least n bytes from one place in memory to another, and at most m bytes? In one special case (color channels using SDL), there are 3 bytes which need to be copied. One could think the fastest way is to copy byte by byte. But in this case, there's a fourth byte in both the source and at the destination (since the video mode is put to 32 bits/pixel), which makes it possible to copy 4 bytes at a time, which I guess should make a faster copy (3 times faster?). Here n = 3 and m = 4.

The compiler won't be able to optimize this for me, or is it? Since it probably doesn't know m, just n.

2. In general, this is a question that is not really C++ related as much as it is CPU related, and as such, there is no real answer to it. If you want to deal with specific CPUs and such, then yes, you can squeeze extra performance out of knowing such information. In these days, however, I would not entirely underestimate modern compilers and their ability to optimize code.

3. Copying a single 32-bit integer is faster than 4 individual byte copies, yes. However it also takes more memory, and if you have enough pixels then that makes a difference in the speed too. When just plain copying a row of pixels between bitmaps, both can be done using a memcpy that may involve copying multiple bytes at once. So in some cases 24-bit can be faster.
Brief experiments with my own software 3D engine put 24-bit rendering just 15% below 32-bit, speedwise.

You haven't said what you're doing exactly so I can't advise you which way to go at this stage.

4. Originally Posted by TriKri
I don't really know how the cpu does to copy a number of bytes; does it take as long time to copy a whole word (4 bytes on a 32-bit system) as it takes to copy 4 bytes separatelly? Which sizes can the cpu copy at a time, can it for exampe copy exactly 3, or 5 bytes at a time? And does every copy take the same time?
A x86 CPU can copy 1, 2, or 4 bytes at a time. Which one is used depends on the type of the data: characters are copied 1 byte at a time, short 2 bytes at a time etc. There are special also operations for copying large blocks of data.

In one special case (color channels using SDL), there are 3 bytes which need to be copied. One could think the fastest way is to copy byte by byte. But in this case, there's a fourth byte in both the source and at the destination (since the video mode is put to 32 bits/pixel), which makes it possible to copy 4 bytes at a time, which I guess should make a faster copy (3 times faster?). Here n = 3 and m = 4.
Depends on the function used to do the copy. If you reinterpret_cast the 4 bytes to int, then it should copy all 4 bytes at once. If you use memcpy or similar functions, then it is likely that the compiler will copy one byte at a time, but it may also do checks to see if a multi-byte operation is a better option. You can also use a POD struct for the three bytes, and rely on the compiler to decide to implement it with an extra padded bit or not. The padding will make it possible for the compiler to automatically use 4 byte copy to copy the struct.

Here's an example: If n and m in known to the compiler during compilation, what is the fastest way of copying at least n bytes from one place in memory to another, and at most m bytes?

The compiler won't be able to optimize this for me, or is it? Since it probably doesn't know m, just n.
There is no built-in function to tell the compiler to preform such an operation. You can influence the operation used to copy the data by changing the type of the data. So if you want data to be copied four bytes at a time, you could cast to int array, and make sure that the first element lies on an even and possibly multiple of 4 (not sure how x86 does it).

But the truth is for small n, it doesn't really matter, and for large n, you want to use functions like memcpy that can take advantage of the ability to copy large data blocks.

5. Always use memcpy() for copying bytes. This little function is so essential that an incredible amount of optimization effort goes into it. Typically, compilers recognize a call to memcpy and use all the static flow information available to them (and profiling information, if you use profile-guided optimization) to select the best method of copying for that particular CPU. It might decided to copy the data in 16-byte blocks using the SSE load and store instructions, for example. If the particular hardware has a special memory transfer engine, the compiler might decide to call to that. Etc, etc.

Trust the compiler.

6. Yes, but the compiler cannot know everything. So in this case with SDL for example, do you mean I should use memcpy(dest, source, 3); instead of *(long*)dest = *(long*)source? The original probem is to copy a specific color, a BPP bytes big array containing the values for the different channels, into each pixel of an image. Also each pixel takes up BPP bytes of memory. The thing is that it is allowed to copy BPP bytes (4 in this case), but it only has to copy NC*sizeof(T) bytes (3 in this case, the last byte is unused), where NC is the number of channels and T the data type containing the value for each channel.

Currently I have this inline member function to do this for me:
Code:
```template<class T, uint NC, size_t BPP>
inline void mp_image<T, NC, BPP>::CopyColor(byte *dest, byte *source)
{
uint i = 0;
while ( i < (NC * sizeof(T))/sizeof(int)*sizeof(int)           ||
i <  NC * sizeof(T) && i < BPP/sizeof(int)*sizeof(int)  ) {
*(int*)dest = *(int*)source;
dest += sizeof(int);
i += sizeof(int);
}
if ( i < (NC * sizeof(T))/sizeof(short)*sizeof(short)           ||
i <  NC * sizeof(T) && i < BPP/sizeof(short)*sizeof(short)  ) {
*(short*)dest = *(short*)source;
dest += sizeof(short);
i += sizeof(short);
}
if ( i < NC * sizeof(T)) {
*dest = *source;
dest++;
i++;
}
}```
I don't know if this is good or if it is just stupid, but in the case with SDL this function would just take and copy one int. byte here is an integer datat type, same as unsigned char.

7. do you mean I should use memcpy(dest, source, 3); instead of *(long*)dest = *(long*)source
Well, apart from your form copying 8 bytes on my system, you'd simply be lying to the compiler in one case. One of those two copies 3 bytes, the other 4. Yes, the compiler doesn't know it's safe to copy 4 bytes (and thus use a block transfer), but you could trivially supply it with this information by passing 4 to memcpy.

8. Originally Posted by CornedBee
Well, apart from your form copying 8 bytes on my system...
Oops, I meant a 32 bit integer, thought long always was that, but maybe not...

You're right, maybe just passing BPP to memcpy, I realized that BPP will probably be optimized purely for this reason. Thanks!

9. Oops, I meant a 32 bit integer, thought long always was that, but maybe not...
GCC under a 64-bit system uses 8-byte longs.

10. It can be good to know. Is there any way to tell the word size in compile time, if you're running a 16 bit system, 32 bit system or a 64 bit system (though I guess no one uses a 16 bit system these days)? Maybe a compile time flag?

11. Closest portable method I believe you can use to determine the type of system is to do a sizeof(void *). If you want it in bits, multiply by 8 obviously. This depends upon the compiler, though, more than the system on which you're compiling.

12. I need to do it in precompiling to be able to set typedefs, but the precompiler doesn't recognize the sizeof operator...

13. It does in C, and my understanding is that C++ resolves sizeof() when it can at compile-time and leaves only the ones it can't for run-time. Something like sizeof(void*) should work since it is clearly constant.

Code:
```#include <iostream>

#define TEH_SIZE sizeof(void *)

int main()
{
std::cout << "Architecture is " << ((TEH_SIZE)*8) << " bits." << std::endl;
return 0;
}```
This produces the following output on a 32-bit Windows XP machine using MinGW:

Code:
`Architecture is 32 bits.`

14. Yes, but sizeof is still not be interpreted by the precompiler. I need to do a typedef for intw and uintw, which has the same size as a word. What I thought of was something like

Code:
```#if    (sizeof(void*) == 16)  //16-bit system
typedef   int16_t   intw;
typedef  uint16_t  uintw;
#elif  (sizeof(void*) == 32)  //32-bit system
typedef   int32_t   intw;
typedef  uint32_t  uintw;
#elif  (sizeof(void*) == 64)  //64-bit system
typedef   int64_t   intw;
typedef  uint64_t  uintw;
#endif```
But this fails.

15. The number of bits in a char is, strictly speaking, able to have values other than 8. Therefore it is appropriate to remove the assumption of 8-bit chars, and express that as;
Code:
```#include <iostream>
#include <climits>

#define THE_SIZE (sizeof(void *))

int main()
{
std::cout << "Architecture is " << (THE_SIZE*CHAR_BIT) << " bits." << std::endl;
return 0;
}```

Page 1 of 2 12 Last