Endianness and bit shifting

**brewbuck** · 06-05-2009

Originally Posted by synthetix

I don't know if this matters, but I am working with data from a SAN with probably up to around 500MB/sec throughput (4gbit fibre channel) and processing many thousands of files. Simple code tweaks may save hours of processing time, which is why I'm trying to do this as efficiently as possible.

If the endian swap is the way to go, then that works for me. I guess if it's in the "wrong" format, then I have no choice.

Thanks for all your help!

I seriously doubt that the byte swapping will take longer than the conversion:

BBBBBB00 GGGGBBBB RRGGGGGG RRRRRRRR --> BBBBBBBB GGGGGGBB RRRRGGGG 00RRRRRR

An alternative is to shift the data on a machine whose native endianness matches the endianness of the data, or alter the code which generates the image data to generate it with an endianness which matches what the consumer expects.

**synthetix** · 06-05-2009

Originally Posted by brewbuck

An alternative is to shift the data on a machine whose native endianness matches the endianness of the data, or alter the code which generates the image data to generate it with an endianness which matches what the consumer expects.

Ok, I may try that. I have a G5 machine running SuSE which is 64 bit capable and I believe is big endian, so I could try compiling this on that machine. I know the G5/PowerPC 970 has fast vector processing but I think that only applies to float values. Even so, it's only 2.3GHz per CPU, so even if it's faster without conversion I could probably just throw more hardware at the problem on a little-endian machine (like an Intel Core 2 based machine, etc).

**cyberfish** · 06-05-2009

I doubt the G5 will ever come close to the speed of the Xeons, even with the extra work for the Xeons.

Since you have a total of 8 cores on that Xeon, you can multithread your code to make it run almost 8 times as fast (this sounds like an easily parallelized problem). That would be quite a bit of work, though, so make sure you really need that speed first.

**cyberfish** · 06-05-2009

I can get 409MB/s on a crap AMD laptop in -O3.

Code:

#include <stdio.h>
#include <stdlib.h>

#define N (256*1024*1024)

int swap_end(int i) {
	int rtn;
	unsigned char *rtn_ptr = &rtn;
	unsigned char *i_ptr = &i;
	
	rtn_ptr[0] = i_ptr[3];
	rtn_ptr[1] = i_ptr[2];
	rtn_ptr[2] = i_ptr[1];
	rtn_ptr[3] = i_ptr[0];

	return rtn;
}

int main() {
	int *arr = malloc(N*sizeof(int));

	srand(time(0));

	int i, r;
	for (i = 0; i < N; ++i) {
		arr[i] = rand();
	}

	int sum = 0;

	time_t t_start = time(0);
	
	for (r = 0; r < 10; ++r) {
		for (i = 0; i < N; ++i) {
			sum += swap_end(arr[i]);
		}
	}

	printf("Time: %ld Speed: %ld MB/s\n", time(0) - t_start, (10*1024) / (time(0) - t_start));

	printf("Sum: %d\n", sum); //so sum won't be optimized out

}

cyberfish@cyberfish-tablet:~$ gcc -fno-strict-aliasing -O3 a.c
a.c: In function ‘swap_end’:
a.c:8: warning: initialization from incompatible pointer type
a.c:9: warning: initialization from incompatible pointer type
cyberfish@cyberfish-tablet:~$ ./a.out
Time: 25 Speed: 409 MB/s
Sum: -282253132

-fno-strict-aliasing needed because swap_end accesses an int through unsigned char *.

Wouldn't be surprised if you can get much higher than that on your Xeons.

**matsp** · 06-06-2009

Using the bswap instruction, I get about 10x performance improvement:

Code:

C:\tmp>gcc -O3 -DCSWAP bswap.c
bswap.c: In function 'swap_end':
bswap.c:9: warning: initialization from incompatible pointer type
bswap.c:10: warning: initialization from incompatible pointer type

C:\tmp>a
Time: 29 Speed: 353 MB/s
Sum: 576716800

C:\tmp>a
Time: 30 Speed: 341 MB/s
Sum: 576716800

C:\tmp>gcc -O3 bswap.c

C:\tmp>a
Time: 3 Speed: 3413 MB/s
Sum: 576716800

This is using gcc-mignw 3.4.5

Change to the code is:

Code:

#ifdef CSWAP
int swap_end(int i) {
	int rtn;
	unsigned char *rtn_ptr = &rtn;
	unsigned char *i_ptr = &i;
	
	rtn_ptr[0] = i_ptr[3];
	rtn_ptr[1] = i_ptr[2];
	rtn_ptr[2] = i_ptr[1];
	rtn_ptr[3] = i_ptr[0];

	return rtn;
}
#else
int swap_end(int i)
{
  __asm__ __volatile__("bswap %0": "+r"(i));
  return i;
}
#endif

And I commented out the srand part, to allow it do get the SAME random numbers, so that sum is the same each run - that way I can say with some certainty that it's doing the same thing for both tasks.

The actual code generated changes from:

Code:

	movb	-17(%ebp), %al
	movb	%al, -24(%ebp)
	movb	-18(%ebp), %al
	movb	%al, -23(%ebp)
	movb	(%ebx), %al
	movb	%al, -22(%ebp)
	movb	-20(%ebp), %al
	movb	%al, (%ecx)

new code:

Code:

	bswap %eax

Using a few more registers would probably allow a bit more overlap between read/writes, but that would probably still make it slower than the bswap instruction. And it would certainly not help the rest of the code around it...

One slight problem is that different compilers will need different inline assembler syntax, so it won't be portable.

--
Mats

**cyberfish** · 06-06-2009

Wow that's amazing!

Guess nothing beats hardware implementations. That was the case for the BSR thread a while ago, too, but the difference wasn't this big.

[edit]There is another "solution" - htonl()[/edit]

**cyberfish** · 06-06-2009

On my machine (32-bit x86 Linux, GCC 4.3.3, Pentium Dual-Core),

"Naive" -
Time: 25 Speed: 409 MB/s
Sum: 946287732

bswap -
Time: 4 Speed: 2560 MB/s
Sum: 946287732

htonl -
Time: 3 Speed: 3413 MB/s
Sum: 946287732

(4 and 3 is probably just experimental error)
Which leads me to believe that htonl is implemented using bswap, so it could be a more portable alternative. GCC doesn't appear to have an intrinsic for bswap.

Code:

#include <stdio.h>
#include <stdlib.h>

#include <arpa/inet.h>

#define N (256*1024*1024)

#if 0

int swap_end(int i) {
	int rtn;
	unsigned char *rtn_ptr = &rtn;
	unsigned char *i_ptr = &i;
	
	rtn_ptr[0] = i_ptr[3];
	rtn_ptr[1] = i_ptr[2];
	rtn_ptr[2] = i_ptr[1];
	rtn_ptr[3] = i_ptr[0];

	return rtn;
}

#endif

#if 0

int swap_end(int i) {
  __asm__ __volatile__("bswap %0": "+r"(i));
  return i;
}

#endif

#if 1

int swap_end(int i) {
	return htonl(i);
}

#endif

int main() {
	int *arr = malloc(N*sizeof(int));

	srand(1);

	int i, r;
	for (i = 0; i < N; ++i) {
		arr[i] = rand();
	}

	int sum = 0;

	time_t t_start = time(0);
	
	for (r = 0; r < 10; ++r) {
		for (i = 0; i < N; ++i) {
			sum += swap_end(arr[i]);
		}
	}

	printf("Time: %ld Speed: %ld MB/s\n", time(0) - t_start, (10*1024) / (time(0) - t_start));

	printf("Sum: %d\n", sum); //so sum won't be optimized out

}

**synthetix** · 06-07-2009

Wow, I had no idea you guys would be so helpful. Thanks so much!

I do, however, have more questions now:

1. Is bswap() an ANSI C function? If so, which header would I include? Cyberfish mentioned "hardware implementation," so I assume this function is written as assembly code? I read up on this and I assume this is a function to do exactly what I need, swap endian order as quickly as possible.

2. In terms of multithreading, this app may be a good candidate because I could assign one file to each thread and process 8 files simultaneously using 8 cores. I found some basic code on Wikipedia using functions from pthread.h, and it seemed pretty straightforward. I test compiled this code and it seemed to work OK (spawned 3 threads and CPU usage was 300%!). I'm sure there are tricks to optimizing it, though.

3. Regarding the assembler syntax, if I were sticking with GCC across x86/64 platforms (would be either Mac OS X/x86 or Linux/x86), would I be OK or would each unique CPU have its own requirements (quad core Xeon vs. Core 2 duo, for example).

Thanks again, guys. I have already learned a ton!

**cyberfish** · 06-07-2009

bswap is an assembly instruction. That's what I meant by "hardware implementation". There is no ANSI C function that do this. To use it in C, you can use the code matsp wrote

Code:

int swap_end(int i) {
  __asm__ __volatile__("bswap %0": "+r"(i));
  return i;
}

.

Or, like I pointed out above, the htonl() (host to network byte order) function (#include <arpa/inet.h>) commonly used in network programming does exaclty this, too. Judging by the performance, it's probably implemented on x86 using bswap. It has an added benefit that, if run on a big-endian machine, it won't do anything.

3. Regarding the assembler syntax, if I were sticking with GCC across x86/64 platforms (would be either Mac OS X/x86 or Linux/x86), would I be OK or would each unique CPU have its own requirements (quad core Xeon vs. Core 2 duo, for example).

As long as you are sticking with GCC, it should work with all x86 CPUs.

Bottom line is, I recommend htonl(). It's similar in performance to bswap, and is portable (to all CPUs, including big-endian).

**cyberfish** · 06-07-2009

As for threading, I would make sure you really need it first. Mulththreading introduces all kinds of hard-to-find bugs (that may only happen once every 1000 runs) if you are not careful and experienced. The time used to debug multithreaded code can easily exceed the run time it saves.

**synthetix** · 06-07-2009

Originally Posted by cyberfish

bswap is an assembly instruction. That's what I meant by "hardware implementation".

Okay, so basically __asm__ is a way of passing assembly code to GCC like so:
__asm__(<assembly code goes here>);
The lets GCC know it's assembly and to compile it as such. Is this right?

Or, like I pointed out above, the htonl() (host to network byte order) function (#include <arpa/inet.h>) commonly used in network programming does exaclty this, too.

I added this to my code and it works perfectly!

**synthetix** · 06-07-2009

Originally Posted by cyberfish

As for threading, I would make sure you really need it first. Mulththreading introduces all kinds of hard-to-find bugs (that may only happen once every 1000 runs) if you are not careful and experienced. The time used to debug multithreaded code can easily exceed the run time it saves.

If the code is processing faster than the disk can read, then yeah -- multithreading is probably not necessary. Frankly, I'm not doing any real calculations at this point, I'm simply moving data around. I've tested this on a local RAID which pushes about 300MB/sec r/w and it's _very_ fast.

Thanks for the help cyberfish!

**cyberfish** · 06-07-2009

Okay, so basically __asm__ is a way of passing assembly code to GCC like so:
__asm__(<assembly code goes here>);
The lets GCC know it's assembly and to compile it as such. Is this right?

Yes, it's called inline assembly.

Thread: Endianness and bit shifting

Thread Tools

Search Thread

Display

Similar Threads

Float =>memory Representation

Formatting 64 bit integers

endianness and bit shifting

endianness