Thread: Endianness and bit shifting

  1. #16
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by synthetix View Post
    I don't know if this matters, but I am working with data from a SAN with probably up to around 500MB/sec throughput (4gbit fibre channel) and processing many thousands of files. Simple code tweaks may save hours of processing time, which is why I'm trying to do this as efficiently as possible.

    If the endian swap is the way to go, then that works for me. I guess if it's in the "wrong" format, then I have no choice.

    Thanks for all your help!
    I seriously doubt that the byte swapping will take longer than the conversion:

    BBBBBB00 GGGGBBBB RRGGGGGG RRRRRRRR --> BBBBBBBB GGGGGGBB RRRRGGGG 00RRRRRR

    An alternative is to shift the data on a machine whose native endianness matches the endianness of the data, or alter the code which generates the image data to generate it with an endianness which matches what the consumer expects.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  2. #17
    Registered User
    Join Date
    Jun 2009
    Posts
    101
    Quote Originally Posted by brewbuck View Post
    An alternative is to shift the data on a machine whose native endianness matches the endianness of the data, or alter the code which generates the image data to generate it with an endianness which matches what the consumer expects.
    Ok, I may try that. I have a G5 machine running SuSE which is 64 bit capable and I believe is big endian, so I could try compiling this on that machine. I know the G5/PowerPC 970 has fast vector processing but I think that only applies to float values. Even so, it's only 2.3GHz per CPU, so even if it's faster without conversion I could probably just throw more hardware at the problem on a little-endian machine (like an Intel Core 2 based machine, etc).

  3. #18
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    I doubt the G5 will ever come close to the speed of the Xeons, even with the extra work for the Xeons.

    Since you have a total of 8 cores on that Xeon, you can multithread your code to make it run almost 8 times as fast (this sounds like an easily parallelized problem). That would be quite a bit of work, though, so make sure you really need that speed first.

  4. #19
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    I can get 409MB/s on a crap AMD laptop in -O3.
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    
    #define N (256*1024*1024)
    
    int swap_end(int i) {
    	int rtn;
    	unsigned char *rtn_ptr = &rtn;
    	unsigned char *i_ptr = &i;
    	
    	rtn_ptr[0] = i_ptr[3];
    	rtn_ptr[1] = i_ptr[2];
    	rtn_ptr[2] = i_ptr[1];
    	rtn_ptr[3] = i_ptr[0];
    
    	return rtn;
    }
    
    int main() {
    	int *arr = malloc(N*sizeof(int));
    
    	srand(time(0));
    
    	int i, r;
    	for (i = 0; i < N; ++i) {
    		arr[i] = rand();
    	}
    
    	int sum = 0;
    
    	time_t t_start = time(0);
    	
    	for (r = 0; r < 10; ++r) {
    		for (i = 0; i < N; ++i) {
    			sum += swap_end(arr[i]);
    		}
    	}
    
    	printf("Time: %ld Speed: %ld MB/s\n", time(0) - t_start, (10*1024) / (time(0) - t_start));
    
    	printf("Sum: %d\n", sum); //so sum won't be optimized out
    
    }
    cyberfish@cyberfish-tablet:~$ gcc -fno-strict-aliasing -O3 a.c
    a.c: In function ‘swap_end’:
    a.c:8: warning: initialization from incompatible pointer type
    a.c:9: warning: initialization from incompatible pointer type
    cyberfish@cyberfish-tablet:~$ ./a.out
    Time: 25 Speed: 409 MB/s
    Sum: -282253132
    -fno-strict-aliasing needed because swap_end accesses an int through unsigned char *.

    Wouldn't be surprised if you can get much higher than that on your Xeons.

  5. #20
    Kernel hacker
    Join Date
    Jul 2007
    Location
    Farncombe, Surrey, England
    Posts
    15,677
    Using the bswap instruction, I get about 10x performance improvement:
    Code:
    C:\tmp>gcc -O3 -DCSWAP bswap.c
    bswap.c: In function 'swap_end':
    bswap.c:9: warning: initialization from incompatible pointer type
    bswap.c:10: warning: initialization from incompatible pointer type
    
    C:\tmp>a
    Time: 29 Speed: 353 MB/s
    Sum: 576716800
    
    C:\tmp>a
    Time: 30 Speed: 341 MB/s
    Sum: 576716800
    
    C:\tmp>gcc -O3 bswap.c
    
    C:\tmp>a
    Time: 3 Speed: 3413 MB/s
    Sum: 576716800
    This is using gcc-mignw 3.4.5

    Change to the code is:
    Code:
    #ifdef CSWAP
    int swap_end(int i) {
    	int rtn;
    	unsigned char *rtn_ptr = &rtn;
    	unsigned char *i_ptr = &i;
    	
    	rtn_ptr[0] = i_ptr[3];
    	rtn_ptr[1] = i_ptr[2];
    	rtn_ptr[2] = i_ptr[1];
    	rtn_ptr[3] = i_ptr[0];
    
    	return rtn;
    }
    #else
    int swap_end(int i)
    {
      __asm__ __volatile__("bswap %0": "+r"(i));
      return i;
    }
    #endif
    And I commented out the srand part, to allow it do get the SAME random numbers, so that sum is the same each run - that way I can say with some certainty that it's doing the same thing for both tasks.

    The actual code generated changes from:
    Code:
    	movb	-17(%ebp), %al
    	movb	%al, -24(%ebp)
    	movb	-18(%ebp), %al
    	movb	%al, -23(%ebp)
    	movb	(%ebx), %al
    	movb	%al, -22(%ebp)
    	movb	-20(%ebp), %al
    	movb	%al, (%ecx)
    new code:
    Code:
    	bswap %eax
    Using a few more registers would probably allow a bit more overlap between read/writes, but that would probably still make it slower than the bswap instruction. And it would certainly not help the rest of the code around it...

    One slight problem is that different compilers will need different inline assembler syntax, so it won't be portable.


    --
    Mats
    Compilers can produce warnings - make the compiler programmers happy: Use them!
    Please don't PM me for help - and no, I don't do help over instant messengers.

  6. #21
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    Wow that's amazing!

    Guess nothing beats hardware implementations. That was the case for the BSR thread a while ago, too, but the difference wasn't this big.

    [edit]There is another "solution" - htonl()[/edit]
    Last edited by cyberfish; 06-06-2009 at 05:38 AM.

  7. #22
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    On my machine (32-bit x86 Linux, GCC 4.3.3, Pentium Dual-Core),

    "Naive" -
    Time: 25 Speed: 409 MB/s
    Sum: 946287732

    bswap -
    Time: 4 Speed: 2560 MB/s
    Sum: 946287732

    htonl -
    Time: 3 Speed: 3413 MB/s
    Sum: 946287732
    (4 and 3 is probably just experimental error)
    Which leads me to believe that htonl is implemented using bswap, so it could be a more portable alternative. GCC doesn't appear to have an intrinsic for bswap.

    Code:
    #include <stdio.h>
    #include <stdlib.h>
    
    #include <arpa/inet.h>
    
    #define N (256*1024*1024)
    
    #if 0
    
    int swap_end(int i) {
    	int rtn;
    	unsigned char *rtn_ptr = &rtn;
    	unsigned char *i_ptr = &i;
    	
    	rtn_ptr[0] = i_ptr[3];
    	rtn_ptr[1] = i_ptr[2];
    	rtn_ptr[2] = i_ptr[1];
    	rtn_ptr[3] = i_ptr[0];
    
    	return rtn;
    }
    
    #endif
    
    #if 0
    
    int swap_end(int i) {
      __asm__ __volatile__("bswap %0": "+r"(i));
      return i;
    }
    
    #endif
    
    #if 1
    
    int swap_end(int i) {
    	return htonl(i);
    }
    
    #endif
    
    int main() {
    	int *arr = malloc(N*sizeof(int));
    
    	srand(1);
    
    	int i, r;
    	for (i = 0; i < N; ++i) {
    		arr[i] = rand();
    	}
    
    	int sum = 0;
    
    	time_t t_start = time(0);
    	
    	for (r = 0; r < 10; ++r) {
    		for (i = 0; i < N; ++i) {
    			sum += swap_end(arr[i]);
    		}
    	}
    
    	printf("Time: %ld Speed: %ld MB/s\n", time(0) - t_start, (10*1024) / (time(0) - t_start));
    
    	printf("Sum: %d\n", sum); //so sum won't be optimized out
    
    }

  8. #23
    Registered User
    Join Date
    Jun 2009
    Posts
    101
    Wow, I had no idea you guys would be so helpful. Thanks so much!

    I do, however, have more questions now:

    1. Is bswap() an ANSI C function? If so, which header would I include? Cyberfish mentioned "hardware implementation," so I assume this function is written as assembly code? I read up on this and I assume this is a function to do exactly what I need, swap endian order as quickly as possible.

    2. In terms of multithreading, this app may be a good candidate because I could assign one file to each thread and process 8 files simultaneously using 8 cores. I found some basic code on Wikipedia using functions from pthread.h, and it seemed pretty straightforward. I test compiled this code and it seemed to work OK (spawned 3 threads and CPU usage was 300%!). I'm sure there are tricks to optimizing it, though.

    3. Regarding the assembler syntax, if I were sticking with GCC across x86/64 platforms (would be either Mac OS X/x86 or Linux/x86), would I be OK or would each unique CPU have its own requirements (quad core Xeon vs. Core 2 duo, for example).

    Thanks again, guys. I have already learned a ton!

  9. #24
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    bswap is an assembly instruction. That's what I meant by "hardware implementation". There is no ANSI C function that do this. To use it in C, you can use the code matsp wrote
    Code:
    int swap_end(int i) {
      __asm__ __volatile__("bswap %0": "+r"(i));
      return i;
    }
    .

    Or, like I pointed out above, the htonl() (host to network byte order) function (#include <arpa/inet.h>) commonly used in network programming does exaclty this, too. Judging by the performance, it's probably implemented on x86 using bswap. It has an added benefit that, if run on a big-endian machine, it won't do anything.

    3. Regarding the assembler syntax, if I were sticking with GCC across x86/64 platforms (would be either Mac OS X/x86 or Linux/x86), would I be OK or would each unique CPU have its own requirements (quad core Xeon vs. Core 2 duo, for example).
    As long as you are sticking with GCC, it should work with all x86 CPUs.

    Bottom line is, I recommend htonl(). It's similar in performance to bswap, and is portable (to all CPUs, including big-endian).

  10. #25
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    As for threading, I would make sure you really need it first. Mulththreading introduces all kinds of hard-to-find bugs (that may only happen once every 1000 runs) if you are not careful and experienced. The time used to debug multithreaded code can easily exceed the run time it saves.

  11. #26
    Registered User
    Join Date
    Jun 2009
    Posts
    101
    Quote Originally Posted by cyberfish View Post
    bswap is an assembly instruction. That's what I meant by "hardware implementation".
    Okay, so basically __asm__ is a way of passing assembly code to GCC like so:
    __asm__(<assembly code goes here>);
    The lets GCC know it's assembly and to compile it as such. Is this right?

    Or, like I pointed out above, the htonl() (host to network byte order) function (#include <arpa/inet.h>) commonly used in network programming does exaclty this, too.
    I added this to my code and it works perfectly!

  12. #27
    Registered User
    Join Date
    Jun 2009
    Posts
    101
    Quote Originally Posted by cyberfish View Post
    As for threading, I would make sure you really need it first. Mulththreading introduces all kinds of hard-to-find bugs (that may only happen once every 1000 runs) if you are not careful and experienced. The time used to debug multithreaded code can easily exceed the run time it saves.
    If the code is processing faster than the disk can read, then yeah -- multithreading is probably not necessary. Frankly, I'm not doing any real calculations at this point, I'm simply moving data around. I've tested this on a local RAID which pushes about 300MB/sec r/w and it's _very_ fast.

    Thanks for the help cyberfish!

  13. #28
    Registered User
    Join Date
    Dec 2006
    Location
    Canada
    Posts
    3,229
    Okay, so basically __asm__ is a way of passing assembly code to GCC like so:
    __asm__(<assembly code goes here>);
    The lets GCC know it's assembly and to compile it as such. Is this right?
    Yes, it's called inline assembly.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Float =>memory Representation
    By ganesh bala in forum Tech Board
    Replies: 3
    Last Post: 02-06-2009, 04:06 AM
  2. Formatting 64 bit integers
    By Dino in forum C++ Programming
    Replies: 6
    Last Post: 02-05-2008, 08:24 PM
  3. endianness and bit shifting
    By Dino in forum Tech Board
    Replies: 7
    Last Post: 01-30-2008, 11:53 PM
  4. endianness
    By DavidP in forum Tech Board
    Replies: 4
    Last Post: 06-13-2004, 04:44 PM