Thread: efficient bit manipulation of memory

  1. #1
    Registered User
    Join Date
    Feb 2005
    Posts
    46

    efficient bit manipulation of memory

    I have a large chunk of memory that I want to bit-wise complement. Since I want it to be fast, I don't really want to go byte by byte. What's the best way to do this?
    Here's a first implementation using longs, but I want to do a bigger chunk at a time. However I'm worried that doing a larger chunk might be dependent on the word size of the machine.

    Code:
    char *memory = 0xblahblahblah;
    unsigned int size = a zillion;
    
    int i;
    for (i = 0; i < size; i+=sizeof(unsigned long))
    {
       *((unsigned long *)(memory + i)) = ~(*((unsigned long *)(memory + i)));
    }

  2. #2
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    >> I have a large chunk of memory that I want to bit-wise complement. Since I want it to be fast, I don't really want to go byte by byte. What's the best way to do this? Here's a first implementation using longs, but I want to do a bigger chunk at a time. However I'm worried that doing a larger chunk might be dependent on the word size of the machine.

    Well, the problem is that if you try to process a size that isn't an even multiple of an unsigned long you'll write past the end of the buffer. I didn't compile this, but it should work fine:

    Code:
    void* compliment( void* buffer, size_t size )
    {
    	const unsigned long
    		mask = ~0UL;
    	size_t
    		words = size / sizeof( unsigned long ), 
    		bytes = size % sizeof( unsigned long );
    	unsigned long*
    		lptr = ( unsigned long* )buffer;
    	char*
    		cptr = ( ( char* )buffer ) + ( size - bytes );
    	while( words-- )
    		*lptr++ ^= mask;
    	while( bytes-- )
    		*cptr++ ^= mask;
    	return buffer;
    }
    EDIT:
    Whoops, I used assignment instead of XOR!
    Last edited by Sebastiani; 07-27-2009 at 07:00 PM.
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  3. #3
    and the hat of sweating
    Join Date
    Aug 2007
    Location
    Toronto, ON
    Posts
    3,545
    How portable does the code need to be?
    "I am probably the laziest programmer on the planet, a fact with which anyone who has ever seen my code will agree." - esbo, 11/15/2008

    "the internet is a scary place to be thats why i dont use it much." - billet, 03/17/2010

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,656
    Ah Tuesday - that must be premature optimisation day

    > However I'm worried that doing a larger chunk might be dependent on the word size of the machine.
    Yes, and in doing so you'll probably add a bunch of bugs in the process which might take days to find and fix.

    Finish the program, then profile it to find out where the REAL hot-spots are, and then work on those.

    There's an old saying
    "It's easier to make a working program efficient, than it is to make an efficient program work".
    In other words, make it "right" before you make it "fast".
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    >> Finish the program, then profile it to find out where the REAL hot-spots are, and then work on those. [...] In other words, make it "right" before you make it "fast".

    Well said.
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  6. #6
    Registered User
    Join Date
    Feb 2005
    Posts
    46
    Quote Originally Posted by Salem View Post
    Ah Tuesday - that must be premature optimisation day
    Well, I posted it on Monday, so there goes that theorem

    I will finish the program in a simple way first, but I was also asking out of curiosity.

    Also, the program is a memory latch-up test (detecting stuck-at 0/1 faults) for an embedded system, so writing and reading memory is going to be the major source for speedup. I already have a slower version written in another language that I need to speed up. I've made some algorithmic improvements so now I'm looking to squeeze out the last bit of performance.

  7. #7
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,656
    Depends on your architecture (check the ASM to make sure it's doing it right)

    1. Stop casting the pointer, make another one of the right type.

    2. Unroll the loop. Some instruction sets give you small constant displacements for free.
    It means you have 8 times fewer increment, branch and compare instructions which do nothing with the memory.

    Code:
    for ( i = 0 ; i < size ; i += sizeof(unsigned long) * 8 )
    {
      unsigned long *p = (unsigned long*)(memory + i);
      p[0] = ~p[0];
      p[1] = ~p[1];
      p[2] = ~p[2];
      p[3] = ~p[3];
      p[4] = ~p[4];
      p[5] = ~p[5];
      p[6] = ~p[6];
      p[7] = ~p[7];
    }
    But make sure your memory is both long-aligned and a multiple of 8 longs in size.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  8. #8
    Registered User
    Join Date
    Sep 2008
    Location
    Toronto, Canada
    Posts
    1,834
    Code:
    p[0] ^= -1
    might be faster. Less memory accesses since the constant is made part of the instruction. But you'd have to time it to make sure.

    Also, how about long long (or __int64). It's portable as long as the target machine/compiler supports 64-bit integers.

  9. #9
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    Quote Originally Posted by nonoob View Post
    Code:
    p[0] ^= -1
    might be faster. Less memory accesses since the constant is made part of the instruction. But you'd have to time it to make sure.
    Any decent compiler will convert either method to the same code. Anyway, the trick is to process the largest word size possible.

    But take care for alignment. Even if your architectures allows unaligned access, they may be very slow compared to aligned access. So your loop will probably look like:

    Code:
    while( !aligned )
        process bytes until aligned
    while( count >= sizeof( word ) )
        process words
    while( count-- )
        process final bytes
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  10. #10
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    >> But take care for alignment. Even if your architectures allows unaligned access, they may be very slow compared to aligned access.

    Good point!
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  11. #11
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    Ok, so here's an updated version, taking into account Brewbuck's suggestion:

    Code:
    void* compliment( void* buffer, size_t size )
    {
    	typedef unsigned long
    		word_t;
    	size_t
    		unaligned = ( size_t )buffer % sizeof( word_t ),
    		adjusted = size - unaligned;
    	char
    		* bptr = ( char* )buffer, 
    		* bpe = bptr + unaligned;
    	word_t
    		* lptr = ( word_t* )bpe, 
    		* lpe = lptr + ( adjusted / sizeof( word_t ) );
    	char
    		* cptr = ( char* )lpe, 
    		* cpe = cptr + ( adjusted % sizeof( word_t ) );
    	while( bptr != bpe )
    	{
    		*bptr = ~*bptr;
    		++bptr;
    	}	
    	while( lptr != lpe )
    	{
    		*lptr = ~*lptr;
    		++lptr;
    	}	
    	while( cptr != cpe )
    	{
    		*cptr = ~*cptr;
    		++cptr;
    	}	
    	return buffer;
    }
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Mutex and Shared Memory Segment Questions.
    By MadDog in forum Linux Programming
    Replies: 14
    Last Post: 06-20-2010, 04:04 AM
  2. Question about Bit Manipulation.
    By parastar in forum C Programming
    Replies: 5
    Last Post: 01-28-2009, 08:46 AM
  3. Assignment Operator, Memory and Scope
    By SevenThunders in forum C++ Programming
    Replies: 47
    Last Post: 03-31-2008, 06:22 AM
  4. Efficient basic arithmetic algorithms (8 bit)
    By Unregistered in forum C Programming
    Replies: 1
    Last Post: 10-23-2001, 09:56 AM