Thread: Swap Byte

  1. #1
    Registered User
    Join Date
    Feb 2003
    Posts
    175

    Swap Byte

    Hello,

    Help needed to optimize following program.

    Code:
    	int num,i, byte =0;
    
    	printf(" Enter the number\n");
    	scanf("%d",&num);
    	printf("Number is %x\n",num);
    	
    	i = 0;
    	
    	do
    	{
    		if (num%2)
    		{
    			byte = byte << 1;
    			byte |= 0x01;
    		}
    		else
    			byte = byte << 1;
    
    		num = num>>1; 
    		i++;
    	} while(i&7);
    	printf("Reversed number = %x and %d \n",byte, i);
    }
    Thanks,
    Tiger

  2. #2
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    For simple optimization, you want to avoid division and mod. Both are considered slow.

    I'm on the phone now, so I'll add a bit more later.

    [edit]
    Additionally, you want to unrole or avoid loops completely. They're also considered slower than non-loops.
    [/edit]

    Quzah.
    Last edited by quzah; 02-04-2003 at 09:05 PM.
    Hope is the first step on the road to disappointment.

  3. #3
    Registered User
    Join Date
    Feb 2003
    Posts
    175
    What i am looking at is,

    1. Loop takes 8 iterations, even if most significant bits are 0. Any logic that avoid unncessary iteration.

    2. There are no division in above program. Any replcacement for mod is greately apppreciated.

    Let me know.

    tiger

  4. #4
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Originally posted by Roaring_Tiger
    2. There are no division in above program. Any replcacement for mod is greately apppreciated.
    Yes there is. Look at your first check:

    if (num%2)

    Right there.

    There are many ways to optimize this. You can just hard code the swaps. You can use a loop with simple bit shifting. You can create a union. Many differnt things.

    Since this is fun, I'll show you the unionized method:
    Code:
    union byteu
    {
        unsigned char byte;
        struct 
        {
            unsigned char u0:1;
            unsigned char u1:1;
            unsigned char u2:1;
            unsigned char u3:1;
            unsigned char u4:1;
            unsigned char u5:1;
            unsigned char u6:1;
            unsigned char u7:1;
        } sbyte;
    };
    
    union byteu abyte, bbyte;
    
    abyte.byte = 123;
    
    bbyte.sbyte.u7 = abyte.sbyte.u0;
    bbyte.sbyte.u6 = abyte.sbyte.u1;
    bbyte.sbyte.u5 = abyte.sbyte.u2;
    bbyte.sbyte.u4 = abyte.sbyte.u3;
    bbyte.sbyte.u3 = abyte.sbyte.u4;
    bbyte.sbyte.u2 = abyte.sbyte.u5;
    bbyte.sbyte.u1 = abyte.sbyte.u6;
    bbyte.sbyte.u0 = abyte.sbyte.u7;
    
    abyte.byte = bbyte.byte;
    Vola. Swapped with 9 assignments.

    Quzah.
    Hope is the first step on the road to disappointment.

  5. #5
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Originally posted by Roaring_Tiger
    1. Loop takes 8 iterations, even if most significant bits are 0. Any logic that avoid unncessary iteration.
    2. There are no division in above program. Any replcacement for mod is greately apppreciated.
    A lookup table and other techniques are suggested in the following thread. Addressing your two questions, is something like the following what you had in mind?
    Code:
    #include <stdio.h>
    #include <limits.h>
    
    unsigned char foo(unsigned char value)
    {
        unsigned char mask = 1 << (CHAR_BIT - 1), result = 0;
        while(value) /* skip most significant bits that are zero */
        {
            if(value & 1) /* replace mod (machine dependency) */
            {
                result |= mask;
            }
            mask  >>= 1;
            value >>= 1;
        }
        return result;
    }
    
    int main(void)
    {
        unsigned char value = 5;
        printf("value = %X, reversed = %X\n", value, foo(value));
        return 0;
    }
    
    /* my output
    value = 5, reversed = A0
    */

  6. #6
    Registered User
    Join Date
    Feb 2003
    Posts
    175
    Thanks
    Last edited by Roaring_Tiger; 02-05-2003 at 08:28 PM.

  7. #7
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708

    Thumbs up good idea

    Originally posted by quzah
    Yes there is. Look at your first check:

    if (num%2)

    Right there.

    There are many ways to optimize this. You can just hard code the swaps. You can use a loop with simple bit shifting. You can create a union. Many differnt things.

    Since this is fun, I'll show you the unionized method:
    Code:
    union byteu
    {
        unsigned char byte;
        struct 
        {
            unsigned char u0:1;
            unsigned char u1:1;
            unsigned char u2:1;
            unsigned char u3:1;
            unsigned char u4:1;
            unsigned char u5:1;
            unsigned char u6:1;
            unsigned char u7:1;
        } sbyte;
    };
    
    union byteu abyte, bbyte;
    
    abyte.byte = 123;
    
    bbyte.sbyte.u7 = abyte.sbyte.u0;
    bbyte.sbyte.u6 = abyte.sbyte.u1;
    bbyte.sbyte.u5 = abyte.sbyte.u2;
    bbyte.sbyte.u4 = abyte.sbyte.u3;
    bbyte.sbyte.u3 = abyte.sbyte.u4;
    bbyte.sbyte.u2 = abyte.sbyte.u5;
    bbyte.sbyte.u1 = abyte.sbyte.u6;
    bbyte.sbyte.u0 = abyte.sbyte.u7;
    
    abyte.byte = bbyte.byte;
    Vola. Swapped with 9 assignments.

    Quzah.

    Kudos, Quzah.

    That was a great solution.
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  8. #8
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    I rarely find the use of bitfields to be an optimization when it comes to speed because what appears to be a simple assignment is often implmented by masking and shifting. For example,
    Code:
       ;
       ;        b.sbyte.u5 = a.sbyte.u2;
       ;
            mov       cl,byte ptr [ebp-1]
            shr       ecx,2
            and       ecx,1
            and       cl,1
            shl       ecx,5
            mov       dl,byte ptr [eax]
            and       dl,-33
            or        cl,dl
            mov       byte ptr [eax],cl
    Bitfields may pack data more tightly, but this size benefit may be taken at the cost of speed. Some exceptions might be on architectures that actually have bit-addressable memory in which an assignment would be an assignment.

    I had run a little comparison, which is less than perfect, but is as follows.
    Code:
    #include <stdio.h>
    #include <limits.h>
    #include <time.h>
    
    unsigned char foo(unsigned char value)
    {
        unsigned char mask = 1 << (CHAR_BIT - 1), result = 0;
        while ( value ) /* skip most significant bits that are zero */
        {
            if ( value & 1 ) /* replace mod (machine dependency) */
            {
                result |= mask;
            }
            mask  >>= 1;
            value >>= 1;
        }
        return result;
    }
    
    unsigned char bar(unsigned char num)
    {
        unsigned char byte = 0;
        int i = 0;
        do
        {
            if ( num % 2 )
            {
                byte = byte << 1;
                byte |= 0x01;
            }
            else
                byte = byte << 1;
    
            num = num >> 1;
            i++;
        } while ( i & 7 );
        return byte;
    }
    
    unsigned char baz(unsigned char value)
    {
        union byteu
        {
            unsigned char byte;
            struct
            {
                unsigned char u0:1;
                unsigned char u1:1;
                unsigned char u2:1;
                unsigned char u3:1;
                unsigned char u4:1;
                unsigned char u5:1;
                unsigned char u6:1;
                unsigned char u7:1;
            } sbyte;
        } a, b;
    
        a.byte = value;
    
        b.sbyte.u7 = a.sbyte.u0;
        b.sbyte.u6 = a.sbyte.u1;
        b.sbyte.u5 = a.sbyte.u2;
        b.sbyte.u4 = a.sbyte.u3;
        b.sbyte.u3 = a.sbyte.u4;
        b.sbyte.u2 = a.sbyte.u5;
        b.sbyte.u1 = a.sbyte.u6;
        b.sbyte.u0 = a.sbyte.u7;
    
        return b.byte;
    }
    
    #define CYCLES 100000000
    
    int main(void)
    {
        const char *name[] = { "foo", "bar", "baz" };
        unsigned char (*const function[])(unsigned char) = { foo, bar, baz };
        unsigned char value = 5;
        size_t i,j;
        printf("foo(%X) = %X\n", value, foo(value));
        printf("bar(%X) = %X\n", value, bar(value));
        printf("baz(%X) = %X\n", value, baz(value));
        fflush(stdout);
        for ( i = 0; i < sizeof(function)/sizeof(*function); ++i )
        {
            clock_t end, start = clock();
            for ( j = 0; j < CYCLES; ++j )
            {
                function [ i ] (j);
            }
            end = clock();
            printf("%s = %f\n", name [ i ], (end - start) / CLOCKS_PER_SEC);
            fflush(stdout);
        }
    
        return 0;
    }
    Here are the results I obtained.
    Code:
    foo(5) = A0
    bar(5) = A0
    baz(5) = A0
    foo = 9.614000
    bar = 12.738000
    baz = 17.766000
    If optimizing for speed was the goal, this bitfields implementation was 39% slower than the original.

    [OT]
    >Vola.
    Perhaps you mean voilą.
    [/OT]

  9. #9
    Registered User Cela's Avatar
    Join Date
    Jan 2003
    Posts
    362
    Code:
    union byteu
    {
        unsigned char byte;
        struct
        {
            unsigned char u0:1;
            unsigned char u1:1;
            unsigned char u2:1;
            unsigned char u3:1;
            unsigned char u4:1;
            unsigned char u5:1;
            unsigned char u6:1;
            unsigned char u7:1;
        } sbyte;
    };
    Just an aside, but bitfields have to be either signed or unsigned int, or _Bool for C99 :-)
    *Cela*

  10. #10
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    Originally posted by Dave_Sinkula

    [snipped example]
    If optimizing for speed was the goal, this bitfields implementation was 39% slower than the original.
    You have made MSVC++6 very very unhappy. It compiles but generates a runtime error for me.

    Originally posted by Dave_Sinkula

    [OT]
    >Vola.
    Perhaps you mean voilą.
    [/OT]
    Hm. Yep. I didn't think it had an I in it.

    Quzah.
    Hope is the first step on the road to disappointment.

  11. #11
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Originally posted by quzah
    You have made MSVC++6 very very unhappy. It compiles but generates a runtime error for me.
    So it does.

    [/EDIT]
    I had gotten used to Borland, which has CLOCKS_PER_SEC like this.
    Code:
    #define CLOCKS_PER_SEC 1000.0
    I found that MSVC does it differently.
    Code:
    #define CLOCKS_PER_SEC 1000
    So I shouldn't be trusting that CLOCKS_PER_TICK will trigger an implicit conversion to floating point in code such as this.
    Code:
            printf("%s = %f\n", name[i], (end - start) / CLOCKS_PER_SEC);
    It needs a cast to ensure portability.
    [/EDIT]

    And once that has been fixed it shows that MSVC seems to do a much better job than Borland optimizing with bitfields.
    Code:
    foo(5) = A0
    bar(5) = A0
    baz(5) = A0
    foo = 45.004000
    bar = 43.082000
    baz = 16.133000
    Last edited by Dave_Sinkula; 02-06-2003 at 04:40 PM.

  12. #12
    Guest Sebastiani's Avatar
    Join Date
    Aug 2001
    Location
    Waterloo, Texas
    Posts
    5,708
    Quzahs solution completes the task in less than a dozen operations - all of them assignment, none are comparisons either. That is what makes this solution superior. You have comparisons for loop bounds, lots of arithametic and so forth, which make for a more intensive program and hence, longer code segments, more register accesses, long math, etc.
    Code:
    #include <cmath>
    #include <complex>
    bool euler_flip(bool value)
    {
        return std::pow
        (
            std::complex<float>(std::exp(1.0)), 
            std::complex<float>(0, 1) 
            * std::complex<float>(std::atan(1.0)
            *(1 << (value + 2)))
        ).real() < 0;
    }

  13. #13
    Just Lurking Dave_Sinkula's Avatar
    Join Date
    Oct 2002
    Posts
    5,005
    Originally posted by Sebastiani
    Quzahs solution completes the task in less than a dozen operations - all of them assignment, none are comparisons either. That is what makes this solution superior. You have comparisons for loop bounds, lots of arithametic and so forth, which make for a more intensive program and hence, longer code segments, more register accesses, long math, etc.
    I guess you missed a point I was trying to make: this solution may be "superior" on one implementation, and inferior on another (as in the example I gave).

    Two specific questions Roaring_Tiger asked were how to avoid unnecessary loop iterations and how to avoid a division. These were the points I addressed. Since quzah already showed an alternate approach, I tried to answer Roaring_Tiger's question as posted -- and offered a comparison to quzah's code. Note that in the implementation I tested, the loop with comparisons was faster than the bitfields. Surely this is a quality of implementation issue, but definitely something worth mention.

    If you really want to trim things down to minimal operations, why ignore the lookup table approach found in the link I posted? There you would have only one assignment. Or using similarly unrolled loops, the comparisons and explicit shift operations may be faster than bitfield assignment. On not. Again, I find it relevant to at least mention this.

    Generalizing that certain C contructs will always generate optimum code is not a wise thing to do.

  14. #14
    ATH0 quzah's Avatar
    Join Date
    Oct 2001
    Posts
    14,826
    This is my change to function 2 to use a lookup table. It scored horribly depending on the implementation:
    Code:
    unsigned char bar(unsigned char num)
    {
        unsigned char table[256] =
        {
            ... table here ...
        };
        return table[num];
    }
    Using a global variable or declaring the array as static gives nearly a 2x performance boost over my initial example.

    Using a non-global and non-static array, my initial example has roughly a 3x performance boost over the lookup table.

    Quzah.
    Hope is the first step on the road to disappointment.

  15. #15
    Registered User alex's Avatar
    Join Date
    Sep 2001
    Posts
    132

    More possibilities...

    Hi all!

    Just one question: what happened to the "standard" solution for this problem?
    Code:
    unsigned char quux(unsigned char value)
    {
       value = ((value&0xaa)>>1) | ((value&0x55)<<1);
       value = ((value&0xcc)>>2) | ((value&0x33)<<2);
       return (value>>4) | (value<<4);
    }
    I found this one still a bit faster:
    Code:
    unsigned char quuux(unsigned char value)
    {
       static unsigned char tab[16]=
             {0x0,0x8,0x4,0xc,0x2,0xa,0x6,0xe,
              0x1,0x9,0x5,0xd,0x3,0xb,0x7,0xf};
       return (tab[value&0xf]<<4) | (tab[value>>4]);
    }
    I got (on a 1000MHz Athlon):
    Code:
    foo(5) = A0
    bar(5) = A0
    baz(5) = A0
    quux(5) = A0
    quuux(5) = A0
    baseline(5) = 0
    foo = 9.210000
    bar = 9.980000
    baz = 3.340000
    quux = 1.950000
    quuux = 1.730000
    baseline = 1.130000
    Where the function baseline is obviously wrong, but shows the cost of the function call itself:
    Code:
    unsigned char baseline(unsigned char value)
    {
       return 0;
    }
    If I change the inner loop to call the function quux directly, instead of indirectly, it runs in 0.6 seconds:
    Code:
            volatile int result;
            clock_t end, start = clock();
            for ( j = 0; j < CYCLES; ++j )
            {
               result=quuux(j);
            }
            end = clock();
    Where the variable result is marked volatile, to make sure that the compiler really calls the function quuux... (it automatically inlined the function and then optimized it out alltogether)

    Have fun,
    alex

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. brace-enclosed error
    By jdc18 in forum C++ Programming
    Replies: 53
    Last Post: 05-03-2007, 05:49 PM
  2. About aes
    By gumit in forum C Programming
    Replies: 13
    Last Post: 10-24-2006, 03:42 PM
  3. Need some help regarding data structures
    By Afrinux in forum C Programming
    Replies: 15
    Last Post: 01-28-2006, 05:19 AM
  4. error: identifier "byte" is undefined.
    By Hulag in forum C++ Programming
    Replies: 4
    Last Post: 12-10-2003, 05:46 PM
  5. sorting a structure of arrays
    By Unregistered in forum C Programming
    Replies: 7
    Last Post: 03-15-2002, 11:45 AM