Thread: Optimizing hot section of code.

  1. #1
    Registered User
    Join Date
    May 2016
    Posts
    104

    Optimizing hot section of code.

    Greetings everybody, I'm trying to improve the performance of my program and according to my profiling tests, the function below alone accounts for the majority of the execution time, it is the biggest bottleneck by far. If y'all have a little time to spare, I'd appreciate some ideas on how to optimize it.

    The function compresses a 64 bit block of data into 32 bits, by using hard-coded values in the substitution boxes. The 8 boxes are defined in a static 3d array in a header file. I tried making them 8 separate arrays and eliminating the loop but that didn't improve anything, the compiler is probably unrolling the loop on its own.
    Originally I had a variable to hold the six bit segment but after testing, it turns out it is faster to compute the value every time, I don't know why. Maybe the compiler ran out of registers and the CPU has to issue a MOV instruction into memory on every iteration. /shrug

    Code:
    #define SEXTET 6
    #define QUARTET 4
    #define SEXTET_MASK 0x3f
    #define SEXTET_SEGMENT ((block >> blck_bit_pos) & SEXTET_MASK) //get six bits from block
    #define ROW ((SEXTET_SEGMENT >> 5) << 1) | (SEXTET_SEGMENT & 0x1) //Row index is formed by joining the first and last bit of the sextet.
    #define COLUMN (SEXTET_SEGMENT >> 1) & 0xf //COLUMN index is formed by the inner four bits of the sextet.
    
    static inline uint32_t compress(uint64_t block)
    {
        unsigned int    byte;
        unsigned int    blck_bit_pos;
        unsigned int    cmpsd_bit_pos;
        register uint32_t        compressed;
    
        blck_bit_pos = (sizeof(block) * BITS_PER_BYTE) - SEXTET;
        cmpsd_bit_pos = (sizeof(compressed) * BITS_PER_BYTE) - QUARTET;
        byte = 0;
        compressed = 0;
        while(byte < BITS_PER_BYTE)
        {
            compressed |= (uint32_t)g_sboxes[byte++][ROW][COLUMN] << cmpsd_bit_pos;
            blck_bit_pos -= SEXTET;
            cmpsd_bit_pos -= QUARTET;
        }
        return (compressed);
    }
    Last edited by Dren; 05-25-2019 at 09:33 PM.
    printf("I'm a %s.\n", strrev("Dren"));

  2. #2
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    37,485
    1. What's in g_sboxes
    2. Post a main() with a couple of test cases.

    > #define ROW ((SEXTET_SEGMENT >> 5) << 1) | (SEXTET_SEGMENT & 0x1) //Row index is formed by joining the first and last bit of the sextet.
    > #define COLUMN (SEXTET_SEGMENT >> 1) & 0xf //COLUMN index is formed by the inner four bits of the sextet.
    So is reordering your g_sboxes is not an option?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  3. #3
    Registered User
    Join Date
    May 2016
    Posts
    104
    1. What's in g_sboxes
    This.
    So is reordering your g_sboxes is not an option?
    No, the S-boxes were provided by the NSA as they are to protect the algorithm against certain type of attacks. Not that it matters today, as DES has been broken for decades.
    2. Post a main() with a couple of test cases.
    I'm not sure how feasible that will be, as this thing is nested pretty deep down my program and is only a fragment, can't see how to make a valid test case with only this. I will give it a try when I get home and aim for an abridged version, but will probably end up with a lot more code.
    I could also post a link to my git repo which would make it easier for you to try, just git clone && make all.

    I was looking at the assembly output and it looks -to my inexperienced eye- very efficient. Bunch of bitwise and a few mov instructions, nothing costly.
    The issue here is the stupid amount of times the code has to execute. I feel a more effective way to extract the necessary bits of the sextet would make all the difference; alas I can't think of a better way to do it.
    printf("I'm a %s.\n", strrev("Dren"));

  4. #4
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    604
    try putting the initial values of the *_bit_pos variables into "static unsigned int const" variables with "pos" replaced with "set" and then set the *_bit_pos variables to those instead

  5. #5
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    37,485
    I meant you could do something like this, which preserves the content of the s-boxes, but rearranges them to remove the bit-twiddling indices.

    Code:
    // xx0
    s[0][4] = { 1, 2, 3, 4 };
    // xx1
    s[1][4] = { 5, 6, 7, 8 };
    
    // becomes
    s[8] = { 1, 5, 2, 6, 3, 7, 4, 8 };
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  6. #6
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    604
    Try this:
    Code:
    #define SEXTET 6
    #define QUARTET 4
    // Added these just to make the declaration clearer
    #define BITS_PER_UINT32 (sizeof(uint32_t) * BITS_PER_BYTE)
    #define BITS_PER_UINT64 (sizeof(uint64_t) * BITS_PER_BYTE)
    #define SEXTET_MASK 0x3f
    #define SEXTET_SEGMENT ((block >> blck_bit_pos) & SEXTET_MASK) //get six bits from block
    #define ROW ((sextet_segment >> 5) << 1) | (sextet_segment & 0x1) //Row index is formed by joining the first and last bit of the sextet.
    #define COLUMN (sextet_segment >> 1) & 0xf //COLUMN index is formed by the inner four bits of the sextet.
     
    static inline uint32_t compress(uint64_t block)
    {
       // Let the compiler work out when to fill these, it may have a more optimal method than setting them later
        unsigned int    pos = 0;
        unsigned int    blck_bit_pos = BITS_PER_UINT64 - SEXTET;
        unsigned int    cmpsd_bit_pos = BITS_PER_UINT32 - QUARTET;
        unsigned int		sextet_segment; // << Storage variable, removes repeated shifts
        register uint32_t compressed = 0;
        for ( ; pos < BITS_PER_BYTE; ++pos,
    			blck_bit_pos -= SEXTET, cmpsd_bit_pos -= QUARTET )
        {
    				sextet_segment = (unsigned int)SEXTET_SEGMENT;
            compressed |= (uint32_t)g_sboxes[pos][ROW][COLUMN] << cmpsd_bit_pos;
        }
        return (compressed);
    }
    At the very least it WILL shave some of that time it was using

  7. #7
    Registered User
    Join Date
    May 2016
    Posts
    104
    try putting the initial values of the *_bit_pos variables into "static unsigned int const" variables with "pos" replaced with "set" and then set the *_bit_pos variables to those instead
    Integral constants are resolved at compile time even without optimizations on, so the above expressions end up being: blck_bit_pos = 56 and cmpsd_bit_pos = 28.
    At the very least it WILL shave some of that time it was using.
    I had the same idea originally, but testing revealed that was actually slower than computing the sextet repeatedly. Counter intuitive, I know.
    There's a MOV (assignment) instruction involved; if the variable is on a register, the operation is basically free, otherwise the CPU has to access the bus and write to memory, which is orders of magnitude slower. Sadly, you can't control registers allocation.

    I meant you could do something like this
    Salem you are a freaking genius.
    I struggled with the idea for a bit, until it dawned on me. Of course it will not be as simple as transposing the arrays one on another. But I'll use a printf loop to help me figure out the correct order. If I've learned something in my time here is to absorb everything that comes out of your keyboard as a cactus absorbs moist in the morning desert. If it doesn't at first, it will make sense later.

    Thank you!

    PS: When testing my code in the macs at my school -where my project will be graded- I saw a dramatic speed up. encoding-decoding rate went from around 7MB/s on the i5 2520M 2.5GHZ in my laptop, to 32MB/s on the i5 in the mac. The mac has a bit more base clock speed but that alone wouldn't account for such a dramatic difference. I'm shocked. In contrast, Openssl achieves near to 42MB/s in my machine, and 56MB/s in the macs at the lab; and I know for a fact they've hand optimized their encryption algorithm in assembly. Can't wait to benchmark again after I've implemented your idea.
    printf("I'm a %s.\n", strrev("Dren"));

  8. #8
    Registered User
    Join Date
    May 2016
    Posts
    104
    Of course it will not be as simple as transposing the arrays one on another
    Pardon my lack of faith. It was as simple as you said. My hats off to you sir.

    :claps
    printf("I'm a %s.\n", strrev("Dren"));

  9. #9
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    37,485
    <burns>Excellent</burns>
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  10. #10
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    604
    Quote Originally Posted by Salem View Post
    <burns>Excellent</burns>
    You missed the extra e, "Eexcellant" XD

  11. #11
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    604
    Have you thought about using the asm keyword and making a block of assembly? or how bout just putting that one function inside an asm file and just directly control what is done within the function, means you need to be more careful with the code but should be doable

  12. #12
    Registered User
    Join Date
    Feb 2019
    Posts
    533
    Quote Originally Posted by awsdert View Post
    Have you thought about using the asm keyword and making a block of assembly? or how bout just putting that one function inside an asm file and just directly control what is done within the function, means you need to be more careful with the code but should be doable
    Writing a routine in assembly not necessarily will improve its performance. Here's a simple example:

    Code:
    # Makefile
    CFLAGS=-O2
    
    test: test.o f.o g.o
        $(CC) -o $@ $^
    
    test.o: test.c
    f.o: f.c
    
    g.o: g.asm
        nasm -felf64 -o $@ $<
    Code:
    /* test.c */
    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <inttypes.h>
    #include "cycle_counting.h"
    
    extern long f( int *, size_t );
    extern long g( int *, size_t );
    
    #define ARRAY_SIZE 128
    
    int main ( void )
    {
      static int array[ARRAY_SIZE];
      int *p;
      size_t count;
      long sum1, sum2;
      counter_T c1, c2;
    
      srand ( time ( NULL ) );
    
      // fill array with random values...
      count = ARRAY_SIZE;
      p = array;
      while ( count-- )
        *p++ = rand();
    
      c1 = BEGIN_TSC();
      sum1 = f( array, ARRAY_SIZE );
      END_TSC( &c1 );
    
      c2 = BEGIN_TSC();
      sum2 = g( array, ARRAY_SIZE );
      END_TSC( &c2 );
    
      printf( "sum = %ld, from f(): %" PRIu64 " cycles\n"
              "sum = %ld, from g(): %" PRIu64 " cycles\n",
              sum1, c1, sum2, c2 );
    }
    Code:
    /* f.c */
    #include <stddef.h>
    
    long f( int *p, size_t size )
    {
      long sum = 0L;
    
      while ( size-- )
        sum += *p++;
    
      return sum;
    }
    Code:
    ; g.asm
      bits 64
      default rel
    
      section .text
    
    ; long g( int *, size_t );
    ; Entry: RDI = ptr, RSI = size
    ; Returns RAX = sum.
      global g
    g:
      xor   rax,rax   ; sum = 0;
    .loop:
      test  rsi,rsi
      jz    .exit
      sub   rsi,1
      mov   ecx,[rdi+rsi*4]
      add   rax,rcx
      jmp   .loop
    .exit:
      ret
    Compiling, linking and testing I get:
    Code:
    $ make
    cc -O2   -c -o test.o test.c
    cc -O2   -c -o f.o f.c
    nasm -felf64 -o g.o g.asm
    cc -o test test.o f.o g.o
    $ ./test
    sum = 140386635490, from f(): 680 cycles
    sum = 140386635490, from g(): 892 cycles
    The assembly routine is 31% slower than the C routine (tested in an i5-3570 @ 3.4 GHz)!
    Here's the routine created by GCC:
    Code:
    $ objdump -dM intel f.o
    0000000000000000 <f>:
       0: 48 85 f6              test   rsi,rsi
       3: 74 23                 je     28 <f+0x28>
       5: 31 d2                 xor    edx,edx
       7: 31 c0                 xor    eax,eax
       9: 0f 1f 80 00 00 00 00  nop    DWORD PTR [rax+0x0]
      10: 48 63 0c 97           movsxd rcx,DWORD PTR [rdi+rdx*4]
      14: 48 83 c2 01           add    rdx,0x1
      18: 48 01 c8              add    rax,rcx
      1b: 48 39 f2              cmp    rdx,rsi
      1e: 75 f0                 jne    10 <f+0x10>
      20: f3 c3                 repz ret 
      22: 66 0f 1f 44 00 00     nop    WORD PTR [rax+rax*1+0x0]
      28: 31 c0                 xor    eax,eax
      2a: c3                    ret
    Notice the C implementation is more complex (testing the index two times to improve the effects of static branch predictor algorithm, and uses jumps qword alignment!)... In general, C compilers do a better job than manually created assembly routines.
    Last edited by flp1969; 05-30-2019 at 07:39 AM.

  13. #13
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    37,485
    Quote Originally Posted by awsdert View Post
    Have you thought about using the asm keyword and making a block of assembly? or how bout just putting that one function inside an asm file and just directly control what is done within the function, means you need to be more careful with the code but should be doable
    A better algorithm usually wins over crafting your own asm.
    In this case, Dren seems to have made good progress by rearranging all the lookup tables to remove lots of bit-twiddling and subscripting.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  14. #14
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    604
    flp1969: Well your's was most likely slower because of the move instruction in the loop, I was thinking more along the lines of using the registers in asm and never moving while in the loop, just changing the values, I just figured there must be a shorter way than what the compiler produced because the compiler might not be smart enough to realise it doesn't need to reset any registers during the loop

  15. #15
    Registered User
    Join Date
    Feb 2019
    Posts
    533
    Quote Originally Posted by awsdert View Post
    flp1969: Well your's was most likely slower because of the move instruction in the loop
    Nope... As I explain, the problem is that I did not aligned the instructions properly and not take advantage of the static branch prediction algorithm. My point: Good C compilers, as GCC and CLANG, take advantage of CPU characteristics that is easy not to take account by a regular programmer. Usually (not always), they create better code.

    I was thinking more along the lines of using the registers in asm and never moving while in the loop, just changing the values, I just figured there must be a shorter way than what the compiler produced because the compiler might not be smart enough to realise it doesn't need to reset any registers during the loop
    I am anxious to see how you intend to get values from a table without dealing with memory access... And, again, "shorter" way doesn't mean "faster" way, as I demonstrated with that example...

    As Salem said: A better algorithm will serve you better. Focusing on assembly, probably will not.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Help optimizing a code !!
    By thebenman in forum C Programming
    Replies: 9
    Last Post: 11-25-2014, 11:45 AM
  2. Optimizing code
    By KBriggs in forum C Programming
    Replies: 43
    Last Post: 06-05-2009, 04:09 PM
  3. Optimizing my code
    By Lina in forum C Programming
    Replies: 30
    Last Post: 10-22-2006, 01:31 PM
  4. Need help with section of Code
    By Harkin1987 in forum C Programming
    Replies: 9
    Last Post: 08-27-2006, 01:55 PM
  5. Help with section of code.
    By unejam2005 in forum C++ Programming
    Replies: 3
    Last Post: 12-11-2005, 06:38 PM

Tags for this Thread