Thread: I can't spot my mistake...

  1. #46
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,138
    Quote Originally Posted by hamster_nz View Post
    As I don't know your code or the problem you are solving , so I can't tell what you know is not right.

    What seems to be the problem? and what would you expect the correct output would be? What should be happening but isn't?
    In the output I gave, this:
    Code:
    test.c:122: reg_compare() 1: num.node = 8, val.node = 9
    test.c:126: reg_compare() 2: num.node = 8, val.node = 9
    test.c:130: reg_compare() 3: num.node = 8, val.node = 9
    test.c:140: reg_compare() 4: num.node = 8, val.node = 9
    Is the correct difference those values should ALWAYS have because they are supposed to refer to different registers, this:
    Code:
    test.c:122: reg_compare() 1: num.node = 9, val.node = 9
    test.c:126: reg_compare() 2: num.node = 9, val.node = 9
    test.c:130: reg_compare() 3: num.node = 9, val.node = 9
    test.c:140: reg_compare() 4: num.node = 9, val.node = 9
    Is an example of corrupt output as they refer to the same register

  2. #47
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,138
    Quote Originally Posted by hamster_nz View Post
    Found why it won't link.

    At least on my toolchain (gcc 7.5.0) the "-l alu" needs to be after the "test.o"

    I've done the following in main.mak to make it make:

    Code:
    BIN_FLAGS:=$(DBG_FLAGS) -fPIE $(COP)L . $(COP)l alu$(DBG_SFX)
    COMPILE_EXE=$(CC) $(BIN_FLAGS) $1 $(COP)o $2 $3 $(RPATH_FLAG)
    Code:
    BIN_FLAGS:=$(DBG_FLAGS) -fPIE $(COP)L . $(DBG_SFX)
     COMPILE_EXE=$(CC) $(BIN_FLAGS) $1 $(COP)o $2 $3 $(RPATH_FLAG) $(COP)l alu
    The change is now incorporated so no more needing to edit main.mak each time you update, Hodor since you were having the same issue could you give the update a try to see if this fix resolves it for you too.
    GitHub - awsdert/alu: Library modeled after the Arithmetic Logic Unit built into CPUs

  3. #48
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,138
    I managed to fix it, took me long enough to notice that I forgot about decrementing r in the scenario that count had been 0'd out, adding this line before restoring count fixed things:
    Code:
    r = SET2IF( count, count, r - 1 );
    Edit: While I have booked this week off work I still have some things to do so will be a while before I respond to any further posts (unless I see it right after making this edit)
    Last edited by awsdert; 09-16-2020 at 05:44 AM.

  4. #49
    Registered User
    Join Date
    Sep 2020
    Posts
    138
    Quote Originally Posted by awsdert View Post
    Well the 1st 2 structures (alu_src_t & alu_dst_t) hold pointers to the object that should be read from or written to and the various functions needed for interacting with them (along with a position pointer for the case of alu_src_t) so they are meant to be copied for skipping the branch that would otherwise check that 1st pointer, adding a few temporary variables would be faster than throwing away previously preloaded instructions just to load a different set, the 3rd (alu_base_t) is NEVER meant to have it's changes seen by the caller, the only change the caller should see in that scenario is the change made to *(src.nextpos) and the change made to alu_reg_data( alu, dst ), that is literally all the caller needs to see on that function, besides even if there is in error in that function I cannot track it properly as long as the node indexes are being given corrupt values.

    On another note I will make the change you suggested for the makefile regarding "-l alu" (honestly didn't expect fast reply so that is why I took a while to respond)
    I'm not sure that want you say is born out in testing.


    Here's the header file defining a test structure:


    Code:
    // perf.h - check the performance of passing a structure on the stack
    
    
    struct perf {
        struct perf *self;
        unsigned int i;
    };
    typedef struct perf perf_t;
    
    
    void perf_inc1(perf_t p);
    void perf_inc2(perf_t *p);



    Here's the two different routines, one that passes the structure on the stack, the other that uses a pointer:


    Code:
    // perf.c - check the performance of passing structure on the stack
    
    
    #include "perf.h"
    
    
    void perf_inc1(perf_t p) {
      p.self->i++;
    }
    
    
    void perf_inc2(perf_t *p) {
      if(p) {
         p->i++;
      }
    }

    And here's the test program that calls each 100,000,000 times and tells you how long each takes:


    Code:
    // main.c - test the performance of passing a structure on stack
    #include <stdio.h>
    #include <sys/time.h>
    #include "perf.h"
    
    
    
    
    int main(int argc, char *argv[]) {
       int i;
       long long s,e;
       struct timeval start_tv, end_tv;
    
    
       perf_t p;
       p.self = &p;
       p.i = 0;
    
    
       gettimeofday(&start_tv,NULL);
       for(i = 0; i < 1000*1000*100; i++) {
          perf_inc1(p);
       }
       gettimeofday(&end_tv, NULL);
       printf("p.i is %i\n",p.i);
    
    
       s = start_tv.tv_sec*1000000+start_tv.tv_usec;
       e = end_tv.tv_sec*1000000+end_tv.tv_usec;
       printf("perf_inc1() took %lli microseconds\n", e-s);
    
    
    
    
       gettimeofday(&start_tv,NULL);
       for(i = 0; i < 1000*1000*100; i++) {
          perf_inc2(&p);
       }
       gettimeofday(&end_tv, NULL);
       printf("p.i is %i\n",p.i);
    
    
       s = start_tv.tv_sec*1000000+start_tv.tv_usec;
       e = end_tv.tv_sec*1000000+end_tv.tv_usec;
       printf("perf_inc2() took %lli microseconds\n", e-s);
    }



    Are you saying that perf_inc1() should be faster than perf_inc2()?

  5. #50
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,138
    Quote Originally Posted by hamster_nz View Post
    I'm not sure that want you say is born out in testing.


    Here's the header file defining a test structure:


    Code:
    // perf.h - check the performance of passing a structure on the stack
    
    
    struct perf {
        struct perf *self;
        unsigned int i;
    };
    typedef struct perf perf_t;
    
    
    void perf_inc1(perf_t p);
    void perf_inc2(perf_t *p);



    Here's the two different routines, one that passes the structure on the stack, the other that uses a pointer:


    Code:
    // perf.c - check the performance of passing structure on the stack
    
    
    #include "perf.h"
    
    
    void perf_inc1(perf_t p) {
      p.self->i++;
    }
    
    
    void perf_inc2(perf_t *p) {
      if(p) {
         p->i++;
      }
    }

    And here's the test program that calls each 100,000,000 times and tells you how long each takes:


    Code:
    // main.c - test the performance of passing a structure on stack
    #include <stdio.h>
    #include <sys/time.h>
    #include "perf.h"
    
    
    
    
    int main(int argc, char *argv[]) {
       int i;
       long long s,e;
       struct timeval start_tv, end_tv;
    
    
       perf_t p;
       p.self = &p;
       p.i = 0;
    
    
       gettimeofday(&start_tv,NULL);
       for(i = 0; i < 1000*1000*100; i++) {
          perf_inc1(p);
       }
       gettimeofday(&end_tv, NULL);
       printf("p.i is %i\n",p.i);
    
    
       s = start_tv.tv_sec*1000000+start_tv.tv_usec;
       e = end_tv.tv_sec*1000000+end_tv.tv_usec;
       printf("perf_inc1() took %lli microseconds\n", e-s);
    
    
    
    
       gettimeofday(&start_tv,NULL);
       for(i = 0; i < 1000*1000*100; i++) {
          perf_inc2(&p);
       }
       gettimeofday(&end_tv, NULL);
       printf("p.i is %i\n",p.i);
    
    
       s = start_tv.tv_sec*1000000+start_tv.tv_usec;
       e = end_tv.tv_sec*1000000+end_tv.tv_usec;
       printf("perf_inc2() took %lli microseconds\n", e-s);
    }



    Are you saying that perf_inc1() should be faster than perf_inc2()?
    That's an invalid test because you're using the same object via different means, with one it is just self + pos while the other is (p + off) + pos, also obj.self should've been checked via a branch, whereas with what I did the responsibility of checking destination rests not with the ALU but with the functions it calls
    For the test to be valid it would need to look something like this:
    Code:
    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/time.h>
    
    typedef struct test
    {
    	int value;
    } test_t;
    
    void test_a( test_t *test )
    {
    	if ( test )
    		printf( "test>value = %i\n", test->value );
    }
    
    void test_b( test_t test )
    {
    	printf("test.value = %i", test.value );
    }
    
    void took( char *name, struct timeval init, struct timeval stop )
    {
    	ssize_t s, e;
    	
    	s = init.tv_sec * 1000000 + init.tv_usec;
    	e = stop.tv_sec * 1000000 + stop.tv_usec;
    	
    	printf("%s took %zi microseconds\n", name, e-s);
    }
    
    int main()
    {
    	test_t test = {0};
    	size_t i, stop = 1000*1000*100;
    	struct timeval tvInitA, tvStopA, tvInitB, tvStopB;
    	
    	gettimeofday(&tvInitA,NULL);
    	for( i = 0; i < stop; ++i ) {
    		test_a( &test );
    	}
    	gettimeofday(&tvStopA,NULL);
    	
    	gettimeofday(&tvInitB,NULL);
    	for( i = 0; i < stop; ++i ) {
    		test_b( test );
    	}
    	gettimeofday(&tvStopB,NULL);
    	
    	took( "test_a()", tvInitA, tvStopA );
    	took( "test_b()", tvInitB, tvStopB );
    	
    	return 0;
    }
    Edit 1: Changed code out for a test I'm currently running
    Edit 2: Test was taking too long so I killed it, set stop to 100 and ran again, here's the results:
    Edit 3: Made the mistake of leaving the excess in the results, removed now
    Edit 4: After multiple tests involving more values, I found that pointers only become a faster option when you go beyond the available CPU registers
    Code:
    ./pointer_branch_vs_object_copy_speed_test.AppImage
    ...
    test_a() took 96 microseconds
    test_b() took 13 microseconds
    Compilation finished successfully.
    Last edited by awsdert; 09-17-2020 at 03:55 AM.

  6. #51
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,138
    Quote Originally Posted by awsdert View Post
    That's an invalid test...
    For the test to be valid...
    Having adjusted my test to not print the values and just use them yielded similar results but with smaller amount of values, since the function you mentioned uses a total of 10 values I opted to base my test around that:
    Code:
    #define TEST_VAL_COUNT 10
    
    typedef struct test
    {
    	int val[TEST_VAL_COUNT];
    } test_t;
    
    int test_a( test_t *test )
    {
    	int i, ret = 0;
    	if ( test )
    	{
    		for ( i = 0; i < TEST_VAL_COUNT; ++i )
    		{
    			ret += test->val[i];
    		}
    	}
    	return ret;
    }
    
    int test_b( test_t test )
    {
    	int i, ret = 0;
    	for ( i = 0; i < TEST_VAL_COUNT; ++i )
    	{
    		ret += test.val[i];
    	}
    	return ret;
    }
    
    void took( char *name, struct timeval init, struct timeval stop )
    {
    	ssize_t s, e;
    	
    	s = init.tv_sec * 1000000 + init.tv_usec;
    	e = stop.tv_sec * 1000000 + stop.tv_usec;
    	
    	printf("%s took %zi microseconds\n", name, e-s);
    }
    
    int main()
    {
    	test_t test = {{0}};
    	size_t i, stop = SHRT_MAX;
    	struct timeval tvInitA, tvStopA, tvInitB, tvStopB;
    	
    	gettimeofday(&tvInitA,NULL);
    	for( i = 0; i < stop; ++i ) {
    		test_a( &test );
    	}
    	gettimeofday(&tvStopA,NULL);
    	
    	gettimeofday(&tvInitB,NULL);
    	for( i = 0; i < stop; ++i ) {
    		test_b( test );
    	}
    	gettimeofday(&tvStopB,NULL);
    	
    	took( "test_a()", tvInitA, tvStopA );
    	took( "test_b()", tvInitB, tvStopB );
    	
    	return 0;
    }
    With output:
    Code:
    gcc -o pointer_branch_vs_object_copy_speed_test.AppImage pointer_branch_vs_object_copy_speed_test.c && ./pointer_branch_vs_object_copy_speed_test.AppImage
    test_a() took 718 microseconds
    test_b() took 770 microseconds
    Compilation finished successfully.
    So in this case I concede the point that pointers despite the checks are faster

  7. #52
    Registered User
    Join Date
    Sep 2020
    Posts
    138
    I did a bit more testing, and moved the called routines into a different .c file so the compiler couldn't do local optimisations.

    This is for 100,000,000 calls of the following functions:

    Code:
    void test_a( test_t *test )
    {
       if ( test )
          total += test->value;
    }
    
    
    void test_b( test_t test )
    {
       total += test.value;
    }
    With only one int in the structure, placing the structure on the stack is quicker:

    test_a() took 315343 microseconds
    test_b() took 298095 microseconds

    With two ints in the structure, passing the address is quicker:

    test_a() took 305756 microseconds
    test_b() took 802630 microseconds


    With 16 ints in the structure time goes up again

    test_a() took 306505 microseconds
    test_b() took 1003469 microseconds

    So I agree in that as long as you only have one item in your structure, it looks as though it can be quicker. With more than one element all bets are off.


    Note that this was with the work being done in test_a and test_b staying the same, as verified by looking at the assembly
    Code:
    test_a: testq    %rdi, %rdi  << NULL POINTER TEST
        je    .L1    << JUMP IF NULL
        movl    (%rdi), %eax   << LOAD VALUE 
        addl    %eax, total(%rip)  << ADD IT TO A RUNNING TOTAL
    .L1:
        rep ret
    
    
    test_b: movl    8(%rsp), %eax  << LOAD THE VALUE
        addl    %eax, total(%rip)   << ADD IT TO THE RUNNING TOTAL
        ret

    Here's the assembler for the caller's loops, when the structures are large:

    Passing the address:
    Code:
    L4:
            movq    %rbp, %rdi
            addl    %ebx, 64(%rsp)
            addq    $1, %rbx
            call    test_a@PLT
            cmpq    $100000000, %rbx
            jne     .L4
    And this is the passing the copied test structure on the stack (this time with an excessive number of elements):
    Code:
    .L5:
            addl    %ebx, 64(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            call    test_b@PLT
            addq    $64, %rsp
            cmpq    $100000000, %rbx
            jne     .L5

    NOTE: Performance is most likely CPU dependant....

  8. #53
    misoturbutc Hodor's Avatar
    Join Date
    Nov 2013
    Posts
    1,795
    Quote Originally Posted by hamster_nz View Post
    I did a bit more testing, and moved the called routines into a different .c file so the compiler couldn't do local optimisations.

    This is for 100,000,000 calls of the following functions:

    Code:
    void test_a( test_t *test )
    {
       if ( test )
          total += test->value;
    }
    
    
    void test_b( test_t test )
    {
       total += test.value;
    }
    With only one int in the structure, placing the structure on the stack is quicker:

    test_a() took 315343 microseconds
    test_b() took 298095 microseconds

    With two ints in the structure, passing the address is quicker:

    test_a() took 305756 microseconds
    test_b() took 802630 microseconds


    With 16 ints in the structure time goes up again

    test_a() took 306505 microseconds
    test_b() took 1003469 microseconds

    So I agree in that as long as you only have one item in your structure, it looks as though it can be quicker. With more than one element all bets are off.


    Note that this was with the work being done in test_a and test_b staying the same, as verified by looking at the assembly
    Code:
    test_a: testq    %rdi, %rdi  << NULL POINTER TEST
        je    .L1    << JUMP IF NULL
        movl    (%rdi), %eax   << LOAD VALUE 
        addl    %eax, total(%rip)  << ADD IT TO A RUNNING TOTAL
    .L1:
        rep ret
    
    
    test_b: movl    8(%rsp), %eax  << LOAD THE VALUE
        addl    %eax, total(%rip)   << ADD IT TO THE RUNNING TOTAL
        ret

    Here's the assembler for the caller's loops, when the structures are large:

    Passing the address:
    Code:
    L4:
            movq    %rbp, %rdi
            addl    %ebx, 64(%rsp)
            addq    $1, %rbx
            call    test_a@PLT
            cmpq    $100000000, %rbx
            jne     .L4
    And this is the passing the copied test structure on the stack (this time with an excessive number of elements):
    Code:
    .L5:
            addl    %ebx, 64(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            pushq   120(%rsp)
            call    test_b@PLT
            addq    $64, %rsp
            cmpq    $100000000, %rbx
            jne     .L5

    NOTE: Performance is most likely CPU dependant....
    In test_a(), remove the test for test being NULL because that's the responsibility of the calling function (i.e. test != NULL should be a pre-condition). Either way you are of course correct but moving the test to the calling function will make it (test_a()) even faster.

  9. #54
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,138
    Quote Originally Posted by Hodor View Post
    In test_a(), remove the test for test being NULL because that's the responsibility of the calling function (i.e. test != NULL should be a pre-condition). Either way you are of course correct but moving the test to the calling function will make it (test_a()) even faster.
    The point of the test is to check branching speed against copying object data so doing so would render the test invalid, as for it being a precondition I'm considering adding a macro for pointer checks, that way the developer can decide if they want risky assumptions like that or not, anyways back to my ALU project I need a fresh pair of eyes (aside from my own) so could someone follow the ALU link in my signature, download and compile (in case something is not supporting your system/compiler yet) then take a quick glance through the code for any problems they can see (spend at most 10min if you need/want to keep the time short).

  10. #55
    Registered User
    Join Date
    Sep 2020
    Posts
    138
    Nobody can review a codebase in 10 minutes. This took much longer.

    I ran a profile over the latest build:

    Code:
    Flat profile:
    
    
    Each sample counts as 0.01 seconds.
     no time accumulated
    
    
      %   cumulative   self        self     total
     time   seconds   seconds    calls  Ts/call  Ts/call  name
      0.00      0.00     0.00    86753     0.00     0.00  alu_bit_dec
      0.00      0.00     0.00    39747     0.00     0.00  alu_bit_inc
      0.00      0.00     0.00     3159     0.00     0.00  alu_bit_set_bit
      0.00      0.00     0.00     1653     0.00     0.00  alu_reg_end_bit
      0.00      0.00     0.00      816     0.00     0.00  alu_reg_cmp
      0.00      0.00     0.00      191     0.00     0.00  alu_reg__shl
      0.00      0.00     0.00      186     0.00     0.00  alu_lowest_upto
      0.00      0.00     0.00      175     0.00     0.00  alu_reg_sub
      0.00      0.00     0.00      165     0.00     0.00  alu_get_reg_n.....

    As it stands, alu_bit_dec and alu_bit_inc are the most used functions, so an improvement in those will pay off many times over, compared with improving alu_reg_cmp().


    So let's see the code involved:

    From alu_bit.h, where the structure is defined:

    Code:
    typedef struct alu_bit {
       size_t s, b, p, *S, B;
    } alu_bit_t;

    For somebody who doesn't know the codebase, the members have zero context and meaning. That goes doubly so for using upper and lowecase member names (e.g. "s" and "S").

    From alu_bit.c:

    Code:
    struct alu_bit alu_bit_inc( struct alu_bit pos )
    {
       pos.b++;
       pos.p++;
       pos.B <<= 1;
       if ( pos.B )
         return pos;
       pos.p = 0;
       pos.B = SIZE_T_C(1);
       pos.s++;
       pos.S++;
       return pos;
    }
    And here is an example of using it:

    Code:
    int_t alu_reg_not( alu_t *alu, alu_reg_t num )
    {
      alu_bit_t n = {0};
      void *part;
    
    
      if ( alu )
      {
        num.node %= alu_used( alu );
    
    
        part = alu_reg_data( alu, num );
        n = alu_bit_set_bit( part, num.from );
    
    
        for ( ; n.b < num.upto; n = alu_bit_inc( n ) )
        {
          *(n.S) ^= n.B;
        }
    
    
        return 0;
      }
    
    
      return alu_err_null_ptr( "alu" );
    }

    I still really don't like the way you are copying a structure with lots of memebers on the stack, manipulate it inside the function then copying it back off the stack into the original structure. It will be slow.


    However, lets improve the code a bit....


    So let's work out what is going on:


    Code:
    struct alu_bit alu_bit_inc( struct alu_bit pos )
    {
       pos.b++;       // Increment the bit number
       pos.p++;       // Increment the position number, in this word
       pos.B <<= 1;   // Shift the mask over one bit
       if ( pos.B )   // Have we not used all the bits in this word?
         return pos;  // Return
       pos.p = 0;     // We are on bit zero of the word
       pos.B = 1;     // We are using the rightmost bit of the word
       pos.s++;       // Increment the index of the word being checked
       pos.S++;       // Increment the pointer to the word being checked
    
    
       return pos;
       
    }

    So now rename the items in the structure to give meaning:


    Code:
    typedef struct alu_bit {
       size_t bit_mask_in_word;   // formally B 
       size_t bit_number_in_reg;  // formally b
       size_t position_in_word;   // formally p
       size_t *current_word,      // formally S
       size_t current_word_index; // formally s 
    } alu_bit_t;

    An update the code that uses them:


    Code:
    struct alu_bit alu_bit_inc( struct alu_bit pos )
    {
       pos.bit_number_in_reg++;
       pos.position_in_word++;
       pos.bit_mask_in_word <<= 1;
       if ( pos.bit_mask_in_word == 0 )
       {
          // Move onto the next word
          pos.position_in_word = 0;
          pos.bit_mask_in_word = 1;
          pos.current_word++;
          pos.current_word_index++;
       }
       return pos;
    }

    Oh, so now it all becomes clear. At this point I see what is going on have another look at alu_reg_not() andunderstand how it is working through and flipping bits one by one in a block of data.

    If I was re-writing everything from scratch, is what it would look like:

    Code:
    ....
    #define REG_DATA_BITS 32
    typdef uint32_t reg_data_t
    
    
    struct alu {
      int registers_used;   // Number of elements allocated
      uint32_t **reg_data;  // Pointer to an array of pointers that hold the data, dynamically allocated
      uint32_t *reg_size;   // Pointer to an array of sizes, dynamically allocated
    };
    typedef struct alu alu_t;
    
    
    ....
    
    
    int_t alu_reg_not( alu_t *alu, uint32_t num )
    {
      reg_data_t *data;
      uint32_t to_flip;
    
    
      if(!alu) {
        alu_err_null_ptr( "alu" );
      }
    
    
      if(num >= alu->registers_used) {
        alu_err_reg_num( "alu" );
      }
    
    
      data = alu->reg_data[num];     // get the pointer to data for this register 'num'
    
    
      while(to_flip = alu->reg_size[num]; to_flip >= REG_DATA_BITS; to_flip -= REG_DATA_BITS) {
        *data = ~(*data);            // Flip all the bits
        data++;                      // move onto the next word
      }
    
    
      // Now flip any remaining bits
      if(to_flip > 0) {
        reg_data_t temp = 1;
        temp <<= to_flip;
        *data ^= temp-1;
      }
      return 0;
    }

    It might literally be 100s of times faster. it makes 1/32th the loop iterations, and doesn't do all that mucking around with the "alu_bit" members, with all the calling and return overheads.
    Last edited by hamster_nz; 09-17-2020 at 06:03 PM.

  11. #56
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,138
    Quote Originally Posted by hamster_nz View Post
    Nobody can review a codebase in 10 minutes. This took much longer.

    I ran a profile over the latest build:

    Code:
    Flat profile:
    
    
    Each sample counts as 0.01 seconds.
     no time accumulated
    
    
      %   cumulative   self        self     total
     time   seconds   seconds    calls  Ts/call  Ts/call  name
      0.00      0.00     0.00    86753     0.00     0.00  alu_bit_dec
      0.00      0.00     0.00    39747     0.00     0.00  alu_bit_inc
      0.00      0.00     0.00     3159     0.00     0.00  alu_bit_set_bit
      0.00      0.00     0.00     1653     0.00     0.00  alu_reg_end_bit
      0.00      0.00     0.00      816     0.00     0.00  alu_reg_cmp
      0.00      0.00     0.00      191     0.00     0.00  alu_reg__shl
      0.00      0.00     0.00      186     0.00     0.00  alu_lowest_upto
      0.00      0.00     0.00      175     0.00     0.00  alu_reg_sub
      0.00      0.00     0.00      165     0.00     0.00  alu_get_reg_n.....

    As it stands, alu_bit_dec and alu_bit_inc are the most used functions, so an improvement in those will pay off many times over, compared with improving alu_reg_cmp().


    So let's see the code involved:

    From alu_bit.h, where the structure is defined:

    Code:
    typedef struct alu_bit {
       size_t s, b, p, *S, B;
    } alu_bit_t;

    For somebody who doesn't know the codebase, the members have zero context and meaning. That goes doubly so for using upper and lowecase member names (e.g. "s" and "S").

    From alu_bit.c:

    Code:
    struct alu_bit alu_bit_inc( struct alu_bit pos )
    {
       pos.b++;
       pos.p++;
       pos.B <<= 1;
       if ( pos.B )
         return pos;
       pos.p = 0;
       pos.B = SIZE_T_C(1);
       pos.s++;
       pos.S++;
       return pos;
    }
    And here is an example of using it:

    Code:
    int_t alu_reg_not( alu_t *alu, alu_reg_t num )
    {
      alu_bit_t n = {0};
      void *part;
    
    
      if ( alu )
      {
        num.node %= alu_used( alu );
    
    
        part = alu_reg_data( alu, num );
        n = alu_bit_set_bit( part, num.from );
    
    
        for ( ; n.b < num.upto; n = alu_bit_inc( n ) )
        {
          *(n.S) ^= n.B;
        }
    
    
        return 0;
      }
    
    
      return alu_err_null_ptr( "alu" );
    }

    I still really don't like the way you are copying a structure with lots of memebers on the stack, manipulate it inside the function then copying it back off the stack into the original structure. It will be slow.


    However, lets improve the code a bit....


    So let's work out what is going on:


    Code:
    struct alu_bit alu_bit_inc( struct alu_bit pos )
    {
       pos.b++;       // Increment the bit number
       pos.p++;       // Increment the position number, in this word
       pos.B <<= 1;   // Shift the mask over one bit
       if ( pos.B )   // Have we not used all the bits in this word?
         return pos;  // Return
       pos.p = 0;     // We are on bit zero of the word
       pos.B = 1;     // We are using the rightmost bit of the word
       pos.s++;       // Increment the index of the word being checked
       pos.S++;       // Increment the pointer to the word being checked
    
    
       return pos;
       
    }

    So now rename the items in the structure to give meaning:


    Code:
    typedef struct alu_bit {
       size_t bit_mask_in_word;   // formally B 
       size_t bit_number_in_reg;  // formally b
       size_t position_in_word;   // formally p
       size_t *current_word,      // formally S
       size_t current_word_index; // formally s 
    } alu_bit_t;

    An update the code that uses them:


    Code:
    struct alu_bit alu_bit_inc( struct alu_bit pos )
    {
       pos.bit_number_in_reg++;
       pos.position_in_word++;
       pos.bit_mask_in_word <<= 1;
       if ( pos.bit_mask_in_word == 0 )
       {
          // Move onto the next word
          pos.position_in_word = 0;
          pos.bit_mask_in_word = 1;
          pos.current_word++;
          pos.current_word_index++;
       }
       return pos;
    }

    Oh, so now it all becomes clear. At this point I see what is going on have another look at alu_reg_not() andunderstand how it is working through and flipping bits one by one in a block of data.

    If I was re-writing everything from scratch, is what it would look like:

    Code:
    ....
    #define REG_DATA_BITS 32
    typdef uint32_t reg_data_t
    
    
    struct alu {
      int registers_used;   // Number of elements allocated
      uint32_t **reg_data;  // Pointer to an array of pointers that hold the data, dynamically allocated
      uint32_t *reg_size;   // Pointer to an array of sizes, dynamically allocated
    };
    typedef struct alu alu_t;
    
    
    ....
    
    
    int_t alu_reg_not( alu_t *alu, uint32_t num )
    {
      reg_data_t *data;
      uint32_t to_flip;
    
    
      if(!alu) {
        alu_err_null_ptr( "alu" );
      }
    
    
      if(num >= alu->registers_used) {
        alu_err_reg_num( "alu" );
      }
    
    
      data = alu->reg_data[num];     // get the pointer to data for this register 'num'
    
    
      while(to_flip = alu->reg_size[num]; to_flip >= REG_DATA_BITS; to_flip -= REG_DATA_BITS) {
        *data = ~(*data);            // Flip all the bits
        data++;                      // move onto the next word
      }
    
    
      // Now flip any remaining bits
      if(to_flip > 0) {
        reg_data_t temp = 1;
        temp <<= to_flip;
        *data ^= temp-1;
      }
      return 0;
    }

    It might literally be 100s of times faster. it makes 1/32th the loop iterations, and doesn't do all that mucking around with the "alu_bit" members, with all the calling and return overheads.
    The reason I fiddle with bit members is because the function may be handed a number where it should not be doing every segment and/or bit, like in the example of alu_reg_sub() which is used by alu_reg_divide() to deal with subtracting the division value from a section of the original number (like you would do in normal math because it is faster than doing dozens of subtractions to find the same number). On another note how did you do the profiling? That's one common thing that I definitely haven't learnt yet. Also as for 10min thing I was more expecting a quick glance through for anything your eyes pick out without looking deeply into the code, I will take into account that some functions like alu_reg_not() can be done quicker though so I will work on that. Edit: Forgot to mention I will take into account that naming problem, originally I set them like that because I knew I would use them a lot and wanted to save myself some trouble when error fixing. Since all that's left is floating math I can safely go ahead and change the names, I will still use shorter though because the
    structure name is enough to describe what dealing with.
    Last edited by awsdert; 09-18-2020 at 01:48 AM.

  12. #57
    Registered User
    Join Date
    Sep 2020
    Posts
    138
    Profiling is really simple and cool - you'll love it! If using GCC, compile your code with the "-pg" option. Run it, then afterwards run "gprof [program_name]" to generate the report.

    If you compile with other optimizations it can change the report, as functions can get 'inlined' and so on.

  13. #58
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,138
    Quote Originally Posted by hamster_nz View Post
    Profiling is really simple and cool - you'll love it! If using GCC, compile your code with the "-pg" option. Run it, then afterwards run "gprof [program_name]" to generate the report.

    If you compile with other optimizations it can change the report, as functions can get 'inlined' and so on.
    Thanks While I'm implementing that into my makefile targets/goals do you mind having a look at this function, I seem to have made a mistake while trying to implement faster code for it:
    Code:
    int_t alu_reg_not( alu_t *alu, alu_reg_t num )
    {
    	alu_bit_t n, e;
    	void *part;
    	size_t i, stop, mask, mask_init, mask_last;
    	
    	if ( alu )
    	{
    		num.node %= alu_used( alu );
    		part = alu_reg_data( alu, num );
    		
    		n = alu_bit_set_bit( part, num.from );
    		e = alu_bit_set_bit( part, num.upto - 1 );
    		
    		mask = 0;
    		mask = mask_init = mask_last = ~mask;
    		
    		mask_init <<= n.pos;
    		mask_last >>= (bitsof(size_t) - e.pos);
    		
    		stop = e.seg - n.seg;
    		
    		*(n.ptr) ^= SET2IF( stop, mask_init, mask_init & mask_last );
    		
    		for ( i = 1; i < stop; ++i ) n.ptr[i] = ~(n.ptr[i]);
    		
    		*(e.ptr) ^= SET1IF( stop, mask_last );
    	
    		return 0;
    	}
    	
    	return alu_err_null_ptr( "alu" );
    }
    Example of output currently:
    Code:
    test.c:204: reg_modify() ~0x00000000DEADC0DE, Expected 0xFFFFFFFF21523F21, Got 0x7FFFFFFF21523F21, op = '~'

  14. #59
    Registered User awsdert's Avatar
    Join Date
    Jan 2015
    Posts
    1,138
    Well I tried adding the profiling as a target/goal, as you might have guessed I didn't succeed:
    Code:
    make profile run
    MAKECMDGOALS=profile run
    make -j 1 --no-print-directory -f main.mak profile run
    PRJ_SRC_FILES = 'test.c alu_bit.c alu_int.c alu_main.c alu_math.c alu_mem.c alu_uint.c alu_vec.c'
    Checking 3rd Party libraries are upto date
    cd 'cloned/fbstdc' && git config pull.rebase true && git pull
    Already up to date.
    Finished checking
    PRJ_DST_BIN=alu_p.AppImage
    PRJ_DST_LIB=libalu_p.so
    make[1]: Nothing to be done for 'profile'.
    gprof ./alu_p.AppImage
    gmon.out: No such file or directory
    make[1]: *** [main.mak:83: run] Error 1
    make: *** [1st.mak:6: profile] Error 2
    Compilation failed.
    Any ideas why it failed?

  15. #60
    Registered User
    Join Date
    Sep 2020
    Posts
    138
    Quote Originally Posted by awsdert View Post
    Well I tried adding the profiling as a target/goal, as you might have guessed I didn't succeed:
    Code:
    make profile run
    MAKECMDGOALS=profile run
    make -j 1 --no-print-directory -f main.mak profile run
    PRJ_SRC_FILES = 'test.c alu_bit.c alu_int.c alu_main.c alu_math.c alu_mem.c alu_uint.c alu_vec.c'
    Checking 3rd Party libraries are upto date
    cd 'cloned/fbstdc' && git config pull.rebase true && git pull
    Already up to date.
    Finished checking
    PRJ_DST_BIN=alu_p.AppImage
    PRJ_DST_LIB=libalu_p.so
    make[1]: Nothing to be done for 'profile'.
    gprof ./alu_p.AppImage
    gmon.out: No such file or directory
    make[1]: *** [main.mak:83: run] Error 1
    make: *** [1st.mak:6: profile] Error 2
    Compilation failed.
    Any ideas why it failed?
    Can't see it compiling any of the -c- source with "-pg" in that output.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. can anyone help to spot mistake in the code
    By chess_queen in forum C Programming
    Replies: 1
    Last Post: 10-21-2012, 10:37 AM
  2. Can you spot my mistake?
    By Brewer in forum C Programming
    Replies: 13
    Last Post: 11-12-2006, 12:50 PM
  3. going to a certain spot in a file...
    By agerealm in forum C++ Programming
    Replies: 3
    Last Post: 05-17-2002, 02:31 AM

Tags for this Thread