I can't spot my mistake...

**awsdert** · 09-16-2020

Originally Posted by hamster_nz

As I don't know your code or the problem you are solving , so I can't tell what you know is not right.

What seems to be the problem? and what would you expect the correct output would be? What should be happening but isn't?

In the output I gave, this:

Code:

test.c:122: reg_compare() 1: num.node = 8, val.node = 9
test.c:126: reg_compare() 2: num.node = 8, val.node = 9
test.c:130: reg_compare() 3: num.node = 8, val.node = 9
test.c:140: reg_compare() 4: num.node = 8, val.node = 9

Is the correct difference those values should ALWAYS have because they are supposed to refer to different registers, this:

Code:

test.c:122: reg_compare() 1: num.node = 9, val.node = 9
test.c:126: reg_compare() 2: num.node = 9, val.node = 9
test.c:130: reg_compare() 3: num.node = 9, val.node = 9
test.c:140: reg_compare() 4: num.node = 9, val.node = 9

Is an example of corrupt output as they refer to the same register

**awsdert** · 09-16-2020

Originally Posted by hamster_nz

Found why it won't link.

At least on my toolchain (gcc 7.5.0) the "-l alu" needs to be after the "test.o"

I've done the following in main.mak to make it make:

Code:

BIN_FLAGS:=$(DBG_FLAGS) -fPIE $(COP)L . $(COP)l alu$(DBG_SFX)
COMPILE_EXE=$(CC) $(BIN_FLAGS) $1 $(COP)o $2 $3 $(RPATH_FLAG)

Code:

BIN_FLAGS:=$(DBG_FLAGS) -fPIE $(COP)L . $(DBG_SFX)
 COMPILE_EXE=$(CC) $(BIN_FLAGS) $1 $(COP)o $2 $3 $(RPATH_FLAG) $(COP)l alu

The change is now incorporated so no more needing to edit main.mak each time you update, Hodor since you were having the same issue could you give the update a try to see if this fix resolves it for you too.
GitHub - awsdert/alu: Library modeled after the Arithmetic Logic Unit built into CPUs

**awsdert** · 09-16-2020

I managed to fix it, took me long enough to notice that I forgot about decrementing r in the scenario that count had been 0'd out, adding this line before restoring count fixed things:

Code:

r = SET2IF( count, count, r - 1 );

Edit: While I have booked this week off work I still have some things to do so will be a while before I respond to any further posts (unless I see it right after making this edit)

**hamster_nz** · 09-16-2020

Originally Posted by awsdert

Well the 1st 2 structures (alu_src_t & alu_dst_t) hold pointers to the object that should be read from or written to and the various functions needed for interacting with them (along with a position pointer for the case of alu_src_t) so they are meant to be copied for skipping the branch that would otherwise check that 1st pointer, adding a few temporary variables would be faster than throwing away previously preloaded instructions just to load a different set, the 3rd (alu_base_t) is NEVER meant to have it's changes seen by the caller, the only change the caller should see in that scenario is the change made to *(src.nextpos) and the change made to alu_reg_data( alu, dst ), that is literally all the caller needs to see on that function, besides even if there is in error in that function I cannot track it properly as long as the node indexes are being given corrupt values.

On another note I will make the change you suggested for the makefile regarding "-l alu" (honestly didn't expect fast reply so that is why I took a while to respond)

I'm not sure that want you say is born out in testing.

Here's the header file defining a test structure:

Code:

// perf.h - check the performance of passing a structure on the stack


struct perf {
    struct perf *self;
    unsigned int i;
};
typedef struct perf perf_t;


void perf_inc1(perf_t p);
void perf_inc2(perf_t *p);

Here's the two different routines, one that passes the structure on the stack, the other that uses a pointer:

Code:

// perf.c - check the performance of passing structure on the stack


#include "perf.h"


void perf_inc1(perf_t p) {
  p.self->i++;
}


void perf_inc2(perf_t *p) {
  if(p) {
     p->i++;
  }
}

And here's the test program that calls each 100,000,000 times and tells you how long each takes:

Code:

// main.c - test the performance of passing a structure on stack
#include <stdio.h>
#include <sys/time.h>
#include "perf.h"




int main(int argc, char *argv[]) {
   int i;
   long long s,e;
   struct timeval start_tv, end_tv;


   perf_t p;
   p.self = &p;
   p.i = 0;


   gettimeofday(&start_tv,NULL);
   for(i = 0; i < 1000*1000*100; i++) {
      perf_inc1(p);
   }
   gettimeofday(&end_tv, NULL);
   printf("p.i is %i\n",p.i);


   s = start_tv.tv_sec*1000000+start_tv.tv_usec;
   e = end_tv.tv_sec*1000000+end_tv.tv_usec;
   printf("perf_inc1() took %lli microseconds\n", e-s);




   gettimeofday(&start_tv,NULL);
   for(i = 0; i < 1000*1000*100; i++) {
      perf_inc2(&p);
   }
   gettimeofday(&end_tv, NULL);
   printf("p.i is %i\n",p.i);


   s = start_tv.tv_sec*1000000+start_tv.tv_usec;
   e = end_tv.tv_sec*1000000+end_tv.tv_usec;
   printf("perf_inc2() took %lli microseconds\n", e-s);
}

Are you saying that perf_inc1() should be faster than perf_inc2()?

**awsdert** · 09-17-2020

Originally Posted by hamster_nz

I'm not sure that want you say is born out in testing.

Here's the header file defining a test structure:

Code:

// perf.h - check the performance of passing a structure on the stack


struct perf {
    struct perf *self;
    unsigned int i;
};
typedef struct perf perf_t;


void perf_inc1(perf_t p);
void perf_inc2(perf_t *p);

Here's the two different routines, one that passes the structure on the stack, the other that uses a pointer:

Code:

// perf.c - check the performance of passing structure on the stack


#include "perf.h"


void perf_inc1(perf_t p) {
  p.self->i++;
}


void perf_inc2(perf_t *p) {
  if(p) {
     p->i++;
  }
}

And here's the test program that calls each 100,000,000 times and tells you how long each takes:

Code:

// main.c - test the performance of passing a structure on stack
#include <stdio.h>
#include <sys/time.h>
#include "perf.h"




int main(int argc, char *argv[]) {
   int i;
   long long s,e;
   struct timeval start_tv, end_tv;


   perf_t p;
   p.self = &p;
   p.i = 0;


   gettimeofday(&start_tv,NULL);
   for(i = 0; i < 1000*1000*100; i++) {
      perf_inc1(p);
   }
   gettimeofday(&end_tv, NULL);
   printf("p.i is %i\n",p.i);


   s = start_tv.tv_sec*1000000+start_tv.tv_usec;
   e = end_tv.tv_sec*1000000+end_tv.tv_usec;
   printf("perf_inc1() took %lli microseconds\n", e-s);




   gettimeofday(&start_tv,NULL);
   for(i = 0; i < 1000*1000*100; i++) {
      perf_inc2(&p);
   }
   gettimeofday(&end_tv, NULL);
   printf("p.i is %i\n",p.i);


   s = start_tv.tv_sec*1000000+start_tv.tv_usec;
   e = end_tv.tv_sec*1000000+end_tv.tv_usec;
   printf("perf_inc2() took %lli microseconds\n", e-s);
}

Are you saying that perf_inc1() should be faster than perf_inc2()?

That's an invalid test because you're using the same object via different means, with one it is just self + pos while the other is (p + off) + pos, also obj.self should've been checked via a branch, whereas with what I did the responsibility of checking destination rests not with the ALU but with the functions it calls
For the test to be valid it would need to look something like this:

Code:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

typedef struct test
{
	int value;
} test_t;

void test_a( test_t *test )
{
	if ( test )
		printf( "test>value = %i\n", test->value );
}

void test_b( test_t test )
{
	printf("test.value = %i", test.value );
}

void took( char *name, struct timeval init, struct timeval stop )
{
	ssize_t s, e;
	
	s = init.tv_sec * 1000000 + init.tv_usec;
	e = stop.tv_sec * 1000000 + stop.tv_usec;
	
	printf("%s took %zi microseconds\n", name, e-s);
}

int main()
{
	test_t test = {0};
	size_t i, stop = 1000*1000*100;
	struct timeval tvInitA, tvStopA, tvInitB, tvStopB;
	
	gettimeofday(&tvInitA,NULL);
	for( i = 0; i < stop; ++i ) {
		test_a( &test );
	}
	gettimeofday(&tvStopA,NULL);
	
	gettimeofday(&tvInitB,NULL);
	for( i = 0; i < stop; ++i ) {
		test_b( test );
	}
	gettimeofday(&tvStopB,NULL);
	
	took( "test_a()", tvInitA, tvStopA );
	took( "test_b()", tvInitB, tvStopB );
	
	return 0;
}

Edit 1: Changed code out for a test I'm currently running
Edit 2: Test was taking too long so I killed it, set stop to 100 and ran again, here's the results:
Edit 3: Made the mistake of leaving the excess in the results, removed now
Edit 4: After multiple tests involving more values, I found that pointers only become a faster option when you go beyond the available CPU registers

Code:

./pointer_branch_vs_object_copy_speed_test.AppImage
...
test_a() took 96 microseconds
test_b() took 13 microseconds
Compilation finished successfully.

**awsdert** · 09-17-2020

Originally Posted by awsdert

That's an invalid test...
For the test to be valid...

Having adjusted my test to not print the values and just use them yielded similar results but with smaller amount of values, since the function you mentioned uses a total of 10 values I opted to base my test around that:

Code:

#define TEST_VAL_COUNT 10

typedef struct test
{
	int val[TEST_VAL_COUNT];
} test_t;

int test_a( test_t *test )
{
	int i, ret = 0;
	if ( test )
	{
		for ( i = 0; i < TEST_VAL_COUNT; ++i )
		{
			ret += test->val[i];
		}
	}
	return ret;
}

int test_b( test_t test )
{
	int i, ret = 0;
	for ( i = 0; i < TEST_VAL_COUNT; ++i )
	{
		ret += test.val[i];
	}
	return ret;
}

void took( char *name, struct timeval init, struct timeval stop )
{
	ssize_t s, e;
	
	s = init.tv_sec * 1000000 + init.tv_usec;
	e = stop.tv_sec * 1000000 + stop.tv_usec;
	
	printf("%s took %zi microseconds\n", name, e-s);
}

int main()
{
	test_t test = {{0}};
	size_t i, stop = SHRT_MAX;
	struct timeval tvInitA, tvStopA, tvInitB, tvStopB;
	
	gettimeofday(&tvInitA,NULL);
	for( i = 0; i < stop; ++i ) {
		test_a( &test );
	}
	gettimeofday(&tvStopA,NULL);
	
	gettimeofday(&tvInitB,NULL);
	for( i = 0; i < stop; ++i ) {
		test_b( test );
	}
	gettimeofday(&tvStopB,NULL);
	
	took( "test_a()", tvInitA, tvStopA );
	took( "test_b()", tvInitB, tvStopB );
	
	return 0;
}

With output:

Code:

gcc -o pointer_branch_vs_object_copy_speed_test.AppImage pointer_branch_vs_object_copy_speed_test.c && ./pointer_branch_vs_object_copy_speed_test.AppImage
test_a() took 718 microseconds
test_b() took 770 microseconds
Compilation finished successfully.

So in this case I concede the point that pointers despite the checks are faster

**hamster_nz** · 09-17-2020

I did a bit more testing, and moved the called routines into a different .c file so the compiler couldn't do local optimisations.

This is for 100,000,000 calls of the following functions:

Code:

void test_a( test_t *test )
{
   if ( test )
      total += test->value;
}


void test_b( test_t test )
{
   total += test.value;
}

With only one int in the structure, placing the structure on the stack is quicker:

test_a() took 315343 microseconds
test_b() took 298095 microseconds

With two ints in the structure, passing the address is quicker:

test_a() took 305756 microseconds
test_b() took 802630 microseconds

With 16 ints in the structure time goes up again

test_a() took 306505 microseconds
test_b() took 1003469 microseconds

So I agree in that as long as you only have one item in your structure, it looks as though it can be quicker. With more than one element all bets are off.

Note that this was with the work being done in test_a and test_b staying the same, as verified by looking at the assembly

Code:

test_a: testq    %rdi, %rdi  << NULL POINTER TEST
    je    .L1    << JUMP IF NULL
    movl    (%rdi), %eax   << LOAD VALUE 
    addl    %eax, total(%rip)  << ADD IT TO A RUNNING TOTAL
.L1:
    rep ret


test_b: movl    8(%rsp), %eax  << LOAD THE VALUE
    addl    %eax, total(%rip)   << ADD IT TO THE RUNNING TOTAL
    ret

Here's the assembler for the caller's loops, when the structures are large:

Passing the address:

Code:

L4:
        movq    %rbp, %rdi
        addl    %ebx, 64(%rsp)
        addq    $1, %rbx
        call    test_a@PLT
        cmpq    $100000000, %rbx
        jne     .L4

And this is the passing the copied test structure on the stack (this time with an excessive number of elements):

Code:

.L5:
        addl    %ebx, 64(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        call    test_b@PLT
        addq    $64, %rsp
        cmpq    $100000000, %rbx
        jne     .L5

NOTE: Performance is most likely CPU dependant....

**Hodor** · 09-17-2020

Originally Posted by hamster_nz

I did a bit more testing, and moved the called routines into a different .c file so the compiler couldn't do local optimisations.

This is for 100,000,000 calls of the following functions:

Code:

void test_a( test_t *test )
{
   if ( test )
      total += test->value;
}


void test_b( test_t test )
{
   total += test.value;
}

With only one int in the structure, placing the structure on the stack is quicker:

test_a() took 315343 microseconds
test_b() took 298095 microseconds

With two ints in the structure, passing the address is quicker:

test_a() took 305756 microseconds
test_b() took 802630 microseconds

With 16 ints in the structure time goes up again

test_a() took 306505 microseconds
test_b() took 1003469 microseconds

So I agree in that as long as you only have one item in your structure, it looks as though it can be quicker. With more than one element all bets are off.

Note that this was with the work being done in test_a and test_b staying the same, as verified by looking at the assembly

Code:

test_a: testq    %rdi, %rdi  << NULL POINTER TEST
    je    .L1    << JUMP IF NULL
    movl    (%rdi), %eax   << LOAD VALUE 
    addl    %eax, total(%rip)  << ADD IT TO A RUNNING TOTAL
.L1:
    rep ret


test_b: movl    8(%rsp), %eax  << LOAD THE VALUE
    addl    %eax, total(%rip)   << ADD IT TO THE RUNNING TOTAL
    ret

Here's the assembler for the caller's loops, when the structures are large:

Passing the address:

Code:

L4:
        movq    %rbp, %rdi
        addl    %ebx, 64(%rsp)
        addq    $1, %rbx
        call    test_a@PLT
        cmpq    $100000000, %rbx
        jne     .L4

And this is the passing the copied test structure on the stack (this time with an excessive number of elements):

Code:

.L5:
        addl    %ebx, 64(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        pushq   120(%rsp)
        call    test_b@PLT
        addq    $64, %rsp
        cmpq    $100000000, %rbx
        jne     .L5

NOTE: Performance is most likely CPU dependant....

In test_a(), remove the test for test being NULL because that's the responsibility of the calling function (i.e. test != NULL should be a pre-condition). Either way you are of course correct but moving the test to the calling function will make it (test_a()) even faster.

**awsdert** · 09-17-2020

Originally Posted by Hodor

In test_a(), remove the test for test being NULL because that's the responsibility of the calling function (i.e. test != NULL should be a pre-condition). Either way you are of course correct but moving the test to the calling function will make it (test_a()) even faster.

The point of the test is to check branching speed against copying object data so doing so would render the test invalid, as for it being a precondition I'm considering adding a macro for pointer checks, that way the developer can decide if they want risky assumptions like that or not, anyways back to my ALU project I need a fresh pair of eyes (aside from my own) so could someone follow the ALU link in my signature, download and compile (in case something is not supporting your system/compiler yet) then take a quick glance through the code for any problems they can see (spend at most 10min if you need/want to keep the time short).

**hamster_nz** · 09-17-2020

Nobody can review a codebase in 10 minutes. This took much longer.

I ran a profile over the latest build:

Code:

Flat profile:


Each sample counts as 0.01 seconds.
 no time accumulated


  %   cumulative   self        self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
  0.00      0.00     0.00    86753     0.00     0.00  alu_bit_dec
  0.00      0.00     0.00    39747     0.00     0.00  alu_bit_inc
  0.00      0.00     0.00     3159     0.00     0.00  alu_bit_set_bit
  0.00      0.00     0.00     1653     0.00     0.00  alu_reg_end_bit
  0.00      0.00     0.00      816     0.00     0.00  alu_reg_cmp
  0.00      0.00     0.00      191     0.00     0.00  alu_reg__shl
  0.00      0.00     0.00      186     0.00     0.00  alu_lowest_upto
  0.00      0.00     0.00      175     0.00     0.00  alu_reg_sub
  0.00      0.00     0.00      165     0.00     0.00  alu_get_reg_n.....

As it stands, alu_bit_dec and alu_bit_inc are the most used functions, so an improvement in those will pay off many times over, compared with improving alu_reg_cmp().

So let's see the code involved:

From alu_bit.h, where the structure is defined:

Code:

typedef struct alu_bit {
   size_t s, b, p, *S, B;
} alu_bit_t;

For somebody who doesn't know the codebase, the members have zero context and meaning. That goes doubly so for using upper and lowecase member names (e.g. "s" and "S").

From alu_bit.c:

Code:

struct alu_bit alu_bit_inc( struct alu_bit pos )
{
   pos.b++;
   pos.p++;
   pos.B <<= 1;
   if ( pos.B )
     return pos;
   pos.p = 0;
   pos.B = SIZE_T_C(1);
   pos.s++;
   pos.S++;
   return pos;
}

And here is an example of using it:

Code:

int_t alu_reg_not( alu_t *alu, alu_reg_t num )
{
  alu_bit_t n = {0};
  void *part;


  if ( alu )
  {
    num.node %= alu_used( alu );


    part = alu_reg_data( alu, num );
    n = alu_bit_set_bit( part, num.from );


    for ( ; n.b < num.upto; n = alu_bit_inc( n ) )
    {
      *(n.S) ^= n.B;
    }


    return 0;
  }


  return alu_err_null_ptr( "alu" );
}

I still really don't like the way you are copying a structure with lots of memebers on the stack, manipulate it inside the function then copying it back off the stack into the original structure. It will be slow.

However, lets improve the code a bit....

So let's work out what is going on:

Code:

struct alu_bit alu_bit_inc( struct alu_bit pos )
{
   pos.b++;       // Increment the bit number
   pos.p++;       // Increment the position number, in this word
   pos.B <<= 1;   // Shift the mask over one bit
   if ( pos.B )   // Have we not used all the bits in this word?
     return pos;  // Return
   pos.p = 0;     // We are on bit zero of the word
   pos.B = 1;     // We are using the rightmost bit of the word
   pos.s++;       // Increment the index of the word being checked
   pos.S++;       // Increment the pointer to the word being checked


   return pos;
   
}

So now rename the items in the structure to give meaning:

Code:

typedef struct alu_bit {
   size_t bit_mask_in_word;   // formally B 
   size_t bit_number_in_reg;  // formally b
   size_t position_in_word;   // formally p
   size_t *current_word,      // formally S
   size_t current_word_index; // formally s 
} alu_bit_t;

An update the code that uses them:

Code:

struct alu_bit alu_bit_inc( struct alu_bit pos )
{
   pos.bit_number_in_reg++;
   pos.position_in_word++;
   pos.bit_mask_in_word <<= 1;
   if ( pos.bit_mask_in_word == 0 )
   {
      // Move onto the next word
      pos.position_in_word = 0;
      pos.bit_mask_in_word = 1;
      pos.current_word++;
      pos.current_word_index++;
   }
   return pos;
}

Oh, so now it all becomes clear. At this point I see what is going on have another look at alu_reg_not() andunderstand how it is working through and flipping bits one by one in a block of data.

If I was re-writing everything from scratch, is what it would look like:

Code:

....
#define REG_DATA_BITS 32
typdef uint32_t reg_data_t


struct alu {
  int registers_used;   // Number of elements allocated
  uint32_t **reg_data;  // Pointer to an array of pointers that hold the data, dynamically allocated
  uint32_t *reg_size;   // Pointer to an array of sizes, dynamically allocated
};
typedef struct alu alu_t;


....


int_t alu_reg_not( alu_t *alu, uint32_t num )
{
  reg_data_t *data;
  uint32_t to_flip;


  if(!alu) {
    alu_err_null_ptr( "alu" );
  }


  if(num >= alu->registers_used) {
    alu_err_reg_num( "alu" );
  }


  data = alu->reg_data[num];     // get the pointer to data for this register 'num'


  while(to_flip = alu->reg_size[num]; to_flip >= REG_DATA_BITS; to_flip -= REG_DATA_BITS) {
    *data = ~(*data);            // Flip all the bits
    data++;                      // move onto the next word
  }


  // Now flip any remaining bits
  if(to_flip > 0) {
    reg_data_t temp = 1;
    temp <<= to_flip;
    *data ^= temp-1;
  }
  return 0;
}

It might literally be 100s of times faster. it makes 1/32th the loop iterations, and doesn't do all that mucking around with the "alu_bit" members, with all the calling and return overheads.

**awsdert** · 09-18-2020

Originally Posted by hamster_nz

Nobody can review a codebase in 10 minutes. This took much longer.

I ran a profile over the latest build:

Code:

Flat profile:


Each sample counts as 0.01 seconds.
 no time accumulated


  %   cumulative   self        self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
  0.00      0.00     0.00    86753     0.00     0.00  alu_bit_dec
  0.00      0.00     0.00    39747     0.00     0.00  alu_bit_inc
  0.00      0.00     0.00     3159     0.00     0.00  alu_bit_set_bit
  0.00      0.00     0.00     1653     0.00     0.00  alu_reg_end_bit
  0.00      0.00     0.00      816     0.00     0.00  alu_reg_cmp
  0.00      0.00     0.00      191     0.00     0.00  alu_reg__shl
  0.00      0.00     0.00      186     0.00     0.00  alu_lowest_upto
  0.00      0.00     0.00      175     0.00     0.00  alu_reg_sub
  0.00      0.00     0.00      165     0.00     0.00  alu_get_reg_n.....

As it stands, alu_bit_dec and alu_bit_inc are the most used functions, so an improvement in those will pay off many times over, compared with improving alu_reg_cmp().

So let's see the code involved:

From alu_bit.h, where the structure is defined:

Code:

typedef struct alu_bit {
   size_t s, b, p, *S, B;
} alu_bit_t;

For somebody who doesn't know the codebase, the members have zero context and meaning. That goes doubly so for using upper and lowecase member names (e.g. "s" and "S").

From alu_bit.c:

Code:

struct alu_bit alu_bit_inc( struct alu_bit pos )
{
   pos.b++;
   pos.p++;
   pos.B <<= 1;
   if ( pos.B )
     return pos;
   pos.p = 0;
   pos.B = SIZE_T_C(1);
   pos.s++;
   pos.S++;
   return pos;
}

And here is an example of using it:

Code:

int_t alu_reg_not( alu_t *alu, alu_reg_t num )
{
  alu_bit_t n = {0};
  void *part;


  if ( alu )
  {
    num.node %= alu_used( alu );


    part = alu_reg_data( alu, num );
    n = alu_bit_set_bit( part, num.from );


    for ( ; n.b < num.upto; n = alu_bit_inc( n ) )
    {
      *(n.S) ^= n.B;
    }


    return 0;
  }


  return alu_err_null_ptr( "alu" );
}

I still really don't like the way you are copying a structure with lots of memebers on the stack, manipulate it inside the function then copying it back off the stack into the original structure. It will be slow.

However, lets improve the code a bit....

So let's work out what is going on:

Code:

struct alu_bit alu_bit_inc( struct alu_bit pos )
{
   pos.b++;       // Increment the bit number
   pos.p++;       // Increment the position number, in this word
   pos.B <<= 1;   // Shift the mask over one bit
   if ( pos.B )   // Have we not used all the bits in this word?
     return pos;  // Return
   pos.p = 0;     // We are on bit zero of the word
   pos.B = 1;     // We are using the rightmost bit of the word
   pos.s++;       // Increment the index of the word being checked
   pos.S++;       // Increment the pointer to the word being checked


   return pos;
   
}

So now rename the items in the structure to give meaning:

Code:

typedef struct alu_bit {
   size_t bit_mask_in_word;   // formally B 
   size_t bit_number_in_reg;  // formally b
   size_t position_in_word;   // formally p
   size_t *current_word,      // formally S
   size_t current_word_index; // formally s 
} alu_bit_t;

An update the code that uses them:

Code:

struct alu_bit alu_bit_inc( struct alu_bit pos )
{
   pos.bit_number_in_reg++;
   pos.position_in_word++;
   pos.bit_mask_in_word <<= 1;
   if ( pos.bit_mask_in_word == 0 )
   {
      // Move onto the next word
      pos.position_in_word = 0;
      pos.bit_mask_in_word = 1;
      pos.current_word++;
      pos.current_word_index++;
   }
   return pos;
}

Oh, so now it all becomes clear. At this point I see what is going on have another look at alu_reg_not() andunderstand how it is working through and flipping bits one by one in a block of data.

If I was re-writing everything from scratch, is what it would look like:

Code:

....
#define REG_DATA_BITS 32
typdef uint32_t reg_data_t


struct alu {
  int registers_used;   // Number of elements allocated
  uint32_t **reg_data;  // Pointer to an array of pointers that hold the data, dynamically allocated
  uint32_t *reg_size;   // Pointer to an array of sizes, dynamically allocated
};
typedef struct alu alu_t;


....


int_t alu_reg_not( alu_t *alu, uint32_t num )
{
  reg_data_t *data;
  uint32_t to_flip;


  if(!alu) {
    alu_err_null_ptr( "alu" );
  }


  if(num >= alu->registers_used) {
    alu_err_reg_num( "alu" );
  }


  data = alu->reg_data[num];     // get the pointer to data for this register 'num'


  while(to_flip = alu->reg_size[num]; to_flip >= REG_DATA_BITS; to_flip -= REG_DATA_BITS) {
    *data = ~(*data);            // Flip all the bits
    data++;                      // move onto the next word
  }


  // Now flip any remaining bits
  if(to_flip > 0) {
    reg_data_t temp = 1;
    temp <<= to_flip;
    *data ^= temp-1;
  }
  return 0;
}

It might literally be 100s of times faster. it makes 1/32th the loop iterations, and doesn't do all that mucking around with the "alu_bit" members, with all the calling and return overheads.

The reason I fiddle with bit members is because the function may be handed a number where it should not be doing every segment and/or bit, like in the example of alu_reg_sub() which is used by alu_reg_divide() to deal with subtracting the division value from a section of the original number (like you would do in normal math because it is faster than doing dozens of subtractions to find the same number). On another note how did you do the profiling? That's one common thing that I definitely haven't learnt yet. Also as for 10min thing I was more expecting a quick glance through for anything your eyes pick out without looking deeply into the code, I will take into account that some functions like alu_reg_not() can be done quicker though so I will work on that. Edit: Forgot to mention I will take into account that naming problem, originally I set them like that because I knew I would use them a lot and wanted to save myself some trouble when error fixing. Since all that's left is floating math I can safely go ahead and change the names, I will still use shorter though because the
structure name is enough to describe what dealing with.

**hamster_nz** · 09-18-2020

Profiling is really simple and cool - you'll love it! If using GCC, compile your code with the "-pg" option. Run it, then afterwards run "gprof [program_name]" to generate the report.

If you compile with other optimizations it can change the report, as functions can get 'inlined' and so on.

**awsdert** · 09-18-2020

Originally Posted by hamster_nz

Profiling is really simple and cool - you'll love it! If using GCC, compile your code with the "-pg" option. Run it, then afterwards run "gprof [program_name]" to generate the report.

If you compile with other optimizations it can change the report, as functions can get 'inlined' and so on.

Thanks

While I'm implementing that into my makefile targets/goals do you mind having a look at this function, I seem to have made a mistake while trying to implement faster code for it:

Code:

int_t alu_reg_not( alu_t *alu, alu_reg_t num )
{
	alu_bit_t n, e;
	void *part;
	size_t i, stop, mask, mask_init, mask_last;
	
	if ( alu )
	{
		num.node %= alu_used( alu );
		part = alu_reg_data( alu, num );
		
		n = alu_bit_set_bit( part, num.from );
		e = alu_bit_set_bit( part, num.upto - 1 );
		
		mask = 0;
		mask = mask_init = mask_last = ~mask;
		
		mask_init <<= n.pos;
		mask_last >>= (bitsof(size_t) - e.pos);
		
		stop = e.seg - n.seg;
		
		*(n.ptr) ^= SET2IF( stop, mask_init, mask_init & mask_last );
		
		for ( i = 1; i < stop; ++i ) n.ptr[i] = ~(n.ptr[i]);
		
		*(e.ptr) ^= SET1IF( stop, mask_last );
	
		return 0;
	}
	
	return alu_err_null_ptr( "alu" );
}

Example of output currently:

Code:

test.c:204: reg_modify() ~0x00000000DEADC0DE, Expected 0xFFFFFFFF21523F21, Got 0x7FFFFFFF21523F21, op = '~'

**awsdert** · 09-18-2020

Well I tried adding the profiling as a target/goal, as you might have guessed I didn't succeed:

Code:

make profile run
MAKECMDGOALS=profile run
make -j 1 --no-print-directory -f main.mak profile run
PRJ_SRC_FILES = 'test.c alu_bit.c alu_int.c alu_main.c alu_math.c alu_mem.c alu_uint.c alu_vec.c'
Checking 3rd Party libraries are upto date
cd 'cloned/fbstdc' && git config pull.rebase true && git pull
Already up to date.
Finished checking
PRJ_DST_BIN=alu_p.AppImage
PRJ_DST_LIB=libalu_p.so
make[1]: Nothing to be done for 'profile'.
gprof ./alu_p.AppImage
gmon.out: No such file or directory
make[1]: *** [main.mak:83: run] Error 1
make: *** [1st.mak:6: profile] Error 2
Compilation failed.

Any ideas why it failed?

**hamster_nz** · 09-18-2020

Originally Posted by awsdert

Well I tried adding the profiling as a target/goal, as you might have guessed I didn't succeed:

Code:

make profile run
MAKECMDGOALS=profile run
make -j 1 --no-print-directory -f main.mak profile run
PRJ_SRC_FILES = 'test.c alu_bit.c alu_int.c alu_main.c alu_math.c alu_mem.c alu_uint.c alu_vec.c'
Checking 3rd Party libraries are upto date
cd 'cloned/fbstdc' && git config pull.rebase true && git pull
Already up to date.
Finished checking
PRJ_DST_BIN=alu_p.AppImage
PRJ_DST_LIB=libalu_p.so
make[1]: Nothing to be done for 'profile'.
gprof ./alu_p.AppImage
gmon.out: No such file or directory
make[1]: *** [main.mak:83: run] Error 1
make: *** [1st.mak:6: profile] Error 2
Compilation failed.

Any ideas why it failed?

Can't see it compiling any of the -c- source with "-pg" in that output.