I have written a program that uses a queue structure. It is implemented as a struct containing a void * array, and indexes pointing to the first and last element in the queue.
I use this structure to store pointers, but I also use it to store long ints.
So when I wrote the first version of the code when I needed to store a long int I would malloc a space for it and store it in the queue:
Code:
pdata=malloc(sizeof(node_id));
*pdata = v;
queue_add(qbfs,(void *)pdata);
(node_id is a typedef for long int and queue_add is the function I wrote for inserting an element in the queue).
For removing an element I would read from the pointer and then free it:
Code:
pdata = (node_id*)queue_get(qbfs);
u = *pdata;
free(pdata);
Now I wanted my code to run faster, and I thought all these mallocs and frees were not good. So I created a variant of my queue structure that contains a long int array, to be able to store my long ints directly. This made the code run _much_ slower. I tried every variation I could think of, but everything just seems to slow my code down. In the end I've done a version which uses the same queue structure but casts the long ints to pointers, replacing the above snippets of code by lines such as:
Code:
queue_add(qbfs,(void *)v);
u = (node_id)queue_get(qbfs);
(I have checked that long ints and pointers are the same size on the machine I'm using, and that this does not change the results of the program).
The resulting code is much slower. To compare the running times of the two:
Code:
real 0m21.208s
user 0m3.688s
sys 0m17.517s
versus
Code:
real 0m3.900s
user 0m2.740s
sys 0m1.148s
I use some random calls in my program but I used a srandom(0) at the beginning of both versions, and checked that the results are identical.
I've run diff between the two versions a thousand time to check that the differences are only in the queue management.
I've compiled the two versions of the code in the same way (no option except -W and -Wall), and am running them on the same machine. I have done the compilations and timing several times to check that I was not getting mixed up between which version is which.
I've isolated these snippets of code to launch them on their own in a program that does 2 millions inserts and two millions remove from the queue, and in this case, the cast version is faster.
I must be missing something really obvious but I am out of my depth.
Any idea how using the same code, with less malloc and free, can run slower? And why is it spending so much more time in sys state?
Thank you for reading.