Using less malloc and free makes code _slower_

**chlorine** · 08-10-2011

I have written a program that uses a queue structure. It is implemented as a struct containing a void * array, and indexes pointing to the first and last element in the queue.
I use this structure to store pointers, but I also use it to store long ints.

So when I wrote the first version of the code when I needed to store a long int I would malloc a space for it and store it in the queue:

Code:

pdata=malloc(sizeof(node_id));
*pdata = v;
queue_add(qbfs,(void *)pdata);

(node_id is a typedef for long int and queue_add is the function I wrote for inserting an element in the queue).

For removing an element I would read from the pointer and then free it:

Code:

pdata = (node_id*)queue_get(qbfs);
u = *pdata;
free(pdata);

Now I wanted my code to run faster, and I thought all these mallocs and frees were not good. So I created a variant of my queue structure that contains a long int array, to be able to store my long ints directly. This made the code run _much_ slower. I tried every variation I could think of, but everything just seems to slow my code down. In the end I've done a version which uses the same queue structure but casts the long ints to pointers, replacing the above snippets of code by lines such as:

Code:

queue_add(qbfs,(void *)v);

u = (node_id)queue_get(qbfs);

(I have checked that long ints and pointers are the same size on the machine I'm using, and that this does not change the results of the program).
The resulting code is much slower. To compare the running times of the two:

Code:

real    0m21.208s
user    0m3.688s
sys     0m17.517s

versus

Code:

real    0m3.900s
user    0m2.740s
sys     0m1.148s

I use some random calls in my program but I used a srandom(0) at the beginning of both versions, and checked that the results are identical.
I've run diff between the two versions a thousand time to check that the differences are only in the queue management.

I've compiled the two versions of the code in the same way (no option except -W and -Wall), and am running them on the same machine. I have done the compilations and timing several times to check that I was not getting mixed up between which version is which.

I've isolated these snippets of code to launch them on their own in a program that does 2 millions inserts and two millions remove from the queue, and in this case, the cast version is faster.

I must be missing something really obvious but I am out of my depth.
Any idea how using the same code, with less malloc and free, can run slower? And why is it spending so much more time in sys state?

Thank you for reading.

**laserlight** · 08-10-2011

I think that you should post your code.

**Codeplug** · 08-10-2011

What is the address value of "v"? Is it a multiple of 4?

Could be a difference of one unaligned access per operation vs. many. Which would also explain the extra sys time.

gg

**chlorine** · 08-11-2011

laserlight: I was reluctant to post my code because it's a bit large and I was unable to isolate the problem in a small bit of code. Anyway, in case it may help I put it up here:
https://bitbucket.org/chlorine/queue/src

There are three versions:
malloc-free.c is the original one, in which I do a malloc before inserting a pointer in the queue, and a free after removing it.
cast.c is the one that I described, in which I cast a long int to void * before inserting it in the queue (will only work on architectures in which long int and void * are the same size)
longint-queue.c is a third version in which I created a queue variant for storing long ints. It has the same bad performance as cast.c

All the differences are in the function bfs_r, which performs a breadth first search in a graph (hence the need for a queue).

It needs parameters to run, the ones I used are:
100000 200000 1 3000 5 10 0

Codeplug: thanks for the input, I would never had thought to check that. However as far as I can tell all pointers are properly aligned.

**grumpy** · 08-11-2011

Originally Posted by chlorine

malloc-free.c is the original one, in which I do a malloc before inserting a pointer in the queue, and a free after removing it.
cast.c is the one that I described, in which I cast a long int to void * before inserting it in the queue (will only work on architectures in which long int and void * are the same size)

Casting a long int to void * successfully requires more than them being the same size. The bit pattern to represent a pointer with a specified value is not necessarily the same bit pattern as an integral type with the same value (i.e. a long int with value 42 is not necessarily represented with the same set of bits as a pointer with value 42). If the representations are different, and the value being converted is not known until runtime, the conversion itself involve a runtime overhead (typically swapping a few bits around).

In any event, storing an int value in a void * within a linked list strikes me as rather pointless. A potential performance hit (what you're seeing will not be universal) is just one more reason not to do that.

**chlorine** · 08-11-2011

I didn't know that casting long ints to pointers could create an overhead. The cast version isn't one I would like to use indeed, I just created it for posting because it was closest to the original code.
The version I'm wondering about is longint-queue.c. What I do is store values in a queue (not a linked list):

Code:

typedef struct queue{
  node_id size;
  void **elts;
  node_id begin;
  node_id end;
}queue;

In the original version I would malloc a pointer to a long int and insert it in the queue:

Code:

  pdata=malloc(sizeof(node_id));
  *pdata = v;
  queue_add(qbfs,(void *)pdata);

In longint-queue.c I created a variant of the queue structure that stores long ints rather than void *, and I replaced the above snippets of code with

Code:

node_id_queue_add(qbfs,v);

This version is as slow as the cast one. I thought this was for similar reasons but maybe not. Sorry if it was misleading.

**MK27** · 08-11-2011

If its any consolation, I seem to remember writing a memory pool for a hash table implementation, thinking it would be faster than malloc'ing each node, and getting a similar surprise. Which lead me to give up on the pool. Keep in mind that the OS (all of them, I believe) allocates memory in 4Kb pages. That's 1000 ints. I'm not sure how truly expensive malloc/free is except when you need to get a new page from the OS.

Have you tried profiling this beyond the overall runtime, eg, with gprof (linux)?

Profiling with GProf

The guidelines there look somewhat different than the ones I use, but I am sure you are capable of using google and reading man pages. Here's the basic notes I follow ("factorial.c" being a source and "factor" being the binary):

Code:

compile:
gcc -pg -fprofile-arcs -ftest-coverage -o factor factorial.c

run ./factor to create gmon.out

gprof -b ./factor

gcov factorial.c
cat factorial.c.gcov

This might not pinpoint the problem beyond what you've said, if the difference is really all about 1 particular function, but it at least will confirm (or disprove) that theory.

**~~CommonTater~~** · 08-11-2011

Originally Posted by MK27

Keep in mind that the OS (all of them, I believe) allocates memory in 4Kb pages. That's 1000 ints. I'm not sure how truly expensive malloc/free is except when you need to get a new page from the OS.

You may want to check that... Windows allocates and deallocates on 4 byte boundaries on x86, 8 bytes on x64... as a matter of data alinment. If you are using the OS level memory management HeapAlloc() allocates by bytes.

**grumpy** · 08-11-2011

Page sizes and alignment boundaries (if there are alignment boundaries) depend on how the system addresses memory. So a 64-bit system (which can work with chunks 64 bit wide) will have different page sizes and alignment boundaries than a 32-bit system.

**MK27** · 08-11-2011

Originally Posted by CommonTater

You may want to check that... Windows allocates and deallocates on 4 byte boundaries on x86, 8 bytes on x64... as a matter of data alinment. If you are using the OS level memory management HeapAlloc() allocates by bytes.

Yes, most OS's also align memory using a 4 byte boundary, but that is not what I am talking about.

HeapAlloc(), and malloc() allocate memory from the heap. That heap already exists and belongs to the program. The same is true for the stack frame.

Stack and heap memory are allocated to the program by the OS in 4Kb pages. That memory is then allocated within the program using your allocation functions, which work on a byte (or word) level. This is as true on windows as it is everywhere else:

Page (computer memory) - Wikipedia, the free encyclopedia

Originally Posted by wikipedia

A page, memory page, or virtual page is a fixed-length contiguous block of virtual memory that is the smallest unit of data for the following:

- memory allocation performed by the operating system for a program; and
- transfer between main memory and any other auxiliary store, such as a hard disk drive.

That article actually implies that the 4Kb page size is determined by the architecture and not the OS. To re-iterate: there is a BIG difference between allocation by a program to a variable, and allocation by the OS for the program.

Page size is kind of like block size on disk: even if your file is only 25 bytes, it will still take up an entire block (512 or 1024 or whatever) on disk. Have a look at the memory profile your programs have, they follow the same principle. They do not grow in multiples of 4 bytes every time you call malloc(). They grow by 4Kb when the heap is exhausted.

So my point was (it's a guess, nb) that malloc only has serious overhead when the program has to get a new page. Hence, removing 1000 4 byte malloc calls might not make much difference vs. 1 4Kb call. This does not explain why the OP's pool version is so much slower, but it would mean that trying to improve performance that way is chimeric.

**~~CommonTater~~** · 08-11-2011

Originally Posted by MK27

Stack and heap memory are allocated to the program by the OS in 4Kb pages.

OOPs... my bad... I read you as meaning malloc etc....

Yes there are page size requirements for heap creation and heap expansion but allocation within the heap is merely a matter of alignment.

When you malloc(1) you don't get 4k...

However HeapCreate() will get you some multiple of 4k.

**chlorine** · 08-11-2011

MK27: I understand your point and your mishaps do give a bit of consolation

However I'm still bummed out that the second version works so much more slower than the original.

**chlorine** · 08-11-2011

MK27: sorry, I just realized I only replied half your post.

I did try to use gprof, but the results were more confusing than helping.
If I understand the output correctly, it seems that in the slow version the whole code is slower, and not just the function I modified.
Compare:

Code:

index % time    self  children    called     name
...
                7.61   62.29    1000/1000        main [1]
[2]     90.9    7.61   62.29    1000         bfs_r [2]
                9.93   47.14 49031199/49031199     neighbours_r [3]
                1.56    1.41 49031199/260124630     queue_add [5]
                1.62    0.15 49031199/260124630     queue_get [6]
...

for the quick version to

Code:

index % time    self  children    called     name
...
               13.38   81.87    1000/1000        main [1]
[2]     92.7   13.38   81.87    1000         bfs_r [2]
               13.90   61.32 49031199/49031199     neighbours_r [3]
                1.88    1.45 49031199/49031199     node_id_queue_add [11]
                2.07    0.29 49031199/49031199     node_id_queue_get [13]
...

for the slow one.

bfs_r, which is the modified function, is slower (self seconds column) in the slow version, but so are its children!
And the time difference do not lie mainly in the queue dealing functions, but in the neighours_r function which is exactly the same in both version.

I really don't know what to make of this.

**MK27** · 08-11-2011

You have said that the only difference is in bfs_r, but what you did not say was that it then calls a bunch of functions that are present in "malloc-free.c" but not used, which is strange.

That and your naming conventions make the code somewhat hard to follow (no offense).

In the flat profile I did for longint-queue, there are actually 5 more functions listed than in the one for malloc-free. They all have "node_id" in their name. What are those functions for and why are they unused in the malloc-free version? This means there are millions of more function calls in the longint-queue version. A lot of those include assert(), which I believe is not a "high performance" tactic

.

Ie, the difference between the two is much more than what you have presented so far -- either this slipped your mind or you need to consider more carefully what the differences actually are.

**MK27** · 08-11-2011

Originally Posted by chlorine

And the time difference do not lie mainly in the queue dealing functions, but in the neighours_r function which is exactly the same in both version.

Most of that is from is from the children of neighbours_r.

I don't do a lot of profiling, but I think one of the limitations of grof is the granularity; you get plenty of very small fast functions that are called oodles of times but report 0.0 for a total because (probably) gprof computes this by multiplying the number of calls, and one call is too fast to measure. This doesn't mean the the cumulative time for 100000 calls was really 0.0. So those new "node_id" functions may be part of the issue. The longint version is calling both the old and the new functions, whereas the malloc one only uses the older subset.

You can get more of a sense of these limitations here:

Inaccuracy - GNU gprof

Thread: Using less malloc and free makes code _slower_

Thread Tools

Search Thread

Display

Using less malloc and free makes code _slower_

Similar Threads

malloc/free help

Malloc - Free giving double free or corruption error

malloc & free

malloc(); and free();

Malloc and Free.....