Thread: Using less malloc and free makes code _slower_

  1. #1
    Registered User
    Join Date
    Aug 2011
    Posts
    8

    Using less malloc and free makes code _slower_

    I have written a program that uses a queue structure. It is implemented as a struct containing a void * array, and indexes pointing to the first and last element in the queue.
    I use this structure to store pointers, but I also use it to store long ints.

    So when I wrote the first version of the code when I needed to store a long int I would malloc a space for it and store it in the queue:
    Code:
    pdata=malloc(sizeof(node_id));
    *pdata = v;
    queue_add(qbfs,(void *)pdata);
    (node_id is a typedef for long int and queue_add is the function I wrote for inserting an element in the queue).

    For removing an element I would read from the pointer and then free it:
    Code:
    pdata = (node_id*)queue_get(qbfs);
    u = *pdata;
    free(pdata);
    Now I wanted my code to run faster, and I thought all these mallocs and frees were not good. So I created a variant of my queue structure that contains a long int array, to be able to store my long ints directly. This made the code run _much_ slower. I tried every variation I could think of, but everything just seems to slow my code down. In the end I've done a version which uses the same queue structure but casts the long ints to pointers, replacing the above snippets of code by lines such as:
    Code:
    queue_add(qbfs,(void *)v);
    
    u = (node_id)queue_get(qbfs);
    (I have checked that long ints and pointers are the same size on the machine I'm using, and that this does not change the results of the program).
    The resulting code is much slower. To compare the running times of the two:
    Code:
    real    0m21.208s
    user    0m3.688s
    sys     0m17.517s
    versus

    Code:
    real    0m3.900s
    user    0m2.740s
    sys     0m1.148s
    I use some random calls in my program but I used a srandom(0) at the beginning of both versions, and checked that the results are identical.
    I've run diff between the two versions a thousand time to check that the differences are only in the queue management.

    I've compiled the two versions of the code in the same way (no option except -W and -Wall), and am running them on the same machine. I have done the compilations and timing several times to check that I was not getting mixed up between which version is which.

    I've isolated these snippets of code to launch them on their own in a program that does 2 millions inserts and two millions remove from the queue, and in this case, the cast version is faster.

    I must be missing something really obvious but I am out of my depth.
    Any idea how using the same code, with less malloc and free, can run slower? And why is it spending so much more time in sys state?

    Thank you for reading.

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    I think that you should post your code.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    What is the address value of "v"? Is it a multiple of 4?

    Could be a difference of one unaligned access per operation vs. many. Which would also explain the extra sys time.

    gg

  4. #4
    Registered User
    Join Date
    Aug 2011
    Posts
    8
    laserlight: I was reluctant to post my code because it's a bit large and I was unable to isolate the problem in a small bit of code. Anyway, in case it may help I put it up here:
    https://bitbucket.org/chlorine/queue/src

    There are three versions:
    malloc-free.c is the original one, in which I do a malloc before inserting a pointer in the queue, and a free after removing it.
    cast.c is the one that I described, in which I cast a long int to void * before inserting it in the queue (will only work on architectures in which long int and void * are the same size)
    longint-queue.c is a third version in which I created a queue variant for storing long ints. It has the same bad performance as cast.c

    All the differences are in the function bfs_r, which performs a breadth first search in a graph (hence the need for a queue).

    It needs parameters to run, the ones I used are:
    100000 200000 1 3000 5 10 0


    Codeplug: thanks for the input, I would never had thought to check that. However as far as I can tell all pointers are properly aligned.

  5. #5
    Registered User
    Join Date
    Jun 2005
    Posts
    6,815
    Quote Originally Posted by chlorine View Post
    malloc-free.c is the original one, in which I do a malloc before inserting a pointer in the queue, and a free after removing it.
    cast.c is the one that I described, in which I cast a long int to void * before inserting it in the queue (will only work on architectures in which long int and void * are the same size)
    Casting a long int to void * successfully requires more than them being the same size. The bit pattern to represent a pointer with a specified value is not necessarily the same bit pattern as an integral type with the same value (i.e. a long int with value 42 is not necessarily represented with the same set of bits as a pointer with value 42). If the representations are different, and the value being converted is not known until runtime, the conversion itself involve a runtime overhead (typically swapping a few bits around).

    In any event, storing an int value in a void * within a linked list strikes me as rather pointless. A potential performance hit (what you're seeing will not be universal) is just one more reason not to do that.
    Right 98% of the time, and don't care about the other 3%.

    If I seem grumpy or unhelpful in reply to you, or tell you you need to demonstrate more effort before you can expect help, it is likely you deserve it. Suck it up, Buttercup, and read this, this, and this before posting again.

  6. #6
    Registered User
    Join Date
    Aug 2011
    Posts
    8
    I didn't know that casting long ints to pointers could create an overhead. The cast version isn't one I would like to use indeed, I just created it for posting because it was closest to the original code.
    The version I'm wondering about is longint-queue.c. What I do is store values in a queue (not a linked list):
    Code:
    typedef struct queue{
      node_id size;
      void **elts;
      node_id begin;
      node_id end;
    }queue;
    In the original version I would malloc a pointer to a long int and insert it in the queue:
    Code:
      pdata=malloc(sizeof(node_id));
      *pdata = v;
      queue_add(qbfs,(void *)pdata);
    In longint-queue.c I created a variant of the queue structure that stores long ints rather than void *, and I replaced the above snippets of code with

    Code:
    node_id_queue_add(qbfs,v);
    This version is as slow as the cast one. I thought this was for similar reasons but maybe not. Sorry if it was misleading.

  7. #7
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    If its any consolation, I seem to remember writing a memory pool for a hash table implementation, thinking it would be faster than malloc'ing each node, and getting a similar surprise. Which lead me to give up on the pool. Keep in mind that the OS (all of them, I believe) allocates memory in 4Kb pages. That's 1000 ints. I'm not sure how truly expensive malloc/free is except when you need to get a new page from the OS.

    Have you tried profiling this beyond the overall runtime, eg, with gprof (linux)?

    Profiling with GProf

    The guidelines there look somewhat different than the ones I use, but I am sure you are capable of using google and reading man pages. Here's the basic notes I follow ("factorial.c" being a source and "factor" being the binary):

    Code:
    compile:
    gcc -pg -fprofile-arcs -ftest-coverage -o factor factorial.c
    
    run ./factor to create gmon.out
    
    gprof -b ./factor
    
    gcov factorial.c
    cat factorial.c.gcov
    This might not pinpoint the problem beyond what you've said, if the difference is really all about 1 particular function, but it at least will confirm (or disprove) that theory.
    Last edited by MK27; 08-11-2011 at 06:46 AM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  8. #8
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by MK27 View Post
    Keep in mind that the OS (all of them, I believe) allocates memory in 4Kb pages. That's 1000 ints. I'm not sure how truly expensive malloc/free is except when you need to get a new page from the OS.
    You may want to check that... Windows allocates and deallocates on 4 byte boundaries on x86, 8 bytes on x64... as a matter of data alinment. If you are using the OS level memory management HeapAlloc() allocates by bytes.

  9. #9
    Registered User
    Join Date
    Jun 2005
    Posts
    6,815
    Page sizes and alignment boundaries (if there are alignment boundaries) depend on how the system addresses memory. So a 64-bit system (which can work with chunks 64 bit wide) will have different page sizes and alignment boundaries than a 32-bit system.
    Right 98% of the time, and don't care about the other 3%.

    If I seem grumpy or unhelpful in reply to you, or tell you you need to demonstrate more effort before you can expect help, it is likely you deserve it. Suck it up, Buttercup, and read this, this, and this before posting again.

  10. #10
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by CommonTater View Post
    You may want to check that... Windows allocates and deallocates on 4 byte boundaries on x86, 8 bytes on x64... as a matter of data alinment. If you are using the OS level memory management HeapAlloc() allocates by bytes.
    Yes, most OS's also align memory using a 4 byte boundary, but that is not what I am talking about.

    HeapAlloc(), and malloc() allocate memory from the heap. That heap already exists and belongs to the program. The same is true for the stack frame.

    Stack and heap memory are allocated to the program by the OS in 4Kb pages. That memory is then allocated within the program using your allocation functions, which work on a byte (or word) level. This is as true on windows as it is everywhere else:

    Page (computer memory) - Wikipedia, the free encyclopedia

    Quote Originally Posted by wikipedia
    A page, memory page, or virtual page is a fixed-length contiguous block of virtual memory that is the smallest unit of data for the following:

    - memory allocation performed by the operating system for a program; and
    - transfer between main memory and any other auxiliary store, such as a hard disk drive.
    That article actually implies that the 4Kb page size is determined by the architecture and not the OS. To re-iterate: there is a BIG difference between allocation by a program to a variable, and allocation by the OS for the program.

    Page size is kind of like block size on disk: even if your file is only 25 bytes, it will still take up an entire block (512 or 1024 or whatever) on disk. Have a look at the memory profile your programs have, they follow the same principle. They do not grow in multiples of 4 bytes every time you call malloc(). They grow by 4Kb when the heap is exhausted.

    So my point was (it's a guess, nb) that malloc only has serious overhead when the program has to get a new page. Hence, removing 1000 4 byte malloc calls might not make much difference vs. 1 4Kb call. This does not explain why the OP's pool version is so much slower, but it would mean that trying to improve performance that way is chimeric.
    Last edited by MK27; 08-11-2011 at 08:04 AM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  11. #11
    Banned
    Join Date
    Aug 2010
    Location
    Ontario Canada
    Posts
    9,547
    Quote Originally Posted by MK27 View Post
    Stack and heap memory are allocated to the program by the OS in 4Kb pages.
    OOPs... my bad... I read you as meaning malloc etc....

    Yes there are page size requirements for heap creation and heap expansion but allocation within the heap is merely a matter of alignment.

    When you malloc(1) you don't get 4k...

    However HeapCreate() will get you some multiple of 4k.

  12. #12
    Registered User
    Join Date
    Aug 2011
    Posts
    8
    MK27: I understand your point and your mishaps do give a bit of consolation

    However I'm still bummed out that the second version works so much more slower than the original.

  13. #13
    Registered User
    Join Date
    Aug 2011
    Posts
    8
    MK27: sorry, I just realized I only replied half your post.

    I did try to use gprof, but the results were more confusing than helping.
    If I understand the output correctly, it seems that in the slow version the whole code is slower, and not just the function I modified.
    Compare:
    Code:
    index % time    self  children    called     name
    ...
                    7.61   62.29    1000/1000        main [1]
    [2]     90.9    7.61   62.29    1000         bfs_r [2]
                    9.93   47.14 49031199/49031199     neighbours_r [3]
                    1.56    1.41 49031199/260124630     queue_add [5]
                    1.62    0.15 49031199/260124630     queue_get [6]
    ...
    for the quick version to
    Code:
    index % time    self  children    called     name
    ...
                   13.38   81.87    1000/1000        main [1]
    [2]     92.7   13.38   81.87    1000         bfs_r [2]
                   13.90   61.32 49031199/49031199     neighbours_r [3]
                    1.88    1.45 49031199/49031199     node_id_queue_add [11]
                    2.07    0.29 49031199/49031199     node_id_queue_get [13]
    ...
    for the slow one.

    bfs_r, which is the modified function, is slower (self seconds column) in the slow version, but so are its children!
    And the time difference do not lie mainly in the queue dealing functions, but in the neighours_r function which is exactly the same in both version.

    I really don't know what to make of this.

  14. #14
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    You have said that the only difference is in bfs_r, but what you did not say was that it then calls a bunch of functions that are present in "malloc-free.c" but not used, which is strange.

    That and your naming conventions make the code somewhat hard to follow (no offense).

    In the flat profile I did for longint-queue, there are actually 5 more functions listed than in the one for malloc-free. They all have "node_id" in their name. What are those functions for and why are they unused in the malloc-free version? This means there are millions of more function calls in the longint-queue version. A lot of those include assert(), which I believe is not a "high performance" tactic .

    Ie, the difference between the two is much more than what you have presented so far -- either this slipped your mind or you need to consider more carefully what the differences actually are.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  15. #15
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    Quote Originally Posted by chlorine View Post
    And the time difference do not lie mainly in the queue dealing functions, but in the neighours_r function which is exactly the same in both version.
    Most of that is from is from the children of neighbours_r.

    I don't do a lot of profiling, but I think one of the limitations of grof is the granularity; you get plenty of very small fast functions that are called oodles of times but report 0.0 for a total because (probably) gprof computes this by multiplying the number of calls, and one call is too fast to measure. This doesn't mean the the cumulative time for 100000 calls was really 0.0. So those new "node_id" functions may be part of the issue. The longint version is calling both the old and the new functions, whereas the malloc one only uses the older subset.

    You can get more of a sense of these limitations here:

    Inaccuracy - GNU gprof
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. malloc/free help
    By stan4d2012 in forum C Programming
    Replies: 3
    Last Post: 04-13-2009, 07:04 PM
  2. Malloc - Free giving double free or corruption error
    By andrew.bolster in forum C Programming
    Replies: 2
    Last Post: 11-02-2007, 06:22 AM
  3. malloc & free
    By akrlot in forum C Programming
    Replies: 1
    Last Post: 10-01-2007, 05:03 AM
  4. malloc(); and free();
    By Perica in forum C Programming
    Replies: 1
    Last Post: 12-20-2002, 05:55 AM
  5. Malloc and Free.....
    By heljy in forum C Programming
    Replies: 5
    Last Post: 04-14-2002, 09:17 PM