Standard practice in C suggests that if you have to allocate a large number of fixed sized blocks, it is best to write a custom allocator, since malloc and free have 'too much' overhead associated with them. Well I went ahead and did this. I created an allocator that allocates a larger fixed size block, cuts it up into smaller pieces and then loads them up into a free list, which is nothing more than an array of pointers.

Allocation simply returns the next available pointer
Code:
 
/** fixed block size allocator.  */
typedef struct _blkmem {
    /** size of block. */
    int bsize ;
    /** length of free pointer list in memory in bytes. */
    int freelen ;
    /** number of pointers that are free. */
    int nfree ;
    /** array of free pointers */
    void **freeptrs ;
   /** length of malloc block list in memory in bytes. */
    int mlen ;
    /** array of pointers to malloced blocks. */ 
    void **allocs ;
    /**  number of blocks alloced */
    int nalloc ;
} Fmem ; 



void *allocF(Fmem *fm)
{
if (fm->nfree == 0) {
    moreFmem(fm) ;
}
fm->nfree = fm->nfree - 1 ;
return((fm->freeptrs)[fm->nfree]) ;
}
and my custom free simply puts the freed block back into the array and increments
the array index.

Code:
/** push a pointer to a list growing the list as needed. */
inline void pushptr(void *nptr, void ***ptr, unsigned int *plen, unsigned int *mlen)
{
void **parr;
unsigned int ns ;
*plen = *plen+1;
if (*plen * sizeof(void *) > *mlen) {
    /* grow the base array */
    *ptr = growmem(*ptr, *mlen, sizeof(void *), mlen) ;
}	
parr = *ptr ;
parr[*plen-1] = nptr ;

}


/** free a fixed block and return it to the free list.*/
void *freeF(Fmem *fm, void *ptr)
{
pushptr(ptr, &(fm->freeptrs), &(fm->nfree), &(fm->freelen)) ;
}
Now the shocking thing is that in my benchmarks using
gcc version 4.1.1 20070105 (Red Hat 4.1.1-51), my custom allocator is only
about 5-10% faster than malloc and free. The result holds even if I deliberately add a bunch of random mallocs and frees in between each malloc and free associated with the test case. Moreover for this test I deliberately create a situation where the edge conditions that require allocating more large blocks or growing my array of pointers happens only once. Profiling the code does not suggest any egregious performance bottlenecks. In theory I could get some incremental performance improvement by using pointers instead of array indices and perhaps using a shift instead of a multiply.

Ive noticed that in some of the example code online for custom allocators people use linked lists instead of an array of free pointers, but I would think that would be potentially slower since it requires an additional memory access instead of just pointer or index arithmetic. Is it possible that there are some fantastic optimizations in the newer versions of malloc and free or does my code have some performance issues that I have overlooked? For flexibility I do need to use some indirection for my arrays, but I am frankly quite surprised.