(Note: this is very similar to some of the suggestions above; I just like to be explicit.)
Originally Posted by
otaconBot
Its relatively small..as in a bits: num_vars*(sizeof(char))
In that case, I'd recommend a good-sized initial allocation, then reallocation when necessary. For example:
Code:
struct dataset {
size_t size; /* Allocated */
size_t used; /* Used */
unsigned char *data;
};
#define EMPTY { 0, 0, NULL }
Assuming your application runs on a typical desktop workstation or server, and has only one or just a few datasets at a time, I'd start with say 4096 byte allocations; between 64k and 8M use 25% increase, and after 8M, increase by fixed 2M:
Code:
static inline int dataset_need(struct dataset *const set, const size_t chars)
{
if (set->used >= set->size || set->used + chars > set->size) {
size_t need = set->used + chars;
void *data;
if (need < 65536)
need = (need | 4095) + 1;
else
if (need < 8388608)
need = (need | 2097151) + 1;
else
need = (5 * need) / 4;
data = realloc(set->data, need * sizeof (*(set->data)));
if (!data)
return errno = ENOMEM; /* Cannot grow, but data still intact. */
set->size = need;
set->data = data;
}
return 0;
}
Note that if you initialize your dataset to EMPTY (size and used 0, data NULL), you don't need any initial allocation, just make sure you call dataset_need(&set, chars) before adding chars new values. It will return nonzero (ENOMEM) if the dataset cannot be resized (out of memory). Even then the current data is still accessible and valid; the function just failed to make room for new data. The function is static inline, meaning the compiler will optimize it just as effectively away as if it was a preprocessor macro, so calling it often should not be a (big) performance issue.
The rationale for my suggestion is based on the balance between allocated but unused memory, and the number of realloc() calls.
To grow the data to up to 65536 bytes, you do 16 realloc() calls, but you know you always have less than 4096 allocated but unused bytes (wasted). To grow the data from there to 8M, you do an additional 22, for a total of 38 realloc() calls. At any point, you know you have less than 25% allocated but unused. After that, you do a realloc() call for every new 2M, and are guaranteed to not have more than 2M allocated but unused.
If this exact strategy does not fit your needs, the reallocation size is trivial to modify to fit your needs.(I do not wish to imply this strategy is likely to be the best, it's just the one I'd start with, and see if it can be made better for typical workloads, without penalizing odd cases too much. I wanted to show the example code and example case, so you'd see how few of those realloc() calls you need in practice with a typical strategy.
That said, if there was a lot of churn -- not just steady increase storing incoming data, but also data discarded --, I'd use a different approach, probably very similar to phantomotap's suggestion.
Finally, I'd seriously reconsider your algorithm, to see if it was possible to process the data concurrently to input, or online. For example, whether you can discard data that is proven to not affect the results, or perhaps use a binary heap or something to keep only the most important data. Algorithmic advances tend to have bigger rewards than any code optimization, in my experience.