std::vector in C?

**MutantJohn** · 09-06-2013

Hello All,

I may not have been that into C++ at first but I like having a try-catch again and a lot of other features are neat.

But by far and above, I am in love with vectors.

So what does a vector look like in C?

Is it an array where you allocate memory right behind it?

Like,

Code:

int *x = malloc(1*sizeof(*x));
x+1 = malloc(1*sizeof(*x));

Will this even compile? I haven't tested it yet but is that basically how'd you emulate push_back()?

**Elkvis** · 09-06-2013

you'd probably have a struct and a set of manipulator functions. templates also don't exist in C, but fortunately you can use void pointers. in C it's generally a lot simpler to just use a linked list. you lose the O(1) complexity of getting an item by position, and replace it with O(n), which can be a significant drawback if you have a large list, but implementing a linked list is trivial in C, compared to a fully functional implementation of a vector.

**MutantJohn** · 09-06-2013

Yeah, but now I'm determined to make that code work because it's an interesting problem.

Okay, here's what I want. We declare one unit set of memory in C, this is the

Code:

T *x = malloc(1*sizeof(*T));

line, where T is any datatype/structure, w/e.

What I want is to write the next unit block of memory after this, if that would even possible. That's what I mean by x+1 = malloc(). But thinking about it now, it would be possible to have memory allocated right after that initial declaration so my idea just can't physically work, can it?

Lame. I was hoping. What if I allocated specifically from the end of the pool? Like, I was writing from the last registers in memory, assuming ram is used in a from the bottom up sort of fashion. Or is it just scattered about loosely and intermittently?

**anduril462** · 09-06-2013

EDIT: To (hopefully) answer your question, in a simplistic manner:
> What if I allocated specifically from the end of the pool? Like, I was writing from the last registers in memory, assuming ram is used in a from the bottom up sort of fashion. Or is it just scattered about loosely and intermittently?
Maybe if you wrote your own memory manager, instead of using the built-in malloc/realloc/free setup. But that's probably not worth it for most things. You don't have control over what pieces of memory, or what addresses, the allocation functions give you. There is no way to specify "give me memory from the end of the pool", and the addresses given to you by subsequent malloc calls may or may not be sequential, thus you can't rely on them for this.

If I understand you correctly, what you want is basically an array that grows when you call push_back. So you would need a struct with, at the very least, a size and a pointer to the data. To make this generic, you need to keep an array of void * internally, to refer to the elements:

Code:

struct vector {
    size_t size;
    void **data;  // holds array of void *
};

When you create a new vector, you initialize size to 0 and data to NULL. Then push_back will call realloc with room for 1 more element in data each time. That is the trivial solution, but that results in a lot of calls to realloc, which can be slow and possibly cause lots of memory fragmentation.

Most serious implementations will initialize the array to something modest, like 64 elements, and grow by some factor when that is exceeded, perhaps 2x the first time or two, then 1.5x once or twice, and 1.2x or so every time after that. It wastes a little memory, but really helps keep the reallocs down. That means, however, that you need to track alloc'ed size and num elements used:

Code:

struct vector {
    size_t alloc_size;
    size_t last_elem;
    void **data;
};

Then, your new_vector call will allocate room for, e.g. 64 elements, and set alloc_size to 64, and last_elem to 0 or -1 (depending on how you want to track it). Then, when you call push_back, if last_elem is less than alloc_size, you simply store the data in the next available slot and increment last_elem. If, however, your currently allocated memory is full, you need to realloc data with an appropriate new size and (if it succeeds), increment alloc_size accordingly. Then you store the new element in the next available spot and increment last_elem.

**MutantJohn** · 09-06-2013

Dang, I was just hoping using addresses manually would lead to a faster realloc because you would no longer need to copy data to append to it. Then again, it'd be a lot of work but I think it'd be neat if you specifically target memory pools and arrange them any way you'd like.

**rcgldr** · 09-06-2013

Originally Posted by anduril462

Code:

struct vector {
    size_t alloc_size;
    size_t last_elem;
    void **data;  // array of void *
};

It would seem that void **data would require the code to allocate space for the array as well as the data. A vector implementation is closer to this (void * data versus void **data):

Code:

struct vector {
    size_t alloc_size;
    size_t last_elem;
    void *data;  // pointer to first element
};

You may also consider using end_elem instead of last_elem, where end_elem = 1 + last_elem. Note that realloc() doesn't copy data if the realloc can be done with the current internally allocated block of memory.

**anduril462** · 09-06-2013

Originally Posted by MutantJohn

Dang, I was just hoping using addresses manually would lead to a faster realloc because you would no longer need to copy data to append to it. Then again, it'd be a lot of work but I think it'd be neat if you specifically target memory pools and arrange them any way you'd like.

So you're suggesting something like:

Code:

Foo *f = malloc(0x1800, sizeof(*f));  // give me enough memory for a Foo object at address 0x1800 (or at a "pool" of memory starting at 0x1800)

Which seems okay at a very (and I stress very) superficial level. But, if you're going to specify the starting address of the pool you wish to allocate from, then you must know the starting address of all your pools. That means, if your vector pool starts at 0x1000 and the next pool is at 0x1100, you have a fixed size for your vector of 256 bytes. At that point, you could just pre-allocate everything to some "maximum" size, but that defeats the whole purpose of dynamic memory. You might as well just use an array. It's a fixed maximum size, and avoids all the complications, overhead and potential memory leaks of a dynamic memory management system.

A couple other concerns:

How do you know what addresses are "safe" or available for dynamic memory? You would have to know where your heap starts, how big it is, etc at compile time to make any sensible decisions. That's very system dependent. So much for portability.
Even if you can work out heap starting address and size in a portable manner, how do you design your app to work make smart decisions about how big to make pools for vectors, Foo objects, etc when different systems out there have such a huge difference in how much memory they have available?

**rcgldr** · 09-06-2013

Originally Posted by MutantJohn

Dang, I was just hoping using addresses manually would lead to a faster realloc because you would no longer need to copy data to append to it. Then again, it'd be a lot of work but I think it'd be neat if you specifically target memory pools and arrange them any way you'd like.

As mentioned previously, realloc() won't copy data if the realloc can be done in place using the left over space in the current block of allocated memory.

The issue with memory pools is that a program doesn't normally know anything about physical memory. For X86 (Intel / AMD) systems, malloc(), realloc(), and similar functions interact with the operating system to allocate 4 k pages from physical memory and map those pages into a programs virtual address space. Since the page size is 4k (4096 bytes), and to prevent an excessive number of virtual to physical mapping table entries, you might want to consider allocating in chunks of 64k or more, although it's possible that malloc() and realloc() are already allocating in larger than 4 k chunks.

"no longer need to copy data to append to it" - the concept of a vector is that it is a continuous memory array (continuous within the virtual address space, the physical 4 k pages of memory could be scattered). No special adjustments of iterators or pointers needs to be done in order to access the elements of a vector.

**rcgldr** · 09-06-2013

Originally Posted by rcgldr

It's possible that malloc() and realloc() are already allocating in larger than 4 k chunks.

Using Visual Studio 2005, with an initial allocation of 16 bytes, the released (non-debug mode) version of realloc() returns a new pointer with a resize to about 48 k bytes. I didn't check to see if this pattern continues.

**smokeyangel** · 09-06-2013

Fun ideas!

Originally Posted by MutantJohn

Lame. I was hoping. What if I allocated specifically from the end of the pool? Like, I was writing from the last registers in memory, assuming ram is used in a from the bottom up sort of fashion. Or is it just scattered about loosely and intermittently?

You can't do anything like that with malloc. The whole point of malloc is to give you "some memory" on request without you having to know where the memory is, whether it's available for use or used by another data structure, how to mark it as unused when it's freed....

But suppose you could. Suppose you allocated your vector at the beginning of the heap and other data at the end, so the vector could grow without running into other data items. What happens when you want to have another vector - where does it go? You'd have to keep track of where all your vectors were in order to know where to allocate new ones.

Like I said, you can't do this with malloc. The documentation for malloc on Linux has links to the low level OS calls you could use if you were very determined: malloc(3): allocate/free dynamic memory - Linux man page
You could use sbrk() to set/get the heap limit and merrily write what you want where you want. You couldn't use malloc in the same code though, as it wouldn't know where your data was and would happily overwrite it.

C++ doesn't avoid doing reallocs - from here: vector - C++ Reference

Originally Posted by vector doc

Libraries can implement different strategies for growth to balance between memory usage and reallocations, but in any case, reallocations should only happen at logarithmically growing intervals of size so that the insertion of individual elements at the end of the vector can be provided with amortized constant time complexity (see push_back).

As vectors get larger, copies get more expensive. But reallocs become less frequent, so the cost is mitigated. Small vectors are cheap to copy, so we can afford to call realloc at smaller intervals. We can do the same in C...

Originally Posted by rcgldr

It would seem that void **data would require the code to allocate space for the array as well as the data. A vector implementation is closer to this (void * data versus void **data):

Code:

struct vector {
    size_t alloc_size;
    size_t last_elem;
    void *data;  // pointer to first element
};

Using this sort of thing, with realloc and a lot of nasty pointer arithmetic, you can make yourself a sort of vector-like thing in C.

I'd use void* rather than void** as above, store the element size, and I'd probably store 'maximum elements' rather than alloc size:

Code:

struct vector {
    size_t alloc_size;
    size_t elements;
    size_t max_elems;
    void *data;  // pointer to first element
};

C doesn't have templates or overloaded functions, so everything will have to be in terms of void* to make this usable generically. Here are a handful of example functions:

Code:

Vect * InitVect(size_t elsize)
{
   Vect *v = malloc(sizeof(Vect));
   v->elements = 0;
   v->max_elems = INITIAL_SZ/elsize;
   v->elsize = elsize;
   v->data = malloc(v->max_elems * v->elsize);   
   return v;
}

void push_back(Vect *v, void* el)
{
   if (v->elements > v->max_elems-1)
   {
      void *temp;
      // exponential growth --> realloc at logarithmic intervals
      v->max_elems *=2;
      temp = realloc(v->data, v->max_elems * v->elsize);
      if (!temp)
      {
          printf("\nFatal error: realloc failed");
          Destroy(v);
          exit(1);
      }
      v->data = temp;
   }
   memcpy(v->data+(v->elements * v->elsize), el, v->elsize);
   v->elements++;   
}

void *get(Vect *v, size_t i)
{
    return v->data + (i * v->elsize);    
}

void set(Vect *v, size_t i, void* el)
{
   memcpy(v->data + (v->elsize * i), el, v->elsize);   
}

Since everything has to be of type void*, you'd have to deal with pointers-to-data when using it. I couldn't think of a way to get around that.

Code:

int main(void)
{
   Vect *v = InitVect(sizeof(int));
   int in;
   int *out;

   push_back(v, 10); // NO.
   in = 10;
   push_back(v, &in); // yes   
}

These snippets are totally unsafe -- if you try to set() something way outside of the vector, the code will happily write to memory outside of the vector, much like array accesses aren't bounds-checked. That could be easily fixed. Also, because everything is a void* you have absolutely no type checking. You could pass a double or a char and the code would still treat it like it was int-sized.

Other vector-like functions can be implemented in much the same way and will have (as far as I can see) the same complexity (efficiency) as C++ vectors. Iterators would just be void* pointers to particular elements. Things like reserve() and capacity() are trivial.

Out of interest, I decided to have a look at how expensive the copy really is. These experiments were with gcc 4.7.2 with -O2 on a 64-bit Ubuntu box. I added some debugging lines to the code to print out when realloc needed to do a copy. This will provoke a copy every realloc:

Code:

   Vect *v = InitVect(sizeof(int));
   Vect *v2 = InitVect(sizeof(int));   
   int i;   
   for (i = 0; i < NUM; i++)
   {   
       push_back(v, &i);              
       push_back(v2, &i);                           
   }

Separating this into two separate loops gets rid of the copies for smaller reallocs, but for large reallocs (4MB and higher) a copy always takes place, and the memory is allocated from somewhere else.
I implemented a reserve, and did a reserve(NUM). That got rid of all the copies except one, but the program didn't run any faster (according to the output of time).
I ran the test with valgrind --tool=callgrind, and had a look in kcachegrind. For large values of NUM (20000000), everything else is simply dwarfed by the time spent memcpying the elements into the vector. Realloc time is negligible.

For a small value of NUM (2000), realloc accounted for 1.8% of execution time when no copying took place, and 2.6% with the pathological interleaved push_back calls. To put that in context: the for loop in the test was responsible for 2.7%. Again, memcpy took the most time.

So given all that, I wouldn't worry about the time taken if realloc needs to do a copy.

Although this has been an interesting topic to investigate, I must say that if you're in love with std::vector, just use std::vector!

Thread: std::vector in C?

Thread Tools

Search Thread

Display

std::vector in C?

Similar Threads

vector<vector<const char*>> behaving strangely

applying the std::find function to a vector of structures vector<myStruct) vec

How do we apply std::max_element to a vector of structs (vector<myStruct>)?

Converting string vector to integer vector

Simple Question memset(vector<int>, 0, sizeof(vector<int>))