minimizing bitfield size in memory

**Thadius** · 07-21-2011

I'm trying to update some old code of mine. Previously I used a bitfield comprised of the data type char as I had to keep track of sh_int worth of integers. These were used to track a users progress through the system, and there isn't any restraint on the order in which they must proceed through the bitfield.

Code:

char bitfield[4096]; worked nicely for this.

Now however circumstances have evolved and I must track up to an int worth of integers. Or rather if the users have encounter that particular page in the db. It's easy enough to just up the size of the bitfield. For the save file this makes almost no difference as I use RLE compression to store the bits. However I noticed that with the same amount of users my program is now taking up a much larger amount of memory which is easily attributed to the increased size of the bitfield.

I tinkered with using RLE as linked structures but that actually increased the memory usage and I rather expected it would due to generally short runs of users through linked documents. Unless they were visiting many continues documents.

Code:

struct rlebitfield {
int count; 
bool bit;
rlebitfield *next;
};

Ended up being larger than the initial storage space required.

Is there a better way to store large bitfields in memory?

**Thadius** · 07-21-2011

I should add that there is an initial block of numbers that do not need to be tracked per user, I don't know if that makes any difference. Currently we are just over the 32767 original tracking numbers. However we are going to be adding quite a bit very shortly. The only guaranteed stopping point is the current top number, which goes up each day, but rarely and almost never, down.

**itCbitC** · 07-21-2011

Unsure about this bitfield application concept so it'd nice to get more information.
Are you free()'ing the malloc()'d struct rlebitfield objects when users have exited?

**quzah** · 07-21-2011

Do you have more users, or more tracking numbers? If you have more tracking numbers than users, have each tracking number store which users know about it, instead of the other way around.

Quzah.

**Thadius** · 07-21-2011

Originally Posted by itCbitC

Unsure about this bitfield application concept so it'd nice to get more information.
Are you free()'ing the malloc()'d struct rlebitfield objects when users have exited?

Yes, also when two of them collide and merge. The problem comes up when someone goes to say pages 1 3 5 7 9.
Then a unique rlebitfield is created for each in this fashion.

Code:

struct rlebitfield { //for the zero index start of tracking.
count 1;
bit 0;
rlebitfield *next;
};
struct rlebitfield { //for the index 1 
count 1;
bit 1;
rlebitfield *next;
};
struct rlebitfield { //for the index 2
count 1;
bit 0;
rlebitfield *next;
};
struct rlebitfield { //for the index 3
count 1;
bit 1;
rlebitfield *next;
};

And so on. It only becomes efficient to store things in this manner if the number of contiguous set or unset bits is more than the size of bits of the memory structure.

**itCbitC** · 07-21-2011

I could be wrong but it seems like an application of the one-to-many type.
Figure out if there are multiple users per page or vice versa and code around that.

**Thadius** · 07-21-2011

Originally Posted by quzah

Do you have more users, or more tracking numbers? If you have more tracking numbers than users, have each tracking number store which users know about it, instead of the other way around.

Quzah.

I've thought about that approach and while I'm not dismissing there is the issue that the files associated with the tracking numbers are not changed. To further complicate matters the users do not have an easy to track number of their own and are most easily referred to by the memory structure of each user. I could possibly create a separate file and memory structure that uses the users user name however I can see that once someone has been through several thousand documents this would create bloat both on disk and in memory after a time. Since in the space of "thisismyuser" I could have tracked 96 bits on disk and by using the pointer to the user in memory it would require 32 bits per integer pointer.

**Thadius** · 07-21-2011

Originally Posted by itCbitC

I could be wrong but it seems like an application of the one-to-many type.
Figure out if there are multiple users per page or vice versa and code around that.

There are multiple users per page, and many multiple pages per user. Multiple users can look at the same page at different times or look at completely different sets of pages. I do group the pages in sections with 97 distinct sections at the moment each playing host to a range of index's i.e. 100-400. Using RLE compression on disk has kept the storage size on disk way down, but in practice it doesn't seem to help at all in memory. And with a fully expanded bitfield ,even though limited by the current top index number, the size of the structure starts to become a memory concern. Of course if a user visited all 35000 current pages it would be simple to just set a bit that said I've been through it all stop tracking me. RLE in memory would also work out okay if a user went through each and every page in order, but there is no constraint on that and it would in effect ruin the system if there were.

**Thadius** · 07-21-2011

I think ideally what I would like to do would be compress the bitfield in memory but retain the ability to switch individual bits without decompressing any large segment of the field. Any ideas in that direction?

**quzah** · 07-21-2011

Code:

struct visitedpages
{
    size_t lowestpage;
    size_t highestpage;
    unsigned char bitfield[];
};
...
struct visitedpages *mypages; 
size_t low, high;
low = loadlownumber(); /* wherever you stored this, go get it */
high = loadhighnumber(); /* "" */
mypages = malloc( ((high - low)/CHAR_BIT) + sizeof *mypages );
mypages->lowestpage = low;
mypages->highestpage = high;
loadfield( mypages );

Store the highest page visited. Store the lowest page visited. Use a bitfield of highest - lowest for all the pages known.

Quzah.

**itCbitC** · 07-21-2011

So instead of a sequential bitfield how about a randomized one. This way you don't need to generate the entire set of 3500 struct objects, corresponding to each page, at once per user. Instead dynamically generate each struct object on an as needed basis. Tradeoff maybe that now the application is CPU bound instead of memory.

Don't know if that makes sense as I'm not sure what statistics you need to track and collect. Just my 2c.

**Thadius** · 07-21-2011

Originally Posted by itCbitC

So instead of a sequential bitfield how about a randomized one. This way you don't need to generate the entire set of 3500 struct objects, corresponding to each page, at once per user. Instead dynamically generate each struct object on an as needed basis. Tradeoff maybe that now the application is CPU bound instead of memory.

Don't know if that makes sense as I'm not sure what statistics you need to track and collect. Just my 2c.

This could be the way to go since I have more cpu power to play around with on the server then ram. Maybe only load a bitfield for the section the user is in and then detect when they switch sections and swap bitfields from the user file. That would marginally increase the storage size on disk and actually use less ram then the current setup. I like it. The only issue I see in implementing it is minimizing the impact of reopening and resaving the user file between section switches. Perhaps I could buffer a few of the users most commenly used sections. That's maybe a topic for another thread though. Thanks for the advice and the sounding board, it's been quite a lot of help.

**iMalc** · 07-22-2011

I'm a little unsure on a few things here because you may be missusing a few terms.
So let me get this straight:
Is your char bitfield[4096] simply used as an array of 32768 individually accessed bits?
And now you effectively need an array of 4294967295 accessible bits?
But you simply want to compress this 512MB array of bits into something smaller, perhaps by using RLE compression?

If I'm not understanding correctly then can you clarify please?

**quzah** · 07-22-2011

Originally Posted by iMalc

And now you effectively need an array of 4294967295 accessible bits?
But you simply want to compress this 512MB array of bits into something smaller, perhaps by using RLE compression?

If I'm not understanding correctly then can you clarify please?

Well when you put it like that, this...

Originally Posted by Thadius

Since in the space of "thisismyuser" I could have tracked 96 bits on disk....

...looks a bit silly.

Quzah.

**Thadius** · 07-22-2011

Originally Posted by iMalc

I'm a little unsure on a few things here because you may be missusing a few terms.
So let me get this straight:
Is your char bitfield[4096] simply used as an array of 32768 individually accessed bits?
And now you effectively need an array of 4294967295 accessible bits?
But you simply want to compress this 512MB array of bits into something smaller, perhaps by using RLE compression?

If I'm not understanding correctly then can you clarify please?

The char[4096] was the old allocation of bits. The new allocation of bits could at maximum be the positive value of a signed int. So more like 2,147,483,647.
In practical usage the current top bit is at about index 34000, with about another 60000 indexes to occur. But in five years who really knows? I had already implimented an RLE style compression on the bitfield, sadly it used more memory then just a bitfield with the range of 0-topindex.

As for it looking silly I'll take RLE and a char array of size 2 (16 bits on this machine) maximum per bit per document over storing 96 bits on average per user per document as the post I responded to using that example suggested. Imagine 100 users accessing 34000 index's and each index storing a unique user-id or name. Now imagine that we have kept each id down to a single 32 bit integer. 32bits*100*34000=12.9699707MB. That would be in the long term, it would start out very small. Now compare the 62kb that seems to be the maximum disk size of a user file at the moment that not only contains the bitfield of a user that has been on the system for two years, but all of the other data relevant to that user as well. Consider then that currently the user files are only about 9kb with all the user data and the compressed bitfield. If every users file bloated to 62kb I would still have far less then ~13MB of used disk memory.

This was simply an occurrence of having under-simplified matters. As the method that itCbitC led me into quite handily solved the problem. With a minimal amount of code I was able to actually shrink the in ram usage by splitting the bitfield into sections while loaded into memory and only slightly increase the needed storage on disk. RLE still has some shortcoming as the example with the struct fields above shows, it's much the same on disk. Luckily that is mostly a transitory state. As a new user the rle starts off very small.
0 0 //start of storage indicator / 0 index
1 1 //one bit read with value of 1
33999 0 //33999 bits with value of 0
0 -1 //end of storage
With a user who has been through the entire sequence (which would probably never happen) it would be just as small
0 0 //start
33999 1 // 33999 bits with value of 1
0 -1 // end of storage

The biggest flareups in disk storage size would theoretically happen if a user were to access for instance every odd numbered index as that would require a bit range for each index visited. Just as luckily this is unlikely to occur as well.

As for the practicality of using RLE on in ram storage of bits, testing with the current existing user files with bitsets below 32,767 in size produced an in memory size larger than the uncompressed bitfield. This is largely due to the randomness in the order of which a user might access each index I believe though the way the structure is set up could have something to do with it. The method I hammered out earlier due to itCbitC's comment counts up to the start of the relevant section, expands the bitfield for that section into memory (usually less then 1000 bits often more like 300) then releases the file. It does use more cpu time and accesses the hard drive more frequently, but the in memory size (even when caching a couple bitfields to reduce disk read/writes) ends up being less then even the original code. Though the only real compression happens in the disk storage phase. I'm quite satisfied with this method.

I hope that clarifies my perceptions on the matter.

Thread: minimizing bitfield size in memory

Thread Tools

Search Thread

Display

minimizing bitfield size in memory

Similar Threads

Bitfield

Bitfield

bitfield memory question

Minimizing Memory

Memory size