compact table representation

**baxy** · 07-19-2013

Hi,

a data structure problem. I have a table of the following format:

Code:

C/R  1  2  3  4  5  6  7 ...
1    x  x  x
2            x
3                   x
4    x
5            x  x x
6
...

so my column and row names are integer numbers if an outcome of some game for a certain column and row lable is a match then we have an x on that position. The size of the integer names for both column and row name is quite large (let us imagine that it is so large that you would need a machine with 500GB of RAM memory to hold this type of table even if x's are treated as regular char's) . In every row there is at least one x and for every column the same holds for the columns. However, the number of x's for a row or a column can be bigger then 1. Does anyone has any suggestion on how to store this table efficiently? (using as less memory as possible).

The data structure should be efficiently accessed in the column fashion that is, if i want to get all values for column 4 I should be able to do that in O(N) time where N= the number of rows.

cheers

baxy

**laserlight** · 07-19-2013

If you only need to access it efficiently by column, then perhaps one solution is to have an array of linked lists. The index of each element corresponds to a column number. Each linked list stores the row numbers for which there is an 'x' in that column.

**baxy** · 07-19-2013

you exploited the hole in my problem right away

that would be a nice solution if only few x's were in my table but since my table is almost full of x's storing an int instead of char is more expensive. I was thinking of storing ranges of x's through columns but again i came to a conclusion that this would require of me to first have this table and then create a compact version of it.

**laserlight** · 07-19-2013

Hmm... if you are sure that your table will be densely populated, then maybe you could store the positions that are empty instead.

**Nominal Animal** · 07-19-2013

How many unique values of x do you need? How many columns? How many rows?

Since you wish to access the contents of each column as a unit, column-major data order is a given. (Multidimensional arrays in C use row-major order, where the values on each row are consecutive in memory.)

If x has only one possible value, then you could pack the fields in each column in consecutive bits, with 1 indicating x and 0 indicating no x. That alone gives you a memory use of roughly (rows*columns/8) bytes, and very simple and fast access methods. (For example, there are very fast methods of calculating the number of x in a given column, or in a given column on specific consecutive rows. Modifications are similarly trivial; most C compilers even support atomic built-ins, if you want to access the same data from multiple threads simultaneously without locking.)

On 64-bit architectures, memory-mapping such a bit map is quite efficient. I've shown elsewhere how to manipulate a terabyte data structure using file-backed memory maps without any issues; in your case, that's equivalent to a map with just under three million rows and columns (8,796,093,022,208 cells).

Run-length encodings are sometimes fast, but manipulating them is often painful. (There are three cases -- change at the start of a run, change in the middle of a run, change at the end of a run -- that need quite clever code to handle changes efficiently, so usually it's easier to "unpack" the entire column first, then modify it, then re-pack it, instead of manipulating the packed data directly.)

So, more details, please!

**oogabooga** · 07-19-2013

@nominal:

There is no 'x' in the sense of something that has a value.
It's just X as in X marks the spot.
So, in essense, yes, x has only one possible value.

He said that represented fully with chars (using spaces and X's) it would take 500GB.
That's a row/column size of over 20 billion (assuming a square ... but is it?)

A string of bits 20 billion bits long is about 2.8 gigs.
20 billion of those is 56 gigs.
Actually, that's doable.

So another question is how much memory are you allowed to use?
And is space more important than time?

**Nominal Animal** · 07-19-2013

Originally Posted by oogabooga

There is no 'x' in the sense of something that has a value.
It's just X as in X marks the spot.
So, in essense, yes, x has only one possible value.

Thanks. I suspected that, just wanted to be sure.

Originally Posted by oogabooga

He said that represented fully with chars (using spaces and X's) it would take 500GB.

In binary, that'd be 500GB/8, or about 63GB. That's not too bad at all. (I'd align each column to 256 bits (32 bytes), so there might be up to 255 unused bits per column.)

In all use cases I can think of, the manipulations become I/O-bottlenecked. That is, a very simple code can do all kinds of manipulations you wish, and you'll be limited only by your storage I/O speed. (My two-disk RAID-0/1 array gives about 200 MB/s reads, 120 MB/s writes in Linux, and the kernel page mechanisms get pretty darn close to that.)

In Linux, if you have enough RAM to keep the entire dataset, or at least the active portion of the dataset in the page cache -- no matter how distributed overall, then the storage I/O speed does not matter. But 64GB RAM is still a bit more than a typical workstation (or a computing cluster node) has.

Based on some stuff I've done in the past with similar data, I'd say (at least in Linux on 64-bit architectures), memory-mapping the bit map file is the way to start. I can cobble up an example program (for Linux), if you like.

However, if the data has patterns, or if you only ever add x's, or so on, there may be other data structures that might give a BIG boost. However, those data structures are only effective for specific data patterns, and typically perform very poorly for some other patterns. Therefore, yet another question:

Is the data pattern symmetric? Does it have typical patterns? (In particular, are there typically long consecutive runs of x's in a column? Or are the x's scattered randomly?)

As a side note, if you had to use a 32-bit architecture, you could split each column into a separate file. (Some OSes don't like millions of files in a single directory, but you can create a small directory hierarchy to limit it to a few thousand or tens of thousands of files (columns) per directory; in Linux, a large number of files in a directory is typically not that big of a problem.) It might also make sense if consecutive operations typically target a specific column. (In this case, some form of run-length or pattern encoding might make sense. Since disk I/O is usually the bottleneck with datasets of this size, saving disk space yields a speed boost, even if you spend a bit more CPU time in unpacking and packing the column data. Because the OS takes care of the file fragmentation, file size changing due to data modifications is not a big problem.)

If your operations typically operate on specific (sets of) columns, then the one-file-per-column, with data compressed (or just raw bitmap) in each file, might be the most efficient way.

Yup. Lots of options, lots of ways to go at it. But not enough information to make really good suggestions, I think.

**jim mcnamara** · 07-19-2013

Are you bound to using datatypes and not bits? You seem to have only -- off/on -- for each "cell" in your matrix.
This has O(N) access to columns given a row, or to rows given a column.

Code:

// proof of concept
#include <stdlib.h>
#include <stdio.h>
#include <limits.h>

static unsigned char table[512][64]={{0x0}};  // 512 x 512 bitmap

void setrowcol(int col, int row)  // set one cell to X
{

    int bit=col % CHAR_BIT;
    col/=64;       // which element of array    
    table[col][row]|=(1<<bit);
}


int getrowcol(int col, int row)  // set one cell to X
{

    int bit=col % CHAR_BIT;
    col/=64;       // which element of array    
    return !! (table[col][row] & (1<<bit));  // 1 or zero
}

int main()
{

   int col, row;

	 srand(time(NULL));   
	 col=rand()%512;
	 row=rand()%512;
   setrowcol(col,row);
   printf(" table[row=%d][col=%d]=%d\n", row, col, getrowcol(col, row) );
   col/=2;
   row/=2;
   printf(" table[row=%d][col=%d]=%d\n", row, col, getrowcol(col, row) );   
   return 0;
}

**jim mcnamara** · 07-22-2013

This is a serious design question.

Consider: Cache latency and Column-wise Access

If you access each element (either in a bit table, or some other data structure) of a column when the whole structure is large, each access is going cause cpu data cache invalidation - L1, L2, L3 whatever cpu caching your architecture has. Invalidating the cache for every access is ridiculously expensive. Since you state that it is possible to have more than half of the elements in a column or row be 'X', you will have the same problem with hashing into a sparse array. On inserting or retrieving data, a cache overhead hit will occur.

This is much less of a problem going row-wise. Depending on architecture.

Net result: a super-slow app. With a lots of extra cpu overhead. Ulrich Drepper, a less-than-friendly glib developer, has some low-level suggestions for using threads and changing loop code to approach data prefetching more efficiently. Decrease run time for this kind of matrix or table operation. The paper is old, but it is still a valid read. Especially for this kind of latency issue.

http://www.akkadia.org/drepper/cpumemory.pdf

Consider a judy-array approach. This is a sparse array implementation. If the array gets to be not-so-sparse, then it becomes not a good idea for huge tables. For the above latency reasons, plus the extra code overhead.

https://code.google.com/p/judyarray/

And if the matrix you create is the size you indicated, populated the way you said, then you will need to run massively parallel on a box with gobs of memory and cpus, or your app will drag, big time.

**Nominal Animal** · 07-22-2013

Originally Posted by jim mcnamara

If you access each element (either in a bit table, or some other data structure) of a column when the whole structure is large

You're assuming row-major data ordering. We've already established that if the access patterns are such that the data in each column is accessed sequentially, it's best to store the data in column-major order.

Since C natively uses row-major data order, that essentially means that you store the transpose of the data set, i.e. with consecutive data in a column consecutive in memory.

To make any further suggestions, we really need more information about the data access patterns, and if possible, any known patterns in the data too.

(Drepper's paper is certainly useful in considering how to best implement those patterns. For example, doing a transpose copy of the right-side matrix for matrix-matrix multiplication [using C data order] is an excellent case. For 1000×1000 matrices with double-precision floating-point values, Drepper showed 77% CPU time savings. In other words, doing the "extra" transposing copy, cut the CPU time needed to do the matrix-matrix multiplication, to under a quarter of the original CPU time.)

If we are talking about a dataset with millions of rows, with dense or easily compressible (non-random) data in each column, and accesses typically such that a small number of columns are compared or otherwise operated on with each other (logical and? logical or? logical xor? logical nand? sum? difference? counts?) -- i.e. the working set is a small number of columns --, with modifications heavily clustered in time within the same column or group of columns, I would personally split each column into a separate file, with each file containing the column data in compressed form. (This should drastically cut down the I/O time, and thus speed up the overall accesses; the assumption is that the dataset even in binary form is much, much larger than available RAM.) A small number of columns would be cached in memory, with least recently used reused to hold a new column. (When reuse happens, a simple flag will tell if the data has to be recompressed and written to storage, or just discarded since no changes have occurred.)

It's pretty frustrating to see such an interesting topic pop up, and the OP to choose not to provide any relevant details.

**iMalc** · 07-22-2013

So the size of the raw uncompressed data here would be 62.5GB, given using a char per bit would be 500GB.
Having established what the size is, and it is huge, it is clear that the most effective scheme would take into account everything you could possibly know about the data. Something that large doesn't tend to be random uncompresable data.
Even knowning the typical ratio of true's to false's could mean savings on the order of 10's of GBs.

I would not approach this problem without some real information about what the actual data is, or how it comes into existence.
Nothing where the compressed size of the data matters this much, is worth blindly choosing some random idea for.

Thread: compact table representation

Thread Tools

Search Thread

Display

compact table representation

Similar Threads

C code - looking for a compact example that exercises as many constructs as possible

number representation

Debian on a system with Compact Flash, without Hard Disk

Sorting a table and saving a table help!!!

help on Compact Flashes MBR /Boot Sector