Thread: compact table representation

  1. #1
    Registered User
    Join Date
    Jan 2011
    Posts
    222

    compact table representation

    Hi,

    a data structure problem. I have a table of the following format:

    Code:
    C/R  1  2  3  4  5  6  7 ...
    1    x  x  x
    2            x
    3                   x
    4    x
    5            x  x x
    6
    ...
    so my column and row names are integer numbers if an outcome of some game for a certain column and row lable is a match then we have an x on that position. The size of the integer names for both column and row name is quite large (let us imagine that it is so large that you would need a machine with 500GB of RAM memory to hold this type of table even if x's are treated as regular char's) . In every row there is at least one x and for every column the same holds for the columns. However, the number of x's for a row or a column can be bigger then 1. Does anyone has any suggestion on how to store this table efficiently? (using as less memory as possible).

    The data structure should be efficiently accessed in the column fashion that is, if i want to get all values for column 4 I should be able to do that in O(N) time where N= the number of rows.

    cheers

    baxy

  2. #2
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    If you only need to access it efficiently by column, then perhaps one solution is to have an array of linked lists. The index of each element corresponds to a column number. Each linked list stores the row numbers for which there is an 'x' in that column.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  3. #3
    Registered User
    Join Date
    Jan 2011
    Posts
    222
    you exploited the hole in my problem right away that would be a nice solution if only few x's were in my table but since my table is almost full of x's storing an int instead of char is more expensive. I was thinking of storing ranges of x's through columns but again i came to a conclusion that this would require of me to first have this table and then create a compact version of it.

  4. #4
    C++ Witch laserlight's Avatar
    Join Date
    Oct 2003
    Location
    Singapore
    Posts
    28,413
    Hmm... if you are sure that your table will be densely populated, then maybe you could store the positions that are empty instead.
    Quote Originally Posted by Bjarne Stroustrup (2000-10-14)
    I get maybe two dozen requests for help with some sort of programming or design problem every day. Most have more sense than to send me hundreds of lines of code. If they do, I ask them to find the smallest example that exhibits the problem and send me that. Mostly, they then find the error themselves. "Finding the smallest program that demonstrates the error" is a powerful debugging tool.
    Look up a C++ Reference and learn How To Ask Questions The Smart Way

  5. #5
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    How many unique values of x do you need? How many columns? How many rows?

    Since you wish to access the contents of each column as a unit, column-major data order is a given. (Multidimensional arrays in C use row-major order, where the values on each row are consecutive in memory.)

    If x has only one possible value, then you could pack the fields in each column in consecutive bits, with 1 indicating x and 0 indicating no x. That alone gives you a memory use of roughly (rows*columns/8) bytes, and very simple and fast access methods. (For example, there are very fast methods of calculating the number of x in a given column, or in a given column on specific consecutive rows. Modifications are similarly trivial; most C compilers even support atomic built-ins, if you want to access the same data from multiple threads simultaneously without locking.)

    On 64-bit architectures, memory-mapping such a bit map is quite efficient. I've shown elsewhere how to manipulate a terabyte data structure using file-backed memory maps without any issues; in your case, that's equivalent to a map with just under three million rows and columns (8,796,093,022,208 cells).

    Run-length encodings are sometimes fast, but manipulating them is often painful. (There are three cases -- change at the start of a run, change in the middle of a run, change at the end of a run -- that need quite clever code to handle changes efficiently, so usually it's easier to "unpack" the entire column first, then modify it, then re-pack it, instead of manipulating the packed data directly.)

    So, more details, please!

  6. #6
    - - - - - - - - oogabooga's Avatar
    Join Date
    Jan 2008
    Posts
    2,808
    @nominal:

    There is no 'x' in the sense of something that has a value.
    It's just X as in X marks the spot.
    So, in essense, yes, x has only one possible value.

    He said that represented fully with chars (using spaces and X's) it would take 500GB.
    That's a row/column size of over 20 billion (assuming a square ... but is it?)

    A string of bits 20 billion bits long is about 2.8 gigs.
    20 billion of those is 56 gigs.
    Actually, that's doable.

    So another question is how much memory are you allowed to use?
    And is space more important than time?

  7. #7
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by oogabooga View Post
    There is no 'x' in the sense of something that has a value.
    It's just X as in X marks the spot.
    So, in essense, yes, x has only one possible value.
    Thanks. I suspected that, just wanted to be sure.

    Quote Originally Posted by oogabooga View Post
    He said that represented fully with chars (using spaces and X's) it would take 500GB.
    In binary, that'd be 500GB/8, or about 63GB. That's not too bad at all. (I'd align each column to 256 bits (32 bytes), so there might be up to 255 unused bits per column.)

    In all use cases I can think of, the manipulations become I/O-bottlenecked. That is, a very simple code can do all kinds of manipulations you wish, and you'll be limited only by your storage I/O speed. (My two-disk RAID-0/1 array gives about 200 MB/s reads, 120 MB/s writes in Linux, and the kernel page mechanisms get pretty darn close to that.)

    In Linux, if you have enough RAM to keep the entire dataset, or at least the active portion of the dataset in the page cache -- no matter how distributed overall, then the storage I/O speed does not matter. But 64GB RAM is still a bit more than a typical workstation (or a computing cluster node) has.

    Based on some stuff I've done in the past with similar data, I'd say (at least in Linux on 64-bit architectures), memory-mapping the bit map file is the way to start. I can cobble up an example program (for Linux), if you like.

    However, if the data has patterns, or if you only ever add x's, or so on, there may be other data structures that might give a BIG boost. However, those data structures are only effective for specific data patterns, and typically perform very poorly for some other patterns. Therefore, yet another question:

    Is the data pattern symmetric? Does it have typical patterns? (In particular, are there typically long consecutive runs of x's in a column? Or are the x's scattered randomly?)

    As a side note, if you had to use a 32-bit architecture, you could split each column into a separate file. (Some OSes don't like millions of files in a single directory, but you can create a small directory hierarchy to limit it to a few thousand or tens of thousands of files (columns) per directory; in Linux, a large number of files in a directory is typically not that big of a problem.) It might also make sense if consecutive operations typically target a specific column. (In this case, some form of run-length or pattern encoding might make sense. Since disk I/O is usually the bottleneck with datasets of this size, saving disk space yields a speed boost, even if you spend a bit more CPU time in unpacking and packing the column data. Because the OS takes care of the file fragmentation, file size changing due to data modifications is not a big problem.)

    If your operations typically operate on specific (sets of) columns, then the one-file-per-column, with data compressed (or just raw bitmap) in each file, might be the most efficient way.

    Yup. Lots of options, lots of ways to go at it. But not enough information to make really good suggestions, I think.

  8. #8
    .
    Join Date
    Nov 2003
    Posts
    307
    Are you bound to using datatypes and not bits? You seem to have only -- off/on -- for each "cell" in your matrix.
    This has O(N) access to columns given a row, or to rows given a column.

    Code:
    // proof of concept
    #include <stdlib.h>
    #include <stdio.h>
    #include <limits.h>
    
    static unsigned char table[512][64]={{0x0}};  // 512 x 512 bitmap
    
    void setrowcol(int col, int row)  // set one cell to X
    {
    
        int bit=col % CHAR_BIT;
        col/=64;       // which element of array    
        table[col][row]|=(1<<bit);
    }
    
    
    int getrowcol(int col, int row)  // set one cell to X
    {
    
        int bit=col % CHAR_BIT;
        col/=64;       // which element of array    
        return !! (table[col][row] & (1<<bit));  // 1 or zero
    }
    
    int main()
    {
    
       int col, row;
    
    	 srand(time(NULL));   
    	 col=rand()%512;
    	 row=rand()%512;
       setrowcol(col,row);
       printf(" table[row=%d][col=%d]=%d\n", row, col, getrowcol(col, row) );
       col/=2;
       row/=2;
       printf(" table[row=%d][col=%d]=%d\n", row, col, getrowcol(col, row) );   
       return 0;
    }

  9. #9
    .
    Join Date
    Nov 2003
    Posts
    307
    This is a serious design question.

    Consider: Cache latency and Column-wise Access

    If you access each element (either in a bit table, or some other data structure) of a column when the whole structure is large, each access is going cause cpu data cache invalidation - L1, L2, L3 whatever cpu caching your architecture has. Invalidating the cache for every access is ridiculously expensive. Since you state that it is possible to have more than half of the elements in a column or row be 'X', you will have the same problem with hashing into a sparse array. On inserting or retrieving data, a cache overhead hit will occur.

    This is much less of a problem going row-wise. Depending on architecture.

    Net result: a super-slow app. With a lots of extra cpu overhead. Ulrich Drepper, a less-than-friendly glib developer, has some low-level suggestions for using threads and changing loop code to approach data prefetching more efficiently. Decrease run time for this kind of matrix or table operation. The paper is old, but it is still a valid read. Especially for this kind of latency issue.

    http://www.akkadia.org/drepper/cpumemory.pdf

    Consider a judy-array approach. This is a sparse array implementation. If the array gets to be not-so-sparse, then it becomes not a good idea for huge tables. For the above latency reasons, plus the extra code overhead.

    https://code.google.com/p/judyarray/

    And if the matrix you create is the size you indicated, populated the way you said, then you will need to run massively parallel on a box with gobs of memory and cpus, or your app will drag, big time.

  10. #10
    Ticked and off
    Join Date
    Oct 2011
    Location
    La-la land
    Posts
    1,728
    Quote Originally Posted by jim mcnamara View Post
    If you access each element (either in a bit table, or some other data structure) of a column when the whole structure is large
    You're assuming row-major data ordering. We've already established that if the access patterns are such that the data in each column is accessed sequentially, it's best to store the data in column-major order.

    Since C natively uses row-major data order, that essentially means that you store the transpose of the data set, i.e. with consecutive data in a column consecutive in memory.

    To make any further suggestions, we really need more information about the data access patterns, and if possible, any known patterns in the data too.

    (Drepper's paper is certainly useful in considering how to best implement those patterns. For example, doing a transpose copy of the right-side matrix for matrix-matrix multiplication [using C data order] is an excellent case. For 1000×1000 matrices with double-precision floating-point values, Drepper showed 77% CPU time savings. In other words, doing the "extra" transposing copy, cut the CPU time needed to do the matrix-matrix multiplication, to under a quarter of the original CPU time.)

    If we are talking about a dataset with millions of rows, with dense or easily compressible (non-random) data in each column, and accesses typically such that a small number of columns are compared or otherwise operated on with each other (logical and? logical or? logical xor? logical nand? sum? difference? counts?) -- i.e. the working set is a small number of columns --, with modifications heavily clustered in time within the same column or group of columns, I would personally split each column into a separate file, with each file containing the column data in compressed form. (This should drastically cut down the I/O time, and thus speed up the overall accesses; the assumption is that the dataset even in binary form is much, much larger than available RAM.) A small number of columns would be cached in memory, with least recently used reused to hold a new column. (When reuse happens, a simple flag will tell if the data has to be recompressed and written to storage, or just discarded since no changes have occurred.)

    It's pretty frustrating to see such an interesting topic pop up, and the OP to choose not to provide any relevant details.

  11. #11
    Algorithm Dissector iMalc's Avatar
    Join Date
    Dec 2005
    Location
    New Zealand
    Posts
    6,318
    So the size of the raw uncompressed data here would be 62.5GB, given using a char per bit would be 500GB.
    Having established what the size is, and it is huge, it is clear that the most effective scheme would take into account everything you could possibly know about the data. Something that large doesn't tend to be random uncompresable data.
    Even knowning the typical ratio of true's to false's could mean savings on the order of 10's of GBs.

    I would not approach this problem without some real information about what the actual data is, or how it comes into existence.
    Nothing where the compressed size of the data matters this much, is worth blindly choosing some random idea for.
    My homepage
    Advice: Take only as directed - If symptoms persist, please see your debugger

    Linus Torvalds: "But it clearly is the only right way. The fact that everybody else does it some other way only means that they are wrong"

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 2
    Last Post: 10-11-2011, 11:21 PM
  2. number representation
    By fsanchez in forum C Programming
    Replies: 6
    Last Post: 08-12-2010, 12:18 PM
  3. Debian on a system with Compact Flash, without Hard Disk
    By mynickmynick in forum Linux Programming
    Replies: 19
    Last Post: 09-01-2008, 11:38 AM
  4. Sorting a table and saving a table help!!!
    By MrMe913 in forum C Programming
    Replies: 4
    Last Post: 04-18-2007, 12:19 PM
  5. help on Compact Flashes MBR /Boot Sector
    By BrownB in forum Tech Board
    Replies: 1
    Last Post: 03-30-2007, 05:26 AM