Thread: Advice needed, how to encode large arrays

  1. #1
    Registered User
    Join Date
    Jan 2011
    Posts
    222

    Advice needed, how to encode large arrays

    Hi, i need some advice,

    I have an array of 5*10e10 integer values. Values span between 1 and 50000 they do not appear in nay particular order. Also i have an array of characters of the 3*10e9 size. I would like to store these two arrays on disc. if i store them as binary then since the integer array is encoded as long int i consume more memory then just writing the thing down as a regular txt file. In any case both the text and the binary are too large to be transported over the net. so my question is can anyone give my an advice on how to encode this particular case, the binary and character array so that i consume as less space on disc as possible.

    thank you

    baxy

    P.S.

    How to save an array of type long to disc as a type short

    just an example
    Code:
    long array;
    
    fwrite(&array,sizeof(short),NUMBER_OF_VALUES,fileIn)
    Is something like this possible (i don't see how but i am not the best programmer out there)
    Last edited by baxy; 10-17-2012 at 10:54 AM.

  2. #2
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    If you have on the order of billions of integer values, but they all fall within a range of 1 to 50000, you will have a lot of repeats. (assuming even distribution, every number will appear 50000 times for your first array).

    Enter compression. Yep, your standard old zip utilities. They look for patterns or repeated data in your input, and replace that by a much shorter sequence of bits in the output. The more repetition in your data, the more benefit you get from compression (higher compression ratio). Different algorithms work slightly better or worse in some cases (i.e. depending on the type of input), but any of them will work very well on your data. Also, the text version may be larger to begin with, but may compress better than the binary version, since you have fewer overall "symbols" (10 digits, plus perhaps a space char to separate them) versus the possible binary symbols (basically all 256 values for a byte).

    But the best thing to do is test them. This is very easy if you're on linux, just write your program to spit the data to stdout (the terminal/screen), but pipe that data through the bzip2 program:
    Code:
    ./your_program | bzip2 > foo.bz2
    You can try your program with binary and text output, create 2 different zip files and see which compresses better.

  3. #3
    Registered User
    Join Date
    Nov 2010
    Location
    Long Beach, CA
    Posts
    5,909
    Quote Originally Posted by baxy View Post
    P.S.

    How to save an array of type long to disc as a type short

    just an example
    Code:
    long array;
    
    fwrite(&array,sizeof(short),NUMBER_OF_VALUES,fileIn)
    Is something like this possible (i don't see how but i am not the best programmer out there)
    Yes, it's possible, but not that way. First, if short was a 16-bit type, you would need an unsigned short to hold values over 32,767. Second, short is guaranteed by the standard to be at least 16 bits, but it could be a larger type (e.g. 32-bits). I would recommend using fixed-size type, like uint16_t (make sure to #include <stdint.h>).

    You would have to cast the elements of the original array to the type you want as you write it. That either means creating an array of uint16_t of a reasonable size (say 1024 or so) and filling that one-by-one from your giant array, casting each time:
    Code:
    for offset = 0 to 5e10
        for i from 0 to 1023
            small_array[i] = (uint16_t) big_array[offset + i)
        fwrite(small_array, sizeof(small_array), 1024, fileIn)
        offset += 1024
    On a 64-bit machine, this could save you 75% alone in the binary file format, before compression.
    Last edited by anduril462; 10-17-2012 at 11:18 AM.

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,666
    > In any case both the text and the binary are too large to be transported over the net.
    How many copies, over what distance, in what timeframe?
    throughput and latency

    > so my question is can anyone give my an advice on how to encode this particular case
    In order to answer this, we would need to know how you arrived at so much data to begin with.
    Are there any peculiar or special properties of the data which might lead to novel compression approaches.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Registered User
    Join Date
    May 2012
    Location
    Arizona, USA
    Posts
    956
    Quote Originally Posted by Salem View Post
    > In any case both the text and the binary are too large to be transported over the net.
    How many copies, over what distance, in what timeframe?
    throughput and latency
    My favorite quote on that page:
    Quote Originally Posted by Andrew Tanenbaum
    Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Replies: 15
    Last Post: 12-23-2011, 11:35 AM
  2. Advice needed
    By Incantrix in forum Game Programming
    Replies: 4
    Last Post: 10-18-2010, 08:39 PM
  3. Help needed with VERY large numbers.
    By jwarner in forum C++ Programming
    Replies: 4
    Last Post: 01-18-2004, 12:01 PM
  4. Job advice needed regarding C++
    By joeyzt in forum A Brief History of Cprogramming.com
    Replies: 2
    Last Post: 06-20-2003, 01:10 AM
  5. advice needed
    By unregisterd in forum C Programming
    Replies: 3
    Last Post: 10-19-2002, 07:04 AM