Thread: Reading in large arrays ...

  1. #1
    Registered User
    Join Date
    Jul 2006
    Posts
    13

    Reading in large arrays ...

    I'm looking at the ITA problem "BitVector Genealogy" at

    Current ITA Software Hiring Puzzles

    It involves 10,000 vectors of 10,000 bits each. I'm not used to dealing with this much info. My plan is to read in all the vectors and then create an array of ints to store the distance between each vector. My current plan is to read the bits into an array of strings (or a 2d array of bits), and compare the two strings finding out which indicies have different values.

    The question I have is this:

    Is the best way to store these distances just to define an array (call it distAB)that's a 10000x10000 array of ints? obviously distAB(x,y)=distAB(y,x) and distAB(x,x)=0 so I can dynamically define this, but it still seems like a lot of memory, especially when I have that array being compiled from the array of 10000 vectors (which I can then get rid of). Is this the proper way to do things?

  2. #2
    spurious conceit MK27's Avatar
    Join Date
    Jul 2008
    Location
    segmentation fault
    Posts
    8,300
    If it's a bit vector, you need 10000/8 bytes, not ints. Use unsigned char and set bits, 8 bits per char, 1250 bytes per individual. So the whole matrix will only be 12.5 mb, which is not so much. I just downloaded the data for that problem and notice it is a string of chars representing each bit (100mb of data). You could go that route too, but methinks real bits will be better.

    You will still want to malloc() that on the heap rather than declaring it on the stack, I think.
    Last edited by MK27; 04-26-2010 at 07:42 AM.
    C programming resources:
    GNU C Function and Macro Index -- glibc reference manual
    The C Book -- nice online learner guide
    Current ISO draft standard
    CCAN -- new CPAN like open source library repository
    3 (different) GNU debugger tutorials: #1 -- #2 -- #3
    cpwiki -- our wiki on sourceforge

  3. #3
    Registered User
    Join Date
    Jul 2006
    Posts
    13
    Okay, so reading in is good with an array [1..10000][1..10000] of bits for 12.5 megs.

    Determining the hamming distance between each of these vectors will require another array [1..10000][1..10000] of a number up to 10000 (we'll say int) for another 100 Megs. I guess that's not too much memory, I just didn't know if there was a better way to deal with (what I consider to be) large arrays of data.

    Thanks for your reply.

  4. #4
    Registered User C_ntua's Avatar
    Join Date
    Jun 2008
    Posts
    1,853
    Interesting problem. Unfortunately they don't have an office in Texas.
    The way I was thinking about it also involved finding distances, or rather matches. Just be careful that it says it has 80% possibility of two vectors to be the same, which means that the closest matching is the ones that are 80% the same, not 100%.

    Apart from that I would try to make pairs, then pairs with pairs and so on until I find the chain of the clones. Of course finding the absolute best solution is very difficult, but there are solutions that should work (at least in my head).

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. Dynamically allocate Large 4-D arrays
    By ksoth in forum C Programming
    Replies: 8
    Last Post: 08-25-2009, 02:12 AM
  2. How large can arrays be?
    By overspray in forum C++ Programming
    Replies: 9
    Last Post: 07-21-2004, 02:25 PM
  3. Replies: 0
    Last Post: 07-12-2002, 01:40 PM
  4. Reading from a .txt file or use arrays.
    By []--JOEMAN--[] in forum C++ Programming
    Replies: 3
    Last Post: 12-07-2001, 07:55 AM
  5. reading arrays from files
    By iain in forum C++ Programming
    Replies: 2
    Last Post: 12-05-2001, 01:23 PM