1. ## Reading in large arrays ...

I'm looking at the ITA problem "BitVector Genealogy" at

Current ITA Software Hiring Puzzles

It involves 10,000 vectors of 10,000 bits each. I'm not used to dealing with this much info. My plan is to read in all the vectors and then create an array of ints to store the distance between each vector. My current plan is to read the bits into an array of strings (or a 2d array of bits), and compare the two strings finding out which indicies have different values.

The question I have is this:

Is the best way to store these distances just to define an array (call it distAB)that's a 10000x10000 array of ints? obviously distAB(x,y)=distAB(y,x) and distAB(x,x)=0 so I can dynamically define this, but it still seems like a lot of memory, especially when I have that array being compiled from the array of 10000 vectors (which I can then get rid of). Is this the proper way to do things?

2. If it's a bit vector, you need 10000/8 bytes, not ints. Use unsigned char and set bits, 8 bits per char, 1250 bytes per individual. So the whole matrix will only be 12.5 mb, which is not so much. I just downloaded the data for that problem and notice it is a string of chars representing each bit (100mb of data). You could go that route too, but methinks real bits will be better.

You will still want to malloc() that on the heap rather than declaring it on the stack, I think.

3. Okay, so reading in is good with an array [1..10000][1..10000] of bits for 12.5 megs.

Determining the hamming distance between each of these vectors will require another array [1..10000][1..10000] of a number up to 10000 (we'll say int) for another 100 Megs. I guess that's not too much memory, I just didn't know if there was a better way to deal with (what I consider to be) large arrays of data.