Reading in large arrays ...

**182blink** · 04-26-2010

I'm looking at the ITA problem "BitVector Genealogy" at

Current ITA Software Hiring Puzzles

It involves 10,000 vectors of 10,000 bits each. I'm not used to dealing with this much info. My plan is to read in all the vectors and then create an array of ints to store the distance between each vector. My current plan is to read the bits into an array of strings (or a 2d array of bits), and compare the two strings finding out which indicies have different values.

The question I have is this:

Is the best way to store these distances just to define an array (call it distAB)that's a 10000x10000 array of ints? obviously distAB(x,y)=distAB(y,x) and distAB(x,x)=0 so I can dynamically define this, but it still seems like a lot of memory, especially when I have that array being compiled from the array of 10000 vectors (which I can then get rid of). Is this the proper way to do things?

**MK27** · 04-26-2010

If it's a bit vector, you need 10000/8 bytes, not ints. Use unsigned char and set bits, 8 bits per char, 1250 bytes per individual. So the whole matrix will only be 12.5 mb, which is not so much. I just downloaded the data for that problem and notice it is a string of chars representing each bit (100mb of data). You could go that route too, but methinks real bits will be better.

You will still want to malloc() that on the heap rather than declaring it on the stack, I think.

**182blink** · 04-26-2010

Okay, so reading in is good with an array [1..10000][1..10000] of bits for 12.5 megs.

Determining the hamming distance between each of these vectors will require another array [1..10000][1..10000] of a number up to 10000 (we'll say int) for another 100 Megs. I guess that's not too much memory, I just didn't know if there was a better way to deal with (what I consider to be) large arrays of data.

Thanks for your reply.

**C_ntua** · 04-26-2010

Interesting problem. Unfortunately they don't have an office in Texas.
The way I was thinking about it also involved finding distances, or rather matches. Just be careful that it says it has 80% possibility of two vectors to be the same, which means that the closest matching is the ones that are 80% the same, not 100%.

Apart from that I would try to make pairs, then pairs with pairs and so on until I find the chain of the clones. Of course finding the absolute best solution is very difficult, but there are solutions that should work (at least in my head).

Thread: Reading in large arrays ...

Thread Tools

Search Thread

Display

Reading in large arrays ...

Similar Threads

Dynamically allocate Large 4-D arrays

How large can arrays be?

Need help reading in large files without overloading the heap

Reading from a .txt file or use arrays.

reading arrays from files