I am using HDF and have massive databases containing individual HDF (hierarchial data format, essentially a database of matrices). I have a program that gathers statistics on these files, verifies them against one another. I need some guidelines regarding optimization of such a thing, and optimization in general. Is there a book that specializes in c++ optimization that you could recommend me?

AN HDF file would look like this:

HDF : folders : Dataset : Data. A dataset is similar to a matrix or array.

One of my statistical HDF files logs the largest values of a given set of HDF files which have a definite and common structure.

One idea I've had: instead of loading one HDF, then updating the archive that keeps track like this:

file 1: load the archive, check if the value is largest(for each member of each matrix in the hierarchies) , if it is , write to the archive HDF
Repeat n times.

DO this instead:

file 1-10: Load 10 files at a time in memory, find the largest value of the 10 and compare it to the current archive value (for each member of each matrix in the hierarchies), then only write the maximum value of the 10 ONCE.
Repeat n/10 times.

the second option would be faster, what do you think?


I could also load one data value of one matrix of ALL datasets, and then write the largest to the archive. If the program were to crash, it would have to be run all over again. With the previous options, it could be possible to resume the scanning/comparing/validating process after each HDF file, or each set of 10 HDF file.

Which is the best way (good practice) and fastest (optimization) way to update a massive database such as this?

Summary of strategies:

1. Load 1 HDF at a time, compare to archive
2. Load 10 HDF at a time, compare to archive.
3. Load all HDF (given parameters), but only one data out of all

Could you please comment on the efficiency and safeness of each strategy.