I'm currently writing a data conversion program and would like to get better performance. In case the environment matters: the programming language is straight C, Solaris 10, E25K with 8 CPUs and 16G RAM allotted to my zone. I do not have admin on the box. I am compiling 64-bit with lots of compiler options for performance.
The process is very linear and most of the optimization examples I find are for making loops run in parallel and such. Well, I don't have any loops. I'm moving a lot of data from a set of source files, doing some transformation and validation, then writing to the appropriate target file. No recursion or matrix math here...
I wrote my initial test program which would basically spin through the source files and writes empty target files. Doing this I was able to process about 70,000 source records per second - which was acceptable and on par with the speed of simply copying the disk files from one place to another.
Once I started adding logic, the records per second started to drop drastically. I expected this to some degree, but adding just the basic initial logic cut the records per second in half, and after that the performance dropped in a pretty linear fashion as I added transformation logic. Mind you, most of the logic is moving source to target and space padding the target, validating a date range, etc.. Nothing complex by any stretch of the imagination, there are just a lot of fields.
Before I spend a lot of time trying to multi-thread the application, I wanted to see if my expectations are realistic. My thinking is that 8 CPUs should be able to keep up with the disk subsystem and that my conversion should not take any longer than the amount of time it takes to simply copy the data from one point to another. Possible?
Currently I'm processing like this:
1. mmap open all sources (there are about 10 to 15 depending)
2. collect counts of all source records in a given "set"
3. wait for any previous targets to finish writing to disk
4. process the current set of source records and write target records to memory buffers for each target
5. when a given target buffer is full, aiowrite to the target file
6. while there are source records, goto step 2
Basically I use aiowrite to get a little free async operation in that any target buffers that are ready to be written could do so while the next set of source records is being grouped (being read from the mmap'd source files). I also try to keep things as fast as possible by not moving the data more than necessary. Usually my transformation logic can move the data directly from the mmap'd file to the target buffer, and in other cases only a single move of the data needs to be done.
What I think I would like to do is create a thread that groups the source record sets into 8 independent memory locations. This thread's job is to simply keep those group locations full. Then 8 worker threads would pick the next source "set" from the pool and process it, and only have to sync on a mutex when writing to the target file.
Any insight or feedback would be greatly appreciated.