Threads to keep the CPU faster than the disk?

**matthew180** · 06-06-2007

Greetings,

I'm currently writing a data conversion program and would like to get better performance. In case the environment matters: the programming language is straight C, Solaris 10, E25K with 8 CPUs and 16G RAM allotted to my zone. I do not have admin on the box. I am compiling 64-bit with lots of compiler options for performance.

The process is very linear and most of the optimization examples I find are for making loops run in parallel and such. Well, I don't have any loops. I'm moving a lot of data from a set of source files, doing some transformation and validation, then writing to the appropriate target file. No recursion or matrix math here...

I wrote my initial test program which would basically spin through the source files and writes empty target files. Doing this I was able to process about 70,000 source records per second - which was acceptable and on par with the speed of simply copying the disk files from one place to another.

Once I started adding logic, the records per second started to drop drastically. I expected this to some degree, but adding just the basic initial logic cut the records per second in half, and after that the performance dropped in a pretty linear fashion as I added transformation logic. Mind you, most of the logic is moving source to target and space padding the target, validating a date range, etc.. Nothing complex by any stretch of the imagination, there are just a lot of fields.

Before I spend a lot of time trying to multi-thread the application, I wanted to see if my expectations are realistic. My thinking is that 8 CPUs should be able to keep up with the disk subsystem and that my conversion should not take any longer than the amount of time it takes to simply copy the data from one point to another. Possible?

Currently I'm processing like this:

1. mmap open all sources (there are about 10 to 15 depending)
2. collect counts of all source records in a given "set"
3. wait for any previous targets to finish writing to disk
4. process the current set of source records and write target records to memory buffers for each target
5. when a given target buffer is full, aiowrite to the target file
6. while there are source records, goto step 2

Basically I use aiowrite to get a little free async operation in that any target buffers that are ready to be written could do so while the next set of source records is being grouped (being read from the mmap'd source files). I also try to keep things as fast as possible by not moving the data more than necessary. Usually my transformation logic can move the data directly from the mmap'd file to the target buffer, and in other cases only a single move of the data needs to be done.

What I think I would like to do is create a thread that groups the source record sets into 8 independent memory locations. This thread's job is to simply keep those group locations full. Then 8 worker threads would pick the next source "set" from the pool and process it, and only have to sync on a mutex when writing to the target file.

Any insight or feedback would be greatly appreciated.

Thanks,
Matthew

**iMalc** · 06-06-2007

Originally Posted by matthew180

Greetings,
Well, I don't have any loops.

Hehe, I had a good laugh at that one.
And then saying "No Recursion" - Oh please stop, you're killing me!

**Salem** · 06-06-2007

Have you benchmarked the conversion, without the overhead of reading and writing files?

If you can process quicker than you can perform I/O, then all you need is two threads, because one will finish calculating before the other finishes I/O.
If processing is slower, then you need as many threads as it would take to get the processing rate up to where the I/O rate is.

Is each file (or file set) an independent unit, or do you need to merge the contents of several input files (or file sets) in some way?
If they're totally independent, then threads will be a lot easier to work with, otherwise some horrible performance-sucking locking mechanisms may be needed.
Even if they are not independent, could they be made independent if there were a final result "merge" (in the same way that a file can be sorted in fragments and the intermediate results merged to produce a result).

> mmap open all sources (there are about 10 to 15 depending)
If you're just reading the files sequentially (not doing lots of seeking), I'm not sure that it really helps you. The really expensive bit is when you touch a page for the first time and then you have to wait while the OS loads that page (and maybe the next few as well for some read-ahead prediction). After you're done with it, it just sits there in memory until the OS needs that page for something else.
A disk read to a fixed buffer, followed by a memcpy to your own buffer looks a better bet than a disk read to a virtual address followed sometime later by a page fault.

Do all 8 CPUs share the same disk bandwidth?

Is say /tmp
a) large enough to hold the results,
b) on a separate disk, with a separate controller (and not just another partition on the same physical HD as your data files).
If it is, then you'll gain some interleaving between reading and writing by not thrashing the heads all over the same disk by trying to read and write at the same time.

Also, if the processing of a file (or file set) is independent from another file (or file set), then you could just try
myprog file1 file2 file3 > result1 &
myprog file4 file5 file6 > result2 &

and let the OS just push the concurrent background processes onto different processors.

**matthew180** · 06-06-2007

@salem

I have done a basic benchmark. When I started writing my code, before I had conversion logic, my program would read all the source files and write empty targets. I used this as a baseline to indicate how fast the disk subsystem could go since there was no logic happening, other than what was needed to drive the process. This was the initial 70,000 records per second I was starting off with.

The source files are in the form of a flattened database. So there is one master record file, and the other source files have records in them that relate to each record in the master file. Each master record has a unique key that exists in each record, in the child source tables, that belong to a given master record.

Currently I'm "grouping" these records by reading a source master record, then looking in each source child for related records, keeping the starting offset and count for each source file. This "set" of records is then processed and targets are written. My idea to speed up the process is to make this grouping into a thread, and copying the groups into individual buffers that individual threads could then use to produce targets.

The target files are also file representations of what will be database tables (Oracle to be exact).

So, there is a boundary for the source files that exist as a "set" of source records that represent a complete source object as it existed in the legacy system.

I mmap'd the source files for the primary reason that I didn't want to move the data twice, i.e. once to a local buffer, then again to an output buffer (and a final time to the disk). Also, everything I have read about the way Solaris does mmap indicates that it should be fast. All the source files are read sequentially from start to finish, so a page is only loaded once, and you can use madvise to tell the OS about the access method to expect. This does make a major impact (better speed), but not enough.

Yes, all 8 CPU's share the same disk subsystem I believe. I do not have physical access to the machine and I'm not completely aware to the physical build. The box sits in a data center somewhere with a 4TB array attached for this conversion (the source data is about 1.1TB). Each group of source files represent a month of data and are generally between 3GB and 10G total (I have 84 months to process, 4 times per month for special conditions).

/tmp is not large enough to be used, and I cannot do anything about making sure the source and target files get written to different disk subsystems, unfortunately. I don't think the admins who set up the box really knew what they were doing either, so it is less than optimal I think. But, I do know what I can copy a month's worth of data in about 2 or 3 minutes, and I'd like my conversion program to run in that same time (meaning keep the process I/O bound and make sure the CPUs are running faster than the disks).

I will already be running multiple instances of the program, so each run needs to process a complete set of source files. The conversion from source set to target set is very linear and a lot of the data within a set relies on previous data having already been converted.

Matthew

**JVene** · 06-06-2007

Salem's right on point....

..but I noticed you're using memory mapped files for I/O?

That suggests it may be tough to separate the processing from the I/O....

So, have you checked to see that CPU utilization is pegged (for the thread you're running) when processing rates drop?

If not, you're not CPU bound, you're I/O bound - and you'll need to rethink the problem.

If you really are CPU bound, then threading will help, up to that point where you're not CPU bound.

You're otherwise thinking in the right ballpark with Salem's points in mind.

Thread: Threads to keep the CPU faster than the disk?

Thread Tools

Search Thread

Display

Threads to keep the CPU faster than the disk?

Similar Threads

questions on multiple thread programming

Multithreading - synchronisation and keeping track of running threads

Upgrading my old CPU (for another old one!)

threads with lower CPU usage

Formatting Output