mmap scales poorly... help?

Printable View

02-04-2011
graphicsMan

mmap scales poorly... help?

Hi all -

Here's some context:

I have files of some large-ish size consisting of chunks. I need to
load each chunk, do some work (call it munging) on it, and write the
munged chunk to a file. My test file is ~25 MB with about 17,000 chunks.

My system is a Nehalem-EX with 32 physical cores, 64 virtual ones with
hyperthreading. I'm running Ubuntu x86-46 with the 2.6.35-24-generic
kernel. I'm running the system over remote-X from my desk to a server room,
if this has any impact on answers.

Since I have a bunch-o-cores, I'd like to do this processing in
parallel. I have the list of chunks stored in an array, and I started
by synchronizing access with a pthread mutex wrapping a counter. I've
since moved on to using an atomic counter (fetch_and_inc) with no mutex.

My problem appears to be with mmap. I use boost::mapped_region as a
light wrapper around mmap, but I tried mmap directly with no
differences.

I tried two different ways of accessing the chunks: 1, I tried making
a mapped_region for each chunk, and 2, I created a mapped_region for
the entire file, and then simply grab the subset needed for a chunk.

The first way (1) yields the following times for the test file:

Threads | Time (s)
==============
1 | 1.927
2 | 1.299
4 | 1.158
8 | 6.413
16 | 10.217
32 | 11.049
64 | 11.445

The second way (2) yeilds the following times for the test file:

Threads | Time (s)
==============
1 | 1.889
2 | 1.136
4 | 0.774
8 | 0.761
16 | 6.092
32 | 6.116
64 | 6.252

Adding thread affinity DOES help, but not a lot. The trend is
similar, but about 10% faster.

Clearly the second method is preferable when possible (I many need
to do a similar thing keeping many file mappings on subsets of the
chunks open, and on 32-bit machine, the first is more attractive
because I use less virtual memory. On 64-bit, I'm not worried about
running out of my virtual address space :-) ).

The main point of this post, however, is that neither technique scales
past a handful of cores. This is where I hope someone might help.
Does anyone know about bottlenecks or potential bottlenecks with mmap
and multi-threading on more than a few cores? If so, is there a
solution? A kernel runtime tweak or even a kernel compile-time tweak
(I'm willing to build the kernel).

Any information that might be helpful is appreciated, even if it's
that I need to ask the question elsewhere (e.g. where might be more
appropriate?).

Thanks!
02-04-2011
brewbuck

I'm not sure that mmap() has much to do with what you are seeing. It's not expensive inside the kernel to set up a memory map of a file -- it just records a few things, allocates address space, then lets the page fault handler bring in the pages as they are accessed.

With that many cores, chances are you're playing cache ping-pong. Even with thread affinity under control, cache coherency is maintained on a per-cache-line basis, so even if threads never jump from core to core, and even if each thread is working on independent data, it could be that a cache line in one core aliases a cache line in another core -- the cache coherency mechanism will spend a lot of time bouncing cache lines back and forth even though no data is actually shared -- this is called the false sharing problem.

For these sorts of things you really ought to bust out a parallel process analyzer like Intel Parallel Studio.
02-04-2011
graphicsMan

Hey, thanks for the reply. I think you may have something there. If my chunks are typically in the kilobytes, is this still a likely candidate for the problem?

Also, I should note, when I run this in oprofile, the time for 64 threads is 0.88 seconds. Somehow the performance counting distrupts the problem.

I'll look into getting parallel studio.

Thanks again.
02-05-2011
Salem

nmap might be an interesting programming trick to save a bit of buffering work inside the kernel and your code, but it doesn't get past the physical reality of having to read the hard disk.

Seek times are measured in milliseconds, which puts it somewhere between 1E6 and 1E9 times SLOWER than the core of your system. You might have a multi-lane highway, but the only on-ramp is still a dirt track.

If one core can keep the system I/O bound, then adding more cores is only going to increase problems like the one brewbuck mentioned.

If further analysis does show that your system is indeed I/O bound in a bad way, then perhaps configuring your storage as a RAID array would help to boost disk performance.
02-05-2011
graphicsMan

thanks for the reply. From looking at the numbers in the table, we can see that I can get down at least to about the half-second range for the computation. By comparing that to the ~10s and ~5s numbers with 64 threads, we know something bad is happening that is not related to peak disk throughput. My disk is capable of 100 MB/s reads for large enough chunks, and for the file at hand, that's about a quarter second at best. The mapped regions will be accessed in a semi-random fashion, so we're not getting the best possible disk read rate in any case, even with the single-threaded version, but it's clear that we can perform all work in about half a second with the optimal choice of # of threads.
Finally, we should be able to more-or-less neglect the disk read, because in the example, I'd already loaded the file many times, so it was still in the file cache.

As for false sharing, this is a possibility I'm not ruling out, but I should mention that none of the data loaded from the file is directly modified -- it's read-only. The munging happens into a new buffer that isn't even the same size as the original chunk. Can false sharing affect read-only cache lines? If so, I should try to make my chunks start at 128 byte alignment. I can try this easily and see if it makes the problem go away, but not until Monday. I may just need to do as suggested and try out parallel studio.

Thanks for the help.
02-06-2011
Salem

>My test file is ~25 MB with about 17,000 chunks.
How long does it take to do this?

Code:

int nblocks = 0; char buff[BUFSIZ]; FILE *fp = fopen("chunky_file","rb"); while ( fread( buff, BUFSIZ, 1, fp ) ) { nblocks++; } fclose(fp); printf("%d\n",nblocks);

And how long does it take the second time you do it (after it has been cached perhaps)?

If you did all the work (processed 17,000 chunks) inside that simple single threaded loop, how long would it take?
Because this is your baseline for measuring "optimisation" against. I'm wondering if all this talk of "nmap, cores and threads" is all just "smoke and mirrors" premature optimisation.

> Finally, we should be able to more-or-less neglect the disk read, because in the example,
> I'd already loaded the file many times, so it was still in the file cache.
Why would an nmapped file be in cache?

> My test file is ~25 MB with about 17,000 chunks.
How big are the real files?
How much real memory does your x64 machine have?
02-06-2011
graphicsMan

Thanks for the response. I can try your load test tomorrow.

Real files may be as small as ~1MB, or as large as gigabytes. In the real program doing this, the code will need to run on 32-bit and 64-bit machines, and chunks might be accessed many times over the lifetime of the program (in those cases, we'd like to reuse the existing mapping, and I have an LRU cache of these), or potentially no times. The machines we are targeting for performance have at least 16 GB and are 64-bit Linux (my test machine has 32 GB). The code may eventually have to run on Windows and Mac, so I'm using boost as a portable wrapper for the mapping.

I'm trying to replace an existing system where we definitely spend far too much time doing these types of operations. Even single-threaded, my system is much faster, (and clearly this scales to a handful of threads), but I'm hoping to make this scale better, or at least degrade gracefully.
02-06-2011
Salem

Some reading
c - When should I use mmap for file access? - Stack Overflow
linux performance tuning - Google Search

Some API calls that may be worth a read.
madvise(2): give advice about use of memory - Linux man page
fadvise(2): Give advice about file access - Linux man page
readahead(2) - Linux man page
02-07-2011
graphicsMan

Okay. Once I moved to a single mapped file, I think I changed the bottleneck. I now try commenting out the "write out to files" portion of the operation, and I scale much better (to 20 or 30 threads), and performance degrades gracefully (only 30% slower at 64 threads than 24).

I think there's a lot of file-locking stuff that happens during the write-to-file portion, which can become a bottleneck with multi-threaded code. I don't have time to continue investigating this, but I'm fairly satisfied with where I ended up.

My solution is to threshold how many threads can run (set to 16). This could be higher if the file writing wasn't happening.

Lessons learned:
1) single file mapping is MUCH better than mapping each region of the file separately. Do this if you can. If you know you'll use the data right away, map with MAP_POPULATE, as this results in really fast paging.
2) multi-threaded code should be careful to limit concurrent file operations, as beyond a handful of threads, operating on files, especially writing, can become expensive.

Thanks for the suggestions Salem and brewbuck.