Thread: mmap scales poorly... help?

  1. #1
    Registered User
    Join Date
    Feb 2011
    Posts
    5

    mmap scales poorly... help?

    Hi all -

    Here's some context:

    I have files of some large-ish size consisting of chunks. I need to
    load each chunk, do some work (call it munging) on it, and write the
    munged chunk to a file. My test file is ~25 MB with about 17,000 chunks.

    My system is a Nehalem-EX with 32 physical cores, 64 virtual ones with
    hyperthreading. I'm running Ubuntu x86-46 with the 2.6.35-24-generic
    kernel. I'm running the system over remote-X from my desk to a server room,
    if this has any impact on answers.

    Since I have a bunch-o-cores, I'd like to do this processing in
    parallel. I have the list of chunks stored in an array, and I started
    by synchronizing access with a pthread mutex wrapping a counter. I've
    since moved on to using an atomic counter (fetch_and_inc) with no mutex.

    My problem appears to be with mmap. I use boost::mapped_region as a
    light wrapper around mmap, but I tried mmap directly with no
    differences.

    I tried two different ways of accessing the chunks: 1, I tried making
    a mapped_region for each chunk, and 2, I created a mapped_region for
    the entire file, and then simply grab the subset needed for a chunk.

    The first way (1) yields the following times for the test file:

    Threads | Time (s)
    ==============
    1 | 1.927
    2 | 1.299
    4 | 1.158
    8 | 6.413
    16 | 10.217
    32 | 11.049
    64 | 11.445

    The second way (2) yeilds the following times for the test file:

    Threads | Time (s)
    ==============
    1 | 1.889
    2 | 1.136
    4 | 0.774
    8 | 0.761
    16 | 6.092
    32 | 6.116
    64 | 6.252

    Adding thread affinity DOES help, but not a lot. The trend is
    similar, but about 10% faster.

    Clearly the second method is preferable when possible (I many need
    to do a similar thing keeping many file mappings on subsets of the
    chunks open, and on 32-bit machine, the first is more attractive
    because I use less virtual memory. On 64-bit, I'm not worried about
    running out of my virtual address space :-) ).

    The main point of this post, however, is that neither technique scales
    past a handful of cores. This is where I hope someone might help.
    Does anyone know about bottlenecks or potential bottlenecks with mmap
    and multi-threading on more than a few cores? If so, is there a
    solution? A kernel runtime tweak or even a kernel compile-time tweak
    (I'm willing to build the kernel).

    Any information that might be helpful is appreciated, even if it's
    that I need to ask the question elsewhere (e.g. where might be more
    appropriate?).

    Thanks!

  2. #2
    Officially An Architect brewbuck's Avatar
    Join Date
    Mar 2007
    Location
    Portland, OR
    Posts
    7,396
    I'm not sure that mmap() has much to do with what you are seeing. It's not expensive inside the kernel to set up a memory map of a file -- it just records a few things, allocates address space, then lets the page fault handler bring in the pages as they are accessed.

    With that many cores, chances are you're playing cache ping-pong. Even with thread affinity under control, cache coherency is maintained on a per-cache-line basis, so even if threads never jump from core to core, and even if each thread is working on independent data, it could be that a cache line in one core aliases a cache line in another core -- the cache coherency mechanism will spend a lot of time bouncing cache lines back and forth even though no data is actually shared -- this is called the false sharing problem.

    For these sorts of things you really ought to bust out a parallel process analyzer like Intel Parallel Studio.
    Code:
    //try
    //{
    	if (a) do { f( b); } while(1);
    	else   do { f(!b); } while(1);
    //}

  3. #3
    Registered User
    Join Date
    Feb 2011
    Posts
    5
    Hey, thanks for the reply. I think you may have something there. If my chunks are typically in the kilobytes, is this still a likely candidate for the problem?

    Also, I should note, when I run this in oprofile, the time for 64 threads is 0.88 seconds. Somehow the performance counting distrupts the problem.

    I'll look into getting parallel studio.

    Thanks again.

  4. #4
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    nmap might be an interesting programming trick to save a bit of buffering work inside the kernel and your code, but it doesn't get past the physical reality of having to read the hard disk.

    Seek times are measured in milliseconds, which puts it somewhere between 1E6 and 1E9 times SLOWER than the core of your system. You might have a multi-lane highway, but the only on-ramp is still a dirt track.

    If one core can keep the system I/O bound, then adding more cores is only going to increase problems like the one brewbuck mentioned.

    If further analysis does show that your system is indeed I/O bound in a bad way, then perhaps configuring your storage as a RAID array would help to boost disk performance.
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  5. #5
    Registered User
    Join Date
    Feb 2011
    Posts
    5
    thanks for the reply. From looking at the numbers in the table, we can see that I can get down at least to about the half-second range for the computation. By comparing that to the ~10s and ~5s numbers with 64 threads, we know something bad is happening that is not related to peak disk throughput. My disk is capable of 100 MB/s reads for large enough chunks, and for the file at hand, that's about a quarter second at best. The mapped regions will be accessed in a semi-random fashion, so we're not getting the best possible disk read rate in any case, even with the single-threaded version, but it's clear that we can perform all work in about half a second with the optimal choice of # of threads.
    Finally, we should be able to more-or-less neglect the disk read, because in the example, I'd already loaded the file many times, so it was still in the file cache.

    As for false sharing, this is a possibility I'm not ruling out, but I should mention that none of the data loaded from the file is directly modified -- it's read-only. The munging happens into a new buffer that isn't even the same size as the original chunk. Can false sharing affect read-only cache lines? If so, I should try to make my chunks start at 128 byte alignment. I can try this easily and see if it makes the problem go away, but not until Monday. I may just need to do as suggested and try out parallel studio.

    Thanks for the help.
    Last edited by graphicsMan; 02-05-2011 at 05:52 PM.

  6. #6
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    >My test file is ~25 MB with about 17,000 chunks.
    How long does it take to do this?
    Code:
    int nblocks = 0;
    char buff[BUFSIZ];
    FILE *fp = fopen("chunky_file","rb");
    while ( fread( buff, BUFSIZ, 1, fp ) ) {
      nblocks++;
    }
    fclose(fp);
    printf("%d\n",nblocks);
    And how long does it take the second time you do it (after it has been cached perhaps)?

    If you did all the work (processed 17,000 chunks) inside that simple single threaded loop, how long would it take?
    Because this is your baseline for measuring "optimisation" against. I'm wondering if all this talk of "nmap, cores and threads" is all just "smoke and mirrors" premature optimisation.

    > Finally, we should be able to more-or-less neglect the disk read, because in the example,
    > I'd already loaded the file many times, so it was still in the file cache.
    Why would an nmapped file be in cache?

    > My test file is ~25 MB with about 17,000 chunks.
    How big are the real files?
    How much real memory does your x64 machine have?
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  7. #7
    Registered User
    Join Date
    Feb 2011
    Posts
    5
    Thanks for the response. I can try your load test tomorrow.

    Real files may be as small as ~1MB, or as large as gigabytes. In the real program doing this, the code will need to run on 32-bit and 64-bit machines, and chunks might be accessed many times over the lifetime of the program (in those cases, we'd like to reuse the existing mapping, and I have an LRU cache of these), or potentially no times. The machines we are targeting for performance have at least 16 GB and are 64-bit Linux (my test machine has 32 GB). The code may eventually have to run on Windows and Mac, so I'm using boost as a portable wrapper for the mapping.

    I'm trying to replace an existing system where we definitely spend far too much time doing these types of operations. Even single-threaded, my system is much faster, (and clearly this scales to a handful of threads), but I'm hoping to make this scale better, or at least degrade gracefully.

  8. #8
    and the hat of int overfl Salem's Avatar
    Join Date
    Aug 2001
    Location
    The edge of the known universe
    Posts
    39,659
    If you dance barefoot on the broken glass of undefined behaviour, you've got to expect the occasional cut.
    If at first you don't succeed, try writing your phone number on the exam paper.

  9. #9
    Registered User
    Join Date
    Feb 2011
    Posts
    5
    Okay. Once I moved to a single mapped file, I think I changed the bottleneck. I now try commenting out the "write out to files" portion of the operation, and I scale much better (to 20 or 30 threads), and performance degrades gracefully (only 30% slower at 64 threads than 24).

    I think there's a lot of file-locking stuff that happens during the write-to-file portion, which can become a bottleneck with multi-threaded code. I don't have time to continue investigating this, but I'm fairly satisfied with where I ended up.

    My solution is to threshold how many threads can run (set to 16). This could be higher if the file writing wasn't happening.

    Lessons learned:
    1) single file mapping is MUCH better than mapping each region of the file separately. Do this if you can. If you know you'll use the data right away, map with MAP_POPULATE, as this results in really fast paging.
    2) multi-threaded code should be careful to limit concurrent file operations, as beyond a handful of threads, operating on files, especially writing, can become expensive.

    Thanks for the suggestions Salem and brewbuck.

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. mmap fails with errno EBADF
    By Elkvis in forum Linux Programming
    Replies: 3
    Last Post: 09-10-2009, 10:03 AM
  2. mmap program
    By karthigayan in forum C Programming
    Replies: 1
    Last Post: 04-01-2009, 04:38 AM
  3. BSD mmap for shared memory - am I right?
    By sean in forum Linux Programming
    Replies: 21
    Last Post: 03-09-2009, 01:57 PM
  4. using mmap for copying large files
    By rohan_ak1 in forum C Programming
    Replies: 6
    Last Post: 05-13-2008, 08:12 AM
  5. write a word to file by using lseek and mmap
    By SoFarAway in forum C Programming
    Replies: 1
    Last Post: 03-28-2005, 01:33 PM

Tags for this Thread