Hi all -

Here's some context:

I have files of some large-ish size consisting of chunks. I need to
load each chunk, do some work (call it munging) on it, and write the
munged chunk to a file. My test file is ~25 MB with about 17,000 chunks.

My system is a Nehalem-EX with 32 physical cores, 64 virtual ones with
hyperthreading. I'm running Ubuntu x86-46 with the 2.6.35-24-generic
kernel. I'm running the system over remote-X from my desk to a server room,
if this has any impact on answers.

Since I have a bunch-o-cores, I'd like to do this processing in
parallel. I have the list of chunks stored in an array, and I started
by synchronizing access with a pthread mutex wrapping a counter. I've
since moved on to using an atomic counter (fetch_and_inc) with no mutex.

My problem appears to be with mmap. I use boost::mapped_region as a
light wrapper around mmap, but I tried mmap directly with no
differences.

I tried two different ways of accessing the chunks: 1, I tried making
a mapped_region for each chunk, and 2, I created a mapped_region for
the entire file, and then simply grab the subset needed for a chunk.

The first way (1) yields the following times for the test file:

Threads | Time (s)
==============
1 | 1.927
2 | 1.299
4 | 1.158
8 | 6.413
16 | 10.217
32 | 11.049
64 | 11.445

The second way (2) yeilds the following times for the test file:

Threads | Time (s)
==============
1 | 1.889
2 | 1.136
4 | 0.774
8 | 0.761
16 | 6.092
32 | 6.116
64 | 6.252

Adding thread affinity DOES help, but not a lot. The trend is
similar, but about 10% faster.

Clearly the second method is preferable when possible (I many need
to do a similar thing keeping many file mappings on subsets of the
chunks open, and on 32-bit machine, the first is more attractive
because I use less virtual memory. On 64-bit, I'm not worried about
running out of my virtual address space :-) ).

The main point of this post, however, is that neither technique scales
past a handful of cores. This is where I hope someone might help.
Does anyone know about bottlenecks or potential bottlenecks with mmap
and multi-threading on more than a few cores? If so, is there a
solution? A kernel runtime tweak or even a kernel compile-time tweak
(I'm willing to build the kernel).

Any information that might be helpful is appreciated, even if it's
that I need to ask the question elsewhere (e.g. where might be more
appropriate?).

Thanks!