PDA

View Full Version : How to force two process to run on the same CPU



fryguy
01-11-2011, 05:52 PM
Hi. Hard question here. I'm programming a software system that consists of multiple processes. It is programmed in C++ under Linux. and they communicate among them using Linux shared memory.

Usually, in software development, is in the final stage when the performance optimization is made. Here I came to a big problem. The software has high performance requirements, but in machines with 4 or 8 CPU cores (usually with more than one CPU), it was only able to use 3 cores, thus wasting 25% of the CPU power in the first ones, and more than 60% in the second ones.
After many, many research, and having discarded mutex and lock contention, I found out that the time was being wasted on shmdt/shmat calls (detach and attach to shared memory segments). After some more research, I found out that these CPUs, which usually are AMD Opteron and Intel Xeon, use a memory system called NUMA, which basically means that each processor has its fast, "local memory", and accessing memory from other CPUs is expensive.

After doing some tests, the problem seems to be that the software is designed so that, basically, any process can pass shared memory segments to any other process, and to any thread in them. This seems to kill performance, as process are constantly accessing memory from other processes.

Now, the question is, is there any way to force groups of process to execute in the same CPU?. I don't mean to force them to execute always in the same processor, as I don't care in which one they are executed, but that would do the job. Ideally, there would be a way to tell the kernel: If you schedule this process in one processor, you must also schedule this "brother" process (which is the process with which it communicates through shared memory) in that same processor, so that performance is not penalized.


Hard one, I know, as probably very few people know that well the Linux scheduling system. But after countless days working into this, anyone suggesting anything that could remotely help would be greatly appreciated.

Codeplug
01-11-2011, 06:27 PM
http://linux.die.net/man/8/schedtool
http://linux.die.net/man/3/numa

gg

Codeplug
01-11-2011, 07:44 PM
http://linux.die.net/man/8/numactl
That may be more appropriate for running your existing processes.

http://www.halobates.de/numaapi3.pdf

gg

fryguy
01-12-2011, 03:19 AM
Thaks.


numactl(8) - Linux man page (http://linux.die.net/man/8/numactl)
That may be more appropriate for running your existing processes.


Umm it looks like the kernel I'm using does not have compiled support for NUMA:


$ numactl
libnuma: Warning: /sys not mounted or no numa system. Assuming one node: No such file or directory

What is the option that should be enabled when compiling the kernel?.

Codeplug
01-12-2011, 08:36 AM
If you're sure you are on NUMA hardware...
Linux Kernel Driver Database: CONFIG_NUMA: # "NUMA Memory Allocation and Scheduler Support" (http://cateee.net/lkddb/web-lkddb/NUMA.html)

gg

Codeplug
01-12-2011, 10:07 AM
>> I found out that the time was being wasted on shmdt/shmat calls (detach and attach to shared memory segments)
Are you saying that you're calling these more than once per shared memory object? These are typically only called once, and so would be negligible in any performance analysis.

Where NUMA may make a difference is in the read/write access to the shared memory itself. I wouldn't expect the performance of shmdt/shmat to change much, if at all.

gg

fryguy
01-13-2011, 09:06 AM
>> I found out that the time was being wasted on shmdt/shmat calls (detach and attach to shared memory segments)
Are you saying that you're calling these more than once per shared memory object? These are typically only called once, and so would be negligible in any performance analysis.


They are created only once, and attached and detached once per segment, so twice in total. There are so many calls to shmat/dt because about 5000 segments are created/deleted per second.



Where NUMA may make a difference is in the read/write access to the shared memory itself. I wouldn't expect the performance of shmdt/shmat to change much, if at all.

gg


You might be right. Anyway, I've found out that currently the shmat/shmdt is not the bottleneck. It was in the past but, but not now, as a lot of new code has been added in the meanwhile. Now the time is being wasted is the malloc implementation, even after using optimized memory allocators like Google's tcmalloc. I was hoping that both bottlenecs might be related, because the basics of the operations of malloc and shmget are similar -find free space and allocate it-, and the consequences of both bottlenecks are also similar -four out of eight cores are left unused-.

I was hoping that a NUMA-enabled kernel and NUMA tuning might help in both cases. I enabled NUMA support in the kernel. But now, it looks like the the kernel thinks there is only one NUMA node...


numactl --hardware
available: 1 nodes (0-0)

Is there anything else that needs to be done so that the kernel recognizes the two memory nodes?.

Codeplug
01-13-2011, 03:08 PM
>> There are so many calls to shmat/dt because about 5000 segments are created/deleted per second.
5000/s - That sounds like a design flaw, or poor choice of IPC - without knowing anything else about it.

>> and the consequences of both bottlenecks are also similar -four out of eight cores are left unused-
That doesn't make sense to me. Are you saying that you have threads on all 8, but in 4 of the cores the code is spending all it's time inside malloc?

>> Is there anything else that needs to be done so that the kernel recognizes the two memory nodes?
Check your Bios settings and see if NUMA needs to be enabled there.

gg