Thread: How to force two process to run on the same CPU

  1. #1
    Registered User
    Join Date
    Jan 2011
    Posts
    3

    How to force two process to run on the same CPU

    Hi. Hard question here. I'm programming a software system that consists of multiple processes. It is programmed in C++ under Linux. and they communicate among them using Linux shared memory.

    Usually, in software development, is in the final stage when the performance optimization is made. Here I came to a big problem. The software has high performance requirements, but in machines with 4 or 8 CPU cores (usually with more than one CPU), it was only able to use 3 cores, thus wasting 25% of the CPU power in the first ones, and more than 60% in the second ones.
    After many, many research, and having discarded mutex and lock contention, I found out that the time was being wasted on shmdt/shmat calls (detach and attach to shared memory segments). After some more research, I found out that these CPUs, which usually are AMD Opteron and Intel Xeon, use a memory system called NUMA, which basically means that each processor has its fast, "local memory", and accessing memory from other CPUs is expensive.

    After doing some tests, the problem seems to be that the software is designed so that, basically, any process can pass shared memory segments to any other process, and to any thread in them. This seems to kill performance, as process are constantly accessing memory from other processes.

    Now, the question is, is there any way to force groups of process to execute in the same CPU?. I don't mean to force them to execute always in the same processor, as I don't care in which one they are executed, but that would do the job. Ideally, there would be a way to tell the kernel: If you schedule this process in one processor, you must also schedule this "brother" process (which is the process with which it communicates through shared memory) in that same processor, so that performance is not penalized.


    Hard one, I know, as probably very few people know that well the Linux scheduling system. But after countless days working into this, anyone suggesting anything that could remotely help would be greatly appreciated.

  2. #2

  3. #3
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    http://linux.die.net/man/8/numactl
    That may be more appropriate for running your existing processes.

    http://www.halobates.de/numaapi3.pdf

    gg

  4. #4
    Registered User
    Join Date
    Jan 2011
    Posts
    3
    Thaks.

    Quote Originally Posted by Codeplug View Post
    numactl(8) - Linux man page
    That may be more appropriate for running your existing processes.
    Umm it looks like the kernel I'm using does not have compiled support for NUMA:

    $ numactl
    libnuma: Warning: /sys not mounted or no numa system. Assuming one node: No such file or directory
    What is the option that should be enabled when compiling the kernel?.

  5. #5

  6. #6
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> I found out that the time was being wasted on shmdt/shmat calls (detach and attach to shared memory segments)
    Are you saying that you're calling these more than once per shared memory object? These are typically only called once, and so would be negligible in any performance analysis.

    Where NUMA may make a difference is in the read/write access to the shared memory itself. I wouldn't expect the performance of shmdt/shmat to change much, if at all.

    gg

  7. #7
    Registered User
    Join Date
    Jan 2011
    Posts
    3
    Quote Originally Posted by Codeplug View Post
    >> I found out that the time was being wasted on shmdt/shmat calls (detach and attach to shared memory segments)
    Are you saying that you're calling these more than once per shared memory object? These are typically only called once, and so would be negligible in any performance analysis.
    They are created only once, and attached and detached once per segment, so twice in total. There are so many calls to shmat/dt because about 5000 segments are created/deleted per second.

    Where NUMA may make a difference is in the read/write access to the shared memory itself. I wouldn't expect the performance of shmdt/shmat to change much, if at all.

    gg

    You might be right. Anyway, I've found out that currently the shmat/shmdt is not the bottleneck. It was in the past but, but not now, as a lot of new code has been added in the meanwhile. Now the time is being wasted is the malloc implementation, even after using optimized memory allocators like Google's tcmalloc. I was hoping that both bottlenecs might be related, because the basics of the operations of malloc and shmget are similar -find free space and allocate it-, and the consequences of both bottlenecks are also similar -four out of eight cores are left unused-.

    I was hoping that a NUMA-enabled kernel and NUMA tuning might help in both cases. I enabled NUMA support in the kernel. But now, it looks like the the kernel thinks there is only one NUMA node...

    numactl --hardware
    available: 1 nodes (0-0)
    Is there anything else that needs to be done so that the kernel recognizes the two memory nodes?.

  8. #8
    Registered User Codeplug's Avatar
    Join Date
    Mar 2003
    Posts
    4,981
    >> There are so many calls to shmat/dt because about 5000 segments are created/deleted per second.
    5000/s - That sounds like a design flaw, or poor choice of IPC - without knowing anything else about it.

    >> and the consequences of both bottlenecks are also similar -four out of eight cores are left unused-
    That doesn't make sense to me. Are you saying that you have threads on all 8, but in 4 of the cores the code is spending all it's time inside malloc?

    >> Is there anything else that needs to be done so that the kernel recognizes the two memory nodes?
    Check your Bios settings and see if NUMA needs to be enabled there.

    gg

Popular pages Recent additions subscribe to a feed

Similar Threads

  1. windows CPU run queue length
    By ashaikh432 in forum C++ Programming
    Replies: 5
    Last Post: 08-16-2010, 12:48 PM
  2. Portable method of accessing CPU time for given process
    By saeculum in forum Linux Programming
    Replies: 3
    Last Post: 03-13-2009, 03:44 PM
  3. create a child process that creates a child process
    By cus in forum Linux Programming
    Replies: 9
    Last Post: 01-13-2009, 02:14 PM
  4. Replies: 3
    Last Post: 10-15-2008, 09:24 AM
  5. Create a process that uses cpu % without do anything
    By BianConiglio in forum Windows Programming
    Replies: 8
    Last Post: 05-22-2004, 12:30 PM