Processes not dying

Printable View

04-18-2008
Elkvis

Processes not dying

I have a server program that forks off child processes to handle network connections. for some reason, the child processes are not dying when I call exit(0). I use MySQL for database access, and it is high on my list of suspects, but at this point I have no proof that the MySQL client library is causing any problems. the network connections are getting closed, as evidenced by the fact that I can look in /proc/<pid>/fd and see that only 0, 1, and 2 (stdin, stdout, and stderr, in no particular order) remain. I am intercepting the SIGCHLD signal, which calls wait(), in order to collect the terminated processes, but for some reason, the processes don't go away. they still show up on the output of ps ax, and not as zombies, so somehow, the processes are still running. I've done everything I can think of to fix this, and I'm running out of ideas. Please give me a few suggestions of things to try.
04-18-2008
CornedBee

That should be quite impossible. For testing, though, you can try calling _exit() instead of exit(), thus avoiding atexit()-registered functions, one of which might, if it's very badly behaved, call longjmp().

But primarily I'd try making absolutely sure that the exit() call is reached at all.
04-18-2008
matsp

You can also attach to a "dying" process in gdb for example - just "attach <pid>" after starting gdb without any arguments.

--
Mats
04-18-2008
Elkvis

I'm not setting any atexit() functions, but I'll definitely give the _exit() thing a try. also, I'm not especially familiar with gdb, but I'll try attaching to a process and see what happens.
04-18-2008
matsp

Ah, you need a "beginners guide to GDB", ok, so after you attached to the process, you can "break" by pressing CTRL-C, and type "stack" or "backtrace" to list which function you are in, and the "call-stack" of functions leading up to that point.

--
Mats
04-18-2008
CornedBee

Quote:

type "stack" or "backtrace"

Or "bt". It's the shortest variant. And when you've used command line debuggers a bit, you'll really appreciate short commands ;)
04-18-2008
Elkvis

Quote:

Originally Posted by matsp

Ah, you need a "beginners guide to GDB", ok, so after you attached to the process, you can "break" by pressing CTRL-C, and type "stack" or "backtrace" to list which function you are in, and the "call-stack" of functions leading up to that point.

--
Mats

I'll have to rebuild the project in debug mode, but that's only about 10 minutes worth.... any recommendations of a good "beginner's guide" for gdb?
04-18-2008
matsp

This may be a good place to start:
http://www.cs.princeton.edu/~benjasik/gdb/gdbtut.html

--
Mats
04-22-2008
brewbuck

If the processes are not in state 'Z' what state are they in?

Maybe the mysql client library is blocking SIGCHLD for some reason.
04-22-2008
Elkvis

most are showing 'S+' for their status, meaning they are running, but sleeping.

as far as the mysql client library catching SIGCHLD, I doubt it. I'm definitely catching it in my handler.

we believe we have tracked this issue down. the client program was calling recv() after having read all available data from the socket, causing it to block, and of course the server sits and waits on recv() for each client connection, which also blocks, so it looks like it wasn't actually a problem with the server after all. however, I have since added a thread that starts in each child process, and each time a request comes in from a client connection, it stores the tim at which it occured in a global variable, which the thread looks at every 5 seconds. if the difference between NOW and the stored time is greater than a specific timeout, the thread exits, and the process terminates. the client developers have since fixed this problem, and we are putting out our update today, after which we will see if it is fixed.
04-22-2008
brewbuck

Quote:

Originally Posted by Elkvis

most are showing 'S+' for their status, meaning they are running, but sleeping.

as far as the mysql client library catching SIGCHLD, I doubt it. I'm definitely catching it in my handler.

You catching it in your handler has nothing to do with whether mysql is blocking the signal. The two are independent. You could install a signal handler which would never be called, because the signal itself is blocked. But you say you've figured it out, so that's probably not what's happening.

Quote:

we believe we have tracked this issue down. the client program was calling recv() after having read all available data from the socket, causing it to block, and of course the server sits and waits on recv() for each client connection, which also blocks, so it looks like it wasn't actually a problem with the server after all. however, I have since added a thread that starts in each child process, and each time a request comes in from a client connection, it stores the tim at which it occured in a global variable, which the thread looks at every 5 seconds. if the difference between NOW and the stored time is greater than a specific timeout, the thread exits, and the process terminates. the client developers have since fixed this problem, and we are putting out our update today, after which we will see if it is fixed.

So instead of using TCP's intrinsic back-off and timing systems you are hacking in your own? That makes no sense. The recv() WILL eventually return after a timeout.

The problem seems to be the design of the protocol itself. You don't know there is no more data coming and blindly call recv().
04-23-2008
Elkvis

Quote:

Originally Posted by brewbuck

You catching it in your handler has nothing to do with whether mysql is blocking the signal. The two are independent. You could install a signal handler which would never be called, because the signal itself is blocked. But you say you've figured it out, so that's probably not what's happening.

the point I was trying to make is that I'm definitely catching the signal, because my SIGCHLD handler prints to the screen every time a child process terminates, and I can see it happening.

Quote:

Originally Posted by brewbuck

So instead of using TCP's intrinsic back-off and timing systems you are hacking in your own? That makes no sense. The recv() WILL eventually return after a timeout.

The problem seems to be the design of the protocol itself. You don't know there is no more data coming and blindly call recv().

straight from the manpage for recv() : (http://www.penguin-soft.com/penguin/...an2/recv.2.inc)
"If no messages are available at the socket, the receive calls wait for a message to arrive, unless the socket is nonblocking."

I have not set my sockets to non-blocking, therefore recv() should block until there is data available. Perhaps you have some suggestions for how I might take advantage of "TCP's intrinsic back-off and timing systems." Perhaps you could share them with me, rather than simply telling me that I'm doing it wrong.

I eagerly await your advice.
04-23-2008
brewbuck

Quote:

Originally Posted by Elkvis

straight from the manpage for recv() : (http://www.penguin-soft.com/penguin/...an2/recv.2.inc)
"If no messages are available at the socket, the receive calls wait for a message to arrive, unless the socket is nonblocking."

That is not the complete picture. recv() is a protocol-agnostic system, as is the entire socket layer in general. TCP/IP itself has underlying timeout features which will cause the connection to be treated as "dead" after a certain amount of time.

Quote:

I have not set my sockets to non-blocking, therefore recv() should block until there is data available. Perhaps you have some suggestions for how I might take advantage of "TCP's intrinsic back-off and timing systems." Perhaps you could share them with me, rather than simply telling me that I'm doing it wrong.

I didn't describe how to do it, because no description is necessary. It's all automatic.

Again, I think the fundamental problem is not having a way of knowing when the data is finished, leaving you to resort to such tricks. It's not robust -- what if data was merely delayed by a fraction of a second longer than your timeout? My suggestion is to modify the protocol so you know how many bytes to expect.

Did not mean to sound snippy.