PDA

View Full Version : Efficient watch dog process...



MutantJohn
03-11-2015, 09:30 AM
Okay, I have some questions about what's possible or what's not.

I finally read up on signal handling in C and Linux. Turns out, SIGSTOP cannot be caught or handled so I want to work around that.

What if I wrote a script that launches executable A. Every time A is started, if it is stopped, I want it to be killed or interrupted.

So I write a script that launches A then executable B which is the "watch dog". B watches over A and every time A is stopped, B sends a signal to A (SIGCONT, SIGINT, etc).

Is there an efficient way to do this?

My idea was to have B ping A every t seconds or something like that and then upon specific statuses, respond. Is there a better way to do this?

Also, assume that I might be able to alter the source code of A. I think sockets would then be a possibility and it would be performant as listening on a specific port has no performance overhead that I'm aware of.

Nominal Animal
03-11-2015, 03:44 PM
SIGSTOP cannot be caught or handled so I want to work around that.
Please don't.

If you don't want the user to suspend your application by pressing CTRL+Z, use stty susp undef (http://man7.org/linux/man-pages/man1/stty.1.html) (or equivalent termios (http://man7.org/linux/man-pages/man3/termios.3.html) setting, .c_cc[VSUSP]=_POSIX_DISABLE), or use Curses.

Other than that, SIGSTOP is rare, but useful. It's used in various checkpoint-restart schemes, for example. It is not used willy-nilly; it's not like there are SIGSTOP signals being sent for no reason! Working around SIGSTOP only makes your program difficult to manage, and annoying to use.

Note that sending the target process any signal other than SIGCONT depends on whether the target process has set a signal handler, or if the signal causes the process to die. If there is a signal handler, it will only be run AFTER the process continues, not immediately. If the signal causes the process to die, it will die immediately. Therefore, you should respond to a process stopping by sending it either the SIGCONT signal to resume it immediately, or the SIGKILL signal to kill it. (If you need the process to immediately resume and handle some other signal, send that signal first, followed immediately by SIGCONT.)


B watches over A and every time A is stopped, B sends a signal to A (SIGCONT, SIGINT, etc). Is there an efficient way to do this?
Yes. Have B start A, so that it is the parent of A, then wait for A to exit using waitpid(,,WUNTRACED) (http://man7.org/linux/man-pages/man2/waitpid.2.html); the WUNTRACED flag means waitpid() will return if A is stopped.

Better yet, have B start a new process group (setpgid(B,B) (http://man7.org/linux/man-pages/man2/setpgid.2.html) before starting A), and you can use a waitpid(-B,&status,WUNTRACED) (http://man7.org/linux/man-pages/man2/waitpid.2.html) to reap any child processes (in this process group). If a child process is stopped, it will return the PID of that process with (WIFSTOPPED(status)), so you can send it the signal. Or, if you just want to keep the children running, you can also use kill(-B,SIGCONT) to try and keep all child processes running.


Also, assume that I might be able to alter the source code of A.
No need.


To verify this scheme works, I wrote a 115-line C program, that executes the command line parameters in a new process group, reflecting all signals sent to B to the child process (could also forward to entire process group), keeping them running, and returning when the last process in the process group exits; the program itself returns the exit status of the command executed.

The program just installs signal handlers (for the reflected signals), then blocks in a waitpid() loop, so it does not consume CPU time at all during normal operation (only when reflecting signals or waking up child processes. No need to try and optimize the performance, as it is very lightweight anyway.

Although this particular use case is not something I'd ever use, it's very close to process supervisors -- small parent programs that monitor a daemon process or a set of related processes comprising a service --, especially in how the signals are delivered. Linux and Unix services are typically controlled via signals. For example, a SIGTERM signal is often used to tell the service to shut down, SIGHUP to tell it to reload its configuration (or restart), and SIGUSR1 to rotate or reopen log files (if using custom log files instead of syslog). These signals are delivered to the initial process. However, when the service comprises of several processes, a small supervisor can be used to monitor the components, reflecting/forwarding the signals as necessary. And, of course, to detect if the process exits prematurely, or perhaps gets deadlocked or some such, and restarts it automatically.

The above is common to all Unix systems, most current Linux systems (except possibly systemd), and those BSD derivatives that still use signals instead of some other IPC to control services. That makes this subject, in general, useful to know for anyone developing service daemons.

(A side note: Instead of moving to DBUS or other IPC methods to control services, we could just use signals bidirectionally. That is, we could standardize on services to respond to their parent process (or a process whose PID was set in an environment variable, or as a command-line parameter) using signals, updating their current state as necessary. sigqueue() (http://man7.org/linux/man-pages/man3/sigqueue.3.html) allows "queuing" the signal -- it's the same as kill(), except with an optional integer (or pointer) payload. The integer could be a set of standard status codes, similar to HTTP response codes. Realtime signals are queued, and therefore would be well suited for this. All this is POSIX, not Linux-specific.
If service processes sent a signal when they become operational (active and ready to respond to requests), dependency-based init systems would be simple, and services could be started in parallel without issues. Acknowledgements for "successfully restarted" or "successfully re-read configuration" or "successfully rotated log files" and similar would make typical service management tasks robust, without polling: the service manager could immediately tell the admin user if the action was successful. If you've ever had to write a service script for a closed-source Java service, you know these issues.
The downside to using signals is that we'd need to add a couple of lines (of POSIX C) to each service daemon. I think that'd work much better than the dbus mess, though.)

MutantJohn
03-12-2015, 12:20 AM
Nominal, I'm going to be too busy over the next 3 or so days to attempt to implement what you've suggested but as always, thank you for the very well-written and thoughtful post. I'll have to reread it a few times to fully understand it but it's really, really nice to know that what I want to do is possible and then have a general direction. Thanks, man. :)

Nominal Animal
03-12-2015, 04:06 AM
No worries.

As usual, it's best to start small -- a minimal example that forks and has the child process execute the command specified on the command line (Hint: execvp(argv[1], argv + 1);), with the parent waiting for the child to exit, is a good start. Then modify the loop to also check if the child process was stopped instead of exited, and so on. One step at a time.

I've saved the example program, so if you encounter issues or after you have implemented yours, I'd be happy to post my version (or the relevant snippets of it). It's only lightly tested, not stress-tested in a signal storm or with a limited fork-bombing child, but it's overall logic is pretty sound.

MutantJohn
03-16-2015, 09:48 AM
Wait, am I forking the main process? Am I using shared memory? I'm trying to look up how to use setpgid now and it seems kind of odd. Where should I start?

MutantJohn
03-16-2015, 11:12 AM
Okay, I understand this stuff a lot better now but I'm confused why valgrind is giving me the following errors :


==17472== Memcheck, a memory error detector
==17472== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==17472== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==17472== Command: ./pgroup
==17472==
==17472== Syscall param execve(filename) points to unaddressable byte(s)
==17472== at 0x4EF8337: execve (execve.c:33)
==17472== by 0x4EF898F: execvpe (execvpe.c:63)
==17472== by 0x400D03: launch_process (in /home/christian/Desktop/work/pgroup)
==17472== by 0x400DDA: main (in /home/christian/Desktop/work/pgroup)
==17472== Address 0x51fc09d is 0 bytes after a block of size 13 alloc'd
==17472== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==17472== by 0x400D7C: main (in /home/christian/Desktop/work/pgroup)
==17472==
==17472== Syscall param execve(argv[i]) points to unaddressable byte(s)
==17472== at 0x4EF8337: execve (execve.c:33)
==17472== by 0x4EF898F: execvpe (execvpe.c:63)
==17472== by 0x400D03: launch_process (in /home/christian/Desktop/work/pgroup)
==17472== by 0x400DDA: main (in /home/christian/Desktop/work/pgroup)
==17472== Address 0x51fc09d is 0 bytes after a block of size 13 alloc'd
==17472== at 0x4C2AB80: malloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==17472== by 0x400D7C: main (in /home/christian/Desktop/work/pgroup)
==17472==


Code :


/*
Compile with : gcc -std=gnu11 -Wall -Wextra -pedantic -o pgroup pgroup.c
*/

/* Keep track of attributes of the shell. */


#define _POSIX_SOURCE
#include <sys/types.h>
#include <signal.h>
#include <unistd.h>
#include <sys/wait.h>
#include <termios.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>


pid_t shell_pgid;
struct termios shell_tmodes;
int shell_terminal;
int shell_is_interactive;


typedef struct _process
{
char **argv;
} process;




/* Make sure the shell is running interactively as the foreground job
before proceeding. */


void
init_shell ()
{


/* See if we are running interactively. */
shell_terminal = STDIN_FILENO;
shell_is_interactive = isatty (shell_terminal);


if (shell_is_interactive)
{
/* Loop until we are in the foreground. */
while (tcgetpgrp (shell_terminal) != (shell_pgid = getpgrp ()))
kill (- shell_pgid, SIGTTIN);


/* Ignore interactive and job-control signals. */
signal (SIGINT, SIG_IGN);
signal (SIGQUIT, SIG_IGN);
signal (SIGTSTP, SIG_IGN);
signal (SIGTTIN, SIG_IGN);
signal (SIGTTOU, SIG_IGN);
signal (SIGCHLD, SIG_IGN);


/* Put ourselves in our own process group. */
shell_pgid = getpid ();
if (setpgid (shell_pgid, shell_pgid) < 0)
{
perror ("Couldn't put the shell in its own process group");
exit (1);
}


/* Grab control of the terminal. */
tcsetpgrp (shell_terminal, shell_pgid);


/* Save default terminal attributes for shell. */
tcgetattr (shell_terminal, &shell_tmodes);
}
}


void
launch_process (process *p, pid_t pgid,
int infile, int outfile, int errfile,
int foreground)
{
pid_t pid;


if (shell_is_interactive)
{
/* Put the process into the process group and give the process group
the terminal, if appropriate.
This has to be done both by the shell and in the individual
child processes because of potential race conditions. */
pid = getpid ();
if (pgid == 0) pgid = pid;
setpgid (pid, pgid);
if (foreground)
tcsetpgrp (shell_terminal, pgid);


/* Set the handling for job control signals back to the default. */
signal (SIGINT, SIG_DFL);
signal (SIGQUIT, SIG_DFL);
signal (SIGTSTP, SIG_DFL);
signal (SIGTTIN, SIG_DFL);
signal (SIGTTOU, SIG_DFL);
signal (SIGCHLD, SIG_DFL);
}


/* Set the standard input/output channels of the new process. */
if (infile != STDIN_FILENO)
{
dup2 (infile, STDIN_FILENO);
close (infile);
}
if (outfile != STDOUT_FILENO)
{
dup2 (outfile, STDOUT_FILENO);
close (outfile);
}
if (errfile != STDERR_FILENO)
{
dup2 (errfile, STDERR_FILENO);
close (errfile);
}


/* Exec the new process. Make sure we exit. */
execvp (p->argv[0], p->argv);
perror ("execvp");


int status;
waitpid(-1, &status, WUNTRACED);


exit (1);
}


int main(void)
{
process p;


p.argv = malloc(2 * sizeof(char*));


const char proc_name[] = "/usr/bin/nano";
p.argv[0] = malloc(strlen(proc_name) * sizeof(char));
memcpy(p.argv[0], proc_name, strlen(proc_name) * sizeof(char));
p.argv[1] = NULL;


init_shell();
launch_process (&p, 0,
STDIN_FILENO, STDOUT_FILENO, STDERR_FILENO,
1);




free(p.argv[0]);
free(p.argv);


return 0;
}


Am I just using these functions wrong?

Nominal Animal
03-16-2015, 12:34 PM
You have an off-by-one error here:



process p;
p.argv = malloc(2 * sizeof(char*));

const char proc_name[] = "/usr/bin/nano";
p.argv[0] = malloc(strlen(proc_name) * sizeof(char));
memcpy(p.argv[0], proc_name, strlen(proc_name) * sizeof(char));
p.argv[1] = NULL;

You forgot that a string should always have the terminating NUL byte ('\0'), i.e. strlen() + 1.

The C standard says sizeof (char) is 1.

Why did you complicate things with the shell and terminal state?

That is actually quite complicated to do right -- your watchdog has to be the parent of the command processes, but it should be in a different session without a controlling terminal, so that e.g. ^Z does not suspend the watchdog. The simplest way to do that, I think, would be to use a parent (watchdog) and grandparent (original) process in addition to the target process(es). Then, watchdog can close standard streams before setsid() to detach from the terminal, but the original process can send them to the target process (before execv()) via an Unix domain socket pair. Whenever anything interesting happens, the watchdog could send (an usually superfluous) SIGCONT to the grandparent; this is especially important just before exiting.

It would be much more rewarding to ignore shell and terminal details first, and just get the waitpid() loop right. The terminal/shell state management can be applied on top, after you get that working.

(Full disclosure: I didn't bother to implement full terminal/session control; if you want, I can do that -- shouldn't be too hard, really, just complicated -- using ^Z from nano as the test case.)

MutantJohn
04-10-2015, 12:13 PM
Btw, I haven't given up on doing this. I was lectured on how if I submitted code it would never be looked at.

So I mellowed out for a little bit but my term is approaching an end (was a temp. contractor) and I wanna finish this and try to hand it in towards the end. Gonna yolo this so hard.

Thank you for your constant help, Nominal.