Howdy --
Looking for tips on how I can get my program to dump a core file when it crashes. The program itself is an SMTP server, comprising about 20k lines of code. It presently is written to fork itself 12 times, and then spawn 64 threads for each process. Each thread then handles a seperate connection. The system itself is a Dell Poweredge 1650, running Red Hat Enterprise Linux 3, with kernel 2..4.21-20.ELsmp.
Before anyone states the obvious, let me run down what I've tried already. I've tried enabling core dumps in the bash shell using the command 'ulimit -c unlimited', after which my settings look like so :
[root@jean root]# ulimit -a
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
max locked memory (kbytes, -l) 4
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 7168
virtual memory (kbytes, -v) unlimited
I have also tried enabling core dumps in code using the 'setrlimit' function call :
Code:
int enable_coredumps(void) {
int state;
struct rlimit rlim;
memset(&rlim, 0, sizeof(rlim));
state = getrlimit(RLIMIT_CORE, &rlim);
if (state) {
slog("(main:enable_coredumps) Could not get the kernel options. getrlimit = %i", state);
return 0;
}
rlim.rlim_cur = RLIM_INFINITY;
state = setrlimit(RLIMIT_CORE, &rlim);
if (state) {
slog("(main:enable_coredumps) Could not set kernel options for core dumping. setrlimit = %i", state);
return 0;
}
return 1;
}
So far nothing has worked.
Does anyone have any ideas? My present theories are that something about how big my stack is prevents the kernel from being able to dump the core. My second theory is that the problem is a result of the fact that I start the application as root, bind to port 25, and then setsuid to another user. I then chdir to the setsuid's homedir, and in theory it can no longer access the directory where the app was started. In order to try and rule this theory out, I've tried starting the app from the setsuid's home directory, to no avail.
Alternatively, has anyone written a SIGSEGV handler that uses the ptrace function to record where the fault occurred? (Remember, this is a multi-threaded app, so in theory the parent thread could catch the SIGSEGV, and then ptrace all of the child threads before exiting, though I haven't actually tried to implement this.)
I have run this program extensively in valgrind, and can find no trace of a bug. However, I cannot run the production server inside of valgrind because of the number of simultaneous connections. This application tends to have between 100 and 500 simultaneous connections, and when running under valgrind, it doesn't appear to accept the connections fast fast enough.
Any help would be greatly appreciated.
Thanks in advance,
Ladar