Hello,

I am running a very computationally expensive program that is parallelised and still runs for many days. It is therefore undesirable for it to stop if a runtime error occurs; precious CPU time on our Linux cluster would be lost. In order to handle abort() and other runtime problems gracefully, I have therefore installed a signal handler that will throw an exception instead of exiting right away. This exception can then be caught and the program can continue without wasting CPU time.

Or so I thought.

This is the problem I have:

I install a signal handler that displays a message, then throw()s an exception. One line of code that is running somewhere calls abort(). This call is within a try statement. The corresponding catch(...) statement, however, is never reached. Instead, the OS calls terminate again, recursively calling the signal handler without ever terminating (until the stack overflows or something). This is the message I get (on stdout/stderr):

Code:
NeuroEvolution: Signal 6 (Aborted) received.  Throwing an exception...
terminate called after throwing an instance of 'int'
NeuroEvolution: Signal 6 (Aborted) received.  Throwing an exception...
NeuroEvolution: Signal 6 (Aborted) received.  Throwing an exception...
terminate called recursively
NeuroEvolution: Signal 6 (Aborted) received.  Throwing an exception...
NeuroEvolution: Signal 6 (Aborted) received.  Throwing an exception...
terminate called recursively
etc. (you get the picture). The "NeuroEvolution: ..." message is from my signal handler, which looks like this:

Code:
2292 void CNeuroEvolution::ThrowingSignalHandler (int signal)
2293 {
2294     cerr << "NeuroEvolution: Signal " << signal
2295          << " (" << strsignal (signal) << ")"
2296          << " received.  Throwing an exception..." << endl;
2297     
2298     throw (signal);
2299 }
It was installed using this code:

Code:
2302 /// handle signals like ABRT and SEGV by having them throw an exception
2303 void CNeuroEvolution::InstallSignalHandler()
2304 {
2305     struct sigaction signal_handling;
2306     
2307     signal_handling.sa_handler = &ThrowingSignalHandler;
2308 
2309     sigfillset(&(signal_handling.sa_mask));
2310     signal_handling.sa_flags = SA_NOCLDSTOP;
2311     
2312     sigaction(SIGILL, &signal_handling, NULL);
2313     sigaction(SIGTRAP, &signal_handling, NULL);
2314     sigaction(SIGABRT, &signal_handling, NULL);
2315     sigaction(SIGFPE, &signal_handling, NULL);
2316     sigaction(SIGSEGV, &signal_handling, NULL);
2317     
2318 }
I thought that by using sigfillset etc. I disable any recursive call to the handler? I have also tried sigemptyset at this place with (apparently) no change to the situation.

Here is the catch statement I would like to end up at (the abort occurs in a function called by a function called by a function called by DoSomething):

Code:
1682 try
1683 {
1684     // Something that could go wrong
1685     DoSomething(data);
1686 }
1687 catch (...)
1688 {
1689     // something has gone wrong
1690     // avoid a program abort, print a message and do default action
1691 
1692     cerr << " NeuroEvolution: Error occured, removing incorrect data... " << endl;
1693 
1694     continue;
1695 }
This is the relevant part of the stack backtrace:

Code:
#32247 0x00002b93bc7ffbbb in PAC::CNeuroEvolution::ThrowingSignalHandler (signal=6) at NeuroEvolution.cpp:2298
#32248 <signal handler called>
#32249 0x00002b93bedb8aa5 in raise () from /lib64/libc.so.6
#32250 0x00002b93bedb9e60 in abort () from /lib64/libc.so.6
#32251 0x00002b93be8afe5b in std::set_unexpected () from /usr/lib64/libstdc++.so.6
#32252 0x00002b93be8af24b in __cxa_bad_cast () from /usr/lib64/libstdc++.so.6
#32253 0x00002b93be8afceb in __gxx_personality_v0 () from /usr/lib64/libstdc++.so.6
#32254 0x00002b93bec84748 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#32255 0x00002b93bec848dc in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#32256 0x00002b93be8aff5d in __cxa_throw () from /usr/lib64/libstdc++.so.6
#32257 0x00002b93bc7ffbbb in PAC::CNeuroEvolution::ThrowingSignalHandler (signal=6) at NeuroEvolution.cpp:2298
#32258 <signal handler called>
#32259 0x00002b93bedb8aa5 in raise () from /lib64/libc.so.6
#32260 0x00002b93bedb9e60 in abort () from /lib64/libc.so.6
#32261 0x00002b93be8afe5b in std::set_unexpected () from /usr/lib64/libstdc++.so.6
#32262 0x00002b93be8af24b in __cxa_bad_cast () from /usr/lib64/libstdc++.so.6
#32263 0x00002b93be8afceb in __gxx_personality_v0 () from /usr/lib64/libstdc++.so.6
#32264 0x00002b93bec84748 in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#32265 0x00002b93bec848dc in _Unwind_RaiseException () from /lib64/libgcc_s.so.1
#32266 0x00002b93be8aff5d in __cxa_throw () from /usr/lib64/libstdc++.so.6
#32267 0x00002b93bc7ffbbb in PAC::CNeuroEvolution::ThrowingSignalHandler (signal=6) at NeuroEvolution.cpp:2298
#32268 <signal handler called>
#32269 0x00002b93bedb8aa5 in raise () from /lib64/libc.so.6
#32270 0x00002b93bedb9e60 in abort () from /lib64/libc.so.6
#32271 0x00002b93bedb2246 in __assert_fail () from /lib64/libc.so.6
#32272 0x00002b93bc7e7045 in CLinearGenome::SortSubNetworks (this=0x7fffee3e6720, Start=0, End=108, 
    Visited=0x7fffee3e6450) at LinearGenome.cpp:1106
I have spent a lot of time trying to get it to run and would be very happy if you could point me to a solution.

Please let me know if you need any more details. I am using gcc (GCC) 4.1.0 (SUSE Linux).

Many thanks in advance,

Nils.

PS: Please do not suggest to change the cause for the abort() call, I had that idea myself already :-) Most problems occur in a part of the code I have not written myself and it would be too time-consuming to debug all of that.