I like the topic of this thread! These are good questions, and important.
Some things may be learned from Erlang, with its light-weight processes that stop when a failure occurs, and may be restarted with clean state (all invariants established again).
Here a blog post on Erlang’s “let it crash” principle: Let it crash (the right way…) | MazenHarake
(Edit of the link above)
Whether in Erlang or other languages, encapsulation of subsystem state is crucial, e.g. within a process, AppDomain or whatever. Global state shared between multiple components is problematic, as it can be difficult to ensure that it remains consistent (all invariants established) after arbitrary application failures.
Terminating a failed process can be non-trivial, especially if physical state is involved. For example, I know a robotic assembly machine with multiple robot arms, where after a failure (which results in some inconsistent or undefined state) the arms have to be untangled and brought back into their idle positions, before everything can be safely restarted. The cleanup code for getting the arms safely back into that position is more than 50% of the overall code size. Here, the entire physical system is though of as transactional: every transaction starts from and ends in a well-defined idle state, where all invariants are trivially established.
So keeping things simple, modularizing system state and thinking in invariants, sometimes even in transactions, seem to be powerful tools.
Not catching exceptions too early is also important (e.g. within service components instead of the main program), otherwise faulty code may go undetected for a long time.
In some safety-critical systems, a central health monitor receives failure indications from all components, and tells the components how to react (stop component, restart component, restart entire system or stop entire system). The health monitor’s behavior can be configured, as it depends on the specific system - the individual components cannot decide what error handling strategy makes sense, for every possible system.
Also, it can help to distinguish between expected errors (should be handled) and between unexpected failures (which happen when the underlying architectural assumptions about the end systems are violated, e.g. “not enough RAM available”, or “thrust has been reversed although airplane is still in the air”). Often it is needed to distinguish several operating modes to be able to define such errors and failures (e.g. “airplane is in the air” versus “airplane is touching ground”).
To keep things robust, things need to be kept simple. Often, it is simpler (sometimes drastically simpler) if code is not interrupt-driven / event-driven, but uses e.g. periodic polling of inputs, doing some computation, and then setting up outputs. Then it’s much easier to avoid reentrance problems. For this reason, aerospace systems often allow only a single interrupt, namely for a timer that controls a simple periodic schedule. That’s also how most of the world’s true real-time systems operate, namely the lowly PLCs (not the fancy RTOSes with their complex-to-use interrupt priority schemes).
There are a number of other interesting approaches, but too far away from the mainstream: monads of functional programming languages, their equivalents in dataflow process networks, or cycle-free programming. But it’s always about keeping state modular and controlled, and about keeping control flow simple (e.g. avoid cycles as much as possible).
Cuno