Exception handling, logging and always on? Coding patterns, tips & tricks...?

Hi all

There are so many wise heads here, but I have been looking for really experienced advice on how to structure your code and the underlying subsystems, to make sure all errors are caught and the system can keep running without manual intervention.

Most of my setups has been prototyping and demonstrating whats possible, which gives a focus to “will it work”? And not “will this continue for years without problems”?

Exceptions thrown from the netmf and gadgeteer layers does not seem very good, and acting on them might be problematic?

Logging to SD-card or other media for later debugging is also a kind of wizardry-skill to make sure you log enough but not too much?

I havent found any meaningful resources for this, can you help?

3 Likes

I’m relatively new to the world of reliable resource-constrained computing. All of the patterns and practices I have in my toolkit are designed for the world of effectively-unlimited resources, and all of my micro-controller work so far has only had to survive the length of a demo. I would be very interested in how people have addressed this question.

I would add to the specific things that @ njbuch mentions: the use of telemetry (for health and perf monitoring) and watchdog timers. Do you routinely include a watchdog timer and what recovery strategies do you use?

On a larger scale, there’s also the question of failsafe design for networks of devices. If you have a thermostat in your living room and a relay actuator on your boiler, have you included fail-safes (such as a local thermister at the boiler with absolute min and max ambient temps)?

There’s probably a whole book to be written (or maybe one already out there to be read) on these topics.

1 Like

@ njbuch - Now I don’t have a complete template project, that one could use in VS, that would be nice.

Although, I believe (hopefully someone proves me wrong) there will not be a 100% cover it all solution for this. But we all can help/contribute to the highest % possible and boost stability.

As a first step one could use/implement a watchdog to auto reset on a possible failure and somehow (by LED, simple serial display/7-segment, logging the reset count on SD, etc) keep the number of resets for future use. Refer to https://www.ghielectronics.com/docs/31/watchdog for more info on watchdog.

In addition one could think of a kind of trace number that is logged together with the error (datestamped as well) so it can easily be used to see which statement failed or was executed last. I’ve tried to log the last excecuted statement line but didn’t get it to work reliably so used the trace number instead. The heartbeat principle could be used as well, so every time you successfully have a transaction you start the watchdog timeout again, otherwise it will go and reset your device.

But as @ mcalsyn said … reading material enough …

For full .net we’ve used patterns and practises but I don’t know if that’s still alive, if so, you could take a couple of ideas from that area

Hope this will help a bit.

2 Likes

Welcome to the world of QA and a product’s life cycle from the womb to the tomb !!

I like the topic of this thread! These are good questions, and important.

Some things may be learned from Erlang, with its light-weight processes that stop when a failure occurs, and may be restarted with clean state (all invariants established again).
Here a blog post on Erlang’s “let it crash” principle: Let it crash (the right way…) | MazenHarake
(Edit of the link above)

Whether in Erlang or other languages, encapsulation of subsystem state is crucial, e.g. within a process, AppDomain or whatever. Global state shared between multiple components is problematic, as it can be difficult to ensure that it remains consistent (all invariants established) after arbitrary application failures.

Terminating a failed process can be non-trivial, especially if physical state is involved. For example, I know a robotic assembly machine with multiple robot arms, where after a failure (which results in some inconsistent or undefined state) the arms have to be untangled and brought back into their idle positions, before everything can be safely restarted. The cleanup code for getting the arms safely back into that position is more than 50% of the overall code size. Here, the entire physical system is though of as transactional: every transaction starts from and ends in a well-defined idle state, where all invariants are trivially established.

So keeping things simple, modularizing system state and thinking in invariants, sometimes even in transactions, seem to be powerful tools.

Not catching exceptions too early is also important (e.g. within service components instead of the main program), otherwise faulty code may go undetected for a long time.

In some safety-critical systems, a central health monitor receives failure indications from all components, and tells the components how to react (stop component, restart component, restart entire system or stop entire system). The health monitor’s behavior can be configured, as it depends on the specific system - the individual components cannot decide what error handling strategy makes sense, for every possible system.

Also, it can help to distinguish between expected errors (should be handled) and between unexpected failures (which happen when the underlying architectural assumptions about the end systems are violated, e.g. “not enough RAM available”, or “thrust has been reversed although airplane is still in the air”). Often it is needed to distinguish several operating modes to be able to define such errors and failures (e.g. “airplane is in the air” versus “airplane is touching ground”).

To keep things robust, things need to be kept simple. Often, it is simpler (sometimes drastically simpler) if code is not interrupt-driven / event-driven, but uses e.g. periodic polling of inputs, doing some computation, and then setting up outputs. Then it’s much easier to avoid reentrance problems. For this reason, aerospace systems often allow only a single interrupt, namely for a timer that controls a simple periodic schedule. That’s also how most of the world’s true real-time systems operate, namely the lowly PLCs (not the fancy RTOSes with their complex-to-use interrupt priority schemes).

There are a number of other interesting approaches, but too far away from the mainstream: monads of functional programming languages, their equivalents in dataflow process networks, or cycle-free programming. But it’s always about keeping state modular and controlled, and about keeping control flow simple (e.g. avoid cycles as much as possible).

Cuno

3 Likes

@ Cuno - Yep, sometimes a procedural approach or using balanced line processing is not that bad. In the end, if your (software) building is architected not good enough then the building is bound for a fall …

in all honesty, it is nice to do all that and be able to know the smallest issue that can hit your device, but WHY?

let take my WIFI Router for instance, they are made by known brands and yet they don’t really offer a great way to actually log and debug errors, they still rely on user reports to release fixes and so on.

so again WHY go through the trouble?

I would prefer to spend time and energy on building a solid test case framework, instead!

Jay.

1 Like

@ Jay Jay - Do you mean good testing [em]instead[/em] of good architecture and design - or in [em]addition[/em]?

Do you mean unit-testing??

No amount of testing will turn badly designed, unreliable code into robust code. Just doesn’t happen. Testing is indeed needed as well, but not instead of, careful design. Of course, many commercial products probably have bad design and also have been barely tested.

1 Like

Another factor is the available skills: no matter how good your approach, how good your tools, how good your certified development process, you still need developers with the appropriate skill sets. Which does not say the rest is unimportant.

1 Like

Agree completely with your comments Cuno - “provably correct” is always better than “tested as correct”, so I have a lot of affinity for Erlang and process-algebra based approaches and also for formal-methods approaches applied to critical code. Trouble is, those are very specialized skill sets and not all of those approaches are easily applied to NETMF. (fwiw, the Microsoft Robotics Development Software grew out of what was a very pure and impractical Process Algebra language experiment. It was then ‘watered down’ to fit within the language constraints of C# at the time, but I think it failed in a lot of ways on both safety and usability - a compromise the took the worst of both worlds).

I do think there is a set of patterns and practices for NETMF, though, that do suffice for your average not life-and-safety-critical applications and I think it would be useful to enumerate some of those concrete patterns here. I don’t have a particularly good set built up myself, but I am hopeful others do and can share some of those tricks.

How do you do telemetry (monitoring of performance and activity)?
How do you do root-cause-analysis of failures after the fact? What do you log and how?
How do you use the WDT?
What design techniques have you used for fail-safe operation if a node fails or a network partition occurs?

These are easy problems to describe and maddeningly complex ones to account for properly (esp the failsafe design issue).

@ Jay Jay - There are issues that are impractical to test for. I would argue that no matter how solid your test framework is, for any N or more networked nodes (where N is usually a pretty small number - 2 or 3), you need an impractically large amount of test time and fuzzing in order to force all the different timing scenarios, and it is arguably impossible to know how close you are to complete coverage. If you can’t test to certain correctness (due to combinatorial timing complexity), and you don’t prove correctness (through formal methods), then you better have a good recovery strategy because eventually you will see that failure in the field that you didn’t see in the lab.

1 Like

To catch all excpetions, and react on theim there is only one way:
Put an try/catch block around your main routine.
Put and try/catch block in every event handler (at least if the events are fired from outside your code)
Put and try/catch block in every tread procedure.
Then:
Never do nothing in an catch block, except when you are sure that any thrown exception of the code can be safely ignored.

Logging:
I always create my own logging system.
The globally accessible log method looks like this:
public static void Log(Loglevel level, string module, string message);
level is some kind of priority like:
Error: something unexpected bad happened
Warning: something expected happened, which normally does not cause too much trouble
Message: Just for info
Bulk: Detailed info, you normally only need in special cases
Debug: Even more detailed info you normally need for debugging issues only
module is the source where the message comes from like “HW-Layer”, “UI”, …
message is the message.
I also have usually an overload to the log message like this:
public static void Log(Exception ex, string module, string message);
which logs the exception in the error level. I usually call it in most of my catch blocks.
By this you can quickly change where a log message should be written to:
to Debug.Print, to UART, to Network, to SD-Card.
You also can easily manage which log level should be written to which target.
Normally you do not log messages in Bulk and Debug level by default.
I once tried to write my log messages to SD Card, but I failed because of write performance (on an G120 based device)
With a logger like this, I never use Debug.Print directly.

That is how I do things like that.
Using these techniques I already made a quite complex G120 based device with networking, FTP, Digital IO lots of data allocations, … which runs more or less for ever (at least I had it running for more than a week in an simulation environment, and the customer (where it runs 24/7) never complains about any issues).

1 Like

@ Reinhard Ostermeier - I guess its a stupid question, but do you have code to share on this, or point to?

Unfortunately its commercial code I do not own personally.
But to begin with its really just something like:

public enum LogLevel
{
  Error,
  Warning,
  Message,
  Bulk,
  Debug
}

public static class MyLogger
{
  public static LogLevel MaxLogLevel = LogLevel.Message;
  
  private static readonly string[] _levelId = { "E", "W", "M", "B", "D" };

  public static void Log(LogLevel level, string module, string message)
  {
    if((int)logLevel > (int)MaxLogLevel)
      return;
    Debug.Print(String.Concat(_levelId[(int)logLevel)], "-", DateTime.Now.ToString(), "-", module, ": ", message);
    // add additional log targets here
  }

  public static Log(Exception ex, string module, string message)
  {
    Log(LogLevel.Error, module, String.Concat(message, "\n", ex.ToString());
  }
}

and use it like this:

public static class Program
{
  public static void Main()
  {
    try
    {
      Logger.Log(LogLevel.Message, "Main proc", "Program started");

      for(int n = 0; n < 1000; ++n)
     {
       Logger.Log(LogLevel.Bulk, "Main proc", String.Concat("Loop iteration #", n));
     }
    }
   catch(Excpetion ex)
    {
      Logger.Log(ex, "Main proc", "Unhandled exception");
    }
  }
}

In fact you can put as much optimization into the actual log message as you wish, like running an separate thread with an FIFO buffer for the messages to write them to SD card without slowing down the actual execution too much.

p.s. disclaimer:
The code above was typed directly into the reply window and is not even tested to compile correctly.
Any damages, specially any damage to living things, like death, lies in the responsibility of the user. 8)

For the performance issue on SD write, I suppose one could use a bounded queue in memory and write to the SD card lazily in the background. Lower priority messages can be candidates for dropping if the rate at which messages are created clogs memory faster than you can write them out. Dropped messages can be counted and aggregated later into a single message (“X verbose and Y debug messages were dropped in the last 5 minutes”). The downside is that in a sudden and catastrophic failure, your most important messages (the last few) may be lost.

[Edit: yeah, what he said]

@ mcalsyn - The problem with lacy write to SD card is not only that you might loose messages, but also that in NETMF the file system gets corrupted some (often) times.
Most of the time this results in not being able to create new files in this folder, or at least the new files are invisible. I assume that the directory table gets corrupted.
The only solution to this is deleting the whole directory or worst case formatting the SD card.

I’m not even shooting for provably correct software. Although when I took a driverless metro last week in Paris it was nice to know that its critical software was formally verified, http://www.prover.com/company/press/view/?id=47 .

And unfortunately you are correct, some of the cool really powerful approaches are too far off the mainstream and common skill set.

But I still have left some residual hope for modular design and for the “keep it simple” discipline.

1 Like

@ Reinhard Ostermeier - I haven’t experienced that (because I have not worked with NETMF SD support much) and I don’t know what the underlying firmware code looks like, but I do know that there are mechanisms for avoiding that sort of outcome. The most hardened approach would probably be to create a single file one time and maintain your own always-consistent ring buffer within that file, perhaps even ensuring that it always falls at a known location on the card so that even if the filesystem indexing fails, you can still access its contents.