EMX Watchdog Query

Chris_MC · January 4, 2013, 5:54am

Hi All,

I have an application running on an EMX module that has watchdog enabled. Now and again I have noticed that looking at my log files that the watchdog has kicked in and reset the device. This happens out at one of our customer sites were we have quite a few of the devices installed. Not every device has restarted via watchdog (all the same application program) and they appear to run for varying lengths of time before the watchdog kicks in (few weeks, 2-3 months). So I am trying to identify what the issue could be.

The watchdog runs on its own processing thread and is set with quite a high time-out of 2 minutes. I have a few of questions, that maybe someone can answer:

On searching the forum, someone mentioned that the ChipworkX watchdog maximum time-out was actually 15 seconds. Is there a maximum time-out for watchdog on the EMX module i.e. I set it for 2 minutes but is it actually ignoring this? (Note this is at the very beginning of the application being installed, I gather once the time-out has been set for the first time it can’t be changed).
Has anybody come across any functionality or bug that can lock up all other processing threads i.e. during my initial development I discovered that if making outward tcp connections I unplugged the Ethernet cable, the hanging tcp socket would block all other processing threads and the watchdog would eventually kick in (I fixed this by ensuring the application only listened for incoming tcp connections and made no outbound tcp connections).

I have noticed that during some functional routines, that the application appears to run slow on executing other routines i.e. one processing thread periodically initiates an SPI data transfer, during this period other functionality appears to run slow as if the SPI transfer thread is blocking the rest of the threads.

I understand that threads are managed via a time-slicing schedule. Have any bugs been reported in this area (Googling didn’t seem to bring up anything)?

Finally the application is currently running on framework 4.1. I am waiting to move over to 4.2 until the SystemUpdate functionality has been added back. Plus I gather from reading on the forum there is an issue with UDP on EMX with 4.2.

Sorry for the long post. Thanks in advance!

Gus_Issa · January 4, 2013, 8:29am

Using netmf 4.1 or 4.2?

taylorza · January 4, 2013, 8:38am

This got lost in the long post

James_ghielectroncs · January 4, 2013, 9:36am

You are probably seeing the effect of SPI taking processing power to complete it’s actions, it should not block the thread, but think of threads in this situation as a call stack of instructions from different sections of code.

There was a checksum issue with UDP, however this has been fixed and will be available in the next up-coming release.

Do you have any code snippets where you at the least suspect that the lockout could occur?

Gus_Issa · January 4, 2013, 10:52am

Calling any native method, SPI or other, will block the entire system until it is completed. This is why NETMF is not real time.

But you have timeout set to 2 minutes so I do not see how SPI will take 2 minutes to send data! My guess is that there is an uncaught exception that terminates your app and then watchdog fires in 2 minutes after.

Aron_Phillips · January 4, 2013, 12:09pm

I ran a test with Micro Framework 4.2 and can assure you that Watchdog on EMX does allow for at least 2 minutes reset counter. The test set the Watchdog to 121ms and the reset switch to 120ms and it is working as expected.

Chris_MC · January 4, 2013, 12:32pm

Thanks for all the replies.

I will look into any activities that could be taking longer than two minutes to complete. The SPI transfer does involve large file sizes, but like you say it has never taken anywhere near 2 minutes to complete.

As I mentioned earlier I switched my TCP socket to be a listener rather than attempting connections. One question I do have is during send or receive, once the TCP connection has been made, if it is a large number of bytes, does that itself lock up the entire system until all the data has been sent or received?

I don’t think the application is crashing out as around all the processing threads and the starting application thread, there are try/catches within their processing loops - though maybe there is a bug in something that ignores the try/catch and drops out regardless.

With regards to moving to 4.2, within the next release will SystemUpdate be put back in?

Aron_Phillips · January 4, 2013, 3:02pm

@ Chris_MC,

Can you provide any sample code that we can test to understand the potential issue that you are experiencing with the TCP stack and watchdog?

Chris_MC · January 8, 2013, 4:57am

Hi Aron, I will see if I can cut down my code and replicate the issue with the TCP.

The original change from the EMX making TCP connections to listening for TCP connections came about from noticing some hanging problems myself when I unplugged the Ethernet cable. Then after doing a little internet digging I was lead to this:

Socket.Connect blocks all threads? - Netduino Plus 2 (and Netduino Plus 1) - Netduino Forums and http://netmf.codeplex.com/workitem/950

Obviously the first discussion is in relation to a netduino but the issue itself points to the micro framework. It may all be solved when I can move to 4.2, but will keep you updated if I discover more. Thanks again for the responses.

tvinko · March 5, 2013, 5:45am

Maybe off topic, but what is point with watch dog? If I understand correctly, this solution is on hardware level, so it’s code independent, and if all system freezes, it restart device? Where is then the catch, that it’s not by default in ProgramStarted some thread that enables watch dog, and periodically sets timeout?

High_Speed_Dan · March 5, 2013, 10:52am

@ tvinko - watchdog is built into the CPU. You must “poke” the watchdog within the specified interval otherwise a CPU reset will occur. Example, watchdog is set for 30 seconds, your “alive” thread pokes the watchdog every 10 seconds. If it goes 31 seconds without getting poked, the CPU resets.

Remember, if there is an uncaught exception in your code, ALL threads will die that are running, including the thread “poking” your watchdog.

tvinko · March 5, 2013, 2:14pm

@ High-Speed Dan - this seems to me reason more that wd had to be built in default. Like you said, if there is uncaught exception, all threads die, and this is clear sign, that board must be reseted.

Brett · March 5, 2013, 2:55pm

… because defaulting to restarting something because of a bug in your code isn’t cool. Yes, it makes the software look better, but restarts aren’t everything. Isn’t it better to know you’re causing an exception for some reason and deal with that than to be oblivious to it? Also, this could be a very bad behaviour in many apps. Think about a machinery control system that moves a component to a location at start up, that could be quite dangerous.

David_Leclanche · March 5, 2013, 3:08pm

We use the watchdog combined with a “safe mode” execution that runs diagnostics and check’s for hotfixes.
I believe this is good practice for real world watchdog handling.


            //************************************************************
            // you can read this flag ***ONLY ONCE*** on power up     ****
            //************************************************************
            if (GHI.Premium.Hardware.LowLevel.Watchdog.LastResetCause == GHI.Premium.Hardware.LowLevel.Watchdog.ResetCause.WatchdogReset)
            {
                DebugOut("Watchdog did Reset");
                BootMode = 3;
            }
            else
            {
                DebugOut("Reset switch or system power");
                BootMode = 0;
            }

High_Speed_Dan · March 5, 2013, 3:16pm

I agree @ Brett , I just figured that was outside the scope of this conversation. Also @ tvinkoJunior , don’t leave your watchdog enabled with a debugger attached

{edited}

David_Leclanche · March 5, 2013, 3:21pm

Yes indeed, you don’t want the watchdog to kick in when you are stepping in your code…
I have this block in comment during dev :


            
            //************************************************************
            //**** Start van de watchdog (moet als laatste na init)   ****
            //************************************************************
            // TODO !!! terug aanzetten !!! DIT IS ENKEL OM TE KUNNEN DEBUGGEN
            //GHI.Premium.Hardware.LowLevel.Watchdog.Enable(10000);
            //DebugOut("Watchdog counter ingesteld");

tvinko · March 5, 2013, 3:38pm

@ Brett - i agree. But for my case watchdog would be ideal. I have socket communication with server application, and when the server is down, after some failed tries to contact server, board hangs.

High_Speed_Dan · March 5, 2013, 3:39pm

@ tvinko - that’s actually why I have the watchdog enabled myself

tvinko · March 5, 2013, 3:46pm

@ High-Speed Dan - if i know for this watchdog before, it would be easier solution. For now i count failed requests, and restart board in my code, after 100 fails. I’ll immediately change this my ugly solution

Brett · March 5, 2013, 3:51pm

sorry, if a timeout is something you have code to handle, then I personally think it’s better to handle it in your code than rely on a watchdog. Watchdog is really good at dealing with things you can’t handle, where your code blocks and never returns from a call, or crashes unexpectedly, which in your scenario doesn’t seem to be the case - if you can count the failures, you’re obviously handling them appropriately. Another thing in your scenario, why does restarting the app work for you, if the server is down? That just seems like a logic condition that again you can handle in your app.

not every nail needs to be hit with the hammer in the shape of a watchdog