Modules completely crashing when connected to network

LucaP · March 8, 2021, 4:01pm

Hi, I am experiencing a very frustrating and very hard to reproduce problem with some modules I have in the field.

Whenever the modules are connected to the network, they will 100% completely crash in anywhere between 1 and 120 minutes. They won’t come back until the power has been reset completely.

I think that the bootloader/tinyclr or something crashes because I get this message in the debugger:
TinyCLR application: Managed' has exited with code 0 (0x0).
The module also stops responding to debug requests or requests from TinyCLR config, until it has been reset by removing the power completely.

I am using a ENC28J60 connected to a SC20100 running 2.1 preview 3.

Is there a way to enable a verbose mode of some sorts so I can see what is happening before the software crashes? I’m trying to make this more easy reproducible so I can share that…

While I work on that, is there anything that comes to mind that I can try to fix this with? Should it even be possible for a managed operating system like this to crash so severely as it’s doing here?

What I’ve tried so far:

Removing the network cable from the device, but leaving the network code intact. This way, the module has run for weeks without problems
Watch the memory usage, the memory usage (after a GC) stays 100% constant all the time
Disabling all network activity, even when the NetworkController is just initialized and there is a cable plugged in, this problem still occurs

Other things that are running on my module:

A CAN bus interface
A USB Host interface (usb mass storage)

My code for using the ENC28J60 is the same as the sample code in the docs, I just changed my ip settings and some pin numbers.

Mike · March 8, 2021, 5:04pm

Do you have 100% coverage for exception handling?

LucaP · March 8, 2021, 5:20pm

Cannot 100% guarantee that, but what is throwing an exception right now is beyond my control (I think)

I’m not doing anything with the network, I just have link with a switch…

Please correct me if I’m wrong

Gus_Issa · March 8, 2021, 5:27pm

one option is to switch to serial debugging (MOD pin) and connect a serial terminal to your device. Then you may see more info that will help in finding the issue.

sgtyar95 · March 8, 2021, 5:38pm

this is something that i am going to have to change under the hood in my own projects, but instead of handling exceptions at the source I have been told it can be better to rethrow the exception and catch it on the top level, if you can do so without it completely breaking your code.

Another thing I’ve been doing is kicking my main loop into a new thread and monitoring the status of the thread so i can run code when it aborts. Still working on the details with that one, it’s hard to get data back from it if it dies but it allows me to actually do stuff on exit or even try to just restart the thread.

LucaP · March 8, 2021, 5:44pm

I’ll see if I can solder tiny wires to my board then…!

Or I’ll have to see if I can reproduce this on my development board, does that one support the serial debug port?

Do you have any clue what this could be?
Could this be fixed with the upcoming network rewrite code? /** When will that be released? **/

mcalsyn · March 8, 2021, 6:39pm

My understanding is that the ENC28J60 isn’t robust on busy networks with a lot of physical-layer collisions. That may explain why you are seeing it only fail in the field, and may be something you can simulate locally in order to force a crash in your development environment.

All the SIT Dev boards I know of support the serial debug port, but depending on what rev you have it may have moved from COM1 to COM5 or vice versa and the silk-screen may not be right.

LucaP · March 8, 2021, 7:00pm

I have ENC28J60s in the same network that are connected to G120s running netmf that are having no such problems.

What would you recommend to simulate a busier network to make me crash?

mcalsyn · March 8, 2021, 7:21pm

It might not be an issue if the firmware is using the fix referred to here : ENC28J60 frame collisions causing hang / lockup | Microchip. GHI - is this update in place? The result of this bug is that the network interface will wedge and perhaps cause an infinite loop until rebooted. My understanding is that it is a bug in the ENC that needs to be worked around in the TCP stack code.

I suspect that you could simulate frame collisions with a managed switch and a lot of traffic and confirm that you are getting frame collisions by monitoring network stats. Full disclosure: I have not done that myself, but I do know this bug is out there in the wild which is why I mention it. I only mention this in case you are seeing things work in the lab and fail in the field with low-level network errors.

Some managed switches also have SNMP counters for frame collisions and you could check if you are getting a high rate.

What makes me suspect that it might be something like this is the fact that you see crashes just by initializing the interface (activating the ENC and it’s interrupts and such), but not using it.

You could also perhaps put the failing units behind a switch to try to limit the traffic they see and see if the problem goes away.

LucaP · March 8, 2021, 8:13pm

@Dat_Tran @Gus_Issa can you confirm you have this fix in place?

I’m sorry if I’m pushing, but I need to fix this rather soon as my modules are on a ship that is destined to France in a couple of weeks. I don’t feel like having to go there anytime…

Is there anything more that I can do to help?

Dat_Tran · March 8, 2021, 8:28pm

it is too late since it is only 4 hours ago.

But we did some change in network, hope it help.

Or give us a small project that we can reproduce.

Thanks.

LucaP · March 8, 2021, 8:50pm

I understand it’s too late for preview 4. But do you have the fix described in the forum post in the tinyclr software?

I will try with preview 4 tomorrow. I’m not sure if a sample project will help because this problem is very hard to reproduce and can take hours to occur. It even seems like it won’t occur at all on a non busy network.

Dat_Tran · March 8, 2021, 8:58pm

Nightmare is coming :d

Not yet, but we can take a look.

LucaP · March 9, 2021, 2:30pm

I’ve managed to get this problem to be reproducible on my desk. It does seem to be a packet collision of some sort, because it only happens on busy networks.

The way I reproduced this problem is by connecting my module to my companies network, this makes sure there is atleast some noise (like ARPs) coming by over the network.

On top of that, I am using a small java tool to ping the module extremely extremely fast. I have this tool open four times, this decreases the time it takes for the problem to occur since it makes for more traffic.

I have been able to reproduce this using the SC20100 dev board with a mikroe ENC28J60 click running preview 4 of TinyCLR. This means you should be able to reproduce this problem at GHI.

It seems to have something to do with USB Host. When I don’t enable the USB host in my code, the network does stop working after a few minutes, but the rest of the program (in this case the led blink) keeps on working after the network crashes. If I enable USB host, the entire system crashes.

If you want to reproduce this, use this tool. I have used this command to run my pings:
fastping -n:100000 -l:32 -ms:0 -i:2 192.168.31.1
You can change the IP to what you want, I would recommend leaving the rest of the settings intact.

It’s important to have at least two cmd windows with this tool open at the same time. Since they wait for a reply before resending a ping, there is zero chance of a collision if you only have 1 window. As I said, I use 4 windows at a time to get the error after only a couple minutes, sometimes even faster. Since your circumstances might be different, I would recommend trying it out for atleast an hour before making any conclusions because it has taken upwards of two hours for me to have this problem occur.

I have attached a sample project that will work on the SC20100 dev board. This code doesnt do anything extra on the network, it just connects to the network and responds to the pings. When you run this program and point the four fastpings to it, you will find that the pings will drop after a few minutes (could be longer depending on circumstances). At the same time, the PB0 LED on the devboard will stop blinking indicating the board has crashed. If you had your debugger atttached, you would see the message
TinyCLR application: Managed' has exited with code 0 (0x0).

I hope this is enough info for you to reproduce and hopefully fix this problem with the upcoming rc1 release Should I create a github issue for the matter?

I want to give a special thanks to @mcalsyn for pointing me in the right direction, I don’t think I would have ever found this without your help

LucaP · March 15, 2021, 9:50am

Is there any update regarding this problem? Are you able to reproduce the problem?

Dat_Tran · March 15, 2021, 1:20pm

Not yet, we are busy with other thing, will be back preview soon.

Dat_Tran · April 14, 2021, 5:16pm

Can you please share the sample project again? It is expired.

Thanks

LucaP · April 14, 2021, 6:11pm

I’ll do this when I get back to my computer tomorrow. You’ll have it when you wake up!

Altough, @Gus_Issa seems to have a weird sleep schedule sometimes

Dat_Tran · April 14, 2021, 6:18pm

another question, we have tried with 6 cmd window and run your fastping app, use your command, about 20 min, not found any issue.

We write the app as you suggested (ENC-SC20100 Dev), just connect network and response the ping with another thread blinking led.

Any other suggestion to reproduce. After few minutes, the app stop with the screen below:

LucaP · April 14, 2021, 6:24pm

Are you connecting the module directly to your pc?

It’s important to have it on a switch that has more traffic than just this module running.