G400 freeze for more than 15 min

xtomas22 · October 22, 2016, 1:24pm

Hi everyone,

I use G400 over year now, but it seems to be glitchy with last 2016 R1 FW.
After several days G400 freeze for more than 15 minutes, but wake up / restart automaticaly.
It seems that hapends when I send too much ETH messages.
Most glitchy is board which reads 24 inputs and sends ETH message on every input change (every cca 500ms)
I read inputs from shift registers using SPI in loop and there is some logic for outputs as well. When board freeze, SPI is not working so it is nothing like ETH connection error, its just brain dead, but only for 15 min.

Debuging is problem because of 14 day error period… :wall:

Does anybody have similar problem?
Every idea helps significantly, thanks a lot.

Eddie, Czech Republic

Gus_Issa · October 22, 2016, 4:19pm

I do not recall changes done on latest G400 firmware so that is probably not the problem. Still, you can load the older firmware and try if this helps in narrowing down the problem. 14 day error period is going to be tough. Maybe speed things up to make the error happen faster.

Mike · October 22, 2016, 10:50pm

Could be a garbage collection problem. Since the G400 has a lot of memory, collection could be delayed for 14 days. If a lot of objects were allocated and free, and the structure and inter-object relationships are complex, then garbage collection could take a very long time.

Solution might be to reuse objects rather than freeing and reallocating.

xtomas22 · October 23, 2016, 4:53am

Thanks Gus and Mike.

Actually I reused all my objects and arrays. Well maybe not all of them maybe I miss something.
How GC actually decide to start? Because I didn`t invoke GC manualy.
Is even possible that GC run 15 min?

Now I generate faster changes on inputs and first freeze occured after 14h, but unfortunately without debugging.
Now I am waiting for another freeze in debug mode.

mhstr · October 23, 2016, 11:56am

@ xtomas22 - Maybe try to invoke GC manually from (your program) after some time of running. Ideally in time you are sure your program went through all paths if possible.
You can test some flag which will be tested during every program iteration. Then you can set this flag from debugger whenever you want and run GC manually based on this flag.

Mike · October 23, 2016, 12:54pm

This is a great idea. If you force GC often, then it will run very fast. It is an easy thing to try. Create a thread, and do GC every minute. If the problem goes away, then you know your you have a memory issue.

Something like concatenation of strings can generate a lot of garbage.

xtomas22 · October 23, 2016, 1:01pm

Last GC after cca 20h in DEBUG mode, no freeze yet, so I`m not sure if GC cause this problems…


GC: 270msec 750276 bytes used, 66355488 bytes available
Type 0F (STRING              ):    264 bytes
Type 11 (CLASS               ):  11940 bytes
Type 12 (VALUETYPE           ):    240 bytes
Type 13 (SZARRAY             ): 330900 bytes
  Type 01 (BOOLEAN             ):     24 bytes
  Type 03 (U1                  ): 215856 bytes
  Type 04 (CHAR                ):    492 bytes
  Type 07 (I4                  ):    300 bytes
  Type 0B (U8                  ):  61848 bytes
  Type 11 (CLASS               ):  52380 bytes
Type 15 (FREEBLOCK           ): 66355488 bytes
Type 16 (CACHEDBLOCK         ):   1260 bytes
Type 17 (ASSEMBLY            ):  38076 bytes
Type 18 (WEAKCLASS           ):     48 bytes
Type 19 (REFLECTION          ):    192 bytes
Type 1B (DELEGATE_HEAD       ):    576 bytes
Type 1C (DELEGATELIST_HEAD   ):     48 bytes
Type 1D (OBJECT_TO_EVENT     ):    288 bytes
Type 1E (BINARY_BLOB_HEAD    ): 358044 bytes
Type 1F (THREAD              ):   1920 bytes
Type 20 (SUBTHREAD           ):    240 bytes
Type 21 (STACK_FRAME         ):   1980 bytes
Type 26 (WAIT_FOR_OBJECT_HEAD):     48 bytes
Type 27 (FINALIZER_HEAD      ):    504 bytes
Type 31 (IO_PORT             ):    216 bytes
Type 34 (APPDOMAIN_HEAD      ):     72 bytes
Type 36 (APPDOMAIN_ASSEMBLY  ):   3420 bytes

xtomas22 · October 23, 2016, 1:40pm

@ Mike - Im working on my code 2 years now and it is quiet optimized. All serialization and deserialization of messages on UART and ETH run in RLP and use shared arrays between C# and RLP, all operations are locked and thread safe. I dont see any possible way that something consume memory or cause error of any king.

Im using my Core library in 8 different types of NETMF boards and in PC server app. Everything is well tested. This last glitchy board reads fast changing inputs and sends bit mask on every input change. First occurance of this problem is with last SDK 2016 R1, but this board is new and never run with older FW so I cant be sure.

I have this problem with other boards too, but there is much less communication, so error ocures after longer period.

I don`t have much data because last deployment was 1 month ago and in different country. I have to collect more data.

If I found nothing and manual GC won`t help, I have to downgrade FW and try again…

Mike · October 23, 2016, 3:29pm

@ xtomas22 - I would install the prior firmware and see if the problem persists.

If the problem goes away, then it might indicate a firmware problem, or a change in the firmware that is causing a problem with you code to occur.

If the problem does not go away, then I suspect there is still an unhandled issue in your code.

My initial thought is there could be a race condition that appears more frequently under load. You said you have other systems, with lower loads, that experience the problem at longer intervals. This would tend to support a race condition.

Are any of your other systems G400s? The greater speed could expose a race condition.

I just had a .NET system I developed back in 2001/2 fail. It had not experienced a failure in at least ten years, which is good since I don’t have control of some of the data it receives and processes. The failure was due to the introduction of a new USB to serial cable. The cable was bad and/or low quality, and resulted in an error condition which I was handling incorrectly. I was confident that my program was solid. Hubris has its cost.

Brett · October 23, 2016, 4:09pm

Are you 100% sure that the device goes unresponsive for 15 minutes (I assume that’s what your fail-state is), or does something else happen, like does the device restart to restore service? Do you have enough logging to understand that kind of level of info?

xtomas22 · October 23, 2016, 5:32pm

@ Brett - I`m sure that device goes unresponsive for 15min. No ETH communication, no SPI ReadWrite operations. I dont have any more info, but I am working on it. Now I have my testing device under heavy load and in DEBUG mode and I am waiting.

All code catch and hande exceptions and store them to Queue(10). Async thread test ETH connection and send these exceptions to the server and store to SQLite database when device is connected. But nothing arrives, so I assume that device goes unresponsive and after 15min restart, because no exception is send from queue. Another possibility is that no exception is thrown and HAL freeze???

I have problem to simulate this error under DEBUG…but it could be random…GC + something…I dont konw…

xtomas22 · October 23, 2016, 5:43pm

@ Mike - All my devices are based on G400s. Yes I had similar problems with RS-485 messages, because some characters arrived incorectly. But this implentation is really simple: SPI.ReadWrite => SignalMask != SignalMaskNew => ETH.Send(SignalMaskNew).

Now my heavyload test generates signal change every 5ms. After several hours nothing happend. Until I succeed to simulate this error in DEBUG mode, I`m lost…

xtomas22 · October 25, 2016, 2:35am

Recent findings:

Error occurs after a long time, in about 14 days. I expose G400 heavy load, but the error does not occur. Only once error occurred after 10h, but it was just a coincidence, since I started tests on already running electronics.

My tests:

Sending to the input signal change every 5ms
Every 1ms call ArrayList.Add to overflow memory
2.1) I leave the electronics in this state
2.2) Calling GC after Exception

All tests passed, because electronics continuously maintain peripherals (SPI, ETH)

It looks as if the error occurred directly on the periphery, but HAL failed to clean it up.
I`m still confuset about recovery time 16min 55s, but this time is always the same.

Any idea?

P.S.: All my boards are based on G400, all of them have ETH, but some use SPI, some use UART…
Could be the IwIP network stack problem I found here?
[url]https://blogs.msdn.microsoft.com/netmfteam/2015/10/20/net-micro-framework-4-4-is-now-available/[/url]

Gus_Issa · October 25, 2016, 9:45am

I want to setup a G400 with a simple test that runs for 14 days but then what if it froze for 15 minutes while I was away from it! This is an interesting issue to solve.

With your setup, can you run something simpler, maybe without networking, to see if it will still behave the same after 14 days?! If you have a simple setup, we can do the same on our end.

xtomas22 · October 25, 2016, 10:54am

My logger is running on server. If there is no communication between server and G400 longer than 10s, server log OFFLINE message (or ONLINE message after first communication). Thats how I know it`s happening.
Our client reported that control signals from machine goes to G400 board inputs but no outputs activity for 15min. Thats how I know that G400 goes unresponsive. Now I just track these 15min OFFLINE → ONLINE windows in log and I can see that this happening even with boards with ETH+UART, not only this ETH+SPI board…

Now we are soldering more boards for paralel testing of these scenarios:

Original code - SDK 2016
Original code with manual GC - SDK 2016
Original code - SDK 2015

FOR YOU: Our client can disconnect this ETH+SPI board from network. SPI will be running, but without ETH activity. Then I won`t be able to find OFFLINE → ONLINE messages in log anymore, but client notice this error immediately, because it causes production problems…(no control signals)

After I find anything useful, I will report back.

I use UDP for ethernet communication

xtomas22 · November 15, 2016, 3:53pm

xtomas22:

Gus:

I want to setup a G400 with a simple test that runs for 14 days but then what if it froze for 15 minutes while I was away from it! This is an interesting issue to solve.

With your setup, can you run something simpler, maybe without networking, to see if it will still behave the same after 14 days?! If you have a simple setup, we can do the same on our end.

My logger is running on server. If there is no communication between server and G400 longer than 10s, server log OFFLINE message (or ONLINE message after first communication). Thats how I know it`s happening.
Our client reported that control signals from machine goes to G400 board inputs but no outputs activity for 15min. Thats how I know that G400 goes unresponsive. Now I just track these 15min OFFLINE → ONLINE windows in log and I can see that this happening even with boards with ETH+UART, not only this ETH+SPI board…

Now we are soldering more boards for paralel testing of these scenarios:

Original code - SDK 2016

Original code with manual GC - SDK 2016

Original code - SDK 2015

FOR YOU: Our client can disconnect this ETH+SPI board from network. SPI will be running, but without ETH activity. Then I won`t be able to find OFFLINE → ONLINE messages in log anymore, but client notice this error immediately, because it causes production problems…(no control signals)

After I find anything useful, I will report back.

I use UDP for ethernet communication

Test Results:

ad 1) Freeze after 14 days
ad 2) Freeze after 8 days
ad 3) Not tested
new 4) SDK 2016 without ETH cable (no ETH and no RLP serialization) - Running without any kind of problem…

It looks like ETH problem or my simple RLP serializer.
I wrote 2 simple RLP tests:

Division by zero error (RLP returned back to NETMF without any error???)
Infinite loop (RLP stopped NETMF code)

But if it is my RLP library problem, how it is possible that after 17min board restart or just force to exit RLP task and working fine?
In my tests board freeze forever when I call RLP infinite loop method…

Any ideas? Any knowledge about NETMF timeouts (RLP, ETH, SPI)?
What if there is SPI.Read timeout 1000 000 ms => 16,6 min (because this timeout is in milliseconds insted microseconds)?

Thanks

Gus_Issa · November 16, 2016, 8:43am

@ xtomas22 - It is possible that your RLP is using some memory that is collected/shuffled by the garbage collector, which is what is causing the system to crash. With the large memory size on G400, it is reasonable to say that the error only happens after a few days.

xtomas22 · November 16, 2016, 1:04pm

Ok, but:

Why board wake up after 17min?
Why board did not crash after manual GC(true)?

Now I have my stopwatch for every ETH.SendTo, SPI.WriteRead, RLP.Serialize, RLP.Deserialize and log every operation longer than 2ms but we have to wait for first crash several days.

I double checked logs and it looks like only SPI boards causing this problem.

Please, can you check timeout for SPI.WriteRead operation in the G400-D driver?
Is there any way to set SPI timeouts?

Thanks

Gus_Issa · November 16, 2016, 2:26pm

@ xtomas22 - board wakes up due to the watchdog.

There is no timeout for SPI.

xtomas22 · November 16, 2016, 3:26pm

Ok, so combination of RLP and GC. I use 2 static arrays for serialization and deserialization. I call RLP.Initialize to store pointers after boot. Every time I call RLP.Serialize or RLP.Deserialize I check pointers. If array starts with “RLP:” string pointer is OK otherwise I return -1 and re-initialize pointers from C#.

How I can protect my arrays against GC? Maybe fixed in C#?
How I can suppress GC when Im inside RLP? What happens when Im inside RLP and let say input interrupt occures?

Thanks