G120 NETMF CAN Overrun error

HalfGeek · January 17, 2020, 11:21am

Hi,

I have been using a G120 (Panda III) for a while in some projects, running 2 CAN networks with no issues. However, with recent software changes I have a lot more user-defined events firing to handle various things. Whenever a few events are firing closely together, I get CAN overrun errors. The documentation simply says that this means “A CAN message was lost because the hardware was not able to receive the message in time.”
I know that my CAN handler is receiving only 1 or 2 messages at a time (I try and grab 20 at a time usually and simply add them to a queue for processing outside the event handler), so I am not hitting the HW buffer limit while processing other things (I am not sure how big you can set the HW limit as it seems happy to accept any number!).

Has anyone had something similar, or does anyone know any other causes of the overrun error? It is not generating any errors on the CAN bus, so I assume that something is being starved of processing. From what I can see, it only happens when other events are firing, so I am suspicious that it happens when the CAN controller cannot access some resource or bus.

Any thoughts much appreciated! I have run out of things to try!

Thanks in advance

Nick

Gus_Issa · January 17, 2020, 12:00pm

The easiest thing to do is enable hardware filters on the messages so you only recurved what you need, if this is an option.

We have new hardware that is a lot faster. I am curious if the here hardware can handle all of your messages. Please email me directly g@ghiele…

HalfGeek · January 17, 2020, 12:22pm

Thanks Gus,

The CAN bus itself is busy, but I have some pretty limiting GroupFilters set already which give me regular messages approximately every 100-110ms (Network Management), and also the sporadic message sequence (automotive diagnostics) that I am trying to react to. I assume the GroupFilters are the hardware filters, and not a software level filter?

Incoming CAN load (after the filters) is low (and this can be seen by usually only having 1 messages in the buffer every time the MessageAvailable event fires.

Unfortunately, we are stuck with this hardware for now due to the installed user base that we are trying to service. I would love to move on to some newer, faster components Version 2 is on the drawing board still.

It just seems that the CAN HW is affected whenever I do something in the other areas of SW, so trying to understand the interactions, and I may be able to limit the rate of certain things, especially while doing diagnostic CAN. e.g. does the HW raise that error only when it’s incoming buffers are full, or could it also raise it if other internal comms are blocked etc

Cheers!

Nick

Gus_Issa · January 17, 2020, 12:46pm

Do you allocate new objects in your processing loops? This is something that you can possibly optimize.

As for new hardware, we are not asking you to switch but it will add to the long list of in-field real-life tests we have done with customers.

HalfGeek · January 17, 2020, 1:20pm

Hi Gus,

This is my current WIP for the handler for MessageAvailable event to try and get the quickest turnaround.

    private void incomingMessageHandler(ControllerAreaNetwork sender, ControllerAreaNetwork.MessageAvailableEventArgs e)
    {
        while (sender.AvailableMessages >0)
        {
            TempCANMessage = sender.ReadMessage();
            // Send continuation frame if needed (done here for max responsiveness)
            if ((flowControlTrigger) && (TempCANMessage.ArbitrationId == flowControlTriggerAddress) && ((TempCANMessage.Data[0] & 0xF0) == 0x10))
                {
                    can.SendMessage(flowControlMessage);
                    ClearFlowControlTrigger();
                }
            if (canEnabled)
            {
                incomingCANmessageQueue.Enqueue(new CANMessage(TempCANMessage)); // Convert GHI CAN message to new copy of JNX CAN message to avoid data overwriting and add to queue
            }
        IncomingCANMessageEvent.Set();
        }
    }

I do have to convert the GHI CAN message to our own wrapper struct (with some extra info). This is mainly because if I add the GHI CAN message to the queue it will only be a reference and, if I do get another message quickly after it (some diag messages burst frames at 1-4ms), then I end up processing a queue that all have the same message.

I originally used the ReadMessages() method, but it actually turned out quicker to call the ReadMessage() method repeatedly until the buffer was empty, at least for the ew messages I need to react to! Both have the same issue with having to clone the GHI CAN message when adding it to the queue to avoid referencing issues.

All versions of my handler are quite happy with high CAN loading (without Overrun errors) as long as nothing much else is going on in the system. If I handle button presses (using interrupts on input pins), or update the SPI-based display, or trigger the diagnostic handling (using user-defined events), then I get CAN Overrun errors.

Your comment about in-field real-life tests intrigued me - I’ll drop you an email as well!
Nick

Gus_Issa · January 17, 2020, 2:23pm

I am not sure what else can be done. But this info is very good for our engineers to verify that TinyCLR 2.0 can handle better. I will pass this on.

HalfGeek · January 17, 2020, 2:58pm

Cheers for looking - I think we may just be pushing the limits of the hardware in peak loading. When I get some time, a clean sweep through the code may eliminate some old redundant loads and help things.

Nick

Mike · January 17, 2020, 4:08pm

Any indications of excessive garbage collection when running in debugger?

Pre-allocation and reuse of objects being cloned to?

Brett · January 17, 2020, 10:31pm

Panda3 is NOT a G120, it’s only G80. Do you perhaps mean Cobra3?

HalfGeek · January 21, 2020, 8:13am

My apologies - I did mean Cobra III! We moved from Panda II to Cobra III (Panda III did not do IFU etc which we needed, so we dallied with it only a short while). But my aging brain is too easily confused

Nick

HalfGeek · January 21, 2020, 8:18am

Hi Mike,

Nothing I could see. I actually have the following set

Debug.EnableGCMessages(true); // Turns off Garbage Collection debug messages to avoid CAN timing glitching

This was because, left to it’s own devices, GC would run sporadically, but whenever it did it would pause CAN for 100ms or so, which is too much for our systems. And the debug messages made it worse when there was no debugger attached.

From your suggestion though, I suspect we would benefit from seeing where I can re-use existing objects rather than create new ones repeatedly.

I turned the messaging back on, and I can see the initial GC but nothing reported while the CAN Overruns occur…

The debugging target runtime is loading the application assemblies and starting execution.
Ready.

GC: 3msec 820332 bytes used, 6519336 bytes available
Type 0F (STRING ): 612 bytes
Type 11 (CLASS ): 9240 bytes
Type 12 (VALUETYPE ): 624 bytes
Type 13 (SZARRAY ): 4296 bytes
Type 03 (U1 ): 1512 bytes
Type 04 (CHAR ): 492 bytes
Type 06 (U2 ): 228 bytes
Type 07 (I4 ): 120 bytes
Type 0F (STRING ): 216 bytes
Type 11 (CLASS ): 1728 bytes
Type 15 (FREEBLOCK ): 6519336 bytes
Type 16 (CACHEDBLOCK ): 264 bytes
Type 17 (ASSEMBLY ): 35256 bytes
Type 18 (WEAKCLASS ): 48 bytes
Type 19 (REFLECTION ): 192 bytes
Type 1B (DELEGATE_HEAD ): 756 bytes
Type 1D (OBJECT_TO_EVENT ): 384 bytes
Type 1E (BINARY_BLOB_HEAD ): 761568 bytes
Type 1F (THREAD ): 1152 bytes
Type 20 (SUBTHREAD ): 144 bytes
Type 21 (STACK_FRAME ): 924 bytes
Type 27 (FINALIZER_HEAD ): 432 bytes
Type 31 (IO_PORT ): 576 bytes
Type 34 (APPDOMAIN_HEAD ): 72 bytes
Type 36 (APPDOMAIN_ASSEMBLY ): 3792 bytes
CAN 1 error received:0
CAN 1 error received:0
CAN 1 error received:0
CAN 1 error received:0
CAN 1 error received:0
CAN 1 error received:0
CAN 1 error received:0
CAN 1 error received:0
CAN 1 error received:0