Cerberus ENC28 TCP Protocol Problems Found via Wireshark

csailor · December 21, 2013, 5:59pm

I’m developing a sensor that uses a Cerberus and ENC26 as it’s core that a PLC or PC can communicate with via ModbusTCP. Everything seemed to be working fine until I started having issues with the sensor becoming unresponsive after a day or two of running with the PLC taking readings. And when I mean unresponsive I mean that the Cerberus won’t even respond to a ICMP! This does not happen even after a week of running with a PC taking readings. So I fired up Wireshark to see what was happening.

My first test was to see what the PLC to a PC packet exchange looked like. I wrote a Win7 app that used the core “server” code that’s running in the Cerberus. In the attached image PLCToPCWireshark.jpg you can see a smooth exchange between the two devices. The PLC address is 34.34.34.23 and the PC address is 34.34.34.45. This configuration runs for days without issue.

The second test was to see what the PLC to one of the Cerberus sensors packet exchange looked like. You can see this in the attached image PLCToSensorWireshark.jpg.
PLC address is 34.34.34.23 and the Cerberus address is 34.34.34.134.

As you can see there is a difference in the packets as the Cerberus sensor has extra packets that it sends back to the PLC where the PC doesn’t.

So the question is, could these extra packets from the Cerberus cause the Cerberus communications to lock up? I’ve tried setting the DontLinger, Linger, KeepAlive, socket options on the Cerberus and nothing effects the outcome. One might be lead to believe that the PLC is the problem, but when it talks to the PC app after an extended period of time it keeps humming along.

12/22/2013 - Update.

In my continued testing I’ve found that even a PC after running for a week communicating with the Cerberus sensor will also result in the sensor locking up completely to the point it won’t even respond to an ICMP. So there is something going on with the ENC/NETMF stack or something. The only way to get the Cerberus back online is to cycle power on the unit. This is the code for the server:

Setup the listening socket:

tcpPort502ServerSocket = new Socket( AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp );

tcpPort502LocalEndPoint = new IPEndPoint( IPAddress.Any, 502 );

tcpPort502ServerSocket.Bind( tcpPort502LocalEndPoint );

tcpPort502ServerSocket.Listen( 2 );

tcpPort502ListeningThread = new Thread( WaitForPort502Connections );
tcpPort502ListeningThread.Start();

Listening Thread:

private void WaitForPort502Connections()
{
while ( true )
{
Socket clientSocket = tcpPort502ServerSocket.Accept();

	ProcessPort502ClientRequest( clientSocket );

	clientSocket.Close();
	clientSocket = null;
}

}

James_ghielectroncs · December 23, 2013, 10:39am

@ csailor - Could you provide us with the simplest possible test case that will reproduce this issue?

csailor · December 23, 2013, 10:57am

James,

Basically the Cerberus unit is opening 3 ports to listen on:

TCP Stream Port 134 with 1 listening queue
TCP Stream Port 502 (ModbusIP) with 3 listening queues
UDP Port 30303

The Modbus buffer size is 12 bytes sent from a PC and 11 bytes being returned by the Cerberus unit.

PC app should poll the Cerberus unit every second on port 502 until the Cerberus unit locks up. It will take possibly a week or so and the Cerberus unit will lock up. If you set the polling time to 500ms you can get the Cerberus to lock up after about 3 days.

The port 502 simply triggers a digital output and reads an analog input. There is a little math being performed, but nothing heavy.

If needed I can supply you with both the PC and Cerberus apps.

James_ghielectroncs · December 23, 2013, 11:33am

@ csailor - It would be best to supply us with the apps, and details about any special network setup, if it exists.

csailor · December 23, 2013, 12:40pm

James,

I’ve attached the apps as you requested. The network is a standard unmanaged 10/100/1000 switch. The Cerberus sensor is running off of POE. I split off the power lines before hooking up the ENC28.

The QuickMsgApp in the zip file is the Win7 app.

Well I tried to attach the zip file renamed to jpg but that didn’t work. Where can I send the file?

James_ghielectroncs · December 23, 2013, 1:49pm

You can send the file to james.dudeck at ghielectronics dot com

csailor · December 26, 2013, 9:39pm

James,

I’ve collected more data that points to the ENC28 stack locking up totally! I’ve added another thread that just runs in the background and will report the number of port 502 requests that have been issued by the client every 15 seconds in the debugging output window. If the client is querying the Cerberus with a 100ms delay between requests for about 14.5 hours the ENC28 will stop responding to further requests. Not even the ICMP. But the monitoring thread will still keep running showing the number of client requests. Of course the number will no longer change due to the ENC28 no longer responding to client requests. I’ve captured the server thread, socket, and network interfaces before and after the lockup. They contain the same information. This is what leads me to believe that the ENC28 stack is completely locked up and does not respond to any anything while the Cerberus is working just fine.

At this point I would say that GHI has a problem, but what that problem is I don’t know. I know that if I hit the reset button on the Cerberus board that everything will start working once again after it reboots. I could reset the entire unit after 1000 client requests, but that is a bandaid and I shouldn’t have to go to that extreme to get this thing working. Is there a way to reset the ENC28 via method or something? Again, this isn’t the way I want to handle this problem as there should be a fix!

I appreciate any and all help here!

Here are the three variables contents both before and after the lockup:

netif {Microsoft.SPOT.Net.NetworkInformation.NetworkInterface[1]} Microsoft.SPOT.Net.NetworkInformation.NetworkInterface[]
[0] {Microsoft.SPOT.Net.NetworkInformation.NetworkInterface} Microsoft.SPOT.Net.NetworkInformation.NetworkInterface
_dnsAddress1 0 uint
_dnsAddress2 0 uint
_flags 0 uint
_gatewayAddress 19014178 uint
_interfaceIndex 0 int {uint}
_ipAddress 2250383906 uint
_macAddress {byte[6]} byte[]
[0] 52 byte
[1] 52 byte
[2] 52 byte
[3] 52 byte
[4] 52 byte
[5] 134 byte
_networkInterfaceType 6 Microsoft.SPOT.Net.NetworkInformation.NetworkInterfaceType {uint}
_subnetMask 16777215 uint
DnsAddresses {string[0]} string[]
GatewayAddress “34.34.34.1” string
IPAddress “34.34.34.134” string
IsDhcpEnabled 0 bool {int}
IsDynamicDnsEnabled 0 bool {int}
NetworkInterfaceType Ethernet Microsoft.SPOT.Net.NetworkInformation.NetworkInterfaceType
PhysicalAddress {byte[6]} byte[]
[0] 52 byte
[1] 52 byte
[2] 52 byte
[3] 52 byte
[4] 52 byte
[5] 134 byte
SubnetMask “255.255.255.0” string
Static members

tcpPort502ServerSocket {System.Net.Sockets.Socket} System.Net.Sockets.Socket
Available 0 int {uint}
LocalEndPoint {0.0.0.0:502} System.Net.EndPoint {System.Net.IPEndPoint}
m_fBlocking true bool
m_Handle 1 int
m_localEndPoint {0.0.0.0:502} System.Net.EndPoint {System.Net.IPEndPoint}
m_recvTimeout -1 int
m_sendTimeout -1 int
ReceiveTimeout -1 int
RemoteEndPoint {0.0.0.0:3} System.Net.EndPoint {System.Net.IPEndPoint}
SendTimeout -1 int

tcpPort502ListeningThread {System.Threading.Thread} System.Threading.Thread
IsAlive true bool
m_AppDomain Cannot fetch the value of field ‘m_AppDomain’ because information about the containing class is unavailable. object
m_Delegate {System.Threading.ThreadStart} System.Delegate {System.Threading.ThreadStart}
m_Id 5 int
m_Priority 2 int
m_Thread Cannot fetch the value of field ‘m_Thread’ because information about the containing class is unavailable. object
ManagedThreadId 5 int
Priority Normal System.Threading.ThreadPriority
ThreadState WaitSleepJoin | Suspended System.Threading.ThreadState
Static members

csailor · December 27, 2013, 10:13am

More Info…

I tried to perform a PowerState.RebootDevice( false ) and that didn’t reset the ENC28. So I added the toggling of the ENC28’s reset line before calling RebootDevice and that works in resetting the ENC along with the Cerberus. I will perform a reset every hour to make sure the unit stays connected and will respond to requests.

However, I still call this a bandaid and I believe the ENC28 issue needs addressed. I’ve read a few articles after Googling the subject and see that this has been a problem with the ENC28. I don’t know if GHI has addressed the issue recently, but it looks to be a problem none the less.

Brett · December 27, 2013, 4:15pm

I just want to check… you did provide GHI the test harness to repro this, correct?

csailor · December 27, 2013, 4:16pm

Brett,

Yes I did provide James with the client and Cerberus apps.

James_ghielectroncs · December 30, 2013, 2:29pm

We do have the test apps, but due to the prolonged usage before the issue appears, we will be testing this after the new year because of the holidays.

jango_jas · January 3, 2014, 2:46pm

@ csailor - Hi. I have a similar issue with Cerberus and ENC28. A computer and a Cerberus exchange 30 UDP message per seconds in my application. Everything works fine until I start a huge file transfert on my network. At this moment, the ENC28 become unresponsive and will never respond again. I have to restart the board.

Could you try to play with huge file on your network and see if the problem occur. If so, it could be simpler to reproduce.

Gus_Issa · January 3, 2014, 3:00pm

A quick note here about the Cerb family of devices. As they have very little RAM, it is very easy to lose incoming packets on the network so this needs to be taken under consideration. My guess is that the device throws an Experian and exits and not locks up when it gets out of memory exception.

csailor · January 3, 2014, 3:57pm

Gus,

I too thought that maybe this was happening and so I performed my testing while running the app under a debugger. I didn’t get one exception! To prove this to myself I kicked off another thread just to report what was happening as I detailed in an earlier posting. The Cerberus keeps running, but the ENC28 stops responding. And the only reason that I sped up the packet rate was due to the fact I didn’t want to wait 1 week for it to become unresponsive again. I normally poll the Cerberus every 30 seconds. But given enough time it will stop responding. It’s like the ENC28 needs to be kicked every X hours to keep it going. That is why I chose to kick it every hour to keep it running. With this implementation of resetting the ENC28 and then rebooting the Cerberus every hour I’ve been able to run continuously for an entire week making reads every 100ms without a lockup! As of this posting the test app has made 5,526,000 requests with 658 fails! This is a .01% failure rate which is acceptable of course.

Given the fact that it is working, I’ve solved the problem with a bandaid, and have had a unit running for an entire week is all good. However there is still the fact that I have to kick the ENC28 to keep it working.

@ jango_jas - I’ve been running on a network that has <512 byte packets and the Cerberus/ENC28 still has issues. But the test system that has been running for a week works even though I’ve moved around 500+MB files on the same network.

Jay_Jay · January 3, 2014, 6:16pm

Does anyone know the code to reset the ENC28, I DON’T want to reset the board…

Thanks for sharing…

csailor · January 3, 2014, 6:25pm

@ Jay Jay - Here is the code I use to reset the ENC28 on the Cerberus:

The only problem is that the ENC28 is setup during the startup of the Cerberus and so to prevent issues I reset the Cerberus after resetting the ENC28. You can try just resetting the ENC28, but I wouldn’t guarantee the Cerberus will behave properly. Maybe someone else can do a better job enlightening us if resetting the Cerberus is imperative.

					
//
// Declare the ENC's Reset output...
//
private GT.Interfaces.DigitalOutput enc28ResetLine;

//
// Configure the ENC28's reset output...
//
try
{
	enc28ResetLine = new GT.Interfaces.DigitalOutput( socket, GT.Socket.Pin.Four, true, null );
}
catch ( Exception ex )
{
	Debug.Print( "GT.Interfaces.DigitalOutput enc28ResetLine Error: " + ex.Message );
}

//
// Perform the reset...
//
bool currentState = enc28ResetLine.Read();
currentState = !currentState;
enc28ResetLine.Write( currentState );
Thread.Sleep( 50 );
currentState = !currentState;
enc28ResetLine.Write( currentState );

Jay_Jay · January 4, 2014, 11:02am

Thanks for sharing, will try it later and report back…

I think this should be built into the firmware, where one would call a simple function to reset the module… what do you all think?

Reinhard_Ostermeier · January 4, 2014, 11:44am

If I remember right the constructor of ENC28 interface class from premium lib has a parameter for reset pin.
But I don’t know what it is using for.

csailor · January 9, 2014, 8:06am

James,

Any word on your testing of this issue??? I’ve got over 10 million reads now with a reset every hour. I have noticed that even after resetting the unit every hour there are times when the unit will still become unresponsive and will continue this way for 30+ minutes. It won’t even respond to a ping. But, once the hour timer kicks in and resets the ENC28 and Cerberus all is well. I’ve witnessed this on many occasions when I’ve noticed that the unit is no longer taking readings. I’ve looked at the client app and it is timing out trying to take readings. This will continue until the hour timer does its thing.

Again this is further proof that it isn’t anything related to the Cerberus but rather the ENC28 becoming unresponsive altogether.

Any clues?