Networking and the state of NetMF 4.4

_Peter · November 29, 2015, 7:52am

Has someone got any results running this on 4.4 ?

mcalsyn · November 29, 2015, 7:57am

Not on the GHI hardware, but I am working on the 4.4 version of the serialwifi nuget packages for the Oxy+Neon (ESP8266), and the server test is passing. I will do a stress run overnight tonight.

RoSchmi · November 29, 2015, 8:08am

sh1 · November 29, 2015, 9:16am

@ RoSchmi - thanks very much for taking the time to report back on your findings. Obviously for me it’s all about trying to find the changes and differences that might lead me to a solution. I wrote the original code quickly to try and replicate the problem and there was no http header as I was calling the server from my own ‘socket’ based test app. The only thing this longer string will do is probably mean more than 1 packet for the send as it will trip over the MTU size.

I do run on port 80 here and @ mcalsyn made the change to another port number so for me that is the same.

I reality the only changes I see are actually unrelated to the webserver code itself.

1 - You are using DHCP (and dynamic DNS) and I am using static addressing
2 - You are calling the EthernetENC28J60 constructor directly and I am going via Gadgeteer
3 - Your tinybooter version is different to mine (TinCLR is the same)

I will replicate exactly what you have done and report back.

Thanks again

Steve

Reinhard_Ostermeier · November 29, 2015, 9:49am

@ sh - As far as I know the Gadgeteer Networking wrappers are not very reliable. I would go for directly initializing the networking stuff using prure NETMF.
Also, a fixed IP setup should not bring any less reliability.
I use Static IP all the time, DHCP might only add the issue of not getting an IP address.

PHITEK · November 29, 2015, 10:18am

Found my issue, was using SPI1, should have used SPI2 contrary to what the schematic states for socket X6 on the G400HDR.



Probably will make @ RoSchmi changes, and then try some testing with network errors.

RoSchmi · November 29, 2015, 10:45am

@ sh -
@ sh - After more testing I must unhappily admit that I saw some hangs too. Fiddler showed that the ENC28 did not answer to the next request and the ENC28 Connection did not recover. Furthermore I saw that the http connection was more reliable and even faster when the Raptor application was running in Debugging mode in Visual Studio.
From my work with the RN-171 WiFly Module I had the suspicion the the cause of this strange behavior could be that the socket.close command of the http Server does not arrive at the ENC28 at the right time, namely that the socket that shall be closed is already closed by another mechanism (may be through a time gap between sending the last Bytes and the socket.close command). Then the socket.close command closes the socket of the next request before it is answered and the Client waits in vain for its response. Garabage collection could be a reason for a time gap between the last data bytes and the socket.close command. In my RN-171 http-Server I therefore sent the data in chunks and called the garbage collector in code before sending the last chunk. So the garbage collector action cannot fall in this vulnerable time.
I know that these thoughts are still in some way speculative.

sh1 · November 29, 2015, 10:50am

I have been trying to improve this network crash issue which always end up with the thread in a hung state and keep thinking about asynchronous or non-blocking sockets. NetMF (in theory) does not have non-blocking sockets but there is a private field in the socket class which is used internally to wrap non blocking calls to the underlying native socket.

If this field is true (which it always is) then a call to Socket.Accept() or Socket.Connect() actually calls Poll(-1, SelectMode.SelectRead) which is where the thread effectively hangs. I am not suggesting it’s a fix for my problem but we can make these calls non-blocking and implement our own Socket.Poll with a timeout so that at least our thread keeps running when these situations occur

I have added 2 fields and changed my constructor to

Type _socketType;
FieldInfo _blockingInfo;

public SampleTCPListener()
{
    _socketType = Type.GetType("System.Net.Sockets.Socket");
    _blockingInfo = _socketType.GetField("m_fBlocking", BindingFlags.NonPublic | BindingFlags.Instance);
}

and then in the listening loop we can do this which gives more control over the loop and means the thread won’t hang. The Socket.Accept call or the whole block needs exception handling as it will throw a WSAEWOULDBLOCK exception (10035) if there is no connection waiting but this is exactly how a non-blocking socket should behave. In theory this won’t happen (often) as the Socket.Poll call should tell us that there is a connection waiting.

_blockingInfo.SetValue(_socket, false);

while (_isActive)
{
    if (_socket.Poll(1, SelectMode.SelectRead))
    {
         Socket clientSocket = _socket.Accept();

         new Thread(() =>
         {
             try
             {
                 OnSocket(clientSocket);
             }
             catch (Exception ex)
             {
                 throw ex;
              }
         }).Start();
     }
     else
     {
          Thread.Sleep(20);
     }
}

I am not sure it will fix my problem and reflection generally means you are doing something you shouldn’t but for my debugging purposes I prefer this to any of the thread wrappers or execution constraint methods that I have seen to date.

Steve

sh1 · November 29, 2015, 11:00am

@ RoSchmi - Thanks for your feedback, I have been working down several paths to try and understand the issue and in a separate post I have described a non-blocking socket approach that is at least making my debugging a little easier as the thread doesn’t hang anymore.

I was about to reply to you and report that the changes from your post have not improved anything at my end. It’s hard to be exact because the ‘hangs’ are not at a fixed frequency but I changed everything to be inline with your post and could still not get more than a few minutes (9 was the maximum) before the socket was no longer useable.

I do agree with your view that running in the debugger seems better; I originally put this down to an issue with Debug.Print statements when there is no debugger attached (which was a reported problem at one point) but I now have a full logging facade and the problem persists.

Thanks though, it’s very helpful to get any input and reports and I am very appreciative of your time

RoSchmi · November 29, 2015, 12:52pm

Not easy to understand why it runs so much better on my side. Did you use my http Client App? (See link in my former post). I think the behaviour of the Client is important too.
I found one more thing: When I had Fiddler running and the server was not running in Visual Studio the server hang after few requests. After I included in the header Connection:close and Cache-Control:no-cache it worked seemingly with no problems.


string header = "HTTP/1.1 200 OK\r\nContent-type: text/html\r\nConnection:close\r\nCache-Control:no-cache\r\nContent-Length:" + stringDocument.Length + "\r\n\r\n";

sh1 · November 29, 2015, 2:01pm

@ RoSchmi - I wanted to steer clear of any particular protocol in the testing and I wish to use the TCP/IP stack for things other than HTTP so it’s important for me to keep as close to the raw socket as possible.

I have my own test client which in its basic form does nothing more than connect, send a very small amount of data and read the response from the server over a basic socket without any higher level abstractions.

It really is as simple a socket implementation as possible and I am 100% confident in it.

As to why our results are different I have now gone full circle and I am taking this improved and simplified test server and running on as many different network setups as I can. I have a packet sniffer linked with my test client and hopefully a picture will emerge.

Too early yet but I am getting some evidence that either the amount or nature of the other traffic on the network has quite a big effect on regularity of the hangs.

RoSchmi · November 30, 2015, 3:11am

@ sh - I just saw. what @ andre.m already mentioned in an earlier post, that the receive- and send- timeouts are set to -1 (for ever). Isn’t it necessary to set them to a not infinite time ( e.g. 2 min, or longer if you want to send long files)?

sh1 · November 30, 2015, 4:27am

OK - definite progress in terms of finding the issue.

Using the reflected private property within the Socket class to change the socket to a non-blocking one has allowed me to at least keep the thread running after a ‘hang’ in the network layer. This helped and was a lot cleaner than all the thread wrappers and timeouts that have been suggested.

The bigger breakthrough though was in setting up the Raptor + ENC28 to a client laptop via a crossover cable and with as few ‘network’ capable services or applications running. With this extremely simple configuration and a very quiet network the mini ‘server’ has been running for 24 hours without a single issue and I have sped up the request frequency to 4Hz. Nearly 400,000 request / response cycles and still running!

If I go to the other extreme and put the Raptor + ENC28 on a very busy network with many other services running, more traffic etc. then the ‘hangs’ return in a few minutes. I am obviously now trying to dig deeper into the reasons using a packet sniffer. It seems that the hangs occur around the time of a single / multiple TCP retransmission packet(s) but I think this is symptomatic of the problem rather than the cause.

Reading around many other forums and more specifically the MicroChip documentation (which is just beyond my level of comfortable understanding) it seems there is an issue with the way the ENC28J60 handles frame collisions and the fix suggested by them is to query the interrupt flags to determine what state everything is in rather than relying on the status registers.

There is plenty of ‘Arduino’ code fixes for what looks like the same problem but maybe the NetMF drivers for this have already implemented a fix or don’t have the problem. I am beyond my comfort zone here but I was hopeful that it might provide some insight into the issue that a more capable person than me could ‘run with’

Steve

sh1 · November 30, 2015, 9:40am

Just to add further notes to my last post - I have implemented my full webserver, other network protocol server and clients and got it running alongside the main application code which is doing pulse sampling, ADC and talking to various modules including RS232, SD Card and Can.

The good news is that on a very quiet network (literally the only packets being to / from my client and server via a crossover cable) I can serve a very complex website with over 1.5MB of HTML and scripts under 100% CPU load and it all works OK.

The bad news is that as soon I introduce the system onto my office network it crashes within a few minutes every time and is non recoverable without a power reboot.

It feels like a little headway but I have no real idea where to go from here

Reinhard_Ostermeier · November 30, 2015, 1:56pm

@ sh - Have you set a valid MAC Address for the ENC28 Module?
The default one is not an valid one, I think, and some routers or network switches does not like this.

RoSchmi · December 1, 2015, 9:09am

@ sh -
Does the issue persist if you only make requests every few seconds and if you send only short responses?
In a busy network the transmission from one endpoint to the other may last longer and so the buffers of the ENC28 might get full which could lead to the hang.
If your application would work with short responses and some time between the requests it would be worth a try to slow down the data Transmission to the ENC28 by sending data in chunks and call a thread.sleep between the chunks to ensure that the data are taken from the Client when you put in new data to the ENC28 on the server side.
We had a discussion some years ago and I don’t know if this problem was solved in the meantime.
https://www.ghielectronics.com/community/forum/topic?id=9878&page=3#msg98411

sh1 · December 1, 2015, 12:06pm

Hi, the Mac address is 00-21-03-00-00-01 which was already on the ENC28 chip. It checks out to be a GHI mac address so I am guessing there is no reason to suspect any issues ?

Reinhard_Ostermeier · December 1, 2015, 12:13pm

@ sh - As far as I know the MAC on the ENC28 is not unique and I guess not even valid.
But you can generate a new unique MAC address with valid checksumm and write it to your board:
see here:
https://www.ghielectronics.com/community/codeshare/entry/822
I use this code on our commercial products as well and use our own serial number (stored on SD card) as a seed for the generator.
But normally we use this devices only in locale networks without internet access.

If I remember right also the FEZConfig tool provides this functionality.

sh1 · December 1, 2015, 12:26pm

@ Reinhard Ostermeier - Good advice and (assuming I can ever make the ethernet reliable enough) when I start thinking beyond the prototype stage I will definitely do what you suggest.

For now there is definitely no clash on my network (just gone back through many hours of Wireshark logs) and its a unique Mac Address for at least my network.

I am increasingly convinced that there is a driver issue with the ENC28 as per my previous posts, all the testing is consistent in this direction. Problem now is how to proceed to fix it.

Gus_Issa · December 1, 2015, 12:43pm

@ sh - The ENC28 is inherently slow due to the SPI bus it uses. I think it would be expected to lose packets in a busy system due to its limitation but I would expect the system overall to continue functioning as TCP handles missing packets. Things will be slower but it should always work.

Do you have a closed and complete test we can replicate here? Seeing what you are seeing is step one towards finding the issue. And do you happen to own an G400D, EMX or G120E device? Those use the built in Ethernet interface.