Networking and the state of NetMF 4.4

Having spent several days trying to get a reliable Http server running on a Raptor and trawling every single post about networking issues (wrapper classes and execution constraints etc) it seems very clear that the current NetMF 4.3 networking stack and / or the ENC28 drivers are in no way capable of supporting a commercial application.

Without asking for solutions the issues are based around the well reported Socket.Accept() method blocking indefinitely, a similar less frequent issue with Socket.Send() and a general instability in the networking stack. At very low ‘stress’ levels it’s possible to keep everything running for hours (and maybe days) at a time but trying to serve larger amounts of data and / or more frequently very quickly introduces issues.

I know the request will be for me to show some code snippets so that someone can work out what I am doing wrong but I am 100% confident that it’s very simple C# socket code and I have tried many ways around the problem (ExectutionConstraints, thread wrappers, ThreadPools, bandwidth throttling, switches, hubs, crossover connections, no firewalls, better cables, different mainboard sockets, different SPI clock speeds etc) and the problems remain.

You can improve the situation by sending very small packets of data at a time (this cuts down the number of Socket.Send hangs) but the Socket.Accept() issues are random and unpredictable.

My question isn’t how to fix the problem directly but whether the ‘talked about’ improvements in NetMF 4.4 are likely to have any effect on these issues and if yes, on what time scale can we think about (not expect!) a GHI rollout of 4.4 compatible firmware.

I am evaluating the board for a commercial project and honestly it’s a show stopper right now so any guidance would be very gratefully received. The very helpful community on here is, in many ways great, but it also very much blurs the picture as there are so many conflicting answers to different problems and rumours about what someone heard might be happening.

GHI clearly have to take care of their commercial interests and are busy developing new stuff which is great but the bottom line remains (it really does, whatever anyone may think) that issues like this flaky network stack are a genuine, major issue for many.

This is not meant to start a war of words, I am simply after some clarity on the situation as I will need to look to other platforms very soon if there is no light at the end of the tunnel.

Thanks and regards,

Steve

2 Likes

Which board are you using? And what do you expect as far as connection count, speed and any other requirements for your commercial application?

@ Gus - Thanks for your response, I am using a Fez Raptor and an ENC28 module.

A throughput of 500 Kbit/s would be fine for my uses and the stack easily does that now; my issue is not performance per-se but reliability. I can request 1KB of data every 10 seconds and the stack will stay running for 24 hours + (but will still lock up at some point) and if you request 50KB every second it will last a few minutes. Every ‘hang’ leaves a thread stuck stuck at either Socket.Accept() or Socket.Send() and around 70% of the time the entire network layer goes offline and any ping request etc are not answered.

I really appreciate your support, but it seems there is a general reluctance to accept that there is an issue with the networking. Despite Microsoft being rumoured to be unhappy with the Lwip integration into NetMF, other providers trying to produce their own stacks because of the reliability issues, and news that it’s all going to be fixed at some point (4.4 ??) we are told that it’s all fine and works well for most people.

I have read literally every post on the forum around networking and there is a common them (especially around the Socket.Accept() issue) and none of the ‘potential’ workarounds provides a solution.

The problem with something like the TCP/IP stack in an embedded device is that it has to be 100% reliable (or at least recoverable by the application itself). If someone there could give me some guidance about how to fix the problem myself I would; I went 50% of the way to porting the Netduino.IP stack to the Raptor before realizing that it was unfinished.

In direct answer to your question my ‘specification’ has shifted from any throughout or concurrent connection limits to simply trying to answer the question of whether NetMF can keep a network stack up and running indefinitely.

I specifically did not ask for solutions because many have been offered and I have tried them all but I am keen to know before I switch platforms whether the NetMF 4.4 release could be a silver bullet for my issues and if it was, on what time frame could we think about it.

Thanks very much,

Steve

Many thanks for this post.
Actually we have also some commercial product based on G400-S and ENC28 module. Right now the product is almost finished except Web server. I have postponed this task as much as possible getting more time to collect all problematic things regarding network issues and hoping there will be introduced some reliable solution/example which can work reliable 24/7.
What I am missing is some example covering all described issues here (like Socket.Accept() or Socket.Send()).
A lot of us are beginners(including me) with low experience especially in network things so we would be really happy to see some example solving these issues.
I am not sure if GHI network example solves all this network issues (for some people well known problems, but for beginners really not), I am guessing not because they are just examples. But we (beginners) would really appreciate quite complex example which can be used (of course with some modifications) in real application. This would even increases interest for NETMF stuff.
There is always some problems in real life, in every project, so logically even here in NETMF network stuff must be. The problems itself are not the problem if there is at least some well known workaround.
And here “well known” is the biggest problem. It would be perfect if GHI can bring some “well known” workaround example not only by new 4.4 framework but especially for existing last 4.3 framework.
These days almost all applications (customers) require reliable network connectivity(Web server, FTP etc.) and very popular GHI products should offer it - I am sure they can and there is a very big benefit we have very smart GHI guys with valuable technical support.

@ mhstr - Thanks for your input on this thread and I totally understand and agree with your views. In no way am I upset with GHI, but I am about to invest many thousands (with GHI I hope) to develop a custom board for my application.

Right now (after a long evaluation period and many hundreds of man hours coding, testing and debugging) I cannot say with any degree of certainty that the current issues with the network stack will be resolved.

Can they be resolved - without doubt yes, but my concern is 2 fold. Firstly I don’t think there is an acceptance that any problem exists and therefore the motivation to fix it is not present. There are a very wide range of users for GHI to deal with and I understand the problems of trying to work out what is an issues and what is simply bad application code. I feel for them as I have software in use by end users and deal with the issue every day.

Secondly if there is an issue (and I am going to say again with confidence that there is) it’s not something that I think we can resolve at the application level. The fundamental problem is in the native NetMF integration with the LWIP stack and of course it’s open source and of course that means any one can fix it but at that stage the whole advantage of NetMF is rapidly disappearing (rapid and cheap development cycles). If someone wants to show me some high level fixes to the problem then I can assure you it would make my day but all the references I have seen to date are sticking plasters around a much bigger issue and all of them (ExecutionConstraints, Wrappers etc) are basically a way of trying to enforce a timeout on something.

The bottom line is that (at least in all my testing) all those approaches do not work as the underlying stack is ‘hung’. If the solution is to then do a full restart of the networking stack I am heading elsewhere as that is not a solution that stands up in the real world and in no sense can be considered a solution to the problem.

Sparse and intermittent issues need recovery ‘code’ for sure but we are dealing with something that happens every few minutes on a 6" long network cable into a very high quality hub (about as prefect a network as you could hope for) so there is an issue without doubt. My own testing leads me to believe that there is some kind of buffering issue at the hardware level as sending very small amounts of data at a time seems to make things better. It’s a running secret that the NetMF network stack needs a big overhaul and my original question was simply - is it going to get and when

My #1 suggestion is that if you’re embarking on a commercial project at scale, then you have a 1:1 conversation with someone at GHI who can likely walk you through these things in a much better way than chatting on the forum could do. You may never get anything tangible from the chat here, but if you’re going to get it anywhere it’ll be from GHI. Note, I’m in no way associated with GHI - I just think commercial discussions are best had more personally

@ Brett - Thanks for your message, and of course you are right but there is also an element of ‘chicken and egg’.

I was hoping to get to a point where I had ticked all of my ‘proof of concept’ requirements and then lead straight into a commercial discussion with GHI without wasting their time before I was ready to commit funds to the project.

I have touched base with them about that process in general and the comms haven’t been particularly forthcoming. Although they haven’t answered my questions these posts have achieved plenty of responses in a few hours where I have waited days for any response through the other direct channels

I also figured that knowing the situation about any forthcoming releases and what they may or may not include is beneficial to the wider audience. I stress again that I don’t have a one off ‘domain’ problem on my hands, there is a generic issues with the network stack and it must effect more users than just myself.

I will do whatever is best / easiest for GHI, but within a fairly short timeframe I need to make a decision about whether we drop NetMF and move on and I will continue to ask as many questions in as many places as possible to get those answers.

Thanks to those concerned for their views, they do make sense and I will reflect on what I should do next.

Thanks

Steve

While we’ve achieved quite reliable networking with our Mountaineer firmware (many boards working nonstop over months in industrial use), there are still a number of networking issues that we couldn’t fix without breaking something else (e.g. cable detection interfering with other stuff). Microsoft has publically confirmed that the way how lwIP is integrated into NETMF is fundamentally flawed. That’s why they have completely redesigned lwIP integration for NETMF 4.4 and done much testing with the new code. Unfortunately, this also makes switching to 4.4, at least for boards with networking hardware, a bit nontrivial.

I really hope this problem doesn’t become another example of this;

3 years later, and they are still saying “soon!”

(still bitter)

@ mtylerjr - GHI is not a one man operation. Please do not compare apples to oranges. Also a lot of assumptions are here without a single response to the problem from GHI yet. We haven’t verified that the current firmware doesn’t do what the customer needs.

1 Like

This depends.
I use G120+ENC28 for several commercial projects. All of them use Networking as a primary feature, and it works without any problems in an 24/7 environment.
In my case I do not run a web server on the board, and I do also not transfer large amount of data so far.
But it runs absolutely stable.
There might be other cases (like running a webserver) where networking is less stable.
But saying it’s not usable at all is not correct.

1 Like

Thanks for your response, and I am very happy to hear that you have had success with the networking stack, but I would stress again that my issues are not related to ‘large amounts of data’ or the specifics of running a webserver. If by chance you are using Socket.Accept() calls in your system and they are not hanging I would ask very kindly ask to know more about your implementation.

Your comment about it not being usable makes perfect sense from your position because you have evidence to back it up 100% but from my viewpoint a network stack that can only do a subset of what it is supposed to do is still not complete and reliable. I don’t want a webserver, I just want to be able to opening a listening socket and accept incoming connections on it without it ending up in an unrecoverable state within minutes / hours.

Thanks again for your input (and encouragement!)

@ andre.m - I don’t think anyone claimed that 4.3 was perfect, and nobody is foolish enough to claim 4.4 is either. Improvements in a later version don’t mean that the earlier version was unusable. The 4.3 stack had some well-recognized and often painful problems, but it was certainly usable and definitely usable in production. So sayeth significant numbers of successful deployments.

@ sh - can you provide a minimal code sample illustrating the problem, because I have quite a few apps on various GHI processors doing Listen and Accept that work just dandy.

Ready to declare victory with 4.4 already? I’ll grant you it is all shiny and new, but there’s very few miles on it yet - I’m not such an optimist, and prefer to see it get a few more miles on it before I claim that it is ‘a fact’ that it is better. There is reason to believe it should be, but that’s a long way from being proven.

And this thread is wandering all over opinion-land based on a poorly documented potential bug. The core issue is that the user has a specific problem that has not been described to the standard required for providing a fix : a concise code sample exhibiting the problem.

I tried to jump one step ahead and get a 20,000 foot view of the current state of the framework and on what timeframe a potential solution might be available.I will follow this post with a code sample (once I have built a simple sample that exhibits the problem) but that doesn’t actually answer my questions which were very clear

1 - “Whether the ‘talked about’ improvements in NetMF 4.4 are likely to have any effect on these issues”
2 - “If yes, on what time scale can we think about (not expect!) a GHI rollout of 4.4 compatible firmware.”

I am well versed in software development cycles but to throw the problem back at the end users is not right when those practices are not really being used. Show me the bug life cycle and reporting guidelines and I will happily adhere to them.

A lot of forum threads touch in the this subject but having studied all of them carefully I don’t see any clear and accurate conclusions that resolve the issue. Even the question about the 4.4 rollout is mired in confusion. It will “take about 1 - 2 months so that means before the end of the year” BUT it turns out “we haven’t started yet”.

Maybe the whole community is happy with that and it’s mainly a hobbyist tool but I had high hopes to really do something great on an interesting platform at a very commercial level but to do that I can’t report back to my fellow directors and investors with no confidence on the resolution to, what I see, as a major flaw.

I will follow with a code sample and look forward to a solution !

1 Like

Well-described bugs/issues with example code can be repro’d and addressed. General malaise cannot - it’s just not actionable. That’s why I pressed for a specific code sample. Then we (the community and/or ghi) can either help you out with 4.3 or definitively answer whether 4.4 exhibits a different behavior.

I’m not complaining about your post - just trying to keep things grounded in actionable work.

Here is a code sample that will exhibit the specific problem of the Socket.Accept() call ‘hanging’.

Environment

  • Fez Raptor mainboard and ENC28 connected to Socket 3 (but Socket 1 makes no difference)
  • Powered by DP module and 12V to barrel jack (but USB makes no difference)
  • No other SPI devices on the Raptor
  • NetMF version 4.3.1.0
  • Gadgeteer 2.43.1.0
  • CAT5 cable via an unmanaged hub and no other network traffic (but direct crossover to PC makes no difference)

Test Configuration

Instantiate the class and call start with a Port number, the class generates a random string to return to a calling client and the easiest way to test it is point a browser to the IP address of the Raptor and you should get a string response (obviously without all the HTTP response headers)

I have quickly automated the test by using “Easy Auto Refresh” Chrome extension (free trial) and you can repeatedly call the socket but any method of automating the request to the socket will work… The size of the response string can be set by a the ‘StringLength’ constant in the class.

I get failures within minutes sending a 1 KB response every second. This is purely a test case and is not thread safe so there needs to be a large enough gap between requests to prevent other application code problems but the numbers given here should work fine.

Expected Output

The socket responses will work for an ‘indeterminate’ amount of time but at some point the ‘server’ will stop responding to requests.

Running with a Debugger attached I then issue a Debugger->BreakAll and look at the ‘Threads’ window where the ‘StartListen’ thread will be hung at the Socket.Accept() line. If I wrap this call in a Timer, and ExecutionConstraint or another Thread so that I can ‘break’ out of the Accept() call any further calls on the ListenerSocket will throw exceptions immediately.

 public class SampleTCPListener
    {
        private AutoResetEvent _listenerStarted;
        internal Socket _socket;
        internal Thread _thread;

        private IPAddress _interfaceAddress = IPAddress.Any;
        private int _receiveTimeout = -1;
        private int _sendTimeout = -1;
        private int _listenBacklog = 10;
        private bool _isActive = false;

        const int BufferSize = 1460;
        const int StringLength = 1 * 1024;

        public bool Start(int servicePort)
        {
            try
            {
                _requestArrived = new AutoResetEvent(false);

                _interfaceAddress = System.Net.IPAddress.GetDefaultLocalAddress();

                _socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
                
                _socket.Bind(new IPEndPoint(_interfaceAddress, servicePort));
                _socket.ReceiveTimeout = _receiveTimeout;
                _socket.SendTimeout = _sendTimeout;
                _socket.Listen(_listenBacklog);

                _isActive = true;
                _thread = new Thread(StartListen);
                _thread.Start();

                _listenerStarted.WaitOne();
            }
            catch (Exception)
            {
                return false;
            }
            return true;
        }

        public bool Stop()
        {
            try
            {
                _isActive = false;
                _socket.Close();
            }
            catch (Exception)
            {
                return false;
            }

            return true;
        }

        private void StartListen()
        {
            //Thread.CurrentThread.Priority = ThreadPriority.AboveNormal;

            _listenerStarted.Set();

            while (_isActive)
            {
                using (Socket clientSocket = _socket.Accept())
                {
                    try
                    {
                        OnSocket(clientSocket);
                    }
                    catch (Exception ex)
                    {
                            throw ex;
                    }
                }
            }

            _socket.Close();
        }

        protected virtual void OnSocket(Socket socket)
        {
            try
            {
                if (socket.Poll(-1, SelectMode.SelectRead))
                {
                    EndPoint remoteEndPoint = new IPEndPoint(0, 0);

                    if (socket.Available == 0)
                        return;

                    byte[] response = Encoding.UTF8.GetBytes(new String(RandomResponse(StringLength)));

                    int numBytesToWrite = StringLength;
                    int offset = 0;

                    do
                    {
                        var sendSize = (numBytesToWrite <= BufferSize) ? numBytesToWrite : BufferSize;

                        var bytesSent = socket.Send(response, offset, sendSize, SocketFlags.None);
                        numBytesToWrite -= bytesSent;
                        offset += bytesSent;
                  
                    } while (numBytesToWrite > 0);
                }
            }
            catch (SocketException ex)
            {
                if (ex.ErrorCode == (int)SocketError.ConnectionReset)
                    return;
            }
            catch (Exception ex)
            {
                throw ex;
            }
            finally
            {
                socket.Close();
            }
        }

        private char[] RandomResponse(int length)
        {
            var chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
            
            var stringChars = new char[length];
            var random = new Random();

            for (int i = 0; i < stringChars.Length; i++)
            {
                stringChars[i] = chars[random.Next(chars.Length)];
            }

            return stringChars;
        }
    }

My first comments for improvement: (this is what i used although still running on netmf 4.2)

Handle each accept in its own thread.
Add a the following code to the beginning of accept thread:


DateTime timeout = DateTime.Now.AddSeconds(10);
while (workingSocket.Available == 0)
{
     if (DateTime.Now > timeout)
     {
          workingSocket.Close();
          return;
      }
      Thread.Sleep(25);
}

Set the linger parameter on the socket:


workingSocket.SetSocketOption(SocketOptionLevel.Socket, SocketOptionName.Linger, 0x8000 | 0x0005 /*-2*/); // timeout: 0x8000, noclose: 0x0002 

1 Like

You are not helping here by throwing assumptions. No software is perfect anyway and it was not confirmed that 4.3 will not work for him nor that 4.4 will do.

@ RobvanSchelven - I really appreciate your help and advice in this matter, this was a sample I put together which exhibited the issues and I kept it as simple as possible. From my notes I have tried handling each ‘client’ socket in it’s own thread without success but I am happy to test it again

Could I quickly clarify 2 points

1 - Are suggesting ‘processing’ each socket returned by the Accept call in it’s own thread or spawning a new thread for each and every Accept call ?

2 - Could you clarify the linger options. In the full framework there is a LingerOptions class which gets passed to the SocketOptions which hides the low level enums away. Could you advise what 0x8000 and 0x5000 do ?

Thanks

Steve