MQTT errors with cellular PPP and Command Mode

skeller · February 17, 2021, 9:10pm

I am using MQTT with Azure IoT over a cellular modem using PPP. Overall it is working pretty well. Occasionally I get errors and MQTT dies when I try to check the signal strength of the cellular connection periodically.

To check the signal strength I do the following:

     networkController.Suspend();
    //send +++ to modem to enter command mode
    //wait for 'OK' from modem

    //get signal strength and connection status

    // issue "ATO\r\n" to re-enable PPP
    //Wait for CONNECTED from the modem

    networkController.Resume()

During that time I suspend sending any new traffic over MQTT or network.

However, I still think the MQTT library is attempting to send traffic while the modem is in command mode causing socket errors.

Before I dig into the MQTT library to come up with some way to suspend traffic while doing the signal check does anyone have any other suggestions or ways to solve this?

Dave_McLaughlin · February 18, 2021, 2:19am

This is one of the reasons I would like to see multiple serial UART support with the USB host port so that we can have a data connection at the same as being able to check for signal, sms etc.

mhardy · February 18, 2021, 2:48am

@skeller - While I have no help on your direct issue with PPP, MQTT, such is of deep interest as to:

“How does PPP work”?

We use direct AT modem commands to direct the modem to do this-n-that.
Multitudes of AT commands and connecting logic are what we call a ‘Modem Driver’

Of course, this ‘Modem Driver’ does not provide an ‘ethernet’ type of context, provided by PPP.

Belaying the details of a PPP/Ethernet context…

What is the ‘Core Logic’ within the modem that handles all the " What’s, When’s, How’s " logic to allow ‘Always Available Connectivity’.?

By activating PPP, does a ‘Modem Internal’ driver handle all facets of maintaining an ‘Always Available Connection’.?

Was there a ‘Super Guru’ engineer (at say Telit) that was able to take into account all geographies and all dynamics of placing the modem ‘Anywhere’.?

…Not a bang against PPP, just don’t know what is at the ‘Internal Logical (driver) Heart’.?

mhardy · February 18, 2021, 3:01am

After writing, I realize ‘Ethernet’ is not the proper description of what I meant.!

The meaning was:

Setup a System.Net.Sockets.Socket as a listener socket.
Bind the Listener socket to the PPP interface.
Request a connection, get a connection and transact data.

skeller · February 18, 2021, 3:25am

@mhardy, PPP is an old protocol that goes back to the dial-up days that allows the modem to be put in somewhat a ‘transparent’ mode so that data bytes can move freely over the connection. It is the “Data link layer”, so it is one layer above the actual physical connection. The TCP/IP protocol still runs on top of the PPP connection and in this case is handled by TinyCLR. The engineering behind it, on the cellular side, is pretty complicated and fortunately, the ‘Super Guru’ engineers standardized it all and have made it work fairly seamlessly. The modem does request a data connection from the cell provider based on the security information provided on the SIM and the APN settings. So there is always a chance that the modem can’t establish a connection if there is no service or the SIM card is bad, among other things. It is also possible that the connection drops and would need to be re-established, just like if someone unplugged an ethernet cable.

As @Dave_McLaughlin pointed out using USB versus a standard UART interface allows for the status of the modem and connection to be polled over a secondary data connection without breaking into the PPP connection.

@Dave_McLaughlin, USB CDC support for modems did make it to “new feature” status in the TinyCLR Backlog, so it is at least on the list. USB CDC support for Cellular Modems · Issue #849 · ghi-electronics/TinyCLR-Libraries · GitHub

I think my solution will be to disconnect the MQTT connection, do the signal check, and reconnect with the Clean Session = false. That should allow all the original subscriptions to carry over and a “Last Will and Testament” message isn’t generated. It is not what I wanted but with time be an issue on this project I think it is the safest.

mhardy · February 18, 2021, 4:23am

@skeller
Understood on the roots of PPP.
Yes - ‘The engineering behind it, on the cellular side, is pretty complicated’

As to ‘fairly’, as stated - ‘and fortunately, the ‘Super Guru’ engineers standardized it all and have made it work fairly seamlessly.’

Given that (good) SIM (well inserted), APN settings are all correct, (good-to fair RSSI, service available), all good-to-go, i.e. initial and on-going connectivity has been established for say months or years…In other words, only external (of no control) connections get dropped …‘and need to be re-established, just like if someone unplugged an ethernet cable.’

In your experience, have you found ‘fairly’ equates to approximately 1min (max) of ‘No Connectivity’, before the PPP/Modem driver can re-established a connection…?

I have not used PPP, on the Telit chipset, which is at the ‘Belly-of-the-beast’ of our modem.
1min (max) of ‘No Connectivity’ is what our AT based modem driver roughly delivers.

@Dave_McLaughlin
Yep, not having to break from data mode to command mode back to data mode -back-n-forth, would be a great feature.!

Such dialog is great…!
Hard to find folks that have walked many-a-mile down such a path(s)…

skeller · February 18, 2021, 4:41am

@mhardy, not sure I completely understand your question but I will try to answer.
The initial PPP connection establishes within 5 or 10 seconds of the connection request. The errors I was referring to in the original post only happened occasionally. Every 60 seconds I would break into the PPP connection with ‘+++’ to enter command mode so I could check the signal strength. (sometimes it would take 5 or 10 seconds and/or additional ‘+++’ to enter command mode.) Then re-init the PPP connection. For the most part, everything just picked up where it left off with no problems. But on occasions, while in Command Mode MQTT would crash with various socket and SSL errors. Even with those errors, returning from my signal check, the PPP connection was still active, it was just MQTT that died.
For now, I have moved the signal check out to every 5 minutes and I close the MQTT connection. It will increase my data usage slightly since it has to re-establish the TLS socket and the MQTT connection but it is necessary.

I have not tested yet how to recover from a dropped signal or bad signal. I don’t have an easy way to test that since disconnecting the antenna could damage the modem. I am connected to an external antenna because I am inside a metal building. Any small antenna that I could shield to simulate a low signal can’t even establish the initial connection. I will have to resolve that soon since that is a real-world situation. I will probably move development to my house to do that part.

mhardy · February 18, 2021, 5:10am

OK - Understand where you are coming from.

As said, we don’t use PPP, MQTT or Azure IoT.
…Not that any of these paths are incorrect…they just did not exist back-in-the-day.

If I could offer of little advice:

Whatever you are building needs to be developed / debugged / tested …in ‘all’ environment(s) where final product will be deployed. If such is not done, endless-loop of (no-idea-what-the-heck) support, etc.

Never let the amount of data consumed limit reliability.
Customers just want stuff that works and will pay if it does…

Dat_Tran · February 18, 2021, 2:25pm

Yes, MQTT is separated and it doesn’t know what PPP is, suspend, resume…

MQTT has a property “KeepAliveTimeout” (time unit is second), can you please try to increase that one and make sure your suspend period is smaller that that value

if still, as usually, give us a small, simple project to see if we can help.

skeller · February 18, 2021, 8:37pm

@Dat_Tran, Thank for jumping in. The PPP connection is in command mode for maybe 5 seconds, it all depends on how long it takes to respond to the ‘+++’. Sometimes it takes 10+ seconds. (After a 10 sec delay, if no response to the first ‘+++’ I send a second ‘+++’ and that gets its attention.)
My MQTT Keepalive is at 300 and I am sending a message about every 30seconds leading up to the break in the PPP and the signal check so I don’t think it is trying to do a keepalive ping. I think it might be in the middle of a message acknowledge when I break the connection. I could wait a few seconds after the last publish before doing a signal check but it would be a complete guess as to how long to wait since I don’t know how long it will take for the broker to respond to the publish. Depending on the cellular network it could be pretty variable.
I will go with the safe route for now and just close the connection. For future software rev’s I can try to get more creative.

mhardy · February 19, 2021, 6:11am

When you say; “I am sending a message about every 30seconds”
What is the source? …Azure IoT or your device.?

Device to Azure IoT should not be a problem, i.e. the modem will increase dBm to connect to the tower.
Azure IoT to device is the tough problem, i.e. the modem sit’s at default dBm (power)…listening.
Are you running over LTE.?
If you are, which version, carrier, etc.?

Every 30sec is extreme for keeping the connection ‘alive’.
With understanding you are in dev mode …all fine, but when rolling to production, messages every 30sec may kill you on data plan limits, overages, etc.?

A ‘true’ (udp) ping to the device costs very little in data.
A message, other than ‘ping’…assume MQTT at basic is TCP, will cost much more in data.

Just IMO… trying to coalesce all forum members empirical discovery(s) as to cell connectivity.!

mhardy · February 19, 2021, 6:24am

BTW -

Cell documentation will say a session remains alive on some timer somewhere for (seem to remember) 120min.

NEVER believe such…!
Such metric was derived from ‘absolutely perfect anechoic chamber’ testing.
One will never find the same conditions in the ‘real’ world.

mcalsyn · February 19, 2021, 4:58pm

Paraphrasing the original question as “Why does MQTT still send, even when I suspend my calls to it?”, Dat correctly pointed out the keep-alive interval. The reason is that within that keep-alive interval, the client (and maybe the server) are sending ‘pings’ back and forth even if you don’t send any packets or have any subscription traffic. Even if you stop making calls, pings will still happen and subscriptions will still result in inbound traffic.

You can slow the client->server pings with the keep-alive interval, but that only makes an inconvenient ping less probably, and doesn’t stop any inbound messages (unless you unsubscribe).

Multi-channel serial would solve the issue, but in the meantime, you’re just taking a chance that a ping interval or inbound message won’t occur while you have suspended the network. Even if you create a long timeout for keep-alive, how can you be certain that it won’t still occur during that outage window? Closing the MQTT connection is the only way to be certain.

skeller · February 19, 2021, 8:52pm

@mcalsyn, you are 100% correct. Increasing delays and keepalives only reduces the chances of the problem, it doesn’t eliminate it. Short of re-writing the MQTT library to “pause” all communications while there is a break in the PPP connection, disconnecting is the best option.
I would rather wait for the multi-channel USB connection than re-write MQTT libraries.

mhardy · February 21, 2021, 3:38am

I also support @mcalsyn on the MQTT issue.!
The same is very true with cell, best to totally disconnect a session and restart.
Use brute force, just do such every so often regardless as to what code may try and understand of ‘Being Connected’, especially with server to device (device always command able) communications.