cnMaestro protocol & algorithm

Maciej_Jungst · May 30, 2016, 7:52pm

Supposedly there is a certain way of communication between cnMaestro running on CN server and the devices which are onboarded to one's account. This results in on-line status of all managed devices like cnPilot or ePMP when there are no lost packets by the way through the Internet.

In some situations, however, a certain percentage of lost packets are the reality. I suspect such situation in certain deployment I have, when the Internet access for two cnPilot devices, one ePMP1000HotSpot and one E400, is over the common one GSM modem in small far village, where any other Internet access is unavailable. No WiFi clients are connected so no user traffic is present.

I observed from cnMaestro Monitor>Notifications>Events list frequent toggle on-line/off-line for ePMP1000HotSpot device (see the attached exported download...pdfs). At the same time the E400 device is showing much less off-line events.

Can CN disclose some details on how frequently each type of devices is sending its keep-alives to cnMaestro, how many these keep-alives and at which time intervals must be received by cnMaestro to keep the on-line status for the device, and how many these keep-alives and for how long time must be lost just before cnMaestro understands the off-line status for the device?

And, in general, what is the algorithm used in cnMaestro to detect on-line/off-line state of devices?

Are the keep alives acknowleded by the receiving side, or not? How does it work?

Using ping tool from both these devices I found the ICMP packets loss through GSM access to the Internet can reach 30%-50%. The E400 seems keep the on-line state in cnMaestro much more stable than ePMP1000HotSpot.

If the device communication to cnMaestro and v-v has acknowledgements, it could be helpful for diagnostic purposes if appropriate counters could be read from the device memory when the cnMaestro access to far device is restored.

download_events_ePMP1000HotSpot.pdf (87.2 KB)
download_events_E400.pdf (91 KB)

ashutoshdatta · May 31, 2016, 5:04am

Hi Maciej,

We have a keepalive mechanism where by the device sends a message every ~30 seconds and expects a reply from the cnMaestro. For the device to detect the connection is down, it needs to experience "5 consecutive" losses (ping responses).

The cnMaestro is a sligthly more agressive about closing stale connections, and does so if it does not receive even a single keepalive message (or data frame) from a device in ~120 seconds. So it is more likely that the cnMaestro initiates cleanup of stale connection rather than the device.

These times have been kept aggressive to maintain the systems responsiveness in general. This algorithm is common across all the Cambium devices. The difference in behavior observed, may be due to other external factors. If you can describe your topology, it would help us explain things in more detail.

Thanks

Ashutosh

Maciej_Jungst · August 3, 2016, 9:51pm

The topology is very simple: cnPilot E400 is connected to Internet via GSM modem in unattended rural site. Certain loss of packets are in such connection quite normal. The cnMaestro protocol should tolerate some loss of packets.

This protocol should allow to check the quality of connection with acknowledging forth and back, sent/received packets and error counters, etc. Perhaps the protocol ruggedness should be even configurable. Without them it is hard to trust to that protocol. I experienced several times the situation when the cnPilot could not reconnect to cnMaestro although the pings from cnPilot to cloud cloud.cambiumnetworks.com are OK. I have also far cnPilot access points behind GSM modems, and when I see in cnMaestro they are not connected I cannot be sure why: is it some loss of packets in GSM or cnPilot for unknown reason does not reconnect to cnMaestro.

Cambium_Rupam · August 9, 2016, 9:54am

Just a small correction in previous statement from Ashutosh

cnMaestro waits for a single keepalive message (or data frame) from a device within ~180 seconds (3minutes).

If we dont receive any packet within ~180 seconds then we initiate the close from cnMaestro.

Regards,

Rupam

Maciej_Jungst · August 10, 2016, 1:58pm

Because of the cloud can be the only one way to manage cnPilots, the device to cnMaestro and back should be reliable. Which protocol UDP or TCP is used to transport keep alives between device and the cloud?

Cambium_Rupam · August 10, 2016, 2:00pm

We use TCP