How robust are heartbeats to packet loss

Ken_Hohhof · December 19, 2020, 1:15am

Are missed heartbeats retried? We have an AP that will run for several hours and then have a momentary Ethernet drop. Obviously we need to find the root cause and fix it, which is proving difficult. But this should only result in a blip for customers, maybe not even noticeble.

What happens however is that a heartbeat request gets missed and the spectrum grant gets withdrawn. This results in a several minute outage. During the pandemic, customers are stuck in their house watching Netflx or doing Zoom classes pretty much all day, and they find these outages intolerable.

Given these Ethernet drops are brief, I’m surprised it results in a loss of the spectrum grant. In the logs we even see an occasional error looking up the cnMaestro (cloud) IP address. It seems like all these things should get retried a few times if they fail.

Eric_Ozrelic · December 19, 2020, 2:00am

We just ran into this recently. We had a 450m 3GHz AP that all of a sudden stopped transmitting for 10-15min in the middle of the day. After going through the logs we found:

12/14/2020 : 13:46:09 PST : Transmit time expired hence transitioning to granted state
12/14/2020 : 13:47:09 PST : Processing Transmit expiry timeout
12/14/2020 : 13:47:44 PST : AP transmit disabled, so stopping
12/14/2020 : 13:59:55 PST : CBSD is already registered with SAS
12/14/2020 : 13:59:55 PST : Reusing existing grant with SAS - Frequency = 3635000 kHz and EIRP = 46 dBm.
12/14/2020 : 13:59:55 PST : Enabling session for SM

We submitted this to Evan at Cambium and after some investigating he found that the 450m was experiencing a high number of CRC and input errors on the ethernet port… enough to cause it to lose enough heart beats to have it stop transmitting. Thankfully, as shown in the logs, it was able to reuse the existing grant with the SAS and begin transmitting again.

The cause of the CRC errors was use of a Cambium LPU. We swapped it out for a Transtector LPU and all our errors went away and the AP has been operating correctly since.

CambiumMatt · December 21, 2020, 3:13pm

Eric - The Cambium LPU failed?? I would like to better understand what you’re saying here… which model Transtector did you replace the recommended Cambium LPU with?

Nicholas_Eastman · December 21, 2020, 5:30pm

Depending on the type/timing of the drops, the radios will automatically drop their grant and stop transmitting. If it’s in the middle of communicating with the SAS, and the link drops the radios see that the connection has been terminated and drop the grant. We have seen similar issues with our radios when our proxy restarted on us. The proxy was only down for 30-50 seconds, but all of the radios checking their grant at that time (around 50+ APs) went offline until the next heartbeat, which was sometimes 5 minutes later.

Matt/Cambium, please correct me if I’m wrong, but from what I understand from a discussion with their engineers is if the session is terminated during a heartbeat, it is written into the CBRS spec. that the radios act as if their grant was withdrawn until it can verify otherwise.

Edited: clarifying which spec. is to blame.

LuciaCambium · December 21, 2020, 10:44pm

According to the WInnForum protocol, every time a device sends a heartbeat request the SAS sends a response which includes the “heartbeat interval” and the “grant expiry time”. The heartbeat interval indicates when the device is expected to send the next heartbeat request; the grant expiry time indicates when the grant expires if it is not renewed. These times may not be exactly the same for all SASs, but they are typically a few minutes. Devices are expected to send a new heartbeat request before the heartbeat interval ends, so it can be extended for a few more minutes, and so on. Cambium devices start sending the next heartbeat request well in advance, to allow for up to three re-tries in case the message is lost and the SAS does not send a response. As long as the response arrives before the grant expires, there is no outage. But if the response does not arrive, the device has 60 seconds to go off the air (i.e there is an outage). The device however, stores all details of the grant, and when the connection is restored it will reuse the same grant.
The heartbeat requests for all messages in a sector are sent at the same time. If connectivity issues occur right before these requests are sent, and they are not resolved, the sector can be off the air within minutes. If the sector has just sent the heartbeat requests, there is a margin of several more minutes before the sector is off the air (if the connectivity is not restored).

Eric_Ozrelic · December 22, 2020, 1:51am

The model we pulled that was causing the CRC and input errors was C000000L033A Gigabit Ethernet Surge Suppressor. We replaced it with a Transtector Systems ALPU-F140.

I’m wondering if perhaps the Cambium LPU has issues not having full PoE running over it along with sync injection via the PacketFlux as this is a 450m 3GHz?