SAS Outage Reported - Monday, April 6

Many of our customers registered with the Federated Wireless SAS experienced an outage on Monday that affected service for about 30m.

I wanted to post this here to advise on what was found to be the cause and corrective actions that have been taken to prevent the issue from recurring.

In summary, a surge of traffic from Cambium devices (heartbeat messages) effectively fully utilized the Federated Wireless ability to process the requests, resulting in failing to complete the heartbeats for some customers. This resulted in devices that stopped transmitting due to failure to complete heartbeats, or in some cases, intermittent connections. The outage lasted approximately 30 minutes, and nearly all radios self-recovered, save for those that customers manually moved back to Part 90 operations.

Federated has changed their ability to handle traffic spikes like this by increasing their handling capabilities by 30x. This will alleviate a traffic surge like this from being a problem in the future. Further steps are being taken to add resilence to both Federated and Cambium's processes to reduce potential issues, and build confidence in these critical systems outside of this single failure mode as well.

We apologize for the service interruption, but are glad this revealed itself early on, as we are now able to eliminate similar causes from becoming an issue as we more fully exercise these systems. 

2 Likes

This isn't a complete RFO.   What caused the large number of Cambium devices to overload the Federated SAS?   I understand Federated's issue, but what tripped the issue off in the first place? 

There were numerous wide-spread internet outages reported just before the time period that affected the SAS. The suspected root cause is this. 

Significant and widespread outages resulted in CBRS radio messages failing. This results in those radios transmitting the hearbeat message more often (by design) to re-establish connection with the SAS. When the widespread outages were restored, there were many devices queued up and sending heartbeats all at the same time (and more often than during normal operation). This caused the traffic spike that was seen on the SAS servers.

This is the working theory, and is the most likely root cause for this issue.

2 Likes