Many of our customers registered with the Federated Wireless SAS experienced an outage on Monday that affected service for about 30m.
I wanted to post this here to advise on what was found to be the cause and corrective actions that have been taken to prevent the issue from recurring.
In summary, a surge of traffic from Cambium devices (heartbeat messages) effectively fully utilized the Federated Wireless ability to process the requests, resulting in failing to complete the heartbeats for some customers. This resulted in devices that stopped transmitting due to failure to complete heartbeats, or in some cases, intermittent connections. The outage lasted approximately 30 minutes, and nearly all radios self-recovered, save for those that customers manually moved back to Part 90 operations.
Federated has changed their ability to handle traffic spikes like this by increasing their handling capabilities by 30x. This will alleviate a traffic surge like this from being a problem in the future. Further steps are being taken to add resilence to both Federated and Cambium's processes to reduce potential issues, and build confidence in these critical systems outside of this single failure mode as well.
We apologize for the service interruption, but are glad this revealed itself early on, as we are now able to eliminate similar causes from becoming an issue as we more fully exercise these systems.