CloudFlare DNS outage - Loss of communication to sas.cbrs.cambiumnetworks.com

So we use Google as SAS provider through Cambium;  when CloudFlare had an outage today, all of our CBRS AP's lost communication to Cambiums SAS url, resulting in outage for 15 mins or so.

Is there something that can be done about this weak link?  If we were talking to Google's SAS directly, we wouldn't have had an outage.

Surely others noticed this?

1 Like

We noticed a 28min outage 2:15 - 2:43pm PDT today. We did not initially recognize this outage.

It wasn't a DDOS. Cloudflare says:

This afternoon we saw an outage across some parts of our network. It was not as a result of an attack. It appears a router on our global backbone announced bad routes and caused some portions of the network to not be available. We believe we have addressed the root cause and are monitoring systems for stability now.

So some cloudflare router misbehaving took down the cambium SAS URL.


The problem is that cambiumnetworks.com uses cloudflare only for DNS and the SAS url NS records pointing to AWS DNS we get back are only valid for 5 minutes. AWS DNS says the SAS url Record is valid for 900 seconds.

We really need some more redundancy here to maintain proper url resolution for more than 900 seconds. Should AWS DNS or cloudflare dns go offline for more than 5 minutes nobody will be able to resolve the URL... then shortly after all the CBRS enabled APs stop functioning.

the idea of cnmaestro talking directly to the google sas is a good one. How about it, cambium? A few less single points of failure between our cnmaestro and the sas is good right?

2 Likes

These comments are good ones... we are actively looking into both what happened on Friday, obviously uncovering some opportunities to improve the resiliency of these cloud applications, and the necessity for redundancy.  We will respond to this when we determine the right actions to make with respect to this...

Thanks Matt.

Here's an idea, instead of making the URL sas.cbrs.cambiumnetworks.com, how about (for example) "cbrs.cambiumnetworks-sas.com" or something.  With the new domain "cambiumnetworks-sas.com" having its whois NS records only pointing at the "incapdns.net" DNS servers.   This way we're not requiring Cloudflare NS to work for cambiumnetworks.com, nor Amazon Route53 NS to work for cbrs.cambiumnetworks.com, who in turn point to "incapdns.net" for sas.cbrs.cambiumnetworks.com. 

Of course we're still needing incapdns.net to be responding to DNS queries 100% of the time :), so we'd still love the ability to communicate to Google SAS via cnMaestro directly.  We peer with Google publically and privately in three markets. 

FWIW, our Google SAS contact said that Google did not have any outage Friday and was not directly impacted by the Cloudflare outage.  They did see a small dip in incoming "CBSD-to-SAS" request volume during the cloudflare outage window.  They're assuming this was because of some devices which relied on Cloudflare being operational to get to their SAS. 

1 Like

This was sent to our CBRS customers today, but I also wanted to post here for those that may be reviewing or deciding whether Cambium is the right partner for CBRS...



Cambium Networks’ Response to the CBRS Services Outage on July 17, 2020

On Friday July 17, there was a nationwide Internet outage that impacted Cambium’s CBRS Services as well as many other services and web sites on the Internet.  The duration of the outage was 27 minutes.  The outage impacted our CBRS SAS Services customers, as well as the broader community of Cambium customers.  We want to explain the root cause of the outage and provide transparency on what we are doing to protect against this and other types of issues in the future.

The Problem

On July 17, Cloudflare operations pushed a BGP router configuration that resulting in routing a significant portion of traffic through their Atlanta data center.  The ensuing traffic load took down the infrastructure in that data center.  The results were:

  1. Websites protected by the Cloudflare infrastructure, including those operated by Cambium Networks, were taken offline.
  2. The Cloudflare DNS service was interrupted, and websites such as cnMaestro may have been impacted because DNS was inaccessible. Existing connections to cnMaestro were unaffected.

Cloudflare advertises their DNS as “Always Available”, allowing “DNS resolution at the network edge in each of our data centers across 200+ cities, resulting in unparalleled redundancy and 100% uptime.”  Unfortunately, their operational procedures negated that advertised redundancy and resulted in a rare but substantial outage.  Cloudflare published a detailed review of the event on their blog for additional information.

The Impact on Cambium Networks

CBRS.  The most dramatic impact was to our CBRS domain proxy customers.  Some clients experienced outages due to loss of DNS service.  Connectivity between Cambium’s domain proxy service and SAS providers was also impacted.  Even though most clients stayed connected, this meant a loss of service.

Cambium Website: The Cambium website uses CloudFlare as a WAF (Web Application Firewall) and was inaccessible during the outage.

cnMaestro.  Connectivity between managed devices and cnMaestro was minimally impacted:  new devices attempting to onboard may have failed during the outage due to DNS, but existing devices and Web UI sessions remained connected.  Onboarded Devices did not lose configuration or operational status.

What Can Be Done

Cloudflare is a substantial public company with a world-class operations team.  Their track record is remarkable.  However, 100% uptime is a very high bar to achieve.  Cambium is investigating the following options to mitigate single-vendor risk:

DNS Resolution Issue.  Name resolution was unavailable both for our customers and for us to upstream providers. There are a number of possible approaches, involving both Cambium and our providers: 

  1. Use multiple vendors for our authoritative servers and keep them all in sync.  This is the common approach, but adds complexity – especially with delegated DNS zones.
  2. Work with upstream vendors to ensure name resolution redundancy – longer TTLs, multiple authoritative servers, or alternative names.
  3. Use an active/passive approach where DNS nameservers are health checked and switched at signs of trouble.
  4. For CBRS: use a weighted list of multiple endpoints (in different DNS domains) for clients to connect with.

These and other alternatives are currently being investigated.  DNS redundancy would be particularly advantageous for CBRS.

WAF Firewalling Issue.  We are reviewing what design changes would be required to use two separate vendors and send traffic through both of them.  Because of the cost and complexity, this option has to be carefully designed and reviewed against other alternatives.

Summary

The new CBRS infrastructure promises an efficient method to dynamically share spectrum in a way that benefits a wide customer base.  It has introduced a new inline element that is outside of the direct control of network operators.  Interruptions in this infrastructure add a new element of risk that is balanced against the potential benefits.  Cambium’s mission is to keep the risks to the absolute minimum while delivering reliability, performance, and operational simplicity for customers. 

This outage has motivated us to look even more deeply into what we can do to make the Cambium CBRS Services robust and resilient.  We already have efforts underway to regionalize our server infrastructure for redundancy, we just implemented a no-outage server upgrade procedures for cnMaestro Cloud 2.4.1, and other improvements.  We meet weekly with our SAS partners and have joint engineering initiatives with each of them.  We constantly exchange technical documentation and product roadmaps to stay in alignment.  In the three short months since CBRS launch, we have implemented many engineering and operational improvements.

Please give us your feedback, suggestions, and share your concerns.

Cambium CBRS Program Team

2 Likes