Managing the Facebook outage with DNSdist

Oct 8, 2021

Managing-the-Facebook-outage-with-DNSdist

Earlier this week, the world was thrown into turmoil by a six-hour outage of Facebook and its associated platforms, WhatsApp and Instagram.

While it seems that the downtime was caused by a Facebook DNS failure, the outage also left DNS resolver operators around the world struggling to cope, resulting in a wide range of issues for both consumers and businesses.

The reason for this collateral damage: Facebook’s DNS authoritative servers were not available during the outage and the situation was exacerbated by the fact that Facebook, like many other large domains, set a time-to-live (TTL) of just 300 seconds on its DNS records to optimize routing.

As a result, the caching resolvers at telcos and public resolvers were returning “SERVFAIL” responses to Facebook, Instagram, and WhatsApp queries. Clients, like browsers and apps including Facebook’s own mobile apps, were seeing the SERVFAIL response and immediately retrying the lookup to facebook.com and associated domains. This caused up to a fivefold increase in the amount of DNS queries seen by DNS resolver operators around the world, as users and their apps continuously retried the DNS queries for Facebook.

For those organizations running PowerDNS software, this didn’t lead to a total meltdown of DNS service, as a result of OX PowerDNS Recursor and DNSdist.

The OX PowerDNS Recursor combines outgoing queries and will throttle if an authoritative server, in this case Facebook, is down, which limits the overload of queries. DNSdist, our smart DNS cache, load balancer and DDoS protection proxy, caches SERVFAIL responses and is able to serve them extremely efficiently out of its packetcache, with only minimal CPU impact. This meant that the additional query load did not directly lead to the equivalent CPU load.

As a result, depending on the amount of headroom customers had, PowerDNS users either saw no impact on their DNS service, or minor additional latency, but none experienced meltdowns or service outages.

And, alongside the way DNSdist has been engineered, we know this because customers have been contacting us to thank us (and the entire community around DNSdist), as it allowed them to manage the Facebook outage with minimal, if any impact, as one customer explains:

“During the recent Facebook outage, we saw traffic from Facebook clients cause more than double the normal query volume in some locations. The responses our systems sent back to this influx of queries were all SERVFAIL, due to the unreachability of Facebook authoritative systems. DNSdist was able to respond to SERVFAIL requests exceptionally quickly, which allowed us to remain operational and without serious side effects during the wave of queries. We’ve been very pleased with the engineering choices made in DNSdist that allow it to respond with low latency in a variety of challenging circumstances, but still remain flexible enough for us to apply our policy and custom responses that allow us to deliver our services at a global scale.”

About the author

Bob Brandt

Bob Brandt

VP PowerDNS Engineering

Categories

Related Articles

Customer Focus: Yvan Knapp, Chief Strategy Officer at Hostpoint

Hostpoint has been shaping the internet in Switzerland since 2001 and is now the country’s largest web hosting provider and...

Chris Holder Oct 5, 2023

PowerDNS brings encrypted DNS capabilities onto routers for the...

Helps protect confidentiality and integrity of traffic in the first mile CPE (customer premise equipment) manufacturers,...

Chris Holder Jul 5, 2023

DNSdist as a router-ready solution

As you might have read, with the release of DNSdist 1.8, PowerDNS brings DNS encryption with DNS over TLS (DoT) and DNS over...

Bob Brandt Apr 12, 2023

Production-ready PowerDNS Cloud Control available

DNS is one of the vital components of the internet, invisibly making the internet work for everyone for almost four decades....

Alexander ter Haar Dec 5, 2022