Earlier this week, the world was thrown into turmoil by a six-hour outage of Facebook and its associated platforms, WhatsApp and Instagram.
While it seems that the downtime was caused by a Facebook DNS failure, the outage also left DNS resolver operators around the world struggling to cope, resulting in a wide range of issues for both consumers and businesses.
The reason for this collateral damage: Facebook’s DNS authoritative servers were not available during the outage and the situation was exacerbated by the fact that Facebook, like many other large domains, set a time-to-live (TTL) of just 300 seconds on its DNS records to optimize routing.
As a result, the caching resolvers at telcos and public resolvers were returning “SERVFAIL” responses to Facebook, Instagram, and WhatsApp queries. Clients, like browsers and apps including Facebook’s own mobile apps, were seeing the SERVFAIL response and immediately retrying the lookup to facebook.com and associated domains. This caused up to a fivefold increase in the amount of DNS queries seen by DNS resolver operators around the world, as users and their apps continuously retried the DNS queries for Facebook.
The OX PowerDNS Recursor combines outgoing queries and will throttle if an authoritative server, in this case Facebook, is down, which limits the overload of queries. DNSdist, our smart DNS cache, load balancer and DDoS protection proxy, caches SERVFAIL responses and is able to serve them extremely efficiently out of its packetcache, with only minimal CPU impact. This meant that the additional query load did not directly lead to the equivalent CPU load.
As a result, depending on the amount of headroom customers had, PowerDNS users either saw no impact on their DNS service, or minor additional latency, but none experienced meltdowns or service outages.
And, alongside the way DNSdist has been engineered, we know this because customers have been contacting us to thank us (and the entire community around DNSdist), as it allowed them to manage the Facebook outage with minimal, if any impact, as one customer explains:
“During the recent Facebook outage, we saw traffic from Facebook clients cause more than double the normal query volume in some locations. The responses our systems sent back to this influx of queries were all SERVFAIL, due to the unreachability of Facebook authoritative systems. DNSdist was able to respond to SERVFAIL requests exceptionally quickly, which allowed us to remain operational and without serious side effects during the wave of queries. We’ve been very pleased with the engineering choices made in DNSdist that allow it to respond with low latency in a variety of challenging circumstances, but still remain flexible enough for us to apply our policy and custom responses that allow us to deliver our services at a global scale.”