Incident Report: Spotify Outage on January 14, 2023

0
183
Incident Report: Spotify Outage on January 14, 2023



February 15, 2023

Published by Erik Lindblad, Staff Engineer

On January 14, between 00:15 UTC and 03:45 UTC, Spotify suffered an outage. The affect was small at first and elevated over the course of an hour till most performance (together with playback) was not working. Spotify engineers had been first notified of the issue at 00:40 UTC, and our incident response group was instantly assembled. Due to the character of the incident, triage took longer than one would count on; our inner tooling was inaccessible. At 02:00 UTC, the basis trigger had been recognized and mitigation was underway. Service was absolutely restored at 03:45 UTC.

What occurred?

The problem was triggered by scheduled upkeep of our Git repository (GitHub Enterprise, or GHE), inflicting our inner DNS resolvers, that are a important element in our inner DNS infrastructure, to fail. The resolvers pull configuration from a number of sources, GHE included. The dependency between this DNS element and our Git repository was well-known, and there are a number of safeguards in place to deal with Git outages. 

However, a change was launched within the DNS element, which set off a brand new failure mode that resulted in invalid configurations being utilized to the DNS resolvers. GHE taking place didn’t trigger the outage — it was solely as GHE was introduced again up from upkeep that the invalid configuration was then utilized throughout the resolver fleet over the course of an hour. When a resolver bought an invalid configuration, it entered a crashloop and was unable to serve responses to inner DNS queries.

As ‌DNS failures ramped up, inner companies and manufacturing visitors began failing. Since inner tooling (together with worker VPN) was additionally impacted, the response groups needed to give you novel troubleshooting strategies, which brought on an extended triage time than one would count on.

Timeline

Timeline depicting traffic service during the incident.
Traffic to a core service throughout the incident (inexperienced) and one week earlier (blue).

(All instances are in UTC.)

00:15 UTC – DNS resolvers begin to fail.

00:36 UTC – About 30% of DNS resolver fleet down, first indicators of person affect.

00:40 UTC – Spotify platform engineers alerted by automated monitoring.

01:15 UTC – 100% of our DNS resolver fleet down, extreme affect.

02:00 UTC – Root trigger recognized.

02:15 UTC – Mitigation in place for inner tooling.

02:33 UTC – Mitigation in place for manufacturing visitors.

03:45 UTC – All core companies recovered.

Where can we go from right here?

  • We have mounted the bug that brought on this outage.
  • We have made modifications in a number of companies and infrastructure that may enhance resiliency towards future DNS outages.
  • We plan to decommission the DNS element that failed, as a part of an ongoing undertaking aimed toward eradicating complexity from our DNS infrastructure. We have elevated the precedence for this undertaking.



LEAVE A REPLY

Please enter your comment!
Please enter your name here