Facebook was down worldwide as a consequence of configuration changes made to its backbone router, ultimately disrupting their data servers...
Facebook and its subsidiaries Instagram and WhatsApp faced one of the biggest outages in history for over 5 hours. The fundamental cause of this outage began with a configuration change that cascaded all the way into the data centers, bringing their services down. Nonetheless, Facebook confirms that “they have no evidence that user data was compromised as a result of this downtime.”
This downtime had also left the Facebook employees, most of whom are working remotely, stranded and unable to communicate with each other. The internal tools of Facebook were compromised, this further complicated their attempts to resolve the outage.
In attempts to identify the underlying cause behind the outage externally from Facebook, problems such as BGP and DNS were mentioned. The director of internet analysis at Kentik, Doug Madory gave his insights on this issue saying an employee at Facebook pushed an update to Facebook’s BGP records, the Border Gateway Protocol (BGP) holds and permits the exhange of all routing information between autonomous systems. Without this, the network would not be able to send or receive requests.
This technical oversight led to the nullification of routing information between internet-providing services. As an aftermath of this, DNS servers stopped resolving the domain names. Hence when someone types the Facebook URL in the browser, the DNS resolvers had no way to connect to their nameservers and consequently starts issuing the error DNS_PROBE_FINISHED_NXDOMAIN.
This glitch resulted in a decrease of shares by 5.5% in the afternoon, Facebook apologizes for the inconvenience caused to people and has been working hard to restore access.
“People and businesses around the world rely on us every day to stay connected. We understand the impact outages like these have on people’s lives, and our responsibility to keep people informed about disruptions to our services. We apologize to all those affected, and we’re working to understand more about what happened today so we can continue to make our infrastructure more resilient, “ wrote Santosh Janardhan in their blog post.