If you’re a FaceBook user, chances are you’ve probably already heard about the two and a half hour downtime that the site faced yesterday, during which people were not able to access FaceBook’s website. Also affected were FaceBook’s “Like” buttons, which are used on a variety of different pages throughout the Internet. Yesterday’s downtime was the worst outage that FaceBook has experienced in over four years.
FaceBook’s Director of Engineering, Robert Johnson, has posted a note on his FaceBook profile, which explains what caused the website’s outage, which you can view at this link.
Basically, what happened was that an automated system, which checks for invalid configuration value’s in the cache and replaces them with updated values from the persistent store. According to Johnson, this process works well for transient problems in FaceBook’s cache, but does not work when the persistent store is invalid, as well.
When FaceBook made a change to a persistent copy of a configuration value, it was was interpreted as being invalid. Because of this, every client saw the invalid value and attempted to fix it, which involves making a query to a cluster of databases. All of the queries caused the databases to become overwhelmed, as there were hundreds of thousands of requests being made a second.
Also, each time the client made a query and received an error, it interpreted it as an invalid value and then deleted the corresponding cache key. Because of this, even after the original problem had been fixed, the queries still continued. As long as the databases were failing to service some of the requests that were being received, they were causing more requests themselves, thus causing a feedback loop, which did not allow for FaceBook’s databases to recover.
In order to rectify this problem, all traffic to the database cluster had to be stopped, which meant that FaceBook had to essentially be “turned off”, which is why users were unable to access the website. After the databases had recovered from this error, FaceBook allowed more people to access the website again.
In addition, the system that checks for and attempts to correct invalid configuration values has been turned off and other options for this system are being explored.
If you would like more information about FaceBook’s recent downtime, you may want to check out this note from FaceBook’s Director of Engineering, Robert Johnson’s, profile, as well as this post from TechCrunch.