Update on the April 5th, 2017 Outage

By DigitalOcean

Posted: April 4, 2017•3 min read

Today, DigitalOcean’s control panel and API were unavailable for a period of four hours and fifty-six minutes. During this time, all running Droplets continued to function, but no additional Droplets or other resources could be created or managed. We know that you depend on our services, and an outage like this is unacceptable. We would like to apologize and take full responsibility for the situation. The trust you’ve placed in us is our most important asset, so we’d like to share all of the details about this event.

At 10:24 AM EDT on April 5th, 2017, we began to receive alerts that our public services were not functioning. Within three minutes of the initial alerts, we discovered that our primary database had been deleted. Four minutes later we commenced the recovery process, using one of our time-delayed database replicas. Over the next four hours, we copied and restored the data to our primary and secondary replicas. The duration of the outage was due to the time it took to copy the data between the replicas and restore it into an active server.

At 3:20 PM EDT the primary database was completely restored, and no data was lost.

Timeline of Events

T0.00 - 10:24 EDT - First observation of issues
T0.03 - 10:27 EDT - Verified that production database had been deleted on master
T0.10 - 10:34 EDT - Began recovery from time-delayed replica
T1.29 - 11:53 EDT - Backup of time-delayed replica completed
T2.10 - 12:34 EDT - Copy of backup to master completed; recovery commencing
T3.07 - 13:31 EDT - Recovery of master completed; copy of backups to replicas ongoing
T4.56 - 15:20 EDT - All systems restored

Future Measures

The root cause of this incident was a engineer-driven configuration error. A process performing automated testing was misconfigured using production credentials. As such, we will be drastically reducing access to the primary system for certain actions to ensure this does not happen again.

As noted above, duration of the incident was primarily influenced by the speed of our network while reloading the data into our database. While it should be a rare occurrence that this type of action would happen again, we are in the process of upgrading our network connectivity between database servers and also updating our hardware to improve the speed of recovery. We expect these improvements to be completed over the next few months.

In Conclusion

We wanted to share this information with you as soon as possible so that you can understand the nature of the outage and its impact. In the coming days, we will continue to assess further safeguards against developer error, work to improve our processes around data recovery, and explore ways to provide better real time information during future customer impacting events. We take the reliability of our service seriously and are committed to delivering a platform that you can depend on to run your mission-critical applications. The entire team at DigitalOcean thanks you for your understanding and, again, we apologize for the impact of this incident.