An Update on Last Week's Customer Shutdown Incident

Barry Cooks

Posted: June 4, 20196 min read
<- Back to Blog Home

Share

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!Sign up

Update 0120 UTC 5 June – We want to clarify that all customer details shared in this post have been approved by the customer in advance. We would never share such company information without express permission.

Original post

On May 29, DigitalOcean customer Raisup’s account was locked, and their resources were powered down due to a false positive generated by our anti-fraud and abuse automation system. The follow-up in handling the false positive resulted in a subsequent lock, and a communication of permanent denial of access to the account was sent to the customer. The account owner leveraged Twitter as an avenue to call attention to the mistake. Shortly thereafter, DigitalOcean investigated the issue and the Raisup account was unlocked and powered back on. We’d like to apologize and share more details about exactly what happened.

The Incident

The initial account lock and resource power down resulted from an automated service that monitors for cryptocurrency mining activity (Droplet CPU loads and Droplet create behaviors). These signals, coupled with a number of account-level signals (including payment history and current run rate compared to total payments) are used to determine if automated action is warranted to minimize the impact of potential fraudulent high-cpu-loads on other customers. Before any action is taken against accounts, automated safeties are checked to avoid action on a customer that is in good standing without warning.

Unfortunately in this case, the safeties were insufficient to prevent automated action. Additionally, because the customer was running on credit, they did not have a clear payment history, which meant that one of the primary safeties (payment history) was not triggered. The automated service created a support ticket on behalf of the customer to allow for rapid communication regarding the action.

Upon recognizing his resources had been powered off, and the account locked, the customer replied to the ticket created for communication on the action. An Abuse Operations agent re-enabled the account 12 hours after the initial ticket. However, a mistake occurred and the agent did not flag the account as approved for the CPU-intensive activity that was the cause of the initial flag.

On May 30, the same automated service then acted on the account a second time, due to the absence of a safety flag. Upon a second review by a different Abuse Operations agent (nearly 29 hours after the customer responded to the second flag), the agent failed to recognize this was a false positive, and the agent fully denied access back into the account. This action triggered the final “access denied” communication to the customer. At this point, the customer initiated the series of tweets to gain the attention of DigitalOcean.

After further investigation the Droplets were powered back on, access was regranted to the account, and the appropriate safeties were flagged. DigitalOcean leadership initiated communication with the customer to extend apologies, offer credit, and fully explain what happened to resolve the issue.

Timeline of Events

2019-05-29 16:43 UTC – Customer creates a batch of 10 Droplets rapidly creating ~100% CPU load across all new worker Droplets.

2019-05-29 18:24 UTC – Cryptocurrency mining mitigation detects suspicious behavior, including very high CPU utilization on an account with no payment history, which results in an account lock. As a part of this lock a support ticket is automatically created on the customer’s behalf.

2019-05-29 18:37 UTC – Customer replies back to the ticket with a request to unlock.

2019-05-30 06:43 UTC – Action is taken due to the customer reaching out on social media and Support. Support routes the issue to the Abuse Ops. Account is unlocked by responding Abuse Ops agent and a reply is sent in email, 12 hours after customer responded. The Allow High Cpu Usage flag is not set as part of the unlock.

2019-05-30 09:49 UTC – Account is locked and powered down by the cryptocurrency mitigation three hours after the customer powers their Droplets back on when the CPU usage on the same worker Droplets spikes back to 100%. Customer replies back to the new Verification support ticket within 20 minutes.

2019-05-31 15:32 UTC – 29 hours after the customer’s response, the account is denied reactivation. Abuse Ops agent (different from initial agent) cites the link to an older account, connected through a shared SSH key, as additional justification for making the decision to deny access.

2019-05-31 19:21 UTC – Social escalation leads to the account being unlocked/powered back on.

2019-05-31 – Communication across multiple channels (Twitter, HackerNews, other media outlets) occurs to provide apologies and clarity on the situation. Customer is directly contacted by DO staff to offer apologies, situational insight, and credit.

2019-06-01 – Customer responds to direct contact, acknowledging the apology.

Key Findings and Concerns

This situation involved failures across people, process, and technology:

Technology

The safeties intended to prevent fraud and abuse algorithms from taking automated action on a healthy, non-abusive customer were inadequate for a customer lacking payment history.

Process

  • Response timeframes to the customer of 12 hours, then 29 hours, for subsequent locks were far too long.
  • Responses to account locks were not prioritized differently from a ticket management standpoint to be above less severe tickets.
  • The initial DigitalOcean response on Twitter failed to recognize the potential harm that had been caused, and did not show compassion to the customer situation.
  • The communication regarding denial of access to the account creates a sense of helplessness; the finality without explanation requires correcting.

People

  • Process for adding the Allow High CPU Utilization safety flag was not followed.
  • Guidelines for judgment on a reported false positive were not clear, resulting in the denial of access.

Future Measures

There were a number of issues and missteps that contributed to the incident. To prevent similar incidents from occurring in the future, we are considering the following measures:

  • Peer review of account terminations. For any account appealing a lock, two agents will be required to review the submission prior to issuing a final deny.
  • The template used for response in account denial will be removed entirely. If account access is denied during an appeal, which often is the case as most appeals are true bad actors, the agent must create a reasoned response.
  • Services that result in the power down of resources will no longer automatically take action on any account, regardless of lack of payment history, for accounts that were engaged more than 90 days prior. These cases will be escalated for manual review.
  • We will revisit how communications around fraud and abuse related issues are handled on Twitter.
    When an agent manually chooses to unlock an account, that account will have a safety applied to ignore automated security, fraud and abuse services for a designated period of time (timeframe TBD).
  • To address the extended delay on the account lock appeal, Support and Security Operations leadership will create new workflows to allow abuse-related events to leverage the 24/7 structure of Support.
  • Additional hiring has been approved for both Support and AbuseOps to reduce ticket queue wait times.
  • Service is already under development for centralizing safeties for anti-fraud and abuse automation.
  • Finally, we will be reviewing how we share information about accounts within our internal systems and services to better contextualize an account for expected versus unexpected behaviors.

In Conclusion

We wanted to share the specific details around this incident as accurately and quickly as possible to give the community insight into what happened and how we handled it. We recognize the impact this had on a customer, and how this represented a breach of trust for the community, and for that we are deeply sorry. We have a number of takeaways to improve the technical, process, and people missteps that led to this failure. The entire team at DigitalOcean values and remains committed to the global community of developers.

Barry Cooks

Chief Technical Officer

Share

Try DigitalOcean for free

Click below to sign up and get $200 of credit to try our products over 60 days!Sign up

Related Articles

How DigitalOcean Uses Semgrep to Fortify Security: A Highlight From Our Toolset
Trust & Security

How DigitalOcean Uses Semgrep to Fortify Security: A Highlight From Our Toolset

Contextual Vulnerability Management With Security Risk As Debt
Trust & Security

Contextual Vulnerability Management With Security Risk As Debt

Regresshion vulnerability: Recommended actions and steps we've taken
Trust & Security

Regresshion vulnerability: Recommended actions and steps we've taken