Outage Post-Mortem: July 19, 2019

What Happened? The Incident Report

At around 9 PM UTC on July 16th 2019 a developer ran a test script while connected to the Habitica production database to make sure all the data for an upcoming feature was correct. The postmortem report reads:

I tried to connect with my read-only database account but it wouldn’t connect correctly (probably due to an issue in the library we use to connect to MongoDB in one of our code repositories). So I switched to my database account with write permissions and run the script.

The script only read data so it didn’t cause any issue, but at the same time I decided to investigate a test failure in the same code repository and run the tests there.

Being used to the main code repository where the tests use a different database than the rest of the code I didn’t realize the fact that the tests did run against the same database that was being used for the test script, which means the production database.

Unfortunately one of the tests, after completing, deleted all the data it had created by deleting all documents from the groups collection that stores all the parties and guilds that exist on Habitica. This also affected quests and, in some cases, party or guild membership.

Chat messages are stored on a separate collection so they were not affected.

Resolution and Recovery

About 3 hours after the incident, we were able to restore to the most recent database backup we had access to. This reset all groups back as they were at the time of the backup, with the exception of those groups created between the last backup and the time of incident.

The following day we were able to restore the vast majority of party and guilds membership that had been lost and, where possible, restored lost quest progress. We are offering all affected users that were in the middle of a quest and who might have lost their progress 4 gems plus a copy of the quest scroll they were completing. Certain users may not have been identified during our automated checks and may need to reach out to us via admin@habitica.com for further resolution.

Other parts of Habitica, including user accounts and tasks, were not affected.

We have been fixing and continue to fix smaller issues as they are discovered.

Corrective and Preventive Measures

Following the incident, we reviewed our procedures and came up with a series of steps to avoid similar issues in the future.

In particular, we will do the following:

  • Update the tests in the affected code repository to make sure they use a completely different database when running, even locally.
  • Change the permissions of database users by removing the ability to delete entire collections of documents.
  • In the few places where we must delete all documents from a collection (using collectionName.remove({})) we are changing the code to instead use collectionName.drop() which, coupled with the previous change, will prevent the deletion of production data in case similar incidents occurs.
  • Start using checklists for common maintenance operations so that every developer has a set of steps to follow and it’s more difficult to forget something.

Many of these recommendations have already been implemented as of this posting.

As always, we’ve appreciate your support and patience as we repaired the error and continue to work to improve the Habitica experience!

Advertisements

2 thoughts on “Outage Post-Mortem: July 19, 2019

  1. Pingback: Developer Blog Archive – get organized. stay motivated. have fun.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s