We had complete downtime across all systems. The cause of this downtime was because of the following:
A db lock implementation that spun indefinitely against the database and didn't release gracefully
An unindexed, large user table scan that resulted in long lookup times when logging users in and signing them up
The second issue exacerbating the load on the DB because of the first issue
Remediation:
Replace our db lock implementation with one that does not spin
Denormalize the lookup information off the user table and added indexes for faster lookup times
Implemented this status page
Upgrade database to latest version
Upgraded server database runs on to double CPU and Memory
Created dashboards internally that get us rich information about the health of our database queries so we can diagnose and address query-related issues before they impact production traffic
No components marked as affected
Resolved
We had complete downtime across all systems. The cause of this downtime was because of the following:
A db lock implementation that spun indefinitely against the database and didn't release gracefully
An unindexed, large user table scan that resulted in long lookup times when logging users in and signing them up
The second issue exacerbating the load on the DB because of the first issue
Remediation:
Replace our db lock implementation with one that does not spin
Denormalize the lookup information off the user table and added indexes for faster lookup times
Implemented this status page
Upgrade database to latest version
Upgraded server database runs on to double CPU and Memory
Created dashboards internally that get us rich information about the health of our database queries so we can diagnose and address query-related issues before they impact production traffic