Degraded Diary Service
Incident Report for ResDiary
Postmortem

On Wednesday 13th October at 06:00* our development team attempted to release version 14.35.1 of our diary application to our Microsoft Azure cloud hosted servers. In order to provide resiliency against failures, we deploy the application to a number of servers in a pool so that if one or more servers become unhealthy the remaining servers can continue to serve the application. When we upgrade the application, we do so on one server at a time to allow for zero down-time.

On the morning that we attempted to release, Azure encountered a global issue that rendered them unable to provide any new servers. The failure occurred during our 14.35.1 application release and resulted in half of the servers serving the previous version of the application and the other half serving the new version. Fortunately we were able to utilise our existing servers to keep the application running. Many Azure customers were left with no capacity between 05:12 and 11:45.

At 11:45 (during UK/EU lunch service), Azure fixed their service outage and automatically upgraded the subset of our remaining servers to the newer version of the application - this upgrade was unplanned and out of our control. This inadvertently resulted in our database experiencing a significant load increase and exacerbating a performance issue in the new version. After trying to scale up the capacity of our database and servers, we decided to roll back the release and redeploy it the following morning when we could be sure that the Azure issues were resolved.

We're still considering ways that we can protect ourselves from global outages like this. Typically the safeguard against a cloud issue involves spreading your servers over multiple geographic locations (for example having some in Amsterdam, and others in London), however this outage occurred globally. Despite this, we will continue to investigate possible ways to mitigate issues like this as part of our internal post mortem process.

For further information about the Azure outage, see:

An Azure post-mortem will be available by Wednesday the 20th October.

* All times UTC

Posted Oct 15, 2021 - 12:46 UTC

Resolved
This incident has been resolved.
Posted Oct 13, 2021 - 13:51 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted Oct 13, 2021 - 12:37 UTC
Investigating
We are currently investigating this issue.
Posted Oct 13, 2021 - 12:06 UTC
This incident affected: ResDiary Application (UK/Europe).