AU Diary Unavailable
Incident Report for ResDiary
Postmortem
  • At 11:03 there was an alert generated to say that au.resdiary.com was unavailable from 11:02. This alert was closed at 11:05 with the site being available again.
  • Although the app reported being available again at 11:05, it looks like there were errors being reported until 11:10, after which things seem to have recovered entirely.
  • It appears as though the problem began due to a brief period (27 seconds) of the database being unavailable as it was switching between performance tiers in Azure. 

The database used by au.resdiary.com is hosted in Microsoft Azure, and one of the benefits of this is that the level of resource allocated can be adjusted in line with demand. Based on previous utilization, the database was scheduled to scale down to a lower service tier at 11 am BST (8 pm AEST). Behind the scenes this works by creating a copy of the database and then switching the connection to the new copy when it is ready to go, during which time there is a short period (up to 30 seconds) where no new connections can be made to the database.

It appears that when this happened on 02/10/18 it had a more severe impact on the ResDiary application than expected. There were a number of errors thrown due to queued connections building up during the period the database was unavailable, and also requests timing out as a result of this, then continued errors for a few minutes after it became available again. The period of the database being unavailable was only 27 seconds, though the application did not handle this particularly well, which was what led to the prolonged service interruption.

As a result of this, we have disabled our scheduled service tier changes until we are able to implement this in a way where this happens without any interruption to service.

Posted 7 months ago. Oct 03, 2018 - 12:15 UTC

Resolved
Between 11:02 BST and 11:10 BST there was a short period of downtime followed by intermittent failed requests. This was triggered by a maintenance job used to scale the database performance level in line with expected demand. We were notified of this issue by our alerting software at 11:03 and the initial problem had resolved itself by 11:05 without any requirement for human intervention. We investigated the issue and found that we continued to receive errors until 11:10 as the application recovered and service resumed as normal.
Posted 7 months ago. Oct 02, 2018 - 11:02 UTC