AU Diary Slow Requests

Incident Report for ResDiary

Postmortem

At around 08:40 BST the Australian diary application lost the ability to communicate with its SQL Server database for several minutes. During this period, a large backlog of requests built up on the web servers running the application, leading to increased CPU usage. The connection to the database was automatically restored, but unfortunately by this point the web servers were unable to cope with the volume of queued requests that they were trying to process.

What this meant was that there was a short period of time (between 1-2 minutes) where the diary was unavailable, followed by a longer period of around 20 minutes where performance was significantly reduced. At around 09:00 BST engineers restarted each web server in turn, which allowed the servers to begin processing requests normally.

By 09:10 most requests were being processed correctly, and by 09:20 the incident was completely resolved.

We understand that problems like this are unnacceptable, and we sincerely apologise for any inconvenience caused, especially during a busy Friday evening.

To prevent the same situation from happening again, we are going to take the following actions:

Contact Azure to try to find out why the communication issue happened, and whether there is anything we can do to prevent it in future.
We have added additional web servers to our Australian infrastructure to help deal with the demand.
We are going to investigate making configuration changes to our web servers to allow them to recover faster, and without manual intervention from situations like this.

Posted Oct 23, 2018 - 09:56 UTC

Resolved

At around 08:40 BST (17:40 AEST) on Friday 19/10/2018 some Australian diaries were unavailable or experienced reduced performance. Engineers were notified and began to restore normal performance around 09:00 BST. Normal service was completely restored by 09:20 BST.

Posted Oct 19, 2018 - 07:40 UTC