Slow Requests on AU Server

Incident Report for ResDiary

Postmortem

At 07:29 BST our engineers were notified that the Australian diary application was experiencing reduced performance. After investigating, we found a large amount of requests queued up with processing times significantly increased. Our engineers began the process of recycling the Australian web servers to get rid of queued up requests that were never going to complete in time. This process was completed by around 07:50, after which time the performance of the diary began to get back to normal.

After further investigation, we have discovered that the problem was triggered by the backend database for the application automatically choosing a different plan for a particular query. In this case, switching plans caused the database to fallback to retrieving information from disk instead of from memory, which caused queries to take much longer to return with the required information. Our working theory is that the new plan required much more memory to execute, which caused the database to free up memory which had previously been used to increase the speed of other queries.

When database queries started taking longer to process, it caused a large backlog of requests on our web servers, which then put increased pressure on the database. When our engineers recycled the web servers, this reduced the pressure on the database, which allowed it to begin to perform more normally. The database continued to experience reduced performance until around 09:30, at which point it automatically switched to a better performing query, although most customer facing impact would have been solved by around 07:50.

We understand the impact this can cause to our restaurant customers, especially as we are getting into the busiest time of the year, and we apologise for it. To prevent an outage like this from happening again, we are taking the following steps:

We have immediately scaled up the Australian database as a short term mitigating measure.
We are analysing the affected query to try to reduce the amount of resources it requires, to make a situation like this less likely.
We are investigating whether we can take measures that will allow our web servers to automatically recover instead of requiring engineers input.

Posted Nov 27, 2018 - 15:21 UTC

Resolved

We have been monitoring the AU diary for the past few hours, and the system has been behaving normally. We are continuing to investigate the cause of the problem, and will provide a post-mortem report later.

Posted Nov 22, 2018 - 10:45 UTC

Monitoring

The AU servers are continuing to respond as normal. However, we are still monitoring closely while we investigate what the cause of the problem was.

Posted Nov 22, 2018 - 08:56 UTC

Update

The performance on the Australian servers is returning to normal but we are continuing to monitor closely.

Posted Nov 22, 2018 - 08:04 UTC

Investigating

We are aware of an issue on the AU server affecting diary performance. This is currently being investigated.

Posted Nov 22, 2018 - 07:50 UTC

This incident affected: ResDiary Application (Australia).