At 07:29 BST our engineers were notified that the Australian diary application was experiencing reduced performance. After investigating, we found a large amount of requests queued up with processing times significantly increased. Our engineers began the process of recycling the Australian web servers to get rid of queued up requests that were never going to complete in time. This process was completed by around 07:50, after which time the performance of the diary began to get back to normal.
After further investigation, we have discovered that the problem was triggered by the backend database for the application automatically choosing a different plan for a particular query. In this case, switching plans caused the database to fallback to retrieving information from disk instead of from memory, which caused queries to take much longer to return with the required information. Our working theory is that the new plan required much more memory to execute, which caused the database to free up memory which had previously been used to increase the speed of other queries.
When database queries started taking longer to process, it caused a large backlog of requests on our web servers, which then put increased pressure on the database. When our engineers recycled the web servers, this reduced the pressure on the database, which allowed it to begin to perform more normally. The database continued to experience reduced performance until around 09:30, at which point it automatically switched to a better performing query, although most customer facing impact would have been solved by around 07:50.
We understand the impact this can cause to our restaurant customers, especially as we are getting into the busiest time of the year, and we apologise for it. To prevent an outage like this from happening again, we are taking the following steps: