At around 07:15 GMT automated alerts started firing indicating that the Australian diary application was experiencing performance problems. Unfortunately because of a configuration problem which has now been solved, our on-call engineers did not notice the alert until 07:25.
Our engineers investigated and found that a large backlog of requests had built up and were not being processed in a timely manner. At 07:35 they decided to recycle one of the web servers to allow it to get back into a healthy state. Initially it looked like this might have been enough to solve the problem, but after 10 minutes they decided to recycle the other web servers.
After doing this, all the servers began to process requests normally, and by 07:48 the system was normally again.
Unfortunately the root cause of this problem is not clear at this point in time, but we are continuing to try to get to the bottom of it, and have begun to implement mitigating measures to help prevent the servers getting into the situation where they are overloaded like this.
Update 27/11/2018
After investigating a similar incident that occurred on Thursday 22/11/2018, we believe this incident was triggered by a problem with the backend database server for the Australian diary. See https://status.resdiary.com/incidents/1mktl1r85ps7 for more details of this incident, and the steps we are taking to mitigate this in future.