Between 18:02 and 18:10 on Friday 01/03/2019, we had an incident that resulted in reduced performance for any diaries hosted on our UK servers (uk.resdiary.com). Customers would have experienced difficulty accessing their diaries and making bookings during this period, and API integrations were also affected.
Around 18:03 health checks between our web application and its database began to fail, resulting in all of the web servers being pulled out of their load balancer pool. At this point, and until the health checks began to succeed again at around 18:07, customers would have been unable to access their diaries.
After further investigation and communication with the Microsoft Azure support team, we discovered that they had performed a failover that migrated our database within the Azure data centre. During that time the database was unavailable for around 222 seconds.
After the database became available again, our system recovered automatically after a short period of reduced performance while it worked through a backlog of requests.
While investigating this issue, we also identified a problem with a particular database query. While this query was not the cause of the outage, it may have caused reduced performance for the database. We have made some configuration changes to the database which appears to have improved performance, and we are currently investigating whether we can tune the misbehaving query.
We would like to apologise to any customers affected by this outage, and assure you that we are investigating ways to try to prevent a situation like this happening in future.