All times in this report are in BST unless otherwise noted.
This report details the initial incident caused by a CloudFlare outage, along with an incident affecting TheFork in Australia that was triggered by the initial incident.
At 22:14 we began to receive alerts from our monitoring system that some of our sites were unavailable. At 22:22 two of our engineers began investigating and found that DNS resolution was failing for all our sites. We verified that the sites themselves were still running correctly, but that we weren't able to reach them via their hostnames.
Because of this we began to suspect that CloudFlare were having an outage, and confirmed this by testing DNS resolution using both CloudFlare and our backup DNS provider (Azure DNS). Queries to CloudFlare were failing while queries to our backup provider were succeeding. At this point we also noticed that CloudFlare had declared a major incident, the details of which can be found here: https://www.cloudflarestatus.com/incidents/b888fyhbygb8.
At around 22:33 we decided to switch over to our backup provider since we were unsure how long it would take CloudFlare to resolve their issue. We began to notice our applications starting to become available again, and by around 23:00 all of our applications were functioning again. This may have been due to CloudFlare resolving their incident rather than us switching to our backup.
At 23:27 we received an alert that www.resdiary.com had become unavailable. We realised that it was unable to communicate with api.resdiary.com because we currently make use of a CloudFlare load balancer to provide geographic load balancing, which meant that Azure DNS didn't have a record for api.resdiary.com.
On Monday 13/07/2020 we added new HA Proxy load balancers in front of our au.resdiary.com servers as part of a migration to new infrastructure for au.resdiary.com. Unfortunately there was a configuration error with these load balancers that prevented the original client IP addresses being reported to our backend servers. The configuration issue did not cause any problems while we were using CloudFlare DNS because requests were being routed via CloudFlare infrastructure that correctly set the original client IP address.
Unfortunately when we switched over to our backup DNS provider, we ended up in a situation where some requests were being routed via CloudFlare and others were being routed directly to our HA Proxy load balancers. The majority of requests were unaffected by this, but because the Reservation Service requires the IP address of requests to match the original IP that requested the token, we ended up in a situation where requests could fail if the initial token request was routed via CloudFlare and subsequent requests were routed directly to us, or vice-versa.
At 09:08 we disabled CloudFlare proxying for au.resdiary.com, causing all traffic to be routed directly. This meant that the source IP was consistent, allowing requests to succeed. Shortly after this our partners at TheFork confirmed that the error rate was dropping, and by 09:29 we were not seeing any more 401s on the Reservation Service.
We will be taking the following actions as a result of this incident: