Cloudflare Issue affecting all ResDiary services

Incident Report for ResDiary

Postmortem

All times in this report are in BST unless otherwise noted.

This report details the initial incident caused by a CloudFlare outage, along with an incident affecting TheFork in Australia that was triggered by the initial incident.

CloudFlare Outage

At 22:14 we began to receive alerts from our monitoring system that some of our sites were unavailable. At 22:22 two of our engineers began investigating and found that DNS resolution was failing for all our sites. We verified that the sites themselves were still running correctly, but that we weren't able to reach them via their hostnames.

Because of this we began to suspect that CloudFlare were having an outage, and confirmed this by testing DNS resolution using both CloudFlare and our backup DNS provider (Azure DNS). Queries to CloudFlare were failing while queries to our backup provider were succeeding. At this point we also noticed that CloudFlare had declared a major incident, the details of which can be found here: https://www.cloudflarestatus.com/incidents/b888fyhbygb8.

At around 22:33 we decided to switch over to our backup provider since we were unsure how long it would take CloudFlare to resolve their issue. We began to notice our applications starting to become available again, and by around 23:00 all of our applications were functioning again. This may have been due to CloudFlare resolving their incident rather than us switching to our backup.

At 23:27 we received an alert that www.resdiary.com had become unavailable. We realised that it was unable to communicate with api.resdiary.com because we currently make use of a CloudFlare load balancer to provide geographic load balancing, which meant that Azure DNS didn't have a record for api.resdiary.com.

At 23:37 we switched our DNS back to CloudFlare, and also added in a temporary DNS entry for api.resdiary.com to Azure DNS, and by 23:42 the alert for www.resdiary.com closed.

TheFork Reservation Service Incident

On Monday 13/07/2020 we added new HA Proxy load balancers in front of our au.resdiary.com servers as part of a migration to new infrastructure for au.resdiary.com. Unfortunately there was a configuration error with these load balancers that prevented the original client IP addresses being reported to our backend servers. The configuration issue did not cause any problems while we were using CloudFlare DNS because requests were being routed via CloudFlare infrastructure that correctly set the original client IP address.

Unfortunately when we switched over to our backup DNS provider, we ended up in a situation where some requests were being routed via CloudFlare and others were being routed directly to our HA Proxy load balancers. The majority of requests were unaffected by this, but because the Reservation Service requires the IP address of requests to match the original IP that requested the token, we ended up in a situation where requests could fail if the initial token request was routed via CloudFlare and subsequent requests were routed directly to us, or vice-versa.

At 09:08 we disabled CloudFlare proxying for au.resdiary.com, causing all traffic to be routed directly. This meant that the source IP was consistent, allowing requests to succeed. Shortly after this our partners at TheFork confirmed that the error rate was dropping, and by 09:29 we were not seeing any more 401s on the Reservation Service.

Actions

We will be taking the following actions as a result of this incident:

Fixing the configuration issue with our HA proxy load balancers.
Investigating further changes we need to make to allow us to failover to our backup DNS provider in future without causing knock-on problems.

Posted Jul 20, 2020 - 13:35 UTC

Resolved

The problems affecting the Australian ReservationService were resolved around 08:30 UTC. Since then we have not detected any further issues. We believe the incident has been fully resolved now, but will continue to monitor for further problems.

Posted Jul 18, 2020 - 10:33 UTC

Update

As a result of the actions we took to resolve the outage last night, there were issues with connectivity to the ReservationService which affected customers making bookings via The Fork in Australia. A fix for these issues has now been deployed and we will continue to monitor the situation.

Posted Jul 18, 2020 - 09:08 UTC

Monitoring

We have switched over to an alternate provider, our monitoring has shown that our servers are starting to become available again. We will continue to monitor the situation and we will update this incident once it has been resolved.

Posted Jul 17, 2020 - 22:11 UTC

Update

We are continuing to work on a fix for this issue.

Posted Jul 17, 2020 - 22:03 UTC

Update

We have taken action to failover to an alternate provider, service availability may continue to be affected while the changes are propagating.

Posted Jul 17, 2020 - 21:59 UTC

Identified

We are currently investigating an issue with cloudflare(https://www.cloudflarestatus.com/) which is causing ResDiary services to be unavailable. We are going to failover to an alternate provider to bring back availability of the ResDiary services.

Posted Jul 17, 2020 - 21:44 UTC

This incident affected: ResDiary Application (UK/Europe, Australia, North Central US, S.E. ASIA, East Asia) and Login, API, Widget Configurator, Dishcult Portal.