Degraded performance of Diary application for ASSE

Incident Report for ResDiary

Postmortem

All times in this report are in SGT.

At 21:25 on 21/03/2019, an automatic configuration change was made to one of our SQL Azure databases. From that point onward, the performance of the database was reduced slightly, but not enough to immediately cause user visible problems. At around 09:00 on 22/03/2019, the CPU usage on the database began to climb, until at around 10:00 the CPU usage was at almost 80%.

At 12:57 one of our engineers received a call indicating that there was a problem. After investigating, he scaled up the database at 13:35 to give it more capacity. After the scaling operation completed at 13:37, the database began performing normally, and normal operations were resumed.

From further investigation, it doesn't look like the database was under provisioned. For reference, over the past month, the database CPU usage has been consistently below 40%. This leads us to believe it was the automated configuration change performed by Azure that caused the incident.

We will be implementing additional monitoring for our databases over the next few weeks so that we are notified of issues like this well in advance of them becoming a problem. We are also currently in communication with the Azure support team to try to find out exactly what happened, and how we can prevent it from happening in future, since the problem appears to have been triggered by a configuration change at their end.

We sincerely apologise for any inconvenience caused by these issues, and we can assure you that we are actively working at preventing problems like this from happening in future.

Posted Mar 27, 2019 - 10:57 UTC

Resolved

We have been monitoring the affected system over the last three hours and it has been performing normally. We will continue monitoring and provide a post-mortem as soon as we have more information.

Posted Mar 22, 2019 - 09:58 UTC

Monitoring

Engineers have applied a fix to the SQL database and will continue to monitor the results. We will provide a further update once they have concluded a full investigation

Posted Mar 22, 2019 - 06:34 UTC

Identified

Engineers have noticed a problem with one of our SQL Server databases. We are working at identifying and solving the cause of the problem

Posted Mar 22, 2019 - 05:44 UTC

Investigating

We are currently investigating the issue.

Posted Mar 22, 2019 - 05:13 UTC

This incident affected: ResDiary Application (S.E. ASIA).