SlackerTalk 2018-02-08 Outage: Root Cause Analysis and Post Mortem -- (edited)
Posted by
Beryllium (aka grayman)
Feb 8 '18, 08:50
|
At about 7:30, the server rebooted due to scheduled maintenance with the ISP.
When it came up, things went poorly.
It seems that the HTTP server didn't load properly, which I fixed with a manual intervention.
When HTTP came back up, HTTPS was supposed to - but it didn't. This turned out to be because the firewall re-read its configuration from disk and said "Ignore everything except HTTP", so I had to manually add a rule to allow HTTPS.
Total downtime (including partial downtime): about 1 hour 8 minutes.
Root Cause: Configuration is not stable enough to withstand reboots.
|
Responses:
|