Overview of the Incident: On October 24, 2023, during a scheduled network maintenance at our data center, an unforeseen issue occurred, leading to significant service disruptions on the Joomag platform. This postmortem report aims to outline the events, our immediate response, and the steps we are taking to prevent such incidents in the future.
Sequence of Events:
- Planned Activity: As part of our routine maintenance, one server was scheduled to be taken offline and relocated to another rack.
- Unexpected Downtime: Contrary to our plans and preparations, four additional servers were inadvertently taken offline.
- Service Impact: This resulted in a downtime for Joomag, affecting various functionalities.
- Data Corruption: We encountered data corruption on one MySQL server and one ClickHouse server.
- Recovery Measures: Both servers were successfully recovered from backups, ensuring no data loss.
- Restoration Time: The total duration for restoring full service functionality was approximately 6 hours.
Investigation and Current Understanding:
- We are actively investigating the root cause of the additional servers going offline.
- Initial assessments indicate a potential procedural error or miscommunication, but further analysis is ongoing.
Corrective and Preventive Actions:
- Immediate steps were taken to restore the affected services promptly.
- We are revising our maintenance protocols to enhance safety and reliability.
- We are working closely with our hosting provider to avoid such incidents in the future.
Commitment to Service Excellence: We apologize for the inconvenience caused by this incident and appreciate your patience and understanding. Our dedication to providing dependable and high-quality services is stronger than ever. We will keep you informed about our progress and findings as we continue our investigation.