a year ago
Explanation of the Recent Production Outage - July 15th
During our recent upgrade, we faced configuration issues that impacted production. Key problems included:
- Stricter Security Defaults: New security defaults and admission controller changes led to permission failures.
- Configuration Discrepancies: Differences in parameters caused node instability and scheduling issues.
Our infrastructure team worked closely with our development team and brought all services back at 6.30pm EST. We have escalated and are in discussion with AWS Support to prevent future occurrences. Thank you for your understanding.
Sincere apologies to our customers for the outage.