Postmortem
We would like to extend our sincerest apologies to all partners in the North American region who may have experienced downtime between March 29th and April 1st, 2022, as we understand how critical IT Glue is to your organization.
At approximately 5:15PM (PDT) on Tuesday March 29th, AWS made an unplanned database upgrade outside of our regular maintenance window that caused a brief 3 minute outage. This abrupt event has caused a cascade of events that followed.
At approximately 5:30AM (PDT) on Thursday March 31st, an increase in IT Glue response time in the North America region was caused by an AWS cloud storage failure that affected our database instances. Cloud storage failure has been confirmed by AWS engineers in their correspondence to us. This caused intermittent but widespread outages for customers, and continued until 11AM (PDT).
After resolving the issue with AWS Technical Account Manager and support engineers, we failed over the primary database to a healthy instance and observed a successful recovery. After multiple confirmations with AWS engineers that the faulty cloud storage cloud volumes have been removed, we decided it was safe to switch back to our primary (most scalable) database to ensure we were back to optimal performance, and scheduled this activity during the maintenance window between 3 pm and 7 pm PDT on March 31st.
However, despite multiple confirmations with Amazon, the database still did not ramp back up to expected performance levels, which caused a prolonged maintenance window with intermittent outages that continued till April 1st 10 am (PDT).
While this issue is very unlikely to repeat in the future, business continuity for our partners is IT Glue’s TOP priority. Always striving for 100% uptime, we have already developed a preventative plan to optimize Database performance related to the AWS cloud storage failure and avoid outages related to it in the future. We are also actively investigating performance and reliability gains provided by clustered database products that have become available since the current database design was put in production.
Any incident that may cause interruption for our partners is always addressed with a rigorous internal review process. We hope our preventative plan and remediation efforts will further enhance our uptime and your IT Glue experience. As always, please subscribe to status.itglue.com to get the latest system status updates.
Lastly, we understand that your information is the most valuable asset of your organization. You can take advantage of the following features so you always have access to up-to-date IT Glue data:
We appreciate your continuous support and partnership.