Scheduled Maintenance - IT Glue and MyGlue - NA March 31st, 2022 3:00pm - April 1st, 2022 10:00am PDT
Scheduled Maintenance Report for IT Glue
Postmortem

Postmortem

We would like to extend our sincerest apologies to all partners in the North American region who may have experienced downtime between March 29th and April 1st, 2022, as we understand how critical IT Glue is to your organization.

At approximately 5:15PM (PDT) on Tuesday March 29th, AWS made an unplanned database upgrade outside of our regular maintenance window that caused a brief 3 minute outage. This abrupt event has caused a cascade of events that followed.

At approximately 5:30AM (PDT) on Thursday March 31st, an increase in IT Glue response time in the North America region was caused by an AWS cloud storage failure that affected our database instances. Cloud storage failure has been confirmed by AWS engineers in their correspondence to us. This caused intermittent but widespread outages for customers, and continued until 11AM (PDT).

After resolving the issue with AWS Technical Account Manager and support engineers, we failed over the primary database to a healthy instance and observed a successful recovery. After multiple confirmations with AWS engineers that the faulty cloud storage cloud volumes have been removed, we decided it was safe to switch back to our primary (most scalable) database to ensure we were back to optimal performance, and scheduled this activity during the maintenance window between 3 pm and 7 pm PDT on March 31st.

However, despite multiple confirmations with Amazon, the database still did not ramp back up to expected performance levels, which caused a prolonged maintenance window with intermittent outages that continued till April 1st 10 am (PDT).

While this issue is very unlikely to repeat in the future, business continuity for our partners is IT Glue’s TOP priority. Always striving for 100% uptime, we have already developed a preventative plan to optimize Database performance related to the AWS cloud storage failure and avoid outages related to it in the future. We are also actively investigating performance and reliability gains provided by clustered database products that have become available since the current database design was put in production.

Any incident that may cause interruption for our partners is always addressed with a rigorous internal review process. We hope our preventative plan and remediation efforts will further enhance our uptime and your IT Glue experience. As always, please subscribe to status.itglue.com to get the latest system status updates.

Lastly, we understand that your information is the most valuable asset of your organization. You can take advantage of the following features so you always have access to up-to-date IT Glue data:

  1. CSV Export
  2. Runbook
  3. API Endpoint for Automated Account Export (NEW)

We appreciate your continuous support and partnership.

Posted Apr 01, 2022 - 15:33 PDT

Completed
The scheduled maintenance has been completed.
Posted Apr 01, 2022 - 10:00 PDT
Update
Users will no longer see any 502/500 errors as IT Glue is running with stability currently. We apologise for the inconvenience caused due to the errors you may have come across during the maintenance.

Maintenance will end at 10:00 am PDT. We expect certain transactions that may be delayed to catch up over time. A full RCA will be shared with our customers once ready.

We thank you for your continued patience and partnership.
Posted Apr 01, 2022 - 09:46 PDT
Update
Scheduled maintenance is in progress. During the course of maintenance, we may see 502 Bad Gateway / 500 errors. We apologise for the inconvenience caused due to this.
Posted Apr 01, 2022 - 07:53 PDT
Update
Maintenance progressing well. We will provide our next update at 9:00 am PDT.

During this time, users should be able to access the application, as we actively monitor. Thank you for your patience as we work hard behind the scenes.
Posted Apr 01, 2022 - 05:51 PDT
Update
We will continue to work on the maintenance overnight, as certain tasks are taking longer than initially forecasted. We will provide our next update on April 1st, 2022 at 6:00 am PDT.

During this time, users should be able to access the application, as we actively monitor. Thank you for your patience as we work hard behind the scenes.
Posted Apr 01, 2022 - 00:00 PDT
Update
We continue to make good progress on the maintenance, we will have another update in 2 hours.

During this time, some users may experience sporadic degraded performance or 502 errors in the app. We apologize for any inconvenience caused due to this.
Posted Mar 31, 2022 - 23:16 PDT
Update
We are making progress on the maintenance. We will have a further update at 11:30 pm Pacific.

During this time, some users may experience degraded performance or see 502 errors in the app. We apologize for any inconvenience caused due to this.
Posted Mar 31, 2022 - 22:25 PDT
Update
Our maintenance is running longer than expected, as a result of which we will have to extend this for 2 more hours. We hope to finish this work soon. We have our team engaged and working hard to ensure we can finish soon. Thank you for your patience.
Posted Mar 31, 2022 - 20:47 PDT
Update
Scheduled maintenance is in progress. During the course of maintenance, we may experience extended downtime where you may see 502 Bad Gateway errors. As such we are extending the maintenance window by another hour.

Thank you for your patience as we work hard to finish the maintenance. We will provide more updates as we have them and apologize for any inconvenience.
Posted Mar 31, 2022 - 19:55 PDT
Update
Scheduled maintenance is in progress. During the course of maintenance, we may experience extended downtime where you may see 502 Bad Gateway errors. As such we are extending the maintenance window by another hour.

Thank you for your patience as we work hard to finish the maintenance. We will provide more updates as we have them and apologize for any inconvenience.
Posted Mar 31, 2022 - 18:47 PDT
Update
We are extending the maintenance window by another hour. We expect it to complete by 7:00 pm PDT.
Posted Mar 31, 2022 - 17:47 PDT
Update
We will be extending the maintenance window for another hour till 6:00 pm PDT.
Posted Mar 31, 2022 - 16:49 PDT
Update
We are extending the maintenance window for another hour. We expect it to complete at 5:00 pm PDT.
Posted Mar 31, 2022 - 16:00 PDT
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Mar 31, 2022 - 15:00 PDT
Scheduled
We will be undergoing planned emergency database maintenance on March 31st, 2022 from 3:00pm - 4:00pm PDT.

IT Glue and MyGlue will be unavailable to our North American Datacenter customers for 5 minutes during this time. Post maintenance some users may see nominal delays in the app, which will improve over time. We apologize for any inconvenience caused due to the maintenance.
Posted Mar 31, 2022 - 13:39 PDT
This scheduled maintenance affected: North America Primary Application Services and North America Primary File Services.