To our monday.com community,
On April 11th and 16th, we experienced some service interruptions on our US server that may have impacted your account. We want to share a more detailed update on what happened and how we plan to improve your experience moving forward.
The incidents resulted in a total of 59 minutes of system downtime and 62 minutes of degraded performance across the two dates. There was no data loss or security risk and EU and APAC servers were not affected.
We know that you rely on our platform and that any disruption to the platform impacts your workflow and for that we deeply apologize. We take these interruptions very seriously and we’ll continue to invest in preventing them from happening again.
A more detailed update follows below on exactly what happened and our plans for improving your experience moving forward.
What happened
Thursday, April 11th – Status page retro
The root cause of the incident on April 11th was a rare service issue in one of our service provider’s infrastructure, which resulted in 20 minutes of downtime.
Our monitoring system alerted our R&D teams to the issue, and we subsequently began moving the system to a backup. Some technical difficulties in doing so caused system instability and read-only mode for another 62 minutes.
At this stage, we’ve already updated our system readiness processes and fine-tuned our backup servers, enabling us to recover faster and reduce interruptions.
Total downtime on the US server: 20 minutes
Total time of degraded performance on the US server: 62 minutes
No service interruption to EU and APAC servers
Tuesday, April 16th – Status page retro
In the early hours of Tuesday, we experienced a 20-minute platform downtime. Our monitors detected the issue, and our teams worked to reboot the platform and begin investigating the cause.
During the investigation, a second issue occurred, causing fluctuations in the platform’s availability across one hour, totaling an additional 19 minutes of downtime.
Our investigations confirmed that the issue was caused by a recently deployed monitoring service designed to flag large or complex queries. While the new monitoring service had initially been running smoothly within the platform, it was identified as the cause of the infrastructure lock, and we promptly reverted to an older version of the service that had been operating for many months without any issues.
Total downtime on the US server: 39 minutes
No service interruption to EU and APAC servers
What’s next
Prevention and faster recovery
We continue to take a proactive approach to ensuring platform stability, with both immediate action items and long-term plans in place to reduce risk, improve our ability to recover quickly, and, most importantly, prevent platform instability through enhanced resilience.
Prevention & reduced risk:
- Ongoing improvements to monitoring abilities and system flags
- Comprehensive processes around implementation and deployment
- Continued investments to isolate core flows and make them independently resilient
Recovery & agility:
- Implementation of checks & procedures that enable much faster recovery
- Additional ready-to-go fallback systems
- Processes that will allow our teams to be more agile and make timely decisions
We’ll share progress on these action items in this blog post and via our X.com support account.
More communication
We understand that there’s room for improvement in how we share real-time updates with you, whether around planned maintenance or an unexpected issue so that you know what to expect and how to get more support from us in mitigating the impact on your daily work.
We recommend following our X.com support account and our Status Page updates, which are updated in real-time.
Building a platform you can rely on
The bottom line is that the recent platform experience doesn’t align with what we strive to provide, and we sincerely apologize for the interruptions and any frustration or inconvenience caused. We continue to strive to deliver a platform that creates efficiency, brings impact, innovates along with you, and ultimately helps you meet your goals with ease.
The entire monday.com team and I thank you for your patience, ongoing trust, and understanding.
Sergei Liakhovetsky
VP Engineering, Infrastructure