To our monday.com community,
We are writing with an update on our recent platform outages. While the platform is now fully restored, after our team worked relentlessly to resolve the issues, we want to begin with a sincere apology for the unexpected disruption to your work. We know you and your team rely heavily on our platform, and for so many of you, it is the core workspace for all your business processes. This is not the experience we aspire to provide, and we remain committed to doing our best to ensure it doesn’t happen again.
We are now providing you with full transparency into what happened with a detailed report of our findings outlined below.
Before we dive into the details, it is important to clarify that these incidents were not caused by a cyberattack or a failure of our systems to scale, and none of our customers experienced any data loss.
On Monday and Tuesday, we encountered two significant system outages. The monday.com platform was down for a combined total of 2 hours and 56 minutes. While the system outages occurred on two consecutive days, these were two separate and unrelated incidents: connectivity issues related to our US region servers and downtime of one of our main databases. Through this situation and always, we keep our user base up-to-date via our status page: https://status.monday.com.
The downtime you experienced was intermittent over a longer time period, leaving so many of you unable to use the platform for most of the day. This undoubtedly caused additional strain on the situation and for this, we are truly sorry.
Connectivity issues related to our US region servers
On Monday, April 11 at 8:56 AM EST, a mistake was made in configuring our production environment as part of a routine maintenance procedure.
How did we solve it?
As soon as our team became aware of the incident, we immediately worked on identifying the underlying cause. Once the root cause was discovered, we reverted the changes made in the production environment and deployed the fix to the entire system, resolving the issue.
Total downtime: 1hr and 2 mins
Main database downtime
On Tuesday, April 12 at 8:50 AM EST, one of our main databases stopped serving requests, rendering it non-operational almost immediately, with no apparent reason. Since this particular database is essential for viewing data and changing it, the platform went down.
The moment the database went down our engineering team started investigating the root cause and restoring the service. Since we weren’t able to identify the root cause initially, we took additional steps to restore the service by disabling non-critical flows, replacing the affected database, and leveraging our production environment by adding additional backup servers, along with several other mitigation steps.
Though those steps brought our service back up, we didn’t manage to solve the root cause, leading to 3 additional consecutive downtimes.
Total downtime: 1hr and 54 mins (intermittent over a 10-hour period)
How did we solve it?
Our team took the divide and conquer approach to find the root cause by taking action on several potential sources to reach a solution as fast as possible.
The same day, around 8:00 PM EST, we found the root cause, which was a rare and very specific combination of system events, errors, and processes that locked our main database, causing the downtime of our platform. These events and errors included multiple “floating” transactions that were opened and not finalized, which, along with other processes in the background, caused a load of server restarts. Combined with a few more conditions and misconfigurations, these occurrences caused one of our databases to lock.
To gain confidence in our root cause analysis and resolve the incident, our teams acted in several directions:
- Deployed fixes to the errors that caused our servers to restart multiple times, and added another defense layer for events of this type to avoid databases locking in the future.
- Improved our monitoring of server restarts, the conditions on which our databases may be locked, and the script that got stuck during this incident, which we tested in the most extreme scenarios to ensure stability.
- Removed system processes that were part of the cause of this event.
- Extended our production environment to offer a more solid backup in case of recurrent downtime.
At 9:13 AM EST on April 13, 2022, after we gained confidence in our root cause analysis and the fixes that were deployed, we officially closed our incident mode.
Our commitment to our customers
Our top priority is maintaining your trust and confidence in our platform. I am dedicating my team’s resources to taking all necessary actions to prevent and mitigate these issues from reoccurring and would also like to take a moment to acknowledge their outstanding efforts in resolving this situation as soon as possible. Resolving these outages was truly a team effort.
- Additional checks and procedures added to the deployment process
- Further investigations with our third-party cloud computing service provider
- Dedicated task force to ensure the continued stability of our platform beyond our regular monitoring procedures
All of us at monday.com are taking learnings from this experience to ensure that we improve our communication around any future occurrences. We hope this post answers your questions and we look forward to continuing to do incredible work together.