On Aug 16, 2025 at 4:20am we faced a degradation of performance and later a partial outage of about 4 hours (5:14 to 9:22am).
The issue was caused by a series of restarts on single replicas of our service that are part of normal operations, but led to some skewed load within the service’s deployment. This led to an eventual cascade of failures ultimately resulting in a downtime until a manual recovery was made.
We experienced some data loss during this cascading failure as caches from gridBoxes were emptied but not always successfully processed before they were cached in the cloud.
A sub-optimal allocation procedure in our code became an issue when the load balancing caused a rapid, cross-replica reassignments of persistent connections relating to specific functionality. This incurred disproportionate load and a pileup of waiting requests hitting the replicas in question, leading to further restarts of those, furthering into a cascade eventually affecting the deployment as a whole.
We needed to manually recover the affected systems as the services and infrastructure were unable to automatically recover from the cascading failures.