Measurement API unavailable

Incident Report for gridX GmbH

Postmortem

On Aug 16, 2025 at 4:20am we faced a degradation of performance and later a partial outage of about 4 hours (5:14 to 9:22am).

The issue was caused by a series of restarts on single replicas of our service that are part of normal operations, but led to some skewed load within the service’s deployment. This led to an eventual cascade of failures ultimately resulting in a downtime until a manual recovery was made.

We experienced some data loss during this cascading failure as caches from gridBoxes were emptied but not always successfully processed before they were cached in the cloud.

Root Cause

A sub-optimal allocation procedure in our code became an issue when the load balancing caused a rapid, cross-replica reassignments of persistent connections relating to specific functionality. This incurred disproportionate load and a pileup of waiting requests hitting the replicas in question, leading to further restarts of those, furthering into a cascade eventually affecting the deployment as a whole.

Resolution

We needed to manually recover the affected systems as the services and infrastructure were unable to automatically recover from the cascading failures.

Action Items

  • Optimise allocation procedure to limit impact of connection reassignments
  • Optimise Loadbalancer skew by removing Unary methods from the target group matcher
  • Prevent the large allocation in the first place by batching how we consume streams
  • Ensure faster failing for improved backoff behaviour and reduce cascading failing potential
Posted Aug 26, 2025 - 09:56 CEST

Resolved

Data has been backfilled - we are back to normal operations.
Posted Aug 16, 2025 - 15:00 CEST

Update

We continue to monitor the situation. Our systems are currently continuing automatically backfilling historic data.
Posted Aug 16, 2025 - 11:25 CEST

Monitoring

We had an issue in measurement processing since 5:30AM CEST which has since been resolved. We are currently monitoring the situation. Some historic data appears to have been lost, we are backfilling available data.
Posted Aug 16, 2025 - 09:47 CEST
This incident affected: Private gridBox API.