Live Measurements unavailable

Incident Report for gridX GmbH

Postmortem

One of the action items from the recent raw-measurement processing incidents, was a separate path for live calculations. Live calculations now use a different processing pipeline which looks only at the last 5 minutes of data. If an incident occurs, the new path allows the live view to recover very fast.

The new pipeline was deployed Wednesday morning to production and processing was able to keep up for many hours. Unfortunately, on one of the ElastiCache nodes, the Engine CPU Usage increased to around 100%, causing delays in raw-measuremnt processing. The clients received timeouts, triggering retries, which were far too aggressive. This caused live measurements to be only partially available to end users during the time period

Root Cause

In the new live view measurement processing, we started batching measurements into a Kinesis Record. As expected, this lead to a significant improvement in our Kinesis stream processing. This was one of the action items from our previous incidents. Unfortunately, the batch of measurements was too large for ElastiCache to process timely. Clients started receiving IO timeouts, triggering aggressive retries. This put additional pressure on ElastiCache’s Engine CPU, which it couldn’t recover from.

Resolution

By reducing the batch size and configuring less aggressive retries, ElastiCache now able to process requests as they arrive, clients were no longer receiving timeouts.

Action Items

  1. Decouple the batch size in Kinesis from the batch size in ElastiCache
  2. Configure sensible retries
  3. Introduce a Dead-Letter-Queue
Posted Nov 26, 2024 - 14:59 CET

Resolved

Live measurements were restored around 11:30pm yesterday evening. We monitored the systems through the evening and they have been running smoothly.
Posted Nov 21, 2024 - 09:45 CET

Update

We are continuing to monitor for any further issues.
Posted Nov 20, 2024 - 23:13 CET

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Nov 20, 2024 - 23:13 CET

Update

We're scaling out raw measurement caching.
Posted Nov 20, 2024 - 22:11 CET

Identified

The issue has been identified and a fix is being implemented.
Posted Nov 20, 2024 - 21:45 CET

Investigating

We are currently experiencing a disruption in our data processing for live measurements
Posted Nov 20, 2024 - 21:37 CET
This incident affected: Platform Components (beta) (Measurement processing).