Measurement API unavailable

Incident Report for gridX GmbH

Postmortem

One of our AWS MemoryDB clusters ran out of memory. This cluster stores our table metadata, and a cache of our data index. An issue with the data index cache resulted in the MemoryDB running out memory, which prevented it from responding to requests for table metadata. Without this information, requests for measurement data couldn’t be processed.

Root Cause

The measurement ingestion service was changed to a new Redis client library, utilizing server-assisted, client-side caching (see https://redis.io/docs/latest/develop/reference/client-side-caching/). This caching technique resulted in a steady increase in ElastiCache memory consumption, until it reached 100% memory utilization. At this point, requests for table metadata could no longer be served by the cluster. Without this information, we aren’t able to retrieve raw measurements for processing.

Resolution

We scaled the MemoryDB cluster to a larger size.

Action Items

  1. The table metadata and data index cache have been decoupled from the same cluster. The data index cache was moved to it’s own ElastiCache cluster.
  2. The ElastiCache cluster is now configured to eject keys when memory grows too large.
  3. Client-side caching has been disabled, where it isn’t necessary for performance reasons.
  4. Alerting has been improved.
Posted Feb 13, 2025 - 14:01 CET

Resolved

This incident has been resolved.
Posted Feb 04, 2025 - 16:40 CET

Update

We will run a backfill for the historical data.
Posted Feb 04, 2025 - 14:33 CET

Monitoring

Scaling up the cache instance did mitigate the issues. We keep monitoring the system closely.
Posted Feb 04, 2025 - 14:22 CET

Update

We have identified an issue within our caching infrastructure and scaled up the affected instances
Posted Feb 04, 2025 - 14:08 CET

Identified

We are currently investigating issues with our Measurement API. Live and historical data is partially unavailable
Posted Feb 04, 2025 - 14:00 CET
This incident affected: Platform Components (beta) (Measurement API).