Cluster issues

Incident Report for gridX GmbH

Postmortem

A misconfiguration in our Kubernetes node provisioning system (Karpenter) caused an outage impacting public-facing services from 13:15 to 14:00 CET. Measurement ingestion was unaffected, and no data was lost. During the outage, our local EMS operated in fallback mode, maintaining full local functionality, though cloud optimizations may have been impacted.

Root Cause: A misconfiguration in a core Karpenter resource triggered a cascading issue, disrupting normal cluster operation. Karpenter's dynamic nature complicated immediate manual intervention.

Resolution: The misconfiguration was reverted.

Action Items:

  1. Implement a hybrid node provisioning approach combining Karpenter with static autoscaling groups for manual scaling capabilities.
  2. Enhance testing procedures for node provisioning configurations to prevent future occurrences.
Posted Feb 12, 2025 - 15:59 CET

Resolved

This incident has been resolved.
Posted Jan 27, 2025 - 19:32 CET

Update

The systems are up again. Measurement data for the incident period might be missing at the moment. We're monitoring the backfill of measurement data.
Posted Jan 27, 2025 - 14:13 CET

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jan 27, 2025 - 14:01 CET

Identified

The issue has been identified and a fix is being implemented.
Posted Jan 27, 2025 - 13:51 CET

Investigating

We are investigating problems with our cloud infrastructure.
Posted Jan 27, 2025 - 13:25 CET
This incident affected: Frontend, Public API, Private gridBox API and Platform Components (beta) (Measurement ingestion, Asset inventory management, Measurement processing, Measurement aggregation, Measurement API, Backend HTTP API, Backend gRPC API).