One of the action items from the recent raw-measurement processing incidents, was a separate path for live calculations. Live calculations now use a different processing pipeline which looks only at the last 5 minutes of data. If an incident occurs, the new path allows the live view to recover very fast.
The new pipeline was deployed Wednesday morning to production and processing was able to keep up for many hours. Unfortunately, on one of the ElastiCache nodes, the Engine CPU Usage increased to around 100%, causing delays in raw-measuremnt processing. The clients received timeouts, triggering retries, which were far too aggressive. This caused live measurements to be only partially available to end users during the time period
In the new live view measurement processing, we started batching measurements into a Kinesis Record. As expected, this lead to a significant improvement in our Kinesis stream processing. This was one of the action items from our previous incidents. Unfortunately, the batch of measurements was too large for ElastiCache to process timely. Clients started receiving IO timeouts, triggering aggressive retries. This put additional pressure on ElastiCache’s Engine CPU, which it couldn’t recover from.
By reducing the batch size and configuring less aggressive retries, ElastiCache now able to process requests as they arrive, clients were no longer receiving timeouts.