Incident Summary
The Kafka-based event streaming platform becomes non-operational when a Kafka node goes offline due to exceeding capacity; the downstream impact of which is all Enlighted gateways connected to Energy Manager Cloud instances disconnect.
Our team has been able to mitigate the impact by applying a restart to the services, most gateways come back online, but there is a potential impact that a percentage do not. For those that do not, a power cycle will need to be performed, and if this does not resolve, the gateway will need to be replaced.
Who was Affected
- All customers using the Enlighted Energy Manager Cloud (EMC) Solution
Root Cause Analysis Discoveries
- A kafka bug has been discovered within the live instance which impacts the LogCleaner thread
Remediation Actions
- Allocated disk space was increased.
- Alert monitoring alert thresholds were lowered, and monitoring urgency was raised to improve response for Kafka platform environment issues.
- Urgently preparing minor Kafka upgrade to resolve bug
Current Actions: (10/27/2022)
- Support is working with customers affected by offline gateways
- Preparations are being made to perform the necessary upgrade
- Further communication will be posted with additional information
Comments
Please sign in to leave a comment.