Incident Notification: Kafka Based System Outage Thursday, Oct 27st, 2022

Incident Summary
The Kafka-based event streaming platform becomes non-operational when a Kafka node goes offline due to exceeding capacity; the downstream impact of which is all Enlighted gateways connected to Energy Manager Cloud instances disconnect.

Our team has been able to mitigate the impact by applying a restart to the services, most gateways come back online, but there is a potential impact that a percentage do not. For those that do not, a power cycle will need to be performed, and if this does not resolve, the gateway will need to be replaced.

Who was Affected

  • All customers using the Enlighted Energy Manager Cloud (EMC) Solution

Root Cause Analysis Discoveries

  • A kafka bug has been discovered within the live instance which impacts the LogCleaner thread

Remediation Actions

  • Allocated disk space was increased.
  • Alert monitoring alert thresholds were lowered, and monitoring urgency was raised to improve response for Kafka platform environment issues.
  • Urgently preparing minor Kafka upgrade to resolve bug

Current Actions:    (10/27/2022)

  • Support is working with customers affected by offline gateways
  • Preparations are being made to perform the necessary upgrade
  • Further communication will be posted with additional information
Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request

Comments

Please sign in to leave a comment.