Evergram banner
Evergram logo

Evergram

Active

Decentralized messaging built on Evernode and Xahau.

4 min read
incidents

Post mortem: incident 2026-03-12

Updated

Evergram Incident Postmortem — March 12, 2026

Incident Summary

FieldDetails
SeverityCritical — Full service outage
Incident StartMarch 12, 2026 ~10:49 UTC
Incident EndMarch 13, 2026 ~00:36 UTC
Duration~13 hours 47 minutes
Systems AffectedEvergram messaging service (HotPocket cluster on Evernode)
Root CauseConsensus failure caused by validator state divergence
StatusResolved

1. Impact

During the incident window, Evergram was unavailable to users. The underlying validator cluster stopped producing new ledgers due to a distributed consensus failure.

  • Ledger progression halted; no transactions could be processed
  • The Gateway service remained reachable but could not advance state
  • Users experienced timeouts, inconsistent responses, or inability to interact with the platform
  • No data was lost or corrupted

System integrity was preserved throughout the incident.


2. Timeline

All timestamps in UTC.

TimeEvent
Mar 12 ~10:49Consensus failure begins; new ledgers stop being produced
Mar 12 ~10:49 → Mar 13 ~00:20Service outage while investigation and mitigation proceed
Mar 13 ~00:20Fault isolation measures applied to the affected validator
Mar 13 ~00:21Validator state resynchronization begins
Mar 13 ~00:36Consensus restored; normal ledger production resumes

3. Root Cause

Evergram operates on a distributed validator cluster using the HotPocket protocol on the Evernode network. Consensus requires participating validators to agree on the same ledger state.

During startup, one validator entered a divergent state and could not agree with its peers on the latest ledger. Because consensus could not be achieved, ledger production halted to preserve correctness and prevent inconsistent state propagation.

This behavior reflects a safety-first design: when agreement cannot be guaranteed, the system stops rather than risking data corruption.

Network timing conditions likely contributed to the divergence scenario.


4. Resolution

The incident was resolved through controlled fault isolation and state resynchronization.

Mitigation Actions

  • The divergent validator was temporarily isolated from the consensus group
  • The affected node was restarted on a clean state
  • Synchronization with healthy peers was allowed to complete
  • Consensus timing parameters were adjusted to improve tolerance to real-world network latency
  • All validators were restarted in a coordinated manner to ensure configuration consistency

Once synchronization completed, the validator safely rejoined consensus and normal operation resumed.

Operational Improvements Introduced

During the incident response, the team accelerated delivery of Evergram Manager, an authenticated administrative interface designed to improve operational visibility and control.

Capabilities include:

  • Service lifecycle management
  • Real-time system observability
  • Configuration management
  • Validator coordination tools

This tooling significantly reduces recovery time for future incidents and supports planned network expansion.


5. What Went Well

  • No data loss. State integrity was maintained throughout
  • Clear fault identification. Divergence was detected through system telemetry
  • Safe recovery procedure. Isolation and resynchronization restored operation without side effects
  • Rapid tooling improvement. New operational capabilities were deployed during response
  • Fast resynchronization once corrective action was applied

6. What Could Be Improved

  • Lack of automated alerting for consensus failure conditions
  • Insufficient operational tooling prior to the incident
  • Timing parameters not fully tuned to production network conditions
  • Absence of a documented runbook for divergence scenarios
  • Limited fault tolerance in the initial deployment topology

7. Lessons Learned

This incident reinforced several key principles for operating distributed systems:

  • Safety mechanisms worked as designed. The system halted rather than producing inconsistent state
  • Operational tooling is critical. Visibility and control dramatically reduce recovery time
  • Production conditions differ from test environments. Parameters must be tuned accordingly
  • Fault tolerance is essential for high availability
  • Incidents drive resilience. Improvements implemented during recovery leave the system stronger than before

Current Status

Evergram is fully operational. The validator cluster is synchronized, consensus is stable, and monitoring improvements are being deployed to prevent recurrence.


Closing Note

We take reliability seriously and appreciate the community’s patience during this outage. Transparency is a core value, and we will continue publishing incident reports as part of our commitment to building a robust and trustworthy platform.

incidentsnodehotpocketevernodeevergram

Support this Project

1 gift = 100 XAH

Recent Supporters

Titanium

Titanium

×

Anonymous

Anonymous

×