Post mortem: incident 2026-03-12

March 13, 2026 at 02:01 PM

Updated March 13, 2026 at 02:20 PM

Evergram Incident Postmortem — March 12, 2026

Incident Summary

Field	Details
Severity	Critical — Full service outage
Incident Start	March 12, 2026 ~10:49 UTC
Incident End	March 13, 2026 ~00:36 UTC
Duration	~13 hours 47 minutes
Systems Affected	Evergram messaging service (HotPocket cluster on Evernode)
Root Cause	Consensus failure caused by validator state divergence
Status	Resolved

1. Impact

During the incident window, Evergram was unavailable to users. The underlying validator cluster stopped producing new ledgers due to a distributed consensus failure.

Ledger progression halted; no transactions could be processed
The Gateway service remained reachable but could not advance state
Users experienced timeouts, inconsistent responses, or inability to interact with the platform
No data was lost or corrupted

System integrity was preserved throughout the incident.

2. Timeline

All timestamps in UTC.

Time	Event
Mar 12 ~10:49	Consensus failure begins; new ledgers stop being produced
Mar 12 ~10:49 → Mar 13 ~00:20	Service outage while investigation and mitigation proceed
Mar 13 ~00:20	Fault isolation measures applied to the affected validator
Mar 13 ~00:21	Validator state resynchronization begins
Mar 13 ~00:36	Consensus restored; normal ledger production resumes

3. Root Cause

Evergram operates on a distributed validator cluster using the HotPocket protocol on the Evernode network. Consensus requires participating validators to agree on the same ledger state.

During startup, one validator entered a divergent state and could not agree with its peers on the latest ledger. Because consensus could not be achieved, ledger production halted to preserve correctness and prevent inconsistent state propagation.

This behavior reflects a safety-first design: when agreement cannot be guaranteed, the system stops rather than risking data corruption.

Network timing conditions likely contributed to the divergence scenario.

4. Resolution

The incident was resolved through controlled fault isolation and state resynchronization.

Mitigation Actions

The divergent validator was temporarily isolated from the consensus group
The affected node was restarted on a clean state
Synchronization with healthy peers was allowed to complete
Consensus timing parameters were adjusted to improve tolerance to real-world network latency
All validators were restarted in a coordinated manner to ensure configuration consistency

Once synchronization completed, the validator safely rejoined consensus and normal operation resumed.

Operational Improvements Introduced

During the incident response, the team accelerated delivery of Evergram Manager, an authenticated administrative interface designed to improve operational visibility and control.

Capabilities include:

Service lifecycle management
Real-time system observability
Configuration management
Validator coordination tools

This tooling significantly reduces recovery time for future incidents and supports planned network expansion.

5. What Went Well

No data loss. State integrity was maintained throughout
Clear fault identification. Divergence was detected through system telemetry
Safe recovery procedure. Isolation and resynchronization restored operation without side effects
Rapid tooling improvement. New operational capabilities were deployed during response
Fast resynchronization once corrective action was applied

6. What Could Be Improved

Lack of automated alerting for consensus failure conditions
Insufficient operational tooling prior to the incident
Timing parameters not fully tuned to production network conditions
Absence of a documented runbook for divergence scenarios
Limited fault tolerance in the initial deployment topology

7. Lessons Learned

This incident reinforced several key principles for operating distributed systems:

Safety mechanisms worked as designed. The system halted rather than producing inconsistent state
Operational tooling is critical. Visibility and control dramatically reduce recovery time
Production conditions differ from test environments. Parameters must be tuned accordingly
Fault tolerance is essential for high availability
Incidents drive resilience. Improvements implemented during recovery leave the system stronger than before

Current Status

Evergram is fully operational. The validator cluster is synchronized, consensus is stable, and monitoring improvements are being deployed to prevent recurrence.

Closing Note

We take reliability seriously and appreciate the community’s patience during this outage. Transparency is a core value, and we will continue publishing incident reports as part of our commitment to building a robust and trustworthy platform.

incidentsnodehotpocketevernodeevergram

Evergram

Blog

Post mortem: incident 2026-03-12

Evergram Incident Postmortem — March 12, 2026

Incident Summary

1. Impact

2. Timeline

3. Root Cause

4. Resolution

Mitigation Actions

Operational Improvements Introduced

5. What Went Well

6. What Could Be Improved

7. Lessons Learned

Current Status

Closing Note

Recent Articles

Post mortem: incident 2026-03-12

Support this Project

Recent Supporters

Titanium

Anonymous