Post mortem: incident 2026-03-12
Evergram Incident Postmortem — March 12, 2026
Incident Summary
| Field | Details |
|---|---|
| Severity | Critical — Full service outage |
| Incident Start | March 12, 2026 ~10:49 UTC |
| Incident End | March 13, 2026 ~00:36 UTC |
| Duration | ~13 hours 47 minutes |
| Systems Affected | Evergram messaging service (HotPocket cluster on Evernode) |
| Root Cause | Consensus failure caused by validator state divergence |
| Status | Resolved |
1. Impact
During the incident window, Evergram was unavailable to users. The underlying validator cluster stopped producing new ledgers due to a distributed consensus failure.
- Ledger progression halted; no transactions could be processed
- The Gateway service remained reachable but could not advance state
- Users experienced timeouts, inconsistent responses, or inability to interact with the platform
- No data was lost or corrupted
System integrity was preserved throughout the incident.
2. Timeline
All timestamps in UTC.
| Time | Event |
|---|---|
| Mar 12 ~10:49 | Consensus failure begins; new ledgers stop being produced |
| Mar 12 ~10:49 → Mar 13 ~00:20 | Service outage while investigation and mitigation proceed |
| Mar 13 ~00:20 | Fault isolation measures applied to the affected validator |
| Mar 13 ~00:21 | Validator state resynchronization begins |
| Mar 13 ~00:36 | Consensus restored; normal ledger production resumes |
3. Root Cause
Evergram operates on a distributed validator cluster using the HotPocket protocol on the Evernode network. Consensus requires participating validators to agree on the same ledger state.
During startup, one validator entered a divergent state and could not agree with its peers on the latest ledger. Because consensus could not be achieved, ledger production halted to preserve correctness and prevent inconsistent state propagation.
This behavior reflects a safety-first design: when agreement cannot be guaranteed, the system stops rather than risking data corruption.
Network timing conditions likely contributed to the divergence scenario.
4. Resolution
The incident was resolved through controlled fault isolation and state resynchronization.
Mitigation Actions
- The divergent validator was temporarily isolated from the consensus group
- The affected node was restarted on a clean state
- Synchronization with healthy peers was allowed to complete
- Consensus timing parameters were adjusted to improve tolerance to real-world network latency
- All validators were restarted in a coordinated manner to ensure configuration consistency
Once synchronization completed, the validator safely rejoined consensus and normal operation resumed.
Operational Improvements Introduced
During the incident response, the team accelerated delivery of Evergram Manager, an authenticated administrative interface designed to improve operational visibility and control.
Capabilities include:
- Service lifecycle management
- Real-time system observability
- Configuration management
- Validator coordination tools
This tooling significantly reduces recovery time for future incidents and supports planned network expansion.
5. What Went Well
- No data loss. State integrity was maintained throughout
- Clear fault identification. Divergence was detected through system telemetry
- Safe recovery procedure. Isolation and resynchronization restored operation without side effects
- Rapid tooling improvement. New operational capabilities were deployed during response
- Fast resynchronization once corrective action was applied
6. What Could Be Improved
- Lack of automated alerting for consensus failure conditions
- Insufficient operational tooling prior to the incident
- Timing parameters not fully tuned to production network conditions
- Absence of a documented runbook for divergence scenarios
- Limited fault tolerance in the initial deployment topology
7. Lessons Learned
This incident reinforced several key principles for operating distributed systems:
- Safety mechanisms worked as designed. The system halted rather than producing inconsistent state
- Operational tooling is critical. Visibility and control dramatically reduce recovery time
- Production conditions differ from test environments. Parameters must be tuned accordingly
- Fault tolerance is essential for high availability
- Incidents drive resilience. Improvements implemented during recovery leave the system stronger than before
Current Status
Evergram is fully operational. The validator cluster is synchronized, consensus is stable, and monitoring improvements are being deployed to prevent recurrence.
Closing Note
We take reliability seriously and appreciate the community’s patience during this outage. Transparency is a core value, and we will continue publishing incident reports as part of our commitment to building a robust and trustworthy platform.


