You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
State replication enables standby nodes to stay in sync with the primary node by streaming transactions via NATS JetStream. The magicblock-replicator crate introduces a producer component (primary) that publishes transactions and block metadata, and a consumer component (standby) that replays them against their local AccountsDb. Standbys can catch up from any point using NATS message replay, with periodic snapshots enabling fast recovery.
Motivation
Currently, each validator node maintains its own independent state. This works for single-node deployments but becomes problematic when:
Scaling read traffic: Multiple nodes serving read requests need consistent state
High availability: Hot spares should be able to take over instantly
Geographic distribution: Edge nodes need to stay synced with the source of truth
Testing/Staging: Environment parity without re-processing the entire ledger
Transaction-level replication via NATS JetStream gives us durable messaging, automatic replay, and leader election out of the box.
Publishes transactions to TRANSACTIONS stream as they're processed
Sends Block delimiters when slot transitions occur
Sends SuperBlock every N blocks with AccountsDb checksum
Uploads snapshots to NATS ObjectStore periodically
Refreshes lock every second; releases on shutdown
Consumer (Standby)
Subscribes to TRANSACTIONS stream
Retrieves snapshot from ObjectStore for initial recovery
Replays messages starting from last acknowledged position:
Message
Action
Transaction
Replay via TransactionScheduler.replay()
Block
Verify blockhash matches local observation, transition slot
SuperBlock
Compute local AccountsDb checksum, compare with received
Acknowledges processed messages
Monitors leader lock for primary liveness
Promotes to primary on leader timeout
Transaction Replay
Standbys replay transactions using the existing TransactionScheduler.replay() path. This ensures:
Same SVM semantics as primary
Account locking and conflict resolution
Program cache consistency
Sysvar updates (Clock, SlotHashes)
Replayed transactions update AccountsDb but don't persist to ledger.
Safeguards
Validation Checks
Blockhash verification: Standby ensures its observed blockhash matches Block delimiter
Checksum verification: SuperBlock contains AccountsDb hash for state validation
Transaction ordering: NATS preserves message ordering within stream
Gap detection: Missing messages detected via sequence numbers
Failure Modes
Scenario
Detection
Response
Replay failure
TransactionError from scheduler
Log, halt, restart from snapshot
Checksum mismatch
SuperBlock hash differs
Alert, flag state divergence
Primary lock timeout
Lock TTL expires
Standby promotes to primary
NATS disconnect
Connection loss
Reconnect, resume from last ack
Snapshot unavailable
ObjectStore miss
Start from stream beginning
Open Questions
Checksum algorithm: SHA256 of AccountsDb? Merkle root? Something faster like xxHash?
Authentication: TLS for NATS connections? NATS user accounts? Or trust internal network?
Backpressure: What if standby can't keep up with stream? NATS rate limiting? Consumer pacing?
Snapshot frequency: How often should primary upload snapshots? Trade-off between recovery time and upload overhead.
Replay failure policy: If a transaction fails to replay on standby (e.g., due to program differences), what should happen? Panic? Retry? Alert operator?
Implementation Phases
Phase 1: Protocol and NATS Integration ✅
Message type definitions with bincode serialization
NATS JetStream broker connection and stream/bucket initialization
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Status: Draft
Created: 2026-02-16
Updated: 2026-03-13
Abstract
State replication enables standby nodes to stay in sync with the primary node by streaming transactions via NATS JetStream. The
magicblock-replicatorcrate introduces a producer component (primary) that publishes transactions and block metadata, and a consumer component (standby) that replays them against their local AccountsDb. Standbys can catch up from any point using NATS message replay, with periodic snapshots enabling fast recovery.Motivation
Currently, each validator node maintains its own independent state. This works for single-node deployments but becomes problematic when:
Transaction-level replication via NATS JetStream gives us durable messaging, automatic replay, and leader election out of the box.
Specification
Architecture Overview
Protocol Messages
All messages use bincode serialization with a 4-byte discriminator prefix:
0x01(slot: u64, index: u32, tx: SanitizedTransaction)0x02(slot: u64, blockhash: Hash)0x03(slot: u64, checksum: Hash)Leader Election
Primary role is determined via NATS KV distributed lock:
Message Flow
Producer (Primary)
Consumer (Standby)
TransactionScheduler.replay()Transaction Replay
Standbys replay transactions using the existing
TransactionScheduler.replay()path. This ensures:Replayed transactions update AccountsDb but don't persist to ledger.
Safeguards
Validation Checks
Failure Modes
TransactionErrorfrom schedulerOpen Questions
Checksum algorithm: SHA256 of AccountsDb? Merkle root? Something faster like xxHash?
Authentication: TLS for NATS connections? NATS user accounts? Or trust internal network?
Backpressure: What if standby can't keep up with stream? NATS rate limiting? Consumer pacing?
Snapshot frequency: How often should primary upload snapshots? Trade-off between recovery time and upload overhead.
Replay failure policy: If a transaction fails to replay on standby (e.g., due to program differences), what should happen? Panic? Retry? Alert operator?
Implementation Phases
Phase 1: Protocol and NATS Integration ✅
Phase 2: Service Layer ✅
Phase 3: Filesystem Watcher ✅
Phase 4: Validation and Monitoring 📋
Phase 5: Testing and Hardening 📋
Beta Was this translation helpful? Give feedback.
All reactions