Skip to content

feat(sequencer): catchup from base#3057

Draft
julienrbrt wants to merge 12 commits intomainfrom
julien/catchup-base
Draft

feat(sequencer): catchup from base#3057
julienrbrt wants to merge 12 commits intomainfrom
julien/catchup-base

Conversation

@julienrbrt
Copy link
Member

Closes: #3016

Opus did a pretty good job. I'll be improving it.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2026

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped✅ passed✅ passedFeb 16, 2026, 5:09 PM

@claude
Copy link
Contributor

claude bot commented Feb 9, 2026

Claude finished @julienrbrt's task in 2m 5s —— View job


Comprehensive Code Review

I've completed a thorough review of the sequencer catch-up feature implementation. While the overall architecture is well-designed, I've identified critical issues that will cause chain halts along with several important improvements.


Critical Issues (Must Fix)

1. Non-Monotonic Block Timestamps During Catch-Up 🔴

Location: pkg/sequencers/single/sequencer.go:368-378

Severity: HIGH - Will cause chain halt

Issue: During catch-up mode, all blocks produced for the same DA epoch use the exact same timestamp (currentDAEndTime). Most execution layers (EVM, Cosmos SDK, etc.) require strictly monotonically increasing timestamps. If the sequencer produces multiple blocks per epoch or encounters an empty epoch, subsequent blocks will have the same or older timestamps, causing the execution layer to reject them and halt the chain.

Example Scenario:

// Epoch 100 has 3 forced txs, block size fits 1 tx per block:
Block 1: timestamp = epochTimestamp  // ✓ OK
Block 2: timestamp = epochTimestamp  // ✗ REJECTED (same as parent)
Block 3: timestamp = epochTimestamp  // ✗ REJECTED (same as parent)

Recommended Fix:

// Line 368-378: Instead of using the same timestamp for all blocks
// in an epoch, increment the timestamp by a small delta per block
timestamp = daEndTime.Add(-time.Duration(remainingForcedTxs) * time.Millisecond)

// Better approach: Track a monotonic counter per epoch
if c.catchUpState.Load() == catchUpInProgress {
    daEndTime := c.currentDAEndTime
    if !daEndTime.IsZero() {
        // Ensure monotonic increase within epoch by adding consumed tx count
        consumedInEpoch := c.checkpoint.TxIndex
        timestamp = daEndTime.Add(-time.Duration(remainingForcedTxs) * time.Millisecond)
        
        // Ensure we never go backwards from the last block time
        // This requires tracking lastBlockTime or checking execution layer
    }
}

The current logic at line 376 tries to handle this but has flaws:

  • It doesn't account for empty epochs (no timestamp update)
  • It doesn't ensure timestamps are strictly greater than the parent block
  • The millisecond offset might still produce collisions for fast block production

Impact: Chain will halt during catch-up when processing multi-block epochs.


2. Data Race on Catch-Up State Fields 🔴

Location: pkg/sequencers/single/sequencer.go:71, 429-431

Severity: MEDIUM-HIGH - Undefined behavior, potential corrupted state

Issue: The field currentDAEndTime (line 71) is a time.Time struct that is read and written concurrently without synchronization:

  • Written in fetchNextDAEpoch (line 482) by the block production goroutine
  • Read in GetNextBatch (line 370) by the same goroutine
  • Potentially read by isCatchingUp() (line 429-431) which could be called from RPC handlers, metrics collectors, or other goroutines

While catchUpState uses atomic.Int32, currentDAEndTime has no such protection. Concurrent access to non-atomic fields is a data race in Go, leading to undefined behavior.

Recommended Fix:

// Option 1: Use atomic.Pointer for time.Time
type Sequencer struct {
    ...
    catchUpState     atomic.Int32
    currentDAEndTime atomic.Pointer[time.Time]  // Changed from time.Time
}

// Update writes:
if !forcedTxsEvent.Timestamp.IsZero() {
    t := forcedTxsEvent.Timestamp.UTC()
    c.currentDAEndTime.Store(&t)
}

// Update reads:
daEndTime := c.currentDAEndTime.Load()
if daEndTime != nil && !daEndTime.IsZero() {
    timestamp = *daEndTime
}

// Option 2: Protect with mutex (if atomic pointer is too complex)
type Sequencer struct {
    ...
    catchUpMu        sync.RWMutex
    catchUpState     int32  // protected by catchUpMu
    currentDAEndTime time.Time  // protected by catchUpMu
}

Impact: Corrupted timestamps, possible panics, unpredictable behavior under concurrent access.


High Priority Issues

3. Magic Number - Define Catch-Up Threshold Constant

Location: pkg/sequencers/single/sequencer.go:569

Issue: The value 30*time.Second is hard-coded without explanation. This threshold determines when the sequencer enters catch-up mode based on timestamp age.

Recommendation:

// At package level:
const (
    catchUpAgeThreshold = 30 * time.Second  // Minimum age for an epoch to trigger catch-up
)

// Use in code:
if missedEpochs <= 1 {
    // Within one epoch, normal operation
    return
}

Rationale: Named constants improve code readability and make tuning easier during testing or operational adjustments.


4. Test Logger Configuration Inconsistency

Location: Multiple test functions in pkg/sequencers/single/sequencer_test.go

Issue: Some tests use zerolog.New(zerolog.NewConsoleWriter()) while others use zerolog.New(zerolog.NewTestWriter(t)) or zerolog.Nop(). Console writer produces verbose output during test runs, cluttering CI logs.

Affected lines: 1245, 1309, 1457, 1676 (and potentially more)

Recommendation:

// Consistent approach - only show logs for failed tests:
logger := zerolog.New(zerolog.NewTestWriter(t))

// Or completely silent:
logger := zerolog.Nop()

Impact: Minor - affects developer experience and CI log clarity.


Architecture & Design Review

✅ Strengths

  1. Well-structured state machine: The catchUpUncheckedcatchUpInProgresscatchUpDone state machine is clear and prevents redundant checks.

  2. Incremental DA height advancement (syncer.go:780-806): The logic to advance DAHeight by one epoch during catch-up is correctly implemented and properly handles the scenario Gemini described. This ensures forced inclusion verification works correctly during catch-up.

  3. Comprehensive test coverage: Tests cover key scenarios including:

    • Epoch gap detection
    • Mempool skipping during catch-up
    • DA timestamp usage
    • Exit conditions (HeightFromFuture)
    • Multi-epoch replay
  4. Proper separation of concerns: Catch-up logic is confined to the sequencer package and doesn't leak into unrelated components.

⚠️ Areas for Improvement

Code Quality

Empty Epoch Handling: The code doesn't explicitly handle empty DA epochs during catch-up. If an epoch has no forced inclusion transactions and returns an empty result with a zero timestamp, currentDAEndTime won't be updated, causing subsequent blocks to reuse stale timestamps.

Error Handling: In fetchNextDAEpoch (lines 454-478), when ErrHeightFromFuture is encountered, the function returns (0, nil). This magic 0 value requires callers to check for it. Consider using a sentinel error or a structured result type.

Context Cancellation: updateCatchUpState (line 531) calls GetLatestDAHeight without a timeout. If the DA layer is slow or unresponsive, this could block indefinitely. Consider:

queryCtx, cancel := context.WithTimeout(ctx, 5*time.Second)
defer cancel()
latestDAHeight, err := c.daClient.GetLatestDAHeight(queryCtx)

Performance

Single DA Query Overhead: updateCatchUpState calls GetLatestDAHeight once per sequencer lifecycle. This is acceptable, but the implementation should document that this is intentional (not a performance bug).

Checkpoint Persistence: Every call to GetNextBatch persists the checkpoint to disk (line 350). For high-throughput scenarios, consider batching or using a write-behind cache.


Testing

Missing Test Cases:

  1. Multi-block per epoch: Test when multiple blocks are produced for a single DA epoch (timestamp monotonicity issue)
  2. Empty epoch handling: Test when a DA epoch has no forced inclusion transactions
  3. Concurrent access: Test data race scenarios (though this is hard to test reliably)
  4. Checkpoint crash recovery: Test resuming from a crash mid-epoch with partial TxIndex

Test Gaps (from codecov report):

  • block/internal/syncing/syncer.go: 22.22% patch coverage (13 lines missing)
  • block/internal/da/tracing.go: 0% coverage (10 lines missing)
  • block/internal/da/client.go: 0% coverage (9 lines missing)

Security Review

✅ No Major Security Issues

The catch-up mechanism doesn't introduce new attack vectors. Key security properties are maintained:

  1. Forced inclusion verification still works correctly due to the incremental DAHeight advancement in syncer.go
  2. No privilege escalation: Catch-up mode doesn't bypass any access controls
  3. No secret exposure: No new logging of sensitive data

⚠️ Operational Security Considerations

Chain Halt Risk: The timestamp monotonicity issue (#1) will cause the chain to halt during catch-up. This is a denial-of-service condition that requires manual intervention.

State Divergence: If the data race (#2) causes currentDAEndTime to be corrupted, blocks produced during catch-up may have invalid timestamps, potentially causing state divergence between full nodes and the sequencer.


Error Handling & Resilience

✅ Good Practices

  1. Graceful degradation: When GetLatestDAHeight fails, the sequencer logs a warning and continues (line 550-552)
  2. Proper error wrapping: Errors include context (DA height, epoch numbers) for debugging
  3. State recovery: Checkpoint mechanism allows resuming from crashes

⚠️ Issues

Silent Failure: If getBlockTimestamp fails in client.go:224-228, it falls back to time.Now() with only a log warning. This could cause based sequencers to produce blocks with incorrect timestamps. Consider returning an error instead:

if err != nil {
    return time.Time{}, fmt.Errorf("failed to get block timestamp: %w", err)
}

Documentation

Missing Documentation

  1. Catch-up behavior: The PR description mentions "Opus did a pretty good job. I'll be improving it" but doesn't explain:

    • When catch-up triggers
    • How long it takes
    • What operators should expect
  2. CHANGELOG.md entry is minimal (3 lines added). Should include:

    • Feature description
    • Breaking changes (if any)
    • Migration guide for operators
  3. Inline comments: The timestamp logic (lines 367-378) needs clearer explanation of the millisecond offset calculation.


Summary & Recommendations

Must Fix Before Merge

  1. Fix non-monotonic timestamps (Issue ABCI interface #1) - Will cause chain halt
  2. Fix data race on currentDAEndTime (Issue Peer discovery #2) - Undefined behavior

Recommended Improvements

  1. Define catch-up threshold constant (Issue Transaction gossiping #3)
  2. Standardize test logger configuration (Issue Fraud proofs gossiping #4)
  3. Add timeout to GetLatestDAHeight call
  4. Add tests for multi-block-per-epoch scenarios
  5. Document empty epoch handling
  6. Improve CHANGELOG.md with detailed feature description

Overall Assessment

The catch-up feature is architecturally sound and addresses a real operational need. The syncer.go changes for incremental DA height advancement are correct and well-implemented. However, the critical timestamp and concurrency issues must be resolved before merging to prevent production incidents.

Estimated Risk: HIGH (chain halt due to timestamp issue)
Recommendation: Do not merge until critical issues #1 and #2 are fixed and verified with additional tests.


Additional Files to Review

The following files were modified but warrant deeper review:

  • scripts/test-catchup.sh (393 lines added) - New test script
  • test/mocks/da.go (+58 lines) - Mock changes
  • apps/evm/server/force_inclusion_test.go (+4 lines) - Test updates

| Branch: julien/catchup-base

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @julienrbrt, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a critical "catch-up" feature for the sequencer, designed to enhance its resilience and consistency following periods of inactivity or downtime. By intelligently replaying missed Data Availability (DA) epochs, the sequencer can resynchronize its state, ensuring that all forced inclusion transactions are processed in the correct order and with accurate timestamps, thereby maintaining a consistent view of the blockchain's history, even when operating in a decentralized environment. This mechanism allows the sequencer to seamlessly transition back to normal operation without compromising data integrity.

Highlights

  • Sequencer Catch-up Mechanism: Introduced a new mechanism allowing the sequencer to "catch up" on missed Data Availability (DA) epochs after extended downtime.
  • Forced Inclusion Priority: During catch-up, the sequencer temporarily processes only forced inclusion transactions, ensuring consistency with base sequencing nodes and temporarily skipping mempool transactions.
  • DA Height Synchronization: Modified syncer.go to incrementally advance the DAHeight by one epoch during catch-up, preventing issues with forced inclusion transaction verification.
  • Timestamp Alignment: Block timestamps generated during catch-up are now aligned with the DA epoch's end timestamp for historical accuracy.
  • Comprehensive Testing: Added extensive unit tests covering various catch-up scenarios, including detection, mempool skipping, timestamp usage, exit conditions, and multi-epoch replay.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • block/internal/syncing/syncer.go
    • Updated the logic for DAHeight updates to handle large discrepancies between event.DaHeight and newState.DAHeight. When a significant gap (more than one epoch) is detected, newState.DAHeight is now advanced by exactly one epoch per block, rather than jumping directly to event.DaHeight. This ensures correct verification of forced inclusion transactions during sequencer catch-up.
    • Added detailed comments explaining the new DAHeight update strategy and its importance for catch-up blocks and forced inclusion verification.
  • pkg/sequencers/single/sequencer.go
    • Added catchingUp (boolean) and currentDAEndTime (time.Time) to the Sequencer struct to manage the catch-up state and store the DA epoch's end timestamp.
    • Modified GetNextBatch to conditionally skip fetching transactions from the mempool when the sequencer is in catchingUp mode. This ensures that only forced inclusion transactions are included in batches during catch-up.
    • Updated GetNextBatch to use sequencer.currentDAEndTime as the block timestamp when catchingUp is true, aligning block timestamps with the historical DA epoch.
    • Introduced a public IsCatchingUp() method to query the sequencer's current catch-up status.
    • Modified fetchNextDAEpoch to store the forcedTxsEvent.Timestamp in currentDAEndTime and to call a new updateCatchUpState method to manage the catchingUp flag. It also now exits catch-up mode if StatusHeightFromFuture is returned by the DA client.
    • Implemented a new private method updateCatchUpState that determines whether the sequencer should be in catch-up mode based on the timestamp of the fetched DA epoch. It enters catch-up if the epoch timestamp is significantly in the past (more than one epoch's wall-clock duration) and exits if it's recent.
  • pkg/sequencers/single/sequencer_test.go
    • Added a comprehensive suite of tests (TestSequencer_CatchUp_...) to validate the new catch-up functionality. These tests cover:
    • Detection of old epochs and entering catch-up mode.
    • Skipping mempool transactions during catch-up.
    • Using DA epoch timestamps for blocks during catch-up.
    • Exiting catch-up mode when the DA head is reached (either a recent epoch or HeightFromFuture).
    • Scenarios where catch-up is not triggered (recent epoch, no forced inclusion configured).
    • Multi-epoch replay scenarios.
    • Correct advancement of the internal checkpoint (DAHeight) during catch-up.
Activity
  • The pull request addresses and closes issue https://github.com/evstack/ev-node/issues/3016.
  • The author indicates that this work builds upon a previous attempt by "Opus", aiming for improvement.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@codecov
Copy link

codecov bot commented Feb 9, 2026

Codecov Report

❌ Patch coverage is 64.86486% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.10%. Comparing base (1da7634) to head (9288b29).

Files with missing lines Patch % Lines
block/internal/syncing/syncer.go 22.22% 13 Missing and 1 partial ⚠️
block/internal/da/tracing.go 0.00% 10 Missing ⚠️
block/internal/da/client.go 0.00% 9 Missing ⚠️
pkg/sequencers/single/sequencer.go 91.89% 4 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3057      +/-   ##
==========================================
+ Coverage   61.04%   61.10%   +0.05%     
==========================================
  Files         113      113              
  Lines       11440    11546     +106     
==========================================
+ Hits         6984     7055      +71     
- Misses       3674     3710      +36     
+ Partials      782      781       -1     
Flag Coverage Δ
combined 61.10% <64.86%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust catch-up mechanism for the sequencer, designed to handle restarts after extended downtime. While the overall approach for consuming and verifying catch-up blocks in the syncer, including the incremental advancement of DAHeight, is well-implemented and tested, the implementation of catch-up mode in the single sequencer has significant flaws. Specifically, it produces non-monotonic block timestamps when multiple blocks are generated for a single DA epoch or when empty epochs are encountered, which will likely cause the execution layer to reject blocks and halt the chain. Additionally, there is a data race on the new catch-up state fields due to a lack of synchronization primitives. Minor suggestions for code clarity and testing experience were also noted.

Comment on lines 362 to 365
timestamp := time.Now()
if c.catchingUp && !c.currentDAEndTime.IsZero() {
timestamp = c.currentDAEndTime
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

During catch-up mode, the sequencer uses the timestamp from the last fetched DA epoch as the block timestamp. However, multiple blocks can be produced for a single DA epoch (e.g., if the epoch contains many transactions that don't fit in one block). In this case, all blocks produced for that epoch will have the exact same timestamp (c.currentDAEndTime). Furthermore, if an empty DA epoch is encountered, c.currentDAEndTime is not updated, causing the next block to reuse the timestamp from the previous non-empty epoch.

Most blockchain execution layers (like EVM) require that each new block has a timestamp strictly greater than its parent. If the sequencer produces multiple blocks with the same timestamp, the execution layer will reject them, halting the block production process and causing a Denial of Service.

Comment on lines 60 to 63
catchingUp bool
// currentDAEndTime is the DA epoch end timestamp from the last fetched epoch.
// Used as the block timestamp during catch-up to match based sequencing behavior.
currentDAEndTime time.Time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The fields catchingUp and currentDAEndTime are accessed concurrently without proper synchronization. GetNextBatch and fetchNextDAEpoch (which update these fields) can be called by the block production loop, while IsCatchingUp() can be called concurrently by other components (e.g., RPC, monitoring, or metrics).

In Go, concurrent read/write access to non-atomic fields is a data race, which leads to undefined behavior. Specifically, a race on time.Time (which is a struct) can result in a corrupted timestamp being read, potentially leading to invalid blocks and a Denial of Service.

Comment on lines 533 to 535
if catchUpThreshold < 30*time.Second {
catchUpThreshold = 30 * time.Second
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value 30*time.Second is used as a minimum threshold for the catch-up logic. To improve readability and maintainability, consider defining this as a named constant at the package level, for example: const minCatchUpThreshold = 30 * time.Second, and using that constant here. This makes the intent of the value clearer and centralizes its definition.

// This test uses maxBytes to limit how many txs are fetched, triggering the unprocessed txs scenario.
func TestSequencer_CatchUp_DetectsOldEpoch(t *testing.T) {
ctx := context.Background()
logger := zerolog.New(zerolog.NewConsoleWriter())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using zerolog.New(zerolog.NewConsoleWriter()) in tests can lead to verbose output in the console during test runs. It's generally better to use a logger that integrates with the testing framework, such as zerolog.New(zerolog.NewTestWriter(t)), which only shows logs for failed tests, or zerolog.Nop() if logs are not needed. This practice is followed in other new tests in this file (e.g., TestSequencer_CatchUp_UsesDATimestamp). Applying this consistently would improve the developer experience. This comment also applies to lines 1309, 1457, and 1676.

Suggested change
logger := zerolog.New(zerolog.NewConsoleWriter())
logger := zerolog.New(zerolog.NewTestWriter(t))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Sequencer catchup on restart from base

1 participant