Skip to content

chore: replace starting_up bool with coordination mode#1033

Open
thlorenz wants to merge 22 commits intomasterfrom
thlorenz/schedulter_dual-mode+validator-keypair
Open

chore: replace starting_up bool with coordination mode#1033
thlorenz wants to merge 22 commits intomasterfrom
thlorenz/schedulter_dual-mode+validator-keypair

Conversation

@thlorenz
Copy link
Collaborator

@thlorenz thlorenz commented Mar 10, 2026

Summary

Replace the starting_up boolean flag with a proper CoordinationMode enum that tracks the
validator's lifecycle state (StartingUp → Primary or Replica).

Details

CoordinationMode (magicblock-core)

New module that replaces the starting_up boolean with a proper state machine:

  • StartingUp: Ledger replay phase — no validator signer required, no side effects
  • Primary: Full operation — requires validator signer, allows side effects
  • Replica: Strict replay mode — no validator signer, enforces ordering

The mode uses atomic operations for lock-free reads from program instruction handlers.

Scheduler Updates (magicblock-processor)

Modified ExecutionCoordinator to support the StartingUp state:

  • Added StartingUp(ReplicaMode) variant to track the pre-replay state
  • Implements mode-based transaction filtering (same logic as Replica during startup)
  • New methods switch_to_replica_mode_globally() and switch_to_primary_mode_globally() coordinate scheduler + global validator state

TransactionScheduler now respects validator type during mode transitions:

  • Standalone → Primary
  • StandBy/ReplicatOnly → Replica

Tests

Added comprehensive tests for StartingUp mode transitions and blocking behavior.

Summary by CodeRabbit

  • New Features

    • Adds a global coordination mode (StartingUp, Primary, Replica) controlling when intents are scheduled and how transactions are accepted.
  • Refactor

    • Replaces legacy startup signaling with an explicit mode-switch flow; scheduler transitions are driven by advertised target mode and legacy startup flags removed.
  • Tests

    • Adds unit and integration tests covering startup/StartingUp behavior, mode transitions, and scheduling rules.
  • Bug Fixes

    • Prevents premature scheduling of intents during startup.
  • Chores

    • Improves error reporting for failed mode-switch sends.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a lock-free global CoordinationMode enum (StartingUp, Primary, Replica) with query and transition helpers in magicblock-core. Introduces a StartingUp scheduling mode in the processor and updates ExecutionCoordinator and scheduler logic to be mode-aware. Replaces the STARTING_UP flag and Notify-based signaling with a channel-based mode switch propagated from MagicValidator (new is_standalone plus mode_tx/mode_rx) into TransactionSchedulerState. Startup now sends an explicit Scheduler/Target mode to the scheduler; related call sites and tests updated. Adds ApiError::FailedToSendModeSwitch.

Suggested reviewers

  • GabrielePicco
  • bmuddha
  • Dodecahedr0x
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch thlorenz/schedulter_dual-mode+validator-keypair

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@thlorenz thlorenz force-pushed the thlorenz/schedulter_dual-mode+validator-keypair branch from 16c1bc8 to bcba134 Compare March 10, 2026 01:50
@thlorenz thlorenz marked this pull request as draft March 10, 2026 01:52
@thlorenz thlorenz requested a review from bmuddha March 10, 2026 01:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
magicblock-processor/src/scheduler/coordinator.rs (1)

230-285: ⚠️ Potential issue | 🟠 Major

Align the local and global transition rules.

magicblock_core::coordination_mode::{switch_to_primary_mode,switch_to_replica_mode} only accepts StartingUp -> *, but the local helpers here still allow or tolerate states that the global state machine rejects. In particular, switch_to_primary_mode_globally() can move the local coordinator from Replica to Primary, and switch_to_replica_mode_globally() still calls the global switch even after the local helper warned and returned from Primary. An unexpected caller can therefore panic after only one side changed.

Either tighten the local scheduler to the same StartingUp -> {Primary, Replica} transitions as magicblock-core, or have the local helpers return whether a startup transition actually happened and only then invoke the global switch. If Replica -> Primary is intentional, the global state machine needs to support it too.

Suggested change
-    pub(super) fn switch_to_primary_mode(&mut self) {
-        if let CoordinationMode::Primary(_) = self.mode {
-            warn!("Tried to switch to primary mode more than once");
-            return;
-        }
+    pub(super) fn switch_to_primary_mode(&mut self) -> bool {
+        if !matches!(&self.mode, CoordinationMode::StartingUp(_)) {
+            warn!("Cannot switch to primary mode from non-startup state");
+            return false;
+        }
         let mode = PrimaryMode {
             blocked_txn_count: 0,
             max_blocked_txn: self.blocked_transactions.len()
                 * BLOCKED_TXN_MULTIPLIER,
         };
         self.mode = CoordinationMode::Primary(mode);
+        true
     }
@@
-    pub(super) fn switch_to_replica_mode(&mut self) {
+    pub(super) fn switch_to_replica_mode(&mut self) -> bool {
         match &self.mode {
             CoordinationMode::Replica(_) => {
                 warn!("Tried to switch to replica mode more than once");
-                return;
+                return false;
             }
             CoordinationMode::Primary(_) => {
                 warn!("Cannot switch from primary to replica mode");
-                return;
+                return false;
             }
             CoordinationMode::StartingUp(r) => {
                 self.mode = CoordinationMode::Replica(ReplicaMode {
                     pending: r.pending,
                 });
+                return true;
             }
         }
     }
@@
     pub(super) fn switch_to_primary_mode_globally(&mut self) {
-        self.switch_to_primary_mode();
-        magicblock_core::coordination_mode::switch_to_primary_mode();
+        if self.switch_to_primary_mode() {
+            magicblock_core::coordination_mode::switch_to_primary_mode();
+        }
     }
@@
     pub(super) fn switch_to_replica_mode_globally(&mut self) {
-        self.switch_to_replica_mode();
-        magicblock_core::coordination_mode::switch_to_replica_mode();
+        if self.switch_to_replica_mode() {
+            magicblock_core::coordination_mode::switch_to_replica_mode();
+        }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@magicblock-processor/src/scheduler/coordinator.rs` around lines 230 - 285,
The local helpers can change state without the global helper accepting the same
transition; change switch_to_primary_mode and switch_to_replica_mode to return a
bool indicating whether a local transition from StartingUp actually occurred
(true only when you replace CoordinationMode::StartingUp with Primary/Replica),
keep the existing warns for no-op/invalid transitions, and then update
switch_to_primary_mode_globally and switch_to_replica_mode_globally to call
magicblock_core::coordination_mode::switch_to_primary_mode() or
::switch_to_replica_mode() only when the respective local helper returned true;
reference the CoordinationMode enum and PrimaryMode/ReplicaMode struct
constructors in the diff to locate where to add the return and conditional
global calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magicblock-api/src/magic_validator.rs`:
- Around line 701-705: The scheduler is being notified via
mode_switcher.notify_one() before reset_accounts_bank() runs, allowing
RPC/traffic to observe accounts during startup cleanup; move the notify_one()
call so it occurs only after reset_accounts_bank() (and any other startup-only
bank cleanup) completes, ensuring the StartingUp->running transition happens
after cleanup; update code in the function containing
mode_switcher.notify_one(), reset_accounts_bank(), and related startup steps to
call notify_one() last.

In `@magicblock-core/src/coordination_mode.rs`:
- Around line 66-96: The current switch_to_primary_mode and
switch_to_replica_mode use load + swap which has a TOCTOU race; replace the
logic with an atomic compare_exchange loop on COORDINATION_MODE that attempts to
change from CoordinationMode::StartingUp as u8 to the target
(CoordinationMode::Primary/Replica as u8) using appropriate orderings (Acquire
for load, Release for successful store, Acquire/Relaxed for failure as needed),
and treat a compare_exchange failure as a no-op if the current value already
equals the target; keep the assertion or error path only if the observed value
is neither StartingUp nor the target to preserve the original invariant.

In `@test-kit/src/lib.rs`:
- Around line 175-176: The scheduler currently stores a fixed is_standalone
boolean at construction (field is_standalone) and uses the constructor's
notify_one() to signal startup, which locks replicas into the wrong transition
path; update the startup notification so the message carries the desired target
mode (e.g., include the target enum/value instead of relying on is_standalone)
or change the constructor behavior to emit the startup-complete notification for
both initial modes and make the scheduler consult the mode in the notification
handler (rather than a fixed is_standalone) so new_replica_mode(..., false) can
transition out of StartingUp when switch_to_primary_mode() is called; adjust
code paths in new_replica_mode, switch_to_primary_mode, and the notification
handling to read the communicated target mode (or read a mutable mode field)
instead of deriving it from the immutable is_standalone flag.

---

Outside diff comments:
In `@magicblock-processor/src/scheduler/coordinator.rs`:
- Around line 230-285: The local helpers can change state without the global
helper accepting the same transition; change switch_to_primary_mode and
switch_to_replica_mode to return a bool indicating whether a local transition
from StartingUp actually occurred (true only when you replace
CoordinationMode::StartingUp with Primary/Replica), keep the existing warns for
no-op/invalid transitions, and then update switch_to_primary_mode_globally and
switch_to_replica_mode_globally to call
magicblock_core::coordination_mode::switch_to_primary_mode() or
::switch_to_replica_mode() only when the respective local helper returned true;
reference the CoordinationMode enum and PrimaryMode/ReplicaMode struct
constructors in the diff to locate where to add the return and conditional
global calls.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9cbe712f-85e2-4fe1-a292-eb22c09ea0a5

📥 Commits

Reviewing files that changed from the base of the PR and between 190bd7a and bcba134.

⛔ Files ignored due to path filters (1)
  • test-integration/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (11)
  • magicblock-api/src/magic_validator.rs
  • magicblock-core/src/coordination_mode.rs
  • magicblock-core/src/lib.rs
  • magicblock-processor/src/scheduler/coordinator.rs
  • magicblock-processor/src/scheduler/mod.rs
  • magicblock-processor/src/scheduler/state.rs
  • magicblock-processor/src/scheduler/tests.rs
  • programs/magicblock/src/schedule_transactions/process_scheduled_commit_sent.rs
  • programs/magicblock/src/test_utils/mod.rs
  • programs/magicblock/src/validator.rs
  • test-kit/src/lib.rs

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@magicblock-api/src/magic_validator.rs`:
- Around line 718-728: The mode switch send currently ignores failures from
self.mode_switch_tx.try_send(target), which can leave the scheduler stuck;
change the code around the TargetMode selection and try_send call (the block
that sets target and calls self.mode_switch_tx.try_send(target)) to explicitly
match the Result, handling TrySendError::Full and TrySendError::Closed by
returning an Err (or panic if you prefer an invariant failure) so startup fails
fast; include a clear error/log message referencing the failed send and the
target mode so the caller of the surrounding function (the startup path in
magic_validator) can abort instead of silently continuing.

In `@magicblock-core/src/coordination_mode.rs`:
- Around line 22-26: Add a test-only reset path that sets the process-global
COORDINATION_MODE back to CoordinationMode::StartingUp so tests can return the
global state to a known value; implement a cfg(test) function (e.g., pub(crate)
fn reset_coordination_mode_for_tests()) that writes
AtomicU8::new(CoordinationMode::StartingUp as u8) or uses
COORDINATION_MODE.store(...) to restore StartingUp, and call this in tests
before mutating state so switch_to_primary_mode() and switch_to_replica_mode()
assertions do not leak between tests.

In `@test-kit/src/lib.rs`:
- Around line 180-183: Replace the silent ignores of TrySendError when
pre-sending target mode with explicit .expect() calls so failures are clearly
reported; locate the two places using
this.mode_switch_tx.try_send(TargetMode::Primary) (the pre-send when
primary_mode is true and the other send around lines 209–210) and change the
pattern from `let _ = ...` to calling `.expect("failed to send target mode to
mode_switch_tx")` (or a similarly descriptive message) on the Result so tests
fail with a clear diagnostic if the channel unexpectedly fails.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7b450a39-75cc-411d-8de1-54c75b5c0ca8

📥 Commits

Reviewing files that changed from the base of the PR and between 68a81dd and d22a126.

📒 Files selected for processing (5)
  • magicblock-api/src/magic_validator.rs
  • magicblock-core/src/coordination_mode.rs
  • magicblock-processor/src/scheduler/mod.rs
  • magicblock-processor/src/scheduler/state.rs
  • test-kit/src/lib.rs

Base automatically changed from bmuddha/scheduler/dual-mode to master March 10, 2026 10:52
@thlorenz thlorenz marked this pull request as ready for review March 10, 2026 10:53
…ypair

* master:
  feat: dual mode transaction scheduler (#1004)
  fix: tui txs (#1036)
  Fix: explorer rendering for failed transactions (#1032)
@github-actions
Copy link

github-actions bot commented Mar 10, 2026

Manual Deploy Available

You can trigger a manual deploy of this PR branch to testnet:

Deploy to Testnet 🚀

Alternative: Comment /deploy on this PR to trigger deployment directly.

⚠️ Note: Manual deploy requires authorization. Only authorized users can trigger deployments.

Comment updated automatically when the PR is synchronized.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
test-kit/src/lib.rs (1)

178-183: ⚠️ Potential issue | 🟠 Major

Replica test envs still have no path to complete startup into Replica.

Line 178 only pre-queues Primary, and Line 209 only exposes a primary switch. As a result, new_replica_mode(..., false) stays in StartingUp, and the deferred replica path has no public way to exit startup into Replica later. That leaves the harness exercising different coordination semantics than its API implies.

Proposed fix
-        // Pre-send the target mode so the scheduler picks it up once running.
-        if primary_mode {
-            this.mode_tx
-                .try_send(SchedulerMode::Primary)
-                .expect("failed to pre-send target mode to mode_tx");
-        }
+        // Pre-send the target mode for non-deferred startup so the scheduler
+        // leaves StartingUp in the intended mode once running.
+        if !defer_startup {
+            let initial_mode = if primary_mode {
+                SchedulerMode::Primary
+            } else {
+                SchedulerMode::Replica
+            };
+            this.mode_tx
+                .try_send(initial_mode)
+                .expect("failed to pre-send target mode to mode_tx");
+        }
pub fn switch_to_replica_mode(&self) {
    self.mode_tx
        .try_send(SchedulerMode::Replica)
        .expect("failed to send replica target mode to mode_tx");
}

Also applies to: 205-213

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test-kit/src/lib.rs` around lines 178 - 183, The test harness pre-queues only
SchedulerMode::Primary (this.mode_tx.try_send(SchedulerMode::Primary)) so
replica test environments never get a path to leave StartingUp; update the code
to also pre-send SchedulerMode::Replica when primary_mode is false and add a
public method mirroring the primary switch (e.g., switch_to_replica_mode) that
sends SchedulerMode::Replica on mode_tx; ensure new_replica_mode(...) can
receive that signal to transition out of StartingUp and mirror the existing
primary switch logic (references: mode_tx, SchedulerMode::Primary,
SchedulerMode::Replica, new_replica_mode, switch_to_replica_mode).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@test-kit/src/lib.rs`:
- Around line 178-183: The test harness pre-queues only SchedulerMode::Primary
(this.mode_tx.try_send(SchedulerMode::Primary)) so replica test environments
never get a path to leave StartingUp; update the code to also pre-send
SchedulerMode::Replica when primary_mode is false and add a public method
mirroring the primary switch (e.g., switch_to_replica_mode) that sends
SchedulerMode::Replica on mode_tx; ensure new_replica_mode(...) can receive that
signal to transition out of StartingUp and mirror the existing primary switch
logic (references: mode_tx, SchedulerMode::Primary, SchedulerMode::Replica,
new_replica_mode, switch_to_replica_mode).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5150c981-ba2d-44ea-90ea-848c096cc614

📥 Commits

Reviewing files that changed from the base of the PR and between ecb4721 and b2012e2.

📒 Files selected for processing (4)
  • magicblock-api/src/magic_validator.rs
  • magicblock-processor/src/scheduler/mod.rs
  • magicblock-processor/src/scheduler/state.rs
  • test-kit/src/lib.rs

Copy link
Collaborator

@bmuddha bmuddha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, but a few important things to address before we can merge

…ypair

* master:
  fix: standalone actions (#1041)
  feat: improve logs rendering (#1040)
  fix: delegation status of not eata accounts owned by the token program (#1039)
  improve: tui (#1038)
Copy link
Collaborator

@bmuddha bmuddha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to allow for bidirectional mode switching

@thlorenz
Copy link
Collaborator Author

Need to allow for bidirectional mode switching

I already responded that we'll add this when we add the code that actually uses this.
I don't think it's a good idea to add methods for future use, esp. if it's trivial to do so at that later point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants