chore: replace starting_up bool with coordination mode#1033
chore: replace starting_up bool with coordination mode#1033
Conversation
Co-authored-by: Amp <amp@ampcode.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a lock-free global CoordinationMode enum (StartingUp, Primary, Replica) with query and transition helpers in magicblock-core. Introduces a StartingUp scheduling mode in the processor and updates ExecutionCoordinator and scheduler logic to be mode-aware. Replaces the STARTING_UP flag and Notify-based signaling with a channel-based mode switch propagated from MagicValidator (new is_standalone plus mode_tx/mode_rx) into TransactionSchedulerState. Startup now sends an explicit Scheduler/Target mode to the scheduler; related call sites and tests updated. Adds ApiError::FailedToSendModeSwitch. Suggested reviewers
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
16c1bc8 to
bcba134
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
magicblock-processor/src/scheduler/coordinator.rs (1)
230-285:⚠️ Potential issue | 🟠 MajorAlign the local and global transition rules.
magicblock_core::coordination_mode::{switch_to_primary_mode,switch_to_replica_mode}only acceptsStartingUp -> *, but the local helpers here still allow or tolerate states that the global state machine rejects. In particular,switch_to_primary_mode_globally()can move the local coordinator fromReplicatoPrimary, andswitch_to_replica_mode_globally()still calls the global switch even after the local helper warned and returned fromPrimary. An unexpected caller can therefore panic after only one side changed.Either tighten the local scheduler to the same
StartingUp -> {Primary, Replica}transitions asmagicblock-core, or have the local helpers return whether a startup transition actually happened and only then invoke the global switch. IfReplica -> Primaryis intentional, the global state machine needs to support it too.Suggested change
- pub(super) fn switch_to_primary_mode(&mut self) { - if let CoordinationMode::Primary(_) = self.mode { - warn!("Tried to switch to primary mode more than once"); - return; - } + pub(super) fn switch_to_primary_mode(&mut self) -> bool { + if !matches!(&self.mode, CoordinationMode::StartingUp(_)) { + warn!("Cannot switch to primary mode from non-startup state"); + return false; + } let mode = PrimaryMode { blocked_txn_count: 0, max_blocked_txn: self.blocked_transactions.len() * BLOCKED_TXN_MULTIPLIER, }; self.mode = CoordinationMode::Primary(mode); + true } @@ - pub(super) fn switch_to_replica_mode(&mut self) { + pub(super) fn switch_to_replica_mode(&mut self) -> bool { match &self.mode { CoordinationMode::Replica(_) => { warn!("Tried to switch to replica mode more than once"); - return; + return false; } CoordinationMode::Primary(_) => { warn!("Cannot switch from primary to replica mode"); - return; + return false; } CoordinationMode::StartingUp(r) => { self.mode = CoordinationMode::Replica(ReplicaMode { pending: r.pending, }); + return true; } } } @@ pub(super) fn switch_to_primary_mode_globally(&mut self) { - self.switch_to_primary_mode(); - magicblock_core::coordination_mode::switch_to_primary_mode(); + if self.switch_to_primary_mode() { + magicblock_core::coordination_mode::switch_to_primary_mode(); + } } @@ pub(super) fn switch_to_replica_mode_globally(&mut self) { - self.switch_to_replica_mode(); - magicblock_core::coordination_mode::switch_to_replica_mode(); + if self.switch_to_replica_mode() { + magicblock_core::coordination_mode::switch_to_replica_mode(); + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@magicblock-processor/src/scheduler/coordinator.rs` around lines 230 - 285, The local helpers can change state without the global helper accepting the same transition; change switch_to_primary_mode and switch_to_replica_mode to return a bool indicating whether a local transition from StartingUp actually occurred (true only when you replace CoordinationMode::StartingUp with Primary/Replica), keep the existing warns for no-op/invalid transitions, and then update switch_to_primary_mode_globally and switch_to_replica_mode_globally to call magicblock_core::coordination_mode::switch_to_primary_mode() or ::switch_to_replica_mode() only when the respective local helper returned true; reference the CoordinationMode enum and PrimaryMode/ReplicaMode struct constructors in the diff to locate where to add the return and conditional global calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@magicblock-api/src/magic_validator.rs`:
- Around line 701-705: The scheduler is being notified via
mode_switcher.notify_one() before reset_accounts_bank() runs, allowing
RPC/traffic to observe accounts during startup cleanup; move the notify_one()
call so it occurs only after reset_accounts_bank() (and any other startup-only
bank cleanup) completes, ensuring the StartingUp->running transition happens
after cleanup; update code in the function containing
mode_switcher.notify_one(), reset_accounts_bank(), and related startup steps to
call notify_one() last.
In `@magicblock-core/src/coordination_mode.rs`:
- Around line 66-96: The current switch_to_primary_mode and
switch_to_replica_mode use load + swap which has a TOCTOU race; replace the
logic with an atomic compare_exchange loop on COORDINATION_MODE that attempts to
change from CoordinationMode::StartingUp as u8 to the target
(CoordinationMode::Primary/Replica as u8) using appropriate orderings (Acquire
for load, Release for successful store, Acquire/Relaxed for failure as needed),
and treat a compare_exchange failure as a no-op if the current value already
equals the target; keep the assertion or error path only if the observed value
is neither StartingUp nor the target to preserve the original invariant.
In `@test-kit/src/lib.rs`:
- Around line 175-176: The scheduler currently stores a fixed is_standalone
boolean at construction (field is_standalone) and uses the constructor's
notify_one() to signal startup, which locks replicas into the wrong transition
path; update the startup notification so the message carries the desired target
mode (e.g., include the target enum/value instead of relying on is_standalone)
or change the constructor behavior to emit the startup-complete notification for
both initial modes and make the scheduler consult the mode in the notification
handler (rather than a fixed is_standalone) so new_replica_mode(..., false) can
transition out of StartingUp when switch_to_primary_mode() is called; adjust
code paths in new_replica_mode, switch_to_primary_mode, and the notification
handling to read the communicated target mode (or read a mutable mode field)
instead of deriving it from the immutable is_standalone flag.
---
Outside diff comments:
In `@magicblock-processor/src/scheduler/coordinator.rs`:
- Around line 230-285: The local helpers can change state without the global
helper accepting the same transition; change switch_to_primary_mode and
switch_to_replica_mode to return a bool indicating whether a local transition
from StartingUp actually occurred (true only when you replace
CoordinationMode::StartingUp with Primary/Replica), keep the existing warns for
no-op/invalid transitions, and then update switch_to_primary_mode_globally and
switch_to_replica_mode_globally to call
magicblock_core::coordination_mode::switch_to_primary_mode() or
::switch_to_replica_mode() only when the respective local helper returned true;
reference the CoordinationMode enum and PrimaryMode/ReplicaMode struct
constructors in the diff to locate where to add the return and conditional
global calls.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 9cbe712f-85e2-4fe1-a292-eb22c09ea0a5
⛔ Files ignored due to path filters (1)
test-integration/Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (11)
magicblock-api/src/magic_validator.rsmagicblock-core/src/coordination_mode.rsmagicblock-core/src/lib.rsmagicblock-processor/src/scheduler/coordinator.rsmagicblock-processor/src/scheduler/mod.rsmagicblock-processor/src/scheduler/state.rsmagicblock-processor/src/scheduler/tests.rsprograms/magicblock/src/schedule_transactions/process_scheduled_commit_sent.rsprograms/magicblock/src/test_utils/mod.rsprograms/magicblock/src/validator.rstest-kit/src/lib.rs
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@magicblock-api/src/magic_validator.rs`:
- Around line 718-728: The mode switch send currently ignores failures from
self.mode_switch_tx.try_send(target), which can leave the scheduler stuck;
change the code around the TargetMode selection and try_send call (the block
that sets target and calls self.mode_switch_tx.try_send(target)) to explicitly
match the Result, handling TrySendError::Full and TrySendError::Closed by
returning an Err (or panic if you prefer an invariant failure) so startup fails
fast; include a clear error/log message referencing the failed send and the
target mode so the caller of the surrounding function (the startup path in
magic_validator) can abort instead of silently continuing.
In `@magicblock-core/src/coordination_mode.rs`:
- Around line 22-26: Add a test-only reset path that sets the process-global
COORDINATION_MODE back to CoordinationMode::StartingUp so tests can return the
global state to a known value; implement a cfg(test) function (e.g., pub(crate)
fn reset_coordination_mode_for_tests()) that writes
AtomicU8::new(CoordinationMode::StartingUp as u8) or uses
COORDINATION_MODE.store(...) to restore StartingUp, and call this in tests
before mutating state so switch_to_primary_mode() and switch_to_replica_mode()
assertions do not leak between tests.
In `@test-kit/src/lib.rs`:
- Around line 180-183: Replace the silent ignores of TrySendError when
pre-sending target mode with explicit .expect() calls so failures are clearly
reported; locate the two places using
this.mode_switch_tx.try_send(TargetMode::Primary) (the pre-send when
primary_mode is true and the other send around lines 209–210) and change the
pattern from `let _ = ...` to calling `.expect("failed to send target mode to
mode_switch_tx")` (or a similarly descriptive message) on the Result so tests
fail with a clear diagnostic if the channel unexpectedly fails.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7b450a39-75cc-411d-8de1-54c75b5c0ca8
📒 Files selected for processing (5)
magicblock-api/src/magic_validator.rsmagicblock-core/src/coordination_mode.rsmagicblock-processor/src/scheduler/mod.rsmagicblock-processor/src/scheduler/state.rstest-kit/src/lib.rs
Amp-Thread-ID: https://ampcode.com/threads/T-019cd734-802b-77f0-8350-6e7e485d0920 Co-authored-by: Amp <amp@ampcode.com>
Amp-Thread-ID: https://ampcode.com/threads/T-019cd734-802b-77f0-8350-6e7e485d0920 Co-authored-by: Amp <amp@ampcode.com>
…in test-kit Amp-Thread-ID: https://ampcode.com/threads/T-019cd734-802b-77f0-8350-6e7e485d0920 Co-authored-by: Amp <amp@ampcode.com>
Manual Deploy AvailableYou can trigger a manual deploy of this PR branch to testnet: Alternative: Comment
Comment updated automatically when the PR is synchronized. |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
test-kit/src/lib.rs (1)
178-183:⚠️ Potential issue | 🟠 MajorReplica test envs still have no path to complete startup into
Replica.Line 178 only pre-queues
Primary, and Line 209 only exposes a primary switch. As a result,new_replica_mode(..., false)stays inStartingUp, and the deferred replica path has no public way to exit startup intoReplicalater. That leaves the harness exercising different coordination semantics than its API implies.Proposed fix
- // Pre-send the target mode so the scheduler picks it up once running. - if primary_mode { - this.mode_tx - .try_send(SchedulerMode::Primary) - .expect("failed to pre-send target mode to mode_tx"); - } + // Pre-send the target mode for non-deferred startup so the scheduler + // leaves StartingUp in the intended mode once running. + if !defer_startup { + let initial_mode = if primary_mode { + SchedulerMode::Primary + } else { + SchedulerMode::Replica + }; + this.mode_tx + .try_send(initial_mode) + .expect("failed to pre-send target mode to mode_tx"); + }pub fn switch_to_replica_mode(&self) { self.mode_tx .try_send(SchedulerMode::Replica) .expect("failed to send replica target mode to mode_tx"); }Also applies to: 205-213
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test-kit/src/lib.rs` around lines 178 - 183, The test harness pre-queues only SchedulerMode::Primary (this.mode_tx.try_send(SchedulerMode::Primary)) so replica test environments never get a path to leave StartingUp; update the code to also pre-send SchedulerMode::Replica when primary_mode is false and add a public method mirroring the primary switch (e.g., switch_to_replica_mode) that sends SchedulerMode::Replica on mode_tx; ensure new_replica_mode(...) can receive that signal to transition out of StartingUp and mirror the existing primary switch logic (references: mode_tx, SchedulerMode::Primary, SchedulerMode::Replica, new_replica_mode, switch_to_replica_mode).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@test-kit/src/lib.rs`:
- Around line 178-183: The test harness pre-queues only SchedulerMode::Primary
(this.mode_tx.try_send(SchedulerMode::Primary)) so replica test environments
never get a path to leave StartingUp; update the code to also pre-send
SchedulerMode::Replica when primary_mode is false and add a public method
mirroring the primary switch (e.g., switch_to_replica_mode) that sends
SchedulerMode::Replica on mode_tx; ensure new_replica_mode(...) can receive that
signal to transition out of StartingUp and mirror the existing primary switch
logic (references: mode_tx, SchedulerMode::Primary, SchedulerMode::Replica,
new_replica_mode, switch_to_replica_mode).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5150c981-ba2d-44ea-90ea-848c096cc614
📒 Files selected for processing (4)
magicblock-api/src/magic_validator.rsmagicblock-processor/src/scheduler/mod.rsmagicblock-processor/src/scheduler/state.rstest-kit/src/lib.rs
bmuddha
left a comment
There was a problem hiding this comment.
Good job, but a few important things to address before we can merge
bmuddha
left a comment
There was a problem hiding this comment.
Need to allow for bidirectional mode switching
I already responded that we'll add this when we add the code that actually uses this. |
Summary
Replace the
starting_upboolean flag with a properCoordinationModeenum that tracks thevalidator's lifecycle state (StartingUp → Primary or Replica).
Details
CoordinationMode (
magicblock-core)New module that replaces the
starting_upboolean with a proper state machine:The mode uses atomic operations for lock-free reads from program instruction handlers.
Scheduler Updates (
magicblock-processor)Modified
ExecutionCoordinatorto support the StartingUp state:StartingUp(ReplicaMode)variant to track the pre-replay stateswitch_to_replica_mode_globally()andswitch_to_primary_mode_globally()coordinate scheduler + global validator stateTransactionSchedulernow respects validator type during mode transitions:Tests
Added comprehensive tests for StartingUp mode transitions and blocking behavior.
Summary by CodeRabbit
New Features
Refactor
Tests
Bug Fixes
Chores