Skip to content

Split brain fix#374

Open
Kaustubh1204 wants to merge 3 commits intoapache:unstablefrom
Kaustubh1204:split-brain-fix
Open

Split brain fix#374
Kaustubh1204 wants to merge 3 commits intoapache:unstablefrom
Kaustubh1204:split-brain-fix

Conversation

@Kaustubh1204
Copy link

This PR implements comprehensive safeguards against the split-brain scenario in kvrocks-controller as described in Issue #329.

Key Fixes:

  1. Reliable Failure Detection: probeNode() now performs quorum verification to prevent false-positive failovers.
  2. Node Fencing: promoteNewMaster() explicitly demotes the old master before promoting a new one.
  3. Conditional Store Writes: UpdateCluster() validates the controller’s leader lease to block Zombie Controllers.
  4. Atomic Failover Flow: Promotion and persistence are now sequenced to prevent the irreversible split-brain point.

Verification:

  • Specialized unit tests (split_brain_test.go) confirm Zombie Controllers are blocked and quorum checks work.
  • Existing cluster tests updated and passing.

All critical safety invariants are enforced, preventing both control-plane and data-plane split-brain scenarios.

cluster_shard.go(Failover failed because of sequence is 0 apache#366)
- Added quorum verification in probeNode() to prevent false-positive failovers.
- Added explicit node fencing in promoteNewMaster() to demote old master before promotion.
- Updated UpdateCluster() to enforce leader lease validation (blocks zombie controllers).
- Wrapped promotion and persistence in atomic flow with rollback/logging.
- Added split_brain_test.go to verify Zombie Controller, quorum, and node fencing scenarios.
@Kaustubh1204
Copy link
Author

This PR fully resolves Issue #329: Split-Brain Vulnerability in kvrocks-controller.

Key Fixes:

  1. Reliable Failure Detection: probeNode() performs quorum verification to prevent false-positive failovers.
  2. Node Fencing: promoteNewMaster() explicitly demotes the old master before promoting a new one.
  3. Conditional Store Writes: UpdateCluster() enforces leader lease validation to block Zombie Controllers.
  4. Atomic Failover Flow: Promotion and persistence are sequenced to prevent irreversible split-brain points.

Verification:

  • Specialized unit tests (split_brain_test.go) confirm Zombie Controllers are blocked and quorum checks work.
  • Existing cluster tests updated and passing.
  • All critical safety invariants are enforced, preventing both control-plane and data-plane split-brain scenarios.

Closes #329

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant