Skip to content

fix: data branch create table + DROP DATABASE race condition#23788

Merged
mergify[bot] merged 13 commits intomatrixorigin:mainfrom
LeftHandCold:fix_clone_drop_race
Mar 5, 2026
Merged

fix: data branch create table + DROP DATABASE race condition#23788
mergify[bot] merged 13 commits intomatrixorigin:mainfrom
LeftHandCold:fix_clone_drop_race

Conversation

@LeftHandCold
Copy link
Contributor

@LeftHandCold LeftHandCold commented Mar 1, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #23766

What this PR does / why we need it:

Two-layer fix for orphan records in mo_tables caused by concurrent data branch create table (CLONE) and DROP DATABASE, which leads to ExpectedEOB panic on checkpoint replay.

Fix 1 - Lock Service: Lock mode sync on holder promotion
Lock is a Go value type. When addHolder is called during tryHold,
modifications to l.value (lock mode byte) don't propagate to the
store. Add setMode() method to correctly update the stored lock
entry's mode when a waiter is promoted to holder (newHolder=true).
This ensures Exclusive locks actually block Shared requests.

Files: lock.go, lock_table_local.go, types.go

Fix 2 - DDL Layer: Snapshot refresh after exclusive lock (clock.Now)
After acquiring exclusive lock on mo_database in DropDatabase, call
refreshSnapshotAfterLock which uses clock.Now() + WaitLogTailAppliedAt

  • UpdateSnapshot to advance the transaction snapshot. This ensures Relations() sees all tables committed before the lock was acquired.

Uses clock.Now() instead of GetLatestCommitTS() (v1) because doWrite
defer LIFO ordering causes unlock to execute before updateLastCommitTS,
making the v1 conditional check always false in the critical scenario.

Files: ddl.go, ddl_test.go

Two-layer fix for orphan records in mo_tables caused by concurrent
data branch create table (CLONE) and DROP DATABASE, which leads to
ExpectedEOB panic on checkpoint replay.

Fix 1 - Lock Service: Lock mode sync on holder promotion
  Lock is a Go value type. When addHolder is called during tryHold,
  modifications to l.value (lock mode byte) don't propagate to the
  store. Add setMode() method to correctly update the stored lock
  entry's mode when a waiter is promoted to holder (newHolder=true).
  This ensures Exclusive locks actually block Shared requests.

  Files: lock.go, lock_table_local.go, types.go

Fix 2 - DDL Layer: Snapshot refresh after exclusive lock (clock.Now)
  After acquiring exclusive lock on mo_database in DropDatabase, call
  refreshSnapshotAfterLock which uses clock.Now() + WaitLogTailAppliedAt
  + UpdateSnapshot to advance the transaction snapshot. This ensures
  Relations() sees all tables committed before the lock was acquired.

  Uses clock.Now() instead of GetLatestCommitTS() (v1) because doWrite
  defer LIFO ordering causes unlock to execute before updateLastCommitTS,
  making the v1 conditional check always false in the critical scenario.

  Files: ddl.go, ddl_test.go
@qodo-code-review
Copy link

Review Summary by Qodo

Fix race condition between CLONE and DROP DATABASE via lock mode sync

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Fix lock mode synchronization when waiter promoted to holder
  - Add setMode() method to update stored lock entry mode
  - Ensures Exclusive locks block Shared requests correctly
• Fix snapshot staleness in DROP DATABASE with concurrent CLONE
  - Use clock.Now() instead of GetLatestCommitTS() for snapshot refresh
  - Prevents orphan records in mo_tables and ExpectedEOB panic
• Add comprehensive tests for lock mode transitions
  - Verify Exclusive holder blocks Shared requests
  - Verify Shared holder allows subsequent Shared requests
Diagram
flowchart LR
  A["Concurrent CLONE<br/>and DROP DATABASE"] -->|Lock mode<br/>not synced| B["Stale lock mode<br/>in storage"]
  B -->|Shared requests<br/>incorrectly allowed| C["Orphan records<br/>in mo_tables"]
  C -->|Checkpoint replay| D["ExpectedEOB panic"]
  
  E["Fix 1: setMode()"] -->|Sync mode on<br/>holder promotion| F["Correct lock mode<br/>in storage"]
  G["Fix 2: clock.Now()<br/>snapshot refresh"] -->|Advance snapshot<br/>after exclusive lock| H["Relations() sees<br/>all committed tables"]
  F --> I["Race condition<br/>resolved"]
  H --> I
Loading

Grey Divider

File Changes

1. pkg/lockservice/lock.go 🐞 Bug fix +23/-0

Add setMode() for lock mode synchronization

• Add setMode() method to update lock mode in stored lock entry
• Handles transitions between Shared and Exclusive modes
• Includes detailed comments explaining value type semantics

pkg/lockservice/lock.go


2. pkg/lockservice/lock_table_local.go 🐞 Bug fix +6/-0

Apply setMode() on holder promotion in lock acquisition

• Call setMode() after waiter promotion in acquireRowLockLocked()
• Call setMode() after waiter promotion in addRangeLockLocked()
• Write updated lock back to store when mode changes

pkg/lockservice/lock_table_local.go


3. pkg/lockservice/lock_table_local_test.go 🧪 Tests +129/-0

Add tests for lock mode transition correctness

• Add TestExclusiveHolderMustBlockSharedRequests test
 - Verifies Exclusive holder blocks subsequent Shared requests
 - Tests waiter promotion scenario
• Add TestSharedAfterExclusiveRelease test
 - Verifies Shared holder allows subsequent Shared requests
 - Tests mode transition from Exclusive to Shared

pkg/lockservice/lock_table_local_test.go


View more (3)
4. pkg/lockservice/types.go 📝 Documentation +6/-0

Document Lock value type behavior and setMode() requirement

• Add WARNING comment explaining Lock value type semantics
• Document that modifications to value field don't propagate to storage
• Explain setMode() requirement for mode changes

pkg/lockservice/types.go


5. pkg/sql/compile/ddl.go 🐞 Bug fix +31/-20

Replace conditional snapshot refresh with clock.Now() approach

• Replace GetLatestCommitTS() conditional check with unconditional clock.Now() approach
• Extract snapshot refresh logic into refreshSnapshotAfterLock() function
• Use clock.Now() + WaitLogTailAppliedAt() to advance snapshot
• Add detailed comments explaining defer LIFO ordering issue

pkg/sql/compile/ddl.go


6. pkg/sql/compile/ddl_test.go 🧪 Tests +35/-29

Update tests for unconditional snapshot refresh logic

• Simplify test to reflect unconditional clock.Now() approach
• Remove snapshotTS and latestCommitTS comparison logic
• Rename test case to snapshot_always_refreshed_via_clock_now
• Add test for error propagation from refreshSnapshotAfterLock
• Remove test case for conditional snapshot refresh (no longer applicable)

pkg/sql/compile/ddl_test.go


Grey Divider

Qodo Logo

@qodo-code-review
Copy link

qodo-code-review bot commented Mar 1, 2026

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Snapshot refresh may stall🐞 Bug ⛯ Reliability
Description
refreshSnapshotAfterLock waits via TxnClient.WaitLogTailAppliedAt(now) and then calls
TxnOperator.UpdateSnapshot(ts), but both paths ultimately depend on TimestampWaiter.GetTimestamp();
since GetTimestamp returns latest.Next(), passing that result into UpdateSnapshot can require a
subsequent logtail advance and may add latency or stall DDL under low commit activity. This is
triggered for every Exclusive mo_database lock (e.g. CreateDatabase and DropDatabase) because the
call is unconditional for lockMode==Exclusive.
Code

pkg/sql/compile/ddl.go[R4018-4024]

+var refreshSnapshotAfterLock = func(c *Compile) error {
+	now, _ := moruntime.ServiceRuntime(c.proc.GetService()).Clock().Now()
+	ts, err := c.proc.Base.TxnClient.WaitLogTailAppliedAt(c.proc.Ctx, now)
+	if err != nil {
+		return err
+	}
+	return c.proc.GetTxnOperator().UpdateSnapshot(c.proc.Ctx, ts)
Evidence
refreshSnapshotAfterLock first calls TxnClient.WaitLogTailAppliedAt(now) and then
UpdateSnapshot(ts). WaitLogTailAppliedAt delegates to timestampWaiter.GetTimestamp, which returns
latest.Next(); UpdateSnapshot again calls timestampWaiter.GetTimestamp, so if the first call
returned latest.Next() (greater than latest), the second call can block until the waiter’s latest
timestamp advances to >= that value (i.e., an additional logtail/commit tick). Also, lockMoDatabase
calls refreshSnapshotAfterLock unconditionally for every Exclusive lock, expanding the impact to
DDLs beyond DropDatabase.

pkg/sql/compile/ddl.go[3983-4010]
pkg/sql/compile/ddl.go[4014-4025]
pkg/txn/client/client.go[515-522]
pkg/txn/client/timestamp_waiter.go[61-79]
pkg/txn/client/operator.go[714-740]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`refreshSnapshotAfterLock` currently calls `TxnClient.WaitLogTailAppliedAt(now)` and then `TxnOperator.UpdateSnapshot(ts)`. Both use `TimestampWaiter.GetTimestamp`, and `GetTimestamp` returns `latest.Next()`, so feeding its result into `UpdateSnapshot` can require an additional logtail/commit advance and may add latency or stall DDL under low activity.
### Issue Context
This helper is invoked for every `lockMoDatabase(..., Exclusive)` call, so the risk applies to multiple DDLs (not just `DropDatabase`).
### Fix Focus Areas
- pkg/sql/compile/ddl.go[3983-4025]
- pkg/txn/client/client.go[515-522]
- pkg/txn/client/timestamp_waiter.go[61-79]
- pkg/txn/client/operator.go[714-740]
### Suggested change
- Replace the helper body with a single snapshot update call, e.g.:
- compute `now` and then call `txnOp.UpdateSnapshot(ctx, now)` (no explicit `WaitLogTailAppliedAt` call).
- Reintroduce guards similar to the old code:
- only do this for pessimistic + RC isolation (or at minimum RC), and
- only when `txnOp.Txn().SnapshotTS` is behind the chosen timestamp to reduce unnecessary waits.
- If only `DropDatabase` needs this behavior, move the refresh call out of the shared `lockMoDatabase` helper and into `DropDatabase` (or add a parameter/flag to control it).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Range-lock mode inconsistency🐞 Bug ✓ Correctness
Description
When a waiter is promoted to holder, the new code writes back an updated Lock mode for only the
single conflicting key; but a range lock is represented by two store entries (range-start and
range-end) with independent Lock.value bytes, so the paired entry can retain stale mode and still
allow/deny Shared requests incorrectly (especially at boundaries where a row lock can hit the
range-start key).
Code

pkg/lockservice/lock_table_local.go[R655-657]

+					if updated, changed := conflictWith.setMode(c.opts.Mode); changed {
+						l.mu.store.Add(conflictKey, updated)
+					}
Evidence
Range locks are created as two distinct Lock values (start/end) derived from one base lock; each
copy has its own value byte. The promotion fix updates mode via store.Add(conflictKey, updated)
but only for the one key being operated on, so the other half of the range pair can keep stale mode
bits. This matters because row-lock acquisition may operate on a range-start entry when
bytes.Equal(key,row) (i.e., when the row equals the range boundary).

pkg/lockservice/lock_table_local.go[401-427]
pkg/lockservice/lock_table_local.go[645-663]
pkg/lockservice/lock.go[47-53]
pkg/lockservice/lock.go[243-251]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Range locks are represented by two entries (range-start and range-end) with independent `Lock.value` bytes. The new promotion write-back updates only one key’s `value` via `store.Add`, so the paired entry can keep stale mode bits and still be consulted by later lock attempts (notably at boundary keys).
### Issue Context
- `newRangeLock` returns two `Lock` values; the range-start/range-end bits are set via value receivers, so each entry’s `value` diverges.
- `acquireRowLockLocked` can hit a range-start entry when the row equals the start boundary (`bytes.Equal(key,row)`), so stale mode on the start entry can still matter.
### Fix Focus Areas
- pkg/lockservice/lock_table_local.go[401-427]
- pkg/lockservice/lock_table_local.go[645-663]
- pkg/lockservice/lock.go[47-53]
- pkg/lockservice/lock.go[243-251]
### Suggested change
- When `setMode(...)` reports `changed==true` and the lock is part of a range lock, locate the paired entry (start↔end) and apply the same mode update to it as well.
- One approach under `l.mu`:
- If `conflictWith.isLockRangeEnd()`, walk backwards with `store.Prev(...)` until you find a `isLockRangeStart()` whose `holders` pointer (or `waiters` backing queue identity) matches.
- If `conflictWith.isLockRangeStart()`, walk forwards with `store.Seek/Range` until you find the matching `isLockRangeEnd()`.
- Add a targeted test that constructs a range lock, promotes an exclusive waiter to holder, and then attempts a shared row lock exactly at the range start key to ensure it blocks.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. DDL test hides real logic🐞 Bug ⛯ Reliability
Description
The updated DropDatabase test stubs refreshSnapshotAfterLock itself, so it does not validate the
production logic (clock.Now()+waiter behavior) and can miss regressions in the actual implementation
semantics.
Code

pkg/sql/compile/ddl_test.go[R904-913]

+		// Stub refreshSnapshotAfterLock to simulate the clock.Now() path:
+		// it calls WaitLogTailAppliedAt and UpdateSnapshot unconditionally.
+		refreshStub := gostub.Stub(&refreshSnapshotAfterLock, func(c *Compile) error {
+			ts, err := c.proc.Base.TxnClient.WaitLogTailAppliedAt(c.proc.Ctx, appliedTS)
+			if err != nil {
+				return err
+			}
+			return c.proc.GetTxnOperator().UpdateSnapshot(c.proc.Ctx, ts)
+		})
+		defer refreshStub.Reset()
Evidence
By replacing refreshSnapshotAfterLock with a stub, the test only asserts that DropDatabase invokes
the helper and propagates errors; it does not exercise the real code path that reads the runtime
clock and performs the actual wait/update behavior.

pkg/sql/compile/ddl_test.go[904-913]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Current tests stub `refreshSnapshotAfterLock`, which prevents them from validating the real implementation and the clock.Now() behavior.
### Issue Context
The helper’s correctness depends on clock.Now() and timestamp-waiter semantics; stubbing the helper bypasses those.
### Fix Focus Areas
- pkg/sql/compile/ddl_test.go[822-975]
- pkg/sql/compile/ddl.go[4014-4025]
### Suggested change
- Add a focused unit test for `refreshSnapshotAfterLock` itself:
- Provide a controllable runtime clock (or a mock ServiceRuntime/Clock if available in test utilities).
- Use a real or mocked `TimestampWaiter`/`TxnClient` behavior that reflects production semantics.
- Assert that `UpdateSnapshot` is called with the expected input and that the function does not require additional external stubbing.
- Keep the existing integration-style DropDatabase test, but avoid stubbing the helper in the main “happy path” subtest if feasible.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

…e only

refreshSnapshotAfterLock was inside lockMoDatabase and triggered for all
Exclusive lock acquisitions. CreateDatabase also calls lockMoDatabase with
Exclusive mode, so during restore cluster operations the snapshot advance
caused CreateDatabase to see already-restored tables, producing:
  Duplicate entry '(0,mo_catalog,mo_branch_metadata)' for key '__mo_cpkey_col'

Move refreshSnapshotAfterLock to DropDatabase directly, since only
DropDatabase needs the snapshot advance (to enumerate tables via Relations()).
lockMoDatabase is now a simple wrapper around doLockMoDatabase.
LeftHandCold and others added 8 commits March 4, 2026 17:28
…eordering doWrite defers

Two fixes for the data-branch-create-table + DROP DATABASE race that
leaves orphan rows in mo_tables:

1. lockservice: sync mode on both ends of range lock (lock_table_local.go)

   When a waiter is promoted to holder, setMode updates Lock.value on
   the conflicting key only. But a range lock is stored as two
   independent btree entries (range-start and range-end) with separate
   Lock.value bytes. The paired entry retains stale mode, causing
   isLockModeAllowed to incorrectly allow or deny Shared requests.

   Add setModePairedRangeLock helper that scans past interleaved row
   locks to find and update the paired entry. Called from both
   acquireRowLockLocked and addRangeLockLocked when newHolder=true.
   No-op for row locks.

2. txn: reorder defers in doWrite to update latestCommitTS before unlock (operator.go)

   Original code registered two independent defers in doWrite:
     defer tc.unlock(ctx)                    // LIFO: executes first
     defer func() { closeLocked(); mu.Unlock() }()  // executes second

   This meant unlock released the lock-service lock before closeLocked
   triggered updateLastCommitTS. In the window between unlock and
   closeLocked, DROP DATABASE could acquire the Exclusive lock and call
   GetLatestCommitTS, getting a stale value. The UpdateSnapshot fix
   (layer 2) would then not trigger because SnapshotTS >= stale
   latestCommitTS.

   Merge unlock into the first defer so execution order becomes:
     closeLocked() -> mu.Unlock() -> unlock(ctx)
   This ensures latestCommitTS is updated before the lock is released.
@mergify
Copy link
Contributor

mergify bot commented Mar 5, 2026

Merge Queue Status

Rule: main


  • Entered queue2026-03-05 07:41 UTC
  • Checks passed · in-place
  • Merged2026-03-05 07:42 UTC · at 1d8aaaf713fc514fd4929fceec3ac99069134f38

This pull request spent 10 seconds in the queue, with no time running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
  • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-neutral = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-skipped = Matrixone Standlone CI / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / SCA Test on Ubuntu/x86
    • check-neutral = Matrixone CI / SCA Test on Ubuntu/x86
    • check-skipped = Matrixone CI / SCA Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI / UT Test on Ubuntu/x86
    • check-neutral = Matrixone CI / UT Test on Ubuntu/x86
    • check-skipped = Matrixone CI / UT Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-neutral = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-skipped = Matrixone Compose CI / multi cn e2e bvt test docker compose(Optimistic/PUSH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-neutral = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-skipped = Matrixone Standlone CI / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-neutral = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-skipped = Matrixone Upgrade CI / Compatibility Test With Target on Linux/x64(LAUNCH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Utils CI / Coverage
    • check-neutral = Matrixone Utils CI / Coverage
    • check-skipped = Matrixone Utils CI / Coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/L Denotes a PR that changes [500,999] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants