Skip to content

fix(data-branch): improve diff correctness, memory control, and output summary support#23789

Open
gouhongshen wants to merge 20 commits intomatrixorigin:mainfrom
gouhongshen:fix/diff-insert-delete
Open

fix(data-branch): improve diff correctness, memory control, and output summary support#23789
gouhongshen wants to merge 20 commits intomatrixorigin:mainfrom
gouhongshen:fix/diff-insert-delete

Conversation

@gouhongshen
Copy link
Contributor

@gouhongshen gouhongshen commented Mar 2, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

Fixes #23751

What this PR does / why we need it:

  1. Add branch-level memory throttling to control peak memory usage during data branch operations.
  2. Reduce redundant allocations and copies in diff/merge paths to improve performance and stability.
  3. Support OUTPUT SUMMARY syntax for data branch diff.
  4. Refactor data branch diff output generation to improve maintainability and follow-up operations.

Behavior changes

  1. data branch diff ... output summary is now supported.
  2. Diff output SQL now models updates as DELETE + INSERT instead of REPLACE INTO for clearer semantics.

Tests

  1. Added/updated parser tests for OUTPUT SUMMARY.
  2. Added DML tests covering summary metrics, update split behavior, no-PK duplicates, null handling, and complex types.
  3. Updated BVT cases under test/distributed/cases/git4data/branch/{diff,merge}.

@qodo-code-review
Copy link

Review Summary by Qodo

Refactor data branch operations with memory management, hash-based diff, and output formatting improvements

✨ Enhancement 🐞 Bug fix 🧪 Tests

Grey Divider

Walkthroughs

Description
• Refactored data branch operations with comprehensive memory management improvements, including
  branchHashmapAllocator and branchHashmapDeallocator for throttled memory allocation
• Redesigned in-memory hashmap storage from linked-list buckets to hash-indexed memStore with LRU
  eviction and tombstone support, plus spillStore for efficient disk-based overflow
• Implemented hash-based diff algorithm for comparing data branches with LCA (Lowest Common
  Ancestor) support and conflict detection/resolution
• Added new data branch output operations module supporting multiple output modes: summary
  statistics, row count, limited rows, and file-based exports (CSV/SQL)
• Fixed commit timestamp indexing in in-memory committed insert filtering and added
  GetObjectCreateTS method for object creation timestamp retrieval
• Added support for OUTPUT SUMMARY and OUTPUT COUNT syntax in diff operations with comprehensive
  parsing and validation
• Consolidated data branch type definitions and constants into dedicated file for improved code
  organization
• Added comprehensive test coverage for diff output modes, summary validation, update splitting, and
  complex type handling
• Improved block data read function with proper error handling and DataSource parameter propagation
• Added debug logging for commit timestamp placeholder scenarios to diagnose TN nonappendable block
  issues
Diagram
flowchart LR
  A["Data Branch Operations"] --> B["Memory Management"]
  A --> C["Hash-based Diff"]
  A --> D["Output Formatting"]
  B --> B1["branchHashmapAllocator"]
  B --> B2["memStore with LRU"]
  B --> B3["spillStore for Overflow"]
  C --> C1["LCA Resolution"]
  C --> C2["Conflict Detection"]
  D --> D1["Summary Statistics"]
  D --> D2["CSV/SQL Export"]
  D --> D3["Batch Processing"]
Loading

Grey Divider

File Changes

1. pkg/frontend/data_branch.go ✨ Enhancement +308/-3442

Refactor data branch operations with memory management improvements

• Removed unused imports and simplified import statements by removing encoding/hex,
 encoding/json, io, os, path, filepath, slices, time, and several internal packages
• Added new imports for memory management (malloc, rscthrottler) and TAE common utilities
• Implemented branchHashmapAllocator and branchHashmapDeallocator types for memory-aware hashmap
 allocation with throttling
• Removed large utility functions (runSql, tryFlushDeletesOrReplace, sqlValuesAppender,
 mergeDiffs, satisfyDiffOutputOpt, etc.) that were moved to separate files
• Refactored decideLCABranchTSFromBranchDAG to accept tableStuff parameter instead of individual
 relations
• Simplified getTablesCreationCommitTS function with improved snapshot resolution logic and
 removed dependency on engine.Engine and client.TxnOperator
• Added outputSQL and expandUpdate flags to compositeOption struct initialization
• Updated getTableStuff to initialize hashmapAllocator for memory management

pkg/frontend/data_branch.go


2. pkg/frontend/data_branch_types.go Refactoring +148/-0

Create consolidated data branch types and constants file

• New file created to consolidate data branch type definitions and constants
• Defined branchHashmapAllocator and branchHashmapDeallocator types for memory-aware allocation
 with throttling support
• Added constants for diff operations (diffInsert, diffDelete, diffUpdate) and LCA types
• Defined tableStuff, batchWithKind, retBatchList, and compositeOption types
• Added memory limit configuration constant dataBranchHashmapLimitRate set to 0.8
• Introduced new batch size and row count constants for SQL operations

pkg/frontend/data_branch_types.go


3. pkg/sql/parsers/tree/data_branch_test.go 🧪 Tests +1/-1

Add Summary flag to diff output test

• Updated test case to add Summary: true flag to DiffOutputOpt struct initialization

pkg/sql/parsers/tree/data_branch_test.go


View more (41)
4. pkg/frontend/databranchutils/branch_hashmap.go ✨ Enhancement +1698/-370

Refactor hashmap storage and add spill statistics tracking

• Refactored in-memory storage from linked-list buckets to a hash-indexed memStore with LRU
 eviction and tombstone support
• Introduced spillStore with bucketed segments for efficient disk-based overflow handling with
 statistics tracking
• Added new API methods: GetByEncodedKey, PopByVectorsStream, PopByEncodedKeyValue,
 PopByEncodedFullValueExact, ShardCount, and shard-level cursor methods
• Implemented comprehensive spill statistics collection and summary reporting for performance
 monitoring

pkg/frontend/databranchutils/branch_hashmap.go


5. pkg/tests/dml/dml_test.go 🧪 Tests +339/-29

Add comprehensive diff output and summary validation tests

• Added four new test functions for update splitting and complex type handling:
 runUpdateSplitDiffAsFile, runCompositeUpdateSplitDiffAsFile, runNoPKDuplicateDiffAsFile,
 runComplexTypeDiffAsFile
• Added runDiffOutputSummaryComplex test to validate diff summary metrics for divergent branch
 scenarios
• Added helper functions fetchDiffSummaryMetrics, fetchDiffCount, assertSummaryMetrics,
 assertSummaryMatchesCount for summary validation
• Changed expected diff output from REPLACE INTO to INSERT INTO with separate DELETE
 statements

pkg/tests/dml/dml_test.go


6. pkg/vm/engine/disttae/local_disttae_datasource.go 🐞 Bug fix +32/-2

Add object timestamp retrieval and fix commit ts indexing

• Added GetObjectCreateTS method to retrieve object creation timestamp from persistent state
• Fixed in-memory committed insert filtering to correctly handle SEQNUM_COMMITTS sequence number
 with proper vector indexing
• Added debug logging for missing commit timestamp columns in in-memory batches

pkg/vm/engine/disttae/local_disttae_datasource.go


7. pkg/frontend/stmt_kind.go Miscellaneous +3/-0

Add commented transaction handling note

• Added commented-out code block for potential future restriction on CTAS execution in explicit
 transactions

pkg/frontend/stmt_kind.go


8. pkg/frontend/data_branch_output.go ✨ Enhancement +1746/-0

Data branch output operations and result formatting

• New file implementing data branch output operations including diff result formatting, CSV/SQL
 export, and merge operations
• Supports multiple output modes: summary statistics, row count, limited rows, and file-based
 exports (CSV/SQL)
• Implements memory-efficient batch processing with buffer pooling and streaming output
• Provides SQL value formatting and file system abstraction for local and cloud storage paths

pkg/frontend/data_branch_output.go


9. pkg/frontend/data_branch_hashdiff.go ✨ Enhancement +1528/-0

Hash-based data branch diff algorithm implementation

• New file implementing hash-based diff algorithm for comparing data branches
• Handles LCA (Lowest Common Ancestor) scenarios with conflict detection and resolution
• Implements parallel hashmap construction and row-by-row comparison logic
• Supports update expansion and conflict options (ACCEPT, SKIP, FAIL)

pkg/frontend/data_branch_hashdiff.go


10. pkg/frontend/databranchutils/branch_hashmap_test.go 🧪 Tests +173/-17

Branch hashmap test updates and new test cases

• Updated ForEach callback signature from rows [][]byte to single row []byte parameter
• Added new test cases for GetByEncodedKey, PopByEncodedKeyValue, and
 PopByEncodedFullValueExact methods
• Updated existing tests to match new callback signature and expectations

pkg/frontend/databranchutils/branch_hashmap_test.go


11. pkg/vm/engine/tae/blockio/read.go 🐞 Bug fix +7/-1

Block data read function signature and error handling

• Added ds (DataSource) parameter to readBlockData function signature
• Added error handling for LoadColumns operation to properly return errors
• Updated function calls to pass the new ds parameter through the call chain

pkg/vm/engine/tae/blockio/read.go


12. pkg/objectio/funcs.go Error handling +13/-0

Debug logging for commit timestamp placeholder handling

• Added debug logging for commit timestamp placeholder scenarios
• Logs file name, block number, sequence numbers, metadata column count, data type, and row count
• Helps diagnose issues with old version TN nonappendable blocks

pkg/objectio/funcs.go


13. pkg/sql/parsers/dialect/mysql/mysql_sql_test.go 🧪 Tests +20/-0

Data branch diff output modes parsing tests

• Added new test function TestDataBranchDiffOutputModes to validate output mode parsing
• Tests both output summary and output count syntax variants
• Verifies correct parsing of OutputOpt flags in DataBranchDiff statements

pkg/sql/parsers/dialect/mysql/mysql_sql_test.go


14. bench_delete_perf_20260122_190508.md Additional files +15/-0

...

bench_delete_perf_20260122_190508.md


15. bench_insert_perf_20260122_191042.md Additional files +14/-0

...

bench_insert_perf_20260122_191042.md


16. bench_insert_perf_20260123_103322.md Additional files +14/-0

...

bench_insert_perf_20260123_103322.md


17. pkg/common/rscthrottler/resource_throttler.go Additional files +16/-0

...

pkg/common/rscthrottler/resource_throttler.go


18. pkg/frontend/data_branch_helpers.go Additional files +534/-0

...

pkg/frontend/data_branch_helpers.go


19. pkg/frontend/databranchutils/BRANCH_HASHMAP_REDESIGN.md Additional files +316/-0

...

pkg/frontend/databranchutils/BRANCH_HASHMAP_REDESIGN.md


20. pkg/sql/parsers/dialect/mysql/keywords.go Additional files +1/-0

...

pkg/sql/parsers/dialect/mysql/keywords.go


21. pkg/sql/parsers/dialect/mysql/mysql_sql.go Additional files +8838/-8820

...

pkg/sql/parsers/dialect/mysql/mysql_sql.go


22. pkg/sql/parsers/dialect/mysql/mysql_sql.y Additional files +8/-1

...

pkg/sql/parsers/dialect/mysql/mysql_sql.y


23. pkg/sql/parsers/tree/data_branch.go Additional files +1/-0

...

pkg/sql/parsers/tree/data_branch.go


24. pkg/vm/engine/disttae/logtailreplay/partition_state.go Additional files +13/-0

...

pkg/vm/engine/disttae/logtailreplay/partition_state.go


25. pkg/vm/engine/disttae/txn_table_sharding_handle.go Additional files +2/-0

...

pkg/vm/engine/disttae/txn_table_sharding_handle.go


26. pkg/vm/engine/readutil/reader.go Additional files +3/-0

...

pkg/vm/engine/readutil/reader.go


27. test/distributed/cases/git4data/branch/diff/diff_1.result Additional files +3/-3

...

test/distributed/cases/git4data/branch/diff/diff_1.result


28. test/distributed/cases/git4data/branch/diff/diff_2.result Additional files +10/-10

...

test/distributed/cases/git4data/branch/diff/diff_2.result


29. test/distributed/cases/git4data/branch/diff/diff_3.result Additional files +0/-0

...

test/distributed/cases/git4data/branch/diff/diff_3.result


30. test/distributed/cases/git4data/branch/diff/diff_5.result Additional files +12/-12

...

test/distributed/cases/git4data/branch/diff/diff_5.result


31. test/distributed/cases/git4data/branch/diff/diff_7.result Additional files +54/-0

...

test/distributed/cases/git4data/branch/diff/diff_7.result


32. test/distributed/cases/git4data/branch/diff/diff_7.sql Additional files +61/-0

...

test/distributed/cases/git4data/branch/diff/diff_7.sql


33. test/distributed/cases/git4data/branch/diff/diff_8.result Additional files +173/-0

...

test/distributed/cases/git4data/branch/diff/diff_8.result


34. test/distributed/cases/git4data/branch/diff/diff_8.sql Additional files +136/-0

...

test/distributed/cases/git4data/branch/diff/diff_8.sql


35. test/distributed/cases/git4data/branch/merge/merge_2.result Additional files +3/-3

...

test/distributed/cases/git4data/branch/merge/merge_2.result


36. test/distributed/cases/git4data/branch/merge/merge_3.result Additional files +3/-3

...

test/distributed/cases/git4data/branch/merge/merge_3.result


37. test/distributed/cases/git4data/branch/merge/merge_4.result Additional files +33/-0

...

test/distributed/cases/git4data/branch/merge/merge_4.result


38. test/distributed/cases/git4data/branch/merge/merge_4.sql Additional files +37/-0

...

test/distributed/cases/git4data/branch/merge/merge_4.sql


39. test/distributed/cases/git4data/branch/merge/merge_5.result Additional files +44/-0

...

test/distributed/cases/git4data/branch/merge/merge_5.result


40. test/distributed/cases/git4data/branch/merge/merge_5.sql Additional files +55/-0

...

test/distributed/cases/git4data/branch/merge/merge_5.sql


41. test/distributed/cases/git4data/branch/merge/merge_6.result Additional files +90/-0

...

test/distributed/cases/git4data/branch/merge/merge_6.result


42. test/distributed/cases/git4data/branch/merge/merge_6.sql Additional files +62/-0

...

test/distributed/cases/git4data/branch/merge/merge_6.sql


43. test/distributed/cases/git4data/branch/merge/merge_7.result Additional files +38/-0

...

test/distributed/cases/git4data/branch/merge/merge_7.result


44. test/distributed/cases/git4data/branch/merge/merge_7.sql Additional files +34/-0

...

test/distributed/cases/git4data/branch/merge/merge_7.sql


Grey Divider

Qodo Logo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/wip kind/bug Something isn't working kind/enhancement kind/feature kind/refactor Code refactor kind/test-ci size/XXL Denotes a PR that changes 2000+ lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request]: Enhance data branch diff with fallback, delta output, and summary statistics

5 participants