Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
212 commits
Select commit Hold shift + click to select a range
0c2561d
Add initial KV Cache benchmark implementation for MLPerf Storage v3
hazemawadalla Nov 21, 2025
073fe61
feat: Replace legacy spillover logic with Waterfall LRU architecture
hazemawadalla Dec 9, 2025
2eb39cf
Fix two runtime errors in RAG-enabled benchmark mode
hazemawadalla Dec 19, 2025
f78bf60
Add detailed README.md for running the different invocations of kv-ca…
hazemawadalla Dec 19, 2025
2464edf
fix: line endings from dos2unix; increase cpu memory to 4GB for mlper…
hazemawadalla Dec 19, 2025
70b8f69
Update MLperf v3 KV cache proposal.md to recommend using a minimum of…
hazemawadalla Dec 19, 2025
9e60b98
Add storage throughput metric, ShareGPT integration, LMCache validati…
hazemawadalla Jan 10, 2026
72f36c7
Refactor BenchmarkRun to use data contract pattern
wvaske Jan 12, 2026
db82626
Update MLPerf v3 submission guidelines with discovery test validation
hazemawadalla Jan 13, 2026
f1ff963
Improve test suite with HTML reporting and flexible tier assertions
hazemawadalla Jan 13, 2026
e016954
Add pytest-html dependency for HTML test reports
hazemawadalla Jan 13, 2026
c1e5ff7
Add unit test HTML report showing all 112 tests passing
hazemawadalla Jan 13, 2026
bcae6ed
Add comprehensive pytest-based unit testing framework
wvaske Jan 13, 2026
da454e4
Increase test coverage to 74% with additional unit tests
wvaske Jan 13, 2026
d1364f4
Update NVMe Bandwidth specification to 14,000 MB/s
hazemawadalla Jan 13, 2026
7d331e2
Fix KV cache size per token values in discovery doc
hazemawadalla Jan 13, 2026
c93d013
Merge pull request #224 from hazemawadalla/TF_KVCache
FileSystemGuy Jan 13, 2026
5b71200
Fix system_info metadata fallback for ClusterInformation
wvaske Jan 15, 2026
47e1adc
Add .gitignore for Python cache and common artifacts
wvaske Jan 15, 2026
671f8a2
Merge pull request #18 from wvaske/claude/fix-system-info-metadata-LXIM0
wvaske Jan 15, 2026
6bc9568
Add implementation plan for MPI-based cluster information collection
wvaske Jan 15, 2026
bc8589d
Implement Phase 1: MPI-based cluster information collector
wvaske Jan 15, 2026
4bb263f
Merge main into feature branch, resolving conflict in rules.py
wvaske Jan 15, 2026
bf4d4ef
Merge pull request #19 from wvaske/claude/resolve-merge-conflicts-8Bwyz
wvaske Jan 15, 2026
8b9b832
Merge pull request #17 from wvaske/benchmark_cls_refactor
wvaske Jan 15, 2026
0450b75
Implement Phase 2: Extend HostInfo and ClusterInformation data classes
wvaske Jan 15, 2026
eba0f8a
Implement Phase 3: Integrate cluster collection with Benchmark classes
wvaske Jan 15, 2026
c60177e
Fix mpi4py import error handling in MPI collector script
wvaske Jan 15, 2026
f0c8c0b
Add comprehensive code improvement plan
wvaske Jan 15, 2026
92cf152
Merge pull request #20 from wvaske/claude/cluster-info-collection-mpi…
wvaske Jan 15, 2026
ac05eff
Enhance code improvement plan with refactored phases
wvaske Jan 15, 2026
069fc9a
Merge pull request #21 from wvaske/claude/fix-improvement-plan-markdo…
wvaske Jan 15, 2026
f5664a8
Implement Phase 1: Foundation - Interfaces and Abstractions
wvaske Jan 15, 2026
75a59ab
Implement Phase 2: Modular Rules Engine
wvaske Jan 15, 2026
974f17d
Implement Phase 3: CLI Refactoring and Benchmark Registry
wvaske Jan 15, 2026
02462ab
Implement Phase 4: Test Infrastructure Enhancement
wvaske Jan 15, 2026
f680f60
Implement Phase 5: Documentation and Type Annotations
wvaske Jan 15, 2026
12338b7
Implement Phase 6: KV Cache Benchmark Integration
wvaske Jan 15, 2026
443de02
Implement Phase 7: Reporting System Refactoring
wvaske Jan 15, 2026
2581ced
Implement Phase 8: Error Handling and User Messaging
wvaske Jan 15, 2026
f873a85
Merge pull request #23 from wvaske/claude/code-improvement-UGEAO
wvaske Jan 15, 2026
9048404
Add fail-fast dependency validation and fix initialization bugs
wvaske Jan 23, 2026
f13dbec
Fix unhashable RunID bug in report generator
wvaske Jan 23, 2026
d8280e7
Update CLAUDE.md with test environment directories and examples
wvaske Jan 23, 2026
bc45cc5
docs: map existing codebase structure
wvaske Jan 23, 2026
8bb1170
docs: initialize project
wvaske Jan 23, 2026
5f46292
chore: add project config
wvaske Jan 23, 2026
c7a8a4c
docs: define v3.0 requirements
wvaske Jan 23, 2026
688e593
docs: create roadmap (10 phases)
wvaske Jan 23, 2026
4cfbd94
docs(phase-1): research package management foundation
wvaske Jan 23, 2026
68c8818
docs(01): create phase plan for Package Management Foundation
wvaske Jan 23, 2026
10fdd9f
docs: revise Plan 01-05 to address benchmark integration gap
wvaske Jan 23, 2026
dcd7a6a
feat(01-02): add uv CPU-only index configuration
wvaske Jan 23, 2026
29845be
feat(01-02): add packaging dependency for version parsing
wvaske Jan 23, 2026
f417e2c
feat(01-01): create lockfile module structure with data models
wvaske Jan 23, 2026
7f6a78b
docs(01-02): complete CPU-only PyTorch configuration plan
wvaske Jan 23, 2026
1074d62
docs(01-01): complete lockfile module foundation plan
wvaske Jan 23, 2026
d237420
feat(01-03): implement lockfile generator module
wvaske Jan 23, 2026
6b21501
feat(01-04): create lockfile validator module
wvaske Jan 23, 2026
d4ffa94
feat(01-03): update lockfile module exports for generator
wvaske Jan 23, 2026
8aa422b
docs(01-03): complete lockfile generator plan
wvaske Jan 23, 2026
0d05871
docs(01-04): complete runtime version validation plan
wvaske Jan 23, 2026
bfa2326
feat(01-05): add lockfile CLI argument builder
wvaske Jan 23, 2026
80436f4
feat(01-05): integrate lockfile subcommand into CLI
wvaske Jan 23, 2026
818abd9
feat(01-05): add lockfile command handler to main
wvaske Jan 23, 2026
f2df6f4
feat(01-05): add --verify-lockfile flag to benchmark commands
wvaske Jan 23, 2026
b1fe7b6
docs(01-05): complete CLI integration plan - Phase 1 complete
wvaske Jan 23, 2026
e476136
docs: mark Phase 1 complete in roadmap
wvaske Jan 23, 2026
005e767
docs(phase-2): research environment validation and fail-fast patterns
wvaske Jan 24, 2026
ee7a8dc
docs(02): create phase plan for Environment Validation and Fail-Fast
wvaske Jan 24, 2026
1504fa4
fix(02): revise plans based on checker feedback
wvaske Jan 24, 2026
fe133c1
feat(02-01): add environment module with OS detection and install hints
wvaske Jan 24, 2026
cc741fb
test(02-01): add unit tests for environment module
wvaske Jan 24, 2026
864a39a
docs(02-01): complete environment detection plan
wvaske Jan 24, 2026
b4231ad
feat(02-03): add ValidationIssue dataclass and SSH validators
wvaske Jan 24, 2026
bf7cb0f
feat(02-02): Add OS-aware dependency checking with install hints
wvaske Jan 24, 2026
ddb2924
test(02-03): add comprehensive tests for validators
wvaske Jan 24, 2026
2b227a3
docs(02-02): complete Executable Checking Module plan
wvaske Jan 24, 2026
48c4389
docs(02-03): complete SSH validation and issue collection plan
wvaske Jan 24, 2026
a78cc92
feat(02-04): add comprehensive fail-fast environment validator
wvaske Jan 24, 2026
966a360
test(02-04): add comprehensive tests for validate_benchmark_environment
wvaske Jan 24, 2026
08852bb
docs(02-04): complete pre-run validation orchestration plan
wvaske Jan 24, 2026
f84a56a
feat(02-05): integrate fail-fast validation into main.py
wvaske Jan 24, 2026
04c8eae
feat(02-05): add _validate_environment hook to Benchmark base class
wvaske Jan 24, 2026
8256646
test(02-05): add tests for benchmark validation integration
wvaske Jan 24, 2026
d5e5c11
docs(02-05): complete fail-fast validation integration plan
wvaske Jan 24, 2026
d11f5cf
docs(02): complete environment validation and fail-fast phase
wvaske Jan 24, 2026
953bc40
docs(03): research KV Cache benchmark integration phase
wvaske Jan 24, 2026
99e5b4f
docs(03): create phase plan for KV Cache benchmark integration
wvaske Jan 24, 2026
de8b84d
feat(03-01): add distributed execution arguments to KV cache CLI
wvaske Jan 24, 2026
2772561
test(03-01): add unit tests for KV cache CLI arguments
wvaske Jan 24, 2026
4e79081
docs(03-01): complete KV cache distributed CLI plan
wvaske Jan 24, 2026
ebe1b37
feat(03-02): add MPI execution support to KVCacheBenchmark
wvaske Jan 24, 2026
da03797
test(03-02): add unit tests for KVCacheBenchmark MPI execution
wvaske Jan 24, 2026
b307d58
docs(03-02): complete MPI execution support plan
wvaske Jan 24, 2026
3f6c500
feat(03-03): enhance KV cache metadata for history integration
wvaske Jan 24, 2026
f15c3ba
test(03-03): add metadata tests for history integration
wvaske Jan 24, 2026
6329825
docs(03-03): complete metadata and history integration plan
wvaske Jan 24, 2026
431bf37
docs(03): complete KV Cache Benchmark Integration phase
wvaske Jan 24, 2026
ed5439b
docs(04): research VectorDB benchmark integration phase
wvaske Jan 24, 2026
64c79f5
docs(04): create phase plan for VectorDB benchmark integration
wvaske Jan 24, 2026
0bc604e
fix(04): revise 04-02 plan based on checker feedback
wvaske Jan 24, 2026
fc54091
feat(04-01): rename vectordb run-search to run for CLI consistency
wvaske Jan 24, 2026
c54cc2e
feat(04-02): add metadata property to VectorDBBenchmark
wvaske Jan 24, 2026
32a279a
feat(04-02): add write_metadata calls to execute_run and execute_datagen
wvaske Jan 24, 2026
7c4cab5
test(04-01): update vectordb CLI tests for run command rename
wvaske Jan 24, 2026
e641b2a
docs(04-01): complete VectorDB CLI rename plan
wvaske Jan 24, 2026
91d0ca2
docs(04-02): complete VectorDB metadata integration plan
wvaske Jan 24, 2026
0b0ecf5
test(04-03): add VectorDB CLI argument parsing tests
wvaske Jan 24, 2026
2656cb9
test(04-03): add VectorDB benchmark class tests
wvaske Jan 24, 2026
f48f717
docs(04-03): complete VectorDB verification and integration plan
wvaske Jan 24, 2026
5f42b61
docs(04): complete VectorDB Benchmark Integration phase
wvaske Jan 24, 2026
e81506a
docs(05): research phase domain
wvaske Jan 24, 2026
32eb533
docs(05): create phase plan for benchmark validation pipeline
wvaske Jan 24, 2026
2bcb0c1
feat(05-01): add VectorDBRunRulesChecker class
wvaske Jan 24, 2026
ceafea4
feat(05-01): export VectorDBRunRulesChecker from run_checkers
wvaske Jan 24, 2026
fb81826
feat(05-01): export VectorDBRunRulesChecker from rules package
wvaske Jan 24, 2026
2088e3f
feat(05-02): route kv_cache and vector_database in BenchmarkVerifier
wvaske Jan 24, 2026
44ee9b9
feat(05-02): add VECTORDB_REQUIREMENTS to ClosedRequirementsFormatter
wvaske Jan 24, 2026
c4804b9
docs(05-01): complete VectorDBRunRulesChecker plan
wvaske Jan 24, 2026
d95d336
docs(05-02): complete BenchmarkVerifier routing plan
wvaske Jan 24, 2026
86e08f8
test(05-03): add VectorDBRunRulesChecker unit tests
wvaske Jan 24, 2026
a89abe3
docs(05-03): complete VectorDB rules checker tests plan
wvaske Jan 24, 2026
c04dbb2
docs(05): complete Benchmark Validation Pipeline Integration phase
wvaske Jan 24, 2026
792c7d9
docs(06): research SSH-based host collection phase
wvaske Jan 24, 2026
670c225
docs(06): create phase plan for SSH-Based Host Collection
wvaske Jan 24, 2026
98dda5c
feat(06-01): add MountInfo, CgroupInfo dataclasses and /proc parsers
wvaske Jan 24, 2026
d447848
feat(06-01): update collect_local_system_info with vmstat, mounts, cg…
wvaske Jan 24, 2026
bea6560
test(06-01): add unit tests for cluster_collector parsers
wvaske Jan 24, 2026
b5eeb53
docs(06-01): complete /proc parsers plan
wvaske Jan 24, 2026
bcf7d2d
feat(06-02): implement SSHClusterCollector class
wvaske Jan 24, 2026
4cf35a9
test(06-02): add unit tests for SSHClusterCollector
wvaske Jan 24, 2026
72642f1
docs(06-02): complete SSHClusterCollector implementation plan
wvaske Jan 24, 2026
36a90a0
feat(06-03): add ClusterSnapshots dataclass for start/end collection
wvaske Jan 24, 2026
009279f
feat(06-03): integrate SSH collection into benchmark base class
wvaske Jan 24, 2026
631d117
test(06-03): add tests for collection method selection and snapshots
wvaske Jan 24, 2026
09fe05e
docs(06-03): complete benchmark base integration plan
wvaske Jan 24, 2026
5f1c9e6
docs(06): complete SSH-Based Host Collection phase
wvaske Jan 24, 2026
25b51ad
docs(07): research time-series host data collection phase
wvaske Jan 24, 2026
77c0d63
docs(07): create phase plan
wvaske Jan 24, 2026
6017301
fix(07): revise plans based on checker feedback
wvaske Jan 24, 2026
0ee8476
feat(07-01): add TimeSeriesSample and TimeSeriesData dataclasses
wvaske Jan 24, 2026
2dac7cd
feat(07-01): add collect_timeseries_sample and TimeSeriesCollector
wvaske Jan 24, 2026
8db1754
test(07-01): add unit tests for time-series collection
wvaske Jan 24, 2026
936bfa5
docs(07-01): complete core time-series infrastructure plan
wvaske Jan 24, 2026
3ba446d
feat(07-02): add MultiHostTimeSeriesCollector for parallel multi-host…
wvaske Jan 24, 2026
6e69035
test(07-02): add unit tests for MultiHostTimeSeriesCollector
wvaske Jan 24, 2026
74b33f3
docs(07-02): complete multi-host time-series collection plan
wvaske Jan 24, 2026
518b7d5
feat(07-03): add time-series CLI arguments to all benchmark parsers
wvaske Jan 24, 2026
22f3ca3
feat(07-03): integrate time-series collection into Benchmark base class
wvaske Jan 24, 2026
b2e8a02
test(07-03): add unit tests for time-series benchmark integration
wvaske Jan 24, 2026
fc5a91f
docs(07-03): complete benchmark time-series integration plan
wvaske Jan 24, 2026
2c4355f
docs(07): complete Time-Series Host Data Collection phase
wvaske Jan 24, 2026
711157a
docs(08): research phase domain for new training models
wvaske Jan 24, 2026
b1078d7
docs(08): create phase plan for new training models
wvaske Jan 24, 2026
accf79c
fix(08): revise plans based on checker feedback
wvaske Jan 24, 2026
2ce2823
feat(08-01): add DLRM, RETINANET, FLUX model constants
wvaske Jan 24, 2026
ac6746c
feat(08-01): add DLRM workload configurations
wvaske Jan 24, 2026
986321d
feat(08-01): add RetinaNet workload configurations
wvaske Jan 24, 2026
c490869
feat(08-01): add Flux workload configurations
wvaske Jan 24, 2026
10dec80
docs(08-01): complete new training model configurations plan
wvaske Jan 24, 2026
91637bc
feat(08-02): add model validation rules for new training models
wvaske Jan 24, 2026
3642404
test(08-02): add unit tests for new training model validation
wvaske Jan 24, 2026
add6fce
docs(08-02): complete validation rules for new training models plan
wvaske Jan 24, 2026
466d1b3
docs(08): complete New Training Models phase
wvaske Jan 24, 2026
0e5073f
docs(09): research DLIO parquet format support
wvaske Jan 25, 2026
fec97e4
docs(09): create phase plan for DLIO parquet support
wvaske Jan 25, 2026
efd1f64
fix(09): revise plans based on checker feedback
wvaske Jan 25, 2026
6933978
docs(09-01): complete DLIO parquet format implementation plan
wvaske Jan 25, 2026
d340aa7
feat(09-02): add DLRM parquet configuration files
wvaske Jan 25, 2026
51d6a95
test(09-02): add unit tests for parquet format validation
wvaske Jan 25, 2026
e165c05
docs(09-02): complete parquet workload configuration plan
wvaske Jan 25, 2026
8e40ef3
docs(09): complete DLIO Parquet Support phase
wvaske Jan 25, 2026
50fe9e5
docs(10): research phase domain for progress indication
wvaske Jan 25, 2026
c38bb83
docs(10): create phase plan
wvaske Jan 25, 2026
028eccc
fix(10): revise plans based on checker feedback
wvaske Jan 25, 2026
ef5d042
feat(10-01): add Rich dependency and create progress module
wvaske Jan 25, 2026
e64c229
test(10-01): add unit tests for progress module
wvaske Jan 25, 2026
6289cec
docs(10-01): complete progress indication foundation plan
wvaske Jan 25, 2026
38d1505
feat(10-02): add stage indicators to benchmark run() method
wvaske Jan 25, 2026
5335506
feat(10-02): add spinners to cluster collection methods
wvaske Jan 25, 2026
0f54d0e
feat(10-03): add progress indication to main.py validation operations
wvaske Jan 25, 2026
7089a16
test(10-02): add unit tests for progress integration in base.py
wvaske Jan 25, 2026
4928667
docs(10-02): complete benchmark progress integration plan
wvaske Jan 25, 2026
aa2509b
docs(10-03): complete main.py progress integration plan
wvaske Jan 25, 2026
295ad82
docs(10): complete Progress Indication phase
wvaske Jan 25, 2026
3aba7e9
docs(11): research phase domain
wvaske Feb 2, 2026
f341b64
docs(11): create phase plan for comprehensive parquet support
wvaske Feb 2, 2026
8e7563b
feat(11-03): update DLIO dependency to wvaske fork with parquet support
wvaske Feb 2, 2026
b21ab5b
docs(11-03): complete DLIO fork dependency plan
wvaske Feb 2, 2026
5d93513
docs(11-01): complete parquet config and enum extensions plan
wvaske Feb 2, 2026
08c5456
docs(11-02): complete Parquet Reader/Generator Rewrite plan
wvaske Feb 2, 2026
0780b39
docs(11): complete Comprehensive Parquet Support phase
wvaske Feb 2, 2026
746656b
chore: add UAT docs, phase context, and uv lockfile
wvaske Feb 2, 2026
ab81a5a
feat: update DLRM parquet configs with schema-driven columns
wvaske Feb 3, 2026
77f43ba
chore: update test params and fix file permissions
wvaske Feb 4, 2026
12577c0
allow claude to bypass cla check (#234)
BarnacleBob Feb 9, 2026
4a0669a
Remove unused imports and ShareGPT dataset loader
FileSystemGuy Feb 13, 2026
4bbb7b7
Revise KV Cache Benchmark script for MLPerf updates
FileSystemGuy Feb 13, 2026
7af411a
Revise README for KV Cache benchmark implementation
FileSystemGuy Feb 13, 2026
5416486
Enhance KV cache benchmark with ShareGPT integration
FileSystemGuy Feb 13, 2026
a8be997
Update allowlist format in CLA workflow
FileSystemGuy Feb 13, 2026
06dc3e4
Merge pull request #238 from mlcommons/FileSystemGuy-KVCache-revert
FileSystemGuy Feb 15, 2026
7286ed5
Merge pull request #239 from mlcommons/FileSystemGuy-claudebot
FileSystemGuy Feb 17, 2026
8fca675
Updated config files for v3 workloads
wvaske Feb 18, 2026
75a68d6
Merge branch 'mlcommons:main' into main
wvaske Feb 18, 2026
596fb55
Merge branch 'main' of https://github.com/wvaske/mlperf-storage
wvaske Feb 18, 2026
d70c4ad
GSD: 12-dlrm-dataset-columns - T2 Generate and apply 200-column DLRM …
wvaske Feb 25, 2026
24746aa
Updated training workload yamls
wvaske Mar 2, 2026
0201c44
merging changes
wvaske Mar 2, 2026
f874472
Last workload yaml changes.
wvaske Mar 2, 2026
24192d1
Update allowlist in CLA workflow
wvaske Mar 2, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Empty file modified .github/CODEOWNERS
100644 → 100755
Empty file.
Empty file modified .github/workflows/cla.yml
100644 → 100755
Empty file.
74 changes: 74 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
name: Tests

on:
push:
branches: [main, master]
pull_request:
branches: [main, master]

jobs:
test:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ['3.10', '3.11', '3.12']

steps:
- uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}

- name: Install system dependencies
run: |
sudo apt-get update
sudo apt-get install -y libopenmpi-dev openmpi-common

- name: Install package and test dependencies
run: |
python -m pip install --upgrade pip
# Install the package in editable mode without DLIO
pip install -e ".[test]"

- name: Run unit tests
run: |
pytest tests/unit -v --tb=short

- name: Run unit tests with coverage
run: |
pytest tests/unit -v --cov=mlpstorage --cov-report=xml --cov-report=term-missing

- name: Upload coverage to Codecov
uses: codecov/codecov-action@v4
with:
files: ./coverage.xml
fail_ci_if_error: false
verbose: true
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}

lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'

- name: Install lint dependencies
run: |
python -m pip install --upgrade pip
pip install ruff

- name: Run ruff check
run: |
ruff check mlpstorage/ --output-format=github || true

- name: Run ruff format check
run: |
ruff format --check mlpstorage/ || true
39 changes: 39 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Python cache
__pycache__/
*.py[cod]
*$py.class
*.so

# Distribution / packaging
dist/
build/
*.egg-info/

# Virtual environments
venv/
.venv/
env/

# IDE
.idea/
.vscode/
*.swp
*.swo

# Test artifacts
.pytest_cache/
.coverage
htmlcov/
*.html

# OS files
.DS_Store
Thumbs.db


# Coding Agents
.agent/
.roo/
.vscode/
CLAUDE.md
.roomodes
100 changes: 100 additions & 0 deletions .planning/PROJECT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# MLPerf Storage Benchmark Suite v3.0

## What This Is

A benchmark orchestration framework for the MLCommons MLPerf Storage working group. The suite runs storage benchmarks aligned with MLPerf rules and reports results with verification of rules compliance.

## Core Value

**The ONE thing that must work:** Orchestrate multiple benchmark types (training, checkpointing, kv-cache, vectordb) across distributed systems and produce verified, rules-compliant results.

## Context

### Current State
- v2.0 release with Claude Code enhancements
- Training and checkpointing benchmarks use DLIO as underlying engine
- KV cache benchmark exists in separate directory (`kv_cache_benchmark/`)
- VectorDB benchmark code exists in external branch
- MPI-based execution and host collection for DLIO benchmarks
- Existing error handling and validation pipeline

### Target State (v3.0)
- Fully integrated KV cache and VectorDB benchmarks as Benchmark subclasses
- New training models (dlrm, retinanet, flux)
- Package version management with lockfiles
- SSH-based host collection for non-MPI benchmarks
- Time-series /proc/ data collection during benchmark execution
- Improved error messaging and user guidance

### Timeline
- **Feature freeze:** 6 weeks
- **Bugfix period:** 6 weeks
- **Code freeze:** 12 weeks total

## Requirements

### Validated (Existing)

- ✓ Training benchmark orchestration via DLIO — existing
- ✓ Checkpointing benchmark orchestration via DLIO — existing
- ✓ MPI-based distributed execution — existing
- ✓ Rules validation pipeline — existing
- ✓ Report generation — existing
- ✓ CLI with nested subcommands — existing
- ✓ Benchmark registry pattern — existing

### Active

- [ ] Package version lockfile management
- [ ] Remove GPU package dependencies (not used)
- [ ] KV cache Benchmark class (wraps kv-cache.py)
- [ ] KV cache MPI execution across hosts
- [ ] VectorDB Benchmark class (wraps load_vdb.py, compact_and_watch.py, simple_bench.py)
- [ ] SSH-based host collection for non-MPI benchmarks
- [ ] New training models: dlrm, retinanet, flux
- [ ] Improved error messaging for missing commands/packages
- [ ] Clear user guidance for resolving dependency issues
- [ ] Time-series /proc/ collection (diskstats, vmstat, cpuinfo, etc.)
- [ ] Parallel collection process (10 sec intervals) without impacting benchmark

### Out of Scope

- GPU support — deliberately not supporting GPU execution
- Rewriting KV/VDB as native benchmarks — v3.0 wraps existing scripts
- Real-time monitoring UI — collection only, no visualization
- Cloud provider integrations — on-premise/bare-metal focus

## Key Decisions

| Decision | Rationale | Outcome |
|----------|-----------|---------|
| Lockfile for package versions | Reproducibility across systems, MPI version issues | Pending |
| Benchmark subclasses for KV/VDB | Minimal integration, reuse CLI and reporting infrastructure | Pending |
| SSH for non-MPI host collection | KV cache and VectorDB don't require MPI execution | Pending |
| Parallel process for time-series | Must not impact benchmark performance | Pending |

## Constraints

- **No GPU dependencies** — storage benchmark, not compute
- **MPI compatibility** — must work with various MPI implementations
- **Cross-platform** — Linux primarily, various distributions
- **Minimal dependencies** — reduce version conflict surface area

## External Code References

| Component | Location | Notes |
|-----------|----------|-------|
| KV cache benchmark | `kv_cache_benchmark/` (local) | Also: `mlcommons/storage/TF_KVCache` branch |
| VectorDB benchmark | `mlcommons/storage/TF_VDBBench` branch | Scripts: load_vdb.py, compact_and_watch.py, simple_bench.py |
| DLIO benchmark | External package | Upstream dependency for training/checkpointing |

## Success Metrics

- All 4 benchmark types (training, checkpointing, kv-cache, vectordb) runnable from unified CLI
- Package lockfile prevents version conflicts in CI
- Error messages guide users to resolution for common issues
- Host data collected for all benchmark types (MPI or SSH)
- Time-series collection runs without measurable benchmark impact

---
*Last updated: 2026-01-23 after initialization*
92 changes: 92 additions & 0 deletions .planning/REQUIREMENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# MLPerf Storage v3.0 Requirements

## v1 Requirements

### Package Management

- [x] **PKG-01**: Lockfile for Python dependencies with pinned versions
- [x] **PKG-02**: Remove GPU package dependencies from default install
- [x] **PKG-03**: Validate package versions match lockfile before benchmark execution

### Benchmark Integration

- [x] **BENCH-01**: KVCacheBenchmark class extending Benchmark base (wraps kv-cache.py)
- [x] **BENCH-02**: KV cache MPI execution across multiple hosts
- [x] **BENCH-03**: VectorDBBenchmark class extending Benchmark base (wraps VDB scripts)
- [x] **BENCH-04**: VectorDB CLI commands (run, datagen operations)
- [x] **BENCH-05**: Integration with existing validation/reporting pipeline

### Training Updates

- [x] **TRAIN-01**: Add dlrm model configuration
- [x] **TRAIN-02**: Add retinanet model configuration
- [x] **TRAIN-03**: Add flux model configuration
- [x] **TRAIN-04**: Update DLIO to support parquet for data loaders, readers, data generation
- [x] **TRAIN-05**: Production-ready parquet reader with memory-efficient I/O
- [x] **TRAIN-06**: Update pyproject.toml to reference DLIO fork

### Host Collection

- [x] **HOST-01**: SSH-based host collection for non-MPI benchmarks
- [x] **HOST-02**: Collect /proc/ data (diskstats, vmstat, cpuinfo, filesystems, cgroups)
- [x] **HOST-03**: Collection at benchmark start and end
- [x] **HOST-04**: Time-series collection (10 sec intervals) during execution
- [x] **HOST-05**: Parallel collection process without benchmark performance impact

### Error Handling & UX

- [x] **UX-01**: Detect missing commands/packages with actionable error messages
- [x] **UX-02**: Suggest installation steps for missing dependencies
- [x] **UX-03**: Validate environment before benchmark execution (fail-fast)
- [x] **UX-04**: Clear progress indication during long operations

---

## v2 Requirements (Deferred)

- [ ] Deeper KV cache integration (native implementation vs wrapper)
- [ ] Deeper VectorDB integration (native implementation vs wrapper)
- [ ] Real-time monitoring dashboard for time-series data
- [ ] Cloud provider integrations (AWS, GCP, Azure)

---

## Out of Scope

- **GPU support** — Storage benchmark, deliberately not supporting GPU execution
- **Rewriting KV/VDB as native benchmarks** — v3.0 wraps existing scripts
- **Real-time visualization** — Collection only, no visualization in v3.0
- **Windows support** — Linux-only target

---

## Traceability

| Requirement | Phase | Status |
|-------------|-------|--------|
| PKG-01 | Phase 1 | Complete |
| PKG-02 | Phase 1 | Complete |
| PKG-03 | Phase 1 | Complete |
| UX-01 | Phase 2 | Complete |
| UX-02 | Phase 2 | Complete |
| UX-03 | Phase 2 | Complete |
| BENCH-01 | Phase 3 | Complete |
| BENCH-02 | Phase 3 | Complete |
| BENCH-03 | Phase 4 | Complete |
| BENCH-04 | Phase 4 | Complete |
| BENCH-05 | Phase 5 | Complete |
| HOST-01 | Phase 6 | Complete |
| HOST-02 | Phase 6 | Complete |
| HOST-03 | Phase 6 | Complete |
| HOST-04 | Phase 7 | Complete |
| HOST-05 | Phase 7 | Complete |
| TRAIN-01 | Phase 8 | Complete |
| TRAIN-02 | Phase 8 | Complete |
| TRAIN-03 | Phase 8 | Complete |
| TRAIN-04 | Phase 9 | Complete |
| UX-04 | Phase 10 | Complete |
| TRAIN-05 | Phase 11 | Complete |
| TRAIN-06 | Phase 11 | Complete |

---
*Last updated: 2026-01-25*
Loading