Skip to content

Feature/io-trace-log: Add io-tracing to a log file, along with tensor parallelism and # GPU parameters#254

Closed
russfellows wants to merge 35 commits intomlcommons:mainfrom
russfellows:feature/io-trace-log
Closed

Feature/io-trace-log: Add io-tracing to a log file, along with tensor parallelism and # GPU parameters#254
russfellows wants to merge 35 commits intomlcommons:mainfrom
russfellows:feature/io-trace-log

Conversation

@russfellows
Copy link

PR: Add --io-trace-log Trace Mode with Tensor-Parallel / Multi-GPU Support

Branch: feature/io-trace-logmain
Author: Russ Fellows <russ.fellows@mlcommons.org>
Commit: ac5970c
Files changed: 8 (+677 / −36)


Summary

This PR adds a pure logical trace mode to the KV cache benchmark. When
--io-trace-log <path> is specified, the benchmark runs the full inference
simulation (prefill, decode, eviction, multi-turn, prefix caching, etc.) but
performs no real GPU/CPU/NVMe I/O. Instead, every cache operation is
recorded to a structured CSV file for offline replay by an external storage
tool such as fio, sai3-bench, or warp.

This enables clean separation between the workload generation (what the
benchmark does) and the storage validation (what an external tool measures),
which is essential for MLPerf Storage submission workflows.

The PR also adds --num-gpus and --tensor-parallel arguments so that
large, real-world multi-GPU configurations (e.g. 8×H200, TP=8) can be
accurately modeled in both trace and normal benchmark modes.


Motivation

The previous benchmark could only exercise storage backends directly during
a run. There was no way to capture the operation stream for replay, and no
support for modeling multi-GPU tensor-parallel deployments. Trace mode
addresses both gaps:

  • A replay file from a 1-hour, 300 req/s run represents a realistic,
    model-accurate LLM KV cache workload that any storage tool can
    replay exactly, on any hardware.
  • Tensor-parallel modeling ensures that per-rank object sizes (1/TP of
    each KV entry) are reflected correctly in I/O sizes, matching what a
    real distributed inference environment would write and read.

New Features

--io-trace-log <path> — trace mode

When this flag is set:

  • All storage backends are replaced with NullBackend (no real I/O).
  • Every allocate, access, and demote operation is logged to CSV.
  • Output path ending in .zst enables zstd streaming compression
    (level 3, 10–20× ratio), reducing a 1-hour log from ~1 GB to ~50 MB.

CSV columns (in output order):

Column Description
Timestamp Unix epoch, float, 6 decimal places
Operation Write or Read
Object_Size_Bytes Exact byte size of the KV cache object (TP-adjusted)
Tier Tier-0 = GPU VRAM, Tier-1 = CPU RAM, Tier-2 = NVMe
Key Cache entry identifier — use as object name / path in replay tools
Phase Prefill (initial write), Decode (per-token read), Evict (demotion)

--num-gpus N

Sets the total GPU count in the tensor-parallel group. Effective GPU tier
capacity = N × gpu-mem-gb. Example: --num-gpus 8 --gpu-mem-gb 141
models an 8×H200 node with 1,128 GB of total HBM.

--tensor-parallel N

TP degree for KV cache sharding. Each GPU rank stores 1/N of each KV entry,
so per-rank object sizes reported in the trace, cache stats, and XLSX export
are all divided by N. Must be ≥ 1 and ≤ --num-gpus.


Files Changed

File Change
kv_cache/tracer.py New. IOTracer: thread-safe CSV writer with optional zstd compression, Key and Phase columns, clean close() sequence.
kv_cache/backends.py New NullBackend: no-op write/read that tracks byte counts only; used by all tiers in trace mode.
kv_cache/cache.py MultiTierCache accepts io_tracer= and tensor_parallel=; TP-adjusted size_bytes in trace mode; data sliced to per-rank shard in real mode; _demote_entry stats use TP-adjusted bytes/token.
kv_cache/benchmark.py IntegratedBenchmark accepts io_trace_log=, num_gpus=, tensor_parallel=; manages IOTracer lifecycle; display banner shows 8x 141 GB GPU (total 1128 GB HBM) | TP=8 and per-rank KV sizes.
kv_cache/cli.py --io-trace-log, --num-gpus, --tensor-parallel args; XLSX export includes both columns and Total GPU Memory.
kv_cache/workload.py Validation: TP ≤ num_gpus; warn if TP not power-of-2; MAX_GPU_MEMORY_GB 1,024 → 65,536; MAX_CPU_MEMORY_GB 16,384 → 131,072 to support large multi-GPU nodes.
pyproject.toml compression optional extra (zstandard>=0.21); included in full extra.
docs/io_trace_log_usage.md New. User guide: all flags, CSV schema, compression size estimates, seven ready-to-run examples (single GPU, 8×H200 TP=8, prefill-only, decode-only, DeepSeek V3), trace inspection shell snippets, model reference table.

Usage Examples

# Capture a 60-second trace for an 8×H200 node, TP=8, compressed
python -m kv_cache.cli \
  --model llama3.1-70b-instruct \
  --num-users 64 \
  --duration 60 \
  --num-gpus 8 --gpu-mem-gb 141 \
  --tensor-parallel 8 \
  --io-trace-log kv_ops_llama70b_tp8.csv.zst

# Replay the trace with sai3-bench (illustrative)
sai3-bench replay --trace kv_ops_llama70b_tp8.csv.zst --endpoint s3://bucket

Compatibility

All existing code paths are completely unchanged when --io-trace-log
is not specified. There are no breaking changes to existing CLI arguments,
config files, or Python API.


Testing

  • All existing tests pass unmodified.
  • Trace mode validated end-to-end: CSV output correct for prefill, decode,
    and eviction operations across GPU/CPU/NVMe tiers.
  • zstd compression validated: output readable by zstd -d and standard
    Python zstandard reader.
  • TP division verified: object sizes in trace match kv_cache_size_per_token × seq_len ÷ tensor_parallel for TP ∈ {1, 2, 4, 8}.

FileSystemGuy and others added 30 commits November 25, 2025 08:41
Add initial KV Cache benchmark implementation for MLPerf Storage v3
…lcommons#219)

* feat: Replace legacy spillover logic with Waterfall LRU architecture

This is a major architectural upgrade to the core benchmark logic. Replacing
the original "Spillover" memory management strategy with the new "Waterfall
LRU" implementation to accurately simulate enterprise storage hierarchies.

Key Changes:
- Waterfall Eviction: Implemented recursive eviction (GPU -> CPU -> NVMe).
  New data now correctly lands in the fastest available tier, pushing cold
  data down, rather than the old behavior where new data skipped directly
  to NVMe if RAM was full.
- Static Buffer Optimization: Replaced the CPU-bound np.random generation
  with a pre-allocated static noise buffer. This removes the CPU bottleneck
  that was masking true storage latency, allowing us to fully saturate
  high-performance NVMe drives.
- Concurrency Hardening: Added semaphore-based concurrency limits
  (max_concurrent_allocs) and atomic memory reservations to prevent OOM
  crashes under heavy load.
- Storage Metrics: Added explicit tracking for nvme_tokens_processed to
  calculate true storage throughput separate from system throughput.
- Stress Test Validation: Verified that this new architecture correctly
  exposes storage latency limits (e.g., pushing P95 write latency >1000ms)
  where the old script artificially throttled the load.

* Fix two runtime errors in RAG-enabled benchmark mode

This patch addresses two bugs that surface when running the benchmark
with --enable-rag:

1. Race condition in process_requests (line 2693)

   Worker threads begin processing requests immediately upon benchmark
   start, while RAG document ingestion runs in a separate daemon thread.
   When a worker hits the 10% RAG query path before any documents have
   been ingested, random.choice() is called on an empty list, raising
   IndexError.

   Fixed by adding a truthiness check on self.rag_manager.documents
   before entering the RAG code path. An empty dict evaluates to False,
   so RAG queries are safely skipped until ingestion populates at least
   one document.

2. Division by zero in KVCacheGenerator.generate (line 1097)

   The buffer slicing logic uses modulo to compute a pseudo-random start
   index: seed % (buffer_size - total_elements). When total_elements
   exactly equals buffer_size (an edge case permitted by the <= guard),
   the divisor becomes zero, raising ZeroDivisionError.

   Fixed by computing the divisor separately and defaulting start_idx
   to 0 when the divisor is zero.

* Add detailed README.md for running the different invocations of kv-cache.py

* fix: line endings from dos2unix; increase cpu memory to 4GB for mlperf invocation

* Update MLperf v3 KV cache proposal.md to recommend using a minimum of 4G of DRAM to reduce Queue contention and unrealistic read amplification
- Add ConfigLoader class with YAML config file support and schema validation
- Add cfg() helper function for config-driven parameter access
- Add validate_args() with safety limits for protected system paths
- Rename all nvme_* metrics to storage_* for MLPerf terminology compliance
- Add extended QoS percentiles: P99.9 and P99.99 latency tracking
- Add per-tier bandwidth metrics (read/write GB/s per tier)
- Add per-tier KV bytes tracking for detailed storage analysis
- Fix GPU metadata desync bug via on_eviction_callback pattern
- Change eviction from single-shot to iterative loop until space freed
- Replace print statements with Python logging module
- Add waterfall LRU eviction with configurable high/low watermarks
- Add storage_health section with PASS/FAIL criteria
- Add storage_throughput_tokens_per_sec as primary MLPerf metric
- Add -c DIR option for custom config directory
- Generate and pass config.yaml to Python script via --config flag
- Add --xlsx-output support for Excel export
- Update jq queries for new storage_* metric names
- Add mlperf_submission workload with required trial parameters
- Enhance system detection for thread counts and memory limits
- Update metric parsing for storage_throughput primary metric
- Add 170+ tests covering all new functionality
- Add ConfigLoader tests: schema validation, defaults, file loading
- Add cfg() helper tests for config-driven parameters
- Add validate_args() tests for path safety and input validation
- Add extended QoS tests for P99.9 and P99.99 percentiles
- Add GPU eviction callback tests for metadata sync
- Add per-tier bandwidth and KV bytes metric tests
- Add storage_* metric naming tests for MLPerf compliance
- Add waterfall eviction tests with high/low watermarks
- Add storage_health PASS/FAIL criteria tests
- Add Configuration section with YAML parameter reference
- Add MLPerf Submission Guidelines with validated commands
- Add Excel metrics reference table with all output columns
- Add installation instructions including pyyaml dependency
- Add CLI arguments vs config file precedence documentation
- Add workload definitions and tier configuration examples
- Add troubleshooting section for common issues
- Add kv-cache-test-report.html with full test execution results
- All 170+ tests passing for v3.0 features
- Create unit_test_results directory for test artifacts
- Add P99.9 and P99.99 latency columns
- Add per-tier KV bytes columns (GPU, CPU, Storage)
- Add per-tier bandwidth columns (read/write GB/s)
- Add storage tier device vs host latency breakdown
- Rename nvme_entries to storage_entries for MLPerf compliance
- Add storage_throughput_tokens_per_sec as primary metric
- Add pyyaml>=6.0 for YAML configuration file parsing
- Required for ConfigLoader and --config CLI argument
- Add user_templates section with conversation patterns
- Add qos_profiles with latency thresholds per tier
- Add eviction settings with waterfall LRU parameters
- Add storage_health criteria for PASS/FAIL determination
- Add cache_sizing defaults for GPU/CPU/Storage tiers
- Provides validated defaults for all tunable parameters
Split the single ~3500-line kv-cache.py into a structured Python package
(kv_cache/) with 12 modules. Added MLA attention support, NVMe capacity
management, SSD preconditioning, disaggregated inference modes, and
streaming BurstGPT trace replay. Updated proposal and README with
corrected DeepSeek-V3 MLA calculations, capacity planning scope notes,
and repo cleanup.

Structural changes:
- kv_cache/ package: __init__, _compat, config, models, backends, cache,
  conversation, prefix_cache, rag, monitoring, workload, benchmark, cli
- kv-cache.py is now a thin shim importing from kv_cache
- Added pyproject.toml for pip-installable package

New features:
- MLA attention support (DeepSeek-V3: 70,272 bytes/token vs 1.7M MHA)
- 4 new models: deepseek-v3, qwen3-32b, gpt-oss-120b, gpt-oss-20b
- NVMe capacity tracking with LRU eviction (prevents disk exhaustion)
- SSD preconditioning (--precondition)
- Disaggregated inference (--prefill-only, --decode-only)
- Streaming BurstGPT trace replay (--trace-speedup, --replay-cycles)
- Config-driven model definitions via config.yaml
- RAG retrieval distribution (zipfian/uniform), document eviction

Documentation:
- Corrected DeepSeek-V3 from MHA formula to MLA in all capacity tables
- Scoped capacity planning claims to storage throughput (no tier promotion)
- Restructured GDS section around production GPU-origin KV cache
- Added NVMe terminology note (benchmark works with any block device)
- Fixed stale class names and default ranges in README

Repo cleanup:
- Moved kv-cache-wrapper.sh to utils/
- Added utils/run_benchmarks_256gb.sh
- Removed kv-cache_sharegpt_replay.py (merged into package)
- Removed discovery_results_and_analysis/, lmcache_results_*, proposal PDF
README: Corrected DeepSeek-V3 KV cache from MHA formula (1,748,992
bytes/token, 1.7 MB) to MLA formula (70,272 bytes/token, 69 KB).
Updated all derived tables: per-user RAM 13.4 GB -> 0.54 GB, removed
from 128 GB exclusion list, fixed model reference table.

Moved validate.sh to utils/ alongside other shell scripts.
The code reads decode_batch_size from config.yaml via
cfg('decode', 'batch_size', default=32). Updated the proposal
code snippet to match the actual implementation.
The "Two Separate Eviction Mechanisms" section now explicitly
distinguishes metadata-only eviction (ConversationManager removes
dict entries; .npy files remain on disk) from physical file deletion
(MultiTierCache calls path.unlink(), permanently removing .npy files
from the filesystem). Added actual code paths from backends.py and
cache.py to replace the pseudocode.
… compatibility

Major Features:
=============

1. DLIO s3dlio Backend Integration
   - Installed s3dlio as alternative storage backend to s3pytorchconnector
   - Patched DLIO enumerations.py to add StorageType.S3DLIO
   - Patched storage_factory.py to instantiate S3dlioStorage
   - Copied s3dlio_storage.py into DLIO installation
   - Multi-protocol support: s3://, az://, gs://, file://, direct://

2. s3torchconnector Drop-In Compatibility Layer
   - Created s3dlio/python/s3dlio/compat/s3torchconnector.py (482 lines)
   - Full API compatibility: S3Item, S3IterableDataset, S3MapDataset, S3Checkpoint
   - Zero-code migration: users change only import statement
   - Extends s3torchconnector with Azure/GCS/file:// support
   - All runtime tests passing (test_compat_runtime.py)

3. Environment Setup & Tooling
   - setup_env.sh: Supports both uv and pip/venv workflows
   - install_s3dlio_backend.py: Automated DLIO patching
   - verify_s3dlio.py: 5-point integration validation (all passing)
   - Test suite: Import tests + runtime tests with file:// backend

4. Comprehensive Documentation
   - S3DLIO_INTEGRATION.md: Complete usage guide (400+ lines)
   - S3TORCHCONNECTOR_MIGRATION.md: Migration guide in s3dlio repo
   - QUICKSTART.md: 2-minute migration guide
   - SUCCESS_SUMMARY.md: Detailed success report
   - INTEGRATION_SUMMARY.md: Technical project summary
   - QUICKREF.md: Command reference cheat sheet

5. Analysis & Architecture Docs (NEW)
   - ANALYSIS_ZERO_COPY_AND_PLUGINS.md: Performance analysis
   - ZERO_COPY_VISUAL.md: Visual diagrams of zero-copy issues
   - Identified critical bytes() conversion performance bugs
   - Plugin architecture analysis and recommendations

Dependencies:
============
- DLIO Benchmark: main branch from argonne-lcf/dlio_benchmark
- s3dlio: v0.9.39 from local ../s3dlio (editable install)
- Python 3.12.9, PyTorch 2.10.0, TensorFlow 2.20.0
- Package manager: uv (with pip/venv fallback)

Test Results:
============
✅ All 5 integration checks pass (verify_s3dlio.py)
✅ All runtime tests pass (test_compat_runtime.py)
✅ S3IterableDataset streaming works
✅ S3MapDataset random access works
✅ S3Checkpoint save/load works
✅ file:// backend tested successfully

🟡 TODO: Benchmark zero-copy vs current implementation
🟡 TODO: Test with real S3/MinIO endpoints

Architecture:
============
- Multi-protocol support via URI scheme detection
- Zero-copy design (when BytesView conversions removed)
- Compatible with PyTorch DataLoader and NumPy operations
- Backward compatible with existing DLIO configs

Next Steps:
==========
1. Fix zero-copy by removing bytes() conversions
2. Add storage_library YAML config support
3. Create file:// backend test suite
4. Benchmark performance improvements
5. Test with real S3/Azure/GCS endpoints

Performance Expectations (After Zero-Copy Fix):
=============================================
- Throughput: 5-10 GB/s (vs 2-3 GB/s with copies)
- Memory: 1x usage (vs 2-3x with copies)
- CPU: Minimal overhead (no memcpy operations)

perf: Fix zero-copy performance by removing bytes() conversions

Critical Performance Fixes:
- Removed bytes() conversions in s3dlio_storage.py (lines 232, 234)
  Now returns BytesView directly for zero-copy performance
- Updated compat/s3torchconnector.py with dual interface:
  • read() - returns BytesView (zero-copy, fast)
  • read_bytes() - returns bytes (creates copy, compatible)
- Reinstalled s3dlio backend into DLIO with zero-copy fix

Testing & Verification:
- Updated test_compat_runtime.py to verify BytesView and buffer protocol
- All tests pass with zero-copy confirmed
- Created test_zerocopy_direct.py - proves BytesView works with PyTorch/NumPy

Test Infrastructure:
- Created generate_test_data.py - generates 10 NPZ files for testing
- Created zerocopy_file_test.yaml - DLIO config using file:// backend

Key Results:
- BytesView returned throughout (buffer protocol compatible)
- PyTorch torch.frombuffer() works (zero-copy)
- NumPy np.frombuffer() works (zero-copy)
- Memory addresses match between frameworks (proof of zero-copy)
- file:// backend tested successfully (local testing without S3)

Performance Impact:
- Before: 2-3x memory copies → ~2-3 GB/s throughput
- After: 0 copies → ~5-10 GB/s throughput expected
- Memory usage: 50% reduction (no duplicate copies)

Files Modified:
- s3dlio/python/s3dlio/integrations/dlio/s3dlio_storage.py
- s3dlio/python/s3dlio/compat/s3torchconnector.py
- test_compat_runtime.py

Files Added:
- generate_test_data.py
- test_zerocopy_direct.py
- configs/dlio/workload/zerocopy_file_test.yaml
- test_dlio_storage.py

BREAKING CHANGE: S3Item.read() now returns BytesView instead of bytes.
For strict bytes compatibility, use S3Item.read_bytes() instead.

Add storage_library config and multi-endpoint support

Features:
- storage_library YAML config for easy A/B testing (s3dlio vs s3torchconnector)
- Multi-endpoint load balancing (s3dlio native round-robin/random)
- MPI-based endpoint distribution (OMPI_COMM_WORLD_RANK)
- Separate checkpoint storage (different bucket/filesystem)
- S3Client/S3ClientConfig compatibility layer in s3dlio

Implementation:
- Patched DLIO s3_torch_storage.py to support storage_library config
- Extended s3dlio.compat.s3torchconnector with S3Client API
- Added install_storage_library_patch.py for automatic installation
- Created 6 example YAML configs (s3dlio, s3torchconnector, multi-endpoint, MPI, hybrid)

Testing:
- test_storage_library.py - 5 comprehensive tests (all passing)
- test_ab_comparison.py - A/B comparison between libraries
- test_multi_endpoint.py - Multi-endpoint selection logic
- test_mpi_basic.py - MPI environment verification (8 ranks tested)
- test_dlio_mpi.py - DLIO + MPI integration test

Documentation:
- docs/STORAGE_LIBRARY_GUIDE.md - Complete guide to storage_library config
- docs/MULTI_ENDPOINT_GUIDE.md - Multi-endpoint configuration guide (500+ lines)
- README_STORAGE_LIBRARY.md - Implementation summary

Verified:
- Both s3torchconnector and s3dlio work with identical APIs
- MPI environment working (OpenMPI 4.1.6, mpi4py 4.1.1)
- Zero-copy architecture maintained throughout
- Easy A/B testing via single line config change

Add performance benchmarks and comprehensive zero-copy verification

Core Features:
- benchmark_s3dlio_write.py: Uses s3dlio's 300 GB/s Rust-based data generation
  * test_data_generation_speed(): Verifies 50-300 GB/s capability
  * test_s3_write_performance(): Full write benchmark (20-30 GB/s target)
  * test_zero_copy_verification(): PyTorch/NumPy memory address validation
- benchmark_s3dlio_read.py: Zero-copy read benchmark with throughput
- PERFORMANCE_TESTING.md: Complete remote testing guide (5-min quick start)
- ZERO_COPY_CODE_REVIEW.md: Comprehensive 4-path code review
  * Found and documented 1 bug in S3Client reader (bytes() conversion)
  * Verified 95% zero-copy compliance (100% after fix)
- QUICK_TEST_GUIDE.md: Ultra-brief reference for remote deployment

Critical Bug Fix (in s3dlio repo):
- Fixed S3Client._S3Reader.read() line 614: bytes(data) -> data
- Performance impact: Restores 50-70% throughput for non-ranged reads
- Now maintains BytesView zero-copy throughout entire stack

Performance Targets:
- Data generation: 50-300 GB/s (Rust-based, unlimited threads)
- Storage write: 20-30 GB/s (S3/MinIO cluster)
- Storage read: 20-30 GB/s
- Zero memory copies in hot path

Testing Requirements:
- High-performance S3 (MinIO cluster on NVMe)
- 100+ Gbps network
- 16-32 CPU cores
- Validated via file:// backend before remote testing

Add head-to-head library comparison benchmarks

New Features:
- benchmark_write_comparison.py: Write benchmark with library comparison
  * --compare-libraries: Run s3dlio and s3torchconnector back-to-back
  * --library {s3dlio,s3torchconnector}: Test single library
  * Defaults: 2000 files × 100 MB = 200 GB, 32 threads
  * Flexible: Supports 16-500 MB files, 32-64 threads, 200-2000 GB tests

- benchmark_read_comparison.py: Read benchmark with library comparison
  * Same comparison mode for read performance
  * Zero-copy validation for s3dlio
  * Side-by-side throughput comparison

Meeting User Requirements:
✅ Switch between libraries (--library flag)
✅ Head-to-head comparison (--compare-libraries)
✅ 32+ threads (default 32, supports 64+)
✅ 16+ MB files (default 100 MB, supports 16-1000 MB)
✅ 200+ GB data (default 200 GB, supports up to TB+)
✅ Real performance testing at 20-30 GB/s targets

Documentation:
- BENCHMARK_COMPARISON_GUIDE.md: Complete usage guide with examples
- BENCHMARK_TOOLS_SUMMARY.md: Quick reference and validation results
- SESSION_SUMMARY.md: Full session history and testing checklist

Example Usage:
  # Head-to-head comparison (RECOMMENDED)
  python benchmark_write_comparison.py --compare-libraries --endpoint http://localhost:9000

  # Maximum performance (500 MB files, 64 threads)
  python benchmark_write_comparison.py --files 400 --size 500 --threads 64 --compare-libraries

  # Quick validation
  python benchmark_write_comparison.py --skip-write-test

Output Format:
  Metric                    s3dlio          s3torchconnector   Difference
  -------------------------------------------------------------------------
  Throughput (GB/s)         24.50           18.20              1.35x

  🏁 FINAL VERDICT:
     s3dlio is 1.35x FASTER than s3torchconnector
     Performance gain: +34.6%

Tested:
✅ Zero-copy verification works
✅ Data generation (s3dlio Rust backend)
✅ Both libraries import correctly
✅ Command-line arguments parsed correctly

Replace example performance numbers with placeholder notation

Issue: Documentation showed specific performance values (24.50 GB/s, 18.20 GB/s,
etc.) that looked like actual measurements but were only example/placeholder values.

Changes:
- Replaced all specific numbers with placeholder notation:
  * XX.XX = s3dlio throughput
  * YY.YY = s3torchconnector throughput
  * A.BC = Speedup factor
  * T1.TT, T2.TT = Test duration
  * FFF.F, GGG.G = Files per second
  * PP.P = Performance gain %
  * SS.S = Time saved %

- Added clear notes: "Values shown are placeholder examples only"
- Added placeholder legends explaining what each symbol represents
- Changed ranges (24-30 → XX-YY, 18-22 → AA-BB, etc.)

Affected Files:
- BENCHMARK_COMPARISON_GUIDE.md
- BENCHMARK_TOOLS_SUMMARY.md

This makes it crystal clear these are NOT actual benchmark results,
waiting for real performance testing on high-performance hardware.

feat: Add 4-library support and fix critical unique data generation bug

BREAKING: Write benchmark now generates unique data per file (was reusing same data)

Major Changes:
- Extended both benchmarks to support 4 libraries:
  * s3dlio: Zero-copy, Rust-based (S3/Azure/GCS/file/direct)
  * s3torchconnector: AWS official S3 library
  * minio: MinIO Python SDK (S3-compatible)
  * azstoragetorch: Azure Storage for PyTorch (BlobIO API)

- New comparison modes:
  * --compare LIB1 LIB2 ...: Compare specific libraries
  * --compare-all: Compare all installed libraries
  * --compare-libraries: Legacy 2-way mode (backward compatible)

Critical Bug Fix (Write Benchmark):
- BEFORE: Generated data once, reused for all files (INVALID)
- AFTER: Generates UNIQUE data per file using:
  * s3dlio: s3dlio.generate_data_with_threads() (~1 GB/s per-file)
  * Others: dgen-py streaming API (~0.4 GB/s per-file)
- No copying (generate-only approach, faster than copy)
- Each file has unique content (valid for storage testing)

Data Generation:
- Replaced s3dlio with dgen-py for neutral data generation
- dgen-py is independent library (not tied to s3dlio)
- Available on PyPI: pip install dgen-py

Library-Specific Implementations:
- MinIO: S3-compatible put_object/get_object with BytesIO
- Azure: BlobIO file-like interface with DefaultAzureCredential
- Proper client setup for each library (endpoint parsing, auth)
- Resource cleanup (MinIO: response.close() + release_conn())

Documentation:
- MULTI_LIBRARY_SUPPORT.md: Research and API analysis
- MULTI_LIBRARY_IMPLEMENTATION_SUMMARY.md: Implementation details

Testing:
- All syntax validated
- Library detection logic tested
- Comparison modes verified
- Unique data generation verified (hash testing)
- Ready for production use with MinIO/Azure endpoints

docs: Consolidate documentation into 6 focused guides

Consolidated 20+ markdown files into 6 comprehensive guides in docs/:

New Documentation (6 files):
✅ QUICK_START.md - 5-minute setup and first benchmark
✅ STORAGE_LIBRARIES.md - Complete guide to all 4 libraries
✅ PERFORMANCE_TESTING.md - Comprehensive benchmarking
✅ PARQUET_FORMATS.md - Parquet/HDF5/TFRecord byte-range architecture
✅ S3DLIO_INTEGRATION.md - s3dlio deep dive (existing, kept)
✅ MULTI_ENDPOINT.md - Load balancing (renamed)

Removed 19 redundant files:
- Session docs: SESSION_SUMMARY, MISSION_COMPLETE, SUCCESS_SUMMARY, INTEGRATION_SUMMARY
- Zero-copy: ZERO_COPY_CODE_REVIEW, ZERO_COPY_VISUAL, ANALYSIS_ZERO_COPY_AND_PLUGINS
- Quick starts: QUICKSTART, QUICKREF, QUICK_TEST_GUIDE
- Library docs: MULTI_LIBRARY_SUPPORT, MULTI_LIBRARY_IMPLEMENTATION_SUMMARY, README_STORAGE_LIBRARY, docs/STORAGE_LIBRARY_GUIDE
- Benchmarks: BENCHMARK_COMPARISON_GUIDE, BENCHMARK_TOOLS_SUMMARY, PERFORMANCE_TESTING (root)
- Other: README_S3DLIO, PARQUET_BYTE_RANGE_ARCHITECTURE

Added:
- parquet_byte_range_example.py - Working Parquet byte-range demo

Root directory cleaned: 23 markdown files → 5 (original repo state)
Documentation centralized in docs/ with focused, non-overlapping guides

feat: Add comprehensive s3dlio configs for Azure Blob and data generation

Added complete workflow configs covering both data generation and training phases:

Training Configs (4 variants):
- pytorch_s3dlio.yaml - Production with environment variables (UPDATED)
- pytorch_s3dlio_local_test.yaml - Local testing with hardcoded credentials (NEW)
- pytorch_s3dlio_multiendpoint.yaml - Multi-endpoint load balancing (NEW)
- pytorch_s3dlio_azure.yaml - Azure Blob Storage support (NEW)

Data Generation Configs (3 variants):
- datagen_s3dlio_s3.yaml - Generate to single S3 endpoint (NEW)
- datagen_s3dlio_multiendpoint.yaml - Generate to multi-endpoint (4x faster) (NEW)
- datagen_s3dlio_azure.yaml - Generate to Azure Blob Storage (NEW)

Documentation:
- README_S3DLIO_CONFIGS.md - Complete workflows and examples (NEW)

Key Features:
✅ Environment variable support for secure credential management
✅ Azure Blob Storage configurations (az:// URIs)
✅ Multi-endpoint load balancing for 4x performance
✅ Two-phase workflow: generate data → train
✅ Clear comments explaining data_folder usage
✅ Production and local testing variants

Addresses:
- data_folder clarification (only used during generate_data: True)
- Multiple endpoint configuration (endpoint_uris list)
- Environment variable substitution (${AWS_ACCESS_KEY_ID}, etc.)
- Azure Blob authentication options (connection string, account key, managed identity)

Add s3dlio storage library validation and testing

- Validated s3dlio with PyTorch (NPZ) and TensorFlow (TFRecord)
- Complete round-trip testing (generate -> read with s3dlio)
- Documented test commands in S3DLIO_TEST_RECORD.md
- Added storage library testing status tracking
- Created reference YAML configs for s3dlio integration
- Added handoff document for session continuity (Feb 7, 2026)
- Archived previous test configs
- Updated README for s3dlio command patterns

All tests passing with file:// protocol. Cloud protocols (s3://, az://) pending.
Prepares groundwork for streaming checkpoint implementation.
…s3dlio)

- Add URI-based storage handler with 3 library backends
- Integrate s3dlio v0.9.40 native API (put_bytes, get_bytes, list)
- Apply PR mlcommons#232 fix for empty data_dir handling
- Add comprehensive test suite with 3 validated implementations
- Organize project structure (tests/, docs/, patches/)
- Document MLP vs dpsi architectural comparison

Changes preserved in patches/ directory for flexible integration approach.
Test results: All 3 libraries working (s3torch: 30s, minio: 15s, s3dlio: 31s)
Moved 20 top-level Python test files to tests/integration/:
- benchmark_*_comparison.py (4 files)
- benchmark_s3dlio_*.py (2 files)
- test_*.py (10 files)
- install_*.py (2 files)
- Other utilities (2 files)

These integration tests validate s3dlio, minio, and s3torchconnector
storage libraries and belong with the multi-library support feature.
- Comprehensive strategy for managing two feature branches
- PR readiness action plan with step-by-step workflow
- Executable setup script for branch creation
- Security: Use environment variables for S3 credentials
Optimize checkpoint data generation by replacing torch.rand() and
tf.random.uniform() with dgen-py (Rust-based random data generator).

Performance Improvements:
- PyTorch: torch.rand() → gen_random_tensor() (155x speedup)
- TensorFlow: tf.random.uniform() → gen_random_tensor() (155x speedup)
- Data generation: 1.54 GB/s → 239 GB/s (NumPy → dgen-py)

Key Changes (PR#2):
- dlio_benchmark/dlio_benchmark/checkpointing/pytorch_checkpointing.py
  - Replaced torch.rand() and torch.randint() with gen_random_tensor()
  - Added dtype mapping for NumPy/PyTorch compatibility

- dlio_benchmark/dlio_benchmark/checkpointing/tf_checkpointing.py
  - Replaced tf.random.uniform() with gen_random_tensor()
  - Added dtype mapping for NumPy/TensorFlow compatibility

Test Suite:
- tests/checkpointing/compare_methods.py
  - Comprehensive test comparing original DLIO vs streaming methods
  - Uses dgen_py.create_bytearrays() for 1654x faster buffer allocation

Complete Package:
- Includes full dlio_benchmark package for standalone functionality
- Depends on utility.py gen_random_tensor() (already present in DLIO)
- All __init__.py, configs, and dependencies included

Configuration:
- Set DLIO_DATA_GEN=dgen to enable (auto-fallback to numpy if unavailable)
- Compatible with existing DLIO configs (no config changes required)
… checkpoint I/O

Merge streaming checkpoint implementation from streaming-checkpoint-poc branch
to complete the dgen-py optimization feature set.

This provides two complementary optimizations:
1. dgen-py integration: 155x faster data generation (already in dlio_benchmark/)
2. StreamingCheckpointing: Producer-consumer pattern with minimal memory footprint

StreamingCheckpointing Features:
- Producer-consumer architecture with shared memory buffers
- Multi-backend support (file, s3dlio) via StorageWriter interface
- Buffer pool pattern (4 buffers default, ~128MB vs 24GB for original)
- Overlapping generation and I/O for maximum throughput
- Configurable fadvise modes (none, sequential, dontneed)

Example Usage:
  checkpoint = StreamingCheckpointing(
      chunk_size=32 * 1024 * 1024,  # 32 MB chunks
      num_buffers=4,                 # 128 MB total memory
      use_dgen=True,                 # Use dgen-py for generation
      fadvise_mode='dontneed'        # Drop pages after write
  )
  checkpoint.write_checkpoint(output_path, total_bytes)

Test Suite:
- tests/checkpointing/compare_methods.py demonstrates both approaches:
  - Method 1: Original DLIO (pre-generate all data, uses dgen-py)
  - Method 2: Streaming (producer-consumer, uses dgen-py + StreamingCheckpointing)
  - Method 3: S3Checkpoint compatibility layer test

Files Added:
- mlpstorage/checkpointing/__init__.py
- mlpstorage/checkpointing/streaming_checkpoint.py (427 lines)
- mlpstorage/checkpointing/storage_writers/__init__.py
- mlpstorage/checkpointing/storage_writers/base.py
- mlpstorage/checkpointing/storage_writers/file_writer.py
- mlpstorage/checkpointing/storage_writers/s3dlio_writer.py

This completes the checkpoint optimization work, providing both:
- Speed: dgen-py 155x faster generation
- Memory: StreamingCheckpointing reduces memory from 24GB to 128MB for 24GB checkpoint
- Implement StreamingCheckpointing with producer-consumer pattern
- Add storage writers for s3dlio, minio, and s3torch backends
- Support multi-endpoint load balancing via environment variables
- Enable concurrent checkpoint I/O without blocking training loops
- Add test_streaming_backends.py for multi-library backend testing
- Add demo_checkpoint_methods.sh to demonstrate different checkpoint approaches
- Add demo_streaming_checkpoint.sh for interactive streaming checkpoint demo
- Update tests/README.md with detailed test documentation
- Add MULTI_ENDPOINT_GUIDE.md with comprehensive multi-endpoint documentation
- Add Streaming-Chkpt-Guide.md with StreamingCheckpointing usage guide
- Add pr-stream-chkpt/ directory with PR-specific documentation
- Update README.md with StreamingCheckpointing section
- Remove redundant MULTI_ENDPOINT.md and PR_Readiness_Plan.md
- Update .gitignore to exclude Test-Backup/ and development artifacts
- Remove hardcoded AWS credentials from test_streaming_backends.py
- Remove hardcoded AWS credentials from test_mlp_*.sh scripts
- Replace with environment variable validation and helpful error messages
- Remove internal IP address exposure (172.16.1.40)
- All tests now require AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_ENDPOINT_URL to be set
- Workflow documented elsewhere, not needed in PR
This change enables users to clone the fork and get a complete working
environment with all multi-library storage and StreamingCheckpointing
features without needing to separately manage the dlio_benchmark fork.

Note: This change is ONLY for the integrated-main branch in the personal
fork. The formal PR mlcommons#249 to mlcommons/storage maintains the upstream
argonne-lcf/dlio_benchmark reference.
…ences

- Remove outdated docs files: IMPLEMENTATION_COMPARISON.md, STORAGE_LIBRARY_HANDOFF.md, TF_ObjectBranch-Strategy.md
- Remove all azstoragetorch references from STORAGE_LIBRARIES.md (library removed from project)
- Remove specific performance numbers from PERFORMANCE_TESTING.md (environment-dependent)
- Update PERFORMANCE_TESTING.md to show relative performance only
- Rewrite STORAGE_LIBRARY_TESTING_STATUS.md to focus on HOW to run tests
- Update documentation to reflect 3 supported libraries: s3dlio, minio, s3torchconnector
- Remove azstoragetorch support from benchmark_write_comparison.py
- Remove azstoragetorch support from benchmark_read_comparison.py
- Update documentation to reflect 3 supported libraries (s3dlio, minio, s3torchconnector)
- Remove azstoragetorch examples from PARQUET_FORMATS.md
- Update QUICK_START.md and README_S3DLIO_CONFIGS.md
- Delete outdated HANDOFF_2026-02-07.md document

azstoragetorch was never fully integrated and is not part of the project scope.
The 3 core storage libraries provide complete S3/Azure/GCS coverage via s3dlio.
Russ Fellows added 5 commits February 19, 2026 10:53
Update s3dlio dependency to require version 0.9.50 or newer from PyPI.
This version includes all necessary features for multi-library storage
support and StreamingCheckpointing.
Remove dlio_benchmark directory from git repository since it's now
installed as a dependency from GitHub. This eliminates redundancy:

- dlio_benchmark is installed via: git+https://github.com/russfellows/dlio_benchmark.git@main
- Local directory kept for development but not tracked in git
- Added dlio_benchmark/ to .gitignore
- Backup created: Test-Backup/dlio_benchmark_full_20260219_105808.tar.gz

This makes the repository cleaner and ensures users get dlio_benchmark
from the correct source (russfellows fork with multi-library support).
- Add dgen-py>=0.2.0, minio, s3torchconnector to dependencies
- Remove native Azure backend support (Azure only via s3dlio with az:// URIs)
- Update documentation to clarify Azure Blob Storage exclusively via s3dlio
- Remove broken references to azure_writer.AzureStorageWriter
…support

When --io-trace-log <path> is specified the benchmark runs in pure logical
trace mode: no real GPU/CPU/NVMe I/O is performed. Instead every KV cache
operation is recorded to a structured CSV file for offline replay by an
external storage tool (fio, sai3-bench, warp, etc.).

This enables clean separation between workload generation (what the
benchmark does) and storage validation (what an external tool measures),
which is essential for MLPerf Storage submission workflows.

New flags
---------
--io-trace-log <path>
    Activates trace mode. Path ending in .zst enables streaming zstd
    compression (level 3, ~10-20x ratio). Requires the 'zstandard' package.

--num-gpus N  (default: 1)
    Total GPUs in the tensor-parallel group.
    Effective GPU tier capacity = N x --gpu-mem-gb.
    Example: --num-gpus 8 --gpu-mem-gb 141 models an 8xH200 node (1128 GB HBM).

--tensor-parallel N  (default: 1)
    TP degree for KV cache sharding. Per-rank object sizes in the trace,
    cache stats, and XLSX export are divided by N.
    Must be >= 1 and <= --num-gpus. Non-power-of-2 values emit a warning.

CSV output format
-----------------
Columns: Timestamp, Operation, Object_Size_Bytes, Tier, Key, Phase
  Timestamp        Unix epoch (float, 6 decimal places)
  Operation        'Write' or 'Read'
  Object_Size_Bytes  TP-adjusted byte size of the KV cache object
  Tier             'Tier-0' (GPU), 'Tier-1' (CPU), 'Tier-2' (NVMe)
  Key              Cache entry identifier for replay tool correlation
  Phase            'Prefill', 'Decode', or 'Evict'

Files changed
-------------
kv_cache/tracer.py      New. IOTracer: thread-safe CSV writer with optional
                        zstd compression, Key and Phase columns, context-manager
                        support, clean close() sequence.
kv_cache/backends.py    New NullBackend: no-op write/read that tracks byte
                        counts only; used for all tiers in trace mode.
kv_cache/cache.py       MultiTierCache accepts io_tracer= and tensor_parallel=;
                        TP-adjusted size_bytes in all trace rows; per-rank
                        data slicing in real mode.
kv_cache/benchmark.py   IntegratedBenchmark accepts io_trace_log=, num_gpus=,
                        tensor_parallel=; manages IOTracer lifecycle; banner
                        shows '8x 141 GB GPU (total 1128 GB HBM) | TP=8'.
kv_cache/cli.py         --io-trace-log, --num-gpus, --tensor-parallel args;
                        XLSX export includes Num GPUs, Tensor Parallel, and
                        Total GPU Memory columns.
kv_cache/workload.py    Validates TP <= num_gpus; warns if TP not power-of-2;
                        MAX_GPU_MEMORY_GB 1024->65536; MAX_CPU_MEMORY_GB
                        16384->131072 to support large multi-GPU nodes.
pyproject.toml          'compression' optional extra (zstandard>=0.21);
                        included in 'full' extra.
docs/io_trace_log_usage.md  New user guide: all flags, CSV schema, compression
                        size estimates, seven ready-to-run examples (single GPU,
                        8xH200 TP=8, prefill-only, decode-only, DeepSeek V3),
                        trace inspection shell snippets, model table.
@russfellows russfellows requested a review from a team February 26, 2026 21:54
@russfellows russfellows requested a review from a team as a code owner February 26, 2026 21:54
@github-actions
Copy link

MLCommons CLA bot:
Thank you very much for your submission, we really appreciate it. Before we can accept your contribution, we ask that you sign the MLCommons CLA (Apache 2). Please use this [Google form] (https://forms.gle/Ew1KkBVpyeJDuRw67) to initiate authorization. If you are from an MLCommons member organization, we will request that you be added to the CLA. If you are not from a member organization, we will email you a CLA to sign. For any questions, please contact support@mlcommons.org.
3 out of 5 committers have signed the MLCommons CLA.
@FileSystemGuy
@hazemawadalla
@dslik
@eva Luator
@russ Fellows
Eva Luator, Russ Fellows seem not to be a GitHub user. You need a GitHub account after you become MLCommons member. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request

@russfellows
Copy link
Author

This went into the wrong repository. It was supposed to go into my fork. Sorry, closing.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 26, 2026
@russfellows russfellows deleted the feature/io-trace-log branch February 26, 2026 22:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants