[DRAFT] Benchmark Platform for Nature Methods Paper#60
Draft
[DRAFT] Benchmark Platform for Nature Methods Paper#60
Conversation
- plan_01: Benchmark infrastructure with orchestration, storage, comparison - plan_02: Dataset acquisition with fail-loud validation and caching - plan_03: Tool adapters for OpenHCS, CellProfiler, ImageJ, Python - plan_04: Metric collectors (Time, Memory, GPU, Correctness) - plan_05: Pipeline equivalence system for fair comparison All plans include: - UML class diagrams - Flow diagrams - Sequence diagrams - Complete implementation code - Integration examples Ready for implementation following smell-loop approval.
Learn moreAll Green is an AI agent that automatically: ✅ Addresses code review comments ✅ Fixes failing CI checks ✅ Resolves merge conflicts |
Research findings from publications using BBBC datasets: - Complete BBBC021/022/038 dataset specifications with real URLs, sizes, formats - Real CellProfiler pipeline parameters from actual analysis.cppipe files - Evaluation metrics from NuSeT (2020), Cimini et al. (2023), and other benchmarking papers - Illumination correction parameters from Singh et al. (2014) - Ground truth availability and usage strategies - Preprocessing pipelines and subsetting approaches Files added: - plan_02_ADDENDUM_real_dataset_specs.md: Complete BBBC dataset specs, download strategies, validation without checksums - plan_03_ADDENDUM_real_pipelines.md: Real CellProfiler pipeline from BBBC021 analysis.cppipe with all 27 modules - plan_04_ADDENDUM_correctness_metrics.md: Pixel-level and object-level evaluation metrics from publications - RESEARCH_SUMMARY.md: Complete investigation report with all sources cited All findings sourced from publications, GitHub repos, and BBBC downloads. No handwaving. Remaining gaps (require downloads to fill): - BBBC022 filename pattern (need to download 1 plate to reverse-engineer) - Dataset checksums (not provided by Broad, will compute or skip) - File manifests (impractical to list 39,600 files, will use count validation) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implement proper ABC-compliant handlers for BBBC datasets:
BBBC021Handler (ImageXpress format):
- Pattern: {Well}_{Site}_{Channel}{UUID}.tif (e.g., G10_s1_w1BEDC2073...tif)
- Channels: w1=DAPI, w2=Tubulin, w4=Actin
- FilenameParser with regex for Well/Site/Channel extraction
- MetadataHandler for CSV metadata (BBBC021_v1_image.csv)
- No virtual mapping needed (already flat structure)
BBBC038Handler (Kaggle nuclei, PNG format):
- Folder-based organization: stage1_train/{ImageId}/images/{ImageId}.png
- No structured filename pattern (uses ImageId as identifier)
- FilenameParser accepts .png files, extracts ImageId from path
- MetadataHandler for metadata.xlsx and CSV labels
- Handles segmentation masks in separate masks/ folders
Both handlers:
- Implement all abstract methods from MicroscopeHandler ABC
- Define compatible_backends (DISK only)
- Auto-register via _microscope_type class attribute
- Support FileManager abstraction throughout
No handwaving - ready for benchmark platform integration.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implement ABC-compliant handlers with PASSING TESTS for BBBC datasets:
BBBC021Handler (ImageXpress-like with UUID):
- Parses: G10_s1_w1{UUID}.tif (original files in Week#/Week#_##### subdirectories)
- Constructs: G10_s1_w1_z001_t001.tif (virtual workspace with all components)
- Pattern handles BOTH original (with UUID) and virtual (with z/t) filenames
- Flattens Week#/Week#_##### folder structure to plate root
- Adds default z_index=1, timepoint=1 for pattern discovery consistency
- Channels: w1=DAPI, w2=Tubulin, w4=Actin (w3 not used)
BBBC038Handler (Kaggle nuclei, PNG):
- Parses: {hex_id}.png from stage1_train/{ImageId}/images/ subdirectories
- ImageId treated as unique "well" identifier
- Single channel, single site, no Z or timepoint
- Flattens folder structure to stage1_train/ directory
Both handlers:
- Follow virtual workspace architecture: ALL components in constructed filenames
- Implement all MicroscopeHandler ABC methods
- Auto-register via _microscope_type
- Compatible backends: [DISK]
- Ready for benchmark platform integration
Tests included:
- BBBC021: 6 real filenames from BBBC021_v1_image.csv (ALL PASS)
- BBBC038: 3 hex ID filenames (ALL PASS)
- Roundtrip: parse → construct → parse (ALL PASS)
No handwaving - tested with actual BBBC filenames.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
MICROSCOPE DETECTION & REGISTRATION: - Add MetadataDetectMixin: reusable detect() implementation delegating to metadata handler - Add TiffPixelSizeMixin: extract pixel size and channel names from TIFF tags - BBBC021Handler: implement detect() via filename pattern matching - BBBC038Handler: implement detect() via stage1_train folder detection - ImageXpressHandler, OperaPhenixHandler, OpenHCSMicroscopeHandler: use MetadataDetectMixin - Remove hardcoded handler registration at end of bbbc.py (now automatic via metaclass) METADATA CACHING: - Simplify MetadataCache: remove per-file mtime tracking and validation checks - Cache is now explicit-clear-only (no automatic invalidation) - Reduces complexity while maintaining correctness for single-plate workflows REGISTRY DISCOVERY: - LazyDiscoveryDict: skip cache when secondary registries present - After discovery, populate secondary registries via _register_secondary hook - Prevents stale cache from blocking secondary registry population SIGNAL BATCHING (ImageBrowser performance): - ColumnFilterWidget.select_all/select_none: always block signals during batch updates - Emit single filter_changed signal at end instead of N signals - Fixes signal storm when clicking 'None' button on 96-well filter (96 -> 1 signal) DEPENDENCIES: - Add tqdm>=4.66.5 for progress indication This refactor improves: - Microscope detection: deterministic, side-effect-free, testable - Code reuse: mixins eliminate duplication across handlers - Performance: signal batching prevents UI thrashing - Maintainability: explicit registration removed, automatic via metaclass
ARCHITECTURE:
- Contracts: ToolAdapter, MetricCollector, DatasetSpec (immutable specs)
- Datasets: Registry of BBBC021, BBBC022, BBBC038 with download/extract/validate
- Pipelines: Registry of benchmark pipelines (nuclei_segmentation)
- Metrics: TimeMetric (perf_counter), MemoryMetric (RSS sampling)
- Adapters: OpenHCSAdapter implementing ToolAdapter contract
- Runner: Orchestrates tool validation, dataset acquisition, execution
DATASET ACQUISITION:
- Download with progress bars (tqdm)
- Extract zip archives atomically
- Validate by image count (±5% tolerance) or manifest
- Cache to ~/.cache/openhcs/benchmark_datasets/{id}/
- Fast path: skip re-download if cached and valid
OPENHCS ADAPTER:
- Validates OpenHCS installation
- Creates FileManager and microscope handler
- Runs minimal segmentation pipeline: blur → threshold → label
- Supports parameter validation (threshold_method, declump_method, diameter_range)
- Collects metrics via context managers
- Returns normalized BenchmarkResult with provenance
METRICS:
- TimeMetric: wall-clock execution time (perf_counter)
- MemoryMetric: peak RSS memory in background thread (psutil)
- Both implement MetricCollector ABC (context manager pattern)
PIPELINES:
- NUCLEI_SEGMENTATION: Otsu threshold + morphological operations
- Parameters: opening_radius, diameter_range, fill_holes
- Extensible: easy to add CELL_PAINTING, etc.
DATASETS:
- BBBC021_SINGLE_PLATE: 720 images, 839 MB
- BBBC022_SINGLE_PLATE_DNA: 3,456 images, 7.8 GB
- BBBC038_FULL: 33,215 images, 382 MB
- All with validation rules and microscope type
This enables:
- Reproducible benchmarking across tools
- Standardized metrics collection
- Dataset caching and validation
- Easy tool adapter implementation
- Extensible pipeline registry
…tion SUMMARY ======= Add complete CellProfiler conversion infrastructure for benchmarking OpenHCS against CellProfiler. Uses a two-phase approach: one-time library absorption (LLM converts entire CellProfiler library), then instant .cppipe conversion (registry lookup, no LLM needed at conversion time). CONVERTER INFRASTRUCTURE (benchmark/converter/) =============================================== - absorb.py: CLI for one-time library absorption python -m benchmark.converter.absorb --model google/gemini-3-flash-preview - library_absorber.py: Core absorption logic - Scans cellprofiler_source/library/modules/_*.py - LLM converts each to OpenHCS format - Validates: syntax, @numpy decorator, 'image' first param, no relative imports - Writes to cellprofiler_library/functions/ - llm_converter.py: Dual-backend LLM converter - Ollama (local): model names like 'qwen2.5-coder:7b' - OpenRouter (cloud): model names like 'google/gemini-3-flash-preview' - Auto-detects backend from model name format (org/model = OpenRouter) - system_prompt.py: Comprehensive first-principles OpenHCS explanation (~470 lines) - Dimensional dataflow architecture - ProcessingContract semantics (PURE_2D, PURE_3D, FLEXIBLE, VOLUMETRIC_TO_SLICE) - Multi-input operations (stack along dim 0, unstack inside function) - special_outputs/special_inputs for labels and measurements - Conversion rules and template - contract_inference.py: Runtime contract inference - source_locator.py: CellProfiler source code locator - parser.py: .cppipe file parser - pipeline_generator.py: Generate OpenHCS pipelines - settings_binder.py: Bind .cppipe settings to function kwargs - convert.py: CLI for .cppipe conversion ABSORBED LIBRARY (benchmark/cellprofiler_library/) ================================================= 26 CellProfiler modules converted to OpenHCS functions: closing, colortogray, combineobjects, convertimagetoobjects, convertobjectstoimage, correctilluminationapply, crop, dilateimage, enhanceedges, enhanceorsuppressfeatures, erodeimage, erodeobjects, expandorshrinkobjects, fillobjects, gaussianfilter, measureimageoverlap, measureobjectsizeshape, medialaxis, medianfilter, morphologicalskeleton, opening, overlayobjects, reducenoise, savecroppedobjects, threshold, watershed CELLPROFILER SOURCE (benchmark/cellprofiler_source/) ==================================================== Extracted CellProfiler source code for LLM reference: - modules/: 90 module class files - library/modules/: 27 pure algorithm implementations - library/functions/: Core utility functions - library/opts/: Enums and options EXAMPLE PIPELINES ================= - benchmark/cellprofiler_pipelines/: Original .cppipe files + converted - benchmark/pipelines/: OpenHCS benchmark pipelines (numpy, cupy, gpu variants)
EXPERIMENTAL - may be reverted. - flash_config.py: Remove max_fps cap (None instead of 60) - geometry_tracking.py: New orthogonal geometry tracking - WidgetSizeMonitor: Detects size changes in watched widgets - AutoGeometryTracker: Discovers geometry-affecting widgets - FlashGeometryTracker: Queues flashes during layout changes - Eliminates timing race conditions by state transitions, not arbitrary delays
CHANGES: - system_prompt.py: Request structured JSON output with contract, category, confidence, reasoning - llm_converter.py: Parse JSON response, populate ConversionResult with LLM-inferred metadata - library_absorber.py: Use LLM-inferred values instead of hardcoded pure_2d/0.5 defaults - pipeline_generator.py: Map category → variable_components (z_projection→Z_INDEX, channel_operation→CHANNEL) - Removed LLM fallback mode - purely deterministic conversion from absorbed library - Deleted broken ExampleHuman_openhcs.py (garbage from early LLM run) CONTRACTS.JSON NOW INCLUDES: - contract: PURE_2D | PURE_3D | FLEXIBLE | VOLUMETRIC_TO_SLICE - category: image_operation | z_projection | channel_operation - confidence: 0.0-1.0 (LLM's confidence in inference) - reasoning: Why this contract/category was chosen PIPELINE GENERATION: - Fail-loud if modules missing from absorbed library (no fallback) - variable_components derived from LLM-inferred category
…bsorbed modules Implemented LLM-powered converter system that transpiles CellProfiler pipelines (.cppipe) into native OpenHCS pipelines. Successfully absorbed all 88 CellProfiler modules using Claude Opus 4.5 and converted both benchmark pipelines (ExampleHuman and ExampleFly) to runnable OpenHCS code. Three-phase system: (1) Absorption - LLM extracts pure algorithms from CellProfiler source, infers contracts and categories; (2) Parsing - deterministic .cppipe parsing; (3) Generation - maps modules to OpenHCS functions with proper variable_components. Key features: ROI+CSV materialization for segmentation, infrastructure module handling (LoadData/ExportToSpreadsheet), retry logic, registry system with contracts.json. Results: 88 absorbed modules (segmentation, measurements, image processing, morphology, projections, transformations), 2 converted pipelines (ExampleHuman 4 modules, ExampleFly 9 modules). Technical highlights: CamelCase registry fix, dual-axis resolution integration, special I/O handling, fail-loud error handling.
- Fixed parameter name normalization to exactly match SettingsBinder logic
- Remove parenthetical content before normalization (e.g., '(Min,Max)')
- This fixes mapping of tuple parameters like 'Typical diameter (Min,Max)' -> [min_diameter, max_diameter]
- Fixed FunctionStep API usage to use tuple pattern: func=(function, {kwargs})
- Previously was incorrectly passing kwargs directly to FunctionStep
- Now correctly passes kwargs dict as second element of tuple
- Backfilled parameter mappings for 83/88 absorbed CellProfiler functions
- Used Gemini Flash 3.0 to generate mappings from original source + absorbed function
- Mappings stored in function docstrings as single source of truth
- Added backfill_parameter_mappings.py script
- Generated pipelines now have proper kwargs instead of comments
- ExampleFly: min_diameter=10, max_diameter=40 correctly mapped from tuple
- ExampleHuman: min_diameter=8, max_diameter=80 correctly mapped from tuple
- All other parameters properly translated using docstring mappings
…semantics
Used LLM (Gemini 3.0 Flash Preview) to analyze all 88 absorbed functions and determine
correct categories based on input shape expectations and iteration semantics.
Changes:
- Created recategorize_functions.py script for LLM-based recategorization
- Updated contracts.json with 7 category changes (81 unchanged)
Category changes:
z_projection (3 functions):
- MakeProjection: Processes z-stacks (D, H, W) → (H, W) projections
- Morphologicalskeleton: Has volumetric parameter for 3D processing
- TrackObjects: Processes temporal sequences (frames over time)
channel_operation (4 functions):
- CorrectIlluminationCalculate: Per-channel illumination correction
- IdentifyPrimaryObjects: Segment same marker across all sites per channel
- RescaleIntensity: Per-channel intensity normalization
- Tile: Assembles sites into montage per channel
Impact:
- IdentifyPrimaryObjects now uses VariableComponents.CHANNEL instead of SITE
- MakeProjection now uses VariableComponents.Z_INDEX instead of SITE
- Generated pipelines have semantically correct iteration order
- Functions receive correct input shapes based on their processing semantics
All changes verified against OpenHCS PURE_2D contract behavior:
- PURE_2D unstacks dim 0 and calls function on each (H, W) slice
- variable_components controls what dim 0 represents (sites, channels, or z-slices)
- Total function calls remain the same, only iteration order changes
…ents semantics Updated LLM recategorization prompt with correct dimensional dataflow semantics: - image_operation (SITE): Single-channel operations across all sites (default) - z_projection (Z_INDEX): Functions that NEED z-stacks (projections, 3D ops) - channel_operation (CHANNEL): Functions that NEED multiple channels simultaneously Results: - channel_operation (4): ColorToGray, GrayToColorRgb, MeasureColocalization, UnmixColors - z_projection (2): MakeProjection, Morphologicalskeleton - image_operation (82): Everything else (single-channel operations) Fixed incorrect categorizations from previous run: - IdentifyPrimaryObjects: channel_operation → image_operation ✓ - CorrectIlluminationCalculate: channel_operation → image_operation ✓ - RescaleIntensity: channel_operation → image_operation ✓ - Tile: channel_operation → image_operation ✓ - TrackObjects: z_projection → image_operation ✓ (time-lapse uses sequential_components) Added correct categorizations: - ColorToGray: image_operation → channel_operation ✓ - MeasureColocalization: image_operation → channel_operation ✓ - UnmixColors: image_operation → channel_operation ✓ - GrayToColorRgb: image_operation → channel_operation ✓ (manual fix) Regenerated pipelines with correct variable_components.
UnmixColors has PURE_2D contract, which means it receives (H, W) and processes each site independently. PURE_2D with channel_operation would unstack dimension 0 and process each channel independently, which defeats the purpose. Dimensional dataflow rule: - PURE_2D contract → ALWAYS image_operation (processes each site independently) - FLEXIBLE/PURE_3D contract → can be channel_operation or z_projection (processes dim 0 together) Final categorizations: - channel_operation (3): ColorToGray, GrayToColorRgb, MeasureColocalization All have FLEXIBLE contract and process multiple channels together - z_projection (2): MakeProjection, Morphologicalskeleton Process z-stacks (volumetric data) - image_operation (83): Everything else, including all PURE_2D functions
Key changes: 1. measure_colocalization: Added channel_1/channel_2 params for arbitrary N-channel input 2. gray_to_color_rgb: Added red/green/blue_channel params for arbitrary N-channel input 3. gray_to_color_cmyk: Added channel selection params for arbitrary N-channel input 4. Fixed @numpy decorator: Removed invalid contract=ProcessingContract.X usage 5. Removed unused ProcessingContract imports from all 88 functions 6. Rewrote __init__.py with dynamic function loading from contracts.json 7. Regenerated pipelines with correct variable_components The dimensional dataflow compiler perspective: - Dimension 0 can be of ARBITRARY size (1, 2, 3, 4, 5, ... N) - Functions should parameterize channel selection, not hardcode indices - ProcessingContract is orthogonal to variable_components
1. Removed unused ProcessingContract import from header template 2. Removed duplicate imports in header template 3. Changed to dynamic function loading with get_function() 4. Fixed measurecolocalization parameter mapping: - 'Select images to measure' -> (pipeline-handled) (requires pipeline context) - 'Run all metrics?' -> (pipeline-handled) (multi-param not auto-mappable) 5. Regenerated ExampleFly and ExampleHuman pipelines with clean parameters
1. Removed duplicate parameter mapping from _outline helper function
2. Added correct mapping to identify_tertiary_objects docstring
3. Object selection settings ('Select the larger/smaller identified objects')
are now (pipeline-handled) since they're @special_inputs
4. Only shrink_primary is an actual function parameter
In OpenHCS, @special_inputs are wired at compile time by name matching,
not passed as string parameters. CellProfiler's object naming convention
doesn't map directly to function kwargs.
- Categorized all 88 absorbed functions into FLEXIBLE vs PURE_3D contracts - Identified critical architectural issues: * Contract mismatch (PURE_2D vs PURE_3D) * Tuple handling bug in _execute_pure_2d * Inconsistent special outputs format - Created phased refactoring plan with timeline and risk mitigation - Documented 14 FLEXIBLE functions (support true 3D + slice-by-slice) - Documented 74 PURE_3D functions (always internal slicing) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Comprehensive design document covering: - Architecture comparison (CellProfiler vs OpenHCS) - Identified abstraction leaks (A1-A3, B1-B4, C1-C4) - What we're certain about (contract system, aggregation orthogonality) - Design proposal: AggregationSpec and compile-time symbol resolution - Implementation phases - Open questions for further discussion
Detailed mapping of: - Core concept mapping (pipeline, data containers, object model) - Semantic gaps requiring new concepts (ObjectRegistry, etc.) - Adapter layer design for CellProfiler modules - ProcessingContract mapping - Measurement naming conventions - Settings system mapping - Abstraction leak analysis
…sign doc Includes: - Essential files to read (OpenHCS core + CellProfiler integration) - Detailed execution flow diagram - ProcessingContract implementation with code snippets - Special outputs system explanation - CellProfiler workspace structure - Absorbed function patterns (current buggy vs required) - Key terms glossary - Quick reference: what to read when
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Benchmark Platform for Nature Methods Paper
Overview
Complete architectural plans for benchmarking OpenHCS against CellProfiler, ImageJ, and Python scripts for the Nature Methods publication.
Plans Included
✅ plan_01_benchmark_infrastructure.md
Orchestration layer and comparison engine
run_benchmark())✅ plan_02_dataset_acquisition.md
Automatic dataset download and validation
✅ plan_03_tool_adapters.md
Normalize heterogeneous tools to uniform interface
✅ plan_04_metric_collectors.md
Context manager metrics for transparent collection
✅ plan_05_pipeline_equivalence.md
Equivalent analysis pipelines across all tools
Architecture Highlights
Orthogonal Concerns
Each plan solves ONE problem completely:
Declarative API
Fail-Loud Philosophy
Platform, Not Application
Diagrams Included
All plans include:
Implementation Status
Next Steps
Expected Results
Based on OpenHCS architecture:
This is a DRAFT PR for planning purposes. Implementation will follow smell-loop approval.
Pull Request opened by Augment Code with guidance from the PR author