chore(bench): experiment postmortems, artifact promotion, log update#126
chore(bench): experiment postmortems, artifact promotion, log update#126jack-arturo wants to merge 1 commit intomainfrom
Conversation
Add postmortem infrastructure and archive three experiments: - #79 (accepted): priority_ids fetch bug fix, merged as PR #125 - #74 (rejected): entity expansion precision, zero benchmark delta - PR #80 (rejected): BM25+rerank+query expansion, -3.83pp regression Promote 5 comparison JSONs to tests/benchmarks/results/ for durable record. Update EXPERIMENT_LOG.md with 7 new rows and postmortem links. Prune 10 experiment worktrees and 9 local branches.
📝 WalkthroughSummary by CodeRabbitRelease Notes
WalkthroughThis PR documents benchmark results and postmortem analyses for three experiments (Issue Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md (2)
61-71: Make reproduction commands commit-pinned.Line 63 references the branch in a comment, but the command block should include an explicit checkout to commit
a122ba2so reruns stay deterministic.Suggested doc patch
```bash +# Reproduce exactly from recorded revision +git checkout a122ba2 + # Full port evaluation make bench-eval BENCH=locomo-mini CONFIG=baseline # on exp/pr80-enhanced-recall-v2 make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md` around lines 61 - 71, The reproduction commands in the code block (the make bench-eval and make bench-compare invocations) are not pinned to a commit; prepend an explicit git checkout to commit a122ba2 before the make commands so reruns are deterministic (i.e., add a step that runs git checkout a122ba2 prior to the make bench-eval/make bench-compare commands in the same snippet).
81-86: Consider adding artifact checksums for integrity tracking.Since these are promoted benchmark records, adding SHA256 values would make future integrity verification straightforward.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md` around lines 81 - 86, Add SHA256 checksums for each promoted artifact to enable integrity tracking: compute the SHA256 hash for each listed file (`tests/benchmarks/results/compare_pr80_judge_off_20260311.json`, `tests/benchmarks/results/compare_pr80_judge_on_20260311.json`, `tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`) and append the checksum next to each entry in the "Promoted Artifacts" list (e.g., "- filename — SHA256: <hex>"), ensuring the exact filename strings from the diff are used so the checksums clearly map to the artifacts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`:
- Around line 12-13: The JSON entries baseline_file and test_file currently
contain absolute local paths; replace them with repo-relative paths (e.g.,
"benchmarks/results/locomo-mini_baseline_20260310_233631.json" and
"benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json") so the
artifact references are portable and do not leak local filesystem details.
---
Nitpick comments:
In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md`:
- Around line 61-71: The reproduction commands in the code block (the make
bench-eval and make bench-compare invocations) are not pinned to a commit;
prepend an explicit git checkout to commit a122ba2 before the make commands so
reruns are deterministic (i.e., add a step that runs git checkout a122ba2 prior
to the make bench-eval/make bench-compare commands in the same snippet).
- Around line 81-86: Add SHA256 checksums for each promoted artifact to enable
integrity tracking: compute the SHA256 hash for each listed file
(`tests/benchmarks/results/compare_pr80_judge_off_20260311.json`,
`tests/benchmarks/results/compare_pr80_judge_on_20260311.json`,
`tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`) and append
the checksum next to each entry in the "Promoted Artifacts" list (e.g., "-
filename — SHA256: <hex>"), ensuring the exact filename strings from the diff
are used so the checksums clearly map to the artifacts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f1f949e9-14f3-48ce-90c2-303ad3278cda
📒 Files selected for processing (10)
benchmarks/EXPERIMENT_LOG.mdbenchmarks/postmortems/.gitkeepbenchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.mdbenchmarks/postmortems/2026-03-11_pr80_enhanced_recall.mdbenchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.mdtests/benchmarks/results/compare_issue74_entity_precision_20260311.jsontests/benchmarks/results/compare_issue79_priority_ids_20260311.jsontests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.jsontests/benchmarks/results/compare_pr80_judge_off_20260311.jsontests/benchmarks/results/compare_pr80_judge_on_20260311.json
| "baseline_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_baseline_20260310_233631.json", | ||
| "test_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json" |
There was a problem hiding this comment.
Replace absolute artifact paths with repo-relative paths.
Line 12 and Line 13 embed a local machine path (/Users/jgarturo/...), which hurts portability and leaks local environment details. Keep these paths repo-relative like the other promoted artifacts.
📦 Proposed fix
- "baseline_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_baseline_20260310_233631.json",
- "test_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json"
+ "baseline_file": "benchmarks/results/locomo-mini_baseline_20260310_233631.json",
+ "test_file": "benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json` around
lines 12 - 13, The JSON entries baseline_file and test_file currently contain
absolute local paths; replace them with repo-relative paths (e.g.,
"benchmarks/results/locomo-mini_baseline_20260310_233631.json" and
"benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json") so the
artifact references are portable and do not leak local filesystem details.
Summary
benchmarks/postmortems/directory with 3 postmortems:tests/benchmarks/results/EXPERIMENT_LOG.mdwith 7 new rows + postmortem linksTest plan
make testpasses (no functional changes)benchmarks/results/