Skip to content

chore(bench): experiment postmortems, artifact promotion, log update#126

Open
jack-arturo wants to merge 1 commit intomainfrom
chore/experiment-hygiene-archive
Open

chore(bench): experiment postmortems, artifact promotion, log update#126
jack-arturo wants to merge 1 commit intomainfrom
chore/experiment-hygiene-archive

Conversation

@jack-arturo
Copy link
Member

Summary

Test plan

  • No code changes — documentation and artifacts only
  • make test passes (no functional changes)
  • Postmortem links in EXPERIMENT_LOG.md resolve correctly
  • Promoted artifacts match originals in benchmarks/results/

Add postmortem infrastructure and archive three experiments:
- #79 (accepted): priority_ids fetch bug fix, merged as PR #125
- #74 (rejected): entity expansion precision, zero benchmark delta
- PR #80 (rejected): BM25+rerank+query expansion, -3.83pp regression

Promote 5 comparison JSONs to tests/benchmarks/results/ for durable
record. Update EXPERIMENT_LOG.md with 7 new rows and postmortem links.
Prune 10 experiment worktrees and 9 local branches.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Summary by CodeRabbit

Release Notes

  • Documentation

    • Added comprehensive benchmark postmortem reports documenting evaluation outcomes for recent feature improvements, including detailed performance metrics and analysis findings.
  • Tests

    • Added benchmark comparison result snapshots for performance tracking across multiple evaluated configurations.

Walkthrough

This PR documents benchmark results and postmortem analyses for three experiments (Issue #74 entity expansion, PR #80 enhanced recall, Issue #79 priority IDs fetch), adding entries to the experiment log with detailed analysis documents and corresponding benchmark result artifacts.

Changes

Cohort / File(s) Summary
Experiment Log Update
benchmarks/EXPERIMENT_LOG.md
Added 8 new benchmark entries across 2026-03-10 to 2026-03-12 documenting main-refresh, PR #80 port variants, BM25 configurations, entity expansion, and priority IDs experiments with corresponding category breakdowns.
Postmortem Documentation
benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md, benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md, benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
Added three postmortem analysis documents detailing experimental hypotheses, benchmark results, root cause analyses, and follow-up recommendations; Issue #74 and PR #80 marked as REJECTED/NON-PROMOTED, Issue #79 marked as merged/accepted.
Benchmark Result Artifacts
tests/benchmarks/results/compare_*.json
Added 6 new JSON benchmark comparison files recording baseline vs. test accuracy, deltas, and per-category performance metrics for PR #80 variants, Issue #74, and Issue #79 experiments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

enhancement

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding benchmark postmortem documentation, promoting artifacts, and updating the experiment log.
Description check ✅ Passed The description provides clear information about the changeset, including the three postmortems, artifact promotion, log updates, and cleanup actions.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore/experiment-hygiene-archive
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai bot added the enhancement New feature or request label Mar 13, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md (2)

61-71: Make reproduction commands commit-pinned.

Line 63 references the branch in a comment, but the command block should include an explicit checkout to commit a122ba2 so reruns stay deterministic.

Suggested doc patch
 ```bash
+# Reproduce exactly from recorded revision
+git checkout a122ba2
+
 # Full port evaluation
 make bench-eval BENCH=locomo-mini CONFIG=baseline  # on exp/pr80-enhanced-recall-v2
 make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md` around lines 61 -
71, The reproduction commands in the code block (the make bench-eval and make
bench-compare invocations) are not pinned to a commit; prepend an explicit git
checkout to commit a122ba2 before the make commands so reruns are deterministic
(i.e., add a step that runs git checkout a122ba2 prior to the make
bench-eval/make bench-compare commands in the same snippet).

81-86: Consider adding artifact checksums for integrity tracking.

Since these are promoted benchmark records, adding SHA256 values would make future integrity verification straightforward.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md` around lines 81 -
86, Add SHA256 checksums for each promoted artifact to enable integrity
tracking: compute the SHA256 hash for each listed file
(`tests/benchmarks/results/compare_pr80_judge_off_20260311.json`,
`tests/benchmarks/results/compare_pr80_judge_on_20260311.json`,
`tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`) and append
the checksum next to each entry in the "Promoted Artifacts" list (e.g., "-
filename — SHA256: <hex>"), ensuring the exact filename strings from the diff
are used so the checksums clearly map to the artifacts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`:
- Around line 12-13: The JSON entries baseline_file and test_file currently
contain absolute local paths; replace them with repo-relative paths (e.g.,
"benchmarks/results/locomo-mini_baseline_20260310_233631.json" and
"benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json") so the
artifact references are portable and do not leak local filesystem details.

---

Nitpick comments:
In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md`:
- Around line 61-71: The reproduction commands in the code block (the make
bench-eval and make bench-compare invocations) are not pinned to a commit;
prepend an explicit git checkout to commit a122ba2 before the make commands so
reruns are deterministic (i.e., add a step that runs git checkout a122ba2 prior
to the make bench-eval/make bench-compare commands in the same snippet).
- Around line 81-86: Add SHA256 checksums for each promoted artifact to enable
integrity tracking: compute the SHA256 hash for each listed file
(`tests/benchmarks/results/compare_pr80_judge_off_20260311.json`,
`tests/benchmarks/results/compare_pr80_judge_on_20260311.json`,
`tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`) and append
the checksum next to each entry in the "Promoted Artifacts" list (e.g., "-
filename — SHA256: <hex>"), ensuring the exact filename strings from the diff
are used so the checksums clearly map to the artifacts.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f1f949e9-14f3-48ce-90c2-303ad3278cda

📥 Commits

Reviewing files that changed from the base of the PR and between 5d3708c and b2c20aa.

📒 Files selected for processing (10)
  • benchmarks/EXPERIMENT_LOG.md
  • benchmarks/postmortems/.gitkeep
  • benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md
  • benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md
  • benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
  • tests/benchmarks/results/compare_issue74_entity_precision_20260311.json
  • tests/benchmarks/results/compare_issue79_priority_ids_20260311.json
  • tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json
  • tests/benchmarks/results/compare_pr80_judge_off_20260311.json
  • tests/benchmarks/results/compare_pr80_judge_on_20260311.json

Comment on lines +12 to +13
"baseline_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_baseline_20260310_233631.json",
"test_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace absolute artifact paths with repo-relative paths.
Line 12 and Line 13 embed a local machine path (/Users/jgarturo/...), which hurts portability and leaks local environment details. Keep these paths repo-relative like the other promoted artifacts.

📦 Proposed fix
-  "baseline_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_baseline_20260310_233631.json",
-  "test_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json"
+  "baseline_file": "benchmarks/results/locomo-mini_baseline_20260310_233631.json",
+  "test_file": "benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json` around
lines 12 - 13, The JSON entries baseline_file and test_file currently contain
absolute local paths; replace them with repo-relative paths (e.g.,
"benchmarks/results/locomo-mini_baseline_20260310_233631.json" and
"benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json") so the
artifact references are portable and do not leak local filesystem details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant