chore(bench): experiment postmortems, artifact promotion, log update by jack-arturo · Pull Request #126 · verygoodplugins/automem

jack-arturo · 2026-03-13T17:20:23Z

Summary

Add benchmarks/postmortems/ directory with 3 postmortems:
- Bug: priority_ids parameter only boosts relevance, doesn't fetch specific memories #79 (accepted): priority_ids fetch bug fix, merged as PR fix(recall): priority_ids parameter only boosts relevance (#79) #125, zero benchmark delta
- recall: Graph expansion follows too many hops through generic/hub nodes #74 (rejected): entity expansion precision, zero delta — benchmark doesn't exercise graph expansion
- PR feat: enhanced recall — BM25 search, LLM reranking, query expansion #80 (rejected): BM25+rerank+query expansion, -3.83pp regression driven by open-domain -11.4pp
Promote 5 comparison JSON artifacts to committed tests/benchmarks/results/
Update EXPERIMENT_LOG.md with 7 new rows + postmortem links
Prune 10 experiment worktrees and 9 local branches

Test plan

No code changes — documentation and artifacts only
make test passes (no functional changes)
Postmortem links in EXPERIMENT_LOG.md resolve correctly
Promoted artifacts match originals in benchmarks/results/

Add postmortem infrastructure and archive three experiments: - #79 (accepted): priority_ids fetch bug fix, merged as PR #125 - #74 (rejected): entity expansion precision, zero benchmark delta - PR #80 (rejected): BM25+rerank+query expansion, -3.83pp regression Promote 5 comparison JSONs to tests/benchmarks/results/ for durable record. Update EXPERIMENT_LOG.md with 7 new rows and postmortem links. Prune 10 experiment worktrees and 9 local branches.

coderabbitai · 2026-03-13T17:20:44Z

📝 Walkthrough

Summary by CodeRabbit

Release Notes

Documentation
- Added comprehensive benchmark postmortem reports documenting evaluation outcomes for recent feature improvements, including detailed performance metrics and analysis findings.
Tests
- Added benchmark comparison result snapshots for performance tracking across multiple evaluated configurations.

Walkthrough

This PR documents benchmark results and postmortem analyses for three experiments (Issue #74 entity expansion, PR #80 enhanced recall, Issue #79 priority IDs fetch), adding entries to the experiment log with detailed analysis documents and corresponding benchmark result artifacts.

Changes

Cohort / File(s)	Summary
Experiment Log Update `benchmarks/EXPERIMENT_LOG.md`	Added 8 new benchmark entries across 2026-03-10 to 2026-03-12 documenting main-refresh, PR `#80` port variants, BM25 configurations, entity expansion, and priority IDs experiments with corresponding category breakdowns.
Postmortem Documentation `benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md`, `benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md`, `benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md`	Added three postmortem analysis documents detailing experimental hypotheses, benchmark results, root cause analyses, and follow-up recommendations; Issue `#74` and PR `#80` marked as REJECTED/NON-PROMOTED, Issue `#79` marked as merged/accepted.
Benchmark Result Artifacts `tests/benchmarks/results/compare_*.json`	Added 6 new JSON benchmark comparison files recording baseline vs. test accuracy, deltas, and per-category performance metrics for PR `#80` variants, Issue `#74`, and Issue `#79` experiments.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

chore(bench): add health check, baselines, and fix eval harness #102: Modifies benchmark experiment artifacts and infrastructure in related directories
docs(bench): add PR #73, #80, and #87 experiment results #103: Updates benchmarks/EXPERIMENT_LOG.md with experiment entries including PR #80 references

Suggested labels

enhancement

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding benchmark postmortem documentation, promoting artifacts, and updating the experiment log.
Description check	✅ Passed	The description provides clear information about the changeset, including the three postmortems, artifact promotion, log updates, and cleanup actions.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chore/experiment-hygiene-archive

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md (2)

61-71: Make reproduction commands commit-pinned.

Line 63 references the branch in a comment, but the command block should include an explicit checkout to commit a122ba2 so reruns stay deterministic.

Suggested doc patch

 ```bash
+# Reproduce exactly from recorded revision
+git checkout a122ba2
+
 # Full port evaluation
 make bench-eval BENCH=locomo-mini CONFIG=baseline  # on exp/pr80-enhanced-recall-v2
 make bench-compare BENCH=locomo-mini CONFIG=baseline BASELINE=baseline

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md` around lines 61 -
71, The reproduction commands in the code block (the make bench-eval and make
bench-compare invocations) are not pinned to a commit; prepend an explicit git
checkout to commit a122ba2 before the make commands so reruns are deterministic
(i.e., add a step that runs git checkout a122ba2 prior to the make
bench-eval/make bench-compare commands in the same snippet).

81-86: Consider adding artifact checksums for integrity tracking.

Since these are promoted benchmark records, adding SHA256 values would make future integrity verification straightforward.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md` around lines 81 -
86, Add SHA256 checksums for each promoted artifact to enable integrity
tracking: compute the SHA256 hash for each listed file
(`tests/benchmarks/results/compare_pr80_judge_off_20260311.json`,
`tests/benchmarks/results/compare_pr80_judge_on_20260311.json`,
`tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`) and append
the checksum next to each entry in the "Promoted Artifacts" list (e.g., "-
filename — SHA256: <hex>"), ensuring the exact filename strings from the diff
are used so the checksums clearly map to the artifacts.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`:
- Around line 12-13: The JSON entries baseline_file and test_file currently
contain absolute local paths; replace them with repo-relative paths (e.g.,
"benchmarks/results/locomo-mini_baseline_20260310_233631.json" and
"benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json") so the
artifact references are portable and do not leak local filesystem details.

---

Nitpick comments:
In `@benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md`:
- Around line 61-71: The reproduction commands in the code block (the make
bench-eval and make bench-compare invocations) are not pinned to a commit;
prepend an explicit git checkout to commit a122ba2 before the make commands so
reruns are deterministic (i.e., add a step that runs git checkout a122ba2 prior
to the make bench-eval/make bench-compare commands in the same snippet).
- Around line 81-86: Add SHA256 checksums for each promoted artifact to enable
integrity tracking: compute the SHA256 hash for each listed file
(`tests/benchmarks/results/compare_pr80_judge_off_20260311.json`,
`tests/benchmarks/results/compare_pr80_judge_on_20260311.json`,
`tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json`) and append
the checksum next to each entry in the "Promoted Artifacts" list (e.g., "-
filename — SHA256: <hex>"), ensuring the exact filename strings from the diff
are used so the checksums clearly map to the artifacts.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f1f949e9-14f3-48ce-90c2-303ad3278cda

📥 Commits

Reviewing files that changed from the base of the PR and between 5d3708c and b2c20aa.

📒 Files selected for processing (10)

benchmarks/EXPERIMENT_LOG.md
benchmarks/postmortems/.gitkeep
benchmarks/postmortems/2026-03-11_issue74_entity_expansion_precision.md
benchmarks/postmortems/2026-03-11_pr80_enhanced_recall.md
benchmarks/postmortems/2026-03-12_issue79_priority_ids_fetch.md
tests/benchmarks/results/compare_issue74_entity_precision_20260311.json
tests/benchmarks/results/compare_issue79_priority_ids_20260311.json
tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json
tests/benchmarks/results/compare_pr80_judge_off_20260311.json
tests/benchmarks/results/compare_pr80_judge_on_20260311.json

coderabbitai · 2026-03-13T17:24:54Z

tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json

+  "baseline_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_baseline_20260310_233631.json",
+  "test_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json"


⚠️ Potential issue | 🟠 Major

Replace absolute artifact paths with repo-relative paths.
Line 12 and Line 13 embed a local machine path (/Users/jgarturo/...), which hurts portability and leaks local environment details. Keep these paths repo-relative like the other promoted artifacts.

📦 Proposed fix

- "baseline_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_baseline_20260310_233631.json", - "test_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json" + "baseline_file": "benchmarks/results/locomo-mini_baseline_20260310_233631.json", + "test_file": "benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/benchmarks/results/compare_pr80_bm25_only_f10_judge_off.json` around lines 12 - 13, The JSON entries baseline_file and test_file currently contain absolute local paths; replace them with repo-relative paths (e.g., "benchmarks/results/locomo-mini_baseline_20260310_233631.json" and "benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json") so the artifact references are portable and do not leak local filesystem details.

coderabbitai bot added the enhancement New feature or request label Mar 13, 2026

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(bench): experiment postmortems, artifact promotion, log update#126

chore(bench): experiment postmortems, artifact promotion, log update#126
jack-arturo wants to merge 1 commit intomainfrom
chore/experiment-hygiene-archive

jack-arturo commented Mar 13, 2026

Uh oh!

coderabbitai bot commented Mar 13, 2026 •

edited

Loading

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		"baseline_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_baseline_20260310_233631.json",
		"test_file": "/Users/jgarturo/Projects/OpenAI/automem/benchmarks/results/locomo-mini_pr80_bm25_only_f10_20260311_025443.json"

Conversation

jack-arturo commented Mar 13, 2026

Summary

Test plan

Uh oh!

coderabbitai bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Mar 13, 2026 •

edited

Loading