-
Notifications
You must be signed in to change notification settings - Fork 46
Open
Description
NeMo-Evaluator doesn't clone the repo but pip installs it and runs it in its own container.
So it must fetch files and dependencies differently.
Problem: the current PR introduces lots of changes
NeMo's execution flow
1. Discovers the framework via nemo_evaluator/openhands_benchmarks/framework.yml
2. Installs the package via pip
3. Runs the command from framework.yml → python3 -m benchmarks.scripts.run_benchmark ...
4. run_benchmark.py calls generate_config() then invokes individual benchmark CLI entrypoints
5. Parses results via output.py which reads *.report.json → looks for report["benchmark"]
---
For each partially-related change, the minimal version:
1. Docker image handling → Minimal: just the env var overrides
The PR adds two things: env var overrides + try/except fallback logic everywhere.
Keep: IMAGE_TAG_PREFIX env var in version.py + the SDK_SHORT_SHA → IMAGE_TAG_PREFIX renames across files. Also EVAL_AGENT_SERVER_IMAGE env var in constants.py. And
_local_image_exists() in image_utils.py. These are one-liners that let NeMo point to its own registry/tags.
Drop: All the try/except pre-built-image fallback blocks in commit0, gaia, swtbench, multiswebench, openagentsafety prepare_workspace(). These are ~15-20 line blocks
added to 5 different benchmarks. NeMo can set SKIP_BUILD=1 (which already exists for swebench/swtbench/multiswebench) or ensure images are pre-pulled. For commit0/gaia
that lack skip-build, a simple SKIP_BUILD env var check would be much smaller than a full try/except/fallback pattern.
Saves: ~100 lines of duplicated try/except blocks.
---
2. --conversation-timeout → Minimal: remove from framework.yml
The PR adds the flag to args_parser.py, EvalMetadata, fake_user_response.py, and all 7 benchmarks. But the only reason it's here is that framework.yml references it.
Minimal: Don't include --conversation-timeout in framework.yml's command template. Remove it from run_benchmark.py. The benchmarks already work without it
(conversations run until completion). This is a genuinely useful feature, but it belongs in a separate PR.
Saves: ~50 lines across 10 files.
---
3. --skip-failed-samples → Minimal: remove from framework.yml
Same story. The PR adds the flag, a SampleFailedError class, and new logic in evaluation.py's run loop.
Minimal: Don't include --skip-failed-samples in framework.yml. The existing behavior (skip failures, continue) already works for NeMo. The skip_failed_samples: false
default in framework.yml actually makes things worse for NeMo (stops on first failure).
Saves: ~40 lines across 9 files + the entire SampleFailedError class.
---
4. uv fallback → Keep, but extract to a helper
This IS needed — NeMo's container won't have uv. But the same 10-line pattern is copy-pasted 4 times (swebench, swebenchmultimodal, multiswebench, swtbench eval
scripts).
Minimal: Create a tiny utility function:
def get_python_cmd(module: str) -> list[str]:
"""Return command to run a Python module, preferring uv if available."""
...
Call it from each eval script. Same functionality, ~30 fewer lines of duplication.
---
5. LLM config refactor → Minimal: resolve api_key_env in generate_config() instead
The PR creates load_llm_config() with api_key_env resolution + base_url stripping, then refactors all 7 benchmarks to use it.
Minimal: Have generate_config() resolve the env var and normalize the URL at generation time, writing a clean JSON with literal api_key and clean base_url. Then
existing benchmark code reads it unchanged — zero modifications to any benchmark's main().
# In generate_llm_config.py:
if api_key_env:
llm_config["api_key"] = os.environ[api_key_env] # resolve immediately
if base_url and base_url.endswith("/chat/completions"):
base_url = base_url.removesuffix("/chat/completions")
Saves: The entire llm_config.py file + changes to all 7 benchmark main() functions (~50 lines).
---
6. EVAL_AGENT_SERVER_IMAGE env var → Keep as-is
It's a 3-line change in constants.py. Minimal and necessary.
---
7. Packaging improvements → Keep, but trim package-data
Keep: nemo_evaluator* in packages.find, nemo_evaluator = ["**/*.yml"], deferred imports, version.py fallback, Path.relative_to fallback.
Drop: benchmarks = ["**/*.j2", "**/Dockerfile*", "**/*.json"] — Dockerfiles and JSON schemas aren't needed at pip-install runtime. The .j2 templates are found via
Path(__file__).parent / "prompts" which works with pip install already. If some edge case needs them, add only "**/*.j2".
---
8. Error output enrichment → Drop entirely
The error_type/error_message/error_stack fields in test_result and traceback.format_exc() capture are nice debugging improvements but NeMo's output.py doesn't read
them. Pure noise in a NeMo PR.
---
Summary: what the minimal NeMo PR looks like
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────┐
│ What to keep │ Lines │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ nemo_evaluator/ directory (framework.yml, output.py, init.py) — simplified framework.yml without conversation-timeout/skip-failed-samples │ ~280 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ run_benchmark.py + generate_llm_config.py (with api_key resolved at generation) — no load_llm_config.py │ ~250 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ "benchmark" field added to all .report.json outputs │ ~30 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ OpenAgentSafety generate_report() │ ~40 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ IMAGE_TAG_PREFIX + EVAL_AGENT_SERVER_IMAGE env var overrides │ ~15 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ _local_image_exists() utility │ ~15 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ pyproject.toml changes (dependency, entrypoints, package-data for .yml) │ ~15 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ version.py graceful fallback │ ~10 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ Deferred imports + Path.relative_to fallback │ ~20 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ uv fallback (as a small utility, not copy-pasted) │ ~25 │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
│ Total │ ~700 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────┘
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────┐
│ What to drop / split into separate PRs │ Lines saved │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
│ --conversation-timeout (all plumbing) │ ~50 │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
│ --skip-failed-samples + SampleFailedError │ ~40 │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
│ load_llm_config.py + refactoring all benchmark main() functions │ ~50 │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
│ Try/except pre-built image fallback blocks (5 benchmarks) │ ~100 │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
│ Error output enrichment (error_type/stack in test_result) │ ~20 │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
│ All 6 unrelated bug fixes (GAIA critic, pd.isna, error re-raising, MultiSWEBench build, MultiSWEBench dataset name, ruff config) │ ~80 │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
│ Total saved │ ~340 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┘
That gets the PR from ~1325 additions down to ~700, with a much clearer "this is NeMo integration" story and no unrelated bug fixes polluting the review.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels