Nemo Evaluator Integration with OpenHands

NeMo-Evaluator doesn't clone the repo but pip installs it and runs it in its own container.
So it must fetch files and dependencies differently. 

Problem: the current PR introduces lots of changes

```
  NeMo's execution flow

  1. Discovers the framework via nemo_evaluator/openhands_benchmarks/framework.yml
  2. Installs the package via pip
  3. Runs the command from framework.yml → python3 -m benchmarks.scripts.run_benchmark ...
  4. run_benchmark.py calls generate_config() then invokes individual benchmark CLI entrypoints
  5. Parses results via output.py which reads *.report.json → looks for report["benchmark"]

  ---
  For each partially-related change, the minimal version:

  1. Docker image handling → Minimal: just the env var overrides

  The PR adds two things: env var overrides + try/except fallback logic everywhere.

  Keep: IMAGE_TAG_PREFIX env var in version.py + the SDK_SHORT_SHA → IMAGE_TAG_PREFIX renames across files. Also EVAL_AGENT_SERVER_IMAGE env var in constants.py. And
  _local_image_exists() in image_utils.py. These are one-liners that let NeMo point to its own registry/tags.

  Drop: All the try/except pre-built-image fallback blocks in commit0, gaia, swtbench, multiswebench, openagentsafety prepare_workspace(). These are ~15-20 line blocks
  added to 5 different benchmarks. NeMo can set SKIP_BUILD=1 (which already exists for swebench/swtbench/multiswebench) or ensure images are pre-pulled. For commit0/gaia
  that lack skip-build, a simple SKIP_BUILD env var check would be much smaller than a full try/except/fallback pattern.

  Saves: ~100 lines of duplicated try/except blocks.

  ---
  2. --conversation-timeout → Minimal: remove from framework.yml

  The PR adds the flag to args_parser.py, EvalMetadata, fake_user_response.py, and all 7 benchmarks. But the only reason it's here is that framework.yml references it.

  Minimal: Don't include --conversation-timeout in framework.yml's command template. Remove it from run_benchmark.py. The benchmarks already work without it
  (conversations run until completion). This is a genuinely useful feature, but it belongs in a separate PR.

  Saves: ~50 lines across 10 files.

  ---
  3. --skip-failed-samples → Minimal: remove from framework.yml

  Same story. The PR adds the flag, a SampleFailedError class, and new logic in evaluation.py's run loop.

  Minimal: Don't include --skip-failed-samples in framework.yml. The existing behavior (skip failures, continue) already works for NeMo. The skip_failed_samples: false
  default in framework.yml actually makes things worse for NeMo (stops on first failure).

  Saves: ~40 lines across 9 files + the entire SampleFailedError class.

  ---
  4. uv fallback → Keep, but extract to a helper

  This IS needed — NeMo's container won't have uv. But the same 10-line pattern is copy-pasted 4 times (swebench, swebenchmultimodal, multiswebench, swtbench eval
  scripts).

  Minimal: Create a tiny utility function:
  def get_python_cmd(module: str) -> list[str]:
      """Return command to run a Python module, preferring uv if available."""
      ...
  Call it from each eval script. Same functionality, ~30 fewer lines of duplication.

  ---
  5. LLM config refactor → Minimal: resolve api_key_env in generate_config() instead

  The PR creates load_llm_config() with api_key_env resolution + base_url stripping, then refactors all 7 benchmarks to use it.

  Minimal: Have generate_config() resolve the env var and normalize the URL at generation time, writing a clean JSON with literal api_key and clean base_url. Then
  existing benchmark code reads it unchanged — zero modifications to any benchmark's main().

  # In generate_llm_config.py:
  if api_key_env:
      llm_config["api_key"] = os.environ[api_key_env]  # resolve immediately
  if base_url and base_url.endswith("/chat/completions"):
      base_url = base_url.removesuffix("/chat/completions")

  Saves: The entire llm_config.py file + changes to all 7 benchmark main() functions (~50 lines).

  ---
  6. EVAL_AGENT_SERVER_IMAGE env var → Keep as-is

  It's a 3-line change in constants.py. Minimal and necessary.

  ---
  7. Packaging improvements → Keep, but trim package-data

  Keep: nemo_evaluator* in packages.find, nemo_evaluator = ["**/*.yml"], deferred imports, version.py fallback, Path.relative_to fallback.

  Drop: benchmarks = ["**/*.j2", "**/Dockerfile*", "**/*.json"] — Dockerfiles and JSON schemas aren't needed at pip-install runtime. The .j2 templates are found via
  Path(__file__).parent / "prompts" which works with pip install already. If some edge case needs them, add only "**/*.j2".

  ---
  8. Error output enrichment → Drop entirely

  The error_type/error_message/error_stack fields in test_result and traceback.format_exc() capture are nice debugging improvements but NeMo's output.py doesn't read
  them. Pure noise in a NeMo PR.

  ---
  Summary: what the minimal NeMo PR looks like
  ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────┐
  │                                                               What to keep                                                                │ Lines │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ nemo_evaluator/ directory (framework.yml, output.py, init.py) — simplified framework.yml without conversation-timeout/skip-failed-samples │ ~280  │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ run_benchmark.py + generate_llm_config.py (with api_key resolved at generation) — no load_llm_config.py                                   │ ~250  │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ "benchmark" field added to all .report.json outputs                                                                                       │ ~30   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ OpenAgentSafety generate_report()                                                                                                         │ ~40   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ IMAGE_TAG_PREFIX + EVAL_AGENT_SERVER_IMAGE env var overrides                                                                              │ ~15   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ _local_image_exists() utility                                                                                                             │ ~15   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ pyproject.toml changes (dependency, entrypoints, package-data for .yml)                                                                   │ ~15   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ version.py graceful fallback                                                                                                              │ ~10   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ Deferred imports + Path.relative_to fallback                                                                                              │ ~20   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ uv fallback (as a small utility, not copy-pasted)                                                                                         │ ~25   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ Total                                                                                                                                     │ ~700  │
  └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────┘
  ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────┐
  │                                              What to drop / split into separate PRs                                              │ Lines saved │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ --conversation-timeout (all plumbing)                                                                                            │ ~50         │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ --skip-failed-samples + SampleFailedError                                                                                        │ ~40         │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ load_llm_config.py + refactoring all benchmark main() functions                                                                  │ ~50         │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ Try/except pre-built image fallback blocks (5 benchmarks)                                                                        │ ~100        │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ Error output enrichment (error_type/stack in test_result)                                                                        │ ~20         │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ All 6 unrelated bug fixes (GAIA critic, pd.isna, error re-raising, MultiSWEBench build, MultiSWEBench dataset name, ruff config) │ ~80         │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ Total saved                                                                                                                      │ ~340        │
  └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┘
  That gets the PR from ~1325 additions down to ~700, with a much clearer "this is NeMo integration" story and no unrelated bug fixes polluting the review.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nemo Evaluator Integration with OpenHands #447

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Nemo Evaluator Integration with OpenHands #447

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions