Skip to content

Nemo Evaluator Integration with OpenHands #447

@simonrosenberg

Description

@simonrosenberg

NeMo-Evaluator doesn't clone the repo but pip installs it and runs it in its own container.
So it must fetch files and dependencies differently.

Problem: the current PR introduces lots of changes

  NeMo's execution flow

  1. Discovers the framework via nemo_evaluator/openhands_benchmarks/framework.yml
  2. Installs the package via pip
  3. Runs the command from framework.yml → python3 -m benchmarks.scripts.run_benchmark ...
  4. run_benchmark.py calls generate_config() then invokes individual benchmark CLI entrypoints
  5. Parses results via output.py which reads *.report.json → looks for report["benchmark"]

  ---
  For each partially-related change, the minimal version:

  1. Docker image handling → Minimal: just the env var overrides

  The PR adds two things: env var overrides + try/except fallback logic everywhere.

  Keep: IMAGE_TAG_PREFIX env var in version.py + the SDK_SHORT_SHA → IMAGE_TAG_PREFIX renames across files. Also EVAL_AGENT_SERVER_IMAGE env var in constants.py. And
  _local_image_exists() in image_utils.py. These are one-liners that let NeMo point to its own registry/tags.

  Drop: All the try/except pre-built-image fallback blocks in commit0, gaia, swtbench, multiswebench, openagentsafety prepare_workspace(). These are ~15-20 line blocks
  added to 5 different benchmarks. NeMo can set SKIP_BUILD=1 (which already exists for swebench/swtbench/multiswebench) or ensure images are pre-pulled. For commit0/gaia
  that lack skip-build, a simple SKIP_BUILD env var check would be much smaller than a full try/except/fallback pattern.

  Saves: ~100 lines of duplicated try/except blocks.

  ---
  2. --conversation-timeout → Minimal: remove from framework.yml

  The PR adds the flag to args_parser.py, EvalMetadata, fake_user_response.py, and all 7 benchmarks. But the only reason it's here is that framework.yml references it.

  Minimal: Don't include --conversation-timeout in framework.yml's command template. Remove it from run_benchmark.py. The benchmarks already work without it
  (conversations run until completion). This is a genuinely useful feature, but it belongs in a separate PR.

  Saves: ~50 lines across 10 files.

  ---
  3. --skip-failed-samples → Minimal: remove from framework.yml

  Same story. The PR adds the flag, a SampleFailedError class, and new logic in evaluation.py's run loop.

  Minimal: Don't include --skip-failed-samples in framework.yml. The existing behavior (skip failures, continue) already works for NeMo. The skip_failed_samples: false
  default in framework.yml actually makes things worse for NeMo (stops on first failure).

  Saves: ~40 lines across 9 files + the entire SampleFailedError class.

  ---
  4. uv fallback → Keep, but extract to a helper

  This IS needed — NeMo's container won't have uv. But the same 10-line pattern is copy-pasted 4 times (swebench, swebenchmultimodal, multiswebench, swtbench eval
  scripts).

  Minimal: Create a tiny utility function:
  def get_python_cmd(module: str) -> list[str]:
      """Return command to run a Python module, preferring uv if available."""
      ...
  Call it from each eval script. Same functionality, ~30 fewer lines of duplication.

  ---
  5. LLM config refactor → Minimal: resolve api_key_env in generate_config() instead

  The PR creates load_llm_config() with api_key_env resolution + base_url stripping, then refactors all 7 benchmarks to use it.

  Minimal: Have generate_config() resolve the env var and normalize the URL at generation time, writing a clean JSON with literal api_key and clean base_url. Then
  existing benchmark code reads it unchanged — zero modifications to any benchmark's main().

  # In generate_llm_config.py:
  if api_key_env:
      llm_config["api_key"] = os.environ[api_key_env]  # resolve immediately
  if base_url and base_url.endswith("/chat/completions"):
      base_url = base_url.removesuffix("/chat/completions")

  Saves: The entire llm_config.py file + changes to all 7 benchmark main() functions (~50 lines).

  ---
  6. EVAL_AGENT_SERVER_IMAGE env var → Keep as-is

  It's a 3-line change in constants.py. Minimal and necessary.

  ---
  7. Packaging improvements → Keep, but trim package-data

  Keep: nemo_evaluator* in packages.find, nemo_evaluator = ["**/*.yml"], deferred imports, version.py fallback, Path.relative_to fallback.

  Drop: benchmarks = ["**/*.j2", "**/Dockerfile*", "**/*.json"] — Dockerfiles and JSON schemas aren't needed at pip-install runtime. The .j2 templates are found via
  Path(__file__).parent / "prompts" which works with pip install already. If some edge case needs them, add only "**/*.j2".

  ---
  8. Error output enrichment → Drop entirely

  The error_type/error_message/error_stack fields in test_result and traceback.format_exc() capture are nice debugging improvements but NeMo's output.py doesn't read
  them. Pure noise in a NeMo PR.

  ---
  Summary: what the minimal NeMo PR looks like
  ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬───────┐
  │                                                               What to keep                                                                │ Lines │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ nemo_evaluator/ directory (framework.yml, output.py, init.py) — simplified framework.yml without conversation-timeout/skip-failed-samples │ ~280  │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ run_benchmark.py + generate_llm_config.py (with api_key resolved at generation) — no load_llm_config.py                                   │ ~250  │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ "benchmark" field added to all .report.json outputs                                                                                       │ ~30   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ OpenAgentSafety generate_report()                                                                                                         │ ~40   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ IMAGE_TAG_PREFIX + EVAL_AGENT_SERVER_IMAGE env var overrides                                                                              │ ~15   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ _local_image_exists() utility                                                                                                             │ ~15   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ pyproject.toml changes (dependency, entrypoints, package-data for .yml)                                                                   │ ~15   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ version.py graceful fallback                                                                                                              │ ~10   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ Deferred imports + Path.relative_to fallback                                                                                              │ ~20   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ uv fallback (as a small utility, not copy-pasted)                                                                                         │ ~25   │
  ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼───────┤
  │ Total                                                                                                                                     │ ~700  │
  └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴───────┘
  ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┬─────────────┐
  │                                              What to drop / split into separate PRs                                              │ Lines saved │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ --conversation-timeout (all plumbing)                                                                                            │ ~50         │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ --skip-failed-samples + SampleFailedError                                                                                        │ ~40         │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ load_llm_config.py + refactoring all benchmark main() functions                                                                  │ ~50         │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ Try/except pre-built image fallback blocks (5 benchmarks)                                                                        │ ~100        │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ Error output enrichment (error_type/stack in test_result)                                                                        │ ~20         │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ All 6 unrelated bug fixes (GAIA critic, pd.isna, error re-raising, MultiSWEBench build, MultiSWEBench dataset name, ruff config) │ ~80         │
  ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┼─────────────┤
  │ Total saved                                                                                                                      │ ~340        │
  └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┘
  That gets the PR from ~1325 additions down to ~700, with a much clearer "this is NeMo integration" story and no unrelated bug fixes polluting the review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions