Fix: laminar trace timeline to account for idle wait time by simonrosenberg · Pull Request #464 · OpenHands/benchmarks

simonrosenberg · 2026-02-27T16:50:46Z

Summary

Fix evaluation datapoint trace timeline when the number of tasks exceeds concurrent workers.

Problem: The root "Evaluation" span was created in the parent process alongside the datapoint, before a worker picked up the task. When more tasks were queued than available workers, the span included idle wait time, resulting in inflated durations and misleading timelines in Laminar:

Fix: The "Evaluation" span is now created in the child process when execution actually begins. The span's trace ID is then linked back to the Laminar datapoint via update_datapoint, so the timeline accurately reflects real execution time.

Changes

Two-phase datapoint linking: Parent creates the datapoint immediately (for UI progress), child creates the eval span when work begins, then links them via update_datapoint_trace_id
_execute_single_attempt helper: Extracted from _process_one_mp to reduce nesting and improve readability
_cleanup_workspace helper: Consolidates workspace archive capture and teardown
Span lifecycle safety: Both eval_span and exec_span are always closed in finally blocks
Defensive trace linking: get_span_context() and UUID conversion failures are caught so they don't crash the worker
Improved logging: Trace link failures logged at ERROR with full traceback
Test coverage: Added tests for datapoint trace linking, failure modes, and non-zero UUID validation
Dependency: Requires lmnr>=0.7.41 for update_datapoint trace ID support

This PR supersedes #415 (which was from a fork and couldn't be updated due to permission issues).

Original PR by @Rainhunter13

simonrosenberg · 2026-02-27T16:52:38Z

@Rainhunter13 can you give a quick look to check that the conflicts were resolved correctly? Will then merge asap
Thank you so much!

all-hands-bot

🟡 Acceptable - Solves a real observability problem (inflated timelines due to idle wait), but needs attention to edge cases and testing.

Linus-Style Analysis: The core approach is sound - move span creation to where work actually happens. The data structure change (tuple → UUID) simplifies the API. However, there are some pragmatic concerns around failure modes and testing gaps.

benchmarks/utils/evaluation.py

benchmarks/utils/laminar.py

tests/test_workspace_cleanup.py

pyproject.toml

all-hands-bot

The design is sound (fixing queue-time inflation), but implementation has complexity issues that should be addressed.

benchmarks/utils/evaluation.py

tests/test_workspace_cleanup.py

all-hands-bot

🟡 Acceptable - Solves a real observability problem (moving span creation from queue-time to execution-time), but has resource management issues that need attention.

Linus-Style Analysis: The core approach is sound - you fixed the actual problem (inflated timelines). The data structure simplification (tuple → single UUID) is good. However, there are pragmatic concerns around resource cleanup that will bite you in production.

Key Insight: You fixed the observability problem but introduced resource leaks - OpenTelemetry spans must be explicitly closed or they accumulate in memory.

benchmarks/utils/evaluation.py

tests/test_workspace_cleanup.py

benchmarks/utils/evaluation.py

all-hands-bot

🟡 Acceptable - Core fix is sound (moving span creation to execution time), but has logic and type safety issues that should be addressed.

Linus-Style Analysis: The data structure simplification (tuple → UUID) and span lifecycle fix are good. Moving span creation to where work actually happens eliminates the queue-time inflation problem elegantly. However, there are pragmatic issues with dead code, type safety, and test quality.

benchmarks/utils/evaluation.py

tests/test_workspace_cleanup.py

benchmarks/utils/evaluation.py

all-hands-bot

🟡 Acceptable - Core fix is sound, but test coverage has limitations.

Linus-Style Analysis: The data structure simplification (tuple → UUID) and moving span creation to execution time is exactly the right fix. Extracting _cleanup_workspace and _execute_single_attempt reduces nesting and improves readability. This solves a real observability problem (inflated timelines from queue wait).

Key Insight: Good pragmatic design - span lifecycle now matches actual work lifecycle instead of queue lifecycle.

Verdict: ✅ Worth merging

tests/test_workspace_cleanup.py

all-hands-bot

🟡 Acceptable - Core fix is sound (moving span creation to execution time), but has dead code and resource initialization issues that need attention.

benchmarks/utils/evaluation.py

tests/test_workspace_cleanup.py

benchmarks/utils/evaluation.py

all-hands-bot

🟡 Acceptable - Core fix is sound (moving span creation to execution time), solves a real observability problem. Previous critical issues (uninitialized variables, dead code) have been addressed. Minor improvements suggested below.

benchmarks/utils/evaluation.py

tests/test_workspace_cleanup.py

benchmarks/utils/evaluation.py

all-hands-bot

🟡 Acceptable - Previous critical issues (uninitialized spans, dead code) are now fixed. Core design is sound, but test quality remains a concern.

tests/test_workspace_cleanup.py

benchmarks/utils/evaluation.py

Previously, the root "Evaluation" span was created in the parent process alongside the datapoint, before a worker picked up the task. When more tasks were queued than available workers, the span included idle wait time, resulting in inflated durations and misleading timelines in Laminar. Now, the "Evaluation" span is created in the child process when execution actually begins. The span's trace ID is then linked back to the Laminar datapoint via update_datapoint, so the timeline accurately reflects real execution time. Requires lmnr>=0.7.41 for update_datapoint trace ID support. Co-Authored-By: Simon Rosenberg <simon@rosenberg.dev>

- Fix span lifecycle safety: initialize eval_span/exec_span to None with guarded .end() calls in finally blocks - Broaden exception handling for trace linking to catch all failures - Clarify _execute_single_attempt return convention in docstring - Add tests for datapoint trace linking, failure resilience - Satisfy pyright return type with unreachable assertion - Update uv.lock for lmnr>=0.7.41 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

all-hands-bot

🟡 Acceptable - Core fix is sound (moving span creation to execution time) and solves a real observability problem. Previous critical issues have been addressed. Minor improvements suggested in inline comments.

Linus-Style Analysis: The data structure simplification (tuple → UUID) and moving span creation to where work actually happens is exactly the right fix. Helper extraction reduces nesting and improves readability. This eliminates the queue-time inflation problem elegantly.

Key Insight: Span lifecycle now matches work lifecycle instead of queue lifecycle - good taste.

all-hands-bot · 2026-03-04T18:55:27Z

benchmarks/utils/evaluation.py

-                        instance,
-                        resource_factor=resource_factor,
-                        forward_env=LMNR_ENV_VARS,
+                eval_span_ctx = Laminar.get_laminar_span_context(eval_span)


🟡 Suggestion: Type Safety

The return type of Laminar.get_laminar_span_context(eval_span) is inferred but not explicitly typed. Since this is passed to Laminar.start_active_span(..., parent_span_context=eval_span_ctx) later (line 707), it should have an explicit type annotation.

Import the proper type from lmnr:

from lmnr import SpanContext # or whatever the actual type is eval_span_ctx: SpanContext = Laminar.get_laminar_span_context(eval_span)

This was flagged in a previous review but resolved using implicit inference. Explicit is better for maintainability and type checking.

benchmarks/utils/laminar.py

simonrosenberg · 2026-03-04T19:57:26Z

✅ Evaluation Run Successful

Triggered a SWE-bench evaluation (5 instances) with this branch:

Field	Value
K8s Job	`eval-22684638709-claude-son`
Evaluation Run ID	22684638709
Model	claude-sonnet-4-5-20250929
Result	All 5 instances completed successfully
Duration	~39 minutes

Datadog Logs: View logs

neubig

Looks fine to me, although some of the AI review comments might be worth reflecting.

simonrosenberg self-assigned this Feb 27, 2026

simonrosenberg marked this pull request as ready for review February 27, 2026 16:52

all-hands-bot reviewed Feb 27, 2026

View reviewed changes

simonrosenberg requested a review from all-hands-bot February 27, 2026 17:24

all-hands-bot reviewed Feb 27, 2026

View reviewed changes

simonrosenberg requested a review from all-hands-bot February 27, 2026 17:36

all-hands-bot reviewed Feb 27, 2026

View reviewed changes

simonrosenberg requested a review from all-hands-bot March 4, 2026 13:50

all-hands-bot reviewed Mar 4, 2026

View reviewed changes

simonrosenberg requested a review from all-hands-bot March 4, 2026 15:05

all-hands-bot reviewed Mar 4, 2026

View reviewed changes

tests/test_workspace_cleanup.py Show resolved Hide resolved

simonrosenberg force-pushed the fix_laminar_evals_timeline branch from 53c9497 to 167be13 Compare March 4, 2026 16:11

simonrosenberg requested a review from all-hands-bot March 4, 2026 18:00

all-hands-bot reviewed Mar 4, 2026

View reviewed changes

benchmarks/utils/evaluation.py Show resolved Hide resolved

benchmarks/utils/evaluation.py Show resolved Hide resolved

benchmarks/utils/evaluation.py Show resolved Hide resolved

tests/test_workspace_cleanup.py Show resolved Hide resolved

benchmarks/utils/evaluation.py Show resolved Hide resolved

simonrosenberg requested a review from all-hands-bot March 4, 2026 18:10

all-hands-bot reviewed Mar 4, 2026

View reviewed changes

benchmarks/utils/evaluation.py Show resolved Hide resolved

benchmarks/utils/evaluation.py Outdated Show resolved Hide resolved

tests/test_workspace_cleanup.py Show resolved Hide resolved

benchmarks/utils/evaluation.py Show resolved Hide resolved

simonrosenberg requested a review from all-hands-bot March 4, 2026 18:20

all-hands-bot reviewed Mar 4, 2026

View reviewed changes

tests/test_workspace_cleanup.py Show resolved Hide resolved

benchmarks/utils/evaluation.py Show resolved Hide resolved

simonrosenberg mentioned this pull request Mar 4, 2026

Rename max_retries to max_attempts for clarity #482

Open

Rakhman Asmatullayev and others added 2 commits March 4, 2026 15:45

simonrosenberg force-pushed the fix_laminar_evals_timeline branch from 1457b6e to 9441a34 Compare March 4, 2026 18:47

simonrosenberg requested a review from all-hands-bot March 4, 2026 18:52

all-hands-bot reviewed Mar 4, 2026

View reviewed changes

neubig approved these changes Mar 4, 2026

View reviewed changes

simonrosenberg merged commit 9c97a9b into main Mar 4, 2026
4 checks passed

simonrosenberg deleted the fix_laminar_evals_timeline branch March 4, 2026 20:26

Conversation

simonrosenberg commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Uh oh!

simonrosenberg commented Feb 27, 2026

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

Uh oh!

all-hands-bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

simonrosenberg commented Mar 4, 2026

✅ Evaluation Run Successful

Uh oh!

neubig left a comment

Choose a reason for hiding this comment

Uh oh!

simonrosenberg commented Feb 27, 2026 •

edited

Loading