Skip to content

Fix: laminar trace timeline to account for idle wait time#464

Merged
simonrosenberg merged 2 commits intomainfrom
fix_laminar_evals_timeline
Mar 4, 2026
Merged

Fix: laminar trace timeline to account for idle wait time#464
simonrosenberg merged 2 commits intomainfrom
fix_laminar_evals_timeline

Conversation

@simonrosenberg
Copy link
Collaborator

@simonrosenberg simonrosenberg commented Feb 27, 2026

Summary

Fix evaluation datapoint trace timeline when the number of tasks exceeds concurrent workers.

Problem: The root "Evaluation" span was created in the parent process alongside the datapoint, before a worker picked up the task. When more tasks were queued than available workers, the span included idle wait time, resulting in inflated durations and misleading timelines in Laminar:

image

Fix: The "Evaluation" span is now created in the child process when execution actually begins. The span's trace ID is then linked back to the Laminar datapoint via update_datapoint, so the timeline accurately reflects real execution time.

Changes

  • Two-phase datapoint linking: Parent creates the datapoint immediately (for UI progress), child creates the eval span when work begins, then links them via update_datapoint_trace_id
  • _execute_single_attempt helper: Extracted from _process_one_mp to reduce nesting and improve readability
  • _cleanup_workspace helper: Consolidates workspace archive capture and teardown
  • Span lifecycle safety: Both eval_span and exec_span are always closed in finally blocks
  • Defensive trace linking: get_span_context() and UUID conversion failures are caught so they don't crash the worker
  • Improved logging: Trace link failures logged at ERROR with full traceback
  • Test coverage: Added tests for datapoint trace linking, failure modes, and non-zero UUID validation
  • Dependency: Requires lmnr>=0.7.41 for update_datapoint trace ID support

This PR supersedes #415 (which was from a fork and couldn't be updated due to permission issues).

Original PR by @Rainhunter13

@simonrosenberg simonrosenberg self-assigned this Feb 27, 2026
@simonrosenberg simonrosenberg marked this pull request as ready for review February 27, 2026 16:52
@simonrosenberg
Copy link
Collaborator Author

@Rainhunter13 can you give a quick look to check that the conflicts were resolved correctly? Will then merge asap
Thank you so much!

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solves a real observability problem (inflated timelines due to idle wait), but needs attention to edge cases and testing.

Linus-Style Analysis: The core approach is sound - move span creation to where work actually happens. The data structure change (tuple → UUID) simplifies the API. However, there are some pragmatic concerns around failure modes and testing gaps.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The design is sound (fixing queue-time inflation), but implementation has complexity issues that should be addressed.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Solves a real observability problem (moving span creation from queue-time to execution-time), but has resource management issues that need attention.

Linus-Style Analysis: The core approach is sound - you fixed the actual problem (inflated timelines). The data structure simplification (tuple → single UUID) is good. However, there are pragmatic concerns around resource cleanup that will bite you in production.

Key Insight: You fixed the observability problem but introduced resource leaks - OpenTelemetry spans must be explicitly closed or they accumulate in memory.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Core fix is sound (moving span creation to execution time), but has logic and type safety issues that should be addressed.

Linus-Style Analysis: The data structure simplification (tuple → UUID) and span lifecycle fix are good. Moving span creation to where work actually happens eliminates the queue-time inflation problem elegantly. However, there are pragmatic issues with dead code, type safety, and test quality.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Core fix is sound, but test coverage has limitations.

Linus-Style Analysis: The data structure simplification (tuple → UUID) and moving span creation to execution time is exactly the right fix. Extracting _cleanup_workspace and _execute_single_attempt reduces nesting and improves readability. This solves a real observability problem (inflated timelines from queue wait).

Key Insight: Good pragmatic design - span lifecycle now matches actual work lifecycle instead of queue lifecycle.

Verdict: ✅ Worth merging

@simonrosenberg simonrosenberg force-pushed the fix_laminar_evals_timeline branch from 53c9497 to 167be13 Compare March 4, 2026 16:11
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Core fix is sound (moving span creation to execution time), but has dead code and resource initialization issues that need attention.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Core fix is sound (moving span creation to execution time), solves a real observability problem. Previous critical issues (uninitialized variables, dead code) have been addressed. Minor improvements suggested below.

Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Previous critical issues (uninitialized spans, dead code) are now fixed. Core design is sound, but test quality remains a concern.

Rakhman Asmatullayev and others added 2 commits March 4, 2026 15:45
Previously, the root "Evaluation" span was created in the parent process
alongside the datapoint, before a worker picked up the task. When more
tasks were queued than available workers, the span included idle wait
time, resulting in inflated durations and misleading timelines in Laminar.

Now, the "Evaluation" span is created in the child process when execution
actually begins. The span's trace ID is then linked back to the Laminar
datapoint via update_datapoint, so the timeline accurately reflects real
execution time. Requires lmnr>=0.7.41 for update_datapoint trace ID
support.

Co-Authored-By: Simon Rosenberg <simon@rosenberg.dev>
- Fix span lifecycle safety: initialize eval_span/exec_span to None
  with guarded .end() calls in finally blocks
- Broaden exception handling for trace linking to catch all failures
- Clarify _execute_single_attempt return convention in docstring
- Add tests for datapoint trace linking, failure resilience
- Satisfy pyright return type with unreachable assertion
- Update uv.lock for lmnr>=0.7.41

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@simonrosenberg simonrosenberg force-pushed the fix_laminar_evals_timeline branch from 1457b6e to 9441a34 Compare March 4, 2026 18:47
Copy link
Collaborator

@all-hands-bot all-hands-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Acceptable - Core fix is sound (moving span creation to execution time) and solves a real observability problem. Previous critical issues have been addressed. Minor improvements suggested in inline comments.

Linus-Style Analysis: The data structure simplification (tuple → UUID) and moving span creation to where work actually happens is exactly the right fix. Helper extraction reduces nesting and improves readability. This eliminates the queue-time inflation problem elegantly.

Key Insight: Span lifecycle now matches work lifecycle instead of queue lifecycle - good taste.

instance,
resource_factor=resource_factor,
forward_env=LMNR_ENV_VARS,
eval_span_ctx = Laminar.get_laminar_span_context(eval_span)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Suggestion: Type Safety

The return type of Laminar.get_laminar_span_context(eval_span) is inferred but not explicitly typed. Since this is passed to Laminar.start_active_span(..., parent_span_context=eval_span_ctx) later (line 707), it should have an explicit type annotation.

Import the proper type from lmnr:

from lmnr import SpanContext  # or whatever the actual type is

eval_span_ctx: SpanContext = Laminar.get_laminar_span_context(eval_span)

This was flagged in a previous review but resolved using implicit inference. Explicit is better for maintainability and type checking.

@simonrosenberg
Copy link
Collaborator Author

✅ Evaluation Run Successful

Triggered a SWE-bench evaluation (5 instances) with this branch:

Field Value
K8s Job eval-22684638709-claude-son
Evaluation Run ID 22684638709
Model claude-sonnet-4-5-20250929
Result All 5 instances completed successfully
Duration ~39 minutes

Datadog Logs: View logs

Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me, although some of the AI review comments might be worth reflecting.

@simonrosenberg simonrosenberg merged commit 9c97a9b into main Mar 4, 2026
4 checks passed
@simonrosenberg simonrosenberg deleted the fix_laminar_evals_timeline branch March 4, 2026 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants