Fix: laminar trace timeline to account for idle wait time#464
Fix: laminar trace timeline to account for idle wait time#464simonrosenberg merged 2 commits intomainfrom
Conversation
|
@Rainhunter13 can you give a quick look to check that the conflicts were resolved correctly? Will then merge asap |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Solves a real observability problem (inflated timelines due to idle wait), but needs attention to edge cases and testing.
Linus-Style Analysis: The core approach is sound - move span creation to where work actually happens. The data structure change (tuple → UUID) simplifies the API. However, there are some pragmatic concerns around failure modes and testing gaps.
all-hands-bot
left a comment
There was a problem hiding this comment.
The design is sound (fixing queue-time inflation), but implementation has complexity issues that should be addressed.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Solves a real observability problem (moving span creation from queue-time to execution-time), but has resource management issues that need attention.
Linus-Style Analysis: The core approach is sound - you fixed the actual problem (inflated timelines). The data structure simplification (tuple → single UUID) is good. However, there are pragmatic concerns around resource cleanup that will bite you in production.
Key Insight: You fixed the observability problem but introduced resource leaks - OpenTelemetry spans must be explicitly closed or they accumulate in memory.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Core fix is sound (moving span creation to execution time), but has logic and type safety issues that should be addressed.
Linus-Style Analysis: The data structure simplification (tuple → UUID) and span lifecycle fix are good. Moving span creation to where work actually happens eliminates the queue-time inflation problem elegantly. However, there are pragmatic issues with dead code, type safety, and test quality.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Core fix is sound, but test coverage has limitations.
Linus-Style Analysis: The data structure simplification (tuple → UUID) and moving span creation to execution time is exactly the right fix. Extracting _cleanup_workspace and _execute_single_attempt reduces nesting and improves readability. This solves a real observability problem (inflated timelines from queue wait).
Key Insight: Good pragmatic design - span lifecycle now matches actual work lifecycle instead of queue lifecycle.
Verdict: ✅ Worth merging
53c9497 to
167be13
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Core fix is sound (moving span creation to execution time), but has dead code and resource initialization issues that need attention.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Core fix is sound (moving span creation to execution time), solves a real observability problem. Previous critical issues (uninitialized variables, dead code) have been addressed. Minor improvements suggested below.
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Previous critical issues (uninitialized spans, dead code) are now fixed. Core design is sound, but test quality remains a concern.
Previously, the root "Evaluation" span was created in the parent process alongside the datapoint, before a worker picked up the task. When more tasks were queued than available workers, the span included idle wait time, resulting in inflated durations and misleading timelines in Laminar. Now, the "Evaluation" span is created in the child process when execution actually begins. The span's trace ID is then linked back to the Laminar datapoint via update_datapoint, so the timeline accurately reflects real execution time. Requires lmnr>=0.7.41 for update_datapoint trace ID support. Co-Authored-By: Simon Rosenberg <simon@rosenberg.dev>
- Fix span lifecycle safety: initialize eval_span/exec_span to None with guarded .end() calls in finally blocks - Broaden exception handling for trace linking to catch all failures - Clarify _execute_single_attempt return convention in docstring - Add tests for datapoint trace linking, failure resilience - Satisfy pyright return type with unreachable assertion - Update uv.lock for lmnr>=0.7.41 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1457b6e to
9441a34
Compare
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Core fix is sound (moving span creation to execution time) and solves a real observability problem. Previous critical issues have been addressed. Minor improvements suggested in inline comments.
Linus-Style Analysis: The data structure simplification (tuple → UUID) and moving span creation to where work actually happens is exactly the right fix. Helper extraction reduces nesting and improves readability. This eliminates the queue-time inflation problem elegantly.
Key Insight: Span lifecycle now matches work lifecycle instead of queue lifecycle - good taste.
| instance, | ||
| resource_factor=resource_factor, | ||
| forward_env=LMNR_ENV_VARS, | ||
| eval_span_ctx = Laminar.get_laminar_span_context(eval_span) |
There was a problem hiding this comment.
🟡 Suggestion: Type Safety
The return type of Laminar.get_laminar_span_context(eval_span) is inferred but not explicitly typed. Since this is passed to Laminar.start_active_span(..., parent_span_context=eval_span_ctx) later (line 707), it should have an explicit type annotation.
Import the proper type from lmnr:
from lmnr import SpanContext # or whatever the actual type is
eval_span_ctx: SpanContext = Laminar.get_laminar_span_context(eval_span)This was flagged in a previous review but resolved using implicit inference. Explicit is better for maintainability and type checking.
✅ Evaluation Run SuccessfulTriggered a SWE-bench evaluation (5 instances) with this branch:
Datadog Logs: View logs |
neubig
left a comment
There was a problem hiding this comment.
Looks fine to me, although some of the AI review comments might be worth reflecting.
Summary
Fix evaluation datapoint trace timeline when the number of tasks exceeds concurrent workers.
Problem: The root "Evaluation" span was created in the parent process alongside the datapoint, before a worker picked up the task. When more tasks were queued than available workers, the span included idle wait time, resulting in inflated durations and misleading timelines in Laminar:
Fix: The "Evaluation" span is now created in the child process when execution actually begins. The span's trace ID is then linked back to the Laminar datapoint via
update_datapoint, so the timeline accurately reflects real execution time.Changes
update_datapoint_trace_id_execute_single_attempthelper: Extracted from_process_one_mpto reduce nesting and improve readability_cleanup_workspacehelper: Consolidates workspace archive capture and teardowneval_spanandexec_spanare always closed infinallyblocksget_span_context()and UUID conversion failures are caught so they don't crash the workerlmnr>=0.7.41forupdate_datapointtrace ID supportThis PR supersedes #415 (which was from a fork and couldn't be updated due to permission issues).
Original PR by @Rainhunter13