Skip to content

docs(prd): add PRD for DAG-based concurrent execution#2194

Open
osterman wants to merge 2 commits intomainfrom
osterman/dag-scheduler-prd
Open

docs(prd): add PRD for DAG-based concurrent execution#2194
osterman wants to merge 2 commits intomainfrom
osterman/dag-scheduler-prd

Conversation

@osterman
Copy link
Member

@osterman osterman commented Mar 14, 2026

what

Added comprehensive Product Requirements Document for implementing DAG-based concurrent execution in Atmos. The PRD proposes a ready-queue scheduler that enables concurrent execution of components across all types (Terraform, Packer, Ansible, custom registry) while respecting dependency graphs and maintaining safe defaults (sequential by default with opt-in parallelism via --max-concurrency).

why

Currently Atmos executes components sequentially even when they have no dependencies and could safely run in parallel. For large deployments with dozens or hundreds of components, this serialization is the dominant bottleneck. The PRD establishes architectural principles, justifies ready-queue scheduling through industry research (Terragrunt, Make, Ninja, Bazel, Buck2, and 10+ other tools all use this pattern), and provides a phased rollout plan. The document also addresses critical concerns: output isolation under concurrency via stream injection, integration with legacy built-in component types without requiring migration, and configuration of concurrency defaults through atmos.yaml.

references

Summary by CodeRabbit

  • Documentation
    • Added a PRD describing DAG-aware concurrent execution with a ready-queue scheduler, phased rollout, and configurable max-concurrency (default 1).
    • Describes per-node streaming and logging, output labeling, JSON summaries, and visualization guidance for debugging DAGs.
    • Includes examples, integration points for mixed component types, and operational/rollout considerations.

Implement a ready-queue scheduler for component-type-agnostic concurrent
execution with bounded concurrency, proper output isolation, and safe defaults.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@osterman osterman requested a review from a team as a code owner March 14, 2026 05:04
@github-actions github-actions bot added the size/m Medium size PR label Mar 14, 2026
@github-actions
Copy link

github-actions bot commented Mar 14, 2026

Dependency Review

✅ No vulnerabilities or license issues found.

Scanned Files

None

@osterman osterman added the no-release Do not create a new release (wait for additional code changes) label Mar 14, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9a6e0250-d29d-481b-99a1-d9d6ca6dc73d

📥 Commits

Reviewing files that changed from the base of the PR and between 16c4179 and 1adabc7.

📒 Files selected for processing (1)
  • docs/prd/dag-concurrent-execution.md

📝 Walkthrough

Walkthrough

Adds a new PRD describing a DAG-aware concurrent executor: a ready-queue scheduler with bounded worker pool, phased rollout, integration points for replacing sequential execution, stream-injectable per-node outputs, and configuration defaults (max_concurrency default 1).

Changes

Cohort / File(s) Summary
DAG Concurrency PRD
docs/prd/dag-concurrent-execution.md
Adds a new PRD specifying a ready-queue DAG scheduler architecture (pkg/scheduler/), separation from dependency graph data (pkg/dependency/), phased implementation plan (4 phases), per-node stream injection and logging, concurrency primitives (errgroup, semaphores), CLI/env/config max_concurrency, and integration notes for cross-type components and legacy built-ins.

Sequence Diagram(s)

sequenceDiagram
    participant CLI
    participant Scheduler
    participant DependencyGraph as DepGraph
    participant WorkerPool
    participant Subprocess

    CLI->>DepGraph: load graph
    CLI->>Scheduler: start execution (max_concurrency)
    Scheduler->>DepGraph: request ready nodes
    DepGraph-->>Scheduler: ready node(s)
    Scheduler->>WorkerPool: dispatch node job
    WorkerPool->>Subprocess: run node (stream-injected stdout/stderr)
    Subprocess-->>WorkerPool: complete (exit + outputs)
    WorkerPool-->>Scheduler: node finished (result, logs)
    Scheduler->>DepGraph: mark node complete, request new ready nodes
    DepGraph-->>Scheduler: new ready node(s) or done
    Scheduler->>CLI: emit JSON summary / per-node logs when complete
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

no-release

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and directly reflects the main change: adding a PRD document about DAG-based concurrent execution, which matches the changeset perfectly.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch osterman/dag-scheduler-prd
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/prd/dag-concurrent-execution.md`:
- Around line 273-319: The Run signature and loop must be fixed: change
Scheduler.Run(ctx context.Context) to return error (or return a *Result that
wraps g.Wait() error) so returning g.Wait() type-checks; also avoid deadlock by
making all channel operations respect context cancellation — when receiving the
next node from ready use a select with case <-ctx.Done() to return
ctx.Err()/g.Wait(), and when enqueuing dependents inside the g.Go closure (the
loop that updates inDegree and does ready <- s.graph.GetNode(dep)) use a
non-blocking/ctx-aware send (select with case ready<-node and case <-ctx.Done())
so failFast cancellation won't leave goroutines blocked. Apply these changes in
Scheduler.Run, around the ready channel receive and the dependents enqueue logic
inside the g.Go closure, and ensure completed/total bookkeeping still guarded by
mu.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7f195206-7caf-465b-9592-b65508e033f5

📥 Commits

Reviewing files that changed from the base of the PR and between 76a269b and 16c4179.

📒 Files selected for processing (1)
  • docs/prd/dag-concurrent-execution.md

@codecov
Copy link

codecov bot commented Mar 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.29%. Comparing base (bbac3f8) to head (1adabc7).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2194      +/-   ##
==========================================
- Coverage   77.29%   77.29%   -0.01%     
==========================================
  Files         960      960              
  Lines       91088    91088              
==========================================
- Hits        70410    70404       -6     
- Misses      16593    16602       +9     
+ Partials     4085     4082       -3     
Flag Coverage Δ
unittests 77.29% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.
see 4 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Address CodeRabbit review: fix Run() to return *Result (not error from
g.Wait()), add ctx.Done() select to prevent deadlock on fail-fast
cancellation, and use context-aware channel sends in dependent enqueuing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-release Do not create a new release (wait for additional code changes) size/m Medium size PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant