Skip to content

Comments

Add support for suspendable jobs#16657

Draft
ranger-ross wants to merge 3 commits intorust-lang:masterfrom
ranger-ross:suspendable-tasks
Draft

Add support for suspendable jobs#16657
ranger-ross wants to merge 3 commits intorust-lang:masterfrom
ranger-ross:suspendable-tasks

Conversation

@ranger-ross
Copy link
Member

What does this PR try to resolve?

This PR fixes a bug in -Zfine-grain-locking that prevent prevent other jobs from running when a job is waiting on a lock.
Cargo's job queue has previously never support the ability to suspend a job that is already executing.
This PR adds support to suspend and resume jobs being executed.

This also paves the way to add better blocking messages to the user (originally attempted in #16463)

Tracking issue: #4282

Flow

  • Before taking an exclusive lock (and potentially blocking) we do a .try_lock() to see if we will block.
    • If we are going to block, we send a Message::Blocked letting the job queue reclaim the token for another job to run while we are blocked
    • If try_lock succeeds, we have the lock and never blocked, so we simply start compiling.
  • Once we acquire the lock, we sending a Message::Unblocked letting the job queue know we are ready to continue
  • But before compiling, to respect the --jobs limit, wait for the job queue to reschedule us by calling JobState.resume.recv() which blocks the current thread.
  • The job queue will eventually call ActiveJob.resume.send() which will let the job resume exection.

Design

  • We leverage the existing Queue<Message> in JobState to notify the job queue that we are going to block.
  • For resuming tasks that are blocked, I used a std::sync::mpsc::channel.
flowchart LR
    JQ["Job Queue (DrainState)"]
    J["Jobs"]

    J -- "Queue&lt;Message&gt;" --> JQ
    JQ -- "resume mpsc::channel" --> J
Loading

How to test and review this PR?

To test this I found the easiest to way to do it synthetically by injecting a std::thread::sleep in the lock manager for a given lock. Along with passing --jobs 1 to make a blocked job block the entire build.

pub fn lock(&self, key: &LockKey) -> CargoResult<()> {
      let locks = self.locks.read().unwrap();
      if key.0.to_str().unwrap().contains("libc") {
          std::thread::sleep(std::time::Duration::from_secs(5));
      }
      // ....
}

Test script

cargo new foo; cd foo; cargo add tokio --features full

rm -rf target build
alias lc=/path/to/cargo
CARGO_BUILD_BUILD_DIR=build lc -Zbuild-dir-new-layout -Zfine-grain-locking build --jobs 1

Doing this on master will result in blocked build that never completes, while changes in this PR will finish the build.

r? @epage

@rustbot rustbot added A-build-execution Area: anything dealing with executing the compiler S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Feb 20, 2026
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide the motivation for this commit and in particular why each lock type was chosen with the confines of this commit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I should have made that clear in the commit message.

The issue I am trying to solve is to avoid a holding the MutexGuard while waiting for the file lock.
This is an issue since we can only lock a single lock at a time. (I am a bit surprised I overlooked this previously)

For the lock manager, we only insert into hashmap during the single threaded prepare fingerprint so we take RwLock::write(). During the multithreaded Job section, we take RwLock::read() to the hashmap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds like this is a fix independent of the rest of the PR?

Can we split it out and update the title to focus on the user impact?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about the maintainability of having side band communication going on.

Is it possible for us to manage this within the existing flow? For example, could we use Poll on Job::run to return early if we couldn't acquire the lock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, I think it might be tricky to use Poll with the generic structure of Job. (like how to resume the job since internally its a closure)

Though I am happy to look into that to see what is possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, if we know what jobs will actually build a prior (see #16659 (comment)), then why can't we do all of the lock processing before the build, grabbing our exclusive locks and blocking then, rather than waiting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thats an interesting idea.
So basically the DrainState could try to lock before calling DrainState::run (running the job)

This is single threaded so it will change the way we take the locks.
I am imagining spawning a thread to take each lock as to avoid blocking the main job queue loop.
I think this could be encapsulated in the LockManager. (and use try_lock() as an optimization to avoid spawning a new thread)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying that we acquire the locks in compile and downgrade after build.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm well if you try to take the locks in compile (assuming you are referring to BuildRunner::compile) that would block all jobs since that is single threaded. Unless I am missing something?

My rational for trying to put them in DrainState is that is what is doing the job scheduling so it has the power to reschedule another job that is not blocked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats super::compile which currently coordinates locks.

How much does grabbing as we go benefit in practice when the build we block on would hold a shared lock until its done, blocking our job and all dependents anyways? We can build some jobs that don't overlap, so a little?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can build some jobs that don't overlap, so a little?

The benefit we get is when there are build units that do not overlap.
From my perspective, this is that main benefit of fine grain locking over what we currently have.
If we were to wait for everything, I feel like the benefits would be limited to a few more niche scenarios?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal we are working towards is non-blocking. If there is overlap, then we are blocking.

So the question then is does the benefit from some overlap justify a more complex architecture (both for this and blocking messages).

Copy link
Member Author

@ranger-ross ranger-ross Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some thinking on this and I think we might be able to get away with simply locking before kicking off jobs.

The primary use case for fine grain locking is to avoid rust-analyzer's cargo check from blocking the users cargo build which slows down iteration during development.

If the user is making small incremental changes to their project, this means full rebuilds should be uncommon. So we will likely take shared locks to all of the dependencies, so we don't need an exclusive lock for those. We just need to take an exclusive lock on the things that changed(dirty units). Build scripts are still a problem with this but I think those will not be dirty unless the user is editing the build script itself.

I think the big question if this is worth the extra complexity of suspending jobs is if we care about build script units (or any other units that are shared between build and check) being blocking.

@ranger-ross ranger-ross marked this pull request as draft February 21, 2026 09:05
@rustbot rustbot removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 21, 2026
github-merge-queue bot pushed a commit that referenced this pull request Feb 21, 2026
### What does this PR try to resolve?

Split out of #16657

Currently we are holding a `MutexGuard` while waiting for the file lock,
preventing multiple locks from being taken at once.
This means that a single blocking unit can cause other units from
running.

For the lock manager, we only insert into hashmap during the single
threaded prepare fingerprint so we take `RwLock::write()`. During the
multithreaded `Job` section, we take `RwLock::read()` to the hashmap.

### How to test and review this PR?

See the "How to test and review this PR" section of
#16657

However, unlike that PR, `--jobs 1` will not work with the changes in
that PR.
We would need to do `--jobs 2` and make sure that the first build unit
blocked causing the second one to block. Though I think this is a bit
difficult to setup a test for without the changes in
#16657

r? @epage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-build-execution Area: anything dealing with executing the compiler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants