Skip to content

Don't panic in duration underflow#12693

Closed
ConradIrwin wants to merge 1 commit intobytecodealliance:mainfrom
ConradIrwin:fix-interval-duration
Closed

Don't panic in duration underflow#12693
ConradIrwin wants to merge 1 commit intobytecodealliance:mainfrom
ConradIrwin:fix-interval-duration

Conversation

@ConradIrwin
Copy link

In debug mode we check for tokens being dropped out of order, but in a
release build this can fail.

Closes #12692

In debug mode we check for tokens being dropped out of order, but in a
release build this can fail.
@ConradIrwin ConradIrwin requested a review from a team as a code owner February 27, 2026 23:32
@ConradIrwin ConradIrwin requested review from cfallin and removed request for a team February 27, 2026 23:32
@cfallin
Copy link
Member

cfallin commented Feb 28, 2026

Thanks for your report and this PR!

I don't think any of the current contributors have touched this code in a while so we may not have a full understanding of its invariants, but I'm surprised that "tokens are dropped out of order" -- do you have a reproduction of that? (I see your companion issue filed said you had no repro, but maybe you do with this PR?)

In any case, I'm happy to merge it as a conservative fix but I'd like to understand if there's an underlying invariant violation that it would mask first, too.

@cfallin
Copy link
Member

cfallin commented Feb 28, 2026

(As an aside, I'm also surprised you're seeing this in (presumably production) telemetry -- is the timing infra enabled in your configuration?)

@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Feb 28, 2026
@ConradIrwin
Copy link
Author

@cfallin no reproduction, just a somewhat surprising crash showed up in our telemetry.

I think timings are logged if RUST_LOG=debug is exported (regardless of build settings), which seems plausible for a Zed user working on Rust.

Claude's (very hand-wavy) explanation for the out of order timing was that as these things happen on different threads, the drop ordering was not guaranteed; but not sure I believe it. (It seems like some care is taken to avoid this, hence the debug assert here: https://github.com/ConradIrwin/wasmtime/blob/c5214df981a88d30226e9b3ac1b88da32395806a/cranelift/codegen/src/timing.rs#L269).

The other potential options are something weird with instants (e.g. this purported issue zed-industries/zed#46636); or given this has only happened once, a particularly targeted cosmic ray.

@cfallin
Copy link
Member

cfallin commented Feb 28, 2026

Claude's (very hand-wavy) explanation for the out of order timing was that as these things happen on different threads, the drop ordering was not guaranteed; but not sure I believe it. (It seems like some care is taken to avoid this, hence the debug assert here: https://github.com/ConradIrwin/wasmtime/blob/c5214df981a88d30226e9b3ac1b88da32395806a/cranelift/codegen/src/timing.rs#L269).

OK, that's fortunately not possible -- Claude must not have seen the thread_local! on the relevant data structure here, such that compilations in different threads never interact. Drop order should be purely a function of the (hopefully deterministic!) pass driver structure in one compilation in one thread.

I'll take a closer look at this code and try to understand its invariants next week -- in the meantime, any further info you can gather about repros or conditions this is seen under are of course appreciated. Thanks!

@alexcrichton
Copy link
Member

This is perhaps one way to reproduce this:

    #[test]
    fn test() {
        let _a = process_file();
        drop(canonicalize_nans());
        take_current().total();
    }

The current invariant of dropping-in-order doesn't take into account when take_current() is called in the middle. Whether that's possible to happen within Wasmtime/Cranelift though I'm not entirely sure. One theoretical way to happen is that a rayon worker thread is in the middle of compiling something, so it's got half the timing for a function, and then it does some synchronization ends up work-stealing a function to compile. This second function finishes and Wasmtime calls take_current(), and then rayon would later resume the original compilation.

Cranelift itself doesn't have any rayon infrastructure but Wasmtime/wasmtime-internal-cranelift might have just enough to trigger this? The best fix for this might be to reset timing data when starting to compile a function and restoring that once compilation is finished.

@ConradIrwin
Copy link
Author

zed-industries/zed#46636 seems to be hitting this too (but on a linux realtime kernel). They claim there are known bugs with Instant subtraction on a realtime linux kernel (but I can't find any other references to this).

The other thing I notice is that crane lift makes heavy use of catch_unwind; but unfortunately Zed's panic handler acts more like abort on panic. I assume from the comments that the catch_unwind is intended as a backup mechanism (not the primary mechanism) to prevent problems.

In terms of concurrency; we run a bunch of rayon stuff ourselves, and I'm pretty sure we can call wasmtime compile multiple times in parallel. I'm not familiar with how rayon manages thread pools but if it has a global thread pool then interleaving between different compiles seems possible in the way we use it.

@cfallin
Copy link
Member

cfallin commented Mar 2, 2026

The other thing I notice is that crane lift makes heavy use of catch_unwind

Not really? The only two calls to that function in cranelift/ are in the filetest runner and bugpoint, two separate CLI utilities not involved with the main compilation flow.

I think that Alex is onto the right hunch here with take_current -- I will try to get to this today.

@cfallin
Copy link
Member

cfallin commented Mar 3, 2026

The current invariant of dropping-in-order doesn't take into account when take_current() is called in the middle. Whether that's possible to happen within Wasmtime/Cranelift though I'm not entirely sure. One theoretical way to happen is that a rayon worker thread is in the middle of compiling something, so it's got half the timing for a function, and then it does some synchronization ends up work-stealing a function to compile. This second function finishes and Wasmtime calls take_current(), and then rayon would later resume the original compilation.

Cranelift itself doesn't have any rayon infrastructure but Wasmtime/wasmtime-internal-cranelift might have just enough to trigger this? The best fix for this might be to reset timing data when starting to compile a function and restoring that once compilation is finished.

I've read through the logic in wasmtime-internal-cranelift now and I'm not seeing how we could invoke take_current() in the middle of a compilation (or nest two compilations) -- there are no async functions or await points here, there's no recursion, and take_current is only called once at the toplevel after invoking Cranelift.

I strongly believe that this PR's approach of using a saturating subtraction is wrong: it is hiding an actual bug deeper in the logic by just ignoring a mismatch.

We could tighten up some assertions, e.g. have a generation number in PassTimes so that if it is prematurely reset and then a drop-token later increments, we catch that properly. If something in Zed or Wasmtime is "holding it wrong" that wouldn't resolve the actual bug but it would help to find it. @ConradIrwin any progress on getting a reproducer?

@ConradIrwin
Copy link
Author

@cfallin None from my side. That said, the reporter in zed-industries/zed#46636 seems to be able to reproduce on demand (albeit under different circumstances). He thinks this is an expected bug in Rust's time calculations as described in rust-lang/rust#56612, but feel free to reach out directly.

Looking closer at telemetry, in the last few weeks we've seen this overflow 14 times on Linux, and 3 times on Windows (no macOS). (Prior to that it was also happening definitely on Linux and probably elsewhere, but we made some changes to crash grouping so it's hard to know for sure); this is across roughly O(10,000,000) app opens in the same time period, so it's not exactly common.

(also my mistake on the catch_unwind; I was seeing it in the stack trace inside Rayon, which propagates it by default anyway)

@ConradIrwin
Copy link
Author

If you wanted to update the assertions to try and catch things earlier, I'm willing to ship it to zed to help out.

@cfallin
Copy link
Member

cfallin commented Mar 3, 2026

OK, I've put up #12709 for this. I'll go ahead and close this PR if you don't mind since I don't think the "hide the issue by ignoring underflow" approach is what we want.

@cfallin cfallin closed this Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift Issues related to the Cranelift code generator

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cranelift: panic calculating timing total

3 participants