Don't panic in duration underflow by ConradIrwin · Pull Request #12693 · bytecodealliance/wasmtime

ConradIrwin · 2026-02-27T23:32:00Z

In debug mode we check for tokens being dropped out of order, but in a
release build this can fail.

In debug mode we check for tokens being dropped out of order, but in a release build this can fail.

cfallin · 2026-02-28T00:00:03Z

Thanks for your report and this PR!

I don't think any of the current contributors have touched this code in a while so we may not have a full understanding of its invariants, but I'm surprised that "tokens are dropped out of order" -- do you have a reproduction of that? (I see your companion issue filed said you had no repro, but maybe you do with this PR?)

In any case, I'm happy to merge it as a conservative fix but I'd like to understand if there's an underlying invariant violation that it would mask first, too.

cfallin · 2026-02-28T00:00:31Z

(As an aside, I'm also surprised you're seeing this in (presumably production) telemetry -- is the timing infra enabled in your configuration?)

ConradIrwin · 2026-02-28T05:33:43Z

@cfallin no reproduction, just a somewhat surprising crash showed up in our telemetry.

I think timings are logged if RUST_LOG=debug is exported (regardless of build settings), which seems plausible for a Zed user working on Rust.

Claude's (very hand-wavy) explanation for the out of order timing was that as these things happen on different threads, the drop ordering was not guaranteed; but not sure I believe it. (It seems like some care is taken to avoid this, hence the debug assert here: https://github.com/ConradIrwin/wasmtime/blob/c5214df981a88d30226e9b3ac1b88da32395806a/cranelift/codegen/src/timing.rs#L269).

The other potential options are something weird with instants (e.g. this purported issue zed-industries/zed#46636); or given this has only happened once, a particularly targeted cosmic ray.

cfallin · 2026-02-28T06:21:54Z

Claude's (very hand-wavy) explanation for the out of order timing was that as these things happen on different threads, the drop ordering was not guaranteed; but not sure I believe it. (It seems like some care is taken to avoid this, hence the debug assert here: https://github.com/ConradIrwin/wasmtime/blob/c5214df981a88d30226e9b3ac1b88da32395806a/cranelift/codegen/src/timing.rs#L269).

OK, that's fortunately not possible -- Claude must not have seen the thread_local! on the relevant data structure here, such that compilations in different threads never interact. Drop order should be purely a function of the (hopefully deterministic!) pass driver structure in one compilation in one thread.

I'll take a closer look at this code and try to understand its invariants next week -- in the meantime, any further info you can gather about repros or conditions this is seen under are of course appreciated. Thanks!

alexcrichton · 2026-03-02T16:13:49Z

This is perhaps one way to reproduce this:

    #[test]
    fn test() {
        let _a = process_file();
        drop(canonicalize_nans());
        take_current().total();
    }

The current invariant of dropping-in-order doesn't take into account when take_current() is called in the middle. Whether that's possible to happen within Wasmtime/Cranelift though I'm not entirely sure. One theoretical way to happen is that a rayon worker thread is in the middle of compiling something, so it's got half the timing for a function, and then it does some synchronization ends up work-stealing a function to compile. This second function finishes and Wasmtime calls take_current(), and then rayon would later resume the original compilation.

Cranelift itself doesn't have any rayon infrastructure but Wasmtime/wasmtime-internal-cranelift might have just enough to trigger this? The best fix for this might be to reset timing data when starting to compile a function and restoring that once compilation is finished.

ConradIrwin · 2026-03-02T17:13:12Z

zed-industries/zed#46636 seems to be hitting this too (but on a linux realtime kernel). They claim there are known bugs with Instant subtraction on a realtime linux kernel (but I can't find any other references to this).

The other thing I notice is that crane lift makes heavy use of catch_unwind; but unfortunately Zed's panic handler acts more like abort on panic. I assume from the comments that the catch_unwind is intended as a backup mechanism (not the primary mechanism) to prevent problems.

In terms of concurrency; we run a bunch of rayon stuff ourselves, and I'm pretty sure we can call wasmtime compile multiple times in parallel. I'm not familiar with how rayon manages thread pools but if it has a global thread pool then interleaving between different compiles seems possible in the way we use it.

cfallin · 2026-03-02T17:22:44Z

The other thing I notice is that crane lift makes heavy use of catch_unwind

Not really? The only two calls to that function in cranelift/ are in the filetest runner and bugpoint, two separate CLI utilities not involved with the main compilation flow.

I think that Alex is onto the right hunch here with take_current -- I will try to get to this today.

cfallin · 2026-03-03T15:30:56Z

The current invariant of dropping-in-order doesn't take into account when take_current() is called in the middle. Whether that's possible to happen within Wasmtime/Cranelift though I'm not entirely sure. One theoretical way to happen is that a rayon worker thread is in the middle of compiling something, so it's got half the timing for a function, and then it does some synchronization ends up work-stealing a function to compile. This second function finishes and Wasmtime calls take_current(), and then rayon would later resume the original compilation.

Cranelift itself doesn't have any rayon infrastructure but Wasmtime/wasmtime-internal-cranelift might have just enough to trigger this? The best fix for this might be to reset timing data when starting to compile a function and restoring that once compilation is finished.

I've read through the logic in wasmtime-internal-cranelift now and I'm not seeing how we could invoke take_current() in the middle of a compilation (or nest two compilations) -- there are no async functions or await points here, there's no recursion, and take_current is only called once at the toplevel after invoking Cranelift.

I strongly believe that this PR's approach of using a saturating subtraction is wrong: it is hiding an actual bug deeper in the logic by just ignoring a mismatch.

We could tighten up some assertions, e.g. have a generation number in PassTimes so that if it is prematurely reset and then a drop-token later increments, we catch that properly. If something in Zed or Wasmtime is "holding it wrong" that wouldn't resolve the actual bug but it would help to find it. @ConradIrwin any progress on getting a reproducer?

ConradIrwin · 2026-03-03T16:16:31Z

@cfallin None from my side. That said, the reporter in zed-industries/zed#46636 seems to be able to reproduce on demand (albeit under different circumstances). He thinks this is an expected bug in Rust's time calculations as described in rust-lang/rust#56612, but feel free to reach out directly.

Looking closer at telemetry, in the last few weeks we've seen this overflow 14 times on Linux, and 3 times on Windows (no macOS). (Prior to that it was also happening definitely on Linux and probably elsewhere, but we made some changes to crash grouping so it's hard to know for sure); this is across roughly O(10,000,000) app opens in the same time period, so it's not exactly common.

(also my mistake on the catch_unwind; I was seeing it in the stack trace inside Rayon, which propagates it by default anyway)

ConradIrwin · 2026-03-03T16:20:05Z

If you wanted to update the assertions to try and catch things earlier, I'm willing to ship it to zed to help out.

cfallin · 2026-03-03T18:45:16Z

OK, I've put up #12709 for this. I'll go ahead and close this PR if you don't mind since I don't think the "hide the issue by ignoring underflow" approach is what we want.

Don't panic in duration underflow

c5214df

In debug mode we check for tokens being dropped out of order, but in a release build this can fail.

ConradIrwin requested a review from a team as a code owner February 27, 2026 23:32

ConradIrwin requested review from cfallin and removed request for a team February 27, 2026 23:32

github-actions bot added the cranelift Issues related to the Cranelift code generator label Feb 28, 2026

A386official mentioned this pull request Feb 28, 2026

fix(cranelift): use saturating_sub in PassTimes::total to prevent panic #12694

Closed

ConradIrwin mentioned this pull request Mar 3, 2026

Crash on startup: "overflow when subtracting durations" on PREEMPT_RT kernel zed-industries/zed#46636

Open

cfallin closed this Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't panic in duration underflow#12693

Don't panic in duration underflow#12693
ConradIrwin wants to merge 1 commit intobytecodealliance:mainfrom
ConradIrwin:fix-interval-duration

ConradIrwin commented Feb 27, 2026

Uh oh!

cfallin commented Feb 28, 2026

Uh oh!

cfallin commented Feb 28, 2026

Uh oh!

ConradIrwin commented Feb 28, 2026

Uh oh!

cfallin commented Feb 28, 2026

Uh oh!

alexcrichton commented Mar 2, 2026

Uh oh!

ConradIrwin commented Mar 2, 2026

Uh oh!

cfallin commented Mar 2, 2026

Uh oh!

cfallin commented Mar 3, 2026

Uh oh!

ConradIrwin commented Mar 3, 2026

Uh oh!

ConradIrwin commented Mar 3, 2026

Uh oh!

cfallin commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ConradIrwin commented Feb 27, 2026

Uh oh!

cfallin commented Feb 28, 2026

Uh oh!

cfallin commented Feb 28, 2026

Uh oh!

ConradIrwin commented Feb 28, 2026

Uh oh!

cfallin commented Feb 28, 2026

Uh oh!

alexcrichton commented Mar 2, 2026

Uh oh!

ConradIrwin commented Mar 2, 2026

Uh oh!

cfallin commented Mar 2, 2026

Uh oh!

cfallin commented Mar 3, 2026

Uh oh!

ConradIrwin commented Mar 3, 2026

Uh oh!

ConradIrwin commented Mar 3, 2026

Uh oh!

cfallin commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants