Don't panic in duration underflow#12693
Don't panic in duration underflow#12693ConradIrwin wants to merge 1 commit intobytecodealliance:mainfrom
Conversation
In debug mode we check for tokens being dropped out of order, but in a release build this can fail.
|
Thanks for your report and this PR! I don't think any of the current contributors have touched this code in a while so we may not have a full understanding of its invariants, but I'm surprised that "tokens are dropped out of order" -- do you have a reproduction of that? (I see your companion issue filed said you had no repro, but maybe you do with this PR?) In any case, I'm happy to merge it as a conservative fix but I'd like to understand if there's an underlying invariant violation that it would mask first, too. |
|
(As an aside, I'm also surprised you're seeing this in (presumably production) telemetry -- is the timing infra enabled in your configuration?) |
|
@cfallin no reproduction, just a somewhat surprising crash showed up in our telemetry. I think timings are logged if Claude's (very hand-wavy) explanation for the out of order timing was that as these things happen on different threads, the drop ordering was not guaranteed; but not sure I believe it. (It seems like some care is taken to avoid this, hence the debug assert here: https://github.com/ConradIrwin/wasmtime/blob/c5214df981a88d30226e9b3ac1b88da32395806a/cranelift/codegen/src/timing.rs#L269). The other potential options are something weird with instants (e.g. this purported issue zed-industries/zed#46636); or given this has only happened once, a particularly targeted cosmic ray. |
OK, that's fortunately not possible -- Claude must not have seen the I'll take a closer look at this code and try to understand its invariants next week -- in the meantime, any further info you can gather about repros or conditions this is seen under are of course appreciated. Thanks! |
|
This is perhaps one way to reproduce this: #[test]
fn test() {
let _a = process_file();
drop(canonicalize_nans());
take_current().total();
}The current invariant of dropping-in-order doesn't take into account when Cranelift itself doesn't have any rayon infrastructure but Wasmtime/ |
|
zed-industries/zed#46636 seems to be hitting this too (but on a linux realtime kernel). They claim there are known bugs with Instant subtraction on a realtime linux kernel (but I can't find any other references to this). The other thing I notice is that crane lift makes heavy use of In terms of concurrency; we run a bunch of rayon stuff ourselves, and I'm pretty sure we can call wasmtime compile multiple times in parallel. I'm not familiar with how rayon manages thread pools but if it has a global thread pool then interleaving between different compiles seems possible in the way we use it. |
Not really? The only two calls to that function in I think that Alex is onto the right hunch here with |
I've read through the logic in I strongly believe that this PR's approach of using a saturating subtraction is wrong: it is hiding an actual bug deeper in the logic by just ignoring a mismatch. We could tighten up some assertions, e.g. have a generation number in |
|
@cfallin None from my side. That said, the reporter in zed-industries/zed#46636 seems to be able to reproduce on demand (albeit under different circumstances). He thinks this is an expected bug in Rust's time calculations as described in rust-lang/rust#56612, but feel free to reach out directly. Looking closer at telemetry, in the last few weeks we've seen this overflow 14 times on Linux, and 3 times on Windows (no macOS). (Prior to that it was also happening definitely on Linux and probably elsewhere, but we made some changes to crash grouping so it's hard to know for sure); this is across roughly O(10,000,000) app opens in the same time period, so it's not exactly common. (also my mistake on the catch_unwind; I was seeing it in the stack trace inside Rayon, which propagates it by default anyway) |
|
If you wanted to update the assertions to try and catch things earlier, I'm willing to ship it to zed to help out. |
|
OK, I've put up #12709 for this. I'll go ahead and close this PR if you don't mind since I don't think the "hide the issue by ignoring underflow" approach is what we want. |
In debug mode we check for tokens being dropped out of order, but in a
release build this can fail.
Closes #12692