Skip to content

Cranelift: robustify timing infrastructure against mis-use and/or system clock bugs.#12709

Merged
cfallin merged 1 commit intobytecodealliance:mainfrom
cfallin:cranelift-timing-asserts-and-instant-backstop
Mar 3, 2026
Merged

Cranelift: robustify timing infrastructure against mis-use and/or system clock bugs.#12709
cfallin merged 1 commit intobytecodealliance:mainfrom
cfallin:cranelift-timing-asserts-and-instant-backstop

Conversation

@cfallin
Copy link
Member

@cfallin cfallin commented Mar 3, 2026

In #12692, it was observed that the computation of time spent in nested timing spans, minus child time, was underflowing.

Correct operation of the handling of nested spans depends on the invariant that the accumulated time for child spans is less than or equal to a parent span once timing is completed. This property should hold as long as the system clock is monotonic, and as long as timing tokens are dropped in-order, so that the elapsed time of a parent truly is computed after the elapsed time of a child ends.

The timing state may also temporarily violate this invariant whenever a token is pending (still on stack and timing): the child time of any completed child spans will be counted, but the parent has not yet been. Hence, taking a snapshot of the state and computing "parent minus children" on that snapshot may observe cases that yield underflow.

This PR makes the infrastructure more robust along a few different dimensions:

  • It hardens the clock source we use to have a locally-ensured guarantee of monotonicity, since we rely on this for logical correctness. In particular, for each thread (since timing spans never move between threads), we track the last Instant that was used by the timing infrastructure, and use that value (zero time passed) if the system clock moves backward.

  • It hardens the assert about pass-timing token drop order from a debug_assert to an assert. If this invariant is being violated, we want to know about it noisily, rather than observing a subtraction underflow or other inconsistency later.

  • It adds an assert in take_current() to ensure that a snapshot is never taken when any pass timing is in progress.

This should address any theoretically possible sources of #12692, as far as I can tell.

…tem clock bugs.

In bytecodealliance#12692, it was observed that the computation of time spent in
nested timing spans, minus child time, was underflowing.

Correct operation of the handling of nested spans depends on the
invariant that the accumulated time for child spans is less than or
equal to a parent span once timing is completed. This property should
hold as long as the system clock is monotonic, and as long as timing
tokens are dropped in-order, so that the elapsed time of a parent
truly is computed after the elapsed time of a child ends.

The timing state may also temporarily violate this invariant whenever
a token is pending (still on stack and timing): the child time of any
completed child spans will be counted, but the parent has not yet
been. Hence, taking a snapshot of the state and computing "parent
minus children" on that snapshot may observe cases that yield
underflow.

This PR makes the infrastructure more robust along a few different
dimensions:

- It hardens the clock source we use to have a locally-ensured
  guarantee of monotonicity, since we rely on this for logical
  correctness. In particular, for each thread (since timing spans never
  move between threads), we track the last `Instant` that was used by
  the timing infrastructure, and use that value (zero time passed) if
  the system clock moves backward.

- It hardens the assert about pass-timing token drop order from a
  `debug_assert` to an `assert`. If this invariant is being violated,
  we want to know about it noisily, rather than observing a
  subtraction underflow or other inconsistency later.

- It adds an assert in `take_current()` to ensure that a snapshot is
  never taken when any pass timing is in progress.

This should address any theoretically possible sources of bytecodealliance#12692, as
far as I can tell.
@cfallin cfallin requested a review from a team as a code owner March 3, 2026 18:44
@cfallin cfallin requested review from alexcrichton and removed request for a team March 3, 2026 18:44
@cfallin cfallin added this pull request to the merge queue Mar 3, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Mar 3, 2026
@cfallin
Copy link
Member Author

cfallin commented Mar 3, 2026

CI failure on merge queue in the OpenVino-on-Windows test:

    Permission denied (os error 2)
> using cached artifact: D:\a\wasmtime\wasmtime\target\debug\build\wasmtime-wasi-nn-23e0bb9516bdddbc\out\fixtures\model.onnx

I'll try landing again to see if it's spurious. cc @jlb6740 if you have seen this before or have any other ideas (if it becomes a repeating issue we'll have to disable the test)

@cfallin cfallin added this pull request to the merge queue Mar 3, 2026
Merged via the queue into bytecodealliance:main with commit 94a5e2c Mar 3, 2026
45 checks passed
@cfallin cfallin deleted the cranelift-timing-asserts-and-instant-backstop branch March 3, 2026 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants