Skip to content

Update README.md#1

Open
Endilll wants to merge 1 commit intomainfrom
ruleset-test
Open

Update README.md#1
Endilll wants to merge 1 commit intomainfrom
ruleset-test

Conversation

@Endilll
Copy link
Owner

@Endilll Endilll commented Sep 5, 2025

No description provided.

Endilll pushed a commit that referenced this pull request Feb 4, 2026
…lvm#178069)

Kernel panic is a special case, and there is no signal or exception for
that so we need to rely on special workaround called `dumptid`.
FreeBSDKernel plugin is supposed to find this thread and set it manually
through `SetStopInfo()` in `CalculateStopInfo()` like Mach core plugin
does.

Before (We had to find and select crashed thread list otherwise thread 1
was selected by default):
```
➜ sudo lldb /boot/panic/kernel -c /var/crash/vmcore.last
(lldb) target create "/boot/panic/kernel" --core "/var/crash/vmcore.last"
Core file '/var/crash/vmcore.last' (x86_64) was loaded.
(lldb) bt
* thread #1, name = '(pid 12991) dtrace'
  * frame #0: 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff8015882f780, flags=259) at sched_ule.c:2448:26
    frame #1: 0xffffffff80bd38d2 kernel`mi_switch(flags=259) at kern_synch.c:530:2
    frame llvm#2: 0xffffffff80c29799 kernel`sleepq_switch(wchan=0xfffff8014edff300, pri=0) at subr_sleepqueue.c:608:2
    frame llvm#3: 0xffffffff80c29b76 kernel`sleepq_catch_signals(wchan=0xfffff8014edff300, pri=0) at subr_sleepqueue.c:523:3
    frame llvm#4: 0xffffffff80c29d32 kernel`sleepq_timedwait_sig(wchan=<unavailable>, pri=<unavailable>) at subr_sleepqueue.c:704:11
    frame llvm#5: 0xffffffff80bd2e2d kernel`_sleep(ident=0xfffff8014edff300, lock=0xffffffff81df2880, priority=768, wmesg="uwait", sbt=2573804118162, pr=0, flags=512) at kern_synch.c:215:10
    frame llvm#6: 0xffffffff80be8622 kernel`umtxq_sleep(uq=0xfffff8014edff300, wmesg="uwait", timo=0xfffffe0279cb3d20) at kern_umtx.c:843:11
    frame llvm#7: 0xffffffff80bef87a kernel`do_wait(td=0xfffff8015882f780, addr=<unavailable>, id=0, timeout=0xfffffe0279cb3d90, compat32=1, is_private=1) at kern_umtx.c:1316:12
    frame llvm#8: 0xffffffff80bed264 kernel`__umtx_op_wait_uint_private(td=0xfffff8015882f780, uap=0xfffffe0279cb3dd8, ops=<unavailable>) at kern_umtx.c:3990:10
    frame llvm#9: 0xffffffff80beaabe kernel`sys__umtx_op [inlined] kern__umtx_op(td=<unavailable>, obj=<unavailable>, op=<unavailable>, val=<unavailable>, uaddr1=<unavailable>, uaddr2=<unavailable>, ops=<unavailable>) at kern_umtx.c:4999:10
    frame llvm#10: 0xffffffff80beaa89 kernel`sys__umtx_op(td=<unavailable>, uap=<unavailable>) at kern_umtx.c:5024:10
    frame llvm#11: 0xffffffff81122cd1 kernel`amd64_syscall [inlined] syscallenter(td=0xfffff8015882f780) at subr_syscall.c:165:11
    frame llvm#12: 0xffffffff81122c19 kernel`amd64_syscall(td=0xfffff8015882f780, traced=0) at trap.c:1208:2
    frame llvm#13: 0xffffffff810f1dbb kernel`fast_syscall_common at exception.S:570
```

After:
```
➜ sudo ./build/bin/lldb /boot/panic/kernel -c /var/crash/vmcore.last
(lldb) target create "/boot/panic/kernel" --core "/var/crash/vmcore.last"
Core file '/var/crash/vmcore.last' (x86_64) was loaded.
(lldb) bt
* thread llvm#18, name = '(pid 5409) powerd (crashed)', stop reason = kernel panic
  * frame #0: 0xffffffff80bc6c91 kernel`__curthread at pcpu_aux.h:57:2 [inlined]
    frame #1: 0xffffffff80bc6c91 kernel`doadump(textdump=0) at kern_shutdown.c:399:2
    frame llvm#2: 0xffffffff804b3b7a kernel`db_dump(dummy=<unavailable>, dummy2=<unavailable>, dummy3=<unavailable>, dummy4=<unavailable>) at db_command.c:596:10
    frame llvm#3: 0xffffffff804b396d kernel`db_command(last_cmdp=<unavailable>, cmd_table=<unavailable>, dopager=true) at db_command.c:508:3
    frame llvm#4: 0xffffffff804b362d kernel`db_command_loop at db_command.c:555:3
    frame llvm#5: 0xffffffff804b7026 kernel`db_trap(type=<unavailable>, code=<unavailable>) at db_main.c:267:3
    frame llvm#6: 0xffffffff80c16aaf kernel`kdb_trap(type=3, code=0, tf=0xfffffe01b605b930) at subr_kdb.c:790:13
    frame llvm#7: 0xffffffff8112154e kernel`trap(frame=<unavailable>) at trap.c:614:8
    frame llvm#8: 0xffffffff810f14c8 kernel`calltrap at exception.S:285
    frame llvm#9: 0xffffffff81da2290 kernel`cn_devtab + 64
    frame llvm#10: 0xfffffe01b605b8b0
    frame llvm#11: 0xffffffff84001c43 dtrace.ko`dtrace_panic(format=<unavailable>) at dtrace.c:652:2
    frame llvm#12: 0xffffffff84005524 dtrace.ko`dtrace_action_panic(ecb=0xfffff80539cad580) at dtrace.c:7022:2 [inlined]
    frame llvm#13: 0xffffffff840054de dtrace.ko`dtrace_probe(id=88998, arg0=14343377283488, arg1=<unavailable>, arg2=<unavailable>, arg3=<unavailable>, arg4=<unavailable>) at dtrace.c:7665:6
    frame llvm#14: 0xffffffff83e5213d systrace.ko`systrace_probe(sa=<unavailable>, type=<unavailable>, retval=<unavailable>) at systrace.c:226:2
    frame llvm#15: 0xffffffff8112318d kernel`syscallenter(td=0xfffff801318d5780) at subr_syscall.c:160:4 [inlined]
    frame llvm#16: 0xffffffff81123112 kernel`amd64_syscall(td=0xfffff801318d5780, traced=0) at trap.c:1208:2
    frame llvm#17: 0xffffffff810f1dbb kernel`fast_syscall_common at exception.S:570
```
Endilll pushed a commit that referenced this pull request Feb 4, 2026
In this PR i move the insertion point in the
`yieldReplacementForFusedProducer` because i ran into some issue where a
`tensor.extract_slices` tried to use a result of `affine.apply` that was
inserted at the end of the block instead of the start of it.

This is the full error of the test i added before this change:

```mlir
third-party/llvm-project/mlir/test/Interfaces/TilingInterface/tile-fuse-and-yield-using-scfforall.mlir:83:11: error: operand #1 does not dominate this use
  %pack = linalg.pack %gen#1
          ^
third-party/llvm-project/mlir/test/Interfaces/TilingInterface/tile-fuse-and-yield-using-scfforall.mlir:83:11: note: see current operation: %24 = "tensor.extract_slice"(%23, %36, %8) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32>
third-party/llvm-project/mlir/test/Interfaces/TilingInterface/tile-fuse-and-yield-using-scfforall.mlir:71:12: note: operand defined here (op in the same block)
  %gen:2 = linalg.generic {
           ^
// -----// IR Dump After InterpreterPass Failed (transform-interpreter) //----- //
#map = affine_map<(d0, d1) -> (d0, d1)>
#map1 = affine_map<(d0) -> (d0 * 16)>
#map2 = affine_map<(d0) -> (d0 * -16 + 32)>
#map3 = affine_map<(d0) -> (16, d0 * -16 + 32)>
#map4 = affine_map<(d0) -> (d0 - 1)>
"builtin.module"() ({
  "func.func"() <{function_type = (tensor<32x1024xf32>) -> (tensor<32x1024xf32>, tensor<2x512x16x2xi8>), sym_name = "fuse_pack_consumer_into_multi_output_generic"}> ({
  ^bb0(%arg1: tensor<32x1024xf32>):
    %2 = "arith.constant"() <{value = 0 : i8}> : () -> i8
    %3 = "tensor.empty"() : () -> tensor<32x1024xf32>
    %4 = "tensor.empty"() : () -> tensor<32x1024xi8>
    %5 = "tensor.empty"() : () -> tensor<2x512x16x2xi8>
    %6:2 = "linalg.generic"(%arg1, %3, %4) <{indexing_maps = [#map, #map, #map], iterator_types = [#linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>], operandSegmentSizes = array<i32: 1, 2>}> ({
    ^bb0(%arg9: f32, %arg10: f32, %arg11: i8):
      %41 = "arith.fptoui"(%arg9) : (f32) -> i8
      "linalg.yield"(%arg9, %41) : (f32, i8) -> ()
    }) : (tensor<32x1024xf32>, tensor<32x1024xf32>, tensor<32x1024xi8>) -> (tensor<32x1024xf32>, tensor<32x1024xi8>)
    %7:3 = "scf.forall"(%5, %3, %4) <{operandSegmentSizes = array<i32: 0, 0, 0, 3>, staticLowerBound = array<i64: 0>, staticStep = array<i64: 1>, staticUpperBound = array<i64: 2>}> ({
    ^bb0(%arg2: index, %arg3: tensor<2x512x16x2xi8>, %arg4: tensor<32x1024xf32>, %arg5: tensor<32x1024xi8>):
      %8 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index
      %9 = "affine.apply"(%arg2) <{map = #map2}> : (index) -> index
      %10 = "affine.min"(%arg2) <{map = #map3}> : (index) -> index
      %11 = "affine.apply"(%10) <{map = #map4}> : (index) -> index
      %12 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index
      %13 = "affine.apply"(%10) <{map = #map4}> : (index) -> index
      %14 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index
      %15 = "affine.apply"(%10) <{map = #map4}> : (index) -> index
      %16 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index
      %17 = "affine.apply"(%10) <{map = #map4}> : (index) -> index
      %18 = "tensor.extract_slice"(%arg1, %12, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32>
      %19 = "tensor.empty"() : () -> tensor<32x1024xf32>
      %20 = "tensor.extract_slice"(%19, %14, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32>
      %21 = "tensor.extract_slice"(%3, %14, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32>
      %22 = "tensor.empty"() : () -> tensor<32x1024xi8>
      %23 = "tensor.extract_slice"(%22, %16, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8>
      %24 = "tensor.extract_slice"(%4, %16, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8>
      %25 = "tensor.empty"() : () -> tensor<32x1024xf32>
      %26 = "tensor.extract_slice"(%25, %38, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32>
      %27 = "tensor.extract_slice"(%arg4, %38, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xf32>, index, index) -> tensor<?x1024xf32>
      %28 = "tensor.empty"() : () -> tensor<32x1024xi8>
      %29 = "tensor.extract_slice"(%28, %8, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8>
      %30 = "tensor.extract_slice"(%arg5, %8, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8>
      %31:2 = "linalg.generic"(%18, %27, %30) <{indexing_maps = [#map, #map, #map], iterator_types = [#linalg.iterator_type<parallel>, #linalg.iterator_type<parallel>], operandSegmentSizes = array<i32: 1, 2>}> ({
      ^bb0(%arg6: f32, %arg7: f32, %arg8: i8):
        %40 = "arith.fptoui"(%arg6) : (f32) -> i8
        "linalg.yield"(%arg6, %40) : (f32, i8) -> ()
      }) : (tensor<?x1024xf32>, tensor<?x1024xf32>, tensor<?x1024xi8>) -> (tensor<?x1024xf32>, tensor<?x1024xi8>)
      %32 = "tensor.extract_slice"(%6#1, %8, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<32x1024xi8>, index, index) -> tensor<?x1024xi8>
      %33 = "tensor.empty"() : () -> tensor<2x512x16x2xi8>
      %34 = "tensor.extract_slice"(%33, %arg2) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: 1, 512, 16, 2>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<2x512x16x2xi8>, index) -> tensor<1x512x16x2xi8>
      %35 = "tensor.extract_slice"(%arg3, %arg2) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: 1, 512, 16, 2>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<2x512x16x2xi8>, index) -> tensor<1x512x16x2xi8>
      %36 = "linalg.pack"(%31#1, %35, %2) <{inner_dims_pos = array<i64: 0, 1>, operandSegmentSizes = array<i32: 1, 1, 1, 0>, static_inner_tiles = array<i64: 16, 2>}> : (tensor<?x1024xi8>, tensor<1x512x16x2xi8>, i8) -> tensor<1x512x16x2xi8>
      %37 = "affine.apply"(%10) <{map = #map4}> : (index) -> index
      %38 = "affine.apply"(%arg2) <{map = #map1}> : (index) -> index
      %39 = "affine.apply"(%10) <{map = #map4}> : (index) -> index
      "scf.forall.in_parallel"() ({
        "tensor.parallel_insert_slice"(%36, %arg3, %arg2) <{operandSegmentSizes = array<i32: 1, 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808, 0, 0, 0>, static_sizes = array<i64: 1, 512, 16, 2>, static_strides = array<i64: 1, 1, 1, 1>}> : (tensor<1x512x16x2xi8>, tensor<2x512x16x2xi8>, index) -> ()
        "tensor.parallel_insert_slice"(%31#0, %arg4, %38, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<?x1024xf32>, tensor<32x1024xf32>, index, index) -> ()
        "tensor.parallel_insert_slice"(%31#1, %arg5, %8, %10) <{operandSegmentSizes = array<i32: 1, 1, 1, 1, 0>, static_offsets = array<i64: -9223372036854775808, 0>, static_sizes = array<i64: -9223372036854775808, 1024>, static_strides = array<i64: 1, 1>}> : (tensor<?x1024xi8>, tensor<32x1024xi8>, index, index) -> ()
      }) : () -> ()
    }) : (tensor<2x512x16x2xi8>, tensor<32x1024xf32>, tensor<32x1024xi8>) -> (tensor<2x512x16x2xi8>, tensor<32x1024xf32>, tensor<32x1024xi8>)
    "func.return"(%7#1, %7#0) : (tensor<32x1024xf32>, tensor<2x512x16x2xi8>) -> ()
  }) : () -> ()
  "builtin.module"() ({
    "transform.named_sequence"() <{arg_attrs = [{transform.readonly}], function_type = (!transform.any_op) -> (), sym_name = "__transform_main"}> ({
    ^bb0(%arg0: !transform.any_op):
      %0 = "transform.structured.match"(%arg0) <{ops = ["linalg.pack"]}> : (!transform.any_op) -> !transform.any_op
      %1:2 = "transform.test.fuse_and_yield"(%0) <{tile_interchange = [], tile_sizes = [1], use_forall = true}> : (!transform.any_op) -> (!transform.any_op, !transform.any_op)
      "transform.yield"() : () -> ()
    }) : () -> ()
  }) {transform.with_named_sequence} : () -> ()
}) : () -> ()
``` 

I also noticed that Interface tests are missing from the bazel overlay
so i also added this.
Endilll pushed a commit that referenced this pull request Feb 4, 2026
…m#167446)

Add SVE optimization for AArch64 architectures. The idea is to use
predicate registers to avoid branching.
Microbench in repo shows considerable improvements on NV GB10 (locked on
largest X925):

```
======================================================================
BENCHMARK STATISTICS (time in nanoseconds)
======================================================================

memcpy_Google_A:
  Old - Mean: 3.1257 ns, Median: 3.1162 ns
  New - Mean: 2.8402 ns, Median: 2.8265 ns
  Improvement: +9.14% (mean), +9.30% (median)

memcpy_Google_B:
  Old - Mean: 2.3171 ns, Median: 2.3159 ns
  New - Mean: 1.6589 ns, Median: 1.6593 ns
  Improvement: +28.40% (mean), +28.35% (median)

memcpy_Google_D:
  Old - Mean: 8.7602 ns, Median: 8.7645 ns
  New - Mean: 8.4307 ns, Median: 8.4308 ns
  Improvement: +3.76% (mean), +3.81% (median)

memcpy_Google_L:
  Old - Mean: 1.7137 ns, Median: 1.7091 ns
  New - Mean: 1.4530 ns, Median: 1.4553 ns
  Improvement: +15.22% (mean), +14.85% (median)

memcpy_Google_M:
  Old - Mean: 1.9823 ns, Median: 1.9825 ns
  New - Mean: 1.4826 ns, Median: 1.4840 ns
  Improvement: +25.20% (mean), +25.15% (median)

memcpy_Google_Q:
  Old - Mean: 1.6812 ns, Median: 1.6784 ns
  New - Mean: 1.1538 ns, Median: 1.1517 ns
  Improvement: +31.37% (mean), +31.38% (median)

memcpy_Google_S:
  Old - Mean: 2.1816 ns, Median: 2.1786 ns
  New - Mean: 1.6297 ns, Median: 1.6287 ns
  Improvement: +25.29% (mean), +25.24% (median)

memcpy_Google_U:
  Old - Mean: 2.2851 ns, Median: 2.2825 ns
  New - Mean: 1.7219 ns, Median: 1.7187 ns
  Improvement: +24.65% (mean), +24.70% (median)

memcpy_Google_W:
  Old - Mean: 2.0408 ns, Median: 2.0361 ns
  New - Mean: 1.5260 ns, Median: 1.5252 ns
  Improvement: +25.23% (mean), +25.09% (median)

uniform_384_to_4096:
  Old - Mean: 26.9067 ns, Median: 26.8845 ns
  New - Mean: 26.8083 ns, Median: 26.8149 ns
  Improvement: +0.37% (mean), +0.26% (median)
```
The beginning of the memcpy function looks like the following:
```
Dump of assembler code for function _ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm:
   0x0000000000001340 <+0>:     cbz     x2, 0x143c <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+252>
   0x0000000000001344 <+4>:     cbz     x0, 0x1440 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+256>
   0x0000000000001348 <+8>:     cbz     x1, 0x1444 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+260>
   0x000000000000134c <+12>:    subs    x8, x2, #0x20
   0x0000000000001350 <+16>:    b.hi    0x1374 <_ZN22__llvm_libc_22_0_0_git6memcpyEPvPKvm+52>  // b.pmore
   0x0000000000001354 <+20>:    rdvl    x8, #1
   0x0000000000001358 <+24>:    whilelo p0.b, xzr, x2
   0x000000000000135c <+28>:    ld1b    {z0.b}, p0/z, [x1]
   0x0000000000001360 <+32>:    whilelo p1.b, x8, x2
   0x0000000000001364 <+36>:    ld1b    {z1.b}, p1/z, [x1, #1, mul vl]
   0x0000000000001368 <+40>:    st1b    {z0.b}, p0, [x0]
   0x000000000000136c <+44>:    st1b    {z1.b}, p1, [x0, #1, mul vl]
   0x0000000000001370 <+48>:    ret
```

---------

Co-authored-by: Guillaume Chatelet <chatelet.guillaume@gmail.com>
Endilll pushed a commit that referenced this pull request Feb 6, 2026
…8306)

In FreeBSD, allproc is a prepend list and new processes are appended at
head. This results in reverse pid order, so we first need to order pid
incrementally then print threads according to the correct order.

Before:
```
Process 0 stopped
* thread #1: tid = 101866, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff8015882f780, flags=259) at sched_ule.c:2448:26, name = '(pid 12991) dtrace'
  thread llvm#2: tid = 101915, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff80158825780, flags=259) at sched_ule.c:2448:26, name = '(pid 11509) zsh'
  thread llvm#3: tid = 101942, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff80142599000, flags=259) at sched_ule.c:2448:26, name = '(pid 11504) ftcleanup'
  thread llvm#4: tid = 101545, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff80131898000, flags=259) at sched_ule.c:2448:26, name = '(pid 5599) zsh'
  thread llvm#5: tid = 100905, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff80131899000, flags=259) at sched_ule.c:2448:26, name = '(pid 5598) sshd-session'
  thread llvm#6: tid = 101693, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff8015886e780, flags=259) at sched_ule.c:2448:26, name = '(pid 5595) sshd-session'
  thread llvm#7: tid = 101626, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801588be000, flags=259) at sched_ule.c:2448:26, name = '(pid 5592) sh'
...
```

After:
```
(lldb) thread list
Process 0 stopped
* thread #1: tid = 100000, 0xffffffff80bf9322 kernel`sched_switch(td=0xffffffff81abe840, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel'
  thread llvm#2: tid = 100035, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d9780, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_0'
  thread llvm#3: tid = 100036, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d9000, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_1'
  thread llvm#4: tid = 100037, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d8780, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_2'
  thread llvm#5: tid = 100038, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d8000, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_3'
  thread llvm#6: tid = 100039, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d7780, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_4'
  thread llvm#7: tid = 100040, 0xffffffff80bf9322 kernel`sched_switch(td=0xfffff801052d7000, flags=259) at sched_ule.c:2448:26, name = '(pid 0) kernel/softirq_5'
...
```

Signed-off-by: Minsoo Choo <minsoochoo0122@proton.me>
Endilll pushed a commit that referenced this pull request Feb 24, 2026
…er. (llvm#181941)

The progress event reporter has a thread that reports events every 250
millisecond. and is destroyed in its destructor.

When in event reporter desctructor, the event reporter may have pending
event but the call mutex is destroyed leading to the crash.

Relevant stack trace from CI.
```
[2026-02-13T17:46:13.577Z] libc++abi: terminating due to uncaught exception of type std::__1::system_error: mutex lock failed: Invalid argument 
[2026-02-13T17:46:13.577Z] PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash report from ~/Library/Logs/DiagnosticReports/. 

[2026-02-13T17:46:13.577Z]  #0 0x0000000102b6943c llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake-os-verficiation/lldb-build/bin/lldb-dap+0x10008943c) 
[2026-02-13T17:46:13.577Z]  #1 0x0000000102b67368 llvm::sys::RunSignalHandlers() (/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake-os-verficiation/lldb-build/bin/lldb-dap+0x100087368) 
[2026-02-13T17:46:13.577Z]  llvm#2 0x0000000102b69f20 SignalHandler(int, __siginfo*, void*) (/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake-os-verficiation/lldb-build/bin/lldb-dap+0x100089f20)
[2026-02-13T17:46:13.577Z]  llvm#3 0x000000018bbdb744 (/usr/lib/system/libsystem_platform.dylib+0x1804e3744) 
[2026-02-13T17:46:13.577Z]  llvm#4 0x000000018bbd1888 (/usr/lib/system/libsystem_pthread.dylib+0x1804d9888) 
[2026-02-13T17:46:13.577Z]  llvm#5 0x000000018bad6850 (/usr/lib/system/libsystem_c.dylib+0x1803de850) 
[2026-02-13T17:46:13.577Z]  llvm#6 0x000000018bb85858 (/usr/lib/libc++abi.dylib+0x18048d858) 
[2026-02-13T17:46:13.577Z]  llvm#7 0x000000018bb744bc (/usr/lib/libc++abi.dylib+0x18047c4bc) 
[2026-02-13T17:46:13.577Z]  llvm#8 0x000000018b7a0424 (/usr/lib/libobjc.A.dylib+0x1800a8424) 
[2026-02-13T17:46:13.577Z]  llvm#9 0x000000018bb84c2c (/usr/lib/libc++abi.dylib+0x18048cc2c) 
[2026-02-13T17:46:13.577Z] llvm#10 0x000000018bb88394 (/usr/lib/libc++abi.dylib+0x180490394) 
[2026-02-13T17:46:13.577Z] llvm#11 0x000000018bb8833c (/usr/lib/libc++abi.dylib+0x18049033c) 
[2026-02-13T17:46:13.577Z] llvm#12 0x000000018bb01b90 (/usr/lib/libc++.1.dylib+0x180409b90) 
[2026-02-13T17:46:13.577Z] llvm#13 0x000000018bb01b34 (/usr/lib/libc++.1.dylib+0x180409b34) 
[2026-02-13T17:46:13.577Z] llvm#14 0x000000018bb038a0 (/usr/lib/libc++.1.dylib+0x18040b8a0) 
[2026-02-13T17:46:13.577Z] llvm#15 0x0000000102b6fbac lldb_dap::DAP::Send(std::__1::variant<lldb_dap::protocol::Request, lldb_dap::protocol::Response, lldb_dap::protocol::Event> const&) (/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake-os-verficiation/lldb-build/bin/lldb-dap+0x10008fbac) 
[2026-02-13T17:46:13.577Z] llvm#16 0x0000000102b6f890 lldb_dap::DAP::SendJSON(llvm::json::Value const&) (/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake-os-verficiation/lldb-build/bin/lldb-dap+0x10008f890) 
[2026-02-13T17:46:13.577Z] llvm#17 0x0000000102b78788 std::__1::__function::__func<lldb_dap::DAP::DAP(lldb_dap::Log&, lldb_dap::ReplMode, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>, bool, llvm::StringRef, lldb_private::transport::JSONTransport<lldb_dap::ProtocolDescriptor>&, lldb_private::MainLoopPosix&)::$_0, std::__1::allocator<lldb_dap::DAP::DAP(lldb_dap::Log&, lldb_dap::ReplMode, std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>, bool, llvm::StringRef, lldb_private::transport::JSONTransport<lldb_dap::ProtocolDescriptor>&, lldb_private::MainLoopPosix&)::$_0>, void (lldb_dap::ProgressEvent&)>::operator()(lldb_dap::ProgressEvent&) (/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake-os-verficiation/lldb-build/bin/lldb-dap+0x100098788) 
[2026-02-13T17:46:13.577Z] llvm#18 0x0000000102b8939c lldb_dap::ProgressEventManager::ReportIfNeeded() (/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake-os-verficiation/lldb-build/bin/lldb-dap+0x1000a939c) 
[2026-02-13T17:46:13.577Z] llvm#19 0x0000000102b8982c lldb_dap::ProgressEventReporter::ReportStartEvents() (/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake-os-verficiation/lldb-build/bin/lldb-dap+0x1000a982c) 
[2026-02-13T17:46:13.577Z] llvm#20 0x0000000102b8a038 void* std::__1::__thread_proxy[abi:nn200100]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct>>, lldb_dap::ProgressEventReporter::ProgressEventReporter(std::__1::function<void (lldb_dap::ProgressEvent&)>)::$_0>>(void*) (/Users/ec2-user/jenkins/workspace/llvm.org/as-lldb-cmake-os-verficiation/lldb-build/bin/lldb-dap+0x1000aa038) 
[2026-02-13T17:46:13.577Z] llvm#21 0x000000018bbd1c08 (/usr/lib/system/libsystem_pthread.dylib+0x1804d9c08) [2026-02-13T17:46:13.577Z] llvm#22 0x000000018bbccba8 (/usr/lib/system/libsystem_pthread.dylib+0x1804d4ba8)
```
rdar://170331108
Endilll pushed a commit that referenced this pull request Feb 24, 2026
I created an issue about this in llvm#179976.

Clang's Address Sanitizer installs its own SEH filter which handles some
types of uncaught exceptions. Along with register values and some other
information, it also generates a stack trace. However, current logic is
incomplete. It relies on DbgHelp's SymFunctionTableAccess64 and
SymGetModuleBase64 which won't work with machine code that has its
RUNTIME_FUNCTION entry registered with Rtl* (e.g. RtlAddFunctionTable)
system calls. Most likely, this is because DbgHelp either relies on
information in PDB files or considers PDATA and XDATA only from loaded
EXE and DLL modules. Either way, consider the following example:

```
#include <windows.h>
#include <iostream>
#include <vector>

typedef union _UNWIND_CODE {
    struct {
        BYTE CodeOffset;
        BYTE UnwindOp : 4;
        BYTE OpInfo : 4;
    };
    USHORT FrameOffset;
} UNWIND_CODE, * PUNWIND_CODE;

typedef struct _UNWIND_INFO {
    BYTE Version : 3;
    BYTE Flags : 5;
    BYTE SizeOfProlog;
    BYTE CountOfCodes;
    BYTE FrameRegister : 4;
    BYTE FrameOffset : 4;
    UNWIND_CODE UnwindCode[1]; // Variable size
} UNWIND_INFO, * PUNWIND_INFO;

#define UWOP_PUSH_NONVOL      0
#define UWOP_ALLOC_LARGE      1
#define UWOP_ALLOC_SMALL      2
#define UWOP_SET_FPREG        3
#define UWOP_SAVE_NONVOL      4
#define UWOP_SAVE_NONVOL_FAR  5
#define UWOP_SAVE_XMM128      8
#define UWOP_SAVE_XMM128_FAR  9
#define UWOP_PUSH_MACHFRAME   10

int main() {
    // PUSH RBX         (0x53)                - Save non-volatile register
    // SUB RSP, 0x20    (0x48 0x83 0xEC 0x20) - Allocate 32 bytes (shadow space)
    // XOR RAX, RAX     (0x48 0x31 0xC0)      - Zero out RAX
    // MOV RAX, [RAX]   (0x48 0x8B 0x00)      - Dereference NULL
    
    std::vector<unsigned char> code = {
        0x53,
        0x48, 0x83, 0xEC, 0x20,
        0x48, 0x31, 0xC0,
        0x48, 0x8B, 0x00
    };

    size_t codeSize = code.size();
    size_t totalSize = 100;

    LPVOID pMemory = VirtualAlloc(NULL, totalSize, MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
    
    BYTE* pCodeBase = (BYTE*)pMemory;
    PUNWIND_INFO pUnwindInfo = (PUNWIND_INFO)(pCodeBase + codeSize);    

    size_t alignmentPadding = 0;
    if ((size_t)pUnwindInfo % 4 != 0) {
        alignmentPadding = 4 - ((size_t)pUnwindInfo % 4);
        pUnwindInfo = (PUNWIND_INFO)((BYTE*)pUnwindInfo + alignmentPadding);
    }

    memcpy(pCodeBase, code.data(), codeSize);

    pUnwindInfo->Version = 1;
    pUnwindInfo->Flags = UNW_FLAG_NHANDLER;
    pUnwindInfo->Flags = 0; 
    pUnwindInfo->SizeOfProlog = 5; 
    pUnwindInfo->CountOfCodes = 2; 
    pUnwindInfo->FrameRegister = 0;
    pUnwindInfo->FrameOffset = 0;

    pUnwindInfo->UnwindCode[0].CodeOffset = 5;
    pUnwindInfo->UnwindCode[0].UnwindOp = UWOP_ALLOC_SMALL;
    pUnwindInfo->UnwindCode[0].OpInfo = 3; 

    pUnwindInfo->UnwindCode[1].CodeOffset = 1;
    pUnwindInfo->UnwindCode[1].UnwindOp = UWOP_PUSH_NONVOL;
    pUnwindInfo->UnwindCode[1].OpInfo = 3; // RBX

    RUNTIME_FUNCTION tableEntry = {};
    tableEntry.BeginAddress = 0;
    tableEntry.EndAddress = (DWORD)codeSize;
    tableEntry.UnwindData = (DWORD)((BYTE*)pUnwindInfo - (BYTE*)pMemory);

    DWORD64 baseAddress = (DWORD64)pMemory;
    RtlAddFunctionTable(&tableEntry, 1, baseAddress);

    typedef void(*FuncType)();
    FuncType myFunc = (FuncType)pMemory;
    myFunc();

    return 0;
}
```

Windows' kernel can propagate hardware exception through that function,
so clearly these entries are at least partially correct. Right now,
ASan's stack walking produces this (compiled with latest release,
clang++):

```
PS D:\Local Projects\cpp-playground> ./a.exe
=================================================================
==14216==ERROR: AddressSanitizer: access-violation on unknown address 0x000000000000 (pc 0x0199561c0008 bp 0x004cf0cffb30 sp 0x004cf0cff970 T0)
==14216==The signal is caused by a READ memory access.
==14216==Hint: address points to the zero page.
    #0 0x0199561c0007  (<unknown module>)
    #1 0x000000000000  (<unknown module>)
    llvm#2 0x000000000000  (<unknown module>)

==14216==Register values:
rax = 0  rbx = 4cf0cffaa0  rcx = 7ffcb97b4e28  rdx = 19955dc0000
rdi = 11bf564a0040  rsi = 0  rbp = 4cf0cffb30  rsp = 4cf0cff970
r8  = 7ffffffffffffffc  r9  = 1  r10 = 0  r11 = 246
r12 = 0  r13 = 0  r14 = 0  r15 = 0
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: access-violation (<unknown module>)
==14216==ABORTING
```

Frames one and two is just some stack space allocated by that dynamic
function. While patched version produces this:

```
PS D:\Local Projects\cpp-playground> ./a.exe
=================================================================
==13660==ERROR: AddressSanitizer: access-violation on unknown address 0x000000000000 (pc 0x01ed5ad70008 bp 0x00d76492f650 sp 0x00d76492f490 T0)
==13660==The signal is caused by a READ memory access.
==13660==Hint: address points to the zero page.
    #0 0x01ed5ad70007  (<unknown module>)
    #1 0x7ff732e518a1 in main (D:\Local Projects\cpp-playground\a.exe+0x1400018a1)
    llvm#2 0x7ff732e56a9b in invoke_main D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:78
    llvm#3 0x7ff732e56a9b in __scrt_common_main_seh D:\a\_work\1\s\src\vctools\crt\vcstartup\src\startup\exe_common.inl:288
    llvm#4 0x7ffcb878e8d6  (C:\WINDOWS\System32\KERNEL32.DLL+0x18002e8d6)
    llvm#5 0x7ffcb966c53b  (C:\WINDOWS\SYSTEM32\ntdll.dll+0x18008c53b)

==13660==Register values:
rax = 0  rbx = d76492f5c0  rcx = 7ffcb97b4e28  rdx = 1ed5a870000
rdi = 12135afa0040  rsi = 0  rbp = d76492f650  rsp = d76492f490
r8  = 7ffffffffffffffc  r9  = 1  r10 = 0  r11 = 246
r12 = 0  r13 = 0  r14 = 0  r15 = 0
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: access-violation (<unknown module>)
==13660==ABORTING
```

Now we see that stack walking handled our dynamic function properly.
Interestingly enough, it appears that other overloaded version of
UnwindSlow procedure that works without CONTEXT structure already has
some logic to handle this. Theoretically, symbolizer should also be able
to provide some information about these functions, but I don't think
that this is necessary.

I added SANITIZER_WINDOWS64 check because I am pretty sure Microsoft
only mentions these functions for 64 bit version of their OS. I also
can't check how this works on ARM.
Endilll pushed a commit that referenced this pull request Mar 3, 2026
Using code/ideas from the x86 backend to optimize a select on a bitcast
integer. The previous aarch64 approach was to individually extract the
bits from the mask, which is kind of terrible.

https://rust.godbolt.org/z/576sndT66

```llvm
define void @if_then_else8(ptr %out, i8 %mask, ptr %if_true, ptr %if_false) {
start:
  %t = load <8 x i32>, ptr %if_true, align 4
  %f = load <8 x i32>, ptr %if_false, align 4
  %m = bitcast i8 %mask to <8 x i1>
  %s = select <8 x i1> %m, <8 x i32> %t, <8 x i32> %f
  store <8 x i32> %s, ptr %out, align 4
  ret void
}
```

turned into 

```asm
if_then_else8:                          // @if_then_else8
        sub     sp, sp, llvm#16
        ubfx    w8, w1, llvm#4, #1
        and     w11, w1, #0x1
        ubfx    w9, w1, llvm#5, #1
        fmov    s1, w11
        ubfx    w10, w1, #1, #1
        fmov    s0, w8
        ubfx    w8, w1, llvm#6, #1
        ldp     q5, q2, [x3]
        mov     v1.h[1], w10
        ldp     q4, q3, [x2]
        mov     v0.h[1], w9
        ubfx    w9, w1, llvm#2, #1
        mov     v1.h[2], w9
        ubfx    w9, w1, llvm#3, #1
        mov     v0.h[2], w8
        ubfx    w8, w1, llvm#7, #1
        mov     v1.h[3], w9
        mov     v0.h[3], w8
        ushll   v1.4s, v1.4h, #0
        ushll   v0.4s, v0.4h, #0
        shl     v1.4s, v1.4s, llvm#31
        shl     v0.4s, v0.4s, llvm#31
        cmlt    v1.4s, v1.4s, #0
        cmlt    v0.4s, v0.4s, #0
        bsl     v1.16b, v4.16b, v5.16b
        bsl     v0.16b, v3.16b, v2.16b
        stp     q1, q0, [x0]
        add     sp, sp, llvm#16
        ret
```

With this PR that instead emits

```asm
if_then_else8:
   adrp x8, .LCPI0_1
   dup v0.4s, w1
   ldr q1, [x8, :lo12:.LCPI0_1]
   adrp x8, .LCPI0_0
   ldr q2, [x8, :lo12:.LCPI0_0]
   ldp q4, q3, [x2]
   and v1.16b, v0.16b, v1.16b
   and v0.16b, v0.16b, v2.16b
   ldp q5, q2, [x3]
   cmeq v1.4s, v1.4s, #0
   cmeq v0.4s, v0.4s, #0
   bsl v1.16b, v2.16b, v3.16b
   bsl v0.16b, v5.16b, v4.16b
   stp q0, q1, [x0]
   ret
```

So substantially shorter. Instead of building the mask
element-by-element, this approach (by virtue of not splitting) instead
splats the mask value into all vector lanes, performs a bitwise and with
powers of 2, and compares with zero to construct the mask vector.

cc rust-lang/rust#122376
cc llvm#175769
Endilll pushed a commit that referenced this pull request Mar 3, 2026
llvm#184186)

…83889)"

This reverts commit 2342db0.

Revert "[CMake] Propagate dependencies to OBJECT libraries in
`add_llvm_library` (llvm#183541)"

This reverts commit e3c0454.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant