Skip to content

Support Mixed-Precision Quantization#4

Merged
lhpqaq merged 1 commit intomasterfrom
mixed-pr
Jan 13, 2026
Merged

Support Mixed-Precision Quantization#4
lhpqaq merged 1 commit intomasterfrom
mixed-pr

Conversation

@lhpqaq
Copy link
Owner

@lhpqaq lhpqaq commented Jan 13, 2026

No description provided.

@lhpqaq lhpqaq merged commit 6d349f9 into master Jan 13, 2026
48 of 64 checks passed
lhpqaq added a commit that referenced this pull request Jan 14, 2026
commit f62a80b
Author: lhpqaq <liuhaopeng@apache.org>
Date:   Wed Jan 14 21:04:06 2026 +0800

    simd

commit d8b99c9
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 12:29:55 2026 +0000

    Implement ggml-level Q8_0 KV cache optimization: add ggml_vec_mad_q8_0

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit a94c7a5
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 10:15:24 2026 +0000

    Improve comments explaining V quantization constraints in non-flash attention paths

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit d53bf50
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 10:12:12 2026 +0000

    Allow K cache quantization with non-flash attention, V quantization requires flash attention

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 96da180
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 09:44:38 2026 +0000

    Fix non-flash attention path V tensor strides to use n_ctx-based layout

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 610f5f8
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 08:29:10 2026 +0000

    Implement separate KV cache types for kv_self, kv_cross, and kv_pad

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 294c6c5
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 08:11:28 2026 +0000

    Document three KV cache types (kv_self, kv_cross, kv_pad) with usage recommendations

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit db20ad3
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 07:58:27 2026 +0000

    Add validation: quantized KV cache types require flash attention enabled

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit f653dff
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 07:48:26 2026 +0000

    Add quantized KV cache performance analysis and optimization recommendations

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 93ebaef
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 07:14:07 2026 +0000

    Fix comment about block size determination

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 317b980
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 07:12:03 2026 +0000

    Implement KV cache quantization with Q8_0, Q4_0, Q5_0 support using ggml_row_size

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 32d9709
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Tue Jan 13 19:51:11 2026 +0000

    Implement separate K/V cache type configuration for mixed precision

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 3429b9a
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Tue Jan 13 18:29:07 2026 +0000

    Add practical KV cache optimization strategies section with implementation guidance

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 512a42b
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Tue Jan 13 18:18:32 2026 +0000

    Revert Q8_0 KV cache code changes; document block alignment issue and mixed precision strategies

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit b3d0e2d
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Tue Jan 13 18:08:20 2026 +0000

    Improve KV cache Q8_0 documentation and help text clarity

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit bb5674c
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Tue Jan 13 18:06:25 2026 +0000

    Add KV Cache Q8_0 quantization support for reduced memory usage

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 0fe0dab
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Tue Jan 13 17:53:49 2026 +0000

    Initial plan

commit 6d349f9
Merge: 679bdb5 f92fc36
Author: haopeng <liuhaopeng@apache.org>
Date:   Wed Jan 14 01:52:08 2026 +0800

    Merge pull request #4 from lhpqaq/mixed-pr

    Support Mixed-Precision Quantization

commit f92fc36
Author: lhpqaq <liuhaopeng@apache.org>
Date:   Mon Jan 12 14:44:23 2026 +0800

    Support Mixed-Precision Quantization

commit 679bdb5
Author: KITAITI Makoto <KitaitiMakoto@gmail.com>
Date:   Mon Jan 5 17:41:22 2026 +0900

    ruby : fix segmentation fault (ggml-org#3591)

    * Mark long live variable

    * Fix test for Whisper::Token#deconstruct_keys(nil)

    * Don't use long live variable

    * Fix indentation

commit e9898dd
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Wed Dec 31 18:26:42 2025 +0200

    sync : ggml

commit ce03f8e
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Wed Dec 31 18:24:07 2025 +0200

    ggml : bump version to 0.9.5 (ggml/1410)

commit 7359ac9
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Wed Dec 31 13:13:57 2025 +0200

    talk-llama : sync llama.cpp

commit 54fa821
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Wed Dec 31 13:09:05 2025 +0200

    sync : ggml

commit 8189f2c
Author: gatbontonpc <gatbontonpc@gmail.com>
Date:   Wed Dec 31 00:39:48 2025 -0800

    metal : add count_equal op (llama/18314)

    * add count equal for metal

    * remove trailing whitespace

    * updated doc ops table

    * changed shmem to i32

    * added multi tg and templating

    * removed BLAS support from Metal docs

    * Apply suggestions from code review

    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

    * add memset to set dst to 0

    * metal : cleanup

    ---------

    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

commit 2d250f8
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Wed Dec 31 09:37:00 2025 +0100

    CUDA: fix KQ max calculation (llama/18487)

commit 5deaf8f
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Wed Dec 31 09:53:48 2025 +0200

    metal : remove BF16 x F16 kernels (llama/18456)

commit 4679331
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Wed Dec 31 14:23:44 2025 +0800

    sycl: add newline at the end of CMakeLists.txt (llama/18503)

commit a363549
Author: Rahul Sathe <150351592+rrsathe@users.noreply.github.com>
Date:   Wed Dec 31 06:38:44 2025 +0530

    Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (llama/18345)

    * cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x

    * [AI] sycl: auto-detect and skip incompatible IntelSYCL package

    Automatically detect compiler versions with incompatible IntelSYCL
    CMake configuration files and fall back to manual SYCL flags instead
    of requiring users to set options manually.

    Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake
    has SYCL_FEATURE_TEST_EXTRACT invocation errors.

    * refactor: improve SYCL provider handling and error messages in CMake configuration

    * refactor: enhance SYCL provider validation and error handling in CMake configuration

    * ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes

commit c995536
Author: Charles Xu <charles.xu@arm.com>
Date:   Tue Dec 30 13:04:53 2025 +0100

    kleidiai: add and integrate SVE 256-bit vector-length kernel (llama/18458)

    * kleidiai: add and integrate SVE 256-bit vector-length kernel

    * updated for review comments

commit 6d4aa96
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Tue Dec 30 17:40:46 2025 +0800

    CUDA: add log line when mxfp4 acceleration is used (llama/18483)

    * CUDA: add log line when mxfp4 acceleration is used

    * add in backend_get_features

commit 5765c5b
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Mon Dec 29 17:58:20 2025 +0100

    CUDA: fix replacment of bad archs in CMake (llama/18457)

commit d6cb240
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Mon Dec 29 09:35:42 2025 +0100

    CUDA: Blackwell features for non-native builds (llama/18436)

commit e49e88b
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Mon Dec 29 14:07:17 2025 +0800

    cuda: fix race condition in cumsum (llama/18448)

    * ggml-cuda: fix race condition in cumsum

    * remove unneccesary sync_threads

commit 20f5729
Author: uvos <carl@uvos.xyz>
Date:   Sun Dec 28 20:12:55 2025 +0100

    HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (llama/18202)

commit b8d209f
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Sun Dec 28 20:53:36 2025 +0800

    Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (llama/18426)

commit 54fe9a6
Author: o7si <32285332+o7si@users.noreply.github.com>
Date:   Sun Dec 28 18:34:41 2025 +0800

    rpc: fix segfault on invalid endpoint format (llama/18387)

    * rpc: fix segfault on invalid endpoint format

    * rpc: add error log for failed endpoint connection

commit b3788ef
Author: Boian Berberov <7432115+bberberov@users.noreply.github.com>
Date:   Sun Dec 28 07:33:29 2025 +0000

    cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (llama/18186)

    * minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`

    * cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`

    - `ivybridge`
    - `piledriver`
    - `cannonlake`
    - `cascadelake`
    - `cooperlake`
    - `zen4`

    Resolves: #17966

commit 31fc2c3
Author: QDelta <60222316+QDelta@users.noreply.github.com>
Date:   Sat Dec 27 20:33:14 2025 -0500

    ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (llama/18413)

commit a800a3a
Author: lhez <lih@qti.qualcomm.com>
Date:   Sat Dec 27 15:51:14 2025 -0800

    opencl: allow resizing transpose buffers (llama/18384)

    * opencl: allow resizing transpose buffers instead of using fixed sizes

    * opencl: remove commented code

commit 29f8155
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Sat Dec 27 19:56:27 2025 +0800

    ggml-cuda: Use same regex for GGML_NATIVE=OFF (llama/18407)

commit 015b618
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Fri Dec 26 16:12:58 2025 -0600

    vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (llama/18352)

    Run a preprocess to count how many times each expert is used, and use this to
    quickly discard workgroups that aren't needed.

commit e37c8ed
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Fri Dec 26 11:15:50 2025 -0600

    vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (llama/18349)

    * vulkan: Use BK=32 for coopmat2 mul_mat_id

    * vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader

    Disable robustness, remove the OOB check in decodeFuncB, and initialize the
    row_ids to zero to avoid OOB access.

    Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
    to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
    zero and remove the '& (BN - 1)'. This allows the compiler to common some of
    the shared memory loads.

commit 331c6cc
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Fri Dec 26 11:15:02 2025 -0600

    vulkan: Use BK=32 for coopmat2 mul_mat_id (llama/18332)

commit 35cb4ab
Author: Eve <139727413+netrunnereve@users.noreply.github.com>
Date:   Fri Dec 26 17:12:11 2025 +0000

    vulkan: small dequantization improvements (llama/18380)

    * iq4_xs

    * quants

commit 181e36f
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Fri Dec 26 10:00:57 2025 -0600

    vulkan: Support UPSCALE w/antialias (llama/18327)

commit 67473fe
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Fri Dec 26 09:53:46 2025 -0600

    vulkan: handle rope with large number of rows (llama/18306)

commit 33f75a8
Author: 0Marble <85058989+0Marble@users.noreply.github.com>
Date:   Fri Dec 26 09:12:04 2025 +0800

    CANN: implement the SSM_CONV operator (llama/17737)

    * CANN: implement SSM_CONV operator

    Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com>
    Co-authored-by: Sujin Kang, <waterjin326@gmail.com>

    * CANN: remove custom error limit for SSM_CONV

    * CANN: merge SSM_CONV tensor shape/strides into one line

    ---------

    Co-authored-by: Sujin Kang, <waterjin326@gmail.com>

commit 5177835
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Fri Dec 26 01:35:14 2025 +0800

    ggml-cuda: fix regex for arch list (llama/18371)

    * ggml-cuda: fix regex for arch list

    * make regex exact

commit 8e02f09
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Thu Dec 25 23:55:38 2025 +0800

    cuda: optimize cumsum cub path (llama/18362)

    * cuda: optimize cumsum cub path

    * remove heavy perf test

commit ea07c5d
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Thu Dec 25 22:12:11 2025 +0800

    ggml-cuda: fix blackwell native builds (llama/18361)

    * ggml-cuda: fix blackwell native builds

    Replace 12x in native architectures by 12xa

    * replace for GGML_NATIVE=OFF too

    * only replace for native

    * remove 120f-virtual for default compilation

    ---------

    Co-authored-by: Aman Gupta <aman>

commit 5f0488f
Author: Penglin Cai <1402538448@qq.com>
Date:   Thu Dec 25 16:46:09 2025 +0800

    CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (llama/17934)

    * CONV_TRANSPOSE_1D kernel_size>255

    * remove condition check

    * fix the bug of type conversion

    * removing trailing whitespaces

    * fix: return true in the switch case

commit db75fff
Author: Aadeshveer Singh <24b0926@iitb.ac.in>
Date:   Thu Dec 25 09:41:13 2025 +0530

    ggml : optimize cuda cumsum fallback kernel (llama/18343)

commit 41e578e
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Wed Dec 24 22:28:26 2025 +0800

    CUDA: experimental native mxfp4 support for blackwell (llama/17906)

    * CUDA: experimental native mxfp4 support for blackwell

    * optimize load_tiles

    * optimize quantize_mxfp4

    * cleanup

    * first pass review: formatting

    * use interleaved layout for mma

    * mmq: add assert for size

    * use __nv_fp4x4_e2m1

    * use iter_k as 512, cleanup

    * Use 1200 as blackwell instead of 1000

    * address review comments

    * mmq: fix stride

    * quantize.cu: use reference impl of e8m0 scale

    * address review comments

    * add 120f-virtual + minor fixes

    ---------

    Co-authored-by: Aman Gupta <aman>

commit f863735
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Wed Dec 24 05:36:34 2025 -0600

    vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (llama/18302)

commit bab2c02
Author: Wang Weixuan <wangweixvan@gmail.com>
Date:   Wed Dec 24 17:50:24 2025 +0800

    CANN : refactor ACL graph cache (llama/17752)

    Move the graph property checking code into methods of LRU cache.

    Signed-off-by: Wang Weixuan <wangweixvan@gmail.com>

commit 1356600
Author: Ruben Ortlam <picard12@live.de>
Date:   Wed Dec 24 08:59:14 2025 +0100

    vulkan: use fewer FA rows for small cache runs (llama/18280)

commit ec9239d
Author: TianHao324 <tianhao42@huawei.com>
Date:   Wed Dec 24 14:55:33 2025 +0800

    CANN: Uses yarn_ramp cache in ROPE (llama/17725)

commit 9bdd465
Author: Chris Rohlf <chris.rohlf@gmail.com>
Date:   Tue Dec 23 04:56:49 2025 -0500

    rpc : add check for rpc buffer type (llama/18242)

commit e4c8961
Author: nullname <chraac@gmail.com>
Date:   Tue Dec 23 15:13:24 2025 +0800

    ggml-hexagon: create generalized functions for cpu side op (llama/17500)

    * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility

    * refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility

    * refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity

    * add comment

    * refactor: remove redundant buffer checks in hexagon supported operations

    * wip

    * add missing include to fix weak symbol warning

    * add ggml_hexagon_op_generic

    * refactor: simplify tensor operation initialization and buffer management in hexagon implementation

    * refactor: streamline hexagon operation initialization and buffer management

    * refactor: update function signatures and streamline request handling in hexagon operations

    * wip

    * ggml-hexagon: clean up code formatting and improve unary operation handling

    * wip

    * rename

    * fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations

    * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility

    refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility

    refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity

    refactor: remove redundant buffer checks in hexagon supported operations

    add missing include to fix weak symbol warning

    add ggml_hexagon_op_generic

    refactor: simplify tensor operation initialization and buffer management in hexagon implementation

    refactor: streamline hexagon operation initialization and buffer management

    refactor: update function signatures and streamline request handling in hexagon operations

    ggml-hexagon: clean up code formatting and improve unary operation handling

    fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations

    * hexagon: fix merge conflicts

    * hexagon: minor cleanup for buffer support checks

    * hexagon: factor out op_desc and the overal op logging

    * hexagon: further simplify and cleanup op dispatch logic

    * snapdragon: update adb scripts to use llama-cli and llama-completion

    * fix pipeline failure

    ---------

    Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

commit 2f33395
Author: Shouyu <65317431+joeldushouyu@users.noreply.github.com>
Date:   Mon Dec 22 13:56:52 2025 -0500

    ggml-hexagon: gelu optimization (llama/18151)

    * feat: working gelu with src0 put on vtcm

    * feat: gelu ping-pong for both in and out

    * fix: fixu compile error

    * break: distinguish dma ddr->vtcm and vtcm->ddr operation

    * fix: fix dma queue size

    * break: update dma api to either pop src or dst ptr

    * fix: fix activation vtcm allocation issue for src1 when swapperd

    * refactor: ping-pong gelu logic to avoid unnecessary if else

    * dma: improved queue interface and prefetch handling

    * gelu: fix N+2 block prefetch

    ---------

    Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

commit 5b0c1c1
Author: Taimur Ahmad <taimur.ahmad@10xengineers.ai>
Date:   Mon Dec 22 23:20:23 2025 +0500

    llamafile: add rvv support for sgemm kernels (llama/18199)

    Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

commit f2fe1e5
Author: lhez <lih@qti.qualcomm.com>
Date:   Mon Dec 22 10:19:01 2025 -0800

    opencl: unpack q4_0 for adreno in get_tensor (llama/18278)

commit dbbe6c1
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Mon Dec 22 11:03:13 2025 -0600

    vulkan: Extend rope fusions to allow mrope (llama/18264)

    Extend the test-backend-ops tests as well.

commit 98e59a4
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Sun Dec 21 14:52:09 2025 -0600

    vulkan: Implement set_tensor_async and the event interfaces (llama/18047)

    The goal is to enable the async loading code paths in
    llama_model_loader::load_all_data, originally from #7896. This works and the
    loads themselves are faster, but with host visible vidmem I think the cost of
    allocating/mapping vidmem moves and becomes more expensive, and I don't see a
    benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
    significant improvement in model loading time.

commit b68b12f
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Sun Dec 21 19:33:08 2025 +0100

    llama: fix RPC for -fit on (llama/18233)

commit b893e08
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Sun Dec 21 03:32:58 2025 -0600

    vulkan: fix im2col overflowing maxworkgroupcount (llama/18180)

commit f407c5e
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Sun Dec 21 03:27:34 2025 -0600

    vulkan/cuda: fix topk_moe with exp_probs_b (llama/18071)

    I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
    and added coverage for exp_probs_b and some other missing combinations. This
    exposed a bug in both CUDA and Vulkan backends where they were assuming the
    input to argsort and the input to get_rows are the same. I'd like to optimize
    this graph in another change, but for now just get it functional.

    CUDA also had a bug where it got n_experts from the wrong place, leading to
    GGML_ASSERT failures in some of the new tests.

commit ad6ee38
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Sun Dec 21 03:17:58 2025 -0600

    vulkan: support GGML_UNARY_OP_XIELU (llama/18062)

commit 3cd141f
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Sun Dec 21 03:05:08 2025 -0600

    vulkan: in graph_optimize, try to group ADD operations (llama/18060)

    I saw the adds not staying together in the new nemotron 3 nano model.

commit 449fc7c
Author: lovedheart <6277001+lovedheart@users.noreply.github.com>
Date:   Sun Dec 21 09:59:52 2025 +0100

    Vulkan: some improvement on mul_mat_iq2_xs (llama/18031)

    * Some improvement on mul_mat_iq2_xs

    Refactor calculations for db values and grid data to optimize performance and reduce redundancy.

    * Fix trailing whitespace

commit 0983985
Author: Aadeshveer Singh <24b0926@iitb.ac.in>
Date:   Sat Dec 20 16:58:57 2025 +0530

    Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context (llama/18212)

commit 17a4cb1
Author: Alfred <zxu3@clemson.edu>
Date:   Fri Dec 19 12:42:28 2025 -0500

    ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (llama/17977)

    * feat: implement real Q8_0

    * feat: adding cmake option for configuring FP32 quantize group size

    * typo: set() shall be used

    ---------

    Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>

commit 195d8d0
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Thu Dec 18 23:36:46 2025 -0600

    vulkan: Add perf logger mode with concurrency (llama/17944)

    This implements a variation of the perf logger where rather than timing each
    operation individually with effectively a barrier in between, we put the
    timing boundaries where we already synchronize and time the groups of work
    that normally overlap. This can be useful to help understand whether
    individual operations need to be optimized, or if the group is already running
    efficiently.

    GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
    GGML_VK_PERF_LOGGER is also set).

    GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.

commit fea481f
Author: Xuan-Son Nguyen <son@huggingface.co>
Date:   Fri Dec 19 00:18:01 2025 +0100

    model : add ASR support for LFM2-Audio-1.5B (conformer) (llama/18106)

    * ASR with LFM2-Audio-1.5B

    * Set rope_theta

    * Fix comment

    * Remove rope_theta setting

    * Address PR feedback

    * rename functions to conformer

    * remove some redundant ggml_cont

    * fix missing tensor

    * add prefix "a." for conv tensors

    * remove redundant reshape

    * clean up

    * add test model

    ---------

    Co-authored-by: Tarek Dakhran <tarek@liquid.ai>

commit 956fac4
Author: Taimur Ahmad <taimur.ahmad@10xengineers.ai>
Date:   Thu Dec 18 19:02:09 2025 +0500

    ggml-cpu: extend support for RVV floating-point kernels (llama/17318)

    * cmake: add BF16 RVV flag for ggml-cpu

    * ggml-cpu: add floating-point conversion kernels

    * ggml: add floating-point kernels

    Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

    * ggml-cpu: fix lmul in vec_dot_bf16

    * ggml-cpu: change redsum to lmul 4, fix leftover

    ---------

    Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

commit 325a9b7
Author: yulo <77381088+zhang-hui-yulo@users.noreply.github.com>
Date:   Thu Dec 18 19:50:56 2025 +0800

    remove i_major_dual (llama/18157)

    Co-authored-by: zhang hui <you@example.com>

commit c3a1608
Author: Shouyu <65317431+joeldushouyu@users.noreply.github.com>
Date:   Wed Dec 17 16:38:21 2025 -0500

    ggml-hexagon: swiglu_oai operation (llama/18114)

    * snapshot: debug ggml-hexagon swiglu-oai

    * fix: fix hvx_min_scalar_f32

    * feat: working swiglu-oai

    * chore: fix formating isue

commit c7ccedb
Author: Shouyu <65317431+joeldushouyu@users.noreply.github.com>
Date:   Wed Dec 17 13:39:32 2025 -0500

    ggml-hexagon: gelu operation (llama/17921)

    * feat: inital support for gelu using sigmoid approximation

    * snapshot: faster gelu using polynomial approximation

    * test: disable l2-block prefetch in polynomail approximation

    * Revert "test: disable l2-block prefetch in polynomail approximation"

    This reverts commit 72339994d45b2bed887e79994403c378d90b62b5.

    * Revert "snapshot: faster gelu using polynomial approximation"

    This reverts commit 2a787a61d11f9e63e5943a2e6d134b2f0c402ace.

    * debug: temporarily disable unnecessary log message for debug purpose

    * Feat: optiized unaligned sigmoid_f32

    * Feat: larger l2prefetch block

    * feat: apply unaligned-load optimization on mul and mul_scalar

    * Revert "debug: temporarily disable unnecessary log message for debug purpose"

    This reverts commit 84f2f23aa9f17e2fa826db969cd825d0ab192995.

    * refactor: cleanup commented unused code

    * chore: reformat code with clang-formatter to pass cli test

    * Revert "chore: reformat code with clang-formatter to pass cli test"

    This reverts commit 952877ec24732b12010c7fa7ed3fc8de4b74e718.

    * fix: fix loop overflow

    * chore: fix formating ci error

commit 1f72f00
Author: Alberto Cabrera Pérez <1478977+Alcpz@users.noreply.github.com>
Date:   Wed Dec 17 11:39:13 2025 +0000

    ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) (llama/18096)

    * wip: skeleton for q8_0 repack

    * q8_0 repack GEMV implementations

    * GEMM implementations

    * Formatting

    * Fixed format consistency of repack gemm and gemv declarations

    * gemv and gemm generic location consistent with declarations

    * Removed non-correct unused variables statements

    * Cleanup, consistent style

    * Missing generic fallbacks for x86 and powerpc

commit 9118c05
Author: yulo <77381088+zhang-hui-yulo@users.noreply.github.com>
Date:   Wed Dec 17 16:34:54 2025 +0800

    HIP: Refactor mma for RDNA and CDNA (llama/17990)

    * mma.cuh for rdna4

    * mma for rdna3

    * mmq for rdna4

    * mmq for rdna3

    * align i-major and j-major

    * cdna

    * fix cuda error

    * add missing tile of mfma

    * fix j-major wrong ne on CDNA

    * fix gramma and empty spaces

    ---------

    Co-authored-by: zhang hui <you@example.com>

commit 6114e69
Author: KITAITI Makoto <KitaitiMakoto@gmail.com>
Date:   Wed Dec 24 16:52:16 2025 +0900

    ruby : add Whisper::Token, fix model URI (ggml-org#3575)

    * Define and use macro to get context safely

    * Add test to check SEGV

    * Move installation guid after usage

    * [skip ci]Change doc slightly

    * [skip ci]Fix a typo in README

    * [skip ci]Add carry_initial_prompt option in README

    * Define GetVADSegments and use it

    * Use GetContext

    * Fix download URI of small.en-tdrz

    * Fix URI of CoreML models corresponding to quantized models

    * Cache computed string

    * Remove unused argument

    * Add Whisper::Token

    * Add document comments

    * Rename function: rb_whisper_token_s_new -> ruby_whisper_token_s_init

    * Fix size of token

    * Insert _get into function names

    * Add Whisper::Token#text

    * Add test for Whisper::Token#text

    * Declare static if possible

    * Change method names

    * Add Whisper::Token#deconstruct_keys

    * Add tests for Whisper::Token#deconstruct_keys

    * Add signatures for Whisper::Token

    * Complete signature

    * [skip ci]Add n_tokens to document of Segment
lhpqaq added a commit that referenced this pull request Jan 14, 2026
commit 9833290
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 18:05:14 2026 +0000

    Add ggml_vec_mad_q4_0 with SIMD optimization (NEON + AVX2) and integrate into flash attention

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit f62a80b
Author: lhpqaq <liuhaopeng@apache.org>
Date:   Wed Jan 14 21:04:06 2026 +0800

    simd

commit d8b99c9
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 12:29:55 2026 +0000

    Implement ggml-level Q8_0 KV cache optimization: add ggml_vec_mad_q8_0

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit a94c7a5
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 10:15:24 2026 +0000

    Improve comments explaining V quantization constraints in non-flash attention paths

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit d53bf50
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 10:12:12 2026 +0000

    Allow K cache quantization with non-flash attention, V quantization requires flash attention

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 96da180
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 09:44:38 2026 +0000

    Fix non-flash attention path V tensor strides to use n_ctx-based layout

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 610f5f8
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 08:29:10 2026 +0000

    Implement separate KV cache types for kv_self, kv_cross, and kv_pad

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 294c6c5
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 08:11:28 2026 +0000

    Document three KV cache types (kv_self, kv_cross, kv_pad) with usage recommendations

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit db20ad3
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 07:58:27 2026 +0000

    Add validation: quantized KV cache types require flash attention enabled

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit f653dff
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 07:48:26 2026 +0000

    Add quantized KV cache performance analysis and optimization recommendations

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 93ebaef
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 07:14:07 2026 +0000

    Fix comment about block size determination

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 317b980
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Wed Jan 14 07:12:03 2026 +0000

    Implement KV cache quantization with Q8_0, Q4_0, Q5_0 support using ggml_row_size

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 32d9709
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Tue Jan 13 19:51:11 2026 +0000

    Implement separate K/V cache type configuration for mixed precision

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 3429b9a
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Tue Jan 13 18:29:07 2026 +0000

    Add practical KV cache optimization strategies section with implementation guidance

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 512a42b
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Tue Jan 13 18:18:32 2026 +0000

    Revert Q8_0 KV cache code changes; document block alignment issue and mixed precision strategies

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit b3d0e2d
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Tue Jan 13 18:08:20 2026 +0000

    Improve KV cache Q8_0 documentation and help text clarity

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit bb5674c
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Tue Jan 13 18:06:25 2026 +0000

    Add KV Cache Q8_0 quantization support for reduced memory usage

    Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com>

commit 0fe0dab
Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Date:   Tue Jan 13 17:53:49 2026 +0000

    Initial plan

commit 6d349f9
Merge: 679bdb5 f92fc36
Author: haopeng <liuhaopeng@apache.org>
Date:   Wed Jan 14 01:52:08 2026 +0800

    Merge pull request #4 from lhpqaq/mixed-pr

    Support Mixed-Precision Quantization

commit f92fc36
Author: lhpqaq <liuhaopeng@apache.org>
Date:   Mon Jan 12 14:44:23 2026 +0800

    Support Mixed-Precision Quantization

commit 679bdb5
Author: KITAITI Makoto <KitaitiMakoto@gmail.com>
Date:   Mon Jan 5 17:41:22 2026 +0900

    ruby : fix segmentation fault (ggml-org#3591)

    * Mark long live variable

    * Fix test for Whisper::Token#deconstruct_keys(nil)

    * Don't use long live variable

    * Fix indentation

commit e9898dd
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Wed Dec 31 18:26:42 2025 +0200

    sync : ggml

commit ce03f8e
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Wed Dec 31 18:24:07 2025 +0200

    ggml : bump version to 0.9.5 (ggml/1410)

commit 7359ac9
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Wed Dec 31 13:13:57 2025 +0200

    talk-llama : sync llama.cpp

commit 54fa821
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Wed Dec 31 13:09:05 2025 +0200

    sync : ggml

commit 8189f2c
Author: gatbontonpc <gatbontonpc@gmail.com>
Date:   Wed Dec 31 00:39:48 2025 -0800

    metal : add count_equal op (llama/18314)

    * add count equal for metal

    * remove trailing whitespace

    * updated doc ops table

    * changed shmem to i32

    * added multi tg and templating

    * removed BLAS support from Metal docs

    * Apply suggestions from code review

    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

    * add memset to set dst to 0

    * metal : cleanup

    ---------

    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

commit 2d250f8
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Wed Dec 31 09:37:00 2025 +0100

    CUDA: fix KQ max calculation (llama/18487)

commit 5deaf8f
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Wed Dec 31 09:53:48 2025 +0200

    metal : remove BF16 x F16 kernels (llama/18456)

commit 4679331
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Wed Dec 31 14:23:44 2025 +0800

    sycl: add newline at the end of CMakeLists.txt (llama/18503)

commit a363549
Author: Rahul Sathe <150351592+rrsathe@users.noreply.github.com>
Date:   Wed Dec 31 06:38:44 2025 +0530

    Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (llama/18345)

    * cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x

    * [AI] sycl: auto-detect and skip incompatible IntelSYCL package

    Automatically detect compiler versions with incompatible IntelSYCL
    CMake configuration files and fall back to manual SYCL flags instead
    of requiring users to set options manually.

    Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake
    has SYCL_FEATURE_TEST_EXTRACT invocation errors.

    * refactor: improve SYCL provider handling and error messages in CMake configuration

    * refactor: enhance SYCL provider validation and error handling in CMake configuration

    * ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes

commit c995536
Author: Charles Xu <charles.xu@arm.com>
Date:   Tue Dec 30 13:04:53 2025 +0100

    kleidiai: add and integrate SVE 256-bit vector-length kernel (llama/18458)

    * kleidiai: add and integrate SVE 256-bit vector-length kernel

    * updated for review comments

commit 6d4aa96
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Tue Dec 30 17:40:46 2025 +0800

    CUDA: add log line when mxfp4 acceleration is used (llama/18483)

    * CUDA: add log line when mxfp4 acceleration is used

    * add in backend_get_features

commit 5765c5b
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Mon Dec 29 17:58:20 2025 +0100

    CUDA: fix replacment of bad archs in CMake (llama/18457)

commit d6cb240
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Mon Dec 29 09:35:42 2025 +0100

    CUDA: Blackwell features for non-native builds (llama/18436)

commit e49e88b
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Mon Dec 29 14:07:17 2025 +0800

    cuda: fix race condition in cumsum (llama/18448)

    * ggml-cuda: fix race condition in cumsum

    * remove unneccesary sync_threads

commit 20f5729
Author: uvos <carl@uvos.xyz>
Date:   Sun Dec 28 20:12:55 2025 +0100

    HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (llama/18202)

commit b8d209f
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Sun Dec 28 20:53:36 2025 +0800

    Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (llama/18426)

commit 54fe9a6
Author: o7si <32285332+o7si@users.noreply.github.com>
Date:   Sun Dec 28 18:34:41 2025 +0800

    rpc: fix segfault on invalid endpoint format (llama/18387)

    * rpc: fix segfault on invalid endpoint format

    * rpc: add error log for failed endpoint connection

commit b3788ef
Author: Boian Berberov <7432115+bberberov@users.noreply.github.com>
Date:   Sun Dec 28 07:33:29 2025 +0000

    cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (llama/18186)

    * minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h`

    * cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On`

    - `ivybridge`
    - `piledriver`
    - `cannonlake`
    - `cascadelake`
    - `cooperlake`
    - `zen4`

    Resolves: #17966

commit 31fc2c3
Author: QDelta <60222316+QDelta@users.noreply.github.com>
Date:   Sat Dec 27 20:33:14 2025 -0500

    ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (llama/18413)

commit a800a3a
Author: lhez <lih@qti.qualcomm.com>
Date:   Sat Dec 27 15:51:14 2025 -0800

    opencl: allow resizing transpose buffers (llama/18384)

    * opencl: allow resizing transpose buffers instead of using fixed sizes

    * opencl: remove commented code

commit 29f8155
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Sat Dec 27 19:56:27 2025 +0800

    ggml-cuda: Use same regex for GGML_NATIVE=OFF (llama/18407)

commit 015b618
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Fri Dec 26 16:12:58 2025 -0600

    vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (llama/18352)

    Run a preprocess to count how many times each expert is used, and use this to
    quickly discard workgroups that aren't needed.

commit e37c8ed
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Fri Dec 26 11:15:50 2025 -0600

    vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (llama/18349)

    * vulkan: Use BK=32 for coopmat2 mul_mat_id

    * vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader

    Disable robustness, remove the OOB check in decodeFuncB, and initialize the
    row_ids to zero to avoid OOB access.

    Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
    to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
    zero and remove the '& (BN - 1)'. This allows the compiler to common some of
    the shared memory loads.

commit 331c6cc
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Fri Dec 26 11:15:02 2025 -0600

    vulkan: Use BK=32 for coopmat2 mul_mat_id (llama/18332)

commit 35cb4ab
Author: Eve <139727413+netrunnereve@users.noreply.github.com>
Date:   Fri Dec 26 17:12:11 2025 +0000

    vulkan: small dequantization improvements (llama/18380)

    * iq4_xs

    * quants

commit 181e36f
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Fri Dec 26 10:00:57 2025 -0600

    vulkan: Support UPSCALE w/antialias (llama/18327)

commit 67473fe
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Fri Dec 26 09:53:46 2025 -0600

    vulkan: handle rope with large number of rows (llama/18306)

commit 33f75a8
Author: 0Marble <85058989+0Marble@users.noreply.github.com>
Date:   Fri Dec 26 09:12:04 2025 +0800

    CANN: implement the SSM_CONV operator (llama/17737)

    * CANN: implement SSM_CONV operator

    Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com>
    Co-authored-by: Sujin Kang, <waterjin326@gmail.com>

    * CANN: remove custom error limit for SSM_CONV

    * CANN: merge SSM_CONV tensor shape/strides into one line

    ---------

    Co-authored-by: Sujin Kang, <waterjin326@gmail.com>

commit 5177835
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Fri Dec 26 01:35:14 2025 +0800

    ggml-cuda: fix regex for arch list (llama/18371)

    * ggml-cuda: fix regex for arch list

    * make regex exact

commit 8e02f09
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Thu Dec 25 23:55:38 2025 +0800

    cuda: optimize cumsum cub path (llama/18362)

    * cuda: optimize cumsum cub path

    * remove heavy perf test

commit ea07c5d
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Thu Dec 25 22:12:11 2025 +0800

    ggml-cuda: fix blackwell native builds (llama/18361)

    * ggml-cuda: fix blackwell native builds

    Replace 12x in native architectures by 12xa

    * replace for GGML_NATIVE=OFF too

    * only replace for native

    * remove 120f-virtual for default compilation

    ---------

    Co-authored-by: Aman Gupta <aman>

commit 5f0488f
Author: Penglin Cai <1402538448@qq.com>
Date:   Thu Dec 25 16:46:09 2025 +0800

    CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (llama/17934)

    * CONV_TRANSPOSE_1D kernel_size>255

    * remove condition check

    * fix the bug of type conversion

    * removing trailing whitespaces

    * fix: return true in the switch case

commit db75fff
Author: Aadeshveer Singh <24b0926@iitb.ac.in>
Date:   Thu Dec 25 09:41:13 2025 +0530

    ggml : optimize cuda cumsum fallback kernel (llama/18343)

commit 41e578e
Author: Aman Gupta <amangupta052@gmail.com>
Date:   Wed Dec 24 22:28:26 2025 +0800

    CUDA: experimental native mxfp4 support for blackwell (llama/17906)

    * CUDA: experimental native mxfp4 support for blackwell

    * optimize load_tiles

    * optimize quantize_mxfp4

    * cleanup

    * first pass review: formatting

    * use interleaved layout for mma

    * mmq: add assert for size

    * use __nv_fp4x4_e2m1

    * use iter_k as 512, cleanup

    * Use 1200 as blackwell instead of 1000

    * address review comments

    * mmq: fix stride

    * quantize.cu: use reference impl of e8m0 scale

    * address review comments

    * add 120f-virtual + minor fixes

    ---------

    Co-authored-by: Aman Gupta <aman>

commit f863735
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Wed Dec 24 05:36:34 2025 -0600

    vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (llama/18302)

commit bab2c02
Author: Wang Weixuan <wangweixvan@gmail.com>
Date:   Wed Dec 24 17:50:24 2025 +0800

    CANN : refactor ACL graph cache (llama/17752)

    Move the graph property checking code into methods of LRU cache.

    Signed-off-by: Wang Weixuan <wangweixvan@gmail.com>

commit 1356600
Author: Ruben Ortlam <picard12@live.de>
Date:   Wed Dec 24 08:59:14 2025 +0100

    vulkan: use fewer FA rows for small cache runs (llama/18280)

commit ec9239d
Author: TianHao324 <tianhao42@huawei.com>
Date:   Wed Dec 24 14:55:33 2025 +0800

    CANN: Uses yarn_ramp cache in ROPE (llama/17725)

commit 9bdd465
Author: Chris Rohlf <chris.rohlf@gmail.com>
Date:   Tue Dec 23 04:56:49 2025 -0500

    rpc : add check for rpc buffer type (llama/18242)

commit e4c8961
Author: nullname <chraac@gmail.com>
Date:   Tue Dec 23 15:13:24 2025 +0800

    ggml-hexagon: create generalized functions for cpu side op (llama/17500)

    * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility

    * refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility

    * refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity

    * add comment

    * refactor: remove redundant buffer checks in hexagon supported operations

    * wip

    * add missing include to fix weak symbol warning

    * add ggml_hexagon_op_generic

    * refactor: simplify tensor operation initialization and buffer management in hexagon implementation

    * refactor: streamline hexagon operation initialization and buffer management

    * refactor: update function signatures and streamline request handling in hexagon operations

    * wip

    * ggml-hexagon: clean up code formatting and improve unary operation handling

    * wip

    * rename

    * fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations

    * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility

    refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility

    refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity

    refactor: remove redundant buffer checks in hexagon supported operations

    add missing include to fix weak symbol warning

    add ggml_hexagon_op_generic

    refactor: simplify tensor operation initialization and buffer management in hexagon implementation

    refactor: streamline hexagon operation initialization and buffer management

    refactor: update function signatures and streamline request handling in hexagon operations

    ggml-hexagon: clean up code formatting and improve unary operation handling

    fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations

    * hexagon: fix merge conflicts

    * hexagon: minor cleanup for buffer support checks

    * hexagon: factor out op_desc and the overal op logging

    * hexagon: further simplify and cleanup op dispatch logic

    * snapdragon: update adb scripts to use llama-cli and llama-completion

    * fix pipeline failure

    ---------

    Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

commit 2f33395
Author: Shouyu <65317431+joeldushouyu@users.noreply.github.com>
Date:   Mon Dec 22 13:56:52 2025 -0500

    ggml-hexagon: gelu optimization (llama/18151)

    * feat: working gelu with src0 put on vtcm

    * feat: gelu ping-pong for both in and out

    * fix: fixu compile error

    * break: distinguish dma ddr->vtcm and vtcm->ddr operation

    * fix: fix dma queue size

    * break: update dma api to either pop src or dst ptr

    * fix: fix activation vtcm allocation issue for src1 when swapperd

    * refactor: ping-pong gelu logic to avoid unnecessary if else

    * dma: improved queue interface and prefetch handling

    * gelu: fix N+2 block prefetch

    ---------

    Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>

commit 5b0c1c1
Author: Taimur Ahmad <taimur.ahmad@10xengineers.ai>
Date:   Mon Dec 22 23:20:23 2025 +0500

    llamafile: add rvv support for sgemm kernels (llama/18199)

    Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

commit f2fe1e5
Author: lhez <lih@qti.qualcomm.com>
Date:   Mon Dec 22 10:19:01 2025 -0800

    opencl: unpack q4_0 for adreno in get_tensor (llama/18278)

commit dbbe6c1
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Mon Dec 22 11:03:13 2025 -0600

    vulkan: Extend rope fusions to allow mrope (llama/18264)

    Extend the test-backend-ops tests as well.

commit 98e59a4
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Sun Dec 21 14:52:09 2025 -0600

    vulkan: Implement set_tensor_async and the event interfaces (llama/18047)

    The goal is to enable the async loading code paths in
    llama_model_loader::load_all_data, originally from #7896. This works and the
    loads themselves are faster, but with host visible vidmem I think the cost of
    allocating/mapping vidmem moves and becomes more expensive, and I don't see a
    benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
    significant improvement in model loading time.

commit b68b12f
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Sun Dec 21 19:33:08 2025 +0100

    llama: fix RPC for -fit on (llama/18233)

commit b893e08
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Sun Dec 21 03:32:58 2025 -0600

    vulkan: fix im2col overflowing maxworkgroupcount (llama/18180)

commit f407c5e
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Sun Dec 21 03:27:34 2025 -0600

    vulkan/cuda: fix topk_moe with exp_probs_b (llama/18071)

    I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
    and added coverage for exp_probs_b and some other missing combinations. This
    exposed a bug in both CUDA and Vulkan backends where they were assuming the
    input to argsort and the input to get_rows are the same. I'd like to optimize
    this graph in another change, but for now just get it functional.

    CUDA also had a bug where it got n_experts from the wrong place, leading to
    GGML_ASSERT failures in some of the new tests.

commit ad6ee38
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Sun Dec 21 03:17:58 2025 -0600

    vulkan: support GGML_UNARY_OP_XIELU (llama/18062)

commit 3cd141f
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Sun Dec 21 03:05:08 2025 -0600

    vulkan: in graph_optimize, try to group ADD operations (llama/18060)

    I saw the adds not staying together in the new nemotron 3 nano model.

commit 449fc7c
Author: lovedheart <6277001+lovedheart@users.noreply.github.com>
Date:   Sun Dec 21 09:59:52 2025 +0100

    Vulkan: some improvement on mul_mat_iq2_xs (llama/18031)

    * Some improvement on mul_mat_iq2_xs

    Refactor calculations for db values and grid data to optimize performance and reduce redundancy.

    * Fix trailing whitespace

commit 0983985
Author: Aadeshveer Singh <24b0926@iitb.ac.in>
Date:   Sat Dec 20 16:58:57 2025 +0530

    Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context (llama/18212)

commit 17a4cb1
Author: Alfred <zxu3@clemson.edu>
Date:   Fri Dec 19 12:42:28 2025 -0500

    ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (llama/17977)

    * feat: implement real Q8_0

    * feat: adding cmake option for configuring FP32 quantize group size

    * typo: set() shall be used

    ---------

    Co-authored-by: ngdxzy <zhenyu_xu@uri.edu>

commit 195d8d0
Author: Jeff Bolz <jbolz@nvidia.com>
Date:   Thu Dec 18 23:36:46 2025 -0600

    vulkan: Add perf logger mode with concurrency (llama/17944)

    This implements a variation of the perf logger where rather than timing each
    operation individually with effectively a barrier in between, we put the
    timing boundaries where we already synchronize and time the groups of work
    that normally overlap. This can be useful to help understand whether
    individual operations need to be optimized, or if the group is already running
    efficiently.

    GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when
    GGML_VK_PERF_LOGGER is also set).

    GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch.

commit fea481f
Author: Xuan-Son Nguyen <son@huggingface.co>
Date:   Fri Dec 19 00:18:01 2025 +0100

    model : add ASR support for LFM2-Audio-1.5B (conformer) (llama/18106)

    * ASR with LFM2-Audio-1.5B

    * Set rope_theta

    * Fix comment

    * Remove rope_theta setting

    * Address PR feedback

    * rename functions to conformer

    * remove some redundant ggml_cont

    * fix missing tensor

    * add prefix "a." for conv tensors

    * remove redundant reshape

    * clean up

    * add test model

    ---------

    Co-authored-by: Tarek Dakhran <tarek@liquid.ai>

commit 956fac4
Author: Taimur Ahmad <taimur.ahmad@10xengineers.ai>
Date:   Thu Dec 18 19:02:09 2025 +0500

    ggml-cpu: extend support for RVV floating-point kernels (llama/17318)

    * cmake: add BF16 RVV flag for ggml-cpu

    * ggml-cpu: add floating-point conversion kernels

    * ggml: add floating-point kernels

    Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

    * ggml-cpu: fix lmul in vec_dot_bf16

    * ggml-cpu: change redsum to lmul 4, fix leftover

    ---------

    Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>

commit 325a9b7
Author: yulo <77381088+zhang-hui-yulo@users.noreply.github.com>
Date:   Thu Dec 18 19:50:56 2025 +0800

    remove i_major_dual (llama/18157)

    Co-authored-by: zhang hui <you@example.com>

commit c3a1608
Author: Shouyu <65317431+joeldushouyu@users.noreply.github.com>
Date:   Wed Dec 17 16:38:21 2025 -0500

    ggml-hexagon: swiglu_oai operation (llama/18114)

    * snapshot: debug ggml-hexagon swiglu-oai

    * fix: fix hvx_min_scalar_f32

    * feat: working swiglu-oai

    * chore: fix formating isue

commit c7ccedb
Author: Shouyu <65317431+joeldushouyu@users.noreply.github.com>
Date:   Wed Dec 17 13:39:32 2025 -0500

    ggml-hexagon: gelu operation (llama/17921)

    * feat: inital support for gelu using sigmoid approximation

    * snapshot: faster gelu using polynomial approximation

    * test: disable l2-block prefetch in polynomail approximation

    * Revert "test: disable l2-block prefetch in polynomail approximation"

    This reverts commit 72339994d45b2bed887e79994403c378d90b62b5.

    * Revert "snapshot: faster gelu using polynomial approximation"

    This reverts commit 2a787a61d11f9e63e5943a2e6d134b2f0c402ace.

    * debug: temporarily disable unnecessary log message for debug purpose

    * Feat: optiized unaligned sigmoid_f32

    * Feat: larger l2prefetch block

    * feat: apply unaligned-load optimization on mul and mul_scalar

    * Revert "debug: temporarily disable unnecessary log message for debug purpose"

    This reverts commit 84f2f23aa9f17e2fa826db969cd825d0ab192995.

    * refactor: cleanup commented unused code

    * chore: reformat code with clang-formatter to pass cli test

    * Revert "chore: reformat code with clang-formatter to pass cli test"

    This reverts commit 952877ec24732b12010c7fa7ed3fc8de4b74e718.

    * fix: fix loop overflow

    * chore: fix formating ci error

commit 1f72f00
Author: Alberto Cabrera Pérez <1478977+Alcpz@users.noreply.github.com>
Date:   Wed Dec 17 11:39:13 2025 +0000

    ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) (llama/18096)

    * wip: skeleton for q8_0 repack

    * q8_0 repack GEMV implementations

    * GEMM implementations

    * Formatting

    * Fixed format consistency of repack gemm and gemv declarations

    * gemv and gemm generic location consistent with declarations

    * Removed non-correct unused variables statements

    * Cleanup, consistent style

    * Missing generic fallbacks for x86 and powerpc

commit 9118c05
Author: yulo <77381088+zhang-hui-yulo@users.noreply.github.com>
Date:   Wed Dec 17 16:34:54 2025 +0800

    HIP: Refactor mma for RDNA and CDNA (llama/17990)

    * mma.cuh for rdna4

    * mma for rdna3

    * mmq for rdna4

    * mmq for rdna3

    * align i-major and j-major

    * cdna

    * fix cuda error

    * add missing tile of mfma

    * fix j-major wrong ne on CDNA

    * fix gramma and empty spaces

    ---------

    Co-authored-by: zhang hui <you@example.com>

commit 6114e69
Author: KITAITI Makoto <KitaitiMakoto@gmail.com>
Date:   Wed Dec 24 16:52:16 2025 +0900

    ruby : add Whisper::Token, fix model URI (ggml-org#3575)

    * Define and use macro to get context safely

    * Add test to check SEGV

    * Move installation guid after usage

    * [skip ci]Change doc slightly

    * [skip ci]Fix a typo in README

    * [skip ci]Add carry_initial_prompt option in README

    * Define GetVADSegments and use it

    * Use GetContext

    * Fix download URI of small.en-tdrz

    * Fix URI of CoreML models corresponding to quantized models

    * Cache computed string

    * Remove unused argument

    * Add Whisper::Token

    * Add document comments

    * Rename function: rb_whisper_token_s_new -> ruby_whisper_token_s_init

    * Fix size of token

    * Insert _get into function names

    * Add Whisper::Token#text

    * Add test for Whisper::Token#text

    * Declare static if possible

    * Change method names

    * Add Whisper::Token#deconstruct_keys

    * Add tests for Whisper::Token#deconstruct_keys

    * Add signatures for Whisper::Token

    * Complete signature

    * [skip ci]Add n_tokens to document of Segment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant