Support Mixed-Precision Quantization by lhpqaq · Pull Request #4 · lhpqaq/whisper.cpp

lhpqaq · 2026-01-13T17:48:56Z

No description provided.

commit f62a80b Author: lhpqaq <liuhaopeng@apache.org> Date: Wed Jan 14 21:04:06 2026 +0800 simd commit d8b99c9 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 12:29:55 2026 +0000 Implement ggml-level Q8_0 KV cache optimization: add ggml_vec_mad_q8_0 Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit a94c7a5 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 10:15:24 2026 +0000 Improve comments explaining V quantization constraints in non-flash attention paths Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit d53bf50 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 10:12:12 2026 +0000 Allow K cache quantization with non-flash attention, V quantization requires flash attention Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 96da180 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 09:44:38 2026 +0000 Fix non-flash attention path V tensor strides to use n_ctx-based layout Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 610f5f8 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 08:29:10 2026 +0000 Implement separate KV cache types for kv_self, kv_cross, and kv_pad Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 294c6c5 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 08:11:28 2026 +0000 Document three KV cache types (kv_self, kv_cross, kv_pad) with usage recommendations Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit db20ad3 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 07:58:27 2026 +0000 Add validation: quantized KV cache types require flash attention enabled Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit f653dff Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 07:48:26 2026 +0000 Add quantized KV cache performance analysis and optimization recommendations Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 93ebaef Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 07:14:07 2026 +0000 Fix comment about block size determination Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 317b980 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 07:12:03 2026 +0000 Implement KV cache quantization with Q8_0, Q4_0, Q5_0 support using ggml_row_size Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 32d9709 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Tue Jan 13 19:51:11 2026 +0000 Implement separate K/V cache type configuration for mixed precision Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 3429b9a Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Tue Jan 13 18:29:07 2026 +0000 Add practical KV cache optimization strategies section with implementation guidance Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 512a42b Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Tue Jan 13 18:18:32 2026 +0000 Revert Q8_0 KV cache code changes; document block alignment issue and mixed precision strategies Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit b3d0e2d Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Tue Jan 13 18:08:20 2026 +0000 Improve KV cache Q8_0 documentation and help text clarity Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit bb5674c Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Tue Jan 13 18:06:25 2026 +0000 Add KV Cache Q8_0 quantization support for reduced memory usage Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 0fe0dab Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Tue Jan 13 17:53:49 2026 +0000 Initial plan commit 6d349f9 Merge: 679bdb5 f92fc36 Author: haopeng <liuhaopeng@apache.org> Date: Wed Jan 14 01:52:08 2026 +0800 Merge pull request #4 from lhpqaq/mixed-pr Support Mixed-Precision Quantization commit f92fc36 Author: lhpqaq <liuhaopeng@apache.org> Date: Mon Jan 12 14:44:23 2026 +0800 Support Mixed-Precision Quantization commit 679bdb5 Author: KITAITI Makoto <KitaitiMakoto@gmail.com> Date: Mon Jan 5 17:41:22 2026 +0900 ruby : fix segmentation fault (ggml-org#3591) * Mark long live variable * Fix test for Whisper::Token#deconstruct_keys(nil) * Don't use long live variable * Fix indentation commit e9898dd Author: Georgi Gerganov <ggerganov@gmail.com> Date: Wed Dec 31 18:26:42 2025 +0200 sync : ggml commit ce03f8e Author: Georgi Gerganov <ggerganov@gmail.com> Date: Wed Dec 31 18:24:07 2025 +0200 ggml : bump version to 0.9.5 (ggml/1410) commit 7359ac9 Author: Georgi Gerganov <ggerganov@gmail.com> Date: Wed Dec 31 13:13:57 2025 +0200 talk-llama : sync llama.cpp commit 54fa821 Author: Georgi Gerganov <ggerganov@gmail.com> Date: Wed Dec 31 13:09:05 2025 +0200 sync : ggml commit 8189f2c Author: gatbontonpc <gatbontonpc@gmail.com> Date: Wed Dec 31 00:39:48 2025 -0800 metal : add count_equal op (llama/18314) * add count equal for metal * remove trailing whitespace * updated doc ops table * changed shmem to i32 * added multi tg and templating * removed BLAS support from Metal docs * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add memset to set dst to 0 * metal : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> commit 2d250f8 Author: Johannes Gäßler <johannesg@5d6.de> Date: Wed Dec 31 09:37:00 2025 +0100 CUDA: fix KQ max calculation (llama/18487) commit 5deaf8f Author: Georgi Gerganov <ggerganov@gmail.com> Date: Wed Dec 31 09:53:48 2025 +0200 metal : remove BF16 x F16 kernels (llama/18456) commit 4679331 Author: Aman Gupta <amangupta052@gmail.com> Date: Wed Dec 31 14:23:44 2025 +0800 sycl: add newline at the end of CMakeLists.txt (llama/18503) commit a363549 Author: Rahul Sathe <150351592+rrsathe@users.noreply.github.com> Date: Wed Dec 31 06:38:44 2025 +0530 Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (llama/18345) * cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x * [AI] sycl: auto-detect and skip incompatible IntelSYCL package Automatically detect compiler versions with incompatible IntelSYCL CMake configuration files and fall back to manual SYCL flags instead of requiring users to set options manually. Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake has SYCL_FEATURE_TEST_EXTRACT invocation errors. * refactor: improve SYCL provider handling and error messages in CMake configuration * refactor: enhance SYCL provider validation and error handling in CMake configuration * ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes commit c995536 Author: Charles Xu <charles.xu@arm.com> Date: Tue Dec 30 13:04:53 2025 +0100 kleidiai: add and integrate SVE 256-bit vector-length kernel (llama/18458) * kleidiai: add and integrate SVE 256-bit vector-length kernel * updated for review comments commit 6d4aa96 Author: Aman Gupta <amangupta052@gmail.com> Date: Tue Dec 30 17:40:46 2025 +0800 CUDA: add log line when mxfp4 acceleration is used (llama/18483) * CUDA: add log line when mxfp4 acceleration is used * add in backend_get_features commit 5765c5b Author: Johannes Gäßler <johannesg@5d6.de> Date: Mon Dec 29 17:58:20 2025 +0100 CUDA: fix replacment of bad archs in CMake (llama/18457) commit d6cb240 Author: Johannes Gäßler <johannesg@5d6.de> Date: Mon Dec 29 09:35:42 2025 +0100 CUDA: Blackwell features for non-native builds (llama/18436) commit e49e88b Author: Aman Gupta <amangupta052@gmail.com> Date: Mon Dec 29 14:07:17 2025 +0800 cuda: fix race condition in cumsum (llama/18448) * ggml-cuda: fix race condition in cumsum * remove unneccesary sync_threads commit 20f5729 Author: uvos <carl@uvos.xyz> Date: Sun Dec 28 20:12:55 2025 +0100 HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (llama/18202) commit b8d209f Author: Aman Gupta <amangupta052@gmail.com> Date: Sun Dec 28 20:53:36 2025 +0800 Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (llama/18426) commit 54fe9a6 Author: o7si <32285332+o7si@users.noreply.github.com> Date: Sun Dec 28 18:34:41 2025 +0800 rpc: fix segfault on invalid endpoint format (llama/18387) * rpc: fix segfault on invalid endpoint format * rpc: add error log for failed endpoint connection commit b3788ef Author: Boian Berberov <7432115+bberberov@users.noreply.github.com> Date: Sun Dec 28 07:33:29 2025 +0000 cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (llama/18186) * minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h` * cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` - `ivybridge` - `piledriver` - `cannonlake` - `cascadelake` - `cooperlake` - `zen4` Resolves: #17966 commit 31fc2c3 Author: QDelta <60222316+QDelta@users.noreply.github.com> Date: Sat Dec 27 20:33:14 2025 -0500 ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (llama/18413) commit a800a3a Author: lhez <lih@qti.qualcomm.com> Date: Sat Dec 27 15:51:14 2025 -0800 opencl: allow resizing transpose buffers (llama/18384) * opencl: allow resizing transpose buffers instead of using fixed sizes * opencl: remove commented code commit 29f8155 Author: Aman Gupta <amangupta052@gmail.com> Date: Sat Dec 27 19:56:27 2025 +0800 ggml-cuda: Use same regex for GGML_NATIVE=OFF (llama/18407) commit 015b618 Author: Jeff Bolz <jbolz@nvidia.com> Date: Fri Dec 26 16:12:58 2025 -0600 vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (llama/18352) Run a preprocess to count how many times each expert is used, and use this to quickly discard workgroups that aren't needed. commit e37c8ed Author: Jeff Bolz <jbolz@nvidia.com> Date: Fri Dec 26 11:15:50 2025 -0600 vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (llama/18349) * vulkan: Use BK=32 for coopmat2 mul_mat_id * vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader Disable robustness, remove the OOB check in decodeFuncB, and initialize the row_ids to zero to avoid OOB access. Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of zero and remove the '& (BN - 1)'. This allows the compiler to common some of the shared memory loads. commit 331c6cc Author: Jeff Bolz <jbolz@nvidia.com> Date: Fri Dec 26 11:15:02 2025 -0600 vulkan: Use BK=32 for coopmat2 mul_mat_id (llama/18332) commit 35cb4ab Author: Eve <139727413+netrunnereve@users.noreply.github.com> Date: Fri Dec 26 17:12:11 2025 +0000 vulkan: small dequantization improvements (llama/18380) * iq4_xs * quants commit 181e36f Author: Jeff Bolz <jbolz@nvidia.com> Date: Fri Dec 26 10:00:57 2025 -0600 vulkan: Support UPSCALE w/antialias (llama/18327) commit 67473fe Author: Jeff Bolz <jbolz@nvidia.com> Date: Fri Dec 26 09:53:46 2025 -0600 vulkan: handle rope with large number of rows (llama/18306) commit 33f75a8 Author: 0Marble <85058989+0Marble@users.noreply.github.com> Date: Fri Dec 26 09:12:04 2025 +0800 CANN: implement the SSM_CONV operator (llama/17737) * CANN: implement SSM_CONV operator Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com> Co-authored-by: Sujin Kang, <waterjin326@gmail.com> * CANN: remove custom error limit for SSM_CONV * CANN: merge SSM_CONV tensor shape/strides into one line --------- Co-authored-by: Sujin Kang, <waterjin326@gmail.com> commit 5177835 Author: Aman Gupta <amangupta052@gmail.com> Date: Fri Dec 26 01:35:14 2025 +0800 ggml-cuda: fix regex for arch list (llama/18371) * ggml-cuda: fix regex for arch list * make regex exact commit 8e02f09 Author: Aman Gupta <amangupta052@gmail.com> Date: Thu Dec 25 23:55:38 2025 +0800 cuda: optimize cumsum cub path (llama/18362) * cuda: optimize cumsum cub path * remove heavy perf test commit ea07c5d Author: Aman Gupta <amangupta052@gmail.com> Date: Thu Dec 25 22:12:11 2025 +0800 ggml-cuda: fix blackwell native builds (llama/18361) * ggml-cuda: fix blackwell native builds Replace 12x in native architectures by 12xa * replace for GGML_NATIVE=OFF too * only replace for native * remove 120f-virtual for default compilation --------- Co-authored-by: Aman Gupta <aman> commit 5f0488f Author: Penglin Cai <1402538448@qq.com> Date: Thu Dec 25 16:46:09 2025 +0800 CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (llama/17934) * CONV_TRANSPOSE_1D kernel_size>255 * remove condition check * fix the bug of type conversion * removing trailing whitespaces * fix: return true in the switch case commit db75fff Author: Aadeshveer Singh <24b0926@iitb.ac.in> Date: Thu Dec 25 09:41:13 2025 +0530 ggml : optimize cuda cumsum fallback kernel (llama/18343) commit 41e578e Author: Aman Gupta <amangupta052@gmail.com> Date: Wed Dec 24 22:28:26 2025 +0800 CUDA: experimental native mxfp4 support for blackwell (llama/17906) * CUDA: experimental native mxfp4 support for blackwell * optimize load_tiles * optimize quantize_mxfp4 * cleanup * first pass review: formatting * use interleaved layout for mma * mmq: add assert for size * use __nv_fp4x4_e2m1 * use iter_k as 512, cleanup * Use 1200 as blackwell instead of 1000 * address review comments * mmq: fix stride * quantize.cu: use reference impl of e8m0 scale * address review comments * add 120f-virtual + minor fixes --------- Co-authored-by: Aman Gupta <aman> commit f863735 Author: Jeff Bolz <jbolz@nvidia.com> Date: Wed Dec 24 05:36:34 2025 -0600 vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (llama/18302) commit bab2c02 Author: Wang Weixuan <wangweixvan@gmail.com> Date: Wed Dec 24 17:50:24 2025 +0800 CANN : refactor ACL graph cache (llama/17752) Move the graph property checking code into methods of LRU cache. Signed-off-by: Wang Weixuan <wangweixvan@gmail.com> commit 1356600 Author: Ruben Ortlam <picard12@live.de> Date: Wed Dec 24 08:59:14 2025 +0100 vulkan: use fewer FA rows for small cache runs (llama/18280) commit ec9239d Author: TianHao324 <tianhao42@huawei.com> Date: Wed Dec 24 14:55:33 2025 +0800 CANN: Uses yarn_ramp cache in ROPE (llama/17725) commit 9bdd465 Author: Chris Rohlf <chris.rohlf@gmail.com> Date: Tue Dec 23 04:56:49 2025 -0500 rpc : add check for rpc buffer type (llama/18242) commit e4c8961 Author: nullname <chraac@gmail.com> Date: Tue Dec 23 15:13:24 2025 +0800 ggml-hexagon: create generalized functions for cpu side op (llama/17500) * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility * refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility * refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity * add comment * refactor: remove redundant buffer checks in hexagon supported operations * wip * add missing include to fix weak symbol warning * add ggml_hexagon_op_generic * refactor: simplify tensor operation initialization and buffer management in hexagon implementation * refactor: streamline hexagon operation initialization and buffer management * refactor: update function signatures and streamline request handling in hexagon operations * wip * ggml-hexagon: clean up code formatting and improve unary operation handling * wip * rename * fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity refactor: remove redundant buffer checks in hexagon supported operations add missing include to fix weak symbol warning add ggml_hexagon_op_generic refactor: simplify tensor operation initialization and buffer management in hexagon implementation refactor: streamline hexagon operation initialization and buffer management refactor: update function signatures and streamline request handling in hexagon operations ggml-hexagon: clean up code formatting and improve unary operation handling fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations * hexagon: fix merge conflicts * hexagon: minor cleanup for buffer support checks * hexagon: factor out op_desc and the overal op logging * hexagon: further simplify and cleanup op dispatch logic * snapdragon: update adb scripts to use llama-cli and llama-completion * fix pipeline failure --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> commit 2f33395 Author: Shouyu <65317431+joeldushouyu@users.noreply.github.com> Date: Mon Dec 22 13:56:52 2025 -0500 ggml-hexagon: gelu optimization (llama/18151) * feat: working gelu with src0 put on vtcm * feat: gelu ping-pong for both in and out * fix: fixu compile error * break: distinguish dma ddr->vtcm and vtcm->ddr operation * fix: fix dma queue size * break: update dma api to either pop src or dst ptr * fix: fix activation vtcm allocation issue for src1 when swapperd * refactor: ping-pong gelu logic to avoid unnecessary if else * dma: improved queue interface and prefetch handling * gelu: fix N+2 block prefetch --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> commit 5b0c1c1 Author: Taimur Ahmad <taimur.ahmad@10xengineers.ai> Date: Mon Dec 22 23:20:23 2025 +0500 llamafile: add rvv support for sgemm kernels (llama/18199) Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> commit f2fe1e5 Author: lhez <lih@qti.qualcomm.com> Date: Mon Dec 22 10:19:01 2025 -0800 opencl: unpack q4_0 for adreno in get_tensor (llama/18278) commit dbbe6c1 Author: Jeff Bolz <jbolz@nvidia.com> Date: Mon Dec 22 11:03:13 2025 -0600 vulkan: Extend rope fusions to allow mrope (llama/18264) Extend the test-backend-ops tests as well. commit 98e59a4 Author: Jeff Bolz <jbolz@nvidia.com> Date: Sun Dec 21 14:52:09 2025 -0600 vulkan: Implement set_tensor_async and the event interfaces (llama/18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from #7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time. commit b68b12f Author: Johannes Gäßler <johannesg@5d6.de> Date: Sun Dec 21 19:33:08 2025 +0100 llama: fix RPC for -fit on (llama/18233) commit b893e08 Author: Jeff Bolz <jbolz@nvidia.com> Date: Sun Dec 21 03:32:58 2025 -0600 vulkan: fix im2col overflowing maxworkgroupcount (llama/18180) commit f407c5e Author: Jeff Bolz <jbolz@nvidia.com> Date: Sun Dec 21 03:27:34 2025 -0600 vulkan/cuda: fix topk_moe with exp_probs_b (llama/18071) I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn and added coverage for exp_probs_b and some other missing combinations. This exposed a bug in both CUDA and Vulkan backends where they were assuming the input to argsort and the input to get_rows are the same. I'd like to optimize this graph in another change, but for now just get it functional. CUDA also had a bug where it got n_experts from the wrong place, leading to GGML_ASSERT failures in some of the new tests. commit ad6ee38 Author: Jeff Bolz <jbolz@nvidia.com> Date: Sun Dec 21 03:17:58 2025 -0600 vulkan: support GGML_UNARY_OP_XIELU (llama/18062) commit 3cd141f Author: Jeff Bolz <jbolz@nvidia.com> Date: Sun Dec 21 03:05:08 2025 -0600 vulkan: in graph_optimize, try to group ADD operations (llama/18060) I saw the adds not staying together in the new nemotron 3 nano model. commit 449fc7c Author: lovedheart <6277001+lovedheart@users.noreply.github.com> Date: Sun Dec 21 09:59:52 2025 +0100 Vulkan: some improvement on mul_mat_iq2_xs (llama/18031) * Some improvement on mul_mat_iq2_xs Refactor calculations for db values and grid data to optimize performance and reduce redundancy. * Fix trailing whitespace commit 0983985 Author: Aadeshveer Singh <24b0926@iitb.ac.in> Date: Sat Dec 20 16:58:57 2025 +0530 Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context (llama/18212) commit 17a4cb1 Author: Alfred <zxu3@clemson.edu> Date: Fri Dec 19 12:42:28 2025 -0500 ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (llama/17977) * feat: implement real Q8_0 * feat: adding cmake option for configuring FP32 quantize group size * typo: set() shall be used --------- Co-authored-by: ngdxzy <zhenyu_xu@uri.edu> commit 195d8d0 Author: Jeff Bolz <jbolz@nvidia.com> Date: Thu Dec 18 23:36:46 2025 -0600 vulkan: Add perf logger mode with concurrency (llama/17944) This implements a variation of the perf logger where rather than timing each operation individually with effectively a barrier in between, we put the timing boundaries where we already synchronize and time the groups of work that normally overlap. This can be useful to help understand whether individual operations need to be optimized, or if the group is already running efficiently. GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when GGML_VK_PERF_LOGGER is also set). GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch. commit fea481f Author: Xuan-Son Nguyen <son@huggingface.co> Date: Fri Dec 19 00:18:01 2025 +0100 model : add ASR support for LFM2-Audio-1.5B (conformer) (llama/18106) * ASR with LFM2-Audio-1.5B * Set rope_theta * Fix comment * Remove rope_theta setting * Address PR feedback * rename functions to conformer * remove some redundant ggml_cont * fix missing tensor * add prefix "a." for conv tensors * remove redundant reshape * clean up * add test model --------- Co-authored-by: Tarek Dakhran <tarek@liquid.ai> commit 956fac4 Author: Taimur Ahmad <taimur.ahmad@10xengineers.ai> Date: Thu Dec 18 19:02:09 2025 +0500 ggml-cpu: extend support for RVV floating-point kernels (llama/17318) * cmake: add BF16 RVV flag for ggml-cpu * ggml-cpu: add floating-point conversion kernels * ggml: add floating-point kernels Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: fix lmul in vec_dot_bf16 * ggml-cpu: change redsum to lmul 4, fix leftover --------- Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> commit 325a9b7 Author: yulo <77381088+zhang-hui-yulo@users.noreply.github.com> Date: Thu Dec 18 19:50:56 2025 +0800 remove i_major_dual (llama/18157) Co-authored-by: zhang hui <you@example.com> commit c3a1608 Author: Shouyu <65317431+joeldushouyu@users.noreply.github.com> Date: Wed Dec 17 16:38:21 2025 -0500 ggml-hexagon: swiglu_oai operation (llama/18114) * snapshot: debug ggml-hexagon swiglu-oai * fix: fix hvx_min_scalar_f32 * feat: working swiglu-oai * chore: fix formating isue commit c7ccedb Author: Shouyu <65317431+joeldushouyu@users.noreply.github.com> Date: Wed Dec 17 13:39:32 2025 -0500 ggml-hexagon: gelu operation (llama/17921) * feat: inital support for gelu using sigmoid approximation * snapshot: faster gelu using polynomial approximation * test: disable l2-block prefetch in polynomail approximation * Revert "test: disable l2-block prefetch in polynomail approximation" This reverts commit 72339994d45b2bed887e79994403c378d90b62b5. * Revert "snapshot: faster gelu using polynomial approximation" This reverts commit 2a787a61d11f9e63e5943a2e6d134b2f0c402ace. * debug: temporarily disable unnecessary log message for debug purpose * Feat: optiized unaligned sigmoid_f32 * Feat: larger l2prefetch block * feat: apply unaligned-load optimization on mul and mul_scalar * Revert "debug: temporarily disable unnecessary log message for debug purpose" This reverts commit 84f2f23aa9f17e2fa826db969cd825d0ab192995. * refactor: cleanup commented unused code * chore: reformat code with clang-formatter to pass cli test * Revert "chore: reformat code with clang-formatter to pass cli test" This reverts commit 952877ec24732b12010c7fa7ed3fc8de4b74e718. * fix: fix loop overflow * chore: fix formating ci error commit 1f72f00 Author: Alberto Cabrera Pérez <1478977+Alcpz@users.noreply.github.com> Date: Wed Dec 17 11:39:13 2025 +0000 ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) (llama/18096) * wip: skeleton for q8_0 repack * q8_0 repack GEMV implementations * GEMM implementations * Formatting * Fixed format consistency of repack gemm and gemv declarations * gemv and gemm generic location consistent with declarations * Removed non-correct unused variables statements * Cleanup, consistent style * Missing generic fallbacks for x86 and powerpc commit 9118c05 Author: yulo <77381088+zhang-hui-yulo@users.noreply.github.com> Date: Wed Dec 17 16:34:54 2025 +0800 HIP: Refactor mma for RDNA and CDNA (llama/17990) * mma.cuh for rdna4 * mma for rdna3 * mmq for rdna4 * mmq for rdna3 * align i-major and j-major * cdna * fix cuda error * add missing tile of mfma * fix j-major wrong ne on CDNA * fix gramma and empty spaces --------- Co-authored-by: zhang hui <you@example.com> commit 6114e69 Author: KITAITI Makoto <KitaitiMakoto@gmail.com> Date: Wed Dec 24 16:52:16 2025 +0900 ruby : add Whisper::Token, fix model URI (ggml-org#3575) * Define and use macro to get context safely * Add test to check SEGV * Move installation guid after usage * [skip ci]Change doc slightly * [skip ci]Fix a typo in README * [skip ci]Add carry_initial_prompt option in README * Define GetVADSegments and use it * Use GetContext * Fix download URI of small.en-tdrz * Fix URI of CoreML models corresponding to quantized models * Cache computed string * Remove unused argument * Add Whisper::Token * Add document comments * Rename function: rb_whisper_token_s_new -> ruby_whisper_token_s_init * Fix size of token * Insert _get into function names * Add Whisper::Token#text * Add test for Whisper::Token#text * Declare static if possible * Change method names * Add Whisper::Token#deconstruct_keys * Add tests for Whisper::Token#deconstruct_keys * Add signatures for Whisper::Token * Complete signature * [skip ci]Add n_tokens to document of Segment

commit 9833290 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 18:05:14 2026 +0000 Add ggml_vec_mad_q4_0 with SIMD optimization (NEON + AVX2) and integrate into flash attention Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit f62a80b Author: lhpqaq <liuhaopeng@apache.org> Date: Wed Jan 14 21:04:06 2026 +0800 simd commit d8b99c9 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 12:29:55 2026 +0000 Implement ggml-level Q8_0 KV cache optimization: add ggml_vec_mad_q8_0 Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit a94c7a5 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 10:15:24 2026 +0000 Improve comments explaining V quantization constraints in non-flash attention paths Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit d53bf50 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 10:12:12 2026 +0000 Allow K cache quantization with non-flash attention, V quantization requires flash attention Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 96da180 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 09:44:38 2026 +0000 Fix non-flash attention path V tensor strides to use n_ctx-based layout Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 610f5f8 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 08:29:10 2026 +0000 Implement separate KV cache types for kv_self, kv_cross, and kv_pad Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 294c6c5 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 08:11:28 2026 +0000 Document three KV cache types (kv_self, kv_cross, kv_pad) with usage recommendations Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit db20ad3 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 07:58:27 2026 +0000 Add validation: quantized KV cache types require flash attention enabled Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit f653dff Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 07:48:26 2026 +0000 Add quantized KV cache performance analysis and optimization recommendations Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 93ebaef Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 07:14:07 2026 +0000 Fix comment about block size determination Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 317b980 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Wed Jan 14 07:12:03 2026 +0000 Implement KV cache quantization with Q8_0, Q4_0, Q5_0 support using ggml_row_size Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 32d9709 Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Tue Jan 13 19:51:11 2026 +0000 Implement separate K/V cache type configuration for mixed precision Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 3429b9a Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Tue Jan 13 18:29:07 2026 +0000 Add practical KV cache optimization strategies section with implementation guidance Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 512a42b Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Tue Jan 13 18:18:32 2026 +0000 Revert Q8_0 KV cache code changes; document block alignment issue and mixed precision strategies Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit b3d0e2d Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Tue Jan 13 18:08:20 2026 +0000 Improve KV cache Q8_0 documentation and help text clarity Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit bb5674c Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Tue Jan 13 18:06:25 2026 +0000 Add KV Cache Q8_0 quantization support for reduced memory usage Co-authored-by: lhpqaq <63844184+lhpqaq@users.noreply.github.com> commit 0fe0dab Author: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Date: Tue Jan 13 17:53:49 2026 +0000 Initial plan commit 6d349f9 Merge: 679bdb5 f92fc36 Author: haopeng <liuhaopeng@apache.org> Date: Wed Jan 14 01:52:08 2026 +0800 Merge pull request #4 from lhpqaq/mixed-pr Support Mixed-Precision Quantization commit f92fc36 Author: lhpqaq <liuhaopeng@apache.org> Date: Mon Jan 12 14:44:23 2026 +0800 Support Mixed-Precision Quantization commit 679bdb5 Author: KITAITI Makoto <KitaitiMakoto@gmail.com> Date: Mon Jan 5 17:41:22 2026 +0900 ruby : fix segmentation fault (ggml-org#3591) * Mark long live variable * Fix test for Whisper::Token#deconstruct_keys(nil) * Don't use long live variable * Fix indentation commit e9898dd Author: Georgi Gerganov <ggerganov@gmail.com> Date: Wed Dec 31 18:26:42 2025 +0200 sync : ggml commit ce03f8e Author: Georgi Gerganov <ggerganov@gmail.com> Date: Wed Dec 31 18:24:07 2025 +0200 ggml : bump version to 0.9.5 (ggml/1410) commit 7359ac9 Author: Georgi Gerganov <ggerganov@gmail.com> Date: Wed Dec 31 13:13:57 2025 +0200 talk-llama : sync llama.cpp commit 54fa821 Author: Georgi Gerganov <ggerganov@gmail.com> Date: Wed Dec 31 13:09:05 2025 +0200 sync : ggml commit 8189f2c Author: gatbontonpc <gatbontonpc@gmail.com> Date: Wed Dec 31 00:39:48 2025 -0800 metal : add count_equal op (llama/18314) * add count equal for metal * remove trailing whitespace * updated doc ops table * changed shmem to i32 * added multi tg and templating * removed BLAS support from Metal docs * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add memset to set dst to 0 * metal : cleanup --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> commit 2d250f8 Author: Johannes Gäßler <johannesg@5d6.de> Date: Wed Dec 31 09:37:00 2025 +0100 CUDA: fix KQ max calculation (llama/18487) commit 5deaf8f Author: Georgi Gerganov <ggerganov@gmail.com> Date: Wed Dec 31 09:53:48 2025 +0200 metal : remove BF16 x F16 kernels (llama/18456) commit 4679331 Author: Aman Gupta <amangupta052@gmail.com> Date: Wed Dec 31 14:23:44 2025 +0800 sycl: add newline at the end of CMakeLists.txt (llama/18503) commit a363549 Author: Rahul Sathe <150351592+rrsathe@users.noreply.github.com> Date: Wed Dec 31 06:38:44 2025 +0530 Work around broken IntelSYCLConfig.cmake in Intel oneAPI 2025.x (llama/18345) * cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x * [AI] sycl: auto-detect and skip incompatible IntelSYCL package Automatically detect compiler versions with incompatible IntelSYCL CMake configuration files and fall back to manual SYCL flags instead of requiring users to set options manually. Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake has SYCL_FEATURE_TEST_EXTRACT invocation errors. * refactor: improve SYCL provider handling and error messages in CMake configuration * refactor: enhance SYCL provider validation and error handling in CMake configuration * ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes commit c995536 Author: Charles Xu <charles.xu@arm.com> Date: Tue Dec 30 13:04:53 2025 +0100 kleidiai: add and integrate SVE 256-bit vector-length kernel (llama/18458) * kleidiai: add and integrate SVE 256-bit vector-length kernel * updated for review comments commit 6d4aa96 Author: Aman Gupta <amangupta052@gmail.com> Date: Tue Dec 30 17:40:46 2025 +0800 CUDA: add log line when mxfp4 acceleration is used (llama/18483) * CUDA: add log line when mxfp4 acceleration is used * add in backend_get_features commit 5765c5b Author: Johannes Gäßler <johannesg@5d6.de> Date: Mon Dec 29 17:58:20 2025 +0100 CUDA: fix replacment of bad archs in CMake (llama/18457) commit d6cb240 Author: Johannes Gäßler <johannesg@5d6.de> Date: Mon Dec 29 09:35:42 2025 +0100 CUDA: Blackwell features for non-native builds (llama/18436) commit e49e88b Author: Aman Gupta <amangupta052@gmail.com> Date: Mon Dec 29 14:07:17 2025 +0800 cuda: fix race condition in cumsum (llama/18448) * ggml-cuda: fix race condition in cumsum * remove unneccesary sync_threads commit 20f5729 Author: uvos <carl@uvos.xyz> Date: Sun Dec 28 20:12:55 2025 +0100 HIP: Use mmq on MFMA devices for MUL_MAT_ID in cases where a lot of splits would be generated (llama/18202) commit b8d209f Author: Aman Gupta <amangupta052@gmail.com> Date: Sun Dec 28 20:53:36 2025 +0800 Revert "ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (#18413)" (llama/18426) commit 54fe9a6 Author: o7si <32285332+o7si@users.noreply.github.com> Date: Sun Dec 28 18:34:41 2025 +0800 rpc: fix segfault on invalid endpoint format (llama/18387) * rpc: fix segfault on invalid endpoint format * rpc: add error log for failed endpoint connection commit b3788ef Author: Boian Berberov <7432115+bberberov@users.noreply.github.com> Date: Sun Dec 28 07:33:29 2025 +0000 cmake: Added more x86_64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` (llama/18186) * minor: Consolidated `#include <immintrin.h>` under `ggml-cpu-impl.h` * cmake: Added more x86-64 CPU backends when building with `GGML_CPU_ALL_VARIANTS=On` - `ivybridge` - `piledriver` - `cannonlake` - `cascadelake` - `cooperlake` - `zen4` Resolves: #17966 commit 31fc2c3 Author: QDelta <60222316+QDelta@users.noreply.github.com> Date: Sat Dec 27 20:33:14 2025 -0500 ggml-cuda: use CMAKE_CUDA_ARCHITECTURES if set when GGML_NATIVE=ON (llama/18413) commit a800a3a Author: lhez <lih@qti.qualcomm.com> Date: Sat Dec 27 15:51:14 2025 -0800 opencl: allow resizing transpose buffers (llama/18384) * opencl: allow resizing transpose buffers instead of using fixed sizes * opencl: remove commented code commit 29f8155 Author: Aman Gupta <amangupta052@gmail.com> Date: Sat Dec 27 19:56:27 2025 +0800 ggml-cuda: Use same regex for GGML_NATIVE=OFF (llama/18407) commit 015b618 Author: Jeff Bolz <jbolz@nvidia.com> Date: Fri Dec 26 16:12:58 2025 -0600 vulkan: preprocess mul_mat_id experts and discard workgroups more quickly (llama/18352) Run a preprocess to count how many times each expert is used, and use this to quickly discard workgroups that aren't needed. commit e37c8ed Author: Jeff Bolz <jbolz@nvidia.com> Date: Fri Dec 26 11:15:50 2025 -0600 vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader (llama/18349) * vulkan: Use BK=32 for coopmat2 mul_mat_id * vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader Disable robustness, remove the OOB check in decodeFuncB, and initialize the row_ids to zero to avoid OOB access. Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of zero and remove the '& (BN - 1)'. This allows the compiler to common some of the shared memory loads. commit 331c6cc Author: Jeff Bolz <jbolz@nvidia.com> Date: Fri Dec 26 11:15:02 2025 -0600 vulkan: Use BK=32 for coopmat2 mul_mat_id (llama/18332) commit 35cb4ab Author: Eve <139727413+netrunnereve@users.noreply.github.com> Date: Fri Dec 26 17:12:11 2025 +0000 vulkan: small dequantization improvements (llama/18380) * iq4_xs * quants commit 181e36f Author: Jeff Bolz <jbolz@nvidia.com> Date: Fri Dec 26 10:00:57 2025 -0600 vulkan: Support UPSCALE w/antialias (llama/18327) commit 67473fe Author: Jeff Bolz <jbolz@nvidia.com> Date: Fri Dec 26 09:53:46 2025 -0600 vulkan: handle rope with large number of rows (llama/18306) commit 33f75a8 Author: 0Marble <85058989+0Marble@users.noreply.github.com> Date: Fri Dec 26 09:12:04 2025 +0800 CANN: implement the SSM_CONV operator (llama/17737) * CANN: implement SSM_CONV operator Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com> Co-authored-by: Sujin Kang, <waterjin326@gmail.com> * CANN: remove custom error limit for SSM_CONV * CANN: merge SSM_CONV tensor shape/strides into one line --------- Co-authored-by: Sujin Kang, <waterjin326@gmail.com> commit 5177835 Author: Aman Gupta <amangupta052@gmail.com> Date: Fri Dec 26 01:35:14 2025 +0800 ggml-cuda: fix regex for arch list (llama/18371) * ggml-cuda: fix regex for arch list * make regex exact commit 8e02f09 Author: Aman Gupta <amangupta052@gmail.com> Date: Thu Dec 25 23:55:38 2025 +0800 cuda: optimize cumsum cub path (llama/18362) * cuda: optimize cumsum cub path * remove heavy perf test commit ea07c5d Author: Aman Gupta <amangupta052@gmail.com> Date: Thu Dec 25 22:12:11 2025 +0800 ggml-cuda: fix blackwell native builds (llama/18361) * ggml-cuda: fix blackwell native builds Replace 12x in native architectures by 12xa * replace for GGML_NATIVE=OFF too * only replace for native * remove 120f-virtual for default compilation --------- Co-authored-by: Aman Gupta <aman> commit 5f0488f Author: Penglin Cai <1402538448@qq.com> Date: Thu Dec 25 16:46:09 2025 +0800 CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (llama/17934) * CONV_TRANSPOSE_1D kernel_size>255 * remove condition check * fix the bug of type conversion * removing trailing whitespaces * fix: return true in the switch case commit db75fff Author: Aadeshveer Singh <24b0926@iitb.ac.in> Date: Thu Dec 25 09:41:13 2025 +0530 ggml : optimize cuda cumsum fallback kernel (llama/18343) commit 41e578e Author: Aman Gupta <amangupta052@gmail.com> Date: Wed Dec 24 22:28:26 2025 +0800 CUDA: experimental native mxfp4 support for blackwell (llama/17906) * CUDA: experimental native mxfp4 support for blackwell * optimize load_tiles * optimize quantize_mxfp4 * cleanup * first pass review: formatting * use interleaved layout for mma * mmq: add assert for size * use __nv_fp4x4_e2m1 * use iter_k as 512, cleanup * Use 1200 as blackwell instead of 1000 * address review comments * mmq: fix stride * quantize.cu: use reference impl of e8m0 scale * address review comments * add 120f-virtual + minor fixes --------- Co-authored-by: Aman Gupta <aman> commit f863735 Author: Jeff Bolz <jbolz@nvidia.com> Date: Wed Dec 24 05:36:34 2025 -0600 vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (llama/18302) commit bab2c02 Author: Wang Weixuan <wangweixvan@gmail.com> Date: Wed Dec 24 17:50:24 2025 +0800 CANN : refactor ACL graph cache (llama/17752) Move the graph property checking code into methods of LRU cache. Signed-off-by: Wang Weixuan <wangweixvan@gmail.com> commit 1356600 Author: Ruben Ortlam <picard12@live.de> Date: Wed Dec 24 08:59:14 2025 +0100 vulkan: use fewer FA rows for small cache runs (llama/18280) commit ec9239d Author: TianHao324 <tianhao42@huawei.com> Date: Wed Dec 24 14:55:33 2025 +0800 CANN: Uses yarn_ramp cache in ROPE (llama/17725) commit 9bdd465 Author: Chris Rohlf <chris.rohlf@gmail.com> Date: Tue Dec 23 04:56:49 2025 -0500 rpc : add check for rpc buffer type (llama/18242) commit e4c8961 Author: nullname <chraac@gmail.com> Date: Tue Dec 23 15:13:24 2025 +0800 ggml-hexagon: create generalized functions for cpu side op (llama/17500) * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility * refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility * refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity * add comment * refactor: remove redundant buffer checks in hexagon supported operations * wip * add missing include to fix weak symbol warning * add ggml_hexagon_op_generic * refactor: simplify tensor operation initialization and buffer management in hexagon implementation * refactor: streamline hexagon operation initialization and buffer management * refactor: update function signatures and streamline request handling in hexagon operations * wip * ggml-hexagon: clean up code formatting and improve unary operation handling * wip * rename * fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity refactor: remove redundant buffer checks in hexagon supported operations add missing include to fix weak symbol warning add ggml_hexagon_op_generic refactor: simplify tensor operation initialization and buffer management in hexagon implementation refactor: streamline hexagon operation initialization and buffer management refactor: update function signatures and streamline request handling in hexagon operations ggml-hexagon: clean up code formatting and improve unary operation handling fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations * hexagon: fix merge conflicts * hexagon: minor cleanup for buffer support checks * hexagon: factor out op_desc and the overal op logging * hexagon: further simplify and cleanup op dispatch logic * snapdragon: update adb scripts to use llama-cli and llama-completion * fix pipeline failure --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> commit 2f33395 Author: Shouyu <65317431+joeldushouyu@users.noreply.github.com> Date: Mon Dec 22 13:56:52 2025 -0500 ggml-hexagon: gelu optimization (llama/18151) * feat: working gelu with src0 put on vtcm * feat: gelu ping-pong for both in and out * fix: fixu compile error * break: distinguish dma ddr->vtcm and vtcm->ddr operation * fix: fix dma queue size * break: update dma api to either pop src or dst ptr * fix: fix activation vtcm allocation issue for src1 when swapperd * refactor: ping-pong gelu logic to avoid unnecessary if else * dma: improved queue interface and prefetch handling * gelu: fix N+2 block prefetch --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com> commit 5b0c1c1 Author: Taimur Ahmad <taimur.ahmad@10xengineers.ai> Date: Mon Dec 22 23:20:23 2025 +0500 llamafile: add rvv support for sgemm kernels (llama/18199) Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> commit f2fe1e5 Author: lhez <lih@qti.qualcomm.com> Date: Mon Dec 22 10:19:01 2025 -0800 opencl: unpack q4_0 for adreno in get_tensor (llama/18278) commit dbbe6c1 Author: Jeff Bolz <jbolz@nvidia.com> Date: Mon Dec 22 11:03:13 2025 -0600 vulkan: Extend rope fusions to allow mrope (llama/18264) Extend the test-backend-ops tests as well. commit 98e59a4 Author: Jeff Bolz <jbolz@nvidia.com> Date: Sun Dec 21 14:52:09 2025 -0600 vulkan: Implement set_tensor_async and the event interfaces (llama/18047) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from #7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time. commit b68b12f Author: Johannes Gäßler <johannesg@5d6.de> Date: Sun Dec 21 19:33:08 2025 +0100 llama: fix RPC for -fit on (llama/18233) commit b893e08 Author: Jeff Bolz <jbolz@nvidia.com> Date: Sun Dec 21 03:32:58 2025 -0600 vulkan: fix im2col overflowing maxworkgroupcount (llama/18180) commit f407c5e Author: Jeff Bolz <jbolz@nvidia.com> Date: Sun Dec 21 03:27:34 2025 -0600 vulkan/cuda: fix topk_moe with exp_probs_b (llama/18071) I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn and added coverage for exp_probs_b and some other missing combinations. This exposed a bug in both CUDA and Vulkan backends where they were assuming the input to argsort and the input to get_rows are the same. I'd like to optimize this graph in another change, but for now just get it functional. CUDA also had a bug where it got n_experts from the wrong place, leading to GGML_ASSERT failures in some of the new tests. commit ad6ee38 Author: Jeff Bolz <jbolz@nvidia.com> Date: Sun Dec 21 03:17:58 2025 -0600 vulkan: support GGML_UNARY_OP_XIELU (llama/18062) commit 3cd141f Author: Jeff Bolz <jbolz@nvidia.com> Date: Sun Dec 21 03:05:08 2025 -0600 vulkan: in graph_optimize, try to group ADD operations (llama/18060) I saw the adds not staying together in the new nemotron 3 nano model. commit 449fc7c Author: lovedheart <6277001+lovedheart@users.noreply.github.com> Date: Sun Dec 21 09:59:52 2025 +0100 Vulkan: some improvement on mul_mat_iq2_xs (llama/18031) * Some improvement on mul_mat_iq2_xs Refactor calculations for db values and grid data to optimize performance and reduce redundancy. * Fix trailing whitespace commit 0983985 Author: Aadeshveer Singh <24b0926@iitb.ac.in> Date: Sat Dec 20 16:58:57 2025 +0530 Added comments explaining thread block size selection logic based on row count and column size, derived from historical commit context (llama/18212) commit 17a4cb1 Author: Alfred <zxu3@clemson.edu> Date: Fri Dec 19 12:42:28 2025 -0500 ggml-hexagon: Implement true Q8_0 quantization on Hexagon NPU for more accurate mixed-precision matmul operations (llama/17977) * feat: implement real Q8_0 * feat: adding cmake option for configuring FP32 quantize group size * typo: set() shall be used --------- Co-authored-by: ngdxzy <zhenyu_xu@uri.edu> commit 195d8d0 Author: Jeff Bolz <jbolz@nvidia.com> Date: Thu Dec 18 23:36:46 2025 -0600 vulkan: Add perf logger mode with concurrency (llama/17944) This implements a variation of the perf logger where rather than timing each operation individually with effectively a barrier in between, we put the timing boundaries where we already synchronize and time the groups of work that normally overlap. This can be useful to help understand whether individual operations need to be optimized, or if the group is already running efficiently. GGML_VK_PERF_LOGGER_CONCURRENT=1 enables the new mode (when GGML_VK_PERF_LOGGER is also set). GGML_VK_SYNC_LOGGER=1 replaces the ENABLE_SYNC_LOGGING compile time switch. commit fea481f Author: Xuan-Son Nguyen <son@huggingface.co> Date: Fri Dec 19 00:18:01 2025 +0100 model : add ASR support for LFM2-Audio-1.5B (conformer) (llama/18106) * ASR with LFM2-Audio-1.5B * Set rope_theta * Fix comment * Remove rope_theta setting * Address PR feedback * rename functions to conformer * remove some redundant ggml_cont * fix missing tensor * add prefix "a." for conv tensors * remove redundant reshape * clean up * add test model --------- Co-authored-by: Tarek Dakhran <tarek@liquid.ai> commit 956fac4 Author: Taimur Ahmad <taimur.ahmad@10xengineers.ai> Date: Thu Dec 18 19:02:09 2025 +0500 ggml-cpu: extend support for RVV floating-point kernels (llama/17318) * cmake: add BF16 RVV flag for ggml-cpu * ggml-cpu: add floating-point conversion kernels * ggml: add floating-point kernels Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: fix lmul in vec_dot_bf16 * ggml-cpu: change redsum to lmul 4, fix leftover --------- Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> commit 325a9b7 Author: yulo <77381088+zhang-hui-yulo@users.noreply.github.com> Date: Thu Dec 18 19:50:56 2025 +0800 remove i_major_dual (llama/18157) Co-authored-by: zhang hui <you@example.com> commit c3a1608 Author: Shouyu <65317431+joeldushouyu@users.noreply.github.com> Date: Wed Dec 17 16:38:21 2025 -0500 ggml-hexagon: swiglu_oai operation (llama/18114) * snapshot: debug ggml-hexagon swiglu-oai * fix: fix hvx_min_scalar_f32 * feat: working swiglu-oai * chore: fix formating isue commit c7ccedb Author: Shouyu <65317431+joeldushouyu@users.noreply.github.com> Date: Wed Dec 17 13:39:32 2025 -0500 ggml-hexagon: gelu operation (llama/17921) * feat: inital support for gelu using sigmoid approximation * snapshot: faster gelu using polynomial approximation * test: disable l2-block prefetch in polynomail approximation * Revert "test: disable l2-block prefetch in polynomail approximation" This reverts commit 72339994d45b2bed887e79994403c378d90b62b5. * Revert "snapshot: faster gelu using polynomial approximation" This reverts commit 2a787a61d11f9e63e5943a2e6d134b2f0c402ace. * debug: temporarily disable unnecessary log message for debug purpose * Feat: optiized unaligned sigmoid_f32 * Feat: larger l2prefetch block * feat: apply unaligned-load optimization on mul and mul_scalar * Revert "debug: temporarily disable unnecessary log message for debug purpose" This reverts commit 84f2f23aa9f17e2fa826db969cd825d0ab192995. * refactor: cleanup commented unused code * chore: reformat code with clang-formatter to pass cli test * Revert "chore: reformat code with clang-formatter to pass cli test" This reverts commit 952877ec24732b12010c7fa7ed3fc8de4b74e718. * fix: fix loop overflow * chore: fix formating ci error commit 1f72f00 Author: Alberto Cabrera Pérez <1478977+Alcpz@users.noreply.github.com> Date: Wed Dec 17 11:39:13 2025 +0000 ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) (llama/18096) * wip: skeleton for q8_0 repack * q8_0 repack GEMV implementations * GEMM implementations * Formatting * Fixed format consistency of repack gemm and gemv declarations * gemv and gemm generic location consistent with declarations * Removed non-correct unused variables statements * Cleanup, consistent style * Missing generic fallbacks for x86 and powerpc commit 9118c05 Author: yulo <77381088+zhang-hui-yulo@users.noreply.github.com> Date: Wed Dec 17 16:34:54 2025 +0800 HIP: Refactor mma for RDNA and CDNA (llama/17990) * mma.cuh for rdna4 * mma for rdna3 * mmq for rdna4 * mmq for rdna3 * align i-major and j-major * cdna * fix cuda error * add missing tile of mfma * fix j-major wrong ne on CDNA * fix gramma and empty spaces --------- Co-authored-by: zhang hui <you@example.com> commit 6114e69 Author: KITAITI Makoto <KitaitiMakoto@gmail.com> Date: Wed Dec 24 16:52:16 2025 +0900 ruby : add Whisper::Token, fix model URI (ggml-org#3575) * Define and use macro to get context safely * Add test to check SEGV * Move installation guid after usage * [skip ci]Change doc slightly * [skip ci]Fix a typo in README * [skip ci]Add carry_initial_prompt option in README * Define GetVADSegments and use it * Use GetContext * Fix download URI of small.en-tdrz * Fix URI of CoreML models corresponding to quantized models * Cache computed string * Remove unused argument * Add Whisper::Token * Add document comments * Rename function: rb_whisper_token_s_new -> ruby_whisper_token_s_init * Fix size of token * Insert _get into function names * Add Whisper::Token#text * Add test for Whisper::Token#text * Declare static if possible * Change method names * Add Whisper::Token#deconstruct_keys * Add tests for Whisper::Token#deconstruct_keys * Add signatures for Whisper::Token * Complete signature * [skip ci]Add n_tokens to document of Segment

Support Mixed-Precision Quantization

f92fc36

lhpqaq merged commit 6d349f9 into master Jan 13, 2026
48 of 64 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Mixed-Precision Quantization#4

Support Mixed-Precision Quantization#4
lhpqaq merged 1 commit intomasterfrom
mixed-pr

lhpqaq commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lhpqaq commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant