Skip to content

[SYCL] add OP GATED_DELTA_NET to enhance to support Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF#20455

Merged
NeoZhangJianyu merged 1 commit intoggml-org:masterfrom
arthw:add_gated_delta_net
Mar 14, 2026
Merged

[SYCL] add OP GATED_DELTA_NET to enhance to support Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-GGUF#20455
NeoZhangJianyu merged 1 commit intoggml-org:masterfrom
arthw:add_gated_delta_net

Conversation

@arthw
Copy link
Contributor

@arthw arthw commented Mar 12, 2026

Fix issue: #20423
Add OP GATED_DELTA_NET.
All UT cases are passed.
Update the ops.md.
All OPs run on GPU.

Here is the performance result:

GPU PP(t/s) tg(t/s)
Arc770 90.83 -> 339.22 10.52 -> 11.73
iGPU (UHD Graphics 770) 26.21 -> 33.42 2.12 -> 2.24

@github-actions github-actions bot added documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels Mar 12, 2026
@strtgbb
Copy link

strtgbb commented Mar 12, 2026

Nice

Test b8284 This PR
pp2000 208.98 409.55
pp20000 43.09 52.21
tg300 41.85 47.68

Arc B580
Linux
Qwen3.5-4B-Q6_K
SYCL

I still see the error that this fixes on b8284 Vulkan: layer 0 is assigned to device Vulkan0 but the fused Gated Delta Net tensor is assigned to device CPU (usually due to missing support)
Will there be a separate PR for that?

@NeoZhangJianyu
Copy link
Contributor

Nice

Arc B580 Qwen3.5-4B-Q6_K SYCL

Test b8284 This PR
pp2000 208.98 409.55
pp20000 43.09 52.21
tg300 41.85 47.68
I still see the error that this fixes on b8284 Vulkan: layer 0 is assigned to device Vulkan0 but the fused Gated Delta Net tensor is assigned to device CPU (usually due to missing support) Will there be a separate PR for that?

Thank you to share the test result!

SYCL and Vulkan backends are different.
Vulkan need another PR.

@WizardlyBump17
Copy link

WizardlyBump17 commented Mar 13, 2026

Before:

Model Parameters Quantization pp512 (t/s) tg128 (t/s) CLI Parameters
Qwen3.5-9B 8.95B Q8_0 217.69 ± 3.51 9.85 ± 0.17 --n-gpu-layers 99
Qwen3.5-9B 8.95B Q4_K_M 214.83 ± 2.66 32.73 ± 0.38 --n-gpu-layers 99
Qwen3.5-4B 4.21B Q8_0 246.88 ± 2.63 17.41 ± 0.00 --n-gpu-layers 99
Qwen3.5-4B 4.21B Q4_K_M 248.52 ± 3.11 45.92 ± 0.05 --n-gpu-layers 99
Qwen3.5-2B 1.88B BF16 135.00 ± 4.91 6.47 ± 0.05 --n-gpu-layers 99
Qwen3.5-2B 1.88B Q8_0 581.90 ± 9.18 35.41 ± 0.03 --n-gpu-layers 99
Qwen3.5-2B 1.88B Q4_K_M 603.45 ± 20.62 77.47 ± 0.66 --n-gpu-layers 99
Qwen3.5-0.8B 0.75B BF16 285.41 ± 2.18 11.26 ± 0.34 --n-gpu-layers 99
Qwen3.5-0.8B 0.75B Q8_0 694.80 ± 12.38 64.99 ± 0.02 --n-gpu-layers 99
Qwen3.5-0.8B 0.75B Q4_K_M 661.21 ± 35.89 98.46 ± 1.03 --n-gpu-layers 99

After:

Model Parameters Quantization pp512 (t/s) tg128 (t/s) CLI Parameters
Qwen3.5 27B 26.90 B Q2_K 199.64 ± 3.58 8.94 ± 0.27 --n-gpu-layers 99
Qwen3.5 9B 8.95 B Q8_0 664.37 ± 5.12 10.32 ± 0.18 --n-gpu-layers 99
Qwen3.5 9B 8.95 B Q4_K_M 697.43 ± 5.55 38.17 ± 0.45 --n-gpu-layers 99
Qwen3.5 4B 4.21 B F16 1161.00 ± 0.93 36.13 ± 0.02 --n-gpu-layers 99
Qwen3.5 4B 4.21 B Q8_0 1182.21 ± 9.96 18.96 ± 0.02 --n-gpu-layers 99
Qwen3.5 4B 4.21 B Q4_K_M 1234.99 ± 3.21 59.98 ± 0.11 --n-gpu-layers 99
Qwen3.5 2B 1.88 B BF16 169.08 ± 2.16 6.42 ± 0.43 --n-gpu-layers 99
Qwen3.5 2B 1.88 B F16 2787.86 ± 2.67 65.77 ± 0.06 --n-gpu-layers 99
Qwen3.5 2B 1.88 B Q8_0 2861.57 ± 3.23 38.88 ± 0.10 --n-gpu-layers 99
Qwen3.5 2B 1.88 B Q4_K_M 2986.40 ± 5.09 100.17 ± 0.72 --n-gpu-layers 99
Qwen3.5 0.8B 752.39 M BF16 410.79 ± 5.43 12.09 ± 0.09 --n-gpu-layers 99
Qwen3.5 0.8B 752.39 M F16 5043.84 ± 12.73 119.63 ± 1.68 --n-gpu-layers 99
Qwen3.5 0.8B 752.39 M Q8_0 5176.11 ± 4.61 77.92 ± 0.06 --n-gpu-layers 99
Qwen3.5 0.8B 752.39 M Q4_K_M 5310.50 ± 15.18 135.37 ± 0.76 --n-gpu-layers 99

Ryzen 7 5700X3D
B580

@NeoZhangJianyu
Copy link
Contributor

@WizardlyBump17
It's great to show the performance increase data here.

Could you update with GPU info and OS?

Thank you!

@savvadesogle
Copy link

Greate PR!

But there are some problems
Driver: 8509
Windows 11
A770

1. Some operations on CPU

According to the task manager, some operations are performed on the CPU during token generation.

изображение

2. Benchmark stuck on FA operations:

But this is not always the case

изображение
Open me

C:\llm\llama-cpp\SYCL\add_gated_delta_net\build\bin>llama-bench -m T:\models\lmstudio-community\Qwen3.5-9B-GGUF\Qwen3.5-9B-Q4_K_M.gguf  -ngl 100 -fa 0,1 --verbose
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
llama_model_load_from_file_impl: using device SYCL0 (Intel(R) Arc(TM) A770 Graphics) (unknown id) - 15930 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 427 tensors from T:\models\lmstudio-community\Qwen3.5-9B-GGUF\Qwen3.5-9B-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen_Qwen3.5 9B
llama_model_loader: - kv   3:                           general.basename str              = Qwen_Qwen3.5
llama_model_loader: - kv   4:                         general.size_label str              = 9B
llama_model_loader: - kv   5:                         qwen35.block_count u32              = 32
llama_model_loader: - kv   6:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv   7:                    qwen35.embedding_length u32              = 4096
llama_model_loader: - kv   8:                 qwen35.feed_forward_length u32              = 12288
llama_model_loader: - kv   9:                qwen35.attention.head_count u32              = 16
llama_model_loader: - kv  10:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  11:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  12:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  13:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  14:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  15:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  16:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  17:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  18:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  19:                  qwen35.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  20:                      qwen35.ssm.inner_size u32              = 4096
llama_model_loader: - kv  21:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  22:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  23:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  24:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  25:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  26:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  27:                      tokenizer.ggml.merges arr[str,247587]  = ["G G", "GG GG", "i n", "G t",...
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  29:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  30:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  177 tensors
llama_model_loader: - type q4_K:  204 tensors
llama_model_loader: - type q5_K:   24 tensors
llama_model_loader: - type q6_K:   22 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 5.23 GiB (5.02 BPW)
init_tokenizer: initializing tokenizer for type 2
load: 0 unused tokens
load: control token: 248062 '<|fim_suffix|>' is not marked as EOG
load: control token: 248072 '<tts_pad>' is not marked as EOG
load: control token: 248050 '<|box_end|>' is not marked as EOG
load: control token: 248048 '<|object_ref_end|>' is not marked as EOG
load: control token: 248055 '<|vision_pad|>' is not marked as EOG
load: control token: 248060 '<|fim_prefix|>' is not marked as EOG
load: control token: 248049 '<|box_start|>' is not marked as EOG
load: control token: 248073 '<tts_text_bos>' is not marked as EOG
load: control token: 248052 '<|quad_end|>' is not marked as EOG
load: control token: 248054 '<|vision_end|>' is not marked as EOG
load: control token: 248075 '<tts_text_bos_single>' is not marked as EOG
load: control token: 248056 '<|image_pad|>' is not marked as EOG
load: control token: 248071 '<|audio_end|>' is not marked as EOG
load: control token: 248045 '<|im_start|>' is not marked as EOG
load: control token: 248047 '<|object_ref_start|>' is not marked as EOG
load: control token: 248051 '<|quad_start|>' is not marked as EOG
load: control token: 248053 '<|vision_start|>' is not marked as EOG
load: control token: 248057 '<|video_pad|>' is not marked as EOG
load: control token: 248061 '<|fim_middle|>' is not marked as EOG
load: control token: 248070 '<|audio_start|>' is not marked as EOG
load: control token: 248074 '<tts_text_eod>' is not marked as EOG
load: control token: 248076 '<|audio_pad|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 32
print_info: n_head                = 16
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 12288
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 4096
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 32
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 9B
print_info: model params          = 8.95 B
print_info: general.name          = Qwen_Qwen3.5 9B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248044 '<|endoftext|>'
print_info: LF token              = 198 'C'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = true, direct_io = false)
load_tensors: layer   0 assigned to device SYCL0, is_swa = 0
load_tensors: layer   1 assigned to device SYCL0, is_swa = 0
load_tensors: layer   2 assigned to device SYCL0, is_swa = 0
load_tensors: layer   3 assigned to device SYCL0, is_swa = 0
load_tensors: layer   4 assigned to device SYCL0, is_swa = 0
load_tensors: layer   5 assigned to device SYCL0, is_swa = 0
load_tensors: layer   6 assigned to device SYCL0, is_swa = 0
load_tensors: layer   7 assigned to device SYCL0, is_swa = 0
load_tensors: layer   8 assigned to device SYCL0, is_swa = 0
load_tensors: layer   9 assigned to device SYCL0, is_swa = 0
load_tensors: layer  10 assigned to device SYCL0, is_swa = 0
load_tensors: layer  11 assigned to device SYCL0, is_swa = 0
load_tensors: layer  12 assigned to device SYCL0, is_swa = 0
load_tensors: layer  13 assigned to device SYCL0, is_swa = 0
load_tensors: layer  14 assigned to device SYCL0, is_swa = 0
load_tensors: layer  15 assigned to device SYCL0, is_swa = 0
load_tensors: layer  16 assigned to device SYCL0, is_swa = 0
load_tensors: layer  17 assigned to device SYCL0, is_swa = 0
load_tensors: layer  18 assigned to device SYCL0, is_swa = 0
load_tensors: layer  19 assigned to device SYCL0, is_swa = 0
load_tensors: layer  20 assigned to device SYCL0, is_swa = 0
load_tensors: layer  21 assigned to device SYCL0, is_swa = 0
load_tensors: layer  22 assigned to device SYCL0, is_swa = 0
load_tensors: layer  23 assigned to device SYCL0, is_swa = 0
load_tensors: layer  24 assigned to device SYCL0, is_swa = 0
load_tensors: layer  25 assigned to device SYCL0, is_swa = 0
load_tensors: layer  26 assigned to device SYCL0, is_swa = 0
load_tensors: layer  27 assigned to device SYCL0, is_swa = 0
load_tensors: layer  28 assigned to device SYCL0, is_swa = 0
load_tensors: layer  29 assigned to device SYCL0, is_swa = 0
load_tensors: layer  30 assigned to device SYCL0, is_swa = 0
load_tensors: layer  31 assigned to device SYCL0, is_swa = 0
load_tensors: layer  32 assigned to device SYCL0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.post_attention_norm.weight
create_tensor: loading tensor blk.0.attn_qkv.weight
create_tensor: loading tensor blk.0.attn_gate.weight
create_tensor: loading tensor blk.0.ssm_conv1d.weight
create_tensor: loading tensor blk.0.ssm_dt.bias
create_tensor: loading tensor blk.0.ssm_a
create_tensor: loading tensor blk.0.ssm_beta.weight
create_tensor: loading tensor blk.0.ssm_alpha.weight
create_tensor: loading tensor blk.0.ssm_norm.weight
create_tensor: loading tensor blk.0.ssm_out.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.post_attention_norm.weight
create_tensor: loading tensor blk.1.attn_qkv.weight
create_tensor: loading tensor blk.1.attn_gate.weight
create_tensor: loading tensor blk.1.ssm_conv1d.weight
create_tensor: loading tensor blk.1.ssm_dt.bias
create_tensor: loading tensor blk.1.ssm_a
create_tensor: loading tensor blk.1.ssm_beta.weight
create_tensor: loading tensor blk.1.ssm_alpha.weight
create_tensor: loading tensor blk.1.ssm_norm.weight
create_tensor: loading tensor blk.1.ssm_out.weight
create_tensor: loading tensor blk.1.ffn_gate.weight
create_tensor: loading tensor blk.1.ffn_down.weight
create_tensor: loading tensor blk.1.ffn_up.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.post_attention_norm.weight
create_tensor: loading tensor blk.2.attn_qkv.weight
create_tensor: loading tensor blk.2.attn_gate.weight
create_tensor: loading tensor blk.2.ssm_conv1d.weight
create_tensor: loading tensor blk.2.ssm_dt.bias
create_tensor: loading tensor blk.2.ssm_a
create_tensor: loading tensor blk.2.ssm_beta.weight
create_tensor: loading tensor blk.2.ssm_alpha.weight
create_tensor: loading tensor blk.2.ssm_norm.weight
create_tensor: loading tensor blk.2.ssm_out.weight
create_tensor: loading tensor blk.2.ffn_gate.weight
create_tensor: loading tensor blk.2.ffn_down.weight
create_tensor: loading tensor blk.2.ffn_up.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.post_attention_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_q_norm.weight
create_tensor: loading tensor blk.3.attn_k_norm.weight
create_tensor: loading tensor blk.3.ffn_gate.weight
create_tensor: loading tensor blk.3.ffn_down.weight
create_tensor: loading tensor blk.3.ffn_up.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.post_attention_norm.weight
create_tensor: loading tensor blk.4.attn_qkv.weight
create_tensor: loading tensor blk.4.attn_gate.weight
create_tensor: loading tensor blk.4.ssm_conv1d.weight
create_tensor: loading tensor blk.4.ssm_dt.bias
create_tensor: loading tensor blk.4.ssm_a
create_tensor: loading tensor blk.4.ssm_beta.weight
create_tensor: loading tensor blk.4.ssm_alpha.weight
create_tensor: loading tensor blk.4.ssm_norm.weight
create_tensor: loading tensor blk.4.ssm_out.weight
create_tensor: loading tensor blk.4.ffn_gate.weight
create_tensor: loading tensor blk.4.ffn_down.weight
create_tensor: loading tensor blk.4.ffn_up.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.post_attention_norm.weight
create_tensor: loading tensor blk.5.attn_qkv.weight
create_tensor: loading tensor blk.5.attn_gate.weight
create_tensor: loading tensor blk.5.ssm_conv1d.weight
create_tensor: loading tensor blk.5.ssm_dt.bias
create_tensor: loading tensor blk.5.ssm_a
create_tensor: loading tensor blk.5.ssm_beta.weight
create_tensor: loading tensor blk.5.ssm_alpha.weight
create_tensor: loading tensor blk.5.ssm_norm.weight
create_tensor: loading tensor blk.5.ssm_out.weight
create_tensor: loading tensor blk.5.ffn_gate.weight
create_tensor: loading tensor blk.5.ffn_down.weight
create_tensor: loading tensor blk.5.ffn_up.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.post_attention_norm.weight
create_tensor: loading tensor blk.6.attn_qkv.weight
create_tensor: loading tensor blk.6.attn_gate.weight
create_tensor: loading tensor blk.6.ssm_conv1d.weight
create_tensor: loading tensor blk.6.ssm_dt.bias
create_tensor: loading tensor blk.6.ssm_a
create_tensor: loading tensor blk.6.ssm_beta.weight
create_tensor: loading tensor blk.6.ssm_alpha.weight
create_tensor: loading tensor blk.6.ssm_norm.weight
create_tensor: loading tensor blk.6.ssm_out.weight
create_tensor: loading tensor blk.6.ffn_gate.weight
create_tensor: loading tensor blk.6.ffn_down.weight
create_tensor: loading tensor blk.6.ffn_up.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.post_attention_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_q_norm.weight
create_tensor: loading tensor blk.7.attn_k_norm.weight
create_tensor: loading tensor blk.7.ffn_gate.weight
create_tensor: loading tensor blk.7.ffn_down.weight
create_tensor: loading tensor blk.7.ffn_up.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.post_attention_norm.weight
create_tensor: loading tensor blk.8.attn_qkv.weight
create_tensor: loading tensor blk.8.attn_gate.weight
create_tensor: loading tensor blk.8.ssm_conv1d.weight
create_tensor: loading tensor blk.8.ssm_dt.bias
create_tensor: loading tensor blk.8.ssm_a
create_tensor: loading tensor blk.8.ssm_beta.weight
create_tensor: loading tensor blk.8.ssm_alpha.weight
create_tensor: loading tensor blk.8.ssm_norm.weight
create_tensor: loading tensor blk.8.ssm_out.weight
create_tensor: loading tensor blk.8.ffn_gate.weight
create_tensor: loading tensor blk.8.ffn_down.weight
create_tensor: loading tensor blk.8.ffn_up.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.post_attention_norm.weight
create_tensor: loading tensor blk.9.attn_qkv.weight
create_tensor: loading tensor blk.9.attn_gate.weight
create_tensor: loading tensor blk.9.ssm_conv1d.weight
create_tensor: loading tensor blk.9.ssm_dt.bias
create_tensor: loading tensor blk.9.ssm_a
create_tensor: loading tensor blk.9.ssm_beta.weight
create_tensor: loading tensor blk.9.ssm_alpha.weight
create_tensor: loading tensor blk.9.ssm_norm.weight
create_tensor: loading tensor blk.9.ssm_out.weight
create_tensor: loading tensor blk.9.ffn_gate.weight
create_tensor: loading tensor blk.9.ffn_down.weight
create_tensor: loading tensor blk.9.ffn_up.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.post_attention_norm.weight
create_tensor: loading tensor blk.10.attn_qkv.weight
create_tensor: loading tensor blk.10.attn_gate.weight
create_tensor: loading tensor blk.10.ssm_conv1d.weight
create_tensor: loading tensor blk.10.ssm_dt.bias
create_tensor: loading tensor blk.10.ssm_a
create_tensor: loading tensor blk.10.ssm_beta.weight
create_tensor: loading tensor blk.10.ssm_alpha.weight
create_tensor: loading tensor blk.10.ssm_norm.weight
create_tensor: loading tensor blk.10.ssm_out.weight
create_tensor: loading tensor blk.10.ffn_gate.weight
create_tensor: loading tensor blk.10.ffn_down.weight
create_tensor: loading tensor blk.10.ffn_up.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.post_attention_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_q_norm.weight
create_tensor: loading tensor blk.11.attn_k_norm.weight
create_tensor: loading tensor blk.11.ffn_gate.weight
create_tensor: loading tensor blk.11.ffn_down.weight
create_tensor: loading tensor blk.11.ffn_up.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.post_attention_norm.weight
create_tensor: loading tensor blk.12.attn_qkv.weight
create_tensor: loading tensor blk.12.attn_gate.weight
create_tensor: loading tensor blk.12.ssm_conv1d.weight
create_tensor: loading tensor blk.12.ssm_dt.bias
create_tensor: loading tensor blk.12.ssm_a
create_tensor: loading tensor blk.12.ssm_beta.weight
create_tensor: loading tensor blk.12.ssm_alpha.weight
create_tensor: loading tensor blk.12.ssm_norm.weight
create_tensor: loading tensor blk.12.ssm_out.weight
create_tensor: loading tensor blk.12.ffn_gate.weight
create_tensor: loading tensor blk.12.ffn_down.weight
create_tensor: loading tensor blk.12.ffn_up.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.post_attention_norm.weight
create_tensor: loading tensor blk.13.attn_qkv.weight
create_tensor: loading tensor blk.13.attn_gate.weight
create_tensor: loading tensor blk.13.ssm_conv1d.weight
create_tensor: loading tensor blk.13.ssm_dt.bias
create_tensor: loading tensor blk.13.ssm_a
create_tensor: loading tensor blk.13.ssm_beta.weight
create_tensor: loading tensor blk.13.ssm_alpha.weight
create_tensor: loading tensor blk.13.ssm_norm.weight
create_tensor: loading tensor blk.13.ssm_out.weight
create_tensor: loading tensor blk.13.ffn_gate.weight
create_tensor: loading tensor blk.13.ffn_down.weight
create_tensor: loading tensor blk.13.ffn_up.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.post_attention_norm.weight
create_tensor: loading tensor blk.14.attn_qkv.weight
create_tensor: loading tensor blk.14.attn_gate.weight
create_tensor: loading tensor blk.14.ssm_conv1d.weight
create_tensor: loading tensor blk.14.ssm_dt.bias
create_tensor: loading tensor blk.14.ssm_a
create_tensor: loading tensor blk.14.ssm_beta.weight
create_tensor: loading tensor blk.14.ssm_alpha.weight
create_tensor: loading tensor blk.14.ssm_norm.weight
create_tensor: loading tensor blk.14.ssm_out.weight
create_tensor: loading tensor blk.14.ffn_gate.weight
create_tensor: loading tensor blk.14.ffn_down.weight
create_tensor: loading tensor blk.14.ffn_up.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.post_attention_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_q_norm.weight
create_tensor: loading tensor blk.15.attn_k_norm.weight
create_tensor: loading tensor blk.15.ffn_gate.weight
create_tensor: loading tensor blk.15.ffn_down.weight
create_tensor: loading tensor blk.15.ffn_up.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.post_attention_norm.weight
create_tensor: loading tensor blk.16.attn_qkv.weight
create_tensor: loading tensor blk.16.attn_gate.weight
create_tensor: loading tensor blk.16.ssm_conv1d.weight
create_tensor: loading tensor blk.16.ssm_dt.bias
create_tensor: loading tensor blk.16.ssm_a
create_tensor: loading tensor blk.16.ssm_beta.weight
create_tensor: loading tensor blk.16.ssm_alpha.weight
create_tensor: loading tensor blk.16.ssm_norm.weight
create_tensor: loading tensor blk.16.ssm_out.weight
create_tensor: loading tensor blk.16.ffn_gate.weight
create_tensor: loading tensor blk.16.ffn_down.weight
create_tensor: loading tensor blk.16.ffn_up.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.post_attention_norm.weight
create_tensor: loading tensor blk.17.attn_qkv.weight
create_tensor: loading tensor blk.17.attn_gate.weight
create_tensor: loading tensor blk.17.ssm_conv1d.weight
create_tensor: loading tensor blk.17.ssm_dt.bias
create_tensor: loading tensor blk.17.ssm_a
create_tensor: loading tensor blk.17.ssm_beta.weight
create_tensor: loading tensor blk.17.ssm_alpha.weight
create_tensor: loading tensor blk.17.ssm_norm.weight
create_tensor: loading tensor blk.17.ssm_out.weight
create_tensor: loading tensor blk.17.ffn_gate.weight
create_tensor: loading tensor blk.17.ffn_down.weight
create_tensor: loading tensor blk.17.ffn_up.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.post_attention_norm.weight
create_tensor: loading tensor blk.18.attn_qkv.weight
create_tensor: loading tensor blk.18.attn_gate.weight
create_tensor: loading tensor blk.18.ssm_conv1d.weight
create_tensor: loading tensor blk.18.ssm_dt.bias
create_tensor: loading tensor blk.18.ssm_a
create_tensor: loading tensor blk.18.ssm_beta.weight
create_tensor: loading tensor blk.18.ssm_alpha.weight
create_tensor: loading tensor blk.18.ssm_norm.weight
create_tensor: loading tensor blk.18.ssm_out.weight
create_tensor: loading tensor blk.18.ffn_gate.weight
create_tensor: loading tensor blk.18.ffn_down.weight
create_tensor: loading tensor blk.18.ffn_up.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.post_attention_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_q_norm.weight
create_tensor: loading tensor blk.19.attn_k_norm.weight
create_tensor: loading tensor blk.19.ffn_gate.weight
create_tensor: loading tensor blk.19.ffn_down.weight
create_tensor: loading tensor blk.19.ffn_up.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.post_attention_norm.weight
create_tensor: loading tensor blk.20.attn_qkv.weight
create_tensor: loading tensor blk.20.attn_gate.weight
create_tensor: loading tensor blk.20.ssm_conv1d.weight
create_tensor: loading tensor blk.20.ssm_dt.bias
create_tensor: loading tensor blk.20.ssm_a
create_tensor: loading tensor blk.20.ssm_beta.weight
create_tensor: loading tensor blk.20.ssm_alpha.weight
create_tensor: loading tensor blk.20.ssm_norm.weight
create_tensor: loading tensor blk.20.ssm_out.weight
create_tensor: loading tensor blk.20.ffn_gate.weight
create_tensor: loading tensor blk.20.ffn_down.weight
create_tensor: loading tensor blk.20.ffn_up.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.post_attention_norm.weight
create_tensor: loading tensor blk.21.attn_qkv.weight
create_tensor: loading tensor blk.21.attn_gate.weight
create_tensor: loading tensor blk.21.ssm_conv1d.weight
create_tensor: loading tensor blk.21.ssm_dt.bias
create_tensor: loading tensor blk.21.ssm_a
create_tensor: loading tensor blk.21.ssm_beta.weight
create_tensor: loading tensor blk.21.ssm_alpha.weight
create_tensor: loading tensor blk.21.ssm_norm.weight
create_tensor: loading tensor blk.21.ssm_out.weight
create_tensor: loading tensor blk.21.ffn_gate.weight
create_tensor: loading tensor blk.21.ffn_down.weight
create_tensor: loading tensor blk.21.ffn_up.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.post_attention_norm.weight
create_tensor: loading tensor blk.22.attn_qkv.weight
create_tensor: loading tensor blk.22.attn_gate.weight
create_tensor: loading tensor blk.22.ssm_conv1d.weight
create_tensor: loading tensor blk.22.ssm_dt.bias
create_tensor: loading tensor blk.22.ssm_a
create_tensor: loading tensor blk.22.ssm_beta.weight
create_tensor: loading tensor blk.22.ssm_alpha.weight
create_tensor: loading tensor blk.22.ssm_norm.weight
create_tensor: loading tensor blk.22.ssm_out.weight
create_tensor: loading tensor blk.22.ffn_gate.weight
create_tensor: loading tensor blk.22.ffn_down.weight
create_tensor: loading tensor blk.22.ffn_up.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.post_attention_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_q_norm.weight
create_tensor: loading tensor blk.23.attn_k_norm.weight
create_tensor: loading tensor blk.23.ffn_gate.weight
create_tensor: loading tensor blk.23.ffn_down.weight
create_tensor: loading tensor blk.23.ffn_up.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.post_attention_norm.weight
create_tensor: loading tensor blk.24.attn_qkv.weight
create_tensor: loading tensor blk.24.attn_gate.weight
create_tensor: loading tensor blk.24.ssm_conv1d.weight
create_tensor: loading tensor blk.24.ssm_dt.bias
create_tensor: loading tensor blk.24.ssm_a
create_tensor: loading tensor blk.24.ssm_beta.weight
create_tensor: loading tensor blk.24.ssm_alpha.weight
create_tensor: loading tensor blk.24.ssm_norm.weight
create_tensor: loading tensor blk.24.ssm_out.weight
create_tensor: loading tensor blk.24.ffn_gate.weight
create_tensor: loading tensor blk.24.ffn_down.weight
create_tensor: loading tensor blk.24.ffn_up.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.post_attention_norm.weight
create_tensor: loading tensor blk.25.attn_qkv.weight
create_tensor: loading tensor blk.25.attn_gate.weight
create_tensor: loading tensor blk.25.ssm_conv1d.weight
create_tensor: loading tensor blk.25.ssm_dt.bias
create_tensor: loading tensor blk.25.ssm_a
create_tensor: loading tensor blk.25.ssm_beta.weight
create_tensor: loading tensor blk.25.ssm_alpha.weight
create_tensor: loading tensor blk.25.ssm_norm.weight
create_tensor: loading tensor blk.25.ssm_out.weight
create_tensor: loading tensor blk.25.ffn_gate.weight
create_tensor: loading tensor blk.25.ffn_down.weight
create_tensor: loading tensor blk.25.ffn_up.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.post_attention_norm.weight
create_tensor: loading tensor blk.26.attn_qkv.weight
create_tensor: loading tensor blk.26.attn_gate.weight
create_tensor: loading tensor blk.26.ssm_conv1d.weight
create_tensor: loading tensor blk.26.ssm_dt.bias
create_tensor: loading tensor blk.26.ssm_a
create_tensor: loading tensor blk.26.ssm_beta.weight
create_tensor: loading tensor blk.26.ssm_alpha.weight
create_tensor: loading tensor blk.26.ssm_norm.weight
create_tensor: loading tensor blk.26.ssm_out.weight
create_tensor: loading tensor blk.26.ffn_gate.weight
create_tensor: loading tensor blk.26.ffn_down.weight
create_tensor: loading tensor blk.26.ffn_up.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.post_attention_norm.weight
create_tensor: loading tensor blk.27.attn_q.weight
create_tensor: loading tensor blk.27.attn_k.weight
create_tensor: loading tensor blk.27.attn_v.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.attn_q_norm.weight
create_tensor: loading tensor blk.27.attn_k_norm.weight
create_tensor: loading tensor blk.27.ffn_gate.weight
create_tensor: loading tensor blk.27.ffn_down.weight
create_tensor: loading tensor blk.27.ffn_up.weight
create_tensor: loading tensor blk.28.attn_norm.weight
create_tensor: loading tensor blk.28.post_attention_norm.weight
create_tensor: loading tensor blk.28.attn_qkv.weight
create_tensor: loading tensor blk.28.attn_gate.weight
create_tensor: loading tensor blk.28.ssm_conv1d.weight
create_tensor: loading tensor blk.28.ssm_dt.bias
create_tensor: loading tensor blk.28.ssm_a
create_tensor: loading tensor blk.28.ssm_beta.weight
create_tensor: loading tensor blk.28.ssm_alpha.weight
create_tensor: loading tensor blk.28.ssm_norm.weight
create_tensor: loading tensor blk.28.ssm_out.weight
create_tensor: loading tensor blk.28.ffn_gate.weight
create_tensor: loading tensor blk.28.ffn_down.weight
create_tensor: loading tensor blk.28.ffn_up.weight
create_tensor: loading tensor blk.29.attn_norm.weight
create_tensor: loading tensor blk.29.post_attention_norm.weight
create_tensor: loading tensor blk.29.attn_qkv.weight
create_tensor: loading tensor blk.29.attn_gate.weight
create_tensor: loading tensor blk.29.ssm_conv1d.weight
create_tensor: loading tensor blk.29.ssm_dt.bias
create_tensor: loading tensor blk.29.ssm_a
create_tensor: loading tensor blk.29.ssm_beta.weight
create_tensor: loading tensor blk.29.ssm_alpha.weight
create_tensor: loading tensor blk.29.ssm_norm.weight
create_tensor: loading tensor blk.29.ssm_out.weight
create_tensor: loading tensor blk.29.ffn_gate.weight
create_tensor: loading tensor blk.29.ffn_down.weight
create_tensor: loading tensor blk.29.ffn_up.weight
create_tensor: loading tensor blk.30.attn_norm.weight
create_tensor: loading tensor blk.30.post_attention_norm.weight
create_tensor: loading tensor blk.30.attn_qkv.weight
create_tensor: loading tensor blk.30.attn_gate.weight
create_tensor: loading tensor blk.30.ssm_conv1d.weight
create_tensor: loading tensor blk.30.ssm_dt.bias
create_tensor: loading tensor blk.30.ssm_a
create_tensor: loading tensor blk.30.ssm_beta.weight
create_tensor: loading tensor blk.30.ssm_alpha.weight
create_tensor: loading tensor blk.30.ssm_norm.weight
create_tensor: loading tensor blk.30.ssm_out.weight
create_tensor: loading tensor blk.30.ffn_gate.weight
create_tensor: loading tensor blk.30.ffn_down.weight
create_tensor: loading tensor blk.30.ffn_up.weight
create_tensor: loading tensor blk.31.attn_norm.weight
create_tensor: loading tensor blk.31.post_attention_norm.weight
create_tensor: loading tensor blk.31.attn_q.weight
create_tensor: loading tensor blk.31.attn_k.weight
create_tensor: loading tensor blk.31.attn_v.weight
create_tensor: loading tensor blk.31.attn_output.weight
create_tensor: loading tensor blk.31.attn_q_norm.weight
create_tensor: loading tensor blk.31.attn_k_norm.weight
create_tensor: loading tensor blk.31.ffn_gate.weight
create_tensor: loading tensor blk.31.ffn_down.weight
create_tensor: loading tensor blk.31.ffn_up.weight
done_getting_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type SYCL_Host, using CPU instead
load_tensors: offloading output layer to GPU
load_tensors: offloading 31 repeating layers to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   545.62 MiB
load_tensors:        SYCL0 model buffer size =  4810.28 MiB
.............................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_seq     = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (512) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
Build with Macros:
  GGML_SYCL_FORCE_MMQ: no
  GGML_SYCL_F16: yes
  GGML_SYCL_GRAPH: yes
  GGML_SYCL_DNNL: yes
Running with Environment Variables:
  GGML_SYCL_DEBUG: 0
  GGML_SYCL_DISABLE_OPT: 0
  GGML_SYCL_DISABLE_GRAPH: 1
  GGML_SYCL_DISABLE_DNN: 0
  GGML_SYCL_PRIORITIZE_DMMV: 0
  GGML_SYCL_ENABLE_FLASH_ATTN: 1
Found 1 SYCL devices:
|  |                   |                                       |       |Max    |        |Max  |Global |                     |
|  |                   |                                       |       |compute|Max work|sub  |mem    |                     |
|ID|        Device Type|                                   Name|Version|units  |group   |group|size   |       Driver version|
|--|-------------------|---------------------------------------|-------|-------|--------|-----|-------|---------------------|
| 0| [level_zero:gpu:0]|                Intel Arc A770 Graphics|  12.55|    512|    1024|   32| 16704M|           1.14.36605|
SYCL Optimization Feature:
|ID|        Device Type|Reorder|
|--|-------------------|-------|
| 0| [level_zero:gpu:0]|      Y|
set_abort_callback: call
llama_context:  SYCL_Host  output buffer size =     0.95 MiB
llama_kv_cache: layer   0: filtered
llama_kv_cache: layer   1: filtered
llama_kv_cache: layer   2: filtered
llama_kv_cache: layer   3: dev = SYCL0
llama_kv_cache: layer   4: filtered
llama_kv_cache: layer   5: filtered
llama_kv_cache: layer   6: filtered
llama_kv_cache: layer   7: dev = SYCL0
llama_kv_cache: layer   8: filtered
llama_kv_cache: layer   9: filtered
llama_kv_cache: layer  10: filtered
llama_kv_cache: layer  11: dev = SYCL0
llama_kv_cache: layer  12: filtered
llama_kv_cache: layer  13: filtered
llama_kv_cache: layer  14: filtered
llama_kv_cache: layer  15: dev = SYCL0
llama_kv_cache: layer  16: filtered
llama_kv_cache: layer  17: filtered
llama_kv_cache: layer  18: filtered
llama_kv_cache: layer  19: dev = SYCL0
llama_kv_cache: layer  20: filtered
llama_kv_cache: layer  21: filtered
llama_kv_cache: layer  22: filtered
llama_kv_cache: layer  23: dev = SYCL0
llama_kv_cache: layer  24: filtered
llama_kv_cache: layer  25: filtered
llama_kv_cache: layer  26: filtered
llama_kv_cache: layer  27: dev = SYCL0
llama_kv_cache: layer  28: filtered
llama_kv_cache: layer  29: filtered
llama_kv_cache: layer  30: filtered
llama_kv_cache: layer  31: dev = SYCL0
llama_kv_cache:      SYCL0 KV buffer size =    16.00 MiB
llama_kv_cache: size =   16.00 MiB (   512 cells,   8 layers,  1/1 seqs), K (f16):    8.00 MiB, V (f16):    8.00 MiB
llama_memory_recurrent, layer   0: dev = SYCL0
llama_memory_recurrent, layer   1: dev = SYCL0
llama_memory_recurrent, layer   2: dev = SYCL0
llama_memory_recurrent: layer   3: skipped
llama_memory_recurrent, layer   4: dev = SYCL0
llama_memory_recurrent, layer   5: dev = SYCL0
llama_memory_recurrent, layer   6: dev = SYCL0
llama_memory_recurrent: layer   7: skipped
llama_memory_recurrent, layer   8: dev = SYCL0
llama_memory_recurrent, layer   9: dev = SYCL0
llama_memory_recurrent, layer  10: dev = SYCL0
llama_memory_recurrent: layer  11: skipped
llama_memory_recurrent, layer  12: dev = SYCL0
llama_memory_recurrent, layer  13: dev = SYCL0
llama_memory_recurrent, layer  14: dev = SYCL0
llama_memory_recurrent: layer  15: skipped
llama_memory_recurrent, layer  16: dev = SYCL0
llama_memory_recurrent, layer  17: dev = SYCL0
llama_memory_recurrent, layer  18: dev = SYCL0
llama_memory_recurrent: layer  19: skipped
llama_memory_recurrent, layer  20: dev = SYCL0
llama_memory_recurrent, layer  21: dev = SYCL0
llama_memory_recurrent, layer  22: dev = SYCL0
llama_memory_recurrent: layer  23: skipped
llama_memory_recurrent, layer  24: dev = SYCL0
llama_memory_recurrent, layer  25: dev = SYCL0
llama_memory_recurrent, layer  26: dev = SYCL0
llama_memory_recurrent: layer  27: skipped
llama_memory_recurrent, layer  28: dev = SYCL0
llama_memory_recurrent, layer  29: dev = SYCL0
llama_memory_recurrent, layer  30: dev = SYCL0
llama_memory_recurrent: layer  31: skipped
llama_memory_recurrent:      SYCL0 RS buffer size =    50.25 MiB
llama_memory_recurrent: size =   50.25 MiB (     1 cells,  32 layers,  1 seqs), R (f32):    2.25 MiB, S (f32):   48.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
sched_reserve: reserving ...
sched_reserve: max_nodes = 20480
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
sched_reserve: resolving fused Gated Delta Net support:
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
sched_reserve: fused Gated Delta Net (autoregressive) enabled
graph_reserve: reserving a graph for ubatch with n_tokens =   16, n_seqs =  1, n_outputs =    1
sched_reserve: fused Gated Delta Net (chunked) enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
sched_reserve:      SYCL0 compute buffer size =   493.00 MiB
sched_reserve:  SYCL_Host compute buffer size =    21.02 MiB
sched_reserve: graph nodes  = 1872
sched_reserve: graph splits = 2
sched_reserve: reserve took 70.37 ms, sched copies = 1
attach_threadpool: call
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |  0 |           pp512 |       1407.45 + 1.29 |
llama_perf_context_print:        load time =    6430.39 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  3072 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =    8256.93 ms /  3073 tokens
llama_perf_context_print:    graphs reused =          5
~llama_context:      SYCL0 compute buffer size is 493.0000 MiB, matches expectation of 493.0000 MiB
~llama_context:  SYCL_Host compute buffer size is  21.0157 MiB, matches expectation of  21.0157 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 256
llama_context: n_ctx_seq     = 256
llama_context: n_batch       = 128
llama_context: n_ubatch      = 128
llama_context: causal_attn   = 1
llama_context: flash_attn    = disabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (256) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  SYCL_Host  output buffer size =     0.95 MiB
llama_kv_cache: layer   0: filtered
llama_kv_cache: layer   1: filtered
llama_kv_cache: layer   2: filtered
llama_kv_cache: layer   3: dev = SYCL0
llama_kv_cache: layer   4: filtered
llama_kv_cache: layer   5: filtered
llama_kv_cache: layer   6: filtered
llama_kv_cache: layer   7: dev = SYCL0
llama_kv_cache: layer   8: filtered
llama_kv_cache: layer   9: filtered
llama_kv_cache: layer  10: filtered
llama_kv_cache: layer  11: dev = SYCL0
llama_kv_cache: layer  12: filtered
llama_kv_cache: layer  13: filtered
llama_kv_cache: layer  14: filtered
llama_kv_cache: layer  15: dev = SYCL0
llama_kv_cache: layer  16: filtered
llama_kv_cache: layer  17: filtered
llama_kv_cache: layer  18: filtered
llama_kv_cache: layer  19: dev = SYCL0
llama_kv_cache: layer  20: filtered
llama_kv_cache: layer  21: filtered
llama_kv_cache: layer  22: filtered
llama_kv_cache: layer  23: dev = SYCL0
llama_kv_cache: layer  24: filtered
llama_kv_cache: layer  25: filtered
llama_kv_cache: layer  26: filtered
llama_kv_cache: layer  27: dev = SYCL0
llama_kv_cache: layer  28: filtered
llama_kv_cache: layer  29: filtered
llama_kv_cache: layer  30: filtered
llama_kv_cache: layer  31: dev = SYCL0
llama_kv_cache:      SYCL0 KV buffer size =     8.00 MiB
llama_kv_cache: size =    8.00 MiB (   256 cells,   8 layers,  1/1 seqs), K (f16):    4.00 MiB, V (f16):    4.00 MiB
llama_memory_recurrent, layer   0: dev = SYCL0
llama_memory_recurrent, layer   1: dev = SYCL0
llama_memory_recurrent, layer   2: dev = SYCL0
llama_memory_recurrent: layer   3: skipped
llama_memory_recurrent, layer   4: dev = SYCL0
llama_memory_recurrent, layer   5: dev = SYCL0
llama_memory_recurrent, layer   6: dev = SYCL0
llama_memory_recurrent: layer   7: skipped
llama_memory_recurrent, layer   8: dev = SYCL0
llama_memory_recurrent, layer   9: dev = SYCL0
llama_memory_recurrent, layer  10: dev = SYCL0
llama_memory_recurrent: layer  11: skipped
llama_memory_recurrent, layer  12: dev = SYCL0
llama_memory_recurrent, layer  13: dev = SYCL0
llama_memory_recurrent, layer  14: dev = SYCL0
llama_memory_recurrent: layer  15: skipped
llama_memory_recurrent, layer  16: dev = SYCL0
llama_memory_recurrent, layer  17: dev = SYCL0
llama_memory_recurrent, layer  18: dev = SYCL0
llama_memory_recurrent: layer  19: skipped
llama_memory_recurrent, layer  20: dev = SYCL0
llama_memory_recurrent, layer  21: dev = SYCL0
llama_memory_recurrent, layer  22: dev = SYCL0
llama_memory_recurrent: layer  23: skipped
llama_memory_recurrent, layer  24: dev = SYCL0
llama_memory_recurrent, layer  25: dev = SYCL0
llama_memory_recurrent, layer  26: dev = SYCL0
llama_memory_recurrent: layer  27: skipped
llama_memory_recurrent, layer  28: dev = SYCL0
llama_memory_recurrent, layer  29: dev = SYCL0
llama_memory_recurrent, layer  30: dev = SYCL0
llama_memory_recurrent: layer  31: skipped
llama_memory_recurrent:      SYCL0 RS buffer size =    50.25 MiB
llama_memory_recurrent: size =   50.25 MiB (     1 cells,  32 layers,  1 seqs), R (f32):    2.25 MiB, S (f32):   48.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
sched_reserve: reserving ...
sched_reserve: max_nodes = 13664
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 128, n_seqs = 1, n_outputs = 1
sched_reserve: resolving fused Gated Delta Net support:
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
sched_reserve: fused Gated Delta Net (autoregressive) enabled
graph_reserve: reserving a graph for ubatch with n_tokens =   16, n_seqs =  1, n_outputs =    1
sched_reserve: fused Gated Delta Net (chunked) enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  128, n_seqs =  1, n_outputs =  128
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  128, n_seqs =  1, n_outputs =  128
sched_reserve:      SYCL0 compute buffer size =   123.25 MiB
sched_reserve:  SYCL_Host compute buffer size =     5.13 MiB
sched_reserve: graph nodes  = 1872
sched_reserve: graph splits = 2
sched_reserve: reserve took 54.14 ms, sched copies = 1
attach_threadpool: call
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
| qwen35 9B Q4_K - Medium        |   5.23 GiB |     8.95 B | SYCL       | 100 |  0 |           tg128 |         24.44 + 0.07 |
llama_perf_context_print:        load time =    9211.62 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   641 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   35410.01 ms /   642 tokens
llama_perf_context_print:    graphs reused =        631
~llama_context:      SYCL0 compute buffer size is 123.2500 MiB, matches expectation of 123.2500 MiB
~llama_context:  SYCL_Host compute buffer size is   5.1289 MiB, matches expectation of   5.1289 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_seq     = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (512) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  SYCL_Host  output buffer size =     0.95 MiB
llama_kv_cache: layer   0: filtered
llama_kv_cache: layer   1: filtered
llama_kv_cache: layer   2: filtered
llama_kv_cache: layer   3: dev = SYCL0
llama_kv_cache: layer   4: filtered
llama_kv_cache: layer   5: filtered
llama_kv_cache: layer   6: filtered
llama_kv_cache: layer   7: dev = SYCL0
llama_kv_cache: layer   8: filtered
llama_kv_cache: layer   9: filtered
llama_kv_cache: layer  10: filtered
llama_kv_cache: layer  11: dev = SYCL0
llama_kv_cache: layer  12: filtered
llama_kv_cache: layer  13: filtered
llama_kv_cache: layer  14: filtered
llama_kv_cache: layer  15: dev = SYCL0
llama_kv_cache: layer  16: filtered
llama_kv_cache: layer  17: filtered
llama_kv_cache: layer  18: filtered
llama_kv_cache: layer  19: dev = SYCL0
llama_kv_cache: layer  20: filtered
llama_kv_cache: layer  21: filtered
llama_kv_cache: layer  22: filtered
llama_kv_cache: layer  23: dev = SYCL0
llama_kv_cache: layer  24: filtered
llama_kv_cache: layer  25: filtered
llama_kv_cache: layer  26: filtered
llama_kv_cache: layer  27: dev = SYCL0
llama_kv_cache: layer  28: filtered
llama_kv_cache: layer  29: filtered
llama_kv_cache: layer  30: filtered
llama_kv_cache: layer  31: dev = SYCL0
llama_kv_cache:      SYCL0 KV buffer size =    16.00 MiB
llama_kv_cache: size =   16.00 MiB (   512 cells,   8 layers,  1/1 seqs), K (f16):    8.00 MiB, V (f16):    8.00 MiB
llama_memory_recurrent, layer   0: dev = SYCL0
llama_memory_recurrent, layer   1: dev = SYCL0
llama_memory_recurrent, layer   2: dev = SYCL0
llama_memory_recurrent: layer   3: skipped
llama_memory_recurrent, layer   4: dev = SYCL0
llama_memory_recurrent, layer   5: dev = SYCL0
llama_memory_recurrent, layer   6: dev = SYCL0
llama_memory_recurrent: layer   7: skipped
llama_memory_recurrent, layer   8: dev = SYCL0
llama_memory_recurrent, layer   9: dev = SYCL0
llama_memory_recurrent, layer  10: dev = SYCL0
llama_memory_recurrent: layer  11: skipped
llama_memory_recurrent, layer  12: dev = SYCL0
llama_memory_recurrent, layer  13: dev = SYCL0
llama_memory_recurrent, layer  14: dev = SYCL0
llama_memory_recurrent: layer  15: skipped
llama_memory_recurrent, layer  16: dev = SYCL0
llama_memory_recurrent, layer  17: dev = SYCL0
llama_memory_recurrent, layer  18: dev = SYCL0
llama_memory_recurrent: layer  19: skipped
llama_memory_recurrent, layer  20: dev = SYCL0
llama_memory_recurrent, layer  21: dev = SYCL0
llama_memory_recurrent, layer  22: dev = SYCL0
llama_memory_recurrent: layer  23: skipped
llama_memory_recurrent, layer  24: dev = SYCL0
llama_memory_recurrent, layer  25: dev = SYCL0
llama_memory_recurrent, layer  26: dev = SYCL0
llama_memory_recurrent: layer  27: skipped
llama_memory_recurrent, layer  28: dev = SYCL0
llama_memory_recurrent, layer  29: dev = SYCL0
llama_memory_recurrent, layer  30: dev = SYCL0
llama_memory_recurrent: layer  31: skipped
llama_memory_recurrent:      SYCL0 RS buffer size =    50.25 MiB
llama_memory_recurrent: size =   50.25 MiB (     1 cells,  32 layers,  1 seqs), R (f32):    2.25 MiB, S (f32):   48.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
sched_reserve: reserving ...
sched_reserve: max_nodes = 20480
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
sched_reserve: resolving fused Gated Delta Net support:
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
sched_reserve: fused Gated Delta Net (autoregressive) enabled
graph_reserve: reserving a graph for ubatch with n_tokens =   16, n_seqs =  1, n_outputs =    1
sched_reserve: fused Gated Delta Net (chunked) enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
sched_reserve:      SYCL0 compute buffer size =   501.00 MiB
sched_reserve:  SYCL_Host compute buffer size =    17.02 MiB
sched_reserve: graph nodes  = 1833
sched_reserve: graph splits = 2
sched_reserve: reserve took 71.20 ms, sched copies = 1
attach_threadpool: call
set_n_threads: n_threads = 36, n_threads_batch = 36

@strtgbb
Copy link

strtgbb commented Mar 13, 2026

1. Some operations on CPU

Your log shows done_getting_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type SYCL_Host, using CPU instead.
I see the same in my logs. I expect that issue is beyond the scope of this PR.

@NeoZhangJianyu
Copy link
Contributor

1. Some operations on CPU

Your log shows done_getting_tensors: tensor 'token_embd.weight' (q4_K) (and 0 others) cannot be used with preferred buffer type SYCL_Host, using CPU instead. I see the same in my logs. I expect that issue is beyond the scope of this PR.

Yes, this is another issue.
I will check it later.

It's great if create another issue to track this issue.

Thank you!

@NeoZhangJianyu NeoZhangJianyu merged commit a93c0ef into ggml-org:master Mar 14, 2026
70 of 80 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants