DGX Spark (GB10) llama.cpp build question #20405

ruicatxiao · 2026-03-11T14:58:22Z

ruicatxiao
Mar 11, 2026

I dont see any specific mention of what flag to pass on during build of llama.cpp. Curious to ask what people have been doing with their llama.cpp builds? Specifically I wonder whether -DCMAKE_CUDA_ARCHITECTURES=121 flag is needed since GB10 is CUDA compute 12.1.

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_NATIVE=ON

0cc4m · 2026-03-11T15:06:43Z

0cc4m
Mar 11, 2026
Collaborator

Does cmake -B build -DGGML_CUDA=ON not work?

4 replies

ruicatxiao Mar 11, 2026
Author

Just wonder whether it will fall back to sm_80 instead of its actual sm_121. vLLM had issues that can be found here

https://forums.developer.nvidia.com/t/dgx-spark-sm121-software-support-is-severely-lacking-official-roadmap-needed/357663

0cc4m Mar 11, 2026
Collaborator

You should be able to see in the build log what it picks, but my understanding is that it should scan the system and build just for architectures of GPUs it actually found, by default. If that's true it should just work.

ruicatxiao Mar 11, 2026
Author

I will try and report back. What about flag -DGGML_CUDA_FA_ALL_QUANTS=ON? Again not a whole lot of info on GB10

ruicatxiao Mar 11, 2026
Author

Recommendation I got from someone on NVIDIA GB10 dev forum
cmake -S . -B build -G Ninja
-DCMAKE_BUILD_TYPE=Release
-DGGML_CUDA=ON
-DGGML_CUDA_GRAPHS=ON
-DGGML_CUDA_FA_ALL_QUANTS=ON
-DCMAKE_CUDA_ARCHITECTURES=121a-real

0cc4m · 2026-03-11T15:33:32Z

0cc4m
Mar 11, 2026
Collaborator

GGML_CUDA_FA_ALL_QUANTS is not related to hardware directly. It compiles a lot of kv cache quantization support into Flash Attention. Basically that will make your compile much slower, your binary much larger and not actually do anything unless you wanna use exotic quantization schemes for your KV cache. By default you can only do K and V both F16, both Q8_0 or both Q4_0 (not sure why you'd want anything else).

0 replies

ruicatxiao · 2026-03-11T18:46:43Z

ruicatxiao
Mar 11, 2026
Author

Finished building on the spark with
cmake -B build
-DCMAKE_BUILD_TYPE=Release
-DGGML_CUDA=ON
-DGGML_CUDA_GRAPHS=ON
-DCMAKE_CUDA_ARCHITECTURES=121

Did a few llama-bench with KV cache set to Q8_0

model	size	params	backend	ngl	threads	type_k	type_v	fa	test	t/s
Unsloth-qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	CUDA	99	16	q8_0	q8_0	1	pp512	466.52 ± 18.61
Unsloth-qwen35moe 122B.A10B Q4_K - Medium	71.73 GiB	122.11 B	CUDA	99	16	q8_0	q8_0	1	tg128	17.36 ± 0.18
bartowski-qwen35moe 122B.A10B Q4_K - Medium	69.83 GiB	122.11 B	CUDA	999	16	q8_0	q8_0	1	pp512	507.40 ± 14.78
bartowski-qwen35moe 122B.A10B Q4_K - Medium	69.83 GiB	122.11 B	CUDA	999	16	q8_0	q8_0	1	tg128	19.03 ± 0.20
bartowski-qwen35moe 122B.A10B Q5_K - Small	78.97 GiB	122.11 B	CUDA	999	16	q8_0	q8_0	1	pp512	467.42 ± 19.62
bartowski-qwen35moe 122B.A10B Q5_K - Small	78.97 GiB	122.11 B	CUDA	999	16	q8_0	q8_0	1	tg128	18.50 ± 0.12

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DGX Spark (GB10) llama.cpp build question #20405

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DGX Spark (GB10) llama.cpp build question #20405

Uh oh!

ruicatxiao Mar 11, 2026

Replies: 3 comments · 4 replies

Uh oh!

0cc4m Mar 11, 2026 Collaborator

Uh oh!

ruicatxiao Mar 11, 2026 Author

Uh oh!

0cc4m Mar 11, 2026 Collaborator

Uh oh!

ruicatxiao Mar 11, 2026 Author

Uh oh!

ruicatxiao Mar 11, 2026 Author

Uh oh!

0cc4m Mar 11, 2026 Collaborator

Uh oh!

ruicatxiao Mar 11, 2026 Author

ruicatxiao
Mar 11, 2026

Replies: 3 comments 4 replies

0cc4m
Mar 11, 2026
Collaborator

ruicatxiao Mar 11, 2026
Author

0cc4m Mar 11, 2026
Collaborator

ruicatxiao Mar 11, 2026
Author

ruicatxiao Mar 11, 2026
Author

0cc4m
Mar 11, 2026
Collaborator

ruicatxiao
Mar 11, 2026
Author