native QLoRA training with reward-weighted SFT and GRPO by srossitto79 · Pull Request #20453 · ggml-org/llama.cpp

srossitto79 · 2026-03-12T12:00:23Z

Summary

Native QLoRA + Reward-Weighted SFT + GRPO training pipeline for quantized GGUF models.

QLoRA SFT: Trains F32 LoRA A/B adapters on frozen quantized models (Q4_K_M, Q6_K, etc.). Saved adapters are directly compatible with llama_adapter_lora_init and llama-export-lora.
Reward-Weighted SFT: When the dataset contains a reward/score field, cross-entropy loss is scaled by the normalized reward before backprop — no extra flags needed.
GRPO (online RL): --grpo-mode implements a full GRPO training loop via a line-based IPC protocol between a Python driver (prompt sampling + reward scoring) and the C++ process (model state + generation + gradient updates).

Key features

--freeze-layers N — skip LoRA on first N layers; backward auto-pruned
--grad-checkpoint N — mark every Nth forward node persistent to reduce peak activation VRAM
--train-on-prompt / --shuffle-dataset / --save-every N / --val-split
Quantized OUT_PROD CUDA kernel — dequantize on GPU + cuBLAS for backward matmul
OUT_PROD_ID for MoE backward — enables LoRA on dense FFN layers in Mixtral/Nemotron-MoE
Per-backend momentum allocation — avoids cross-device mismatch with partial GPU offload
Scheduler graph inflation — measures actual training graph size to prevent OOM

Files changed

New: examples/qlora_training/ (C++ trainer, Python GRPO driver, README, sample data)
Modified: ggml/src/ggml.c, ggml/src/ggml-opt.cpp, ggml/include/ggml-opt.h, ggml/include/ggml.h
Modified: ggml/src/ggml-cuda/out-prod.cu (new), ggml/src/ggml-cuda/ggml-cuda.cu, ggml/src/ggml-cuda/opt-step-adamw.cu
Modified: ggml/src/ggml-cpu/ops.cpp
Modified: include/llama.h, src/llama-context.cpp, src/llama-context.h
Modified: common/common.h, common/arg.cpp
Modified: examples/training/finetune.cpp (added shuffle + grad_checkpoint_interval params)

Tested on

Qwen3 1.7B Q4_K_M (full GPU, RTX 4060 Ti 16 GB)
Nemotron 15B Q4_K_M (partial offload -ngl 13)
SFT, Reward-Weighted SFT, and GRPO modes

Test plan

Build with -DGGML_CUDA=ON on Linux/Windows
Run llama-finetune-qlora with sample data: examples/qlora_training/sample_data.jsonl
Verify adapter loads with llama-cli --lora and merges with llama-export-lora
Run examples/training/finetune (existing example) to verify no regression
Run GRPO mode: python3 examples/qlora_training/grpo_example.py --model <model>

🤖 PR Generated with Claude Code

Disclosure: AI tools were used in an assistive capacity for mechanical porting
of already-written and tested code from a development fork to a clean upstream base.
All code was authored and extensively tested by the contributor.

JohannesGaessler · 2026-03-12T16:27:59Z

According to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Sorry, but given that you are a new contributor and that the PR description is clearly machine generated I do in a vacuum not have high confidence that the implementation in this PR is correct. Unfortunately I do as of right now also not have the technical expertise to judge the correctness of the implementation, I would have to read up on this first. As such I'm giving the review of this PR a low priority until I am less busy with other matters in a month or so.

srossitto79 · 2026-03-12T17:42:03Z

According to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Sorry, but given that you are a new contributor and that the PR description is clearly machine generated I do in a vacuum not have high confidence that the implementation in this PR is correct. Unfortunately I do as of right now also not have the technical expertise to judge the correctness of the implementation, I would have to read up on this first. As such I'm giving the review of this PR a low priority until I am less busy with other matters in a month or so.

No problem, take your time, I deliberately added this at the end of the PR to be clear:

🤖 PR Generated with [Claude Code](https://claude.com/claude-code)

Disclosure: AI tools were used in an assistive capacity for mechanical porting
of already-written and tested code from a development fork to a clean upstream base.
All code was authored and extensively tested by the contributor.

I did the changes on a fork and I seriously detached from upstream changes, so to make the pr, I cloned upstream again and used Claude code with opus to transfer my changes on the fresh repo code. I had to do some fix manually, at first, it forgot few things here and there, some declaration, some assert, but In the end I aligned everything I needed with my working dev version and from the models I tested I got from this branch the same results I had in my dev repo.
With qwen3-1.7b but also with nemotron-30b (yes I can finetune a nemotron-3-30b on my machine with this now using my cpu+gpu! bye bye unsloth!) I observed the train loss go down progressively, it looks healthy, the model responses doesn't break up, and it seems to me that it retain the knowledge I trained it on... I used Claude to write the pr description because as you can see it write better English than me :-)
So take your time, the functionality needs to be tested by someone else than me for sure, and the training results needs to be validated too by someone that's not me...

The only think I don't like much, is the way I interfaced python with the grpo finetuner, I used stdin/out/err to deliver the prompts, the responses and the computed rewards> I mean is a bit rudimental way to do IPC. But its working, its fast to implement and I provided an example of its usage in python. Anyway I am open to better solutions if you have others ideas.

Hope to see some comment about the training results that can help me improve the code or validate my observations.

JohannesGaessler · 2026-03-12T18:44:56Z

I used Claude to write the pr description because as you can see it write better English than me :-)

Suboptimally worded PR descriptions are OK, autogenerated ones are not. My opinion is that carefully checked machine generated code can be OK, but I want proof in the form of a writeup that the person clicking the submit button actually understands it. Otherwise I will end up wasting too much time reviewing broken code that will never get merged.

srossitto79 added 4 commits March 11, 2026 14:50

added qlora finetuning

84cab59

added missing llama_opt_set_reward_weights

76d5b67

added reward scaling to opt_epoch_iter calls

70730e8

ported residual changes about grad_checkpointing

22277e3

srossitto79 requested review from JohannesGaessler and ggerganov as code owners March 12, 2026 12:00

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Mar 12, 2026

srossitto79 marked this pull request as draft March 12, 2026 12:14

srossitto79 added 2 commits March 12, 2026 13:19

fixed assert in ggml.c GGML_ASSERT(ggml_nelements(adamw_params) == 8)

3e4166d

fixed missing changes from dev version

e18d20d

srossitto79 marked this pull request as ready for review March 12, 2026 12:42

srossitto79 requested a review from CISC as a code owner March 12, 2026 12:42

srossitto79 added 3 commits March 12, 2026 19:01

removed some python warning/unused import

2e324f6

removed trailing whitespaces

959f789

added new line at end

d1f8d52

removed error guard on dataset (its not generic)

99c2456

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

native QLoRA training with reward-weighted SFT and GRPO#20453

native QLoRA training with reward-weighted SFT and GRPO#20453
srossitto79 wants to merge 10 commits intoggml-org:masterfrom
srossitto79:feat/qlora-training

srossitto79 commented Mar 12, 2026

Uh oh!

JohannesGaessler commented Mar 12, 2026

Uh oh!

srossitto79 commented Mar 12, 2026

Uh oh!

JohannesGaessler commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

srossitto79 commented Mar 12, 2026

Summary

Key features

Files changed

Tested on

Test plan

Uh oh!

JohannesGaessler commented Mar 12, 2026

Uh oh!

srossitto79 commented Mar 12, 2026

Uh oh!

JohannesGaessler commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants