Skip to content

native QLoRA training with reward-weighted SFT and GRPO#20453

Open
srossitto79 wants to merge 10 commits intoggml-org:masterfrom
srossitto79:feat/qlora-training
Open

native QLoRA training with reward-weighted SFT and GRPO#20453
srossitto79 wants to merge 10 commits intoggml-org:masterfrom
srossitto79:feat/qlora-training

Conversation

@srossitto79
Copy link

Summary

Native QLoRA + Reward-Weighted SFT + GRPO training pipeline for quantized GGUF models.

  • QLoRA SFT: Trains F32 LoRA A/B adapters on frozen quantized models (Q4_K_M, Q6_K, etc.). Saved adapters are directly compatible with llama_adapter_lora_init and llama-export-lora.
  • Reward-Weighted SFT: When the dataset contains a reward/score field, cross-entropy loss is scaled by the normalized reward before backprop — no extra flags needed.
  • GRPO (online RL): --grpo-mode implements a full GRPO training loop via a line-based IPC protocol between a Python driver (prompt sampling + reward scoring) and the C++ process (model state + generation + gradient updates).

Key features

  • --freeze-layers N — skip LoRA on first N layers; backward auto-pruned
  • --grad-checkpoint N — mark every Nth forward node persistent to reduce peak activation VRAM
  • --train-on-prompt / --shuffle-dataset / --save-every N / --val-split
  • Quantized OUT_PROD CUDA kernel — dequantize on GPU + cuBLAS for backward matmul
  • OUT_PROD_ID for MoE backward — enables LoRA on dense FFN layers in Mixtral/Nemotron-MoE
  • Per-backend momentum allocation — avoids cross-device mismatch with partial GPU offload
  • Scheduler graph inflation — measures actual training graph size to prevent OOM

Files changed

  • New: examples/qlora_training/ (C++ trainer, Python GRPO driver, README, sample data)
  • Modified: ggml/src/ggml.c, ggml/src/ggml-opt.cpp, ggml/include/ggml-opt.h, ggml/include/ggml.h
  • Modified: ggml/src/ggml-cuda/out-prod.cu (new), ggml/src/ggml-cuda/ggml-cuda.cu, ggml/src/ggml-cuda/opt-step-adamw.cu
  • Modified: ggml/src/ggml-cpu/ops.cpp
  • Modified: include/llama.h, src/llama-context.cpp, src/llama-context.h
  • Modified: common/common.h, common/arg.cpp
  • Modified: examples/training/finetune.cpp (added shuffle + grad_checkpoint_interval params)

Tested on

  • Qwen3 1.7B Q4_K_M (full GPU, RTX 4060 Ti 16 GB)
  • Nemotron 15B Q4_K_M (partial offload -ngl 13)
  • SFT, Reward-Weighted SFT, and GRPO modes

Test plan

  • Build with -DGGML_CUDA=ON on Linux/Windows
  • Run llama-finetune-qlora with sample data: examples/qlora_training/sample_data.jsonl
  • Verify adapter loads with llama-cli --lora and merges with llama-export-lora
  • Run examples/training/finetune (existing example) to verify no regression
  • Run GRPO mode: python3 examples/qlora_training/grpo_example.py --model <model>

🤖 PR Generated with Claude Code

Disclosure: AI tools were used in an assistive capacity for mechanical porting
of already-written and tested code from a development fork to a clean upstream base.
All code was authored and extensively tested by the contributor.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs examples python python script changes ggml changes relating to the ggml tensor library for machine learning labels Mar 12, 2026
@srossitto79 srossitto79 marked this pull request as draft March 12, 2026 12:14
@srossitto79 srossitto79 marked this pull request as ready for review March 12, 2026 12:42
@srossitto79 srossitto79 requested a review from CISC as a code owner March 12, 2026 12:42
@JohannesGaessler
Copy link
Contributor

According to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Sorry, but given that you are a new contributor and that the PR description is clearly machine generated I do in a vacuum not have high confidence that the implementation in this PR is correct. Unfortunately I do as of right now also not have the technical expertise to judge the correctness of the implementation, I would have to read up on this first. As such I'm giving the review of this PR a low priority until I am less busy with other matters in a month or so.

@srossitto79
Copy link
Author

According to the llama.cpp AI usage policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Sorry, but given that you are a new contributor and that the PR description is clearly machine generated I do in a vacuum not have high confidence that the implementation in this PR is correct. Unfortunately I do as of right now also not have the technical expertise to judge the correctness of the implementation, I would have to read up on this first. As such I'm giving the review of this PR a low priority until I am less busy with other matters in a month or so.

No problem, take your time, I deliberately added this at the end of the PR to be clear:

🤖 PR Generated with [Claude Code](https://claude.com/claude-code)

Disclosure: AI tools were used in an assistive capacity for mechanical porting
of already-written and tested code from a development fork to a clean upstream base.
All code was authored and extensively tested by the contributor.

I did the changes on a fork and I seriously detached from upstream changes, so to make the pr, I cloned upstream again and used Claude code with opus to transfer my changes on the fresh repo code. I had to do some fix manually, at first, it forgot few things here and there, some declaration, some assert, but In the end I aligned everything I needed with my working dev version and from the models I tested I got from this branch the same results I had in my dev repo.
With qwen3-1.7b but also with nemotron-30b (yes I can finetune a nemotron-3-30b on my machine with this now using my cpu+gpu! bye bye unsloth!) I observed the train loss go down progressively, it looks healthy, the model responses doesn't break up, and it seems to me that it retain the knowledge I trained it on... I used Claude to write the pr description because as you can see it write better English than me :-)
So take your time, the functionality needs to be tested by someone else than me for sure, and the training results needs to be validated too by someone that's not me...

The only think I don't like much, is the way I interfaced python with the grpo finetuner, I used stdin/out/err to deliver the prompts, the responses and the computed rewards> I mean is a bit rudimental way to do IPC. But its working, its fast to implement and I provided an example of its usage in python. Anyway I am open to better solutions if you have others ideas.

Hope to see some comment about the training results that can help me improve the code or validate my observations.

@JohannesGaessler
Copy link
Contributor

I used Claude to write the pr description because as you can see it write better English than me :-)

Suboptimally worded PR descriptions are OK, autogenerated ones are not. My opinion is that carefully checked machine generated code can be OK, but I want proof in the form of a writeup that the person clicking the submit button actually understands it. Otherwise I will end up wasting too much time reviewing broken code that will never get merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants