Skip to content
/ cppo Public

CPPO: Contrastive Perception for Vision Language Policy Optimization

Notifications You must be signed in to change notification settings

vbdi/cppo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CPPO: Contrastive Perception for Vision Language Policy Optimization

CPPO overview

This repository contains the description and implementation of CPPO, a reinforcement learning framework for finetuning vision–language models (VLMs).

Methodology

1. Entropy-Based Perception Token Detection

For each generated response, CPPO identifies perception tokens by measuring the increase in predictive entropy when the input image is replaced with an information-removing perturbation. Tokens with the largest entropy increase are selected as perception-dependent tokens. This process:

  • Requires no external supervision

  • Is fully model-driven

  • Preserves the natural reasoning structure of the VLM

2. Contrastive Perception Loss (CPL)

For each detected perception token, CPPO applies a token-level contrastive loss:

  • Anchor: token distribution conditioned on the original image
  • Positive: distribution conditioned on an information-preserving perturbation
  • Negative: distribution conditioned on an information-removing perturbation

3. Integration with Reinforcement Learning

CPPO augments the standard RL objective with the Contrastive Perception Loss:

  • CPL is applied only to perception tokens
  • CPL is gated by positive advantage, ensuring it reinforces successful trajectories

This design yields targeted perception improvement while maintaining RL stability.

CPPO methodology

Main Results

CPPO is evaluated on a wide range of multimodal reasoning benchmarks and consistently improves the baseline RL objective.

CPPO results

Training

We provide implementations of CPPO for both synchronous and asynchronous training regimes.
Each variant builds on a widely used large-scale RL framework while adding the CPPO implementation.

Training with CPPO in Synchronous Settings

The synchronous pipeline is built on top of verl, where rollout generation and training are synchronized.

Training with CPPO in Asynchronous Settings

For higher throughput and improved hardware utilization, CPPO is also integrated with AReaL, which decouples generation and training resources.

Quick Start

To train models such as Qwen2.5-VL-3B or Qwen2.5-VL-7B:

  1. Navigate to the verl or AReaL directory.
  2. Follow the provided environment and dataset setup instructions.
  3. Launch training using the CPPO examples.
### For synchronous training
cd verl
bash examples/cppo/run_qwen2_5_vl-3b_virl39k.sh
### For asynchronous training
cd AReaL
bash examples/cppo/run_qwen2_5_vl-3b_geometry3k.sh

Citation

If you find this work useful, please consider giving us a star and citing our work.

@article{rezaei2026cppo,
    title={CPPO: Contrastive Perception for Vision Language Policy Optimization},
    author={Rezaei, Ahmad and Gholami, Mohsen and Ranjbar Alvar, Saeed and Cannons, Kevin and Hossain, Mohammad Asiful and Weimin, Zhou and Zhou, Shunbo and Zhang, Yong and Akbari, Mohammad},
    journal={arXiv preprint arXiv:XXXX.XXXXX},
    year={2026}
}