What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

Overview

TL;DR: Vision tool-use RL enhances model performance by reducing tool-induced harm, but does not significantly improve tool-based correction of intrinsic failures.

This repository provides the MED (Measure-Explain-Diagnose) framework for analyzing vision tool-use reinforcement learning. We decompose performance improvements into intrinsic capability changes and tool-induced effects, providing fine-grained insights into what vision RL truly learns.

Key Findings

Performance gains are primarily driven by intrinsic learning - Models improve their base reasoning capabilities
Tool-use RL mainly reduces tool-induced harm - Reduces errors from tool invocation and weakens tool pattern interference
Limited improvement in tool-based correction - Tools don't significantly improve correction of intrinsic failures
Current vision RL learns to "safely coexist with tools" - Rather than fully mastering their strategic use

The MED Framework

The MED framework provides a coarse-to-fine analysis of vision tool-use reinforcement learning through three sequential steps:

Measure
Quantify tool-induced drift by decomposing
tool-available drift into intrinsic and tool-induced components

Explain
Decompose tool-induced performance gap
into Gross Gain and Gross Harm via 4-term analysis

Diagnose
Factorize each term into Mass, Policy, and Quality
to probe root causes of term evolution

This repository contains the core methodology from our paper (Section 3), including:

4-term decomposition - Call Gain, Schema Gain, Call Harm, Schema Harm
Factor analysis - Decompose each term into Mass (domain size), Policy (when to call), Quality (how to use)
Visualization tools - Generate all figures (Measure, Explain, Diagnose) from the paper

Installation

# Install uv package manager (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone https://github.com/GAIR-NLP/Med.git
cd Med

# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
uv pip install -e .

# For training environment (includes torch, transformers, flash-attn, etc.)
uv pip install -e ".[train]"

Requirements: Python 3.11+, uv package manager

Training

Training Data

The training dataset (~15k samples from 12 data sources) is available on HuggingFace:

hf download Med2026/Med_training_data --repo-type dataset --local-dir data/Med_training_data/

Ray Cluster Setup

Click to expand

For distributed training, set up a Ray cluster. Here is an example for a 2-node cluster, each with 8 GPUs.

Start the Head Node: Run this command on your designated head node. The dashboard will be accessible at http://<head_node_ip>:8265.

ray start --head --dashboard-host=0.0.0.0

Note down the address provided (e.g., xxxxxx:6379).

Start Worker Node(s): Run this command on each worker node, replacing xxxxxx:6379 with the address from the head node.

ray start --address=xxxxxx:6379

Verify Cluster Status: On the head node, run ray status to confirm that all nodes have joined and all GPUs (16 in this example) are detected.

Reward Server

Click to expand

The reward server is a remote FastAPI service used to calculate reward values during training.

To start the reward server:

bash recipe/med/scripts/reward_server.sh

PORT: Specifies the network port on which the reward server will listen for incoming requests.
WORKERS: Sets the number of worker processes for the server.

Upon successful launch, a file named with a unique JOB_ID will be created in the .reward_server/ directory. This file contains the IP address and port of the running server (e.g., your_server_ip:8192).

Note: Take note of this JOB_ID, as it is required for configuring REMOTE_REWARD_JOB_ID in the training script.

Launch Training

Click to expand

For a comprehensive list of all configurable parameters and hyperparameters, please refer to recipe/med/scripts/train.sh. Before running experiments, configure the following environment variables to match your setup.

Set Base Directory and Python Path: Point BASE_DIR to your cloned repository root so that all scripts can locate configs and modules correctly.

export BASE_DIR="/path/to/Med"
export PYTHONPATH=${BASE_DIR}/:${PYTHONPATH}

Set Node and GPU Counts: Adjust these values based on your actual cluster configuration (e.g., for 2 nodes with 8 GPUs each):

export NUM_NODES=2
export GPUS_PER_NODE=8

Configure Reward Server Job ID: Set REMOTE_REWARD_JOB_ID to the identifier(s) of your previously launched reward server(s). This enables the training pipeline to locate the reward server's address.

export REMOTE_REWARD_JOB_ID="j-xxxxxxxxxx"

Set Training Data:

export DATA_TRAIN_FILE="[/path/to/your/data/Med_training_data/train-00000-of-00030.parquet]"

Model Loading and Checkpointing: Configure paths for loading initial model weights and saving training states, along with the save frequency.

ACTOR_LOAD_PATH: Path to the initial model checkpoint to load.
TRAIN_SAVE_FREQ: Frequency to save the training state (e.g., 5 for every 5 steps, -1 to disable saving).
TRAIN_SAVE_PATH: Directory where training checkpoints will be stored.

export ACTOR_LOAD_PATH="/path/to/Qwen2.5-VL-7B-Instruct"
export TRAIN_SAVE_FREQ=10
export TRAIN_SAVE_PATH="/path/to/checkpoints"

Set Wandb API Key: Required for logging training metrics to Weights & Biases.

export WANDB_API_KEY="your-wandb-api-key"

Start Training: First serve the vision tool, then launch the training script. The entry point recipe/med/scripts/run.sh handles this sequence automatically:

bash recipe/med/scripts/run.sh

This script will:

Verify Ray cluster status
Start the vision tool server (recipe/med/scripts/serve_vision_tool.sh)
Launch the training pipeline (recipe/med/scripts/train.sh)

Reproducing Paper Figures

Step 1: Download Evaluation Logs

Click to expand

Download the evaluation logs from HuggingFace:

# Using HuggingFace CLI
hf download Med2026/Med-eval-logs --repo-type dataset --local-dir evals/

# Or using Python API
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Med2026/Med-eval-logs", repo_type="dataset", local_dir="evals/")

This downloads evaluation results for 6 perception benchmarks across 21 training checkpoints:

VStar
HRBench (4k)
HRBench (8k)
VisualProb (easy)
VisualProb (medium)
VisualProb (hard)

Step 2: Generate CSV Data

Click to expand

Extract metrics from evaluation logs:

bash scripts/run_create_csv.sh

This creates CSV files in each eval logs with performance metrics, 4-term decomposition, and factor analysis across all checkpoints.

Step 3: Generate Paper Figures

Click to expand

Generate all figures using the plotting script:

bash scripts/run_plot_paper_figures.sh

This generates two types of figures in the figures/ directory:

Aggregated figures (averaged across all 6 benchmarks):

{exp_name}_measure.pdf - MEASURE: Intrinsic vs tool-induced drift over training
{exp_name}_explain.pdf - EXPLAIN: 4-term decomposition (Call/Schema Gain/Harm)
{exp_name}_diagnose.pdf - DIAGNOSE: Factor analysis (Mass × Policy × Quality)

Per-benchmark figures (individual benchmark breakdowns):

{exp_name}_per_bench_exp{N}_measure.pdf - MEASURE for each benchmark
{exp_name}_per_bench_exp{N}_explain.pdf - EXPLAIN for each benchmark
{exp_name}_per_bench_exp{N}_diagnose.pdf - DIAGNOSE for each benchmark

Understanding the Results

The MED framework provides three levels of analysis, each visualized in separate figures:

MEASURE: Quantifying Drift Components

Click to expand

The MEASURE figure decomposes tool-available drift f_w(t) into two components:

Grey area: Intrinsic drift f_wo(t) - performance change without tool access
Colored area: Tool-induced drift Δ_tool(t) - change in tool-induced performance gap
- Green: positive relative gain (f_w > f_wo)
- Red: negative relative drift (f_wo > f_w)
- Color intensity: tool call rate

Tool contribution ratio S_tool (top progress bar): fraction of total drift magnitude from tool effects

Key finding: Tool-induced effects account for only ~20-30% of total improvement. Most gains come from intrinsic capability improvements.

EXPLAIN: 4-Term Decomposition

Click to expand

The EXPLAIN figure decomposes the tool-induced performance gap G(t) = Acc_w(t) - Acc_wo(t) into:

Gross Gain (green, positive contributions):

Call Gain (Term 1): Intrinsic failures corrected by tool execution
Schema Gain (Term 2): Intrinsic failures recovered under tool schema without invocation

Gross Harm (red, negative contributions):

Call Harm (Term 3): Intrinsic successes lost due to tool calls
Schema Harm (Term 4): Intrinsic successes lost under tool schema without invocation

Net gap G(t) (yellow diamonds): Call Gain + Schema Gain - Call Harm - Schema Harm

Key finding: Gross Gain stagnates (Call Gain plateaus) while Gross Harm decreases consistently, indicating RL primarily reduces tool-induced harm rather than maximizing tool-based correction.

DIAGNOSE: Factor Analysis

Click to expand

The DIAGNOSE figure factorizes each of the four terms into:

Mass (grey): Domain size P(D) - capacity for gain/harm
Policy (blue): Calling probability P(call|D) - when to use the tool
Quality (orange): Success rate P(✓|call,D) - how well the tool is used

Thick line: Term value (left axis) Thin lines: Individual factors (right axis)

Key findings:

Limited failure correction: Call Gain quality P(✓|call, failures) shows little improvement on current and persistent failure sets
Reduced breakage: Call Harm quality P(✗|call, successes) decreases, indicating fewer errors on already-solved instances
Schema interference mitigation: Schema Harm decreases as model becomes less sensitive to tool prompt

Bottom Line

Current vision tool-use RL learns to safely coexist with tools rather than master them:

Tool effects contribute minimally (~20-30%) compared to intrinsic improvements
RL primarily reduces harm (fewer tool-induced errors) rather than increasing gain (better failure correction)
Models improve at not breaking existing capabilities, but show limited progress in using tools to fix hard cases

Citation

If you find this work helpful, please cite our paper:

@article{ma2026does,
  title={What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom},
  author={Ma, Yan and Zhang, Weiyu and Li, Tianle and Du, Linge and Shen, Xuyang and Liu, Pengfei},
  journal={arXiv preprint arXiv:2602.01334},
  year={2026}
}

Roadmap

We are progressively open-sourcing components of the MED project:

Evaluation logs - Available at HuggingFace
Analysis code - MED framework implementation (recipe/med/analysis_plot/)
Training data - Available at HuggingFace
Training code - GRPO-based RL training pipeline (recipe/med/)
Evaluation data - Benchmark datasets (6 perception tasks)
Evaluation code - Evaluation pipeline for tool-free and tool-available protocols

Stay tuned for updates!

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
recipe		recipe
scripts		scripts
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

Overview

Key Findings

The MED Framework

Installation

Training

Training Data

Ray Cluster Setup

Reward Server

Launch Training

Reproducing Paper Figures

Step 1: Download Evaluation Logs

Step 2: Generate CSV Data

Step 3: Generate Paper Figures

Understanding the Results

MEASURE: Quantifying Drift Components

EXPLAIN: 4-Term Decomposition

DIAGNOSE: Factor Analysis

Bottom Line

Citation

Roadmap

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

GAIR-NLP/Med

Folders and files

Latest commit

History

Repository files navigation

What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom

Overview

Key Findings

The MED Framework

Installation

Training

Training Data

Ray Cluster Setup

Reward Server

Launch Training

Reproducing Paper Figures

Step 1: Download Evaluation Logs

Step 2: Generate CSV Data

Step 3: Generate Paper Figures

Understanding the Results

MEASURE: Quantifying Drift Components

EXPLAIN: 4-Term Decomposition

DIAGNOSE: Factor Analysis

Bottom Line

Citation

Roadmap

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages