What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom
TL;DR: Vision tool-use RL enhances model performance by reducing tool-induced harm, but does not significantly improve tool-based correction of intrinsic failures.
This repository provides the MED (Measure-Explain-Diagnose) framework for analyzing vision tool-use reinforcement learning. We decompose performance improvements into intrinsic capability changes and tool-induced effects, providing fine-grained insights into what vision RL truly learns.
- Performance gains are primarily driven by intrinsic learning - Models improve their base reasoning capabilities
- Tool-use RL mainly reduces tool-induced harm - Reduces errors from tool invocation and weakens tool pattern interference
- Limited improvement in tool-based correction - Tools don't significantly improve correction of intrinsic failures
- Current vision RL learns to "safely coexist with tools" - Rather than fully mastering their strategic use
The MED framework provides a coarse-to-fine analysis of vision tool-use reinforcement learning through three sequential steps:
This repository contains the core methodology from our paper (Section 3), including:
- 4-term decomposition - Call Gain, Schema Gain, Call Harm, Schema Harm
- Factor analysis - Decompose each term into Mass (domain size), Policy (when to call), Quality (how to use)
- Visualization tools - Generate all figures (Measure, Explain, Diagnose) from the paper
# Install uv package manager (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository
git clone https://github.com/GAIR-NLP/Med.git
cd Med
# Create virtual environment and install dependencies
uv venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
uv pip install -e .
# For training environment (includes torch, transformers, flash-attn, etc.)
uv pip install -e ".[train]"Requirements: Python 3.11+, uv package manager
The training dataset (~15k samples from 12 data sources) is available on HuggingFace:
hf download Med2026/Med_training_data --repo-type dataset --local-dir data/Med_training_data/Click to expand
For distributed training, set up a Ray cluster. Here is an example for a 2-node cluster, each with 8 GPUs.
Start the Head Node: Run this command on your designated head node. The dashboard will be accessible at http://<head_node_ip>:8265.
ray start --head --dashboard-host=0.0.0.0Note down the address provided (e.g., xxxxxx:6379).
Start Worker Node(s): Run this command on each worker node, replacing xxxxxx:6379 with the address from the head node.
ray start --address=xxxxxx:6379Verify Cluster Status: On the head node, run ray status to confirm that all nodes have joined and all GPUs (16 in this example) are detected.
Click to expand
The reward server is a remote FastAPI service used to calculate reward values during training.
To start the reward server:
bash recipe/med/scripts/reward_server.shPORT: Specifies the network port on which the reward server will listen for incoming requests.WORKERS: Sets the number of worker processes for the server.
Upon successful launch, a file named with a unique JOB_ID will be created in the .reward_server/ directory. This file contains the IP address and port of the running server (e.g., your_server_ip:8192).
Note: Take note of this
JOB_ID, as it is required for configuringREMOTE_REWARD_JOB_IDin the training script.
Click to expand
For a comprehensive list of all configurable parameters and hyperparameters, please refer to recipe/med/scripts/train.sh. Before running experiments, configure the following environment variables to match your setup.
Set Base Directory and Python Path: Point BASE_DIR to your cloned repository root so that all scripts can locate configs and modules correctly.
export BASE_DIR="/path/to/Med"
export PYTHONPATH=${BASE_DIR}/:${PYTHONPATH}Set Node and GPU Counts: Adjust these values based on your actual cluster configuration (e.g., for 2 nodes with 8 GPUs each):
export NUM_NODES=2
export GPUS_PER_NODE=8Configure Reward Server Job ID: Set REMOTE_REWARD_JOB_ID to the identifier(s) of your previously launched reward server(s). This enables the training pipeline to locate the reward server's address.
export REMOTE_REWARD_JOB_ID="j-xxxxxxxxxx"Set Training Data:
export DATA_TRAIN_FILE="[/path/to/your/data/Med_training_data/train-00000-of-00030.parquet]"Model Loading and Checkpointing: Configure paths for loading initial model weights and saving training states, along with the save frequency.
ACTOR_LOAD_PATH: Path to the initial model checkpoint to load.TRAIN_SAVE_FREQ: Frequency to save the training state (e.g.,5for every 5 steps,-1to disable saving).TRAIN_SAVE_PATH: Directory where training checkpoints will be stored.
export ACTOR_LOAD_PATH="/path/to/Qwen2.5-VL-7B-Instruct"
export TRAIN_SAVE_FREQ=10
export TRAIN_SAVE_PATH="/path/to/checkpoints"Set Wandb API Key: Required for logging training metrics to Weights & Biases.
export WANDB_API_KEY="your-wandb-api-key"Start Training: First serve the vision tool, then launch the training script. The entry point recipe/med/scripts/run.sh handles this sequence automatically:
bash recipe/med/scripts/run.shThis script will:
- Verify Ray cluster status
- Start the vision tool server (
recipe/med/scripts/serve_vision_tool.sh) - Launch the training pipeline (
recipe/med/scripts/train.sh)
Click to expand
Download the evaluation logs from HuggingFace:
# Using HuggingFace CLI
hf download Med2026/Med-eval-logs --repo-type dataset --local-dir evals/
# Or using Python API
from huggingface_hub import snapshot_download
snapshot_download(repo_id="Med2026/Med-eval-logs", repo_type="dataset", local_dir="evals/")This downloads evaluation results for 6 perception benchmarks across 21 training checkpoints:
- VStar
- HRBench (4k)
- HRBench (8k)
- VisualProb (easy)
- VisualProb (medium)
- VisualProb (hard)
Click to expand
Extract metrics from evaluation logs:
bash scripts/run_create_csv.shThis creates CSV files in each eval logs with performance metrics, 4-term decomposition, and factor analysis across all checkpoints.
Click to expand
Generate all figures using the plotting script:
bash scripts/run_plot_paper_figures.shThis generates two types of figures in the figures/ directory:
Aggregated figures (averaged across all 6 benchmarks):
{exp_name}_measure.pdf- MEASURE: Intrinsic vs tool-induced drift over training{exp_name}_explain.pdf- EXPLAIN: 4-term decomposition (Call/Schema Gain/Harm){exp_name}_diagnose.pdf- DIAGNOSE: Factor analysis (Mass × Policy × Quality)
Per-benchmark figures (individual benchmark breakdowns):
{exp_name}_per_bench_exp{N}_measure.pdf- MEASURE for each benchmark{exp_name}_per_bench_exp{N}_explain.pdf- EXPLAIN for each benchmark{exp_name}_per_bench_exp{N}_diagnose.pdf- DIAGNOSE for each benchmark
The MED framework provides three levels of analysis, each visualized in separate figures:
Click to expand
The MEASURE figure decomposes tool-available drift fw(t) into two components:
- Grey area: Intrinsic drift fwo(t) - performance change without tool access
- Colored area: Tool-induced drift Δtool(t) - change in tool-induced performance gap
- Green: positive relative gain (fw > fwo)
- Red: negative relative drift (fwo > fw)
- Color intensity: tool call rate
Tool contribution ratio Stool (top progress bar): fraction of total drift magnitude from tool effects
Key finding: Tool-induced effects account for only ~20-30% of total improvement. Most gains come from intrinsic capability improvements.
Click to expand
The EXPLAIN figure decomposes the tool-induced performance gap G(t) = Accw(t) - Accwo(t) into:
Gross Gain (green, positive contributions):
- Call Gain (Term 1): Intrinsic failures corrected by tool execution
- Schema Gain (Term 2): Intrinsic failures recovered under tool schema without invocation
Gross Harm (red, negative contributions):
- Call Harm (Term 3): Intrinsic successes lost due to tool calls
- Schema Harm (Term 4): Intrinsic successes lost under tool schema without invocation
Net gap G(t) (yellow diamonds): Call Gain + Schema Gain - Call Harm - Schema Harm
Key finding: Gross Gain stagnates (Call Gain plateaus) while Gross Harm decreases consistently, indicating RL primarily reduces tool-induced harm rather than maximizing tool-based correction.
Click to expand
The DIAGNOSE figure factorizes each of the four terms into:
- Mass (grey): Domain size P(D) - capacity for gain/harm
- Policy (blue): Calling probability P(call|D) - when to use the tool
- Quality (orange): Success rate P(✓|call,D) - how well the tool is used
Thick line: Term value (left axis) Thin lines: Individual factors (right axis)
Key findings:
- Limited failure correction: Call Gain quality P(✓|call, failures) shows little improvement on current and persistent failure sets
- Reduced breakage: Call Harm quality P(✗|call, successes) decreases, indicating fewer errors on already-solved instances
- Schema interference mitigation: Schema Harm decreases as model becomes less sensitive to tool prompt
Current vision tool-use RL learns to safely coexist with tools rather than master them:
- Tool effects contribute minimally (~20-30%) compared to intrinsic improvements
- RL primarily reduces harm (fewer tool-induced errors) rather than increasing gain (better failure correction)
- Models improve at not breaking existing capabilities, but show limited progress in using tools to fix hard cases
If you find this work helpful, please cite our paper:
@article{ma2026does,
title={What Does Vision Tool-Use Reinforcement Learning Really Learn? Disentangling Tool-Induced and Intrinsic Effects for Crop-and-Zoom},
author={Ma, Yan and Zhang, Weiyu and Li, Tianle and Du, Linge and Shen, Xuyang and Liu, Pengfei},
journal={arXiv preprint arXiv:2602.01334},
year={2026}
}We are progressively open-sourcing components of the MED project:
- Evaluation logs - Available at HuggingFace
- Analysis code - MED framework implementation (
recipe/med/analysis_plot/) - Training data - Available at HuggingFace
- Training code - GRPO-based RL training pipeline (
recipe/med/) - Evaluation data - Benchmark datasets (6 perception tasks)
- Evaluation code - Evaluation pipeline for tool-free and tool-available protocols
Stay tuned for updates!
This project is licensed under the MIT License - see the LICENSE file for details.



