This repo contains the code for the following work:
- MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems
Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Austin Xu, Xiaoxiao He, Yingbo Zhou, Semih Yavuz, Hao Wang, Shafiq Joty
[📄 Paper]
Installing the client-server framework for MAS process evaluation:
pip install -e .
python -c "import mas_proceval"Starting the Judge server:
python -m mas_proceval.servers.server_judgeUsing the client-server framework as a plug-in for the existing MAS systems:
from mas_proceval import BaseClient, llm_parallel_search_decorator
client = BaseClient()
# this is the function we want to perform the search:
@llm_parallel_search_decorator
def function():
passGPT-5-mini Judge:
python -m mas_proceval.servers.server_judgeProcess Reward Model (PRM) (Qwen2.5-Math-PRM-7B):
python -m mas_proceval.servers.server_prmReward Model (RM) (Skywork-Reward-V2):
python -m mas_proceval.servers.server_rmJudge servers should be started before launching any MAS experiments and will remain running to serve evaluation requests. You can run multiple judge servers simultaneously if needed.
Each MAS architecture has its own mechanism for enabling process verification. Below are detailed instructions for each system.
Agent-level verification evaluates each individual debater's response:
python run_benchmarks.py --benchmark {dataset} --process-eval agentRound-level verification evaluates complete debate rounds:
python run_benchmarks.py --benchmark {dataset} --process-eval roundFor round-level (iteration-level) verification, ensure your bash script uses the standard generation file:
python llmlp_gen_{dataset}.py # Round/iteration levelThen run:
bash AIME/exp_aime_dylan.sh # For AIME
bash GAIA/exp_gaia.sh # For GAIAFor agent-level verification, modify your bash script to use the agent-call-level variant:
# In the bash script
python llmlp_gen_{dataset}_sub.py # Agent call levelThen execute the same bash commands as above.
Phase 1 - Optimization:
python -m examples.maas.optimize --dataset {dataset} --round 1 --sample 4 --exec_model_name "gpt-5-mini"Phase 2 - Testing with Process Verification: For agent-level verification:
cp graph_sub.py graph.pyFor iteration-level verification:
cp graph_iter.py graph.pyThen run the testing phase:
python -m examples.maas.optimize --dataset {dataset} --round 1 --sample 4 --exec_model_name "gpt-5-mini" --is_test TrueThe template files contain pre-configured decorators:
graph_sub.py: Individual operators are decorated for fine-grained evaluationgraph_iter.py: Complete workflow iterations are decorated for coarse-grained evaluation
AFlow requires manual configuration ONLY in testing phases. Process verification is enabled through a combination of configuration flags and decorator placement.
Step 1 - Agent-level During Optimization:
Edit optimizer.py and set:
use_mas = False # For Validation Phase
use_mas = True # For Testing PhaseThen run the optimization phase:
python run.py --dataset AIME24 --max_rounds 10 --validation_rounds 2 --opt_model_name gpt-5-mini --exec_model_name gpt-5-miniStep 2 - Iteration-level During Testing:
After optimization completes and generates the workflow graph:
- Set
use_mas = Falseinoptimizer.py - Manually add the
@llm_parallel_search_decoratorto appropriate methods in the generatedgraph.py - Refer to
MaAS/.../graph_iter.pyfor examples of proper decorator placement
For agent-level verification, use the search_sub.py scripts:
AIME24 or AIME25:
python _aime/search_sub.py --dataset aime24 --expr_name aime24_results --n_generation 5GAIA:
python _gaia/search_sub.py --expr_name gaia_results --n_generation 5For iteration-level verification, use the search_iter.py scripts:
AIME24 or AIME25:
python _aime/search_iter.py --dataset aime24 --expr_name aime24_results --n_generation 5GAIA:
python _gaia/search_iter.py --expr_name gaia_results --n_generation 5For agent-level verification, modify the import to use the agent-call-level search function:
from async_search_sub import search # Agent call levelFor iteration-level verification, modify the import to use the iteration-level search function:
from async_search_iter import search # Iteration levelAfter setting the appropriate import, run the planning phase:
AIME24/25 Planning:
python async_main_question.py --dataset workflow_search/aime24 --option plan --meta_model gpt-5-mini --node_model gpt-5-mini --blocks COT COT_SC Reflexion LLM_debate --n_generation 2 --save_dir test_iter_resultsGAIA Planning:
python async_main_question.py --dataset workflow_search/gaia --option plan --meta_model gpt-5-mini --node_model gpt-5-nano --blocks COT COT_SC Reflexion LLM_debate WebSearch --n_generation 2 --save_dir test_resultsOracle Verification: For validating generated responses with an oracle judge:
python main_judge.py --dataset aime24 --judge_method oracle --baseline workflow_search --model gpt-5-mini --node_model gpt-5-mini --min_sample 0 --max_sample 29 --max_response_per_sample 5 --save_dir test_resultsThis repository extends and builds upon several foundational works in the field of Multi-Agent Systems (MAS). We are grateful to the authors of the following projects for open-sourcing their codebases:
Our work introduces MAS-ProVe, a systematic empirical study of process verification for MAS.
If you use this codebase or the integrated architectures in your research, please cite our work and the original papers:
@misc{venkataramani2026masproveunderstandingprocessverification,
title={MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems},
author={Vishal Venkataramani and Haizhou Shi and Zixuan Ke and Austin Xu and Xiaoxiao He and Yingbo Zhou and Semih Yavuz and Hao Wang and Shafiq Joty},
year={2026},
eprint={2602.03053},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.03053},
}