Automatically optimize your LLM prompts using gradient-based beam search
APO Paper · AgentLightning Docs · Google ADK Docs
A CLI tool that takes your LLM prompt and makes it better — automatically. It uses the APO (Automatic Prompt Optimization) algorithm to iteratively critique, edit, and improve prompts through beam search over "textual gradients." Built with Microsoft AgentLightning for the optimization loop and Google ADK with LiteLLM for flexible model access across 100+ LLM providers.
APO treats prompt optimization like gradient descent, but with natural language instead of numbers:
- Evaluate — Run your prompt on a batch of tasks and measure performance
- Critique — An LLM generates "textual gradients" — natural language feedback about what's wrong
- Edit — Another LLM applies the critique to produce improved prompt candidates
- Select — Beam search keeps the top-k performing prompts
- Repeat — Iterate until convergence
"APO is an iterative prompt optimization algorithm that uses LLM-generated textual gradients to improve prompts through a beam search process." — Pryzant et al., 2023
APO includes a real-time dashboard (powered by AgentLightning) that launches automatically at http://localhost:4747 during optimization.
| Tab | What It Shows |
|---|---|
| Rollouts | Every prompt-on-task execution with status, input, duration |
| Resources | The prompt templates being optimized (your beam of candidates) |
| Traces | Detailed LLM call logs per rollout |
| Runners | Parallel workers and their current state |
| Settings | AgentLightning configuration |
- Python 3.12+
- Linux or WSL (AgentLightning requires Unix — see Windows Setup)
- API Key for at least one LLM provider (the model is auto-detected from your key)
# Clone the repository
git clone https://github.com/pouriamrt/apo-adk-cli.git
cd apo-adk-cli
# Create virtual environment and install
uv venv
source .venv/bin/activate
uv pip install -e .
# Set up your API key
cp .env.example .env
# Edit .env with your actual API key# 1. Evaluate your current prompt to get a baseline score
apo evaluate \
--prompt "Answer: {input}" \
--dataset examples/sample_dataset.json \
--eval-mode reference \
--verbose
# 2. Optimize it
apo optimize \
--prompt "Answer: {input}" \
--dataset examples/sample_dataset.json \
--beam-width 3 \
--beam-rounds 3 \
--output optimized_prompt.txt
# 3. Evaluate the optimized prompt to see the improvement
apo evaluate \
--prompt-file optimized_prompt.txt \
--dataset examples/sample_dataset.json \
--eval-mode reference \
--verboseapo optimize \
--prompt "Your prompt template with {input} placeholder" \
--dataset path/to/dataset.json \
--model "gemini/gemini-2.5-flash" \
--optimizer-model "gemini-2.5-flash" \
--eval-mode auto \
--beam-width 3 \
--beam-rounds 5 \
--n-runners 4 \
--output optimized_prompt.txt \
--verbose| Option | Default | Description |
|---|---|---|
--prompt |
— | Prompt template string (must contain {input}) |
--prompt-file |
— | Path to prompt template file (alternative to --prompt) |
--dataset |
required | Path to dataset file (JSON or CSV) |
--model |
auto-detected | LLM for running prompt rollouts (LiteLLM format) |
--optimizer-model |
derived from --model |
LLM for APO gradient/edit steps |
--eval-mode |
auto |
Scoring mode: auto, reference, or llm-judge |
--beam-width |
3 |
Number of top prompts kept per round |
--beam-rounds |
5 |
Number of optimization iterations |
--n-runners |
4 |
Parallel rollout workers |
--output / -o |
— | Save best prompt to file |
--verbose / -v |
false |
Show per-rollout details |
apo evaluate \
--prompt-file my_prompt.txt \
--dataset data.json \
--model "openai/gpt-4o" \
--eval-mode reference \
--verbose[
{"input": "What is the capital of France?", "expected_output": "Paris"},
{"input": "Translate 'hello' to Spanish", "expected_output": "hola"},
{"input": "Summarize this article: ..."}
]input,expected_output
What is the capital of France?,Paris
Translate 'hello' to Spanish,hola| Field | Required | Description |
|---|---|---|
input |
Yes | The variable part injected into {input} placeholder |
expected_output |
No | Ground truth for reference-based scoring |
Your prompt must contain the {input} placeholder. This is where each dataset task gets injected:
You are a helpful assistant. Answer the following question
accurately and concisely.
{input}
APO optimizes the entire prompt text while preserving the {input} placeholder.
| Mode | When to Use | How It Works |
|---|---|---|
reference |
Dataset has expected_output |
Fuzzy string matching + containment scoring |
llm-judge |
No ground truth available | A separate LLM grades output quality 0.0–1.0 |
auto (default) |
Mixed datasets | Uses reference when expected_output exists, llm-judge otherwise |
Any model supported by LiteLLM works. Set the corresponding API key:
| Provider | Model String | Environment Variable |
|---|---|---|
| OpenAI | openai/gpt-5.2 |
OPENAI_API_KEY |
| Google Gemini | gemini/gemini-2.5-flash |
GOOGLE_API_KEY |
| Anthropic | anthropic/claude-sonnet-4-6 |
ANTHROPIC_API_KEY |
| Ollama (local) | ollama/llama3 |
— |
| Azure OpenAI | azure/gpt-4 |
AZURE_API_KEY |
When --model is omitted, APO automatically picks the best model based on which API key is set:
| API Key | Default Model |
|---|---|
OPENAI_API_KEY |
openai/gpt-5.2 |
GOOGLE_API_KEY |
gemini/gemini-2.5-flash |
ANTHROPIC_API_KEY |
anthropic/claude-sonnet-4-6 |
If multiple keys are set, priority is: OpenAI > Google > Anthropic. You can always override with --model.
See the full LiteLLM provider list for 100+ supported models.
ai-prompt-optimizer/
├── pyproject.toml # Dependencies & build config
├── .env.example # API key template
├── src/
│ └── apo/
│ ├── __init__.py
│ ├── cli/
│ │ └── commands.py # Click CLI (optimize, evaluate)
│ ├── core/
│ │ ├── config.py # APO parameter defaults
│ │ ├── optimizer.py # Orchestrator: ties APO + ADK together
│ │ └── rollout.py # @agl.rollout: ADK ↔ AgentLightning bridge
│ ├── evaluation/
│ │ ├── scorer.py # Scoring dispatcher (auto/reference/llm-judge)
│ │ ├── reference.py # Fuzzy string matching scorer
│ │ └── llm_judge.py # LLM-as-judge scorer
│ └── data/
│ └── loader.py # Dataset loading (JSON/CSV) + validation
├── examples/
│ ├── sample_dataset.json # Example dataset (8 Q&A pairs)
│ └── sample_prompt.txt # Example prompt template
├── tests/
│ ├── test_loader.py # Dataset loader tests
│ ├── test_scoring.py # Scorer tests
│ └── test_integration.py # End-to-end CLI tests
└── docs/
├── apo-architecture.png # Architecture diagram
├── apo-sequence.png # Sequence diagram
└── plans/ # Design & implementation docs
AgentLightning depends on gunicorn, which is Unix-only. On Windows, use WSL (Windows Subsystem for Linux):
# 1. Install WSL if you haven't
wsl --install -d Ubuntu
# 2. In WSL, install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
source ~/.local/bin/env
# 3. Navigate to the project
cd /mnt/c/path/to/AI-Prompt-optimizer
# 4. Create a Linux venv and install
uv venv .venv-wsl --python 3.12
source .venv-wsl/bin/activate
uv pip install -e .
# 5. Set your API key and run
export GOOGLE_API_KEY="your-key-here"
apo optimize --prompt "Answer: {input}" --dataset examples/sample_dataset.jsonThe APO algorithm (Pryzant et al., 2023) adapts gradient descent to natural language:
Instead of numerical gradients, APO asks an LLM to critique the current prompt:
"The prompt is too vague — it doesn't specify the desired output format. The model sometimes returns full sentences when a single word would suffice."
A second LLM call "applies" the gradient by editing the prompt in the corrective direction:
Before:
"Answer: {input}"After:
"Answer the following question with a single word or short phrase. Be precise and direct. {input}"
APO maintains a beam of the top-k performing prompts. Each round:
- Sample parent prompts from the beam
- Generate
branch_factornew candidates per parent via textual gradients - Evaluate all candidates on a validation set
- Keep the top
beam_widthprompts for the next round
This combines exploration (generating diverse candidates) with exploitation (keeping the best performers).
All APO parameters with their defaults:
APOConfig(
model="gemini/gemini-2.5-flash", # Rollout model (auto-detected if omitted)
optimizer_model="gemini-2.5-flash", # Gradient/edit model (derived from --model)
beam_width=3, # Top prompts per round
branch_factor=2, # Candidates per parent
beam_rounds=5, # Optimization iterations
gradient_batch_size=4, # Samples for gradient computation
val_batch_size=8, # Validation set size
n_runners=4, # Parallel rollout workers
eval_mode="auto", # auto | reference | llm-judge
verbose=False, # Detailed output
)| Technology | Role |
|---|---|
| AgentLightning | APO algorithm & training loop |
| Google ADK | Agent framework for prompt execution |
| LiteLLM | Unified interface to 100+ LLM providers |
| Click | CLI framework |
| Rich | Terminal output formatting |
- APO Paper: Automatic Prompt Optimization with "Gradient Descent" and Beam Search — Pryzant et al., 2023
- AgentLightning APO Docs: Algorithm Zoo — APO
- Google ADK + LiteLLM: Model Configuration
- Gemini OpenAI-Compatible API: OpenAI Compatibility
This project is for educational and research purposes.


