This repository contains all the necessary materials for 1h Workshop: Hands-on AI agent evaluation: building benchmarks with Harbor we ran during LLMday Warsaw on 12th Feb 2026.
Slides:
Make sure you have the following tools installed:
- Harbor - main tool which is going to execute LLM tasks and test the results.
- We recommend installing Harbor system-wide with uv:
uv tool install harbor,
- We recommend installing Harbor system-wide with uv:
- Docker Desktop - all the tasks are executed locally within Docker containers.
- LLM API keys (if you don't have one, we are happy to provide!)
- During this workshop, we use OpenRouter for LLM API calls to easily access models by various providers.
Set:
export OPENROUTER_API_KEY=...Make sure you have both Harbor and Docker installed:
harbor --version # 0.1.43 or above
docker --version # 29.0.1 or aboveMake sure variables are set and Docker is running:
env | grep "OPENROUTER_API_KEY"
docker psRunning basic task:
harbor run -p "tasks/example-task" --agent terminus-2 --model openrouter/anthropic/claude-haiku-4.5Task run results are stored in jobs/ directory. There is a nice UI available within Harbor:
harbor view jobs/Using different agents, for example Claude Code:
ANTHROPIC_API_KEY=sk-ant-api03-... \
harbor run -p "tasks/example-task" --agent claude-code --model claude-sonnet-4-5-20250929Take a note that different providers may have different names for the same model, here is a list of models by Anthropic
Task with multiple runs – k=3 trials, n=2 concurrent runs:
harbor run -p "tasks/example-task" -a terminus-2 -m openrouter/anthropic/claude-haiku-4.5 -k 3 -n 2Tasks can be organized in datasets:
harbor run \
--dataset compilebench@1.0 \
--task-name "c*" \
--agent terminus-2 \
--model openai/gpt-5.2Harbor is generally helpful. To get summary of harbor options, see:
harbor run --help