Hands-on AI agent evaluation: building benchmarks with Harbor

This repository contains all the necessary materials for 1h Workshop: Hands-on AI agent evaluation: building benchmarks with Harbor we ran during LLMday Warsaw on 12th Feb 2026.

Slides:

Prerequisites

Make sure you have the following tools installed:

Harbor - main tool which is going to execute LLM tasks and test the results.
- We recommend installing Harbor system-wide with uv: uv tool install harbor,
Docker Desktop - all the tasks are executed locally within Docker containers.
LLM API keys (if you don't have one, we are happy to provide!)
- During this workshop, we use OpenRouter for LLM API calls to easily access models by various providers.

Set:

export OPENROUTER_API_KEY=...

Make sure you have both Harbor and Docker installed:

harbor --version  # 0.1.43 or above
docker --version  # 29.0.1 or above

Make sure variables are set and Docker is running:

env | grep "OPENROUTER_API_KEY"
docker ps

Running example tasks

Running basic task:

harbor run -p "tasks/example-task" --agent terminus-2 --model openrouter/anthropic/claude-haiku-4.5

Task run results are stored in jobs/ directory. There is a nice UI available within Harbor:

harbor view jobs/

Using different agents, for example Claude Code:

ANTHROPIC_API_KEY=sk-ant-api03-... \
harbor run -p "tasks/example-task" --agent claude-code --model claude-sonnet-4-5-20250929

Take a note that different providers may have different names for the same model, here is a list of models by Anthropic

Task with multiple runs – k=3 trials, n=2 concurrent runs:

harbor run -p "tasks/example-task" -a terminus-2 -m openrouter/anthropic/claude-haiku-4.5 -k 3 -n 2

Tasks can be organized in datasets:

harbor run \
  --dataset compilebench@1.0 \
  --task-name "c*" \
  --agent terminus-2 \
  --model openai/gpt-5.2

Harbor is generally helpful. To get summary of harbor options, see:

harbor run --help

Notes

Terminal Bench 2.0 - including tasks
Our benchmarks at Quesma
Migrating CompileBench to Harbor: standardizing AI agent evals blog post

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
slides		slides
tasks		tasks
.gitignore		.gitignore
README.md		README.md
task_ideas.md		task_ideas.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hands-on AI agent evaluation: building benchmarks with Harbor

Prerequisites

Running example tasks

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

QuesmaOrg/harbor-workshop

Folders and files

Latest commit

History

Repository files navigation

Hands-on AI agent evaluation: building benchmarks with Harbor

Prerequisites

Running example tasks

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages