Skip to content

QuesmaOrg/harbor-workshop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Hands-on AI agent evaluation: building benchmarks with Harbor

This repository contains all the necessary materials for 1h Workshop: Hands-on AI agent evaluation: building benchmarks with Harbor we ran during LLMday Warsaw on 12th Feb 2026.

Slides:

Prerequisites

Make sure you have the following tools installed:

  • Harbor - main tool which is going to execute LLM tasks and test the results.
    • We recommend installing Harbor system-wide with uv: uv tool install harbor,
  • Docker Desktop - all the tasks are executed locally within Docker containers.
  • LLM API keys (if you don't have one, we are happy to provide!)
    • During this workshop, we use OpenRouter for LLM API calls to easily access models by various providers.

Set:

export OPENROUTER_API_KEY=...

Make sure you have both Harbor and Docker installed:

harbor --version  # 0.1.43 or above
docker --version  # 29.0.1 or above

Make sure variables are set and Docker is running:

env | grep "OPENROUTER_API_KEY"
docker ps

Running example tasks

Running basic task:

harbor run -p "tasks/example-task" --agent terminus-2 --model openrouter/anthropic/claude-haiku-4.5

Task run results are stored in jobs/ directory. There is a nice UI available within Harbor:

harbor view jobs/

Using different agents, for example Claude Code:

ANTHROPIC_API_KEY=sk-ant-api03-... \
harbor run -p "tasks/example-task" --agent claude-code --model claude-sonnet-4-5-20250929

Take a note that different providers may have different names for the same model, here is a list of models by Anthropic

Task with multiple runs – k=3 trials, n=2 concurrent runs:

harbor run -p "tasks/example-task" -a terminus-2 -m openrouter/anthropic/claude-haiku-4.5 -k 3 -n 2

Tasks can be organized in datasets:

harbor run \
  --dataset compilebench@1.0 \
  --task-name "c*" \
  --agent terminus-2 \
  --model openai/gpt-5.2

Harbor is generally helpful. To get summary of harbor options, see:

harbor run --help

Notes

About

Harbor workshop @LLMDay Warsaw, Feb 12th 2026

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •