Skip to main content

Harbor

Harbor is a framework from the creators of Terminal-Bench for evaluating and optimizing agents and language models. It uses LiteLLM to call 100+ LLM providers.

# Install
pip install harbor

# Run a benchmark with any LiteLLM-supported model
harbor run --dataset terminal-bench@2.0 \
--agent claude-code \
--model anthropic/claude-opus-4-1 \
--n-concurrent 4

Key features:

  • Evaluate agents like Claude Code, OpenHands, Codex CLI

  • Build and share benchmarks and environments

  • Run experiments in parallel across cloud providers (Daytona, Modal)

  • Generate rollouts for RL optimization

  • GitHub

  • Documentation