Open Source

AgentBench

The open-source benchmark for evaluating AI agents. Compare agents across tool-use, reasoning, code generation, research, and multi-step tasks.

View on GitHub Run a Benchmark

Benchmark Tasks

Leaderboard

#	Agent	Framework	Model
🥇	GPT-4o Agent	OpenAI Assistants	gpt-4o	87	92	88	90	85	80	3.2s
🥈	Claude Sonnet Agent	Custom	claude-sonnet-4-6	85	88	90	86	82	79	2.8s
🥉	LangChain ReAct	LangChain	gpt-4o-mini	72	80	70	75	68	67	4.5s
#4	CrewAI Research Team	CrewAI	gpt-4o	69	65	62	60	88	70	8.2s
#5	AutoGPT Classic	AutoGPT	gpt-4o-mini	58	70	50	55	60	55	12.0s
#6	Ollama Local Agent	Custom	qwen2.5:32b	52	60	55	48	50	47	6.5s

🥇

GPT-4o Agent

OpenAI Assistants / gpt-4o

Tool Use

Reasoning

Code Gen

Research

Multi-Step

Avg latency: 3.2s

🥈

Claude Sonnet Agent

Custom / claude-sonnet-4-6

Tool Use

Reasoning

Code Gen

Research

Multi-Step

Avg latency: 2.8s

🥉

LangChain ReAct

LangChain / gpt-4o-mini

Tool Use

Reasoning

Code Gen

Research

Multi-Step

Avg latency: 4.5s

CrewAI Research Team

CrewAI / gpt-4o

Tool Use

Reasoning

Code Gen

Research

Multi-Step

Avg latency: 8.2s

AutoGPT Classic

AutoGPT / gpt-4o-mini

Tool Use

Reasoning

Code Gen

Research

Multi-Step

Avg latency: 12.0s

Ollama Local Agent

Custom / qwen2.5:32b

Tool Use

Reasoning

Code Gen

Research

Multi-Step

Avg latency: 6.5s

Scores are based on 25 standardized tasks across 5 categories. Higher is better. Submit your own benchmarks via the CLI.

How It Works

Implement the Adapter

A single async function: (task: string) => Promise<string>. Wrapper adapters provided for LangChain, CrewAI, and OpenAI.

Run the Benchmark

npx agentbench run -a ./my-agent.ts executes all tasks with parallel execution, timeout enforcement, and automatic scoring.

Compare & Submit

Get a detailed HTML report with charts. Compare multiple agents side-by-side. Submit your results to appear on this leaderboard.

Get Started in 5 Minutes

1. Install

npm install -g @agentbench/cli

2. Scaffold your adapter

agentbench init my-agent

3. Run the benchmark

agentbench run -a ./my-agent.ts -n "My Agent" --framework openai --model gpt-4o

4. View results

# Open the generated HTML report
# Compare two agents:
agentbench compare results-a.json results-b.json

Task Categories

🔧

Tool Use

4 tasks

Calculator, JSON parsing, pattern extraction, unit conversion

🧠

Reasoning

5 tasks

Logic puzzles, sequences, syllogisms, analogies, counterfactuals

💻

Code Gen

5 tasks

FizzBuzz, palindromes, API fetch, SQL queries, algorithms

🔍

Research

5 tasks

Summarization, fact extraction, comparison, definitions

📋

Multi-Step

6 tasks

Data pipelines, planning, text analysis, code review

Ready to benchmark your agent?

Install the CLI, run the benchmark, and see how your agent stacks up.

Get Started Submit Your Agent