The open-source benchmark for evaluating AI agents. Compare agents across tool-use, reasoning, code generation, research, and multi-step tasks.
| # | Agent | Framework | Model | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 🥇 | GPT-4o Agent | OpenAI Assistants | gpt-4o | 87 | 92 | 88 | 90 | 85 | 80 | 3.2s |
| 🥈 | Claude Sonnet Agent | Custom | claude-sonnet-4-6 | 85 | 88 | 90 | 86 | 82 | 79 | 2.8s |
| 🥉 | LangChain ReAct | LangChain | gpt-4o-mini | 72 | 80 | 70 | 75 | 68 | 67 | 4.5s |
| #4 | CrewAI Research Team | CrewAI | gpt-4o | 69 | 65 | 62 | 60 | 88 | 70 | 8.2s |
| #5 | AutoGPT Classic | AutoGPT | gpt-4o-mini | 58 | 70 | 50 | 55 | 60 | 55 | 12.0s |
| #6 | Ollama Local Agent | Custom | qwen2.5:32b | 52 | 60 | 55 | 48 | 50 | 47 | 6.5s |
Scores are based on 25 standardized tasks across 5 categories. Higher is better. Submit your own benchmarks via the CLI.
A single async function: (task: string) => Promise<string>. Wrapper adapters provided for LangChain, CrewAI, and OpenAI.
npx agentbench run -a ./my-agent.ts executes all tasks with parallel execution, timeout enforcement, and automatic scoring.
Get a detailed HTML report with charts. Compare multiple agents side-by-side. Submit your results to appear on this leaderboard.
npm install -g @agentbench/cliagentbench init my-agentagentbench run -a ./my-agent.ts -n "My Agent" --framework openai --model gpt-4o# Open the generated HTML report
# Compare two agents:
agentbench compare results-a.json results-b.jsonCalculator, JSON parsing, pattern extraction, unit conversion
Logic puzzles, sequences, syllogisms, analogies, counterfactuals
FizzBuzz, palindromes, API fetch, SQL queries, algorithms
Summarization, fact extraction, comparison, definitions
Data pipelines, planning, text analysis, code review
Install the CLI, run the benchmark, and see how your agent stacks up.