Skip to content

AI Coding Leaderboards

Overview

This page is a list of benchmarks for LLMs that are used to help with coding.

It covers both Coding Assistants and Autonomous Coding Agents

Leaderboards

Leaderboard Category Tasks Metrics
Aider LLM Leaderboards Coding Assistant 225 Exercism exercises across C++, Go, Java, JavaScript, Python, Rust Two-pass pass rates; cost per run; edit correctness
EvalPlus Leaderboard Coding Assistant HumanEval+ (164 hand-verified Python); MBPP+ (399 sanitized Python) pass@1 (greedy); extended efficiency via EvalPerf
TabbyML Coding LLMs Leaderboard Coding Assistant Amazon CCEval next-line tasks in Python, JS, Go… Next-line accuracy (exact-match of very next line)
MHPP Leaderboard Coding Assistant 210 “Mostly Hard” multi-step Python problems pass@1 (greedy); sampling (T=0.7, 100 runs)
Copilot Arena Coding Assistant Paired autocomplete & inline-editing comparisons ELO-style rankings from user votes
WebDev Arena Leaderboard Coding Assistant Real-time web development challenges between models Win rate; task completion; user voting
SWE-bench Autonomous Agent 2,294 real-world “Fail-to-Pass” GitHub issues from 12 Python repos % of issues resolved
HAL (Holistic Agent Leaderboard) Autonomous Agent 13 benchmarks (e.g., SWE-bench Verified, USACO, Cybench, TAU-bench) across many domains Cost-controlled evaluations; success rates; Pareto fronts
TBench Autonomous Agent Terminal-based complex tasks in realistic environments Task success rate; command accuracy; time-to-completion

Takeaways

Key Takeaways

Leaderboards are a good way to quantitatively and objectively compare solutions.

Comparison across multiple metrics and leaderboards avoids solutions that overfit to a benchmark.