AI Coding Leaderboards¶

Overview

This page is a list of benchmarks for LLMs that are used to help with coding.

It covers both Coding Assistants and Autonomous Coding Agents

Leaderboards¶

Leaderboard	Category	Tasks	Metrics
Aider LLM Leaderboards	Coding Assistant	225 Exercism exercises across C++, Go, Java, JavaScript, Python, Rust	Two-pass pass rates; cost per run; edit correctness
EvalPlus Leaderboard	Coding Assistant	HumanEval+ (164 hand-verified Python); MBPP+ (399 sanitized Python)	pass@1 (greedy); extended efficiency via EvalPerf
TabbyML Coding LLMs Leaderboard	Coding Assistant	Amazon CCEval next-line tasks in Python, JS, Go…	Next-line accuracy (exact-match of very next line)
MHPP Leaderboard	Coding Assistant	210 “Mostly Hard” multi-step Python problems	pass@1 (greedy); sampling (T=0.7, 100 runs)
Copilot Arena	Coding Assistant	Paired autocomplete & inline-editing comparisons	ELO-style rankings from user votes
WebDev Arena Leaderboard	Coding Assistant	Real-time web development challenges between models	Win rate; task completion; user voting
SWE-bench	Autonomous Agent	2,294 real-world “Fail-to-Pass” GitHub issues from 12 Python repos	% of issues resolved
HAL (Holistic Agent Leaderboard)	Autonomous Agent	13 benchmarks (e.g., SWE-bench Verified, USACO, Cybench, TAU-bench) across many domains	Cost-controlled evaluations; success rates; Pareto fronts
TBench	Autonomous Agent	Terminal-based complex tasks in realistic environments	Task success rate; command accuracy; time-to-completion

Key Takeaways

Leaderboards are a good way to quantitatively and objectively compare solutions.

Comparison across multiple metrics and leaderboards avoids solutions that overfit to a benchmark.