AI Coding Leaderboards¶
Overview
This page is a list of benchmarks for LLMs that are used to help with coding.
It covers both Coding Assistants and Autonomous Coding Agents
Leaderboards¶
Leaderboard | Category | Tasks | Metrics |
---|---|---|---|
Aider LLM Leaderboards | Coding Assistant | 225 Exercism exercises across C++, Go, Java, JavaScript, Python, Rust | Two-pass pass rates; cost per run; edit correctness |
EvalPlus Leaderboard | Coding Assistant | HumanEval+ (164 hand-verified Python); MBPP+ (399 sanitized Python) | pass@1 (greedy); extended efficiency via EvalPerf |
TabbyML Coding LLMs Leaderboard | Coding Assistant | Amazon CCEval next-line tasks in Python, JS, Go… | Next-line accuracy (exact-match of very next line) |
MHPP Leaderboard | Coding Assistant | 210 “Mostly Hard” multi-step Python problems | pass@1 (greedy); sampling (T=0.7, 100 runs) |
Copilot Arena | Coding Assistant | Paired autocomplete & inline-editing comparisons | ELO-style rankings from user votes |
WebDev Arena Leaderboard | Coding Assistant | Real-time web development challenges between models | Win rate; task completion; user voting |
SWE-bench | Autonomous Agent | 2,294 real-world “Fail-to-Pass” GitHub issues from 12 Python repos | % of issues resolved |
HAL (Holistic Agent Leaderboard) | Autonomous Agent | 13 benchmarks (e.g., SWE-bench Verified, USACO, Cybench, TAU-bench) across many domains | Cost-controlled evaluations; success rates; Pareto fronts |
TBench | Autonomous Agent | Terminal-based complex tasks in realistic environments | Task success rate; command accuracy; time-to-completion |
Takeaways¶
Key Takeaways
Leaderboards are a good way to quantitatively and objectively compare solutions.
Comparison across multiple metrics and leaderboards avoids solutions that overfit to a benchmark.