DARPA AI Cyber Challenge Tools Comparison¶
Overview
This report is auto-generated by ChatGPT 5.
Prompt
I want a report on the tools listed on https://archive.aicyberchallenge.com comparing their operation and features. I want a table comparison and a detailed comparison. I want to highlight what traditional non-AI tools were used - and what AI tools were used and how.
AIxCC Finalists and Tools Comparison¶
Introduction¶
The DARPA AI Cyber Challenge (AIxCC) tasked teams with building fully automated Cyber Reasoning Systems (CRSs) to find and fix vulnerabilities in real-world open-source software[1][2]. Seven finalist teams advanced from the 2024 semifinals to the 2025 finals, each open-sourcing their CRS. These systems varied in design – some enhanced traditional program analysis techniques with AI, while others were AI-first in their approach[3][4]. This report compares all finalist tools (including their semifinal and final iterations) on their core operation, technical approach, key features, use of traditional vs. AI techniques, performance results, and documentation. We also include SHERPA, an AIxCC-developed harness-generation tool released alongside the finalists. Table 1 summarizes the tools and their characteristics, followed by detailed analyses of each.
Comparison Table of AIxCC Tools¶
Tool (Team) | Core Approach & Features | AI-Based Techniques | Traditional Techniques | Performance (Finals) |
---|---|---|---|---|
Atlantis (Team Atlanta) | Ensemble of multiple CRSs for C, Java, etc., run in parallel (N-version design) for robustness[5]. Integrates diverse strategies (fuzzing, concolic, static) for max coverage[6]. | LLMs in three modes: Augmented (assist fuzz/tools), Opinionated (provide hints), Driven (autonomous agents)[7][8]. Fine-tuned custom Llama2-7B models specialized for C code analysis[9]. LLM-generated input mutators, grammar dictionaries and PoV exploit agents[10]. | Heavy use of fuzzing (LibAFL, libFuzzer, AFL++ ensembles) and concolic execution (extended SymCC)[11]. Custom directed fuzzers for C/Java[11]. Kubernetes-based scaling; disabled non-PoV patching to avoid false patches[12]. | 1st place – 43 vulns found, 31 patched[13]. Highest overall score (393)[13]. Most “real-world” vulns patched (achievement: Prince of the Patch[14]). |
Buttercup (Trail of Bits) | Hybrid fuzzing+analysis pipeline with LLM-assisted input generation and multi-agent patching. Emphasizes accuracy (90%+)[15] and broad CWE coverage[16]. Cost-efficient design using cheaper LLMs[17][18]. | LLM-generated test cases augment libFuzzer/Jazzer to hit complex conditions (e.g. SQLi, path traversal)[19]. Multi-agent patching with separate agents for vulnerability analysis and fix generation[20]. Used “non-reasoning” LLMs at scale (100k+ queries) for efficiency[21][17]. | Integrated coverage-guided fuzzers (libFuzzer for C/C++, Jazzer for Java) with grammar and dictionary inputs[20]. Static analysis (tree-sitter parsing, code queries) to locate bug patterns[22]. Conservative patch strategy: no patch submitted without a triggering PoV[12]. | 2nd place – 28 vulns found, 19 patched[23]. Score 219. Achieved 90%+ accuracy and 20 unique CWE exploits[16][24]. Noted for largest patch (300+ lines, “LOC Ness Monster” award)[25]. |
RoboDuck (Theori) | “LLM-first” autonomous CRS that minimizes classical fuzzing[26]. Uses an agent-based workflow to find, exploit, and patch bugs without human input[27]. Uniquely capable of generating proofs of vulnerability without traditional fuzzing or symbolic execution[27]. | Heavy use of LLM agents for code understanding, exploit generation, and patching. Two different LLMs scan code: one per function, one on large code chunks[28]. Agents follow reverse-engineering style playbooks constrained to stay on task[29]. LLMs generate PoVs for complex formats (URLs, binary protocols) by reasoning about program semantics[30]. | Used static analyzer Infer to surface bug candidates (interprocedural null derefs, overflows, etc.)[31]. Basic fuzzing as backup (leveraging OSS-Fuzz harnesses with libFuzzer)[32]. Some integration of SARIF static reports for validation. Aggressive patch strategy: allowed submitting speculative patches (up to 2 without PoV per confirmed PoV) to maximize scores[12][33]. | 3rd place – 34 vulns found, 20 patched[34]. Score 211. Earned “Secret Sauce” for uniquely exploited challenge[35]. Spent \~half the top teams’ LLM budget yet remained competitive[36][37]. |
Fuzzing Brain (All You Need) | AI-centric “brain” directs fuzzing and analysis. Most AI-driven system – LLMs handle vulnerability reasoning, strategic decisions, and code writing (≈90% of codebase written with AI aid)[4]. Emphasizes quick finds (scored within 5 minutes on a task, “Faster Than Pizza Delivery”) and high SARIF accuracy[38]. | Uses thousands of concurrent LLM agents in parallel to analyze code and hypothesize exploits[39]. Primary vulnerability detection via AI reasoning; traditional fuzzers only kick in as a fallback validation[39]. LLMs used for system architecture decisions and patch generation. Achieved >80% correctness in static finding classification (SARIF) via AI triage[38]. | Leveraged fuzzing pipelines (implied by team name and achievements) as secondary confirmation for AI findings. Likely used standard fuzzers and runtime instrumentation to verify AI-identified bugs and patches (the system won “-Ofast” first-to-find real C bug award)[38]. Patching likely also AI-driven with some traditional compile/test loops for validation. | 4th place – 28 vulns found, 14 patched[40]. Score 154. Earned achievements for fastest find and high SARIF precision[38]. Demonstrated extremely AI-forward methodology[4], though with higher cost per point due to massive parallelism[37]. |
Artiphishell (Shellphish) | Massively multi-agent CRS (60+ AI agents coordinating)[41] built by an academic hacker collective. Agent-based architecture tackles tasks in parallel (parsing, fuzzing, exploit generation, patching) with minimal human tuning[41]. Enhances classic fuzzing with AI-driven grammar inference (“Grammar Guy” subsystem)[42]. Focus on telemetry and reliability. | Agents use OpenAI, Anthropic, and Google Gemini LLM APIs (multi-LLM)[43] for various roles. Grammar Guy LLM agent evolves input grammars from coverage feedback to fuzz complex formats (SQL, URLs, binary data)[42] – a major LLM budget consumer that boosted bug-finding. Other agents monitor and “babysit” LLM outputs to keep them on track (role-based prompting, etc.)[44][45]. AI also used for rich telemetry logging of CRS reasoning steps (earned “Best Telemetry” honor)[46]. | Built on years of CRS experience (Shellphish was a CGC 2016 finalist). Uses traditional fuzzers (e.g., AFL-based or QEMU-based) guided by AI-generated grammars[42]. Likely employed symbolic execution or static analysis modules from prior frameworks (angr/Driller heritage) enhanced with AI. Very cautious patch policy – no patches without PoV to ensure 95%+ patch accuracy[46][12]. | 5th place – 28 vulns found, 11 patched[47]. Score 136. Extremely high patch correctness (over 95% success)[46] by design. Won “The Limit Does Not Exist” for creating a PoV over 1 MB[48]. Heavy use of AI (multi-LLM) yielded solid finds, but slower patch output kept score lower. |
Bug Buster (42‑b3yond‑6ug) | High-scalability fuzzing platform with advanced scheduling and some AI assistance. Employed BandFuzz reinforcement-learning scheduler to coordinate up to \~2,000 fuzzer instances in parallel[49] for maximal coverage. Explored static program slicing to focus fuzzing on bug-prone code (though faced toolchain challenges)[50]. Introduced “super patch” concept to fix multiple bugs in one go[9][51]. | Applied AI in specific roles: RL-based orchestration of fuzzers (learned which targets to fuzz more)[52], LLM-generated fuzzing seeds (the team trained an LLM to propose interesting inputs)[53][54], and LLM-assisted patch generalization (“super patches” that address root causes of multiple crashes)[9][51]. Also used LLMs to analyze crash reports into structured “vulnerability objects” for triage planning[55]. | Relied on a wide array of traditional techniques: large-scale AFL++-based fuzzing (3 independent fuzz pipelines)[56], plus static analysis integration (SARIF parsing, CodeQL queries, etc. for validation)[52]. Program slicing/static path analysis was attempted to guide fuzzing but had compatibility issues[50]. Used a mix of kernel fuzzing and user-space targets; implemented crash deduplication logic to cluster same root cause. Patching was ambitious but buggy – their “super patches” fixed multiple vulnerabilities at once when it worked[51]. | 6th place – 41 vulns found, 3 patched[57]. Score 105. Second-highest PoV points (strong bug-finding), top SARIF accuracy[58], but patch failures hurt overall rank[58]. Achieved “Czar of the SARIF” (most correct static assessments) and “Hunger for Knowledge” (peaked at 1K+ LLM requests/min)[59] – demonstrating brute-force scale. Final result marred by a patch pipeline bug, illustrating the fragility of such a complex system[60][58]. |
Lacrosse (SIFT) | Modernized legacy CRS (evolved from a 10-year-old platform)[61][62], augmented with AI. Emphasized proven fuzzing techniques: deployed 300–500 fuzzing agents (comparable scale to ToB) orchestrated by a central agent “Optimus Zero”[55]. Focused on stability and slow-but-steady analysis. | Used LLMs for higher-level reasoning and crash analysis. For each crash, an LLM created a detailed “vulnerability object” describing the bug, classification, and patch plan[55]. LLMs provided semantic guidance to fuzzers and consensus-based patch decisions (multiple AI models had to agree before a fix)[55][63]. | Strong bias toward traditional fuzzing for PoV generation (minimized fancy AI bug-hunting)[64]. Many fuzzers running in parallel with a global state manager (Optimus Zero) to distribute tasks[55]. Employed standard static tools to triage crashes. Patch submissions used a confidence algorithm and LLM consensus to avoid incorrect fixes[63], reflecting a cautious, engineering-driven approach. | 7th place – 1 vuln found, 1 patched[65]. Score 10. Despite limited scored results, received “Consensus Contributor” for most submissions (albeit mostly duplicates) and “Raiders of the Lost PoV” for finding a unique real bug others missed[66]. Suggests the system was robust but perhaps too conservative and slow to compete within time limits. |
Table 1: Summary of AIxCC finalist tools (with team names) and their characteristics in the final competition. All tools were open-sourced[67]. Performance metrics are from the final scored round at DEF CON 33[13][23]. "PoV" = proof-of-vulnerability exploit.
Detailed Tool Analysis and Comparison¶
In this section, we examine each tool in depth, highlighting how they work, their unique strengths or weaknesses, and noteworthy differences. We also note documentation availability (e.g. open-source repositories or papers) and any evolution from semifinal to final round.
Atlantis (Team Atlanta) – Multi-Engine Ensemble CRS¶
Core Operation & Approach: Atlantis is an ensemble system composed of multiple specialized CRSs that run in parallel[5]. Team Atlanta split efforts into independent engines for C/C++ bugs, Java bugs, a language-agnostic “Multilang” engine, plus separate components for patching and SARIF (static analysis report) handling[68]. This N-version programming design maximized fault tolerance and coverage – if one approach failed or crashed, others continued unaffected[69][70]. For example, Atlantis-C aggressively instrumented code with heavy compile-time modifications to find memory errors, whereas Atlantis-Multilang took a lighter, instrumentation-free approach better suited for rapid, general bug discovery[71][72]. By diversifying strategies, Atlantis achieved robust overall performance. Team Atlanta prioritized engineering rigor and stability – a critical bug in a patch-skip heuristic nearly derailed them hours before the deadline, underscoring their emphasis on robustness testing[73][74].
Traditional Techniques: Unlike some competitors that abandoned classic program analysis, Atlantis invested heavily in traditional methods, scaled up to real-world software. They integrated multiple fuzzing engines (both coverage-guided and directed fuzzers) and even ran them concurrently (ensemble fuzzing), as research shows this finds more bugs than any single fuzzer[75][76]. Their toolkit included LibAFL (with custom mutators for C and Java), AFL++, and Jazzer for Java fuzzing[11]. They also extended SymCC (a symbolic execution tool) to work on large codebases and built custom concolic executors for Java[11]. Each of these components required significant engineering to handle the scale of OSS projects[77][78]. Atlantis also employed directed fuzzing guided by static analysis: static analyzers would flag potential bug locations, and fuzzers then specifically targeted those functions or code paths[79]. In short, Team Atlanta did not give up on conventional program analysis – they adapted and scaled it, supplementing with AI rather than replacing it[80][81]. For patching, they took a conservative stance; they completely disabled any patch submission that wasn’t backed by a PoV exploit, to avoid penalties for incorrect patches[12]. This caution paid off with a high patch success rate.
AI-Based Techniques: Although rich in traditional tooling, Atlantis strategically infused AI at multiple levels. They describe three tiers of LLM integration[7][82]: (1) LLM-Augmented components, where AI assists existing tools (e.g. generating seed inputs for fuzzers, suggesting likely vulnerable API calls that fuzzers should focus on, summarizing static analysis output); (2) LLM-Opinionated components, where AI provides hints or “opinions” that the system treats as suggestions – for instance, a Harness Reverser AI that guesses input file formats or network protocols used by the target software, to help fuzzers produce valid inputs[83][82]; and (3) LLM-Driven components, where an agentic AI has autonomy to navigate the codebase. An example of the last is MLLA (Machine Learning Lab Assistant) – an AI agent that reads through a repository and autonomously crafts exploit inputs (“blobs”) for suspected bugs, especially for challenges in “delta mode” (where a recent code diff may contain the bug)[8][84]. Team Atlanta uniquely leveraged fine-tuned LLMs: they fine-tuned a 7-billion-parameter Llama2 model on security-relevant data (especially for C language analysis) to use as an internal reasoning engine[9]. This gave them a proprietary AI tuned for vulnerability patterns in C/C++ code (buffer overflows, use-after-free, etc.), potentially with lower latency and cost than calling large external APIs. LLMs were also used in directed test generation – e.g. Atlantis had language-specific prompt templates to generate exploit code or inputs once a vulnerability was suspected[79]. Another novel use was creating function-level dictionaries of sensitive values: their AI would analyze a function and output a list of keywords or values (e.g. special file names, command strings) that a fuzzer could mutate with[79]. Overall, Atlantis treated AI as powerful assistants that were carefully “babysat” and orchestrated. The team developed an abstraction layer called LiteLLM to interface with multiple AI providers and to handle failures (rate limits, token limits, etc.) gracefully[85][86]. This prevented vendor lock-in and let them switch between models like GPT-4, Claude, and others as needed, or even use their fine-tuned model in place of an API call.
Performance & Results: Atlantis delivered the strongest performance of the finals, outscoring all others by a wide margin. It discovered 43 vulnerabilities and successfully patched 31 of them, earning a total of 393 points – the highest of any team[13]. Notably, Atlantis excelled in both quantity and quality of patches. It earned the special award “Bundle Baron” for submitting the most bundle (PoV+patch) combinations that scored[87], and “Most Real-World Vulns Patched” (the Prince of the Patch title) for fixing the greatest number of genuine vulnerabilities[14]. The system’s diversity paid off – it was able to find bugs across many categories (CWE coverage) and did so quickly enough to also gain a “Professional Assassin” achievement for having over 95% of its PoVs be successful[88]. Interestingly, Atlantis found a few 0-day vulnerabilities in real open-source code (3 in Java, 3 in C) that were not planted by DARPA[89][90], showing its real-world efficacy. In terms of efficiency, Team Atlanta used the largest compute budget (nearly \\(74k) and a substantial LLM budget (\~\\\)29k) during the finals[18], reflecting their at-scale approach. The cost per point for Atlantis was about \$263[91] – higher than some more cost-optimized teams, but still reasonable given the victory. Documentation-wise, Team Atlanta has provided extensive resources: they published a technical report on their blog with system design insights[5][92] and have open-sourced their code (organized into components like Atlantis-C, Atlantis-Java, etc. on GitHub[93]). The comprehensive approach of combining “all of the above” – classical analysis with cutting-edge AI – was validated by their win.
Strengths & Weaknesses: Atlantis’ strength was clearly its balance and thoroughness. It left no stone unturned: fuzzers, static analyzers, custom AIs, all worked in concert. This resulted in a top score and high reliability (very few incorrect patches). A possible weakness of this approach is complexity – maintaining multiple CRSs and integration between them is a heavy engineering lift, evidenced by the one-line bug that almost broke the system[94][74]. However, Team Atlanta mitigated that through aggressive testing and redundancy. Another challenge is cost: they threw significant computing resources at the problem, which might be hard to replicate outside of a funded competition. But overall, Atlantis demonstrated that hybrid systems can outperform purely AI or purely traditional ones, and its design to “not trust any single method too much” proved very effective.
Buttercup (Trail of Bits) – AI-Augmented Fuzzing & Patching¶
Core Operation & Approach: Buttercup (Trail of Bits’ CRS) is built on the idea of augmenting classic vulnerability discovery tools with AI, in a focused and cost-effective way. The system’s core loop still resembles a traditional fuzzing and static analysis pipeline, but nearly every stage is turbocharged by machine learning. For instance, Buttercup uses libFuzzer (for C/C++) and Jazzer (for Java) on each target program, but it improves their effectiveness by supplying intelligent inputs generated by LLMs rather than relying purely on random mutations[19][20]. The team specifically calls out that their LLMs could generate inputs capturing complex attack patterns that simple fuzzers might miss – e.g. a well-formed SQL injection string or a malicious file path – which were then fed into the fuzzers’ corpus[19]. This approach significantly sped up code coverage and bug trigger time (an important factor given limited competition time)[19][22]. Buttercup also integrated static code analysis: it used parsers (like Tree-sitter) and custom query scripts to identify suspicious code constructs (e.g. unsafe uses of strcpy, or use of exec() calls)[22]. These were used to guide fuzzing or to prioritize which issues to patch.
Where Buttercup truly innovated was in patching. Trail of Bits built a multi-agent patch generation system that emphasizes separation of concerns[20]. For example, one agent might specialize in isolating the buggy code snippet, another in drafting a fix, and a third in evaluating the fix’s correctness. They intentionally avoided using the most expensive, “generalist” AI models (like GPT-4) in favor of what they call “non-reasoning LLMs”[17][37] – presumably smaller or fine-tuned models that are cheaper to run at scale. This is reflected in their costs: Buttercup made over 100k LLM requests during the finals but still only spent \~$21k on LLM API usage, far less than Team Atlanta[21][37]. The team noted that the top two finishers (Atlanta and ToB) spent the most on AI queries, suggesting that judicious use of LLMs was “money well-spent” in terms of points[36].
Traditional Techniques: Despite the AI enhancements, Buttercup is grounded in well-understood tools. It leans heavily on coverage-guided fuzzing (libFuzzer for native code, Jazzer for JVM) to generate concrete PoVs. The fuzzing campaigns were likely distributed across multiple cores/nodes for speed. They also utilized static bug-finding tools: although not explicitly named in sources, Buttercup’s use of tree-sitter implies they were parsing source code abstract syntax trees to find patterns (like searching for known dangerous functions, etc.)[22]. These static findings could then be confirmed by fuzzing or passed to the patcher. Another classic tool category Buttercup likely used is sanitizers (AddressSanitizer, etc.) to catch memory errors, given their background in security research. For exploit generation (PoVs), Buttercup seems to have relied mostly on fuzzers to create crashing inputs, rather than symbolic execution or manual exploit crafting, as no mention of those is made – instead, they improved fuzzers via AI seeds. On the patch side, Trail of Bits exercised caution: like Team Atlanta, they never submitted a patch without an accompanying PoV[12]. Their blog suggests they believed patch accuracy was a key differentiator in winning second place[24]. To further ensure quality, they built cross-validation: Buttercup would test its candidate patches against known exploits (its own and perhaps even those of other teams if it obtained them) to avoid releasing a faulty patch[24][63]. This strategy aligns with Trail of Bits’ strength in secure coding – they treated each patch almost like a mini code review, aided by automation.
AI-Based Techniques: Buttercup’s motto could be “use AI to play to our tools’ strengths.” A prime example is their use of LLMs to generate specialized fuzzing inputs[19]. Instead of having an AI blindly hunt for bugs, Buttercup asks it to produce inputs that exercise tricky code paths. The LLM likely has some understanding of what input format or API might reach deep into the code – for instance, if fuzzing an HTTP server, an LLM could generate a variety of HTTP request packets with unusual headers to trigger parsing edge cases. These AI-crafted inputs are then incorporated into the fuzzers’ corpus, after which libFuzzer’s fast mutation engine takes over to refine and find a crashing payload[19]. This human-inspired but machine-augmented approach improved harness saturation time, meaning the time to hit interesting code was reduced[19]. Another AI aspect is multi-agent coordination for patching[20]. Buttercup’s architecture separates vulnerability detection from repair: one agent (or LLM prompt) can focus on understanding what the bug is and marking its location, while another focuses only on generating a fix in code[20]. By dividing the task, they reduce the likelihood of one large model hallucinating or making a complex error. They also mention using cheaper models that lack chain-of-thought reasoning[17]; this implies they might use multiple simpler completions or a fine-tuned code model (like Codex or Code Llama) for tasks like patch generation, rather than an expensive generalist model. Impressively, Buttercup managed these AI resources to achieve the lowest cost-per-point among top teams – \$181/point, indicating high efficiency[91][95]. In addition, an AI crash analysis component likely triaged fuzzer crashes: similar to others, Buttercup probably used an LLM to read sanitizer logs or stack traces and group crashes by root cause, filtering out false positives (like benign out-of-memory errors). This kind of AI filtering improves precision by automatically discarding non-security crashes, so the team only spends time patching real issues[96] (Buttercup indeed scored high on accuracy).
Performance & Results: Buttercup finished as the runner-up (2nd place), with 28 vulnerabilities found and 19 patches applied in the finals[23]. Its total score (219 points) narrowly edged out Theori for the second spot[97][98]. The breadth of bugs Buttercup found was remarkable: it successfully produced proof-of-vulnerabilities across 20 distinct CWE categories, ranging from memory errors to injection flaws[16]. This earned them the “Cornucopia” achievement for diversity of CWEs[15]. Moreover, Buttercup maintained \~90% accuracy, meaning very few of its submissions were incorrect or invalid[99]. The system even patched one challenge with a huge 300+ line fix, winning the quirky “LOC Ness Monster” award[25] – demonstrating its ability to handle complex issues requiring substantial code changes. The team’s focus on correctness over sheer quantity likely contributed to their success over teams that found more bugs but lost points to bad patches (e.g. 42-b3yond). In terms of timeline, Buttercup was also quick – it scored its first points in under 5 minutes on at least one task (achievement: “Faster Than Pizza Delivery”, which actually went to FuzzingBrain[38] but Buttercup also achieved a \<5min score per their blog[100]). All these indicate a well-rounded performance. For documentation, Trail of Bits provided an excellent post-mortem blog post[101] detailing these insights and even an earlier “tale of the tape” analysis[3][42]. Their open-source code is available on GitHub and is said to be runnable on a laptop[102] – illustrating their emphasis on practical, lightweight design.
Strengths & Weaknesses: Buttercup’s strength lies in augmented automation – it took dependable techniques (fuzzing, static checks) and made them better with AI, rather than trying to reinvent the wheel entirely. This made it efficient and accurate. A key strength was resource management: using cheaper LLMs and fewer compute cycles but still achieving results. If there’s a weakness, it might be that Buttercup might miss bugs that require deep reasoning beyond input generation – for instance, logic flaws or complex exploit scenarios that fuzzing can’t easily uncover. Their reliance on simpler models might also limit some capabilities (they explicitly avoided “reasoning” models, which might be less effective for patches that need creative insight, though their results show little evidence of struggle there). Another potential drawback is that by not submitting speculative patches, they may have left some points on the table – Theori’s strategy of patching without PoV could find latent vulnerabilities, but Buttercup chose not to risk it. Nonetheless, their second place indicates this strategy was wise. In summary, Buttercup was an exemplary demonstration of human-in-the-loop AI design: it encoded a lot of security expertise and let AI handle the tedious or large-scale parts, leading to a very balanced system.
RoboDuck (Theori) – LLM-First Autonomous Security Agent¶
Core Operation & Approach: RoboDuck is Theori’s cyber reasoning system, notable for being highly AI-centric. The team explicitly set out with an “LLM-first” design philosophy, using large language models for all aspects of the challenge and only falling back on traditional techniques as a safety net[103][104]. This was a bold approach: unlike others that started from fuzzers or static analyzers, Theori treated the LLM as the primary engine driving bug discovery, exploit development, and patch writing. In fact, Theori’s blog mentions their CRS can produce working PoV exploits without using fuzzing or symbolic execution at all[27], which is quite unique. They implemented a full pipeline of autonomous agents that read the code, reason about potential vulnerabilities, generate an input to trigger each bug, and then create a patch[27]. This means RoboDuck behaves almost like a team of virtual security researchers collaborating.
The system pipeline can be summarized as: Bug Finding → Bug Filtering → PoV Generation → Patch Generation → “Scaffolding & Glue” (integration)[105]. For bug finding, RoboDuck actually does use a bit of traditional analysis: Theori leveraged Facebook’s Infer static analyzer to find memory issues and null dereferences in C/C++ code[31]. However, Infer’s reports were extremely noisy (99.9% false positives in their workflow)[106], so the LLM agents were tasked with reviewing Infer’s output and deciding which reports likely represent real bugs[28]. Separately, they employed two distinct LLM-based static analysis methods: one agent looked at code function-by-function to spot potential bugs, and another looked at larger “blob” chunks of code (spanning multiple functions or files) to catch issues that manifest in broader context[107]. This dual strategy was probably to mitigate LLM context window limits and vantage – fine-grained vs. coarse-grained. After initial bug candidates were identified, a filtering agent would cross-check them to eliminate duplicates or those deemed non-exploitable.
For PoV generation, RoboDuck did something truly novel: it had LLM agents craft exploits (inputs) directly for the suspected bugs[27][108]. This is challenging because it requires the AI to understand how to trigger the bug in practice. Theori’s advantage was to treat this like a reverse engineering problem – their agents would simulate the steps a human exploit developer might take, but entirely in code and memory space. They constrained the LLMs with what they call “reverse engineering workflows” to prevent the AI from going off-track[29]. For example, an agent might perform these steps: Read the vulnerable function; Identify what input parameters or file data influence the vulnerability; Then propose a concrete input that would hit the vulnerability. If the vulnerability was, say, a buffer overflow when parsing a certain file format, the agent would try to produce a file in that format that overflows the buffer. Notably, Theori did not rely on random fuzzing to find the exact trigger – they tried to reason it out with the model’s knowledge. This is where they saved a lot of the fuzzing budget that others spent; Theori’s LLM query spend (\~\\(11.5k) was half that of Buttercup, and they spent a moderate amount on compute (\~\\\)20k), meaning they weren’t running huge fuzzer farms[18]. They did have fuzzing “available for backup”[103] – likely they still ran some fuzzers using existing harnesses to ensure that if the AI missed something, it could be caught. But the emphasis was clearly on AI.
Patch generation in RoboDuck was done via (unsurprisingly) LLMs as well. Once an exploit and bug were confirmed, an agent would modify the code to fix the vulnerability, ensuring it doesn’t break functionality. The blog implies most teams, including Theori, used similar high-level approaches for patching – essentially instructing an LLM to produce a code diff that addresses the problem[109]. The interesting twist is Theori’s risk-taking on patches: they developed a model to decide when to submit patches without PoVs, implementing an aggressive 2:1 patch to PoV ratio strategy[12][33]. This means if RoboDuck thought a piece of code was buggy but couldn’t immediately produce a PoV, it might still submit a speculative patch (but capped at two such speculative fixes for each confirmed bug). This was meant to maximize points under the scoring rules. It was a calculated risk, as bad patches hurt the accuracy multiplier, but Theori bet that their AI could identify real bugs even if it failed to exploit them in time – a bet that seems to have paid off moderately well.
Traditional Techniques: Although de-emphasized, Theori’s CRS did incorporate traditional methods in supporting roles. We’ve mentioned Infer static analysis for C/C++ bug hints[31] – they even patched some bugs in Infer itself to improve its output for AIxCC’s needs[106]. They also mention using some fuzzing and static analysis as backup[103]. Specifically, they would run the standard OSS-Fuzz harnesses for each challenge with libFuzzer, mainly to ensure they didn’t miss any crashes; but interestingly, Theori notes that no other system they know of can use LLMs at repository scale to generate inputs that trigger bugs, which RoboDuck did[110][111]. This suggests that their fuzzers might not have been the primary source of PoVs. They likely also used sanitizers (ASan, UBSan) to catch memory violations when running either fuzzers or AI-generated tests, to know when a bug was triggered. The static analysis results were integrated via SARIF as well, since they had to assess provided static analyzer reports from the organizers – Theori presumably leaned on their LLM to evaluate those too (the competition provided some SARIF that could be false positives, worth 1 point each if correctly assessed[2]). On the exploit side, Theori did not explicitly mention any use of symbolic execution or concolic testing – they appear to have consciously avoided those, focusing on LLM reasoning instead[112]. Finally, for patch testing, they likely did run the patched program against the PoV to verify it stops the crash (common to all teams). Theori’s approach to traditional tech can be seen as “trust the AI, use classical tools as verification.” It’s the inverse of teams like Lacrosse or 42-beyond, which trusted fuzzers and used AI as an assistant.
AI-Based Techniques: RoboDuck’s entire DNA is AI. The system uses multiple LLMs, possibly from different vendors (they referenced multi-LLM usage for static analysis)[107]. A notable design choice: They used two different LLMs in static bug finding, which hints they might have used e.g. GPT-4 for one pass and Claude or a fine-tuned model for another, to get varied “opinions” on what code is risky[107]. This is akin to having two experts review the code. The results of those were likely merged. The agentic aspect is key: Theori’s blog posts (and follow-up ones they published on their site) describe how they built effective LLM agents with constrained behavior[113][114]. For example, they might use a pseudo-code interpreter or step-by-step prompting to force the agent to reason systematically about an exploit (“If the buffer is of size N and input length is M, choose M = N+1 to overflow…”). They also mention preventing the AI from “wandering” by constraining workflows[29] – perhaps by breaking tasks into smaller prompts or using guardrail code to stop nonsense actions.
One major advantage of AI they exploited was handling complex input formats. Traditional fuzzers often struggle to generate inputs that are valid enough to pass basic parsing (e.g., a correct image file header). Theori’s agents, with their knowledge base, could generate syntactically correct and semantically plausible inputs for these formats[108]. For instance, if the target was an XML parser, an AI can produce a well-structured XML that still triggers a bug deep inside – something a dumb fuzzer might not achieve quickly. The AI’s capability to understand protocols (like a well-formed URL or PDF) was a differentiator for RoboDuck[108]. Additionally, Theori built an entire public GitHub repository (theori-io/aixcc-public) with logs and agent traces for transparency[115], implying confidence in their AI agent approach and a desire to analyze it post-competition.
Performance & Results: RoboDuck placed 3rd in the final competition. It found 34 vulnerabilities and patched 20 of them[34], for a total score of 211 points. This was just slightly behind Trail of Bits’ 219 points, a very close race for second. Theori’s system had a strong showing in vulnerability finding – it actually found the second-most vulns after Team Atlanta. Its patch count was a bit lower proportionally (perhaps due to their riskier patch strategy or some speculative patches not yielding points). Still, they demonstrated the viability of an AI-heavy approach. The stats also reveal Theori used about half the LLM budget of the top two teams but achieved \~95% of their scores[36][37], indicating good efficiency. Achievements Theori earned include “Thrifty” (they had the least dollars spent per vuln patched among teams)[116][37] – a testament to their economical use of AI and compute. They also got “Secret Sauce” for submitting a unique PoV that no one else did[117], likely attributable to their creative AI-driven exploit generation finding a scenario others missed. Theori didn’t snag the SARIF or speed trophies, implying their approach was potent but perhaps a bit slower early on or not as focused on static warnings (understandable since they had so many false positives to filter through Infer). Documentation from Theori is extensive: they published multiple blog posts (overview, LLM agent design, agent trajectory examples) sharing their methodology[118][27], and their code (though somewhat curated) is on GitHub.
Strengths & Weaknesses: The strength of RoboDuck is clear – it showed that AI agents can perform end-to-end hacking tasks with minimal human intervention. This is a preview of what AI-driven security might become. It excelled at complex logical reasoning tasks (like exploit generation for tricky formats) and wasn’t limited by needing an instrumented harness for everything – a smart agent can find a bug in any part of the code. Additionally, it achieved a lot with fewer resources (cost-effective). However, the weaknesses are the flip side: an LLM-first system can be unpredictable. Theori themselves likely struggled with managing the “creativity” of the AI – hence the need to babysit and constrain it[119][45]. When their AI agents were right, they were brilliant; when they were wrong, valuable time could be wasted chasing phantom bugs or formulating a PoV that doesn’t quite work. Another weakness was patch accuracy: by pushing more speculative fixes, they risked penalties. If their AI misjudged a bug and patched something incorrectly, that would hurt their score multiplier (though they evidently balanced this with the 2:1 model). In finals, Theori’s accuracy multiplier was a bit lower than teams like ToB or Shellphish (since they submitted some patches without proof)[12]. But Theori’s daring approach clearly paid dividends, and it underscores how far pure AI reasoning has come – nearly matching a top security firm’s tool with half the bug-finding being done by classical fuzzers.
All You Need is a Fuzzing Brain – Massively Scaled AI System¶
Core Operation & Approach: All You Need is a Fuzzing Brain (often shortened to FuzzingBrain) was the most AI-forward system among the finalists[4]. This team treated the CRS as a large-scale AI orchestration problem. According to Trail of Bits’ analysis, FuzzingBrain used thousands of concurrent AI agents in parallel to reason about vulnerabilities, essentially brute-forcing the unreliability of individual LLMs by sheer quantity and diversity[39]. In other words, instead of one very smart agent, they had many “pretty smart” agents trying different approaches simultaneously. The team integrated these agents with an underlying fuzzing platform – the name suggests a focus on fuzzing, and indeed they did have traditional fuzzers in the loop, but in a secondary role. The architecture likely involved an AI coordinator that would spin up subtasks (like analyzing a particular source file or function) and assign them to AI workers. These workers might generate possible inputs or identify suspect code paths. Then, some of those hypotheses would be validated either by running the program or by cross-checking with other agents.
The team’s emphasis on AI is highlighted by the claim that 90% of their entire codebase was written using AI assistance[4]. This is a fascinating detail: they apparently dogfooded AI to even build their CRS (e.g., using Copilot or ChatGPT to generate large chunks of their code). It implies the team culture was to lean into AI at every opportunity. They also gave the AI high-level strategic roles – for instance, using AI for “system architecture and decision-making”[4]. Possibly, they had an agent monitoring the performance of various components and re-balancing resources (maybe deciding to allocate more CPUs to fuzzing vs. static analysis on the fly, based on AI suggestions).
Despite the AI-heavy design, the system did incorporate fuzzing – after all, it’s in their name. Achievements show they won “SARIF Mastery” for high accuracy on static analysis reports[38], and “-OFast” for being first to find a real (non-seeded) C vulnerability[38] (which suggests they had a very fast detection on at least one target, possibly via static analysis or a quick AI insight). They also scored early in tasks (“Faster than Pizza Delivery”)[38], indicating a fast pipeline in some cases. The core idea was likely: throw a large swarm of AI agents at the code, have them propose potential bugs (with reasoning), then use fuzzers to verify those that need confirmation. This flips the typical approach – here AI is finding bugs and fuzzing is double-checking, rather than fuzzing finding and AI triaging. Given that they placed mid-pack, this approach had mixed success, but it’s noteworthy.
Traditional Techniques: FuzzingBrain did not entirely discard traditional methods. It sounds like they still ran fuzzers and other analyses when appropriate, but perhaps triggered by AI. For example, if an AI agent flagged a function as vulnerable (say, a buffer copy with no bounds checking), they might then launch a focused fuzzing campaign on that function or module to produce a concrete crash (or use a directed fuzzing mode). They presumably used the OSS-Fuzz harnesses for targets where available, similar to others, but possibly under the guidance of the AI (e.g., picking which harness to prioritize). Because they got the SARIF accuracy award, they clearly processed the static analyzer outputs given by the organizers very effectively[38]. This likely involved a classical component: maybe they wrote a script to filter out SARIF warnings that didn’t correlate to any dynamic findings, or they used AI to decide which SARIF entries to mark true/false. But an 80%+ accuracy on SARIF suggests a systematic method, since random guessing would be worse. They might have cross-referenced SARIF with code context using an LLM, which is semi-traditional (static analysis result + AI).
On fuzzing, they likely used a variety of fuzzers (AFL++, libFuzzer, etc.), given their team composition might include academic fuzzing experts. They didn’t win any achievements for huge fuzzing scale or such, so their fuzz infrastructure might have been modest (especially compared to Team Atlanta or 42-beyond). Instead, they applied fuzzing more tactically as a backup. For patching, not much is published, but presumably they did similar to others: prompt an LLM to fix the code, perhaps with multiple agents voting (since they like scale, they could generate multiple patch candidates with different LLMs or prompts and choose the best). There’s mention in ToB’s blog that FuzzingBrain’s patching was largely AI-driven with minimal validation – since they prioritized speed (score quickly) sometimes at expense of mistakes, their cost per point was relatively high ($490) and accuracy might have suffered[37]. However, they avoided catastrophic failures and ended in 4th place, which is respectable for a newer team.
AI-Based Techniques: This team embraced AI at scale. If Shellphish used 60 agents, FuzzingBrain used “thousands”[39]. They leveraged model diversity – running different types of models in parallel. This could mean using different prompts or even entirely different AI systems (OpenAI vs. Anthropic vs. open-source models) concurrently, under the theory that what one model misses, another might catch. The phrase “overcome AI unreliability through scale and model diversity”[39] directly implies that by sheer volume of agents, the system’s overall output becomes robust (the wisdom of crowds, but for AIs). This is an intriguing strategy: if one has the budget to do thousands of LLM calls in parallel, one can then filter their outputs for the best ones. Indeed, they had agents whose job was likely to evaluate the outputs of other agents – a hierarchy where some AI judges others’ findings (maybe by cross-checking with program semantics or known bug patterns). The result was that \~90% of their PoVs came from direct AI reasoning (not fuzzers)[39]. In cases where those AI-born PoVs failed or were uncertain, only then did they turn on traditional fuzzing as a “fallback validation mechanism”[39]. That indicates extreme reliance on AI.
Additionally, the team might have used reinforcement learning or planning for their AI agents. The name itself is a play on “All you need is a brain” – possibly they attempted some brain-like global strategy (maybe a central AI that learned which agents to trust). It’s known one team in finals (Team Atlanta) fine-tuned models; it’s possible FuzzingBrain also fine-tuned or trained a custom model, but no evidence here. Instead, they likely leveraged existing APIs heavily. Their large compute spend (\\(63k, second highest) and significant LLM spend (\\\)12k)[37] show they poured resources into this massive parallel AI approach. They actually had the highest cost per point among top teams[120], suggesting a bit of inefficiency – the AI swarm found a lot but also wasted effort on false leads, it seems. Documentation on their methods is scarce (no public blog from them as of writing), but one can infer they were testing the limits of how far multi-agent AI can go in this domain.
Performance & Results: FuzzingBrain secured 4th place, finding 28 vulnerabilities and patching 14[40], for a score of 154. This put them just behind the top three, and notably ahead of Shellphish and others in total points[121][122]. The performance indicates their approach was effective at discovering bugs – tying Shellphish in finds and even Theori (28 vs 34, not far off). Their patch count was moderate; some patches likely failed or they didn’t patch everything they found. The fact that they scored lower suggests either slower patching or accuracy penalties. Indeed, their cost per point and anecdotal evidence imply they may have submitted some incorrect patches that hurt their multiplier (though they didn’t come last in patch points by any means). However, their notable achievements show strength in specific areas: They were the fastest to score on at least one challenge (the “pizza” speed achievement) and excelled at the static analysis portion with SARIF Mastery (>=80% accuracy)[38]. The first-to-find real vuln (-OFast) award indicates a keen eye for an actual bug (maybe found by an AI reading source code before any fuzzer could even compile the program)[38]. These honors underscore that a swarming AI can outpace and outperform some manual or traditional methods, at least initially. The trade-off was possibly coordination and focus – more isn’t always better if it floods the system with noise. But overall, their 4th-place finish as a likely independent or new team (the name suggests perhaps a group of students or researchers) is impressive. The team’s code is on GitHub under an organization o2lab[123], hinting they might be an academic lab (perhaps at Northwestern or another university, since one search result connected a Wang Xuezhen with this team). The repository contains directories for crs, cp_config, etc., and even a proprietary folder[124][125], which suggests some components (perhaps the LLM models or data) were not fully open-sourced or are placeholders.
Strengths & Weaknesses: The clear strength of FuzzingBrain was mass parallelism and AI depth. It essentially brute-forced the problem with AI, leveraging a “more is more” philosophy. This likely uncovered some non-obvious bugs quickly and provided resilience (if one agent got stuck, another completely different approach was also running). It also allowed them to be extremely AI-centric without being derailed by one model’s quirks – consensus or majority voting among agents could surface correct answers. The approach’s weakness is efficiency and coherence. Running so many agents risks duplicating work and requires very good aggregation logic to make sense of their outputs. The high cost per point suggests a lot of wasted cycles on fruitless inquiries[120]. Also, managing thousands of AIs is complex; perhaps the integration and debugging of such a system is difficult, and timing issues or race conditions could occur. In the final scoring, they did well but not enough to beat more targeted approaches – indicating that while you can throw AIs at the wall to see what sticks, a clever strategy (like Team Atlanta’s multi-engine or ToB’s guided fuzzing) can still have an edge. In terms of documentation, FuzzingBrain is less visible; they might share insights later, but currently we rely on the observations of others for their description. In summary, All You Need is a Fuzzing Brain demonstrated the power of massively parallel AI in cybersecurity, achieving solid results but at great expense – a classic time vs. money vs. reliability trade-off.
Artiphishell (Shellphish) – Academic Multi-Agent System¶
Core Operation & Approach: ARTIPHISHELL, by the veteran hacking team Shellphish, is an agent-based CRS where a large number of specialized AI agents collaborate to find and fix bugs[41]. Over 60 independent AI agents were programmed, each with distinct roles (hence the name a pun on “artificial shellphish”)[41]. This architecture draws on Shellphish’s extensive experience building automated hacking systems (they competed in the original 2016 DARPA Cyber Grand Challenge). The ethos was to decompose the immense task of securing a codebase into smaller subtasks, handled by expert agents. For example, one agent (nicknamed “Grammar Guy”) focused on inferring input grammars and generating complex test inputs[42], another might focus on monitoring program telemetry or feedback, another on generating patches for memory errors, and so on. This way, Artiphishell’s design mimics a team of hackers with different specialties working together – but automated.
Shellphish’s CRS heavily emphasizes autonomy and scale. They wanted minimal human intervention; once deployed, the agents communicate and iterate on their own. They set up extensive telemetry logging and monitoring of the AI decisions (even earning the “Best Telemetry” award)[46], which suggests they instrumented their CRS to record agent actions, LLM prompts, responses, etc., both for debugging and perhaps for scoring bonus (if they shared useful data with DARPA). Being an academic team, Shellphish likely experimented with various approaches: fuzzing, concolic execution (their open-source angr framework is for symbolic execution), static analysis (their members have research in that area too), and integrated LLMs on top. The repository indicates a complex infrastructure (with Azure deployment scripts, etc.)[126][127], and environment variables for multiple AI APIs[43]. The multi-agent approach was validated by their professors: quotes show they view this architecture as a representation of “how much AI has changed the way we attack difficult problems”[128][129].
Traditional Techniques: Shellphish has a pedigree in program analysis, so Artiphishell certainly didn’t ditch those methods. They augmented fuzzing significantly through AI. The “Grammar Guy” subsystem is essentially a smart fuzzing assistant: it uses LLMs to generate input grammars and progressively refine them based on coverage feedback[42]. This means Artiphishell did run fuzzers (likely AFL or libFuzzer or their own legacy fuzzers from CGC days), but instead of relying on random mutations, they had grammar rules to produce more structured inputs (e.g., if testing a SQL parser, Grammar Guy might learn the SQL syntax and feed fuzzing with valid but tricky SQL queries). According to ToB, this grammar-based fuzzing approach significantly increased their bug-finding capability for complex formats[42], at the cost of a large portion of their LLM budget (likely a worthy trade-off). Aside from fuzzing, Shellphish historically use symbolic execution; it’s not confirmed, but they possibly integrated their angr framework to systematically explore some code paths or to solve inputs for certain conditions, with AI guiding the target selection. They also likely used static analysis tools, perhaps parsing the code with their own tools or open-source analyzers to supply hints to the agents.
For exploitation (PoVs), Shellphish’s strategy was somewhat middle-ground. They did manage to create at least one enormous PoV payload (earning “The Limit Does Not Exist” for >1MB exploit)[48], meaning they had the capability to generate complex exploits. It’s possible they used some concolic or brute-force method to build that payload, or an agent just concatenated a huge input together to overwhelm something. Patching-wise, Shellphish took a conservative approach like Trail of Bits and Team Atlanta: they only submitted patches when they had a corresponding PoV to prove the fix[12]. In fact, they boasted one of the highest patch accuracies – over 95% of their patches were correct[46]. This indicates they avoided speculative fixes and carefully validated each patch (likely by rerunning all known PoVs on the patched program as part of their pipeline). As a result, they may have left some vulnerabilities unpatched if they couldn’t reliably confirm a fix in time, which might have limited their score, but it kept their penalties minimal.
AI-Based Techniques: Artiphishell’s multi-agent system employed a variety of LLMs – as evidenced by environment keys for OpenAI, Anthropic, and Gemini (Google) models[43]. This suggests they dynamically chose models for tasks: perhaps using GPT-4 for complex reasoning, Claude for long-text analysis, and Gemini (PaLM) for code generation, etc., exploiting each model’s strengths. Over 60 AI agents implies a wide range of AI roles. Some hypothesized roles: code summarizer agents (to distill large code into manageable chunks for others), vulnerability hypothesis agents (scanning for common bug patterns), exploit agents (like a simplified version of what Theori did, but maybe not as central), patch agents (given a location and bug type, draft a code change), verification agents (review a patch diff to ensure it addresses the bug and doesn’t introduce others), and coordination agents (deciding which agent should work on what). They likely implemented a message-passing or blackboard system for agents to share information, which is non-trivial engineering. Given the “we had to babysit LLMs” comment referenced at DEF CON by a member[130], Shellphish found that you can’t just let 60 agents run wild – you need oversight logic. They possibly had a scheduler that monitored agent outputs and killed or re-assigned tasks if an agent got stuck or hallucinated too much.
One specifically cited agent, Grammar Guy, was a standout AI feature[42]. This agent would generate grammar rules (likely in a format like ANTLR or a simple BNF) to describe input formats of targets, then evolve them. For example, if fuzzing found some new code coverage when sending \<script> tags, Grammar Guy might infer that the input is HTML-like and add more HTML grammar rules. Over time, the inputs get more sophisticated, allowing fuzzers to explore deeper states of the program. This is a clever synergy of AI with fuzzing: the AI provides intuition on input structure (like a human writing a grammar for Peach fuzzer), but automatically. Shellphish also built agents for telemetry, which implies some AI(s) were summarizing what the system was doing in human-readable form or detecting anomalies (they reported LLM and CRS activity to organizers, possibly for that Best Telemetry award)[46].
In terms of heavy AI usage, Shellphish’s LLM budget was actually relatively low (\~\\(2.9k, lowest among finalists) and their compute was high (\~\\\)54.9k)[131]. This suggests they self-hosted a lot of models (compute cost) or relied more on local analysis than API calls. Maybe they ran some open models on GPUs for cheaper per-query cost, or they used smaller models extensively which cost less per call. The multi-agent approach might have emphasized many shallow AI tasks (which might align with using smaller models in parallel rather than a few giant model calls). Their “Hunger for Knowledge” achievement about 1K+ LLM requests/minute actually went to 42-beyond, not Shellphish[132], so Shellphish wasn’t spamming external APIs as heavily. Instead, they might have had optimized uses or local instances.
Performance & Results: Artiphishell finished 5th place, with 28 bugs found and 11 patches applied[47], totaling 136 points. The placement belies the sophistication of their system – they clearly built something advanced and it did successfully operate (finding and fixing a good number of issues), but it didn’t score as high as expected. There are a few possible reasons: one, they may have focused on quality over quantity, given their very high patch accuracy (they won “The Doctor is In” for patch success >95%)[46]. That means they likely passed on patching some bugs they weren’t sure about, costing them potential points but avoiding penalties. Two, multi-agent coordination overhead might have slowed them down; they found as many bugs as Theori and FuzzingBrain (28 vs 34/28), but patched significantly fewer. This suggests maybe their patch pipeline was a bottleneck or some agents underperformed. For example, it’s possible they identified many crashes but couldn’t develop PoVs for all (since they wouldn’t patch without PoV). If an exploit agent fails to deliver, those bugs remain unpatched and earn minimal points (just SARIF maybe). The scoreboard shows Shellphish’s PoV count was decent but patch count lagged.
However, Shellphish still had notable accomplishments. They handled at least one scenario requiring an enormous input (1MB+ PoV)[48] – maybe an AI agent decided to flood an input to trigger a worst-case behavior or memory issue, which worked. They also had a telemetric insight into their system which presumably helped them debug and iterate on it quickly. Their 5th place came with a significant points gap below 4th, but a huge gap above 6th (136 vs 105), showing they were clearly more effective than the bottom two teams. As an academic team, the knowledge gained might be as important; indeed, their mentors considered the breakthroughs and learning itself a reward[133]. Documentation on Artiphishell is currently limited – the UCSB news piece provides a high-level view[41], and a DEF CON talk was given by team members (CTF Radio podcast) shedding more technical light, including some post-mortem analysis via an X/Twitter thread[134]. The code is open on GitHub, though it may not include all private data (there’s mention of a private wiki and CI system)[127].
Strengths & Weaknesses: Artiphishell’s main strength is its comprehensive AI-agent ecosystem. It attempted to emulate a full cybersecurity team with AI, which likely made it very flexible – it could handle many different types of challenges (any language, any bug type) by delegating to appropriate agents. Its integration of grammar-based fuzzing was cutting-edge and gave it an edge in handling complex input software (like those requiring well-formed inputs)[42]. Additionally, Shellphish’s cautious approach to patching meant their system’s outputs were highly reliable (a valuable trait outside competition). On the flip side, the complexity of managing 60+ agents is a weakness – overhead and inter-agent communication can slow things down or cause failures if not perfectly tuned. Indeed, team members mentioned having to babysit the AI (keeping the “Jack-Jack” multi-powerful LLM under control)[130][135]. Another weakness is that a multi-agent system can sometimes have blind spots if agents don’t share info effectively; e.g., an exploit agent might not realize another agent already explored a path, etc. This could lead to duplicate work or missed optimizations. Shellphish’s lower LLM usage suggests they might have under-utilized some AI potential, relying more on internal logic – possibly a deliberate cost-saving measure, but it may have limited the raw power available compared to others who spent more. In conclusion, Artiphishell was an ambitious melding of classic hacking techniques with AI teamwork. It didn’t top the scoreboard, but it validated the multi-agent approach and no doubt provided numerous insights (and future research directions) to its creators. The tool is available for the community, and given Shellphish’s academic roots, we may see research papers or open data coming from this effort in the future.
Bug Buster (42‐b3yond‐6ug) – Scalable Fuzzing with AI Scheduling and Super-Patching¶
Core Operation & Approach: Bug Buster, by team 42-b3yond-6ug (a team of academics from Northwestern and others[136][137]), built a CRS optimized for massive scale fuzzing and clever patching techniques. Their philosophy leaned on proven methods like fuzz testing and then amplified them using AI. They put significant effort into scaling out fuzzing infrastructure: their system ran three parallel fuzzing pipelines during the semifinal and continued to expand, ultimately coordinating up to \~2000 fuzzing instances cluster-wide in the finals[56][138]. This was orchestrated via a custom scheduler named BandFuzz, which employed reinforcement learning (RL) to allocate resources to fuzzers in real-time[54][139]. Essentially, BandFuzz would monitor which fuzzers were yielding coverage or crashes and dynamically decide which fuzzing jobs to prioritize or spawn next, learning optimal scheduling policies. The team’s prior research on BandFuzz even won a fuzzing competition, indicating that part of their system was state-of-the-art in fuzz management[54].
Apart from raw fuzzing power, 42-b3yond integrated static program analysis to guide their efforts. One key tool was a program slicer – a static analysis that can narrow down which parts of the code influence a certain point (like a crash point or a target variable)[140]. They used slicing to map out paths to target locations (e.g., to focus fuzzing on code that leads to a potential bug). However, during the competition, this slicer ran into issues (written for LLVM 14, it broke on C23 code)[141], which hampered their Round 1 performance and occupied developers with porting efforts unsuccessfully[141]. This anecdote shows the risk of relying on certain tools that might not scale to new languages or standards – a valuable lesson they learned.
Another highlight of Bug Buster’s approach is what they called “super patches.” Instead of patching each bug individually, they attempted to create patches that fix multiple vulnerabilities at once[9][51]. The system could detect when different crashes actually stemmed from the same root cause, or even if unrelated bugs could be solved by one general hardening change[51]. For example, if two different out-of-bounds crashes in code could both be prevented by adding an array bounds check in one location, they’d deploy a single patch for both. This turned accuracy penalties on their head – a patch that fixes two bugs yields double benefit with one potential penalty. It’s a creative tactic: by “holistically” addressing issues, they aimed to maximize the patch score multiplier. It’s unclear how often this succeeded, but it was innovative enough to impress other teams[51][142]. It likely involved AI assistance to generalize a fix and recognize patterns among crashes.
Traditional Techniques: 42-b3yond-6ug’s foundation was fuzzing. They used established fuzzers (probably AFL++, libFuzzer, and possibly others like Honggfuzz) in large numbers. They mention kernel fuzzing and directed fuzzing as separate efforts in their narrative[53][143], so they even attempted to fuzz an OS kernel (they found a Linux kernel bug in semifinals) – that’s ambitious given kernel fuzzing’s complexity. They also wrote CodeQL queries (a static analysis query language) to find vulnerabilities in source code quickly (the blog says “everyone was writing CodeQL” during development)[53], meaning they automated searches for known bug patterns (like spotting suspicious uses of functions, integer overflow patterns, etc.). Additionally, they implemented triaging pipelines: when fuzzers found crashes, their system automatically triaged them (by reproducing the crash, deduplicating it, classifying it by type). This is where they integrated with SARIF (static analysis reports) and other data – they had a SARIF integration to validate static warnings, matching them with dynamic evidence of a bug[52].
They also took reliability seriously: their story recounts how in semifinals many components crashed but their system had backups and “absorbed the hits”[56]. This resilience came from redundancy – e.g., three independent fuzzing pipelines means if one fuzzer process dies, two others are still running. They also logged component status in a key-value store (KV cache) to recover state if something failed[144]. They had to address scaling issues, like their seed synchronization service bottlenecking when 2000 fuzzers generated new inputs too quickly[145]. They solved it via a MapReduce style approach to distribute the syncing load[146]. This level of engineering shows strong software development to handle large workloads, arguably more so than AI specialization. For patching, beyond the super patch idea, they likely used typical methods of running regression tests and cross-testing PoVs with patches to ensure fixes worked (though anecdotally a late bug in their patch component cost them – more below).
AI-Based Techniques: Bug Buster integrated AI in a few focused ways instead of everywhere. One was the BandFuzz RL scheduler – this was essentially applying reinforcement learning to maximize fuzzing output (coverage/crashes)[54]. The RL agent would consider the “state” (maybe which programs or which seeds have yielded results) and allocate “actions” (which target to fuzz next, or which seed to mutate) to maximize reward (new coverage or crashes). They found this outperformed static scheduling in fuzzing competitions[54]. Another AI use was seed generation via LLMs. They mention training an “LLM seed generator” and writing a paper on it (which got initially rejected)[53][147]. This tool presumably uses an LLM to suggest initial seeds for fuzzers – similar in spirit to what Buttercup did, but possibly with a custom model trained for it. It might generate inputs especially for structured formats. The fact a student spent two months writing a paper on it shows they had a dedicated AI component for it, which might have been key in early bug findings.
For patching, their concept of “super patches” likely involved AI help. Recognizing that two crashes relate might be done via static analysis clustering, but coming up with a single fix that addresses multiple issues without regression is tricky. It sounds like an LLM could have been asked: “Given these two crash locations, is there a single code change that would prevent both?” – a task requiring semantic understanding. The Trail of Bits blog confirms they had an LLM-based patching process to do this multi-bug fix[9]. If it worked, it could yield big points, but if it backfired (fixes one bug but not fully, or introduces new issues), it could hurt.
Additionally, they used AI for crash analysis: after generating a crash, an LLM could turn the raw data (addresses, dumps) into a structured description, which they call vulnerability objects (like a JSON with type of bug, location, etc.)[55]. This would help in triaging and planning patches. They might also have used LLMs to assist writing CodeQL queries or to translate static analysis results, though not explicitly stated. Interestingly, 42-b3yond’s LLM budget was minimal (\\(1.1k) while their compute was high (\\\)38k)[131]. This suggests they ran AI mostly on-premise or used open models. Indeed, they hammered some local model with 1000+ requests/minute (earning “Hunger for Knowledge”)[132], which aligns with running a smaller model many times. This could have been for the vulnerability object creation or static analysis tasks, which can be done with a fine-tuned local model repetitively, keeping API costs low.
Performance & Results: Bug Buster ended up 6th place (out of 7 finalists). In the final round, it found an impressive 41 vulnerabilities – the second-highest count – but only 3 patches succeeded[57]. This massive disparity is what sank their score to 105 points[57]. It indicates that while their finding mechanism (fuzzers + AI scheduling) was top-notch, their fixing mechanism struggled. In fact, their raw PoV score was the 2nd best and they were 1st in SARIF (static) points[58]. But patch points were very low (6th among 7)[58], so the patches either failed or were never submitted for many bugs. The team’s postmortem confirms this: internally their patch success in tests was \~90% for mitigating PoVs, but something went wrong during the actual competition causing patch submissions to fail unexpectedly[58][148]. They suspect a system bug was responsible – a painful lesson that “every system has cracks, and in a competition like this, you only get one chance”[149].
Despite the low placement, 42-b3yond-6ug had notable achievements: they received “Czar of the SARIF” for most correct static analysis evaluations and “Giant Slayer” for tackling a huge repo (5M+ lines)[59]. The latter suggests they scored on one of the largest projects (maybe the Linux kernel or another large codebase) where others struggled. This indicates their system could handle scale better than some (fuzzing at scale helps on giant code). They also got “Hunger For Knowledge” for their heavy LLM query rate[132]. Their semifinal performance was actually top-tier (they were among top scoring teams in semis)[150][151], which earned them the finalist spot and funding. In finals, ironically, their heavy focus on patch innovation might have backfired due to a critical implementation bug (their narrative of a patch component rushed in last week without full review stands out[152]). So, the system found many bugs (some even in Linux which most teams found none[153]), but the patcher collapsing meant low final score. The team lead expressed admiration for Theori’s LLM-first approach later[154], indicating a recognition that their traditional-heavy approach had limitations in comparison.
Strengths & Weaknesses: Bug Buster’s strength was raw vulnerability finding power. They arguably built the best bug-finding machine among all teams – 41 vulns is testament to that. Their RL-driven fuzzing and extensive engineering on scaling paid off in triggering lots of crashes, even in hard targets like Linux. They also nailed static analysis, meaning they weren’t fooled by many false alarms. Another strength was creativity in patching (super patches) – although it didn’t reflect in points, the idea is solid for real-world use where you want minimal patches that solve maximum issues. However, weaknesses were evident: the patch pipeline’s fragility was a huge Achilles’ heel. It shows that balancing the two halves (find & fix) is crucial; their find outran their fix. Possibly too many crashes overwhelmed their patching process or they over-engineered patch strategies when a simpler per-bug patch might have sufficed. Additionally, the slicing tool distraction in Round 1 and integration struggles suggest they faced project management challenges – tackling many complex components (fuzzing, RL, slicing, concolic, patching, etc.) with a limited team can stretch resources thin. In terms of AI, while they used it smartly in places, one could argue they could have benefited from using more advanced LLM reasoning in patching or exploit generation, rather than relying so heavily on fuzzing (which finds crashes but doesn’t inherently prioritize them). Documentation from this team is very rich – one member’s personal blog gave a detailed, candid account of their journey[54][56], and the team has likely academic papers in the works on BandFuzz and LLM seed generation. They open-sourced their code (archived on GitHub)[155] and are reflective about what went wrong, which provides valuable lessons. In summary, Bug Buster was a high-risk, high-complexity system that excelled in discovery but faltered in remediation, showing the need to weight both equally in such competitions.
Lacrosse – Legacy CRS with AI Enhancements¶
Core Operation & Approach: Lacrosse was a bit of an outlier among finalists. It appears to have been built on a legacy CRS codebase from nearly a decade ago, updated for AIxCC[61][62]. This likely refers to technology from the original DARPA Cyber Grand Challenge (2016) or similar competitions. The team behind it (possibly a company like SIFT or a government lab, as suggested by interviews[156][157]) essentially took a tried-and-true automated vulnerability discovery system and gave it a “glow-up” with some modern AI additions[61][62]. The core remained fuzzing-centric and reliability-focused. Lacrosse deployed on the order of 300–500 fuzzing agents in parallel, which is comparable to Trail of Bits’ scale and shows a significant commitment to fuzzing horsepower[55]. These fuzzers were orchestrated by a central intelligence called “Optimus Zero”[55]. Optimus Zero managed global state and distributed tasks among fuzzers and analysis components, prioritizing tasks much like a general might in the field. This suggests a design similar to older CRS where a central controller divides work (some to fuzzers, some to static analyzers, etc., based on what's left to do).
Lacrosse prioritized proven techniques over experimental AI. As the Trail of Bits blog noted, they favored “focusing on agent orchestration rather than AI-driven vulnerability discovery”[64]. In practice, this means Lacrosse relied on fuzzing with robust engineering (ensuring fuzzers run properly, outputs are collected, etc.) and didn’t take as many risks with untested AI logic that might mislead or crash. The approach is akin to running a well-oiled fuzzing farm with careful oversight. That said, they did integrate AI in specific supportive roles (next section).
Traditional Techniques: Being based on an older CRS, Lacrosse almost certainly employed classical symbolic execution and static analysis components from that era. Many CGC-era systems had custom symbolic executors (to generate PoVs by solving path constraints), plus things like taint analysis to see what input bytes influence a crash. If Lacrosse’s base was one of those, it likely retained those capabilities. However, open-source critical infrastructure code (Linux, Apache, etc.) is far bigger and more complex than CGC’s synthetic binaries, so they probably leaned more on fuzzing which scales better. They also adhered to traditional PoV generation via fuzzing: meaning they largely used fuzzers to generate crashing inputs and did not attempt AI generative exploits for format compliance[64].
Lacrosse also did static analysis of a sort: their achievements show “Consensus Contributor” (most POVs/vulns submitted) which implies they detected a lot of potential issues and tried to submit many, even if duplicates[66]. This could mean they had an aggressive static analysis that flagged many possible bugs and they attempted many PoVs (some might have been redundant or invalid, hence most submissions but not high score). It could be due to an older system not having as refined a triage mechanism for modern code, so it oversubmitted. Another achievement, “Professional Assassin” (POV success >95%)[66], is interesting: how do you reconcile submitting the most POVs with 95% success? Possibly they submitted a ton of static-analysis-based SARIF issues (counted as “vulns”) and a smaller number of PoVs which mostly worked. Or maybe they attempted many PoVs but only a handful scored, and those gave them the >95% success ratio because others failing might not count if not considered valid submissions? It's a bit contradictory, but maybe the key is they were careful to only officially submit PoVs once they were pretty sure (leading to high success rate), and they hammered static vuln reports (since those cost little to be wrong besides 0 points).
The “Raiders of the Lost POV” award suggests Lacrosse found at least one genuine vulnerability that DARPA did not plant in the code[66]. That’s a credit to their thoroughness; likely their fuzzers or symbolic execution uncovered a 0-day in a target (maybe the one bug they did patch successfully). For patching, Lacrosse took a middle-of-the-road strategy: not as conservative as Shellphish, not as aggressive as Theori. They used LLM consensus and a confidence algorithm to decide on patches[63]. This implies they might generate a patch with AI, then have multiple AI models or runs evaluate it, and if they all (or a majority) agree it’s a good fix, they submit it. This is a smart way to reduce hallucinations – essentially a voting scheme among AI “opinions.” They didn’t have the fastest or most numerous patches, but this approach likely aimed to avoid accuracy penalties while still patching whenever they had enough confidence.
AI-Based Techniques: Lacrosse integrated AI in a supporting role rather than core. First, they used LLMs for higher-level reasoning tasks. This included analyzing crash reports to produce “vulnerability objects” – structured descriptions of a crash’s root cause, type, etc., and even suggestions for patching[55]. Such an object might say, for example: “Heap buffer overflow in function parse_image() when handling BMP files. Cause: misuse of memcpy length.” This gives the developers (or automated system) a clear picture. Having AI create these likely helped Optimus Zero to decide how to patch or whether to prioritize that bug.
They also used AI for semantic understanding tasks that classical tools struggle with. For instance, figuring out input formats or protocol details might be handled by an AI reading the documentation or inferring from code. Lacrosse might have had an AI agent to reverse-engineer input format if needed (similar to Shellphish’s Grammar Guy, but probably less central). Another key AI usage is in their patching process: LLM consensus. They might generate several patch options with GPT or similar, then evaluate each with one or multiple models to pick the best. Or, they might have one LLM propose a patch and another review it for correctness. That aligns with a careful, consensus-driven approach[63].
The orchestrator, Optimus Zero, probably also had some AI logic. It might have used heuristic or learned (AI) strategies to assign tasks – e.g., using machine learning to predict which module is likely vulnerable based on past data, and thus allocate more fuzzing there. But given the limited performance, it might have been rule-based rather than learned, as they emphasized reliability over new techniques.
The cost data shows Lacrosse spent very little: \\(631 on LLMs and \\\)7.1k on compute (the lowest by far)[131]. That’s maybe because they didn’t utilize external AI much (maybe some small model usage or minimal queries) and their scale was smaller than others (if 300–500 agents, possibly not running full throttle or not all heavy processes). Their cost per point was highest, \$751, meaning the approach was not resource-efficient in terms of score[120]. But they likely aimed for a robust system within modest budget, possibly as a small business track team or similar.
Performance & Results: Lacrosse finished 7th (last) in the finals, with only 1 vulnerability found and 1 patched[65], for 10 points. This is a stark drop from others and indicates something went wrong or underperformed. It’s possible their system actually found more but maybe didn’t get credit due to duplicates or timing. The fact they submitted the most vulns/PoVs (consensus contributor) suggests they tried many, but they only successfully scored one. This likely means many of their submissions were duplicates of others’ findings or came too late. In the DEF CON final format, when multiple teams find the same bug, the one who submits first (with PoV or patch) gets more points, and late patches to a bug someone else already patched can reduce accuracy if the bug is already fixed. Lacrosse might have identified bugs slower than others, so by the time they submitted, another team had patched it, making their patch either unnecessary or harmful to their score. This is speculation, but fits the data: lots of attempts, little payoff, could be due to lateness.
On the positive side, the one bug they did patch gave them those achievements (because it was a unique real bug that only they found). So their conservative thorough approach did yield one unique discovery. But competitively, speed and volume are key, and they fell behind. Possibly their legacy system, while stable, wasn’t designed for the complexity of real modern code and needed more AI infusion to be competitive, but they perhaps ran out of time to modernize everything. The TOT blog’s admiration: “This says a lot about its ability to …” (it cuts off, but likely to stick around or adapt) suggests respect that an old system could still make finals[61].
Team Lacrosse’s documentation is limited publicly, but DARPA posted a YouTube team interview[158], and a CTF Radio podcast exists with their team members discussing their final submission[157]. These likely elaborate on their approach and motivations (e.g., perhaps a DARPA legacy or a focus on integration with government systems). The open-source archive shows their code on GitHub as well.
Strengths & Weaknesses: Lacrosse’s strength was stability and low risk. By using a known architecture and scaling it moderately, they avoided catastrophic failures – their system did run end-to-end and didn’t bomb out (evidenced by their achievements and at least one success). The orchestration by Optimus Zero indicates good control over their processes, likely leading to fewer crashes of the CRS itself (something even Team Atlanta worried about – one bug can kill a CRS[74]). They also achieved near-perfect PoV success ratio, meaning they rarely submitted a PoV that didn’t work[66]. That suggests strong validation and correctness checking on their part – a valuable trait in real deployments, where false positives are a waste of time. And finding a unique bug shows they did have capability beyond trivial finds.
However, their weaknesses were speed and adaptability. They were too conservative and slow in a game that rewarded fast exploitation. Possibly their fuzzers found crashes but maybe they waited too long to confirm them or to generate PoVs. Or they might have targeted the wrong things first (maybe they spent time on a huge codebase and only found one bug, while ignoring easier targets early on). The minimal use of AI meant they didn’t take advantage of leaps in reasoning or analysis that others did – an old system might not catch subtle logic bugs or might drown in false positives without an AI to filter. Also, having fewer fuzzers than others (500 vs thousands) and not pushing them as hard (given the low compute cost) meant generating less coverage and crashes overall.
In essence, Lacrosse was a testament to reliability over novelty, which in this first-of-its-kind AI challenge, wasn’t enough to score high. They effectively demonstrated that an older CRS, even with some LLM help, can still function on modern code, but to win, embracing more AI and aggressive strategies is necessary. For real-world transition, though, some of Lacrosse’s philosophies (like patch confidence algorithms and not overspending resources) could be beneficial.
SHERPA – Automated Fuzz Harness Generation (Auxiliary Tool)¶
Core Operation & Approach: SHERPA (Security Harness Engineering for Robust Program Analysis) is an open-source tool released alongside the competition, focused on a specific but crucial problem: generating fuzzing harnesses for attacker-controlled entry points in software[159][160]. In simpler terms, SHERPA automates the creation of the glue code that lets a fuzzer target the highest-level interfaces of a program – those that real attackers would use – rather than internal helper functions. This addresses a big gap: fuzzers often test library APIs or internal functions because that’s where harnesses exist (like OSS-Fuzz provides many harnesses at low-level APIs), but real attacks typically go through top-level entry points (like feeding a whole file or network packet into the program)[161][162]. SHERPA shifts fuzzing “up the stack” to these entry points[162], meaning it tries to fuzz the program in a way that mirrors actual usage (e.g., fuzz the main() or the request handler rather than just a parsing subroutine).
To do this, SHERPA’s workflow is: given a large project (say an open-source server or utility), analyze its coverage gaps – identify parts of code not exercised by existing fuzzers[163], focus on those that are attacker-controlled entry points (like functions reading external input)[160][164], and then use LLMs to generate fuzz harness code for those entry points[165][166]. The harness code is basically a piece of C/C++ (or Java) that calls into the target code in the right way with some fuzzer-provided data. SHERPA then compiles this harness, runs a fuzzing campaign on it, and monitors for crashes[167][168]. If compilation fails or linking issues arise, SHERPA has an auto-fix loop where an LLM tries to fix build errors in the harness code and rebuild[169][170]. This is very useful because writing correct harnesses can be tricky with dependencies and type matching – SHERPA essentially iterates until the harness compiles (“Build-Until-Green”)[171].
When fuzzing finds crashes with these new harnesses, SHERPA again invokes an LLM for crash analysis. It automatically triages crashes, filtering out false positives (like if the harness itself was flawed or a benign bug) by using AI to analyze stack traces and error patterns[172][173]. The LLM can discard >80% of raw crashes as not real bugs[174][175], achieving a \~67% precision of real bugs in those that remain[176][174] – a huge efficiency gain. Finally, SHERPA outputs maintainer-friendly artifacts: for each validated bug, it provides a reproducible test case, an analysis of the bug, and even patch guidance[177][178]. The idea is to make it easy for open-source maintainers to understand and fix the issue, bridging the gap from raw fuzzer crash to actionable report.
Traditional Techniques: SHERPA stands on existing fuzzing technology – it doesn’t replace fuzzers like AFL/libFuzzer; it augments them. It uses coverage analysis (likely via coverage traces from existing tests or fuzzers) to see what parts of code are not fuzzed yet[176]. Then it picks those areas, specifically focusing on code that handles untrusted input (for instance, functions that parse file formats or process network data – essentially anything that reads from an external source). This targeting may use static analysis or simple heuristics (like function names containing “parse” or pointer arguments that look like data buffers).
Once a target function is chosen, writing a harness is partly straightforward (calling the function with fuzzer data), but often needs some scaffolding: e.g., if the function expects a complex object, SHERPA might need to construct one (maybe by calling initializers or setting global state). It likely uses static analysis to see what that function requires (like what global variables need to be set, or what format the input should be). Some static analysis could be analyzing function signatures or reading comments (LLM can help here by reading code and inferring usage). SHERPA then fuzzes with libFuzzer or AFL (it’s integrated into OSS-Fuzz style pipelines)[179]. The harnesses it generates are OSS-Fuzz compliant, so they can be contributed back to OSS-Fuzz easily if desired[180].
For triage, beyond LLM, it uses sanitizer output and likely checks if a crash is unique (deduplicating by crash stack or ASan report). It filtered 100+ raw crashes to 27 high-signal ones, from which 18 were confirmed real bugs[176][181]. Those numbers indicate heavy filtering. It also uses an interesting trick: if a harness or crash is due to misuse of API or unreachable condition, they filter that out (some automated logic likely flags known harness errors or unreachable states)[172].
AI-Based Techniques: SHERPA’s power comes from integrating LLMs at two key points: harness generation and crash analysis. For harness generation, SHERPA uses LLM agents that understand code context. The LLM is fed with information about the target function and possibly surrounding code (maybe function docstrings or definitions)[182][179]. It then outputs a candidate harness in C/C++ code. Essentially, the LLM is writing new code – a fuzz entry function that calls into the target code appropriately. This is where prompt engineering matters: they likely instruct the LLM to only use safe calls, how to structure a fuzz harness, and provide an example. They mention “Key Prompt Engineering Innovations: Security-First Target Selection”[183], meaning the prompt emphasizes picking real attack surfaces. Also, “Multi-Stage AI Pipeline” is referenced[184] – Stage 1: target selection (maybe LLM helps identify which functions are entry points), Stage 2: harness code generation, Stage 3: build fix, Stage 5: crash triage by LLM[185][166]. Stage 4 (fuzzing) is non-AI.
For build fixing, when the harness fails to compile, an LLM is given the compiler error output and the harness code, and asked to adjust the code to fix the error[186][169]. This is a common use of codex-like models and can save a lot of human time. SHERPA keeps iterating this until it compiles or maybe a limit is reached. This automated debug loop is crucial for scaling to many targets without manual intervention.
During crash analysis, an LLM is tasked with reading sanitizer logs or core dumps to decide if a crash is likely a legit security bug or a false positive. It filtered out \~100 false positives automatically[172][173]. The LLM also categorizes the bug (CWE type, severity) and possibly suggests a patch (the readme says each bug comes with patch guidance[187]). That suggests the LLM might even point out “checking this length before copying would fix it” or similar. SHERPA’s readme lists a summary of bug types it found (CWE-770 uncontrolled mem alloc, etc.)[188][189], which shows it can find a variety, many denial-of-service type issues (out-of-memory, infinite loops, etc.) that typical fuzzing might find but not prioritize.
SHERPA’s use of AI is targeted and effective: rather than trying to analyze entire program flows, it uses AI to solve specific subproblems (writing harness code and interpreting crashes) that humans normally have to do. This dramatically speeds up fuzzing on new code sections (it found 18 new bugs across multiple OSS-Fuzz projects in minutes to hours, which might take humans weeks)[174][175].
Performance & Results: SHERPA is not a competitor system but a tool produced (apparently by a collaboration including team members from Shellphish/Trail of Bits – contributors include “blue-orbiter Alex” (possibly Alex Groce?) and Matt Lehman of Trail of Bits[190][191]). It was presented as a big success: it uncovered 18 confirmed real vulnerabilities (in widely-used projects) from 27 high-signal crashes, out of 127+ raw crashes found[176][181]. That’s 18 bugs that presumably were unknown before (the text says CVE-class bugs, presumably candidates for CVEs) in production software. These include multiple memory safety issues (overflows, double-free, etc.) and several algorithmic complexity or DoS issues[188][189]. The precision of \~67% from crash to real bug[176][174] is far better than typical fuzzing (1-5% per their note)[192]. And the speed is noted: minutes of targeted fuzzing to find these vs. weeks of coverage fuzzing normally[174][175]. This is a huge improvement indicating how focusing on entry points pays off.
The SHERPA approach “changes everything” in their words because it targets bugs that attackers actually exploit, not just obscure internal errors[161][160]. For example, many fuzzers find crashes deep in libraries that might not be reachable in actual use, whereas SHERPA finds issues in how an app handles a whole file or request, which are more directly exploitable (like memory blowups or overflow on real inputs)[161][160].
Documentation for SHERPA is thorough – the GitHub readme reads like a whitepaper with motivation, methodology, and even a case study (LevelDB)[193]. It’s open-source (MIT licensed) and intended to be extended (they invite adding YAML configs for more targets)[194]. It’s a tool likely to be widely useful beyond the competition.
Strengths & Weaknesses: SHERPA’s strength is enabling fuzzing where it previously wasn’t feasible. It leverages AI to dramatically reduce the manual labor of writing harnesses and triaging fuzz results, which are two big bottlenecks in fuzzing adoption. By moving fuzz targets to real entry points, it finds high-impact bugs quickly[161][160]. The combination of coverage analysis and AI ensures it focuses on un-fuzzed code that matters, making it efficient. The automated build-fix loop with AI is also a strength – it’s like having a junior engineer fix your code under supervision. Another strength is that SHERPA outputs actionable artifacts that can directly help maintainers fix issues, bridging the gap between security researchers and developers.
One possible weakness is that LLM-generated harnesses might sometimes be imperfect – for example, the LLM could write a harness that assumes something about the code that isn’t true, or doesn’t correctly initialize some structure, which could lead to false crashes or missed bugs. However, SHERPA’s iterative fixes and AI triage presumably catch many such issues (false crash = filter out). The accuracy of LLM in understanding complex APIs might also vary – if a project requires deep domain knowledge to initialize properly, the harness might not hit the bug in a correct way. But since SHERPA still found many bugs, it indicates it worked well for the targets they tried.
Another challenge could be scaling to extremely large projects – the coverage analysis and target identification might be slow for huge codebases, and the LLM context might not fit if a target requires reading a lot of code. They did mention applying it to multiple OSS-Fuzz projects and specifically call out a new harness for LevelDB as an example (with a detailed write-up included)[195][193]. It likely works best for modular projects with clear entry points.
Overall, SHERPA is an impressive marriage of AI and fuzzing, automating a critical but tedious part of vulnerability discovery. It’s not a direct competitor’s CRS, but rather a tool that any team (and now the public) can use to improve fuzz coverage on real software. It shows how the innovations from AIxCC can be spun off to benefit the broader ecosystem immediately[196][197].
Conclusion – Comparative Insights¶
The AIxCC finalists collectively demonstrated a spectrum of approaches combining traditional cybersecurity techniques with AI, each with unique strengths:
-
Team Atlanta’s Atlantis proved that an ensemble-of-tools strategy augmented by targeted LLM integration can yield the most robust results. Its strength was balancing innovation (custom models, multi-agent LLM use) with proven methods (fuzzing, concolic execution)[81][9], resulting in top scores across the board. Atlantis highlighted that engineering reliability and diversity in approach (multiple independent CRSs) is as important as raw AI prowess[198][68].
-
Trail of Bits’ Buttercup showed the value of AI-assisted efficiency. By inserting AI where it adds clear value – generating smarter inputs and assisting patch generation – Buttercup achieved high accuracy and breadth with less expenditure[24][37]. It underscored that one can leverage AI to boost traditional tools (LLM-generated seeds for fuzzers, multi-agent patching) without over-relying on expensive models, thus optimizing cost/benefit[19][15].
-
Theori’s RoboDuck pushed the envelope on LLM-driven autonomy, confirming that an AI-first system can indeed find and exploit real bugs with minimal human-like heuristics[103][110]. Its success in 3rd place validated the concept of chaining LLM “agents” to perform tasks like static analysis, exploit development, and patching in a loop[27][30]. RoboDuck’s performance suggests that as AI models improve, this approach could become even more dominant – though Theori had to carefully constrain its agents to keep them on track[29].
-
All You Need is a Fuzzing Brain demonstrated the power (and cost) of massively parallel AI. It found and fixed bugs by orchestrating an army of AI workers in parallel[39], achieving notable wins in speed and static accuracy[38]. The lesson from FuzzingBrain is that scaling out AI can brute-force results, but efficiency suffers (highest cost per point)[120]. This approach might be more viable as AI inference costs drop, but for now it was an expensive route to fourth place – albeit one that shows AI “swarm intelligence” can rival more methodical strategies.
-
Shellphish’s Artiphishell embodied a comprehensive multi-agent ecosystem. With 60+ agents dividing every sub-task[41], it was arguably the most ambitious integration of AI and security expertise. It excelled in patches correctness and complex input fuzzing (via Grammar Guy)[42][46], though it lagged in volume of outputs. The takeaway is that highly modular AI systems can work and ensure quality, but coordinating them is complex – a single bottleneck or miscoordination can slow the whole system. Artiphishell’s design will likely inspire future research into multi-agent autonomy and how to optimize it.
-
42-b3yond-6ug’s Bug Buster emphasized extreme scale fuzzing and clever patching, finding the most bugs after the winner[57]. It proved that given enough compute and refined scheduling (RL-based), fuzzers remain incredibly potent[139][52]. However, their stumble in patching (“super patches” with a bug) highlights that novel AI ideas need rigorous testing – an insightful lesson that development speed and code quality can be a deciding factor even with great algorithms[152][58]. Bug Buster’s journey also illustrated the difficulty of integrating multiple advanced components (fuzzing, RL, static slicing, etc.) under time pressure – a more focused scope might have yielded a higher rank.
-
Lacrosse showed a conservative, legacy-informed approach, which, while yielding minimal points, serves as a baseline: an earlier-generation CRS updated with modest AI can still operate on modern targets[62], but it will likely be outpaced by more aggressive AI integration[64]. Lacrosse’s strong reliability yet low score tells us that in this new era, playing it safe is not enough – AI and automation are required to achieve state-of-the-art results, even if the traditional methods give a stable foundation.
-
SHERPA, finally, is a standout contribution that, while not a competitor’s CRS, complements the above by addressing a practical gap with AI. It exemplifies how AI can be surgically applied to automate tedious security tasks (harness writing, triage)[164][167], dramatically amplifying human ability to secure code. The fact that SHERPA found numerous bugs quickly in real software[176][174] underscores the real-world impact potential of these AIxCC innovations.
In summary, the competition affirmed that AI is now an indispensable part of automated cybersecurity: all top teams used machine learning in some form to achieve results unattainable by classical techniques alone. However, it also showed that hybrid approaches – blending AI with time-tested methods – currently have an edge over purely AI or purely traditional systems. The winning teams were those who managed that blend most effectively (or in Theori’s case, tamed an AI-first approach with enough guardrails).
Key comparative takeaways include:
-
Core Approaches: Teams like Atlanta and Shellphish that combined multiple approaches (fuzzing + static + concolic + AI) performed extremely well, suggesting that diversity equals strength in CRS design[198][75]. In contrast, a single-technique focus (e.g., fuzzing-heavy 42-beyond or AI-heavy FuzzingBrain) had either patch or cost issues.
-
AI Usage: Using AI to assist and enhance (Buttercup, Shellphish) tended to yield high accuracy and efficient use of budget[24][37]. Fully AI-driven systems (Theori, FuzzingBrain) found lots of bugs but required careful constraint or heavy resources[29][120]. A striking point was how all teams used AI for patching – there was broad consensus that LLMs are suitable for generating code fixes[109]. The differentiator was how much trust and autonomy they gave those AI in the overall loop.
-
Traditional Techniques: The competition reinforced that techniques like fuzzing are not obsolete – in fact, they were central to every team’s success, but often turbocharged by AI. Teams that extended and scaled traditional methods (Atlanta’s multi-fuzzers, ToB’s static queries, 42’s RL fuzzing) ended up in the top half[11][22]. Traditional static analysis alone was insufficient due to noise – but integrated with AI (Theori, FuzzingBrain, Shellphish SARIF handling), it became useful[108][38].
-
Performance Metrics: Final scores aside, different tools excelled in different metrics. For instance, 42-beyond led in raw finds, Shellphish led in patch accuracy, FuzzingBrain in speed of first find, ToB in CWE coverage[16][48]. This suggests each CRS carved out a niche of strength. A future ideal system might integrate the strengths of all – the coverage of Buttercup, the depth of RoboDuck, the thoroughness of Atlantis, the harness generation of SHERPA – to cover all metrics.
-
Documentation & Usability: Most teams have open-sourced their code and provided documentation or blogs, but complexity varies. Buttercup appears relatively turnkey (they emphasize one can run it on a laptop)[102], whereas Atlantis and Artiphishell are complex distributed systems that require cloud infrastructure and careful configuration[126][127]. Usability will be key for these CRSs to be adopted outside the competition. In that sense, SHERPA and Buttercup set good examples by focusing on being accessible and well-documented[102][193].
In conclusion, the AIxCC finalists collectively advanced the state-of-the-art by demonstrating that AI and traditional program analysis can be fused to automatically secure real-world software at scale and speed previously impossible. Each tool brought a unique innovation: Atlantis with multi-engine fault tolerance[5], Buttercup with cost-effective LLM integration[18], RoboDuck with end-to-end AI reasoning[27], FuzzingBrain with massive agent parallelism[39], Artiphishell with large-scale agent collaboration[41], Bug Buster with RL-guided fuzzing and multi-bug patches[52][51], and Lacrosse with a demonstration of stability vs. innovation trade-offs[64]. And supporting all, SHERPA showcased a immediately applicable tool for the community[199][182].
The competition may be over, but its impact is just beginning: these open-source CRSs and tools can now be tested, combined, and built upon by the broader cybersecurity community[196][197]. The strengths of each can potentially compensate for the weaknesses of others. The future likely lies in hybrid systems that use AI intelligently – neither ignoring its power nor using it blindly – to cover the full spectrum of software security tasks. AIxCC has indeed marked “a pivotal inflection point” for cyber defense[200], showing that automated systems augmented by AI are not just a research dream but a practical reality that can find and fix critical vulnerabilities at scale[1][201]. The tools compared in this report are the first generation of that reality; with continued development and community involvement, we can expect even more robust AI-driven security solutions in the coming years. Each of these systems, in its own way, contributes to that vision of “Security-AGI” – autonomous agents defending code, faster than attackers can find bugs[202][203].
[1] [15] [16] [17] [18] [20] [21] [22] [24] [25] [36] [37] [91] [95] [99] [100] [101] [102] [120] [131] Trail of Bits' Buttercup wins 2nd place in AIxCC Challenge -The Trail of Bits Blog
https://blog.trailofbits.com/2025/08/09/trail-of-bits-buttercup-wins-2nd-place-in-aixcc-challenge/
[2] [26] [27] [28] [31] [32] [103] [104] [105] [106] [107] [109] [110] [111] [112] [118] AI Cyber Challenge and Theori's RoboDuck - Theori BLOG
https://theori.io/blog/aixcc-and-roboduck-63447
[3] [4] [9] [10] [12] [19] [29] [30] [33] [39] [42] [51] [52] [55] [61] [62] [63] [64] [79] [96] [108] [142] AIxCC finals: Tale of the tape -The Trail of Bits Blog
https://blog.trailofbits.com/2025/08/07/aixcc-finals-tale-of-the-tape/
[5] [6] [7] [8] [11] [44] [45] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [80] [81] [82] [83] [84] [85] [86] [92] [94] [119] [130] [135] [198] AIxCC Final and Team Atlanta
https://team-atlanta.github.io/blog/post-afc/
[13] [14] [23] [34] [35] [38] [40] [46] [47] [48] [57] [59] [65] [66] [67] [87] [88] [89] [90] [97] [98] [116] [117] [121] [122] [132] [196] [197] [201] aicyberchallenge.com
[41] [128] [129] [133] Shellphish Submits ARTIPHISHELL to the Final Round of the DARPA AI Cyber Competition | The Robert Mehrabian College of Engineering - UC Santa Barbara
[43] [126] [127] GitHub - shellphish/artiphishell
https://github.com/shellphish/artiphishell
[49] [50] [53] [54] [56] [58] [60] [138] [139] [140] [141] [143] [144] [145] [146] [147] [148] [149] [152] [154] We lost the AIxCC. So, what now?
https://www.whexy.com/en/posts/aixcc
[93] Team-Atlanta/aixcc-afc-atlantis - GitHub
https://github.com/Team-Atlanta/aixcc-afc-atlantis
[113] Theori on X: "So, how did our #AIxCC finalist RoboDuck actually pull ...
https://x.com/theori_io/status/1955886400953589901
[114] Theori on X: "Find the full breakdown of how RoboDuck works, from ...
https://x.com/theori_io/status/1953978623360397373
[115] theori-io/aixcc-public: Public results and products from ... - GitHub
https://github.com/theori-io/aixcc-public
[123] [124] [125] GitHub - o2lab/asc-crs-all-you-need-is-a-fuzzing-brain
https://github.com/o2lab/asc-crs-all-you-need-is-a-fuzzing-brain
[134] Wenbo Guo (@WenboGuo4) / X
[136] [150] AI Security Tools — August 2025 - Medium
https://medium.com/@taleliyahu/ai-security-tools-august-2025-9a6c449f0402
[137] ARPA-H to join DARPA in announcing AI Cyber Challenge winners ...
[151] DARPA Selects Small Businesses to Compete in the AI Cyber ...
https://www.darpa.mil/news/2024/small-business-ai-cyber-challenge
[153] oss-security - Re: AI Cyber Challenge (AIxCC) semi-final results from DEF CON 32 (2024)
https://www.openwall.com/lists/oss-security/2024/08/19/2
[155] GitHub - 42-b3yond-6ug/42-b3yond-6ug-asc: This repository hosts our team’s submission to the AI Cyber Challenge Semifinal Competition.
https://github.com/42-b3yond-6ug/42-b3yond-6ug-asc
[156] DEF CON 33 Video Team - DEFCON AIxCC Lacrosse Team
https://www.youtube.com/watch?v=YdJ6NoxzuKs
[157] LACROSSE's AIxCC Final Submission | 01C CTF Radiooo - YouTube
https://www.youtube.com/watch?v=uO1LnENZV_4
[158] DARPA AI Cyber Challenge - Team Lacrosse - YouTube
https://www.youtube.com/watch?v=ULJcATTuT9U
[159] AIxCC Competition Archive | AIxCC Competition Archive
https://archive.aicyberchallenge.com/
[160] [161] [162] [163] [164] [165] [166] [167] [168] [169] [170] [171] [172] [173] [174] [175] [176] [177] [178] [179] [180] [181] [182] [183] [184] [185] [186] [187] [188] [189] [190] [191] [192] [193] [194] [195] [199] GitHub - AIxCyberChallenge/sherpa: Security Harness Engineering for Robust Program Analysis
https://github.com/AIxCyberChallenge/sherpa
[200] AIxCC Semifinal Competition (ASC) - AI Cyber Challenge