Why Read this Guide?

This guide shows how cybersecurity professionals can get up and running quickly with AI tools to:

augment your skills and amplify your impact
reduce your toil

Large Language Models (hosted and open source), and BERT-based language models tools will be covered e.g.

NotebookLM
Prompts and Prompt Techniques applied to Cybersecurity
BERTopic
Embeddings for CyberSecurity
AI Agents for orchestrated workflows

Risk Based Prioritization Guide

Check out another guide I co-wrote with various thought leaders in vulnerability management https://riskbasedprioritization.github.io/

The Risk Based Prioritization described in this guide significantly reduces the

cost of vulnerability management
risk by reducing the time adversaries have access to vulnerable systems they are trying to exploit.

Introduction ↵

Preface¶

Language Models are powerful tools that can be applied to CyberSecurity.

I enjoy learning about, playing with, and applying these tools to better solve problems:

I wrote this guide for me, to organize my thoughts and my play time as I play with, and apply, Language Models.
I've found by putting something out there, you get something back.
It's the guide I wish existed already.

This approach worked well for the Risk-Based Prioritization guide...

You may find it useful.

Chris Madden

Introduction¶

About this Guide

This guide is in an initial early access state currently and is written to organize my play time as I play with, and apply, these tools.

The content is about solving real problems e.g. how to

view the main topics in a set of documents
validate assigned CWEs, and suggest CWEs to assign
chat with large documents
extract configuration parameters from user manuals.

These examples were driven by a user need.

While the examples focus on specific areas, they can be applied in general to many areas.

After reading this guide you should be able to

Apply Language Models to augment and amplify your skills.
Understand the types of problems that suit Language Models, and those that don't

Overview¶

Intended Audience¶

The intended audience is people wanting to go beyond the hype and basics of Large Language Models.

No prior knowledge is assumed to read the guide - it provides just enough information to understand the advanced topics covered.

A basic knowledge of Jupyter Python is required to run the code (with the data provided or on your data).

How to Use This Guide¶

How to Contribute to This Guide¶

You can contribute content or suggest changes:

Writing Style¶

The "writing style" in this guide is succinct, and leads with an opinion, with data and code to back it up i.e. data analysis plots (with source code where possible) and observations and takeaways that you can assess - and apply to your data and environment. This allows the reader to assess the opinion and the code/data and rationale behind it.

Different, and especially opposite, opinions with the data to back them up, are especially welcome! - and will help shape this guide.

Quote

If we have data, let’s look at data. If all we have are opinions, let’s go with mine.

Jim Barksdale, former CEO of Netscape

Notes¶

Notes

This guide is not affiliated with any Tool/Company/Vendor/Standard/Forum/Data source.
1. Mention of a vendor in this guide is not a recommendation or endorsement of that vendor.
This guide is a living document i.e. it will change and grow over time - with your input.

This guide is not about which tool is better than the other

"Don't fall in love with models: they're expendable. Fall in love with data!"

Julien Simon, Chief Evangelist, Hugging Face

Warning

This space is rapidly evolving so the content in this guide may become no longer current or accurate.

Warning

You are responsible for your data and where it goes.

If you don't understand where your data goes, and what happens to it for a given model or tool, then find out before you use private or personal data.

To evaluate models and tools, you can start with public data.

Model Types¶

Overview

This section gives an overview of different model types.

Introduction¶

Different types of text models are designed with varying architectures, training data, and optimization goals, leading to distinct capabilities and best-fit use cases.

While many models focus primarily on text, an increasing number are becoming multimodal, capable of processing and generating information across different types of data, such as text, images, audio, and even video.

Today, these different types of models are generally accessed by selecting the specific model or API endpoint provided by developers (versus the user accessing the same interface that figures out the best type of model).

Deep Research¶

Models in this category are typically designed for in-depth information retrieval, synthesis, and report generation from vast amounts of data, often involving Browse and analyzing multiple sources. They aim to provide comprehensive and well-supported answers to complex queries.

Key Insights:

Extensive Information Gathering: Excel at searching and processing information from large and diverse datasets, including the web or private document repositories.
Synthesis and Structuring: Capable of synthesizing information from various sources into coherent and structured reports or summaries.
Handling Complexity: Designed to tackle complex, multi-faceted research questions that require connecting information across different domains.
Citation and Verification: Often include features for citing sources, allowing users to verify the information presented.

Use Cases:

Generating detailed reports on niche or complex topics.
Performing market research by analyzing industry trends and competitor information.
Assisting academics and researchers in literature reviews and synthesizing findings.
Providing comprehensive answers to complex legal, medical, or scientific questions.
Analyzing large volumes of internal documents to extract insights.

Examples: Perplexity Deep Research, ChatGPT Deep Research, Gemini Deep Research, HuggingFace Open Deep Research, Claude 3 Opus

Reasoning¶

Reasoning-focused models are optimized to perform complex logical deductions, solve problems requiring multiple steps, and understand intricate relationships between concepts. They are built to "think" through problems rather than just retrieving information or generating text based on patterns.

Key Insights:

Logical Deduction: Strong capabilities in applying logical rules and deriving conclusions from given premises.
Multi-Step Problem Solving: Can break down complex problems into smaller, manageable steps and follow a chain of thought to reach a solution.
Mathematical and Scientific Reasoning: Often perform well on mathematical problems, coding challenges, and scientific inquiries that require step-by-step analysis.
Reduced Hallucination in Complex Tasks: While still a challenge for all LLMs, models focused on reasoning aim to reduce the likelihood of generating false or inconsistent information in complex scenarios by showing their work or using techniques like self-correction.

Use Cases:

Solving complex mathematical equations or proofs.
Debugging code and suggesting logical fixes.
Analyzing data and drawing reasoned conclusions.
Assisting in strategic planning by evaluating scenarios and predicting outcomes.
Providing step-by-step explanations for complex concepts or solutions.
Excelling at benchmarks requiring logical inference and problem-solving.

Examples: DeepSeek-R1, OpenAI’s GPT-4, Google’s Gemini Ultra, Anthropic’s Claude 3 Sonnet, Meta’s Llama 3

General Purpose¶

General-purpose LLMs are designed to be versatile and handle a wide array of natural language tasks. They are trained on broad datasets to provide a good balance of capabilities across different domains without being specifically optimized for one.

Key Insights:

Versatility: Capable of performing a wide range of tasks, including text generation, summarization, translation, question answering, and creative writing.
Broad Knowledge: Possess a vast amount of general knowledge from their diverse training data.
Adaptability: Can often adapt to different styles and formats based on the prompt.
Accessibility: Typically the most widely available and accessible models for everyday use.

Use Cases:

Drafting emails, articles, and other written content.
Summarizing documents or long texts.
Translating text between languages.
Answering general knowledge questions.
Brainstorming ideas and assisting in creative writing.
Powering chatbots and virtual assistants for a variety of inquiries.

Examples: GPT-4 Turbo, Gemini Pro, Claude 3 Haiku, Mistral Large, Cohere Command R+

Code¶

Code-focused models are specifically trained on large datasets of code from various programming languages and sources. They are designed to understand, generate, and assist with programming tasks.

Key Insights:

Code Generation: Can generate code snippets, functions, or even entire programs based on natural language descriptions or prompts.
Code Completion: Provide intelligent suggestions for completing code as developers type.
Code Explanation and Documentation: Can explain how code works and generate documentation.
Debugging and Error Detection: Assist in identifying potential errors and suggesting fixes in code.
Code Translation: Translate code between different programming languages.
Support for Multiple Languages: Trained on a wide variety of programming languages.

Use Cases:

Speeding up software development by generating boilerplate code.
Assisting developers in learning new programming languages or frameworks.
Automating repetitive coding tasks.
Improving code quality through suggestions and error detection.
Generating test cases for software.
Helping non-programmers understand or modify code.

Examples: CodeLlama (Meta), StarCoder (ServiceNow), Codex (OpenAI), DeepSeek Coder, Google’s Codey

Tip

This blog gives a good overview of different models' capabilities for code and echoes my experience.

Tip

Code Generation is a small part of Software Engineering.

Different model types are suitable for different Software Engineering Artifacts

LLMs for CyberSecurity¶

LLMs for CyberSecurity Users and Use Cases¶

Image from Generative AI and Large Language Models for Cyber Security: All Insights You Need.

Empathy Map¶

See Original Post.

Tip

See also MITRE’s Innovation Toolkit https://itk.mitre.org/toolkit/tools-at-a-glance/ a collection of proven and repeatable problem-solving methods to help you and your team do something different that makes a difference.

Targeted PreMortem for Trustworthy AI¶

In general, it is good practice to start with the end in mind ala "Destination Postcard" from the book Switch, Dan and Chip Heath which looks at the aspirational positive outcomes.

This is also useful for Premortems to proactively identify failures so they can be avoided, to ensure the positive outcomes.

Quote

The Targeted Premortem (TPM) is a variant of Klein's Premortem Technique, which uses prospective hindsight to proactively identify failures. This variant targets brainstorming on reasons for losing trust in AI in the context of the sociotechnical system into which it is integrated. That is, the prompts are targeted to specific evidence-based focus areas where trust has been lost in AI. This tool comes with instructions, brainstorming prompts, and additional guidance on how to analyze the outcomes of a TPM session with users, developers, and other stakeholders.

MITRE Targeted PreMortem for Trustworthy AI

References¶

LLMs for CyberSecurity References¶

Generative AI and Large Language Models for Cyber Security: All Insights You Need, May 2024
A Comprehensive Review of Large Language Models in Cyber Security, September 2024
Large Language Models in Cybersecurity: State-of-the-Art, January 2024
How Large Language Models Are Reshaping the Cybersecurity Landscape | Global AI Symposium talk, September 2024
Large Language Models for Cyber Security: A Systematic Literature Review, July 2024
Using AI for Offensive Security, June 2024

Agents for CyberSecurity References¶

Comparing LLMs¶

There are several sites that allow comparisons of LLMs e.g.

https://winston-bosan.github.io/llm-pareto-frontier/
1. LLM Arena Pareto Frontier: Performance vs Cost
https://artificialanalysis.ai/
1. Independent analysis of AI models and API providers. Understand the AI landscape to choose the best model and provider for your use-case
https://llmpricecheck.com/
1. Compare and calculate the latest prices for LLM (Large Language Models) APIs from leading providers such as OpenAI GPT-4, Anthropic Claude, Google Gemini, Mate Llama 3, and more. Use our streamlined LLM Price Check tool to start optimizing your AI budget efficiently today!
https://openrouter.ai/rankings?view=day
1. Compare models used via OpenRouter
https://github.com/vectara/hallucination-leaderboard
1. LLM Hallucination Rate leaderboard
https://lmarena.ai/?leaderboard
1. Chatbot Arena is an open platform for crowdsourced AI benchmarking
https://aider.chat/docs/leaderboards/
1. Benchmark to evaluate an LLM’s ability to follow instructions and edit code successfully without human intervention
https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
1. Benchmark to evaluate language understanding models across broader and more challenging tasks

Books¶

Build a Large Language Model (from Scratch) by Sebastian Raschka, PhD
LLM Engineer's Handbook by Paul Iusztin and Maxime Labonne
AI Engineering by Chip Huyen
Hands-On Large Language Models: Language Understanding and Generation, Oct 2024, Jay Alammar and Maarten Grootendorst
Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG, October 2024, Louis-Francois Bouchard and Louie Peters
LLMs in Production From language models to successful products, December 2024, Christopher Brousseau and Matthew Sharp
Fundamentals of Secure AI Systems with Personal Data, June 2025, Enrico Glerean

Ended: Introduction

NotebookLM ↵

NotebookLM¶

Overview

LLMs change the information retrieval paradigm. Instead of searching for information where we go to the information, we can chat with our documents and ask questions of them, so that the information comes to us in the form of an answer.

In this section, we'll use NotebookLM, and we just need to import our documents to be able to chat with them.

ChatGPT4o is also used for comparison to highlight where one is better applied than the other depending on the context.

Both tools use LLMs, but NoteBookLM uses a "Closed System" (only the document sources you provide), versus ChatGPT4o which bases it answers on the open internet content at the time it was trained, and additionally the documents you provide.

Tip

Your responses from NotebookLM may be different than the examples shown here. LLMs will give different responses to the same question.

NotebookLM ¶

Tip

Quote

NotebookLM lets you read, take notes, ask questions, organize your ideas, and much more -- all with the power of Google AI helping you at every step of the way.

Quote

Audio Overview, a new way to turn your documents into engaging audio discussions. With one click, two AI hosts start up a lively “deep dive” discussion based on your sources. They summarize your material, make connections between topics, and banter back and forth. You can even download the conversation and take it on the go.

Quote

It runs on the company’s Gemini 1.5 Pro model (released Dec 2023), the same AI that powers the Gemini Advanced chatbot. (ref)

Key Features and Benefits of Gemini 1.5 Models¶

Per Gemini 1.5 Technical Report, the Key Features and Benefits of Gemini 1.5 Models are

Highly Compute-Efficient Multimodal Models
1. Capable of recalling and reasoning over fine-grained information from millions of tokens of context, including long documents, videos, and audio.
Benchmark Performance
1. Outperforms other models such as Claude 3.0 (200k tokens) and GPT-4 Turbo (128k tokens) in next-token prediction and retrieval up to 10M tokens (approximately 7M words).
Unprecedented Context Handling
1. Handles extremely long contexts, up to at least 10M tokens (approximately 7M words).
2. Capable of processing long-form mixed-modality inputs, including entire document collections, multiple hours of video, and almost five days of audio.
3. Near-perfect recall on long-context retrieval tasks across various modalities.
Realistic Multimodal Long-Context Benchmarks
1. Excels in tasks requiring retrieval and reasoning over multiple parts of the context.
2. Outperforms all competing models across all modalities, even those augmented with external retrieval methods.

These features make Gemini 1.5 models a generational leap over existing models, offering unparalleled performance in processing and understanding extensive and complex multimodal information.

Tip

Such systems map document content to vectors (numeric representations of words or tokens in multi-dimensional space).

Queries are based on similarity (proximity in vector space).

Document Loading¶

Documents are loaded via GoogleDrive, PDFs, Text files, Copied text, Web page URL.

Tip

Any sources can be used e.g. Books in PDF format, websites, text files.

Using a file of site content (if available) e.g.a PDF, is generally more reliable than using a URL to that site; it ensures all the content is ingested.

Closed System¶

These documents become the corpus where information is retrieved from, with references to the document(s) the information was retrieved from.

Quote

“NotebookLM is a closed system.” This means the AI won’t perform any web searches beyond what you, the user, give it in a prompt. Every response it generates pertains only to the information it has on hand. (ref)

Quote

“source-grounded AI”: you define a set of documents that are important to your work—called “sources” in the NotebookLM parlance—and from that point on, you can have an open-ended conversation with the language model where its answers will be “grounded” in the information you’ve selected. It is as if you are giving the AI instant expertise in whatever domain you happen to be working in. (ref)

What is a dog?¶

This is illustrated with a simple example of asking our Notebook on CWEs "what is a dog?":

Quote

The sources provided do not contain any information about dogs. Therefore, the query "what is a dog?" cannot be answered from the given sources.

Unlike Google Docs, it is not possible to share a NotebookLM publicly - sharing is done directly via email addresses.

How To Use NotebookLM¶

References¶

Introducing NotebookLM, Oct 19, 2023, Steven Johnson who contributed to NotebookLM
Getting The Most Out Of Notes In NotebookLM, Mar 18, 2024, Steven Johnson
How To Use NotebookLM As A Research Tool, Feb 19, 2024, Steven Johnson
Google's NotebookLM is now an even smarter assistant and better fact-checker, June 7, 2024
Using Google’s NotebookLM for Data Science: A Comprehensive Guide, Dec 7, 2023
How to use Google’s genAI-powered note-taking app, Feb 15, 2024

Takeaways¶

Takeaways

NotebookLM is a powerful free solution from Google that allows users to quickly and easily build a source-grounded AI (where users define the set of documents) and then have an open-ended conversation with the language model where its answers will be “grounded” in the information users selected.
The support for large contexts means that large documents can be processed - as demonstrated in the following sections.
I found it a useful tool / companion for the research I was doing on vulnerability management to augment my knowledge and capabilities.

NotebookLM Vulnerability Standards¶

Overview

In this chapter, we'll feed the main vulnerability standards to NotebookLM.

Data Sources¶

These files are available in docs/NotebookLM/InputDocs for convenience.

Content	Page with link to content	Direct Link
Risk Based Prioritization Guide	https://riskbasedprioritization.github.io/print_page/	Print - Save to PDF
CVSS 3.1 User Guide	https://www.first.org/cvss/v3.1/user-guide	https://www.first.org/cvss/v3-1/cvss-v31-user-guide_r1.pdf
CVSS 3.1 Specification	https://www.first.org/cvss/v3.1/specification-document	https://www.first.org/cvss/v3-1/cvss-v31-specification_r1.pdf
CVSS 4.0 User Guide	https://www.first.org/cvss/v4.0/user-guide	https://www.first.org/cvss/v4-0/cvss-v40-user-guide.pdf
CVSS 4.0 Specification Document	https://www.first.org/cvss/v4.0/specification-document	https://www.first.org/cvss/v4-0/cvss-v40-specification.pdf
Exploit Prediction Scoring System (EPSS) FAQ	https://www.first.org/epss/faq	https://www.first.org/epss/faq
Exploit Prediction Scoring System (EPSS) latest published paper	https://www.first.org/epss/model	https://arxiv.org/pdf/2302.14172
Stakeholder-Specific Vulnerability Categorization (SSVC) latest published paper	https://github.com/CERTCC/SSVC/tree/main/pdfs	https://github.com/CERTCC/SSVC/blob/main/pdfs/ssvc_2_1_draft.pdf
CISA KEV	https://www.cisa.gov/known-exploited-vulnerabilities	copy and paste text to a file cisakev.txt because NotebookLM cannot pull content directly from the page

Data Sources

The vulnerability standards PDFs are available in https://github.com/CyberSecAI/CyberSecAI.github.io/tree/main/data/NotebookLM_VulnerabilityStandards

Quote

write an introductory article in less than 250 words that gives an overview of all these sources and how they fit together

Each circled-number is a clickable link to the reference text in the original sources.

I want to rank vulnerabilities by Exploitation¶

Quote

I am currently using CVSS to rank vulnerabilities. I also want to rank based on exploitation as I heard only a small percentage of CVEs are actually exploited. What standard should I use. How should I use them. Provide an answer with rationale and examples.

Explain risk like I'm a 5 year old¶

Quote

I want to prioritize vulnerabilities by risk. But I don't understand risk and the components or factors of risk. Can you explain that to me like I'm a 5 year old. Then explain how I can use that for my risk based prioritization approach. Please provide a rationale

CVSS Base Parameters for CVSS v3 and v4¶

Quote

what parameters contribute to the CVSS base parameter score? How do these parameters differ in CVSS 3 versus CVSS 4

ChatGPT4o Response¶

Failure

Hallucination Remediation Level (RL) is NOT part of the Exploitability Metrics

Quote

Remediation Level (RL): New metric considering how available and effective remediation methods are (Not Defined, Official Fix, Temporary Fix, Workaround, Unavailable).

ChatGPT4o Check Response¶

Takeaways¶

Takeaways

NotebookLM does a good job assimilating these verbose standards and was competently able to answer specific factual questions, and broader opinionated questions.
Lots of references to the source material were provided in the answers which help understand and validate the answers.

NotebookLM CAPEC¶

Overview

In this chapter, we'll use NotebookLM to ingest the Common Attack Pattern Enumeration and Classification (CAPEC) list.

In the current version (CAPEC List Version 3.9 as at June 2024), there are 559 Total Attack Patterns.
It can be unwieldy to navigate or assimilate this information.

Using NotebookLM, we can chat with the CAPEC document and ask questions, so that the information comes to us in the form of an answer.

For comparison, we will also ask some of the same questions of ChatGPT4o covering 2 cases:

without uploading any documents to it.
uploading a document to it.

Note that CAPEC appears a lot less than CWE on the internet so an LLM will have a lot less context for CAPEC.

Data Sources¶

https://capec.mitre.org/data/downloads.html offers various formats (XML, CSV, HTML), and views:

Mechanisms of Attack: https://capec.mitre.org/data/slices/1000.html
Domains of Attack: https://capec.mitre.org/data/slices/3000.html
- We'll use this one.

NotebookLM - Upload from - Web page URL - Paste website link - https://capec.mitre.org/data/slices/3000.html

ChatGPT4o Answer¶

ChatGPT4o did not provide all the CWEs that NotebookLM did.

ChatGPT4o Answer with Hallucination¶

Failure

Hallucination

"CAPEC-633: Client-Side XSS using Content-Security-Policy Bypass" is not valid https://capec.mitre.org/data/definitions/633.html

ChatGPT4o Answer with Hallucination with Uploaded CAPEC File¶

ChatGPT4o UI did not process the CAPEC HTML file, and the UI does not accept URLs, so the CSV file was uploaded instead.

ChatGPT4o Validate the Hallucination¶

In a different ChatGPT4o session (new context to avoid the hallucination), we ask ChatGPT4o to validate the CAPEC.

ChatGPT4o Review and Rate the Responses from NotebookLM and ChatGPT4o¶

ChatGPT4o was used to review and rate both responses (copied directly from the NoteBookLM and ChatGPT4o responses)

Takeaways¶

Takeaways

For less common or referenced text like CAPEC, ChatGPT4o can hallucinate even when the CAPEC list is uploaded to it.
NotebookLM did not hallucinate (a benefit of the "Closed System" approach).

NotebookLM ATTACK¶

Overview

As part of Proactive Software Supply Chain Risk Management (P-SSCRM) Framework that I've been collaborating on, we wanted to apply MITRE ATT&CK. It's a detailed specification, so NotebookLM can help us.

In this chapter, we'll use NotebookLM to ingest the MITRE ATT&CK Enterprise Tactics.

In the current version, MITRE ATT&CK Matrix for Enterprise consists of 14 tactics, 559 Total Attack Patterns.
It can be unwieldy to navigate or assimilate this information.

Using NotebookLM, we can chat with the MITRE ATT&CK Matrix and ask questions, so that the information comes to us in the form of an answer.

without uploading any documents to it.
uploading a document to it.

Data Sources¶

The MITRE ATTACK Tactics and Techniques are available online at https://attack.mitre.org/tactics/enterprise/ as 1 webpage for each of the 14 Tactics.

However, loading these webpages (or the "Version Permalink" pages) into NotebookLM did not work.

MITRE ATTACK is also available as an Excel file from https://attack.mitre.org/resources/attack-data-and-tools/

https://attack.mitre.org/docs/enterprise-attack-v15.1/enterprise-attack-v15.1.xlsx
Note: The data is also available as JSON.

So we can convert that to a text file and load those as follows:

Open the Excel file
For the tabs we want (e.g. Tactics, Techniques, Mitigations), delete the columns we don't want
Export each tab as a csv file
Transpose the data to a text file
Import the text files

Data Sources

The MITRE ATTACK csv and text files, and script to convert, are available in https://github.com/CyberSecAI/CyberSecAI.github.io/tree/main/data/Notebook_Attack

Convert MITRE ATTACK to a Text File for Import¶

Claude 3.5 Sonnet Prompt:

i have a csv file with these columns ID    name description    tactics    detection    "data sources"    "is sub-technique"    "sub-technique of"
I want to extract the content into a document to read. 
Each row should be extracted as follows: the name of the first column, then the ccontent for the first column. Then the second column name and content.
And so on for each column.
And do this for all rows

I want python code to do this

The LLM generated generic code that was applied to convert each csv file to a text file.

Load each text file¶

So we can just load each one to NoteBookLM as a text file:

MITRE_ATTACK_mitigations.txt
MITRE_ATTACK_tactics.txt
MITRE_ATTACK_techniques.txt

NotebookLM Answer¶

Claude Sonnet 3.5 Answer¶

Takeaways¶

Takeaways

Any data or document in text format can be converted to a format suitable for import to an LM.

NotebookLM Config¶

Overview

I came across this via https://tldrsec.com/p/tldr-sec-237 (an excellent newsletter) in the "AI + Security" section, and it piqued my interest!

One area of research is using LLMs for infrastructure configuration as detailed in https://www.coguard.io/post/coguard-uses-openai-cybersecurity-grant-to-automate-infrastructure-security and the associated repo.

Here we take on the first task [C1]: Extraction of security and uptime-relevant configuration parameters from manuals; for the example provided in the repo: Apache Spark

Details¶

Task¶

Quote

[C1] Extraction of security and uptime-relevant configuration parameters from manuals. The goal of this component is simple to describe, but hard to accomplish. Given a manual for a software component, extract the configuration parameters and define the security relevant ones from it.

Example: For Apache Spark, the manual is provided on the general configuration page online, i.e. in HTML format, and there is also a specific security page. The expectation would be to at least extract the parameters from the security page, as well as some log-related items from the general page. In total, when manually examining the configuration parameters, it totals approximately 80 parameters that are security relevant. You can find these in the Appendix A.

Original Prompt¶

The original prompt used is per https://github.com/coguardio/coguard_openai_rule_auto_generation_research/tree/master?tab=readme-ov-file#extraction-of-security-relevant-parameters-in-c1

Quote

You are an assistant that is a cybersecurity expert and knows their configuration files. Your response here should be a JSON list of strings Using the documentation of Apache Spark, which configuration parameters are relevant from a security perspective?.

Expected Answer¶

The expected answer is per https://github.com/coguardio/coguard_openai_rule_auto_generation_research/tree/master?tab=readme-ov-file#appendix-a.

Quote

The following parameters were identified by the CoGuard team by hand as relevant from a security point of view.

spark.yarn.shuffle.server.recovery.disabled
spark.authenticate
spark.authenticate.secret
spark.authenticate.secret.file
spark.authenticate.secret.driver.file
spark.authenticate.secret.executor.file
spark.network.crypto.enabled
spark.network.crypto.config.*
spark.network.crypto.saslFallback
spark.authenticate.enableSaslEncryption
spark.network.sasl.serverAlwaysEncrypt
spark.io.encryption.enabled
spark.io.encryption.keySizeBits
spark.io.encryption.keygen.algorithm
spark.io.encryption.commons.config.*
spark.ui.allowFramingFrom
spark.ui.filters
spark.acls.enable
spark.admin.acls
spark.admin.acls.groups
spark.modify.acls
spark.modify.acls.groups
spark.ui.view.acls
spark.ui.view.acls.groups
spark.user.groups.mapping
spark.history.ui.acls.enable
spark.history.ui.admin.acls
spark.history.ui.admin.acls.groups
spark.ssl.enabled
spark.ssl.port
spark.ssl.enabledAlgorithms
spark.ssl.keyPassword
spark.ssl.keyStore
spark.ssl.keyStorePassword
spark.ssl.keyStoreType
spark.ssl.protocol
spark.ssl.needClientAuth
spark.ssl.trustStore
spark.ssl.trustStorePassword
spark.ssl.trustStoreType
spark.ssl.ui.enabled
spark.ssl.ui.port
spark.ssl.ui.enabledAlgorithms
spark.ssl.ui.keyPassword
spark.ssl.ui.keyStore
spark.ssl.ui.keyStorePassword
spark.ssl.ui.keyStoreType
spark.ssl.ui.protocol
spark.ssl.ui.needClientAuth
spark.ssl.ui.trustStore
spark.ssl.ui.trustStorePassword
spark.ssl.ui.trustStoreType
spark.ssl.standalone.enabled
spark.ssl.standalone.port
spark.ssl.standalone.enabledAlgorithms
spark.ssl.standalone.keyPassword
spark.ssl.standalone.keyStore
spark.ssl.standalone.keyStorePassword
spark.ssl.standalone.keyStoreType
spark.ssl.standalone.protocol
spark.ssl.standalone.needClientAuth
spark.ssl.standalone.trustStore
spark.ssl.standalone.trustStorePassword
spark.ssl.standalone.trustStoreType
spark.ssl.historyServer.enabled
spark.ssl.historyServer.port
spark.ssl.historyServer.enabledAlgorithms
spark.ssl.historyServer.keyPassword
spark.ssl.historyServer.keyStore
spark.ssl.historyServer.keyStorePassword
spark.ssl.historyServer.keyStoreType
spark.ssl.historyServer.protocol
spark.ssl.historyServer.needClientAuth
spark.ssl.historyServer.trustStore
spark.ssl.historyServer.trustStorePassword
spark.ssl.historyServer.trustStoreType
spark.ui.xXssProtection
spark.ui.xContentTypeOptions.enabled
spark.ui.strictTransportSecurity

Data Sources¶

The data sources are per above:

Data Sources

Copies of the html files are available in https://github.com/CyberSecAI/CyberSecAI.github.io/tree/main/data/NotebookLM_Config

Setup¶

Prepare Validation File¶

CopyNPaste the Expected answer to a text file ./data/NotebookLM_Config/security_parameters_manual.txt.
Sort alphabetically to allow diff comparison with answer from NotebookLM.

    cat ./data/NotebookLM_Config/security_parameters_manual.txt | sort > ./data/NotebookLM_Config/security_parameters_manual_sorted.txt

Attempt 1: Use the Provided Prompt¶

Create A New Notebooklm With The 2 Data Sources Only¶

New NotebookLM. Sources - Upload from - Web page URL for the 2 Data Sources listed above.

Submit the prompt¶

Quote

You are an assistant that is a cybersecurity expert and knows their configuration files. Your response here should be a JSON list of strings Using the documentation of Apache Spark, which configuration parameters are relevant from a security perspective?.

Save The Result¶

Click the Copy button.
Create a new file security_parameters.json and save the result, then remove the ```` part at the beginning and end of the file so the file contains JSON only.
Sort the answer

jq -r '.[]' ./data/NotebookLM_Config/security_parameters.json | sort > ./data/NotebookLM_Config/security_parameters.txt

Compare The Answer With The Expected Answer¶

60 parameters were retrieved
We can see that the main difference relates to "spark.ssl." parameters.
Searching manually in the 2 Data Sources above reveals that these config strings are not actually listed in the documentation e.g. "spark.ssl.ui.needClientAuth" directly - but using placeholders.
The LLM didn't understand that - so we'll let it know - then ask it again.
diff data/NotebookLM_Config/security_parameters_manual.txt data/NotebookLM_Config/security_parameters.txt

diff data/NotebookLM_Config/security_parameters_manual.txt data/NotebookLM_Config/security_parameters.txt                 
1c1,4
< spark.yarn.shuffle.server.recovery.disabled
---
> hadoop.security.credential.provider.path
> spark.acls.enable
> spark.admin.acls
> spark.admin.acls.groups
2a6
> spark.authenticate.enableSaslEncryption
4d7
< spark.authenticate.secret.file
7,11c10,14
< spark.network.crypto.enabled
< spark.network.crypto.config.*
< spark.network.crypto.saslFallback
< spark.authenticate.enableSaslEncryption
< spark.network.sasl.serverAlwaysEncrypt
---
> spark.authenticate.secret.file
> spark.history.ui.acls.enable
> spark.history.ui.admin.acls
> spark.history.ui.admin.acls.groups
> spark.io.encryption.commons.config.*
15,20c18,29
< spark.io.encryption.commons.config.*
< spark.ui.allowFramingFrom
< spark.ui.filters
< spark.acls.enable
< spark.admin.acls
< spark.admin.acls.groups
---
> spark.kerberos.access.hadoopFileSystems
> spark.kerberos.keytab
> spark.kerberos.principal
> spark.kubernetes.hadoop.configMapName
> spark.kubernetes.kerberos.krb5.configMapName
> spark.kubernetes.kerberos.krb5.path
> spark.kubernetes.kerberos.tokenSecret.itemKey
> spark.kubernetes.kerberos.tokenSecret.name
> spark.mesos.driver.secret.envkeys
> spark.mesos.driver.secret.filenames
> spark.mesos.driver.secret.names
> spark.mesos.driver.secret.values
23,28c32,39
< spark.ui.view.acls
< spark.ui.view.acls.groups
< spark.user.groups.mapping
< spark.history.ui.acls.enable
< spark.history.ui.admin.acls
< spark.history.ui.admin.acls.groups
---
> spark.network.crypto.config.*
> spark.network.crypto.enabled
> spark.network.crypto.saslFallback
> spark.network.sasl.serverAlwaysEncrypt
> spark.redaction.regex
> spark.redaction.string.regex
> spark.security.credentials.${service}.enabled
> spark.sql.redaction.options.regex
30d40
< spark.ssl.port
36d45
< spark.ssl.protocol
37a47,48
> spark.ssl.port
> spark.ssl.protocol
41,77c52,57
< spark.ssl.ui.enabled
< spark.ssl.ui.port
< spark.ssl.ui.enabledAlgorithms
< spark.ssl.ui.keyPassword
< spark.ssl.ui.keyStore
< spark.ssl.ui.keyStorePassword
< spark.ssl.ui.keyStoreType
< spark.ssl.ui.protocol
< spark.ssl.ui.needClientAuth
< spark.ssl.ui.trustStore
< spark.ssl.ui.trustStorePassword
< spark.ssl.ui.trustStoreType
< spark.ssl.standalone.enabled
< spark.ssl.standalone.port
< spark.ssl.standalone.enabledAlgorithms
< spark.ssl.standalone.keyPassword
< spark.ssl.standalone.keyStore
< spark.ssl.standalone.keyStorePassword
< spark.ssl.standalone.keyStoreType
< spark.ssl.standalone.protocol
< spark.ssl.standalone.needClientAuth
< spark.ssl.standalone.trustStore
< spark.ssl.standalone.trustStorePassword
< spark.ssl.standalone.trustStoreType
< spark.ssl.historyServer.enabled
< spark.ssl.historyServer.port
< spark.ssl.historyServer.enabledAlgorithms
< spark.ssl.historyServer.keyPassword
< spark.ssl.historyServer.keyStore
< spark.ssl.historyServer.keyStorePassword
< spark.ssl.historyServer.keyStoreType
< spark.ssl.historyServer.protocol
< spark.ssl.historyServer.needClientAuth
< spark.ssl.historyServer.trustStore
< spark.ssl.historyServer.trustStorePassword
< spark.ssl.historyServer.trustStoreType
< spark.ui.xXssProtection
---
> spark.ssl.useNodeLocalConf
> spark.ui.allowFramingFrom
> spark.ui.filters
> spark.ui.strictTransportSecurity
> spark.ui.view.acls
> spark.ui.view.acls.groups
79c59,60
< spark.ui.strictTransportSecurity
\ No newline at end of file
---
> spark.ui.xXssProtection
> spark.user.groups.mapping

Note

In the next section, we'll use an LLM to do the comparison.

Here we used traditional methods i.e. diff.

Attempt 2: Explain about Config Namespace Placeholders¶

The LLM did not understand from the documents about Config Namespace placeholders:

So this time, we explain as part of the prompt about Config Namespace placeholders.

Submit The Prompt¶

Quote

Note that Config Namespace placeholders are used for some security configurations. These are placeholders "${ns}" that can take values listed under Config Namespace. For these security configurations placeholders list all possible security configurations for each Config Namespace e.g. For Property Name: "$.enabled", "spark.ssl.enabled" and "spark.ssl.historyServer" would be some possible configurations now that you understand namespace placeholders, list all the configuration parameters that are relevant from a security perspective replacing all placeholders with possible config namespace values. Your response here should be a JSON list of strings only. list the configuration parameters that are relevant from a security perspective replacing all placeholders with possible config namespace values and repeated patterns

Note

Note the duplication in the prompt to emphasize what we want

Quote

"list the configuration parameters that are relevant from a security perspective replacing all placeholders with possible config namespace values and repeated patterns" in the prompt.

security_parameters_ns.json is the resulting file that has 96 config parameters - more than the expected answer config parameters as generated by humans.

Save The Result¶

Click the Copy button.
Create a new file security_parameters_ns.json and save the result, then remove the ```` part at the beginning and end of the file so the file contains JSON only.
Sort the answer and ensure there's no duplicates.

jq -r '.[]' ./data/NotebookLM_Config/security_parameters_ns.json | sort | uniq > ./data/NotebookLM_Config/security_parameters_ns.txt

Compare The Answer With The Expected Answer¶

In this case, we use ChatGPT4o to do the diff, copy and pasting the values from each file:

./data/NotebookLM_Config/security_parameters_manual.txt: the expected answers
./data/NotebookLM_Config/security_parameters_ns.txt: the actual answers

Prompt¶

List the values that are in ===manual===  but not in ===notebooklm===
List the values that are in ===notebooklm===  but not in ===manual===

===manual=== 
spark.yarn.shuffle.server.recovery.disabled
spark.authenticate
spark.authenticate.secret
spark.authenticate.secret.file
spark.authenticate.secret.driver.file
spark.authenticate.secret.executor.file
spark.network.crypto.enabled
spark.network.crypto.config.*
spark.network.crypto.saslFallback
spark.authenticate.enableSaslEncryption
spark.network.sasl.serverAlwaysEncrypt
spark.io.encryption.enabled
spark.io.encryption.keySizeBits
spark.io.encryption.keygen.algorithm
spark.io.encryption.commons.config.*
spark.ui.allowFramingFrom
spark.ui.filters
spark.acls.enable
spark.admin.acls
spark.admin.acls.groups
spark.modify.acls
spark.modify.acls.groups
spark.ui.view.acls
spark.ui.view.acls.groups
spark.user.groups.mapping
spark.history.ui.acls.enable
spark.history.ui.admin.acls
spark.history.ui.admin.acls.groups
spark.ssl.enabled
spark.ssl.port
spark.ssl.enabledAlgorithms
spark.ssl.keyPassword
spark.ssl.keyStore
spark.ssl.keyStorePassword
spark.ssl.keyStoreType
spark.ssl.protocol
spark.ssl.needClientAuth
spark.ssl.trustStore
spark.ssl.trustStorePassword
spark.ssl.trustStoreType
spark.ssl.ui.enabled
spark.ssl.ui.port
spark.ssl.ui.enabledAlgorithms
spark.ssl.ui.keyPassword
spark.ssl.ui.keyStore
spark.ssl.ui.keyStorePassword
spark.ssl.ui.keyStoreType
spark.ssl.ui.protocol
spark.ssl.ui.needClientAuth
spark.ssl.ui.trustStore
spark.ssl.ui.trustStorePassword
spark.ssl.ui.trustStoreType
spark.ssl.standalone.enabled
spark.ssl.standalone.port
spark.ssl.standalone.enabledAlgorithms
spark.ssl.standalone.keyPassword
spark.ssl.standalone.keyStore
spark.ssl.standalone.keyStorePassword
spark.ssl.standalone.keyStoreType
spark.ssl.standalone.protocol
spark.ssl.standalone.needClientAuth
spark.ssl.standalone.trustStore
spark.ssl.standalone.trustStorePassword
spark.ssl.standalone.trustStoreType
spark.ssl.historyServer.enabled
spark.ssl.historyServer.port
spark.ssl.historyServer.enabledAlgorithms
spark.ssl.historyServer.keyPassword
spark.ssl.historyServer.keyStore
spark.ssl.historyServer.keyStorePassword
spark.ssl.historyServer.keyStoreType
spark.ssl.historyServer.protocol
spark.ssl.historyServer.needClientAuth
spark.ssl.historyServer.trustStore
spark.ssl.historyServer.trustStorePassword
spark.ssl.historyServer.trustStoreType
spark.ui.xXssProtection
spark.ui.xContentTypeOptions.enabled
spark.ui.strictTransportSecurity


===notebooklm===
hadoop.security.credential.provider.path
spark.acls.enable
spark.admin.acls
spark.admin.acls.groups
spark.authenticate
spark.authenticate.enableSaslEncryption
spark.authenticate.secret
spark.authenticate.secret.driver.file
spark.authenticate.secret.executor.file
spark.authenticate.secret.file
spark.history.ui.acls.enable
spark.history.ui.admin.acls
spark.history.ui.admin.acls.groups
spark.io.encryption.commons.config.*
spark.io.encryption.enabled
spark.io.encryption.keySizeBits
spark.io.encryption.keygen.algorithm
spark.kerberos.access.hadoopFileSystems
spark.kerberos.keytab
spark.kerberos.principal
spark.kubernetes.hadoop.configMapName
spark.kubernetes.kerberos.krb5.configMapName
spark.kubernetes.kerberos.krb5.path
spark.kubernetes.kerberos.tokenSecret.itemKey
spark.kubernetes.kerberos.tokenSecret.name
spark.mesos.driver.secret.envkeys
spark.mesos.driver.secret.filenames
spark.mesos.driver.secret.names
spark.mesos.driver.secret.values
spark.modify.acls
spark.modify.acls.groups
spark.network.crypto.config.*
spark.network.crypto.enabled
spark.network.crypto.saslFallback
spark.network.sasl.serverAlwaysEncrypt
spark.redaction.regex
spark.redaction.string.regex
spark.security.credentials.${service}.enabled
spark.sql.redaction.options.regex
spark.ssl.enabled
spark.ssl.enabledAlgorithms
spark.ssl.historyServer.enabled
spark.ssl.historyServer.enabledAlgorithms
spark.ssl.historyServer.keyPassword
spark.ssl.historyServer.keyStore
spark.ssl.historyServer.keyStorePassword
spark.ssl.historyServer.keyStoreType
spark.ssl.historyServer.needClientAuth
spark.ssl.historyServer.port
spark.ssl.historyServer.protocol
spark.ssl.historyServer.trustStore
spark.ssl.historyServer.trustStorePassword
spark.ssl.historyServer.trustStoreType
spark.ssl.keyPassword
spark.ssl.keyStore
spark.ssl.keyStorePassword
spark.ssl.keyStoreType
spark.ssl.needClientAuth
spark.ssl.port
spark.ssl.protocol
spark.ssl.standalone.enabled
spark.ssl.standalone.enabledAlgorithms
spark.ssl.standalone.keyPassword
spark.ssl.standalone.keyStore
spark.ssl.standalone.keyStorePassword
spark.ssl.standalone.keyStoreType
spark.ssl.standalone.needClientAuth
spark.ssl.standalone.port
spark.ssl.standalone.protocol
spark.ssl.standalone.trustStore
spark.ssl.standalone.trustStorePassword
spark.ssl.standalone.trustStoreType
spark.ssl.trustStore
spark.ssl.trustStorePassword
spark.ssl.trustStoreType
spark.ssl.ui.enabled
spark.ssl.ui.enabledAlgorithms
spark.ssl.ui.keyPassword
spark.ssl.ui.keyStore
spark.ssl.ui.keyStorePassword
spark.ssl.ui.keyStoreType
spark.ssl.ui.needClientAuth
spark.ssl.ui.port
spark.ssl.ui.protocol
spark.ssl.ui.trustStore
spark.ssl.ui.trustStorePassword
spark.ssl.ui.trustStoreType
spark.ssl.useNodeLocalConf
spark.ui.allowFramingFrom
spark.ui.filters
spark.ui.strictTransportSecurity
spark.ui.view.acls
spark.ui.view.acls.groups
spark.ui.xContentTypeOptions.enabled
spark.ui.xXssProtection
spark.user.groups.mapping

Response¶

Values in ===manual=== but not in ===notebooklm===
spark.yarn.shuffle.server.recovery.disabled
Values in ===notebooklm=== but not in ===manual===
hadoop.security.credential.provider.path
spark.kerberos.access.hadoopFileSystems
spark.kerberos.keytab
spark.kerberos.principal
spark.kubernetes.hadoop.configMapName
spark.kubernetes.kerberos.krb5.configMapName
spark.kubernetes.kerberos.krb5.path
spark.kubernetes.kerberos.tokenSecret.itemKey
spark.kubernetes.kerberos.tokenSecret.name
spark.mesos.driver.secret.envkeys
spark.mesos.driver.secret.filenames
spark.mesos.driver.secret.names
spark.mesos.driver.secret.values
spark.redaction.regex
spark.redaction.string.regex
spark.security.credentials.${service}.enabled
spark.sql.redaction.options.regex
spark.ssl.useNodeLocalConf

Observation

NotebookLM did not understand the concept of placeholder config parameters from the source documents. Since there are several config parameters per placeholder, it missed many config parameters (Attempt 1).
When the prompt explained about placeholder config parameters, NotebookLM found all but 1 of the 79 configuration parameters related to security (Attempt 2).
1. NotebookLM found 18 config parameters related to security that were not found by humans as part of the expected answers and these appear to be valid configuration parameters related to security based on my review.
1 config parameter was not found by NotebookLM, and it's not obvious why: spark.yarn.shuffle.server.recovery.disabled

ChatGPT4o¶

Original Prompt¶

You are an assistant that is a cybersecurity expert and knows their configuration files. Your response here should be a JSON list of strings
Using only the documentation of Apache Spark from the URLs below (and no other source), which configuration parameters are relevant from a security perspective?

The data sources are :
1. https://spark.apache.org/docs/latest/configuration.html
2. https://spark.apache.org/docs/latest/security.html

[
    "spark.authenticate",
    "spark.authenticate.secret",
    "spark.authenticate.secret.file",
    "spark.authenticate.secret.driver.file",
    "spark.authenticate.secret.executor.file",
    "spark.network.crypto.enabled",
    "spark.network.crypto.config.*",
    "spark.network.crypto.saslFallback",
    "spark.authenticate.enableSaslEncryption",
    "spark.network.sasl.serverAlwaysEncrypt",
    "spark.io.encryption.enabled",
    "spark.io.encryption.keySizeBits",
    "spark.io.encryption.keygen.algorithm",
    "spark.io.encryption.commons.config.*",
    "spark.ui.allowFramingFrom",
    "spark.ui.filters",
    "spark.acls.enable",
    "spark.admin.acls",
    "spark.admin.acls.groups",
    "spark.modify.acls",
    "spark.modify.acls.groups",
    "spark.ui.view.acls",
    "spark.ui.view.acls.groups",
    "spark.user.groups.mapping",
    "spark.history.ui.acls.enable",
    "spark.history.ui.admin.acls",
    "spark.history.ui.admin.acls.groups",
    "spark.ssl.enabled",
    "spark.ssl.port",
    "spark.ssl.enabledAlgorithms",
    "spark.ssl.keyPassword",
    "spark.ssl.keyStore",
    "spark.ssl.keyStorePassword",
    "spark.ssl.keyStoreType",
    "spark.ssl.protocol",
    "spark.ssl.needClientAuth",
    "spark.ssl.trustStore",
    "spark.ssl.trustStorePassword",
    "spark.ssl.trustStoreType",
    "spark.ui.xXssProtection",
    "spark.ui.xContentTypeOptions.enabled",
    "spark.ui.strictTransportSecurity",
    "spark.kerberos.access.hadoopFileSystems",
    "spark.security.credentials.${service}.enabled"
]

Prompt Explaining Config Namespace placeholders¶

Note that Config Namespace placeholders are used for some security configurations. These are placeholders "${ns}" that can take values listed under Config Namespace. For these security configurations placeholders list all possible security configurations for each Config Namespace e.g. For Property Name: "${ns}.enabled", "spark.ssl.enabled" and "spark.ssl.historyServer" would be some possible configurations now that you understand namespace placeholders, list all the configuration parameters that are relevant from a security perspective replacing all placeholders with possible config namespace values. Your response here should be a JSON list of strings only. list the configuration parameters that are relevant from a security perspective replacing all placeholders with possible config namespace values and repeated patterns

[
    "spark.ssl.enabled",
    "spark.ssl.historyServer.enabled",
    "spark.security.credentials.enabled",
    "spark.ssl.port",
    "spark.ssl.historyServer.port",
    "spark.security.credentials.port",
    "spark.ssl.enabledAlgorithms",
    "spark.ssl.historyServer.enabledAlgorithms",
    "spark.security.credentials.enabledAlgorithms",
    "spark.ssl.keyPassword",
    "spark.ssl.historyServer.keyPassword",
    "spark.security.credentials.keyPassword",
    "spark.ssl.keyStore",
    "spark.ssl.historyServer.keyStore",
    "spark.security.credentials.keyStore",
    "spark.ssl.keyStorePassword",
    "spark.ssl.historyServer.keyStorePassword",
    "spark.security.credentials.keyStorePassword",
    "spark.ssl.keyStoreType",
    "spark.ssl.historyServer.keyStoreType",
    "spark.security.credentials.keyStoreType",
    "spark.ssl.protocol",
    "spark.ssl.historyServer.protocol",
    "spark.security.credentials.protocol",
    "spark.ssl.needClientAuth",
    "spark.ssl.historyServer.needClientAuth",
    "spark.security.credentials.needClientAuth",
    "spark.ssl.trustStore",
    "spark.ssl.historyServer.trustStore",
    "spark.security.credentials.trustStore",
    "spark.ssl.trustStorePassword",
    "spark.ssl.historyServer.trustStorePassword",
    "spark.security.credentials.trustStorePassword",
    "spark.ssl.trustStoreType",
    "spark.ssl.historyServer.trustStoreType",
    "spark.security.credentials.trustStoreType",
    "spark.ssl.historyServer",
    "spark.ssl.historyServer.historyServer",
    "spark.security.credentials.historyServer",
    "spark.authenticate",
    "spark.authenticate.secret",
    "spark.authenticate.secret.file",
    "spark.authenticate.secret.driver.file",
    "spark.authenticate.secret.executor.file",
    "spark.network.crypto.enabled",
    "spark.network.crypto.config.*",
    "spark.network.crypto.saslFallback",
    "spark.authenticate.enableSaslEncryption",
    "spark.network.sasl.serverAlwaysEncrypt",
    "spark.io.encryption.enabled",
    "spark.io.encryption.keySizeBits",
    "spark.io.encryption.keygen.algorithm",
    "spark.io.encryption.commons.config.*",
    "spark.ui.allowFramingFrom",
    "spark.ui.filters",
    "spark.acls.enable",
    "spark.admin.acls",
    "spark.admin.acls.groups",
    "spark.modify.acls",
    "spark.modify.acls.groups",
    "spark.ui.view.acls",
    "spark.ui.view.acls.groups",
    "spark.user.groups.mapping",
    "spark.history.ui.acls.enable",
    "spark.history.ui.admin.acls",
    "spark.history.ui.admin.acls.groups",
    "spark.ui.xXssProtection",
    "spark.ui.xContentTypeOptions.enabled",
    "spark.ui.strictTransportSecurity",
    "spark.kerberos.access.hadoopFileSystems",
    "spark.security.credentials.${service}.enabled"
]
``` &#8203;:citation[oaicite:0]{index=0}&#8203;

Compare The Answer With The Expected Answer¶

ChatGPT4o missed these 2 config placeholders (these were not provided in the example in the prompt - only the first and last one were provided)
1. spark.ssl.ui
2. spark.ssl.standalone
ChatGPT4o wrongly interpreted spark.security.credentials.${service}.enabled as being the same as ${ns} placeholder, and appended the parameters for ${ns} placeholder to spark.security.credentials.

{
    "manual_not_in_chatgpt4o": [
        "spark.yarn.shuffle.server.recovery.disabled",
        "spark.ssl.ui.enabled",
        "spark.ssl.ui.port",
        "spark.ssl.ui.enabledAlgorithms",
        "spark.ssl.ui.keyPassword",
        "spark.ssl.ui.keyStore",
        "spark.ssl.ui.keyStorePassword",
        "spark.ssl.ui.keyStoreType",
        "spark.ssl.ui.protocol",
        "spark.ssl.ui.needClientAuth",
        "spark.ssl.ui.trustStore",
        "spark.ssl.ui.trustStorePassword",
        "spark.ssl.ui.trustStoreType",
        "spark.ssl.standalone.enabled",
        "spark.ssl.standalone.port",
        "spark.ssl.standalone.enabledAlgorithms",
        "spark.ssl.standalone.keyPassword",
        "spark.ssl.standalone.keyStore",
        "spark.ssl.standalone.keyStorePassword",
        "spark.ssl.standalone.keyStoreType",
        "spark.ssl.standalone.protocol",
        "spark.ssl.standalone.needClientAuth",
        "spark.ssl.standalone.trustStore",
        "spark.ssl.standalone.trustStorePassword",
        "spark.ssl.standalone.trustStoreType"
    ],
    "chatgpt4o_not_in_manual": [
        "spark.security.credentials.enabled",
        "spark.security.credentials.port",
        "spark.security.credentials.enabledAlgorithms",
        "spark.security.credentials.keyPassword",
        "spark.security.credentials.keyStore",
        "spark.security.credentials.keyStorePassword",
        "spark.security.credentials.keyStoreType",
        "spark.security.credentials.protocol",
        "spark.security.credentials.needClientAuth",
        "spark.security.credentials.trustStore",
        "spark.security.credentials.trustStorePassword",
        "spark.security.credentials.trustStoreType",
        "spark.security.credentials.historyServer",
        "spark.security.credentials.${service}.enabled"
    ]
}

Observation

spark.yarn.shuffle.server.recovery.disabled was not found by ChatGPT4o (or NotebookLM).
ChatGPT4o found 44 parameters in the first prompt.
When the concept of placeholder config parameters was explained in the second prompt, ChatGPT made 2 mistakes and thereby got many config parameters wrong
1. it did not extend the example to all the placeholder config parameters
2. it appended the {ns} placeholder parameters to a different placeholder {service}

Claude Sonnet 3.5¶

Note

Claude Sonnet 3.5 Pro was used as both documents were too large for the free version.

Original Prompt¶

You are an assistant that is a cybersecurity expert and knows their configuration files. Your response here should be a JSON list of strings
Using only the documentation of Apache Spark from the URLs below (and no other source), which configuration parameters are relevant from a security perspective?

The data sources are :
1. https://spark.apache.org/docs/latest/configuration.html
2. https://spark.apache.org/docs/latest/security.html

Based on my analysis of the Apache Spark documentation, here is a JSON list of configuration parameters that are relevant from a security perspective:
[
"spark.authenticate",
"spark.authenticate.secret",
"spark.network.crypto.enabled",
"spark.network.crypto.saslFallback",
"spark.authenticate.enableSaslEncryption",
"spark.network.sasl.serverAlwaysEncrypt",
"spark.io.encryption.enabled",
"spark.io.encryption.keySizeBits",
"spark.io.encryption.keygen.algorithm",
"spark.ssl.enabled",
"spark.ssl.keyStore",
"spark.ssl.keyStorePassword",
"spark.ssl.keyPassword",
"spark.ssl.trustStore",
"spark.ssl.trustStorePassword",
"spark.ssl.protocol",
"spark.ssl.enabledAlgorithms",
"spark.ui.filters",
"spark.acls.enable",
"spark.admin.acls",
"spark.admin.acls.groups",
"spark.modify.acls",
"spark.modify.acls.groups",
"spark.ui.view.acls",
"spark.ui.view.acls.groups",
"spark.eventLog.enabled",
"spark.eventLog.dir",
"spark.history.kerberos.enabled",
"spark.history.kerberos.principal",
"spark.history.kerberos.keytab",
"spark.kerberos.principal",
"spark.kerberos.keytab",
"spark.hadoop.security.authentication",
"spark.yarn.security.tokens.hbase.enabled",
"spark.yarn.security.tokens.hive.enabled",
"spark.yarn.security.credentials.hadoop.enabled",
"spark.redaction.regex",
"spark.ui.proxyRedirectUri",
"spark.driver.log.dfsDir"
]

Prompt Explaining Config Namespace placeholders¶

Note that Config Namespace placeholders are used for some security configurations. These are placeholders "${ns}" that can take values listed under Config Namespace. For these security configurations placeholders list all possible security configurations for each Config Namespace e.g. For Property Name: "${ns}.enabled", "spark.ssl.enabled" and "spark.ssl.historyServer" would be some possible configurations now that you understand namespace placeholders, list all the configuration parameters that are relevant from a security perspective replacing all placeholders with possible config namespace values. Your response here should be a JSON list of strings only. list the configuration parameters that are relevant from a security perspective replacing all placeholders with possible config namespace values and repeated patterns

I understand. Here's a revised JSON list of configuration parameters relevant from a security perspective, with namespace placeholders expanded and repeated patterns included:
[
"spark.authenticate",
"spark.authenticate.secret",
"spark.network.crypto.enabled",
"spark.network.crypto.saslFallback",
"spark.authenticate.enableSaslEncryption",
"spark.network.sasl.serverAlwaysEncrypt",
"spark.io.encryption.enabled",
"spark.io.encryption.keySizeBits",
"spark.io.encryption.keygen.algorithm",
"spark.ssl.enabled",
"spark.ssl.ui.enabled",
"spark.ssl.standalone.enabled",
"spark.ssl.historyServer.enabled",
"spark.ssl.keyStore",
"spark.ssl.ui.keyStore",
"spark.ssl.standalone.keyStore",
"spark.ssl.historyServer.keyStore",
"spark.ssl.keyStorePassword",
"spark.ssl.ui.keyStorePassword",
"spark.ssl.standalone.keyStorePassword",
"spark.ssl.historyServer.keyStorePassword",
"spark.ssl.keyPassword",
"spark.ssl.ui.keyPassword",
"spark.ssl.standalone.keyPassword",
"spark.ssl.historyServer.keyPassword",
"spark.ssl.trustStore",
"spark.ssl.ui.trustStore",
"spark.ssl.standalone.trustStore",
"spark.ssl.historyServer.trustStore",
"spark.ssl.trustStorePassword",
"spark.ssl.ui.trustStorePassword",
"spark.ssl.standalone.trustStorePassword",
"spark.ssl.historyServer.trustStorePassword",
"spark.ssl.protocol",
"spark.ssl.ui.protocol",
"spark.ssl.standalone.protocol",
"spark.ssl.historyServer.protocol",
"spark.ssl.enabledAlgorithms",
"spark.ssl.ui.enabledAlgorithms",
"spark.ssl.standalone.enabledAlgorithms",
"spark.ssl.historyServer.enabledAlgorithms",
"spark.ui.filters",
"spark.acls.enable",
"spark.admin.acls",
"spark.admin.acls.groups",
"spark.modify.acls",
"spark.modify.acls.groups",
"spark.ui.view.acls",
"spark.ui.view.acls.groups",
"spark.eventLog.enabled",
"spark.eventLog.dir",
"spark.history.kerberos.enabled",
"spark.history.kerberos.principal",
"spark.history.kerberos.keytab",
"spark.kerberos.principal",
"spark.kerberos.keytab",
"spark.hadoop.security.authentication",
"spark.yarn.security.tokens.hbase.enabled",
"spark.yarn.security.tokens.hive.enabled",
"spark.yarn.security.credentials.hadoop.enabled",
"spark.redaction.regex",
"spark.ui.proxyRedirectUri",
"spark.driver.log.dfsDir",
"spark.ssl.keyStoreType",
"spark.ssl.ui.keyStoreType",
"spark.ssl.standalone.keyStoreType",
"spark.ssl.historyServer.keyStoreType",
"spark.ssl.trustStoreType",
"spark.ssl.ui.trustStoreType",
"spark.ssl.standalone.trustStoreType",
"spark.ssl.historyServer.trustStoreType",
"spark.ssl.needClientAuth",
"spark.ssl.ui.needClientAuth",
"spark.ssl.standalone.needClientAuth",
"spark.ssl.historyServer.needClientAuth",
"spark.ui.allowFramingFrom",
"spark.ui.xXssProtection",
"spark.ui.xContentTypeOptions.enabled",
"spark.ui.strictTransportSecurity"
]

Compare The Answer With The Expected Answer¶

Values in ===manual=== but not in ===claude35===:

spark.yarn.shuffle.server.recovery.disabled
spark.authenticate.secret.file
spark.authenticate.secret.driver.file
spark.authenticate.secret.executor.file
spark.network.crypto.config.*
spark.io.encryption.commons.config.*
spark.user.groups.mapping
spark.history.ui.acls.enable
spark.history.ui.admin.acls
spark.history.ui.admin.acls.groups
spark.ssl.port
spark.ssl.ui.port
spark.ssl.standalone.port
spark.ssl.historyServer.port

Values in ===claude35=== but not in ===manual===:

spark.eventLog.enabled
spark.eventLog.dir
spark.history.kerberos.enabled
spark.history.kerberos.principal
spark.history.kerberos.keytab
spark.kerberos.principal
spark.kerberos.keytab
spark.hadoop.security.authentication
spark.yarn.security.tokens.hbase.enabled
spark.yarn.security.tokens.hive.enabled
spark.yarn.security.credentials.hadoop.enabled
spark.redaction.regex
spark.ui.proxyRedirectUri
spark.driver.log.dfsDir

Observation

spark.yarn.shuffle.server.recovery.disabled was not found by Claude3.5 (or ChatGPT4o or NotebookLM).
Claude3.5 found 39 parameters in the first prompt.
When the concept of placeholder config parameters was explained in the second prompt, Claude3.5 understood it.
Claude3.5 found 14 parameters more than the human-generated answer, including several parameters that the other LLMs did not find.

Takeaways¶

Takeaways

NotebookLM, ChatGPT4o and Claude3.5 Pro did reasonably well at extracting the config parameters related to security
NotebookLM performed best, missing just 1, and finding 18 more than the human-generated answer.
Claude3.5 found 79 parameters, missed 14, and found 14 more than the human-generated answer.
ChatGPT 4o found 71 parameters.
Overall, with everything-as-code (infrastructure, policy, LLM answers, ....), and LLMs being able to process code, there's a lot of benefit and promise in applying LLMs.
The feedback to CoGuard is via https://github.com/coguardio/coguard_openai_rule_auto_generation_research/issues/2

NotebookLM Secure Code¶

Overview

In two separate conversations recently, the topic of using LLMs for secure coding came up. One of the concerns that is often raised is that GenAI Code is not secure because GenAI is trained on arbitrary code on the internet.

I was curious how NotebookLM would work for generating or reviewing secure code i.e. a closed system that has been provided a lot of guidance on secure code (and not arbitrary examples).

Claude Sonnet 3.5 was also used for comparison.

Vulnerability Types¶

Secure Programming with Static Analysis classifies vulnerability types as follows:

LLMs go beyond understanding syntax to understanding semantics and may be effective in the 3 quadrants that traditional static analysis isn't.

But in this simple test case below, the focus is on Generic defects visible in the code, as an initial proof of concept.

Data Sources¶

Two books I had on Java were loaded to NotebookLM:

The CERT Oracle Secure Coding Standard for Java
1. The same material is available on https://wiki.sei.cmu.edu/confluence/display/java/SEI+CERT+Oracle+Coding+Standard+for+Java
Java Coding Guidelines: 75 Recommendations for Reliable and Secure Programs

Test Data¶

NIST Software Assurance Reference Dataset (SARD) was used as the test dataset.

Quote

The Software Assurance Reference Dataset (SARD) is a growing collection of test programs with documented weaknesses. Test cases vary from small synthetic programs to large applications. The programs are in C, C++, Java, PHP, and C#, and cover over 150 classes of weaknesses.

e.g. CWE: 191 Integer Underflow https://samate.nist.gov/SARD/test-cases/252126/versions/1.0.0#4

Setup¶

Import both PDFs into a new NotebookLM.

Test¶

Test Code CWE: 191 Integer Underflow¶

Review Test Code¶

Comments are removed from https://samate.nist.gov/SARD/test-cases/252126/versions/1.0.0#4 so the code fits in the prompt window.

Claude 3.5¶

Generate Code: Write Secure Code to Multiply 2 numbers¶

Use BigInteger Instead¶

Claude 3.5¶

Llama 3.1 405B Code Training¶

Llama 3.1 405B was released July 2024.

The training process to generate good code is described in https://www.deeplearning.ai/the-batch/issue-260/.

Quote

The pretrained model was fine-tuned to perform seven tasks, including coding and reasoning, via supervised learning and direct preference optimization (DPO). Most of the fine-tuning data was generated by the model itself and curated using a variety of methods including agentic workflows. For instance,

To generate good code to learn from, the team:

Generated programming problems from random code snippets.
Generated a solution to each problem, prompting the model to follow good programming practices and explain its thought process in comments.
Ran the generated code through a parser and linter to check for issues like syntax errors, style issues, and uninitialized variables.
Generated unit tests.
Tested the code on the unit tests.
If there were any issues, regenerated the code, giving the model the original question, code, and feedback.
If the code passed all tests, added it to the dataset.
Fine-tuned the model.
Repeated this process several times.

https://www.deeplearning.ai/the-batch/issue-260/

Takeaways¶

Takeaways

NotebookLM with 2 Secure Code Java references performed well in these simple test cases.
LLMs in conjunction with traditional code assurance tools can be used to "generate good code".

Ended: NotebookLM

Grounded Closed System ↵

Grounding¶

Overview

The accurate answers from NotebookLM highlight the benefits of a grounded closed system.

NotebookLM also provides links to the content it references in the data sources.

There are many tools that can be used to build such a system.

Grounding Overview¶

Quote

What is Grounding?

Grounding is the process of using large language models (LLMs) with information that is use-case specific, relevant, and not available as part of the LLM's trained knowledge. It is crucial for ensuring the quality, accuracy, and relevance of the generated output. While LLMs come with a vast amount of knowledge already, this knowledge is limited and not tailored to specific use-cases. To obtain accurate and relevant output, we must provide LLMs with the necessary information. In other words, we need to "ground" the models in the context of our specific use-case.

Motivation for Grounding

The primary motivation for grounding is that LLMs are not databases, even if they possess a wealth of knowledge. They are designed to be used as general reasoning and text engines. LLMs have been trained on an extensive corpus of information, some of which has been retained, giving them a broad understanding of language, the world, reasoning, and text manipulation. However, we should use them as engines rather than stores of knowledge.

https://techcommunity.microsoft.com/t5/fasttrack-for-azure/grounding-llms/ba-p/3843857

Quote

In generative AI, grounding is the ability to connect model output to verifiable sources of information. If you provide models with access to specific data sources, then grounding tethers their output to these data and reduces the chances of inventing content. This is particularly important in situations where accuracy and reliability are significant.

Grounding provides the following benefits:

Reduces model hallucinations, which are instances where the model generates content that isn't factual.
Anchors model responses to specific information.
Enhances the trustworthiness and applicability of the generated content.

https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/overview

Retrieval Augmented Generation (RAG)¶

Quote

Retrieval Augmented Generation (RAG) is the primary technique for grounding and the only one I will discuss in detail. RAG is a process for retrieving information relevant to a task, providing it to the language model along with a prompt, and relying on the model to use this specific information when responding. While sometimes used interchangeably with grounding, RAG is a distinct technique, albeit with some overlap. It is a powerful and easy-to-use method, applicable to many use-cases.

Fine-tuning, on the other hand, is an "honourable mention" when it comes to grounding. It involves orchestrating additional training steps to create a new version of the model that builds on the general training and infuses the model with task-relevant information. In the past, when we had less capable models, fine-tuning was more prevalent. However, it has become less relevant as time-consuming, expensive, and not offering a significant advantage in many scenarios.

The general consensus among experts in the field is that fine-tuning typically results in only a 1-2% improvement in accuracy (depending on how accuracy is defined). While there may be specific scenarios where fine-tuning offers more significant gains, it should be considered a last-resort option for optimisation, rather than the starting go-to technique.

Info

Unlike RAG, fine tuning changes some of the model weights. In some cases, it can lead to reduced performance via catastrophic forgetting.

Vertex AI Grounding¶

Google announced Grounding in April 2024.

Quote

You can ground language models to your own text data using Vertex AI Search as a datastore. With Vertex AI Search you integrate your own data, regardless of format, to refine the model output. Supported data types include:

Website data: Directly use content from your website.
Unstructured data: Utilize raw, unformatted data.

When you ground to your specific data the model can perform beyond its training data. By linking to designated data stores within Vertex AI Search, the grounded model can produce more accurate and relevant responses, and responses directly related to your use case.

https://cloud.google.com/vertex-ai/generative-ai/docs/grounding/overview#ground-private

RAG Architectures¶

Takeaways¶

Takeaways

Where a lot of the information needed is captured in documentation e.g. MITRE CWE specification, Grounding is an effective efficient easy option to improve the quality of responses.

Ended: Grounded Closed System

CWE Assignment ↵

Overview¶

Overview

There are several options to consider when building an LM solution for CWE assignment as described here.

Tip

Vulnerability Root Cause Mapping with CWE: Challenges, Solutions, and Insights from Grounded LLM-based Analysis details a more comprehensive production solution using LLMs for CWE Mapping.

Approach to using Language Models¶

Don't Train A Model On Bad Data!¶

It is possible to train a Language Model as a Classifier to assign CWEs to a CVE Description - and there are several research papers that took that approach e.g.

The problems with this approach:

It's delusional based on my research and experience of incorrect assigned CWEs in general - Garbage In Garbage Out
1. Per Steve Christey Coley, CWE tech lead:
Quote

There has been significant interest in using AI/ML in various applications to use and/or map to CWE, but in my opinion there are a number of significant hurdles, e.g. you can't train on "bad mappings" to learn how to do good mappings.
It removes a lot of the context that could be available to an LM by reducing the reference target down to a set of values or classes (for the given input CVE Descriptions)

Train on Good Data and the Full Standard¶

We can "train" on "good mappings".

The CWE standard includes known "good mappings" e.g. CWE-917 Observed Examples includes CVE-2021-44228 and its Description.
1. The count of these CVE Observed Examples varies significantly per CWE.
2. There's ~3K CVE Observed Examples in the CWE standard.
The Top25 Dataset of known-good mappings contains ~6K CVEs with known-good CWE mappings by MITRE.
We can use the full CWE standard and associated known good CWE mappings as the target, allowing an LLM to compare the CVE Description (and other data) to this.
1. And moreover, prompt the LLM to provide similar CVEs to support its rationale for the CWE assignment

Tip

Rather than train a model on bad data, we can ask a model to assign / validate a CWE based on its understanding of the CWEs available (and its understanding of CWEs assigned to similar CVEs based on the Observed and Top25 Examples for each CWE in the standard).

We can ask the model to follow the MITRE CWE Guidance when assigning a CWE.

Closed or Open Model¶

We can use a Closed or Open Model:

a closed-model with access to the CWE specification only (and no other data) e.g. NotebookLM
an open-model with access to the CWE specification and other data

RAG Corpus¶

Representations of the MITRE CWE Specification:

PDF version of the MITRE CWE Specification
1. https://cwe.mitre.org/data/downloads.html
JSON version of the MITRE CWE Specification
1. https://github.com/CWE-CAPEC/REST-API-wg/blob/main/json_repo/cwe.json
JSON version of the modified MITRE CWE Specification to add and remove parts to make it more relevant to CWE assignment for an LLM as described here
1. https://github.com/CyberSecAI/cwe_top25/tree/main/data_out/output_jsonl
Tip

JSON is preferred over PDF as PDF is generally more lossy because it is less structured.

GPT setups¶

Different GPT setups e.g.

ChatGPT GPT
1. Requires you and users to have a paid ChatGPT subscription
NoteBookLM
1. Anyone with a Google account can get up and running in 5 minutes for free.
VertexAI
1. This allows the most customization - but there's more effort to set it up and it is not free.

Prompts¶

Various Prompts, and Prompt Engineering Techniques, can be used depending on what you want.

Model and Environment¶

For processing 250K+ CVE Descriptions, speed, latency and cost are important considerations, in addition to accuracy.

Based on comparing LLMs as at September 2024, Gemini 1.5 Flash was chosen.

There are different Google AI Environments: - Google AI Studio - lower learning curve and cost and capability - Vertex AI Studio or VertexAI in general

ChatGPT CWE GPT¶

Overview

Here we'll use ChatGPT with the MITRE CWE specification to aid mapping CWEs to vulnerability descriptions.

This is a no-code option.

We'll attempt to implement a closed grounded system to ensure the accuracy of the data (and mitigate hallucinations)

Grounded: content is provided to inform the ansers
Closed system: answers come from only the documents you provide

Result

The result is a Grounded Open-System i.e. we can ground - but can't create a closed system - and we still get hallucinations in some cases.

Tip

Vulnerability Root Cause Mapping with CWE: Challenges, Solutions, and Insights from Grounded LLM-based Analysis details a more comprehensive production solution using LLMs for CWE Mapping.

Recipe¶

Use ChatGPT GPTs which are "custom versions of ChatGPT that combine instructions, extra knowledge, and any combination of skills."
Grounded: Provide the MITRE CWE specification as the "extra knowledge" in JSON format (not PDF).
Closed system: Limit the GPT to that knowledge only i.e. disable web search
1. but we see that the GPT still has knowledge from its training data

MITRE CWE Specification¶

Use a JSON text version instead of PDF
1. ensures all the relevant text we want is fed to the model (e.g. text extraction from PDFs can be lossy for tables)
2. allows us to remove content that is not relevant e.g. "ContentHistory" can contain a lot of text that is not relevant to CWE assignment
Get the MITRE CWE Specification as JSON from https://github.com/CWE-CAPEC/REST-API-wg/blob/main/json_repo/cwe.json
Remove the "ContentHistory" entries as this is not useful to assign CWEs and is a lot of content
1. use Claude 3.5 to generate the python code to do this
Split it into smaller files
1. because the single file is too large to import
2. use Claude 3.5 to generate the python code to do this

Configure ChatGPT CWE GPT¶

Import the MITRE CWE Specification as split JSON files
Disable all capabilities:
1. Web browsing is disabled so the answers come from the imported MITRE CWE Specification
Provide example starter prompts:
1. what is the best CWE to describe the root cause weakness in CVE "an issue in the Pickle Python library of some product allows attackers to execute arbitrary commands". Provide CVEs with the most similar root cause to support your answer.
2. what is the best CWE to describe the root cause weakness in CVE "ProductX contains a default SSH public key in the authorized_keys file. A remote attacker could use this key to gain root privileges.". Provide CVEs with the most similar. root cause to support your answer.
3. what cwe ids are associated with xss. list them all
4. what cwe ids are associated with path or directory traversal. list them all
Publish it so others can use it: https://chatgpt.com/g/g-d8guurMPs-cwe-map

Check if the System is Closed¶

Quote

What is a dog?

Failure

Observations

The system is not closed because the GPT can answer the question even though there is no information about dogs in the MITRE CWE specification.

Example Usage: CWE-502¶

Quote

what is the best CWE to describe the root cause weakness in CVE "an issue in the Pickle Python library of some product allows attackers to execute arbitrary commands".

let's think this out step by step. Before you answer, validate your answer by checking that at least one of ObservedExamples for that CWE is similar to the description I provided.

CWE-502 includes this as part of the ObservedExamples

        {
          "Reference": "CVE-2011-2520",
          "Description": "Python script allows local users to execute code via pickled data.",
          "Link": "https://www.cve.org/CVERecord?id=CVE-2011-2520"
        },
        {
          "Reference": "CVE-2012-4406",
          "Description": "Unsafe deserialization using pickle in a Python script.",
          "Link": "https://www.cve.org/CVERecord?id=CVE-2012-4406"
        },

Example Usage: CWE-1394¶

Quote

what is the best CWE to describe the root cause weakness in CVE "ProductX contains a default SSH public key in the authorized_keys file. A remote attacker could use this key to gain root privileges."

let's think this out step by step. Before you answer, validate your answer by checking that at least one of ObservedExamples for that CWE is similar to the description I provided.

CWE-1394 includes this as part of the ObservedExamples

        {
          "Reference": "CVE-2016-1561",
          "Description": "backup storage product has a default SSH public key in the authorized_keys file, allowing root access",
          "Link": "https://www.cve.org/CVERecord?id=CVE-2016-1561"
        },

Example Usage: CWE-1394 with Example CVEs¶

Failure

"CVE-2013-0138: Network device with a default SSH key that can be exploited to gain unauthorized access."

This is not the description for https://nvd.nist.gov/vuln/detail/CVE-2013-0138.

CVE-2013-0138 is not listed in the MITRE CWE specification.

Note: https://nvd.nist.gov/vuln/detail/CVE-2013-0137 relates to a known SSH key.

Observations

Hallucinations happened for the CVE IDs that the GPT gave as supporting examples whether the "Web Browsing" Capability was enabled or not.

Takeaways¶

Takeaways

Grounding ChatGPT with the MITRE CWE specification helps reduce hallucinations but does not eliminate them.
Hallucinations happened for the CVE IDs that the GPT gave as supporting examples whether the "Web Browsing" Capability was enabled or not.
It is not possible to make ChatGPT GPT a closed system because it has knowledge outside the provided content and will answer from that.

NotebookLM CWE¶

Overview

In this chapter, we'll use NotebookLM to ingest the Common Weakness Enumeration (CWE) list.

In the current version (CWE Version 4.14 2024-02-29 as at June 2024), there are over 1400 CWEs, and 2789 pages in the PDF version of the list.
It can be unwieldy to navigate or assimilate this information.

Using NotebookLM, we can chat with the CWE document and ask questions, so that the information comes to us in the form of an answer.

For comparison, we will also ask some of the same questions of ChatGPT4o covering 2 cases:

without uploading any documents to it.
uploading a document to it.

Success

This content and approach was presented to the CWE Root Cause Mapping Working Group mid July.

The feedback was very positive!

Data Sources¶

MITRE CWE Specification Preparation¶

Tip

cwe_latest.pdf from https://cwe.mitre.org/data/downloads.html has 2789 pages (version CWE Version 4.14 2024-02-29).

This is too big for NotebookLM to ingest.
Use a JSON text version instead of PDF:
- ensures all the relevant text we want is fed to the model (e.g. text extraction from PDFs can be lossy for tables)
- allows us to remove content that is not relevant e.g. "ContentHistory" can contain a lot of text that is not relevant to CWE assignment

Data Sources

The files in https://github.com/CyberSecAI/cwe_top25/tree/main/data_out/output_jsonl are imported into NoteBookLM.. They are generated as follows:

Get the MITRE CWE Specification as JSON from https://github.com/CWE-CAPEC/REST-API-wg/blob/main/json_repo/cwe.json
Remove the "ContentHistory" entries as this is not useful to assign CWEs and is a lot of content
1. use Claude 3.5 to generate the python code to do this
Split it into smaller files
1. because the single file is too large to import
2. use Claude 3.5 to generate the python code to do this
The output files are JSONL with a txt extension so NoteBookLM recognizes them: https://github.com/CyberSecAI/cwe_top25/tree/main/data_out/output_jsonl

Import MITRE CWE Specification to NoteBookLM¶

Download all jsonl.txt files from https://github.com/CyberSecAI/cwe_top25/tree/main/data_out/output_jsonl
Open https://notebooklm.google.com/ in a browser
Add sources

what cwe ids are associated with buffer overflow?¶

Note

https://cwe.mitre.org/data/definitions/121.html contains only 1 Observed Example:

CVE-2021-35395. This is listed under CWE-121: Stack-based Buffer Overflow

The other CVE examples are from the Top25 Mapping Examples included in the JSONL files

CVE-2021-20046, CVE-2021-20048: Stack-based buffer overflows in SonicOS HTTP response headers allow for Denial of Service and potentially remote code execution.
CVE-2021-21887: A stack-based buffer overflow in Lantronix PremierWave 2050 8.9.0.0R4 allows remote code execution via a crafted HTTP request.
CVE-2021-35395: A stack-based buffer overflow vulnerability exists in the DHCP client of Lantronix PremierWave 2050 8.9.0.0R4.

CWE Assignment Prompt¶

The notebook can be used to assign CWEs to a given CVE Description using a specific prompt: https://github.com/CyberSecAI/CWEMap/blob/main/prompts/assign_cwes/system.md

NoteBookLM does not allow the full prompt to be copied into the chat as is - it needs to be copied in 3 smaller parts.

Example CVE-2024-5158¶

CVE-2024-5158

Type Confusion in V8 in Google Chrome prior to 125.0.6422.76 allowed a remote attacker to potentially perform arbitrary read/write via a crafted HTML page. (Chromium security severity: High) https://nvd.nist.gov/vuln/detail/CVE-2024-5158

Example CVE-2022-30271¶

CVE-2022-30271

The Motorola ACE1000 RTU through 2022-05-02 ships with a hardcoded SSH private key and initialization scripts (such as /etc/init.d/sshd_service) only generate a new key if no private-key file exists. Thus, this hardcoded key is likely to be used by default. https://nvd.nist.gov/vuln/detail/CVE-2022-30271

Warning

Some of the CWE Abstraction Level values are incorrect; we’re asking for too much here in one prompt with this simple setup.

We need to either simplify or decompose into separate prompts.

Tip

The notebook can be shared with other users via email (using the share button top right) to allow them to use the notebook. See https://support.google.com/notebooklm/answer/14276471?hl=en.

Sharing is via a user's email.

Takeaways¶

Takeaways

I found it useful, quick, and convenient to be able to ask questions of the CWE list, as a compliment to the traditional search method I've used in the past.
It is also useful to suggest CWEs (and associated info) for a given Vulnerability Description.
The Vulnrichment chapter covers a much more detailed use case - and NoteBookLM performed better than other LLMs.

Google Vertex AI Agent Builder¶

Overview

Here we'll use Vertex AI Agent Builder with the MITRE CWE specification to aid mapping CWEs to vulnerability descriptions.

This is a no-code option.

We'll implement a closed grounded system to ensure the accuracy of the data (and mitigate hallucinations)

Grounded: content is provided to inform the answers
Closed system: answers come from only the documents you provide

Tip

In other words, we'll build NotebookLM.

where NotebookLM is basically a combination of Vertex AI Search for Unstructured (PDFs, HTML, etc.), Vertex AI Grounding, and a custom UX/UI.
But we'll take advantage of the structured data (JSON) that we have for MITRE CWE list, instead of using the unstructured data from the MITRE CWE list PDF.

Result

The result is that we have a grounded closed system (that compares in performance and accuracy to NotebookLM.

But we don't have reference links to the source content in the response i.e. I didn't add that part yet but it's standard functionality that is easy in Vertex AI.

Grounding Confidence¶

Quote

For each response generated from the content of your connected data stores, the agent evaluates a confidence level, which gauges the confidence that all information in the response is supported by information in the data stores. You can customize which responses to allow by selecting the lowest confidence level you are comfortable with. Only responses at or above that confidence level will be shown.

There are 5 confidence levels to choose from: VERY_LOW, LOW, MEDIUM, HIGH, and VERY_HIGH.

https://cloud.google.com/dialogflow/vertex/docs/concept/tools

Quote

To create a data store and connect it to your app, you can use the Tools link in the left navigation of the console. Follow the instructions to create a data store. https://cloud.google.com/dialogflow/vertex/docs/concept/tools

Recipe¶

Same recipe as before but we'll use Google Vertex AI Agent Builder

MITRE CWE Specification¶

Same MITRE CWE Specification as the data source.

Build Vertex AI Agent¶

This link gives the steps with links to the details, summarized here:

Vertex AI Agent Builder
Create App
Select app type
1. Agent (preview) "Built using natural language, agents can answer questions from data, connect with business systems through tools and more"
Create Data Store
1. The MITRE CWE JSON data is converted to jsonl format for import
2. It takes ~5 minutes to ingest (create embeddings for) the jsonl file
There are lots of other Settings available like Logging, Git integration to push/pull agents from a Github repo, or just download the JSON file that represents the agent.
The built agent supports Interactions with the API .

Tip

To create an Grounded Open system, select "search" app type.

The agent will retrieve information from the local documents you provide and via web search.

Note

Alternatively these steps can be implemented with code e.g. https://cloud.google.com/vertex-ai/generative-ai/docs/multimodal/ground-gemini#generative-ai-gemini-grounding-python_vertex_ai_sdk

Quote

Note: Conversation history is used as context during tool invocation. Learn more

Data Preprocessing¶

Remove unneeded sections from the json

content_history
views
categories

python3 ./trim_json.py # cwe.json -> cwe_trimmed.json

Convert to jsonl for import

python3 ./json_to_jsonl.py # cwe_trimmed.json -> cwe_trimmed.jsonl

Data Import¶

Check if the System is Closed¶

Quote

What is a dog?

Success

The system is closed because the GPT can't answer the question because there is no information about dogs in the MITRE CWE specification.

What are the different types of XSS?¶

What CWE IDs Relate To XSS?¶

What Is The Parent Weakness Or CWE For XSS And CSRF?¶

What CWE IDs Relate To Path Or Directory Traversal? List All CWE IDs And Their Description In A Table¶

What is the CWE Associated With CVE¶

Quote

What is the CWE associated with CVE-2021-27104 "Accellion FTA OS Command Injection Vulnerability"

What is the CWE associated with "Cisco Small Business RV320 and RV325 Routers Information Disclosure Vulnerability"

Example Usage: CWE-1394¶

Quote

what is the best CWE to describe the root cause weakness in CVE "ProductX contains a default SSH public key in the authorized_keys file. A remote attacker could use this key to gain root privileges.".

Other App Builder Docs¶

These were not used or required but listing here as I found them informative.

Takeaways¶

Takeaways

Google Vertex AI Agent Builder allows/requires more control over the agent than the ChatGPT GPTs currently.
Google Vertex AI Agent Builder supports a Closed (or Open) System with Grounding and Grounding Confidence threshold unlike ChatGPT GPTs currently.
This comes close to NotebookLM but
1. does not provide references from the original documents from which the answer was determined.

CISA Vulnrichment¶

Overview

An incorrect CWE assignment caught my eye while reading a post

I was reading a post on LinkedIn and the CWE assigned by CISA ADP looked wrong so...

I used my NotebookLM CWE notebook, and other LLMs, to determine the appropriate CWE.
I then raised an issue: https://github.com/cisagov/vulnrichment/issues/84.
I then decided to dig a bit more into this... specifically the CWEs assigned by CISA ADP.

Using LLMs to find incorrect CWE assignments at scale

I used langchain to create a consensus of LLMs to review all CWEs assigned by CISA ADP to find issues:

These issues were found automatically by a consensus of 3 LLMs: (current state-of-the-art) ChatGPT4o, Gemini 1.5 Pro, Claude 3.5 Sonnet who were asked to review CWEs assigned to CVEs by CISA ADP.
The consensus output was then reviewed by a human (me).
I created 3 Vulnrichment Github Issues initially and these were accepted by CISA Vulnrichment and resolved promptly!
I then provided a report to CISA Vulnrichment for all CWEs that were incorrect based on the consensus.

Using LLMs to assign correct CWEs

Finally, I showed how NotebookLM can be used for CWE assignment

It avoids the problem of
- training language models on bad data (existing CVE CWE assignments)
- training humans on the detailed CWE standard (though a basic understanding is still required)
NotebookLM did well in recommending a CWE given a CVE Description.... and providing a supporting CVE from the CWE Observed Examples in the CWE standard.
- NotebookLM has a large context window which allows it to digest the large CWE standard, and it is source-grounded as described in the NotebookLM chapter.
- Notebook on CWEs describes how to create this NotebookLM for CWEs

The different approaches used (and the subscription plan used):

code:
1. ChatGPT4o OpenAI Batch API (Plus Plan)
2. langchain calling 2 LLMs via APIs: Gemini 1.5 Pro, Claude 3.5 Sonnet (ChatGPT4 or ChatGPT4o is also supported in the code (but commented out) if your plan supports that)
no-code using the browser chat interface is not shown here but it is useful for initial evaluation
1. Gemini 1.5 Pro (subscription)
2. Claude 3.5 Sonnet (prepay)

Success

The approach described here resulted in successfully identifying incorrectly assigned CWEs, and identifying the correct CWEs, and a report to CISA Vulnrichment.

Consensus of 3 LLMs¶

Consensus of 3 LLMs

Source Code

https://github.com/CyberSecAI/CWEMap

CISA Vulnrichment¶

Quote

The CISA Vulnrichment project is the public repository of CISA's enrichment of public CVE records through CISA's ADP (Authorized Data Publisher) container. In this phase of the project, CISA is assessing new and recent CVEs and adding key SSVC decision points. Once scored, some higher-risk CVEs will also receive enrichment of CWE, CVSS, and CPE data points, where possible.

https://github.com/cisagov/vulnrichment

I have great admiration for CISA and their pragmatic initiatives like CISA KEV and SSVC and have spoken about them and applied them in production.

Tip

One of the many benefits of this Vulnrichment project is that feedback can be provided as GitHub issues and the team is VERY responsive ✅ 🙌

The 'Bug' label was assigned the same day to the 3 issues I submitted: https://github.com/cisagov/vulnrichment/issues?q=is%3Aissue+author%3ACrashedmind+is%3Aclosed.
The changes were accepted and applied the next working day and a comment was added to the ticket which was then closed.

My overall goal here was to

Show that LLMs could augment human analysts where vulnerability enrichment today is largely done manually.
Show how to use them for this purpose.
Get people to use LLMs to improve the quality of the CVE data in general, and in this specific example case, the CWE data.
Maximize the value of the CISA ADP data and enrichment by improving CWE-assignment quality.

Get CVEs Enriched by CISA ADP¶

What to ask the LLMs?¶

Different approaches are possible when providing the CVE Description to the LLM:

provide the CWE assigned as part of the CVE, and ask the LLM if it agrees or not, and only if not, why
1. This reduces the output token count/cost by only outputting the verbose rationale in case of disagreement
ask the LLM to assign one or more CWEs, with rationale

The first approach is easier and simpler and cheaper (in terms of token use i.e. shorter response output), and better as a first pass option to get the low hanging fruit.

The second approach could be used at the time of CWE assignment to get a second opinion.

Consensus¶

To minimize human effort, 3 LLMs are used and the consensus is reviewed

The LLMs are state-of-the-art models from different providers i.e. the best available and reasonably independent.
The results are sorted by consensus i.e. 3 models in agreement, then 2 models in agreement,.... and by the LLM's confidence in their responses
A Human (me) then reviewed (sorted by consensus and confidence) and made the final decision.

Recipe¶

Get the Vulnrichment subset of CVEs where CISA ADP assigned a CWE (regardless of whether the CWE was the same or different than that assigned by the CNA) into a sheet/CSV file.
1. ~1.8K (CISA ADP Assigned CWEs) of ~~10K CVEs (in Vulnrichment)
As a dry-run submit e.g. 50 CVE Descriptions, CWEs to each of the 3 LLMs to review via the chat UI in one prompt
Ask ChatGPT4o (via Batch API) to Agree (Yes/No) with the assigned CWE (and provide a Confidence score, and rationale if not)
1. Sort these by Confidence score i.e. start with the highest Confidence ones.
Assign the same task to Gemini and Claude via APIs driven by langchain

Create a Prompt¶

Chat Interface - Table Output¶

caption_system_prompt =
You are a cybersecurity expert specializing in identifying Common Weakness Enumeration (CWE) IDs from CVE descriptions.
Your goal is is to say if you Agree with the assigned CWE ID or not.
You will be provided with a CVE ID and description amd a CWE ID that has been assigned to that CVE description.

Please provide the response in a table 'cve_id', 'CWE_ID', 'Agree'. "Rationale", Confidence' where
1. Agree: string // Yes or No
2. Rationale: string // Only if you do not Agree, provide a rationale why not
3. Confidence: string // a confidence score between 0 and 1

The table output allows copy-and-pasting by a human into a sheet.

The prompt consists of these parts:

Role + Task: which is the same for the Chat and API interface
Output format: which is different for the Chat and API interface
A binary value Agree is requested
The rationale only if there is disagreement. This saves on output tokens.
A Confidence score to limit impacts of hallucinations, and as a way to assess and prioritize responses by confidence.
No (Few-shot) examples are provided. Based on the results, these were not necessary.
1. If Few-shot examples were required, I'd submit multiple CVEs in a single batch request (because putting the examples in each single CVE request would add a LOT of input tokens)

Batch API Interface - JSON Output¶

caption_system_prompt =
You are a cybersecurity expert specializing in identifying Common Weakness Enumeration (CWE) IDs from CVE descriptions.
Your goal is is to say if you Agree with the assigned CWE ID or not.
You will be provided with a CVE ID and description amd a CWE ID that has been assigned to that CVE description.

You will output a json object containing the following information:
{
    Agree: string // Yes or No
    Rationale: string // Only if you do not Agree, provide a rationale why not
    Confidence: string // a confidence score between 0 and 1
}

The JSON output allows processing by machines.

Use JSON Mode

ChatGPT and Gemini 1.5 support JSON mode that always outputs valid JSON. Use it!

See details of Latest Gemini features support in LangChain4j 0.32.0.

While you can prompt an LLM to output JSON, it may not always output valid JSON and you're left with a cleanup exercise (a friend of mine had that experience when they first tried this 😉)

Claude doesn't have a formal "JSON Mode" though, in my usage, it always produced valid JSON.

Tip

It is possible to submit multiple CVEs in one prompt for each batch entry i.e. similar to what is done when using the Chat interface.

This is what is done here. 10 CVE CWE assignments are sent per batch (though the OpenAI Batch API example contains 1 entry only)

LLMs¶

Gemini 1.5 Pro API via Langchain¶

The API interface (via Langchain) was used in this example submitting multiple CVEs in one prompt.

Claude 3.5 Sonnet API via Langchain¶

Model¶

Currently: Claude 3.5 Sonnet was used as it has the best performance vs cost for Claude models.

Interface¶

Currently: Claude does not support a native Batch API interface - though Amazon Bedrock supports batching of prompts to models including Claude.

The API interface (via Langchain) was used in this example submitting multiple CVEs in one prompt.

ChatGPT4o Batch API ¶

Model¶

gpt-4o

Plan¶

The Plus plan subscription was used.

Quote

There are some restrictions:

The file can contain up to 50,000 requests.
The file cannot be more than 100 MB in size.

Enqueued token limit reached for gpt-4o in organization XYZ. Limit: 90,000 enqueued tokens. Please try again once some in_progress batches have been completed.'

Interface¶

Batch Interface API.

The ~1800 ADP CVE-CWE pairs were split into 15 files of 100 CVE-CWE pair prompts to comfortably fit under this token limit.

very little effort was spent to optimize the file size (number of prompts per batch), or the prompt size.
The cost to process the ~1800 ADP CVE-CWE pairs: ~$2.

Observations¶

Leak¶

Several CVE Descriptions that include "leak" were incorrectly assigned "CWE-200 Exposure of Sensitive Information to an Unauthorized Actor".

These were actually resource leaks (memory, program objects like handles etc...), not leakage of sensitive data.

Gemini 1.5 Pro Hallucinations¶

Failure

From a sample of 30 assigned CWEs, Gemini 1.5 Pro had 3 hallucinations when asked to provide a rationale for its response (response text shown below, with hallucination in bold).

Quote

While CWE-400 (Uncontrolled Resource Consumption) could be a potential consequence, the core issue described in the CVE is about improper handling of the Python crash handler within a chroot environment. This misconfiguration could allow an attacker to potentially escape the chroot and execute code in the context of the Apport process. A more appropriate CWE might be CWE-247 (Improper Handling of Chroot Environments) or CWE-22 (Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal')).
1. CWE-247: DEPRECATED: Reliance on DNS Lookups in a Security Decision
2. CWE-243: Creation of chroot Jail Without Changing Working Directory is the closest to "Improper Handling of Chroot Environments"
The assigned CWE-436 (Unspecified Encoding or Escaping of Output) doesn't fit well. The description argues that the plugin itself doesn't handle file uploads, making it a matter of what WordPress allows to be uploaded. If the core WordPress installation doesn't prevent polyglot files, it's not directly the plugin's fault. A more suitable CWE might be CWE-284 (Improper Access Control) if the plugin fails to restrict access to already uploaded polyglot files, allowing unauthorized viewing or download.
1. CWE-116 is the correct CWE: CWE-116: Improper Encoding or Escaping of Output
While CWE-57 (Relative Path Traversal) could be a factor in exploiting this vulnerability, the description more directly aligns with CWE-664: Improper Control of File Deletion or Renaming Operations. The vulnerability allows attackers to delete arbitrary files, which is the core characteristic of CWE-664. Relative path traversal might be a means to reach different files, but the primary issue is the improper control over file deletion.
1. CWE-664 correct title is CWE-664: Improper Control of a Resource Through its Lifetime

Tip

While Gemini 1.5 Pro is the backend for NotebookLM, no hallucinations were detected in NotebookLM (the benefit of a source-grounded closed system).

NotebookLM¶

NotebookLM works very well for suggesting a relevant CWE for a given CVE Description (or reviewing an assigned CWE)

It suggests related CVEs to support the CWE it suggests (sometimes prompting is required). These CVEs are part of the "Observed Examples" CVEs that are listed under a CWE in the CWE standard.
The numbered grey circles in the diagram below are links to references from the CWE standard.

Note

The NotebookLM prompts above are deliberately not applying prompt engineering principles to show that NotebookLM still provides a useful response.

Note: Some refinements are possible, but were not implemented in this first pass to minimize Time-To-Value

The full CWE standard was used here for illustration purposes (and to take on the harder problem of a large specification)
1. A subset of CWEs could be used if that is desired.
2. In practice, several hundred CWEs are assigned to CVEs.
3. Remove the 25 "DEPRECATED:" entries
The text from the references in the CVE "References to Advisories, Solutions, and Tools" was not retrieved and fed to the LLM as part of the CVE Description for CWE review or assignment.
1. These references were reviewed manually (for the consensus of incorrect CWE assignments)
2. In some cases, this has additional text available that can inform the CWE assignment beyond the CVE Description alone
3. It is relatively easy to retrieve this content automatically but this content varies significantly by type (text, image, pdf), quality, structure.
4. Separately, it is common that these links break because the original website or post is removed, or it's not in the latest version in Github,... - so it would be useful to have the extracted text at the time of CWE assignment.
Additional known good CVE descriptions and CWE assignments could be incorporated into the corpus, to augment the limited CVE observed examples that are part of the CWE standard.
Reducing the set of CWEs to the desired population, or providing guidance in the prompt on what CWEs to use (e.g. "don't use CWE's marked as Discouraged")
As I was interested in comparing LLM responses, I did not optimize the LLM usage (all LLMs were fed all CVE-CWEs)
1. This can be done in several ways e.g.
  1. Ask each LLM in turn to review the previous assessments by LLMs
  2. Sampling

Takeaways¶

Takeaways

The value of CVE data depends on its quality.
1. For all published CVEs to date, the quality of CWEs assigned is questionable.
2. A large part of that is that humans can't grok ~~1000 CWEs. LLMs can.
Using LLMs to suggest or validate CWEs can reduce the manual effort and error in CWE assignment.
LLMs can validate CWEs at scale e.g. using Batch mode, or multiple CVEs per prompt, or both.
LLMs perform well at this task and, given they can be automated, can augment the human manual effort, and improve the quality of assigned CWEs.
Langchain makes it easier to have generic code that works across multiple LLMs.
LLM JSON Mode should be used where possible to reduce bad JSON output and subsequent cleanup.
Based on a manual review of the subset where all 3 LLMs disagreed with the CWE assignment, > 75% of these CWEs were incorrect (and a report with these was submitted to CISA Vulnrichment)
1. I did not dig into the subset where 2 of 3 LLMs disagreed.
Using LLMs to suggest or validate CWEs can reduce the manual effort and error in CWE assignment.
A closed-system that is grounded on the CWE standard only e.g. NotebookLM, performs very well for assigning CWEs, or reviewing assigned CWEs (though it does not have an API so can't do this at scale), and no hallucinations were observed.

Ended: CWE Assignment

Context Engineering ↵

Overview¶

Abstract

This section looks at Context Engineering, and gives Prioritized Context Engineering Steps for Agentic AI.

Quote

"Context engineering is the delicate art and science of filling the context window with just the right information for the next step." Andrej Karpathy, https://www.youtube.com/watch?v=LCEmiRjPEtQ

Intent Engineering over Context Engineering over Prompt Engineering¶

Context Engineering¶

Quote

+1 for "context engineering" over "prompt engineering".

People associate prompts with short task descriptions you'd give an LLM in your day-to-day use. When in every industrial-strength LLM app, context engineering is the delicate art and science of filling the context window with just the right information for the next step. Science because doing this right involves task descriptions and explanations, few shot examples, RAG, related (possibly multimodal) data, tools, state and history, compacting... Too little or of the wrong form and the LLM doesn't have the right context for optimal performance. Too much or too irrelevant and the LLM costs might go up and performance might come down. Doing this well is highly non-trivial. And art because of the guiding intuition around LLM psychology of people spirits.

On top of context engineering itself, an LLM app has to:

break up problems just right into control flows
pack the context windows just right
dispatch calls to LLMs of the right kind and capability
handle generation-verification UIUX flows
a lot more - guardrails, security, evals, parallelism, prefetching, ...

So context engineering is just one small piece of an emerging thick layer of non-trivial software that coordinates individual LLM calls (and a lot more) into full LLM apps. The term "ChatGPT wrapper" is tired and really, really wrong.

https://x.com/karpathy/status/1937902205765607626 @karpathy Jun 25

Intent Engineering¶

Quote

I really like it! But it’s still one level too deep.

I think the actual skill is more like Intent engineering. Articulating what you want extremely clearly.

As critical as it is, the context is still just support.

The more important part is the clear thinking.

https://x.com/DanielMiessler/status/1937937649005957260 @DanielMiessler Jun 25

Tip

Breakthrough Method for Agile AI-Driven Development (BMAD is an example framework where intent is expressed and checked.

Guides from Anthropic and OpenAI¶

Guides from Anthropic and OpenAI

Anthropic's Guide:

Introduces the concept of "context rot," where model performance degrades as the context window fills.
Stresses the importance of providing the smallest possible set of high-signal tokens.
Recommends dynamic, "just-in-time" context retrieval.
For long-horizon tasks, it suggests compaction, structured note-taking, and multi-agent architectures.
They beta released a Memory Tool "to store and retrieve information across conversations through a memory file directory"

OpenAI Cookbook (on Session Memory):

Provides a concrete implementation of short-term memory management through "session memory."
Details specific techniques like "context trimming" (keeping the last N turns) and "context summarization."
Offers practical parameters for implementation, such as context_limit and keep_last_n_turns.

Essential Knowledge¶

The best references covering the Problem and Solution space

Reunig, D. (2025, June 22). How long contexts fail (and how to fix them).
Analyzes failure modes of excessively long context windows.
Failure modes:
- Context poisoning: When a hallucination makes it into the context
- Context distraction: When the context overwhelms the training
- Context confusion: When superfluous context influences the response
- Context clash: When parts of the context disagree
Reunig, D. (2025, June 26). How to fix your context.
Follow-up article offering tactics to avoid context failure: RAG, tool loadout, context pruning, summarization, and offloading.
Martin, R. L. (2025, June 23). Context engineering.
Overview of practical strategies for structuring and optimizing context for LLMs.
Key practices: write, select, compress, isolate
IndyDevDan. (2025, July). Elite context engineering with Claude Code
Demonstrates advanced context engineering techniques using Claude Code. Core mantra: R&D = Reduce + Delegate.
Reduce what the primary agent sees; delegate everything else to sub-agents or background primaries, with crisp contracts and logs

Prioritized Context Engineering Steps for Agentic AI¶

1) Minimize the Primary Context (treat it like a cache)¶

Keep only: goal, immediate instructions, tiny retrieved slice, output schema.
Keep ephemeral scratchpads for step-by-step reasoning; purge after use.
Offload history, rules, and big docs to external memory or files.
🔑 Impact: Faster, cheaper, less distracted agent; fewer long-context failure modes.

2) Write Context Intentionally (don’t just append everything)¶

Clear system message for goals, rules, constraints, evaluation criteria.
Define output schemas (JSON/table) and acceptance checks.
Put canonical facts/API contracts in stable references, not the live window.
Log key decisions so context can be rebuilt if needed.
🔑 Impact: Prevents bloated prompts; establishes reliable “ground truth”.

3) Select Context Dynamically (quality > quantity)¶

Retrieve only what’s relevant now via embeddings/filters.
Cap retrieval: few high-signal snippets; drop duplicates/stale text.
Prefer structured facts (tables/JSON) over long prose.
🔑 Impact: Model focuses on the right information; fewer hallucinations.

4) Use Context Priming (instead of big always-on memory files)¶

Create reusable prime commands per task (Purpose → Read/Fetch → Steps → Report).
Keep a tiny universal core (guardrails + IO schema) always loaded.
Prime on demand for bugfix, feature, research, etc.
🔑 Impact: Minimal startup tokens with task-specific readiness.

5) Kill Always-On Baggage (tools/MCP autoloading)¶

Don’t preload tool packs/MCP servers by default; load lazily per task.
Unload after use; explicitly justify any autoload.
Audit startup context and remove dead weight.
🔑 Impact: Frees large chunks of window; cuts token waste and latency.

6) Delegate Heavy Work to Sub-Agents (properly)¶

One sub-agent = one job with a tight system prompt.
Let sub-agents consume their windows; return distilled reports + artifacts.
Standard report contract: {summary, key_facts, risks, artifacts[]}.
🔑 Impact: Keeps the primary window small while scaling capability.

7) Delegate Long/Expensive Tasks to Background Primary Agents¶

Kick off separate agents for crawls, planning, batch edits, evals.
They write report files + artifacts; foreground ingests concise summaries.
Use this to “get out of the loop” while work continues.
🔑 Impact: Parallelizes work; maintains a lean, responsive main loop.

Tip

Claude Code CLI God Mode is an example of this, delegating to a different Primary model (that is the best tool for the job.)

8) Compress Context (mostly outside the window)¶

Rolling conversation summaries; hierarchical notes (facts → decisions → actions).
Store raw data externally; insert only pointers + summaries.
Heuristic trimming rules (dedupe, last-N, relevance thresholds).
🔑 Impact: Extends effective history without clogging the window.

9) Isolate Contexts (sandboxing & pipelines)¶

Keep runtime state objects and scratchpads per agent.
Avoid leaking sub-agent history into the primary prompt.
Build multi-agent pipelines with explicit handoffs.
🔑 Impact: Reduces poisoning, distraction, and cross-task conflicts.

10) Log Context Bundles for replay/handoff¶

Save {run_id, primes, reads, tool_calls, key_findings, decisions, outputs, next_steps}.
Use bundles to reprime new agents after window blow-ups or for continuity.
Keep bundles concise—no verbatim dumps of huge content.
🔑 Impact: Reproducibility and seamless continuation across sessions.

11) Defend Against Context Failure Modes¶

Poisoning: verify tool outputs; tag trust levels; gate propagation.
Confusion: filter irrelevant turns; dedupe aggressively.
Conflict: detect/resolve contradictions before composing answers.
🔑 Impact: Higher reliability and fewer cascading errors.

12) Standardize Schemas & Measure What Matters¶

Standard prompt skeletons, report JSON, artifact layout, retrieval policies.
Track startup_tokens, peak_tokens, #agents_spawned, attempts_to_success, retrieval_hits.
Set SLOs (e.g., startup ≤10% window, attempts ≤3) and alert on drift.
🔑 Impact: Consistency at scale; makes optimization and debugging straightforward.

Context Engineering 101 cheat sheet¶

References¶

Academic Papers¶

Zhang, W., Gupta, R., Müller, F., et al. (2025, July). A Survey of Context Engineering for Large Language Models. arXiv preprint.
Comprehensive academic survey reviewing context engineering techniques, challenges, and future research directions. https://arxiv.org/pdf/2507.13334
Haseeb, M., et al. (2025, August). Context Engineering for Multi-Agent LLM Code Assistants Using Elicit, NotebookLM, ChatGPT, and Claude Code. arXiv preprint.
Proposes a multi-agent workflow combining intent translation, semantic retrieval, document synthesis, and Claude Code to improve code generation and validation in large codebases. https://arxiv.org/html/2508.08322v1
Hua, Q., Ye, L., Fu, D., Xiao, Y., Cai, X., Wu, Y., Lin, J., Wang, J., & Liu, P. (2025, October). Context Engineering 2.0: The Context of Context Engineering. arXiv preprint.
Foundational position paper that situates context engineering historically, defines it as entropy reduction between human intentions and machine-readable representations, and introduces a four-era framework with design considerations for context collection, management, and usage. https://arxiv.org/abs/2510.26493

Blogs & Articles¶

Martin, R. L. (2025, June). Context Engineering.
Blog post introducing principles of context engineering for AI systems, including prompt design, retrieval, and orchestration strategies. https://rlancemartin.github.io/2025/06/23/context_engineering/
Schmid, P. (2025, June). Context Engineering: Optimizing Prompts and Retrieval for LLMs.
Detailed exploration of context engineering patterns and their practical application in LLM development. https://www.philschmid.de/context-engineering
LlamaIndex (2025, June). Context Engineering – What it is, and techniques to consider.
Official blog post from LlamaIndex outlining core techniques in context engineering for enterprise AI assistants. https://www.llamaindex.ai/blog/context-engineering-what-it-is-and-techniques-to-consider
Datacamp (2025, June). Context Engineering: The Next Frontier of AI Development.
Educational article explaining how context engineering enhances AI reliability, accuracy, and enterprise integration. https://www.datacamp.com/blog/context-engineering
Landgraf, T. (2025, August). Context Engineering for Claude Code: Mastering Deep Technical Knowledge.
Medium article introducing advanced context engineering workflows for Claude Code, including knowledge-file creation, project architecture awareness, and technical validation. https://medium.com/%40tl_99311/context-engineering-for-claude-code-mastering-deep-technical-knowledge-bae14f158289
Breunig, D. (2025, June). How to Fix Your Context.
Practical guide to diagnosing long-context failures and applying tactics like RAG, tool loadouts, quarantine, pruning, summarization, and offloading to stabilize agent behavior. https://www.dbreunig.com/2025/06/26/how-to-fix-your-context.html
LangChain. (n.d.). Context engineering in agents.
- Official LangChain documentation page explaining context engineering as the core of reliable agents, with middleware patterns for managing model, tool, and life-cycle context. https://docs.langchain.com/oss/python/langchain/context-engineering
LangChain. (2025, July). Context Engineering.
- LangChain blog post (“Context Engineering for Agents”) organizing context engineering strategies into write, select, compress, and isolate patterns, with concrete examples from real agents. https://blog.langchain.com/context-engineering-for-agents/
Osmani, A. (2024, July). Context Engineering: Bringing Engineering Discipline to Prompts.
- Substack essay framing context engineering as bringing software-engineering discipline and information architecture to prompt and context design for production AI systems. https://addyo.substack.com/p/context-engineering-bringing-engineering

Open Source Repositories & Tools¶

Kimai, D. (2025, July). Context-Engineering (GitHub Repository).
- Open-source repository providing frameworks and examples for implementing context engineering strategies in real-world projects. https://github.com/davidkimai/Context-Engineering
Context Engineering Template (2025, August). Context-Engineering-Intro (GitHub Repository).
- Template repository by coleam00 for setting up Claude Code-based context engineering projects, including examples, global rules, and workflows. https://github.com/coleam00/context-engineering-intro

Media & Visualizations¶

Youtube (2025, September). Elite Context Engineering with Claude Code — IndyDevDan [Video].
- Advanced context engineering techniques using Claude Code, as presented by IndyDevDan. https://www.youtube.com/watch?v=Kf5-HWJPTIE ([YouTube][1])
Youtube (2025, September). Context Engineering for Agents — Lance Martin, LangChain [Video].
- Discussion on context engineering specifically for agent frameworks, by Lance Martin from LangChain. https://www.youtube.com/watch?v=_IlTcWciEC4 ([YouTube][2])
Droid, L. (2025, July). Context Engineering Visualization.
- Illustrative summary of context engineering techniques, shared via social media. https://x.com/lenadroid/status/1943685060785524824/photo/1
Youtube (2025, September). Advanced Context Engineering for Agents — Dexter Horthy [Video].
- Walks through why naive back-and-forth prompting fails, how spec-first development keeps teams aligned, and why “everything is context engineering.” From compaction strategies to subagents and planning workflows, he shows how intentional context management turns AI coding from prototypes into production. https://www.youtube.com/watch?v=IS_y40zY-hc ([YouTube][3])

Takeaways¶

Key Takeaways

Treat the Context Window Like a CPU Cache, Not a Hard Drive: The core principle is to keep the primary agent's context minimal and focused—containing only the immediate goal, instructions, and a small slice of relevant data. Offload history, large documents, and non-essential tools to external memory to improve speed, reduce cost, and avoid distraction.
Reduce and Delegate, Don't Accumulate: Instead of bloating a single agent's context, delegate heavy or specialized tasks to sub-agents or background agents. These agents work in their own isolated contexts and return only concise, distilled reports, enabling complex work without overwhelming the primary agent.
Context Must Be Actively Managed, Not Passively Appended: Implement dynamic strategies like just-in-time retrieval (RAG), summarization, and context isolation to ensure the model always has high-quality, relevant information. Actively defend against common failure modes like poisoning and confusion by verifying inputs and filtering irrelevant data.

Prompt Engineering ↵

Prompt Engineering¶

Overview

A large part of working LLMs is knowing how to prompt them to get the information you want.

ChatGPTo will be used here but the techniques apply in general to any LLM.

Tip

With newer Reasoning models, there is less of a need for extensive prompts.

But this Prompt Guidance is still applicable.

Prompt Guides¶

There are many books, guides, and articles on Prompt Engineering. Some of the better ones are listed here:

Prompt Taxonomy¶

Prompt Taxonomy Turn, Expression, Level of Details, Role¶

The above proposed Prompt Taxonomy is TELeR: Turn, Expression, Level of Details, Role from TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks

Prompt Taxonomy 26 Guiding Principles¶

The above "26 guiding principles designed to streamline the process of querying and prompting large language models" is from Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4

Prompt Tactics¶

Some tactics I find particularly useful:

Ask the LLM to generate the prompt.
1. Do this in 2 parts (task decomposition)
  1. Ask the LLM what techniques are best to use for effective prompt engineering
  2. Ask the LLM to create a prompt using these techniques for the objective and example you specify
Ask the LLM to provide a confidence score for their answer.
1. People can be fooled by the fluency of LLMs aka hallucinations
2. LLMs know how confident they are in their answer.
3. So asking an LLM to rate their confidence in the answer can reduce blind trust.
Ask the LLM (or a different LLM) to review the answer and provide improvements.

Ask the LLM to generate the prompt¶

Ask the LLM what techniques are best to use for effective prompt engineering?¶

Ask the LLM to create a prompt using these techniques for the objective and example you specify¶

_functions.php in cpCommerce 1.2.x, possibly including 1.2.9, sends a redirect but does not exit when it is called directly, which allows remote attackers to bypass a protection mechanism to conduct remote file inclusion and directory traversal attacks, execute arbitrary PHP code, or read arbitrary files via the GLOBALS[prefix] parameter, a different vector than CVE-2003-1500.


#CVE# _functions.php in cpCommerce 1.2.x, possibly including 1.2.9, sends a redirect but does not exit when it is called directly, which allows remote attackers to bypass a protection mechanism to conduct remote file inclusion and directory traversal attacks, execute arbitrary PHP code, or read arbitrary files via the GLOBALS[prefix] parameter, a different vector than CVE-2003-1500.
#IMPACT# Remote Code Execution, Information Disclosure, Security Bypass
#RATIONALE# The description specifies multiple potential impacts:
1. **Remote Code Execution**: The ability to "execute arbitrary PHP code" directly implies a high severity impact where an attacker can run malicious code on the affected system.
2. **Information Disclosure**: The mention of "read arbitrary files" suggests that attackers can access sensitive information stored in files.
3. **Security Bypass**: The bypass of a protection mechanism to conduct remote file inclusion and directory traversal attacks implies that security controls are being circumvented.
#CONFIDENCE# 95

Ask the LLM to review the answer and provide improvements.¶

LLM Tools to Improve Prompts¶

In addition to using LLMs to generate prompts (aka Meta Prompts), and review prompts in an adhoc manner via the chat interface, there are various tools that use LLMs to improve prompts:

Fabric Prompt to Improve Prompts
Anthropic Console supports testing and improving prompts (for Anthropic - but likely the resulting prompt would be portable to other LLMs)
1. Claude can generate prompts, create test variables, and show you the outputs of prompts side by side.
2. https://x.com/AnthropicAI/status/1810747792807342395
Google Prompt Gallery

Prompts used in various Code Generation Tools¶

https://github.com/x1xhlol/system-prompts-and-models-of-ai-tools

FULL v0, Cursor, Manus, Same.dev, Lovable, Devin, Replit Agent, Windsurf Agent & VSCode Agent (And other Open Sourced) System Prompts, Tools & AI Models.

These are useful as reference examples.

Takeaways

Getting the right prompt to get what you want out of an LLM can sometimes feel like art or interrogation. There are several options covered here:
1. Prompt Templates
2. Prompt Frameworks
3. Ask an LLM to generate a prompt
4. LLM-based tools for prompt refinement

Fabric¶

Overview

One of Fabric's primary features is helping people collect and integrate prompts

Fabric Prompt Collection and Framework¶

https://github.com/danielmiessler/fabric/tree/main

Quote

Since the start of 2023 and GenAI we've seen a massive number of AI applications for accomplishing tasks. It's powerful, but it's not easy to integrate this functionality into our lives.

In other words, AI doesn't have a capabilities problem—it has an integration problem. Fabric was created to address this by enabling everyone to granularly apply AI to everyday challenges.

One of fabric's primary features is helping people collect and integrate prompts, which we call Patterns, into various parts of their lives.

The prompts (aka Patterns) are available at https://github.com/danielmiessler/fabric/tree/main/patterns

Fabric Prompts for CyberSecurity¶

Fabric has Prompts that are specific to CyberSecurity e.g.

Fabric Prompt to Improve Prompts¶

https://github.com/danielmiessler/fabric/blob/main/patterns/improve_prompt/system.md

Fabric Extract Wisdom¶

We'll take Goldman Sachs, Top of Mind, Gen AI: Too Much Spend, Too Little Benefit?, June 2024 as the example content to extract wisdom from. It's a 30 page report that is dense and varied.

Below the summary by Claude 3.5 Sonnet using the extract_article_wisdom prompt.

SUMMARY
Goldman Sachs analyzes AI's economic impact, discussing potential benefits, challenges, and investment implications in their "Top of Mind" report.

IDEAS

AI could increase US productivity by 9% and GDP growth by 6.1% over the next decade.
The current AI investment boom may be unsustainable, with potential for a market correction.
Power demand from AI data centers could lead to a significant energy crunch in the coming years.
AI's economic impact may be more limited than many expect, with only 4.6% of tasks potentially automated within a decade.
The semiconductor industry faces supply constraints that could limit AI growth in the near term.
Utilities companies are emerging as unexpected beneficiaries of the AI boom due to increased power demand.
AI technology may struggle to replicate human cognitive abilities, limiting its transformative potential.
The cost of AI infrastructure is currently high, raising questions about return on investment.
AI's impact on corporate profitability will be crucial for long-term equity returns.
The AI theme in the stock market may have room to run despite recent gains.
Data centers could consume up to 8% of total US power demand by 2030.
AI development is outpacing the ability of power infrastructure to support it.
The US may struggle to build the necessary infrastructure to support AI growth.
AI spending by tech giants is not seen as irrational exuberance by some analysts.
The AI investment cycle is still in the infrastructure buildout phase, with applications yet to emerge.
European power demand could increase by 40-50% over the next decade due to AI and electrification.
AI data centers can consume up to 10 times more energy than traditional data centers.
The US utility industry has not experienced significant load growth in almost two decades.
AI chip demand is outstripping supply, particularly for High-Bandwidth Memory technology.
The pace of AI model improvements may be slower than many anticipate.

QUOTES

"Given the focus and architecture of generative AI technology today... truly transformative changes won't happen quickly and few—if any—will likely occur within the next 10 years." - Daron Acemoglu
"AI technology is exceptionally expensive, and to justify those costs, the technology must be able to solve complex problems, which it isn't designed to do." - Jim Covello
"Spending is certainly high today in absolute dollar terms. But this capex cycle seems more promising than even previous capex cycles." - Kash Rangan
"The devil is ultimately in the details. So, I don't have a strong prior as to how much of the current investment boom will be wasted vs. productive. But I expect both will happen." - Daron Acemoglu
"Until we reach a level of saturation in terms of human capacity to consume data, any amount of efficiency gains will undoubtedly be gobbled up by even more demand." - Brian Janous
"The risk that our children or grandchildren in 2074 accuse us of moving too slowly in 2024 at the expense of growth seems far lower than the risk that we end up moving too quickly and destroy institutions, democracy, and beyond in the process." - Daron Acemoglu
"I'm loathe to use the word 'bubble' because I don't believe that AI is a bubble, but most bubbles in history ended either because the cost of capital changed dramatically or end-demand deteriorated and affected companies' ability to deploy capital, not because companies retreated from investing in a technology where the payoff was taking longer than expected." - Eric Sheridan
"Utilities have not experienced a period of load growth in almost two decades and are not prepared for—or even capable of matching—the speed at which AI technology is developing." - Brian Janous
"The US has unfortunately lost the ability to build large infrastructure projects—this is a task better suited for 1930s America, not 2030s America." - Brian Janous
"Every human invention should be celebrated, and generative AI is a true human invention. But too much optimism and hype may lead to the premature use of technologies that are not yet ready for prime time." - Daron Acemoglu

FACTS

Data centers currently account for only about 1% of global power demand.
The US utility industry has not experienced significant load growth in almost two decades.
60% of workers today are employed in occupations that did not exist in 1940.
The US corporate profit share of GDP is near its highest levels since World War II.
Northern Virginia is home to the most data centers in the US.
Commercial power consumption in Virginia rose 37% from 2016 to 2023.
AI data centers can consume up to 10 times more energy than traditional data centers.
The average power usage effectiveness (PUE) of cloud data centers declined by only 0.2 over the course of a decade.
Wait times for power projects to connect to the grid currently range from 40-70 months.
The High-Bandwidth Memory market is expected to grow at a ~100% compound annual growth rate over the next few years.
The S&P 500 Shiller P/E ratio is currently in the 97th percentile since 1900.
Utilities returned 16% between March and May, making them the best-performing sector in the S&P 500.
European electricity consumption has declined by around 10% from its 2008 peak.
TSMC's Chip-on-Wafer-on-Substrate packaging capacity is expected to more than double in 2024.
The US is expected to add nearly 800 gigawatts of wind and solar over the coming 10-15 years.

REFERENCES

ChatGPT
Global Macro Research by Goldman Sachs
MIT's Daron Acemoglu's research
Jim Covello's analysis
Joseph Briggs' economic forecasts
Kash Rangan and Eric Sheridan's market analysis
Brian Janous' insights on power infrastructure
Carly Davenport's utility sector analysis
Toshiya Hari, Anmol Makkar, and David Balaban's semiconductor industry research
Ryan Hammond's equity market analysis
Christian Mueller-Glissmann's long-term return forecasting framework
Alberto Gandolfi's European utilities analysis
Hongcen Wei's power consumption analysis
David Autor's research on technological innovation and employment
Robert Shiller's P/E ratio data
EIA (Energy Information Administration) data
Census Bureau data
FactSet financial data
TSMC's capacity projections
REPowerEU Plan

RECOMMENDATIONS

Investors should remain invested in AI infrastructure providers despite high valuations.
Utilities companies should find ways to extract more efficiencies from the current power system.
Policymakers should focus on investing in America's transmission infrastructure.
Companies should tie AI spending to visible revenue growth to maintain investor confidence.
Investors should consider "Electrification Compounders" in the utilities sector as potential beneficiaries of AI growth.
The US should prioritize expanding long-duration energy storage to meet growing power demand.
Tech companies should prepare for potential power constraints when planning AI infrastructure.
Investors should monitor the emergence of AI's "killer application" as a sign of the technology's maturity.
Policymakers should address regulatory lags and interconnection constraints in the power sector.
Companies should be cautious about premature use of AI technologies that are not yet fully developed.
Investors should watch for signs of corporate profitability decline as a potential trigger for reduced AI spending.
The semiconductor industry should focus on expanding High-Bandwidth Memory and advanced packaging capacity.
Utilities should prepare for a potential doubling of grid capacity in some regions over the next decade.
Investors should consider the impact of AI on corporate profitability when assessing long-term equity returns.
Policymakers should balance the need for AI innovation with potential risks to institutions and democracy.

Process Youtube Transcripts¶

Fabric, and LLMs in general, can process Youtube Transcripts (which are text).

The Youtube Transcript for a video can be retrieved in different ways e.g. https://www.youtube.com/watch?v=oMZN810xfck:

Click the Transcript button under a video and copy and paste, or download, to a text file
Use Fabric helper function: installer/client/cli/yt.py (or other tool)

yt --transcript https://www.youtube.com/watch?v=oMZN810xfck

The text can then be processed with an LLM.

This download and processing can be done in one command with Fabric

yt --transcript https://www.youtube.com/watch?v=oMZN810xfck | fabric -sp extract_wisdom

where

-s is for streaming
-p is for pattern
extract_wisdom is the Fabric prompt that extracts wisdom from any text. It addresses the problem of too much content and too little time.

Takeaways¶

Takeaways

Fabric is a good resource for prompts with many being related to CyberSecurity.

Ended: Prompt Engineering

Ended: Context Engineering

Software Engineering ↵

Overview¶

Abstract

This Software Engineering section explores how AI is fundamentally transforming software engineering, from changing development paradigms to redefining the role of traditional artifacts.

It covers immutable principles that remain constant regardless of who or what creates the software, and examines the evolving landscape of AI-powered development tools.

1. How AI is Changing Software Engineering ¶

An examination of software engineering's evolution through three distinct eras: traditional Waterfall methodologies, Agile development with its emphasis on communication over documentation, and the current AI-driven transformation.

The analysis traces two parallel developments:

Software 2.0 where "AI is eating software" by replacing traditional algorithms with neural networks,
Software 1.0 Redux where "AI is feeding software" through generative tools that accelerate traditional development (introducing the concept of "Vibe Coding").

Key Takeaways:

Accelerated Innovation: AI compresses idea-to-prototype cycles from weeks to minutes, fundamentally changing development velocity
Inverted Priorities: Upstream artifacts (requirements, architecture, contracts) become primary deliverables, while code becomes a generated byproduct
Evolved Skill Sets: Success now requires mastering requirements engineering, system architecture, and human-AI collaboration alongside traditional coding skills
New Paradigms: We're witnessing both Software 2.0 (AI eating software) and Software 1.0 Redux (AI feeding traditional development), each suited to different problem domains

2. Software Assurance ¶

A comprehensive overview of Software Assurance fundamentals that remain constant across all Software Development Life Cycle (SDLC) methodologies and development entities. This section covers the immutable principles that govern quality software development.

verification ("Are you building it right?") versus validation ("Are you building the right thing?")
fundamental architectural principles including the Laws of Software Architecture and how quality attributes drive design decisions.

Key Takeaways:

The software requirements specification is the key artifact that distinguishes verification activities from validation activities
Understanding the what (requirements) independent of the how (design) is crucial
The why for the how (captured in Architecture Decision Records) is paramount: "Why is more important than how"
Everything in software architecture involves trade-offs
Quality attributes are the primary drivers of architectural design

3. Software Artifacts ¶

Building on the foundational concepts from the previous sections, this explores how software engineering artifacts are evolving from passive documentation into active contracts that guide AI-driven code generation and system behavior.

As AI transforms development workflows, these artifacts become the core deliverables that specify system requirements, architecture, and behavior in machine-readable formats that AI tools can directly consume and act upon.

Key Takeaways:

Precision Over Brevity - AI needs explicit, unambiguous specifications
Machine-Readable Formats - Use structured data (JSON, YAML, Markdown with frontmatter)
Version Everything - Track changes to specifications alongside generated code
Automate Validation - Ensure AI outputs meet specification requirements
Maintain Traceability - Link generated code back to originating specifications

4. AI Coding Tools ¶

A comprehensive analysis of AI-powered development tools, categorized by autonomy levels from basic code completion to autonomous feature development. This classification helps developers understand the capabilities and limitations of different tools.

The comparison spans from Level 1 (basic autocomplete) through Level 4 (autonomous agents), with Level 5 remaining theoretical. The analysis shows how AI tools are progressively taking on more of the coding burden, allowing developers to focus on higher-level concerns.

Key Takeaways:

Tools range from simple code completion to autonomous feature development
Developers can mix and match tools based on autonomy levels and specific needs
Understanding each tool's autonomy level is crucial for effective adoption
The field evolves rapidly, with tools advancing between autonomy levels frequently

Tip

5. Leaderboards ¶

A curated collection of benchmarks and evaluation metrics for Large Language Models (LLMs) specifically focused on coding capabilities. These leaderboards provide objective measures for comparing the performance of different AI models in software development tasks.

Key Takeaways:

Leaderboards are a good way to quantitatively and objectively compare solutions.
Comparison across multiple metrics and leaderboards avoids solutions that overfit to a benchmark.

Takeaways¶

Key Takeaways

This Software Engineering section demonstrates how AI is not merely automating existing processes but fundamentally reshaping software engineering. While core principles like verification, validation, and architectural trade-offs remain constant, the methods of implementation are evolving rapidly. The shift toward AI-driven development emphasizes the critical importance of well-defined specifications and requirements as the new primary deliverables, with generated code becoming a secondary artifact.

Success in this new paradigm requires mastering both traditional software engineering principles and emerging AI collaboration skills, understanding when to apply different levels of AI autonomy, and maintaining quality through robust specification and validation processes.

How AI is Changing Software Engineering¶

Overview

This section gives an overview of Software Engineering and how it has evolved with AI and Generative AI.

It looks at the early days of Software Engineering and the transition from Waterfall to Agile - and the emphasis on communication over code and documentation.

Then it looks at how AI gave us Software 2.0 where "AI is eating software", and how Generative AI gave us Software 1.0 Redux where "AI is feeding software" (and Vibe Coding).

Andrej Karpathy quotes are used throughout as a consistent commentary from someone who has been formative in this space.

Software engineering

Quote

Software engineering is a branch of both computer science and engineering focused on designing, developing, testing, and maintaining software applications. It involves applying engineering principles and computer programming expertise to develop software systems that meet user needs.

https://en.wikipedia.org/wiki/Software_engineering

Evolution of Software Engineering as Communication over Code¶

Effective software development depends fundamentally on clear communication between all stakeholders—from business experts to developers to end users. The evolution from rigid, document-heavy approaches to collaborative, feedback-driven methodologies illustrates how the industry has learned to prioritize human communication over process documentation.

The Waterfall Challenge¶

The Waterfall model dominated software development for decades through its linear, sequential approach. However, this methodology created significant communication barriers. Requirements, design documents, and other pre-coding artifacts quickly became outdated as projects progressed. The code itself became the only reliable source of truth, but most stakeholders couldn't understand it.

This disconnect between documentation and reality often led teams to solve the wrong problems. The fundamental issue was the lack of continuous feedback loops between those who understood the business needs and those building the software.

The Agile Response¶

During the 1990s, lightweight development methodologies emerged in response to these heavyweight, document-centric approaches. Agile software development became the most influential of these approaches, emphasizing iterative development, continuous feedback, and direct collaboration between business stakeholders and development teams.

Rather than relying on comprehensive documentation, Agile methodologies prioritize working software and frequent communication. This shift recognized that software development is fundamentally a collaborative, creative process that benefits from ongoing dialogue rather than upfront specification.

The Language of Collaboration¶

Tip

See the Design Council's Systemic Design Framework for an evolution of the Double Diamond that recognizes the importance of the 'invisible activities' that sit around the design process: orientation and value setting, continuing the journey, collaboration and connection, and leadership and storytelling.

Software engineering thought leaders have consistently emphasized how critical shared understanding becomes in successful projects.

Alistair Cockburn's research demonstrated that face-to-face communication, particularly when enhanced by shared modeling tools like whiteboards, represents the most effective form of team communication. His work on communication effectiveness showed that physical proximity and multiple communication modalities dramatically improve information transfer and reduce misunderstandings.
Eric Evans, in his foundational work on Domain-Driven Design, introduced the concept of "ubiquitous language"—a shared vocabulary that bridges the gap between business experts and technical teams. This common language emerges from ongoing collaboration and becomes embedded in both conversations and code, ensuring that business terminology permeates throughout the software system.
Dan North extended these ideas when developing Behavior-Driven Development (BDD), creating frameworks that use natural language constructs to describe software behavior in terms that all stakeholders can understand. BDD treats the specification of desired behavior as a ubiquitous language for the entire project team, helping prevent communication breakdowns between developers and business stakeholders.

These approaches share a common insight: successful software development requires more than technical expertise—it demands ongoing communication, shared vocabulary, and collaborative understanding of both the problem domain and the solution being built.

Software 2.0¶

The emergence of artificial intelligence is fundamentally changing how we think about software development.

In 2017, Andrej Karpathy introduced the concept of "Software 2.0" to describe this transformation, where traditional human-written code gives way to AI-generated solutions.

Quote

Software (1.0) is eating the world, and now AI (Software 2.0) is eating software.

The "classical stack" of Software 1.0 is what we're all familiar with — it is written in languages such as Python, C++, etc. It consists of explicit instructions to the computer written by a programmer. By writing each line of code, the programmer identifies a specific point in program space with some desirable behavior.

In contrast, Software 2.0 is written in much more abstract, human unfriendly language, such as the weights of a neural network. No human is involved in writing this code because there are a lot of weights (typical networks might have millions), and coding directly in weights is kind of hard.

In Software 1.0, human-engineered source code (e.g. some .cpp files) is compiled into a binary that does useful work.

In Software 2.0 most often the source code comprises

the dataset that defines the desirable behavior and
the neural net architecture that gives the rough skeleton of the code, but with many details (the weights) to be filled in.

It is likely that any setting where the program is not obvious but one can repeatedly evaluate the performance of it (e.g. — did you classify some images correctly? do you win games of Go?) will be subject to this transition, because the optimization can find much better code than what a human can write.

Andrej Karpathy, Nov 11, 2017 https://karpathy.medium.com/software-2-0-a64152b37c35

Note

"Software (1.0) is eating the world" is a reference to "Why Software Is Eating the World", Marc Andreessen, 2011.

https://a16z.com/why-software-is-eating-the-world/ where a16z is "AndreessenHorowitz."

Software 2.0 Example: Tesla Autopilot transition from C++ code to AI¶

Tesla's Autopilot system demonstrates this transition in practice. As Karpathy explained:

Quote

Neural network can eat through the [programming] stack… When I joined Tesla, there was a ton of C++ code, and now there's much, much less C++ code in the code that runs in the car.

Neural network initially was just doing a detection on the image level, then it went for multiple images, it gives you prediction, then multiple images over time give you a prediction, and you're discarding C++ code. And eventually you're just giving steering commands.

… I do suspect that the end-to-end systems for Tesla in, say, 10 years, it is just a neural net. I mean, the videos stream into a neural net and commands come out.

https://www.linkedin.com/pulse/andrej-karpathy-8-big-ideas-mikael-alemu-gorsky-eckuf/

This evolution illustrates how AI systems can progressively replace traditional programmatic logic, moving toward end-to-end neural networks that process inputs and generate outputs directly.

Data as the New Source Code¶

In this Software 2.0 world, datasets become the primary artifact of development.

Hugging Face has emerged as "Software 2.0's GitHub," hosting over 400K+ datasets alongside 1.7M+ models, where repositories contain datasets and commits represent additions and edits of labels rather than code changes.

This shift fundamentally changes how we think about version control, collaboration, and the artifacts that define our systems. The focus moves from managing code repositories to curating and versioning the data that trains our AI systems.

As Karpathy noted:

Quote

Github is a very successful home for Software 1.0 code. Is there space for a Software 2.0 Github? In this case repositories are datasets and commits are made up of additions and edits of the labels.

https://karpathy.medium.com/software-2-0-a64152b37c35

Software 3.0¶

Software 3.0 is using English to program LLMs.

The LLM can be programmed to

carry out the end action e.g. sentiment classification.
write code to carry out the end action e.g. a web or mobile app.

Quote

Software 1.0 is code, 2.0 was weights of tightly scoped ML models, 3.0 is using English to program LLMs. Just like how in certain use cases software 2.0 would “eat away” at software 1.0 code, software 3.0 will now eat away at a lot of places where we would previously have used 1.0/2.0 https://x.com/ethanniser/status/1935056628350599473

Quote

We are witnessing the rise of a new paradigm: Software 3.0. If traditional coding was Software 1.0, and neural networks trained on data represented Software 2.0, Software 3.0 is where your prompts directly program LLMs https://x.com/huwng_tran/status/1935052873769042115

Quote

There is new category of consumer/manipulator of digital information:

Humans (GUIs)
Computers (APIs)
NEW: Agents <- computers... but human-like

Andrej Karpathy: Software Is Changing (Again)

This new category of consumer/manipulator is in line with "From People to Machines as Consumers/Customers/Creators" per thoughts-on-the-future-from-the-past.

Software 1.0 Redux¶

Note

"Software 1.0 Redux" refers to how fundamental software engineering principles remain unchanged, but the development process itself has been transformed by Generative AI and LLMs. All traditional artifacts become even more relevant as AI-generated code depends on high-quality specifications.

Quote

"The hottest new programming language is English."

—Andrej Karpathy on X/Twitter, 2023

Not all problems are suited to the Software 2.0 paradigm of neural networks replacing traditional code.

For such problems, Large Language Models (LLMs) are transforming how we write traditional software, creating what we might call "Software 1.0 Redux", where AI feeds and enhances traditional software, rather than eating it à la Software 2.0.

While AI handles low-level implementation details, human developers are elevated to higher levels of abstraction, using natural language to specify intent and behavior.
The skill shifts from syntax mastery to effective requirements management, architecture, and system design, and clean coding principles.

Thoughts on the Future from the Past¶

In the past, I captured some thoughts publicly on how I thought the future of software would play out.

These predictions have materialized through today's Generative AI tools, though at the time the Generative AI that we take for granted today did not exist.

More Continuity between Requirements – Documentation – Test – SW¶

Six years ago (March 2019), as part of a presentation on DevSecOps - Thoughts from the Trenches, I concluded with my view of the next 10 years:

For "More Continuity between Requirements – Documentation – Test – SW", these points were already playing out at that time:

Less heavy lifting required by people to build software
The breadth and depth and rate of change of new technologies means that developers can't keep pace
Value moves further up the stack – the lower layers become commodities

Tip

The "More Continuity between Requirements – Documentation – Test – SW" is now playing out as "Software 1.0 Redux" where the non-code artifacts (requirements, architecture and design documentation, tests, user documentation) become the contracts from which code is generated by LLMs—and the code becomes more of a byproduct. These contracts are maintained and versioned alongside the code.

Natural language prompts have become first-class citizens in development workflows, allowing developers to specify intent at higher levels of abstraction while AI handles implementation details. Prompt engineering is an important skill currently, though the need for prompt engineering diminishes with more capable models (e.g. Reasoning, Research).

Diagrams as Code¶

Five years ago (May 2020), as part of a "Thoughts For The Future" chapter for The Hitchhiker's Guide to PlantUML! (linked from plantuml.com), I wrote:

Quote

1.3. Machine Processing Of Text Files

Having a diagram source as a text file is powerful because it allows for machine processing e.g.

If standard building blocks are used, it allows automated analysis and recognition of the diagram text source, and recommendations to the user e.g. if an arrow text includes "TLS" to indicate the link is secured, then an external program can provide recommendations on TLS protocol version, cipher-suites etc…
As companies move towards standard architecture icon sets (AWS, MS/Azure, Google,…), it is possible to process an existing architecture diagram image with optical recognition (and machine learning) and create the text (plantuml) equivalent.

Quote

Having diagrams as code that can live beside the system code, that the stakeholders are equally comfortable editing and viewing, reduces the gap i.e. "Where system diagrams meet system reality"

https://crashedmind.github.io/PlantUMLHitchhikersGuide/about/AboutPlantUML.html#diagrams-as-code

I felt strongly enough about this that I wrote a guide on PlantUML, a diagram-as-code tool.

Experiment

One of my first experiments with an LLM (text-only) was to ask it to do a security review of a diagram-as-code architecture.

And when multi-modal LLMs became available, with a diagram-as-image architecture.

Validation Through Current AI Tools¶

These predictions have materialized through today's generative AI tools:

e.g. Claude's Artifacts feature exemplifies the upstream artifacts concept, allowing users to generate substantial, standalone content like code snippets, documents, or website designs that appear in a dedicated window for iterative refinement. Meanwhile, Claude Code takes this further by operating directly in the terminal, understanding entire codebases and helping developers code faster by executing routine tasks, explaining complex code, and handling git workflows through natural language commands.
The "diagrams as code" vision has also been realized—modern AI can now generate Mermaid diagrams, SVG graphics, and complete architecture visualizations from natural language descriptions, then provide contextual analysis including security recommendations and threat modeling based on those diagrams. It can also generate the code given an image.
I've used LLMs to create architecture diagrams, and then generate code from those diagrams.

But AWS DevGenius is a more recent complete example application that exemplifies this.

Example DevGenius - AWS Solution Generator¶

AWS DevGenius (April 2025) is an example of this.

Quote

DevGenius is an AI-powered application that transforms project ideas into complete, ready-to-deploy AWS solutions. It leverages Amazon Bedrock and Claude AI models to provide architecture diagrams, cost estimates, infrastructure as code, and comprehensive technical documentation.

Conversational Solution Architecture Building: DevGenius enables customers to design solution architectures in a conversational manner. Users can create architecture diagrams (in draw.io format) and refine them interactively. Once the design is finalized, they can generate end-to-end code automation using CDK or CloudFormation templates, and deploy it in their AWS account with a single click. Additionally, customers can receive cost estimates for running the architecture in production, along with detailed documentation for the solution.

Build Solution Architecture from Whiteboard Drawings: For customers who already have their architecture in image form (e.g., whiteboard drawings), DevGenius allows them to upload the image. Once uploaded, DevGenius analyzes the architecture and provides a detailed explanation. Customers can then refine the design conversationally and, once finalized, generate end-to-end code automation using CDK or CloudFormation. Cost estimates and comprehensive documentation are also available.

From Code-Centric to Contract-Centric Development¶

This transformation reflects a broader trend toward what a16z calls "upstream artifacts" becoming the primary deliverables:

In agent-driven workflows, code becomes a byproduct—like a compiled artifact—while high-level inputs become the core deliverable. Code becomes the byproduct of those inputs, more like a compiled artifact than a manually authored source.

The New Development Hierarchy¶

Aspect	Traditional (Software 1.0)	AI-Enhanced (Software 1.0 Redux)
Primary Artifact	Code	Requirements, contracts, specifications
Developer Role	Implementation focused	Architecture and validation focused
Skill Emphasis	Syntax, algorithms, debugging	Requirements engineering, system design
Version Control	Code changes	Contract changes + generated artifacts

Critical Upstream Artifacts¶

The most valuable artifacts in AI-driven development become the specifications that guide code generation:

Artifact Category	Examples	AI-Centric Value
Requirements	User stories, acceptance criteria, functional specs	Clear evaluation targets for AI validation
Architecture & Design	ADRs, system diagrams, API contracts	Unambiguous constraints for code generation
Data & Schemas	JSON schemas, database models, API specifications	Structured templates for consistent output
Security & Compliance	Threat models, security policies, audit requirements	Automated guardrails and validation rules

Everything-as-Code: The Infrastructure Precedent¶

This shift toward specification-driven development builds on existing "everything-as-code" movements:

Infrastructure as Code (Terraform, CloudFormation)
Policy as Code (Open Policy Agent)
Diagrams as Code (Mermaid, PlantUML)
Configuration as Code (Kubernetes manifests)

These approaches already demonstrated the value of maintaining human-readable specifications that generate operational artifacts. AI extends this pattern to application code itself.

Blurred Boundaries: Developer Roles Evolve¶

The lines between traditional software engineering roles are blurring:

Developer + Requirements Analyst: Developers must now excel at translating business needs into precise specifications that AI can implement reliably.

Human + Machine Collaboration: Success requires understanding both what AI can do well (pattern matching, code generation) and what humans do better (creative problem-solving, architectural judgment).

This evolution suggests developers will spend less time on implementation minutiae and more time on:

System design and architectural decision-making
Requirements engineering and specification writing
Quality validation of AI-generated outputs
Integration orchestration across AI and human-created components

Coding for AI Agents vs. Humans¶

Coding for AI Agents vs. Humans, May 2025, Martin Treiber

Quote

As AI coding assistants and autonomous agents (often powered by large language models) become more involved in software development, best practices in coding must account for a new "audience." Traditionally, code is written by and for human developers, emphasizing readability and maintainability for people. In contrast, code intended to be generated or maintained by AI agents may prioritize different qualities to align with an AI's interpretive capabilities.

ASPECT	CODE OPTIMIZED FOR HUMAN DEVS	CODE OPTIMIZED FOR AI AGENTS
Design & Patterns	Uses established design patterns and abstractions for maintainability; architecture planned with human intuition in mind. E.g. heavily employs DRY and common patterns to avoid duplication.	Prefers straightforward implementations unless instructed otherwise; may duplicate logic or use simpler patterns by default. Requires explicit guidance to enforce complex patterns or project-specific architectures.
Readability & Naming	Emphasizes self-explanatory code: clear but concise names, standard formatting, and comments only where needed. Relies on code being self-documenting and consistent by convention.	Emphasizes explicit clarity: very descriptive names (the AI doesn't mind length), extremely consistent naming and styling. Heavy inline documentation and docstrings are included to provide context (since AI can update).
Performance Trade-offs	Prioritizes readability and maintainability; avoids premature optimizations. Optimizes hotspots after profiling, using complex techniques only with justification (and documenting them for colleagues).	Prioritizes correctness and completeness on first pass; optimizations are applied via iteration or prompts. Tends toward robust, error-checked code even if slightly slower. Can leverage known optimizations or libraries.

Ultimately, good code is good code – many best practices overlap whether the consumer is human or AI. Clean architecture, readable style, and efficient execution benefit both. The key differences lie in emphasis: AI agents "read" code differently than humans, so code meant for them leans into consistency, explicit context, and simplicity of structure.

From Professional Development to Rapid Prototyping via Vibe Coding¶

While professional software engineering requires rigorous practices, AI has also enabled what Andrej Karpathy terms "vibe coding"— a more experimental approach where developers "see things, say things, run things, and copy-paste things, and it mostly works."

Though Karpathy notes this isn't traditional software engineering and has limitations, vibe coding serves an important role in enabling rapid prototyping and proof-of-concepts for both experienced developers and newcomers to programming.

AI Across the Development Lifecycle¶

Tracy Bannon's excellent talk Applying AI to the SDLC: New Ideas and Gotchas! - Leveraging AI to Improve Software Engineering includes a model for where AI can be used with DevSecOps.

It demonstrates how AI can enhance every phase of software development:

Phase	AI Applications	Current Tools
Plan & Design	Generate design alternatives, review architectural decisions	Claude Artifacts, GitHub Copilot Chat
Code	Auto-complete, generate from natural language specs	GitHub Copilot, Claude Code, Cursor
Test & QA	Auto-generate test cases, intelligent fuzzing	GitHub Copilot, automated testing tools
Deploy & Ops	Predictive scaling, infrastructure optimization	AI-powered DevOps platforms
Monitor	Automated root-cause analysis, performance insights	AI monitoring and observability tools

This comprehensive integration shows AI's potential to transform every aspect of the software development lifecycle.

Takeaways¶

Key Takeaways

Accelerated Innovation: AI compresses idea-to-prototype cycles from weeks to minutes, fundamentally changing development velocity.
Inverted Priorities: Upstream artifacts (requirements, architecture, contracts) become primary deliverables, while code becomes a generated byproduct.
Evolved Skill Sets: Success now requires mastering requirements engineering, system architecture, and human-AI collaboration alongside traditional coding skills.
New Paradigms: We're witnessing both Software 2.0 (AI eating software) and Software 1.0 Redux (AI feeding traditional development), each suited to different problem domains.

Software Assurance¶

Overview

This section provides a brief overview of Software Assurance fundamentals that are independent of the Software Development Life Cycle (SDLC) Methodology or what entity is creating the software:

Software Assurance
Verification and Validation
Architecture
1. Laws of Software Architecture
2. Quality Attributes Drive the Architecture Design

It also covers the what, how, and why for the how, of Software Engineering.

Quote

"Don't worry about that specification paperwork. We'd better hurry up and start coding, because we're going to have a whole lot of debugging to do."

Verifying And Validating Software Requirements And Design Specifications, Barry Boehm, 1984

Software Assurance¶

Software assurance aims to ensure the reliability, safety, and security of software products.

Software assurance (SwA) is a critical process in software development that ensures the reliability, safety, and security of software products.

There are several types of software assurance initiatives, including:

Certification and accreditation: These programs establish standards and guidelines for software development, and verify that software products meet these standards. Certification and accreditation can help to ensure that software products are reliable, secure, and compliant with regulations and industry standards.[6]
Training and education: These initiatives provide software developers with the knowledge and skills they need to create high-quality, secure software. Training and education can include courses on software testing, secure coding practices, and industry standards and best practices.[7]
Code analysis and testing: These initiatives use tools and techniques to analyze software code and identify defects or vulnerabilities. Code analysis and testing can include static analysis, dynamic analysis, and fuzz testing, among other techniques.[8]
Threat modeling and risk assessment: These initiatives assess the potential risks and threats to a software system, and identify strategies for mitigating these risks. Threat modeling and risk assessment can help to ensure that software systems are designed to be resilient to attacks and other threats.[21]

Software assurance is executed through a series of activities that aim to ensure the reliability, safety, and security of software products. These activities include requirements analysis, design reviews, code inspections, testing, and formal verification.

https://en.wikipedia.org/wiki/Software_assurance

Validation vs. Verification¶

verification and validation to resolve issues early in the software life cycle

The basic objectives in verification and validation (V&V) of software requirements and design specifications are to identify and resolve software problems and high-risk issues early in the software life cycle.

Verification - to establish the truth of the correspondence between a software product and its specification.

"Are you building it right?"

Validation - to establish the fitness or worth of a software product for its operational mission.

"Are you building the right thing?"

Verification activities begin in the Product Design phase and conclude with the acceptance test. They do not lead to changes in the requirements baseline; only to changes in the refinements descending from it.

On the other hand, validation identifies problems which must be resolved by a change of the requirements specification. Thus, there are validation activities which occur throughout the software life cycle, including the development phase.

For example, a simulation of the product design may establish not only that the design cannot meet the baseline performance requirements (verification), but also that the performance requirements are too stringent for any cost-effective product designs, and therefore need to be changed (validation).

Basic V&V criteria for software requirements and design specifications:

completeness
consistency
feasibility
testability

Guidelines For Verifying And Validating Software Requirements And Design Specifications, Barry W. Boehm

Continuous Validation And Verification Process¶

The key artifact that distinguishes verification activities from validation activities is the software requirements specification.

This covers what the software product will do (but not how it will do it; this is to be done in the design specification).

Note

These assurance steps apply independent of the Software Development Life Cycle (SDLC) Methodology.

Architecture¶

Laws of Software Architecture¶

Fundamentals of Software Architecture, Mark Richards & Neal Ford, defines some fundamental Laws of Software Architecture.

First Law of Software Architecture¶

"Everything in software architecture is a trade-off."

Corollary 1: If an architect thinks they have discovered something that isn't a trade-off, more likely they just haven't identified the trade-off yet.

Second Law of Software Architecture¶

"Why is more important than how."

Quality Attributes Drive The Architecture Design¶

Quality Attributes drive the architecture design

Requirements for a software system fall into the following two categories:

Functional requirements: These requirements describe the business capabilities that the system must provide, as well as its behavior at run-time.
Non-functional requirements: These requirements describe the "Quality Attributes" that the system must meet in delivering functional requirements.

Quality Attributes drive the architecture design.

Continuous Architecture: Sustainable Architecture in an Agile and Cloud-Centric World

Architecture Decision Records (ADR)¶

ADRs as a Decision Log

An ADR can help you understand the reasons for a chosen architectural decision, along with its trade-offs and consequences. The collection of ADRs created and maintained in a project constitute its decision log. https://adr.github.io/

ADRs are common e.g.

ADRs as Documentation

Architecture Decision Records can be used as an effective means to document a software architecture.

The Context section of an ADR provides an excellent opportunity to describe the specific area of the system that requires an architecture decision to be made. This section also provides an opportunity to describe the alternatives.
Perhaps more important is that the Decision section describes the reasons why a particular decision is made, which is by far the best form of architecture documentation.
The Consequences section adds the final piece to the architecture documentation by describing additional aspects of a particular decision, such as the trade-off analysis of choosing performance over scalability.

Chapter 19: Architecture Decisions, Fundamentals of Software Architecture, Mark Richards & Neal Ford

The What How Why Of Software Engineering¶

People often struggle with what vs how

As a software architect for most of my career, I've seen people (customers and technical people) struggle with the what vs how.

They will often specify the implementation details (the how), but not what they want and why.

Or they will specify what they know exists - not what they actually want.

I've seen this also in a security context e.g.

how: "I want to share a secret key with a trusted 3rd party"
- The wrong thing to do is start by describing secure ways to share keys (which some technical people will do regardless of their seniority)
- The right thing to do is ask "why, what are you trying to do? Can you give an example?", but often people don't ask.
what: "I want to give read access to a trusted 3rd party to data on somewhere so they can do something"
- This may not require sharing keys depending on the context. But even if it does require sharing keys, you can give advice appropriate to the use case and context.

Framing the "what"

Think of me as a good plumber:

Ask for what you want, not what you think you can get. In other words, assume you can have anything you want, now what do you want?
You tell me what features you want. I'll take care of the plumbing details.

What How Why¶

The what is captured in the Software Requirements.
The how is captured in the System Design.
The why for the how is captured in Decision Records covering the reasons and tradeoffs associated with the decisions. These decisions can be at an Architecture or Design level.

Takeaways¶

Takeaways

The key artifact that distinguishes verification activities from validation activities is the software requirements.

It is important to understand the what independent of the how (requirements vs design).

It is important to understand the why for the how (e.g. ADRs) aka

"Why is more important than how."

Everything in software architecture is a trade-off.

Quality Attributes drive the architecture design.

Software Engineering Artifacts in the AI Era¶

Overview

This section builds on How AI is Changing Software Engineering and Software Assurance to focus on the Software Engineering artifacts, which are becoming the core deliverable, the active contracts that guide code generation and system behavior.

Artifacts¶

1. Research & Discovery¶

Technical Research Documentation - Market analysis, technology evaluations, feasibility studies
AI Model Research - Model selection criteria, capability assessments, performance benchmarks
Competitive Analysis - Feature comparisons, architectural patterns, industry standards

2. Requirements & Specifications ⭐¶

These become critical contracts for AI code generation

User Stories with clear acceptance criteria for AI validation
Functional & Non-functional Requirements - Precise, unambiguous specifications
Use-case Diagrams and Scenarios - Visual specifications for complex workflows
Feature Backlog with AI-friendly descriptions and constraints
Prompt Libraries - Reusable natural language specifications for common patterns
Prompt Engineering Specifications - Detailed instructions optimized for AI performance and reliability

3. Architecture & Design ⭐¶

Primary artifacts that guide AI implementation

Architecture Diagrams (UML, component diagrams, system context)
Architectural Decision Records (ADRs) - Critical for maintaining AI-generated code consistency
System Design Documents - High-level constraints and patterns
API Contracts & Specifications (OpenAPI/Swagger) - Precise interfaces for AI code generation
Data Schemas (JSON Schema, database models) - Structured templates for consistent output
Diagrams-as-Code (Mermaid, PlantUML) - Machine-readable architectural specifications
Diagrams-as-Images (any) - Multi-Modal-LLM-readable architectural specifications

4. Implementation¶

Generated Source Code - Now a byproduct of upstream specifications
Code Repositories with enhanced metadata tracking (human vs. AI contributions)
AI Generation Prompts - Versioned natural language instructions
Code Review Guidelines for AI-generated code
Refactoring Documentation - Plans and histories for AI-assisted refactoring
Legacy Migration Plans - Strategies for AI-assisted modernization
Dataset Management - Versioning, curation, and labeling of training data as code artifacts
Traceability Matrices - Automated linkage between specifications, generated code, and test artifacts
Metadata Standards - Explicit schema for documenting AI contributions vs. human edits, linked to audit trails

5. Testing & Quality Assurance¶

Test Plans with AI-generated test case specifications
Automated Test Suites (unit, integration, end-to-end)
AI-Generated Test Cases - Coverage for edge cases and scenarios
Quality Gates - Automated validation rules for AI-generated code
Performance Benchmarks - Baseline metrics for AI optimization
Bug Tracking with AI-assistance classification and resolution suggestions

6. Security & Compliance ⭐¶

Critical guardrails for AI-generated code

Threat Models - Security constraints for AI code generation
Security Policies as Code - Automated security validation rules
Vulnerability Scans of AI-generated code
Compliance Checklists - Regulatory requirements embedded in AI workflows
Audit Trails - Enhanced tracking of AI-generated changes
Code Reviews - Code Reviews by AI

7. Deployment & Operations¶

Infrastructure as Code (IaC) - Terraform, CloudFormation specifications
CI/CD Pipeline Configurations - Automated deployment workflows
Release Notes - Generated from commit messages and change logs
Monitoring & Observability - AI-powered dashboards and alerting
Runbooks - Operational procedures with AI-assisted troubleshooting

8. Maintenance & Evolution¶

Change Logs - Automated tracking of modifications
Incident Response - AI-assisted root cause analysis
Post-mortem Reports - Learning from failures with AI insights
Dependency Management - Automated updates and security patches
Technical Debt Tracking - AI-identified improvement opportunities
Continuous Improvement Plans - Strategies integrating feedback loops from operational AI performance to refine specifications and AI model performance over time

9. Communication & Collaboration¶

Project Documentation - Living documents that evolve with the codebase
Team Knowledge Base - Searchable, AI-enhanced documentation
Decision Logs - Context for architectural and design choices
User Documentation - Generated from code and specifications
Training Materials - AI-assisted onboarding and skill development
Role Definitions for Human-AI Interaction - Clear delineation of responsibilities, authority, and interaction protocols for AI vs. human developers
Ubiquitous Language Glossaries - Domain-driven language consistently used across requirements, code, prompts, and AI interactions

Artifact Formats for AI Integration¶

Multi-modal LLMs can process diverse document formats, but strategic format selection enhances AI comprehension and workflow efficiency.

Recommended Primary Formats¶

Markdown (.md) - Optimal for AI processing

Semi-structured, lightweight format ideal for LLMs
Human-readable with machine-parseable structure
Supports frontmatter for metadata
Version control friendly

Structured Data

JSON/YAML - Configuration files, schemas, API definitions
TOML - Configuration with human-friendly syntax
CSV - Tabular data, test cases, requirements matrices

Dataset Management Formats - CSV, Parquet, JSONL for structured training data versioning and curation

Diagrams-as-Code

Mermaid - Flowcharts, sequence diagrams, system architecture
PlantUML - UML diagrams, component relationships
DOT/Graphviz - Network diagrams, dependency graphs

Document Conversion Pipeline¶

Modern tools enable seamless format conversion for AI consumption:

Tool	Input Formats	Use Case
MarkItDown	PDF, DOCX, PPTX, XLSX, Images	Legacy document conversion
Docling	PDF, DOCX, HTML, Images	Enterprise document processing
Pandoc	40+ formats	Universal document conversion

Best Practices:

Start with Markdown for new artifacts
Convert legacy documents using automated tools
Use frontmatter for metadata (tags, version, AI instructions)
Maintain original formats alongside Markdown for compliance

This approach ensures artifacts remain both human-readable and AI-processable throughout the development lifecycle.

AI-Enhanced Artifact Workflow¶

Traditional vs. AI-Driven Development¶

Aspect	Traditional Approach	AI-Enhanced Approach
Primary Focus	Code implementation	Specification quality
Artifact Role	Documentation	Active contracts
Maintenance	Manual updates	Automated synchronization
Quality Assurance	Post-development review	Continuous validation
Knowledge Transfer	Human documentation	AI-searchable knowledge base

Repository Structure for AI-First Development¶

project-root/
├── README.md
├── CONTRIBUTING.md
├── .ai/                                    # AI-specific configurations
│   ├── prompts/                           # Reusable prompt templates
│   ├── models/                            # Model configurations
│   └── validation/                        # AI output validation rules
│
├── docs/
│   ├── system-docs/                       # System-focused documentation
│   │   ├── 01-research/
│   │   │   ├── technical-research.md
│   │   │   ├── ai-model-research.md
│   │   │   └── competitive-analysis.md
│   │   │
│   │   ├── 02-requirements/               # ⭐ CRITICAL for AI generation
│   │   │   ├── user-stories.md
│   │   │   ├── functional-requirements.md
│   │   │   ├── use-case-scenarios.md
│   │   │   ├── feature-backlog.md
│   │   │   ├── prompt-libraries.md
│   │   │   └── prompt-engineering-specifications.md
│   │   │
│   │   ├── 03-architecture/               # ⭐ PRIMARY contracts
│   │   │   ├── architecture-diagrams/
│   │   │   │   ├── diagrams-as-code/
│   │   │   │   └── diagrams-as-images/
│   │   │   ├── ADR/
│   │   │   ├── system-design-documents.md
│   │   │   ├── api-contracts/
│   │   │   └── data-schemas/
│   │   │
│   │   ├── 04-implementation/
│   │   │   ├── generated-source-code/
│   │   │   ├── code-repositories-metadata.md
│   │   │   ├── ai-generation-prompts.md
│   │   │   ├── code-review-guidelines.md
│   │   │   ├── refactoring-documentation.md
│   │   │   ├── legacy-migration-plans.md
│   │   │   ├── dataset-management.md
│   │   │   ├── traceability-matrices.md
│   │   │   └── metadata-standards.md
│   │   │
│   │   ├── 05-testing/
│   │   │   ├── test-plans.md
│   │   │   ├── automated-test-suites.md
│   │   │   ├── ai-generated-test-cases.md
│   │   │   ├── quality-gates.md
│   │   │   ├── performance-benchmarks.md
│   │   │   └── bug-tracking.md
│   │   │
│   │   ├── 06-security/                   # ⭐ AI guardrails
│   │   │   ├── threat-models.md
│   │   │   ├── security-policies-as-code.md
│   │   │   ├── vulnerability-scans.md
│   │   │   ├── compliance-checklists.md
│   │   │   ├── audit-trails.md
│   │   │   └── ai-code-reviews.md
│   │   │
│   │   ├── 07-deployment-operations/
│   │   │   ├── infrastructure-as-code/
│   │   │   ├── ci-cd-pipeline-configurations.md
│   │   │   ├── release-notes.md
│   │   │   ├── monitoring-observability.md
│   │   │   └── runbooks.md
│   │   │
│   │   ├── 08-maintenance-evolution/
│   │   │   ├── change-logs.md
│   │   │   ├── incident-response.md
│   │   │   ├── post-mortem-reports.md
│   │   │   ├── dependency-management.md
│   │   │   ├── technical-debt-tracking.md
│   │   │   └── continuous-improvement-plans.md
│   │   │
│   │   └── 09-communication-collaboration/
│   │       ├── project-documentation.md
│   │       ├── team-knowledge-base.md
│   │       ├── decision-logs.md
│   │       ├── role-definitions-human-ai.md
│   │       └── ubiquitous-language-glossaries.md
│   │
│   └── user-docs/                         # User-facing documentation
│       ├── user-documentation.md
│       └── training-materials.md
│
├── src/                                   # Generated and human code
├── infra/                                 # IaC specifications
├── .github/workflows/                     # AI-enhanced CI/CD
└── scripts/ai/                            # AI automation scripts

Key Principles for AI-Era Artifacts¶

Best Practices

Precision Over Brevity - AI needs explicit, unambiguous specifications
Machine-Readable Formats - Use structured data (JSON, YAML, Markdown with frontmatter)
Version Everything - Track changes to specifications alongside generated code
Automate Validation - Ensure AI outputs meet specification requirements
Maintain Traceability - Link generated code back to originating specifications

These artifacts serve as the foundation for AI-driven development, transforming from passive documentation into active contracts that guide intelligent code generation and system evolution.

Software Engineering Agents ↵

AI Coding Tools¶

Overview

This page gives a comparison of AI Coding Tools by Autonomy Levels.

It starts by defining Autonomy Levels for AI Coding Tools.

Then it categorizes and compares these AI Coding Tools based on these Autonomy Levels

Tip

Observation

Current integrated development environments (IDEs) are optimized for human programmers.

They are not optimal

For Software Engineering Artifacts in the AI Era
For shared understanding among stakeholders
As we move From Code-Centric to Contract-Centric Development

Since I made this observation, there have been some relevant updates:

Amazon Kiro introduces a “spec-driven development” philosophy
Sean Grove, OpenAI, July 2025 discuuses The New Code where specifications, not traditional code, are the primary valuable artifact for aligning human and AI intent.
- Future IDEs might function as "integrated thought clarifiers," helping to surface and resolve ambiguities in specifications.
- Moving from disparate machine encodings to a unified human encoding (specifications) will simplify how we solve problems.
"There's a good chance that, by the end of the year, people aren't using IDEs anymore", Mastering Claude Code in 30 minutes, Anthropic, 22 May 2025

Autonomy Levels for AI Coding Tools¶

We define autonomy levels similar to how autonomous driving is categorized:

Level 1 – Basic Code Completion: AI offers single-line or small code suggestions based on context. The developer is fully in control, and the AI assists with micro-tasks (comparable to basic cruise control in cars). Tools at this level accelerate typing but do not generate complex code blocks or make decisions.
Level 2 – Advanced Code Generation (AI Pair Programmer): AI can produce larger code snippets or entire functions from natural language prompts or broader context. The AI behaves like a junior pair programmer – it writes significant blocks of code or fixes simple bugs on request, but each AI output is a single-step operation reviewed by the human. Developers still guide the overall structure and verify the results.
Level 3 – Context-Aware Agents (Supervised Automation): AI acts more agentically, performing multi-step coding tasks in an iterative loop to fulfill a high-level goal. The developer provides a goal (e.g. “Add a profile feature” or “Fix this bug”) and the AI plans steps, writes or edits multiple files, runs code or tests, debugs, and iterates with minimal prompting. The human oversees the process and may intervene or review final changes, but does not micromanage each step. This is akin to an intern or junior dev autonomously coding under supervision.
Level 4 – Highly Autonomous Coding: The AI can handle complex projects end-to-end, needing only minimal human input (like a prompt or spec) and no required oversight on each change. It can design solutions, write code across the entire codebase, test, debug, and even proactively improve code without awaiting human review. At this level, you might trust the AI as you would a strong senior engineer: provide requirements and let it deliver a solution to review. (In practice, developers may still do a final review, but the AI’s output is intended to be production-ready.)
Level 5 – Full Autonomy (Experimental/Future): At this theoretical stage, an AI coding agent sets its own goals and continuously improves software without explicit human-defined tasks. It could analyze user needs or system metrics and autonomously decide to implement features or optimizations. This remains speculative – current tools do not reach Level 5, but research projects hint at this direction (e.g. AutoGPT agents that take an open-ended goal and break it into sub-tasks to achieve it). No mainstream coding tool today operates at full Level 5 autonomy.

Comparison Table of Developer AI Tools by Autonomy and Capabilities

The following table provides a summarized comparison of the developer AI tools discussed in this report, categorized by their primary autonomy level and highlighting their key characteristics, interaction styles, strengths, and limitations. This table serves as a quick reference to differentiate the capabilities and intended use cases of various tools in the rapidly evolving landscape of AI-assisted software development. Understanding these distinctions is crucial for developers and engineering leaders seeking to effectively integrate AI into their workflows.

Tool Name	Autonomy Level	Interaction Style	Key Strengths	Notable Limitations
VS Code IntelliSense	1 (Basic Assist)	Inline IDE suggestions as you type	Fast, context-aware completions for syntax & names based on language semantics and source analysis 7; works offline; reliable for known code patterns; extensible.	No true "AI" generation of novel logic; limited to current file/project context; cannot understand higher-level intent [User Query].
Tabnine	1 (with L3 agent features)	IDE plugin, ML-based autocomplete; AI chat; AI agents for specific tasks 8	ML-driven suggestions for full lines/functions; supports many languages; local models for privacy 8; team-trained models 10; agents for docs, review, Jira tasks.8	Core completion reactive; agent features newer; quality varies with context; large local models can be resource-intensive.10
GitHub Copilot (Standard Mode)	2 (Pair Programmer)	IDE plugin (inline suggestions); Copilot Chat interface 10	Generates larger code blocks/functions from comments/context 10; broad framework knowledge; seamless IDE integration; chat for Q\&A/explanations.10	Can produce incorrect/insecure code requiring review 6; limited full-project context; cloud-dependent; privacy/licensing concerns [User Query].
Amazon CodeWhisperer	2 (Pair Programmer)	IDE plugin (inline suggestions) 11	Optimized for AWS services and APIs 11; generates multi-line completions from comments; security scans 11; free for individuals.47 (Becoming part of Amazon Q Developer 11)	Strong AWS bias, less effective for non-AWS work 47; suggestions can be basic; output needs review; requires AWS account [User Query].
OpenAI Codex (API) / GPT Models	2 (Pair Programmer)	API/Playground; via ChatGPT (prompt-response); new "Codex" agent in ChatGPT 12	Powerful natural language to code generation (esp. GPT-4.1 13); many languages; new "Codex" agent can edit files, run commands.48	Original Codex API deprecated 12; API use requires prompt engineering, output needs validation; context limits; advanced access often paid [User Query].
Replit Ghostwriter	2 (Pair Programmer)	Replit IDE integration (inline autocomplete & Ghostwriter Chat) 49	Tight Replit integration for instant execution/testing; multi-line/function completion, refactoring, explanation, proactive debugger 50; good for learning/prototyping.	Primarily Replit-locked 51; quality similar to other LLM tools (needs review); advanced AI features are paid 49; internet-dependent.51
GitHub Copilot (Agent Mode)	3 (Agentic – supervised)	IDE chat/command (VS Code); GitHub issue assignment 14	Plans & executes multi-file edits, runs tests, iterates on errors autonomously based on high-level goals/issues 14; creates PRs 44; developer kept in loop.	Newer feature (stable in VS Code 14); can misinterpret complex requests, output needs review; performance varies with repo complexity.14
Cursor (AI Editor)	3 (Agentic – supervised)	Full AI-based code editor (VS Code fork) with chat & agent modes 15	Deep codebase indexing for global context 15; multi-file ops, refactoring; Agent Mode explores, plans, executes with tools 15; "YOLO mode" for auto-execution (user term for auto-run) 57; checkpoints for safety.15	Standalone editor (though high VS Code compatibility 56); auto-run/YOLO mode is risky 57; performance/cost depends on chosen LLM API keys.55
Windsurf (Cascade Agent)	3 (Agentic – supervised)	AI-driven IDE (VS Code-based) with "Cascade" agent (Write/Chat modes) 63	Cascade agent modifies code, runs it, debugs, iterates until success (Write Mode) 16; uses tools (search, terminal, web); auto-fixes lint errors 63; "Turbo mode" for auto-execution.63	Newer tool; auto-execution ("Turbo mode") risky; must use its IDE; can make errors requiring human fix 68; some features paid/enterprise.65
ChatGPT (with Plugins & Advanced Data Analysis)	3 (Agentic – supervised)	Conversational AI (web UI or API) with Code Interpreter & plugins (tool use) 75	Highly flexible multi-tool agent: can fetch docs, run/debug Python code iteratively, use Git (via plugins) 75; strong natural language understanding (GPT-4).74	Not an IDE tool by default (copy-paste workflow unless plugins bridge gap) [User Query]; plugin reliability varies; context limits; requires paid subscription for advanced features.82
Aider (CLI + Git assistant)	3 (Agentic – supervised)	CLI chat tool working on local Git repo 17	Excellent Git integration (auto-commits, diffs); auto-runs linters/tests & attempts fixes 17; supports various LLMs (cloud/local) 17; open-source.18	CLI-based (less visual, steeper curve) 18; requires API key setup; auto-fix quality depends on tests/linters; model errors possible [User Query].
Amazon Q Developer CLI	3 (Agentic – supervised)	Interactive CLI agent (natural language to CLI commands & code edits) 85	AI orchestrates local CLI tools (compilers, package managers, AWS CLI) 86; strong for AWS/cloud tasks; uses powerful models (Claude via Bedrock) 85; iterative adjustments based on feedback.85	Newer tool; executing commands automatically has inherent risks [User Query]; primarily benefits AWS users; requires AWS setup [User Query].
Anthropic Claude Code	4 (High Autonomy)	Terminal-based chat agent; IDE integrations (VS Code, JetBrains) 19	Handles full-codebase context (million-token); edits files with diffs; runs tests, debugs, iterates; performs Git ops (branch, commit, PR, merge); designs modular solutions from prompts; proactively refactors and optimizes code; SDK for custom agents 19, 69	Output is production-ready with minimal review; CLI interface may deter some users; currently in beta 19; API costs can add up for large use cases 71; very large monorepos may still need human guidance
Google Jules	4 (High Autonomy)	Asynchronous cloud-based agent; GitHub integration (task assignment & PRs) 20	Autonomous coding agent: takes high-level goals (features, bugs), works independently in cloud VM, presents PRs 20; uses Gemini 2.5 Pro; provides plan/reasoning; audio changelogs.20	Very new (public beta May 2025) 20; cloud-only (code sent to Google VM); asynchronous (no real-time interaction); high autonomy needs careful review of results [User Query].
Cognition Devin	4 (High Autonomy)	Cloud AI service (task assignment via dashboard/Slack/CLI) 22	AI Software Engineer: plans & executes complex end-to-end tasks using own shell, editor, browser 21; strong SWE-bench (13.86%) 23; Devin 2.0 has interactive planning, search, wiki.22	Expensive enterprise tool [User Query]; success on very complex/novel tasks still limited 23; requires codebase access by third-party; some demo authenticity concerns raised.22
GPT-Engineer (open-source)	4 (High Autonomy, experimental)	CLI tool: single prompt -> clarification dialogue -> codebase generation 24	Can bootstrap entire (simple) projects from one prompt; automates initial spec-to-code process 24; open-source and customizable.24	Highly prompt-dependent; generated code needs significant review/debugging for complex apps 24; primarily for initial generation, not iterative work on existing code [User Query].
AutoGPT (for coding tasks)	4 (High Autonomy, experimental)	Self-directed AI agent framework (user sets high-level goal, agent plans & executes with tools) 98	Fully automated workflow: breaks goal into sub-tasks, uses tools (web search, file I/O, code execution) iteratively without human prompts.98	Often inefficient, prone to loops or incorrect paths without precise goals/human feedback 100; not specialized for coding; high API costs; mainly experimental for dev tasks.99

Note

Tools at Level 5 are omitted from the table as no productized examples currently exist; this level remains theoretical and experimental.)*

This comparison underscores that as we progress from Level 1 to Level 4, AI tools take on more of the coding burden — from merely completing the next line of code to handling whole features. Developers today can mix and match these tools to suit their needs, but it’s crucial to understand each tool’s autonomy and limits to use them effectively. The field is evolving rapidly, and what is Level 3 today may become Level 4 tomorrow. Keeping an eye on these autonomy levels helps set expectations and guides us in adopting AI tools that truly enhance productivity while maintaining quality and control in software development.

Works cited¶

The 2025 AI Index Report | Stanford HAI, accessed May 26, 2025, https://hai.stanford.edu/ai-index/2025-ai-index-report
AI: A Declaration of Autonomy - Accenture, accessed May 26, 2025, https://www.accenture.com/content/dam/accenture/final/accenture-com/document-3/Accenture-Tech-Vision-2025.pdf
What are AI Agents?- Agents in Artificial Intelligence Explained - AWS, accessed May 26, 2025, https://aws.amazon.com/what-is/ai-agents/
A More Practical Future for Autonomous Vehicles | S\&P Global, accessed May 26, 2025, https://www.spglobal.com/automotive-insights/en/blogs/fuel-for-thought-a-more-practical-future-for-autonomous-vehicles
Levels of intelligence: Navigating the future of AI, from robotic arms to autonomous cars, accessed May 26, 2025, https://roboticsandautomationnews.com/2025/05/17/levels-of-intelligence-navigating-the-future-of-ai-from-robotic-arms-to-autonomous-cars/90943/
GitHub CoPilot: Pros and Cons 2025 - PeerSpot, accessed May 26, 2025, https://www.peerspot.com/products/github-copilot-pros-and-cons
IntelliSense - Visual Studio Code, accessed May 26, 2025, https://code.visualstudio.com/docs/editing/intellisense
Tabnine AI Code Assistant | private, personalized, protected, accessed May 26, 2025, https://www.tabnine.com/
Plans & Pricing | Tabnine: The AI code assistant that you control, accessed May 26, 2025, https://www.tabnine.com/pricing/
20 Best AI-Powered Coding Assistant Tools in 2025 - Spacelift, accessed May 26, 2025, https://spacelift.io/blog/ai-coding-assistant-tools
What is CodeWhisperer? - CodeWhisperer - AWS Documentation, accessed May 26, 2025, https://docs.aws.amazon.com/codewhisperer/latest/userguide/what-is-cwspr.html
Codex Open AI: The Ultimate Guide to AI-Powered Coding in 2025 - Emelia.io, accessed May 26, 2025, https://emelia.io/hub/codex-open-ai
Introducing GPT-4.1 in the API - OpenAI, accessed May 26, 2025, https://openai.com/index/gpt-4-1/
Introducing GitHub Copilot agent mode (preview) - Visual Studio Code, accessed May 26, 2025, https://code.visualstudio.com/blogs/2025/02/24/introducing-copilot-agent-mode
Agent Mode - Cursor, accessed May 26, 2025, https://docs.cursor.com/chat/agent
Windsurf AI Agentic Code Editor: Features, Setup, and Use Cases | DataCamp, accessed May 26, 2025, https://www.datacamp.com/tutorial/windsurf-ai-agentic-code-editor
Aider - AI Pair Programming in Your Terminal, accessed May 26, 2025, https://aider.chat/
Aider vs Windsurf: Which AI Coding Assistant Should You Choose? | UI Bakery Blog, accessed May 26, 2025, https://uibakery.io/blog/aider-vs-windsurf
Claude Code overview - Anthropic API, accessed May 26, 2025, https://docs.anthropic.com/en/docs/agents/claude-code/introduction
Jules: Google's autonomous AI coding agent - Google Blog, accessed May 26, 2025, https://blog.google/technology/google-labs/jules/
Devin | The AI Software Engineer, accessed May 26, 2025, https://devin.ai/
Report: Cognition Business Breakdown & Founding Story | Contrary ..., accessed May 26, 2025, https://research.contrary.com/company/cognition
SWE-bench technical report - Cognition, accessed May 26, 2025, https://cognition.ai/blog/swe-bench-technical-report
GPT Engineer | AI Low-Code/No-Code Tool Review 2025 - LogicBalls, accessed May 26, 2025, https://logicballs.com/ai-tools/gpt-engineer
From Prompt to Codebase: The Power of GPT Engineer - Kanaries Docs, accessed May 26, 2025, https://docs.kanaries.net/topics/ChatGPT/gpt-engineer
arxiv.org, accessed May 26, 2025, https://arxiv.org/html/2504.15228
arXiv:2504.15228v1 [cs.AI] 21 Apr 2025, accessed May 26, 2025, https://arxiv.org/pdf/2504.15228
DeepMind's AlphaDev Reinvents Algorithms | Enterprise AI Daily, accessed May 26, 2025, https://www.enterpriseaisolutions.io/p/beyond-human-limits-f2bc
MuZero, AlphaZero, and AlphaDev: Optimizing computer systems - Google DeepMind, accessed May 26, 2025, https://deepmind.google/discover/blog/muzero-alphazero-and-alphadev-optimizing-computer-systems/
AlphaDev discovers faster sorting algorithms - Google DeepMind, accessed May 26, 2025, https://deepmind.google/discover/blog/alphadev-discovers-faster-sorting-algorithms/
Meta AI Introduces Meta LLM Compiler: A State-of-the-Art LLM that Builds upon Code Llama with Improved Performance for Code Optimization and Compiler Reasoning : r/machinelearningnews - Reddit, accessed May 26, 2025, https://www.reddit.com/r/machinelearningnews/comments/1dqn8to/meta_ai_introduces_meta_llm_compiler_a/
Meta's LLM Compiler: Innovating Code Optimization with AI ..., accessed May 26, 2025, https://www.unite.ai/metas-llm-compiler-innovating-code-optimization-with-ai-powered-compiler-design/
A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loops - arXiv, accessed May 26, 2025, https://arxiv.org/html/2412.17149v1
Recursive self-improvement - Wikipedia, accessed May 26, 2025, https://en.wikipedia.org/wiki/Recursive_self-improvement
Level 5 Autonomy: How Close Are We to Fully Self-Driving Cars? (Latest Industry Stats), accessed May 26, 2025, https://patentpc.com/blog/level-5-autonomy-how-close-are-we-to-fully-self-driving-cars-latest-industry-stats
In response to critiques of Guaranteed Safe AI - AI Alignment Forum, accessed May 26, 2025, https://www.alignmentforum.org/posts/DZuBHHKao6jsDDreH/in-response-to-critiques-of-guaranteed-safe-ai
Part 1: How Sourcegraph scales with the Language Server Protocol, accessed May 26, 2025, https://sourcegraph.com/blog/part-1-how-sourcegraph-scales-with-the-language-server-protocol
JavaScript in Visual Studio Code, accessed May 26, 2025, https://code.visualstudio.com/docs/languages/javascript
Features - CodeWhisperer - AWS Documentation, accessed May 26, 2025, https://docs.aws.amazon.com/codewhisperer/latest/userguide/features.html
Code completion | IntelliJ IDEA Documentation - JetBrains, accessed May 26, 2025, https://www.jetbrains.com/help/idea/auto-completing-code.html
IntelliJ IDEA | Features - JetBrains, accessed May 26, 2025, https://www.jetbrains.com/idea/features/
IntelliCode for Visual Studio | Microsoft Learn, accessed May 26, 2025, https://learn.microsoft.com/en-us/visualstudio/ide/intellicode-visual-studio?view=vs-2022
IntelliCode Whole-line autocompletions - Visual Studio (Windows) | Microsoft Learn, accessed May 26, 2025, https://learn.microsoft.com/en-us/visualstudio/ide/visual-studio-whole-line-completions?view=vs-2022
GitHub Copilot features, accessed May 26, 2025, https://docs.github.com/en/copilot/about-github-copilot/github-copilot-features
About individual Copilot plans and benefits - GitHub Docs, accessed May 26, 2025, https://docs.github.com/en/copilot/managing-copilot/managing-copilot-as-an-individual-subscriber/getting-started-with-copilot-on-your-personal-account/about-individual-copilot-plans-and-benefits
GitHub Copilot Pros and Cons, accessed May 26, 2025, https://www.netguru.com/blog/github-copilot
Best AI Code Assistants In 2025 - Top 5 Tools For Developers - Revoyant, accessed May 26, 2025, https://www.revoyant.com/blog/best-ai-code-assistants-2025-top-tools
Introducing Codex - OpenAI, accessed May 26, 2025, https://openai.com/index/introducing-codex/
How to Use Replit AI for free - Apidog, accessed May 26, 2025, https://apidog.com/blog/how-to-use-replit-ai-for-free/
Replit Ghostwriter vs. Copilot: Which is Better? - CodeStringers, accessed May 26, 2025, https://www.codestringers.com/insights/replit-ghostwriter-vs-copilot/
11 Best AI Coding Tools—Guide for Developers (2025 updated) - Golden Owl, accessed May 26, 2025, https://goldenowl.asia/blog/best-ai-for-coding
Replit — Announcing Ghostwriter Chat: The first conversational AI ..., accessed May 26, 2025, https://blog.replit.com/gw-chat-launch
Agent mode 101: All about GitHub Copilot's powerful mode - The ..., accessed May 26, 2025, https://github.blog/ai-and-ml/github-copilot/agent-mode-101-all-about-github-copilots-powerful-mode/
Complete Guide to GitHub Copilot Agent Mode: Transforming Development Paradigms, accessed May 26, 2025, https://blog.wadan.co.jp/en/tech/github-copilot-agent-mode
Cursor AI Review: Best AI Code Editor in 2025?, accessed May 26, 2025, https://www.fahimai.com/cursor-ai
Cursor AI: An In-Depth Review (May 2025 Update) | Engine, accessed May 26, 2025, https://www.enginelabs.ai/blog/cursor-ai-an-in-depth-review-may-2025-update
Guide to Cursor | Software.com, accessed May 26, 2025, https://www.software.com/ai-index/tools/cursor
docs.cursor.com, accessed May 26, 2025, https://docs.cursor.com/context/codebase-indexing#:\~:text=For%20better%20and%20more%20accurate,initialize%20indexing%20for%20that%20workspace.
Codebase Indexing - Cursor, accessed May 26, 2025, https://docs.cursor.com/context/codebase-indexing
Top 5 AI IDEs for Coding with Windsurf in 2025 By Girish Kot - Peerlist, accessed May 26, 2025, https://peerlist.io/gkotte/articles/top-5-ai-ides-for-coding-with-windsurf-in-2025
How I use Cursor (+ my best tips) - Builder.io, accessed May 26, 2025, https://www.builder.io/blog/cursor-tips
Cursor AI Unleashed: The Agent-Powered Code Editor Transforming Developer Workflow, accessed May 26, 2025, https://savemyleads.com/blog/useful/cursor-ai
Windsurf (formerly Codeium) - The most powerful AI Code Editor, accessed May 26, 2025, https://windsurf.com/
Windsurf Editor by Codeium for Revit: AI Automation in Revit - ArchiLabs, accessed May 26, 2025, https://archilabs.ai/posts/windsurf-for-revit
Cascade - Windsurf Docs, accessed May 26, 2025, https://docs.windsurf.com/windsurf/cascade/cascade
Windsurf Reviews, Ratings & Features 2025 | Gartner Peer Insights, accessed May 26, 2025, https://www.gartner.com/reviews/market/ai-code-assistants/vendor/windsurf-exafunction/product/windsurf-exafunction
Windsurf Editor Changelogs - Codeium, accessed May 26, 2025, https://codeium.com/changelog
Windsurf Reviews 2025: Details, Pricing, & Features | G2, accessed May 26, 2025, https://www.g2.com/products/codeium/reviews
Introducing Claude 4 - Anthropic, accessed May 26, 2025, https://www.anthropic.com/news/claude-4
Claude Code: Best practices for agentic coding - Anthropic, accessed May 26, 2025, https://www.anthropic.com/engineering/claude-code-best-practices
Introducing Claude 4 \ Anthropic, accessed May 26, 2025, https://anthropic.com/news/claude-4
Claude Anthropic Reviews & Ratings | Pros & Cons - Software Finder, accessed May 26, 2025, https://softwarefinder.com/artificial-intelligence/claude-anthropic/reviews
15 Pros & Cons of Claude [2025] - DigitalDefynd, accessed May 26, 2025, https://digitaldefynd.com/IQ/pros-cons-of-claude/
ChatGPT — Release Notes - OpenAI Help Center, accessed May 26, 2025, https://help.openai.com/en/articles/6825453-chatgpt-release-notes
How to Use ChatGPT's New Code Interpreter - DevDynamics, accessed May 26, 2025, https://devdynamics.ai/blog/how-to-use-chatgpts-new-code-interpreter/
Why ChatGPT's Code Interpreter is the Future of Programming - BotPenguin, accessed May 26, 2025, https://botpenguin.com/blogs/why-chatgpt-code-interpreter-is-future-of-programming
10 of The Best ChatGPT Plugins to Get The Most From AI in 2024 | DataCamp, accessed May 26, 2025, https://www.datacamp.com/blog/best-chat-gpt-plugins
The Top 9 ChatGPT Plugins for Developers in 2025 - ISHIR, accessed May 26, 2025, https://www.ishir.com/blog/119322/the-top-9-chatgpt-plugins-for-developers-in-2024.htm
GitLab ChatGPT (OpenAI) Integration - Quick Connect - Zapier, accessed May 26, 2025, https://zapier.com/apps/gitlab/integrations/chatgpt
HighwayofLife/awesome-chatgpt-plugins - GitHub, accessed May 26, 2025, https://github.com/HighwayofLife/awesome-chatgpt-plugins
Explore the Top 10 Best Chatgpt Plugins of 2025 - Murf AI, accessed May 26, 2025, https://murf.ai/blog/best-chatgpt-plugins
ChatGPT AI Tools Review 2025: Pros, Cons, and Pricing - Sonary, accessed May 26, 2025, https://sonary.com/b/open-ai/chatgpt-writing-tool+ai-tools/
Aider Documentation, accessed May 26, 2025, https://aider.chat/docs/
Best AI Coding Assistants as of May 2025 - Shakudo, accessed May 26, 2025, https://www.shakudo.io/blog/best-ai-coding-assistants
Amazon Q Developer announces a new CLI agent within the command line - AWS, accessed May 26, 2025, https://aws.amazon.com/about-aws/whats-new/2025/03/amazon-q-developer-cli-agent-command-line/
A lightning fast, new agentic coding experience within the Amazon Q Developer CLI - AWS, accessed May 26, 2025, https://aws.amazon.com/blogs/devops/introducing-the-enhanced-command-line-interface-in-amazon-q-developer/
Amazon Q Developer Reviews 2025: Details, Pricing, & Features | G2, accessed May 26, 2025, https://www.g2.com/products/amazon-q-developer/reviews
Best AI Coding Assistant 2025: Complete Guide to Cline and Cursor ..., accessed May 26, 2025, https://cline.bot/blog/best-ai-coding-assistant-2025-complete-guide-to-cline-and-cursor
Qodo AI Reviews: Use Cases, Pricing & Alternatives - Futurepedia, accessed May 26, 2025, https://www.futurepedia.io/tool/qodo
Qodo Reviews 2025: Details, Pricing, & Features | G2, accessed May 26, 2025, https://www.g2.com/products/qodo/reviews
Google I/O 2025: Google's answer to Microsoft and OpenAI's AI coding agents, Jules is now available for everyone to try - The Times of India, accessed May 26, 2025, https://timesofindia.indiatimes.com/technology/tech-news/google-i/o-2025-googles-answer-to-microsoft-and-openais-ai-coding-agents-jules-is-now-available-for-everyone-to-try/articleshow/121298997.cms
Google Counters GitHub & Microsoft with Jules Agent & Enhanced Gemini AI, accessed May 26, 2025, https://visualstudiomagazine.com/articles/2025/05/20/google-counters-github-microsoft-with-jules-agent-enhanced-gemini-ai.aspx
What Google I/O 2025's AI Announcements Mean for Developers - RisingStack Engineering, accessed May 26, 2025, https://blog.risingstack.com/google-io-2025-recap/
Devin AI vs Cursor: Speed & Accuracy Test Results - Trickle AI, accessed May 26, 2025, https://www.trickle.so/blog/devin-ai-or-cursor
Generate Complete Apps with Ease: GPT Engineer Review and Setup - Toolify.ai, accessed May 26, 2025, https://www.toolify.ai/ai-news/generate-complete-apps-with-ease-gpt-engineer-review-and-setup-751744
Build Any App with One Prompt Using GPT-Engineer - YouTube, accessed May 26, 2025, https://m.youtube.com/watch?v=-W73zQr0rl4\&pp=ygUII21nY29kZXI%3D
What I learned in 6 months of working on a CodeGen dev tool GPT Pilot, accessed May 26, 2025, https://dev.to/zvone187/what-i-learned-in-6-months-of-working-on-a-codegen-dev-tool-gpt-pilot-4fij
AutoGPT - AI Agent Reviews, Features, Use Cases & Alternatives (2025), accessed May 26, 2025, https://aiagentsdirectory.com/agent/autogpt
Autogpt Examples: Expert Tips for Success - Codoid, accessed May 26, 2025, https://codoid.com/ai/autogpt-examples-expert-tips-for-success/
What is AutoGPT and How to Use It? - GeeksforGeeks, accessed May 26, 2025, https://www.geeksforgeeks.org/what-is-autogpt-and-how-to-use-it/
AutoGPT Example Guide: With Hands-On Applications - PageTraffic, accessed May 26, 2025, https://www.pagetraffic.com/blog/autogpt-example/
Multi-Agent Collaboration Mechanisms: A Survey of LLMs - arXiv, accessed May 26, 2025, https://arxiv.org/html/2501.06322v1
Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications - arXiv, accessed May 26, 2025, https://arxiv.org/html/2412.05449v1
Agent-Oriented Programming Examples: Real-World Applications and Use Cases, accessed May 26, 2025, https://smythos.com/ai-agents/ai-agent-development/agent-oriented-programming-examples/
20 AI Agent Examples in 2025 - AutoGPT, accessed May 26, 2025, https://autogpt.net/20-ai-agents-examples/
2025 AI predictions: The future of application development - OutSystems, accessed May 26, 2025, https://www.outsystems.com/blog/posts/ai-predictions/
AI Agents in 2025: Expectations vs. Reality | IBM, accessed May 26, 2025, https://www.ibm.com/think/insights/ai-agents-2025-expectations-vs-reality
Revolutionary AI Coding Tool: Cody - Made for Efficient Programming, accessed May 26, 2025, https://www.toolify.ai/ai-news/revolutionary-ai-coding-tool-cody-made-for-efficient-programming-395707
The 9 best AI coding tools for developers of all levels - Lummi, accessed May 26, 2025, https://www.lummi.ai/blog/best-ai-coding-tools
AI for Testing: Context & Eval with Sourcegraph - AI Native Dev, accessed May 26, 2025, https://ainativedev.io/podcast/ai-testing-context-eval-sourcegraph
Agentic code generation: The future of software development - AI Accelerator Institute, accessed May 26, 2025, https://www.aiacceleratorinstitute.com/agentic-code-generation-the-future-of-software-development/

Report on Agent-Based Frameworks for Software Engineering¶

Overview

Per previous sections, GenAI allows a shift from Code-Centric to Contract-Centric Development where the upstream artifacts guide the code generation (per standard Software Engineering).

This section is a Research Report on Agent-Based Frameworks for Software Engineering 1.0 Redux (aka Agent-Based Frameworks Supporting Upstream Software Engineering Artifacts)

These are Level 4 – Highly Autonomous Coding: "The AI can handle complex projects end-to-end, needing only minimal human input"

Some of these build on existing solutions e.g. SuperClaude, ClaudeFlow build on ClaudeCode, and BMAD can build on ClaudeCode or other tools.

The report is from ChatGPT 4o DeepResearch with the following brief:

I’ll get started on a report that surveys frameworks using agents for software engineering, particularly those that support upstream artifacts like requirements, design models, or architecture. I’ll include both open-source and commercial tools, and provide a high-level overview along with a comparison table highlighting key features and capabilities.

Note that, technically, it is possible to use Project Rules for modular persona design e.g. .windsurfrules or .cursor/rules, but the focus here is on existing solutions that embody personas and associated artifacts to build a solution.

Software engineering is increasingly leveraging AI agent frameworks to assist not only with coding but also with upstream development artifacts like requirements documents, design models, and architectural specifications. Below, we survey several notable frameworks – both open-source and commercial – that employ autonomous or semi-autonomous agents across the software lifecycle. We describe each framework’s high-level approach, the upstream artifacts it supports, its agent-based characteristics, integration with development tools, and current maturity. A comparison table at the end summarizes their features side-by-side.

Comparison of Agent-Based SE Frameworks¶

Framework	Upstream Artifact Support	Agent-Based Approach	Integration	Maturity/Status
BMAD-Method	PRDs, architecture docs, and user stories.	Specialized agents (Analyst, PM, Architect, Scrum Master, Dev, QA); checklist-driven.	CLI + Cursor IDE integration. NPM package.	Open-source; v4 active; moderate maturity (~2k GitHub stars).
SuperClaude	Architectural analysis, design ideas, implementation estimation.	One Claude instance with 9 cognitive personas (Architect, Security, QA, etc.).	Claude Code config layer; CLI/chat commands; Git memory; VS Code support planned.	Open-source; mature in niche (~5k stars); stable and growing.
ChatDev	User stories, architecture designs, test plans, project docs.	Simulated software team (CEO, CTO, CPO, Dev, Tester, Reviewer); sequential flow.	Python CLI; generates project folders; minimal IDE tooling.	Research-grade; very active (~27k stars); experimental features.
MetaGPT	User stories, requirements, data models, API specs, design diagrams.	Multi-agent system with SOPs for each role (PM, Architect, Engineer, QA).	Python CLI; flexible LLM backend; outputs markdown & Mermaid diagrams.	Very mature; open-source; extremely popular (~57k stars).
Claude Flow	Requirements analysis, architectural design, test plans, pseudocode.	Parallel multi-agent workflow (SPARC phases); Memory Bank; agent coordination tools.	Claude Code shell script; CLI; Claude tools (BatchTool, WebFetch); Git auto-commits.	New but structured; open-source; early adoption; rapidly evolving.
Tabnine AI Agents	Jira issue descriptions (as requirements), plus test coverage and validation reports.	Specialized agents for Implementation, Validation, Testing, and Code Review. Works in human-in-the-loop cycles.	Deep IDE (VS Code, JetBrains) + Jira + CI integration. Supports enterprise/on-prem.	Commercial; generally available since 2024; mature for enterprise use.
IBM SWE Agents	Bug reports, change requests (as requirements). Test plans derived from fix/feature context.	Research-stage agents: Bug Fixer, Code Editor, Test Generator. Coordinated via orchestration layer.	GitHub-triggered (issue tags); future integration into watsonx / IBM DevOps expected.	Research prototype (2024); strong benchmarks; early enterprise trials.

BMAD-METHOD (Open-Source)¶

Overview: The Breakthrough Method for Agile AI-Driven Development (BMAD) is an open-source framework that orchestrates multiple specialized AI agents to mimic an Agile software team. BMAD introduces custom agent personas for common software roles and guides a project through phases from initial idea to deployment. It was designed to solve problems in AI-assisted dev workflows (like context loss and planning drift) by front-loading planning with dedicated agents.

Upstream Artifact Support: BMAD explicitly supports creation of upstream artifacts such as product requirements documents (PRDs) and architecture design documents. In the “planning” phase, specialized agents (Analyst, Product Manager, Architect, etc.) collaborate with the user to generate detailed and consistent specifications, including comprehensive PRDs and technical architecture documentation. These planning outputs are much richer than generic AI-generated tasks, providing a solid blueprint for development. BMAD can also produce a high-level brief (project summary) and even optional UX design outlines before coding begins.

Agent Architecture and Coordination: BMAD’s agent roster mirrors an Agile team. In the planning phase, agents like Business Analyst/Analyst, Product Manager, and Architect work together (via prompt scripts) to produce the requirement and design artifacts. Once planning is done, the framework shifts to the IDE for implementation: a Scrum Master agent breaks the plan into “story” files (detailed development tasks with embedded context), and a Dev agent then implements code based on these stories. A QA agent may verify the outputs. A central Orchestrator agent coordinates the hand-offs (ensuring each story contains all necessary context for the Dev agent) and runs checklists to audit consistency across artifacts. Notably, each agent operates in a turn-based sequence with human-in-the-loop checkpoints – for example, agents pause for user feedback at certain steps. This structured multi-agent pipeline enforces consistency from requirements through design to code.

Integration and Tooling: BMAD is implemented as a Node.js package with configuration files and prompt templates. It can be run in chat UIs (e.g. ChatGPT or a tool called Gemini) for the planning stage, then integrates with the Cursor IDE (an AI-enabled code editor) for coding. The planning artifacts (PRD, architecture doc, etc.) are saved as project files which the Scrum Master agent later “shards” into story files in the IDE. BMAD is designed to be flexible – users can customize agent prompts or add “expansion packs” for domains beyond software (e.g. game development or DevOps).

Maturity: BMAD is a vibrant open-source project (MIT-licensed) with an active community. It has gone through multiple iterations (V2, V3, now V4 as of mid-2025) and garnered significant interest (over 2k GitHub stars). Users report that BMAD’s approach can save substantial LLM usage costs by shifting work to the planning phase. Overall, BMAD provides a comprehensive AI-driven agile methodology – it is relatively new but evolving rapidly, indicating a growing maturity and real-world experimentation.

SuperClaude (Open-Source)¶

Overview: SuperClaude is an open-source configuration framework that augments Anthropic’s Claude (a large language model) with a team of specialized agent personas and command protocols. Rather than interacting with a single generalist AI, SuperClaude lets developers invoke different “cognitive archetypes” (agents) tailored to specific software engineering tasks. The framework is defined by a set of rules (in RULES.md) and a Model Context Protocol (MCP) that gives the AI long-term memory, tool usage abilities, and an efficient communication style. SuperClaude’s goal is to turn a generic coding assistant into a context-aware, role-specialized development partner.

Upstream Artifact Support: SuperClaude primarily targets the design and implementation stages, but it does facilitate upstream design thinking and planning. For example, it includes a persona called “The Architect” (/persona:architect) whose sole focus is high-level system design and architecture. When activated, this agent will ask questions about scalability and maintainability and apply proven design patterns, effectively helping produce or critique an architectural model of the system. Using the /design command, SuperClaude can generate software designs – for instance, it can output proposed data models or API specifications following a Domain-Driven Design approach. It does not explicitly generate a formal requirements document with its own persona, but it leverages the developer’s input (e.g. user stories or feature descriptions) as the requirement and can estimate effort (/user:estimate) based on the design. In essence, SuperClaude’s upstream support lies in architecture and planning: it helps design the solution and plan implementation steps before writing code, though the initial requirements need to be provided by the user.

Agent-Based Characteristics: SuperClaude features nine specialized AI personas that the user can switch between, each configured with different priorities and toolsets. These include roles like Architect, Frontend Developer, Backend Developer, Security Expert, QA Tester, Analyzer (debugger), and even a Mentor for explanations. Internally, all personas are powered by the same LLM (Claude), but SuperClaude uses modular configuration files and “flag” commands to shape the LLM’s behavior per role. The framework enforces collaboration and parallelism: for example, you can spawn a dedicated frontend and backend agent to work simultaneously on different components of a feature. Agents communicate implicitly via the shared context and the MCP tool outputs. SuperClaude emphasizes autonomy with oversight – agents will perform tasks (like coding or testing) on their own, but important rules (e.g. never output insecure code) are baked into the persona profiles. A unique aspect is the Evidence-Based rule: agents must cite documentation for their decisions, reducing hallucinations. Coordination is handled by the MCP’s decision matrix which decides which tool or persona to invoke based on triggers (user flags, natural language cues, or code context).

Integration: SuperClaude is used alongside Claude’s coding interface (Claude Code) or via a CLI. It installs as a .claude/ configuration on your system. Developers interact with it through chat commands (for example, in a terminal or chat UI that connects to Claude). The framework provides 18 special “/user:” commands for tasks like /user:design, /user:build, /user:analyze, /user:test, etc., each optionally combined with persona or tool flags. It currently relies on Anthropic’s Claude models (Claude 2, Claude Instant, etc.), and a VS Code extension is on the roadmap. Integration with version control is notable: SuperClaude can checkpoint and roll back conversation/code state with Git commands, effectively maintaining memory across sessions. Overall, it acts as an overlay on top of your IDE or CLI, providing an agentic layer that controls Claude for multi-step, multi-role tasks.

Maturity: SuperClaude is an MIT-licensed project launched in mid-2023 and has gained traction (5k+ GitHub stars). It’s actively maintained (v2.0.1 as of 2025) with a growing community of contributors. Many developers have praised its practical impact – e.g. using the Architect persona to generate scalable designs, and the Git-based memory to manage iterative changes. Because it builds on a robust LLM (Claude) and adds pragmatic features (like token compression for long contexts), SuperClaude is considered a stable and useful addition to a developer’s toolkit, albeit currently tied to Claude’s availability.

ChatDev (Open-Source Research)¶

Overview: ChatDev is an open-source research framework that demonstrates how multiple LLM-based agents can collaborate to autonomously develop a software application. Introduced by researchers in late 2023, ChatDev simulates an entire software startup (a “virtual software company”) with agents fulfilling different organizational roles. These agents communicate in natural language, following a structured workflow resembling the waterfall model: phases of designing, coding, testing, and documenting are carried out sequentially by the respective specialists. The core idea is to study collective intelligence – how a team of AI agents can outperform a single agent by divide-and-conquer and collaboration.

Upstream Artifact Support: ChatDev explicitly includes the design phase at the front of its lifecycle. It starts with a software concept or requirement provided by the user (e.g. a one-line idea for an app). The agents then elaborate this into upstream artifacts. For example, the Chief Product Officer (CPO) agent defines product requirements, user stories, and possibly a brief specification of features. The Chief Technology Officer (CTO) agent takes those requirements and produces a system design or architecture, determining the technical approach. Indeed, ChatDev is reported to output documents like project requirements, UI design drafts, and even competitive analyses of the idea. It can also generate technical design artifacts such as data model specifications or API designs, often represented as Mermaid diagrams for architecture. After the design is settled, other agents proceed to coding, but importantly ChatDev creates a persistent set of upstream documents (requirements and design docs) that guide the development. The presence of a Reviewer agent also implies it produces documentation or review reports (ensuring the final product aligns with initial requirements), and a Tester agent generates test plans or reports, which are downstream QA artifacts. In summary, ChatDev’s design and planning outputs are a central feature – it delivers comprehensive documentation from requirements to design models, before any code is written.

Agent Roles and Collaboration: The framework organizes agents in a multi-agent collaboration network with distinct roles akin to job titles in a company. Key roles include: CEO (sets the high-level goal or approves plans), CPO (focus on user needs and requirements), CTO (technical design decisions), Programmer (writes code), Tester (tests the code), Reviewer (performs code review and documentation), and even an Art Designer (which could create UI/UX assets or design elements). These agents communicate in a series of “functional seminars” – essentially structured chat rounds dedicated to specific tasks (brainstorming features, designing system, coding modules, etc.). The collaboration is typically organized as a chain: the output of the planning seminar (requirements from CPO) feeds into the design seminar (led by CTO), which then feeds coding (Programmer), then testing, and so forth. ChatDev’s implementation uses a chain-of-messages orchestration (sometimes called ChatChain) where each agent gets to contribute and sees prior agents’ messages. Notably, the researchers implemented standard operating procedures (SOPs) for each role to ensure coherence (e.g. the CPO agent follows a template to produce requirement lists, the CTO agent follows a template to output design specs). This structured multi-agent dialogue allowed ChatDev to significantly outperform single-agent approaches (like GPT-Engineer) on quality metrics of the final software. Each agent is essentially an instance of an LLM (such as GPT-4) prompted to act in-role, and a controller script manages the turn-taking and information flow. There is no heavy tool use by agents in the published version (they primarily communicate and generate code), but the design ensures that each phase’s output becomes an artifact passed down the pipeline.

Integration and Usage: ChatDev can be run as a Python program using open LLM APIs. It’s provided as a framework for research/experimentation, and the authors also offered a web demo (SaaS) for a period. Integration with developer tooling is minimal in the basic ChatDev – it’s more of an automated planning and coding engine that produces a project (code + docs) in a workspace folder. The outputs include code files, diagrams (as Mermaid markdown), and documentation files that one can open in an IDE. Because ChatDev agents can execute code (especially the Tester agent running the program), a Docker-based sandbox is used for safety.

Maturity: As a research prototype, ChatDev is at the proof-of-concept stage (initial version released late 2023). It has received attention for demonstrating the feasibility of multi-agent automated development. IBM noted that ChatDev, along with similar framework MetaGPT, achieved better completeness and consistency in generated software than single-agent methods. Since release, ChatDev’s team has been extending it with more advanced collaboration mechanisms (e.g. a graph-based agent network called MacNet for scalability). The project is active on GitHub under an Apache-2.0 license, with ongoing improvements in efficiency and support for more complex scenarios. While not production-ready for industry use, ChatDev provides a template for agentic SDLC automation and a benchmark for future frameworks.

MetaGPT (Open-Source)¶

Overview: MetaGPT is another open-source multi-agent framework that gained wide popularity as an “AI software company in a box.” It assigns different GPT-based agents to classic software team roles and coordinates their efforts to build software from a high-level specification. Released in mid-2023 (and quickly amassing thousands of users), MetaGPT proved that given a short prompt describing a desired application, a suite of agents can produce not only code but also a range of planning artifacts and documentation. It emphasizes carefully orchestrated Standard Operating Procedures (SOPs) for each role to ensure the agents collaborate effectively.

Upstream Artifact Support: MetaGPT excels at generating upstream artifacts. Starting from a one-line requirement input, MetaGPT’s agents will output: user stories, competitive analysis, detailed requirements, data structures and API designs, and other design documents. The framework explicitly lists that it produces “user stories / competitive analysis / requirements / data structures / APIs / documents, etc.” as part of its deliverables. For example, given a prompt to create a simple e-commerce app, MetaGPT’s Product Manager agent might generate user personas and user stories; the Architect agent would design the database schema and component diagram; a Market Analyst agent (if included) might provide a brief competitive feature comparison. These artifacts are saved into a structured workspace (often as Markdown files or images for diagrams). Notably, MetaGPT uses Mermaid.js to create UML-like diagrams for architecture and workflows – this means it actually visualizes the design model (class diagrams, flow charts) as part of its output. This comprehensive upstream support is a standout feature: MetaGPT doesn’t jump straight to code, but first fleshes out specifications and designs to guide the coding stage. This results in a more organized project that a human can review (and modify) before implementation.

Agent Roles and Mechanism: In MetaGPT, multiple GPT-4 (or similar) instances are each assigned a role such as Product Manager, Project Manager, Architect, Software Engineer, QA, etc.. At minimum, the published version included Product Manager, Architect, Project Manager, and Engineer roles. These agents follow a fixed coordination pattern: typically the Product Manager analyzes the raw requirement and expands it, the Architect plans the system design, the Project Manager organizes tasks, and the Engineer writes the code. Communication between agents is orchestrated so that each agent’s output becomes input for the next. The project uses a prompt templating approach to enforce that, for example, the Architect agent’s prompt includes the Product Manager’s output (user stories, requirements) and then instructs, “As the software architect, design the system based on these.” Each role has a predefined prompt template (SOP) to maintain structure. For instance, the Architect’s SOP might ensure they produce a section on data schema, a section on module design, etc. MetaGPT’s controller then runs these agents in sequence (with possible iterations if something is incomplete). This clear division of labor allowed MetaGPT to demonstrate high completeness and consistency in generated projects. It also can recruit additional agents dynamically if needed – for example, if a specific specialized task comes up, it could spawn a new agent with an appropriate skill (the framework was designed to be extensible). However, by default, the core team handles most tasks. The outcome is that each agent’s “deliverable” is saved: the Product Manager’s user_stories.md, the Architect’s design.md (with diagrams), the Engineer’s code files, etc., giving a multi-perspective result.

Integration: MetaGPT is primarily a CLI/command-line tool. Developers install it (Python and Node dependencies) and run it by providing a prompt. It then generates a new project directory with all the files. Because it’s not tied to an IDE, one would typically open the resulting files in their preferred IDE for review. There isn’t an interactive loop with human feedback in the default flow – it attempts autonomy from requirement to final product. That said, users can intervene between phases if using it stepwise. MetaGPT can leverage external tools or APIs if configured (e.g. searching for information if needed), but most of its knowledge comes from the base LLM. The project’s documentation suggests using Docker for sandboxing if the agents need to execute code/tests. Comparison with ChatDev: Both are similar in concept; indeed, an evaluation by ChatDev’s authors found MetaGPT performing well but noted ChatDev’s communication mechanism yielded higher quality in some aspects. Regardless, MetaGPT integration is straightforward – it acts as an automated software project generator one can run on a local machine.

Maturity: MetaGPT is fairly mature as an open project – it went viral (over 17k GitHub stars by late 2023) and has an active community of users and contributors. It inspired many derivative projects. As of 2024, it remains under active development (now part of the FoundationAgents organization). MetaGPT is mostly a demonstration; while it can produce a working app for simple requirements, real-world applications likely need iteration and refinement by human developers. However, its strength in producing structured documentation and multi-agent reasoning is a valuable contribution. The framework is often cited as revolutionary in showcasing multi-agent collaboration in SE. It’s open-source (MIT license) and is frequently updated with improvements in prompt strategies and support for new LLMs. Companies and researchers sometimes use MetaGPT as a baseline for agent-based software development experiments.

Tabnine AI Agents for SDLC (Commercial)¶

Overview: Tabnine, known for its AI code completion, has introduced a suite of AI agents to support various stages of the software development lifecycle (SDLC) in a commercial offering. In late 2024, Tabnine announced AI agents for Atlassian Jira – specifically an “Implementation Agent” and a “Validation Agent” – that work from issue descriptions to code and test changes. This is part of Tabnine’s vision of “AI-driven development” where AI agents and human engineers collaborate on tasks, with AI managing many aspects of the development workflow. In addition to the Jira-focused agents, Tabnine also has an AI Test Agent (for generating and running tests) and an AI Code Review Agent (for analyzing pull requests), extending agent support to QA and code quality steps.

Upstream Artifact Support: Tabnine’s approach uniquely ties into requirements as captured in issue trackers. The Jira Implementation Agent takes the natural-language requirements in a Jira issue (e.g. user story or bug description) and automatically generates the code to implement that issue. In doing so, it effectively reads an upstream artifact – the issue text, which often contains acceptance criteria or a feature description – and translates it into code changes. While it doesn’t produce a separate requirements specification (the requirement is the Jira ticket itself), it ensures that the requirement is addressed. The Jira Validation Agent then acts on the requirement as well: it checks that the generated code indeed fulfills the issue’s requirements. This agent will verify functional behavior and suggest fixes if the code diverges from the specified criteria. By doing this validation, Tabnine is effectively formalizing the link between requirements and implementation. Moreover, because Tabnine’s agents can propose additional tests, one can view those tests as artifacts derived from requirements (testing each acceptance criterion). Tabnine’s platform doesn’t output design diagrams or full architecture documents; instead, it integrates into the agile process by going from user story to code and tests. For many teams, the Jira ticket itself is the upstream artifact, and Tabnine’s agents ensure it’s fully handled. In summary, Tabnine supports upstream artifacts in the sense of consuming and honoring them (Jira stories, bug reports) and automatically producing downstream results (code, validation) aligned to them.

Agent Architecture: Tabnine’s SDLC agents operate within the development pipeline as specialized AI services. The Implementation Agent is triggered by a user action (one-click from a Jira issue) and it performs multi-step planning internally: analyzing the issue, retrieving relevant repository context, and then generating code changes. The Validation Agent similarly analyzes code diffs and the issue text to ensure alignment. These can be seen as distinct agents with different objectives that communicate via shared context (the code changes and issue description). Tabnine’s system likely uses large language models (possibly GPT-4 or proprietary models) under the hood for each agent. Coordination occurs in a closed-loop: after implementation, validation runs, and the developer is then presented with the combined outcome (new code plus a validation report). Tabnine also allows the developer to iteratively refine the output by asking the agent to adjust if something is off, effectively supporting a human-in-the-loop feedback cycle. Beyond Jira, the AI Test Agent can generate unit or integration tests automatically when pointed at code changes (using the code and possibly documentation as input). The Code Review Agent will examine a pull request’s diff and leave comments or suggestions, behaving like a static analysis + reviewer. These are separate agents focusing on artifacts like pull request descriptions, commit diffs, and test coverage reports, which are intermediate artifacts in the dev process. All Tabnine agents are orchestrated to keep the developer in control – for instance, they do not auto-merge code; a human reviews suggestions from the Jira agents and decides to accept changes or not.

Integration: Tabnine’s agents integrate deeply with popular developer tools. The Jira agents integrate Atlassian Jira with the development environment: the developer can trigger the Implementation Agent from the Jira UI or IDE, and it will interface with the codebase (through the IDE or repository). The results (code changes) are delivered either as a Git branch/commit or as a patch the developer applies. The Validation Agent can run as part of CI or within the IDE to confirm the issue is resolved. Tabnine’s IDE plugin (for VS Code, IntelliJ, etc.) is the primary user interface, now augmented with chat-based interactions for these agents. So a developer might open a Tabnine chat panel in VS Code, ask the Jira agent to implement a specific issue, and watch as the code appears. Tabnine emphasizes enterprise readiness: code stays private (the models can be self-hosted/on-premises if needed), addressing confidentiality. This makes it appealing for organizations that use Jira workflows. Essentially, Tabnine is bringing agentic automation into existing DevOps – linking issue trackers to IDE to version control in a seamless flow.

Maturity: Tabnine’s agent features are relatively new (general availability announced in Sep 2024), but Tabnine itself is an established company in AI coding tools. The Jira Integration is likely in active use by early-adopter teams. As a commercial product, it is polished for professional environments (with considerations for code security, team settings, etc.). The concept of “issue-to-code” that Tabnine pioneered is a significant step forward, and no major competitor offered that at the time, which suggests a level of innovation and maturity in their implementation. It’s still evolving; future updates may expand the range of upstream artifacts (for instance, parsing design documents or architecture diagrams linked in tickets). In the current state, Tabnine’s AI agents provide a production-ready assistive system that offloads routine coding from human devs, while ensuring requirement coverage. This is a commercially supported solution, making it one of the more practical and immediately usable frameworks in this list for industry teams.

IBM’s Software Engineering Agent Suite (Research/Prototype)¶

Overview: IBM Research has been actively developing an AI agent suite for automating software engineering tasks. In late 2024, IBM announced an experimental set of Software Engineering (SWE) AI agents (v1.0) aimed at reducing developers’ workload, particularly by addressing bug backlogs. These agents leverage multiple large language models and operate together to perform tasks such as bug discovery, code editing, and test generation. IBM’s work is positioned within its larger watsonx initiative – using IBM’s Granite series foundation models and an orchestration framework to coordinate the agents. While still in testing, this effort is a key example of a commercial research-led framework, likely to be integrated into enterprise tooling in the near future.

Upstream Artifact Support: The IBM SWE agents currently focus more on code maintenance and quality tasks than on early-phase artifacts. They do not appear to generate requirements or design models; instead, they take existing developer inputs (like bug reports or change requests) and act on them. For instance, a developer can tag a GitHub issue with a specific label (IBM’s “SWE”) to hand it to the agent – the agent will then analyze the described bug (an upstream artifact in the form of an issue report) and locate the problematic code and propose a fix. In this sense, the agent uses a natural language bug description (which is a kind of requirement for a fix) as input. Another agent IBM has built can create and execute tests, which means it takes a piece of code or a feature spec and produces test cases (test scripts are an artifact upstream of validation). These test-generation agents implicitly reason about the intended behavior (which overlaps with requirement understanding). We can surmise that as IBM’s framework evolves, it might incorporate more design-level reasoning (IBM has a history of AI tools for architecture, e.g. past projects with UML models, but those have not been explicitly mentioned in this LLM-based context). For now, IBM’s agents excel at reading and acting on software artifacts like bug reports, code diffs, and test results, but do not produce design documentation or formal requirement specs themselves.

Agent-Based Characteristics: IBM’s approach uses multiple specialized LLM agents coordinated for end-to-end tasks. For example, one agent focuses on bug localization and repair – it likely parses an issue description, scans the repository (perhaps using a code-search tool) to find suspect functions, then generates a patch. Another agent focuses on code edit requests – a developer can ask to refactor or modify code via a natural language instruction, and the agent will edit the lines accordingly using an IBM Granite model fine-tuned for code. There is also a test-generation agent that creates new tests and possibly runs them to ensure a bug fix didn’t break anything. IBM is building an orchestration framework to chain these agents into workflows. For instance, an orchestration might be: when a bug report comes in, trigger bug-fix agent; then trigger test agent to generate regression tests; then perhaps trigger a code quality agent to review the fix. The agents communicate via shared artifacts (a diff produced by one is input for another, etc.). Coordination, according to IBM’s chief scientist, is aimed at letting these agents handle tasks with minimal human intervention, reducing backlog and freeing developers for new feature work. IBM has reported metrics from benchmarks (SWE-Bench) indicating their agents can localize and fix issues quite efficiently (five minutes on average for a fix, with around 23.7% success on a standard test suite – among the top performers on that benchmark). This indicates a fairly sophisticated interplay of analysis and action by the agents.

Integration: As of the announcement, the IBM SWE agents were in a test phase with integration likely via GitHub (issue tagging) and potentially IDE plugins or CI tools in the future. IBM would presumably offer this through its watsonx.ai platform or Cloud services once ready, meaning it could integrate with enterprise Git repositories and project management systems. There’s mention that tagging a GitHub issue triggers the agent, which suggests a developer workflow integration that’s quite seamless – just mark an issue and let the AI handle it. For editing tasks, a developer might use a chat interface or command within an IDE (“IBM, implement this change…”). IBM’s orchestration framework is intended to ease creating multi-agent workflows, so integration might also allow custom pipelines – e.g. an organization could configure: static analysis agent -> security fix agent -> test agent, etc., as part of their DevOps pipeline. Because these agents use IBM’s own LLM (Granite) on watsonx, adoption would involve accessing IBM’s cloud or on-prem model deployments.

Maturity: IBM’s agent suite is still in research preview (as of late 2024). It’s not a generally available product yet, but IBM’s communications imply they are moving towards that. They’ve proven viability through benchmarks and are refining the system. IBM has a long history of transferring AI research to enterprise tools, so we can expect these multi-agent capabilities to appear in offerings like IBM ELM (Engineering Lifecycle Management) or cloud DevOps solutions. For now, it’s a cutting-edge commercial framework being tested in real-world scenarios. Its strength lies in addressing well-defined tasks (bugs, edits, tests). If we compare to other frameworks: IBM’s is less about full project generation and more about augmentation of the development pipeline. It’s also one of the few targeting bug fixing explicitly. In terms of upstream support, it is limited in this early stage, but as the framework grows, IBM might extend agents into design and requirement management (areas IBM has interest and tools in). Overall, IBM’s work underscores that major industry players see value in teams of AI agents working alongside human developers, and they are investing to integrate that into software engineering practice.

Claude Flow (formerly Claude-SPARC)¶

High-Level Overview¶

Claude Flow is an agentic orchestration system designed to run on Anthropic's Claude Code platform. It formalizes the SPARC methodology (Specification, Pseudocode, Architecture, Refinement, Completion) in a practical multi-agent setup, optimized for scalable software delivery. It incorporates:

Modular execution phases
A Memory Bank for state retention
Agent coordination protocols
Integrated tooling (e.g., BatchTool, WebFetchTool, Git commit hooks)
Optional modes (backend-only, frontend-only, full-stack)

Upstream Artifact Support¶

Claude Flow explicitly supports:

Requirements analysis (via project spec ingestion)
Architecture modeling (e.g. component breakdowns)
Design refinement (via pseudocode and system planning stages) All of this is structured via the SPARC methodology, with output artifacts stored in a project workspace.

Agent-Based Architecture¶

Parallel multi-agent execution with Claude instances working across different SPARC stages.
An integration ledger and task assignment map coordinate responsibilities.
A shared memory mechanism ensures context is preserved and reused.
Emphasizes concurrency (e.g. backend and frontend workstreams) and agent learning.

Integration & Maturity¶

Shell-based runner (claude-flow.sh) compatible with Claude Code.
Auto-commits changes and tracks output through Git.
Web research and code synthesis are integrated via Claude’s tools.
Still early-stage but more structured and production-ready than the prior gist version.
Open-source (MIT license) with growing interest and contributions.

References¶

Claude Flow (formerly Claude-SPARC) https://github.com/ruvnet/claude-flow
BMAD-METHOD GitHub Repository https://github.com/bmad-method/bmad
BMAD Documentation: Agile AI Development Pipeline https://bmad-method.github.io/docs/pipeline
SuperClaude GitHub Repository https://github.com/dsdanielpark/superclaude
SuperClaude: RULES.md (Agent Rules and Protocols) https://github.com/dsdanielpark/superclaude/blob/main/RULES.md
SuperClaude: Model Context Protocol https://github.com/dsdanielpark/superclaude/blob/main/MCP.md
BMAD V4 Design Docs https://github.com/bmad-method/bmad/discussions/33
SuperClaude Quick Start & Persona Command Reference https://github.com/dsdanielpark/superclaude#quick-start
SuperClaude: Claude Code Tooling Integration https://github.com/dsdanielpark/superclaude/blob/main/claude-config.md
BMAD YouTube Channel https://www.youtube.com/@bmad-method
MetaGPT GitHub Repository https://github.com/geekan/MetaGPT
MetaGPT Paper: “MetaGPT: Meta Programming for Multi-Agent Collaborative Coding” https://arxiv.org/abs/2308.00352
ChatDev GitHub Repository https://github.com/openbmb/ChatDev
ChatDev Web Demo https://chatdev.streamlit.app/
ChatDev Paper: “ChatDev: Revolutionizing Software Development with AI Agents” https://arxiv.org/abs/2310.01894
ChatDev: MacNet and Collaboration Framework Update https://github.com/OpenBMB/ChatDev/pull/158
MetaGPT: MGX Spin-off Platform https://github.com/geekan/mgx
MetaGPT: SOP Design Templates https://github.com/geekan/MetaGPT/tree/master/meta/sop
Claude Flow: Memory Bank and Agent Coordination https://github.com/ruvnet/claude-flow/blob/main/docs/memory.md
BMAD Expansion Packs (e.g., Game Dev, DevOps) https://github.com/bmad-method/bmad/discussions/27
IBM SWE Agents – Official Announcement https://research.ibm.com/blog/swe-agent-suite-ai
IBM SWE Agents on GitHub (Prototype) https://github.com/IBM/swe-agent
IBM Granite Models Overview (Watsonx) https://www.ibm.com/blog/ibm-granite-open-source-models/
Tabnine AI Agents for Jira – Product Page https://www.tabnine.com/blog/tabnine-jira-agents/
Tabnine Test and Review Agents – Announcement Blog https://www.tabnine.com/blog/tabnine-code-review-agent-and-test-agent/

Software Engineering with GenAI¶

Overview

This section provides an overview of the context an agent (human or LLM) needs for Software Engineering.

Software Engineering¶

How Anthropic uses Claude for Production Code¶

Informative talk from Anthropic on how they responsibly merged a 22,000-line change, heavily written by Claude, into their production reinforcement learning codebase by employing several key strategies:

Embracing the Product Manager Role for Claude: This wasn't a single prompt but involved days of human work to define requirements and guide Claude
- The human team acted as product managers for Claude, providing the necessary context and direction
Focusing on Leaf Nodes: The change was largely concentrated in "leaf nodes" of the codebase
These are parts of the system where nothing else depends on them, making it acceptable for some technical debt to exist because they are unlikely to change or have further things built upon them
Heavy Human Review for Core Architecture
- For the parts of the codebase that were important and needed to be extensible, heavy human review was conducted
Designing for Verifiability:
- They carefully designed stress tests for stability and ran them for long durations, allowing stability to be measured without reading the code
- The entire system was designed to have easily human-verifiable inputs and outputs, enabling correctness verification based on these without needing to read the full underlying implementation
- These elements created verifiable checkpoints that allowed confidence in the change even without understanding the full code

By combining these methods, Anthropic was able to achieve the same level of confidence in this AI-generated change as any other change, but deliver it in a fraction of the time and effort it would have taken to write and review every line by hand. This approach allowed them to think differently about their engineering capabilities, realizing that larger features and changes could be undertaken more efficiently

Note

The length of tasks AIs can do is doubling every 7 months.

As at May 2025, we're at 1 hour.

Software Engineering Artifacts¶

Code is informed by upstream activities (requirements, architecture, design, components, etc.) and downstream activities (tests, user feedback).

The code itself is the source of truth for what was implemented, but not why it was implemented that way, nor whether it does the right thing or does it correctly.
- The source of truth for these aspects is captured in other Software Engineering Artifacts.

Tip

These Software Engineering Artifacts inform the code and should live with the code. This means they should be in sync with the code and accessible when code is generated. Specifically, they should be located in the same organization/repository as the code, in a format that is both LLM and human-friendly (Software Engineering Artifact Formats).

Quote

At Google, our most successful efforts have been when documentation is treated like code and incorporated into the traditional engineering workflow, making it easier for engineers to write and maintain simple documents. https://abseil.io/resources/swe-book/html/ch10.html#what_qualifies_as_documentationquestion

Documentation must be as close to the code as possible. It must be easy to maintain, easy to keep up to date, and easy to find.

Software Engineering at Google

Software Engineering Knowledge¶

Domain Knowledge¶

Code is also informed by Domain Knowledge.

Taking Security as an example Domain

https://baxbench.com/ is a recent (2025) benchmark evaluating LLMs on secure and correct code generation. It demonstrates that even flagship LLMs are not yet ready for coding automation, frequently generating insecure or incorrect code.

Three levels of Security Reminder are provided to the models:

No Security Reminder: The current highest score is 25% Insecure or Correct code generated.

"The models are only prompted to complete the coding task. The prompt contains no security-specific instructions, reflecting a realistic interaction with a developer that does not make explicit security considerations."

Generic Security Reminder: The current highest score is 18.2% Insecure or Correct code generated.

The models are prompted to complete the coding task and are explicitly reminded to make security considerations and follow security best practices.

Oracle Security Reminder: The current highest score is 10% Insecure or Correct code generated.

The models are prompted to complete the coding task and are explicitly reminded to avoid specific security vulnerabilities that could occur in the given task. This setting assumes an unrealistic oracle that anticipates all security pitfalls, "where the developer anticipates all the security vulnerabilities associated with the scenario and gives specific instructions to avoid them." This prompt provides an upper bound on the models' security performance.

The models are not provided with any security guidance or additional context (security knowledge for the specific task like threat models, security design patterns, or secure code guidelines).

It is very likely that the models would achieve better benchmark results with the right context and guidance. This can be proven by evaluating such a setup against the benchmark.

Domain knowledge must be available to the Agent

Even a simple reminder makes a significant difference, whereas context-specific reminders yield a much greater difference.

Taking Security as an example Domain:

Security guidance or additional context (security knowledge for the specific task, such as threat models, security design patterns, or secure code guidelines) can significantly improve performance.

Tacit Knowledge¶

Tacit knowledge is knowledge that resides in personal experience, skills, and intuition rather than being explicitly documented. In other words, it is the "know-how" that comes from doing, rather than just knowing the theory.

Agents need a mechanism to build tacit knowledge

In addition to Software Engineering Artifacts and Domain Knowledge, agents need a mechanism to build tacit knowledge.

In a recent study, Developers report AI not utilizing important tacit knowledge or context as a factor likely to contribute to slowdown.

Implicit repository context (Limits AI performance, Raises developer performance)

One developer notes that AI often acts like a new contributor to the repository, and that “AI doesn’t pick the right location to make the edits.” Another developer notes that while “we [..] know the data that will interact with the code, but the model doesn’t know the data. It doesn’t know we need to take care of this weird case of backwards compatibility and [thus] keep this specific line. And this is very hard to give as [context to the model].”.

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity, "C.1.5 Implicit repository context (Limits AI performance, Raises developer performance)" July 2025

Takeaways¶

Takeaways

The context an agent (human or LLM) needs for Software Engineering includes:

Software Engineering Artifacts (including, but not limited to, Code)
Software Engineering Knowledge
- Domain Knowledge
- Tacit Knowledge

A Software Engineering setup should have a way to manage this context and provide it to the agent when required.

AI Coding Leaderboards¶

Overview

This page is a list of benchmarks for LLMs that are used to help with coding.

It covers both Coding Assistants and Autonomous Coding Agents

Leaderboards¶

Leaderboard	Category	Tasks	Metrics
Aider LLM Leaderboards	Coding Assistant	225 Exercism exercises across C++, Go, Java, JavaScript, Python, Rust	Two-pass pass rates; cost per run; edit correctness
EvalPlus Leaderboard	Coding Assistant	HumanEval+ (164 hand-verified Python); MBPP+ (399 sanitized Python)	pass@1 (greedy); extended efficiency via EvalPerf
TabbyML Coding LLMs Leaderboard	Coding Assistant	Amazon CCEval next-line tasks in Python, JS, Go…	Next-line accuracy (exact-match of very next line)
MHPP Leaderboard	Coding Assistant	210 “Mostly Hard” multi-step Python problems	pass@1 (greedy); sampling (T=0.7, 100 runs)
Copilot Arena	Coding Assistant	Paired autocomplete & inline-editing comparisons	ELO-style rankings from user votes
WebDev Arena Leaderboard	Coding Assistant	Real-time web development challenges between models	Win rate; task completion; user voting
SWE-bench	Autonomous Agent	2,294 real-world “Fail-to-Pass” GitHub issues from 12 Python repos	% of issues resolved
HAL (Holistic Agent Leaderboard)	Autonomous Agent	13 benchmarks (e.g., SWE-bench Verified, USACO, Cybench, TAU-bench) across many domains	Cost-controlled evaluations; success rates; Pareto fronts
TBench	Autonomous Agent	Terminal-based complex tasks in realistic environments	Task success rate; command accuracy; time-to-completion

Measuring AI Ability to Complete Long Tasks¶

Tip

The length of tasks (measured by how long they take human professionals) that generalist autonomous frontier model agents can complete with 50% reliability has been doubling approximately every 7 months for the last 6 years

Measuring AI Ability to Complete Long Tasks

Other Evaluations¶

In addition to leaderboards, it is useful to read about other evaluations and experiments e.g.

How far can we push AI autonomy in code generation?, Birgitta Böckeler ThoughtWorks, August 2025

Takeaways¶

Key Takeaways

Leaderboards are a good way to quantitatively and objectively compare solutions.

Comparison across multiple metrics and leaderboards avoids solutions that overfit to a benchmark.

Ended: Software Engineering Agents

Software Engineering 1.0 Redux ↵

Software Engineering 1.0 Redux¶

Overview

This section is hands on end2end turning the thoughts on Software Engineering 1.0 Redux from the previous pages into reality.

Exploratory

Pre-Code

Tip

All of these pre-code outputs were created with a Gemini Gem loaded with a BMAD-Method Full Stack team loaded (modified to add a security agent).

This gives a guided interactive experience with different Software Engineering roles.

Phase	Input	Chat History	Output
Planning	Pre-existing documentation	Chat history	Product Brief, PRD, UX requirements, Architecture document.
Security Review	Product Brief, PRD, UX requirements, Architecture document.	Chat history	- Security review documents - Changes to the architecture document by the architect based on the security agent's findings
Document Preparation for development consumption	- PRD - Security Story Report from Security Analyst	- Chat history - Chat history	- User Stories - Security Stories ("S-" prefix) - Security Test Cases
Security Review	Newly published Securing Agentic Applications Guide, OWASP Gen AI Security Project - Agentic Security Initiative Version 1.0 July 28, 2025	- Chat history	- Changes to the architecture document, and security documents, based on the security agent's review of the newly published OWASP guide - Changes to 2 Security Stories
GUI	Architecture Document	-BMAD UX Sally chat - Gemini Pro 2.5 Chat to create mockup GUI	Interactive Mockup on Gemini Pro 2.5 Canvas

Phases are shown in chronological order - note that a second security review was done based on a newly published OWASP guide.

Code

Phase	Input	Chat History	Output
Kanban (via Gitbub Project) to plan and track work	- User Stories - Security Stories ("S-" prefix) - Security Test Cases	Chat history	Kanban Board of Stories
Code Review	Deliberately Vulnerable test app for test and demo purposes	-	Vulnerability Report that includes independent analysis and triage of results, and confidence levels
Docker Container	-	-	- Dockerfile Security Review Reviewer: Security-Reviewer Agent (AI)

Phase: Existing Exploratory Documentation¶

In the exploratory part of the project at the start I had some existing documentation:

A Cost Analysis was created by using
1. ChatGPT and Gemini Reasoning models to create an analysis
2. Getting them to cross-review each other's output and amending their own output as a result
3. Then taking the amended ChatGPT version as it was more succinct which is what I was looking for.
4. The cost analysis was a back-of-napkin effort to understand likely operational costs early to see if these were a show-stopper or not.
Research Documents using ChatGPT and Gemini Research models and Anthropic
ADRs using an existing competed example I liked, and redoing it for the architecture decisions I had made using ChatGPT and Gemini Reasoning models and Anthropic.
Some Functional Requirements and use cases and user stories (in a Doc format) from MITRE CWE team.

Phase: BMAD The Planning Workflow ¶

Having reviewed and played with some of the SWE agents, the BMAD-Method was closest to my views on Software 1.0 Redux, so this is used here.

There's 2 BMad Workflows:

See the updated BMAD flows detailing the Security activities I added:

BMAD The Planning Workflow¶

BMAD method breaks down the requirements and other upstream artifacts into epics and stories, enabling Claude Code to generate structured code and docs with little human intervention (Level 4 Autonomy).

Key Planning Phases

Optional Analysis: Analyst conducts market research and competitive analysis
Project Brief: Foundation document created by Analyst or user
PRD Creation: PM transforms brief into comprehensive product requirements
Architecture Design: Architect creates technical foundation based on PRD
Validation & Alignment: PO ensures all documents are consistent and complete
Refinement: Updates to epics, stories, and documents as needed
Environment Transition: Critical switch from web UI to IDE for development workflow
Document Preparation: PO shards large documents for development consumption

Here we build the Upstream Software Engineering Artifacts.

Specifically these are the artifacts that are built with the workflow below.

Setup¶

The BMAD-METHOD is used.

Specifically, the Fastest Start: Web UI Full Stack Team at your disposal part where you create a Gemini GEM with a BMAD-provided text file.

All these documents are created via a browser interface (Gemini GEM) so the process is IDE-independent!
Gemini's long-context window allows it to keep the various documents produced in context so it can make consistent changes across them all.

Tip

The created files are available via

the Google Gem Canvas
the prompt response when you request a specific document

Observation

I was impressed how well this setup worked!

following a logical workflow and prompting for choices or input at each stage (and not getting lost)
allowing me to request Gemini Gem to output a document at any time (so I could review and version control it before changes)
allowing me to refine the content or ask questions across all the documents as required e.g. if I suggested a change that impacted multiple documents then this was detected by Gemini and the updates made. This ensured consistency across the artefacts.

Some choices / decisions were deliberately postponed e.g. data exchange formats and schemas per Principle #3 - Assume variability; preserve options. This results in a PARTIAL Status at this point.

Interaction with BMAD Gemini Gem¶

Success

See the full chat history of creating a Product Brief, PRD, UX requirements, Architecture document.

Tip

I worked the pre-existing document content into the The Planning Workflow by providing the content to the BMAD Persona at the relevant time via the prompt.

Phase: Security Review¶

BMAD-Method did not have a dedicated Security persona or security documents (but does include some basic security checks)

Other Threat Modeling solutions could be used to create these security documents

See example security artifacts

Tip

The different Threat Modeling solutions had different features I liked, so I decided to build a BMAD Method Security Agent with all the features I wanted.

The BMAD Method Security Agent fits in the BMAD Planning Workflow

It reviews the existing documents, creates security documents including threat models, security test cases, and security updates to the architecture document.
Unlike the other solutions, this Security Agent can also be used during the code implementation phase.

Adding a BMAD Method Security Agent¶

I added a Security Agent to BMAD per commits from Crashedmind. The commit descriptions give the details.

You can browse through the files e.g.:

Tip

The Security Agent creates these Security review documents

Document	Description
security_assessment.md	Initial security assessment and analysis
threat_model.md	Complete STRIDE threat model
dread.md	Quantitative DREAD risk assessment
security_story_report.md	Actionable security user stories
https://github.com/CWE-ChatBot/CWE-ChatBot/tree/main/docs/security/bmad_fullagent_security/security_test_cases.md	Security test cases and validation scenarios
security_architecture_checklist.md	Architecture checklist validation
compliance_audit.md	GDPR compliance audit

Building the Bundles¶

The builder gathers all the relevant files for all the agents into single files in https://github.com/CyberSecAI/BMAD-METHOD/tree/feature/add-new-agent-security/dist/teams

e.g. team-fullstack.txt contains all the agents including security, and all the other files they need.

Create a Gemini Gem with the FullStack team¶

https://gemini.google.com/
Explore Gems
New Gems
CopyNPaste team-fullstack.txt into instructions
Save

Using the BMAD Method Security Agent via the Gemini Gem with FullStack team¶

Collate the project brief, prd, architecture etc... md files into one file (e.g. using gitingest)
At the prompt say "here's my existing documentation" and copyNpaste the collated md file into the prompt
Click the Submit Arrow
The workflow kicks off

Success

See the

full chat history using the Security Agent.
full chat history creating and using the Security Agent (Video)
output security artifacts.
changes to the architecture document by the architect based on the security agent's findings
- "| July 23, 2025 | 2.0 | Integrated security agent's findings (WAF, AI Guardrails, DoS Protection, Enhanced Logging). | Winston (Architect) |"

Document Preparation for development¶

Key Planning Phases

When to move to the IDE: Once you have your PRD, Architecture, optional UX and Briefs - its time to switch over to the IDE to shard your docs, and start implementing the actual code!

Coming soon..

Phase: Core Development Cycle¶

Vulnerability Assessment Analyst Tanya

"Use for performing security scans, identifying and documenting software vulnerabilities, analyzing code for weaknesses, validating patches, assessing system configurations, verifying compliance with security standards, and generating vulnerability reports for audit or remediation purposes."

The original brief from a user AVSuun

Adding a Vulnerability Assessment Analyst¶

I wanted to give the VulnerabilityTech agent access to tools for

Static Analysis Security Testing (SAST) - combining traditional SAST tools with LLM review
Dependency checking

I decided to use Claude Code Sub-agents for this (announced a few days ago https://docs.anthropic.com/en/docs/claude-code/sub-agents).

I only focused on Claude Code - not other environments that BMAD may work on.

BMAD Agents vs Claude Code Sub-Agents¶

Claude Code Sub-Agents¶

What They Are: Specialized AI assistants with custom system prompts and tool access
Architecture: Defined as Markdown files with YAML frontmatter in .claude/agents/
Operation: Isolated context windows with specific expertise and granular tool permissions
Delegation: Automatic selection by Claude Code or explicit user invocation
Benefits: Focused expertise, faster analysis, parallel processing capabilities

BMad Framework Current State¶

12 Specialized Agents: Comprehensive personas in bmad-core/agents/
Complex Configuration: YAML-based with dependencies and external task references
Full Context Maintenance: Complete operating instructions and workflow definitions
Command Syntax: *command activation with extensive persona definitions

Key Architectural Differences¶

Complexity: BMad = comprehensive personas vs Claude = focused sub-agents
Dependencies: BMad = external task references vs Claude = self-contained
Context: BMad = full context maintenance vs Claude = isolated context windows
Activation: BMad = explicit commands vs Claude = automatic delegation

Claude Code sub-agents Setup¶

See how I setup Claude Code sub-agents to work with BMAD agents that describes how BMAD agents call Claude Code sub-agents.

Four Claude Code sub-agents were added:

Claude sub-agent	Purpose
dependency-scanner	I am a specialized security analyst focused on third-party component security assessment and supply chain risk analysis. My expertise covers dependency vulnerability scanning, license compliance, and supply chain security validation according to NIST SSDF practices.
pattern-analyzer	I am a specialized security pattern analyst focused on detecting secure and insecure coding patterns across multiple programming languages. My expertise leverages language-specific security knowledge from expansion packs to validate secure coding practices and identify anti-patterns that introduce vulnerabilities.
security-reviewer	I am a Level 2 orchestrator sub-agent that coordinates comprehensive security analysis by leveraging specialized tool sub-agents. My role is to orchestrate hybrid SAST + LLM security analysis for maximum accuracy and coverage, focusing on practical, exploitable security issues that pose real risks to applications and systems.
test-validator	I am a specialized security testing analyst focused on validating the quality, coverage, and effectiveness of security tests within software projects. My expertise ensures that security testing meets NIST SSDF standards and provides robust protection against real-world threats.

I also added a tests dir to BMAD be able to test this.

This includes a deliberately vulnerable app to test:

VulnerabilityTech Agent Report¶

Success

See example output report.

https://github.com/CyberSecAI/BMAD-METHOD/blob/feature/claude-code-sub-agents-integration/tests/reports/consolidated-security-report.md#%EF%B8%8F-nist-ssdf-compliance-assessment

Other Claude Code Workflows¶

There are other Claude Code workflows, some example are given here.

There are UI tools built on GIT worktrees to support this.

Other non-Claude Code Workflows¶

Accessibility¶

https://github.com/scragz/kornelius/blob/main/prompts/audit/a11y.prompt

References¶

Software Engineering 1.0 Redux Security¶

Overview

This section covers the Core Review Security touchpoint.

We touched on this point earlier in NotebookLM Secure Code.

Software Security Touchpoints¶

Finding Defects in Source Code¶

LLMs can compliment the Static analysis sweetspot (Generic defects - Visible in the code)

"SAST tools often rely on predefined patterns and rules, which can result in high false-positive rates and an inability to detect novel or context-dependent vulnerabilities" per ref.
Additional context can be given to LLMs to improve their performance e.g. vulnerability reports on similar code LSAST: Enhancing Cybersecurity through LLM-supported Static Application Security Testing

How to use LLMs and SAST Tools Together¶

LLMs can be used with Code Analysis Tools in different ways:

Create a formatted report of findings:
1. this is more an editorial activity - not code analysis
Utilising LLMs for False Positive Mitigation (FPM) on SAST Warnings:
1. Traditional SAST tools often generate a high volume of false positive alerts, which can be time-consuming and resource-intensive for developers to manually review.
2. LLMs can be integrated to automate this false positive mitigation process
Direct Integration of SAST Results with LLMs:
1. SAST tools provide initial insights into known vulnerabilities based on predefined rules and patterns. Their findings, including details like CWE IDs and line numbers, can be formatted and directly incorporated into the prompts provided to LLMs alongside the target code.
2. This approach, referred to as "RAW LSAST," has been shown to significantly enhance the LLM’s ability to detect vulnerabilities that static scanners alone might miss, thereby improving overall detection accuracy
Independently look for issues
1. e.g. using an analysis of CPGs.
Actively exercise the code by creating or guiding test cases.

Quote

While traditional tools like Bandit are valuable, our results show that LLMs provide a complementary advantage by analyzing fragmented and non-compilable code and detecting complex vulnerability patterns that existing tools may miss. Additionally, the ability of LLMs to generate security tests adds a useful layer of verification, potentially enhancing the accuracy of vulnerability assessments.

Leveraging Large Language Models for Command Injection Vulnerability Analysis in Python: An Empirical Study on Popular Open-Source Projects

Quote

Our findings indicate that SAST tools exhibit low vulnerability detection ratios while maintaining a low marked function ratio, akin to a low false positive rate. In contrast, LLMs demonstrate high vulnerability detection ratios but are accompanied by elevated marked function ratios (akin to high false positive rates). Through ensemble approaches, we demonstrate that combining SAST tools and LLMs mitigates their respective limitations, resulting in improved overall vulnerability detection performance.

Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection

Quote

Analysis reveals that the optimal approaches differ across programming languages. The best choice will depend on the user’s acceptance of the trade-off between detection ratios and marked function ratios.

Comparison of Static Application Security Testing Tools and Large Language Models for Repo-level Vulnerability Detection

Quote

The reasoning-oriented models consistently produced fewer false positives, suggesting that their internal steps for “validating” potential vulnerabilities lead to more precise outcomes.

CASTLE: Benchmarking Dataset for Static Code Analyzers and LLMs towards CWE Detection

Code Context¶

A challenge with using LLMs is to get the precise and complete code context to the LLM as described in Utilizing Precise and Complete Code Context to Guide LLM in Automatic False Positive Mitigation. They developed a tool to extract this code context.

Quote

First, we propose a line-level precise code slicer eCPG-Slicer. It constructs an extended Code Property Graph (eCPG) and then extracts line-level code context related to the warning within the given files.
Second, we propose a linear complexity algorithm, FARF, which is used to identify source files that have dependencies related to a warning, enabling the slicer to extract the complete code context.
Third, we integrate our eCPG-Slicer and FARF algorithm into our LLM4FPM framework. LLM4FPM can efficiently drive LLMs to give judgements for a given warning generated by SAST tools.

Note

The source code for these does not appear to be publicly available. But Joern is an open source solution used for such purposes.

Quote

A code property graph of a program is a graph representation of the program obtained by merging its abstract syntax trees (AST), control-flow graphs (CFG) and program dependence graphs (PDG) at statement and predicate nodes. https://en.wikipedia.org/wiki/Code_property_graph

Quote

Joern is a platform for analyzing source code, bytecode, and binary executables. It generates code property graphs (CPGs), a graph representation of code for cross-language code analysis. Code property graphs are stored in a custom graph database. https://github.com/joernio/joern

Quote

Precise Code Context. The extracted code snippet should focus on control flows and data flows relevant to the warning, capturing the precise code context while omitting unnecessary parts that might distract the LLM and lead to incorrect or ambiguous conclusions.
Complete Code Context. The analysis should account for key information often missing from bug reports, such as references to global variables or invoked functions located in other files. Without this, the extracted context remains incomplete.
Correct Conclusions. After obtaining precise and complete code context, there is an opportunity to more effectively guide the LLM to make accurate judgments on bug reports from SAST tools.

Utilizing Precise and Complete Code Context to Guide LLM in Automatic False Positive Mitigation

References¶

AI Cyber Challenge (2025). AIxCC Open Source Archive.
- Official archive providing open-source access to all seven finalist Cyber Reasoning Systems (CRSs), released to accelerate adoption of AIxCC-developed technology in cybersecurity and software development. https://archive.aicyberchallenge.com
- See DARPA AI Cyber Challenge Tools Comparison Report
Brian, C., & West, J. (2007). Secure programming with static analysis. Addison-Wesley.
McGraw, G. (2006). Software security: Building security in. Addison-Wesley.
Zhang, Y., Li, Y., Wang, T., et al. (2025). Towards effective complementary security analysis using large language models. arXiv preprint. https://arxiv.org/html/2506.16899v1
Xu, H., Chen, M., & Zhou, L. (2025). Leveraging large language models for command injection vulnerability analysis in Python: An empirical study on popular open-source projects. arXiv preprint. https://arxiv.org/html/2505.15088v1
Li, J., Wang, P., & Huang, K. (2025). LLM vs. SAST: A technical analysis on detecting coding bugs of GPT4-Advanced Data Analysis. arXiv preprint. https://arxiv.org/html/2506.15212v1
Müller, T., & Kim, D. (2024). Comparison of static application security testing tools and large language models for repo-level vulnerability detection. arXiv preprint. https://arxiv.org/abs/2407.16235
Shen, R., & Zhao, F. (2024). LSAST: Enhancing cybersecurity through LLM-supported static application security testing. arXiv preprint. https://arxiv.org/html/2409.15735v2
Kumar, V., & Singh, A. (2025). CASTLE: Benchmarking dataset for static code analyzers and LLMs towards CWE detection. arXiv preprint. https://arxiv.org/html/2503.09433v1
Li, C., & Zhang, H. (2024). Source code vulnerability detection: Combining code language models and code property graphs. arXiv preprint. https://arxiv.org/html/2404.14719v1
Kolla, H. (2024). SecLint: An agentic code vulnerability detection tool using RAG. Substack. https://harishkolla.substack.com/p/seclint-an-agentic-code-vulnerability
Gross, C. (2025, August). Slice: SAST + LLM interprocedural context extractor. noperator.dev. https://noperator.dev/posts/slice/
Google Security Team. (2024). Google Big Sleep AI tool finds critical Chrome vulnerability. LinkedIn. https://lnkd.in/gSkA3kRC
Heelan, S. (2024). Using o3 to discover a remote zeroday vulnerability in the Linux kernel’s SMB implementation. LinkedIn. https://lnkd.in/gHZqC_k3
Wang, Z., Shi, T., Song, D., et al. (2024). CyberGym: Evaluating AI agents' cybersecurity capabilities with real-world vulnerabilities at scale. UC Berkeley. https://www.cybergym.io/
Gaucher, R., Ermilov, V., & [Author], A. (2024). Finding vulnerabilities in modern web apps using Claude Code and OpenAI Codex. LinkedIn. https://lnkd.in/gGWaZzHk
Li, Z., Li, Y., Wang, T., et al. (2024). IRIS: LLM-assisted static analysis for detecting security vulnerabilities. arXiv preprint. https://arxiv.org/abs/2405.17238
1. Open Review
(2025). Harnessing the power of LLMs for vulnerability detection. arXiv preprint. https://arxiv.org/html/2504.18423v1
(2025). Advancing software vulnerability detection with reasoning-enabled large language models. Applied Sciences. https://www.mdpi.com/2076-3417/15/12/6651
Khare, A., Dutta, S., Li, Z., Solko-Breslin, A., Alur, R., & Naik, M. (2023). Understanding the effectiveness of large language models in detecting security vulnerabilities. arXiv preprint. https://arxiv.org/abs/2311.16169
Yang, Y., Xu, B., Gao, X., & Sun, H. (2025). Context-enhanced vulnerability detection based on large language models. arXiv preprint. https://arxiv.org/abs/2504.16877
Kaniewski, S., Schmidt, F., Enzweiler, M., Menth, M., & Heer, T. (2025). A systematic literature review on detecting software vulnerabilities with large language models. arXiv preprint. https://arxiv.org/abs/2507.22659
- See Awesome-LLM4SVD for the surveyed papers, taxonomy, datasets: https://github.com/hs-esslingen-it-security/Awesome-LLM4SVD
Sheng, Z., Chen, Z., Gu, S., Huang, H., Gu, G., & Huang, J. (2025). LLMs in software security: A survey of vulnerability detection techniques and insights. arXiv preprint. https://arxiv.org/abs/2502.07049
Klieber, W., & Flynn, L. (2024, October 7). Evaluating static analysis alerts with LLMs. Cybersecurity Engineering, SEI blog. https://www.sei.cmu.edu/blog/evaluating-static-analysis-alerts-with-llms/
Ferro, A. (2024). Beyond static analysis: AI-powered vulnerability detection. Medium. https://medium.com/engaging-innovation/beyond-static-analysis-ai-powered-vulnerability-detection-690ff4e1f94c
(2024). Research on LLM for vulnerability detection – Fernweh. Blog post. https://blog.wohin.me/posts/recent-llm-for-vuln-detection/
1. Nice summary of 18 studies on LLM for vulnerability detection

In Depth Articles on Tools¶

Hakimian, P. (2025, October 31). WTF is … AI-Native SAST? Hackerman’s Hacking Tutorials. https://parsiya.net/blog/wtf-is-ai-native-sast/.
scabench-org. (2025). Hound: Language-agnostic AI auditor that autonomously builds and refines adaptive knowledge graphs for deep, iterative code reasoning. GitHub repository. https://github.com/scabench-org/hound.
1. https://muellerberndt.medium.com/unleashing-the-hound-how-ai-agents-find-deep-logic-bugs-in-any-codebase-64c2110e3a6f
2. https://muellerberndt.medium.com/hunting-for-security-bugs-in-code-with-ai-agents-a-full-walkthrough-a0dc24e1adf0
Karger, R. (2024, November 1). How ZeroPath Works: Technical deep-dive into ZeroPath’s SAST methodology—from AST generation to AI-powered vulnerability discovery and automated patch generation. ZeroPath Blog. https://zeropath.com/blog/how-zeropath-works.
Rogers, J. (2025, September 18). Hacking with AI SASTs: An overview of “AI Security Engineers” / “LLM Security Scanners” for Penetration Testers and Security Teams. Joshua.Hu — Joshua Rogers’ Scribbles. https://joshua.hu/llm-engineer-review-sast-security-ai-tools-pentesters.

Policy-as-Code Served Pre and Post Coding¶

Overview

This chapter explains how to turn large, static security documents into working policy-as-code that GenAI Security Agents can apply during design, coding, and review. Instead of asking engineers to memorize external and internal standards and guidance, we:

Break large documents into small, testable rule cards
Compile those rule cards into reusable knowledge packs
Expose the packs through security agents and interactive skills
Wire them into software engineering workflows so guidance appears at the right time and place

The end goal is simple: security guidance lives as close as possible to the code, served automatically at the time of coding as pre-coding guidance and post-coding checks—achieving Continuous Software Assurance.

🧑🏻‍💻 Source code: https://github.com/CyberSecAI/genai-sec-agents

Goals¶

The Policy-as-Code engine for GenAI Security Agents is designed to:

Operationalise standards Turn external frameworks and internal policies into concrete, testable rules.
Embed security into workflows Make security checks first-class citizens in planning, design, coding, and review.
Support both pre-code and post-code use Give guidance before the first line of code, and validation once code exists. In general, give guidance pre and post any artifact creation.
Provide traceability Make it clear which requirement, from which document, led to each recommendation.
Enable incremental adoption Start with pre-built knowledge packs, then extend with your own documentation over time.

Core Concept: From Documents to Rule Cards¶

The core idea is simple:

Take large security documents and convert them into small, testable rule cards that can be applied by Claude Code agents and skills.

Source Documents¶

Typical sources include:

OWASP and ASVS guidance as used in this example
Internal security policies and standards
Regulator rules and industry frameworks
Architecture decision records and security guidelines

These are treated as source of truth, not something the LLM invents.

Atomic Rule Cards¶

From those documents we create atomic rule cards:

Each card represents a single, precise requirement (“API must use OAuth2 with short-lived access tokens”, not “use strong auth”)
Each card has:
- A clear condition (when it applies)
- A requirement (what must be true)
- Rationale / references (why it matters, where it came from)
- Example checks or questions the agent should ask

Tip

Think of rule cards as “linters for architecture and security design” - small, focused rules that are easy to test and revise.

Compiled Knowledge Packs¶

Rule cards are grouped and compiled into knowledge packs:

Packs align with domains (authentication, secrets, logging, etc.)
Each pack can be:
- Loaded by a specialist security agent
- Queried via interactive skills or semantic search or grep
- Used in combinations for broader reviews

This compilation step normalises the content into a machine-friendly format that agents can load efficiently.

Agents and Skills¶

The .claude/ folder in a project then becomes the delivery vehicle:

Security agents Deep specialists for cross-cutting reviews (e.g. "comprehensive security review", "authentication specialist").
Interactive skills Specialized context modifiers that can be triggered via slash commands or automatically by Claude (e.g. /authentication-security, /secrets-check, /logging-review).

Note

The same underlying rule cards and packs power all of these entry points - you don't have to duplicate logic across agents and skills.

When to Use Skills or Agents¶

Skills vs Agents

Skills are best for:

Automatic context injection - Claude detects when they're needed via pure LLM reasoning
Interactive workflows - Progressive disclosure loads only what's needed when matched
Token efficiency - Small discovery cost (name + description), larger only when activated
Learning and exploration - Rich examples and guidance for understanding capabilities
Multi-skill composition - Multiple skills can auto-activate together for complex requests

Agents are best for:

Explicit parallel execution - Multiple sub-agents run simultaneously via Agent SDK
Programmatic invocation - CI/CD pipelines, pre-commit hooks, automated validation
Deterministic workflows - User controls exactly which agents run and when
Background execution - Long-running tasks with monitoring and output retrieval
Orchestration patterns - Main agent delegates to specialist sub-agents

Critical distinction: - Skills are context modifiers that inject specialized instructions and can constrain allowed tools/model selection. Agents are task executors that run in separate contexts for parallel analysis.

Complementary, not competing: Use skills for automatic context management in interactive workflows. Use agents for explicit parallel execution in programmatic workflows. Both share the same rule knowledge base via symlinks.

Access Patterns: Six Ways LLMs Use Security Knowledge¶

The Policy-as-Code engine exposes six complementary access patterns. Each pattern balances:

How predictable the invocation is
How much context it needs
How “deep” the analysis should be
How it affects the primary Claude Code CLI context

Summary of Patterns¶

Pattern	Activation	Token Cost	Context Window Impact	Use Case
Slash Commands	Deterministic (user-explicit)	Variable	Invokes skills or direct commands	Explicit developer-invoked guidance
Skills	LLM-triggered (based on description) or via slash command	1.5k-15k per invocation	Context modifier with progressive disclosure	Specialized instructions and tool/model constraints
Agents	Explicit (Task tool)	15k+	Separate context (not main window)	Parallel analysis, deep validation
Semantic Search	Explicit (tool)	Variable	On-demand reading into main context	Standards research, best practices lookup. No vector or RAG DB needed.
Grep	Explicit (tool)	Minimal	On-demand reading into main context	Direct pattern search in rules/corpus
CLAUDE.md	Automatic (every turn)	1-5k per turn (~15k per 5-turn conversation)	Always present in main context	Workflow orchestration, security enforcement

Note: numbers are given as representative examples. YMMV!

Skills vs Slash Commands

Skills and Slash Commands work together, not as alternatives:

Skills are prompt-based context modifiers that inject specialized instructions and can constrain allowed tools and model selection
Slash Commands are invocation methods that can trigger skills deterministically (user types /skill-name)
Skills can ALSO be triggered automatically via pure LLM reasoning when Claude reads skill descriptions and decides they're relevant
A skill can be invoked both ways: explicitly via slash command OR automatically when the LLM detects the need

Skills Auto-Activation is Probabilistic and Naive

Automatic skill activation is hit-and-miss and heavily depends on skill description quality:

Naive Discovery Mechanism:

Skill matching uses substring and semantic matching on the description field in SKILL.md frontmatter
The matching algorithm is not sophisticated - no embedding models, no matching scores, no ranking
If your description is vague or lacks keywords matching the user's request, the skill won't activate at all
Skills can fail to activate even when they seem clearly relevant

No Debug Capability:

There is no matching score or confidence metric available
There is no debug mode to see why a skill didn't activate
The only way to troubleshoot is trial-and-error refinement of descriptions

Critical Success Factors:

Include specific keywords that users are likely to mention (e.g., "JWT", "OAuth2", "password hashing")
List explicit use cases in the description ("Use this skill when reviewing login code, authentication flows, or credential handling")
Add trigger phrases that map to common developer requests
Test extensively with realistic user prompts to validate activation rates
Quality of CLAUDE.md matters - clear instructions can guide Claude to prefer certain skills in specific contexts

Recommendation: For security-critical workflows, do NOT rely solely on automatic skill activation. Use explicit slash commands or deterministic agent invocation for mandatory security checks. Consider skills as helpful guidance that MAY activate, not guaranteed enforcement.

When to Use Which Pattern¶

Claude's Autonomy in Pattern Selection

Claude Code can autonomously choose which access patterns to use based on task requirements and instructions in the CLAUDE.md file:

The patterns described below are guidance for system designers, not rigid constraints on Claude's behavior
Claude may decide to use Skills, Agents, Semantic Search, or Grep based on what's most appropriate for the current task
CLAUDE.md instructions can guide Claude to prefer certain patterns in specific contexts (e.g., "always use the security agent for authentication code")
Developers can invoke patterns explicitly via slash commands, OR Claude can choose them autonomously based on LLM reasoning
This flexibility allows Claude to adapt its approach based on context, task complexity, and available information

Slash Commands Use when you want deterministic, user-controlled skill invocation:
- A developer explicitly calls /secrets-security while working on config to invoke a specific skill.
- An architect explicitly asks /api-authentication-review to trigger authentication review skills on an ADR.
Skills Use when you want to inject specialized context and constraints into Claude's behavior:
- Define skills that Claude can automatically detect and invoke based on pure LLM reasoning.
- Constrain which tools Claude can use during specific security reviews.
- Modify Claude's model selection or execution permissions for specialized tasks.
- Skills can be triggered either explicitly via slash commands OR automatically when Claude determines they're relevant.
Agents Use when you need broad, cross-cutting security analysis:
- “Review this PR against our security standards.”
- “Assess this architecture document for threats and missing controls.”
Semantic Search Use when the question is primarily research:
- “What are our password policies?”
- “How do we handle multi-tenant isolation?”
Grep Use when you need exact phrase matching:
- “Show all rules that mention ‘JWT’.”
- “Find references to SAMESITE cookies.”
CLAUDE.md Use to orchestrate when security runs at all:
- Automatically invoke a security agent when certain file types or keywords appear.
- Ensure security checks become part of the default coding workflow, not an afterthought.

Layered Architecture¶

The Policy-as-Code engine follows a simple layered architecture:

Source Documents
- Raw security content: standards, policies, guidelines.
- Semantic search and Grep search these
Atomic Rule Cards
- Normalised, testable rules with conditions and references.
Compiled Rule Sets (Knowledge Packs)
- Grouped JSON bundles for efficient loading.
Agents & Skills
- Human-facing and Claude-code-facing entry points defined in .claude/.

This separation is important:

Security experts can iterate on rule cards without touching agent wiring.
Developers only see the agents and skills, not the internal representation.
You can run validation on rule cards and packs independently from model behaviour.

Note

This layout is intentionally compatible with both current Claude Code workflows and other agent frameworks. The rule packs are just data - any agent that can load them can use them.

Two Modes of Adoption¶

The repository supports two complementary adoption paths.

1. Use Pre-Built Security Knowledge¶

For teams who want fast value with minimal setup:

Copy the provided .claude/ folder into your project.
You immediately get:
- A curated set of security skills
- A suite of security agents
- Pre-populated rule packs aligned to common standards (e.g. OWASP / ASVS families)

Typical uses:

Pre-code Use skills while drafting designs, ADRs, and user stories to catch issues early.
Post-code Use agents to review pull requests, configuration files, Dockerfiles, etc.

2. Build Your Own Knowledge Packs¶

For organisations with strong internal standards or regulatory needs:

Modify the provided tooling to:
- Ingest your internal security policies and standards
- Shard them into rule cards
- Compile them into custom knowledge packs
- Wire those packs into new or existing agents and skills
Requirements (high level):
- A Python environment
- The ability to run the build scripts locally or in CI

Tip

Start by extending the existing packs - add a small number of internal rules for your most critical risks (e.g. data residency, customer data handling) before attempting a full policy migration.

How It Fits the Software Engineering 1.0 Redux Flow¶

This engine is designed to plug into the phases already described in the Software Engineering 1.0 Redux and Software Engineering Security sections.

Pre-Code: Planning and Security Review¶

Planning
- Use skills to ensure PRDs and architecture docs include required security content.
- Architect and Security roles can call targeted skills for authentication, secrets, logging, and data protection.
Security Review
- Run a comprehensive security agent over:
  - Product brief
  - Architecture document
  - Security stories
- Generate:
  - Security assessment
  - Threat model
  - Security test cases
  - Security stories to feed into the backlog

Core Development: Implementation and Code Review¶

During implementation
- Developers call skills in-context in their IDE:
  - “Is this JWT handling secure?”
  - “Check this API route for common auth issues.”
    - Agents can be invoked periodically for deeper checks on key modules.
Code review
- Use a security agent as a specialized reviewer:
- Run over diffs, not just whole files.
- Map findings back to the rule cards that were violated.
- Output concrete, testable recommendations (often directly convertible into security stories and test cases).

Note

This keeps the security source of truth in one place (rule cards and packs), while letting different roles access it in ways that match their workflows.

Personas and How They Use It¶

The same knowledge packs are surfaced differently for different roles:

Architect
- Uses skills to check architecture documents against security requirements.
- Invokes a security agent for full design reviews.
Product Owner / PM
- Ensures security requirements are explicitly captured in PRDs and stories.
- Uses outputs (security stories, test cases) to plan and track work.
Security Engineer / Analyst
- Curates rule cards and packs.
- Owns the security validation of the agents.
- Reviews high-risk findings and refines the rules.
Developer
- Uses skills inline in the IDE for quick checks.
- Treats the security agent as a specialised reviewer alongside human reviewers.

Note

This aligns with the broader guide’s stance: LLMs augment, not replace, the existing roles. The Policy-as-Code engine simply gives each role a more usable interface to the organization’s security brain.

Example Workflow: From Policy to Actionable Checks¶

A concrete end-to-end flow might look like this:

Ingest standards
- Security team selects a subset of OWASP / ASVS and internal standards as initial scope.
Create rule cards
- Rule cards are authored and reviewed for clarity, conditions, and references.
Build knowledge packs
- Packs are compiled by domain (auth, secrets, etc.) and validated.
Wire into agents and skills
- Security agents load the full set of relevant packs.
- Skills are exposed for common queries (“check my auth”, “review my logging”).
Connect to engineering workflows
- .claude/ is added to the repo.
- CLAUDE.md is updated so that:
  - Certain files trigger automatic security checks.
  - Developers have easy access to security skills.
Iterate based on findings
- False positives and gaps feed back into rule card revisions.
- New standards or guidance become new or updated cards.

Success

Once this loop is running, your security standards stop being static irrelevant documents and become a living, executable policy-as-code system that evolves along with your software.

Takeaways¶

Takeaways

Large, static security documents are converted into atomic rule cards and knowledge packs.
The same packs power skills, agents, search, and workflow automation via CLAUDE.md.
Teams can start quickly with pre-built security content and then extend with internal policies.
The engine fits naturally into the Software Engineering 1.0 Redux lifecycle: planning, design, implementation, and review.
Security becomes embedded and executable, not a separate after-the-fact checklist.

Glossary¶

Rule Cards : Small, testable units representing a single, precise security requirement. Each card contains a condition (when it applies), a requirement (what must be true), rationale/references (why it matters), and example checks. Think of them as "linters for architecture and security design."

Knowledge Packs : Compiled collections of rule cards grouped by security domain (authentication, secrets, logging, etc.). Packs are normalized into machine-friendly JSON format that can be efficiently loaded by agents and skills.

Context Injection : The process by which skills and CLAUDE.md automatically insert specialized instructions and security rules into Claude's conversation context. Skills use progressive disclosure to load minimal metadata initially, then full content when semantically matched.

Progressive Disclosure : A token-efficiency strategy where skills load information incrementally: (1) discovery phase loads only name + description, (2) activation phase loads full skill instructions when matched, (3) on-demand phase loads detailed rules (JSON) only if needed.

Context Modifiers : Skills that inject specialized instructions and can constrain Claude's behavior by modifying allowed tools, model selection, and execution permissions. Distinguished from agents which are task executors in separate contexts.

LLM Reasoning : The mechanism by which Claude autonomously decides when to activate skills. Based purely on language understanding of skill descriptions and user requests—not algorithmic pattern matching, embeddings, or ML-based intent detection.

Agent SDK : Claude Code's framework for explicit parallel execution of multiple sub-agents. Agents run in separate contexts (not the main conversation window) and enable programmatic invocation for CI/CD pipelines and automated validation.

References¶

GenAI Security Agents Repository - Source code for policy-as-code engine with rule cards and knowledge packs (2025)
Claude Skills Deep Dive - Technical analysis of skills architecture and activation mechanisms (October 26, 2025)
Claude Code Internals Analysis - Reverse engineering of context injection and prompt augmentation (2025)
Agent Skills for the Real World - Anthropic's official documentation on agent skills (2024)
Technical Deep Dive into Context Injection - Detailed architecture analysis (2025)
How to Make Claude Code Skills Activate Reliably - Practical, test-driven guide to improving Claude Code skill activation reliability using hook strategies (incl. “forced eval” vs “LLM eval”), with results from 200+ tests and implementation details for .claude/hooks/ and .claude/settings.json (November 16, 2025)

DARPA AI Cyber Challenge Tools Comparison¶

Overview

This report is auto-generated by ChatGPT 5.

Prompt

I want a report on the tools listed on https://archive.aicyberchallenge.com comparing their operation and features. I want a table comparison and a detailed comparison. I want to highlight what traditional non-AI tools were used - and what AI tools were used and how.

AIxCC Finalists and Tools Comparison¶

Introduction¶

The DARPA AI Cyber Challenge (AIxCC) tasked teams with building fully automated Cyber Reasoning Systems (CRSs) to find and fix vulnerabilities in real-world open-source software[1][2]. Seven finalist teams advanced from the 2024 semifinals to the 2025 finals, each open-sourcing their CRS. These systems varied in design – some enhanced traditional program analysis techniques with AI, while others were AI-first in their approach[3][4]. This report compares all finalist tools (including their semifinal and final iterations) on their core operation, technical approach, key features, use of traditional vs. AI techniques, performance results, and documentation. We also include SHERPA, an AIxCC-developed harness-generation tool released alongside the finalists. Table 1 summarizes the tools and their characteristics, followed by detailed analyses of each.

Comparison Table of AIxCC Tools¶

Tool (Team)	Core Approach & Features	AI-Based Techniques	Traditional Techniques	Performance (Finals)
Atlantis (Team Atlanta)	Ensemble of multiple CRSs for C, Java, etc., run in parallel (N-version design) for robustness[5]. Integrates diverse strategies (fuzzing, concolic, static) for max coverage[6].	LLMs in three modes: Augmented (assist fuzz/tools), Opinionated (provide hints), Driven (autonomous agents)[7][8]. Fine-tuned custom Llama2-7B models specialized for C code analysis[9]. LLM-generated input mutators, grammar dictionaries and PoV exploit agents[10].	Heavy use of fuzzing (LibAFL, libFuzzer, AFL++ ensembles) and concolic execution (extended SymCC)[11]. Custom directed fuzzers for C/Java[11]. Kubernetes-based scaling; disabled non-PoV patching to avoid false patches[12].	1st place – 43 vulns found, 31 patched[13]. Highest overall score (393)[13]. Most “real-world” vulns patched (achievement: Prince of the Patch[14]).
Buttercup (Trail of Bits)	Hybrid fuzzing+analysis pipeline with LLM-assisted input generation and multi-agent patching. Emphasizes accuracy (90%+)[15] and broad CWE coverage[16]. Cost-efficient design using cheaper LLMs[17][18].	LLM-generated test cases augment libFuzzer/Jazzer to hit complex conditions (e.g. SQLi, path traversal)[19]. Multi-agent patching with separate agents for vulnerability analysis and fix generation[20]. Used “non-reasoning” LLMs at scale (100k+ queries) for efficiency[21][17].	Integrated coverage-guided fuzzers (libFuzzer for C/C++, Jazzer for Java) with grammar and dictionary inputs[20]. Static analysis (tree-sitter parsing, code queries) to locate bug patterns[22]. Conservative patch strategy: no patch submitted without a triggering PoV[12].	2nd place – 28 vulns found, 19 patched[23]. Score 219. Achieved 90%+ accuracy and 20 unique CWE exploits[16][24]. Noted for largest patch (300+ lines, “LOC Ness Monster” award)[25].
RoboDuck (Theori)	“LLM-first” autonomous CRS that minimizes classical fuzzing[26]. Uses an agent-based workflow to find, exploit, and patch bugs without human input[27]. Uniquely capable of generating proofs of vulnerability without traditional fuzzing or symbolic execution[27].	Heavy use of LLM agents for code understanding, exploit generation, and patching. Two different LLMs scan code: one per function, one on large code chunks[28]. Agents follow reverse-engineering style playbooks constrained to stay on task[29]. LLMs generate PoVs for complex formats (URLs, binary protocols) by reasoning about program semantics[30].	Used static analyzer Infer to surface bug candidates (interprocedural null derefs, overflows, etc.)[31]. Basic fuzzing as backup (leveraging OSS-Fuzz harnesses with libFuzzer)[32]. Some integration of SARIF static reports for validation. Aggressive patch strategy: allowed submitting speculative patches (up to 2 without PoV per confirmed PoV) to maximize scores[12][33].	3rd place – 34 vulns found, 20 patched[34]. Score 211. Earned “Secret Sauce” for uniquely exploited challenge[35]. Spent \~half the top teams’ LLM budget yet remained competitive[36][37].
Fuzzing Brain (All You Need)	AI-centric “brain” directs fuzzing and analysis. Most AI-driven system – LLMs handle vulnerability reasoning, strategic decisions, and code writing (≈90% of codebase written with AI aid)[4]. Emphasizes quick finds (scored within 5 minutes on a task, “Faster Than Pizza Delivery”) and high SARIF accuracy[38].	Uses thousands of concurrent LLM agents in parallel to analyze code and hypothesize exploits[39]. Primary vulnerability detection via AI reasoning; traditional fuzzers only kick in as a fallback validation[39]. LLMs used for system architecture decisions and patch generation. Achieved >80% correctness in static finding classification (SARIF) via AI triage[38].	Leveraged fuzzing pipelines (implied by team name and achievements) as secondary confirmation for AI findings. Likely used standard fuzzers and runtime instrumentation to verify AI-identified bugs and patches (the system won “-Ofast” first-to-find real C bug award)[38]. Patching likely also AI-driven with some traditional compile/test loops for validation.	4th place – 28 vulns found, 14 patched[40]. Score 154. Earned achievements for fastest find and high SARIF precision[38]. Demonstrated extremely AI-forward methodology[4], though with higher cost per point due to massive parallelism[37].
Artiphishell (Shellphish)	Massively multi-agent CRS (60+ AI agents coordinating)[41] built by an academic hacker collective. Agent-based architecture tackles tasks in parallel (parsing, fuzzing, exploit generation, patching) with minimal human tuning[41]. Enhances classic fuzzing with AI-driven grammar inference (“Grammar Guy” subsystem)[42]. Focus on telemetry and reliability.	Agents use OpenAI, Anthropic, and Google Gemini LLM APIs (multi-LLM)[43] for various roles. Grammar Guy LLM agent evolves input grammars from coverage feedback to fuzz complex formats (SQL, URLs, binary data)[42] – a major LLM budget consumer that boosted bug-finding. Other agents monitor and “babysit” LLM outputs to keep them on track (role-based prompting, etc.)[44][45]. AI also used for rich telemetry logging of CRS reasoning steps (earned “Best Telemetry” honor)[46].	Built on years of CRS experience (Shellphish was a CGC 2016 finalist). Uses traditional fuzzers (e.g., AFL-based or QEMU-based) guided by AI-generated grammars[42]. Likely employed symbolic execution or static analysis modules from prior frameworks (angr/Driller heritage) enhanced with AI. Very cautious patch policy – no patches without PoV to ensure 95%+ patch accuracy[46][12].	5th place – 28 vulns found, 11 patched[47]. Score 136. Extremely high patch correctness (over 95% success)[46] by design. Won “The Limit Does Not Exist” for creating a PoV over 1 MB[48]. Heavy use of AI (multi-LLM) yielded solid finds, but slower patch output kept score lower.
Bug Buster (42‑b3yond‑6ug)	High-scalability fuzzing platform with advanced scheduling and some AI assistance. Employed BandFuzz reinforcement-learning scheduler to coordinate up to \~2,000 fuzzer instances in parallel[49] for maximal coverage. Explored static program slicing to focus fuzzing on bug-prone code (though faced toolchain challenges)[50]. Introduced “super patch” concept to fix multiple bugs in one go[9][51].	Applied AI in specific roles: RL-based orchestration of fuzzers (learned which targets to fuzz more)[52], LLM-generated fuzzing seeds (the team trained an LLM to propose interesting inputs)[53][54], and LLM-assisted patch generalization (“super patches” that address root causes of multiple crashes)[9][51]. Also used LLMs to analyze crash reports into structured “vulnerability objects” for triage planning[55].	Relied on a wide array of traditional techniques: large-scale AFL++-based fuzzing (3 independent fuzz pipelines)[56], plus static analysis integration (SARIF parsing, CodeQL queries, etc. for validation)[52]. Program slicing/static path analysis was attempted to guide fuzzing but had compatibility issues[50]. Used a mix of kernel fuzzing and user-space targets; implemented crash deduplication logic to cluster same root cause. Patching was ambitious but buggy – their “super patches” fixed multiple vulnerabilities at once when it worked[51].	6th place – 41 vulns found, 3 patched[57]. Score 105. Second-highest PoV points (strong bug-finding), top SARIF accuracy[58], but patch failures hurt overall rank[58]. Achieved “Czar of the SARIF” (most correct static assessments) and “Hunger for Knowledge” (peaked at 1K+ LLM requests/min)[59] – demonstrating brute-force scale. Final result marred by a patch pipeline bug, illustrating the fragility of such a complex system[60][58].
Lacrosse (SIFT)	Modernized legacy CRS (evolved from a 10-year-old platform)[61][62], augmented with AI. Emphasized proven fuzzing techniques: deployed 300–500 fuzzing agents (comparable scale to ToB) orchestrated by a central agent “Optimus Zero”[55]. Focused on stability and slow-but-steady analysis.	Used LLMs for higher-level reasoning and crash analysis. For each crash, an LLM created a detailed “vulnerability object” describing the bug, classification, and patch plan[55]. LLMs provided semantic guidance to fuzzers and consensus-based patch decisions (multiple AI models had to agree before a fix)[55][63].	Strong bias toward traditional fuzzing for PoV generation (minimized fancy AI bug-hunting)[64]. Many fuzzers running in parallel with a global state manager (Optimus Zero) to distribute tasks[55]. Employed standard static tools to triage crashes. Patch submissions used a confidence algorithm and LLM consensus to avoid incorrect fixes[63], reflecting a cautious, engineering-driven approach.	7th place – 1 vuln found, 1 patched[65]. Score 10. Despite limited scored results, received “Consensus Contributor” for most submissions (albeit mostly duplicates) and “Raiders of the Lost PoV” for finding a unique real bug others missed[66]. Suggests the system was robust but perhaps too conservative and slow to compete within time limits.

Table 1: Summary of AIxCC finalist tools (with team names) and their characteristics in the final competition. All tools were open-sourced[67]. Performance metrics are from the final scored round at DEF CON 33[13][23]. "PoV" = proof-of-vulnerability exploit.

Detailed Tool Analysis and Comparison¶

In this section, we examine each tool in depth, highlighting how they work, their unique strengths or weaknesses, and noteworthy differences. We also note documentation availability (e.g. open-source repositories or papers) and any evolution from semifinal to final round.

Atlantis (Team Atlanta) – Multi-Engine Ensemble CRS¶

Core Operation & Approach: Atlantis is an ensemble system composed of multiple specialized CRSs that run in parallel[5]. Team Atlanta split efforts into independent engines for C/C++ bugs, Java bugs, a language-agnostic “Multilang” engine, plus separate components for patching and SARIF (static analysis report) handling[68]. This N-version programming design maximized fault tolerance and coverage – if one approach failed or crashed, others continued unaffected[69][70]. For example, Atlantis-C aggressively instrumented code with heavy compile-time modifications to find memory errors, whereas Atlantis-Multilang took a lighter, instrumentation-free approach better suited for rapid, general bug discovery[71][72]. By diversifying strategies, Atlantis achieved robust overall performance. Team Atlanta prioritized engineering rigor and stability – a critical bug in a patch-skip heuristic nearly derailed them hours before the deadline, underscoring their emphasis on robustness testing[73][74].

Traditional Techniques: Unlike some competitors that abandoned classic program analysis, Atlantis invested heavily in traditional methods, scaled up to real-world software. They integrated multiple fuzzing engines (both coverage-guided and directed fuzzers) and even ran them concurrently (ensemble fuzzing), as research shows this finds more bugs than any single fuzzer[75][76]. Their toolkit included LibAFL (with custom mutators for C and Java), AFL++, and Jazzer for Java fuzzing[11]. They also extended SymCC (a symbolic execution tool) to work on large codebases and built custom concolic executors for Java[11]. Each of these components required significant engineering to handle the scale of OSS projects[77][78]. Atlantis also employed directed fuzzing guided by static analysis: static analyzers would flag potential bug locations, and fuzzers then specifically targeted those functions or code paths[79]. In short, Team Atlanta did not give up on conventional program analysis – they adapted and scaled it, supplementing with AI rather than replacing it[80][81]. For patching, they took a conservative stance; they completely disabled any patch submission that wasn’t backed by a PoV exploit, to avoid penalties for incorrect patches[12]. This caution paid off with a high patch success rate.

AI-Based Techniques: Although rich in traditional tooling, Atlantis strategically infused AI at multiple levels. They describe three tiers of LLM integration[7][82]: (1) LLM-Augmented components, where AI assists existing tools (e.g. generating seed inputs for fuzzers, suggesting likely vulnerable API calls that fuzzers should focus on, summarizing static analysis output); (2) LLM-Opinionated components, where AI provides hints or “opinions” that the system treats as suggestions – for instance, a Harness Reverser AI that guesses input file formats or network protocols used by the target software, to help fuzzers produce valid inputs[83][82]; and (3) LLM-Driven components, where an agentic AI has autonomy to navigate the codebase. An example of the last is MLLA (Machine Learning Lab Assistant) – an AI agent that reads through a repository and autonomously crafts exploit inputs (“blobs”) for suspected bugs, especially for challenges in “delta mode” (where a recent code diff may contain the bug)[8][84]. Team Atlanta uniquely leveraged fine-tuned LLMs: they fine-tuned a 7-billion-parameter Llama2 model on security-relevant data (especially for C language analysis) to use as an internal reasoning engine[9]. This gave them a proprietary AI tuned for vulnerability patterns in C/C++ code (buffer overflows, use-after-free, etc.), potentially with lower latency and cost than calling large external APIs. LLMs were also used in directed test generation – e.g. Atlantis had language-specific prompt templates to generate exploit code or inputs once a vulnerability was suspected[79]. Another novel use was creating function-level dictionaries of sensitive values: their AI would analyze a function and output a list of keywords or values (e.g. special file names, command strings) that a fuzzer could mutate with[79]. Overall, Atlantis treated AI as powerful assistants that were carefully “babysat” and orchestrated. The team developed an abstraction layer called LiteLLM to interface with multiple AI providers and to handle failures (rate limits, token limits, etc.) gracefully[85][86]. This prevented vendor lock-in and let them switch between models like GPT-4, Claude, and others as needed, or even use their fine-tuned model in place of an API call.

Performance & Results: Atlantis delivered the strongest performance of the finals, outscoring all others by a wide margin. It discovered 43 vulnerabilities and successfully patched 31 of them, earning a total of 393 points – the highest of any team[13]. Notably, Atlantis excelled in both quantity and quality of patches. It earned the special award “Bundle Baron” for submitting the most bundle (PoV+patch) combinations that scored[87], and “Most Real-World Vulns Patched” (the Prince of the Patch title) for fixing the greatest number of genuine vulnerabilities[14]. The system’s diversity paid off – it was able to find bugs across many categories (CWE coverage) and did so quickly enough to also gain a “Professional Assassin” achievement for having over 95% of its PoVs be successful[88]. Interestingly, Atlantis found a few 0-day vulnerabilities in real open-source code (3 in Java, 3 in C) that were not planted by DARPA[89][90], showing its real-world efficacy. In terms of efficiency, Team Atlanta used the largest compute budget (nearly \$74k) and a substantial LLM budget (\~\\$29k) during the finals[18], reflecting their at-scale approach. The cost per point for Atlantis was about \$263[91] – higher than some more cost-optimized teams, but still reasonable given the victory. Documentation-wise, Team Atlanta has provided extensive resources: they published a technical report on their blog with system design insights[5][92] and have open-sourced their code (organized into components like Atlantis-C, Atlantis-Java, etc. on GitHub[93]). The comprehensive approach of combining “all of the above” – classical analysis with cutting-edge AI – was validated by their win.

Strengths & Weaknesses: Atlantis’ strength was clearly its balance and thoroughness. It left no stone unturned: fuzzers, static analyzers, custom AIs, all worked in concert. This resulted in a top score and high reliability (very few incorrect patches). A possible weakness of this approach is complexity – maintaining multiple CRSs and integration between them is a heavy engineering lift, evidenced by the one-line bug that almost broke the system[94][74]. However, Team Atlanta mitigated that through aggressive testing and redundancy. Another challenge is cost: they threw significant computing resources at the problem, which might be hard to replicate outside of a funded competition. But overall, Atlantis demonstrated that hybrid systems can outperform purely AI or purely traditional ones, and its design to “not trust any single method too much” proved very effective.

Buttercup (Trail of Bits) – AI-Augmented Fuzzing & Patching¶

Core Operation & Approach: Buttercup (Trail of Bits’ CRS) is built on the idea of augmenting classic vulnerability discovery tools with AI, in a focused and cost-effective way. The system’s core loop still resembles a traditional fuzzing and static analysis pipeline, but nearly every stage is turbocharged by machine learning. For instance, Buttercup uses libFuzzer (for C/C++) and Jazzer (for Java) on each target program, but it improves their effectiveness by supplying intelligent inputs generated by LLMs rather than relying purely on random mutations[19][20]. The team specifically calls out that their LLMs could generate inputs capturing complex attack patterns that simple fuzzers might miss – e.g. a well-formed SQL injection string or a malicious file path – which were then fed into the fuzzers’ corpus[19]. This approach significantly sped up code coverage and bug trigger time (an important factor given limited competition time)[19][22]. Buttercup also integrated static code analysis: it used parsers (like Tree-sitter) and custom query scripts to identify suspicious code constructs (e.g. unsafe uses of strcpy, or use of exec() calls)[22]. These were used to guide fuzzing or to prioritize which issues to patch.

Where Buttercup truly innovated was in patching. Trail of Bits built a multi-agent patch generation system that emphasizes separation of concerns[20]. For example, one agent might specialize in isolating the buggy code snippet, another in drafting a fix, and a third in evaluating the fix’s correctness. They intentionally avoided using the most expensive, “generalist” AI models (like GPT-4) in favor of what they call “non-reasoning LLMs”[17][37] – presumably smaller or fine-tuned models that are cheaper to run at scale. This is reflected in their costs: Buttercup made over 100k LLM requests during the finals but still only spent \~$21k on LLM API usage, far less than Team Atlanta[21][37]. The team noted that the top two finishers (Atlanta and ToB) spent the most on AI queries, suggesting that judicious use of LLMs was “money well-spent” in terms of points[36].

Traditional Techniques: Despite the AI enhancements, Buttercup is grounded in well-understood tools. It leans heavily on coverage-guided fuzzing (libFuzzer for native code, Jazzer for JVM) to generate concrete PoVs. The fuzzing campaigns were likely distributed across multiple cores/nodes for speed. They also utilized static bug-finding tools: although not explicitly named in sources, Buttercup’s use of tree-sitter implies they were parsing source code abstract syntax trees to find patterns (like searching for known dangerous functions, etc.)[22]. These static findings could then be confirmed by fuzzing or passed to the patcher. Another classic tool category Buttercup likely used is sanitizers (AddressSanitizer, etc.) to catch memory errors, given their background in security research. For exploit generation (PoVs), Buttercup seems to have relied mostly on fuzzers to create crashing inputs, rather than symbolic execution or manual exploit crafting, as no mention of those is made – instead, they improved fuzzers via AI seeds. On the patch side, Trail of Bits exercised caution: like Team Atlanta, they never submitted a patch without an accompanying PoV[12]. Their blog suggests they believed patch accuracy was a key differentiator in winning second place[24]. To further ensure quality, they built cross-validation: Buttercup would test its candidate patches against known exploits (its own and perhaps even those of other teams if it obtained them) to avoid releasing a faulty patch[24][63]. This strategy aligns with Trail of Bits’ strength in secure coding – they treated each patch almost like a mini code review, aided by automation.

AI-Based Techniques: Buttercup’s motto could be “use AI to play to our tools’ strengths.” A prime example is their use of LLMs to generate specialized fuzzing inputs[19]. Instead of having an AI blindly hunt for bugs, Buttercup asks it to produce inputs that exercise tricky code paths. The LLM likely has some understanding of what input format or API might reach deep into the code – for instance, if fuzzing an HTTP server, an LLM could generate a variety of HTTP request packets with unusual headers to trigger parsing edge cases. These AI-crafted inputs are then incorporated into the fuzzers’ corpus, after which libFuzzer’s fast mutation engine takes over to refine and find a crashing payload[19]. This human-inspired but machine-augmented approach improved harness saturation time, meaning the time to hit interesting code was reduced[19]. Another AI aspect is multi-agent coordination for patching[20]. Buttercup’s architecture separates vulnerability detection from repair: one agent (or LLM prompt) can focus on understanding what the bug is and marking its location, while another focuses only on generating a fix in code[20]. By dividing the task, they reduce the likelihood of one large model hallucinating or making a complex error. They also mention using cheaper models that lack chain-of-thought reasoning[17]; this implies they might use multiple simpler completions or a fine-tuned code model (like Codex or Code Llama) for tasks like patch generation, rather than an expensive generalist model. Impressively, Buttercup managed these AI resources to achieve the lowest cost-per-point among top teams – \$181/point, indicating high efficiency[91][95]. In addition, an AI crash analysis component likely triaged fuzzer crashes: similar to others, Buttercup probably used an LLM to read sanitizer logs or stack traces and group crashes by root cause, filtering out false positives (like benign out-of-memory errors). This kind of AI filtering improves precision by automatically discarding non-security crashes, so the team only spends time patching real issues[96] (Buttercup indeed scored high on accuracy).

Performance & Results: Buttercup finished as the runner-up (2nd place), with 28 vulnerabilities found and 19 patches applied in the finals[23]. Its total score (219 points) narrowly edged out Theori for the second spot[97][98]. The breadth of bugs Buttercup found was remarkable: it successfully produced proof-of-vulnerabilities across 20 distinct CWE categories, ranging from memory errors to injection flaws[16]. This earned them the “Cornucopia” achievement for diversity of CWEs[15]. Moreover, Buttercup maintained \~90% accuracy, meaning very few of its submissions were incorrect or invalid[99]. The system even patched one challenge with a huge 300+ line fix, winning the quirky “LOC Ness Monster” award[25] – demonstrating its ability to handle complex issues requiring substantial code changes. The team’s focus on correctness over sheer quantity likely contributed to their success over teams that found more bugs but lost points to bad patches (e.g. 42-b3yond). In terms of timeline, Buttercup was also quick – it scored its first points in under 5 minutes on at least one task (achievement: “Faster Than Pizza Delivery”, which actually went to FuzzingBrain[38] but Buttercup also achieved a \<5min score per their blog[100]). All these indicate a well-rounded performance. For documentation, Trail of Bits provided an excellent post-mortem blog post[101] detailing these insights and even an earlier “tale of the tape” analysis[3][42]. Their open-source code is available on GitHub and is said to be runnable on a laptop[102] – illustrating their emphasis on practical, lightweight design.

Strengths & Weaknesses: Buttercup’s strength lies in augmented automation – it took dependable techniques (fuzzing, static checks) and made them better with AI, rather than trying to reinvent the wheel entirely. This made it efficient and accurate. A key strength was resource management: using cheaper LLMs and fewer compute cycles but still achieving results. If there’s a weakness, it might be that Buttercup might miss bugs that require deep reasoning beyond input generation – for instance, logic flaws or complex exploit scenarios that fuzzing can’t easily uncover. Their reliance on simpler models might also limit some capabilities (they explicitly avoided “reasoning” models, which might be less effective for patches that need creative insight, though their results show little evidence of struggle there). Another potential drawback is that by not submitting speculative patches, they may have left some points on the table – Theori’s strategy of patching without PoV could find latent vulnerabilities, but Buttercup chose not to risk it. Nonetheless, their second place indicates this strategy was wise. In summary, Buttercup was an exemplary demonstration of human-in-the-loop AI design: it encoded a lot of security expertise and let AI handle the tedious or large-scale parts, leading to a very balanced system.

RoboDuck (Theori) – LLM-First Autonomous Security Agent¶

Core Operation & Approach: RoboDuck is Theori’s cyber reasoning system, notable for being highly AI-centric. The team explicitly set out with an “LLM-first” design philosophy, using large language models for all aspects of the challenge and only falling back on traditional techniques as a safety net[103][104]. This was a bold approach: unlike others that started from fuzzers or static analyzers, Theori treated the LLM as the primary engine driving bug discovery, exploit development, and patch writing. In fact, Theori’s blog mentions their CRS can produce working PoV exploits without using fuzzing or symbolic execution at all[27], which is quite unique. They implemented a full pipeline of autonomous agents that read the code, reason about potential vulnerabilities, generate an input to trigger each bug, and then create a patch[27]. This means RoboDuck behaves almost like a team of virtual security researchers collaborating.

The system pipeline can be summarized as: Bug Finding → Bug Filtering → PoV Generation → Patch Generation → “Scaffolding & Glue” (integration)[105]. For bug finding, RoboDuck actually does use a bit of traditional analysis: Theori leveraged Facebook’s Infer static analyzer to find memory issues and null dereferences in C/C++ code[31]. However, Infer’s reports were extremely noisy (99.9% false positives in their workflow)[106], so the LLM agents were tasked with reviewing Infer’s output and deciding which reports likely represent real bugs[28]. Separately, they employed two distinct LLM-based static analysis methods: one agent looked at code function-by-function to spot potential bugs, and another looked at larger “blob” chunks of code (spanning multiple functions or files) to catch issues that manifest in broader context[107]. This dual strategy was probably to mitigate LLM context window limits and vantage – fine-grained vs. coarse-grained. After initial bug candidates were identified, a filtering agent would cross-check them to eliminate duplicates or those deemed non-exploitable.

For PoV generation, RoboDuck did something truly novel: it had LLM agents craft exploits (inputs) directly for the suspected bugs[27][108]. This is challenging because it requires the AI to understand how to trigger the bug in practice. Theori’s advantage was to treat this like a reverse engineering problem – their agents would simulate the steps a human exploit developer might take, but entirely in code and memory space. They constrained the LLMs with what they call “reverse engineering workflows” to prevent the AI from going off-track[29]. For example, an agent might perform these steps: Read the vulnerable function; Identify what input parameters or file data influence the vulnerability; Then propose a concrete input that would hit the vulnerability. If the vulnerability was, say, a buffer overflow when parsing a certain file format, the agent would try to produce a file in that format that overflows the buffer. Notably, Theori did not rely on random fuzzing to find the exact trigger – they tried to reason it out with the model’s knowledge. This is where they saved a lot of the fuzzing budget that others spent; Theori’s LLM query spend (\~\$11.5k) was half that of Buttercup, and they spent a moderate amount on compute (\~\\$20k), meaning they weren’t running huge fuzzer farms[18]. They did have fuzzing “available for backup”[103] – likely they still ran some fuzzers using existing harnesses to ensure that if the AI missed something, it could be caught. But the emphasis was clearly on AI.

Patch generation in RoboDuck was done via (unsurprisingly) LLMs as well. Once an exploit and bug were confirmed, an agent would modify the code to fix the vulnerability, ensuring it doesn’t break functionality. The blog implies most teams, including Theori, used similar high-level approaches for patching – essentially instructing an LLM to produce a code diff that addresses the problem[109]. The interesting twist is Theori’s risk-taking on patches: they developed a model to decide when to submit patches without PoVs, implementing an aggressive 2:1 patch to PoV ratio strategy[12][33]. This means if RoboDuck thought a piece of code was buggy but couldn’t immediately produce a PoV, it might still submit a speculative patch (but capped at two such speculative fixes for each confirmed bug). This was meant to maximize points under the scoring rules. It was a calculated risk, as bad patches hurt the accuracy multiplier, but Theori bet that their AI could identify real bugs even if it failed to exploit them in time – a bet that seems to have paid off moderately well.

Traditional Techniques: Although de-emphasized, Theori’s CRS did incorporate traditional methods in supporting roles. We’ve mentioned Infer static analysis for C/C++ bug hints[31] – they even patched some bugs in Infer itself to improve its output for AIxCC’s needs[106]. They also mention using some fuzzing and static analysis as backup[103]. Specifically, they would run the standard OSS-Fuzz harnesses for each challenge with libFuzzer, mainly to ensure they didn’t miss any crashes; but interestingly, Theori notes that no other system they know of can use LLMs at repository scale to generate inputs that trigger bugs, which RoboDuck did[110][111]. This suggests that their fuzzers might not have been the primary source of PoVs. They likely also used sanitizers (ASan, UBSan) to catch memory violations when running either fuzzers or AI-generated tests, to know when a bug was triggered. The static analysis results were integrated via SARIF as well, since they had to assess provided static analyzer reports from the organizers – Theori presumably leaned on their LLM to evaluate those too (the competition provided some SARIF that could be false positives, worth 1 point each if correctly assessed[2]). On the exploit side, Theori did not explicitly mention any use of symbolic execution or concolic testing – they appear to have consciously avoided those, focusing on LLM reasoning instead[112]. Finally, for patch testing, they likely did run the patched program against the PoV to verify it stops the crash (common to all teams). Theori’s approach to traditional tech can be seen as “trust the AI, use classical tools as verification.” It’s the inverse of teams like Lacrosse or 42-beyond, which trusted fuzzers and used AI as an assistant.

AI-Based Techniques: RoboDuck’s entire DNA is AI. The system uses multiple LLMs, possibly from different vendors (they referenced multi-LLM usage for static analysis)[107]. A notable design choice: They used two different LLMs in static bug finding, which hints they might have used e.g. GPT-4 for one pass and Claude or a fine-tuned model for another, to get varied “opinions” on what code is risky[107]. This is akin to having two experts review the code. The results of those were likely merged. The agentic aspect is key: Theori’s blog posts (and follow-up ones they published on their site) describe how they built effective LLM agents with constrained behavior[113][114]. For example, they might use a pseudo-code interpreter or step-by-step prompting to force the agent to reason systematically about an exploit (“If the buffer is of size N and input length is M, choose M = N+1 to overflow…”). They also mention preventing the AI from “wandering” by constraining workflows[29] – perhaps by breaking tasks into smaller prompts or using guardrail code to stop nonsense actions.

One major advantage of AI they exploited was handling complex input formats. Traditional fuzzers often struggle to generate inputs that are valid enough to pass basic parsing (e.g., a correct image file header). Theori’s agents, with their knowledge base, could generate syntactically correct and semantically plausible inputs for these formats[108]. For instance, if the target was an XML parser, an AI can produce a well-structured XML that still triggers a bug deep inside – something a dumb fuzzer might not achieve quickly. The AI’s capability to understand protocols (like a well-formed URL or PDF) was a differentiator for RoboDuck[108]. Additionally, Theori built an entire public GitHub repository (theori-io/aixcc-public) with logs and agent traces for transparency[115], implying confidence in their AI agent approach and a desire to analyze it post-competition.

Performance & Results: RoboDuck placed 3rd in the final competition. It found 34 vulnerabilities and patched 20 of them[34], for a total score of 211 points. This was just slightly behind Trail of Bits’ 219 points, a very close race for second. Theori’s system had a strong showing in vulnerability finding – it actually found the second-most vulns after Team Atlanta. Its patch count was a bit lower proportionally (perhaps due to their riskier patch strategy or some speculative patches not yielding points). Still, they demonstrated the viability of an AI-heavy approach. The stats also reveal Theori used about half the LLM budget of the top two teams but achieved \~95% of their scores[36][37], indicating good efficiency. Achievements Theori earned include “Thrifty” (they had the least dollars spent per vuln patched among teams)[116][37] – a testament to their economical use of AI and compute. They also got “Secret Sauce” for submitting a unique PoV that no one else did[117], likely attributable to their creative AI-driven exploit generation finding a scenario others missed. Theori didn’t snag the SARIF or speed trophies, implying their approach was potent but perhaps a bit slower early on or not as focused on static warnings (understandable since they had so many false positives to filter through Infer). Documentation from Theori is extensive: they published multiple blog posts (overview, LLM agent design, agent trajectory examples) sharing their methodology[118][27], and their code (though somewhat curated) is on GitHub.

Strengths & Weaknesses: The strength of RoboDuck is clear – it showed that AI agents can perform end-to-end hacking tasks with minimal human intervention. This is a preview of what AI-driven security might become. It excelled at complex logical reasoning tasks (like exploit generation for tricky formats) and wasn’t limited by needing an instrumented harness for everything – a smart agent can find a bug in any part of the code. Additionally, it achieved a lot with fewer resources (cost-effective). However, the weaknesses are the flip side: an LLM-first system can be unpredictable. Theori themselves likely struggled with managing the “creativity” of the AI – hence the need to babysit and constrain it[119][45]. When their AI agents were right, they were brilliant; when they were wrong, valuable time could be wasted chasing phantom bugs or formulating a PoV that doesn’t quite work. Another weakness was patch accuracy: by pushing more speculative fixes, they risked penalties. If their AI misjudged a bug and patched something incorrectly, that would hurt their score multiplier (though they evidently balanced this with the 2:1 model). In finals, Theori’s accuracy multiplier was a bit lower than teams like ToB or Shellphish (since they submitted some patches without proof)[12]. But Theori’s daring approach clearly paid dividends, and it underscores how far pure AI reasoning has come – nearly matching a top security firm’s tool with half the bug-finding being done by classical fuzzers.

All You Need is a Fuzzing Brain – Massively Scaled AI System¶

Core Operation & Approach: All You Need is a Fuzzing Brain (often shortened to FuzzingBrain) was the most AI-forward system among the finalists[4]. This team treated the CRS as a large-scale AI orchestration problem. According to Trail of Bits’ analysis, FuzzingBrain used thousands of concurrent AI agents in parallel to reason about vulnerabilities, essentially brute-forcing the unreliability of individual LLMs by sheer quantity and diversity[39]. In other words, instead of one very smart agent, they had many “pretty smart” agents trying different approaches simultaneously. The team integrated these agents with an underlying fuzzing platform – the name suggests a focus on fuzzing, and indeed they did have traditional fuzzers in the loop, but in a secondary role. The architecture likely involved an AI coordinator that would spin up subtasks (like analyzing a particular source file or function) and assign them to AI workers. These workers might generate possible inputs or identify suspect code paths. Then, some of those hypotheses would be validated either by running the program or by cross-checking with other agents.

The team’s emphasis on AI is highlighted by the claim that 90% of their entire codebase was written using AI assistance[4]. This is a fascinating detail: they apparently dogfooded AI to even build their CRS (e.g., using Copilot or ChatGPT to generate large chunks of their code). It implies the team culture was to lean into AI at every opportunity. They also gave the AI high-level strategic roles – for instance, using AI for “system architecture and decision-making”[4]. Possibly, they had an agent monitoring the performance of various components and re-balancing resources (maybe deciding to allocate more CPUs to fuzzing vs. static analysis on the fly, based on AI suggestions).

Despite the AI-heavy design, the system did incorporate fuzzing – after all, it’s in their name. Achievements show they won “SARIF Mastery” for high accuracy on static analysis reports[38], and “-OFast” for being first to find a real (non-seeded) C vulnerability[38] (which suggests they had a very fast detection on at least one target, possibly via static analysis or a quick AI insight). They also scored early in tasks (“Faster than Pizza Delivery”)[38], indicating a fast pipeline in some cases. The core idea was likely: throw a large swarm of AI agents at the code, have them propose potential bugs (with reasoning), then use fuzzers to verify those that need confirmation. This flips the typical approach – here AI is finding bugs and fuzzing is double-checking, rather than fuzzing finding and AI triaging. Given that they placed mid-pack, this approach had mixed success, but it’s noteworthy.

Traditional Techniques: FuzzingBrain did not entirely discard traditional methods. It sounds like they still ran fuzzers and other analyses when appropriate, but perhaps triggered by AI. For example, if an AI agent flagged a function as vulnerable (say, a buffer copy with no bounds checking), they might then launch a focused fuzzing campaign on that function or module to produce a concrete crash (or use a directed fuzzing mode). They presumably used the OSS-Fuzz harnesses for targets where available, similar to others, but possibly under the guidance of the AI (e.g., picking which harness to prioritize). Because they got the SARIF accuracy award, they clearly processed the static analyzer outputs given by the organizers very effectively[38]. This likely involved a classical component: maybe they wrote a script to filter out SARIF warnings that didn’t correlate to any dynamic findings, or they used AI to decide which SARIF entries to mark true/false. But an 80%+ accuracy on SARIF suggests a systematic method, since random guessing would be worse. They might have cross-referenced SARIF with code context using an LLM, which is semi-traditional (static analysis result + AI).

On fuzzing, they likely used a variety of fuzzers (AFL++, libFuzzer, etc.), given their team composition might include academic fuzzing experts. They didn’t win any achievements for huge fuzzing scale or such, so their fuzz infrastructure might have been modest (especially compared to Team Atlanta or 42-beyond). Instead, they applied fuzzing more tactically as a backup. For patching, not much is published, but presumably they did similar to others: prompt an LLM to fix the code, perhaps with multiple agents voting (since they like scale, they could generate multiple patch candidates with different LLMs or prompts and choose the best). There’s mention in ToB’s blog that FuzzingBrain’s patching was largely AI-driven with minimal validation – since they prioritized speed (score quickly) sometimes at expense of mistakes, their cost per point was relatively high ($490) and accuracy might have suffered[37]. However, they avoided catastrophic failures and ended in 4th place, which is respectable for a newer team.

AI-Based Techniques: This team embraced AI at scale. If Shellphish used 60 agents, FuzzingBrain used “thousands”[39]. They leveraged model diversity – running different types of models in parallel. This could mean using different prompts or even entirely different AI systems (OpenAI vs. Anthropic vs. open-source models) concurrently, under the theory that what one model misses, another might catch. The phrase “overcome AI unreliability through scale and model diversity”[39] directly implies that by sheer volume of agents, the system’s overall output becomes robust (the wisdom of crowds, but for AIs). This is an intriguing strategy: if one has the budget to do thousands of LLM calls in parallel, one can then filter their outputs for the best ones. Indeed, they had agents whose job was likely to evaluate the outputs of other agents – a hierarchy where some AI judges others’ findings (maybe by cross-checking with program semantics or known bug patterns). The result was that \~90% of their PoVs came from direct AI reasoning (not fuzzers)[39]. In cases where those AI-born PoVs failed or were uncertain, only then did they turn on traditional fuzzing as a “fallback validation mechanism”[39]. That indicates extreme reliance on AI.

Additionally, the team might have used reinforcement learning or planning for their AI agents. The name itself is a play on “All you need is a brain” – possibly they attempted some brain-like global strategy (maybe a central AI that learned which agents to trust). It’s known one team in finals (Team Atlanta) fine-tuned models; it’s possible FuzzingBrain also fine-tuned or trained a custom model, but no evidence here. Instead, they likely leveraged existing APIs heavily. Their large compute spend (\$63k, second highest) and significant LLM spend (\\$12k)[37] show they poured resources into this massive parallel AI approach. They actually had the highest cost per point among top teams[120], suggesting a bit of inefficiency – the AI swarm found a lot but also wasted effort on false leads, it seems. Documentation on their methods is scarce (no public blog from them as of writing), but one can infer they were testing the limits of how far multi-agent AI can go in this domain.

Performance & Results: FuzzingBrain secured 4th place, finding 28 vulnerabilities and patching 14[40], for a score of 154. This put them just behind the top three, and notably ahead of Shellphish and others in total points[121][122]. The performance indicates their approach was effective at discovering bugs – tying Shellphish in finds and even Theori (28 vs 34, not far off). Their patch count was moderate; some patches likely failed or they didn’t patch everything they found. The fact that they scored lower suggests either slower patching or accuracy penalties. Indeed, their cost per point and anecdotal evidence imply they may have submitted some incorrect patches that hurt their multiplier (though they didn’t come last in patch points by any means). However, their notable achievements show strength in specific areas: They were the fastest to score on at least one challenge (the “pizza” speed achievement) and excelled at the static analysis portion with SARIF Mastery (>=80% accuracy)[38]. The first-to-find real vuln (-OFast) award indicates a keen eye for an actual bug (maybe found by an AI reading source code before any fuzzer could even compile the program)[38]. These honors underscore that a swarming AI can outpace and outperform some manual or traditional methods, at least initially. The trade-off was possibly coordination and focus – more isn’t always better if it floods the system with noise. But overall, their 4th-place finish as a likely independent or new team (the name suggests perhaps a group of students or researchers) is impressive. The team’s code is on GitHub under an organization o2lab[123], hinting they might be an academic lab (perhaps at Northwestern or another university, since one search result connected a Wang Xuezhen with this team). The repository contains directories for crs, cp_config, etc., and even a proprietary folder[124][125], which suggests some components (perhaps the LLM models or data) were not fully open-sourced or are placeholders.

Strengths & Weaknesses: The clear strength of FuzzingBrain was mass parallelism and AI depth. It essentially brute-forced the problem with AI, leveraging a “more is more” philosophy. This likely uncovered some non-obvious bugs quickly and provided resilience (if one agent got stuck, another completely different approach was also running). It also allowed them to be extremely AI-centric without being derailed by one model’s quirks – consensus or majority voting among agents could surface correct answers. The approach’s weakness is efficiency and coherence. Running so many agents risks duplicating work and requires very good aggregation logic to make sense of their outputs. The high cost per point suggests a lot of wasted cycles on fruitless inquiries[120]. Also, managing thousands of AIs is complex; perhaps the integration and debugging of such a system is difficult, and timing issues or race conditions could occur. In the final scoring, they did well but not enough to beat more targeted approaches – indicating that while you can throw AIs at the wall to see what sticks, a clever strategy (like Team Atlanta’s multi-engine or ToB’s guided fuzzing) can still have an edge. In terms of documentation, FuzzingBrain is less visible; they might share insights later, but currently we rely on the observations of others for their description. In summary, All You Need is a Fuzzing Brain demonstrated the power of massively parallel AI in cybersecurity, achieving solid results but at great expense – a classic time vs. money vs. reliability trade-off.

Artiphishell (Shellphish) – Academic Multi-Agent System¶

Core Operation & Approach: ARTIPHISHELL, by the veteran hacking team Shellphish, is an agent-based CRS where a large number of specialized AI agents collaborate to find and fix bugs[41]. Over 60 independent AI agents were programmed, each with distinct roles (hence the name a pun on “artificial shellphish”)[41]. This architecture draws on Shellphish’s extensive experience building automated hacking systems (they competed in the original 2016 DARPA Cyber Grand Challenge). The ethos was to decompose the immense task of securing a codebase into smaller subtasks, handled by expert agents. For example, one agent (nicknamed “Grammar Guy”) focused on inferring input grammars and generating complex test inputs[42], another might focus on monitoring program telemetry or feedback, another on generating patches for memory errors, and so on. This way, Artiphishell’s design mimics a team of hackers with different specialties working together – but automated.

Shellphish’s CRS heavily emphasizes autonomy and scale. They wanted minimal human intervention; once deployed, the agents communicate and iterate on their own. They set up extensive telemetry logging and monitoring of the AI decisions (even earning the “Best Telemetry” award)[46], which suggests they instrumented their CRS to record agent actions, LLM prompts, responses, etc., both for debugging and perhaps for scoring bonus (if they shared useful data with DARPA). Being an academic team, Shellphish likely experimented with various approaches: fuzzing, concolic execution (their open-source angr framework is for symbolic execution), static analysis (their members have research in that area too), and integrated LLMs on top. The repository indicates a complex infrastructure (with Azure deployment scripts, etc.)[126][127], and environment variables for multiple AI APIs[43]. The multi-agent approach was validated by their professors: quotes show they view this architecture as a representation of “how much AI has changed the way we attack difficult problems”[128][129].

Traditional Techniques: Shellphish has a pedigree in program analysis, so Artiphishell certainly didn’t ditch those methods. They augmented fuzzing significantly through AI. The “Grammar Guy” subsystem is essentially a smart fuzzing assistant: it uses LLMs to generate input grammars and progressively refine them based on coverage feedback[42]. This means Artiphishell did run fuzzers (likely AFL or libFuzzer or their own legacy fuzzers from CGC days), but instead of relying on random mutations, they had grammar rules to produce more structured inputs (e.g., if testing a SQL parser, Grammar Guy might learn the SQL syntax and feed fuzzing with valid but tricky SQL queries). According to ToB, this grammar-based fuzzing approach significantly increased their bug-finding capability for complex formats[42], at the cost of a large portion of their LLM budget (likely a worthy trade-off). Aside from fuzzing, Shellphish historically use symbolic execution; it’s not confirmed, but they possibly integrated their angr framework to systematically explore some code paths or to solve inputs for certain conditions, with AI guiding the target selection. They also likely used static analysis tools, perhaps parsing the code with their own tools or open-source analyzers to supply hints to the agents.

For exploitation (PoVs), Shellphish’s strategy was somewhat middle-ground. They did manage to create at least one enormous PoV payload (earning “The Limit Does Not Exist” for >1MB exploit)[48], meaning they had the capability to generate complex exploits. It’s possible they used some concolic or brute-force method to build that payload, or an agent just concatenated a huge input together to overwhelm something. Patching-wise, Shellphish took a conservative approach like Trail of Bits and Team Atlanta: they only submitted patches when they had a corresponding PoV to prove the fix[12]. In fact, they boasted one of the highest patch accuracies – over 95% of their patches were correct[46]. This indicates they avoided speculative fixes and carefully validated each patch (likely by rerunning all known PoVs on the patched program as part of their pipeline). As a result, they may have left some vulnerabilities unpatched if they couldn’t reliably confirm a fix in time, which might have limited their score, but it kept their penalties minimal.

AI-Based Techniques: Artiphishell’s multi-agent system employed a variety of LLMs – as evidenced by environment keys for OpenAI, Anthropic, and Gemini (Google) models[43]. This suggests they dynamically chose models for tasks: perhaps using GPT-4 for complex reasoning, Claude for long-text analysis, and Gemini (PaLM) for code generation, etc., exploiting each model’s strengths. Over 60 AI agents implies a wide range of AI roles. Some hypothesized roles: code summarizer agents (to distill large code into manageable chunks for others), vulnerability hypothesis agents (scanning for common bug patterns), exploit agents (like a simplified version of what Theori did, but maybe not as central), patch agents (given a location and bug type, draft a code change), verification agents (review a patch diff to ensure it addresses the bug and doesn’t introduce others), and coordination agents (deciding which agent should work on what). They likely implemented a message-passing or blackboard system for agents to share information, which is non-trivial engineering. Given the “we had to babysit LLMs” comment referenced at DEF CON by a member[130], Shellphish found that you can’t just let 60 agents run wild – you need oversight logic. They possibly had a scheduler that monitored agent outputs and killed or re-assigned tasks if an agent got stuck or hallucinated too much.

One specifically cited agent, Grammar Guy, was a standout AI feature[42]. This agent would generate grammar rules (likely in a format like ANTLR or a simple BNF) to describe input formats of targets, then evolve them. For example, if fuzzing found some new code coverage when sending \<script> tags, Grammar Guy might infer that the input is HTML-like and add more HTML grammar rules. Over time, the inputs get more sophisticated, allowing fuzzers to explore deeper states of the program. This is a clever synergy of AI with fuzzing: the AI provides intuition on input structure (like a human writing a grammar for Peach fuzzer), but automatically. Shellphish also built agents for telemetry, which implies some AI(s) were summarizing what the system was doing in human-readable form or detecting anomalies (they reported LLM and CRS activity to organizers, possibly for that Best Telemetry award)[46].

In terms of heavy AI usage, Shellphish’s LLM budget was actually relatively low (\~\$2.9k, lowest among finalists) and their compute was high (\~\\$54.9k)[131]. This suggests they self-hosted a lot of models (compute cost) or relied more on local analysis than API calls. Maybe they ran some open models on GPUs for cheaper per-query cost, or they used smaller models extensively which cost less per call. The multi-agent approach might have emphasized many shallow AI tasks (which might align with using smaller models in parallel rather than a few giant model calls). Their “Hunger for Knowledge” achievement about 1K+ LLM requests/minute actually went to 42-beyond, not Shellphish[132], so Shellphish wasn’t spamming external APIs as heavily. Instead, they might have had optimized uses or local instances.

Performance & Results: Artiphishell finished 5th place, with 28 bugs found and 11 patches applied[47], totaling 136 points. The placement belies the sophistication of their system – they clearly built something advanced and it did successfully operate (finding and fixing a good number of issues), but it didn’t score as high as expected. There are a few possible reasons: one, they may have focused on quality over quantity, given their very high patch accuracy (they won “The Doctor is In” for patch success >95%)[46]. That means they likely passed on patching some bugs they weren’t sure about, costing them potential points but avoiding penalties. Two, multi-agent coordination overhead might have slowed them down; they found as many bugs as Theori and FuzzingBrain (28 vs 34/28), but patched significantly fewer. This suggests maybe their patch pipeline was a bottleneck or some agents underperformed. For example, it’s possible they identified many crashes but couldn’t develop PoVs for all (since they wouldn’t patch without PoV). If an exploit agent fails to deliver, those bugs remain unpatched and earn minimal points (just SARIF maybe). The scoreboard shows Shellphish’s PoV count was decent but patch count lagged.

However, Shellphish still had notable accomplishments. They handled at least one scenario requiring an enormous input (1MB+ PoV)[48] – maybe an AI agent decided to flood an input to trigger a worst-case behavior or memory issue, which worked. They also had a telemetric insight into their system which presumably helped them debug and iterate on it quickly. Their 5th place came with a significant points gap below 4th, but a huge gap above 6th (136 vs 105), showing they were clearly more effective than the bottom two teams. As an academic team, the knowledge gained might be as important; indeed, their mentors considered the breakthroughs and learning itself a reward[133]. Documentation on Artiphishell is currently limited – the UCSB news piece provides a high-level view[41], and a DEF CON talk was given by team members (CTF Radio podcast) shedding more technical light, including some post-mortem analysis via an X/Twitter thread[134]. The code is open on GitHub, though it may not include all private data (there’s mention of a private wiki and CI system)[127].

Strengths & Weaknesses: Artiphishell’s main strength is its comprehensive AI-agent ecosystem. It attempted to emulate a full cybersecurity team with AI, which likely made it very flexible – it could handle many different types of challenges (any language, any bug type) by delegating to appropriate agents. Its integration of grammar-based fuzzing was cutting-edge and gave it an edge in handling complex input software (like those requiring well-formed inputs)[42]. Additionally, Shellphish’s cautious approach to patching meant their system’s outputs were highly reliable (a valuable trait outside competition). On the flip side, the complexity of managing 60+ agents is a weakness – overhead and inter-agent communication can slow things down or cause failures if not perfectly tuned. Indeed, team members mentioned having to babysit the AI (keeping the “Jack-Jack” multi-powerful LLM under control)[130][135]. Another weakness is that a multi-agent system can sometimes have blind spots if agents don’t share info effectively; e.g., an exploit agent might not realize another agent already explored a path, etc. This could lead to duplicate work or missed optimizations. Shellphish’s lower LLM usage suggests they might have under-utilized some AI potential, relying more on internal logic – possibly a deliberate cost-saving measure, but it may have limited the raw power available compared to others who spent more. In conclusion, Artiphishell was an ambitious melding of classic hacking techniques with AI teamwork. It didn’t top the scoreboard, but it validated the multi-agent approach and no doubt provided numerous insights (and future research directions) to its creators. The tool is available for the community, and given Shellphish’s academic roots, we may see research papers or open data coming from this effort in the future.

Bug Buster (42‐b3yond‐6ug) – Scalable Fuzzing with AI Scheduling and Super-Patching¶

Core Operation & Approach: Bug Buster, by team 42-b3yond-6ug (a team of academics from Northwestern and others[136][137]), built a CRS optimized for massive scale fuzzing and clever patching techniques. Their philosophy leaned on proven methods like fuzz testing and then amplified them using AI. They put significant effort into scaling out fuzzing infrastructure: their system ran three parallel fuzzing pipelines during the semifinal and continued to expand, ultimately coordinating up to \~2000 fuzzing instances cluster-wide in the finals[56][138]. This was orchestrated via a custom scheduler named BandFuzz, which employed reinforcement learning (RL) to allocate resources to fuzzers in real-time[54][139]. Essentially, BandFuzz would monitor which fuzzers were yielding coverage or crashes and dynamically decide which fuzzing jobs to prioritize or spawn next, learning optimal scheduling policies. The team’s prior research on BandFuzz even won a fuzzing competition, indicating that part of their system was state-of-the-art in fuzz management[54].

Apart from raw fuzzing power, 42-b3yond integrated static program analysis to guide their efforts. One key tool was a program slicer – a static analysis that can narrow down which parts of the code influence a certain point (like a crash point or a target variable)[140]. They used slicing to map out paths to target locations (e.g., to focus fuzzing on code that leads to a potential bug). However, during the competition, this slicer ran into issues (written for LLVM 14, it broke on C23 code)[141], which hampered their Round 1 performance and occupied developers with porting efforts unsuccessfully[141]. This anecdote shows the risk of relying on certain tools that might not scale to new languages or standards – a valuable lesson they learned.

Another highlight of Bug Buster’s approach is what they called “super patches.” Instead of patching each bug individually, they attempted to create patches that fix multiple vulnerabilities at once[9][51]. The system could detect when different crashes actually stemmed from the same root cause, or even if unrelated bugs could be solved by one general hardening change[51]. For example, if two different out-of-bounds crashes in code could both be prevented by adding an array bounds check in one location, they’d deploy a single patch for both. This turned accuracy penalties on their head – a patch that fixes two bugs yields double benefit with one potential penalty. It’s a creative tactic: by “holistically” addressing issues, they aimed to maximize the patch score multiplier. It’s unclear how often this succeeded, but it was innovative enough to impress other teams[51][142]. It likely involved AI assistance to generalize a fix and recognize patterns among crashes.

Traditional Techniques: 42-b3yond-6ug’s foundation was fuzzing. They used established fuzzers (probably AFL++, libFuzzer, and possibly others like Honggfuzz) in large numbers. They mention kernel fuzzing and directed fuzzing as separate efforts in their narrative[53][143], so they even attempted to fuzz an OS kernel (they found a Linux kernel bug in semifinals) – that’s ambitious given kernel fuzzing’s complexity. They also wrote CodeQL queries (a static analysis query language) to find vulnerabilities in source code quickly (the blog says “everyone was writing CodeQL” during development)[53], meaning they automated searches for known bug patterns (like spotting suspicious uses of functions, integer overflow patterns, etc.). Additionally, they implemented triaging pipelines: when fuzzers found crashes, their system automatically triaged them (by reproducing the crash, deduplicating it, classifying it by type). This is where they integrated with SARIF (static analysis reports) and other data – they had a SARIF integration to validate static warnings, matching them with dynamic evidence of a bug[52].

They also took reliability seriously: their story recounts how in semifinals many components crashed but their system had backups and “absorbed the hits”[56]. This resilience came from redundancy – e.g., three independent fuzzing pipelines means if one fuzzer process dies, two others are still running. They also logged component status in a key-value store (KV cache) to recover state if something failed[144]. They had to address scaling issues, like their seed synchronization service bottlenecking when 2000 fuzzers generated new inputs too quickly[145]. They solved it via a MapReduce style approach to distribute the syncing load[146]. This level of engineering shows strong software development to handle large workloads, arguably more so than AI specialization. For patching, beyond the super patch idea, they likely used typical methods of running regression tests and cross-testing PoVs with patches to ensure fixes worked (though anecdotally a late bug in their patch component cost them – more below).

AI-Based Techniques: Bug Buster integrated AI in a few focused ways instead of everywhere. One was the BandFuzz RL scheduler – this was essentially applying reinforcement learning to maximize fuzzing output (coverage/crashes)[54]. The RL agent would consider the “state” (maybe which programs or which seeds have yielded results) and allocate “actions” (which target to fuzz next, or which seed to mutate) to maximize reward (new coverage or crashes). They found this outperformed static scheduling in fuzzing competitions[54]. Another AI use was seed generation via LLMs. They mention training an “LLM seed generator” and writing a paper on it (which got initially rejected)[53][147]. This tool presumably uses an LLM to suggest initial seeds for fuzzers – similar in spirit to what Buttercup did, but possibly with a custom model trained for it. It might generate inputs especially for structured formats. The fact a student spent two months writing a paper on it shows they had a dedicated AI component for it, which might have been key in early bug findings.

For patching, their concept of “super patches” likely involved AI help. Recognizing that two crashes relate might be done via static analysis clustering, but coming up with a single fix that addresses multiple issues without regression is tricky. It sounds like an LLM could have been asked: “Given these two crash locations, is there a single code change that would prevent both?” – a task requiring semantic understanding. The Trail of Bits blog confirms they had an LLM-based patching process to do this multi-bug fix[9]. If it worked, it could yield big points, but if it backfired (fixes one bug but not fully, or introduces new issues), it could hurt.

Additionally, they used AI for crash analysis: after generating a crash, an LLM could turn the raw data (addresses, dumps) into a structured description, which they call vulnerability objects (like a JSON with type of bug, location, etc.)[55]. This would help in triaging and planning patches. They might also have used LLMs to assist writing CodeQL queries or to translate static analysis results, though not explicitly stated. Interestingly, 42-b3yond’s LLM budget was minimal (\$1.1k) while their compute was high (\\$38k)[131]. This suggests they ran AI mostly on-premise or used open models. Indeed, they hammered some local model with 1000+ requests/minute (earning “Hunger for Knowledge”)[132], which aligns with running a smaller model many times. This could have been for the vulnerability object creation or static analysis tasks, which can be done with a fine-tuned local model repetitively, keeping API costs low.

Performance & Results: Bug Buster ended up 6th place (out of 7 finalists). In the final round, it found an impressive 41 vulnerabilities – the second-highest count – but only 3 patches succeeded[57]. This massive disparity is what sank their score to 105 points[57]. It indicates that while their finding mechanism (fuzzers + AI scheduling) was top-notch, their fixing mechanism struggled. In fact, their raw PoV score was the 2nd best and they were 1st in SARIF (static) points[58]. But patch points were very low (6th among 7)[58], so the patches either failed or were never submitted for many bugs. The team’s postmortem confirms this: internally their patch success in tests was \~90% for mitigating PoVs, but something went wrong during the actual competition causing patch submissions to fail unexpectedly[58][148]. They suspect a system bug was responsible – a painful lesson that “every system has cracks, and in a competition like this, you only get one chance”[149].

Despite the low placement, 42-b3yond-6ug had notable achievements: they received “Czar of the SARIF” for most correct static analysis evaluations and “Giant Slayer” for tackling a huge repo (5M+ lines)[59]. The latter suggests they scored on one of the largest projects (maybe the Linux kernel or another large codebase) where others struggled. This indicates their system could handle scale better than some (fuzzing at scale helps on giant code). They also got “Hunger For Knowledge” for their heavy LLM query rate[132]. Their semifinal performance was actually top-tier (they were among top scoring teams in semis)[150][151], which earned them the finalist spot and funding. In finals, ironically, their heavy focus on patch innovation might have backfired due to a critical implementation bug (their narrative of a patch component rushed in last week without full review stands out[152]). So, the system found many bugs (some even in Linux which most teams found none[153]), but the patcher collapsing meant low final score. The team lead expressed admiration for Theori’s LLM-first approach later[154], indicating a recognition that their traditional-heavy approach had limitations in comparison.

Strengths & Weaknesses: Bug Buster’s strength was raw vulnerability finding power. They arguably built the best bug-finding machine among all teams – 41 vulns is testament to that. Their RL-driven fuzzing and extensive engineering on scaling paid off in triggering lots of crashes, even in hard targets like Linux. They also nailed static analysis, meaning they weren’t fooled by many false alarms. Another strength was creativity in patching (super patches) – although it didn’t reflect in points, the idea is solid for real-world use where you want minimal patches that solve maximum issues. However, weaknesses were evident: the patch pipeline’s fragility was a huge Achilles’ heel. It shows that balancing the two halves (find & fix) is crucial; their find outran their fix. Possibly too many crashes overwhelmed their patching process or they over-engineered patch strategies when a simpler per-bug patch might have sufficed. Additionally, the slicing tool distraction in Round 1 and integration struggles suggest they faced project management challenges – tackling many complex components (fuzzing, RL, slicing, concolic, patching, etc.) with a limited team can stretch resources thin. In terms of AI, while they used it smartly in places, one could argue they could have benefited from using more advanced LLM reasoning in patching or exploit generation, rather than relying so heavily on fuzzing (which finds crashes but doesn’t inherently prioritize them). Documentation from this team is very rich – one member’s personal blog gave a detailed, candid account of their journey[54][56], and the team has likely academic papers in the works on BandFuzz and LLM seed generation. They open-sourced their code (archived on GitHub)[155] and are reflective about what went wrong, which provides valuable lessons. In summary, Bug Buster was a high-risk, high-complexity system that excelled in discovery but faltered in remediation, showing the need to weight both equally in such competitions.

Lacrosse – Legacy CRS with AI Enhancements¶

Core Operation & Approach: Lacrosse was a bit of an outlier among finalists. It appears to have been built on a legacy CRS codebase from nearly a decade ago, updated for AIxCC[61][62]. This likely refers to technology from the original DARPA Cyber Grand Challenge (2016) or similar competitions. The team behind it (possibly a company like SIFT or a government lab, as suggested by interviews[156][157]) essentially took a tried-and-true automated vulnerability discovery system and gave it a “glow-up” with some modern AI additions[61][62]. The core remained fuzzing-centric and reliability-focused. Lacrosse deployed on the order of 300–500 fuzzing agents in parallel, which is comparable to Trail of Bits’ scale and shows a significant commitment to fuzzing horsepower[55]. These fuzzers were orchestrated by a central intelligence called “Optimus Zero”[55]. Optimus Zero managed global state and distributed tasks among fuzzers and analysis components, prioritizing tasks much like a general might in the field. This suggests a design similar to older CRS where a central controller divides work (some to fuzzers, some to static analyzers, etc., based on what's left to do).

Lacrosse prioritized proven techniques over experimental AI. As the Trail of Bits blog noted, they favored “focusing on agent orchestration rather than AI-driven vulnerability discovery”[64]. In practice, this means Lacrosse relied on fuzzing with robust engineering (ensuring fuzzers run properly, outputs are collected, etc.) and didn’t take as many risks with untested AI logic that might mislead or crash. The approach is akin to running a well-oiled fuzzing farm with careful oversight. That said, they did integrate AI in specific supportive roles (next section).

Traditional Techniques: Being based on an older CRS, Lacrosse almost certainly employed classical symbolic execution and static analysis components from that era. Many CGC-era systems had custom symbolic executors (to generate PoVs by solving path constraints), plus things like taint analysis to see what input bytes influence a crash. If Lacrosse’s base was one of those, it likely retained those capabilities. However, open-source critical infrastructure code (Linux, Apache, etc.) is far bigger and more complex than CGC’s synthetic binaries, so they probably leaned more on fuzzing which scales better. They also adhered to traditional PoV generation via fuzzing: meaning they largely used fuzzers to generate crashing inputs and did not attempt AI generative exploits for format compliance[64].

Lacrosse also did static analysis of a sort: their achievements show “Consensus Contributor” (most POVs/vulns submitted) which implies they detected a lot of potential issues and tried to submit many, even if duplicates[66]. This could mean they had an aggressive static analysis that flagged many possible bugs and they attempted many PoVs (some might have been redundant or invalid, hence most submissions but not high score). It could be due to an older system not having as refined a triage mechanism for modern code, so it oversubmitted. Another achievement, “Professional Assassin” (POV success >95%)[66], is interesting: how do you reconcile submitting the most POVs with 95% success? Possibly they submitted a ton of static-analysis-based SARIF issues (counted as “vulns”) and a smaller number of PoVs which mostly worked. Or maybe they attempted many PoVs but only a handful scored, and those gave them the >95% success ratio because others failing might not count if not considered valid submissions? It's a bit contradictory, but maybe the key is they were careful to only officially submit PoVs once they were pretty sure (leading to high success rate), and they hammered static vuln reports (since those cost little to be wrong besides 0 points).

The “Raiders of the Lost POV” award suggests Lacrosse found at least one genuine vulnerability that DARPA did not plant in the code[66]. That’s a credit to their thoroughness; likely their fuzzers or symbolic execution uncovered a 0-day in a target (maybe the one bug they did patch successfully). For patching, Lacrosse took a middle-of-the-road strategy: not as conservative as Shellphish, not as aggressive as Theori. They used LLM consensus and a confidence algorithm to decide on patches[63]. This implies they might generate a patch with AI, then have multiple AI models or runs evaluate it, and if they all (or a majority) agree it’s a good fix, they submit it. This is a smart way to reduce hallucinations – essentially a voting scheme among AI “opinions.” They didn’t have the fastest or most numerous patches, but this approach likely aimed to avoid accuracy penalties while still patching whenever they had enough confidence.

AI-Based Techniques: Lacrosse integrated AI in a supporting role rather than core. First, they used LLMs for higher-level reasoning tasks. This included analyzing crash reports to produce “vulnerability objects” – structured descriptions of a crash’s root cause, type, etc., and even suggestions for patching[55]. Such an object might say, for example: “Heap buffer overflow in function parse_image() when handling BMP files. Cause: misuse of memcpy length.” This gives the developers (or automated system) a clear picture. Having AI create these likely helped Optimus Zero to decide how to patch or whether to prioritize that bug.

They also used AI for semantic understanding tasks that classical tools struggle with. For instance, figuring out input formats or protocol details might be handled by an AI reading the documentation or inferring from code. Lacrosse might have had an AI agent to reverse-engineer input format if needed (similar to Shellphish’s Grammar Guy, but probably less central). Another key AI usage is in their patching process: LLM consensus. They might generate several patch options with GPT or similar, then evaluate each with one or multiple models to pick the best. Or, they might have one LLM propose a patch and another review it for correctness. That aligns with a careful, consensus-driven approach[63].

The orchestrator, Optimus Zero, probably also had some AI logic. It might have used heuristic or learned (AI) strategies to assign tasks – e.g., using machine learning to predict which module is likely vulnerable based on past data, and thus allocate more fuzzing there. But given the limited performance, it might have been rule-based rather than learned, as they emphasized reliability over new techniques.

The cost data shows Lacrosse spent very little: \$631 on LLMs and \\$7.1k on compute (the lowest by far)[131]. That’s maybe because they didn’t utilize external AI much (maybe some small model usage or minimal queries) and their scale was smaller than others (if 300–500 agents, possibly not running full throttle or not all heavy processes). Their cost per point was highest, \$751, meaning the approach was not resource-efficient in terms of score[120]. But they likely aimed for a robust system within modest budget, possibly as a small business track team or similar.

Performance & Results: Lacrosse finished 7th (last) in the finals, with only 1 vulnerability found and 1 patched[65], for 10 points. This is a stark drop from others and indicates something went wrong or underperformed. It’s possible their system actually found more but maybe didn’t get credit due to duplicates or timing. The fact they submitted the most vulns/PoVs (consensus contributor) suggests they tried many, but they only successfully scored one. This likely means many of their submissions were duplicates of others’ findings or came too late. In the DEF CON final format, when multiple teams find the same bug, the one who submits first (with PoV or patch) gets more points, and late patches to a bug someone else already patched can reduce accuracy if the bug is already fixed. Lacrosse might have identified bugs slower than others, so by the time they submitted, another team had patched it, making their patch either unnecessary or harmful to their score. This is speculation, but fits the data: lots of attempts, little payoff, could be due to lateness.

On the positive side, the one bug they did patch gave them those achievements (because it was a unique real bug that only they found). So their conservative thorough approach did yield one unique discovery. But competitively, speed and volume are key, and they fell behind. Possibly their legacy system, while stable, wasn’t designed for the complexity of real modern code and needed more AI infusion to be competitive, but they perhaps ran out of time to modernize everything. The TOT blog’s admiration: “This says a lot about its ability to …” (it cuts off, but likely to stick around or adapt) suggests respect that an old system could still make finals[61].

Team Lacrosse’s documentation is limited publicly, but DARPA posted a YouTube team interview[158], and a CTF Radio podcast exists with their team members discussing their final submission[157]. These likely elaborate on their approach and motivations (e.g., perhaps a DARPA legacy or a focus on integration with government systems). The open-source archive shows their code on GitHub as well.

Strengths & Weaknesses: Lacrosse’s strength was stability and low risk. By using a known architecture and scaling it moderately, they avoided catastrophic failures – their system did run end-to-end and didn’t bomb out (evidenced by their achievements and at least one success). The orchestration by Optimus Zero indicates good control over their processes, likely leading to fewer crashes of the CRS itself (something even Team Atlanta worried about – one bug can kill a CRS[74]). They also achieved near-perfect PoV success ratio, meaning they rarely submitted a PoV that didn’t work[66]. That suggests strong validation and correctness checking on their part – a valuable trait in real deployments, where false positives are a waste of time. And finding a unique bug shows they did have capability beyond trivial finds.

However, their weaknesses were speed and adaptability. They were too conservative and slow in a game that rewarded fast exploitation. Possibly their fuzzers found crashes but maybe they waited too long to confirm them or to generate PoVs. Or they might have targeted the wrong things first (maybe they spent time on a huge codebase and only found one bug, while ignoring easier targets early on). The minimal use of AI meant they didn’t take advantage of leaps in reasoning or analysis that others did – an old system might not catch subtle logic bugs or might drown in false positives without an AI to filter. Also, having fewer fuzzers than others (500 vs thousands) and not pushing them as hard (given the low compute cost) meant generating less coverage and crashes overall.

In essence, Lacrosse was a testament to reliability over novelty, which in this first-of-its-kind AI challenge, wasn’t enough to score high. They effectively demonstrated that an older CRS, even with some LLM help, can still function on modern code, but to win, embracing more AI and aggressive strategies is necessary. For real-world transition, though, some of Lacrosse’s philosophies (like patch confidence algorithms and not overspending resources) could be beneficial.

SHERPA – Automated Fuzz Harness Generation (Auxiliary Tool)¶

Core Operation & Approach: SHERPA (Security Harness Engineering for Robust Program Analysis) is an open-source tool released alongside the competition, focused on a specific but crucial problem: generating fuzzing harnesses for attacker-controlled entry points in software[159][160]. In simpler terms, SHERPA automates the creation of the glue code that lets a fuzzer target the highest-level interfaces of a program – those that real attackers would use – rather than internal helper functions. This addresses a big gap: fuzzers often test library APIs or internal functions because that’s where harnesses exist (like OSS-Fuzz provides many harnesses at low-level APIs), but real attacks typically go through top-level entry points (like feeding a whole file or network packet into the program)[161][162]. SHERPA shifts fuzzing “up the stack” to these entry points[162], meaning it tries to fuzz the program in a way that mirrors actual usage (e.g., fuzz the main() or the request handler rather than just a parsing subroutine).

To do this, SHERPA’s workflow is: given a large project (say an open-source server or utility), analyze its coverage gaps – identify parts of code not exercised by existing fuzzers[163], focus on those that are attacker-controlled entry points (like functions reading external input)[160][164], and then use LLMs to generate fuzz harness code for those entry points[165][166]. The harness code is basically a piece of C/C++ (or Java) that calls into the target code in the right way with some fuzzer-provided data. SHERPA then compiles this harness, runs a fuzzing campaign on it, and monitors for crashes[167][168]. If compilation fails or linking issues arise, SHERPA has an auto-fix loop where an LLM tries to fix build errors in the harness code and rebuild[169][170]. This is very useful because writing correct harnesses can be tricky with dependencies and type matching – SHERPA essentially iterates until the harness compiles (“Build-Until-Green”)[171].

When fuzzing finds crashes with these new harnesses, SHERPA again invokes an LLM for crash analysis. It automatically triages crashes, filtering out false positives (like if the harness itself was flawed or a benign bug) by using AI to analyze stack traces and error patterns[172][173]. The LLM can discard >80% of raw crashes as not real bugs[174][175], achieving a \~67% precision of real bugs in those that remain[176][174] – a huge efficiency gain. Finally, SHERPA outputs maintainer-friendly artifacts: for each validated bug, it provides a reproducible test case, an analysis of the bug, and even patch guidance[177][178]. The idea is to make it easy for open-source maintainers to understand and fix the issue, bridging the gap from raw fuzzer crash to actionable report.

Traditional Techniques: SHERPA stands on existing fuzzing technology – it doesn’t replace fuzzers like AFL/libFuzzer; it augments them. It uses coverage analysis (likely via coverage traces from existing tests or fuzzers) to see what parts of code are not fuzzed yet[176]. Then it picks those areas, specifically focusing on code that handles untrusted input (for instance, functions that parse file formats or process network data – essentially anything that reads from an external source). This targeting may use static analysis or simple heuristics (like function names containing “parse” or pointer arguments that look like data buffers).

Once a target function is chosen, writing a harness is partly straightforward (calling the function with fuzzer data), but often needs some scaffolding: e.g., if the function expects a complex object, SHERPA might need to construct one (maybe by calling initializers or setting global state). It likely uses static analysis to see what that function requires (like what global variables need to be set, or what format the input should be). Some static analysis could be analyzing function signatures or reading comments (LLM can help here by reading code and inferring usage). SHERPA then fuzzes with libFuzzer or AFL (it’s integrated into OSS-Fuzz style pipelines)[179]. The harnesses it generates are OSS-Fuzz compliant, so they can be contributed back to OSS-Fuzz easily if desired[180].

For triage, beyond LLM, it uses sanitizer output and likely checks if a crash is unique (deduplicating by crash stack or ASan report). It filtered 100+ raw crashes to 27 high-signal ones, from which 18 were confirmed real bugs[176][181]. Those numbers indicate heavy filtering. It also uses an interesting trick: if a harness or crash is due to misuse of API or unreachable condition, they filter that out (some automated logic likely flags known harness errors or unreachable states)[172].

AI-Based Techniques: SHERPA’s power comes from integrating LLMs at two key points: harness generation and crash analysis. For harness generation, SHERPA uses LLM agents that understand code context. The LLM is fed with information about the target function and possibly surrounding code (maybe function docstrings or definitions)[182][179]. It then outputs a candidate harness in C/C++ code. Essentially, the LLM is writing new code – a fuzz entry function that calls into the target code appropriately. This is where prompt engineering matters: they likely instruct the LLM to only use safe calls, how to structure a fuzz harness, and provide an example. They mention “Key Prompt Engineering Innovations: Security-First Target Selection”[183], meaning the prompt emphasizes picking real attack surfaces. Also, “Multi-Stage AI Pipeline” is referenced[184] – Stage 1: target selection (maybe LLM helps identify which functions are entry points), Stage 2: harness code generation, Stage 3: build fix, Stage 5: crash triage by LLM[185][166]. Stage 4 (fuzzing) is non-AI.

For build fixing, when the harness fails to compile, an LLM is given the compiler error output and the harness code, and asked to adjust the code to fix the error[186][169]. This is a common use of codex-like models and can save a lot of human time. SHERPA keeps iterating this until it compiles or maybe a limit is reached. This automated debug loop is crucial for scaling to many targets without manual intervention.

During crash analysis, an LLM is tasked with reading sanitizer logs or core dumps to decide if a crash is likely a legit security bug or a false positive. It filtered out \~100 false positives automatically[172][173]. The LLM also categorizes the bug (CWE type, severity) and possibly suggests a patch (the readme says each bug comes with patch guidance[187]). That suggests the LLM might even point out “checking this length before copying would fix it” or similar. SHERPA’s readme lists a summary of bug types it found (CWE-770 uncontrolled mem alloc, etc.)[188][189], which shows it can find a variety, many denial-of-service type issues (out-of-memory, infinite loops, etc.) that typical fuzzing might find but not prioritize.

SHERPA’s use of AI is targeted and effective: rather than trying to analyze entire program flows, it uses AI to solve specific subproblems (writing harness code and interpreting crashes) that humans normally have to do. This dramatically speeds up fuzzing on new code sections (it found 18 new bugs across multiple OSS-Fuzz projects in minutes to hours, which might take humans weeks)[174][175].

Performance & Results: SHERPA is not a competitor system but a tool produced (apparently by a collaboration including team members from Shellphish/Trail of Bits – contributors include “blue-orbiter Alex” (possibly Alex Groce?) and Matt Lehman of Trail of Bits[190][191]). It was presented as a big success: it uncovered 18 confirmed real vulnerabilities (in widely-used projects) from 27 high-signal crashes, out of 127+ raw crashes found[176][181]. That’s 18 bugs that presumably were unknown before (the text says CVE-class bugs, presumably candidates for CVEs) in production software. These include multiple memory safety issues (overflows, double-free, etc.) and several algorithmic complexity or DoS issues[188][189]. The precision of \~67% from crash to real bug[176][174] is far better than typical fuzzing (1-5% per their note)[192]. And the speed is noted: minutes of targeted fuzzing to find these vs. weeks of coverage fuzzing normally[174][175]. This is a huge improvement indicating how focusing on entry points pays off.

The SHERPA approach “changes everything” in their words because it targets bugs that attackers actually exploit, not just obscure internal errors[161][160]. For example, many fuzzers find crashes deep in libraries that might not be reachable in actual use, whereas SHERPA finds issues in how an app handles a whole file or request, which are more directly exploitable (like memory blowups or overflow on real inputs)[161][160].

Documentation for SHERPA is thorough – the GitHub readme reads like a whitepaper with motivation, methodology, and even a case study (LevelDB)[193]. It’s open-source (MIT licensed) and intended to be extended (they invite adding YAML configs for more targets)[194]. It’s a tool likely to be widely useful beyond the competition.

Strengths & Weaknesses: SHERPA’s strength is enabling fuzzing where it previously wasn’t feasible. It leverages AI to dramatically reduce the manual labor of writing harnesses and triaging fuzz results, which are two big bottlenecks in fuzzing adoption. By moving fuzz targets to real entry points, it finds high-impact bugs quickly[161][160]. The combination of coverage analysis and AI ensures it focuses on un-fuzzed code that matters, making it efficient. The automated build-fix loop with AI is also a strength – it’s like having a junior engineer fix your code under supervision. Another strength is that SHERPA outputs actionable artifacts that can directly help maintainers fix issues, bridging the gap between security researchers and developers.

One possible weakness is that LLM-generated harnesses might sometimes be imperfect – for example, the LLM could write a harness that assumes something about the code that isn’t true, or doesn’t correctly initialize some structure, which could lead to false crashes or missed bugs. However, SHERPA’s iterative fixes and AI triage presumably catch many such issues (false crash = filter out). The accuracy of LLM in understanding complex APIs might also vary – if a project requires deep domain knowledge to initialize properly, the harness might not hit the bug in a correct way. But since SHERPA still found many bugs, it indicates it worked well for the targets they tried.

Another challenge could be scaling to extremely large projects – the coverage analysis and target identification might be slow for huge codebases, and the LLM context might not fit if a target requires reading a lot of code. They did mention applying it to multiple OSS-Fuzz projects and specifically call out a new harness for LevelDB as an example (with a detailed write-up included)[195][193]. It likely works best for modular projects with clear entry points.

Overall, SHERPA is an impressive marriage of AI and fuzzing, automating a critical but tedious part of vulnerability discovery. It’s not a direct competitor’s CRS, but rather a tool that any team (and now the public) can use to improve fuzz coverage on real software. It shows how the innovations from AIxCC can be spun off to benefit the broader ecosystem immediately[196][197].

Conclusion – Comparative Insights¶

The AIxCC finalists collectively demonstrated a spectrum of approaches combining traditional cybersecurity techniques with AI, each with unique strengths:

Team Atlanta’s Atlantis proved that an ensemble-of-tools strategy augmented by targeted LLM integration can yield the most robust results. Its strength was balancing innovation (custom models, multi-agent LLM use) with proven methods (fuzzing, concolic execution)[81][9], resulting in top scores across the board. Atlantis highlighted that engineering reliability and diversity in approach (multiple independent CRSs) is as important as raw AI prowess[198][68].
Trail of Bits’ Buttercup showed the value of AI-assisted efficiency. By inserting AI where it adds clear value – generating smarter inputs and assisting patch generation – Buttercup achieved high accuracy and breadth with less expenditure[24][37]. It underscored that one can leverage AI to boost traditional tools (LLM-generated seeds for fuzzers, multi-agent patching) without over-relying on expensive models, thus optimizing cost/benefit[19][15].
Theori’s RoboDuck pushed the envelope on LLM-driven autonomy, confirming that an AI-first system can indeed find and exploit real bugs with minimal human-like heuristics[103][110]. Its success in 3rd place validated the concept of chaining LLM “agents” to perform tasks like static analysis, exploit development, and patching in a loop[27][30]. RoboDuck’s performance suggests that as AI models improve, this approach could become even more dominant – though Theori had to carefully constrain its agents to keep them on track[29].
All You Need is a Fuzzing Brain demonstrated the power (and cost) of massively parallel AI. It found and fixed bugs by orchestrating an army of AI workers in parallel[39], achieving notable wins in speed and static accuracy[38]. The lesson from FuzzingBrain is that scaling out AI can brute-force results, but efficiency suffers (highest cost per point)[120]. This approach might be more viable as AI inference costs drop, but for now it was an expensive route to fourth place – albeit one that shows AI “swarm intelligence” can rival more methodical strategies.
Shellphish’s Artiphishell embodied a comprehensive multi-agent ecosystem. With 60+ agents dividing every sub-task[41], it was arguably the most ambitious integration of AI and security expertise. It excelled in patches correctness and complex input fuzzing (via Grammar Guy)[42][46], though it lagged in volume of outputs. The takeaway is that highly modular AI systems can work and ensure quality, but coordinating them is complex – a single bottleneck or miscoordination can slow the whole system. Artiphishell’s design will likely inspire future research into multi-agent autonomy and how to optimize it.
42-b3yond-6ug’s Bug Buster emphasized extreme scale fuzzing and clever patching, finding the most bugs after the winner[57]. It proved that given enough compute and refined scheduling (RL-based), fuzzers remain incredibly potent[139][52]. However, their stumble in patching (“super patches” with a bug) highlights that novel AI ideas need rigorous testing – an insightful lesson that development speed and code quality can be a deciding factor even with great algorithms[152][58]. Bug Buster’s journey also illustrated the difficulty of integrating multiple advanced components (fuzzing, RL, static slicing, etc.) under time pressure – a more focused scope might have yielded a higher rank.
Lacrosse showed a conservative, legacy-informed approach, which, while yielding minimal points, serves as a baseline: an earlier-generation CRS updated with modest AI can still operate on modern targets[62], but it will likely be outpaced by more aggressive AI integration[64]. Lacrosse’s strong reliability yet low score tells us that in this new era, playing it safe is not enough – AI and automation are required to achieve state-of-the-art results, even if the traditional methods give a stable foundation.
SHERPA, finally, is a standout contribution that, while not a competitor’s CRS, complements the above by addressing a practical gap with AI. It exemplifies how AI can be surgically applied to automate tedious security tasks (harness writing, triage)[164][167], dramatically amplifying human ability to secure code. The fact that SHERPA found numerous bugs quickly in real software[176][174] underscores the real-world impact potential of these AIxCC innovations.

In summary, the competition affirmed that AI is now an indispensable part of automated cybersecurity: all top teams used machine learning in some form to achieve results unattainable by classical techniques alone. However, it also showed that hybrid approaches – blending AI with time-tested methods – currently have an edge over purely AI or purely traditional systems. The winning teams were those who managed that blend most effectively (or in Theori’s case, tamed an AI-first approach with enough guardrails).

Key comparative takeaways include:

Core Approaches: Teams like Atlanta and Shellphish that combined multiple approaches (fuzzing + static + concolic + AI) performed extremely well, suggesting that diversity equals strength in CRS design[198][75]. In contrast, a single-technique focus (e.g., fuzzing-heavy 42-beyond or AI-heavy FuzzingBrain) had either patch or cost issues.
AI Usage: Using AI to assist and enhance (Buttercup, Shellphish) tended to yield high accuracy and efficient use of budget[24][37]. Fully AI-driven systems (Theori, FuzzingBrain) found lots of bugs but required careful constraint or heavy resources[29][120]. A striking point was how all teams used AI for patching – there was broad consensus that LLMs are suitable for generating code fixes[109]. The differentiator was how much trust and autonomy they gave those AI in the overall loop.
Traditional Techniques: The competition reinforced that techniques like fuzzing are not obsolete – in fact, they were central to every team’s success, but often turbocharged by AI. Teams that extended and scaled traditional methods (Atlanta’s multi-fuzzers, ToB’s static queries, 42’s RL fuzzing) ended up in the top half[11][22]. Traditional static analysis alone was insufficient due to noise – but integrated with AI (Theori, FuzzingBrain, Shellphish SARIF handling), it became useful[108][38].
Performance Metrics: Final scores aside, different tools excelled in different metrics. For instance, 42-beyond led in raw finds, Shellphish led in patch accuracy, FuzzingBrain in speed of first find, ToB in CWE coverage[16][48]. This suggests each CRS carved out a niche of strength. A future ideal system might integrate the strengths of all – the coverage of Buttercup, the depth of RoboDuck, the thoroughness of Atlantis, the harness generation of SHERPA – to cover all metrics.
Documentation & Usability: Most teams have open-sourced their code and provided documentation or blogs, but complexity varies. Buttercup appears relatively turnkey (they emphasize one can run it on a laptop)[102], whereas Atlantis and Artiphishell are complex distributed systems that require cloud infrastructure and careful configuration[126][127]. Usability will be key for these CRSs to be adopted outside the competition. In that sense, SHERPA and Buttercup set good examples by focusing on being accessible and well-documented[102][193].

In conclusion, the AIxCC finalists collectively advanced the state-of-the-art by demonstrating that AI and traditional program analysis can be fused to automatically secure real-world software at scale and speed previously impossible. Each tool brought a unique innovation: Atlantis with multi-engine fault tolerance[5], Buttercup with cost-effective LLM integration[18], RoboDuck with end-to-end AI reasoning[27], FuzzingBrain with massive agent parallelism[39], Artiphishell with large-scale agent collaboration[41], Bug Buster with RL-guided fuzzing and multi-bug patches[52][51], and Lacrosse with a demonstration of stability vs. innovation trade-offs[64]. And supporting all, SHERPA showcased a immediately applicable tool for the community[199][182].

The competition may be over, but its impact is just beginning: these open-source CRSs and tools can now be tested, combined, and built upon by the broader cybersecurity community[196][197]. The strengths of each can potentially compensate for the weaknesses of others. The future likely lies in hybrid systems that use AI intelligently – neither ignoring its power nor using it blindly – to cover the full spectrum of software security tasks. AIxCC has indeed marked “a pivotal inflection point” for cyber defense[200], showing that automated systems augmented by AI are not just a research dream but a practical reality that can find and fix critical vulnerabilities at scale[1][201]. The tools compared in this report are the first generation of that reality; with continued development and community involvement, we can expect even more robust AI-driven security solutions in the coming years. Each of these systems, in its own way, contributes to that vision of “Security-AGI” – autonomous agents defending code, faster than attackers can find bugs[202][203].

[1] [15] [16] [17] [18] [20] [21] [22] [24] [25] [36] [37] [91] [95] [99] [100] [101] [102] [120] [131] Trail of Bits' Buttercup wins 2nd place in AIxCC Challenge -The Trail of Bits Blog

https://blog.trailofbits.com/2025/08/09/trail-of-bits-buttercup-wins-2nd-place-in-aixcc-challenge/

[2] [26] [27] [28] [31] [32] [103] [104] [105] [106] [107] [109] [110] [111] [112] [118] AI Cyber Challenge and Theori's RoboDuck - Theori BLOG

https://theori.io/blog/aixcc-and-roboduck-63447

[3] [4] [9] [10] [12] [19] [29] [30] [33] [39] [42] [51] [52] [55] [61] [62] [63] [64] [79] [96] [108] [142] AIxCC finals: Tale of the tape -The Trail of Bits Blog

https://blog.trailofbits.com/2025/08/07/aixcc-finals-tale-of-the-tape/

[5] [6] [7] [8] [11] [44] [45] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [80] [81] [82] [83] [84] [85] [86] [92] [94] [119] [130] [135] [198] AIxCC Final and Team Atlanta

https://team-atlanta.github.io/blog/post-afc/

[13] [14] [23] [34] [35] [38] [40] [46] [47] [48] [57] [59] [65] [66] [67] [87] [88] [89] [90] [97] [98] [116] [117] [121] [122] [132] [196] [197] [201] aicyberchallenge.com

https://aicyberchallenge.com/

[41] [128] [129] [133] Shellphish Submits ARTIPHISHELL to the Final Round of the DARPA AI Cyber Competition | The Robert Mehrabian College of Engineering - UC Santa Barbara

https://engineering.ucsb.edu/news/shellphish-submits-artiphishell-final-round-darpa-ai-cyber-competition

[43] [126] [127] GitHub - shellphish/artiphishell

https://github.com/shellphish/artiphishell

[49] [50] [53] [54] [56] [58] [60] [138] [139] [140] [141] [143] [144] [145] [146] [147] [148] [149] [152] [154] We lost the AIxCC. So, what now?

https://www.whexy.com/en/posts/aixcc

[93] Team-Atlanta/aixcc-afc-atlantis - GitHub

https://github.com/Team-Atlanta/aixcc-afc-atlantis

[113] Theori on X: "So, how did our #AIxCC finalist RoboDuck actually pull ...

https://x.com/theori_io/status/1955886400953589901

[114] Theori on X: "Find the full breakdown of how RoboDuck works, from ...

https://x.com/theori_io/status/1953978623360397373

[115] theori-io/aixcc-public: Public results and products from ... - GitHub

https://github.com/theori-io/aixcc-public

[123] [124] [125] GitHub - o2lab/asc-crs-all-you-need-is-a-fuzzing-brain

https://github.com/o2lab/asc-crs-all-you-need-is-a-fuzzing-brain

[134] Wenbo Guo (@WenboGuo4) / X

https://x.com/wenboguo4

[136] [150] AI Security Tools — August 2025 - Medium

https://medium.com/@taleliyahu/ai-security-tools-august-2025-9a6c449f0402

[137] ARPA-H to join DARPA in announcing AI Cyber Challenge winners ...

https://arpa-h.gov/news-and-events/arpa-h-join-darpa-announcing-ai-cyber-challenge-winners-def-con-33

[151] DARPA Selects Small Businesses to Compete in the AI Cyber ...

https://www.darpa.mil/news/2024/small-business-ai-cyber-challenge

[153] oss-security - Re: AI Cyber Challenge (AIxCC) semi-final results from DEF CON 32 (2024)

https://www.openwall.com/lists/oss-security/2024/08/19/2

[155] GitHub - 42-b3yond-6ug/42-b3yond-6ug-asc: This repository hosts our team’s submission to the AI Cyber Challenge Semifinal Competition.

https://github.com/42-b3yond-6ug/42-b3yond-6ug-asc

[156] DEF CON 33 Video Team - DEFCON AIxCC Lacrosse Team

https://www.youtube.com/watch?v=YdJ6NoxzuKs

[157] LACROSSE's AIxCC Final Submission | 01C CTF Radiooo - YouTube

https://www.youtube.com/watch?v=uO1LnENZV_4

[158] DARPA AI Cyber Challenge - Team Lacrosse - YouTube

https://www.youtube.com/watch?v=ULJcATTuT9U

[159] AIxCC Competition Archive | AIxCC Competition Archive

https://archive.aicyberchallenge.com/

[160] [161] [162] [163] [164] [165] [166] [167] [168] [169] [170] [171] [172] [173] [174] [175] [176] [177] [178] [179] [180] [181] [182] [183] [184] [185] [186] [187] [188] [189] [190] [191] [192] [193] [194] [195] [199] GitHub - AIxCyberChallenge/sherpa: Security Harness Engineering for Robust Program Analysis

https://github.com/AIxCyberChallenge/sherpa

[200] AIxCC Semifinal Competition (ASC) - AI Cyber Challenge

https://aicyberchallenge.com/semifinal-competition/

[202] [203] Team Atlanta

https://team-atlanta.github.io/

AI-Powered Threat Modeling for Secure System Design¶

Overview

This section covers AI-powered threat modeling for secure system design, focusing on open-source tools and prompt-based solutions. Below is a comparison of features across notable open-source tools that assist in early design-time threat modeling.

Threat Modeling¶

Quote

The best use of threat modeling is to improve the security and privacy of a system through early and frequent analysis.
Threat modeling must align with an organization’s development practices and follow design changes in iterations that are each scoped to manageable portions of the system.
The outcomes of threat modeling are meaningful when they are of value to stakeholders.
Dialog is key to establishing the common understandings that lead to value, while documents record those understandings, and enable measurement.

Threat Modeling Manifesto

Quote

Threat modeling is an investment—in my view, it’s a good one, because finding and mitigating threats in the design phase of your workload feature can reduce the relative cost of mitigation, compared to finding the threats later. Consistently implementing threat modeling will likely also improve your security posture over time.

https://aws.amazon.com/blogs/security/threat-modeling-your-generative-ai-workload-to-evaluate-security-risk/

Open Source Tools Comparison¶

Comparison of features and capabilities of open-source AI-driven threat modeling tools (focused on early design-phase usage, except AI Security Analyzer which also supports code-level analysis).

Tool	Focus & Approach	Key Features	AI / Model Integration
AWS Threat Designer	Design-phase automated threat modeling; Web UI (AWS Cloud stack). Users upload system architecture info (including diagrams) to generate threat models.	– Architecture diagram analysis: Multi-modal LLM analyzes system diagrams to identify components and relationships. – Threat catalog: Generates a comprehensive list of potential threats with interactive filtering and refinement. – Iterative replay: Allows rerunning the model after design changes to see updated risk postures. – Exports & history: Supports exporting results to PDF/DOCX and browsing past models in a catalog.	Uses AWS Bedrock with large models (Anthropic Claude 4 Sonnet by default) for NLP and vision; serverless backend. Multimodal LLM interprets text and diagrams to generate threats.
AWS Threat Composer	Design-phase guided threat modeling; Web app or VS Code extension. Emphasizes manual brainstorming with structured guidance (“threat model as code”).	– Data capture: Records system description, architecture and dataflow diagrams, assumptions, etc., as part of the model. – Threat grammar: Uses a prescriptive template to iteratively craft threat statements (with examples and suggestions) instead of starting from scratch. – Insights dashboard: Highlights gaps like threats without mitigations, unprioritized threats, category coverage, etc., to answer “did we do enough?”. – Packs & export: Supports reusable threat/mitigation packs for bulk addition (in self-hosted mode), and can generate a consolidated threat model document.	No built-in generative model – focuses on human-driven input with static guidance. (Uses local storage; no data leaves the browser. “Threat Grammar” framework provides structure rather than AI generation.)
StrideGPT	Design-phase automated STRIDE threat modeling; Streamlit web app or Docker CLI. Given an app description (and optionally diagrams or repo link), it auto-generates a full threat model.	– STRIDE threats generation: Uses LLM to output threats categorized by STRIDE (Spoofing, Tampering, etc.) for the provided system description. – Risk & mitigations: Automatically provides DREAD risk scores and suggests mitigation steps for each identified threat. – Attack trees & tests: Produces potential attack paths (attack trees) and even Gherkin-style security test cases based on threats. – Multi-modal input: Can accept architecture diagrams or flowcharts for analysis with vision-capable models; also can analyze a project’s repository (e.g. README) to enrich the threat model. – No data retention: Does not store inputs or results on the server, focusing on privacy.	Supports multiple LLMs via API or local runtime: OpenAI GPT-4 (and newer GPT-4.1), Anthropic Claude 3/4, Google Gemini (2.0/2.5) and others. Also compatible with self-hosted local models (Ollama, LM Studio). This flexible backend allows using the best available model for analysis.
AI Security Analyzer	Code-centered security analyzer with threat modeling output; CLI tool (Python) that scans an existing codebase to produce security documentation. Geared toward integrating threat modeling in later stages (after code exists) as well as design review.	– Multi-faceted analysis: Generates a Security Design Review document which includes threat modeling, attack surface analysis, attack trees, mitigation strategies, and identified vulnerabilities. – Code-aware threat identification: Parses project source code (multiple languages: Python, Java, JavaScript, Go, Android, etc.) to find assets, entry points, and potential threats specific to the implementation. – Automated documentation: Outputs comprehensive Markdown reports (e.g. `security_design.md`) that consolidate the security findings and model. – Flexible deployment: Can run via Python (Poetry) or Docker; cross-platform support (Windows, MacOS, Linux).	Leverages LLMs to analyze code and generate text. Supports multiple model providers: OpenAI API (GPT-4 family), Anthropic (Claude), Google PaLM/Gemini via API, and OpenRouter. The user supplies an API key for the chosen model, and the tool orchestrates prompts to produce the security docs.

Open Source Tools Details¶

AWS Threat Designer¶

AWS Threat Designer: AI-powered threat modeling for secure system design.

See blogpost: Accelerate threat modeling with generative AI for an in-depth overview of the solution.

Quote

Effective threat modeling examines data flows, trust boundaries, and potential attack vectors to create a comprehensive security strategy tailored to the specific system.

In a shift-left approach to security, threat modeling serves as a critical early intervention. By implementing threat modeling during the design phase—before a single line of code is written—organizations can identify and address potential vulnerabilities at their inception point.

AWS Accelerate threat modeling with generative AI, JUN 2025

Each function generates specialized prompts for different phases of the threat modeling process, including:

Asset identification
Data flow analysis
Gap analysis
Threat identification and improvement
Response structuring

https://github.com/awslabs/threat-designer/tree/main?tab=readme-ov-file#prerequisites

The backend is written in python: threat-designer/backend/threat_designer/

The associated prompts are https://github.com/awslabs/threat-designer/blob/0554b6a97c08e38bb92504ba13768780adb0301f/backend/threat_designer/prompts.py

AWS Threat Composer¶

A simple threat modeling tool to help humans to reduce time-to-value when threat modeling https://github.com/awslabs/threat-composer#readme

STRIDE GPT¶

Quote

Features:

Simple and user-friendly interface
Generates threat models based on the STRIDE methodology
Multi-modal: Use architecture diagrams, flowcharts, etc. as inputs for threat modelling across all supported vision-capable models
Generates attack trees to enumerate possible attack paths
Suggests possible mitigations for identified threats
Supports DREAD risk scoring for identified threats
Generates Gherkin test cases based on identified threats
GitHub repository analysis for comprehensive threat modelling
No data storage; application details are not saved
Supports models accessed via OpenAI API, Azure OpenAI Service, Google AI API, Mistral API, or locally hosted models via Ollama and 🆕 LM Studio Server
Available as a Docker container image for easy deployment
Environment variable support for secure configuration

https://github.com/mrwadams/stride-gpt

The app https://stridegpt.streamlit.app/ has these tabs

It supports a Bring Your Own LLM Key i.e. you chose the LLM and provide your API key.

AI Security Analyzer¶

Quote

AI Security Analyzer is a Python-based tool that analyzes your project's codebase and automatically generates detailed security documentation. It supports multiple analysis types:

🔒 Security Design Documentation
🎯 Threat Modeling
🔍 Attack Surface Analysis
🌳 Attack Tree Analysis
🛡️ Mitigation Strategies
🐛 Vulnerabilities

https://github.com/xvnpw/ai-security-analyzer

generate four different types of security documents:

🔒 Security Design Documentation: Generating detailed security design review.
🎯 Threat Modeling: Performing threat modeling analysis.
🔍 Attack Surface Analysis: Identifying potential entry points and vulnerabilities in the project’s attack surface.
🌳 Attack Tree Analysis: Visualizing potential attack vectors and their hierarchies through attack tree.

https://xvnpw.github.io/posts/scaling-threat-modeling-with-ai/

The associated prompts are

https://github.com/xvnpw/ai-security-analyzer/blob/dabfc57b6e5da9d99b3df5229fd496a224dac862/ai_security_analyzer/prompts.py

Additional Prompt Resources¶

Beyond full tools, there are also open-source prompt libraries focused on security threat modeling:

Fabric by Daniel Miessler: A crowdsourced collection of AI prompt “patterns.” It includes a create_stride_threat_model pattern to guide an LLM in producing a STRIDE-based threat model from a system description and a create_threat_scenarios pattern for generating detailed attack scenarios. These patterns can be used with various GPT-based systems to jump-start threat modeling exercises.
Kornelius Security Audit Prompt: An open prompt (from the Kornelius project) that provides a template for security auditing via LLMs. This prompt script can be adapted to evaluate a system’s security posture by enumerating threats and checks.

Each of the above resources provides ready-made prompt structures that practitioners can use with their AI of choice to conduct threat modeling, complementing the dedicated tools compared in the table.

Overview¶

Abstract

This section looks at some Claude Code setups I use.

I'll add more over time when I'm done playing with them...

Claude Code CLI God Mode¶

I use Claude Code CLI as my main tool.

I wanted a way to use other models via Claude Code to

reduce Claude Code token usage / cost (currently on Pro plan)
use the best tool for the job but maintain context in Claude Code
- Gemini is good for large code bases or data given its large context window
- ChatGPT is good for debugging

There are different ways to achieve this:

using proxy routers per https://youtu.be/EkNfythQNRg
using Claude Code as the router via a Claude Code command

Having played with the former, I settled on the latter as it worked better

Highlights from the Reddit Community

“Gemini CLI feels like garbage… but it has that huge context window we all love. So I added instructions to CLAUDE.md to have Claude use [Gemini CLI in non‑interactive mode] when it needs to gather information about a large part of the codebase.”

“Gemini is good at explaining single, independent things … Claude is good at doing complex tasks that require a lot of memory, deep thinking, reasoning.”

https://www.reddit.com/r/ChatGPTCoding/comments/1lm3fxq/gemini_cli_is_awesome_but_only_when_you_make

This was then extended as a Claude Code command god-cli to support OpenAI Codex also https://github.com/alexsmedile/god-cli/blob/main/god-cli.md.

So a "Map repo architecture, find cause of memory leak, and propose a precise patch" prompt to Claude Code would

Route to Gemini → architecture map.
Route to Codex → debugging/diagnosis.
Route to Claude → minimal diff patch.

Breakthrough Method of Agile AI-Driven Development¶

I researched and evaluated different Agent-Based Frameworks for Software Engineering. Breakthrough Method of Agile AI-Driven Development was the one I selected as being closest to my views.

Software Engineering 1.0 Redux shows this in action with the inputs, chat history, output artifacts.

I added several features that I will merge back:

a security architect (Chris)
a vulnerability tech (Tanja) for code analysis
- uses sub-agents to combine traditional SAST, Dependency tools with LLMs

CLAUDE.md¶

This CLAUDE.md addresses some of the undesirable behaviors I observed:

sycophancy - I prefer my reality raw
code without test - so TDD approach
code without checking for existing code or being overly eager to code
doing something to get the job done - but not the actual documented plan e.g.
1. mocks or stubs
2. instead of installing a documented needed dependency, working around it with a lesser alternative.

Takeaways¶

Key Takeaways

By instructing Claude Code (via a god-cli Command) to route specific tasks to specific CLI tools (Google Gemini CLI, OpenAI Codex CLI), you get the best tool for the job, while saving $.
Breakthrough Method of Agile AI-Driven Development (with my security additions) fits my current views of Software Engineering with GenAI.
CLAUDE.md addresses some of the undesirable behaviors I observed.

Ended: Software Engineering 1.0 Redux

Ended: Software Engineering

CyberSecurity Models ↵

Overview¶

Overview

CyberSecurity Models are those that are optimized and configured specifically for the CyberSecurity domain via knowledge.

Benchmark¶

CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence is a suite of benchmark tasks and datasets to evaluate LLMs in Cyber Threat Intelligence (CTI).

It consists of:

(a) CTI-MCQ, a knowledge evaluation dataset comprising multiple-choice questions aimed at testing LLMs’ understanding of crucial CTI concepts, including standards, threat identification, detection strategies, mitigation techniques, and best practices
(b) CTI-RCM, which involves mapping Common Vulnerabilities and Exposures (CVE) descriptions to Common Weakness Enumeration (CWE) categories
(c) CTI-VSP, which requires calculating Common Vulnerability Scoring System (CVSS) scores
(d) CTI-ATE, which focuses on extracting MITRE ATT&CK techniques from threat descriptions
(e) CTI-TAA, where LLMs are tasked with analyzing publicly available threat reports and attributing them to specific threat actors or malware families

Benchmark Usage¶

CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence is being used by some vendors for their CTI models e.g.

Google https://security.googleblog.com/2025/04/google-launches-sec-gemini-v1-new.html
Cisco https://huggingface.co/fdtn-ai/Foundation-Sec-8B
Trend Micro https://huggingface.co/trend-cybertron

CTI-RCM¶

The Cyber Threat Intelligence Root Cause Mapping (CTI-RCM) benchmark evaluates an LLM's ability to understand the nuances of vulnerability descriptions, identify vulnerabilities underlying root causes, and accurately classify them according to the CWE taxonomy.

CTI-RCM includes 1000 CVE --> CWE Mappings using the CVE Description only (not CVE Reference link content).

The 1000 is a random sample of vulnerabilities reported in the year 2024: https://github.com/xashru/cti-bench/blob/main/data/cti-rcm.tsv
It also includes 1000 CVES from 2021 https://github.com/xashru/cti-bench/blob/main/data/cti-rcm-2021.tsv

CTI-RCM Exploratory Data Analysis¶

Exploratory Data Analysis of the CTI-RCM dataset with code

An Exploratory Data Analysis of the dataset is available in https://github.com/CyberSecAI/cti_bench_rcm.

Limitations¶

The dataset is not "known-good" i.e. how do we know the CVE-CWE mappings are correct?
1. For comparison, the MITRE CWE Top25 2023 identified a significant number of CWEs that were incorrect
The dataset contains a limited number of CWEs per https://github.com/CyberSecAI/cti_bench_rcm
1. For comparison, the MITRE CWE Top25 2023 contains ~400 unique CWEs
The dataset contains a limited number of CVEs per https://github.com/CyberSecAI/cti_bench_rcm
1. For comparison, the MITRE CWE Top25 2022, 2023 include ~~7K CVE-CWE mappings each
The dataset contains 1 CWE per CVE
1. For comparison, the MITRE CWE Top25 2022, 2023 datasets contain many CWEs with more than 1 CWE per CVE
The dataset is highly imbalanced per https://github.com/CyberSecAI/cti_bench_rcm
1. Some CWEs appear a lot more than others per https://github.com/CyberSecAI/cti_bench_rcm
The dataset contains the CVE Description only i.e. not the CVE Reference Information in the CVE links
1. This generally contains more information than the CVE Description alone.
2. A benchmark dataset should use the full CVE info available in the CVE including the Reference Content.
3. For comparison, the MITRE CWE Top25 2022, 2023 mappings relied on the Reference content in many cases because the CVE Description was not sufficient.
The evaluation is a binary equals per https://github.com/xashru/cti-bench/blob/main/evaluation/evaluation.ipynb i.e. correct if the CWE assigned by the model equals the CWE in the benchmark dataset.
1. Based on my work with the MITRE CWE-RCMWG, a model may assign CWEs that are (correctly) more specific (childOf) than the CWE in the CVE. Does this make it "wrong" or "more right"?
2. An evaluation metric should support this e.g. use distance between benchmark and assigned CWEs, where exact match is distance 0, parent/child is distance 1, grandparent/grandchild or cousin is distance 2.
3. An evaluation metric should support the case where there is more that one CWE for a CVE.

Agents¶

Why Agents?¶

Why agents are the next frontier of generative AI, July 2024

Tip

"If one agent can't work well, then why is multi-agent useful?"

Grouping tools/responsibilities can give better results. An agent is more likely to succeed on a focused task than if it has to select from dozens of tools.
Separate prompts can give better results. Each prompt can have its own instructions and few-shot examples. Each agent could even be powered by a separate fine-tuned LLM!
Helpful conceptual model to develop. You can evaluate and improve each agent individually without breaking the larger application.

Multi-agent designs allow you to divide complicated problems into tractable units of work that can be targeted by specialized agents and LLM programs.

https://blog.langchain.dev/langgraph-multi-agent-workflows/

Using Agents can significantly improve performance

GPT-3.5 (zero shot) was 48.1% correct. GPT-4 (zero shot) does better at 67.0%. However, the improvement from GPT-3.5 to GPT-4 is dwarfed by incorporating an iterative agent workflow. Indeed, wrapped in an agent loop, GPT-3.5 achieves up to 95.1%. https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance

An agentic workflow in which the LLM is prompted to focus on one thing at a time can give better performance. https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-5-multi-agent-collaboration/

Agentic AI Design Patterns

Reflection: The LLM examines its own work to come up with ways to improve it.
Tool Use: The LLM is given tools such as web search, code execution, or any other function to help it gather information, take action, or process data.
Planning: The LLM comes up with, and executes, a multistep plan to achieve a goal (for example, writing an outline for an essay, then doing online research, then writing a draft, and so on).
Multi-agent collaboration: More than one AI agent work together, splitting up tasks and discussing and debating ideas, to come up with better solutions than a single agent would.

https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance

Reflection using a multi-agent framework

Further, we can implement Reflection using a multi-agent framework. I've found it convenient to create two different agents, one prompted to generate good outputs and the other prompted to give constructive criticism of the first agent's output. The resulting discussion between the two agents leads to improved responses.

https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-2-reflection/

Like the design pattern of Planning, I find the output quality of multi-agent collaboration hard to predict, especially when allowing agents to interact freely and providing them with multiple tools. The more mature patterns of Reflection and Tool Use are more reliable.

https://www.deeplearning.ai/the-batch/agentic-design-patterns-part-5-multi-agent-collaboration/

LLMs in Production¶

Building LLMs for Production: Enhancing LLM Abilities and Reliability with Prompting, Fine-Tuning, and RAG, October 2024, Louis-Francois Bouchard and Louie Peters
LLMs in Production From language models to successful products, December 2024, Christopher Brousseau and Matthew Sharp
The Shift from Models to Compound AI Systems, Feb 18, 2024
LLM Engineer's Handbook, by Paul Iusztin and Maxime Labonne, Oct 2024
What We’ve Learned From A Year of Building with LLMs. A practical guide to building successful LLM products, covering the tactical, operational, and strategic., June 2024
Google's Approach for Secure AI Agents, Santiago (Sal) Díaz Christoph Kern Kara Olive, Google (2025)

Agents in Production¶

Building effective agents (Anthropic), Dec 19, 2024
Pitfalls and Best Practices — 5 lessons from LLMs in Production, Raza Habib, LLMs in Prod Con 2 https://www.youtube.com/watch?v=TRAUNcPj8KU
Generative AI Agents in Production: Best Practices and Lessons Learned, Patrick Marlow https://www.youtube.com/watch?v=3y17VO3C8jo and whitepaper https://www.kaggle.com/whitepaper-agents
Fine-Tuning LLMs: Best Practices and When to Go Small, Mark Kim-Huang, MLOps Meetup https://www.youtube.com/watch?v=1WSUfWojoe0
Everything you ever wanted to know about LLMOps Maturity Models, Alex Strick van Linschoten Nov 26, 2024 https://www.zenml.io/blog/everything-you-ever-wanted-to-know-about-llmops-maturity-models
A comprehensive collection of over 300 real-world generative AI implementations that showcases how organizations are successfully deploying LLMs in production https://www.zenml.io/blog/demystifying-llmops-a-practical-database-of-real-world-generative-ai-implementations
Stanford Webinar - Agentic AI: A Progression of Language Model Usage and a list of 40 takeaways

References¶

https://huggingface.co/spaces/galileo-ai/agent-leaderboard "How do top LLMs perform in real-world agentic scenarios?"
https://github.com/Jason2Brownlee/awesome-llm-books a list of "Awesome LLM Books"
https://www.deeplearning.ai/the-batch/how-agents-can-improve-llm-performance
https://www.linkedin.com/pulse/potential-impact-future-ai-agents-agentic-series-part-44-techwards-84h3f
https://www.resilientcyber.io/p/agentic-ais-intersection-with-cybersecurity
https://blog.langchain.dev/langgraph-multi-agent-workflows/
Mixture-of-Agents Enhances Large Language Model Capabilities, June 2024, https://arxiv.org/abs/2406.04692
Data Interpreter: An LLM Agent For Data Science, Mar 2024, https://arxiv.org/abs/2402.18679
AgentCoder: Multi-Agent Code Generation with Effective Testing and Self-optimisation, May 2024, https://arxiv.org/html/2312.13010v3
The COMPLETE TRUTH About AI Agents (2024) https://www.youtube.com/watch?v=HGS5LfyxLUU
Securing Agentic AI: A Comprehensive Threat Model and Mitigation Framework for Generative AI Agents, Apr 2025, https://arxiv.org/html/2504.19956v1

Building Multi Agent Systems¶

Overview

Here we will discuss how to build a multi agent system using LangChain and LangGraph.

We will illustrate the architecture of the system using a Mermaid diagram (generated via a GPT).

We will then convert the diagram to LangChain and LangGraph code (using a GPT)

No code solutions to building Multi Agent Systems are also available

Some examples:

The building/coding of these agents was AI-assisted

Architecture Diagram: Bootstrap Mermaid diagram was created (using Claude 3.5 Sonnet)
Code: Bootstrap code was generated from the Architecture Diagram (using Claude 3.5 Sonnet)
Code Editor: Cursor AI was used (with Claude 3.5 Sonnet)
Documentation: CrewAI.com Chat with our docs link (ChatGPT agent). This was also used to generate code snippets as an answer to a question.

Multi Agent Systems Problem and Solution Space¶

For open-ended problems, ReAct is appropriate.

But for smaller Problem and Solution Spaces (more bounded), agents and their tools can be more defined and deterministic, and this allows for more optimization.

Multi Agent Systems Frameworks¶

There are several frameworks for building Multi Agent Systems (MAS), and many more being developed.

Example popular frameworks:

CrewAI (built on LangChain)
AutoGen (by Microsoft)
Swarms

Some frameworks are best suited to more complex multi agent use cases - where autonomy and conversation is required. * Chatty can be expensive in time and money as its more LLM calls. * they can also be useful for rapid prototyping and development of Multi Agent Systems - getting a working system quickly, that can then be optimized and customized with a lower level framework.

For simpler use cases, where we want to prescribe/control how agents interact with each other, we can use LangChain and LangGraph (or no-code solution based on LangGraph).

LangChain does have LCEL (LangChain Expression Language) but LangGraph on LangChain may be a better option.
Another option again is to write (and maintain) your own plumbing to get a simpler Multi Agent System. And it seems like lots of people are rolling their own.

My requirements here are:

Minimal Time-To-Value and Time-To-Learning from the initial solution
Support by tools
including Observability tooling e.g. LangFuse, LangSmith, LangTrace
Representation and support for production Deployment on Cloud platforms (AWS, GCP)

Architecture¶


graph TD
    Start((Start)) --> A[Get Vulnerability Description]
    A --> |Vulnerability Description| C[Extract vulnerability keyphrases from Vulnerability Description]
    C --> |Vulnerability KeyPhrases| D[Get CVEs with similar weakness keyphrases \nfrom CWE Observed Examples]
    C --> |Vulnerability KeyPhrases| E[Get CVEs with similar weakness keyphrases \nfrom Top25 CWE Mappings]
    C --> |Vulnerability KeyPhrases| I[Get CVEs with similar weakness keyphrases \nfrom NVD]
    A-. Vulnerability Description .-> R

    D --> R[Create report]
    E --> R[Create report]
    I --> R[Create report]
    R --> |Draft Report|G[Review Report]
    G -->|Human feedback| G
    G --> |Final Report|End((End))

Note

Text from the NVD References could also be retrieved to augment the report input, as sometimes they contain information that is not in the CVE Description.

That has not been done here (yet) and is not shown in the diagram.

Diagram Code

graph TD
    Start((Start)) --> A[Vulnerability Description]
    A --> |Vulnerability Description| C[Extract vulnerability keyphrases from Vulnerability Description]
    C --> |Vulnerability Description\nVulnerability KeyPhrases| D[Get CVEs with similar weakness keyphrases \nfrom CWE Observed Examples]
    C --> |Vulnerability Description\nVulnerability KeyPhrases| E[Get CVEs with similar weakness keyphrases \nfrom Top25 CWE Mappings]
    C --> |Vulnerability Description\nVulnerability KeyPhrases| I[Get CVEs with similar weakness keyphrases \nfrom NVD]
    D --> R[Create report]
    E --> R[Create report]
    I --> R[Create report]
    R --> |Draft Report|G[Review Report]
    G -->|Human feedback| G
    G --> |Final Report|End((End))

Architecture to Code¶

I want to create a LangGraph multi agent system based on this diagram

I want to create a CrewAI multi agent system based on this diagram

Claude 3.5 Sonnet

Multi Agent System Skeleton Code¶

#from typing import TypedDict, Annotated, Sequence
from langchain_core.messages import BaseMessage
from langgraph.graph import StateGraph, Graph
from langchain_core.runnables import RunnableBinding
from langchain.pydantic_v1 import BaseModel, Field
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
import asyncio

# Define the state
class AgentState(TypedDict):
    messages: Annotated[Sequence[BaseMessage], "The messages in the conversation"]
    vulnerability_description: Annotated[str, "The original vulnerability description"]
    vulnerability_keyphrases: Annotated[list, "Extracted keyphrases from the vulnerability description"]
    cwe_observed_cves: Annotated[list, "CVEs from CWE Observed Examples"]
    top25_cwe_cves: Annotated[list, "CVEs from Top25 CWE Mappings"]
    nvd_cves: Annotated[list, "CVEs from NVD"]
    draft_report: Annotated[str, "The draft report"]
    final_report: Annotated[str, "The final report"]

# Initialize the language model
llm = ChatOpenAI(temperature=0)

# Define the agents

def extract_keyphrases(state):
    prompt = ChatPromptTemplate.from_template(
        "Extract key phrases related to the vulnerability from the following description: {vulnerability_description}"
    )
    chain = prompt | llm
    response = chain.invoke({"vulnerability_description": state["vulnerability_description"]})
    keyphrases = response.content.split(", ")
    return {"vulnerability_keyphrases": keyphrases}

async def get_cwe_observed_cves(state):
    # Simulating an asynchronous API call
    await asyncio.sleep(1)
    # In a real scenario, you would query a CVE database here
    return {"cwe_observed_cves": ["CVE-2021-1234", "CVE-2022-5678"]}

async def get_top25_cwe_cves(state):
    await asyncio.sleep(1)
    return {"top25_cwe_cves": ["CVE-2020-9876", "CVE-2023-4321"]}

async def get_nvd_cves(state):
    await asyncio.sleep(1)
    return {"nvd_cves": ["CVE-2019-8765", "CVE-2024-1111"]}

def create_report(state):
    prompt = ChatPromptTemplate.from_template(
        "Create a comprehensive report based on the following information:\n"
        "Vulnerability Description: {vulnerability_description}\n"
        "Key Phrases: {vulnerability_keyphrases}\n"
        "CVEs from CWE Observed Examples: {cwe_observed_cves}\n"
        "CVEs from Top25 CWE Mappings: {top25_cwe_cves}\n"
        "CVEs from NVD: {nvd_cves}\n"
        "Provide a detailed analysis and recommendations."
    )
    chain = prompt | llm
    response = chain.invoke(state)
    return {"draft_report": response.content}

def review_report(state):
    prompt = ChatPromptTemplate.from_template(
        "Review and improve the following report:\n{draft_report}\n"
        "Provide any necessary corrections or additions."
    )
    chain = prompt | llm
    response = chain.invoke(state)
    return {"final_report": response.content}

# Define the graph
workflow = StateGraph(AgentState)

# Add nodes
workflow.add_node("extract_keyphrases", extract_keyphrases)
workflow.add_node("get_cwe_observed_cves", get_cwe_observed_cves)
workflow.add_node("get_top25_cwe_cves", get_top25_cwe_cves)
workflow.add_node("get_nvd_cves", get_nvd_cves)
workflow.add_node("create_report", create_report)
workflow.add_node("review_report", review_report)

# Add edges
workflow.add_edge("extract_keyphrases", "get_cwe_observed_cves")
workflow.add_edge("extract_keyphrases", "get_top25_cwe_cves")
workflow.add_edge("extract_keyphrases", "get_nvd_cves")
workflow.add_edge("get_cwe_observed_cves", "create_report")
workflow.add_edge("get_top25_cwe_cves", "create_report")
workflow.add_edge("get_nvd_cves", "create_report")
workflow.add_edge("create_report", "review_report")

# Set entry point
workflow.set_entry_point("extract_keyphrases")

# Compile the graph
app = workflow.compile()

# Run the graph
config = {"vulnerability_description": "A buffer overflow vulnerability in the XYZ software allows remote attackers to execute arbitrary code."}
result = app.invoke(config)

print(result["final_report"])

Ended: CyberSecurity Models

TopicModeling ↵

BERTopic¶

Overview

Topic modeling can be used to discover the underlying themes and patterns in a collection of documents (e.g. CVE Descriptions, or CWE Descriptions,...).

It is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize the set of documents.

The output is collections of documents that the algorithm has grouped together, as well as clusters of words and expressions that it used to infer these relations.

BERTopic¶

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF. In general, it outperforms other methods (statistical vs Language Model based) e.g. LSA, LDA.

Quote

Before diving into a classification task, text clustering allows for getting an intuitive understanding of the task but also of its complexity.

BERTopic is a topic modeling technique that assumes that clusters of semantically similar documents are a powerful way of generating and describing clusters. The documents in each cluster are expected to describe a major theme and combined they might represent a topic.

Extracts from book co-authored by BertTopic creator: https://www.oreilly.com/library/view/hands-on-large-language/9781098150952/

Topic Model CWEs¶

I've used BERTopic to Topic Model lots of datasets to quickly get an intuition for the data.

In this example, we'll look at CWEs.

Specifically, generate a Topic Model of the ~1000 CWEs in 1400 View, so we can better understand the underlying patterns and themes.

I did this initially as part of a LinkedIn Thread, to illustrate Topic Modeling as a solution, but also the ease and speed at which it could be done.

Code¶

Source Code

The starting point code is deliberately chosen as unrelated to CyberSecurity: https://github.com/splevine/harnessing-nlp-cx-demo Mar 2024 (where this repo gives an excellent overview of NLP and BERTopic and SetFit).

The main changes I made were:

Don't use wandb (Weights and Balances) service as it was not needed for the demo
Use the 1400 CWE view 1400.csv as the input Data
Use SentenceTransformer("basel/ATTACK-BERT") as the embedding_model so that CyberSecurity embeddings are used.

The code and visualizations are available here: https://colab.research.google.com/drive/168i3oHekL4Cva2s2w01KMq6mnffDgxIS?usp=sharing No effort was made to optimize the topic modeling i.e. it was completely unsupervised.

Topic Model of the ~1000 CWEs in 1400 View¶

Other visualizations are available in the Colab Notebook.

Takeaways¶

Takeaways

Topic Modeling with BERTopic is a powerful unsupervised learning approach to quickly and easily get an intuition for a set of documents.

Ended: TopicModeling

References ↵

Talks and Research¶

Overview

This section highlights relevant talks and research.

My Talks and Research¶

Title	Venue	Author(s)	Date (YYYY/M)
Leveraging generative AI for efficient CVE enrichment	CyberRisk Summit	Chris Madden	2024/11
Vulnerability Root Cause Mapping with CWE: Challenges, Solutions, and Insights from Grounded LLM-based Analysis. Available soon on FIRST YouTube.	FIRST VulnCon	Chris Madden, Alec Summers	2025/4

LLMs For CyberSecurity References¶

Generative AI and Large Language Models for Cyber Security: All Insights You Need, May 2024
A Comprehensive Review of Large Language Models in Cyber Security, September 2024
Large Language Models in Cybersecurity: State-of-the-Art, January 2024
How Large Language Models Are Reshaping the Cybersecurity Landscape | Global AI Symposium talk, September 2024
Large Language Models for Cyber Security: A Systematic Literature Review, July 2024
Using AI for Offensive Security, June 2024

LLMs and GenAI in General¶

Title	Venue	Author(s)	Date (YYYY/M)
Microsoft New Future of Work Report 2023	Report	Microsoft	2023/12
Intelligent Agents in AI Really Can Work Alone. Here’s How.	Post	Gartner	2024/10

Ended: References

Annex ↵

Learning¶

Books¶

Prompt Engineering for Generative AI James Phoenix, Mike Taylor
Hands-On Large Language Models Jay Alammar, Maarten Grootendorst
AI Agents in Action, Publication in Early 2025 (estimated)
Prompt Engineering in Practice, Publication in Spring 2025 (estimated)
LLMs in Production From language models to successful products
AI Apps with LangChain
Natural Language Processing in Action, 2nd ed.
Data Storytelling with Generative AI using Altair and Python
langchain in your pocket

Newsletters¶

https://www.deeplearning.ai/the-batch/issue-260/

Lists¶

Courses¶

AI and GenAI Courses | Andrew Ng and others|
HugginFace courses

~: shorthand for "approximately"

~~: shorthand for "very approximately"

Talks and Research¶

Overview

This section highlights relevant talks and research.

My Talks and Research¶

Title	Venue	Author(s)	Date (YYYY/M)
Leveraging generative AI for efficient CVE enrichment	CyberRisk Summit	Chris Madden	2024/11
Vulnerability Root Cause Mapping with CWE: Challenges, Solutions, and Insights from Grounded LLM-based Analysis. Available soon on FIRST YouTube.	FIRST VulnCon	Chris Madden, Alec Summers	2025/4

LLMs For CyberSecurity References¶

Generative AI and Large Language Models for Cyber Security: All Insights You Need, May 2024
A Comprehensive Review of Large Language Models in Cyber Security, September 2024
Large Language Models in Cybersecurity: State-of-the-Art, January 2024
How Large Language Models Are Reshaping the Cybersecurity Landscape | Global AI Symposium talk, September 2024
Large Language Models for Cyber Security: A Systematic Literature Review, July 2024
Using AI for Offensive Security, June 2024

LLMs and GenAI in General¶

Title	Venue	Author(s)	Date (YYYY/M)
Microsoft New Future of Work Report 2023	Report	Microsoft	2023/12
Intelligent Agents in AI Really Can Work Alone. Here’s How.	Post	Gartner	2024/10

Introduction ↵

Preface¶

Introduction¶

Overview¶

Intended Audience¶

How to Use This Guide¶

How to Contribute to This Guide¶

Writing Style¶

Notes¶

Model Types¶

Introduction¶

Deep Research¶

Reasoning¶

General Purpose¶

Code¶

LLMs for CyberSecurity¶

LLMs for CyberSecurity Users and Use Cases¶

Empathy Map¶

Targeted PreMortem for Trustworthy AI¶

References¶

LLMs for CyberSecurity References¶

Agents for CyberSecurity References¶

Comparing LLMs¶

Books¶

Ended: Introduction

NotebookLM ↵

NotebookLM¶

NotebookLM¶

Key Features and Benefits of Gemini 1.5 Models¶

Document Loading¶

Closed System¶

What is a dog?¶

Sharing¶

How To Use NotebookLM¶

References¶

Takeaways¶

NotebookLM Vulnerability Standards¶

Data Sources¶

I want to rank vulnerabilities by Exploitation¶

Explain risk like I'm a 5 year old¶

CVSS Base Parameters for CVSS v3 and v4¶

ChatGPT4o Response¶

ChatGPT4o Check Response¶

Takeaways¶

NotebookLM CAPEC¶

Data Sources¶

what is the CAPEC ID that related to xss¶

ChatGPT4o Answer¶

ChatGPT4o Answer with Hallucination¶

ChatGPT4o Answer with Hallucination with Uploaded CAPEC File¶

ChatGPT4o Validate the Hallucination¶

ChatGPT4o Review and Rate the Responses from NotebookLM and ChatGPT4o¶

Takeaways¶

NotebookLM ATTACK¶

Data Sources¶

Convert MITRE ATTACK to a Text File for Import¶

Load each text file¶

Tell me about MITRE techniques related to supply chain¶

NotebookLM Answer¶

Claude Sonnet 3.5 Answer¶

Takeaways¶

NotebookLM Config¶

Details¶

Task¶

Original Prompt¶

Expected Answer¶

Data Sources¶

Setup¶

Prepare Validation File¶

Attempt 1: Use the Provided Prompt¶

Create A New Notebooklm With The 2 Data Sources Only¶

Submit the prompt¶

Save The Result¶

Compare The Answer With The Expected Answer¶

Attempt 2: Explain about Config Namespace Placeholders¶

Submit The Prompt¶

Save The Result¶

Compare The Answer With The Expected Answer¶

Prompt¶

Response¶

NotebookLM ¶