Overview¶

Overview

There are several options to consider when building an LM solution for CWE assignment as described here.

Approach to using Language Models¶

Don't Train A Model On Bad Data!¶

It is possible to train a Language Model as a Classifier to assign CWEs to a CVE Description - and there are several research papers that took that approach e.g.

V2W-BERT: A Framework for Effective Hierarchical Multiclass Classification of Software Vulnerabilities
Automated Mapping of CVE Vulnerability Records to MITRE CWE Weaknesses

The problems with this approach:

It's delusional based on my research and experience of incorrect assigned CWEs in general - Garbage In Garbage Out
1. Per Steve Christey Coley, CWE tech lead:
Quote

There has been significant interest in using AI/ML in various applications to use and/or map to CWE, but in my opinion there are a number of significant hurdles, e.g. you can't train on "bad mappings" to learn how to do good mappings.
It removes a lot of the context that could be available to an LM by reducing the reference target down to a set of values or classes (for the given input CVE Descriptions)

Train on Good Data and the Full Standard¶

We can "train" on "good mappings".

The CWE standard includes known "good mappings" e.g. CWE-917 Observed Examples includes CVE-2021-44228 and its Description.
1. The count of these CVE Observed Examples varies significantly per CWE.
2. There's ~3K CVE Observed Examples in the CWE standard.
The Top25 Dataset of known-good mappings contains ~6K CVEs with known-good CWE mappings by MITRE.
We can use the full CWE standard and associated known good CWE mappings as the target, allowing an LLM to compare the CVE Description (and other data) to this.
1. And moreover, prompt the LLM to provide similar CVEs to support its rationale for the CWE assignment

Tip

Rather than train a model on bad data, we can ask a model to assign / validate a CWE based on its understanding of the CWEs available (and its understanding of CWEs assigned to similar CVEs based on the Observed and Top25 Examples for each CWE in the standard).

We can ask the model to follow the MITRE CWE Guidance when assigning a CWE.

Closed or Open Model¶

We can use a Closed or Open Model:

a closed-model with access to the CWE specification only (and no other data) e.g. NotebookLM
an open-model with access to the CWE specification and other data

RAG Corpus¶

Representations of the MITRE CWE Specification:

PDF version of the MITRE CWE Specification
1. https://cwe.mitre.org/data/downloads.html
JSON version of the MITRE CWE Specification
1. https://github.com/CWE-CAPEC/REST-API-wg/blob/main/json_repo/cwe.json
JSON version of the modified MITRE CWE Specification to add and remove parts to make it more relevant to CWE assignment for an LLM as described here
1. https://github.com/CyberSecAI/cwe_top25/tree/main/data_out/output_jsonl
Tip

JSON is preferred over PDF as PDF is generally more lossy because it is less structured.

GPT setups¶

Different GPT setups e.g.

ChatGPT GPT
1. Requires you and users to have a paid ChatGPT subscription
NoteBookLM
1. Anyone with a Google account can get up and running in 5 minutes for free.
VertexAI
1. This allows the most customization - but there's more effort to set it up and it is not free.

Prompts¶

Various Prompts, and Prompt Engineering Techniques, can be used depending on what you want.

Model and Environment¶

For processing 250K+ CVE Descriptions, speed, latency and cost are important considerations, in addition to accuracy.

Based on comparing LLMs as at September 2024, Gemini 1.5 Flash was chosen.

There are different Google AI Environments: - Google AI Studio - lower learning curve and cost and capability - Vertex AI Studio or VertexAI in general