Overview¶
Overview
There are several options to consider when building an LM solution for CWE assignment as described here.
Approach to using Language Models¶
Don't Train A Model On Bad Data!¶
It is possible to train a Language Model as a Classifier to assign CWEs to a CVE Description - and there are several research papers that took that approach e.g.
- V2W-BERT: A Framework for Effective Hierarchical Multiclass Classification of Software Vulnerabilities
- Automated Mapping of CVE Vulnerability Records to MITRE CWE Weaknesses
The problems with this approach:
-
It's delusional based on my research and experience of incorrect assigned CWEs in general - Garbage In Garbage Out
Quote
There has been significant interest in using AI/ML in various applications to use and/or map to CWE, but in my opinion there are a number of significant hurdles, e.g. you can't train on "bad mappings" to learn how to do good mappings.
-
It removes a lot of the context that could be available to an LM by reducing the reference target down to a set of values or classes (for the given input CVE Descriptions)
Train on Good Data and the Full Standard¶
We can "train" on "good mappings".
- The CWE standard includes known "good mappings" e.g. CWE-917 Observed Examples includes CVE-2021-44228 and its Description.
- The count of these CVE Observed Examples varies significantly per CWE.
- There's ~3K CVE Observed Examples in the CWE standard.
- The Top25 Dataset of known-good mappings contains ~6K CVEs with known-good CWE mappings by MITRE.
- We can use the full CWE standard and associated known good CWE mappings as the target, allowing an LLM to compare the CVE Description (and other data) to this.
- And moreover, prompt the LLM to provide similar CVEs to support its rationale for the CWE assignment
Tip
Rather than train a model on bad data, we can ask a model to assign / validate a CWE based on its understanding of the CWEs available (and its understanding of CWEs assigned to similar CVEs based on the Observed and Top25 Examples for each CWE in the standard).
We can ask the model to follow the MITRE CWE Guidance when assigning a CWE.
Closed or Open Model¶
We can use a Closed or Open Model:
- a closed-model with access to the CWE specification only (and no other data) e.g. NotebookLM
- an open-model with access to the CWE specification and other data
RAG Corpus¶
Representations of the MITRE CWE Specification:
- PDF version of the MITRE CWE Specification
- JSON version of the MITRE CWE Specification
-
JSON version of the modified MITRE CWE Specification to add and remove parts to make it more relevant to CWE assignment for an LLM as described here
Tip
JSON is preferred over PDF as PDF is generally more lossy because it is less structured.
GPT setups¶
Different GPT setups e.g.
- ChatGPT GPT
- Requires you and users to have a paid ChatGPT subscription
- NoteBookLM
- Anyone with a Google account can get up and running in 5 minutes for free.
- VertexAI
- This allows the most customization - but there's more effort to set it up and it is not free.
Prompts¶
- Various Prompts, and Prompt Engineering Techniques, can be used depending on what you want.
Model and Environment¶
For processing 250K+ CVE Descriptions, speed, latency and cost are important considerations, in addition to accuracy.
Based on comparing LLMs as at September 2024, Gemini 1.5 Flash was chosen.
There are different Google AI Environments: - Google AI Studio - lower learning curve and cost and capability - Vertex AI Studio or VertexAI in general