AttackGroup-MPNET / README.md
selfconstruct3d's picture
Update README.md
806a44b verified
---
library_name: transformers
tags:
- cybersecurity
- mpnet
- classification
- fine-tuned
license: creativeml-openrail-m
language:
- en
base_model:
- sentence-transformers/all-mpnet-base-v2
---
# AttackGroup-MPNET - Model Card for MPNet Cybersecurity Classifier
This is a fine-tuned MPNet model specialized for classifying cybersecurity threat groups based on textual descriptions of their tactics and techniques.
## Model Details
### Model Description
This model is a fine-tuned MPNet classifier specialized in categorizing cybersecurity threat groups based on textual descriptions of their tactics, techniques, and procedures (TTPs).
- **Developed by:** Dženan Hamzić
- **Model type:** Transformer-based classification model (MPNet)
- **Language(s) (NLP):** English
- **License:** Apache-2.0
- **Finetuned from model:** microsoft/mpnet-base (with intermediate MLM fine-tuning)
### Model Sources
- **Base Model:** [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base)
## Uses
### Direct Use
This model classifies textual cybersecurity descriptions into known cybersecurity threat groups.
### Downstream Use
Integration into Cyber Threat Intelligence platforms, SOC incident analysis tools, and automated threat detection systems.
### Out-of-Scope Use
- General language tasks unrelated to cybersecurity
- Tasks outside the cybersecurity domain
## Bias, Risks, and Limitations
This model specializes in cybersecurity contexts. Predictions for unrelated contexts may be inaccurate.
### Recommendations
Always verify predictions with cybersecurity analysts before using in critical decision-making scenarios.
## How to Get Started with the Model (Classification)
```python
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.optim as optim
import numpy as np
from huggingface_hub import hf_hub_download
import json
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
label_to_groupid_file = hf_hub_download(
repo_id="selfconstruct3d/AttackGroup-MPNET",
filename="label_to_groupid.json"
)
with open(label_to_groupid_file, "r") as f:
label_to_groupid = json.load(f)
# Load explicitly your fine-tuned MPNet model
classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET", num_labels=len(label_to_groupid)).to(device)
# Load explicitly your tokenizer
tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET")
def predict_group(sentence):
classifier_model.eval()
encoding = tokenizer(
sentence,
truncation=True,
padding="max_length",
max_length=128,
return_tensors="pt"
)
input_ids = encoding["input_ids"].to(device)
attention_mask = encoding["attention_mask"].to(device)
with torch.no_grad():
outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits
predicted_label = torch.argmax(logits, dim=1).cpu().item()
predicted_groupid = label_to_groupid[str(predicted_label)]
return predicted_groupid
# Example usage explicitly:
sentence = "APT38 has used phishing emails with malicious links to distribute malware."
predicted_class = predict_group(sentence)
print(f"Predicted GroupID: {predicted_class}")
```
Predicted GroupID: G0001
https://attack.mitre.org/groups/G0001/
## How to Get Started with the Model (Embeddings)
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from huggingface_hub import hf_hub_download
import json
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
label_to_groupid_file = hf_hub_download(
repo_id="selfconstruct3d/AttackGroup-MPNET",
filename="label_to_groupid.json"
)
with open(label_to_groupid_file, "r") as f:
label_to_groupid = json.load(f)
# Load your fine-tuned classification model
model_name = "selfconstruct3d/AttackGroup-MPNET"
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label_to_groupid)).to(device)
def get_embedding(sentence):
classifier_model.eval()
encoding = tokenizer(
sentence,
truncation=True,
padding="max_length",
max_length=128,
return_tensors="pt"
)
input_ids = encoding["input_ids"].to(device)
attention_mask = encoding["attention_mask"].to(device)
with torch.no_grad():
outputs = classifier_model.mpnet(input_ids=input_ids, attention_mask=attention_mask)
cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy().flatten()
return cls_embedding
# Example explicitly:
sentence = "APT38 has used phishing emails with malicious links to distribute malware."
embedding = get_embedding(sentence)
print("Embedding shape:", embedding.shape)
print("Embedding values:", embedding)
```
## Training Details
### Training Data
To be anounced...
### Training Procedure
- Fine-tuned from: MLM fine-tuned MPNet ("mpnet_mlm_cyber_finetuned-v2")
- Epochs: 32
- Learning rate: 5e-6
- Batch size: 16
## Evaluation
### Testing Data, Factors & Metrics
- **Testing Data:** Stratified sample from original dataset.
- **Metrics:** Accuracy, Weighted F1 Score
### Results
| Metric | Value |
|------------------------|---------|
| Cl. Accuracy (Test) | 0.9564 |
| W. F1 Score (Test) | 0.9577 |
## Evaluation Results
| Model | Accuracy | F1 Macro | F1 Weighted | Embedding Variability |
|-----------------------|----------|----------|-------------|-----------------------|
| **AttackGroup-MPNET** | **0.85** | **0.759**| **0.847** | 0.234 |
| GTE Large | 0.66 | 0.571 | 0.667 | 0.266 |
| E5 Large v2 | 0.64 | 0.541 | 0.650 | 0.355 |
| Original MPNet | 0.63 | 0.534 | 0.619 | 0.092 |
| BGE Large | 0.53 | 0.418 | 0.519 | 0.366 |
| SupSimCSE | 0.50 | 0.373 | 0.479 | 0.227 |
| MLM Fine-tuned MPNet | 0.44 | 0.272 | 0.411 | 0.125 |
| SecBERT | 0.41 | 0.315 | 0.410 | 0.591 |
| SecureBERT_Plus | 0.36 | 0.252 | 0.349 | 0.267 |
| CySecBERT | 0.34 | 0.235 | 0.323 | 0.229 |
| ATTACK-BERT | 0.33 | 0.240 | 0.316 | 0.096 |
| Secure_BERT | 0.00 | 0.000 | 0.000 | 0.007 |
| CyBERT | 0.00 | 0.000 | 0.000 | 0.015 |
| Model | Similarity Search Recall@5 | Few-shot Accuracy | In-dist Similarity | OOD Similarity | Robustness Similarity |
|----------------------|----------------------------|-------------------|--------------------|----------------|-----------------------|
| **AttackGroup-MPNET**| **0.934** | **0.857** | 0.235 | 0.017 | 0.948 |
| Original MPNet | 0.786 | 0.643 | 0.217 | -0.004 | 0.941 |
| E5 Large v2 | 0.778 | 0.679 | 0.727 | 0.013 | 0.977 |
| GTE Large | 0.746 | 0.786 | 0.845 | 0.002 | 0.984 |
| BGE Large | 0.632 | 0.750 | 0.533 | -0.006 | 0.970 |
| SupSimCSE | 0.616 | 0.571 | 0.683 | -0.015 | 0.978 |
| SecBERT | 0.468 | 0.429 | 0.586 | -0.001 | 0.970 |
| CyBERT | 0.452 | 0.250 | 1.000 | -0.001 | 1.000 |
| ATTACK-BERT | 0.362 | 0.571 | 0.157 | -0.005 | 0.950 |
| CySecBERT | 0.424 | 0.500 | 0.734 | -0.015 | 0.954 |
| Secure_BERT | 0.424 | 0.250 | 0.990 | 0.050 | 0.998 |
| SecureBERT_Plus | 0.406 | 0.464 | 0.981 | 0.040 | 0.998 |
### Single Prediction Example
```python
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.optim as optim
import numpy as np
from huggingface_hub import hf_hub_download
import json
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load explicitly your fine-tuned MPNet model
classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET").to(device)
# Load explicitly your tokenizer
tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET")
label_to_groupid_file = hf_hub_download(
repo_id="selfconstruct3d/AttackGroup-MPNET",
filename="label_to_groupid.json"
)
with open(label_to_groupid_file, "r") as f:
label_to_groupid = json.load(f)
def predict_group(sentence):
classifier_model.eval()
encoding = tokenizer(
sentence,
truncation=True,
padding="max_length",
max_length=128,
return_tensors="pt"
)
input_ids = encoding["input_ids"].to(device)
attention_mask = encoding["attention_mask"].to(device)
with torch.no_grad():
outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask)
logits = outputs.logits
predicted_label = torch.argmax(logits, dim=1).cpu().item()
predicted_groupid = label_to_groupid[str(predicted_label)]
return predicted_groupid
# Example usage explicitly:
sentence = "APT38 has used phishing emails with malicious links to distribute malware."
predicted_class = predict_group(sentence)
print(f"Predicted GroupID: {predicted_class}")
```
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).
- **Hardware Type:** [To be filled by user]
- **Hours used:** [To be filled by user]
- **Cloud Provider:** [To be filled by user]
- **Compute Region:** [To be filled by user]
- **Carbon Emitted:** [To be filled by user]
## Technical Specifications
### Model Architecture
- MPNet architecture with classification head (768 -> 512 -> num_labels)
- Last 10 transformer layers fine-tuned explicitly
## Environmental Impact
Carbon emissions should be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).
## Model Card Authors
- Dženan Hamzić
## Model Card Contact
- https://www.linkedin.com/in/dzenan-hamzic/