---
library_name: transformers
tags:
- cybersecurity
- mpnet
- classification
- fine-tuned
license: creativeml-openrail-m
language:
- en
base_model:
- sentence-transformers/all-mpnet-base-v2
---

# AttackGroup-MPNET - Model Card for MPNet Cybersecurity Classifier

This is a fine-tuned MPNet model specialized for classifying cybersecurity threat groups based on textual descriptions of their tactics and techniques.

## Model Details

### Model Description

This model is a fine-tuned MPNet classifier specialized in categorizing cybersecurity threat groups based on textual descriptions of their tactics, techniques, and procedures (TTPs).

- **Developed by:** Dženan Hamzić
- **Model type:** Transformer-based classification model (MPNet)
- **Language(s) (NLP):** English
- **License:** Apache-2.0
- **Finetuned from model:** microsoft/mpnet-base (with intermediate MLM fine-tuning)

### Model Sources

- **Base Model:** [microsoft/mpnet-base](https://huggingface.co/microsoft/mpnet-base)

## Uses

### Direct Use

This model classifies textual cybersecurity descriptions into known cybersecurity threat groups.

### Downstream Use

Integration into Cyber Threat Intelligence platforms, SOC incident analysis tools, and automated threat detection systems.

### Out-of-Scope Use

- General language tasks unrelated to cybersecurity
- Tasks outside the cybersecurity domain

## Bias, Risks, and Limitations

This model specializes in cybersecurity contexts. Predictions for unrelated contexts may be inaccurate.

### Recommendations

Always verify predictions with cybersecurity analysts before using in critical decision-making scenarios.

## How to Get Started with the Model (Classification)

```python
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.optim as optim
import numpy as np
from huggingface_hub import hf_hub_download
import json

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


label_to_groupid_file = hf_hub_download(
    repo_id="selfconstruct3d/AttackGroup-MPNET",
    filename="label_to_groupid.json"
)

with open(label_to_groupid_file, "r") as f:
    label_to_groupid = json.load(f)

# Load explicitly your fine-tuned MPNet model
classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET", num_labels=len(label_to_groupid)).to(device)

# Load explicitly your tokenizer
tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET")

def predict_group(sentence):
    classifier_model.eval()
    encoding = tokenizer(
        sentence,
        truncation=True,
        padding="max_length",
        max_length=128,
        return_tensors="pt"
    )
    input_ids = encoding["input_ids"].to(device)
    attention_mask = encoding["attention_mask"].to(device)

    with torch.no_grad():
        outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predicted_label = torch.argmax(logits, dim=1).cpu().item()

    predicted_groupid = label_to_groupid[str(predicted_label)]
    return predicted_groupid

# Example usage explicitly:
sentence = "APT38 has used phishing emails with malicious links to distribute malware."
predicted_class = predict_group(sentence)
print(f"Predicted GroupID: {predicted_class}")
```
Predicted GroupID: G0001
https://attack.mitre.org/groups/G0001/


## How to Get Started with the Model (Embeddings)

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from huggingface_hub import hf_hub_download
import json

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


label_to_groupid_file = hf_hub_download(
    repo_id="selfconstruct3d/AttackGroup-MPNET",
    filename="label_to_groupid.json"
)

with open(label_to_groupid_file, "r") as f:
    label_to_groupid = json.load(f)


# Load your fine-tuned classification model
model_name = "selfconstruct3d/AttackGroup-MPNET"
tokenizer = AutoTokenizer.from_pretrained(model_name)
classifier_model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(label_to_groupid)).to(device)

def get_embedding(sentence):
    classifier_model.eval()

    encoding = tokenizer(
        sentence,
        truncation=True,
        padding="max_length",
        max_length=128,
        return_tensors="pt"
    )
    input_ids = encoding["input_ids"].to(device)
    attention_mask = encoding["attention_mask"].to(device)

    with torch.no_grad():
        outputs = classifier_model.mpnet(input_ids=input_ids, attention_mask=attention_mask)
        cls_embedding = outputs.last_hidden_state[:, 0, :].cpu().numpy().flatten()

    return cls_embedding

# Example explicitly:
sentence = "APT38 has used phishing emails with malicious links to distribute malware."
embedding = get_embedding(sentence)
print("Embedding shape:", embedding.shape)
print("Embedding values:", embedding)
```


## Training Details

### Training Data

To be anounced...

### Training Procedure

- Fine-tuned from: MLM fine-tuned MPNet ("mpnet_mlm_cyber_finetuned-v2")
- Epochs: 32
- Learning rate: 5e-6
- Batch size: 16

## Evaluation

### Testing Data, Factors & Metrics

- **Testing Data:** Stratified sample from original dataset.
- **Metrics:** Accuracy, Weighted F1 Score

### Results

| Metric                 | Value   |
|------------------------|---------|
| Cl. Accuracy (Test)    | 0.9564 |
| W. F1 Score (Test)     | 0.9577 |


## Evaluation Results

| Model                 | Accuracy | F1 Macro | F1 Weighted | Embedding Variability |
|-----------------------|----------|----------|-------------|-----------------------|
| **AttackGroup-MPNET** | **0.85** | **0.759**| **0.847**   | 0.234                 |
| GTE Large             | 0.66     | 0.571    | 0.667       | 0.266                 |
| E5 Large v2           | 0.64     | 0.541    | 0.650       | 0.355                 |
| Original MPNet        | 0.63     | 0.534    | 0.619       | 0.092                 |
| BGE Large             | 0.53     | 0.418    | 0.519       | 0.366                 |
| SupSimCSE             | 0.50     | 0.373    | 0.479       | 0.227                 |
| MLM Fine-tuned MPNet  | 0.44     | 0.272    | 0.411       | 0.125                 |
| SecBERT               | 0.41     | 0.315    | 0.410       | 0.591                 |
| SecureBERT_Plus       | 0.36     | 0.252    | 0.349       | 0.267                 |
| CySecBERT             | 0.34     | 0.235    | 0.323       | 0.229                 |
| ATTACK-BERT           | 0.33     | 0.240    | 0.316       | 0.096                 |
| Secure_BERT           | 0.00     | 0.000    | 0.000       | 0.007                 |
| CyBERT                | 0.00     | 0.000    | 0.000       | 0.015                 |


| Model                | Similarity Search Recall@5 | Few-shot Accuracy | In-dist Similarity | OOD Similarity | Robustness Similarity |
|----------------------|----------------------------|-------------------|--------------------|----------------|-----------------------|
| **AttackGroup-MPNET**| **0.934**                  | **0.857**         | 0.235              | 0.017          | 0.948                 |
| Original MPNet       | 0.786                      | 0.643             | 0.217              | -0.004         | 0.941                 |
| E5 Large v2          | 0.778                      | 0.679             | 0.727              | 0.013          | 0.977                 |
| GTE Large            | 0.746                      | 0.786             | 0.845              | 0.002          | 0.984                 |
| BGE Large            | 0.632                      | 0.750             | 0.533              | -0.006         | 0.970                 |
| SupSimCSE            | 0.616                      | 0.571             | 0.683              | -0.015         | 0.978                 |
| SecBERT              | 0.468                      | 0.429             | 0.586              | -0.001         | 0.970                 |
| CyBERT               | 0.452                      | 0.250             | 1.000              | -0.001         | 1.000                 |
| ATTACK-BERT          | 0.362                      | 0.571             | 0.157              | -0.005         | 0.950                 |
| CySecBERT            | 0.424                      | 0.500             | 0.734              | -0.015         | 0.954                 |
| Secure_BERT          | 0.424                      | 0.250             | 0.990              | 0.050          | 0.998                 |
| SecureBERT_Plus      | 0.406                      | 0.464             | 0.981              | 0.040          | 0.998                 |


### Single Prediction Example

```python

import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch.optim as optim
import numpy as np
from huggingface_hub import hf_hub_download
import json

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Load explicitly your fine-tuned MPNet model
classifier_model = AutoModelForSequenceClassification.from_pretrained("selfconstruct3d/AttackGroup-MPNET").to(device)

# Load explicitly your tokenizer
tokenizer = AutoTokenizer.from_pretrained("selfconstruct3d/AttackGroup-MPNET")


label_to_groupid_file = hf_hub_download(
    repo_id="selfconstruct3d/AttackGroup-MPNET",
    filename="label_to_groupid.json"
)

with open(label_to_groupid_file, "r") as f:
    label_to_groupid = json.load(f)

def predict_group(sentence):
    classifier_model.eval()
    encoding = tokenizer(
        sentence,
        truncation=True,
        padding="max_length",
        max_length=128,
        return_tensors="pt"
    )
    input_ids = encoding["input_ids"].to(device)
    attention_mask = encoding["attention_mask"].to(device)

    with torch.no_grad():
        outputs = classifier_model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predicted_label = torch.argmax(logits, dim=1).cpu().item()

    predicted_groupid = label_to_groupid[str(predicted_label)]
    return predicted_groupid

# Example usage explicitly:
sentence = "APT38 has used phishing emails with malicious links to distribute malware."
predicted_class = predict_group(sentence)
print(f"Predicted GroupID: {predicted_class}")
```

## Environmental Impact

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).

- **Hardware Type:** [To be filled by user]
- **Hours used:** [To be filled by user]
- **Cloud Provider:** [To be filled by user]
- **Compute Region:** [To be filled by user]
- **Carbon Emitted:** [To be filled by user]

## Technical Specifications

### Model Architecture

- MPNet architecture with classification head (768 -> 512 -> num_labels)
- Last 10 transformer layers fine-tuned explicitly

## Environmental Impact

Carbon emissions should be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute).

## Model Card Authors

- Dženan Hamzić

## Model Card Contact

- https://www.linkedin.com/in/dzenan-hamzic/