File size: 4,971 Bytes
d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc d93383a 03ce0dc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 |
---
library_name: transformers
tags:
- Vulnerability
- C/C++
- Detection
datasets:
- DetectVul/devign
language:
- en
base_model:
- microsoft/unixcoder-base
---
# Model Card: UniXcoder for Code Vulnerability Detection
## Model Summary
This model is a fine-tuned version of **Microsoft's UniXcoder**, optimized for detecting vulnerabilities in C/C++ code. It is trained on the **DetectVul/devign** dataset and achieves **68.34% accuracy** with an **F1 score of 62.14%**. The model takes in a code snippet and classifies it as either **safe (0)** or **vulnerable (1)**.
## Model Details
- **Developed by:** [mahdin70(Mukit Mahdin)]
- **Finetuned from:** `microsoft/unixcoder-base`
- **Language(s):** English (for code comments & metadata), C/C++
- **License:** MIT
- **Task:** Code vulnerability detection
- **Dataset Used:** `DetectVul/devign`
- **Architecture:** Transformer-based sequence classification
## Model Sources
- **Repository:** [Add Hugging Face Model Link Here]
- **Paper (UniXcoder):** [https://arxiv.org/abs/2203.03850](https://arxiv.org/abs/2203.03850)
- **Demo (Optional):** [Add Gradio/Streamlit Link Here]
## Uses
### Direct Use
This model can be used for **static code analysis**, security audits, and automatic vulnerability detection in software repositories. It is useful for:
- **Developers**: To analyze their code for potential security flaws.
- **Security Teams**: To scan repositories for known vulnerabilities.
- **Researchers**: To study vulnerability detection in AI-powered systems.
### Downstream Use
This model can be integrated into **IDE plugins**, **CI/CD pipelines**, or **security scanners** to provide real-time vulnerability detection.
### Out-of-Scope Use
- The model is **not meant to replace human security experts**.
- It may not generalize well to **languages other than C/C++**.
- False positives/negatives may occur due to dataset limitations.
## Bias, Risks, and Limitations
- **False Positives & False Negatives:** The model may flag safe code as vulnerable or miss actual vulnerabilities.
- **Limited to C/C++:** The model was trained on a dataset primarily composed of **C and C++ code**. It may not perform well on other languages.
- **Dataset Bias:** The training data may not cover all possible vulnerabilities.
### Recommendations
Users should **not rely solely on the model** for security assessments. Instead, it should be used alongside **manual code review and static analysis tools**.
## How to Get Started with the Model
Use the code below to load the model and run inference on a sample code snippet:
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load the fine-tuned model
tokenizer = AutoTokenizer.from_pretrained("your_username/unixcoder-code-vulnerability-detector")
model = AutoModelForSequenceClassification.from_pretrained("your_username/unixcoder-code-vulnerability-detector")
# Sample code snippet
code_snippet = """
void process(char *input) {
char buffer[50];
strcpy(buffer, input); // Potential buffer overflow
}
"""
# Tokenize the input
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding="max_length", max_length=512)
# Run inference
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_label = torch.argmax(predictions, dim=1).item()
# Output the result
print("⚠️ Vulnerable Code" if predicted_label == 1 else "✅ Safe Code")
```
## Training Details
### Training Data
- **Dataset:** `DetectVul/devign`
- **Classes:** `0 (Safe)`, `1 (Vulnerable)`
- **Size:** 50,000+ code snippets
### Training Procedure
- **Optimizer:** AdamW
- **Loss Function:** Cross-Entropy Loss
- **Batch Size:** 8
- **Learning Rate:** 2e-5
- **Epochs:** 3
- **Hardware Used:** 2x T4 GPU
- **Mixed Precision:** FP16
### Training Metrics
| Metric | Score |
|---------|--------|
| **Train Loss** | 0.4835 |
| **Evaluation Loss** | 0.6855 |
| **Accuracy** | 68.34% |
| **F1 Score** | 62.14% |
| **Precision** | 69.18% |
| **Recall** | 56.40% |
## Evaluation
### Testing Data & Metrics
The model was evaluated using **20% of the dataset**, with the following results:
- **Evaluation Accuracy:** 68.34%
- **F1 Score:** 62.14%
- **Precision:** 69.18%
- **Recall:** 56.40%
- **Evaluation Runtime:** 41.16 sec
- **Evaluation Speed:** 53.1 samples/sec
## Environmental Impact
| Factor | Value |
|---------|--------|
| **GPU Used** | 2x T4 GPU |
| **Training Time** | ~1 hour |
## Citation
If you use this model in your research or applications, please cite:
```
@article{unixcoder,
title={UniXcoder: Unified Cross-Modal Pretraining for Code Representation},
author={Guo, Daya and Wang, Shuo and Wan, Yao and others},
year={2022},
journal={arXiv preprint arXiv:2203.03850}
}
```
## Model Card Authors
- **Mukit Mahdin**
- Contact: [[email protected]]
---
Let me know if you need further modifications! 🚀 |