|
--- |
|
library_name: peft |
|
base_model: deepseek-ai/deepseek-coder-6.7b-instruct |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
|
|
# Model Card for deepseek-coder-6.7b-vulnerability-detection |
|
Fine-tuned version of `deepseek-coder-6.7b-instruct` aiming to improve vulnerability detection in solidity smart contracts and provide informative explanations on what the vulnerabilities are, and how to solve them. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
Given the following prompt below: |
|
``` |
|
Below are one or more Solidity codeblocks. The codeblocks might contain vulnerable code. |
|
If there is a vulnerability please provide a description of the vulnearblity in terms of the code that is responsible for it. |
|
Describe how an attacker would be able to take advantage of the vulnerability so the explanation is even more clear. |
|
|
|
Output only the description of the vulnerability and the attacking vector. No additional information is needed. |
|
|
|
If there is no vulnerability output "There is no vulnearbility". |
|
|
|
Codeblocks: |
|
{} |
|
``` |
|
|
|
When 1 or more codeblocks are provided to the model using this prompt, the model will output: |
|
1. Wether there is a vulnerability or not. |
|
2. What the vulnerability is. |
|
3. How an attacker would take advantage of the detected vulnerability. |
|
|
|
Afterwards, the above output can be chained to produce a solution - the context has the code, the vulnerability and the attacking vector so deducing a solution becomes a more straight-forward task. |
|
Additionally, the same fine-tuned model can be used for the solution recommendation as the fine-tuning is low-rank (LoRA) and a lot of the model ability is preserved. |
|
|
|
|
|
- **Developed by:** [Kristian Apostolov] |
|
- **Shared by:** [Kristian Apostolov] |
|
- **Model type:** [Decoder] |
|
- **Language(s) (NLP):** [English] |
|
- **License:** [MIT] |
|
- **Finetuned from model:** [deepseek-ai/deepseek-coder-6.7b-instruct] |
|
|
|
### Model Sources [optional] |
|
|
|
- **Repository:** [https://huggingface.co/msc-smart-contract-auditing/deepseek-coder-6.7b-vulnerability-detection] |
|
|
|
## Uses |
|
|
|
Provide code from a smart contract for a preliminary audit. |
|
|
|
### Direct Use |
|
|
|
[More Information Needed] |
|
|
|
### Out-of-Scope Use |
|
|
|
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. --> |
|
|
|
Malicious entity could detect 0-day vulnerability and take advantage of it. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
The training data could be improved. Audits sometimes describe vulnerabilities which are not necessarily contained in the code itself, but are a part of a larger context. |
|
|
|
### Recommendations |
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
model_name = 'msc-smart-contract-auditing/deepseek-coder-6.7b-vulnerability' |
|
tokenizer = AutoTokenizer.from_pretrained( # For some reason the tokenizer didn't save properly |
|
"deepseek-ai/deepseek-coder-6.7b-instruct", |
|
trust_remote_code=True, |
|
force_download=True, |
|
) |
|
|
|
prompt = \ |
|
""" |
|
Below are one or more Solidity codeblocks. The codeblocks might contain vulnerable code. |
|
If there is a vulnerability please provide a description of the vulnearblity in terms of the code that is responsible for it. |
|
Describe how an attacker would be able to take advantage of the vulnerability so the explanation is even more clear. |
|
|
|
Output only the description of the vulnerability and the attacking vector. No additional information is needed. |
|
|
|
If there is no vulnerability output "There is no vulnearbility". |
|
|
|
Codeblocks: |
|
{} |
|
|
|
""" |
|
|
|
codeblocks = "Your code here" |
|
|
|
messages = [ |
|
{ 'role': 'user', 'content': prompt.format(codeblocks) } |
|
] |
|
|
|
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device) |
|
outputs = model.generate(inputs, max_new_tokens=512, do_sample=True, top_k=25, top_p=0.95, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id) |
|
description = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True) |
|
|
|
print(description) |
|
``` |
|
## Training Details |
|
|
|
### Training Data |
|
|
|
https://huggingface.co/datasets/msc-smart-contract-auditing/audits-with-reasons |
|
|
|
### Training Procedure |
|
|
|
lora_config = LoraConfig( |
|
r=16, # rank |
|
lora_alpha=32, # scaling factor |
|
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], |
|
lora_dropout=0.05, # dropout rate for LoRA layers |
|
) |
|
|
|
TrainingArguments( |
|
per_device_train_batch_size = 2, |
|
gradient_accumulation_steps = 4, |
|
warmup_steps = 5, |
|
num_train_epochs = 1, |
|
learning_rate = 2e-4, |
|
fp16 = True, |
|
logging_steps = 1, |
|
optim = "adamw_8bit", |
|
weight_decay = 0.01, |
|
lr_scheduler_type = "linear", |
|
seed = 3407, |
|
output_dir = "outputs", |
|
) |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Training regime:** fp16 mixed precision |
|
|
|
## Evaluation |
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
#### Testing Data |
|
|
|
https://huggingface.co/datasets/msc-smart-contract-auditing/audits-with-reasons |
|
|
|
### Framework versions |
|
|
|
- PEFT 0.11.1 |