File size: 3,377 Bytes
b243ee5
 
 
 
 
 
 
 
 
 
 
 
 
7519a87
 
 
 
 
5eac73c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9967d2
7519a87
5eac73c
7519a87
5eac73c
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
datasets:
- fbnhnsl/FIM_Solidity_Dataset
language:
- en
metrics:
- bleu
- meteor
base_model:
- deepseek-ai/deepseek-coder-1.3b-base
pipeline_tag: text-generation
tags:
- code
license: cc-by-4.0
---

This is a finetuned deepseek-coder-1.3b-base model for automatic code completion of Solidity code. The model was finetuned with QLoRA and an FIM transformed and Slither audited dataset. The corresponding dataset can be found at fbnhnsl/FIM_Solidity_Dataset on Hugging Face.

Example usage:
```python
# Load the finetuned model
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pretrained_checkpoint = 'deepseek-ai/deepseek-coder-1.3b-base'
finetuned_checkpoint = 'path/to/model'

tokenizer = AutoTokenizer.from_pretrained(finetuned_checkpoint)

old_model = AutoModelForCausalLM.from_pretrained(pretrained_checkpoint)
old_model.resize_token_embeddings(len(tokenizer))

finetuned_model = PeftModel.from_pretrained(old_model, checkpoint).to(device)

# ----------------------------------------------------------------------------
# General automatic code completion
code_example = '''<|secure_function|>\tfunction add('''

model_inputs = tokenizer(code_example, return_tensors="pt").to(device)

input_ids = model_inputs["input_ids"]
attention_mask = model_inputs["attention_mask"]

generated_ids = finetuned_model.generate(input_ids,
                                         do_sample=True,
                                         max_length=256,
                                         num_beams=4,
                                         temperature=0.3,
                                         pad_token_id=tokenizer.eos_token_id,
                                         attention_mask=attention_mask)

print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0])

# Expected output:
# 	function add(uint256 a, uint256 b) internal pure returns (uint256) {
#     return a + b;
#	}

# ----------------------------------------------------------------------------
# Fill-in-the-middle
def generate_fim(prefix, suffix, model, tokenizer, max_length=256):
    input_text = f"<|fim_begin|>{prefix}<|fim_hole|>{suffix}<|fim_end|>"
    inputs = tokenizer.encode(input_text, return_tensors="pt").to(model.device)
    outputs = model.generate(
        inputs,
        max_length=max_length,
        num_beams=8,
        temperature=0.3,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    middle = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
    return prefix + middle + suffix

prefix = '''pragma solidity ^0.8.0;\n\n'''

suffix = '''\n\ncontract FOO is Context, IERC20, Ownable {'''

print(generate_fim(prefix, suffix, finetuned_model, tokenizer))

# Expected output:
# pragma solidity ^0.8.0;
#
# import "@openzeppelin/contracts/utils/Context.sol" as Context;
# import "@openzeppelin/contracts/interfaces/IERC20.sol" as IERC20;
# import "@openzeppelin/contracts/access/Ownable.sol" as Ownable;
#
# contract FOO is Context, IERC20, Ownable {

```

If you wish to use this model, you can cite it as follows:

```latex
@misc{hensel2025fim_model,
  title = {Finetuned deepseek-coder-1.3b-base model for automatic code completion of Solidity code},
  author={Fabian Hensel},
  year={2025}
}
```