File size: 1,738 Bytes
45d76c8
 
 
7c92b66
 
 
 
 
 
a57837d
 
 
8442c47
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c92b66
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
license: mit
---
# Python clone detection

This is a codebert model for detecting Python clone codes, fine-tuned on the dataset shared by [PoolC](https://github.com/PoolC) on [Hugging Face Hub](https://huggingface.co/datasets/PoolC/1-fold-clone-detection-600k-5fold). The original source code for using the model can be found at https://github.com/sangHa0411/CloneDetection/blob/main/inference.py.

# How to use

To use the model in an efficient way, you can refer to this repository: https://github.com/RepoAnalysis/PythonCloneDetection, which contains a class that integrates data preprocessing, input tokenization, and model inferencing.

You can also follow the original inference source code at https://github.com/sangHa0411/CloneDetection/blob/main/inference.py.

More conveniently, a pipeline for this model has been implemented, and you can initialize it with only two lines of code:
```python
from transformers import pipeline

pipe = pipeline(model="Lazyhope/python-clone-detection", trust_remote_code=True)
```
To use it, pass a tuple of code pairs:
```python
code1 = """def token_to_inputs(feature):
    inputs = {}
    for k, v in feature.items():
        inputs[k] = torch.tensor(v).unsqueeze(0)

    return inputs"""
code2 = """def f(feature):
    return {k: torch.tensor(v).unsqueeze(0) for k, v in feature.items()}"""

is_clone = pipe((code1, code2))
is_clone
# {False: 1.3705984201806132e-05, True: 0.9999862909317017}
```

# Credits

We would like to thank the original team and authors of the model and the fine-tuning dataset:
- [PoolC](https://github.com/PoolC)
- [sangHa0411](https://github.com/sangHa0411)
- [snoop2head](https://github.com/snoop2head)

# Lincese

This model is released under the MIT license.