Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,81 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: sentence-similarity
|
3 |
+
datasets:
|
4 |
+
- gonglinyuan/CoSQA
|
5 |
+
- AdvTest
|
6 |
+
tags:
|
7 |
+
- sentence-transformers
|
8 |
+
- feature-extraction
|
9 |
+
- code-similarity
|
10 |
+
language: en
|
11 |
+
license: apache-2.0
|
12 |
+
---
|
13 |
+
|
14 |
+
# mpnet-code-search
|
15 |
+
This is a finetuned [sentence-transformers](https://www.SBERT.net) model. It was trained on Natural Language-Programming Language pairs, improving the performance for code search and retrieval applications.
|
16 |
+
|
17 |
+
## Usage (Sentence-Transformers)
|
18 |
+
This model can be loaded with [sentence-transformers](https://www.SBERT.net):
|
19 |
+
|
20 |
+
```
|
21 |
+
pip install -U sentence-transformers
|
22 |
+
```
|
23 |
+
|
24 |
+
Then you can use the model like this:
|
25 |
+
```python
|
26 |
+
from sentence_transformers import SentenceTransformer
|
27 |
+
sentences = ["Print hello world to stdout", "print('hello world')"]
|
28 |
+
|
29 |
+
model = SentenceTransformer('sweepai/mpnet-code-search')
|
30 |
+
embeddings = model.encode(sentences)
|
31 |
+
print(embeddings)
|
32 |
+
```
|
33 |
+
|
34 |
+
## Evaluation Results
|
35 |
+
|
36 |
+
MRR for CoSQA and AdvTest dataset:
|
37 |
+
- Base model
|
38 |
+
- Finetuned model
|
39 |
+
|
40 |
+
---
|
41 |
+
|
42 |
+
## Background
|
43 |
+
|
44 |
+
This project aims to improve the performance of the fine-tuned SBERT MPNet model for coding applications.
|
45 |
+
|
46 |
+
We developed this model to use in our own app, [Sweep, an AI-powered junior developer](https://github.com/sweepai/sweep).
|
47 |
+
|
48 |
+
## Intended Uses
|
49 |
+
|
50 |
+
Our model is intended to be used on code search applications, allowing users to search natural language prompts and find corresponding code chunks.
|
51 |
+
|
52 |
+
## Chunking (Open-Source)
|
53 |
+
|
54 |
+
We developed our own chunking algorithm to improve the quality of a repository's code snippets. This tree-based algorithm is described in [Our Blog Post](https://docs.sweep.dev/blogs/chunking-2m-files).
|
55 |
+
|
56 |
+
### Demo
|
57 |
+
|
58 |
+
We created an [interactive demo](https://huggingface.co/spaces/sweepai/chunker) for our new chunking algorithm.
|
59 |
+
|
60 |
+
---
|
61 |
+
|
62 |
+
## Training Procedure
|
63 |
+
|
64 |
+
### Base Model
|
65 |
+
|
66 |
+
We use the pretrained [`sentence-transformers/all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). Please refer to the model card for a more detailed overview on training data.
|
67 |
+
|
68 |
+
### Finetuning
|
69 |
+
|
70 |
+
We finetune the model using a contrastive objective.
|
71 |
+
|
72 |
+
#### Hyperparameters
|
73 |
+
|
74 |
+
We trained on 8x A5000s.
|
75 |
+
|
76 |
+
#### Training Data
|
77 |
+
|
78 |
+
| Dataset | Number of training tuples |
|
79 |
+
| [CoSQA](https://huggingface.co/datasets/gonglinyuan/CoSQA) | 20,000 |
|
80 |
+
| [AdvTest](https://github.com/microsoft/CodeXGLUE/blob/main/Text-Code/NL-code-search-Adv/README.md) | 250,000 |
|
81 |
+
| **Total** | 270,000 |
|