lukejagg commited on
Commit
42c5fd8
·
1 Parent(s): 79ed8a3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +81 -0
README.md ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ datasets:
4
+ - gonglinyuan/CoSQA
5
+ - AdvTest
6
+ tags:
7
+ - sentence-transformers
8
+ - feature-extraction
9
+ - code-similarity
10
+ language: en
11
+ license: apache-2.0
12
+ ---
13
+
14
+ # mpnet-code-search
15
+ This is a finetuned [sentence-transformers](https://www.SBERT.net) model. It was trained on Natural Language-Programming Language pairs, improving the performance for code search and retrieval applications.
16
+
17
+ ## Usage (Sentence-Transformers)
18
+ This model can be loaded with [sentence-transformers](https://www.SBERT.net):
19
+
20
+ ```
21
+ pip install -U sentence-transformers
22
+ ```
23
+
24
+ Then you can use the model like this:
25
+ ```python
26
+ from sentence_transformers import SentenceTransformer
27
+ sentences = ["Print hello world to stdout", "print('hello world')"]
28
+
29
+ model = SentenceTransformer('sweepai/mpnet-code-search')
30
+ embeddings = model.encode(sentences)
31
+ print(embeddings)
32
+ ```
33
+
34
+ ## Evaluation Results
35
+
36
+ MRR for CoSQA and AdvTest dataset:
37
+ - Base model
38
+ - Finetuned model
39
+
40
+ ---
41
+
42
+ ## Background
43
+
44
+ This project aims to improve the performance of the fine-tuned SBERT MPNet model for coding applications.
45
+
46
+ We developed this model to use in our own app, [Sweep, an AI-powered junior developer](https://github.com/sweepai/sweep).
47
+
48
+ ## Intended Uses
49
+
50
+ Our model is intended to be used on code search applications, allowing users to search natural language prompts and find corresponding code chunks.
51
+
52
+ ## Chunking (Open-Source)
53
+
54
+ We developed our own chunking algorithm to improve the quality of a repository's code snippets. This tree-based algorithm is described in [Our Blog Post](https://docs.sweep.dev/blogs/chunking-2m-files).
55
+
56
+ ### Demo
57
+
58
+ We created an [interactive demo](https://huggingface.co/spaces/sweepai/chunker) for our new chunking algorithm.
59
+
60
+ ---
61
+
62
+ ## Training Procedure
63
+
64
+ ### Base Model
65
+
66
+ We use the pretrained [`sentence-transformers/all-mpnet-base-v2`](https://huggingface.co/sentence-transformers/all-mpnet-base-v2). Please refer to the model card for a more detailed overview on training data.
67
+
68
+ ### Finetuning
69
+
70
+ We finetune the model using a contrastive objective.
71
+
72
+ #### Hyperparameters
73
+
74
+ We trained on 8x A5000s.
75
+
76
+ #### Training Data
77
+
78
+ | Dataset | Number of training tuples |
79
+ | [CoSQA](https://huggingface.co/datasets/gonglinyuan/CoSQA) | 20,000 |
80
+ | [AdvTest](https://github.com/microsoft/CodeXGLUE/blob/main/Text-Code/NL-code-search-Adv/README.md) | 250,000 |
81
+ | **Total** | 270,000 |