Transformers
PyTorch
code
English
custom_code
File size: 5,760 Bytes
8942c26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5242979
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8942c26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5242979
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8942c26
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
license: apache-2.0
datasets:
- bigcode/the-stack-v2
- tiiuae/falcon-refinedweb 

library_name: transformers
language:
- code
- en
---

## SageLite-l

### Model Description
SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training:
1. **MLM Pretraining**: Standard masked language model (MLM) pretraining on mixed code and text data ([The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)).
2. **Contrastive Pre-Finetuning**: Learning from a large amount of positive pairs mined from web data and GitHub.
3. **Contrastive Fine-Tuning**: Fine-tuning on a small amount of synthetic data.

---

### **Training Data**
This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby.

---


### **How to Use**
This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).

```python
from transformers import AutoModel, AutoTokenizer

# Specify the checkpoint
checkpoint = "SageLite/SageLite-l"
device = "cuda"  # Use "cpu" if GPU is unavailable

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)

# Example usage
code_snippet = "def print_hello_world():\tprint('Hello World!')"
inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device)
embedding = model(inputs)[0]  # Extract the embedding
```

### **Code Retrieval Performance**

#### 1. Code2Code Search

| Model Name          | # Params | Embd Dim | Python | Java  | JS    | TS     | C#     | C      | Ruby   | PhP    | GO     | AVG    |
|---------------------|----------|----------|--------|-------|-------|--------|--------|--------|--------|--------|--------|--------|
| OpenAI-Code-01      | NA       | 3072     | 21.92  | 8.90  | 4.90  | 5.70   | 3.15   | 11.58  | 26.25  | 16.60  | 9.40   | 12.04  |
| OpenAI-Text-3-Small | NA       | 1536     | 25.18  | 12.61 | 8.00  | 9.44   | 5.46   | 15.86  | 30.70  | 23.33  | 11.20  | 15.57  |
| OpenAI-Text-3-Large | NA       | 3072     | 40.57  | 25.33 | 20.09 | 22.00  | 11.84  | 31.90  | 42.54  | 41.84  | 21.75  | 28.65  |
| CodeSage-v2-Small   | 130M     | 1024     | 45.60  | 33.65 | 39.96 | 47.78  | 19.19  | 30.55  | 40.12  | 55.39  | 30.96  | 38.13  |
| CodeSage-v2-Base    | 356M     | 1024     | 55.86  | 42.89 | 45.29 | 54.58  | 23.90  | 38.52  | 56.02  | 64.56  | 42.88  | 47.17  |
| CodeSage-v2-Large   | 1.3B     | 2048     | 61.11  | 47.09 | 51.18 | 60.67  | 28.04  | 43.40  | 60.74  | 67.87  | 43.86  | 51.55  |
| SageLite-s          | 80M      | 768      | 47.93  | 30.83 | 35.15 | 37.64  | 18.14  | 30.53  | 42.89  | 50.70  | 21.69  | 35.06  |
| SageLite-l          | 850M     | 1536     | 64.46  | 45.53 | 50.80 | 54.71  | 30.66  | 47.46  | 61.01  | 68.68  | 39.25  | 51.40  |

#### 2. NL2Code Search

| Model Name          | # Params | CoSQA | AdvTest | Python | Java  | JS    | PhP    | GO     | Ruby   | Avg    |
|---------------------|----------|-------|---------|--------|-------|-------|--------|--------|--------|--------|
| OpenAI-Code-01      | NA       | 52.20 | 36.03   | 63.13  | 67.85 | 62.30 | 57.47  | 85.22  | 69.28  | 61.69  |
| OpenAI-Text-3-Small | NA       | 52.48 | 34.10   | 62.62  | 65.87 | 60.28 | 54.85  | 81.96  | 67.57  | 59.97  |
| OpenAI-Text-3-Large | NA       | 55.21 | 46.83   | 70.81  | 72.89 | 68.12 | 59.58  | 87.60  | 75.22  | 67.03  |
| CodeSage-v2-Small   | 130M     | 52.39 | 47.28   | 68.79  | 68.13 | 65.77 | 60.20  | 80.26  | 72.46  | 64.41  |
| CodeSage-v2-Base    | 356M     | 50.74 | 52.00   | 70.46  | 70.89 | 69.61 | 62.81  | 82.37  | 73.71  | 66.57  |
| CodeSage-v2-Large   | 1.3B     | 53.18 | 56.31   | 74.18  | 72.33 | 72.49 | 65.26  | 84.67  | 76.61  | 69.38  |
| SageLite-s          | 80M      | 56.49 | 42.32   | 67.59  | 66.62 | 62.32 | 58.87  | 79.36  | 70.75  | 63.04  |
| SageLite-l          | 850M     | 59.76 | 55.55   | 74.25  | 71.76 | 69.35 | 61.62  | 84.09  | 77.14  | 69.19  |

---

### **Text Retrieval Performance ([MTEB Retrieval](https://huggingface.co/spaces/mteb/leaderboard))**

| Metric                        | SageLite-s | SageLite-l |
|-------------------------------|------------|------------|
| ArguAna                       | 57.75      | 60.71      |
| CQADupstackWordpressRetrieval | 32.42      | 38.63      |
| FiQA2018                      | 34.85      | 46.73      |
| NFCorpus                      | 29.97      | 33.70      |
| QuoraRetrieval                | 85.35      | 87.50      |
| SCIDOCS                       | 18.99      | 21.38      |
| SciFact                       | 68.43      | 69.05      |
| Touche2020                    | 24.41      | 21.43      |
| TRECCOVID                     | 70.88      | 76.08      |
| FEVER                         | 71.72      | 73.64      |
| HotpotQA                      | 58.81      | 62.96      |
| NQ                            | 48.26      | 54.48      |
| DBPedia                       | 34.83      | 40.69      |
| ClimateFEVER                  | 25.69      | 26.20      |
| MSMARCO                       | 35.01      | 36.55      |
| average                       | 46.49      | 49.98      |

---