--- license: apache-2.0 datasets: - bigcode/the-stack-v2 - tiiuae/falcon-refinedweb library_name: transformers language: - code - en --- ## SageLite-l ### Model Description SageLite is a new family of open embedding models with an encoder architecture that supports a wide range of tasks in both code and text. SageLite went through three stages of training: 1. **MLM Pretraining**: Standard masked language model (MLM) pretraining on mixed code and text data ([The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb)). 2. **Contrastive Pre-Finetuning**: Learning from a large amount of positive pairs mined from web data and GitHub. 3. **Contrastive Fine-Tuning**: Fine-tuning on a small amount of synthetic data. --- ### **Training Data** This checkpoint is trained on both [The-Stack-v2](https://huggingface.co/datasets/bigcode/the-stack-v2) and [Falcon-refinedweb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). Supported languages (15 in total) are: English, C, C#, Go, Java, JavaScript, TypeScript, PHP, Python, and Ruby. --- ### **How to Use** This checkpoint consists of an encoder (80M model) that extracts code embeddings of 768 dimensions. It can be loaded using the Hugging Face Transformers library and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf). ```python from transformers import AutoModel, AutoTokenizer # Specify the checkpoint checkpoint = "SageLite/SageLite-l" device = "cuda" # Use "cpu" if GPU is unavailable # Load tokenizer and model tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True) model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device) # Example usage code_snippet = "def print_hello_world():\tprint('Hello World!')" inputs = tokenizer.encode(code_snippet, return_tensors="pt").to(device) embedding = model(inputs)[0] # Extract the embedding ``` ### **Code Retrieval Performance** #### 1. Code2Code Search | Model Name | # Params | Embd Dim | Python | Java | JS | TS | C# | C | Ruby | PhP | GO | AVG | |---------------------|----------|----------|--------|-------|-------|--------|--------|--------|--------|--------|--------|--------| | OpenAI-Code-01 | NA | 3072 | 21.92 | 8.90 | 4.90 | 5.70 | 3.15 | 11.58 | 26.25 | 16.60 | 9.40 | 12.04 | | OpenAI-Text-3-Small | NA | 1536 | 25.18 | 12.61 | 8.00 | 9.44 | 5.46 | 15.86 | 30.70 | 23.33 | 11.20 | 15.57 | | OpenAI-Text-3-Large | NA | 3072 | 40.57 | 25.33 | 20.09 | 22.00 | 11.84 | 31.90 | 42.54 | 41.84 | 21.75 | 28.65 | | CodeSage-v2-Small | 130M | 1024 | 45.60 | 33.65 | 39.96 | 47.78 | 19.19 | 30.55 | 40.12 | 55.39 | 30.96 | 38.13 | | CodeSage-v2-Base | 356M | 1024 | 55.86 | 42.89 | 45.29 | 54.58 | 23.90 | 38.52 | 56.02 | 64.56 | 42.88 | 47.17 | | CodeSage-v2-Large | 1.3B | 2048 | 61.11 | 47.09 | 51.18 | 60.67 | 28.04 | 43.40 | 60.74 | 67.87 | 43.86 | 51.55 | | SageLite-s | 80M | 768 | 47.93 | 30.83 | 35.15 | 37.64 | 18.14 | 30.53 | 42.89 | 50.70 | 21.69 | 35.06 | | SageLite-l | 850M | 1536 | 64.46 | 45.53 | 50.80 | 54.71 | 30.66 | 47.46 | 61.01 | 68.68 | 39.25 | 51.40 | #### 2. NL2Code Search | Model Name | # Params | CoSQA | AdvTest | Python | Java | JS | PhP | GO | Ruby | Avg | |---------------------|----------|-------|---------|--------|-------|-------|--------|--------|--------|--------| | OpenAI-Code-01 | NA | 52.20 | 36.03 | 63.13 | 67.85 | 62.30 | 57.47 | 85.22 | 69.28 | 61.69 | | OpenAI-Text-3-Small | NA | 52.48 | 34.10 | 62.62 | 65.87 | 60.28 | 54.85 | 81.96 | 67.57 | 59.97 | | OpenAI-Text-3-Large | NA | 55.21 | 46.83 | 70.81 | 72.89 | 68.12 | 59.58 | 87.60 | 75.22 | 67.03 | | CodeSage-v2-Small | 130M | 52.39 | 47.28 | 68.79 | 68.13 | 65.77 | 60.20 | 80.26 | 72.46 | 64.41 | | CodeSage-v2-Base | 356M | 50.74 | 52.00 | 70.46 | 70.89 | 69.61 | 62.81 | 82.37 | 73.71 | 66.57 | | CodeSage-v2-Large | 1.3B | 53.18 | 56.31 | 74.18 | 72.33 | 72.49 | 65.26 | 84.67 | 76.61 | 69.38 | | SageLite-s | 80M | 56.49 | 42.32 | 67.59 | 66.62 | 62.32 | 58.87 | 79.36 | 70.75 | 63.04 | | SageLite-l | 850M | 59.76 | 55.55 | 74.25 | 71.76 | 69.35 | 61.62 | 84.09 | 77.14 | 69.19 | --- ### **Text Retrieval Performance ([MTEB Retrieval](https://huggingface.co/spaces/mteb/leaderboard))** | Metric | SageLite-s | SageLite-l | |-------------------------------|------------|------------| | ArguAna | 57.75 | 60.71 | | CQADupstackWordpressRetrieval | 32.42 | 38.63 | | FiQA2018 | 34.85 | 46.73 | | NFCorpus | 29.97 | 33.70 | | QuoraRetrieval | 85.35 | 87.50 | | SCIDOCS | 18.99 | 21.38 | | SciFact | 68.43 | 69.05 | | Touche2020 | 24.41 | 21.43 | | TRECCOVID | 70.88 | 76.08 | | FEVER | 71.72 | 73.64 | | HotpotQA | 58.81 | 62.96 | | NQ | 48.26 | 54.48 | | DBPedia | 34.83 | 40.69 | | ClimateFEVER | 25.69 | 26.20 | | MSMARCO | 35.01 | 36.55 | | average | 46.49 | 49.98 | ---