Feature Extraction
Transformers
Safetensors
ModularStarEncoder
custom_code
andreagurioli1995's picture
Update README.md
ab0d30a verified
|
raw
history blame
3.63 kB
---
library_name: transformers
datasets:
- bigcode/the-stack-v2
- andreagurioli1995/SynthCode2Code2NL-neardedup
license: bigcode-openrail-m
base_model:
- andreagurioli1995/ModularStarEncoder
---
# ModularStarEncoder-1B Fine-Tuned model
<!-- Provide a quick summary of what the model is/does. -->
ModularStarEncoder-finetuned is an encoder built on top of [ModularStarEncoder-1B Pre-trained](https://huggingface.co/andreagurioli1995/ModularStarEncoder) on [SynthCode2Code2NL](https://huggingface.co/datasets/andreagurioli1995/SynthCode2Code2NL-neardedup).
ModularStarEncoder fine-tuned is an encoder for various retrieval tasks, enabling the end user to select the model size that meets their memory and computational constraints.
We built ModularStarEncoder on top of [StarCoder-2](https://huggingface.co/bigcode/starcoder2-15b), reducing its size from 15B to 1B parameters in bfloat16.
The model is finetuned with [CLIP objective](https://github.com/mlfoundations/open_clip/blob/main/src/open_clip/loss.py)
- **Paper:** [Link](arxiv.paper)
- **Languages:** English, Go, Ruby, Python, Java, C++, PHP, C, JavaScript
### How to use
```python
from transformers import AutoModel
from transformers import AutoTokenizer
#import the model
model = AutoModel.from_pretrained("andreagurioli1995/ModularStarEncoder-finetuned", trust_remote_code=True)
#import the tokenizer, the tokenizer applies LEFT padding!
tokenizer = AutoTokenizer.from_pretrained("andreagurioli1995/ModularStarEncoder-finetuned")
language = "yourlanguagelowercased"
#instruction in case of code embedding in a code language
instruction_code = f"Represent this {language} code snippet for retrieval:"
#instruction in case of code embedding in English
instruction_natural_language = "Represent this code description for retrieving supporting snippets of code:"
code_snippet = "your code to embed here"
#You should follow this pattern to embed a snippet of code or natural language queries
sentence = f"{tokenizer.sep_token}{instruction_code}{tokenizer.sep_token}{code_snippet)}{tokenizer.cls_token}"
#Tokenizing your sentence
tokenized_sensence = tokenizer(sentence, return_tensors="pt",truncation=True, max_length=2048)
#Embedding the tokenized sentence
embedded_sentence = model(**sentence)
```
You will get as an output three elements:
- projected_pooled_normalized: a list of the projected, pooled, and normalized embeddings from the five exit points;
- raw_hidden_states: raw representation from all the hidden states of the model, without pooling, normalization, and projection
- attentions: attention scores from the encoder
### Training
<!-- Provide a longer summary of what this model is. -->
We fine-tuned ModularStarEncoder with a batch size of 2048 contrastive samples for 20,000 training steps.
The pre-training and fine-tuning were conducted on 512 NVIDIA Ampere (64GB) GPUs using the [Leonardo](https://arxiv.org/abs/2307.16885) supercomputer, requiring 450,000 GPU working hours.
| Hyperparameter | Value |
|--------------------------|-----------|
| Hidden size | 1024 |
| Max. position embeddings | 2048 |
| Num. of attention heads | 12 |
| Num. of key values heads | 4 |
| Num. of hidden layers | 36 |
| Attention | GQA |
| Num. of parameters | ≈1B |
|Loss function |CLIP loss |
|Multi-layer loss | yes |
## Licence
The model is licensed under the BigCode OpenRAIL-M v1 license agreement. You can find the full agreement [here](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement).