|
--- |
|
tags: |
|
- sentence-transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- transformers |
|
- Qwen2 |
|
license: other |
|
pipeline_tag: sentence-similarity |
|
library_name: sentence-transformers |
|
base_model: Alibaba-NLP/gte-Qwen2-1.5B-instruct |
|
--- |
|
|
|
|
|
|
|
|
|
## Qodo-Embed-1 |
|
**Qodo-Embed-1 is a state-of-the-art** code embedding model designed for retrieval tasks in the software development domain. |
|
It is offered in two sizes: lite (1.5B) and medium (7B). The model is optimized for natural language-to-code and code-to-code retrieval, making it highly effective for applications such as code search, retrieval-augmented generation (RAG), and contextual understanding of programming languages. |
|
This model outperforms all previous open-source models in the COIR and MTab leaderboards, achieving best-in-class performance with a significantly smaller size compared to competing models. |
|
|
|
### Languages Supported: |
|
* Python |
|
* C++ |
|
* C# |
|
* Go |
|
* Java |
|
* Javascript |
|
* PHP |
|
* Ruby |
|
* Typescript |
|
|
|
|
|
## Model Information |
|
- Model Size: 1.5B |
|
- Embedding Dimension: 1536 |
|
- Max Input Tokens: 32k |
|
|
|
## Requirements |
|
``` |
|
transformers>=4.39.2 |
|
flash_attn>=2.5.6 |
|
``` |
|
|
|
## Usage |
|
|
|
### Sentence Transformers |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Download from the 🤗 Hub |
|
model = SentenceTransformer("Qodo/Qodo-Embed-1-Lite") |
|
# Run inference |
|
sentences = [ |
|
'accumulator = sum(item.value for item in collection)', |
|
'result = reduce(lambda acc, curr: acc + curr.amount, data, 0)', |
|
'matrix = [[i*j for j in range(n)] for i in range(n)]' |
|
] |
|
embeddings = model.encode(sentences) |
|
print(embeddings.shape) |
|
# [3, 1536] |
|
|
|
# Get the similarity scores for the embeddings |
|
similarities = model.similarity(embeddings, embeddings) |
|
print(similarities.shape) |
|
# [3, 3] |
|
``` |
|
|
|
### Transformers |
|
|
|
```python |
|
import torch |
|
import torch.nn.functional as F |
|
|
|
from torch import Tensor |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
def last_token_pool(last_hidden_states: Tensor, |
|
attention_mask: Tensor) -> Tensor: |
|
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0]) |
|
if left_padding: |
|
return last_hidden_states[:, -1] |
|
else: |
|
sequence_lengths = attention_mask.sum(dim=1) - 1 |
|
batch_size = last_hidden_states.shape[0] |
|
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths] |
|
|
|
|
|
# Each query must come with a one-sentence instruction that describes the task |
|
queries = [ |
|
'how to handle memory efficient data streaming', |
|
'implement binary tree traversal' |
|
] |
|
|
|
documents = [ |
|
"""def process_in_chunks(): |
|
buffer = deque(maxlen=1000) |
|
for record in source_iterator: |
|
buffer.append(transform(record)) |
|
if len(buffer) >= 1000: |
|
yield from buffer |
|
buffer.clear()""", |
|
|
|
"""class LazyLoader: |
|
def __init__(self, source): |
|
self.generator = iter(source) |
|
self._cache = [] |
|
|
|
def next_batch(self, size=100): |
|
while len(self._cache) < size: |
|
try: |
|
self._cache.append(next(self.generator)) |
|
except StopIteration: |
|
break |
|
return self._cache.pop(0) if self._cache else None""", |
|
|
|
"""def dfs_recursive(root): |
|
if not root: |
|
return [] |
|
stack = [] |
|
stack.extend(dfs_recursive(root.right)) |
|
stack.append(root.val) |
|
stack.extend(dfs_recursive(root.left)) |
|
return stack""" |
|
] |
|
input_texts = queries + documents |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('Qodo/Qodo-Embed-1-Lite', trust_remote_code=True) |
|
model = AutoModel.from_pretrained('Qodo/Qodo-Embed-1-Lite', trust_remote_code=True) |
|
|
|
max_length = 8192 |
|
|
|
# Tokenize the input texts |
|
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt') |
|
outputs = model(**batch_dict) |
|
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask']) |
|
|
|
# normalize embeddings |
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
scores = (embeddings[:2] @ embeddings[2:].T) * 100 |
|
print(scores.tolist()) |
|
``` |
|
|
|
|
|
|
|
|
|
## license |
|
QodoAI-Open-RAIL-M |
|
<!-- |
|
## Glossary |
|
|
|
*Clearly define terms in order to be accessible across audiences.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Authors |
|
|
|
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.* |
|
--> |
|
|
|
<!-- |
|
## Model Card Contact |
|
|
|
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.* |
|
--> |