File size: 4,811 Bytes
c298e8d 9c05a2a 03b6a9c c298e8d 9c05a2a c298e8d 00edfbb c298e8d 95f08fd c298e8d 95f08fd c298e8d 95f08fd c298e8d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 |
---
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- transformers
- Qwen2
license: other
license_name: qodoai-open-rail-m
license_link: LICENSE
pipeline_tag: sentence-similarity
library_name: sentence-transformers
base_model: Alibaba-NLP/gte-Qwen2-1.5B-instruct
---
## Qodo-Embed-1
**Qodo-Embed-1 is a state-of-the-art** code embedding model designed for retrieval tasks in the software development domain.
It is offered in two sizes: lite (1.5B) and medium (7B). The model is optimized for natural language-to-code and code-to-code retrieval, making it highly effective for applications such as code search, retrieval-augmented generation (RAG), and contextual understanding of programming languages.
This model outperforms all previous open-source models in the COIR and MTEB leaderboards, achieving best-in-class performance with a significantly smaller size compared to competing models.
### Languages Supported:
* Python
* C++
* C#
* Go
* Java
* Javascript
* PHP
* Ruby
* Typescript
## Model Information
- Model Size: 1.5B
- Embedding Dimension: 1536
- Max Input Tokens: 32k
## Requirements
```
transformers>=4.39.2
flash_attn>=2.5.6
```
## Usage
### Sentence Transformers
```python
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("Qodo/Qodo-Embed-1-1.5B")
# Run inference
sentences = [
'accumulator = sum(item.value for item in collection)',
'result = reduce(lambda acc, curr: acc + curr.amount, data, 0)',
'matrix = [[i*j for j in range(n)] for i in range(n)]'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1536]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
```
### Transformers
```python
import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def last_token_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
if left_padding:
return last_hidden_states[:, -1]
else:
sequence_lengths = attention_mask.sum(dim=1) - 1
batch_size = last_hidden_states.shape[0]
return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
# Each query must come with a one-sentence instruction that describes the task
queries = [
'how to handle memory efficient data streaming',
'implement binary tree traversal'
]
documents = [
"""def process_in_chunks():
buffer = deque(maxlen=1000)
for record in source_iterator:
buffer.append(transform(record))
if len(buffer) >= 1000:
yield from buffer
buffer.clear()""",
"""class LazyLoader:
def __init__(self, source):
self.generator = iter(source)
self._cache = []
def next_batch(self, size=100):
while len(self._cache) < size:
try:
self._cache.append(next(self.generator))
except StopIteration:
break
return self._cache.pop(0) if self._cache else None""",
"""def dfs_recursive(root):
if not root:
return []
stack = []
stack.extend(dfs_recursive(root.right))
stack.append(root.val)
stack.extend(dfs_recursive(root.left))
return stack"""
]
input_texts = queries + documents
tokenizer = AutoTokenizer.from_pretrained('Qodo/Qodo-Embed-1-1.5B', trust_remote_code=True)
model = AutoModel.from_pretrained('Qodo/Qodo-Embed-1-1.5B', trust_remote_code=True)
max_length = 8192
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
```
## License
[QodoAI-Open-RAIL-M](https://www.qodo.ai/open-rail-m-license/)
<!--
## Glossary
*Clearly define terms in order to be accessible across audiences.*
-->
<!--
## Model Card Authors
*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
-->
<!--
## Model Card Contact
*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
--> |