Update README.md (#2)

c0a44c8 verified 5 days ago

4.71 kB

	---
	tags:
	- sentence-transformers
	- sentence-similarity
	- feature-extraction
	- transformers
	- Qwen2
	license: other
	pipeline_tag: sentence-similarity
	library_name: sentence-transformers
	base_model: Alibaba-NLP/gte-Qwen2-1.5B-instruct
	---




	## Qodo-Embed-1
	Qodo-Embed-1 is a state-of-the-art code embedding model designed for retrieval tasks in the software development domain.
	It is offered in two sizes: lite (1.5B) and medium (7B). The model is optimized for natural language-to-code and code-to-code retrieval, making it highly effective for applications such as code search, retrieval-augmented generation (RAG), and contextual understanding of programming languages.
	This model outperforms all previous open-source models in the COIR and MTab leaderboards, achieving best-in-class performance with a significantly smaller size compared to competing models.

	### Languages Supported:
	* Python
	* C++
	* C#
	* Go
	* Java
	* Javascript
	* PHP
	* Ruby
	* Typescript


	## Model Information
	- Model Size: 1.5B
	- Embedding Dimension: 1536
	- Max Input Tokens: 32k

	## Requirements
	```
	transformers>=4.39.2
	flash_attn>=2.5.6
	```

	## Usage

	### Sentence Transformers

	```python
	from sentence_transformers import SentenceTransformer

	# Download from the 🤗 Hub
	model = SentenceTransformer("Qodo/Qodo-Embed-1-Lite")
	# Run inference
	sentences = [
	'accumulator = sum(item.value for item in collection)',
	'result = reduce(lambda acc, curr: acc + curr.amount, data, 0)',
	'matrix = [[i*j for j in range(n)] for i in range(n)]'
	]
	embeddings = model.encode(sentences)
	print(embeddings.shape)
	# [3, 1536]

	# Get the similarity scores for the embeddings
	similarities = model.similarity(embeddings, embeddings)
	print(similarities.shape)
	# [3, 3]
	```

	### Transformers

	```python
	import torch
	import torch.nn.functional as F

	from torch import Tensor
	from transformers import AutoTokenizer, AutoModel


	def last_token_pool(last_hidden_states: Tensor,
	attention_mask: Tensor) -> Tensor:
	left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
	if left_padding:
	return last_hidden_states[:, -1]
	else:
	sequence_lengths = attention_mask.sum(dim=1) - 1
	batch_size = last_hidden_states.shape[0]
	return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


	# Each query must come with a one-sentence instruction that describes the task
	queries = [
	'how to handle memory efficient data streaming',
	'implement binary tree traversal'
	]

	documents = [
	"""def process_in_chunks():
	buffer = deque(maxlen=1000)
	for record in source_iterator:
	buffer.append(transform(record))
	if len(buffer) >= 1000:
	yield from buffer
	buffer.clear()""",

	"""class LazyLoader:
	def __init__(self, source):
	self.generator = iter(source)
	self._cache = []

	def next_batch(self, size=100):
	while len(self._cache) < size:
	try:
	self._cache.append(next(self.generator))
	except StopIteration:
	break
	return self._cache.pop(0) if self._cache else None""",

	"""def dfs_recursive(root):
	if not root:
	return []
	stack = []
	stack.extend(dfs_recursive(root.right))
	stack.append(root.val)
	stack.extend(dfs_recursive(root.left))
	return stack"""
	]
	input_texts = queries + documents

	tokenizer = AutoTokenizer.from_pretrained('Qodo/Qodo-Embed-1-Lite', trust_remote_code=True)
	model = AutoModel.from_pretrained('Qodo/Qodo-Embed-1-Lite', trust_remote_code=True)

	max_length = 8192

	# Tokenize the input texts
	batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt')
	outputs = model(**batch_dict)
	embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

	# normalize embeddings
	embeddings = F.normalize(embeddings, p=2, dim=1)
	scores = (embeddings[:2] @ embeddings[2:].T) * 100
	print(scores.tolist())
	```




	## license
	QodoAI-Open-RAIL-M
	<!--
	## Glossary

	Clearly define terms in order to be accessible across audiences.
	-->

	<!--
	## Model Card Authors

	Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.
	-->

	<!--
	## Model Card Contact

	Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.
	-->