|
--- |
|
license: cc-by-nc-4.0 |
|
base_model: |
|
- nomic-ai/CodeRankEmbed |
|
--- |
|
`SweRankEmbed-Small` is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms other embedding models on the issue localization task. |
|
|
|
The model has been trained on large-scale issue localization data collected from public python github repositories. Check out our [blog post](https://gangiswag.github.io/SweRank/) and [paper](https://arxiv.org/abs/2505.07849) for more details! |
|
|
|
You can combine `SweRankEmbed` with our [`SweRankLLM-Small`]() or [`SweRankLLM-Large`]() rerankers for even higher quality ranking performance. |
|
|
|
Link to code: [https://github.com/gangiswag/SweRank](https://github.com/gangiswag/SweRank) |
|
|
|
## Performance |
|
|
|
SweRank models show SOTA localization performance on a variety of benchmarks like SWE-Bench-Lite and LocBench, considerably out-performing agent-based approaches relying on Claude-3.5 |
|
|
|
| Model Name | SWE-Bench-Lite Func@10 | LocBench Func@15 |
|
| ------------------------------------------------------------------- | -------------------------------- | -------------------------------- | |
|
| OpenHands (Claude 3.5) | 70.07 | 59.29 | |
|
| LocAgent (Claude 3.5) | 77.37 | 60.71 | |
|
| CodeRankEmbed (137M) | 58.76 | 50.89 | |
|
| GTE-Qwen2-7B-Instruct (7B)| 70.44 | 57.14 | |
|
| SweRankEmbed-Small (137M) | 74.45 | 63.39 | |
|
| SweRankEmbed-Large (7B) | 82.12 | 67.32 | |
|
| + GPT-4.1 reranker | 87.96 | 74.64 | |
|
| + SweRankLLM-Small (7B) reranker | 86.13 | 74.46 | |
|
| + SweRankLLM-Large (32B) reranker | 88.69 | 76.25 | |
|
|
|
|
|
## Usage with Sentence-Transformers |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
model = SentenceTransformer("Salesforce/SweRankEmbed-Small", trust_remote_code=True) |
|
queries = ['Calculate the n-th factorial'] |
|
documents = ['def fact(n):\n if n < 0:\n raise ValueError\n return 1 if n == 0 else n * fact(n - 1)'] |
|
|
|
query_embeddings = model.encode(queries, prompt_name="query") |
|
document_embeddings = model.encode(documents) |
|
|
|
scores = query_embeddings @ document_embeddings.T |
|
|
|
for query, query_scores in zip(queries, scores): |
|
doc_score_pairs = list(zip(documents, query_scores)) |
|
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) |
|
# Output passages & scores |
|
print("Query:", query) |
|
for document, score in doc_score_pairs: |
|
print(score, document) |
|
``` |
|
|
|
## Usage with Huggingface Transformers |
|
|
|
**Important**: the query prompt must include the following task instruction prefix: "*Represent this query for searching relevant code: *" |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModel, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('Salesforce/SweRankEmbed-Small') |
|
model = AutoModel.from_pretrained('Salesforce/SweRankEmbed-Small', add_pooling_layer=False) |
|
model.eval() |
|
|
|
query_prefix = 'Represent this query for searching relevant code: ' |
|
queries = ['Calculate the n-th factorial'] |
|
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries] |
|
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512) |
|
|
|
documents = ['def fact(n):\n if n < 0:\n raise ValueError\n return 1 if n == 0 else n * fact(n - 1)'] |
|
document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512) |
|
|
|
# Compute token embeddings |
|
with torch.no_grad(): |
|
query_embeddings = model(**query_tokens)[0][:, 0] |
|
document_embeddings = model(**document_tokens)[0][:, 0] |
|
|
|
|
|
# normalize embeddings |
|
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1) |
|
document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1) |
|
|
|
scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1)) |
|
for query, query_scores in zip(queries, scores): |
|
doc_score_pairs = list(zip(documents, query_scores)) |
|
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True) |
|
#Output passages & scores |
|
print("Query:", query) |
|
for document, score in doc_score_pairs: |
|
print(score, document) |
|
``` |
|
|
|
## Citation |
|
|
|
If you find this model work useful in your research, please consider citing our paper: |
|
|
|
``` |
|
@article{reddy2025swerank, |
|
title={SweRank: Software Issue Localization with Code Ranking}, |
|
author={Reddy, Revanth Gangi and Suresh, Tarun and Doo, JaeHyeok and Liu, Ye and Nguyen, Xuan Phi and Zhou, Yingbo and Yavuz, Semih and Xiong, Caiming and Ji, Heng and Joty, Shafiq}, |
|
journal={arXiv preprint arXiv:2505.07849}, |
|
year={2025} |
|
} |
|
``` |