File size: 4,785 Bytes
69ac1af
 
745d2a0
 
69ac1af
da7ade6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
license: cc-by-nc-4.0
base_model:
- nomic-ai/CodeRankEmbed
---
`SweRankEmbed-Small` is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms other embedding models on the issue localization task. 

The model has been trained on large-scale issue localization data collected from public python github repositories. Check out our [blog post](https://gangiswag.github.io/SweRank/) and [paper](https://arxiv.org/abs/2505.07849) for more details!

You can combine `SweRankEmbed` with our [`SweRankLLM-Small`]() or [`SweRankLLM-Large`]() rerankers for even higher quality ranking performance.

Link to code: [https://github.com/gangiswag/SweRank](https://github.com/gangiswag/SweRank)

## Performance

SweRank models show SOTA localization performance on a variety of benchmarks like SWE-Bench-Lite and LocBench, considerably out-performing agent-based approaches relying on Claude-3.5

| Model Name  | SWE-Bench-Lite Func@10 | LocBench Func@15
| ------------------------------------------------------------------- | -------------------------------- | -------------------------------- |
| OpenHands (Claude 3.5)    | 70.07                            |  59.29 |
| LocAgent (Claude 3.5)     | 77.37                            |  60.71 |
| CodeRankEmbed (137M)      | 58.76                            |  50.89 |
| GTE-Qwen2-7B-Instruct (7B)| 70.44                            |  57.14 |
| SweRankEmbed-Small (137M) | 74.45                            |  63.39 |
| SweRankEmbed-Large (7B)   | 82.12                            |  67.32 |
| + GPT-4.1 reranker        | 87.96                            |  74.64 |
| + SweRankLLM-Small (7B) reranker        | 86.13                            |  74.46 |
| + SweRankLLM-Large (32B) reranker       | 88.69                            |  76.25 |


## Usage with Sentence-Transformers

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Salesforce/SweRankEmbed-Small", trust_remote_code=True)
queries = ['Calculate the n-th factorial']
documents = ['def fact(n):\n if n < 0:\n  raise ValueError\n return 1 if n == 0 else n * fact(n - 1)']

query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

scores = query_embeddings @ document_embeddings.T

for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    # Output passages & scores
    print("Query:", query)
    for document, score in doc_score_pairs:
        print(score, document)
```

## Usage with Huggingface Transformers

**Important**: the query prompt must include the following task instruction prefix: "*Represent this query for searching relevant code: *"

```python
import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('Salesforce/SweRankEmbed-Small')
model = AutoModel.from_pretrained('Salesforce/SweRankEmbed-Small', add_pooling_layer=False)
model.eval()

query_prefix = 'Represent this query for searching relevant code: '
queries  = ['Calculate the n-th factorial']
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)

documents = ['def fact(n):\n if n < 0:\n  raise ValueError\n return 1 if n == 0 else n * fact(n - 1)']
document_tokens =  tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)

# Compute token embeddings
with torch.no_grad():
    query_embeddings = model(**query_tokens)[0][:, 0]
    document_embeddings = model(**document_tokens)[0][:, 0]


# normalize embeddings
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)

scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    #Output passages & scores
    print("Query:", query)
    for document, score in doc_score_pairs:
        print(score, document)
```

## Citation

If you find this model work useful in your research, please consider citing our paper:

```
@article{reddy2025swerank,
  title={SweRank: Software Issue Localization with Code Ranking},
  author={Reddy, Revanth Gangi and Suresh, Tarun and Doo, JaeHyeok and Liu, Ye and Nguyen, Xuan Phi and Zhou, Yingbo and Yavuz, Semih and Xiong, Caiming and Ji, Heng and Joty, Shafiq},
  journal={arXiv preprint arXiv:2505.07849},
  year={2025}
}
```