SweRankEmbed-Small / README.md

Update README.md

745d2a0 verified about 1 month ago

4.79 kB

	---
	license: cc-by-nc-4.0
	base_model:
	- nomic-ai/CodeRankEmbed
	---
	`SweRankEmbed-Small` is a 137M bi-encoder supporting 8192 context length for code retrieval. It significantly outperforms other embedding models on the issue localization task.

	The model has been trained on large-scale issue localization data collected from public python github repositories. Check out our [blog post](https://gangiswag.github.io/SweRank/) and [paper](https://arxiv.org/abs/2505.07849) for more details!

	You can combine `SweRankEmbed` with our [`SweRankLLM-Small`]() or [`SweRankLLM-Large`]() rerankers for even higher quality ranking performance.

	Link to code: [https://github.com/gangiswag/SweRank](https://github.com/gangiswag/SweRank)

	## Performance

	SweRank models show SOTA localization performance on a variety of benchmarks like SWE-Bench-Lite and LocBench, considerably out-performing agent-based approaches relying on Claude-3.5

	\| Model Name \| SWE-Bench-Lite Func@10 \| LocBench Func@15
	\| ------------------------------------------------------------------- \| -------------------------------- \| -------------------------------- \|
	\| OpenHands (Claude 3.5) \| 70.07 \| 59.29 \|
	\| LocAgent (Claude 3.5) \| 77.37 \| 60.71 \|
	\| CodeRankEmbed (137M) \| 58.76 \| 50.89 \|
	\| GTE-Qwen2-7B-Instruct (7B)\| 70.44 \| 57.14 \|
	\| SweRankEmbed-Small (137M) \| 74.45 \| 63.39 \|
	\| SweRankEmbed-Large (7B) \| 82.12 \| 67.32 \|
	\| + GPT-4.1 reranker \| 87.96 \| 74.64 \|
	\| + SweRankLLM-Small (7B) reranker \| 86.13 \| 74.46 \|
	\| + SweRankLLM-Large (32B) reranker \| 88.69 \| 76.25 \|


	## Usage with Sentence-Transformers

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("Salesforce/SweRankEmbed-Small", trust_remote_code=True)
	queries = ['Calculate the n-th factorial']
	documents = ['def fact(n):\n if n < 0:\n raise ValueError\n return 1 if n == 0 else n * fact(n - 1)']

	query_embeddings = model.encode(queries, prompt_name="query")
	document_embeddings = model.encode(documents)

	scores = query_embeddings @ document_embeddings.T

	for query, query_scores in zip(queries, scores):
	doc_score_pairs = list(zip(documents, query_scores))
	doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
	# Output passages & scores
	print("Query:", query)
	for document, score in doc_score_pairs:
	print(score, document)
	```

	## Usage with Huggingface Transformers

	Important: the query prompt must include the following task instruction prefix: "Represent this query for searching relevant code: "

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained('Salesforce/SweRankEmbed-Small')
	model = AutoModel.from_pretrained('Salesforce/SweRankEmbed-Small', add_pooling_layer=False)
	model.eval()

	query_prefix = 'Represent this query for searching relevant code: '
	queries = ['Calculate the n-th factorial']
	queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
	query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)

	documents = ['def fact(n):\n if n < 0:\n raise ValueError\n return 1 if n == 0 else n * fact(n - 1)']
	document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)

	# Compute token embeddings
	with torch.no_grad():
	query_embeddings = model(**query_tokens)[0][:, 0]
	document_embeddings = model(**document_tokens)[0][:, 0]


	# normalize embeddings
	query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
	document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)

	scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
	for query, query_scores in zip(queries, scores):
	doc_score_pairs = list(zip(documents, query_scores))
	doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
	#Output passages & scores
	print("Query:", query)
	for document, score in doc_score_pairs:
	print(score, document)
	```

	## Citation

	If you find this model work useful in your research, please consider citing our paper:

	```
	@article{reddy2025swerank,
	title={SweRank: Software Issue Localization with Code Ranking},
	author={Reddy, Revanth Gangi and Suresh, Tarun and Doo, JaeHyeok and Liu, Ye and Nguyen, Xuan Phi and Zhou, Yingbo and Yavuz, Semih and Xiong, Caiming and Ji, Heng and Joty, Shafiq},
	journal={arXiv preprint arXiv:2505.07849},
	year={2025}
	}
	```