File size: 3,511 Bytes
b6288c1
 
a435c3b
 
 
 
 
 
 
 
b6288c1
a435c3b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fa926ef
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---

license: apache-2.0
pipeline_tag: text-classification
tags:
  - transformers
  - sentence-transformers
  - text-embeddings-inference
language:
  - ko
  - multilingual
---



# upskyy/ko-reranker-8k

**ko-reranker-8k**๋Š” [BAAI/bge-reranker-v2-m3](https://huggingface.co/BAAI/bge-reranker-v2-m3) ๋ชจ๋ธ์— [ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ](https://huggingface.co/datasets/upskyy/ko-wiki-reranking)๋ฅผ finetuning ํ•œ model ์ž…๋‹ˆ๋‹ค.

## Usage
## Using FlagEmbedding
```

pip install -U FlagEmbedding

```

Get relevance scores (higher scores indicate more relevance):

```python

from FlagEmbedding import FlagReranker





reranker = FlagReranker('upskyy/ko-reranker-8k', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation



score = reranker.compute_score(['query', 'passage'])

print(score) # -8.3828125



# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score

score = reranker.compute_score(['query', 'passage'], normalize=True)

print(score) # 0.000228713314721116



scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']])

print(scores) # [-11.2265625, 8.6875]



# You can map the scores into 0-1 by set "normalize=True", which will apply sigmoid function to the score

scores = reranker.compute_score([['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']], normalize=True)

print(scores) # [1.3315579521758342e-05, 0.9998313472460109]

```


## Using Huggingface transformers

Get relevance scores (higher scores indicate more relevance):


```python

import torch

from transformers import AutoModelForSequenceClassification, AutoTokenizer





tokenizer = AutoTokenizer.from_pretrained('upskyy/ko-reranker-8k')

model = AutoModelForSequenceClassification.from_pretrained('upskyy/ko-reranker-8k')

model.eval()



pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]

with torch.no_grad():

    inputs = tokenizer(pairs, padding=True, truncation=True, return_tensors='pt', max_length=512)

    scores = model(**inputs, return_dict=True).logits.view(-1, ).float()

    print(scores)

```



## Citation

```bibtex

@misc{li2023making,

      title={Making Large Language Models A Better Foundation For Dense Retrieval}, 

      author={Chaofan Li and Zheng Liu and Shitao Xiao and Yingxia Shao},

      year={2023},

      eprint={2312.15503},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

@misc{chen2024bge,

      title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation}, 

      author={Jianlv Chen and Shitao Xiao and Peitian Zhang and Kun Luo and Defu Lian and Zheng Liu},

      year={2024},

      eprint={2402.03216},

      archivePrefix={arXiv},

      primaryClass={cs.CL}

}

```


## Reference

- [Dongjin-kr/ko-reranker](https://huggingface.co/Dongjin-kr/ko-reranker)
- [reranker-kr](https://github.com/aws-samples/aws-ai-ml-workshop-kr/tree/master/genai/aws-gen-ai-kr/30_fine_tune/reranker-kr)
- [FlagEmbedding](https://github.com/FlagOpen/FlagEmbedding)