Update README.md
Browse files
README.md
CHANGED
@@ -160,9 +160,9 @@ And to output a string of a number between 1-7.
|
|
160 |
|
161 |
In order to make a continuous score that can be used for reranking query-context pairs (i.e. a method with few ties), we calculate the expectation value of the scores.
|
162 |
|
163 |
-
We include scripts to do this in
|
164 |
|
165 |
-
|
166 |
|
167 |
Install [vLLM](https://github.com/vllm-project/vllm/) using `pip install vllm`.
|
168 |
|
@@ -208,7 +208,7 @@ print(expected_vals)
|
|
208 |
# [6.66570732 1.86686378 1.01102923]
|
209 |
```
|
210 |
|
211 |
-
|
212 |
|
213 |
Install [LMDeploy](https://github.com/InternLM/lmdeploy) using `pip install lmdeploy`.
|
214 |
|
@@ -266,6 +266,65 @@ print(expected_vals)
|
|
266 |
# [6.66415229 1.84342025 1.01133205]
|
267 |
```
|
268 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
269 |
# Evaluation
|
270 |
|
271 |
We perform an evaluation on 9 datasets from the [BEIR benchmark](https://github.com/beir-cellar/beir) that none of the evaluated models have been trained upon (to our knowledge).
|
|
|
160 |
|
161 |
In order to make a continuous score that can be used for reranking query-context pairs (i.e. a method with few ties), we calculate the expectation value of the scores.
|
162 |
|
163 |
+
We include scripts to do this in vLLM, LMDeploy, and OpenAI (hosted for free on Huggingface):
|
164 |
|
165 |
+
### vLLM
|
166 |
|
167 |
Install [vLLM](https://github.com/vllm-project/vllm/) using `pip install vllm`.
|
168 |
|
|
|
208 |
# [6.66570732 1.86686378 1.01102923]
|
209 |
```
|
210 |
|
211 |
+
### LMDeploy
|
212 |
|
213 |
Install [LMDeploy](https://github.com/InternLM/lmdeploy) using `pip install lmdeploy`.
|
214 |
|
|
|
266 |
# [6.66415229 1.84342025 1.01133205]
|
267 |
```
|
268 |
|
269 |
+
### OpenAI (Hosted on Huggingface)
|
270 |
+
|
271 |
+
Install [openai](https://github.com/openai/openai-python) using `pip install openai`.
|
272 |
+
|
273 |
+
```python
|
274 |
+
from openai import OpenAI
|
275 |
+
import numpy as np
|
276 |
+
from multiprocessing import Pool
|
277 |
+
from tqdm.auto import tqdm
|
278 |
+
|
279 |
+
client = OpenAI(
|
280 |
+
base_url="https://api-inference.huggingface.co/v1/",
|
281 |
+
api_key="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # Change this to an access token from https://huggingface.co/settings/tokens
|
282 |
+
)
|
283 |
+
|
284 |
+
def make_reranker_input(t, q):
|
285 |
+
return f"<<<Query>>>\n{q}\n\n<<<Context>>>\n{t}"
|
286 |
+
|
287 |
+
def make_reranker_inference_conversation(context, question):
|
288 |
+
system_message = "Given a query and a piece of text, output a score of 1-7 based on how related the query is to the text. 1 means least related and 7 is most related."
|
289 |
+
|
290 |
+
return [
|
291 |
+
{"role": "system", "content": system_message},
|
292 |
+
{"role": "user", "content": make_reranker_input(context, question)},
|
293 |
+
]
|
294 |
+
|
295 |
+
def get_reranker_score(context_question_tuple):
|
296 |
+
question, context = context_question_tuple
|
297 |
+
|
298 |
+
messages = make_reranker_inference_conversation(context, question)
|
299 |
+
|
300 |
+
completion = client.chat.completions.create(
|
301 |
+
model="lightblue/lb-reranker-0.5B-v1.0",
|
302 |
+
messages=messages,
|
303 |
+
max_tokens=1,
|
304 |
+
temperature=0.0,
|
305 |
+
logprobs=True,
|
306 |
+
top_logprobs=5, # Max allowed by the openai API as top_n_tokens must be >= 0 and <= 5. If this gets changed, fix to > 7.
|
307 |
+
)
|
308 |
+
|
309 |
+
logprobs = completion.choices[0].logprobs.content[0].top_logprobs
|
310 |
+
|
311 |
+
calculated_score = sum([int(x.token) * np.exp(x.logprob) for x in logprobs])
|
312 |
+
|
313 |
+
return calculated_score
|
314 |
+
|
315 |
+
query_texts = [
|
316 |
+
("What is the scientific name of apples?", "An apple is a round, edible fruit produced by an apple tree (Malus spp., among them the domestic or orchard apple; Malus domestica)."),
|
317 |
+
("What is the Chinese word for 'apple'?", "An apple is a round, edible fruit produced by an apple tree (Malus spp., among them the domestic or orchard apple; Malus domestica)."),
|
318 |
+
("What is the square root of 999?", "An apple is a round, edible fruit produced by an apple tree (Malus spp., among them the domestic or orchard apple; Malus domestica)."),
|
319 |
+
]
|
320 |
+
|
321 |
+
with Pool(processes=16) as p: # Allows for parallel processing
|
322 |
+
expected_vals = list(tqdm(p.imap(get_reranker_score, query_texts), total=len(query_texts)))
|
323 |
+
|
324 |
+
print(expected_vals)
|
325 |
+
# [6.64866580, 1.85144404, 1.010719508]
|
326 |
+
```
|
327 |
+
|
328 |
# Evaluation
|
329 |
|
330 |
We perform an evaluation on 9 datasets from the [BEIR benchmark](https://github.com/beir-cellar/beir) that none of the evaluated models have been trained upon (to our knowledge).
|