File size: 4,494 Bytes
79d923d
 
 
 
 
 
 
 
 
 
 
 
 
 
f177d8c
79d923d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
883274c
 
79d923d
 
 
 
 
 
 
 
 
 
 
 
80c369f
883274c
79d923d
 
 
d58d533
 
 
 
79d923d
 
 
 
 
 
 
 
 
 
883274c
79d923d
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
license: cc-by-4.0
language:
- en
tags:
- retrieval
- retriever
- rag
inference: false
---

<img src="logo.png" width=25%>

# Model Description
RoBERTA ReRanker for Retrieved Results or **R*** (pronounced R-star) is an advanced model designed to enhance search results' relevance and accuracy through reranking. By integrating the retrieval capabilities of **R*** with generative models, this hybrid approach significantly enhances the relevance and contextual depth of search results. Based on the [RoBERTa tiny](https://huggingface.co/haisongzhang/roberta-tiny-cased) architecture, **R*** is specialized in distinguishing relevant from irrelevant query-passage pairs, thereby refining the output of LLMs in retrieval and generative tasks. This model is an experiment featured and presented in [PACLIC 38 (2024)](https://sites.google.com/view/paclic38), which would be published in the ACL Anthology.

## Training Data
R* was trained on a dataset derived from the MS MARCO passage ranking dataset, consisting of 2.5 million query-positive passage pairs and an equal number of query-negative passage pairs, totaling 5 million query-passage pairs. This ensures a balanced training approach, exposing R* to both relevant and irrelevant examples equally.

## Training Procedure
Training focused on binary classification, aiming to assign a continuous relevance score ranging from 0 (irrelevant) to 1 (relevant) for each query-passage pair. The model underwent training for 7 epochs with a batch size of 2048, utilizing a Colab Pro instance equipped with a V100 GPU (16 GB VRAM) and 51 GB RAM, completing in approximately 16 hours.

## Evaluation and Performance
Coming soon.

## Use Cases
R* is particularly suitable for applications that demand high precision in information retrieval, such as RAG reranking, search engine results, document searching in legal or academic databases, recommendation systems, and beyond.

## How to Use
### With Transformers
For usage with the Transformers library, you can follow this generic example:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model = AutoModelForSequenceClassification.from_pretrained('jaspercatapang/R-star')
tokenizer = AutoTokenizer.from_pretrained('jaspercatapang/R-star')

features = tokenizer(['Your query here', 'First relevant passage for first query'], ['Your query here', 'Second relevant passage for second query'], padding=True, truncation=True, return_tensors="pt")

model.eval()
with torch.no_grad():
    scores = model(**features).logits
    print(scores)
```

### With SentenceTransformers
```python
from sentence_transformers import CrossEncoder

model = CrossEncoder('jaspercatapang/R-star', max_length=512)
scores = model.predict([('Your query here', 'First relevant passage for first query'), ('Your query here', 'Second relevant passage for second query')])
```

### Training and Evaluation
1. For training, the Colab notebook can be found [here](https://colab.research.google.com/drive/1F105XTCchub-flcGB1XqqoaYlJr16YR3).
2. For evaluation, the Colab notebook can be found [here](https://colab.research.google.com/drive/1H5RppJX9cfRXd8Hls2_Vis5sb6SHB1zf).

## Limitations
Based on our evaluation, R* tends to favor longer passages when scoring, which could introduce a bias. This is true for most cross-encoder models. It is advisable to preprocess text to normalize passage lengths for fair comparison. Note that R* is optimized for passage-level comparisons and may not perform well on word- or phrase-level similarity tasks.

## Ethical Considerations
The use of R* introduces several ethical considerations, including potential biases in the training data, privacy concerns, and the implications of automating decision-making processes. Users are encouraged to critically evaluate the model's fairness and transparency, ensuring its equitable use across diverse demographics.

## Contact Details
For additional information or inquiries about R*, please contact the developer via [email protected]

## Disclaimer
R* is an AI language model developed by Jasper Kyle Catapang. It is provided "as is" without warranty of any kind, expressed or implied. The model developer shall not be liable for any direct or indirect damages arising from the use of this model.

## Acknowledgments
Thank you to Microsoft for the MS MARCO dataset. We would also like to extend our gratitude to [Haisong Zhang](https://huggingface.co/haisongzhang) for the base model.