|
--- |
|
datasets: |
|
- unicamp-dl/mmarco |
|
language: |
|
- pt |
|
pipeline_tag: text2text-generation |
|
base_model: unicamp-dl/ptt5-v2-base |
|
license: apache-2.0 |
|
--- |
|
|
|
## Introduction |
|
MonoPTT5 models are T5 rerankers for the Portuguese language. Starting from [ptt5-v2 checkpoints](https://huggingface.co/collections/unicamp-dl/ptt5-v2-666538a650188ba00aa8d2d0), they were trained for 100k steps on a mixture of Portuguese and English data from the mMARCO dataset. |
|
For further information on the training and evaluation of these models, please refer to our paper, [ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language](https://arxiv.org/abs/2008.09144). |
|
|
|
## Usage |
|
The easiest way to use our models is through the `rerankers` package. After installing the package using `pip install rerankers[transformers]`, the following code can be used as a minimal working example: |
|
|
|
```python |
|
from rerankers import Reranker |
|
import torch |
|
|
|
query = "O futebol é uma paixão nacional" |
|
docs = [ |
|
"O futebol é superestimado e não deveria receber tanta atenção.", |
|
"O futebol é uma parte essencial da cultura brasileira e une as pessoas.", |
|
] |
|
|
|
ranker = Reranker( |
|
"unicamp-dl/monoptt5-base", |
|
inputs_template="Pergunta: {query} Documento: {text} Relevante:", |
|
dtype=torch.float32 # or bfloat16 if supported by your GPU |
|
) |
|
|
|
results = ranker.rank(query, docs) |
|
|
|
print("Classification results:") |
|
for result in results: |
|
print(result) |
|
|
|
# Loading T5Ranker model unicamp-dl/monoptt5-base |
|
# No device set |
|
# Using device cuda |
|
# Using dtype torch.float32 |
|
# Loading model unicamp-dl/monoptt5-base, this might take a while... |
|
# Using device cuda. |
|
# Using dtype torch.float32. |
|
# T5 true token set to ▁Sim |
|
# T5 false token set to ▁Não |
|
# Returning normalised scores... |
|
# Inputs template set to Pergunta: {query} Documento: {text} Relevante: |
|
|
|
# Classification results: |
|
# document=Document(text='O futebol é uma parte essencial da cultura brasileira e une as pessoas.', doc_id=1, metadata={}) score=0.8186910152435303 rank=1 |
|
# document=Document(text='O futebol é superestimado e não deveria receber tanta atenção.', doc_id=0, metadata={}) score=0.008028557524085045 rank=2 |
|
``` |
|
|
|
For additional configurations and more advanced usage, consult the `rerankers` [GitHub repository](https://github.com/AnswerDotAI/rerankers). |
|
|
|
## Citation |
|
If you use our models, please cite: |
|
``` |
|
@misc{piau2024ptt5v2, |
|
title={ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language}, |
|
author={Marcos Piau and Roberto Lotufo and Rodrigo Nogueira}, |
|
year={2024}, |
|
eprint={2406.10806}, |
|
archivePrefix={arXiv}, |
|
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'} |
|
} |
|
``` |