|
--- |
|
license: mit |
|
datasets: |
|
- unicamp-dl/mmarco |
|
language: |
|
- pt |
|
tags: |
|
- colbert |
|
- ColBERT |
|
--- |
|
|
|
**Disclaimer**: This model is based on a model trained for brazilian portuguese, furthermore mMARCO was translated from MSMARCO using Google Translate which also tends to be biased towards brazilian portuguese, therefore it might not do well on european portuguese. |
|
|
|
## Training |
|
|
|
#### Details |
|
|
|
The model is initialized from the [ricardoz/BERTugues-base-portuguese-cased](https://huggingface.co/ricardoz/BERTugues-base-portuguese-cased) model and fine-tuned on 10M triples via pairwise softmax cross-entropy loss over the computed scores of the positive and negative passages associated to a query. It was trained on a single Tesla A100 GPU with 40GBs of memory during 200k steps with 10% of warmup steps using a batch size of 96 and the AdamW optimizer with a constant learning rate of 3e-06. Total training time was around 12 hours. |
|
|
|
#### Data |
|
|
|
The model is fine-tuned on the Portugueses version of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multi-lingual machine-translated version of the MS MARCO dataset. |
|
The triples are sampled from the ~39.8M triples of [triples.train.small.tsv](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset) |
|
|
|
## Evaluation |
|
|
|
The model is evaluated on the smaller development set of mMARCO-es, which consists of 6,980 queries for a corpus of 8.8M candidate passages. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k). |
|
|
|
| model | Vocab. | #Param. | Size | MRR@10 | R@50 | R@1000 | |
|
|:------------------------------------------------------------------------------------------------------------------------|:---------|--------:|------:|---------:|-------:|--------:| |
|
| **ColBERTv1.0-BERTugues-base-portuguese-mmarcoPT** |portuguese| 110M | 440MB | 26.90 | 65.26 | 70.21 | |