ColBERT-X for English-Chinese CLIR using Translate-Distill

CLIR Model Setting

  • Query language: English
  • Query length: 32 token max
  • Document language: Chinese
  • Document length: 180 token max (please use MaxP to aggregate the passage score if needed)

Model Description

Translate-Distill is a training technique that produces state-of-the-art CLIR dense retrieval model through translation and distillation. plaidx-large-zho-tdist-t53b-engeng is trained with KL-Divergence from the t53b MonoT5 reranker inferenced on English MS MARCO training queries and English passages.

Teacher Models:

Training Parameters

  • learning rate: 5e-6
  • update steps: 200,000
  • nway (number of passages per query): 6 (randomly selected from 50)
  • per device batch size (number of query-passage set): 8
  • training GPU: 8 NVIDIA V100 with 32 GB memory

Usage

To properly load ColBERT-X models from Huggingface Hub, please use the following version of PLAID-X.

pip install PLAID-X==0.3.1

Following code snippet loads the model through Huggingface API.

from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

Checkpoint('hltcoe/plaidx-large-zho-tdist-t53b-engeng', colbert_config=ColBERTConfig())

For full tutorial, please refer to the PLAID-X Jupyter Notebook, which is part of the SIGIR 2023 CLIR Tutorial.

BibTeX entry and Citation Info

Please cite the following two papers if you use the model.

@inproceedings{colbert-x,
    author = {Suraj Nair and Eugene Yang and Dawn Lawrie and Kevin Duh and Paul McNamee and Kenton Murray and James Mayfield and Douglas W. Oard},
    title = {Transfer Learning Approaches for Building Cross-Language Dense Retrieval Models},
    booktitle = {Proceedings of the 44th European Conference on Information Retrieval (ECIR)},
    year = {2022},
    url = {https://arxiv.org/abs/2201.08471}
}
@inproceedings{translate-distill,
    author = {Eugene Yang and Dawn Lawrie and James Mayfield and Douglas W. Oard and Scott Miller},
    title = {Translate-Distill: Learning Cross-Language Dense Retrieval by Translation and Distillation},
    booktitle = {Proceedings of the 46th European Conference on Information Retrieval (ECIR)},
    year = {2024},
    url = {https://arxiv.org/abs/2401.04810}
}
Downloads last month
16
Inference API
Unable to determine this model’s pipeline type. Check the docs .

Datasets used to train hltcoe/plaidx-large-zho-tdist-t53b-engeng

Collection including hltcoe/plaidx-large-zho-tdist-t53b-engeng