Papers
arxiv:2108.08787

Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval

Published on Aug 19, 2021
Authors:
,

Abstract

We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations. The goal of this resource is to spur research in dense retrieval techniques in non-English languages, motivated by recent observations that existing techniques for representation learning perform poorly when applied to out-of-distribution data. As a starting point, we provide zero-shot baselines for this new dataset based on a multi-lingual adaptation of DPR that we call "m<PRE_TAG>DPR</POST_TAG>". Experiments show that although the effectiveness of m<PRE_TAG>DPR</POST_TAG> is much lower than BM25, dense representations nevertheless appear to provide valuable relevance signals, improving BM25 results in sparse-dense hybrids. In addition to analyses of our results, we also discuss future challenges and present a research agenda in multi-lingual dense retrieval. Mr. TyDi can be downloaded at https://github.com/castorini/mr.tydi.

Community

Sign up or log in to comment

Models citing this paper 19

Browse 19 models citing this paper

Datasets citing this paper 2

Spaces citing this paper 189

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.