Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval
Abstract
We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations. The goal of this resource is to spur research in dense retrieval techniques in non-English languages, motivated by recent observations that existing techniques for representation learning perform poorly when applied to out-of-distribution data. As a starting point, we provide zero-shot baselines for this new dataset based on a multi-lingual adaptation of DPR that we call "m<PRE_TAG>DPR</POST_TAG>". Experiments show that although the effectiveness of m<PRE_TAG>DPR</POST_TAG> is much lower than BM25, dense representations nevertheless appear to provide valuable relevance signals, improving BM25 results in sparse-dense hybrids. In addition to analyses of our results, we also discuss future challenges and present a research agenda in multi-lingual dense retrieval. Mr. TyDi can be downloaded at https://github.com/castorini/mr.tydi.
Models citing this paper 19
Browse 19 models citing this paperDatasets citing this paper 2
Spaces citing this paper 189
Collections including this paper 0
No Collection including this paper