arxiv:2012.14210

The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes

Published on Dec 28, 2020

Authors:

Abstract

Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how <PRE_TAG>dense representations</POST_TAG> perform with large index sizes. We show theoretically and empirically that the performance for <PRE_TAG>dense representations</POST_TAG> decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform <PRE_TAG>dense representations</POST_TAG>. We show that this behavior is tightly connected to the number of dimensions of the representations: The lower the dimension, the higher the chance for false positives, i.e. returning irrelevant documents.