AI & ML interests

Information Retrieval

We present Spacerini, a modular framework for seamless building and deployment of interactive search applications, designed to facilitate the qualitative analysis of large scale research datasets.

In the current AI research landscape, billion-token textual corpora are widley used to pre-train large language models and conversational agents, which are then applied in a variety of downstream tasks. However, as is clear from the instant community feedback and more principled research, the factuality and fairness of such models’ generations remain elusive as models tend to hallucinate facts and memorize rather then abstract knowledge. In order to understand their failure modes, researchers often turn to the training data in search for the source of questionable model predictions.

Spacerini enables such qualitative analysis by leveraging and integrating features from both the Pyserini toolkit and the Hugging Face ecosystem. Users can easily index their collections and deploy them as ad-hoc search engines, making the retrieval of relevant data points quick and efficient. The user-friendly interface allows to search through massive datasets in no-code fashion, making Spacerini broadly accessible to anyone looking to qualitatively audit their text collections. Spacerini can also be leveraged by IR researchers aiming to demonstrate the capabilities of their indices in a simple and interactive way.

The framework is open-sourced and available on GitHub: https://github.com/castorini/hf-spacerini.