Reliable, Reproducible, and Really Fast Leaderboards with Evalica
Abstract
The rapid advancement of natural language processing (NLP) technologies, such as instruction-tuned large language models (LLMs), urges the development of modern evaluation protocols with human and machine feedback. We introduce Evalica, an open-source toolkit that facilitates the creation of reliable and reproducible model leaderboards. This paper presents its design, evaluates its performance, and demonstrates its usability through its Web interface, command-line interface, and Python API.
Community
Tired of waiting on slow leaderboard computations? Struggling to rank machine learning models quickly and accurately? Evalica is a Python library for fast, efficient, and correctly implemented ranking using methods like Elo, Bradley-Terry, average win rate, and more: https://github.com/dustalov/evalica.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GR-NLP-TOOLKIT: An Open-Source NLP Toolkit for Modern Greek (2024)
- Meaning Typed Prompting: A Technique for Efficient, Reliable Structured Output Generation (2024)
- Can Language Models Replace Programmers? REPOCOD Says 'Not Yet' (2024)
- Crystal: Illuminating LLM Abilities on Language and Code (2024)
- The First Prompt Counts the Most! An Evaluation of Large Language Models on Iterative Example-based Code Generation (2024)
- Does Prompt Formatting Have Any Impact on LLM Performance? (2024)
- PDL: A Declarative Prompt Programming Language (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper