BiblioPage: A Dataset of Scanned Title Pages for Bibliographic Metadata Extraction Paper • 2503.19658 • Published 1 day ago • 1
SkyLadder: Better and Faster Pretraining via Context Window Scheduling Paper • 2503.15450 • Published 7 days ago • 11
InsectSet459: an open dataset of insect sounds for bioacoustic machine learning Paper • 2503.15074 • Published 8 days ago • 1
Brazilian legal datasets ⚖️ Collection A collection of data extracted from the courts of Brazil (and others legal websites) • 31 items • Updated 7 days ago • 2
Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru Paper • 2503.07587 • Published 16 days ago • 10
Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia Paper • 2503.07920 • Published 16 days ago • 95
JurisTCU: A Brazilian Portuguese Information Retrieval Dataset with Query Relevance Judgments Paper • 2503.08379 • Published 16 days ago • 2
EuroBERT: Scaling Multilingual Encoders for European Languages Paper • 2503.05500 • Published 19 days ago • 75
view article Article HuggingFace, IISc partner to supercharge model building on India's diverse languages 28 days ago • 17
rank1 Collection rank1 is the first test-time compute reasoning model in IR • 15 items • Updated 27 days ago • 3
OWLS: Scaling Laws for Speech Recognition and Translation Collection 🦉 A suite of Whisper-style models from 250M to 18B parameters. Trained on up to 360K hours of data. 16k sampling rate. • 7 items • Updated 17 days ago • 4
Minions: Cost-efficient Collaboration Between On-device and Cloud Language Models Paper • 2502.15964 • Published Feb 21 • 1
"Actionable Help" in Crises: A Novel Dataset and Resource-Efficient Models for Identifying Request and Offer Social Media Posts Paper • 2502.16839 • Published about 1 month ago • 1
Slam Collection All resources for SpeechLMs from "Slamming: Training a Speech Language Model on One GPU in a Day". We provide tokeniser, lm, and datasets • 6 items • Updated 30 days ago • 13