Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models Paper β’ 2505.22232 β’ Published May 28 β’ 18
Fineweb2-Classifier Collection Training datasets for the fineweb2 classifier. β’ 5 items β’ Updated Apr 11
Test_Data_fineweb-edu Collection Test Data sampled from fineweb-edu and annotated by humans β’ 3 items β’ Updated Apr 11
GPT-SW3: An Autoregressive Language Model for the Nordic Languages Paper β’ 2305.12987 β’ Published May 22, 2023
Investigating Multilingual Instruction-Tuning: Do Polyglot Models Demand for Multilingual Instructions? Paper β’ 2402.13703 β’ Published Feb 21, 2024
Tokenizer Choice For LLM Training: Negligible or Crucial? Paper β’ 2310.08754 β’ Published Oct 12, 2023 β’ 2