zephyr story sources mentioned by hf.co/thomwolf tweet: x.com/Thom_Wolf/status/1720503998518640703 HuggingFaceH4/zephyr-7b-beta Text Generation • Updated 1 day ago • 1.29M • • 1.59k mistralai/Mistral-7B-v0.1 Text Generation • Updated Jul 24 • 448k • • 3.41k stingning/ultrachat Viewer • Updated Feb 22 • 774k • 1.96k • 416 openbmb/UltraFeedback Viewer • Updated Dec 29, 2023 • 64k • 1.47k • 323
A little guide to building Large Language Models in 2024 Resources mentioned by @thomwolf in https://x.com/Thom_Wolf/status/1773340316835131757 Yi: Open Foundation Models by 01.AI Paper • 2403.04652 • Published Mar 7 • 62 A Survey on Data Selection for Language Models Paper • 2402.16827 • Published Feb 26 • 3 Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research Paper • 2402.00159 • Published Jan 31 • 59 The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Paper • 2306.01116 • Published Jun 1, 2023 • 31
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research Paper • 2402.00159 • Published Jan 31 • 59
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only Paper • 2306.01116 • Published Jun 1, 2023 • 31