README / README.md
loubnabnl's picture
loubnabnl HF staff
update org card
0e5e2a1 verified
|
raw
history blame
2.55 kB
metadata
title: README
emoji: πŸ‘
colorFrom: purple
colorTo: green
sdk: static
pinned: false

HuggingFaceTB

This is the home for small LLMs (SmolLM) and high quality pre-training datasets, such as Cosmopedia and Smollm-Corpus.

We released:

  • FineWeb-Edu: a filtered version of FineWeb dataset for educational content, paper available here.
  • Cosmopedia: the largest open synthetic dataset, with 25B tokens and more than 30M samples. It contains synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1. Blog post available here.
  • Smollm-Corpus: the pre-training corpus of SmolLM models including Cosmopedia v0.2, FineWeb-Edu dedup and Python-Edu. Blog post available here.
  • SmolLM models and SmolLM2 models: a series of strong small models in three sizes: 135M, 360M and 1.7B

News πŸ—žοΈ

Evaluation of SmolLM2 and other models on common benchmarks. For more details, refer to the model card.

Comparison of models finetuned on SmolTalk and Orca AgentInstruct 1M. For more details, refer to the dataset card.