--- license: apache-2.0 datasets: - PleIAs/common_corpus language: - en - fr - es - de - it - la - nl - pl ---
**Pleias-nano-1b-Preview** is an early preview of a 1.21 billion parameters base model trained by [Pleias](https://huggingface.co/PleIAs) with [Tracto AI](https://tracto.ai/) on [Common Corpus](https://huggingface.co/datasets/PleIAs/common_corpus). Like all the base and specialized models from Pleias, Pleias-nano-1b-Preview has only been trained on open data out of copyright (public domain) or under a permissible license. ## Description Pleias-nano-1b-Preview is a transformer base model, entirely pretrained from scratch, using an architecture similar to Llama/GPT-Neox for easier deployment/inference. It includes the following features, that would apply to any responsibly trained variant: * Only trained on open data under a permissible license and in compliance with the European AI Act. By design, all Pleias model are unable to output copyrighted content. * Extensive multilingual support for main European languages. * A new tokenizer designed for enhanced document processing tasks and better multilingual support. * Extremely low level of toxicity and problematic content. Pleias-nano-1b-Preview has demonstrated unusual abilities for multilingual generation in its size range. Fully supported languages include English, French, Spanish, German, Italian, Dutch, Latin and Portuguese. Given its size, Pleias-nano-1b-Preview can run on CPU without any compression loss. We provide a first GGUF variant as part of our release. ## Recommended use As a base model, Pleias-nano-1b-Preview is only able to run continuation prompts. Text generation is currently able to support a range of creative writing tasks in multiple European languages. For more consistent results we recommend using a low or null temperature with a slight repetition penalty (1.2). Pleias-nano-1b-Preview has been successfully adapted for continuous pretraining and full-fine-tuning on document processing tasks such as RAG, translation or OCR correction. Given the small size of the model we do not recommend fine-tuning methods based on LORA. ## Example ## Training Pleias-nano-1b-Preview was fully pretrained on TractoAI on ISEG GPU cluster by Nebius AI on 192 h100s for 5 days. Pretraining code relied on [the fork of Nanotron developed by TractoAI](https://github.com/tractoai/nanotron). We provide the complete settings as a yaml file as part of our release. Training schedule includes 518,000 steps (batch size 1,024) on over three epochs (nearly 5 trillions tokens): * A lightly filtered version of Common Corpus (1.6 trillion tokens) * A filtered and enhanced version of Common Corpus (1,086,324,736,000 tokens). * A repeat of the previous set. ## Update Pleias-nano-1b-Preview is currently released as an early preview. The model will undergo several more round of post-training to enhance reasoning capacities and fine-tunability as well as in anticipation of a generalist instruct version.