Safetensors
llama
Pclanglais commited on
Commit
ae357f3
·
verified ·
1 Parent(s): 3e7a6c3

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -0
README.md ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - PleIAs/common_corpus
5
+ language:
6
+ - en
7
+ - fr
8
+ - es
9
+ - de
10
+ - it
11
+ - la
12
+ - nl
13
+ - pl
14
+ ---
15
+ **Pleias-1b-Preview** is an early preview of a 1.21 billion parameters base model trained by Pleias with Tracto AI on Common Corpus.
16
+
17
+ Like all the base and specialized models from Pleias, Pleias-1b-Preview has only been trained on open data out of copyright (public domain) or under a permissible license.
18
+
19
+ ## Description
20
+ Pleias-1b-Preview is a transformer base model, entirely pretrained from scratch, using an architecture similar to Llama/GPT-Neox for easier deployment/inference.
21
+
22
+ It includes the following features, that would apply to any responsibly trained variant:
23
+ * Only trained on open data under a permissible license and in compliance with the European AI Act. By design, all Pleias model are unable to output copyrighted content.
24
+ * Extensive multilingual support for main European languages.
25
+ * A new tokenizer designed for enhanced document processing tasks and better multilingual support.
26
+ * Extremely low level of toxicity and problematic content.
27
+
28
+ Pleias-1b-Preview has demonstrated unusual abilities for multilingual generation in its size range. Fully supported languages include English, French, Spanish, German, Italian, Dutch, Latin and Portuguese.
29
+
30
+ Given its size, Pleias-1b-Preview can run on CPU without any compression loss. We provide a first GGUF variant as part of our release.
31
+
32
+ ## Recommended use
33
+ As a base model, Pleias-1b-Preview is only able to run continuation prompts.
34
+
35
+ Text generation is currently able to support a range of creative writing tasks in multiple European languages. For more consistent results we recommend using a low or null temperature with a slight repetition penalty (1.2).
36
+
37
+ Pleias-1b-Preview has been successfully adapted for continuous pretraining and full-fine-tuning on document processing tasks such as RAG, translation or OCR correction. Given the small size of the model we do not recommend fine-tuning methods based on LORA.
38
+
39
+ ## Example
40
+
41
+
42
+ ## Training
43
+ Pleias-1b-Preview was fully pretrained with Tracto AI on 192 h100s for 5 days. Pretraining code relied on Nanotron, the HuggingFace library. We provide the complete settings as a yaml file as part of our release.
44
+
45
+ Training schedule includes 518,000 steps (batch size 1,024) on over three epochs (nearly 5 trillions tokens):
46
+ * A lightly filtered version of Common Corpus (1.6 trillion tokens)
47
+ * A filtered and enhanced version of Common Corpus (1,086,324,736,000 tokens).
48
+ * A repeat of the previous set.
49
+
50
+ ## Update
51
+ Pleias-1b-Preview is currently released as an early preview.
52
+
53
+ The model will undergo several more round of post-training to enhance reasoning capacities and fine-tunability as well as in anticipation of a generalist instruct version.