riccorl commited on
Commit
e0de6a5
1 Parent(s): cd84720

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +203 -0
README.md ADDED
@@ -0,0 +1,203 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ language:
5
+ - it
6
+ - en
7
+ tags:
8
+ - pretrained
9
+ datasets:
10
+ - uonlp/CulturaX
11
+ - HuggingFaceFW/fineweb
12
+ - bigcode/the-stack-v2
13
+ inference:
14
+ parameters:
15
+ temperature: 0.5
16
+ do_sample: true
17
+ widget:
18
+ - text: 'La capitale dell''Italia è '
19
+ example_title: Example 1
20
+ - text: 'Nel mezzo del cammin di nostra vita '
21
+ example_title: Example 2
22
+ - text: 'Una cena senza vino è come '
23
+ example_title: Example 3
24
+ ---
25
+
26
+ <div style="text-align: center; display: flex; flex-direction: column; align-items: center;">
27
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/5f0b462819cb630495b814d7/DVA4MnFUs3UHBnTrX9jG6.png" style="max-width: 550px; height: auto;">
28
+ </div>
29
+
30
+ # Model Card for Minerva-7B-base-v1.0
31
+ Minerva is the first family of **LLMs pretrained from scratch on Italian** developed by [Sapienza NLP](https://nlp.uniroma1.it)
32
+ in collaboration with [Future Artificial Intelligence Research (FAIR)](https://fondazione-fair.it/) and [CINECA](https://www.cineca.it/).
33
+ Notably, the Minerva models are truly-open (data and model) Italian-English LLMs, with approximately half of the pretraining data
34
+ including Italian text.
35
+
36
+ * [Minerva LLMs - website](https://nlp.uniroma1.it/minerva/)
37
+
38
+ ## Description
39
+ This is the model card for **Minerva-7B-base-v1.0**, a 7 billion parameter model trained on 2.2T billion tokens (1T billion in Italian,
40
+ 1T billion in English, and 200 billion in code).
41
+
42
+ This model is part of the Minerva LLM family:
43
+
44
+ * [Minerva-350M-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-350M-base-v1.0)
45
+ * [Minerva-1B-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-1B-base-v1.0)
46
+ * [Minerva-3B-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0)
47
+ * [Minerva-7B-base-v1.0](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0)
48
+
49
+ ## 🚨⚠️🚨 Bias, Risks, and Limitations 🚨⚠️🚨
50
+ *This section identifies foreseeable harms and misunderstandings.*
51
+
52
+ This is a foundation model, not subject to alignment. Model may:
53
+
54
+ - Overrepresent some viewpoints and underrepresent others
55
+ - Contain stereotypes
56
+ - Contain [personal information](#personal-data-and-information)
57
+ - Generate:
58
+ - Racist and sexist content
59
+ - Hateful, abusive, or violent language
60
+ - Discriminatory or prejudicial language
61
+ - Content that may not be appropriate for all settings, including sexual content
62
+ - Make errors, including producing incorrect information or historical facts as if it were factual
63
+ - Generate irrelevant or repetitive outputs
64
+
65
+ We are aware of the biases and potential problematic/toxic content that current pretrained large language models exhibit: more specifically, as probabilistic models of (Italian and English) languages, they reflect and amplify the biases of their training data.
66
+ For more information about this issue, please refer to our survey:
67
+ * [Biases in Large Language Models: Origins, Inventory, and Discussion](https://dl.acm.org/doi/full/10.1145/3597307)
68
+
69
+ ## How to use Minerva with Hugging Face transformers
70
+
71
+ ```python
72
+ import transformers
73
+ import torch
74
+
75
+ model_id = "sapienzanlp/Minerva-7B-base-v1.0"
76
+
77
+ # Initialize the pipeline.
78
+ pipeline = transformers.pipeline(
79
+ "text-generation",
80
+ model=model_id,
81
+ model_kwargs={"torch_dtype": torch.bfloat16},
82
+ device_map="auto",
83
+ )
84
+
85
+ # Input text for the model.
86
+ input_text = "La capitale dell'Italia è"
87
+
88
+ # Compute the outputs.
89
+ output = pipeline(
90
+ input_text,
91
+ max_new_tokens=128,
92
+ )
93
+
94
+ # Output:
95
+ # [{'generated_text': "La capitale dell'Italia è la città di Roma, che si trova a [...]"}]
96
+ ```
97
+
98
+ ## Model Architecture
99
+
100
+ Minerva-7B-base-v1.0 is a Transformer model based on the Mistral architecture, where the number of layers, number of heads, and the hidden states dimension are modified to reach 3B parameters.
101
+ Please, take a look at the configuration file for a detailed breakdown of the hyperparameters we chose for this model.
102
+
103
+ The Minerva LLM family is composed of:
104
+
105
+ | Model Name | Tokens | Layers | Hidden Size | Attention Heads | KV Heads | Sliding Window | Max Context Length |
106
+ | --- | --- | --- | --- | --- | --- | --- | --- |
107
+ | Minerva-350M-base-v1.0 | 70B (35B it + 35B en) | 16 | 1152 | 16 | 4 | 2048 | 16384 |
108
+ | Minerva-1B-base-v1.0 | 200B (100B it + 100B en) | 16 | 2048 | 16 | 4 | 2048 | 16384 |
109
+ | Minerva-3B-base-v1.0 | 660B (330B it + 330B en) | 32 | 2560 | 32 | 8 | 2048 | 16384 |
110
+ | Minerva-7B-base-v1.0 | 2.2T (1T it + 1T en + 200B code) | 32 | 4096 | 32 | 8 | None | 4096 |
111
+
112
+ ## Model Training
113
+
114
+ Minerva-7B-base-v1.0 was trained using [llm-foundry 0.8.0](https://github.com/riccorl/llm-foundry) from [MosaicML](https://mosaicml.com/). The hyperparameters used are the following:
115
+
116
+ | Model Name | Optimizer | lr | betas | eps | weight decay | Scheduler | Warmup Steps | Batch Size (Tokens) | Total Steps |
117
+ | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
118
+ | Minerva-350M-base-v1.0 | Decoupled AdamW | 2e-4 | (0.9, 0.95) | 1e-8 | 0.0 | Cosine | 2% | 4M | 16,690 |
119
+ | Minerva-1B-base-v1.0 | Decoupled AdamW | 2e-4 | (0.9, 0.95) | 1e-8 | 0.0 | Cosine | 2% | 4M | 47,684 |
120
+ | Minerva-3B-base-v1.0 | Decoupled AdamW | 2e-4 | (0.9, 0.95) | 1e-8 | 0.0 | Cosine | 2% | 4M | 157,357 |
121
+ | Minerva-7B-base-v1.0 | AdamW | 3e-4 | (0.9, 0.95) | 1e-5 | 0.1 | Cosine | 2000 | 4M | 591,558 |
122
+
123
+ ## Model Evaluation
124
+
125
+ We assessed our model using the [LM-Evaluation-Harness](https://github.com/EleutherAI/lm-evaluation-harness) library, which serves as a comprehensive framework for testing generative language models across a wide range of evaluation tasks.
126
+
127
+ All the reported benchmark data was already present in the LM-Evaluation-Harness suite.
128
+
129
+ **Italian** Data:
130
+ | Task | Accuracy |
131
+ | --- | --- |
132
+ <!-- | [xcopa](https://huggingface.co/datasets/xcopa) (0-shot) | 0.694 |
133
+ | [Hellaswag](https://huggingface.co/datasets/alexandrainst/m_hellaswag) (5-shot) | 0.5293 |
134
+ | [Belebele](https://huggingface.co/datasets/facebook/belebele) (5-shot) | 0.2333 |
135
+ | [TruthfulQA MC 1](https://huggingface.co/datasets/alexandrainst/m_truthfulqa) (0-shot) | 0.2363 |
136
+ | [TruthfulQA MC 2](https://huggingface.co/datasets/alexandrainst/m_truthfulqa) (0-shot) | 0.3731 |
137
+ | [M MMLU](https://huggingface.co/datasets/alexandrainst/m_mmlu) (5-shot) | 0.2612 |
138
+ | [arc challenge](https://huggingface.co/datasets/alexandrainst/m_arc) (5-shot) | 0.3268 | -->
139
+
140
+
141
+ **English** Data:
142
+ | Task | Accuracy |
143
+ | --- | --- |
144
+ <!-- | [Hellaswag](https://huggingface.co/datasets/Rowan/hellaswag) (5-shot) | 0.6168 |
145
+ | [piqa](https://huggingface.co/datasets/piqa) (5-shot) | 0.7535 |
146
+ | [sciq](https://huggingface.co/datasets/sciq) (5-shot) | 0.925 |
147
+ | [Belebele](https://huggingface.co/datasets/facebook/belebele) (5-shot) | 0.2278 |
148
+ | [TruthfulQA MC 1](https://huggingface.co/datasets/truthful_qa) (0-shot) | 0.2142 |
149
+ | [TruthfulQA MC 2](https://huggingface.co/datasets/truthful_qa) (0-shot) | 0.3643 |
150
+ | [M MMLU](https://huggingface.co/datasets/alexandrainst/m_mmlu) (5-shot) | 0.263 |
151
+ | [arc challenge](allenai/ai2_arc) (5-shot) | 0.3319 |
152
+ | [arc easy](allenai/ai2_arc) (5-shot) | 0.6540 | -->
153
+
154
+
155
+ ## Training Data
156
+
157
+ <!-- Minerva-7B-base-v1.0 was trained on 1T Italian tokens and 1T English tokens sampled from CulturaX.
158
+
159
+ We have extracted some statistics on Italian (115B tokens) and English (210B tokens) documents from CulturaX on the selected sources:
160
+
161
+ *Proportion of number of tokens per domain (Italian)*
162
+ <img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_it.png?raw=true" alt="italian-tok-counts" border="0" width="1800px">
163
+
164
+ *Proportion of number of tokens per domain (English)*
165
+ <img src="https://github.com/Andrew-Wyn/images/blob/master/minerva/top_25_url_tokens_proportion_culturax_en.png?raw=true" alt="english-tok-counts" border="0" width="1800px"> -->
166
+
167
+ ## Tokenizer Fertility
168
+
169
+ The tokenizer fertility measures the average amount of tokens produced per tokenized word.
170
+ A tokenizer displaying high fertility values in a particular language typically indicates that it segments words in that language extensively.
171
+ The tokenizer fertility is strictly correlated with the inference speed of the model with respect to a specific language,
172
+ as higher values mean longer sequences of tokens to generate and thus lower inference speed.
173
+
174
+ **Fertility computed over a sample of Cultura X (CX) data and Wikipedia (Wp):**
175
+
176
+ | Model | Voc. Size | Fertility IT (CX) | Fertility EN (CX) | Fertility IT (Wp) | Fertility EN (Wp) |
177
+ | --- | --- | --- |--- | --- |--- |
178
+ | Mistral-7B-v0.1 | 32000 | 1.87 | 1.32 | 2.05 | 1.57 |
179
+ | gemma-7b | 256000 | 1.42 | 1.18 | 1.56 | 1.34 |
180
+ | Minerva-3B-base-v1.0 | 32768 | 1.39 | 1.32 | 1.66 | 1.59 |
181
+ | Minerva-7B-base-v1.0 | 51200 | - | - | - | - |
182
+
183
+ ## Notice
184
+
185
+ Minerva-7B-base-v1.0 is a pretrained base model and, therefore, has no moderation mechanisms.
186
+
187
+ ## The Sapienza NLP Team
188
+
189
+ * **Riccardo Orlando:** data preprocessing, model training
190
+ * **Pere-Lluis Huguet Cabot:** data preprocessing, vocabulary, evaluation
191
+ * **Luca Moroni:** data curation, data analysis, downstream tasks, evaluation
192
+ * **Simone Conia:** data curation, evaluation, project supervision
193
+ * **Edoardo Barba:** data preprocessing, downstream tasks, project supervision
194
+ * **Roberto Navigli:** project coordinator
195
+
196
+ ### Special thanks for their support
197
+ * Giuseppe Fiameni, Nvidia
198
+ * Sergio Orlandini, CINECA
199
+
200
+ ## Acknowledgments
201
+
202
+ This work was funded by the PNRR MUR project [PE0000013-FAIR](https://fondazione-fair.it).
203
+ We acknowledge the [CINECA](https://www.cineca.it) award "IscB_medit" under the ISCRA initiative, for the availability of high performance computing resources and support.