File size: 13,905 Bytes
04f8453
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c00719f
04f8453
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
language:
- bg
- ca
- code
- cs
- cy
- da
- de
- el
- en
- es
- et
- eu
- fi
- fr
- ga
- gl
- hr
- hu
- it
- lt
- lv
- mt
- nl
- nn
- \no
- oc
- pl
- pt
- ro
- ru
- sh
- sk
- sl
- sr
- sv
- uk
base_model:
- BSC-LT/salamandra-2b
---

![](./images/salamandra_header.png)

# Salamandra Model Card (Aina Hack)

Salamandra is a highly multilingual model pre-trained from scratch that comes in three different 
sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants. 
This model card corresponds to the 2B instructed version specific for [AinaHack](https://projecteaina.cat/ainahack/), 
an event launched by Generalitat de Catalunya to create AI tools for the Catalan administration.

To visit the model cards of other Salamandra versions, please refer to the [Model Index](#model-index).

The entire Salamandra family is released under a permissive [Apache 2.0 license]((https://www.apache.org/licenses/LICENSE-2.0)).
Along with the open weights, all training scripts and configuration files are made publicly available in [this GitHub repository](https://github.com/langtech-bsc/salamandra).

> [!WARNING]
> **DISCLAIMER:** This model is a first proof-of-concept designed to demonstrate the instruction-following capabilities of recently released base models.
> It has been optimized to engage in conversation but has *NOT* been aligned through RLHF to filter or avoid sensitive topics.
> As a result, it may generate harmful or inappropriate content.
> The team is actively working to enhance its performance through further instruction and alignment with RL techniques.

---

## Model Details

### Description

Transformer-based decoder-only language model that has been pre-trained from scratch on 7.8 trillion tokens of highly curated data.
The pre-training corpus contains text in 35 European languages and code.

### Hyperparameters

The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/tree/main/configs).

### Architecture

|                         |               |
|-------------------------|:--------------|
| Total Parameters        | 2,253,490,176 |
| Embedding Parameters    | 524,288,000   |
| Layers                  | 24            |
| Hidden size             | 2,048         |
| Attention heads         | 16            |
| Context length          | 8,192         |
| Vocabulary size         | 256,000       |
| Precision               | bfloat16      |
| Embedding type          | RoPE          |
| Activation Function     | SwiGLU        |
| Layer normalization     | RMS Norm      |
| Flash attention         | ✅            |
| Grouped Query Attention | ❌            |
| Num. query groups       | N/A           |

---

## Intended Use

### Direct Use

The models are intended for both research and commercial use in any of the languages included in the training data. 
The base models are intended either for language generation or to be further fine-tuned for specific use-cases. 
The instruction-tuned variants can be used as general-purpose assistants, as long as the user is fully aware of the model’s limitations.

### Out-of-scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. 
Any downstream application must comply with current laws and regulations. 
Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged. 

---

## Hardware and Software

### Training Framework

Pre-training was conducted using NVIDIA’s [NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html), 
which leverages PyTorch Lightning for efficient model training in highly distributed settings.

The instruction-tuned versions were produced with [FastChat](https://github.com/lm-sys/FastChat).

### Compute Infrastructure

All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and
operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:
- 4x Nvidia Hopper GPUs with 64 HBM2 memory
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
- 4x NDR200 (BW per node 800Gb/s)
- 512 GB of Main memory (DDR5)
- 460GB on NVMe storage

|Model|Nodes|GPUs|
|:---:|:---:|:---:|
|2B|64|256|
|7B|128|512|
|40B|256 / 512|1,024 / 2,048|

---

## How to use

The instruction-following models use the commonly adopted ChatML template:

```jinja
{%- if not date_string is defined %}{%- set date_string = "2024-09-30" %}{%- endif %}{%- set system_message = messages[0].content if messages[0].role == "system" else "system message. Today Date: "+ date_string -%}{%- if messages[0].role == "system" -%}{%- set messages = messages[1:] -%}{%- endif -%}{{ "<|im_start|>system\n" + system_message + "<|im_end|>\n" }}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
```
Where `system_message` is used to guide the model during generation and `date_string` can be set to allow the model to respond with the current date.

The exact same chat template should be used for an enhanced conversational experience.
The easiest way to apply it is by using the tokenizer's built-in functions, as shown in the following snippet.

```python
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "BSC-LT/salamandra-2b-instruct-aina-hack"

text = "At what temperature does water boil?"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
  )

message = [ { "role": "user", "content": text } ]
date_string = datetime.today().strftime('%Y-%m-%d')

prompt = tokenizer.apply_chat_template(
    message,
    tokenize=False,
    add_generation_prompt=True,
    date_string=date_string
)

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
Using this template, each turn is preceded by a `<|im_start|>` delimiter and the role of the entity 
(either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the `<|im_end|>` token.

---

## Data

### Pretraining Data

The training corpus consists of 2.4 trillion tokens, including 35 European languages and 92 programming languages. It amounts to a total of 33TB of pre-processed text. 
Languages were sampled manually by giving x2 oversampling to Spain's co-official languages (Spanish, Catalan, Galician and Basque), code was undersampled by half, 
and the rest of the languages were kept as is, resulting in the following distribution:

![lang distrib](./images/corpus_languages.png)

This highly multilingual corpus is predominantly composed of data from Colossal OSCAR, 
which contributes a significant 66.06% of the total tokens. 
Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%. 
The next largest sources are French FR at 3.12% and Proof Pile at 1.98%. 
Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%. 
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
The remaining 10% comes from smaller sources in various languages.

The model was trained for 3 epochs, with two final rounds of 0.3B higher-quality tokens each, 
meaning that the total number of tokens seen during pre-training amounts to roughly 7.8 trillion tokens.

### Finetuning Data

This instruction-tuned variant has been trained with a mixture of 276k English, Spanish, and Catalan multi-turn instructions gathered from open datasets:
| Dataset               | ca     | en     | es     |
|-----------------------|:------:|:------:|:------:|
| alpaca-cleaned        | -      | 50,000 | -      |
| aya-dataset           | -      | 3,944  | 3,854  |
| CoQCat                | 4,797  | -      | -      |
| databricks-dolly-15k  | -      | 15,011 | -      |
| dolly-3k-ca           | 3,232  | -      | -      |
| flores-instr          | 1,994  | 1,994  | 3,988  |
| MentorCA              | 7,122  | -      | -      |
| MentorES              | -      | -      | 7,122  |
| no-robots             | -      | 9,499  | -      |
| oasst-ca              | 2,518  | -      | -      |
| oasst2                | 750    | 31,086 | 15,438 |
| open-orca	         	| -	     | 50,000 | -	   |
| RagMultilingual       | 16,043 | 14,997 | 11,263 |
| tower-blocks          | -      | 19,895 | 2,000  |
| **Total** | **36,456** | **196,426** | **43,665** |


---

## Ethical Considerations and Limitations

We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases, we test performance using the BBQ dataset (Parrish et al., 2022) in the original English and the Regard dataset (Sheng et al., 2019). We report that moderate  accuracies (between 0.5 and 0.6 depending on the social groups) in disambiguated settings, the model performs very poorly in ambiguous setting. Taken together, these results suggest the pervasiveness of social biases that may have an effect on task performance

Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings. For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant, but moderate weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers. We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects, with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.

We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources in all languages present in the training data. We aim to gradually extend and expand our analyses in future work.

These results can be expected from a model that has undergone only a preliminary instruction tuning. These tests are performed in order to show the biases the model may contain. We urge developers to take them into account and perform safety testing and tuning tailored to their specific applications of the model.

---

## Additional information

### Author
The Language Technologies Unit from Barcelona Supercomputing Center.

### Contact
For further information, please send an email to <[email protected]>.

### Copyright
Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.

### Funding
This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).

This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU 
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.

### Acknowledgements

This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support. 

In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.

At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria. 

At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration. We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, specially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipes Soares and  Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.

Their valuable efforts have been instrumental in the development of this work.

### Disclaimer
Be aware that the model may contain biases or other unintended distortions. 
When third parties deploy systems or provide services based on this model, or use the model themselves, 
they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, 
including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

### Citation

Technical report and paper coming soon.

### License
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Model Index
|Model|Base|Instruct|
|:---:|:---:|:---:|
|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|40B| WiP | WiP |