File size: 13,905 Bytes

---
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
language:
- bg
- ca
- code
- cs
- cy
- da
- de
- el
- en
- es
- et
- eu
- fi
- fr
- ga
- gl
- hr
- hu
- it
- lt
- lv
- mt
- nl
- nn
- \no
- oc
- pl
- pt
- ro
- ru
- sh
- sk
- sl
- sr
- sv
- uk
base_model:
- BSC-LT/salamandra-2b
---

![](./images/salamandra_header.png)

# Salamandra Model Card (Aina Hack)

Salamandra is a highly multilingual model pre-trained from scratch that comes in three different 
sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants. 
This model card corresponds to the 2B instructed version specific for [AinaHack](https://projecteaina.cat/ainahack/), 
an event launched by Generalitat de Catalunya to create AI tools for the Catalan administration.

To visit the model cards of other Salamandra versions, please refer to the [Model Index](#model-index).

The entire Salamandra family is released under a permissive [Apache 2.0 license]((https://www.apache.org/licenses/LICENSE-2.0)).
Along with the open weights, all training scripts and configuration files are made publicly available in [this GitHub repository](https://github.com/langtech-bsc/salamandra).

> [!WARNING]
> **DISCLAIMER:** This model is a first proof-of-concept designed to demonstrate the instruction-following capabilities of recently released base models.
> It has been optimized to engage in conversation but has *NOT* been aligned through RLHF to filter or avoid sensitive topics.
> As a result, it may generate harmful or inappropriate content.
> The team is actively working to enhance its performance through further instruction and alignment with RL techniques.

---

## Model Details

### Description

Transformer-based decoder-only language model that has been pre-trained from scratch on 7.8 trillion tokens of highly curated data.
The pre-training corpus contains text in 35 European languages and code.

### Hyperparameters

The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/tree/main/configs).

### Architecture

|                         |               |
|-------------------------|:--------------|
| Total Parameters        | 2,253,490,176 |
| Embedding Parameters    | 524,288,000   |
| Layers                  | 24            |
| Hidden size             | 2,048         |
| Attention heads         | 16            |
| Context length          | 8,192         |
| Vocabulary size         | 256,000       |
| Precision               | bfloat16      |
| Embedding type          | RoPE          |
| Activation Function     | SwiGLU        |
| Layer normalization     | RMS Norm      |
| Flash attention         | ✅            |
| Grouped Query Attention | ❌            |
| Num. query groups       | N/A           |

---

## Intended Use

### Direct Use

The models are intended for both research and commercial use in any of the languages included in the training data. 
The base models are intended either for language generation or to be further fine-tuned for specific use-cases. 
The instruction-tuned variants can be used as general-purpose assistants, as long as the user is fully aware of the model’s limitations.

### Out-of-scope Use

The model is not intended for malicious activities, such as harming others or violating human rights. 
Any downstream application must comply with current laws and regulations. 
Irresponsible usage in production environments without proper risk assessment and mitigation is also discouraged. 

---

## Hardware and Software

### Training Framework

Pre-training was conducted using NVIDIA’s [NeMo Framework](https://docs.nvidia.com/nemo-framework/index.html), 
which leverages PyTorch Lightning for efficient model training in highly distributed settings.

The instruction-tuned versions were produced with [FastChat](https://github.com/lm-sys/FastChat).

### Compute Infrastructure

All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/marenostrum-5), a pre-exascale EuroHPC supercomputer hosted and
operated by Barcelona Supercomputing Center.

The accelerated partition is composed of 1,120 nodes with the following specifications:
- 4x Nvidia Hopper GPUs with 64 HBM2 memory
- 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
- 4x NDR200 (BW per node 800Gb/s)
- 512 GB of Main memory (DDR5)
- 460GB on NVMe storage

|Model|Nodes|GPUs|
|:---:|:---:|:---:|
|2B|64|256|
|7B|128|512|
|40B|256 / 512|1,024 / 2,048|

---

## How to use

The instruction-following models use the commonly adopted ChatML template:

```jinja
{%- if not date_string is defined %}{%- set date_string = "2024-09-30" %}{%- endif %}{%- set system_message = messages[0].content if messages[0].role == "system" else "system message. Today Date: "+ date_string -%}{%- if messages[0].role == "system" -%}{%- set messages = messages[1:] -%}{%- endif -%}{{ "<|im_start|>system\n" + system_message + "<|im_end|>\n" }}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
```
Where `system_message` is used to guide the model during generation and `date_string` can be set to allow the model to respond with the current date.

The exact same chat template should be used for an enhanced conversational experience.
The easiest way to apply it is by using the tokenizer's built-in functions, as shown in the following snippet.

```python
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "BSC-LT/salamandra-2b-instruct-aina-hack"

text = "At what temperature does water boil?"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
  )

message = [ { "role": "user", "content": text } ]
date_string = datetime.today().strftime('%Y-%m-%d')

prompt = tokenizer.apply_chat_template(
    message,
    tokenize=False,
    add_generation_prompt=True,
    date_string=date_string
)

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
Using this template, each turn is preceded by a `<|im_start|>` delimiter and the role of the entity 
(either `user`, for content supplied by the user, or `assistant` for LLM responses), and finished with the `<|im_end|>` token.

---

## Data

### Pretraining Data

The training corpus consists of 2.4 trillion tokens, including 35 European languages and 92 programming languages. It amounts to a total of 33TB of pre-processed text. 
Languages were sampled manually by giving x2 oversampling to Spain's co-official languages (Spanish, Catalan, Galician and Basque), code was undersampled by half, 
and the rest of the languages were kept as is, resulting in the following distribution:

![lang distrib](./images/corpus_languages.png)

This highly multilingual corpus is predominantly composed of data from Colossal OSCAR, 
which contributes a significant 66.06% of the total tokens. 
Following this, Starcoder provides 11.91%, and Spanish Crawling adds 3.34%. 
The next largest sources are French FR at 3.12% and Proof Pile at 1.98%. 
Other notable contributions include Macocu, Pile of Law, and Eurlex, each contributing around 1.5% to 1.3%. 
These major sources collectively form the bulk of the corpus, ensuring a rich and diverse dataset for training the language model.
The remaining 10% comes from smaller sources in various languages.

The model was trained for 3 epochs, with two final rounds of 0.3B higher-quality tokens each, 
meaning that the total number of tokens seen during pre-training amounts to roughly 7.8 trillion tokens.

### Finetuning Data

This instruction-tuned variant has been trained with a mixture of 276k English, Spanish, and Catalan multi-turn instructions gathered from open datasets:
| Dataset               | ca     | en     | es     |
|-----------------------|:------:|:------:|:------:|
| alpaca-cleaned        | -      | 50,000 | -      |
| aya-dataset           | -      | 3,944  | 3,854  |
| CoQCat                | 4,797  | -      | -      |
| databricks-dolly-15k  | -      | 15,011 | -      |
| dolly-3k-ca           | 3,232  | -      | -      |
| flores-instr          | 1,994  | 1,994  | 3,988  |
| MentorCA              | 7,122  | -      | -      |
| MentorES              | -      | -      | 7,122  |
| no-robots             | -      | 9,499  | -      |
| oasst-ca              | 2,518  | -      | -      |
| oasst2                | 750    | 31,086 | 15,438 |
| open-orca	         	| -	     | 50,000 | -	   |
| RagMultilingual       | 16,043 | 14,997 | 11,263 |
| tower-blocks          | -      | 19,895 | 2,000  |
| **Total** | **36,456** | **196,426** | **43,665** |


---

## Ethical Considerations and Limitations

We examine the presence of undesired societal and cognitive biases present in this model using different benchmarks. For societal biases, we test performance using the BBQ dataset (Parrish et al., 2022) in the original English and the Regard dataset (Sheng et al., 2019). We report that moderate  accuracies (between 0.5 and 0.6 depending on the social groups) in disambiguated settings, the model performs very poorly in ambiguous setting. Taken together, these results suggest the pervasiveness of social biases that may have an effect on task performance

Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings. For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant, but moderate weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers. We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects, with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.

We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources in all languages present in the training data. We aim to gradually extend and expand our analyses in future work.

These results can be expected from a model that has undergone only a preliminary instruction tuning. These tests are performed in order to show the biases the model may contain. We urge developers to take them into account and perform safety testing and tuning tailored to their specific applications of the model.

---

## Additional information

### Author
The Language Technologies Unit from Barcelona Supercomputing Center.

### Contact
For further information, please send an email to <[email protected]>.

### Copyright
Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.

### Funding
This work has been promoted and financed by the Government of Catalonia through the [Aina Project](https://projecteaina.cat/).

This work is funded by the _Ministerio para la Transformación Digital y de la Función Pública_ - Funded by EU – NextGenerationEU 
within the framework of [ILENIA Project](https://proyectoilenia.es/) with reference 2022/TL22/00215337.

### Acknowledgements

This project has benefited from the contributions of numerous teams and institutions, mainly through data contributions, knowledge transfer or technical support. 

In Catalonia, many institutions have been involved in the project. Our thanks to Òmnium Cultural, Parlament de Catalunya, Institut d'Estudis Aranesos, Racó Català, Vilaweb, ACN, Nació Digital, El món and Aquí Berguedà.

At national level, we are especially grateful to our ILENIA project partners: CENID, HiTZ and CiTIUS for their participation. We also extend our genuine gratitude to the Spanish Senate and Congress, Fundación Dialnet, Fundación Elcano and the ‘Instituto Universitario de Sistemas Inteligentes y Aplicaciones Numéricas en Ingeniería (SIANI)’ of the University of Las Palmas de Gran Canaria. 

At the international level, we thank the Welsh government, DFKI, Occiglot project, especially Malte Ostendorff, and The Common Crawl Foundation, especially Pedro Ortiz, for their collaboration. We would also like to give special thanks to the NVIDIA team, with whom we have met regularly, specially to: Ignacio Sarasua, Adam Henryk Grzywaczewski, Oleg Sudakov, Sergio Perez, Miguel Martinez, Felipes Soares and  Meriem Bendris. Their constant support has been especially appreciated throughout the entire process.

Their valuable efforts have been instrumental in the development of this work.

### Disclaimer
Be aware that the model may contain biases or other unintended distortions. 
When third parties deploy systems or provide services based on this model, or use the model themselves, 
they bear the responsibility for mitigating any associated risks and ensuring compliance with applicable regulations, 
including those governing the use of Artificial Intelligence.

The Barcelona Supercomputing Center, as the owner and creator of the model, shall not be held liable for any outcomes resulting from third-party use.

### Citation

Technical report and paper coming soon.

### License
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Model Index
|Model|Base|Instruct|
|:---:|:---:|:---:|
|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|40B| WiP | WiP |