Update README.md
Browse files
README.md
CHANGED
@@ -48,16 +48,14 @@ language:
|
|
48 |
>
|
49 |
> The weights will be promptly updated as soon as the training process is complete.
|
50 |
|
51 |
-
#
|
52 |
|
53 |
-
|
54 |
-
sizes — 2B, 7B and 40B parameters — with their respective base and instruction-tuned variants.
|
55 |
-
This model card corresponds to the 40B base version.
|
56 |
|
57 |
-
To visit the model cards of other
|
58 |
|
59 |
-
|
60 |
-
Along with the open weights, all training scripts and configuration files are made publicly available in [this GitHub repository](https://github.com/langtech-bsc/
|
61 |
|
62 |
---
|
63 |
|
@@ -70,7 +68,7 @@ The pre-training corpus contains text in 35 European languages and code.
|
|
70 |
|
71 |
### Hyperparameters
|
72 |
|
73 |
-
The full list of hyperparameters
|
74 |
|
75 |
### Architecture
|
76 |
|
@@ -145,7 +143,7 @@ This section offers examples of how to perform inference using various methods.
|
|
145 |
You'll find different techniques for running inference, including Huggingface's Text Generation Pipeline, multi-GPU configurations, and vLLM for scalable and efficient generation.
|
146 |
|
147 |
#### Inference with Huggingface's Text Generation Pipeline
|
148 |
-
The Huggingface Text Generation Pipeline provides a straightforward way to run inference using the
|
149 |
|
150 |
```bash
|
151 |
pip install transformers torch accelerate sentencepiece protobuf
|
@@ -156,7 +154,7 @@ pip install transformers torch accelerate sentencepiece protobuf
|
|
156 |
```python
|
157 |
from transformers import pipeline, set_seed
|
158 |
|
159 |
-
model_id = "BSC-LT/
|
160 |
|
161 |
# Sample prompts
|
162 |
prompts = [
|
@@ -202,7 +200,7 @@ pip install transformers torch accelerate sentencepiece protobuf
|
|
202 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
203 |
import torch
|
204 |
|
205 |
-
model_id = "BSC-LT/
|
206 |
|
207 |
# Input text
|
208 |
text = "El mercat del barri és"
|
@@ -246,7 +244,7 @@ pip install vllm
|
|
246 |
```python
|
247 |
from vllm import LLM, SamplingParams
|
248 |
|
249 |
-
model_id = "BSC-LT/
|
250 |
|
251 |
# Sample prompts
|
252 |
prompts = [
|
@@ -444,7 +442,7 @@ We provide an extense Datasheet section following the best practices defined by
|
|
444 |
|
445 |
**For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
|
446 |
|
447 |
-
The purpose of creating this dataset is to pre-train
|
448 |
European languages (35) and code (including 92 different programming languages). In addition, we aim to represent especially the co-official
|
449 |
languages of Spain: Spanish, Catalan, Galician, and Basque. This is the reason why we carry out an oversampling of these languages.
|
450 |
|
@@ -630,7 +628,7 @@ and the [Ungoliant](https://github.com/oscar-project/ungoliant) pipeline was use
|
|
630 |
|
631 |
**Has the dataset been used for any tasks already? If so, please provide a description.**
|
632 |
|
633 |
-
Pre-train the Salamandra model family.
|
634 |
|
635 |
**What (other) tasks could the dataset be used for?**
|
636 |
|
@@ -1066,4 +1064,4 @@ Technical report coming soon.
|
|
1066 |
|:---:|:---:|:---:|
|
1067 |
|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|
1068 |
|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|
1069 |
-
|40B| [Link](https://huggingface.co/BSC-LT/
|
|
|
48 |
>
|
49 |
> The weights will be promptly updated as soon as the training process is complete.
|
50 |
|
51 |
+
# ALIA-40b Model Card
|
52 |
|
53 |
+
ALIA-40b is a highly multilingual model pre-trained from scratch that will come with its respective base and instruction-tuned variants.
|
|
|
|
|
54 |
|
55 |
+
To visit the model cards of other ALIA versions, please refer to the [Model Index](#model-index).
|
56 |
|
57 |
+
This model is released under a permissive [Apache 2.0 license]((https://www.apache.org/licenses/LICENSE-2.0)).
|
58 |
+
Along with the open weights, all training scripts and configuration files are made publicly available in [this GitHub repository](https://github.com/langtech-bsc/alia).
|
59 |
|
60 |
---
|
61 |
|
|
|
68 |
|
69 |
### Hyperparameters
|
70 |
|
71 |
+
The full list of hyperparameters can be found [here](https://github.com/langtech-bsc/alia/blob/main/configs/bsc_40b.yaml).
|
72 |
|
73 |
### Architecture
|
74 |
|
|
|
143 |
You'll find different techniques for running inference, including Huggingface's Text Generation Pipeline, multi-GPU configurations, and vLLM for scalable and efficient generation.
|
144 |
|
145 |
#### Inference with Huggingface's Text Generation Pipeline
|
146 |
+
The Huggingface Text Generation Pipeline provides a straightforward way to run inference using the ALIA-40b model.
|
147 |
|
148 |
```bash
|
149 |
pip install transformers torch accelerate sentencepiece protobuf
|
|
|
154 |
```python
|
155 |
from transformers import pipeline, set_seed
|
156 |
|
157 |
+
model_id = "BSC-LT/ALIA-40b"
|
158 |
|
159 |
# Sample prompts
|
160 |
prompts = [
|
|
|
200 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
201 |
import torch
|
202 |
|
203 |
+
model_id = "BSC-LT/ALIA-40b"
|
204 |
|
205 |
# Input text
|
206 |
text = "El mercat del barri és"
|
|
|
244 |
```python
|
245 |
from vllm import LLM, SamplingParams
|
246 |
|
247 |
+
model_id = "BSC-LT/ALIA-40b"
|
248 |
|
249 |
# Sample prompts
|
250 |
prompts = [
|
|
|
442 |
|
443 |
**For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.**
|
444 |
|
445 |
+
The purpose of creating this dataset is to pre-train a family of multilingual models with high performance in a large number of
|
446 |
European languages (35) and code (including 92 different programming languages). In addition, we aim to represent especially the co-official
|
447 |
languages of Spain: Spanish, Catalan, Galician, and Basque. This is the reason why we carry out an oversampling of these languages.
|
448 |
|
|
|
628 |
|
629 |
**Has the dataset been used for any tasks already? If so, please provide a description.**
|
630 |
|
631 |
+
Pre-train the ALIA model and the Salamandra model family.
|
632 |
|
633 |
**What (other) tasks could the dataset be used for?**
|
634 |
|
|
|
1064 |
|:---:|:---:|:---:|
|
1065 |
|2B| [Link](https://huggingface.co/BSC-LT/salamandra-2b) | [Link](https://huggingface.co/BSC-LT/salamandra-2b-instruct) |
|
1066 |
|7B| [Link](https://huggingface.co/BSC-LT/salamandra-7b) | [Link](https://huggingface.co/BSC-LT/salamandra-7b-instruct) |
|
1067 |
+
|40B| [Link](https://huggingface.co/BSC-LT/ALIA-40b) | WiP |
|