Update README.md
Browse files
README.md
CHANGED
@@ -7,11 +7,11 @@ language:
|
|
7 |
- vi
|
8 |
license: llama3
|
9 |
---
|
10 |
-
#
|
11 |
|
12 |
SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
|
13 |
|
14 |
-
|
15 |
These instructions have been carefully curated and rewritten to ensure the model was trained on truly open, commercially permissive and high quality datasets.
|
16 |
|
17 |
SEA-LION stands for _Southeast Asian Languages In One Network_.
|
@@ -20,15 +20,15 @@ SEA-LION stands for _Southeast Asian Languages In One Network_.
|
|
20 |
- **Funded by:** Singapore NRF
|
21 |
- **Model type:** Decoder
|
22 |
- **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
|
23 |
-
- **License:** [
|
24 |
|
25 |
## Model Details
|
26 |
|
27 |
### Base model
|
28 |
-
We performed instruction tuning in English and also in ASEAN languages such as Indonesian, Thai and Vietnamese on our [continued pre-trained
|
29 |
|
30 |
### Benchmark Performance
|
31 |
-
We evaluated
|
32 |
|
33 |
#### General Language Capabilities
|
34 |
For the evaluation of general language capabilities, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
|
@@ -121,10 +121,10 @@ Current SEA-LION models, including this commercially permissive release, have no
|
|
121 |
|
122 |
## Technical Specifications
|
123 |
### Fine-Tuning Details
|
124 |
-
The
|
125 |
|
126 |
## Data
|
127 |
-
|
128 |
|
129 |
In addition, special care was taken to ensure that the datasets used had commercially permissive licenses through verification with the original data source.
|
130 |
|
|
|
7 |
- vi
|
8 |
license: llama3
|
9 |
---
|
10 |
+
# Llama3 8B CPT SEA-LIONv2 Instruct
|
11 |
|
12 |
SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region.
|
13 |
|
14 |
+
Llama3 8B CPT SEA-LIONv2 Instruct is a multilingual model which has been fine-tuned with around **100,000 English instruction-completion pairs** alongside a smaller pool of around **50,000 instruction-completion pairs** from other ASEAN languages, such as Indonesian, Thai and Vietnamese.
|
15 |
These instructions have been carefully curated and rewritten to ensure the model was trained on truly open, commercially permissive and high quality datasets.
|
16 |
|
17 |
SEA-LION stands for _Southeast Asian Languages In One Network_.
|
|
|
20 |
- **Funded by:** Singapore NRF
|
21 |
- **Model type:** Decoder
|
22 |
- **Languages:** English, Indonesian, Thai, Vietnamese, Tamil
|
23 |
+
- **License:** [Llama3 Community License](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE)
|
24 |
|
25 |
## Model Details
|
26 |
|
27 |
### Base model
|
28 |
+
We performed instruction tuning in English and also in ASEAN languages such as Indonesian, Thai and Vietnamese on our [continued pre-trained Llama3 CPT 8B SEA-LIONv2](https://huggingface.co/aisingapore/llama3-8b-cpt-sealionv2-base), a decoder model using the Llama3 architecture, to create Llama3 8B SEA-LIONv2 Instruct.
|
29 |
|
30 |
### Benchmark Performance
|
31 |
+
We evaluated Llama3 8B SEA-LIONv2 Instruct on both general language capabilities and instruction-following capabilities.
|
32 |
|
33 |
#### General Language Capabilities
|
34 |
For the evaluation of general language capabilities, we employed the [BHASA evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
|
|
|
121 |
|
122 |
## Technical Specifications
|
123 |
### Fine-Tuning Details
|
124 |
+
The Llama3 8B CPT SEA-LIONv2 Instruct was fine-tuned using 8x A100-40GB using parameter efficient fine tuning in the form of LoRA.
|
125 |
|
126 |
## Data
|
127 |
+
Llama3 8B CPT SEA-LIONv2 Instruct was trained on a wide range of instructions that were manually and stringently verified by our team. A large portion of the effort was dedicated to ensuring that each instruction-completion pair that the model sees is of a high quality and any errors were corrected and rewritten by native speakers or else dropped from our mix.
|
128 |
|
129 |
In addition, special care was taken to ensure that the datasets used had commercially permissive licenses through verification with the original data source.
|
130 |
|