# SEA-LION

SEA-LION is a collection of LLMs which has been pretrained and instruct-tuned for the South-East Asia (SEA) region.
The models range from 3 billion to 7 billion parameters.
This is the card for the SEA-LION 7B model.

SEA-LION stands for <i>South-East Asia Languages In One Network</i>.


## Model Details

### Model Description

The SEA-LION model is a significant leap forward in the field of natural language processing and understanding,
specifically trained to understand South-East Asia (SEA) regional context.

SEA-LION is built on the robust MPT architecture and utilize a vocabulary size of 256K.

The model employs our proprietary SEABPETokenizer for tokenization.
Our SEABPETokenizer is specially tailored for SEA languages, ensuring optimal model performance.

The training data for SEA-LION encompasses 980B tokens.

- **Developed by:** Products Pillar, AI Singapore
- **Funded by:** Singapore NRF
- **Model type:** Decoder
- **Language(s) (NLP):** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
- **License:** MIT License


## Training Details

### Data

SEA-LION was trained on 980B tokens of the following data:

| Data Source               | Tokens | Percentage |
|---------------------------|-------:|:----------:|
| RefinedWeb - English      | 571.3B |     62.80% |
| mC4 - Chinese             |  91.2B |     10.03% |
| mC4 - Indonesian          |   3.6B |      0.40% |
| mC4 - Malay               |   0.7B |      0.08% |
| mC4 - Filipino            |   1.3B |      0.15% |
| mC4 - Burmese             |   1.2B |      0.13% |
| mC4 - Vietnamese          |  63.4B |      6.97% |
| mC4 - Thai                |  10.8B |      1.19% |
| mC4 - Lao                 |   0.3B |      0.03% |
| mC4 - Khmer               |   0.9B |      0.11% |
| mC4 - Tamil               |   2.5B |      0.28% |
| the Stack - Python        |  20.9B |      2.30% |
| the Stack - Javascript    |  55.6B |      6.11% |
| the Stack - Shell         |   1.3B |      0.14% |
| the Stack - SQL           |   6.4B |      0.70% |
| the Stack - Markdown      |  26.6B |      2.91% |
| RedPajama - StackExchange |  21.2B |      2.33% |
| RedPajama - ArXiv         |  30.6B |      3.35% |


### Infrastructure

SEA-LION was trained using [MosaicML Composer](https://github.com/mosaicml/composer)
on the following hardware:

| Training Details     | SEA-LION 7B  |
|----------------------|:------------:|
| AWS EC2 p4d.24xlarge | 32 instances |
| Nvidia A100 40GB GPU | 256          |
| Training Duration    | 22 days      |


### Configuration

| HyperParameter    | SEA-LION 7B        |
|-------------------|:------------------:|
| Precision         | bfloat16           |
| Optimizer         | decoupled_adamw    |
| Scheduler         | cosine_with_warmup |
| Learning Rate     | 6.0e-5             |
| Global Batch Size | 2048               |
| Micro Batch Size  | 4                  |


## Technical Specifications

### Model Architecture and Objective

SEA-LION is a decoder model using the MPT architecture.

| Parameter       | SEA-LION 7B |
|-----------------|:-----------:|
| Layers          | 32          |
| d_model         | 4096        |
| head_dim        | 32          |
| Vocabulary      | 256000      |
| Sequence Length | 2048        |


### Tokenizer Details

We sample 20M lines from the training data to train the tokenizer.<br>
The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
The tokenizer type is Byte-Pair Encoding (BPE).


## The Team

Hamsawardhini Rengarajan<br>
Lam Zhiwen Clarence<br>
Leong Weiqi<br>
Li Yier<br>
Liu Darius<br>
Lovenia Holy<br>
Ng Raymond<br>
Ngui Jian Gang<br>
Ong Tat-Wee David<br>
Railey Montalan<br>
Tai Ngee Chia<br>
Tan Choon Meng<br>
Thanh Ngan Nguyen<br>
Teo Jin Howe<br>
Teo Wei Yi<br>
William Tjhi<br>
Yeo Yeow Tong<br>
Yong Xianbin<br>
Yosephine<br>
Leslie Teo<br>

## Contact

For more info, please contact us at seallm@aisingapore.org