LazarusNLP
/

IndoNanoT5-base

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

IndoNanoT5-base / README.md

w11wo's picture

Update README.md

8d066de verified 11 months ago

|

history blame contribute delete

2.89 kB

	---
	license: apache-2.0
	language:
	- ind
	datasets:
	- uonlp/CulturaX
	tags:
	- t5
	---

	## IndoNanoT5 Base

	IndoNanoT5 Base is an Indonesian sequence-to-sequence language model based on the [T5](https://arxiv.org/abs/1910.10683) architecture. We conducted pre-training on an open-source Indonesian corpus of [uonlp/CulturaX](https://huggingface.co/datasets/uonlp/CulturaX). On a held-out subset of the corpus, our model achieved an evaluation loss of 2.082 or a perplexity of about 8.02.

	This model was trained using the [nanoT5](https://github.com/PiotrNawrot/nanoT5) PyTorch framework. All training was done on an NVIDIA H100 GPU. [LazarusNLP/IndoNanoT5-base](https://huggingface.co/LazarusNLP/IndoNanoT5-base) is released under Apache 2.0 license.

	## Model Detail

	- Developed by: [LazarusNLP](https://lazarusnlp.github.io/)
	- Model type: Encoder-decoder T5 transformer language model
	- Language(s): Indonesian
	- License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0.html)
	- Contact: [Wilson Wongso](https://wilsonwongso.dev/)

	## Use in 🤗Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	model_checkpoint = "LazarusNLP/IndoNanoT5-base"

	tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
	```

	## Training Datasets

	Around 4B tokens from the following corpora were used during pre-training.

	- [Cleaned, Enormous, and Public: The Multilingual Fuel to Democratize Large Language Models for 167 Languages](https://huggingface.co/datasets/uonlp/CulturaX)

	## Training Hyperparameters

	The following hyperparameters were used during training:

	- `total_steps`: 65536
	- `input_length`: 512
	- `batch_size`: 128
	- `grad_acc`: 1
	- `base_lr`: 5e-3
	- `optimizer`: AdamWScaled with `betas=(0.9,0.999)` and `epsilon=1e-08`
	- `weight_decay`: 0.0
	- `lr_scheduler`: cosine
	- `warmup_steps`: 10000
	- `final_cosine`: 1e-5
	- `grad_clip`: 1.0
	- `precision`: `bf16`

	## Acknowledgements

	We would like to acknowledge [nanoT5](https://github.com/PiotrNawrot/nanoT5) for inspiring this project.

	## Credits

	BhinnekaLM is developed with love by:

	<div style="display: flex;">
	<a href="https://github.com/anantoj">
	<img src="https://github.com/anantoj.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
	</a>

	<a href="https://github.com/DavidSamuell">
	<img src="https://github.com/DavidSamuell.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
	</a>

	<a href="https://github.com/stevenlimcorn">
	<img src="https://github.com/stevenlimcorn.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
	</a>

	<a href="https://github.com/w11wo">
	<img src="https://github.com/w11wo.png" alt="GitHub Profile" style="border-radius: 50%;width: 64px;margin:0 4px;">
	</a>
	</div>