Extend vocabulary and Pretrain

We utilized SentencePiece to retrain a tokenizer for Vietnamese, English, and Chinese. This newly trained tokenizer's vocabulary was then combined with Flan-T5's original vocabulary, eliminating any duplicate tokens. The resulting merged vocabulary consists of 106611 tokens.

For a single-epoch continual pretraining, also referred to as incremental pretraining, we employed the Flan-T5-Large model. This pretraining was conducted on a diverse dataset exceeding 100 GB, incorporating the following sources:

  • NewsCorpus
  • Vietnamese Wikipedia
  • Vietnamese books
  • Vietnamese legal documents
  • Vietnamese legal text
  • English Wikipedia
  • Chinese Text

How to use

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("Hatto/HattoFlanT5-Large")  
model = AutoModelForSeq2SeqLM.from_pretrained("Hatto/HattoFlanT5-Large")
model.cuda()

Finetune and Benchmark

  • Wikilingua
  • Vietnews
  • Pho_NER
  • .....

Citation

  • Hatto
  • Ipcoms
Downloads last month
448
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train Hatto/Vietnamese-FlanT5-Large