Multi-criteria BERT base Thai with Lattice for Word Segmentation

This is a variant of the pre-trained model BERT model. The model was pre-trained on texts in the Thai language and fine-tuned for word segmentation based on bert-base-multilingual-cased. This version of the model processes input texts with character-level with word-level incorporated with a lattice structure.

The scripts for the pre-training are available at tchayintr/latte-ptm-ws.

The LATTE scripts are available at tchayintr/latte-ws.

Model architecture

The model architecture is described in this paper.

Training Data

The model is trained on multiple Thai word segmented datasets, including best2010, lst20, tlc (tnhc), vistec-tp-th-2021 (vistec2021) and wisesight_sentiment (ws160). The datasets can be accessed as follows:

best2010
lst20
tlc
vistec-tp-th-2021
wisesight_sentiment.

Licenses

The pre-trained model is distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0.

Acknowledgments

This model was trained with GPU servers provided by Okumura-Funakoshi NLP Group.

yacht
/

latte-mc-bert-base-thai-ws

Multi-criteria BERT base Thai with Lattice for Word Segmentation

Model architecture

Training Data

Licenses

Acknowledgments

Datasets used to train yacht/latte-mc-bert-base-thai-ws