Multi-criteria BERT base Thai with Lattice for Word Segmentation

This is a variant of the pre-trained model BERT model. The model was pre-trained on texts in the Thai language and fine-tuned for word segmentation based on bert-base-multilingual-cased. This version of the model processes input texts with character-level with word-level incorporated with a lattice structure.

The scripts for the pre-training are available at tchayintr/latte-ptm-ws.

The LATTE scripts are available at tchayintr/latte-ws.

Model architecture

The model architecture is described in this paper.

Training Data

The model is trained on multiple Thai word segmented datasets, including best2010, lst20, tlc (tnhc), vistec-tp-th-2021 (vistec2021) and wisesight_sentiment (ws160). The datasets can be accessed as follows:

Licenses

The pre-trained model is distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0.

Acknowledgments

This model was trained with GPU servers provided by Okumura-Funakoshi NLP Group.

Downloads last month
16
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train yacht/latte-mc-bert-base-thai-ws