metadata
datasets:
- lst20
language:
- th
widget:
- text: วัน ที่ _ 12 _ มีนาคม นี้ _ ฉัน จะ ไป เที่ยว วัดพระแก้ว _ ที่ กรุงเทพ
library_name: transformers
HoogBERTa
This repository includes the Thai pretrained language representation (HoogBERTa_base) fine-tuned for Sentence Boundary Classification Task.
Documentation
Prerequisite
Since we use subword-nmt BPE encoding, input needs to be pre-tokenize using BEST standard before inputting into HoogBERTa
pip install attacut
Getting Start
To initialize the model from hub, use the following commands
from transformers import RobertaTokenizerFast, RobertaForTokenClassification
from attacut import tokenize
import torch
tokenizer = RobertaTokenizerFast.from_pretrained("lst-nectec/HoogBERTa-SENTENCE-lst20")
model = RobertaForTokenClassification.from_pretrained("lst-nectec/HoogBERTa-SENTENCE-lst20")
To do Sentence Boundary Classification, use the following commands
from transformers import pipeline
nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
all_sent = []
sentences = sentence.split(" ")
for sent in sentences:
all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
sentence = " _ ".join(all_sent)
print(nlp(sentence))
For batch processing,
from transformers import pipeline
nlp = pipeline('token-classification', model=model, tokenizer=tokenizer, aggregation_strategy="none")
sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
inputList = []
for sentX in sentenceL:
sentences = sentX.split(" ")
all_sent = []
for sent in sentences:
all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
sentence = " _ ".join(all_sent)
inputList.append(sentence)
print(nlp(inputList))
Huggingface Models
HoogBERTaEncoder
- HoogBERTa:
Feature Extraction
andMask Language Modeling
HoogBERTaMuliTaskTagger
:
- HoogBERTa-NER-lst20:
Named-entity recognition (NER)
based on LST20 - HoogBERTa-POS-lst20:
Part-of-speech tagging (POS)
based on LST20 - HoogBERTa-SENTENCE-lst20:
Clause Boundary Classification
based on LST20
Citation
Please cite as:
@inproceedings{porkaew2021hoogberta,
title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
year = {2021},
address={Online}
}