HoogBERTa
Collection
4 items
•
Updated
This repository includes the Thai pretrained language representation (HoogBERTa_base) and can be used for Feature Extraction and Masked Language Modeling Tasks.
Since we use subword-nmt BPE encoding, input needs to be pre-tokenize using BEST standard before inputting into HoogBERTa
pip install attacut
To initialize the model from hub, use the following commands
from transformers import AutoTokenizer, AutoModel
from attacut import tokenize
import torch
tokenizer = AutoTokenizer.from_pretrained("lst-nectec/HoogBERTa")
model = AutoModel.from_pretrained("lst-nectec/HoogBERTa")
To extract token features, based on the RoBERTa architecture, use the following commands
model.eval()
sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
all_sent = []
sentences = sentence.split(" ")
for sent in sentences:
all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
sentence = " _ ".join(all_sent)
tokenized_text = tokenizer(sentence, return_tensors = 'pt')
token_ids = tokenized_text['input_ids']
with torch.no_grad():
features = model(**tokenized_text, output_hidden_states = True).hidden_states[-1]
For batch processing,
model.eval()
sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
inputList = []
for sentX in sentenceL:
sentences = sentX.split(" ")
all_sent = []
for sent in sentences:
all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
sentence = " _ ".join(all_sent)
inputList.append(sentence)
tokenized_text = tokenizer(inputList, padding = True, return_tensors = 'pt')
token_ids = tokenized_text['input_ids']
with torch.no_grad():
features = model(**tokenized_text, output_hidden_states = True).hidden_states[-1]
To use HoogBERTa as an embedding layer, use
with torch.no_grad():
features = model(token_ids, output_hidden_states = True).hidden_states[-1] # where token_ids is a tensor with type "long".
HoogBERTaEncoder
Feature Extraction
and Mask Language Modeling
HoogBERTaMuliTaskTagger
:Named-entity recognition (NER)
based on LST20Part-of-speech tagging (POS)
based on LST20Clause Boundary Classification
based on LST20Please cite as:
@inproceedings{porkaew2021hoogberta,
title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
year = {2021},
address={Online}
}