File size: 3,750 Bytes
e4b1b40 c64c467 e4b1b40 a3c1288 13b06c7 a3c1288 1abd6ae e4b1b40 4f436fa e4b1b40 812e130 d8e4832 e4b1b40 6c0605a e4b1b40 e03f8f4 c64c467 e4b1b40 1467167 e4b1b40 c64c467 e4b1b40 c64c467 e4b1b40 c64c467 3cad39a e4b1b40 c64c467 e4b1b40 3cad39a e4b1b40 e55c27f 1467167 e55c27f 1467167 e55c27f 812e130 e4b1b40 1abd6ae |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 |
---
license: mit
datasets:
- best2009
- scb_mt_enth_2020
- oscar
- wikipedia
language:
- th
widget:
- text: วัน ที่ _ 12 _ มีนาคม นี้ _ ฉัน จะ ไป <mask> วัดพระแก้ว _ ที่ กรุงเทพ
library_name: transformers
---
# HoogBERTa
This repository includes the Thai pretrained language representation (HoogBERTa_base) and can be used for **Feature Extraction and Masked Language Modeling Tasks**.
# Documentation
## Prerequisite
Since we use subword-nmt BPE encoding, input needs to be pre-tokenize using [BEST](https://huggingface.co/datasets/best2009) standard before inputting into HoogBERTa
```
pip install attacut
```
## Getting Start
To initialize the model from hub, use the following commands
```python
from transformers import AutoTokenizer, AutoModel
from attacut import tokenize
import torch
tokenizer = AutoTokenizer.from_pretrained("lst-nectec/HoogBERTa")
model = AutoModel.from_pretrained("lst-nectec/HoogBERTa")
```
To extract token features, based on the RoBERTa architecture, use the following commands
```python
model.eval()
sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
all_sent = []
sentences = sentence.split(" ")
for sent in sentences:
all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
sentence = " _ ".join(all_sent)
tokenized_text = tokenizer(sentence, return_tensors = 'pt')
token_ids = tokenized_text['input_ids']
with torch.no_grad():
features = model(**tokenized_text, output_hidden_states = True).hidden_states[-1]
```
For batch processing,
```python
model.eval()
sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
inputList = []
for sentX in sentenceL:
sentences = sentX.split(" ")
all_sent = []
for sent in sentences:
all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
sentence = " _ ".join(all_sent)
inputList.append(sentence)
tokenized_text = tokenizer(inputList, padding = True, return_tensors = 'pt')
token_ids = tokenized_text['input_ids']
with torch.no_grad():
features = model(**tokenized_text, output_hidden_states = True).hidden_states[-1]
```
To use HoogBERTa as an embedding layer, use
```python
with torch.no_grad():
features = model(token_ids, output_hidden_states = True).hidden_states[-1] # where token_ids is a tensor with type "long".
```
# Huggingface Models
1. `HoogBERTaEncoder`
- [HoogBERTa](https://huggingface.co/lst-nectec/HoogBERTa): `Feature Extraction` and `Mask Language Modeling`
2. `HoogBERTaMuliTaskTagger`:
- [HoogBERTa-NER-lst20](https://huggingface.co/lst-nectec/HoogBERTa-NER-lst20): `Named-entity recognition (NER)` based on LST20
- [HoogBERTa-POS-lst20](https://huggingface.co/lst-nectec/HoogBERTa-POS-lst20): `Part-of-speech tagging (POS)` based on LST20
- [HoogBERTa-SENTENCE-lst20](https://huggingface.co/lst-nectec/HoogBERTa-SENTENCE-lst20): `Clause Boundary Classification` based on LST20
# Citation
Please cite as:
``` bibtex
@inproceedings{porkaew2021hoogberta,
title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
year = {2021},
address={Online}
}
```
Download full-text [PDF](https://drive.google.com/file/d/1hwdyIssR5U_knhPE2HJigrc0rlkqWeLF/view?usp=sharing)
Check out the code on [Github](https://github.com/lstnlp/HoogBERTa) |