lst-nectec
/

HoogBERTa

Inference Endpoints

Model card Files Files and versions Community

new5558 commited on Mar 31, 2023

Commit

e4b1b40

•

1 Parent(s): d9a424a

docs: update readme

Files changed (1) hide show

README.md +89 -0

README.md ADDED Viewed

	@@ -0,0 +1,89 @@

+---
+license: mit
+datasets:
+- scb_mt_enth_2020
+- oscar
+- best2009
+- wikipedia
+language:
+- th
+library_name: fairseq
+---
+# HoogBERTa
+This repository includes the Thai pretrained language representation (HoogBERTa_base) and the fine-tuned model for multitask sequence labeling.
+# Documentation
+To initialize the model from hub, use the following commands
+```
+from transformers import AutoTokenizer, AutoModel
+tokenizer = AutoTokenizer.from_pretrained("new5558/HoogBERTa")
+model = AutoModel.from_pretrained("new5558/HoogBERTa")
+```
+To annotate POS, NE and cluase boundary, use the following commands
+```
+```
+To extract token features, based on the RoBERTa architecture, use the following commands
+```python
+with torch.no_grad():
+    model.eval()
+    sentence = "วันที่ 12 มีนาคมนี้ ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"
+    all_sent = []
+    sentences = sentence.split(" ")
+    for sent in sentences:
+        all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
+    sentence = " _ ".join(all_sent)
+    token_ids = tokenizer(sentence, return_tensors = 'pt')['input_ids']
+    features = model(token_ids)
+```
+For batch processing,
+```python
+with torch.no_grad():
+    model.eval()
+    sentenceL = ["วันที่ 12 มีนาคมนี้","ฉันจะไปเที่ยววัดพระแก้ว ที่กรุงเทพ"]
+    inputList = []
+    for sentX in sentenceL:
+        sentences = sentX.split(" ")
+        all_sent = []
+        for sent in sentences:
+            all_sent.append(" ".join(tokenize(sent)).replace("_","[!und:]"))
+        sentence = " _ ".join(all_sent)
+        inputList.append(sentence)
+    token_ids = tokenizer(inputList, padding = True, return_tensors = 'pt').input_ids
+    features = model(token_ids)
+```
+To use HoogBERTa as an embedding layer, use
+```python
+with torch.no_grad():
+  features = model(token_ids) # where token_ids is a tensor with type "long".
+```
+# Citation
+Please cite as:
+``` bibtex
+@inproceedings{porkaew2021hoogberta,
+  title = {HoogBERTa: Multi-task Sequence Labeling using Thai Pretrained Language Representation},
+  author = {Peerachet Porkaew, Prachya Boonkwan and Thepchai Supnithi},
+  booktitle = {The Joint International Symposium on Artificial Intelligence and Natural Language Processing (iSAI-NLP 2021)},
+  year = {2021},
+  address={Online}
+}
+```
+Download full-text [PDF](https://drive.google.com/file/d/1hwdyIssR5U_knhPE2HJigrc0rlkqWeLF/view?usp=sharing)