Update README.md

26e6450 about 1 year ago

5.63 kB

	---
	base_model: airesearch/wangchanberta-base-att-spm-uncased
	tags:
	- generated_from_trainer
	datasets:
	- universal_dependencies
	metrics:
	- accuracy
	- recall
	- precision
	- f1
	model-index:
	- name: wangchanberta-ud-thai-pud-upos
	results:
	- task:
	name: Token Classification
	type: token-classification
	dataset:
	name: universal_dependencies
	type: universal_dependencies
	config: th_pud
	split: test
	args: th_pud
	metrics:
	- name: Accuracy
	type: accuracy
	value: 0.9883334914161055
	widget:
	- text: นักวิจัยกล่าวว่าการวิเคราะห์ดีเอ็นเอของเนื้องอกอาจช่วยอธิบายถึงสาเหตุที่แท้จริงของมะเร็งชนิดอื่นๆ ได้
	example_title: test_example_1
	- text: >-
	คือผมไม่ได้ชอบกดดันพวกคุณหรอกนะ แต่ชะตากรรมของสาธารณรัฐอยู่ในกำมือคุณ
	example_title: test_example_2
	language:
	- th
	library_name: transformers
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# wangchanberta-ud-thai-pud-upos

	This model is a fine-tuned version of [airesearch/wangchanberta-base-att-spm-uncased](https://huggingface.co/airesearch/wangchanberta-base-att-spm-uncased) on the universal_dependencies dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.0442
	- Macro avg precision: 0.9221
	- Macro avg recall: 0.9178
	- Macro avg f1: 0.9199
	- Weighted avg precision: 0.9883
	- Weighted avg recall: 0.9883
	- Weighted avg f1: 0.9883
	- Accuracy: 0.9883

	## Model description

	This model is train on thai UD Thai PUD corpus with `Universal Part-of-speech (UPOS)` tag to help with pos tagging in Thai language.

	## Example
	```python
	from transformers import AutoModelForTokenClassification, AutoTokenizer, TokenClassificationPipeline

	model = AutoModelForTokenClassification.from_pretrained("Pavarissy/wangchanberta-ud-thai-pud-upos")
	tokenizer = AutoTokenizer.from_pretrained("Pavarissy/wangchanberta-ud-thai-pud-upos")

	pipeline = TokenClassificationPipeline(model=model, tokenizer=tokenizer, grouped_entities=True)
	outputs = pipeline("ประเทศไทย อยู่ใน ทวีป เอเชีย")
	print(outputs)
	# [{'entity_group': 'NOUN', 'score': 0.419697, 'word': '', 'start': 0, 'end': 1}, {'entity_group': 'PROPN', 'score': 0.8809489, 'word': 'ประเทศไทย', 'start': 0, 'end': 9}, {'entity_group': 'VERB', 'score': 0.7754166, 'word': 'อยู่ใน', 'start': 9, 'end': 16}, {'entity_group': 'NOUN', 'score': 0.9976932, 'word': 'ทวีป', 'start': 16, 'end': 21}, {'entity_group': 'PROPN', 'score': 0.97770107, 'word': 'เอเชีย', 'start': 21, 'end': 28}]
	```

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 10

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Macro avg precision \| Macro avg recall \| Macro avg f1 \| Weighted avg precision \| Weighted avg recall \| Weighted avg f1 \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:-------------------:\|:----------------:\|:------------:\|:----------------------:\|:-------------------:\|:---------------:\|:--------:\|
	\| No log \| 1.0 \| 125 \| 0.5563 \| 0.8103 \| 0.7235 \| 0.7552 \| 0.8574 \| 0.8522 \| 0.8495 \| 0.8522 \|
	\| No log \| 2.0 \| 250 \| 0.2316 \| 0.8701 \| 0.8460 \| 0.8564 \| 0.9320 \| 0.9315 \| 0.9310 \| 0.9315 \|
	\| No log \| 3.0 \| 375 \| 0.1635 \| 0.8903 \| 0.8729 \| 0.8809 \| 0.9511 \| 0.9511 \| 0.9508 \| 0.9511 \|
	\| 0.5782 \| 4.0 \| 500 \| 0.1112 \| 0.9037 \| 0.8964 \| 0.8998 \| 0.9687 \| 0.9685 \| 0.9685 \| 0.9685 \|
	\| 0.5782 \| 5.0 \| 625 \| 0.0860 \| 0.9110 \| 0.9050 \| 0.9079 \| 0.9752 \| 0.9752 \| 0.9751 \| 0.9752 \|
	\| 0.5782 \| 6.0 \| 750 \| 0.0675 \| 0.9160 \| 0.9103 \| 0.9131 \| 0.9815 \| 0.9814 \| 0.9814 \| 0.9814 \|
	\| 0.5782 \| 7.0 \| 875 \| 0.0588 \| 0.9189 \| 0.9138 \| 0.9163 \| 0.9839 \| 0.9839 \| 0.9839 \| 0.9839 \|
	\| 0.1073 \| 8.0 \| 1000 \| 0.0514 \| 0.9214 \| 0.9155 \| 0.9184 \| 0.9858 \| 0.9858 \| 0.9858 \| 0.9858 \|
	\| 0.1073 \| 9.0 \| 1125 \| 0.0463 \| 0.9225 \| 0.9171 \| 0.9197 \| 0.9877 \| 0.9876 \| 0.9876 \| 0.9876 \|
	\| 0.1073 \| 10.0 \| 1250 \| 0.0442 \| 0.9221 \| 0.9178 \| 0.9199 \| 0.9883 \| 0.9883 \| 0.9883 \| 0.9883 \|


	### Framework versions

	- Transformers 4.34.1
	- Pytorch 2.1.0+cu118
	- Datasets 2.14.6
	- Tokenizers 0.14.1