liwii
/

fluency-score-classification-ja

Generated from Trainer

Model card Files Files and versions Community

fluency-score-classification-ja / README.md

liwii's picture

Update README.md

c373333 almost 2 years ago

|

history blame contribute delete

3.58 kB

	---
	license: apache-2.0
	base_model: line-corporation/line-distilbert-base-japanese
	tags:
	- generated_from_trainer
	model-index:
	- name: fluency-score-classification-ja
	results: []
	---


	# fluency-score-classification-ja

	This model is a fine-tuned version of [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) on the ["日本語文法誤りデータセット"](https://github.com/liwii/ja_perturbed/tree/main).
	It achieves the following results on the evaluation set:
	- Loss: 0.1912
	- ROC AUC: 0.9811

	## Model description
	This model wraps [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) with [DistilBertForSequenceClassification](https://huggingface.co/docs/transformers/v4.34.0/en/model_doc/distilbert#transformers.DistilBertForSequenceClassification) to make a binary classifier.

	## Intended uses & limitations
	This model can be used to classify whether the given Japanese texts are fluent (i.e., not having grammactical errors).
	Example usage:

	```python
	# Load the tokenizer & the model
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	tokenizer = AutoTokenizer.from_pretrained("line-corporation/line-distilbert-base-japanese", trust_remote_code=True)
	model = AutoModelForSequenceClassification.from_pretrained("liwii/fluency-score-classification-ja")

	# Make predictions
	input_tokens = tokenizer([
	'黒い猫が',
	'黒い猫がいます',
	'あっちの方で黒い猫があくびをしています',
	'あっちの方でで黒い猫ががあくびをしています',
	'ある日の暮方の事である。一人の下人が、羅生門の下で雨やみを待っていた。'
	],
	return_tensors='pt',
	padding=True)

	output = model(**input_tokens)
	with torch.no_grad():
	# Probabilities of [not_fluent, fluent]
	probs = torch.nn.functional.softmax(
	output.logits, dim=1)
	probs[:, 1] # => tensor([0.1007, 0.2416, 0.5635, 0.0453, 0.7701])
	```

	The scores could be low for short sentences even if they do not contain any grammatical erros because the training dataset consist of long sentences.

	## Training and evaluation data
	From ["日本語文法誤りデータセット"](https://github.com/liwii/ja_perturbed/tree/main), used 512 rows as the evaluation dataset and the rest of the dataset as the training dataset.
	For each dataset split, Used the "original" rows as the data with "fluent" label, and "perturbed" as the data with "not fluent" data.

	## Training procedure
	Fine-tuned the model for 5 epochs. Freezed the params in the original DistilBERT during the fine-duning.

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 1e-05
	- train_batch_size: 64
	- eval_batch_size: 8
	- seed: 42
	- distributed_type: tpu
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 5

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Roc Auc \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:-------:\|
	\| 0.4582 \| 1.0 \| 647 \| 0.2887 \| 0.9679 \|
	\| 0.2664 \| 2.0 \| 1294 \| 0.2224 \| 0.9761 \|
	\| 0.2177 \| 3.0 \| 1941 \| 0.2047 \| 0.9793 \|
	\| 0.1899 \| 4.0 \| 2588 \| 0.1944 \| 0.9807 \|
	\| 0.1865 \| 5.0 \| 3235 \| 0.1912 \| 0.9811 \|


	### Framework versions

	- Transformers 4.34.0
	- Pytorch 2.0.0+cu118
	- Datasets 2.14.5
	- Tokenizers 0.14.0