|
--- |
|
license: apache-2.0 |
|
base_model: line-corporation/line-distilbert-base-japanese |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: fluency-score-classification-ja |
|
results: [] |
|
--- |
|
|
|
|
|
# fluency-score-classification-ja |
|
|
|
This model is a fine-tuned version of [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) on the ["日本語文法誤りデータセット"](https://github.com/liwii/ja_perturbed/tree/main). |
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.1912 |
|
- ROC AUC: 0.9811 |
|
|
|
## Model description |
|
This model wraps [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) with [DistilBertForSequenceClassification](https://huggingface.co/docs/transformers/v4.34.0/en/model_doc/distilbert#transformers.DistilBertForSequenceClassification) to make a binary classifier. |
|
|
|
## Intended uses & limitations |
|
This model can be used to classify whether the given Japanese texts are fluent (i.e., not having grammactical errors). |
|
Example usage: |
|
|
|
```python |
|
# Load the tokenizer & the model |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("line-corporation/line-distilbert-base-japanese", trust_remote_code=True) |
|
model = AutoModelForSequenceClassification.from_pretrained("liwii/fluency-score-classification-ja") |
|
|
|
# Make predictions |
|
input_tokens = tokenizer([ |
|
'黒い猫が', |
|
'黒い猫がいます', |
|
'あっちの方で黒い猫があくびをしています', |
|
'あっちの方でで黒い猫ががあくびをしています', |
|
'ある日の暮方の事である。一人の下人が、羅生門の下で雨やみを待っていた。' |
|
], |
|
return_tensors='pt', |
|
padding=True) |
|
|
|
output = model(**input_tokens) |
|
with torch.no_grad(): |
|
# Probabilities of [not_fluent, fluent] |
|
probs = torch.nn.functional.softmax( |
|
output.logits, dim=1) |
|
probs[:, 1] # => tensor([0.1007, 0.2416, 0.5635, 0.0453, 0.7701]) |
|
``` |
|
|
|
The scores could be low for short sentences even if they do not contain any grammatical erros because the training dataset consist of long sentences. |
|
|
|
## Training and evaluation data |
|
From ["日本語文法誤りデータセット"](https://github.com/liwii/ja_perturbed/tree/main), used 512 rows as the evaluation dataset and the rest of the dataset as the training dataset. |
|
For each dataset split, Used the "original" rows as the data with "fluent" label, and "perturbed" as the data with "not fluent" data. |
|
|
|
## Training procedure |
|
Fine-tuned the model for 5 epochs. Freezed the params in the original DistilBERT during the fine-duning. |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 1e-05 |
|
- train_batch_size: 64 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- distributed_type: tpu |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- num_epochs: 5 |
|
|
|
### Training results |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Roc Auc | |
|
|:-------------:|:-----:|:----:|:---------------:|:-------:| |
|
| 0.4582 | 1.0 | 647 | 0.2887 | 0.9679 | |
|
| 0.2664 | 2.0 | 1294 | 0.2224 | 0.9761 | |
|
| 0.2177 | 3.0 | 1941 | 0.2047 | 0.9793 | |
|
| 0.1899 | 4.0 | 2588 | 0.1944 | 0.9807 | |
|
| 0.1865 | 5.0 | 3235 | 0.1912 | 0.9811 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.34.0 |
|
- Pytorch 2.0.0+cu118 |
|
- Datasets 2.14.5 |
|
- Tokenizers 0.14.0 |