File size: 3,904 Bytes

---
language: ja
license: cc-by-sa-4.0
tags:
- generated_from_trainer
- text-classification
metrics:
- accuracy
widget:
- text: 💪(^ω^ 🍤)
  example_title: Facemark 1
- text: (੭ु∂∀6)੭ु⁾⁾ ஐ•*¨*•.¸¸
  example_title: Facemark 2
- text: :-P
  example_title: Facemark 3
- text: (o.o)
  example_title: Facemark 4
- text: （10/7~）
  example_title: Non-facemark 1
- text: ？？＜＜「ﾆｬｱ(しゃーねぇな)」ﾌﾟｲｯ
  example_title: Non-facemark 2
- text: (0.01)
  example_title: Non-facemark 3
base_model: cl-tohoku/bert-base-japanese-whole-word-masking
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Facemark Detection

This model classifies given text into facemark(1) or not(0). 

This model is a fine-tuned version of [cl-tohoku/bert-base-japanese-whole-word-masking](https://huggingface.co/cl-tohoku/bert-base-japanese-whole-word-masking) on an original facemark dataset.
It achieves the following results on the evaluation set:
- Loss: 0.1301
- Accuracy: 0.9896

## Model description

This model classifies given text into facemark(1) or not(0). 

## Intended uses & limitations

Extract a facemark-prone potion of text and apply the text to the model.
Extraction of a facemark can be done by regex but usually includes many non-facemarks.

For example, I used the following regex pattern to extract a facemark-prone text by perl.

```perl
$input_text = "facemark prne text"

my $text          = '[0-9A-Za-zぁ-ヶ一-龠]';
my $non_text      = '[^0-9A-Za-zぁ-ヶ一-龠]';
my $allow_text    = '[ovっつ゜ニノ三二]';
my $hw_kana       = '[ｦ-ﾟ]';
my $open_branket  = '[\(∩꒰（]';
my $close_branket = '[\)∩꒱）]';
my $around_face   = '(?:' . $non_text . '|' . $allow_text . ')*';
my $face          = '(?!(?:' . $text . '|' . $hw_kana . '){3,8}).{3,8}';
my $face_char     = $around_face . $open_branket . $face . $close_branket . $around_face;

my $facemark;
if ($input_text=~/($face_char)/) {
  $facemark = $1; 
}
```
Example of facemarks are:
```
  （^U^）←
  。\n\n⊂( *･ω･ )⊃
  っ(｡＞﹏＜)
  ﾀｶ( ˘ω' ) ﾔｽｩ…
  。(’↑▽↑)
  ……💰( ˘ω˘ )💰
  ーーー(*´꒳`*)！（
  …(o：∇：o)
 ！！…(；´Д｀)？
  (*´﹃ ｀*)✿
```
Examples of non-facemarks are:
```
  (3,000円)
  : (1/3)
  (@nVApO)
  （10/7~）
  ？＜＜「ﾆｬｱ(しゃーねぇな)」ﾌﾟｲｯ
  (残り 51字)
  (-0.1602)
  (25-0)
  （コーヒー飲んだ）
  （※軽トラ）
```

This model intended to use for a facemark-prone text like above.

## Training and evaluation data

Facemark data is collected manually and automatically from Twitter timeline.

* train.csv : 35591 samples (29911 facemark, 5680 non-facemark)
* test.csv : 3954 samples (3315 facemark, 639 non-facemark)

## Training procedure

```bash
python ./examples/pytorch/text-classification/run_glue.py \
   --model_name_or_path=cl-tohoku/bert-base-japanese-whole-word-masking \
   --do_train --do_eval \
   --max_seq_length=128 --per_device_train_batch_size=32 \
   --use_fast_tokenizer=False --learning_rate=2e-5 --num_train_epochs=50  \
   --output_dir=facemark_classify \
   --save_steps=1000 --save_total_limit=3 \
   --train_file=train.csv \
   --validation_file=test.csv 
```

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 32
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 50.0

### Training results

It achieves the following results on the evaluation set:
- Loss: 0.1301
- Accuracy: 0.9896

### Framework versions

- Transformers 4.26.0.dev0
- Pytorch 1.11.0+cu102
- Datasets 2.7.1
- Tokenizers 0.13.2