File size: 2,587 Bytes
a7345ac
e467730
a7345ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c24f0d3
a7345ac
 
 
 
 
 
 
 
 
 
 
 
bcbdf6d
 
 
 
 
 
 
 
a7345ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
---
inference: false
language: ja
license: apache-2.0
mask_token: "[MASK]"
widget:
- text: "LINE株式会社で[MASK]の研究・開発をしている。"
---

# LINE DistilBERT Japanese

This is a DistilBERT model pre-trained on 131 GB of Japanese web text.
The teacher model is BERT-base that built in-house at LINE.
The model was trained by [LINE Corporation](https://linecorp.com/).

## For Japanese

https://github.com/line/LINE-DistilBERT-Japanese/blob/main/README_ja.md is written in Japanese.

## How to use

```python
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("line-corporation/line-distilbert-base-japanese", trust_remote_code=True)
model = AutoModel.from_pretrained("line-corporation/line-distilbert-base-japanese")

sentence = "LINE株式会社で[MASK]の研究・開発をしている。"
print(model(**tokenizer(sentence, return_tensors="pt")))
```

### Requirements

```txt
fugashi 
sentencepiece
unidic-lite
```

## Model architecture

The model architecture is the DitilBERT base model; 6 layers, 768 dimensions of hidden states, 12 attention heads, 66M parameters.

## Evaluation

The evaluation by [JGLUE](https://github.com/yahoojapan/JGLUE) is as follows:

| model name             | #Params | Marc_ja | JNLI |       JSTS       |  JSQuAD   | JCommonSenseQA |
|------------------------|:-------:|:-------:|:----:|:----------------:|:---------:|:--------------:|
|                        |         |   acc   | acc  | Pearson/Spearman |   EM/F1   |      acc       |
| LINE-DistilBERT        |   68M   |  95.6   | 88.9 |    89.2/85.1     | 87.3/93.3 |      76.1      |
| Laboro-DistilBERT      |   68M   |  94.7   | 82.0 |    87.4/82.7     | 70.2/87.3 |      73.2      |
| BandaiNamco-DistilBERT |   68M   |  94.6   | 81.6 |    86.8/82.1     | 80.0/88.0 |      66.5      |

## Tokenization

The texts are first tokenized by MeCab with the Unidic dictionary and then split into subwords by the SentencePiece algorithm. The vocabulary size is 32768.

## Licenses

The pretrained models are distributed under the terms of the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0).

## To cite this work

We haven't published any paper on this work. Please cite [this GitHub repository](http://github.com/line/LINE-DistilBERT-Japanese):

```
@article{LINE DistilBERT Japanese,
  title = {LINE DistilBERT Japanese},
  author = {"Koga, Kobayashi and Li, Shengzhe and Nakamachi, Akifumi and Sato, Toshinori"},
  year = {2023},
  howpublished = {\url{http://github.com/line/LINE-DistilBERT-Japanese}}
}
```