roberta-long-japanese (jumanpp + sentencepiece, mC4 Japanese)

This is the longer input version of RoBERTa Japanese model pretrained on approximately 200M Japanese sentences. max_position_embeddings has been increased to 1282, allowing it to handle much longer inputs than the basic RoBERTa model.

The tokenization model and logic is completely same as nlp-waseda/roberta-base-japanese. The input text should be pretokenized by Juman++ v2.0.0-rc3 and then the SentencePiece tokenization will be applied for the whitespace-separated token sequences. See tokenizer_config.json for details.

How to use

Please install Juman++ v2.0.0-rc3 and SentencePiece in advance.

You can load the model and the tokenizer via AutoModel and AutoTokenizer, respectively.

from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("megagonlabs/roberta-long-japanese")
tokenizer = AutoTokenizer.from_pretrained("megagonlabs/roberta-long-japanese")
model(**tokenizer("まさに オール マイ ティー な 商品 だ 。", return_tensors="pt")).last_hidden_state
tensor([[[ 0.1549, -0.7576,  0.1098,  ...,  0.7124,  0.8062, -0.9880],
         [-0.6586, -0.6138, -0.5253,  ...,  0.8853,  0.4822, -0.6463],
         [-0.4502, -1.4675, -0.4095,  ...,  0.9053, -0.2017, -0.7756],
         ...,
         [ 0.3505, -1.8235, -0.6019,  ..., -0.0906, -0.5479, -0.6899],
         [ 1.0524, -0.8609, -0.6029,  ...,  0.1022, -0.6802,  0.0982],
         [ 0.6519, -0.2042, -0.6205,  ..., -0.0738, -0.0302, -0.1955]]],
       grad_fn=<NativeLayerNormBackward0>)

Model architecture

The model architecture is almost the same as nlp-waseda/roberta-base-japanese except max_position_embeddings has been increased to 1282; 12 layers, 768 dimensions of hidden states, and 12 attention heads.

Training data and libraries

This model is trained on the Japanese texts extracted from the mC4 Common Crawl's multilingual web crawl corpus. We used the Sudachi to split texts into sentences, and also applied a simple rule-based filter to remove nonlinguistic segments of mC4 multilingual corpus. The extracted texts contains over 600M sentences in total, and we used approximately 200M sentences for pretraining.

We used huggingface/transformers RoBERTa implementation for pretraining. The time required for the pretrainig was about 700 hours using GCP A100 8gpu instance with enabling Automatic Mixed Precision.

Licenses

The pretrained models are distributed under the terms of the MIT License.

Citations

Contains information from mC4 which is made available under the ODC Attribution License.

@article{2019t5,
    author = {Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu},
    title = {Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer},
    journal = {arXiv e-prints},
    year = {2019},
    archivePrefix = {arXiv},
    eprint = {1910.10683},
}