Fill-Mask
Transformers
Safetensors
Japanese
modernbert
speed's picture
Update README.md
dbd1e68 verified
|
raw
history blame
4.4 kB
metadata
library_name: transformers
license: apache-2.0
language:
  - ja

llm-jp-modernbert-base-v4-ja

This model is based on the modernBERT-base architecture with llm-jp-tokenizer. It was trained using the Japanese subset (3.4TB) of the llm-jp-corpus v4 and supports a max sequence length of 8192.

Usage

Please install the transformers library.

pip install "transformers>=4.48.0"

If your GPU supports flash-attn 2, it is recommended to install flash-attn.

pip install flash-attn --no-build-isolation

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "llm-jp/llm-jp-modernbert-base-v4-ja-stage2-200k"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "日本の首都は<MASK|LLM-jp>です。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  東京

Training

This model was trained with a max_seq_len of 1024 in stage 1, and then with a max_seq_len of 8192 in stage 2.

Model stage 1 stage 2
max_seq_len 1024 8192
max_steps 500,000 200,000
Total batch size 3328 384
Peak LR 5e-4 5e-5
warmup step 24,000
LR schedule Linear decay
Adam beta 1 0.9
Adam beta 2 0.98
Adam eps 1e-6
MLM prob 0.30
Gradient clipping 1.0
weight decay 1e-5
line_by_line True

The blank in stage 2 indicate the same value as in stage 1.

In theory, stage 1 consumes 1.7T tokens, but sentences with fewer than 1024 tokens are truncated, so the actual consumption is lower. Stage 2 theoretically consumes 0.6T tokens.

For reference, Warner et al.'s ModernBERT uses 1.72T tokens for stage 1, 250B tokens for stage 2, and 50B tokens for stage 3.

Evaluation

For the sentence classification task evaluation, the datasets JSTS, JNLI, and JCoLA from JGLUE were used. For the evaluation of the Zero-shot Sentence Retrieval task, the miracl/miracl dataset (ja subset) was used. Evaluation code can be found at https://github.com/speed1313/bert-eval

Model JSTS JNLI JCoLA Avg(JGLUE) miracl Avg
tohoku-nlp/bert-base-japanese-v3 0.9196 0.9117 0.8798 0.9037 0.74 0.8628
sbintuitions/modernbert-ja-130m 0.9159 0.9273 0.8682 0.9038 0.5069 0.8046
sbintuitions/modernbert-ja-310m 0.9317 0.9326 0.8832 0.9158 0.6569 0.8511
llm-jp-modernbert-base-v3-stage1-500k 0.9247 0.917 0.8555 0.8991 0.5515 0.8122
llm-jp-modernbert-base-v3-stage2-200k 0.9238 0.9108 0.8439 0.8928 0.5384 0.8042
llm-jp-modernbert-base-v4-ja-stage1-100k 0.9213 0.9182 0.8613 0.9003 N/A N/A
llm-jp-modernbert-base-v4-ja-stage1-300k 0.9199 0.9187 0.852 0.8969 N/A N/A
llm-jp-modernbert-base-v4-ja-stage1-400k 0.9214 0.9203 0.8555 0.8991 N/A N/A
llm-jp-modernbert-base-v4-ja-stage1-500k 0.9212 0.9195 0.8451 0.8953 0.6025 0.8221
llm-jp-modernbert-base-v4-ja-stage2-200k 0.9177 0.9133 0.8439 0.8916 0.5739 0.8122