metadata

library_name: transformers
license: apache-2.0
language:
  - ja

llm-jp-modernbert-base-v4-ja

This model is based on the modernBERT-base architecture with llm-jp-tokenizer. It was trained using the Japanese subset (3.4TB) of the llm-jp-corpus v4 and supports a max sequence length of 8192.

Usage

Please install the transformers library.

pip install "transformers>=4.48.0"

If your GPU supports flash-attn 2, it is recommended to install flash-attn.

pip install flash-attn --no-build-isolation

Using AutoModelForMaskedLM:

from transformers import AutoTokenizer, AutoModelForMaskedLM

model_id = "llm-jp/llm-jp-modernbert-base-v4-ja-stage2-200k"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)

text = "日本の首都は<MASK|LLM-jp>です。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token:  東京

Training

This model was trained with a max_seq_len of 1024 in stage 1, and then with a max_seq_len of 8192 in stage 2.

Model	stage 1	stage 2
max_seq_len	1024	8192
max_steps	500,000	200,000
Total batch size	3328	384
Peak LR	5e-4	5e-5
warmup step	24,000
LR schedule	Linear decay
Adam beta 1	0.9
Adam beta 2	0.98
Adam eps	1e-6
MLM prob	0.30
Gradient clipping	1.0
weight decay	1e-5
line_by_line	True

The blank in stage 2 indicate the same value as in stage 1.

In theory, stage 1 consumes 1.7T tokens, but sentences with fewer than 1024 tokens are truncated, so the actual consumption is lower. Stage 2 theoretically consumes 0.6T tokens.

For reference, Warner et al.'s ModernBERT uses 1.72T tokens for stage 1, 250B tokens for stage 2, and 50B tokens for stage 3.

Evaluation

For the sentence classification task evaluation, the datasets JSTS, JNLI, and JCoLA from JGLUE were used. For the evaluation of the Zero-shot Sentence Retrieval task, the miracl/miracl dataset (ja subset) was used. Evaluation code can be found at https://github.com/speed1313/bert-eval

Model	JSTS	JNLI	JCoLA	Avg(JGLUE)	miracl	Avg
tohoku-nlp/bert-base-japanese-v3	0.9196	0.9117	0.8798	0.9037	0.74	0.8628
sbintuitions/modernbert-ja-130m	0.9159	0.9273	0.8682	0.9038	0.5069	0.8046
sbintuitions/modernbert-ja-310m	0.9317	0.9326	0.8832	0.9158	0.6569	0.8511
llm-jp-modernbert-base-v3-stage1-500k	0.9247	0.917	0.8555	0.8991	0.5515	0.8122
llm-jp-modernbert-base-v3-stage2-200k	0.9238	0.9108	0.8439	0.8928	0.5384	0.8042
llm-jp-modernbert-base-v4-ja-stage1-100k	0.9213	0.9182	0.8613	0.9003	N/A	N/A
llm-jp-modernbert-base-v4-ja-stage1-300k	0.9199	0.9187	0.852	0.8969	N/A	N/A
llm-jp-modernbert-base-v4-ja-stage1-400k	0.9214	0.9203	0.8555	0.8991	N/A	N/A
llm-jp-modernbert-base-v4-ja-stage1-500k	0.9212	0.9195	0.8451	0.8953	0.6025	0.8221
llm-jp-modernbert-base-v4-ja-stage2-200k	0.9177	0.9133	0.8439	0.8916	0.5739	0.8122