library_name: transformers
license: apache-2.0
language:
- ja
llm-jp-modernbert-base-v4-ja
This model is based on the modernBERT-base architecture with llm-jp-tokenizer. It was trained using the Japanese subset (3.4TB) of the llm-jp-corpus v4 and supports a max sequence length of 8192.
Usage
Please install the transformers library.
pip install "transformers>=4.48.0"
If your GPU supports flash-attn 2, it is recommended to install flash-attn.
pip install flash-attn --no-build-isolation
Using AutoModelForMaskedLM:
from transformers import AutoTokenizer, AutoModelForMaskedLM
model_id = "llm-jp/llm-jp-modernbert-base-v4-ja-stage2-200k"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "日本の首都は<MASK|LLM-jp>です。"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# To get predictions for the mask:
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
# Predicted token: 東京
Training
This model was trained with a max_seq_len of 1024 in stage 1, and then with a max_seq_len of 8192 in stage 2.
Model | stage 1 | stage 2 |
---|---|---|
max_seq_len | 1024 | 8192 |
max_steps | 500,000 | 200,000 |
Total batch size | 3328 | 384 |
Peak LR | 5e-4 | 5e-5 |
warmup step | 24,000 | |
LR schedule | Linear decay | |
Adam beta 1 | 0.9 | |
Adam beta 2 | 0.98 | |
Adam eps | 1e-6 | |
MLM prob | 0.30 | |
Gradient clipping | 1.0 | |
weight decay | 1e-5 | |
line_by_line | True |
The blank in stage 2 indicate the same value as in stage 1.
In theory, stage 1 consumes 1.7T tokens, but sentences with fewer than 1024 tokens are truncated, so the actual consumption is lower. Stage 2 theoretically consumes 0.6T tokens.
For reference, Warner et al.'s ModernBERT uses 1.72T tokens for stage 1, 250B tokens for stage 2, and 50B tokens for stage 3.
Evaluation
For the sentence classification task evaluation, the datasets JSTS, JNLI, and JCoLA from JGLUE were used. For the evaluation of the Zero-shot Sentence Retrieval task, the miracl/miracl dataset (ja subset) was used. Evaluation code can be found at https://github.com/speed1313/bert-eval
Model | JSTS | JNLI | JCoLA | Avg(JGLUE) | miracl | Avg |
---|---|---|---|---|---|---|
tohoku-nlp/bert-base-japanese-v3 | 0.9196 | 0.9117 | 0.8798 | 0.9037 | 0.74 | 0.8628 |
sbintuitions/modernbert-ja-130m | 0.9159 | 0.9273 | 0.8682 | 0.9038 | 0.5069 | 0.8046 |
sbintuitions/modernbert-ja-310m | 0.9317 | 0.9326 | 0.8832 | 0.9158 | 0.6569 | 0.8511 |
llm-jp-modernbert-base-v3-stage1-500k | 0.9247 | 0.917 | 0.8555 | 0.8991 | 0.5515 | 0.8122 |
llm-jp-modernbert-base-v3-stage2-200k | 0.9238 | 0.9108 | 0.8439 | 0.8928 | 0.5384 | 0.8042 |
llm-jp-modernbert-base-v4-ja-stage1-100k | 0.9213 | 0.9182 | 0.8613 | 0.9003 | N/A | N/A |
llm-jp-modernbert-base-v4-ja-stage1-300k | 0.9199 | 0.9187 | 0.852 | 0.8969 | N/A | N/A |
llm-jp-modernbert-base-v4-ja-stage1-400k | 0.9214 | 0.9203 | 0.8555 | 0.8991 | N/A | N/A |
llm-jp-modernbert-base-v4-ja-stage1-500k | 0.9212 | 0.9195 | 0.8451 | 0.8953 | 0.6025 | 0.8221 |
llm-jp-modernbert-base-v4-ja-stage2-200k | 0.9177 | 0.9133 | 0.8439 | 0.8916 | 0.5739 | 0.8122 |