bert-1.3b / README.md

Update README.md

d2a5d22 verified 4 months ago

5.65 kB

	---
	library_name: transformers
	license: apache-2.0
	language:
	- ja
	- en
	---

	# Retrieva BERT Model
	The RetrievaBERT is the pre-trained Transformer Encoder using Megatron-LM.
	It is designed for use in Japanese.

	## Model Details

	### Model Description

	The RetrievaBERT is the pre-trained Transformer Encoder using Megatron-LM.

	It is designed for use in Japanese.

	This model offers several advanced features compared to traditional BERT models:
	- PreNorm: Improved stability during training.
	- SwiGLU: Enhanced activation function for better performance.
	- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
	- Max Sequence Length: 2048 tokens, allowing for longer context.
	- Parameters: 1.3 billion parameters.
	- Pre-training Objective: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
	- Token Type IDs: Not used in this model.

	### Model Sources
	- Developed by: Retrieva, Inc.
	- Model type: Based on MegatronBERT Architecture.
	- Language(s) (NLP): Primarily Japanese (optional support for English).
	- License: Apache 2.0


	## Uses

	This model can be used as a Masked Language Model (MLM).
	However, it is primarily intended to be fine-tuned on downstream tasks.
	Depending on your use case, follow the appropriate section below.

	### Direct Use

	This model is pre-trained using Masked Language Modeling.
	The mask token used is `<MASK\|LLM-jp>`.
	Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.

	Example code for direct use:

	```python
	from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

	model_id = "retrieva-jp/bert-1.3b"
	model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)

	text = "こんにちは！私の名前は<MASK\|LLM-jp>です！"
	print(pipe(text))
	```

	### Downstream Use

	RetrievaBERT is compatible with Hugging Face's AutoModels.
	To fine-tune RetrievaBERT for your specific task, use the corresponding AutoModel class.
	For detailed configuration, refer to the config.json file.


	## Training Details

	### Training Data
	The Retrieva BERT model was pre-trained on the reunion of five datasets:
	- [Japanese CommonCrawl Dataset by LLM-jp](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2).
	- [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb).
	- Chinese Wikipedia dumped on 20240120.
	- Korean Wikipedia dumped on 20240120.
	- [The Stack](https://huggingface.co/datasets/bigcode/the-stack)
	The model was trained on 180 billion tokens using the above dataset.
	### Training Procedure
	The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024.
	We adopted the curriculum learning which is similar to the Sequence Length Warmup and training with the following sequence lengths and number of steps.

	- The sequence length of 128: 31,000 steps.
	- The sequence length of 256: 219,000 steps.
	- The sequence length of 512: 192,000 steps.
	- The sequence length of 2048: 12,000 steps.

	#### Training Hyperparameters
	The model was trained on the following hyperparameters.

	- Learning rate: 1.5e-4.
	- Learning rate decay style: Linear.
	- Learning rate warmup fraction: 0.01
	- Minimum learning rate: 1e-6
	- Floating point expression: BF16

	## Evaluation
	We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set.
	We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).

	\| Model \| MARC-ja/acc \| JSTS/pearson \| JSTS/spearman \| JNLI/acc \| JSQuAD/EM \| JSQuAD/F1 \| JComQA/acc \|
	\|----------------------------------\|-------------\|--------------\|---------------\|----------\|-----------\|-----------\|------------\|
	\| tohoku-nlp/bert-base-japanese-v3 \| 0.957 \| 0.914 \| 0.876 \| 0.906 \| 0.878 \| 0.946 \| 0.849 \|
	\| tohoku-nlp/bert-large-japanese-v2\| 0.959 \| 0.916 \| 0.877 \| 0.901 \| 0.884 \| 0.951 \| 0.867 \|
	\| ku-nlp/deberta-v3-base-japanese　　　　\| 0.958 \| 0.925 \| 0.890 \| 0.902 \| 0.925 \| 0.910 \| 0.882 \|
	\| retrieva-jp/bert-1.3b　　　　　　　　　　　　　　　　　　　　　　　　\| 0.952 \| 0.916 \| 0.877 \| 0.896 \| 0.916 \| 0.879 \| 0.815 \|


	## Technical Specifications

	### Model Architectures
	The Retrieva BERT model is based on BERT with the following hyperparameters:

	- Number of layers: 48
	- Hidden layer size: 1536
	- FFN hidden layer size: 4096
	- Number of attention heads: 24
	- Maximum length of position embeddings: 2048

	As mentioned earlier, the main differences from the original BERT are:
	- PreNorm: Improved stability during training.
	- SwiGLU: Enhanced activation function for better performance.
	- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.


	### Compute Infrastructure

	[TSUBAME 4](https://www.t4.gsic.titech.ac.jp/en/hardware)

	This model is based on results obtained from the TSUBAME deep-learning mini-camp.

	#### Software

	The model was trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).

	## More Information [optional]

	https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)

	## Model Card Authors [optional]

	Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba

	## Model Card Contact
	[email protected]