--- library_name: transformers license: apache-2.0 language: - ja - en --- # Retrieva BERT Model The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM. It is designed for use in Japanese. ## Model Details ### Model Description The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM. It is designed for use in Japanese. This model offers several advanced features compared to traditional BERT models: - **PreNorm**: Improved stability during training. - **SwiGLU**: Enhanced activation function for better performance. - **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism. - **Max Sequence Length**: 2048 tokens, allowing for longer context. - **Parameters**: 1.3 billion parameters. - **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP). - **Token Type IDs**: Not used in this model. ### Model Sources - **Developed by:** Retrieva, Inc. - **Model type:** Based on MegatronBERT Architecture. - **Language(s) (NLP):** Primarily Japanese (optional support for English). - **License:** Apache 2.0 ## Uses This model can be used as a Masked Language Model (MLM). However, it is primarily intended to be fine-tuned on downstream tasks. Depending on your use case, follow the appropriate section below. ### Direct Use This model is pre-trained using Masked Language Modeling. The mask token used is ``. Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation. Example code for direct use: ```python from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline model_id = "retrieva-jp/bert-1.3b" model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained(model_id) pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer) text = "こんにちは!私の名前はです!" print(pipe(text)) ``` ### Downstream Use RetrievaBERT is compatible with Hugging Face's AutoModels. To fine-tune RetrievaBERT for your specific task, use the corresponding AutoModel class. For detailed configuration, refer to the config.json file. ## Training Details ### Training Data The Retrieva BERT model was pre-trained on the reunion of five datasets: - [Japanese CommonCrawl Dataset by LLM-jp](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2). - [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb). - Chinese Wikipedia dumped on 20240120. - Korean Wikipedia dumped on 20240120. - [The Stack](https://huggingface.co/datasets/bigcode/the-stack) The model was trained on 180 billion tokens using the above dataset. ### Training Procedure The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024. We adopted the curriculum learning which is similar to the Sequence Length Warmup and training with the following sequence lengths and number of steps. - The sequence length of 128: 31,000 steps. - The sequence length of 256: 219,000 steps. - The sequence length of 512: 192,000 steps. - The sequence length of 2048: 12,000 steps. #### Training Hyperparameters The model was trained on the following hyperparameters. - Learning rate: 1.5e-4. - Learning rate decay style: Linear. - Learning rate warmup fraction: 0.01 - Minimum learning rate: 1e-6 - Floating point expression: BF16 ## Evaluation We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set. We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja). | Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc | | :--- |---:|---:|---:|---:|---:|---:|---:| | tohoku-nlp/bert-base-japanese-v3 | 0.957 | 0.914 | 0.876 | 0.906 | 0.878 | 0.946 | 0.849 | | tohoku-nlp/bert-large-japanese-v2| 0.959 | 0.916 | 0.877 | 0.901 | 0.884 | 0.951 | 0.867 | | ku-nlp/deberta-v3-base-japanese    | 0.958 | 0.925 | 0.890 | 0.902 | 0.925 | 0.910 | 0.882 | | retrieva-jp/bert-1.3b                        | 0.952 | 0.916 | 0.877 | 0.896 | 0.916 | 0.879 | 0.815 | ## Technical Specifications ### Model Architectures The Retrieva BERT model is based on BERT with the following hyperparameters: - Number of layers: 48 - Hidden layer size: 1536 - FFN hidden layer size: 4096 - Number of attention heads: 24 - Maximum length of position embeddings: 2048 As mentioned earlier, the main differences from the original BERT are: - PreNorm: Improved stability during training. - SwiGLU: Enhanced activation function for better performance. - Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism. ### Compute Infrastructure [TSUBAME 4](https://www.t4.gsic.titech.ac.jp/en/hardware) This model is based on results obtained from the [TSUBAME deep-learning mini-camp](https://www.t4.gsic.titech.ac.jp/en/minicamp-dl-202406). #### Software The model was trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM). ## More Information https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese) ## Model Card Authors Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba ## Model Card Contact pr@retrieva.jp