File size: 5,711 Bytes
170ce12 ae73aec 170ce12 1f5ceb2 ae73aec 170ce12 a67b81e 170ce12 ae73aec 170ce12 ae73aec 170ce12 ae73aec a67b81e ae73aec 170ce12 ae73aec 170ce12 ae73aec 170ce12 ae73aec a67b81e 170ce12 ae73aec 170ce12 ae73aec 170ce12 ae73aec 170ce12 ae73aec 170ce12 ae73aec 170ce12 1f5ceb2 ae73aec 170ce12 ae73aec 170ce12 ae73aec 170ce12 ae73aec 170ce12 ae73aec 170ce12 ae73aec 170ce12 a67b81e ae73aec 170ce12 ae73aec a67b81e 170ce12 ae73aec 170ce12 ae73aec 1f5ceb2 170ce12 ae73aec 170ce12 ae73aec a67b81e 170ce12 ae73aec 170ce12 d6454e0 170ce12 ae73aec 170ce12 ae73aec 170ce12 ae73aec 170ce12 ae73aec 170ce12 ae73aec 170ce12 a67b81e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 |
---
library_name: transformers
license: apache-2.0
language:
- ja
- en
---
# RetrievaBERT Model
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
It is designed for use in Japanese.
## What's New
- November 2024 (`v1.0.1`): Bug fix for the model parameters.
- The up_proj's bias was initialized with the gate's one. This bug was fixed.
## Model Details
### Model Description
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
It is designed for use in Japanese.
This model offers several advanced features compared to traditional BERT models:
- **PreNorm**: Improved stability during training.
- **SwiGLU**: Enhanced activation function for better performance.
- **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.
- **Max Sequence Length**: 2048 tokens, allowing for longer context.
- **Parameters**: 1.3 billion parameters.
- **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
- **Token Type IDs**: Not used in this model.
### Model Sources
- **Developed by:** Retrieva, Inc.
- **Model type:** Based on MegatronBERT Architecture.
- **Language(s) (NLP):** Primarily Japanese (optional support for English).
- **License:** Apache 2.0
## Uses
This model can be used as a Masked Language Model (MLM).
However, it is primarily intended to be fine-tuned on downstream tasks.
Depending on your use case, follow the appropriate section below.
### Direct Use
This model is pre-trained using Masked Language Modeling.
The mask token used is `<MASK|LLM-jp>`.
Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.
Example code for direct use:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model_id = "retrieva-jp/bert-1.3b"
model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)
text = "ใใใซใกใฏ๏ผ็งใฎๅๅใฏ<MASK|LLM-jp>ใงใ๏ผ"
print(pipe(text))
```
### Downstream Use
RetrievaBERT is compatible with Hugging Face's AutoModels.
To fine-tune RetrievaBERT for your specific task, use the corresponding AutoModel class.
For detailed configuration, refer to the config.json file.
## Training Details
### Training Data
The RetrievaBERT model was pre-trained on the reunion of five datasets:
- [Japanese CommonCrawl Dataset by LLM-jp](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2).
- [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb).
- Chinese Wikipedia dumped on 20240120.
- Korean Wikipedia dumped on 20240120.
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack)
The model was trained on 180 billion tokens using the above dataset.
### Training Procedure
The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024.
We adopted the curriculum learning which is similar to the Sequence Length Warmup and training with the following sequence lengths and number of steps.
- The sequence length of 128: 31,000 steps.
- The sequence length of 256: 219,000 steps.
- The sequence length of 512: 192,000 steps.
- The sequence length of 2048: 12,000 steps.
#### Training Hyperparameters
The model was trained on the following hyperparameters.
- Learning rate: 1.5e-4.
- Learning rate decay style: Linear.
- Learning rate warmup fraction: 0.01
- Minimum learning rate: 1e-6
- Floating point expression: BF16
## Evaluation
We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set.
We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
| Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
| :--- |---:|---:|---:|---:|---:|---:|---:|
| tohoku-nlp/bert-base-japanese-v3 | 0.957 | 0.914 | 0.876 | 0.906 | 0.878 | 0.946 | 0.849 |
| tohoku-nlp/bert-large-japanese-v2| 0.959 | 0.916 | 0.877 | 0.901 | 0.884 | 0.951 | 0.867 |
| ku-nlp/deberta-v3-base-japaneseใใใใ| 0.958 | 0.925 | 0.890 | 0.902 | 0.925 | 0.910 | 0.882 |
| retrieva-jp/bert-1.3bใใใใใใใใใใใใใใใใใใใใใใใใ| 0.959 | 0.917 | 0.881 | 0.898 | 0.875 | 0.874 | 0.827 |
## Technical Specifications
### Model Architectures
The RetrievaBERT model is based on BERT with the following hyperparameters:
- Number of layers: 48
- Hidden layer size: 1536
- FFN hidden layer size: 4096
- Number of attention heads: 24
- Maximum length of position embeddings: 2048
As mentioned earlier, the main differences from the original BERT are:
- PreNorm: Improved stability during training.
- SwiGLU: Enhanced activation function for better performance.
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
### Compute Infrastructure
[TSUBAME 4](https://www.t4.gsic.titech.ac.jp/en/hardware)
This model is based on results obtained from the [TSUBAME deep-learning mini-camp](https://www.t4.gsic.titech.ac.jp/en/minicamp-dl-202406).
#### Software
The model was trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
## More Information
https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)
## Model Card Authors
Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
## Model Card Contact
[email protected]
|