File size: 5,550 Bytes
170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 95655ad 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 198973a 638b530 198973a 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 d96a69d 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 638b530 170ce12 d2a5d22 170ce12 638b530 170ce12 638b530 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
library_name: transformers
license: apache-2.0
language:
- ja
- en
---
# Retrieva BERT Model
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
It is designed for use in Japanese.
## Model Details
### Model Description
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
It is designed for use in Japanese.
This model offers several advanced features compared to traditional BERT models:
- **PreNorm**: Improved stability during training.
- **SwiGLU**: Enhanced activation function for better performance.
- **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.
- **Max Sequence Length**: 2048 tokens, allowing for longer context.
- **Parameters**: 1.3 billion parameters.
- **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
- **Token Type IDs**: Not used in this model.
### Model Sources
- **Developed by:** Retrieva, Inc.
- **Model type:** Based on MegatronBERT Architecture.
- **Language(s) (NLP):** Primarily Japanese (optional support for English).
- **License:** Apache 2.0
## Uses
This model can be used as a Masked Language Model (MLM).
However, it is primarily intended to be fine-tuned on downstream tasks.
Depending on your use case, follow the appropriate section below.
### Direct Use
This model is pre-trained using Masked Language Modeling.
The mask token used is `<MASK|LLM-jp>`.
Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.
Example code for direct use:
```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
model_id = "retrieva-jp/bert-1.3b"
model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)
text = "ใใใซใกใฏ๏ผ็งใฎๅๅใฏ<MASK|LLM-jp>ใงใ๏ผ"
print(pipe(text))
```
### Downstream Use
RetrievaBERT is compatible with Hugging Face's AutoModels.
To fine-tune RetrievaBERT for your specific task, use the corresponding AutoModel class.
For detailed configuration, refer to the config.json file.
## Training Details
### Training Data
The Retrieva BERT model was pre-trained on the reunion of five datasets:
- [Japanese CommonCrawl Dataset by LLM-jp](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2).
- [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb).
- Chinese Wikipedia dumped on 20240120.
- Korean Wikipedia dumped on 20240120.
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack)
The model was trained on 180 billion tokens using the above dataset.
### Training Procedure
The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024.
We adopted the curriculum learning which is similar to the Sequence Length Warmup and training with the following sequence lengths and number of steps.
- The sequence length of 128: 31,000 steps.
- The sequence length of 256: 219,000 steps.
- The sequence length of 512: 192,000 steps.
- The sequence length of 2048: 12,000 steps.
#### Training Hyperparameters
The model was trained on the following hyperparameters.
- Learning rate: 1.5e-4.
- Learning rate decay style: Linear.
- Learning rate warmup fraction: 0.01
- Minimum learning rate: 1e-6
- Floating point expression: BF16
## Evaluation
We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set.
We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).
| Model | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
|:---: |:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| tohoku-nlp/bert-base-japanese-v3 | 0.957 | 0.914 | 0.876 | 0.906 | 0.878 | 0.946 | 0.849 |
| tohoku-nlp/bert-large-japanese-v2| 0.959 | 0.916 | 0.877 | 0.901 | 0.884 | 0.951 | 0.867 |
| ku-nlp/deberta-v3-base-japaneseใใใใ| 0.958 | 0.925 | 0.890 | 0.902 | 0.925 | 0.910 | 0.882 |
| retrieva-jp/bert-1.3bใใใใใใใใใใใใใใใใใใใใใใใใ| 0.952 | 0.916 | 0.877 | 0.896 | 0.916 | 0.879 | 0.815 |
## Technical Specifications
### Model Architectures
The Retrieva BERT model is based on BERT with the following hyperparameters:
- Number of layers: 48
- Hidden layer size: 1536
- FFN hidden layer size: 4096
- Number of attention heads: 24
- Maximum length of position embeddings: 2048
As mentioned earlier, the main differences from the original BERT are:
- PreNorm: Improved stability during training.
- SwiGLU: Enhanced activation function for better performance.
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.
### Compute Infrastructure
[TSUBAME 4](https://www.t4.gsic.titech.ac.jp/en/hardware)
This model is based on results obtained from the TSUBAME deep-learning mini-camp.
#### Software
The model was trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).
## More Information [optional]
https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)
## Model Card Authors [optional]
Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba
## Model Card Contact
[email protected] |