File size: 5,711 Bytes

170ce12
 
ae73aec
 
 
 
170ce12
 
1f5ceb2
ae73aec
 
170ce12
a67b81e
 
 
 
 
170ce12
 
 
 
ae73aec
170ce12
ae73aec
170ce12
ae73aec
a67b81e
 
 
 
 
 
ae73aec
170ce12
ae73aec
 
 
 
 
170ce12
 
 
 
ae73aec
 
 
170ce12
 
 
ae73aec
 
a67b81e
 
 
170ce12
ae73aec
 
170ce12
ae73aec
 
 
 
170ce12
ae73aec
 
 
170ce12
ae73aec
170ce12
ae73aec
 
 
170ce12
 
 
 
 
1f5ceb2
ae73aec
 
 
 
 
170ce12
ae73aec
170ce12
 
ae73aec
 
170ce12
ae73aec
 
 
 
170ce12
 
ae73aec
170ce12
ae73aec
 
 
 
 
170ce12
 
a67b81e
ae73aec
170ce12
ae73aec
 
 
 
 
a67b81e
170ce12
 
ae73aec
170ce12
ae73aec
1f5ceb2
170ce12
ae73aec
 
 
 
 
170ce12
ae73aec
a67b81e
 
 
170ce12
 
 
 
ae73aec
170ce12
d6454e0
170ce12
 
 
ae73aec
170ce12
ae73aec
170ce12
ae73aec
170ce12
ae73aec
170ce12
ae73aec
170ce12
 
a67b81e

---
library_name: transformers
license: apache-2.0
language:
- ja
- en
---

# RetrievaBERT Model
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
It is designed for use in Japanese.

## What's New

- November 2024 (`v1.0.1`): Bug fix for the model parameters.
  - The up_proj's bias was initialized with the gate's one. This bug was fixed.

## Model Details

### Model Description

The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.

It is designed for use in Japanese.

This model offers several advanced features compared to traditional BERT models:
- **PreNorm**: Improved stability during training.
- **SwiGLU**: Enhanced activation function for better performance.
- **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.
- **Max Sequence Length**: 2048 tokens, allowing for longer context.
- **Parameters**: 1.3 billion parameters.
- **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).
- **Token Type IDs**: Not used in this model.

### Model Sources
- **Developed by:** Retrieva, Inc.
- **Model type:** Based on MegatronBERT Architecture.
- **Language(s) (NLP):** Primarily Japanese (optional support for English).
- **License:** Apache 2.0


## Uses

This model can be used as a Masked Language Model (MLM).
However, it is primarily intended to be fine-tuned on downstream tasks.
Depending on your use case, follow the appropriate section below.

### Direct Use

This model is pre-trained using Masked Language Modeling.
The mask token used is `<MASK|LLM-jp>`.
Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.

Example code for direct use:

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model_id = "retrieva-jp/bert-1.3b"
model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)

text = "こんにちは！私の名前は<MASK|LLM-jp>です！"
print(pipe(text))
```

### Downstream Use

RetrievaBERT is compatible with Hugging Face's AutoModels.
To fine-tune RetrievaBERT for your specific task, use the corresponding AutoModel class.
For detailed configuration, refer to the config.json file.


## Training Details

### Training Data
The RetrievaBERT model was pre-trained on the reunion of five datasets:
- [Japanese CommonCrawl Dataset by LLM-jp](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2).
- [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb).
- Chinese Wikipedia dumped on 20240120.
- Korean Wikipedia dumped on 20240120.
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack)

The model was trained on 180 billion tokens using the above dataset.

### Training Procedure
The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024.
We adopted the curriculum learning which is similar to the Sequence Length Warmup and training with the following sequence lengths and number of steps.

- The sequence length of 128: 31,000 steps.
- The sequence length of 256: 219,000 steps.
- The sequence length of 512: 192,000 steps.
- The sequence length of 2048: 12,000 steps.

#### Training Hyperparameters
The model was trained on the following hyperparameters.

- Learning rate: 1.5e-4.
- Learning rate decay style: Linear.
- Learning rate warmup fraction: 0.01
- Minimum learning rate: 1e-6
- Floating point expression: BF16

## Evaluation
We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set.
We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).

| Model                            | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
| :--- |---:|---:|---:|---:|---:|---:|---:|
| tohoku-nlp/bert-base-japanese-v3 | 0.957       | 0.914        | 0.876         | 0.906    | 0.878     | 0.946     | 0.849      |
| tohoku-nlp/bert-large-japanese-v2| 0.959       | 0.916        | 0.877         | 0.901    | 0.884     | 0.951     | 0.867      |
| ku-nlp/deberta-v3-base-japanese　　　　| 0.958       | 0.925        | 0.890         | 0.902    | 0.925     | 0.910     | 0.882      |
| retrieva-jp/bert-1.3b　　　　　　　　　　　　　　　　　　　　　　　　| 0.959       | 0.917        | 0.881         | 0.898    | 0.875     | 0.874     | 0.827      |


## Technical Specifications

### Model Architectures
The RetrievaBERT model is based on BERT with the following hyperparameters:

- Number of layers: 48
- Hidden layer size: 1536
- FFN hidden layer size: 4096
- Number of attention heads: 24
- Maximum length of position embeddings: 2048

As mentioned earlier, the main differences from the original BERT are:
- PreNorm: Improved stability during training.
- SwiGLU: Enhanced activation function for better performance.
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.


### Compute Infrastructure

[TSUBAME 4](https://www.t4.gsic.titech.ac.jp/en/hardware)

This model is based on results obtained from the [TSUBAME deep-learning mini-camp](https://www.t4.gsic.titech.ac.jp/en/minicamp-dl-202406).

#### Software

The model was trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).

## More Information

https://note.com/retrieva/n/n715bea2c2cd1 (in Japanese)

## Model Card Authors

Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba

## Model Card Contact
[email protected]