File size: 5,621 Bytes
170ce12
 
638b530
 
 
 
170ce12
 
638b530
 
 
170ce12
 
 
 
 
638b530
170ce12
638b530
170ce12
638b530
 
 
 
 
 
 
 
170ce12
638b530
 
 
 
 
170ce12
 
 
 
638b530
 
 
170ce12
 
 
638b530
 
 
 
 
170ce12
638b530
 
170ce12
638b530
 
 
 
170ce12
638b530
 
 
170ce12
638b530
170ce12
638b530
 
 
170ce12
 
 
 
 
638b530
 
 
 
 
 
 
170ce12
638b530
 
170ce12
638b530
 
 
 
170ce12
 
638b530
170ce12
638b530
 
 
 
 
170ce12
 
638b530
 
170ce12
638b530
 
 
 
 
 
170ce12
 
638b530
170ce12
638b530
 
170ce12
638b530
 
 
 
 
170ce12
638b530
 
 
 
170ce12
 
 
 
638b530
170ce12
638b530
170ce12
 
 
638b530
170ce12
 
 
 
 
 
 
638b530
170ce12
 
638b530
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
library_name: transformers
license: apache-2.0
language:
- ja
- en
---

# Retrieva BERT Model
The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.
It is designed for use in Japanese.

## Model Details

### Model Description

The **RetrievaBERT** is the pre-trained Transformer Encoder using Megatron-LM.

It is designed for use in Japanese.

This model offers several advanced features compared to traditional BERT models:
- **PreNorm**: Improved stability during training.  
- **SwiGLU**: Enhanced activation function for better performance.  
- **Grouped-Query Attention (Multi-Query Attention)**: Efficient attention mechanism.  
- **Max Sequence Length**: 2048 tokens, allowing for longer context.  
- **Parameters**: 1.3 billion parameters.  
- **Pre-training Objective**: Only Masked Language Modeling (MLM), not Next Sentence Prediction (NSP).  
- **Token Type IDs**: Not used in this model.

### Model Sources
- **Developed by:** Retrieva, Inc.
- **Model type:** Based on MegatronBERT Architecture.
- **Language(s) (NLP):** Primarily Japanese (optional support for English).
- **License:** Apache 2.0


## Uses

This model can be used as a Masked Language Model (MLM).
However, it is primarily intended to be fine-tuned on downstream tasks.
Depending on your use case, follow the appropriate section below.

### Direct Use

This model is pre-trained using Masked Language Modeling.
The mask token used is `<MASK|LLM-jp>`.
Note that you need to set `trust_remote_code` to `True` because RetrievaBERT uses a custom model implementation.  
  
Example code for direct use: 

```python
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline

model_id = "retrieva-jp/bert-1.3b"
model = AutoModelForMaskedLM.from_pretrained(model_id, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)

text = "ใ“ใ‚“ใซใกใฏ๏ผ็งใฎๅๅ‰ใฏ<MASK|LLM-jp>ใงใ™๏ผ"
print(pipe(text))
```

### Downstream Use

RetrievaBERT is compatible with Hugging Face's AutoModels.
To fine-tune RetrievaBERT for your specific task, use the corresponding AutoModel class.
For detailed configuration, refer to the config.json file.


## Training Details

### Training Data
The Retrieva BERT model was pre-trained on the reunion of five datasets:
- [Japanese CommonCrawl Dataset by LLM-jp](https://gitlab.llm-jp.nii.ac.jp/datasets/llm-jp-corpus-v2).
- [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb).
- Chinese Wikipedia dumped on 20240120.
- Korean Wikipedia dumped on 20240120.
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack)
The model was trained on 180 billion tokens using the above dataset.
### Training Procedure
The model was trained on 4 to 32 H100 GPUs with a batch size of 1,024.
We adopted the curriculum learning which is similar to the Sequence Length Warmup and training with the following sequence lengths and number of steps.

- The sequence length of 128: 31,000 steps.
- The sequence length of 256: 219,000 steps.
- The sequence length of 512: 192,000 steps.
- The sequence length of 2048: 12,000 steps.

#### Training Hyperparameters
The model was trained on the following hyperparameters.

- Learning rate: 1.5e-4.
- Learning rate decay style: Linear.
- Learning rate warmup fraction: 0.01
- Minimum learning rate: 1e-6
- Floating point expression: BF16

## Evaluation
We fine-tuned the following models and evaluated them on the [JGLUE](https://github.com/yahoojapan/JGLUE) development set. 
We adjusted the learning rate and training epochs for each model and task in accordance with [the JGLUE paper](https://www.jstage.jst.go.jp/article/jnlp/30/1/30_63/_pdf/-char/ja).

| Model                            | MARC-ja/acc | JSTS/pearson | JSTS/spearman | JNLI/acc | JSQuAD/EM | JSQuAD/F1 | JComQA/acc |
|----------------------------------|-------------|--------------|---------------|----------|-----------|-----------|------------|
| tohoku-nlp/bert-base-japanese-v3 | 0.957       | 0.914        | 0.876         | 0.906    | 0.878     | 0.946     | 0.849      |
| tohoku-nlp/bert-large-japanese-v2| 0.959       | 0.916        | 0.877         | 0.901    | 0.884     | 0.951     | 0.867      |
| ku-nlp/deberta-v3-base-japaneseใ€€ใ€€ใ€€ใ€€| 0.958       | 0.925        | 0.890         | 0.902    | 0.925     | 0.910     | 0.882      |
| retrieva-jp/bert-1.3bใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€ใ€€| 0.952       | 0.916        | 0.877         | 0.896    | 0.916     | 0.879     | 0.815      |


## Technical Specifications

### Model Architectures
The Retrieva BERT model is based on BERT with the following hyperparameters:

- Number of layers: 48
- Hidden layer size: 1536
- FFN hidden layer size: 4096
- Number of attention heads: 24
- Maximum length of position embeddings: 2048

As mentioned earlier, the main differences from the original BERT are:
- PreNorm: Improved stability during training.  
- SwiGLU: Enhanced activation function for better performance.  
- Grouped-Query Attention (Multi-Query Attention): Efficient attention mechanism.  


### Compute Infrastructure

[TSUBAME 4](https://www.t4.gsic.titech.ac.jp/en/hardware)

This model is based on results obtained from the TSUBAME deep-learning mini-camp.

#### Software

The model was trained using [Megatron-LM](https://github.com/NVIDIA/Megatron-LM).

## More Information [optional]

[More Information Needed]

## Model Card Authors [optional]

Satoru Katsumata, Daisuke Kimura, Jiro Nishitoba

## Model Card Contact
[email protected]