Update README.md
Browse files
README.md
CHANGED
@@ -15,9 +15,12 @@ This is Mistral-12.25B-Instruct-v0.2, a depth-upscaled version of [mistralai/Mis
|
|
15 |
This model is intended to be used as a basis for further fine-tuning, or as a drop-in upgrade from the original 7 billion parameter model.
|
16 |
This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).
|
17 |
|
18 |
-
UpStage
|
19 |
|
20 |
"Our study on the Depth Up-Scaling (DUS) has important limitations and considerations. One key limitation is the need for more thorough explorations of hyperparameters used in the DUS approach. Namely, we removed m = 8 layers from both ends of our base model, primarily due to hardware limitations. However, we have not yet determined if this value is optimal for enhancing performance.The extended time and cost of continued pretraining made it challenging to conduct more comprehensive experiments, which we aim to address in future work through various comparative analyses."
|
|
|
|
|
|
|
21 |
## Merge Details
|
22 |
### Merge Method
|
23 |
|
|
|
15 |
This model is intended to be used as a basis for further fine-tuning, or as a drop-in upgrade from the original 7 billion parameter model.
|
16 |
This is a merge of pre-trained language models created using [mergekit](https://github.com/cg123/mergekit).
|
17 |
|
18 |
+
# UpStage's conclusionary limitations of their research:
|
19 |
|
20 |
"Our study on the Depth Up-Scaling (DUS) has important limitations and considerations. One key limitation is the need for more thorough explorations of hyperparameters used in the DUS approach. Namely, we removed m = 8 layers from both ends of our base model, primarily due to hardware limitations. However, we have not yet determined if this value is optimal for enhancing performance.The extended time and cost of continued pretraining made it challenging to conduct more comprehensive experiments, which we aim to address in future work through various comparative analyses."
|
21 |
+
|
22 |
+
This model was made to help test whether 10.7B parameters (m = 8) is better or worse than m < 8 (10.7B+ parameters)
|
23 |
+
|
24 |
## Merge Details
|
25 |
### Merge Method
|
26 |
|