Update README.md
Browse files
README.md
CHANGED
@@ -451,3 +451,61 @@ FP16:
|
|
451 |
```
|
452 |
ollama run mistral-small:24b-instruct-2501-fp16
|
453 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
451 |
```
|
452 |
ollama run mistral-small:24b-instruct-2501-fp16
|
453 |
```
|
454 |
+
|
455 |
+
### Fine-Tuning & context expansion
|
456 |
+
|
457 |
+
This model is an (untested) fine-funed using [unsloth](https://github.com/unslothai/unsloth)'s PEFT SFT.
|
458 |
+
|
459 |
+
#### Datasets
|
460 |
+
SFT was done on the following datasets:
|
461 |
+
|
462 |
+
1. 40% of [cognitivecomputations/dolphin-r1](https://huggingface.co/datasets/cognitivecomputations/dolphin-r1/viewer/nonreasoning) dataset and
|
463 |
+
2. 2% of the [fireworks-ai/long-chat](https://huggingface.co/datasets/fireworks-ai/long-chat?row=0) dataset for context expansion
|
464 |
+
|
465 |
+
### Training configuration
|
466 |
+
Context expansion to max. 35k with unsloth's [RoPE-](https://arxiv.org/abs/2310.05209) scaling capabilities.
|
467 |
+
|
468 |
+
#### Chat template
|
469 |
+
Mistral chat template format was used.
|
470 |
+
|
471 |
+
#### PEFT settings
|
472 |
+
|
473 |
+
1% of base model's hidden parameters resulting in
|
474 |
+
|
475 |
+
```bash
|
476 |
+
==((====))== Unsloth 2025.1.8: Fast Mistral patching. Transformers: 4.48.2.
|
477 |
+
\\ /| GPU: NVIDIA H200. Max memory: 139.827 GB. Platform: Linux.
|
478 |
+
O^O/ \_/ \ Torch: 2.5.1+cu124. CUDA: 9.0. CUDA Toolkit: 12.4. Triton: 3.1.0
|
479 |
+
\ / Bfloat16 = TRUE. FA [Xformers = 0.0.29. FA2 = False]
|
480 |
+
"-____-" Free Apache license: http://github.com/unslothai/unsloth
|
481 |
+
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
|
482 |
+
|
483 |
+
Unsloth: uncensoredai/Mistral-Small-24B-Instruct-2501 can only handle sequence lengths of at most 32768.
|
484 |
+
But with kaiokendev's RoPE scaling of 1.094, it can be magically be extended to 35840!
|
485 |
+
|
486 |
+
Loading checkpoint shards: 0%| | 0/10 [00:00<?, ?it/s]
|
487 |
+
|
488 |
+
|
489 |
+
Total model parameters: 13,799,674,880
|
490 |
+
|
491 |
+
Total hidden parameters: 12,457,497,600
|
492 |
+
|
493 |
+
Total LM Head parameters: 671,088,640
|
494 |
+
|
495 |
+
Total Embedding parameters: 671,088,640
|
496 |
+
Hidden Size: 5120
|
497 |
+
# Hidden Layers: 40
|
498 |
+
Training Fraction: 0.01
|
499 |
+
|
500 |
+
Number of Training Parameters: 124,574,976.0
|
501 |
+
LoRA Rank (r): 304.00
|
502 |
+
LoRA Alpha (alpha_lora): 608.00
|
503 |
+
...
|
504 |
+
|
505 |
+
==((====))== Unsloth - 2x faster free finetuning | Num GPUs = 1
|
506 |
+
\\ /| Num examples = 64,992 | Num Epochs = 1
|
507 |
+
O^O/ \_/ \ Batch size per device = 2 | Gradient Accumulation steps = 4
|
508 |
+
\ / Total batch size = 8 | Total steps = 8,124
|
509 |
+
"-____-" Number of trainable parameters = 1,755,709,440
|
510 |
+
```
|
511 |
+
|