|
--- |
|
license: gemma |
|
datasets: |
|
- Mielikki/Erebus-87k |
|
- allura-org/r_shortstories_24k |
|
base_model: allura-org/G2-9B-Sugarquill-v0 |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
tags: |
|
- llama-cpp |
|
- gguf-my-repo |
|
--- |
|
|
|
# Triangle104/G2-9B-Sugarquill-v0-Q8_0-GGUF |
|
This model was converted to GGUF format from [`allura-org/G2-9B-Sugarquill-v0`](https://huggingface.co/allura-org/G2-9B-Sugarquill-v0) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space. |
|
Refer to the [original model card](https://huggingface.co/allura-org/G2-9B-Sugarquill-v0) for more details on the model. |
|
|
|
--- |
|
Model details: |
|
- |
|
An experimental continued pretrain of Gemma-2-9B-It-SPPO-Iter3 on assorted short story data from the web. I was trying to diversify Gemma's prose, without completely destroying it's smarts. I think I half-succeeded? This model could have used another epoch of training, but even this is already more creative and descriptive than it's base model, w/o becoming too silly. Doesn't seem to have degraded much in terms of core abilities as well. Should be usable both for RP and raw completion storywriting. I originally planned to use this in a merge, but I feel like this model is interesting enough to be released on it's own as well. |
|
|
|
Model was trained by Auri. |
|
|
|
Dedicated to Cahvay, who wanted a Gemma finetune from me for months by now, and to La Rata, who loves storywriter models. |
|
|
|
GGUFs by Prodeus: https://huggingface.co/allura-org/G2-9B-Sugarquill-v0-GGUF |
|
|
|
Training notes |
|
|
|
This model was trained for 2 epochs on 10k rows (~18.7M tokens), taken equally from Erebus-87k and r_shortstories_24k datasets. It was trained on 8xH100 SXM node for 30 minutes with rsLoRA. I got complete nonsense reported to my wandb during this run, and logging stopped altogether after step 13 for some reason. Seems to be directly related to Gemma, as my training setup worked flawlessly for Qwen. Thanks to Kearm for helping with setting up LF on that node and to Featherless for providing it for EVA-Qwen2.5 (and this model, unknowingly lol) training. |
|
|
|
Format |
|
|
|
Model responds to Gemma instruct formatting, exactly like it's base model. |
|
|
|
<bos><start_of_turn>user |
|
{user message}<end_of_turn> |
|
<start_of_turn>model |
|
{response}<end_of_turn><eos> |
|
|
|
Training config |
|
|
|
See LLaMA-Factory config |
|
### Model |
|
model_name_or_path: UCLA-AGI/Gemma-2-9B-It-SPPO-Iter3 |
|
#ref_model: # Reference model for RL (optional, for everything besides SimPO, which doesn't take it at all) |
|
#ref_model_quantization_bit: 8 # 8 or 4 |
|
|
|
### Method |
|
stage: pt # pt, sft, rm, ppo, kto, dpo (includes orpo and simpo) |
|
do_train: true |
|
finetuning_type: lora # full, freeze or lora |
|
lora_target: all |
|
#pref_beta: 0.1 |
|
#pref_loss: simpo # sigmoid (dpo), orpo, simpo, ipo, hinge |
|
|
|
### Reward model |
|
#reward_model: RLHFlow/ArmoRM-Llama3-8B-v0.1 # or sfairXC/FsfairX-Gemma2-RM-v0.1 or nvidia/Llama-3.1-Nemotron-70B-Reward-HF |
|
#reward_model_type: full # full, lora, api |
|
#reward_model_adapters: # Path to RM LoRA adapter(s) if using a LoRA RM |
|
#reward_model_quantization_bit: 8 # 4 or 8 |
|
|
|
### Freeze |
|
#freeze_trainable_layers: # The number of trainable layers for freeze (partial-parameter) fine-tuning. Positive number means n last layers to train, negative - n first layers to train |
|
#freeze_trainable_modules: # Name(s) of trainable modules for freeze (partial-parameter) fine-tuning. Use commas to separate |
|
#freeze_extra_modules: # Name(s) of modules apart from hidden layers to be set as trainable. Use commas to separate |
|
|
|
### LoRA |
|
#loraplus_lr_ratio: 8.0 |
|
#loraplus_lr_embedding: |
|
use_dora: false |
|
use_rslora: true |
|
lora_rank: 64 # 64 is optimal for most trains on instruct, if training on base - use rslora or dora |
|
lora_alpha: 32 |
|
lora_dropout: 0.05 |
|
#pissa_init: true |
|
#pissa_iter: 16 |
|
#pissa_convert: true |
|
|
|
### QLoRA |
|
quantization_bit: 8 # 2,3,4,5,6,8 in HQQ, 4 or 8 in bnb |
|
quantization_method: hqq # bitsandbytes or hqq |
|
|
|
### DeepSpeed |
|
deepspeed: examples/deepspeed/ds_z2_config.json # ds_z3_config.json or ds_z2_config.json which is required for HQQ on multigpu |
|
|
|
### Dataset |
|
dataset: sugarquill-10k # define in data/dataset_info.json |
|
cutoff_len: 8192 |
|
max_samples: 10000 |
|
overwrite_cache: true |
|
preprocessing_num_workers: 16 |
|
#template: chatml |
|
|
|
### Output |
|
output_dir: saves/gemma/lora/sugarquill-1 |
|
logging_steps: 3 |
|
save_steps: 50 |
|
plot_loss: true |
|
compute_accuracy: true |
|
overwrite_output_dir: true |
|
|
|
### Train |
|
per_device_train_batch_size: 1 # Effective b/s == per-device b/s * grad accum steps * number of GPUs |
|
gradient_accumulation_steps: 8 |
|
learning_rate: 3.0e-5 |
|
optim: paged_adamw_8bit # paged_adamw_8bit or adamw_torch usually |
|
num_train_epochs: 2.0 |
|
lr_scheduler_type: cosine # cosine, constant or linear |
|
warmup_ratio: 0.05 |
|
bf16: true |
|
ddp_timeout: 180000000 |
|
packing: true |
|
max_grad_norm: 1.0 |
|
|
|
### Opts |
|
flash_attn: fa2 # auto, disabled, sdpa, fa2 | Gemma will fallback to eager |
|
enable_liger_kernel: true # Pretty much must have if it works |
|
#use_unsloth: true # May not work with multigpu idk |
|
#use_adam_mini: true # Comment optim if using this |
|
|
|
### Eval |
|
val_size: 0.1 |
|
per_device_eval_batch_size: 1 |
|
eval_strategy: steps |
|
eval_steps: 0.05 |
|
|
|
### Misc |
|
include_num_input_tokens_seen: true |
|
ddp_find_unused_parameters: false # Stupid thing tries to start distributed training otherwise |
|
upcast_layernorm: true |
|
|
|
### Inference for PPO |
|
#max_new_tokens: 512 |
|
#temperature: 0.8 |
|
#top_k: 0 |
|
#top_p: 0.8 |
|
|
|
### Tracking |
|
report_to: wandb # or tensorboard or mlflow | LOGIN BEFORE STARTING TRAIN OR ELSE IT WILL CRASH |
|
run_name: G2-9B-Sugarquill-1 |
|
|
|
### Merge Adapter |
|
#export_dir: models/G2-9B-Sugarquill |
|
#export_size: 4 |
|
#export_device: gpu |
|
#export_legacy_format: false |
|
|
|
--- |
|
## Use with llama.cpp |
|
Install llama.cpp through brew (works on Mac and Linux) |
|
|
|
```bash |
|
brew install llama.cpp |
|
|
|
``` |
|
Invoke the llama.cpp server or the CLI. |
|
|
|
### CLI: |
|
```bash |
|
llama-cli --hf-repo Triangle104/G2-9B-Sugarquill-v0-Q8_0-GGUF --hf-file g2-9b-sugarquill-v0-q8_0.gguf -p "The meaning to life and the universe is" |
|
``` |
|
|
|
### Server: |
|
```bash |
|
llama-server --hf-repo Triangle104/G2-9B-Sugarquill-v0-Q8_0-GGUF --hf-file g2-9b-sugarquill-v0-q8_0.gguf -c 2048 |
|
``` |
|
|
|
Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well. |
|
|
|
Step 1: Clone llama.cpp from GitHub. |
|
``` |
|
git clone https://github.com/ggerganov/llama.cpp |
|
``` |
|
|
|
Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux). |
|
``` |
|
cd llama.cpp && LLAMA_CURL=1 make |
|
``` |
|
|
|
Step 3: Run inference through the main binary. |
|
``` |
|
./llama-cli --hf-repo Triangle104/G2-9B-Sugarquill-v0-Q8_0-GGUF --hf-file g2-9b-sugarquill-v0-q8_0.gguf -p "The meaning to life and the universe is" |
|
``` |
|
or |
|
``` |
|
./llama-server --hf-repo Triangle104/G2-9B-Sugarquill-v0-Q8_0-GGUF --hf-file g2-9b-sugarquill-v0-q8_0.gguf -c 2048 |
|
``` |
|
|