File size: 2,660 Bytes
f9474d3 20cfaf7 f9474d3 211fb27 f9474d3 211fb27 756719a 211fb27 756719a 16a5b46 756719a f9474d3 756719a 6c87550 756719a 67d68ad 756719a 67d68ad 756719a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
base_model: unsloth/llama-3.2-3b-instruct-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
- sft
license: apache-2.0
language:
- en
datasets:
- BAAI/Infinity-Instruct
---
# Fine-tune Llama 3.2 3B Using Unsloth and BAAI/Infinity-Instruct Dataset
This model uses the "0625" version, but there will be a fine-tuned model trained with the "7M" version as well.
## Uploaded Model
- **Developed by:** MateoRov
- **License:** apache-2.0
- **Fine-tuned from model:** unsloth/llama-3.2-3b-instruct-bnb-4bit
## Usage
Check my full repo on github for better undestanding: https://github.com/Mateorovere/FineTuning-LLM-Llama3.2-3b
But with the proper dependencies you can run the model with the following code:
```python
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
# Get the chat template
tokenizer = get_chat_template(
tokenizer,
chat_template="llama-3.1",
)
model = "MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere"
# Enable native 2x faster inference
FastLanguageModel.for_inference(model)
# Define the input message
messages = [
{"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]
# Prepare the inputs
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True, # Must add for generation
return_tensors="pt",
).to("cuda")
# Generate the output
outputs = model.generate(
input_ids=inputs,
max_new_tokens=64,
use_cache=True,
temperature=1.5,
min_p=0.1,
)
# Decode the outputs
result = tokenizer.batch_decode(outputs)
print(result)
```
To get the generation token by token:
```python
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
from transformers import TextStreamer
model = "MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere"
# Enable native 2x faster inference
FastLanguageModel.for_inference(model)
# Get the chat template
tokenizer = get_chat_template(
tokenizer,
chat_template="llama-3.1",
)
# Define the input message
messages = [
{"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]
# Prepare the inputs
inputs = tokenizer.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True, # Must add for generation
return_tensors="pt",
).to("cuda")
# Initialize the text streamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
# Generate the output token by token
_ = model.generate(
input_ids=inputs,
streamer=text_streamer,
max_new_tokens=128,
use_cache=True,
temperature=1.5,
min_p=0.1,
)
``` |