File size: 2,660 Bytes
f9474d3
 
 
 
 
 
 
 
20cfaf7
f9474d3
 
 
211fb27
 
f9474d3
211fb27
756719a
211fb27
756719a
16a5b46
756719a
f9474d3
 
 
756719a
 
 
 
6c87550
 
 
 
756719a
 
 
 
 
 
 
 
 
 
67d68ad
756719a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67d68ad
 
756719a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
---
base_model: unsloth/llama-3.2-3b-instruct-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
- sft
license: apache-2.0
language:
- en
datasets:
- BAAI/Infinity-Instruct
---

# Fine-tune Llama 3.2 3B Using Unsloth and BAAI/Infinity-Instruct Dataset

This model uses the "0625" version, but there will be a fine-tuned model trained with the "7M" version as well.

## Uploaded Model

- **Developed by:** MateoRov
- **License:** apache-2.0
- **Fine-tuned from model:** unsloth/llama-3.2-3b-instruct-bnb-4bit

## Usage

Check my full repo on github for better undestanding:  https://github.com/Mateorovere/FineTuning-LLM-Llama3.2-3b


But with the proper dependencies you can run the model with the following code:

```python
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel

# Get the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)
model = "MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere"

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Define the input message
messages = [
    {"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]

# Prepare the inputs
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,  # Must add for generation
    return_tensors="pt",
).to("cuda")

# Generate the output
outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=64,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)

# Decode the outputs
result = tokenizer.batch_decode(outputs)
print(result)
```

To get the generation token by token:

```python

from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
from transformers import TextStreamer

model = "MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere"

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Get the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

# Define the input message
messages = [
    {"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]

# Prepare the inputs
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,  # Must add for generation
    return_tensors="pt",
).to("cuda")

# Initialize the text streamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

# Generate the output token by token
_ = model.generate(
    input_ids=inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)
```