File size: 2,660 Bytes

---
base_model: unsloth/llama-3.2-3b-instruct-bnb-4bit
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
- sft
license: apache-2.0
language:
- en
datasets:
- BAAI/Infinity-Instruct
---

# Fine-tune Llama 3.2 3B Using Unsloth and BAAI/Infinity-Instruct Dataset

This model uses the "0625" version, but there will be a fine-tuned model trained with the "7M" version as well.

## Uploaded Model

- **Developed by:** MateoRov
- **License:** apache-2.0
- **Fine-tuned from model:** unsloth/llama-3.2-3b-instruct-bnb-4bit

## Usage

Check my full repo on github for better undestanding:  https://github.com/Mateorovere/FineTuning-LLM-Llama3.2-3b


But with the proper dependencies you can run the model with the following code:

```python
from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel

# Get the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)
model = "MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere"

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Define the input message
messages = [
    {"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]

# Prepare the inputs
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,  # Must add for generation
    return_tensors="pt",
).to("cuda")

# Generate the output
outputs = model.generate(
    input_ids=inputs,
    max_new_tokens=64,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)

# Decode the outputs
result = tokenizer.batch_decode(outputs)
print(result)
```

To get the generation token by token:

```python

from unsloth.chat_templates import get_chat_template
from unsloth import FastLanguageModel
from transformers import TextStreamer

model = "MateoRov/Llama3.2-3b-SFF-Infinity-MateoRovere"

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Get the chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template="llama-3.1",
)

# Define the input message
messages = [
    {"role": "user", "content": "Continue the Fibonacci sequence: 1, 1, 2, 3, 5, 8,"},
]

# Prepare the inputs
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,  # Must add for generation
    return_tensors="pt",
).to("cuda")

# Initialize the text streamer
text_streamer = TextStreamer(tokenizer, skip_prompt=True)

# Generate the output token by token
_ = model.generate(
    input_ids=inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    use_cache=True,
    temperature=1.5,
    min_p=0.1,
)
```