---
license: creativeml-openrail-m
datasets:
- HuggingFaceTB/smoltalk
language:
- en
base_model:
- meta-llama/Llama-3.2-1B-Instruct
pipeline_tag: text-generation
library_name: OpenVINO
tags:
- Llama
- SmolTalk
- openvino
---
The **Llama-SmolTalk-3.2-1B-Instruct** model is a lightweight, instruction-tuned model designed for efficient text generation and conversational AI tasks. With a 1B parameter architecture, this model strikes a balance between performance and resource efficiency, making it ideal for applications requiring concise, contextually relevant outputs. The model has been fine-tuned to deliver robust instruction-following capabilities, catering to both structured and open-ended queries.
### Key Features:
1. **Instruction-Tuned Performance**: Optimized to understand and execute user-provided instructions across diverse domains.
2. **Lightweight Architecture**: With just 1 billion parameters, the model provides efficient computation and storage without compromising output quality.
3. **Versatile Use Cases**: Suitable for tasks like content generation, conversational interfaces, and basic problem-solving.
### Intended Applications:
- **Conversational AI**: Engage users with dynamic and contextually aware dialogue.
- **Content Generation**: Produce summaries, explanations, or other creative text outputs efficiently.
- **Instruction Execution**: Follow user commands to generate precise and relevant responses.
### Technical Details:
The model leverages OpenVINO Ir format for inference, with a tokenizer optimized for seamless text input processing. It comes with essential configuration files, including `config.json`, `generation_config.json`, and tokenization files (`tokenizer.json` and `special_tokens_map.json`). The primary weights are stored in a PyTorch binary format (`pytorch_model.bin`), ensuring easy integration with existing workflows.
## Prompt format
```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 26 July 2024
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```
## This is the OpenVINO IR format of the model, quantized in int8
The model was created with the Optimum-Intel libray cli-command
#### Dependencies required to create the model
There is an open clash in dependencies versions between optiumum-intel and openvino-genai
> ⚠️ Exporting tokenizers to OpenVINO is not supported for tokenizers version > 0.19 and openvino version <= 2024.4. Please downgrade to tokenizers version <= 0.19 to export tokenizers to OpenVINO.
So for the model conversion the only dependency you need is
```
pip install -U "openvino>=2024.3.0" "openvino-genai"
pip install "torch>=2.1" "nncf>=2.7" "transformers>=4.40.0" "onnx<1.16.2" "optimum>=1.16.1" "accelerate" "datasets>=2.14.6" "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu
```
The instructions are from the amazing [OpenVINO notebooks](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html#prerequisites)
vanilla pip install will create clashes among dependencies/versions
This command will install, among others:
```
tokenizers==0.20.3
torch==2.5.1+cpu
transformers==4.46.3
nncf==2.14.0
numpy==2.1.3
onnx==1.16.1
openvino==2024.5.0
openvino-genai==2024.5.0.0
openvino-telemetry==2024.5.0
openvino-tokenizers==2024.5.0.0
optimum==1.23.3
optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c454b0000279ac9801302d726fbbbc1152733315
```
#### How to quantized the original model
After the previous step you are enabled to run the following command (considering that you downloaded all the model weights and files into a subfolder called `Llama-SmolTalk-3.2-1B-Instruct` from the [official model repository](https://huggingface.co/prithivMLmods/Llama-SmolTalk-3.2-1B-Instruct))
```bash
optimum-cli export openvino --model .\Llama-SmolTalk-3.2-1B-Instruct\ --task text-generation-with-past --trust-remote-code --weight-format int8 ov_Llama-SmolTalk-3.2-1B-Instruct
```
this will start the process and produce the following messages, without any fatal error
#### Dependencies required to run the model with `openvino-genai`
If you simply need to run already converted models into OpenVINO IR format, you need to install only openvino-genai
```
pip install openvino-genai==2024.5.0
```
## How to use the model with openvino-genai
followed official tutorial on [https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html)
With changes because here we are using Chat Templates
refer to [https://huggingface.co/docs/transformers/main/chat_templating](https://huggingface.co/docs/transformers/main/chat_templating)
```python
# MAIN IMPORTS
import warnings
warnings.filterwarnings(action='ignore')
import datetime
from transformers import AutoTokenizer #for chat templating
import openvino_genai as ov_genai
import tiktoken
import sys
def countTokens(text):
"""
Use tiktoken to count the number of tokens
text -> str input
Return -> int number of tokens counted
"""
encoding = tiktoken.get_encoding("r50k_base") #context_count = len(encoding.encode(yourtext))
numoftokens = len(encoding.encode(text))
return numoftokens
# LOADING THE MODEL
print('Loading the model', end='')
model_dir = 'ov_Llama-SmolTalk-3.2-1B-Instruct'
pipe = ov_genai.LLMPipeline(model_dir, 'CPU')
# PROMPT FORMATTING - we use tokenizer chat templating
tokenizer = AutoTokenizer.from_pretrained(model_dir)
print('✅ done')
print('Ready for generation')
print('Starting now Normal Chat based interface with NO TURNS - chat history disabled...')
counter = 1
while True:
# Reset history ALWAys
history = []
userinput = ""
print("\033[1;30m") #dark grey
print("Enter your text (end input with Ctrl+D on Unix or Ctrl+Z on Windows) - type quit! to exit the chatroom:")
print("\033[91;1m") #red
lines = sys.stdin.readlines()
for line in lines:
userinput += line + "\n"
if "quit!" in lines[0].lower():
print("\033[0mBYE BYE!")
break
history.append({"role": "user", "content": userinput})
tokenized_chat = tokenizer.apply_chat_template(history, tokenize=False)
# START PIPELINE setting eos_token_id = 151643
start = datetime.datetime.now()
print("\033[92;1m")
streamer = lambda x: print(x, end='', flush=True)
output = pipe.generate(tokenized_chat, temperature=0.2,
do_sample=True,
max_new_tokens=500,
repetition_penalty=1.178,
streamer=streamer,
eos_token_id = 128009)
print('')
delta = datetime.datetime.now() - start
totalseconds = delta.total_seconds()
totaltokens = countTokens(output)
genspeed = totaltokens/totalseconds
# PRINT THE STATISTICS
print('---')
print(f'Generated in {delta}')
print(f'🧮 Total number of generated tokens: {totaltokens}')
print(f'⏱️ Generation time: {totalseconds:.0f} seconds')
print(f'📈 speed: {genspeed:.2f} t/s')
```