---
license: creativeml-openrail-m
datasets:
- HuggingFaceTB/smoltalk
language:
- en
base_model:
- meta-llama/Llama-3.2-1B-Instruct
pipeline_tag: text-generation
library_name: OpenVINO
tags:
- Llama
- SmolTalk
- openvino
---

The **Llama-SmolTalk-3.2-1B-Instruct** model is a lightweight, instruction-tuned model designed for efficient text generation and conversational AI tasks. With a 1B parameter architecture, this model strikes a balance between performance and resource efficiency, making it ideal for applications requiring concise, contextually relevant outputs. The model has been fine-tuned to deliver robust instruction-following capabilities, catering to both structured and open-ended queries.

### Key Features:
1. **Instruction-Tuned Performance**: Optimized to understand and execute user-provided instructions across diverse domains.
2. **Lightweight Architecture**: With just 1 billion parameters, the model provides efficient computation and storage without compromising output quality.
3. **Versatile Use Cases**: Suitable for tasks like content generation, conversational interfaces, and basic problem-solving.

### Intended Applications:
- **Conversational AI**: Engage users with dynamic and contextually aware dialogue.
- **Content Generation**: Produce summaries, explanations, or other creative text outputs efficiently.
- **Instruction Execution**: Follow user commands to generate precise and relevant responses.

### Technical Details:
The model leverages OpenVINO Ir format for inference, with a tokenizer optimized for seamless text input processing. It comes with essential configuration files, including `config.json`, `generation_config.json`, and tokenization files (`tokenizer.json` and `special_tokens_map.json`). The primary weights are stored in a PyTorch binary format (`pytorch_model.bin`), ensuring easy integration with existing workflows.

## Prompt format

```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```

## This is the OpenVINO IR format of the model, quantized in int8
The model was created with the Optimum-Intel libray cli-command
#### Dependencies required to create the model
There is an open clash in dependencies versions between optiumum-intel and openvino-genai
> ⚠️ Exporting tokenizers to OpenVINO is not supported for tokenizers version > 0.19 and openvino version <= 2024.4. Please downgrade to tokenizers version <= 0.19 to export tokenizers to OpenVINO.


So for the model conversion the only dependency you need is


```
pip install  -U "openvino>=2024.3.0" "openvino-genai"
pip install "torch>=2.1" "nncf>=2.7" "transformers>=4.40.0" "onnx<1.16.2" "optimum>=1.16.1" "accelerate" "datasets>=2.14.6" "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu
```
The instructions are from the amazing  [OpenVINO notebooks](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html#prerequisites)<br>
vanilla pip install will create clashes among dependencies/versions<br>
This command will install, among others:
```
tokenizers==0.20.3
torch==2.5.1+cpu
transformers==4.46.3
nncf==2.14.0
numpy==2.1.3
onnx==1.16.1
openvino==2024.5.0
openvino-genai==2024.5.0.0
openvino-telemetry==2024.5.0
openvino-tokenizers==2024.5.0.0
optimum==1.23.3
optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c454b0000279ac9801302d726fbbbc1152733315
```


#### How to quantized the original model
After the previous step you are enabled to run the following command (considering that you downloaded all the model weights and files into a subfolder called `Llama-SmolTalk-3.2-1B-Instruct` from the [official model repository](https://huggingface.co/prithivMLmods/Llama-SmolTalk-3.2-1B-Instruct))
```bash
optimum-cli export openvino --model .\Llama-SmolTalk-3.2-1B-Instruct\ --task text-generation-with-past --trust-remote-code --weight-format int8 ov_Llama-SmolTalk-3.2-1B-Instruct
```
this will start the process and produce the following messages, without any fatal error


#### Dependencies required to run the model with `openvino-genai`
If you simply need to run already converted models into OpenVINO IR format, you need to install only openvino-genai
```
pip install openvino-genai==2024.5.0
```


## How to use the model with openvino-genai
followed official tutorial on [https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html)
With changes because here we are using Chat Templates

refer to [https://huggingface.co/docs/transformers/main/chat_templating](https://huggingface.co/docs/transformers/main/chat_templating)

```python
# MAIN IMPORTS
import warnings
warnings.filterwarnings(action='ignore')
import datetime
from transformers import AutoTokenizer #for chat templating
import openvino_genai as ov_genai
import tiktoken
import sys

def countTokens(text):
    """
    Use tiktoken to count the number of tokens
    text -> str input
    Return -> int number of tokens counted
    """
    encoding = tiktoken.get_encoding("r50k_base") #context_count = len(encoding.encode(yourtext))
    numoftokens = len(encoding.encode(text))
    return numoftokens

# LOADING THE MODEL
print('Loading the model', end='')
model_dir = 'ov_Llama-SmolTalk-3.2-1B-Instruct'
pipe = ov_genai.LLMPipeline(model_dir, 'CPU')
# PROMPT FORMATTING - we use tokenizer chat templating
tokenizer = AutoTokenizer.from_pretrained(model_dir)
print('✅  done')
print('Ready for generation')

print('Starting now Normal Chat based interface with NO TURNS - chat history disabled...')
counter = 1
while True:
    # Reset history ALWAys
    history = []        
    userinput = ""
    print("\033[1;30m")  #dark grey
    print("Enter your text (end input with Ctrl+D on Unix or Ctrl+Z on Windows) - type quit! to exit the chatroom:")
    print("\033[91;1m")  #red
    lines = sys.stdin.readlines()
    for line in lines:
        userinput += line + "\n"
    if "quit!" in lines[0].lower():
        print("\033[0mBYE BYE!")
        break
    history.append({"role": "user", "content": userinput})
    tokenized_chat = tokenizer.apply_chat_template(history, tokenize=False)
    # START PIPELINE setting eos_token_id = 151643
    start = datetime.datetime.now() 
    print("\033[92;1m")
    streamer = lambda x: print(x, end='', flush=True)
    output = pipe.generate(tokenized_chat, temperature=0.2, 
                        do_sample=True, 
                        max_new_tokens=500, 
                        repetition_penalty=1.178,
                        streamer=streamer,
                        eos_token_id = 128009)
    print('')
    delta = datetime.datetime.now() - start
    totalseconds = delta.total_seconds()
    totaltokens = countTokens(output)
    genspeed = totaltokens/totalseconds
    # PRINT THE STATISTICS
    print('---')
    print(f'Generated in {delta}')
    print(f'🧮 Total number of generated tokens: {totaltokens}')
    print(f'⏱️ Generation time: {totalseconds:.0f} seconds')
    print(f'📈 speed: {genspeed:.2f}  t/s')

```