FM-1976
/

ov_Llama-SmolTalk-3.2-1B-Instruct

+---
+license: creativeml-openrail-m
+datasets:
+- HuggingFaceTB/smoltalk
+language:
+- en
+base_model:
+- meta-llama/Llama-3.2-1B-Instruct
+pipeline_tag: text-generation
+library_name: OpenVINO
+tags:
+- Llama
+- SmolTalk
+- openvino
+---
+The **Llama-SmolTalk-3.2-1B-Instruct** model is a lightweight, instruction-tuned model designed for efficient text generation and conversational AI tasks. With a 1B parameter architecture, this model strikes a balance between performance and resource efficiency, making it ideal for applications requiring concise, contextually relevant outputs. The model has been fine-tuned to deliver robust instruction-following capabilities, catering to both structured and open-ended queries.
+### Key Features:
+1. **Instruction-Tuned Performance**: Optimized to understand and execute user-provided instructions across diverse domains.
+2. **Lightweight Architecture**: With just 1 billion parameters, the model provides efficient computation and storage without compromising output quality.
+3. **Versatile Use Cases**: Suitable for tasks like content generation, conversational interfaces, and basic problem-solving.
+### Intended Applications:
+- **Conversational AI**: Engage users with dynamic and contextually aware dialogue.
+- **Content Generation**: Produce summaries, explanations, or other creative text outputs efficiently.
+- **Instruction Execution**: Follow user commands to generate precise and relevant responses.
+### Technical Details:
+The model leverages OpenVINO Ir format for inference, with a tokenizer optimized for seamless text input processing. It comes with essential configuration files, including `config.json`, `generation_config.json`, and tokenization files (`tokenizer.json` and `special_tokens_map.json`). The primary weights are stored in a PyTorch binary format (`pytorch_model.bin`), ensuring easy integration with existing workflows.
+## Prompt format
+```
+<|begin_of_text|><|start_header_id|>system<|end_header_id|>
+Cutting Knowledge Date: December 2023
+Today Date: 26 July 2024
+{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
+{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
+```
+## This is the OpenVINO IR format of the model, quantized in int8
+The model was created with the Optimum-Intel libray cli-command
+#### Dependencies required to create the model
+There is an open clash in dependencies versions between optiumum-intel and openvino-genai
+> ⚠️ Exporting tokenizers to OpenVINO is not supported for tokenizers version > 0.19 and openvino version <= 2024.4. Please downgrade to tokenizers version <= 0.19 to export tokenizers to OpenVINO.
+So for the model conversion the only dependency you need is
+```
+pip install  -U "openvino>=2024.3.0" "openvino-genai"
+pip install "torch>=2.1" "nncf>=2.7" "transformers>=4.40.0" "onnx<1.16.2" "optimum>=1.16.1" "accelerate" "datasets>=2.14.6" "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu
+```
+The instructions are from the amazing  [OpenVINO notebooks](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html#prerequisites)<br>
+vanilla pip install will create clashes among dependencies/versions<br>
+This command will install, among others:
+```
+tokenizers==0.20.3
+torch==2.5.1+cpu
+transformers==4.46.3
+nncf==2.14.0
+numpy==2.1.3
+onnx==1.16.1
+openvino==2024.5.0
+openvino-genai==2024.5.0.0
+openvino-telemetry==2024.5.0
+openvino-tokenizers==2024.5.0.0
+optimum==1.23.3
+optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c454b0000279ac9801302d726fbbbc1152733315
+```
+#### How to quantized the original model
+After the previous step you are enabled to run the following command (considering that you downloaded all the model weights and files into a subfolder called `Llama-SmolTalk-3.2-1B-Instruct` from the [official model repository](https://huggingface.co/prithivMLmods/Llama-SmolTalk-3.2-1B-Instruct))
+```bash
+optimum-cli export openvino --model .\Llama-SmolTalk-3.2-1B-Instruct\ --task text-generation-with-past --trust-remote-code --weight-format int8 ov_Llama-SmolTalk-3.2-1B-Instruct
+```
+this will start the process and produce the following messages, without any fatal error
+#### Dependencies required to run the model with `openvino-genai`
+If you simply need to run already converted models into OpenVINO IR format, you need to install only openvino-genai
+```
+pip install openvino-genai==2024.5.0
+```
+## How to use the model with openvino-genai
+followed official tutorial on [https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html)
+With changes because here we are using Chat Templates
+refer to [https://huggingface.co/docs/transformers/main/chat_templating](https://huggingface.co/docs/transformers/main/chat_templating)
+```python
+# MAIN IMPORTS
+import warnings
+warnings.filterwarnings(action='ignore')
+import datetime
+from transformers import AutoTokenizer #for chat templating
+import openvino_genai as ov_genai
+import tiktoken
+import sys
+def countTokens(text):
+    """
+    Use tiktoken to count the number of tokens
+    text -> str input
+    Return -> int number of tokens counted
+    """
+    encoding = tiktoken.get_encoding("r50k_base") #context_count = len(encoding.encode(yourtext))
+    numoftokens = len(encoding.encode(text))
+    return numoftokens
+# LOADING THE MODEL
+print('Loading the model', end='')
+model_dir = 'ov_Llama-SmolTalk-3.2-1B-Instruct'
+pipe = ov_genai.LLMPipeline(model_dir, 'CPU')
+# PROMPT FORMATTING - we use tokenizer chat templating
+tokenizer = AutoTokenizer.from_pretrained(model_dir)
+print('✅  done')
+print('Ready for generation')
+print('Starting now Normal Chat based interface with NO TURNS - chat history disabled...')
+counter = 1
+while True:
+    # Reset history ALWAys
+    history = []
+    userinput = ""
+    print("\033[1;30m")  #dark grey
+    print("Enter your text (end input with Ctrl+D on Unix or Ctrl+Z on Windows) - type quit! to exit the chatroom:")
+    print("\033[91;1m")  #red
+    lines = sys.stdin.readlines()
+    for line in lines:
+        userinput += line + "\n"
+    if "quit!" in lines[0].lower():
+        print("\033[0mBYE BYE!")
+        break
+    history.append({"role": "user", "content": userinput})
+    tokenized_chat = tokenizer.apply_chat_template(history, tokenize=False)
+    # START PIPELINE setting eos_token_id = 151643
+    start = datetime.datetime.now()
+    print("\033[92;1m")
+    streamer = lambda x: print(x, end='', flush=True)
+    output = pipe.generate(tokenized_chat, temperature=0.2,
+                        do_sample=True,
+                        max_new_tokens=500,
+                        repetition_penalty=1.178,
+                        streamer=streamer,
+                        eos_token_id = 128009)
+    print('')
+    delta = datetime.datetime.now() - start
+    totalseconds = delta.total_seconds()
+    totaltokens = countTokens(output)
+    genspeed = totaltokens/totalseconds
+    # PRINT THE STATISTICS
+    print('---')
+    print(f'Generated in {delta}')
+    print(f'🧮 Total number of generated tokens: {totaltokens}')
+    print(f'⏱️ Generation time: {totalseconds:.0f} seconds')
+    print(f'📈 speed: {genspeed:.2f}  t/s')
+```