--- license: creativeml-openrail-m datasets: - HuggingFaceTB/smoltalk language: - en base_model: - meta-llama/Llama-3.2-1B-Instruct pipeline_tag: text-generation library_name: OpenVINO tags: - Llama - SmolTalk - openvino --- The **Llama-SmolTalk-3.2-1B-Instruct** model is a lightweight, instruction-tuned model designed for efficient text generation and conversational AI tasks. With a 1B parameter architecture, this model strikes a balance between performance and resource efficiency, making it ideal for applications requiring concise, contextually relevant outputs. The model has been fine-tuned to deliver robust instruction-following capabilities, catering to both structured and open-ended queries. ### Key Features: 1. **Instruction-Tuned Performance**: Optimized to understand and execute user-provided instructions across diverse domains. 2. **Lightweight Architecture**: With just 1 billion parameters, the model provides efficient computation and storage without compromising output quality. 3. **Versatile Use Cases**: Suitable for tasks like content generation, conversational interfaces, and basic problem-solving. ### Intended Applications: - **Conversational AI**: Engage users with dynamic and contextually aware dialogue. - **Content Generation**: Produce summaries, explanations, or other creative text outputs efficiently. - **Instruction Execution**: Follow user commands to generate precise and relevant responses. ### Technical Details: The model leverages OpenVINO Ir format for inference, with a tokenizer optimized for seamless text input processing. It comes with essential configuration files, including `config.json`, `generation_config.json`, and tokenization files (`tokenizer.json` and `special_tokens_map.json`). The primary weights are stored in a PyTorch binary format (`pytorch_model.bin`), ensuring easy integration with existing workflows. ## Prompt format ``` <|begin_of_text|><|start_header_id|>system<|end_header_id|> Cutting Knowledge Date: December 2023 Today Date: 26 July 2024 {system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|> {prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|> ``` ## This is the OpenVINO IR format of the model, quantized in int8 The model was created with the Optimum-Intel libray cli-command #### Dependencies required to create the model There is an open clash in dependencies versions between optiumum-intel and openvino-genai > ⚠️ Exporting tokenizers to OpenVINO is not supported for tokenizers version > 0.19 and openvino version <= 2024.4. Please downgrade to tokenizers version <= 0.19 to export tokenizers to OpenVINO. So for the model conversion the only dependency you need is ``` pip install -U "openvino>=2024.3.0" "openvino-genai" pip install "torch>=2.1" "nncf>=2.7" "transformers>=4.40.0" "onnx<1.16.2" "optimum>=1.16.1" "accelerate" "datasets>=2.14.6" "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu ``` The instructions are from the amazing [OpenVINO notebooks](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html#prerequisites)
vanilla pip install will create clashes among dependencies/versions
This command will install, among others: ``` tokenizers==0.20.3 torch==2.5.1+cpu transformers==4.46.3 nncf==2.14.0 numpy==2.1.3 onnx==1.16.1 openvino==2024.5.0 openvino-genai==2024.5.0.0 openvino-telemetry==2024.5.0 openvino-tokenizers==2024.5.0.0 optimum==1.23.3 optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c454b0000279ac9801302d726fbbbc1152733315 ``` #### How to quantized the original model After the previous step you are enabled to run the following command (considering that you downloaded all the model weights and files into a subfolder called `Llama-SmolTalk-3.2-1B-Instruct` from the [official model repository](https://huggingface.co/prithivMLmods/Llama-SmolTalk-3.2-1B-Instruct)) ```bash optimum-cli export openvino --model .\Llama-SmolTalk-3.2-1B-Instruct\ --task text-generation-with-past --trust-remote-code --weight-format int8 ov_Llama-SmolTalk-3.2-1B-Instruct ``` this will start the process and produce the following messages, without any fatal error #### Dependencies required to run the model with `openvino-genai` If you simply need to run already converted models into OpenVINO IR format, you need to install only openvino-genai ``` pip install openvino-genai==2024.5.0 ``` ## How to use the model with openvino-genai followed official tutorial on [https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html) With changes because here we are using Chat Templates refer to [https://huggingface.co/docs/transformers/main/chat_templating](https://huggingface.co/docs/transformers/main/chat_templating) ```python # MAIN IMPORTS import warnings warnings.filterwarnings(action='ignore') import datetime from transformers import AutoTokenizer #for chat templating import openvino_genai as ov_genai import tiktoken import sys def countTokens(text): """ Use tiktoken to count the number of tokens text -> str input Return -> int number of tokens counted """ encoding = tiktoken.get_encoding("r50k_base") #context_count = len(encoding.encode(yourtext)) numoftokens = len(encoding.encode(text)) return numoftokens # LOADING THE MODEL print('Loading the model', end='') model_dir = 'ov_Llama-SmolTalk-3.2-1B-Instruct' pipe = ov_genai.LLMPipeline(model_dir, 'CPU') # PROMPT FORMATTING - we use tokenizer chat templating tokenizer = AutoTokenizer.from_pretrained(model_dir) print('✅ done') print('Ready for generation') print('Starting now Normal Chat based interface with NO TURNS - chat history disabled...') counter = 1 while True: # Reset history ALWAys history = [] userinput = "" print("\033[1;30m") #dark grey print("Enter your text (end input with Ctrl+D on Unix or Ctrl+Z on Windows) - type quit! to exit the chatroom:") print("\033[91;1m") #red lines = sys.stdin.readlines() for line in lines: userinput += line + "\n" if "quit!" in lines[0].lower(): print("\033[0mBYE BYE!") break history.append({"role": "user", "content": userinput}) tokenized_chat = tokenizer.apply_chat_template(history, tokenize=False) # START PIPELINE setting eos_token_id = 151643 start = datetime.datetime.now() print("\033[92;1m") streamer = lambda x: print(x, end='', flush=True) output = pipe.generate(tokenized_chat, temperature=0.2, do_sample=True, max_new_tokens=500, repetition_penalty=1.178, streamer=streamer, eos_token_id = 128009) print('') delta = datetime.datetime.now() - start totalseconds = delta.total_seconds() totaltokens = countTokens(output) genspeed = totaltokens/totalseconds # PRINT THE STATISTICS print('---') print(f'Generated in {delta}') print(f'🧮 Total number of generated tokens: {totaltokens}') print(f'⏱️ Generation time: {totalseconds:.0f} seconds') print(f'📈 speed: {genspeed:.2f} t/s') ```