Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,170 @@
|
|
1 |
-
---
|
2 |
-
license: creativeml-openrail-m
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: creativeml-openrail-m
|
3 |
+
datasets:
|
4 |
+
- HuggingFaceTB/smoltalk
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
base_model:
|
8 |
+
- meta-llama/Llama-3.2-1B-Instruct
|
9 |
+
pipeline_tag: text-generation
|
10 |
+
library_name: OpenVINO
|
11 |
+
tags:
|
12 |
+
- Llama
|
13 |
+
- SmolTalk
|
14 |
+
- openvino
|
15 |
+
---
|
16 |
+
|
17 |
+
The **Llama-SmolTalk-3.2-1B-Instruct** model is a lightweight, instruction-tuned model designed for efficient text generation and conversational AI tasks. With a 1B parameter architecture, this model strikes a balance between performance and resource efficiency, making it ideal for applications requiring concise, contextually relevant outputs. The model has been fine-tuned to deliver robust instruction-following capabilities, catering to both structured and open-ended queries.
|
18 |
+
|
19 |
+
### Key Features:
|
20 |
+
1. **Instruction-Tuned Performance**: Optimized to understand and execute user-provided instructions across diverse domains.
|
21 |
+
2. **Lightweight Architecture**: With just 1 billion parameters, the model provides efficient computation and storage without compromising output quality.
|
22 |
+
3. **Versatile Use Cases**: Suitable for tasks like content generation, conversational interfaces, and basic problem-solving.
|
23 |
+
|
24 |
+
### Intended Applications:
|
25 |
+
- **Conversational AI**: Engage users with dynamic and contextually aware dialogue.
|
26 |
+
- **Content Generation**: Produce summaries, explanations, or other creative text outputs efficiently.
|
27 |
+
- **Instruction Execution**: Follow user commands to generate precise and relevant responses.
|
28 |
+
|
29 |
+
### Technical Details:
|
30 |
+
The model leverages OpenVINO Ir format for inference, with a tokenizer optimized for seamless text input processing. It comes with essential configuration files, including `config.json`, `generation_config.json`, and tokenization files (`tokenizer.json` and `special_tokens_map.json`). The primary weights are stored in a PyTorch binary format (`pytorch_model.bin`), ensuring easy integration with existing workflows.
|
31 |
+
|
32 |
+
## Prompt format
|
33 |
+
|
34 |
+
```
|
35 |
+
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
|
36 |
+
|
37 |
+
Cutting Knowledge Date: December 2023
|
38 |
+
Today Date: 26 July 2024
|
39 |
+
|
40 |
+
{system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
|
41 |
+
|
42 |
+
{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
|
43 |
+
```
|
44 |
+
|
45 |
+
## This is the OpenVINO IR format of the model, quantized in int8
|
46 |
+
The model was created with the Optimum-Intel libray cli-command
|
47 |
+
#### Dependencies required to create the model
|
48 |
+
There is an open clash in dependencies versions between optiumum-intel and openvino-genai
|
49 |
+
> ⚠️ Exporting tokenizers to OpenVINO is not supported for tokenizers version > 0.19 and openvino version <= 2024.4. Please downgrade to tokenizers version <= 0.19 to export tokenizers to OpenVINO.
|
50 |
+
|
51 |
+
|
52 |
+
So for the model conversion the only dependency you need is
|
53 |
+
|
54 |
+
|
55 |
+
```
|
56 |
+
pip install -U "openvino>=2024.3.0" "openvino-genai"
|
57 |
+
pip install "torch>=2.1" "nncf>=2.7" "transformers>=4.40.0" "onnx<1.16.2" "optimum>=1.16.1" "accelerate" "datasets>=2.14.6" "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu
|
58 |
+
```
|
59 |
+
The instructions are from the amazing [OpenVINO notebooks](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html#prerequisites)<br>
|
60 |
+
vanilla pip install will create clashes among dependencies/versions<br>
|
61 |
+
This command will install, among others:
|
62 |
+
```
|
63 |
+
tokenizers==0.20.3
|
64 |
+
torch==2.5.1+cpu
|
65 |
+
transformers==4.46.3
|
66 |
+
nncf==2.14.0
|
67 |
+
numpy==2.1.3
|
68 |
+
onnx==1.16.1
|
69 |
+
openvino==2024.5.0
|
70 |
+
openvino-genai==2024.5.0.0
|
71 |
+
openvino-telemetry==2024.5.0
|
72 |
+
openvino-tokenizers==2024.5.0.0
|
73 |
+
optimum==1.23.3
|
74 |
+
optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c454b0000279ac9801302d726fbbbc1152733315
|
75 |
+
```
|
76 |
+
|
77 |
+
|
78 |
+
#### How to quantized the original model
|
79 |
+
After the previous step you are enabled to run the following command (considering that you downloaded all the model weights and files into a subfolder called `Llama-SmolTalk-3.2-1B-Instruct` from the [official model repository](https://huggingface.co/prithivMLmods/Llama-SmolTalk-3.2-1B-Instruct))
|
80 |
+
```bash
|
81 |
+
optimum-cli export openvino --model .\Llama-SmolTalk-3.2-1B-Instruct\ --task text-generation-with-past --trust-remote-code --weight-format int8 ov_Llama-SmolTalk-3.2-1B-Instruct
|
82 |
+
```
|
83 |
+
this will start the process and produce the following messages, without any fatal error
|
84 |
+
|
85 |
+
|
86 |
+
#### Dependencies required to run the model with `openvino-genai`
|
87 |
+
If you simply need to run already converted models into OpenVINO IR format, you need to install only openvino-genai
|
88 |
+
```
|
89 |
+
pip install openvino-genai==2024.5.0
|
90 |
+
```
|
91 |
+
|
92 |
+
|
93 |
+
## How to use the model with openvino-genai
|
94 |
+
followed official tutorial on [https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html)
|
95 |
+
With changes because here we are using Chat Templates
|
96 |
+
|
97 |
+
refer to [https://huggingface.co/docs/transformers/main/chat_templating](https://huggingface.co/docs/transformers/main/chat_templating)
|
98 |
+
|
99 |
+
```python
|
100 |
+
# MAIN IMPORTS
|
101 |
+
import warnings
|
102 |
+
warnings.filterwarnings(action='ignore')
|
103 |
+
import datetime
|
104 |
+
from transformers import AutoTokenizer #for chat templating
|
105 |
+
import openvino_genai as ov_genai
|
106 |
+
import tiktoken
|
107 |
+
import sys
|
108 |
+
|
109 |
+
def countTokens(text):
|
110 |
+
"""
|
111 |
+
Use tiktoken to count the number of tokens
|
112 |
+
text -> str input
|
113 |
+
Return -> int number of tokens counted
|
114 |
+
"""
|
115 |
+
encoding = tiktoken.get_encoding("r50k_base") #context_count = len(encoding.encode(yourtext))
|
116 |
+
numoftokens = len(encoding.encode(text))
|
117 |
+
return numoftokens
|
118 |
+
|
119 |
+
# LOADING THE MODEL
|
120 |
+
print('Loading the model', end='')
|
121 |
+
model_dir = 'ov_Llama-SmolTalk-3.2-1B-Instruct'
|
122 |
+
pipe = ov_genai.LLMPipeline(model_dir, 'CPU')
|
123 |
+
# PROMPT FORMATTING - we use tokenizer chat templating
|
124 |
+
tokenizer = AutoTokenizer.from_pretrained(model_dir)
|
125 |
+
print('✅ done')
|
126 |
+
print('Ready for generation')
|
127 |
+
|
128 |
+
print('Starting now Normal Chat based interface with NO TURNS - chat history disabled...')
|
129 |
+
counter = 1
|
130 |
+
while True:
|
131 |
+
# Reset history ALWAys
|
132 |
+
history = []
|
133 |
+
userinput = ""
|
134 |
+
print("\033[1;30m") #dark grey
|
135 |
+
print("Enter your text (end input with Ctrl+D on Unix or Ctrl+Z on Windows) - type quit! to exit the chatroom:")
|
136 |
+
print("\033[91;1m") #red
|
137 |
+
lines = sys.stdin.readlines()
|
138 |
+
for line in lines:
|
139 |
+
userinput += line + "\n"
|
140 |
+
if "quit!" in lines[0].lower():
|
141 |
+
print("\033[0mBYE BYE!")
|
142 |
+
break
|
143 |
+
history.append({"role": "user", "content": userinput})
|
144 |
+
tokenized_chat = tokenizer.apply_chat_template(history, tokenize=False)
|
145 |
+
# START PIPELINE setting eos_token_id = 151643
|
146 |
+
start = datetime.datetime.now()
|
147 |
+
print("\033[92;1m")
|
148 |
+
streamer = lambda x: print(x, end='', flush=True)
|
149 |
+
output = pipe.generate(tokenized_chat, temperature=0.2,
|
150 |
+
do_sample=True,
|
151 |
+
max_new_tokens=500,
|
152 |
+
repetition_penalty=1.178,
|
153 |
+
streamer=streamer,
|
154 |
+
eos_token_id = 128009)
|
155 |
+
print('')
|
156 |
+
delta = datetime.datetime.now() - start
|
157 |
+
totalseconds = delta.total_seconds()
|
158 |
+
totaltokens = countTokens(output)
|
159 |
+
genspeed = totaltokens/totalseconds
|
160 |
+
# PRINT THE STATISTICS
|
161 |
+
print('---')
|
162 |
+
print(f'Generated in {delta}')
|
163 |
+
print(f'🧮 Total number of generated tokens: {totaltokens}')
|
164 |
+
print(f'⏱️ Generation time: {totalseconds:.0f} seconds')
|
165 |
+
print(f'📈 speed: {genspeed:.2f} t/s')
|
166 |
+
|
167 |
+
```
|
168 |
+
|
169 |
+
|
170 |
+
|