FM-1976 commited on
Commit
170024a
·
verified ·
1 Parent(s): ebfad87

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +170 -3
README.md CHANGED
@@ -1,3 +1,170 @@
1
- ---
2
- license: creativeml-openrail-m
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: creativeml-openrail-m
3
+ datasets:
4
+ - HuggingFaceTB/smoltalk
5
+ language:
6
+ - en
7
+ base_model:
8
+ - meta-llama/Llama-3.2-1B-Instruct
9
+ pipeline_tag: text-generation
10
+ library_name: OpenVINO
11
+ tags:
12
+ - Llama
13
+ - SmolTalk
14
+ - openvino
15
+ ---
16
+
17
+ The **Llama-SmolTalk-3.2-1B-Instruct** model is a lightweight, instruction-tuned model designed for efficient text generation and conversational AI tasks. With a 1B parameter architecture, this model strikes a balance between performance and resource efficiency, making it ideal for applications requiring concise, contextually relevant outputs. The model has been fine-tuned to deliver robust instruction-following capabilities, catering to both structured and open-ended queries.
18
+
19
+ ### Key Features:
20
+ 1. **Instruction-Tuned Performance**: Optimized to understand and execute user-provided instructions across diverse domains.
21
+ 2. **Lightweight Architecture**: With just 1 billion parameters, the model provides efficient computation and storage without compromising output quality.
22
+ 3. **Versatile Use Cases**: Suitable for tasks like content generation, conversational interfaces, and basic problem-solving.
23
+
24
+ ### Intended Applications:
25
+ - **Conversational AI**: Engage users with dynamic and contextually aware dialogue.
26
+ - **Content Generation**: Produce summaries, explanations, or other creative text outputs efficiently.
27
+ - **Instruction Execution**: Follow user commands to generate precise and relevant responses.
28
+
29
+ ### Technical Details:
30
+ The model leverages OpenVINO Ir format for inference, with a tokenizer optimized for seamless text input processing. It comes with essential configuration files, including `config.json`, `generation_config.json`, and tokenization files (`tokenizer.json` and `special_tokens_map.json`). The primary weights are stored in a PyTorch binary format (`pytorch_model.bin`), ensuring easy integration with existing workflows.
31
+
32
+ ## Prompt format
33
+
34
+ ```
35
+ <|begin_of_text|><|start_header_id|>system<|end_header_id|>
36
+
37
+ Cutting Knowledge Date: December 2023
38
+ Today Date: 26 July 2024
39
+
40
+ {system_prompt}<|eot_id|><|start_header_id|>user<|end_header_id|>
41
+
42
+ {prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
43
+ ```
44
+
45
+ ## This is the OpenVINO IR format of the model, quantized in int8
46
+ The model was created with the Optimum-Intel libray cli-command
47
+ #### Dependencies required to create the model
48
+ There is an open clash in dependencies versions between optiumum-intel and openvino-genai
49
+ > ⚠️ Exporting tokenizers to OpenVINO is not supported for tokenizers version > 0.19 and openvino version <= 2024.4. Please downgrade to tokenizers version <= 0.19 to export tokenizers to OpenVINO.
50
+
51
+
52
+ So for the model conversion the only dependency you need is
53
+
54
+
55
+ ```
56
+ pip install -U "openvino>=2024.3.0" "openvino-genai"
57
+ pip install "torch>=2.1" "nncf>=2.7" "transformers>=4.40.0" "onnx<1.16.2" "optimum>=1.16.1" "accelerate" "datasets>=2.14.6" "git+https://github.com/huggingface/optimum-intel.git" --extra-index-url https://download.pytorch.org/whl/cpu
58
+ ```
59
+ The instructions are from the amazing [OpenVINO notebooks](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html#prerequisites)<br>
60
+ vanilla pip install will create clashes among dependencies/versions<br>
61
+ This command will install, among others:
62
+ ```
63
+ tokenizers==0.20.3
64
+ torch==2.5.1+cpu
65
+ transformers==4.46.3
66
+ nncf==2.14.0
67
+ numpy==2.1.3
68
+ onnx==1.16.1
69
+ openvino==2024.5.0
70
+ openvino-genai==2024.5.0.0
71
+ openvino-telemetry==2024.5.0
72
+ openvino-tokenizers==2024.5.0.0
73
+ optimum==1.23.3
74
+ optimum-intel @ git+https://github.com/huggingface/optimum-intel.git@c454b0000279ac9801302d726fbbbc1152733315
75
+ ```
76
+
77
+
78
+ #### How to quantized the original model
79
+ After the previous step you are enabled to run the following command (considering that you downloaded all the model weights and files into a subfolder called `Llama-SmolTalk-3.2-1B-Instruct` from the [official model repository](https://huggingface.co/prithivMLmods/Llama-SmolTalk-3.2-1B-Instruct))
80
+ ```bash
81
+ optimum-cli export openvino --model .\Llama-SmolTalk-3.2-1B-Instruct\ --task text-generation-with-past --trust-remote-code --weight-format int8 ov_Llama-SmolTalk-3.2-1B-Instruct
82
+ ```
83
+ this will start the process and produce the following messages, without any fatal error
84
+
85
+
86
+ #### Dependencies required to run the model with `openvino-genai`
87
+ If you simply need to run already converted models into OpenVINO IR format, you need to install only openvino-genai
88
+ ```
89
+ pip install openvino-genai==2024.5.0
90
+ ```
91
+
92
+
93
+ ## How to use the model with openvino-genai
94
+ followed official tutorial on [https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html](https://docs.openvino.ai/2024/notebooks/llm-question-answering-with-output.html)
95
+ With changes because here we are using Chat Templates
96
+
97
+ refer to [https://huggingface.co/docs/transformers/main/chat_templating](https://huggingface.co/docs/transformers/main/chat_templating)
98
+
99
+ ```python
100
+ # MAIN IMPORTS
101
+ import warnings
102
+ warnings.filterwarnings(action='ignore')
103
+ import datetime
104
+ from transformers import AutoTokenizer #for chat templating
105
+ import openvino_genai as ov_genai
106
+ import tiktoken
107
+ import sys
108
+
109
+ def countTokens(text):
110
+ """
111
+ Use tiktoken to count the number of tokens
112
+ text -> str input
113
+ Return -> int number of tokens counted
114
+ """
115
+ encoding = tiktoken.get_encoding("r50k_base") #context_count = len(encoding.encode(yourtext))
116
+ numoftokens = len(encoding.encode(text))
117
+ return numoftokens
118
+
119
+ # LOADING THE MODEL
120
+ print('Loading the model', end='')
121
+ model_dir = 'ov_Llama-SmolTalk-3.2-1B-Instruct'
122
+ pipe = ov_genai.LLMPipeline(model_dir, 'CPU')
123
+ # PROMPT FORMATTING - we use tokenizer chat templating
124
+ tokenizer = AutoTokenizer.from_pretrained(model_dir)
125
+ print('✅ done')
126
+ print('Ready for generation')
127
+
128
+ print('Starting now Normal Chat based interface with NO TURNS - chat history disabled...')
129
+ counter = 1
130
+ while True:
131
+ # Reset history ALWAys
132
+ history = []
133
+ userinput = ""
134
+ print("\033[1;30m") #dark grey
135
+ print("Enter your text (end input with Ctrl+D on Unix or Ctrl+Z on Windows) - type quit! to exit the chatroom:")
136
+ print("\033[91;1m") #red
137
+ lines = sys.stdin.readlines()
138
+ for line in lines:
139
+ userinput += line + "\n"
140
+ if "quit!" in lines[0].lower():
141
+ print("\033[0mBYE BYE!")
142
+ break
143
+ history.append({"role": "user", "content": userinput})
144
+ tokenized_chat = tokenizer.apply_chat_template(history, tokenize=False)
145
+ # START PIPELINE setting eos_token_id = 151643
146
+ start = datetime.datetime.now()
147
+ print("\033[92;1m")
148
+ streamer = lambda x: print(x, end='', flush=True)
149
+ output = pipe.generate(tokenized_chat, temperature=0.2,
150
+ do_sample=True,
151
+ max_new_tokens=500,
152
+ repetition_penalty=1.178,
153
+ streamer=streamer,
154
+ eos_token_id = 128009)
155
+ print('')
156
+ delta = datetime.datetime.now() - start
157
+ totalseconds = delta.total_seconds()
158
+ totaltokens = countTokens(output)
159
+ genspeed = totaltokens/totalseconds
160
+ # PRINT THE STATISTICS
161
+ print('---')
162
+ print(f'Generated in {delta}')
163
+ print(f'🧮 Total number of generated tokens: {totaltokens}')
164
+ print(f'⏱️ Generation time: {totalseconds:.0f} seconds')
165
+ print(f'📈 speed: {genspeed:.2f} t/s')
166
+
167
+ ```
168
+
169
+
170
+