jartine commited on
Commit
a3d9475
·
verified ·
1 Parent(s): c486577

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +254 -0
README.md ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: mistralai/Mixtral-8x22B-Instruct-v0.1
3
+ model_creator: mistralai
4
+ quantized_by: jartine
5
+ license: apache-2.0
6
+ prompt_template: |
7
+ [INST] {{prompt}} [/INST]
8
+ tags:
9
+ - llamafile
10
+ language:
11
+ - en
12
+ ---
13
+
14
+ # Mixtral 8x22B Instruct v0.1 - llamafile
15
+
16
+ This repository contains executable weights (which we call
17
+ [llamafiles](https://github.com/Mozilla-Ocho/llamafile)) that run on
18
+ Linux, MacOS, Windows, FreeBSD, OpenBSD, and NetBSD for AMD64 and ARM64.
19
+
20
+ - Model creator: [Mistral AI](https://mistral.ai/)
21
+ - Original model: [mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)
22
+
23
+ ## Quickstart
24
+
25
+ Assuming your system has at least 128GB of RAM, you can try running the
26
+ following command which download, concatenate, and execute the model.
27
+
28
+ ```
29
+ ( curl -L https://huggingface.co/jartine/Mixtral-8x22B-Instruct-v0.1-llamafile/resolve/main/Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile.cat0
30
+ curl -L https://huggingface.co/jartine/Mixtral-8x22B-Instruct-v0.1-llamafile/resolve/main/Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile.cat1
31
+ ) > Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile
32
+ chmod +x Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile
33
+ ./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile --help # view manual
34
+ ./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile # launch web gui + oai api
35
+ ./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile -p ... # cli interface (scriptable)
36
+ ```
37
+
38
+ Alternatively, you may download an official `llamafile` executable from
39
+ Mozilla Ocho on GitHub, in which case you can use the Mixtral llamafiles
40
+ as a simple weights data file.
41
+
42
+ ```
43
+ llamafile -m Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile ...
44
+ ```
45
+
46
+ For further information, please see the [llamafile
47
+ README](https://github.com/mozilla-ocho/llamafile/).
48
+
49
+ Having **trouble?** See the ["Gotchas"
50
+ section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting)
51
+ of the README.
52
+
53
+ ## Prompting
54
+
55
+ Prompt template:
56
+
57
+ ```
58
+ [INST] {{prompt}} [/INST]
59
+ ```
60
+
61
+ Command template:
62
+
63
+ ```
64
+ ./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile -p "[INST]{{prompt}}[/INST]"
65
+ ```
66
+
67
+ ## About llamafile
68
+
69
+ llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
70
+ It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
71
+ binaries that run on the stock installs of six OSes for both ARM64 and
72
+ AMD64.
73
+
74
+ In addition to being executables, llamafiles are also zip archives. Each
75
+ llamafile contains a GGUF file, which you can extract using the `unzip`
76
+ command. If you want to change or add files to your llamafiles, then the
77
+ `zipalign` command (distributed on the llamafile github) should be used
78
+ instead of the traditional `zip` command.
79
+
80
+ ## About Upload Limits
81
+
82
+ Files which exceed the Hugging Face 50GB upload limit have a .cat𝑋
83
+ extension. You need to use the `cat` command locally to turn them back
84
+ into a single file, using the same order.
85
+
86
+ ## About Quantization Formats (General Advice)
87
+
88
+ Your choice of quantization format depends on three things:
89
+
90
+ 1. Will it fit in RAM or VRAM?
91
+ 2. Is your use case reading (e.g. summarization) or writing (e.g. chatbot)?
92
+ 3. llamafiles bigger than 4.30 GB are hard to run on Windows (see [gotchas](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas-and-troubleshooting))
93
+
94
+ Good quants for writing (prediction speed) are Q5\_K\_M, and Q4\_0. Text
95
+ generation is bounded by memory speed, so smaller quants help, but they
96
+ cause the LLM to hallucinate more. However that doesn't mean they can't
97
+ think correctly. A highly degraded quant like `Q2_K` may not make a
98
+ great encyclopedia, but it's still capable of logical reasoning and
99
+ the emergent capabilities LLMs exhibit.
100
+
101
+ Good quants for reading (evaluation speed) are BF16, F16, Q8\_0, and
102
+ Q4\_0 (ordered from fastest to slowest). Prompt evaluation is bounded by
103
+ flop count, which means perf can be improved through software
104
+ engineering alone, e.g. BLAS algorithms, in which case quantization
105
+ starts hurting more than it helps, since it competes for CPU resources
106
+ and makes it harder for the compiler to parallelize instructions. You
107
+ want to ideally use the simplest smallest floating point format that's
108
+ natively implemented by your hardware. In most cases, that's BF16 or
109
+ FP16. However, llamafile is able to still offer respectable tinyBLAS
110
+ speedups for llama.cpp's simplest quants: Q8\_0 and Q4\_0.
111
+
112
+ ## Hardware Choices (Mixtral 8x22B Specific)
113
+
114
+ This model is very large. Even at Q2 quantization, it's still well-over
115
+ twice as large the highest tier NVIDIA gaming GPUs. llamafile supports
116
+ splitting models over multiple GPUs (for NVIDIA only currently) if you
117
+ have such a system. The easiest way to have one, if you don't, is to pay
118
+ a few bucks an hour to rent a 4x RTX 4090 rig off vast.ai.
119
+
120
+ Mac Studio is a good option for running this model locally. An M2 Ultra
121
+ desktop from Apple is affordable and has 128GB of unified RAM+VRAM. If
122
+ you have one, then llamafile will use your Metal GPU. Try starting out
123
+ with the `Q4_0` quantization level.
124
+
125
+ Another good option for running large, large language models locally and
126
+ fully under your control is to just use CPU inference. We developed new
127
+ tensor multiplication kernels on the llamafile project specifically to
128
+ speed up "mixture of experts" LLMs like Mixtral. On a AMD Threadripper
129
+ Pro 7995WX with 256GB of 5200 MT/s RAM, llamafile v0.8 runs Mixtral
130
+ 8x22B Q4\_0 on Linux at 98 tokens per second for evaluation, and it
131
+ predicts 9.44 tokens per second.
132
+
133
+ ---
134
+
135
+ # Model Card for Mixtral-8x22B-Instruct-v0.1
136
+ The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct fine-tuned version of the [Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1).
137
+
138
+ ## Run the model
139
+ ```python
140
+ from transformers import AutoModelForCausalLM
141
+ from mistral_common.protocol.instruct.messages import (
142
+ AssistantMessage,
143
+ UserMessage,
144
+ )
145
+ from mistral_common.protocol.instruct.tool_calls import (
146
+ Tool,
147
+ Function,
148
+ )
149
+ from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
150
+ from mistral_common.tokens.instruct.normalize import ChatCompletionRequest
151
+
152
+ device = "cuda" # the device to load the model onto
153
+
154
+ tokenizer_v3 = MistralTokenizer.v3()
155
+
156
+ mistral_query = ChatCompletionRequest(
157
+ tools=[
158
+ Tool(
159
+ function=Function(
160
+ name="get_current_weather",
161
+ description="Get the current weather",
162
+ parameters={
163
+ "type": "object",
164
+ "properties": {
165
+ "location": {
166
+ "type": "string",
167
+ "description": "The city and state, e.g. San Francisco, CA",
168
+ },
169
+ "format": {
170
+ "type": "string",
171
+ "enum": ["celsius", "fahrenheit"],
172
+ "description": "The temperature unit to use. Infer this from the users location.",
173
+ },
174
+ },
175
+ "required": ["location", "format"],
176
+ },
177
+ )
178
+ )
179
+ ],
180
+ messages=[
181
+ UserMessage(content="What's the weather like today in Paris"),
182
+ ],
183
+ model="test",
184
+ )
185
+
186
+ encodeds = tokenizer_v3.encode_chat_completion(mistral_query).tokens
187
+ model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x22B-Instruct-v0.1")
188
+ model_inputs = encodeds.to(device)
189
+ model.to(device)
190
+
191
+ generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
192
+ sp_tokenizer = tokenizer_v3.instruct_tokenizer.tokenizer
193
+ decoded = sp_tokenizer.decode(generated_ids[0])
194
+ print(decoded)
195
+ ```
196
+
197
+ # Instruct tokenizer
198
+ The HuggingFace tokenizer included in this release should match our own. To compare:
199
+ `pip install mistral-common`
200
+
201
+ ```py
202
+ from mistral_common.protocol.instruct.messages import (
203
+ AssistantMessage,
204
+ UserMessage,
205
+ )
206
+ from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
207
+ from mistral_common.tokens.instruct.normalize import ChatCompletionRequest
208
+
209
+ from transformers import AutoTokenizer
210
+
211
+ tokenizer_v3 = MistralTokenizer.v3()
212
+
213
+ mistral_query = ChatCompletionRequest(
214
+ messages=[
215
+ UserMessage(content="How many experts ?"),
216
+ AssistantMessage(content="8"),
217
+ UserMessage(content="How big ?"),
218
+ AssistantMessage(content="22B"),
219
+ UserMessage(content="Noice 🎉 !"),
220
+ ],
221
+ model="test",
222
+ )
223
+ hf_messages = mistral_query.model_dump()['messages']
224
+
225
+ tokenized_mistral = tokenizer_v3.encode_chat_completion(mistral_query).tokens
226
+
227
+ tokenizer_hf = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x22B-Instruct-v0.1')
228
+ tokenized_hf = tokenizer_hf.apply_chat_template(hf_messages, tokenize=True)
229
+
230
+ assert tokenized_hf == tokenized_mistral
231
+ ```
232
+
233
+ # Function calling and special tokens
234
+ This tokenizer includes more special tokens, related to function calling :
235
+ - [TOOL_CALLS]
236
+ - [AVAILABLE_TOOLS]
237
+ - [/AVAILABLE_TOOLS]
238
+ - [TOOL_RESULTS]
239
+ - [/TOOL_RESULTS]
240
+
241
+ If you want to use this model with function calling, please be sure to apply it similarly to what is done in our [SentencePieceTokenizerV3](https://github.com/mistralai/mistral-common/blob/main/src/mistral_common/tokens/tokenizers/sentencepiece.py#L299).
242
+
243
+ # The Mistral AI Team
244
+ Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Antoine Roux,
245
+ Arthur Mensch, Audrey Herblin-Stoop, Baptiste Bout, Baudouin de Monicault,
246
+ Blanche Savary, Bam4d, Caroline Feldman, Devendra Singh Chaplot,
247
+ Diego de las Casas, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger,
248
+ Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona,
249
+ Jean-Malo Delignon, Jia Li, Justus Murke, Louis Martin, Louis Ternon,
250
+ Lucile Saulnier, Lélio Renard Lavaud, Margaret Jennings, Marie Pellat,
251
+ Marie Torelli, Marie-Anne Lachaux, Nicolas Schuhl, Patrick von Platen,
252
+ Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao,
253
+ Thibaut Lavril, Timothée Lacroix, Théophile Gervet, Thomas Wang,
254
+ Valera Nemychnikova, William El Sayed, William Marshall