jartine commited on
Commit
be7dd82
1 Parent(s): d281a8c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +252 -0
README.md ADDED
@@ -0,0 +1,252 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: mistralai/Mixtral-8x22B-Instruct-v0.1
3
+ model_creator: mistralai
4
+ quantized_by: jartine
5
+ license: apache-2.0
6
+ prompt_template: |
7
+ [INST] {{prompt}} [/INST]
8
+ tags:
9
+ - llamafile
10
+ language:
11
+ - en
12
+ ---
13
+
14
+ # Mixtral 8x22B Instruct v0.1 - llamafile
15
+
16
+ This repository contains executable weights (which we call
17
+ [llamafiles](https://github.com/Mozilla-Ocho/llamafile)) that run on
18
+ Linux, MacOS, Windows, FreeBSD, OpenBSD, and NetBSD for AMD64 and ARM64.
19
+
20
+ - Model creator: [Mistral AI](https://mistral.ai/)
21
+ - Original model: [mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)
22
+
23
+ ## Quickstart
24
+
25
+ Assuming your system has at least 128GB of RAM, you can try running the
26
+ following command which download, concatenate, and execute the model.
27
+
28
+ ```
29
+ ( curl https://huggingface.co/jartine/Mixtral-8x22B-Instruct-v0.1-llamafile/resolve/main/Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile.cat0
30
+ curl https://huggingface.co/jartine/Mixtral-8x22B-Instruct-v0.1-llamafile/resolve/main/Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile.cat1
31
+ ) > Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile
32
+ chmod +x Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile
33
+ ./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile --help # view manual
34
+ ./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile # launch web gui + oai api
35
+ ./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile -p ... # cli interface (scriptable)
36
+ ```
37
+
38
+ Alternatively, you may download an official `llamafile` executable from
39
+ Mozilla Ocho on GitHub, in which case you can use the Mixtral llamafiles
40
+ as a simple weights data file.
41
+
42
+ ```
43
+ llamafile -m Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile ...
44
+ ```
45
+
46
+ For further information, please see the [llamafile
47
+ README](https://github.com/mozilla-ocho/llamafile/).
48
+
49
+ Having **trouble?** See the ["Gotchas"
50
+ section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas)
51
+ of the README.
52
+
53
+ ## Prompting
54
+
55
+ Prompt template:
56
+
57
+ ```
58
+ [INST] {{prompt}} [/INST]
59
+ ```
60
+
61
+ Command template:
62
+
63
+ ```
64
+ ./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile -p "[INST]{{prompt}}[/INST]"
65
+ ```
66
+
67
+ ## About llamafile
68
+
69
+ llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
70
+ It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
71
+ binaries that run on the stock installs of six OSes for both ARM64 and
72
+ AMD64.
73
+
74
+ In addition to being executables, llamafiles are also zip archives. Each
75
+ llamafile contains a GGUF file, which you can extract using the `unzip`
76
+ command. If you want to change or add files to your llamafiles, then the
77
+ `zipalign` command (distributed on the llamafile github) should be used
78
+ instead of the traditional `zip` command.
79
+
80
+ ## About Upload Limits
81
+
82
+ Files which exceed the Hugging Face 50GB upload limit have a .cat𝑋
83
+ extension. You need to use the `cat` command locally to turn them back
84
+ into a single file, using the same order.
85
+
86
+ ## About Quantization Formats (General Advice)
87
+
88
+ Your choice of quantization format depends on three things:
89
+
90
+ 1. Will it fit in RAM or VRAM?
91
+ 2. Is your use case reading (e.g. summarization) or writing (e.g. chatbot)?
92
+ 3. llamafiles bigger than 4.30 GB are hard to run on Windows (see [gotchas](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas))
93
+
94
+ Good quants for writing (prediction speed) are Q5\_K\_M, and Q4\_0. Text
95
+ generation is bounded by memory speed, so smaller quants help, but they
96
+ cause the LLM to hallucinate more. However that doesn't mean they can't
97
+ think correctly. A highly degraded quant like `Q2_K` may not make a
98
+ great encyclopedia, but it's still capable of logical reasoning and
99
+ other emergent capabilities that LLMs exhibit.
100
+
101
+ Good quants for reading (evaluation speed) are BF16, F16, Q4\_0, and
102
+ Q8\_0 (ordered from fastest to slowest). Prompt evaluation is bounded by
103
+ computation speed (flops) which means performance can be improved by
104
+ software engineering, e.g. BLAS algorithms, in which case quantization
105
+ starts hurting more than it helps, since it computes for CPU resources
106
+ and makes it harder for the compiler to parallelize instructions. You
107
+ want to ideally use the simplest smallest floating point format that's
108
+ natively implemented by your hardware. In most cases, that's BF16 or
109
+ FP16. However, llamafile is able to still offer respectable tinyBLAS
110
+ speedups for llama.cpp's oldest and simplest quants: Q8\_0 and Q4\_0.
111
+
112
+ ## Hardware Choices (Mixtral 8x22B Specific)
113
+
114
+ This model is very large. Even at Q2 quantization, it's still well-over
115
+ twice as large the highest tier NVIDIA gaming GPUs. llamafile supports
116
+ splitting models over multiple GPUs (for NVIDIA only currently) if you
117
+ have such a system.
118
+
119
+ Mac Studio is a good option for running this model. An M2 Ultra desktop
120
+ from Apple is affordable and has 128GB of unified RAM+VRAM. If you have
121
+ one, then llamafile will use your Metal GPU. Try starting out with the
122
+ `Q4_0` quantization level.
123
+
124
+ Another good option for running very large, large language models is to
125
+ just use CPU. We developed new tensor multiplication kernels on the
126
+ llamafile project specifically to speed up "mixture of experts" LLMs
127
+ like Mixtral. On a AMD Threadripper Pro 7995WX with 256GB of 5200 MT/s
128
+ RAM, llamafile v0.8 runs Mixtral 8x22B Q4\_0 at 98 tokens per second for
129
+ evaluation, and it predicts 9.44 tokens per second.
130
+
131
+ ---
132
+
133
+ # Model Card for Mixtral-8x22B-Instruct-v0.1
134
+ The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct fine-tuned version of the [Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1).
135
+
136
+ ## Run the model
137
+ ```python
138
+ from transformers import AutoModelForCausalLM
139
+ from mistral_common.protocol.instruct.messages import (
140
+ AssistantMessage,
141
+ UserMessage,
142
+ )
143
+ from mistral_common.protocol.instruct.tool_calls import (
144
+ Tool,
145
+ Function,
146
+ )
147
+ from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
148
+ from mistral_common.tokens.instruct.normalize import ChatCompletionRequest
149
+
150
+ device = "cuda" # the device to load the model onto
151
+
152
+ tokenizer_v3 = MistralTokenizer.v3()
153
+
154
+ mistral_query = ChatCompletionRequest(
155
+ tools=[
156
+ Tool(
157
+ function=Function(
158
+ name="get_current_weather",
159
+ description="Get the current weather",
160
+ parameters={
161
+ "type": "object",
162
+ "properties": {
163
+ "location": {
164
+ "type": "string",
165
+ "description": "The city and state, e.g. San Francisco, CA",
166
+ },
167
+ "format": {
168
+ "type": "string",
169
+ "enum": ["celsius", "fahrenheit"],
170
+ "description": "The temperature unit to use. Infer this from the users location.",
171
+ },
172
+ },
173
+ "required": ["location", "format"],
174
+ },
175
+ )
176
+ )
177
+ ],
178
+ messages=[
179
+ UserMessage(content="What's the weather like today in Paris"),
180
+ ],
181
+ model="test",
182
+ )
183
+
184
+ encodeds = tokenizer_v3.encode_chat_completion(mistral_query).tokens
185
+ model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x22B-Instruct-v0.1")
186
+ model_inputs = encodeds.to(device)
187
+ model.to(device)
188
+
189
+ generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
190
+ sp_tokenizer = tokenizer_v3.instruct_tokenizer.tokenizer
191
+ decoded = sp_tokenizer.decode(generated_ids[0])
192
+ print(decoded)
193
+ ```
194
+
195
+ # Instruct tokenizer
196
+ The HuggingFace tokenizer included in this release should match our own. To compare:
197
+ `pip install mistral-common`
198
+
199
+ ```py
200
+ from mistral_common.protocol.instruct.messages import (
201
+ AssistantMessage,
202
+ UserMessage,
203
+ )
204
+ from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
205
+ from mistral_common.tokens.instruct.normalize import ChatCompletionRequest
206
+
207
+ from transformers import AutoTokenizer
208
+
209
+ tokenizer_v3 = MistralTokenizer.v3()
210
+
211
+ mistral_query = ChatCompletionRequest(
212
+ messages=[
213
+ UserMessage(content="How many experts ?"),
214
+ AssistantMessage(content="8"),
215
+ UserMessage(content="How big ?"),
216
+ AssistantMessage(content="22B"),
217
+ UserMessage(content="Noice 🎉 !"),
218
+ ],
219
+ model="test",
220
+ )
221
+ hf_messages = mistral_query.model_dump()['messages']
222
+
223
+ tokenized_mistral = tokenizer_v3.encode_chat_completion(mistral_query).tokens
224
+
225
+ tokenizer_hf = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x22B-Instruct-v0.1')
226
+ tokenized_hf = tokenizer_hf.apply_chat_template(hf_messages, tokenize=True)
227
+
228
+ assert tokenized_hf == tokenized_mistral
229
+ ```
230
+
231
+ # Function calling and special tokens
232
+ This tokenizer includes more special tokens, related to function calling :
233
+ - [TOOL_CALLS]
234
+ - [AVAILABLE_TOOLS]
235
+ - [/AVAILABLE_TOOLS]
236
+ - [TOOL_RESULTS]
237
+ - [/TOOL_RESULTS]
238
+
239
+ If you want to use this model with function calling, please be sure to apply it similarly to what is done in our [SentencePieceTokenizerV3](https://github.com/mistralai/mistral-common/blob/main/src/mistral_common/tokens/tokenizers/sentencepiece.py#L299).
240
+
241
+ # The Mistral AI Team
242
+ Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Antoine Roux,
243
+ Arthur Mensch, Audrey Herblin-Stoop, Baptiste Bout, Baudouin de Monicault,
244
+ Blanche Savary, Bam4d, Caroline Feldman, Devendra Singh Chaplot,
245
+ Diego de las Casas, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger,
246
+ Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona,
247
+ Jean-Malo Delignon, Jia Li, Justus Murke, Louis Martin, Louis Ternon,
248
+ Lucile Saulnier, Lélio Renard Lavaud, Margaret Jennings, Marie Pellat,
249
+ Marie Torelli, Marie-Anne Lachaux, Nicolas Schuhl, Patrick von Platen,
250
+ Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao,
251
+ Thibaut Lavril, Timothée Lacroix, Théophile Gervet, Thomas Wang,
252
+ Valera Nemychnikova, William El Sayed, William Marshall