Mozilla
/

Mixtral-8x22B-Instruct-v0.1-llamafile

llamafile

English

Model card Files Files and versions Community

jartine commited on Apr 25

Commit

be7dd82

•

1 Parent(s): d281a8c

Update README.md

Browse files

Files changed (1) hide show

README.md +252 -0

README.md ADDED Viewed

	@@ -0,0 +1,252 @@

+---
+base_model: mistralai/Mixtral-8x22B-Instruct-v0.1
+model_creator: mistralai
+quantized_by: jartine
+license: apache-2.0
+prompt_template: |
+  [INST] {{prompt}} [/INST]
+tags:
+- llamafile
+language:
+- en
+---
+# Mixtral 8x22B Instruct v0.1 - llamafile
+This repository contains executable weights (which we call
+[llamafiles](https://github.com/Mozilla-Ocho/llamafile)) that run on
+Linux, MacOS, Windows, FreeBSD, OpenBSD, and NetBSD for AMD64 and ARM64.
+- Model creator: [Mistral AI](https://mistral.ai/)
+- Original model: [mistralai/Mixtral-8x22B-Instruct-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1)
+## Quickstart
+Assuming your system has at least 128GB of RAM, you can try running the
+following command which download, concatenate, and execute the model.
+```
+( curl https://huggingface.co/jartine/Mixtral-8x22B-Instruct-v0.1-llamafile/resolve/main/Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile.cat0
+  curl https://huggingface.co/jartine/Mixtral-8x22B-Instruct-v0.1-llamafile/resolve/main/Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile.cat1
+) > Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile
+chmod +x Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile
+./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile --help   # view manual
+./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile          # launch web gui + oai api
+./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile -p ...   # cli interface (scriptable)
+```
+Alternatively, you may download an official `llamafile` executable from
+Mozilla Ocho on GitHub, in which case you can use the Mixtral llamafiles
+as a simple weights data file.
+```
+llamafile -m Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile ...
+```
+For further information, please see the [llamafile
+README](https://github.com/mozilla-ocho/llamafile/).
+Having **trouble?** See the ["Gotchas"
+section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas)
+of the README.
+## Prompting
+Prompt template:
+```
+[INST] {{prompt}} [/INST]
+```
+Command template:
+```
+./Mixtral-8x22B-Instruct-v0.1.Q4_0.llamafile -p "[INST]{{prompt}}[/INST]"
+```
+## About llamafile
+llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
+It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
+binaries that run on the stock installs of six OSes for both ARM64 and
+AMD64.
+In addition to being executables, llamafiles are also zip archives. Each
+llamafile contains a GGUF file, which you can extract using the `unzip`
+command. If you want to change or add files to your llamafiles, then the
+`zipalign` command (distributed on the llamafile github) should be used
+instead of the traditional `zip` command.
+## About Upload Limits
+Files which exceed the Hugging Face 50GB upload limit have a .cat𝑋
+extension. You need to use the `cat` command locally to turn them back
+into a single file, using the same order.
+## About Quantization Formats (General Advice)
+Your choice of quantization format depends on three things:
+1. Will it fit in RAM or VRAM?
+2. Is your use case reading (e.g. summarization) or writing (e.g. chatbot)?
+3. llamafiles bigger than 4.30 GB are hard to run on Windows (see [gotchas](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas))
+Good quants for writing (prediction speed) are Q5\_K\_M, and Q4\_0. Text
+generation is bounded by memory speed, so smaller quants help, but they
+cause the LLM to hallucinate more. However that doesn't mean they can't
+think correctly. A highly degraded quant like `Q2_K` may not make a
+great encyclopedia, but it's still capable of logical reasoning and
+other emergent capabilities that LLMs exhibit.
+Good quants for reading (evaluation speed) are BF16, F16, Q4\_0, and
+Q8\_0 (ordered from fastest to slowest). Prompt evaluation is bounded by
+computation speed (flops) which means performance can be improved by
+software engineering, e.g. BLAS algorithms, in which case quantization
+starts hurting more than it helps, since it computes for CPU resources
+and makes it harder for the compiler to parallelize instructions. You
+want to ideally use the simplest smallest floating point format that's
+natively implemented by your hardware. In most cases, that's BF16 or
+FP16. However, llamafile is able to still offer respectable tinyBLAS
+speedups for llama.cpp's oldest and simplest quants: Q8\_0 and Q4\_0.
+## Hardware Choices (Mixtral 8x22B Specific)
+This model is very large. Even at Q2 quantization, it's still well-over
+twice as large the highest tier NVIDIA gaming GPUs. llamafile supports
+splitting models over multiple GPUs (for NVIDIA only currently) if you
+have such a system.
+Mac Studio is a good option for running this model. An M2 Ultra desktop
+from Apple is affordable and has 128GB of unified RAM+VRAM. If you have
+one, then llamafile will use your Metal GPU. Try starting out with the
+`Q4_0` quantization level.
+Another good option for running very large, large language models is to
+just use CPU. We developed new tensor multiplication kernels on the
+llamafile project specifically to speed up "mixture of experts" LLMs
+like Mixtral. On a AMD Threadripper Pro 7995WX with 256GB of 5200 MT/s
+RAM, llamafile v0.8 runs Mixtral 8x22B Q4\_0 at 98 tokens per second for
+evaluation, and it predicts 9.44 tokens per second.
+---
+# Model Card for Mixtral-8x22B-Instruct-v0.1
+The Mixtral-8x22B-Instruct-v0.1 Large Language Model (LLM) is an instruct fine-tuned version of the [Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1).
+## Run the model
+```python
+from transformers import AutoModelForCausalLM
+from mistral_common.protocol.instruct.messages import (
+    AssistantMessage,
+    UserMessage,
+)
+from mistral_common.protocol.instruct.tool_calls import (
+    Tool,
+    Function,
+)
+from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
+from mistral_common.tokens.instruct.normalize import ChatCompletionRequest
+device = "cuda" # the device to load the model onto
+tokenizer_v3 = MistralTokenizer.v3()
+mistral_query = ChatCompletionRequest(
+    tools=[
+        Tool(
+            function=Function(
+                name="get_current_weather",
+                description="Get the current weather",
+                parameters={
+                    "type": "object",
+                    "properties": {
+                        "location": {
+                            "type": "string",
+                            "description": "The city and state, e.g. San Francisco, CA",
+                        },
+                        "format": {
+                            "type": "string",
+                            "enum": ["celsius", "fahrenheit"],
+                            "description": "The temperature unit to use. Infer this from the users location.",
+                        },
+                    },
+                    "required": ["location", "format"],
+                },
+            )
+        )
+    ],
+    messages=[
+        UserMessage(content="What's the weather like today in Paris"),
+    ],
+    model="test",
+)
+encodeds = tokenizer_v3.encode_chat_completion(mistral_query).tokens
+model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x22B-Instruct-v0.1")
+model_inputs = encodeds.to(device)
+model.to(device)
+generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
+sp_tokenizer = tokenizer_v3.instruct_tokenizer.tokenizer
+decoded = sp_tokenizer.decode(generated_ids[0])
+print(decoded)
+```
+# Instruct tokenizer
+The HuggingFace tokenizer included in this release should match our own. To compare:
+`pip install mistral-common`
+```py
+from mistral_common.protocol.instruct.messages import (
+    AssistantMessage,
+    UserMessage,
+)
+from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
+from mistral_common.tokens.instruct.normalize import ChatCompletionRequest
+from transformers import AutoTokenizer
+tokenizer_v3 = MistralTokenizer.v3()
+mistral_query = ChatCompletionRequest(
+    messages=[
+        UserMessage(content="How many experts ?"),
+        AssistantMessage(content="8"),
+        UserMessage(content="How big ?"),
+        AssistantMessage(content="22B"),
+        UserMessage(content="Noice 🎉 !"),
+    ],
+    model="test",
+)
+hf_messages = mistral_query.model_dump()['messages']
+tokenized_mistral = tokenizer_v3.encode_chat_completion(mistral_query).tokens
+tokenizer_hf = AutoTokenizer.from_pretrained('mistralai/Mixtral-8x22B-Instruct-v0.1')
+tokenized_hf = tokenizer_hf.apply_chat_template(hf_messages, tokenize=True)
+assert tokenized_hf == tokenized_mistral
+```
+# Function calling and special tokens
+This tokenizer includes more special tokens, related to function calling :
+- [TOOL_CALLS]
+- [AVAILABLE_TOOLS]
+- [/AVAILABLE_TOOLS]
+- [TOOL_RESULTS]
+- [/TOOL_RESULTS]
+If you want to use this model with function calling, please be sure to apply it similarly to what is done in our [SentencePieceTokenizerV3](https://github.com/mistralai/mistral-common/blob/main/src/mistral_common/tokens/tokenizers/sentencepiece.py#L299).
+# The Mistral AI Team
+Albert Jiang, Alexandre Sablayrolles, Alexis Tacnet, Antoine Roux,
+Arthur Mensch, Audrey Herblin-Stoop, Baptiste Bout, Baudouin de Monicault,
+Blanche Savary, Bam4d, Caroline Feldman, Devendra Singh Chaplot,
+Diego de las Casas, Eleonore Arcelin, Emma Bou Hanna, Etienne Metzger,
+Gianna Lengyel, Guillaume Bour, Guillaume Lample, Harizo Rajaona,
+Jean-Malo Delignon, Jia Li, Justus Murke, Louis Martin, Louis Ternon,
+Lucile Saulnier, Lélio Renard Lavaud, Margaret Jennings, Marie Pellat,
+Marie Torelli, Marie-Anne Lachaux, Nicolas Schuhl, Patrick von Platen,
+Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao,
+Thibaut Lavril, Timothée Lacroix, Théophile Gervet, Thomas Wang,
+Valera Nemychnikova, William El Sayed, William Marshall