|
--- |
|
library_name: keras-hub |
|
--- |
|
### Model Overview |
|
# Model Summary |
|
|
|
Mistral is a set of large language models published by the Mistral AI team. The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. Both pre-trained and instruction tuned models are available with 7 billion activated parameters. |
|
|
|
Weights are released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE) . Keras model code is released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE). |
|
|
|
## Links |
|
|
|
* [Mixtral Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/mixtral-quickstart-notebook) |
|
* [Mixtral API Documentation](https://keras.io/keras_hub/api/models/mixtral/) |
|
* [Mixtral Model Card](https://mistral.ai/news/mixtral-of-experts) |
|
* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/) |
|
* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/) |
|
|
|
## Installation |
|
|
|
Keras and KerasHub can be installed with: |
|
|
|
``` |
|
pip install -U -q keras-hub |
|
pip install -U -q keras |
|
``` |
|
|
|
Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page. |
|
|
|
## Presets |
|
|
|
The following model checkpoints are provided by the Keras team. Full code examples for each are available below. |
|
|
|
| Preset name | Parameters | Description | |
|
|---------------------------------------|------------|--------------------------------------------------------------------------------------------------------------| |
|
| mixtral_8_7b_en | 7B | 32-layer Mixtral MoE model with 7 billion active parameters and 8 experts per MoE layer. | |
|
| mixtral_8_instruct_7b_en | 7B | Instruction fine-tuned 32-layer Mixtral MoE model with 7 billion active parameters and 8 experts per MoE layer. | |
|
|
|
## Example Usage |
|
```Python |
|
|
|
import keras |
|
import keras_hub |
|
import numpy as np |
|
|
|
# Basic text generation |
|
mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset("mixtral_8_instruct_7b_en") |
|
mixtral_lm.generate("[INST] What is Keras? [/INST]", max_length=500) |
|
|
|
# Generate with batched prompts |
|
mixtral_lm.generate([ |
|
"[INST] What is Keras? [/INST]", |
|
"[INST] Give me your best brownie recipe. [/INST]" |
|
], max_length=500) |
|
|
|
# Using different sampling strategies |
|
mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset("mixtral_8_instruct_7b_en") |
|
# Greedy sampling |
|
mixtral_lm.compile(sampler="greedy") |
|
mixtral_lm.generate("I want to say", max_length=30) |
|
|
|
# Beam search |
|
mixtral_lm.compile( |
|
sampler=keras_hub.samplers.BeamSampler( |
|
num_beams=2, |
|
top_k_experts=2, # MoE-specific: number of experts to use per token |
|
) |
|
) |
|
mixtral_lm.generate("I want to say", max_length=30) |
|
|
|
# Generate without preprocessing |
|
prompt = { |
|
"token_ids": np.array([[1, 315, 947, 298, 1315, 0, 0, 0, 0, 0]] * 2), |
|
"padding_mask": np.array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]] * 2), |
|
} |
|
|
|
mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset( |
|
"mixtral_8_instruct_7b_en", |
|
preprocessor=None, |
|
dtype="bfloat16" |
|
) |
|
mixtral_lm.generate( |
|
prompt, |
|
num_experts=8, # Total number of experts per layer |
|
top_k_experts=2, # Number of experts to use per token |
|
router_aux_loss_coef=0.02 # Router auxiliary loss coefficient |
|
) |
|
|
|
# Training on a single batch |
|
features = ["The quick brown fox jumped.", "I forgot my homework."] |
|
mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset( |
|
"mixtral_8_instruct_7b_en", |
|
dtype="bfloat16" |
|
) |
|
mixtral_lm.fit( |
|
x=features, |
|
batch_size=2, |
|
router_aux_loss_coef=0.02 # MoE-specific: router training loss |
|
) |
|
|
|
# Training without preprocessing |
|
x = { |
|
"token_ids": np.array([[1, 315, 947, 298, 1315, 369, 315, 837, 0, 0]] * 2), |
|
"padding_mask": np.array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]] * 2), |
|
} |
|
y = np.array([[315, 947, 298, 1315, 369, 315, 837, 0, 0, 0]] * 2) |
|
sw = np.array([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]] * 2) |
|
|
|
mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset( |
|
"mixtral_8_instruct_7b_en", |
|
preprocessor=None, |
|
dtype="bfloat16" |
|
) |
|
mixtral_lm.fit( |
|
x=x, |
|
y=y, |
|
sample_weight=sw, |
|
batch_size=2, |
|
router_aux_loss_coef=0.02 |
|
) |
|
``` |
|
|
|
## Example Usage with Hugging Face URI |
|
|
|
```Python |
|
|
|
import keras |
|
import keras_hub |
|
import numpy as np |
|
|
|
# Basic text generation |
|
mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset("hf://keras/mixtral_8_instruct_7b_en") |
|
mixtral_lm.generate("[INST] What is Keras? [/INST]", max_length=500) |
|
|
|
# Generate with batched prompts |
|
mixtral_lm.generate([ |
|
"[INST] What is Keras? [/INST]", |
|
"[INST] Give me your best brownie recipe. [/INST]" |
|
], max_length=500) |
|
|
|
# Using different sampling strategies |
|
mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset("hf://keras/mixtral_8_instruct_7b_en") |
|
# Greedy sampling |
|
mixtral_lm.compile(sampler="greedy") |
|
mixtral_lm.generate("I want to say", max_length=30) |
|
|
|
# Beam search |
|
mixtral_lm.compile( |
|
sampler=keras_hub.samplers.BeamSampler( |
|
num_beams=2, |
|
top_k_experts=2, # MoE-specific: number of experts to use per token |
|
) |
|
) |
|
mixtral_lm.generate("I want to say", max_length=30) |
|
|
|
# Generate without preprocessing |
|
prompt = { |
|
"token_ids": np.array([[1, 315, 947, 298, 1315, 0, 0, 0, 0, 0]] * 2), |
|
"padding_mask": np.array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]] * 2), |
|
} |
|
|
|
mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset( |
|
"hf://keras/mixtral_8_instruct_7b_en", |
|
preprocessor=None, |
|
dtype="bfloat16" |
|
) |
|
mixtral_lm.generate( |
|
prompt, |
|
num_experts=8, # Total number of experts per layer |
|
top_k_experts=2, # Number of experts to use per token |
|
router_aux_loss_coef=0.02 # Router auxiliary loss coefficient |
|
) |
|
|
|
# Training on a single batch |
|
features = ["The quick brown fox jumped.", "I forgot my homework."] |
|
mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset( |
|
"hf://keras/mixtral_8_instruct_7b_en", |
|
dtype="bfloat16" |
|
) |
|
mixtral_lm.fit( |
|
x=features, |
|
batch_size=2, |
|
router_aux_loss_coef=0.02 # MoE-specific: router training loss |
|
) |
|
|
|
# Training without preprocessing |
|
x = { |
|
"token_ids": np.array([[1, 315, 947, 298, 1315, 369, 315, 837, 0, 0]] * 2), |
|
"padding_mask": np.array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]] * 2), |
|
} |
|
y = np.array([[315, 947, 298, 1315, 369, 315, 837, 0, 0, 0]] * 2) |
|
sw = np.array([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]] * 2) |
|
|
|
mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset( |
|
"hf://keras/mixtral_8_instruct_7b_en", |
|
preprocessor=None, |
|
dtype="bfloat16" |
|
) |
|
mixtral_lm.fit( |
|
x=x, |
|
y=y, |
|
sample_weight=sw, |
|
batch_size=2, |
|
router_aux_loss_coef=0.02 |
|
) |
|
``` |
|
|