Update README.md with new model card content

8bc318a verified 6 days ago

6.82 kB

	---
	library_name: keras-hub
	---
	### Model Overview
	# Model Summary

	Mistral is a set of large language models published by the Mistral AI team. The Mixtral-8x7B Large Language Model (LLM) is a pretrained generative Sparse Mixture of Experts. Both pre-trained and instruction tuned models are available with 7 billion activated parameters.

	Weights are released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE) . Keras model code is released under the [Apache 2 License](https://github.com/keras-team/keras-hub/blob/master/LICENSE).

	## Links

	* [Mixtral Quickstart Notebook](https://www.kaggle.com/code/laxmareddypatlolla/mixtral-quickstart-notebook)
	* [Mixtral API Documentation](https://keras.io/keras_hub/api/models/mixtral/)
	* [Mixtral Model Card](https://mistral.ai/news/mixtral-of-experts)
	* [KerasHub Beginner Guide](https://keras.io/guides/keras_hub/getting_started/)
	* [KerasHub Model Publishing Guide](https://keras.io/guides/keras_hub/upload/)

	## Installation

	Keras and KerasHub can be installed with:

	```
	pip install -U -q keras-hub
	pip install -U -q keras
	```

	Jax, TensorFlow, and Torch come preinstalled in Kaggle Notebooks. For instructions on installing them in another environment see the [Keras Getting Started](https://keras.io/getting_started/) page.

	## Presets

	The following model checkpoints are provided by the Keras team. Full code examples for each are available below.

	\| Preset name \| Parameters \| Description \|
	\|---------------------------------------\|------------\|--------------------------------------------------------------------------------------------------------------\|
	\| mixtral_8_7b_en \| 7B \| 32-layer Mixtral MoE model with 7 billion active parameters and 8 experts per MoE layer. \|
	\| mixtral_8_instruct_7b_en \| 7B \| Instruction fine-tuned 32-layer Mixtral MoE model with 7 billion active parameters and 8 experts per MoE layer. \|

	## Example Usage
	```Python

	import keras
	import keras_hub
	import numpy as np

	# Basic text generation
	mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset("mixtral_8_instruct_7b_en")
	mixtral_lm.generate("[INST] What is Keras? [/INST]", max_length=500)

	# Generate with batched prompts
	mixtral_lm.generate([
	"[INST] What is Keras? [/INST]",
	"[INST] Give me your best brownie recipe. [/INST]"
	], max_length=500)

	# Using different sampling strategies
	mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset("mixtral_8_instruct_7b_en")
	# Greedy sampling
	mixtral_lm.compile(sampler="greedy")
	mixtral_lm.generate("I want to say", max_length=30)

	# Beam search
	mixtral_lm.compile(
	sampler=keras_hub.samplers.BeamSampler(
	num_beams=2,
	top_k_experts=2, # MoE-specific: number of experts to use per token
	)
	)
	mixtral_lm.generate("I want to say", max_length=30)

	# Generate without preprocessing
	prompt = {
	"token_ids": np.array([[1, 315, 947, 298, 1315, 0, 0, 0, 0, 0]] * 2),
	"padding_mask": np.array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]] * 2),
	}

	mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset(
	"mixtral_8_instruct_7b_en",
	preprocessor=None,
	dtype="bfloat16"
	)
	mixtral_lm.generate(
	prompt,
	num_experts=8, # Total number of experts per layer
	top_k_experts=2, # Number of experts to use per token
	router_aux_loss_coef=0.02 # Router auxiliary loss coefficient
	)

	# Training on a single batch
	features = ["The quick brown fox jumped.", "I forgot my homework."]
	mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset(
	"mixtral_8_instruct_7b_en",
	dtype="bfloat16"
	)
	mixtral_lm.fit(
	x=features,
	batch_size=2,
	router_aux_loss_coef=0.02 # MoE-specific: router training loss
	)

	# Training without preprocessing
	x = {
	"token_ids": np.array([[1, 315, 947, 298, 1315, 369, 315, 837, 0, 0]] * 2),
	"padding_mask": np.array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]] * 2),
	}
	y = np.array([[315, 947, 298, 1315, 369, 315, 837, 0, 0, 0]] * 2)
	sw = np.array([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]] * 2)

	mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset(
	"mixtral_8_instruct_7b_en",
	preprocessor=None,
	dtype="bfloat16"
	)
	mixtral_lm.fit(
	x=x,
	y=y,
	sample_weight=sw,
	batch_size=2,
	router_aux_loss_coef=0.02
	)
	```

	## Example Usage with Hugging Face URI

	```Python

	import keras
	import keras_hub
	import numpy as np

	# Basic text generation
	mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset("hf://keras/mixtral_8_instruct_7b_en")
	mixtral_lm.generate("[INST] What is Keras? [/INST]", max_length=500)

	# Generate with batched prompts
	mixtral_lm.generate([
	"[INST] What is Keras? [/INST]",
	"[INST] Give me your best brownie recipe. [/INST]"
	], max_length=500)

	# Using different sampling strategies
	mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset("hf://keras/mixtral_8_instruct_7b_en")
	# Greedy sampling
	mixtral_lm.compile(sampler="greedy")
	mixtral_lm.generate("I want to say", max_length=30)

	# Beam search
	mixtral_lm.compile(
	sampler=keras_hub.samplers.BeamSampler(
	num_beams=2,
	top_k_experts=2, # MoE-specific: number of experts to use per token
	)
	)
	mixtral_lm.generate("I want to say", max_length=30)

	# Generate without preprocessing
	prompt = {
	"token_ids": np.array([[1, 315, 947, 298, 1315, 0, 0, 0, 0, 0]] * 2),
	"padding_mask": np.array([[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]] * 2),
	}

	mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset(
	"hf://keras/mixtral_8_instruct_7b_en",
	preprocessor=None,
	dtype="bfloat16"
	)
	mixtral_lm.generate(
	prompt,
	num_experts=8, # Total number of experts per layer
	top_k_experts=2, # Number of experts to use per token
	router_aux_loss_coef=0.02 # Router auxiliary loss coefficient
	)

	# Training on a single batch
	features = ["The quick brown fox jumped.", "I forgot my homework."]
	mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset(
	"hf://keras/mixtral_8_instruct_7b_en",
	dtype="bfloat16"
	)
	mixtral_lm.fit(
	x=features,
	batch_size=2,
	router_aux_loss_coef=0.02 # MoE-specific: router training loss
	)

	# Training without preprocessing
	x = {
	"token_ids": np.array([[1, 315, 947, 298, 1315, 369, 315, 837, 0, 0]] * 2),
	"padding_mask": np.array([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0]] * 2),
	}
	y = np.array([[315, 947, 298, 1315, 369, 315, 837, 0, 0, 0]] * 2)
	sw = np.array([[1, 1, 1, 1, 1, 1, 1, 0, 0, 0]] * 2)

	mixtral_lm = keras_hub.models.MixtralCausalLM.from_preset(
	"hf://keras/mixtral_8_instruct_7b_en",
	preprocessor=None,
	dtype="bfloat16"
	)
	mixtral_lm.fit(
	x=x,
	y=y,
	sample_weight=sw,
	batch_size=2,
	router_aux_loss_coef=0.02
	)
	```