Intel
/

neural-chat-7b-v3

@@ -2,7 +2,7 @@
 license: apache-2.0
 ---
-## Finetuning on [habana](https://habana.ai/) HPU
 This model is a fine-tuned model based on [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) on the open source dataset [Open-Orca/SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca). Then we align it with DPO algorithm. For more details, you can refer our blog: [The Practice of Supervised Fine-tuning and Direct Preference Optimization on Habana Gaudi2](https://medium.com/@NeuralCompressor/the-practice-of-supervised-finetuning-and-direct-preference-optimization-on-habana-gaudi2-a1197d8a3cd3).
@@ -38,12 +38,37 @@ The following hyperparameters were used during training:
 - lr_scheduler_warmup_ratio: 0.02
 - num_epochs: 2.0
-## Inference with transformers
 ```shell
-import transformers
-model = transformers.AutoModelForCausalLM.from_pretrained(
-  'Intel/neural-chat-7b-v3'
 )
 ```
@@ -58,9 +83,8 @@ The license on this model does not constitute legal advice. We are not responsib
 ## Organizations developing the model
-The NeuralChat team with members from Intel/SATG/AIA/AIPT. Core team members: Kaokao Lv, Liang Lv, Chang Wang, Wenxin Zhang, Xuhui Ren, and Haihao Shen.
 ## Useful links
 * Intel Neural Compressor [link](https://github.com/intel/neural-compressor)
 * Intel Extension for Transformers [link](https://github.com/intel/intel-extension-for-transformers)
-* Intel Extension for PyTorch [link](https://github.com/intel/intel-extension-for-pytorch)

 license: apache-2.0
 ---
+## Fine-tuning on [Habana](https://habana.ai/) Gaudi
 This model is a fine-tuned model based on [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) on the open source dataset [Open-Orca/SlimOrca](https://huggingface.co/datasets/Open-Orca/SlimOrca). Then we align it with DPO algorithm. For more details, you can refer our blog: [The Practice of Supervised Fine-tuning and Direct Preference Optimization on Habana Gaudi2](https://medium.com/@NeuralCompressor/the-practice-of-supervised-finetuning-and-direct-preference-optimization-on-habana-gaudi2-a1197d8a3cd3).
 - lr_scheduler_warmup_ratio: 0.02
 - num_epochs: 2.0
+## FP32 Inference with transformers
 ```shell
+from transformers import AutoTokenizer, TextStreamer
+model_name = "Intel/neural-chat-7b-v3"
+prompt = "Once upon a time, there existed a little girl,"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+inputs = tokenizer(prompt, return_tensors="pt").input_ids
+streamer = TextStreamer(tokenizer)
+model = AutoModelForCausalLM.from_pretrained(model_name)
+outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
+)
+```
+## INT4 Inference with transformers
+```shell
+from transformers import AutoTokenizer, TextStreamer
+from intel_extension_for_transformers.transformers import AutoModelForCausalLM, WeightOnlyQuantConfig
+model_name = "Intel/neural-chat-7b-v3"
+config = WeightOnlyQuantConfig(compute_dtype="int8", weight_dtype="int4")
+prompt = "Once upon a time, there existed a little girl,"
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+inputs = tokenizer(prompt, return_tensors="pt").input_ids
+streamer = TextStreamer(tokenizer)
+model = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=config)
+outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)
 )
 ```
 ## Organizations developing the model
+The NeuralChat team with members from Intel/DCAI/AISE. Core team members: Kaokao Lv, Liang Lv, Chang Wang, Wenxin Zhang, Xuhui Ren, and Haihao Shen.
 ## Useful links
 * Intel Neural Compressor [link](https://github.com/intel/neural-compressor)
 * Intel Extension for Transformers [link](https://github.com/intel/intel-extension-for-transformers)