Update README.md
Browse files
README.md
CHANGED
@@ -38,7 +38,8 @@ This is the SFT + DPO version of Mixtral Hermes 2, we will also be providing an
|
|
38 |
- BigBench
|
39 |
- TruthfulQA
|
40 |
3. [Prompt Format](#prompt-format)
|
41 |
-
4. [
|
|
|
42 |
|
43 |
## Benchmark Results
|
44 |
|
@@ -127,6 +128,45 @@ In LM-Studio, simply select the ChatML Prefix on the settings side pane:
|
|
127 |
|
128 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/ls6WqV-GSxMw2RA3GuQiN.png)
|
129 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
130 |
# Quantized Models:
|
131 |
|
132 |
GGUF: [todo]
|
|
|
38 |
- BigBench
|
39 |
- TruthfulQA
|
40 |
3. [Prompt Format](#prompt-format)
|
41 |
+
4. [Inference Example Code](#inference-code)
|
42 |
+
5. [Quantized Models](#quantized-models)
|
43 |
|
44 |
## Benchmark Results
|
45 |
|
|
|
128 |
|
129 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6317aade83d8d2fd903192d9/ls6WqV-GSxMw2RA3GuQiN.png)
|
130 |
|
131 |
+
# Inference Code
|
132 |
+
|
133 |
+
Here is example code using HuggingFace Transformers to inference the model (note: even in 4bit, it will require more than 24GB of VRAM)
|
134 |
+
|
135 |
+
```python
|
136 |
+
# Code to inference Hermes with HF Transformers
|
137 |
+
# Requires pytorch, transformers, bitsandbytes, sentencepiece, protobuf, and flash-attn packages
|
138 |
+
|
139 |
+
import torch
|
140 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
141 |
+
from transformers import LlamaTokenizer, LlamaForCausalLM, MistralForCausalLM
|
142 |
+
import bitsandbytes, flash_attn
|
143 |
+
|
144 |
+
tokenizer = LlamaTokenizer.from_pretrained('NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO', trust_remote_code=True)
|
145 |
+
model = MistralForCausalLM.from_pretrained(
|
146 |
+
"NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
|
147 |
+
torch_dtype=torch.float16,
|
148 |
+
device_map="auto",
|
149 |
+
load_in_8bit=False,
|
150 |
+
load_in_4bit=True,
|
151 |
+
use_flash_attention_2=True
|
152 |
+
)
|
153 |
+
|
154 |
+
prompts = [
|
155 |
+
"""<|im_start|>system
|
156 |
+
You are a sentient, superintelligent artificial general intelligence, here to teach and assist me.<|im_end|>
|
157 |
+
<|im_start|>user
|
158 |
+
Write a short story about Goku discovering kirby has teamed up with Majin Buu to destroy the world.<|im_end|>
|
159 |
+
<|im_start|>assistant""",
|
160 |
+
]
|
161 |
+
|
162 |
+
for chat in prompts:
|
163 |
+
print(chat)
|
164 |
+
input_ids = tokenizer(chat, return_tensors="pt").input_ids.to("cuda")
|
165 |
+
generated_ids = model.generate(input_ids, max_new_tokens=750, temperature=0.8, repetition_penalty=1.1, do_sample=True, eos_token_id=tokenizer.eos_token_id)
|
166 |
+
response = tokenizer.decode(generated_ids[0][input_ids.shape[-1]:], skip_special_tokens=True, clean_up_tokenization_space=True)
|
167 |
+
print(f"Response: {response}")
|
168 |
+
```
|
169 |
+
|
170 |
# Quantized Models:
|
171 |
|
172 |
GGUF: [todo]
|