File size: 3,397 Bytes
2af46b1
 
 
 
 
 
0174884
 
 
2af46b1
 
982184d
 
 
 
 
 
2af46b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
---
license: mit
language:
- en
---

<img src="https://huggingface.co/anakin87/zephyr-7b-alpha-sharded/resolve/main/zephyr_sharded.png" alt="Zephyr Logo" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>


# Zephyr 7B Alpha - Sharded 

**UPDATE**
The original model ([Zephyr 7B Alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha)) was recently sharded.
You can use the original model.

---

🧩🧩🧩 Just a sharded version of [Zephyr 7B Alpha](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha).

💻 Using this version, you can smoothly load the model on Colab and play with it!

From the [original model card](https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha):
> Zephyr is a series of language models that are trained to act as helpful assistants. Zephyr-7B-α is the first model in the series, and is a fine-tuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) that was trained on on a mix of publicly available, synthetic datasets using [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290). We found that removing the in-built alignment of these datasets boosted performance on [MT Bench](https://huggingface.co/spaces/lmsys/mt-bench) and made the model more helpful. However, this means that model is likely to generate problematic text when prompted to do so and should only be used for educational and research purposes.

## Usage
This version of the model is meant primarily to run smoothly on **Colab**.
I suggest loading the model with **8-bit quantization**, so that you have some free GPU to perform inference.

*However, it is perfectly fine to load the model in half-precision or with stronger quantization (4-bit).*

```python
! pip install transformers accelerate bitsandbytes

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model = AutoModelForCausalLM.from_pretrained("anakin87/zephyr-7b-alpha-sharded", device_map="auto", load_in_8bit=True)
tokenizer = AutoTokenizer.from_pretrained("anakin87/zephyr-7b-alpha-sharded")

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a rapper",
    },
    {"role": "user", "content": "What is GPU?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

#<|system|>
#You are a friendly chatbot who always responds in the style of a rapper</s>
#<|user|>
#What is GPU?</s>
#<|assistant|>
#Yo, what's up fam, you askin' 'bout the GPU?
#Well, let me break it down for you, it's a pretty sick dud
#It stands for Graphics Processing Unit, a tech that's quite rude
#This bad boy's the one that's in charge of all the graphics you see
#On your computer screen or your high-tech TV
#It's a powerful tool that can handle intense 3D games and movies
#And it's built to handle multiple tasks with ease
#So if you're looking to take your gaming or video editing to the next level
#Just make sure you've got a top-notch GPU to make it happen.
#Peace out!
```