dangvansam commited on
Commit
0bc62bc
·
verified ·
1 Parent(s): c47eb0d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +89 -0
README.md ADDED
@@ -0,0 +1,89 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - vi
5
+ - zh
6
+ base_model:
7
+ - google/gemma-2-27b-it
8
+ pipeline_tag: text-generation
9
+ tags:
10
+ - fp8
11
+ - vllm
12
+ - system-role
13
+ - langchain
14
+ license: gemma
15
+ ---
16
+
17
+ # gemma-2-27b-it-FP8-fix-system-role
18
+
19
+ Quantized version of [gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it) and update **`chat_template`** for support **`system`** role to handle cases:
20
+ - `Conversation roles must alternate user/assistant/user/assistant/...`
21
+ - `System role not supported`
22
+
23
+ ## Model Overview
24
+ - **Model Architecture:** Gemma 2
25
+ - **Input:** Text
26
+ - **Output:** Text
27
+ - **Model Optimizations:**
28
+ - **Weight quantization:** FP8
29
+ - **Activation quantization:** FP8
30
+ - **Release Date:** 04/12/2024
31
+ - **Version:** 1.0
32
+
33
+ ### Model Optimizations
34
+
35
+ This model was obtained by quantizing the weights and activations of [gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it) to FP8 data type, ready for inference with vLLM >= 0.5.1.
36
+ This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
37
+
38
+ Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
39
+ [AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with a single instance of every token in random order.
40
+
41
+ ## Deployment
42
+
43
+ ### Use with vLLM
44
+
45
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
46
+
47
+ With CLI:
48
+ ```bash
49
+ vllm serve --model dangvansam/gemma-2-27b-it-FP8-fix-system-role -q fp8
50
+ ```
51
+ ```bash
52
+ curl http://localhost:8000/v1/chat/completions \
53
+ -H "Content-Type: application/json" \
54
+ -d '{
55
+ "model": "dangvansam/gemma-2-27b-it-FP8-fix-system-role",
56
+ "messages": [
57
+ {"role": "system", "content": "You are a helpful assistant."},
58
+ {"role": "user", "content": "Who are you?"}
59
+ ]
60
+ }'
61
+ ```
62
+
63
+ With Python:
64
+ ```python
65
+ from vllm import LLM, SamplingParams
66
+ from transformers import AutoTokenizer
67
+
68
+ model_id = "dangvansam/gemma-2-27b-it-FP8-fix-system-role"
69
+
70
+ sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
71
+
72
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
73
+
74
+ messages = [
75
+ {"role": "system", "content": "You are helpfull assistant."},
76
+ {"role": "user", "content": "Who are you?"}
77
+ ]
78
+
79
+ prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
80
+
81
+ llm = LLM(model=model_id)
82
+
83
+ outputs = llm.generate(prompts, sampling_params)
84
+
85
+ generated_text = outputs[0].outputs[0].text
86
+ print(generated_text)
87
+ ```
88
+
89
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.