dangvansam
/

gemma-2-27b-it-FP8-fix-system-role

Text Generation

Model card Files Files and versions Community

dangvansam commited on Dec 4, 2024

Commit

0bc62bc

·

verified ·

1 Parent(s): c47eb0d

Create README.md

Files changed (1) hide show

README.md +89 -0

README.md ADDED Viewed

	@@ -0,0 +1,89 @@

+---
+language:
+- en
+- vi
+- zh
+base_model:
+- google/gemma-2-27b-it
+pipeline_tag: text-generation
+tags:
+- fp8
+- vllm
+- system-role
+- langchain
+license: gemma
+---
+# gemma-2-27b-it-FP8-fix-system-role
+Quantized version of [gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it) and update **`chat_template`** for support **`system`** role to handle cases:
+- `Conversation roles must alternate user/assistant/user/assistant/...`
+- `System role not supported`
+## Model Overview
+- **Model Architecture:** Gemma 2
+  - **Input:** Text
+  - **Output:** Text
+- **Model Optimizations:**
+  - **Weight quantization:** FP8
+  - **Activation quantization:** FP8
+- **Release Date:** 04/12/2024
+- **Version:** 1.0
+### Model Optimizations
+This model was obtained by quantizing the weights and activations of [gemma-2-27b-it](https://huggingface.co/google/gemma-2-27b-it) to FP8 data type, ready for inference with vLLM >= 0.5.1.
+This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
+Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
+[AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with a single instance of every token in random order.
+## Deployment
+### Use with vLLM
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
+With CLI:
+```bash
+vllm serve --model dangvansam/gemma-2-27b-it-FP8-fix-system-role -q fp8
+```
+```bash
+curl http://localhost:8000/v1/chat/completions \
+-H "Content-Type: application/json" \
+-d '{
+  "model": "dangvansam/gemma-2-27b-it-FP8-fix-system-role",
+  "messages": [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Who are you?"}
+  ]
+}'
+```
+With Python:
+```python
+from vllm import LLM, SamplingParams
+from transformers import AutoTokenizer
+model_id = "dangvansam/gemma-2-27b-it-FP8-fix-system-role"
+sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+messages = [
+  {"role": "system", "content": "You are helpfull assistant."},
+  {"role": "user", "content": "Who are you?"}
+]
+prompts = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+llm = LLM(model=model_id)
+outputs = llm.generate(prompts, sampling_params)
+generated_text = outputs[0].outputs[0].text
+print(generated_text)
+```
+vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.