Update docs/transformers_deployment_guide_cn.md
Browse files
docs/transformers_deployment_guide_cn.md
CHANGED
@@ -24,9 +24,9 @@ model = AutoModelForCausalLM.from_pretrained(MODEL_PATH, device_map="auto", trus
|
|
24 |
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
|
25 |
|
26 |
messages = [
|
27 |
-
{"role": "user", "content": "What is your favourite condiment?"},
|
28 |
-
{"role": "assistant", "content": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"},
|
29 |
-
{"role": "user", "content": "Do you have mayonnaise recipes?"}
|
30 |
]
|
31 |
|
32 |
text = tokenizer.apply_chat_template(
|
@@ -59,7 +59,7 @@ print(response)
|
|
59 |
|
60 |
上面的代码片段展示了不使用任何优化技巧的推理过程。但通过利用 [Flash Attention](../perf_train_gpu_one#flash-attention-2),可以大幅加速模型,因为它提供了模型内部使用的注意力机制的更快实现。
|
61 |
|
62 |
-
首先,确保安装最新版本的 Flash Attention 2
|
63 |
|
64 |
```bash
|
65 |
pip install -U flash-attn --no-build-isolation
|
|
|
24 |
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
|
25 |
|
26 |
messages = [
|
27 |
+
{"role": "user", "content": [{"type": "text", "text": "What is your favourite condiment?"}]},
|
28 |
+
{"role": "assistant", "content": [{"type": "text", "text": "Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavour to whatever I'm cooking up in the kitchen!"}]},
|
29 |
+
{"role": "user", "content": [{"type": "text", "text": "Do you have mayonnaise recipes?"}]}
|
30 |
]
|
31 |
|
32 |
text = tokenizer.apply_chat_template(
|
|
|
59 |
|
60 |
上面的代码片段展示了不使用任何优化技巧的推理过程。但通过利用 [Flash Attention](../perf_train_gpu_one#flash-attention-2),可以大幅加速模型,因为它提供了模型内部使用的注意力机制的更快实现。
|
61 |
|
62 |
+
首先,确保安装最新版本的 Flash Attention 2:
|
63 |
|
64 |
```bash
|
65 |
pip install -U flash-attn --no-build-isolation
|