tencent
/

Hunyuan-A13B-Instruct

@@ -90,7 +90,7 @@ Hunyuan-A13B-Instruct has achieved highly competitive performance across multipl
 &nbsp;
 ## Use with transformers
-The following code snippet shows how to use the transformers library to load and apply the model. It also demonstrates how to enable and disable the reasoning mode , and how to parse the reasoning process along with the final output.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -101,13 +101,20 @@ model_name_or_path = os.environ['MODEL_PATH']
 # model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
-model = AutoModelForCausalLM.from_pretrained(model_name_or_path, device_map="auto",trust_remote_code=True)  # You may want to use bfloat16 and/or move to GPU here
 messages = [
     {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
 ]
-tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_tensors="pt",
-                                                enable_thinking=True # Toggle thinking mode (default: True)
-                                                )
 outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
@@ -125,6 +132,27 @@ print(f"thinking_content:{think_content}\n\n")
 print(f"answer_content:{answer_content}\n\n")
 ```
 ## Quantitative Compression
 We used our own `AngleSlim` compression tool to produce FP8 and INT4 quantization models. `AngleSlim` compression tool is expected to be open source in early July, which will support one-click quantization and compression of large models, please look forward to it, and you can download our quantization models directly for deployment testing now.
@@ -196,7 +224,7 @@ trtllm-serve \
 ```
-### vllm
 #### Docker Image
 We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development， **note: cuda 12.8 is require for this docker**.
@@ -237,6 +265,25 @@ docker run  --privileged --user root  --net=host --ipc=host \
 ```
 ### SGLang
 #### Docker Image

 &nbsp;
 ## Use with transformers
+Below is an example of how to use this model with the Hugging Face transformers library. This includes loading the model and tokenizer, toggling reasoning (thinking) mode, and parsing both the reasoning process and final answer from the output.
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 # model_name_or_path = "tencent/Hunyuan-A13B-Instruct"
 tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
+                                            device_map="auto",trust_remote_code=True)  # You may want to use bfloat16 and/or move to GPU here
 messages = [
     {"role": "user", "content": "Write a short summary of the benefits of regular exercise"},
 ]
+tokenized_chat = tokenizer.apply_chat_template(
+        messages,
+        tokenize=True,
+        add_generation_prompt=True,
+        return_tensors="pt",
+        enable_thinking=True # Toggle thinking mode (default: True)
+        )
 outputs = model.generate(tokenized_chat.to(model.device), max_new_tokens=4096)
 print(f"answer_content:{answer_content}\n\n")
 ```
+### Fast and slow thinking switch
+This model supports two modes of operation:
+- Slow Thinking Mode (Default): Enables detailed internal reasoning steps before producing the final answer.
+- Fast Thinking Mode: Skips the internal reasoning process for faster inference, going straight to the final answer.
+**Switching to Fast Thinking Mode:**
+To disable the reasoning process, set `enable_thinking=False` in the apply_chat_template call:
+```
+tokenized_chat = tokenizer.apply_chat_template(
+    messages,
+    tokenize=True,
+    add_generation_prompt=True,
+    return_tensors="pt",
+    enable_thinking=False  # Use fast thinking mode
+)
+```
 ## Quantitative Compression
 We used our own `AngleSlim` compression tool to produce FP8 and INT4 quantization models. `AngleSlim` compression tool is expected to be open source in early July, which will support one-click quantization and compression of large models, please look forward to it, and you can download our quantization models directly for deployment testing now.
 ```
+### vLLM
 #### Docker Image
 We provide a pre-built Docker image containing vLLM 0.8.5 with full support for this model. The official vllm release is currently under development， **note: cuda 12.8 is require for this docker**.
 ```
+#### Tool Calling with vLLM
+To support agent-based workflows and function calling capabilities, this model includes specialized parsing mechanisms for handling tool calls and internal reasoning steps.
+For a complete working example of how to implement and use these features in an agent setting, please refer to our full agent implementation on GitHub:
+🔗 [Hunyuan A13B Agent Example](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/)
+When deploying the model using **vLLM**, the following parameters can be used to configure the tool parsing behavior:
+| Parameter                | Value                                                                 |
+|--------------------------|-----------------------------------------------------------------------|
+| `--tool-parser-plugin`   | [Local Hunyuan A13B Tool Parser File](https://github.com/Tencent-Hunyuan/Hunyuan-A13B/blob/main/agent/hunyuan_tool_parser.py) |
+| `--tool-call-parser`     | `hunyuan`                                                            |
+These settings enable vLLM to correctly interpret and route tool calls generated by the model according to the expected format.
 ### SGLang
 #### Docker Image