amd
/

Mixtral-8x7B-Instruct-v0.1-FP8-KV

Model card Files Files and versions Community

Update README.md

#6

by haoyang-amd - opened Sep 17

base: refs/heads/main

←

from: refs/pr/6

Discussion Files changed

Files changed (1) hide show

README.md +7 -5

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ base_model: mistralai/Mixtral-8x7B-Instruct-v0.1
 - ## Introduction
   This model was created by applying [Quark](https://quark.docs.amd.com/latest/index.html) with calibration samples from Pile dataset.
 - ## Quantization Stragegy
-  - ***Quantized Layers***: All linear layers excluding "lm_head", "*gate"
   - ***Weight***: FP8 symmetric per-tensor
   - ***Activation***: FP8 symmetric per-tensor
   - ***KV Cache***: FP8 symmetric  per-tensor
@@ -23,8 +23,9 @@ python3 quantize_quark.py \
         --output_dir Mixtral-8x7B-Instruct-v0.1-FP8-KV \
         --quant_scheme w_fp8_a_fp8 \
         --kv_cache_dtype fp8 \
-        --num_calib_data 128  \
-        --model_export quark_safetensors
 # If model size is too large for single GPU, please use multi GPU instead.
 python3 quantize_quark.py \
         --model_dir $MODEL_DIR \
@@ -33,6 +34,7 @@ python3 quantize_quark.py \
         --kv_cache_dtype fp8 \
         --num_calib_data 128  \
         --model_export quark_safetensors \
         --multi_gpu
 ```
 ## Deployment
@@ -53,9 +55,9 @@ The quantization evaluation results are conducted in pseudo-quantization mode, w
   <tr>
    <td>Perplexity-wikitext2
    </td>
-   <td>4.1391
    </td>
-   <td>4.2187
    </td>
   </tr>
 </table>

 - ## Introduction
   This model was created by applying [Quark](https://quark.docs.amd.com/latest/index.html) with calibration samples from Pile dataset.
 - ## Quantization Stragegy
+  - ***Quantized Layers***: All linear layers excluding "lm_head", "*.gate"
   - ***Weight***: FP8 symmetric per-tensor
   - ***Activation***: FP8 symmetric per-tensor
   - ***KV Cache***: FP8 symmetric  per-tensor
         --output_dir Mixtral-8x7B-Instruct-v0.1-FP8-KV \
         --quant_scheme w_fp8_a_fp8 \
         --kv_cache_dtype fp8 \
+        --num_calib_data 128 \
+        --model_export quark_safetensors \
+        --no_weight_matrix_merge
 # If model size is too large for single GPU, please use multi GPU instead.
 python3 quantize_quark.py \
         --model_dir $MODEL_DIR \
         --kv_cache_dtype fp8 \
         --num_calib_data 128  \
         --model_export quark_safetensors \
+        --no_weight_matrix_merge \
         --multi_gpu
 ```
 ## Deployment
   <tr>
    <td>Perplexity-wikitext2
    </td>
+   <td>4.1387
    </td>
+   <td>4.2207
    </td>
   </tr>
 </table>