ChatGLM-6B + ONNX
This model is exported from ChatGLM-6b with int8 quantization and optimized for ONNXRuntime inference. Export code in this repo.
Inference code with ONNXRuntime is uploaded with the model. Install requirements and run streamlit run web-ui.py
to start chatting. Currently the MatMulInteger
(for u8s8 data type) and DynamicQuantizeLinear
operators are only supported on CPU. Arm64 with Neon support (Apple M1/M2) should be reasonably fast.
ๅฎ่ฃ
ไพ่ตๅนถ่ฟ่ก streamlit run web-ui.py
้ข่งๆจกๅๆๆใ็ฑไบ ONNXRuntime ็ฎๅญๆฏๆ้ฎ้ข๏ผ็ฎๅไป
่ฝๅคไฝฟ็จ CPU ่ฟ่กๆจ็๏ผๅจ Arm64 (Apple M1/M2) ไธๆๅฏ่ง็้ๅบฆใๅ
ทไฝ็ ONNX ๅฏผๅบไปฃ็ ๅจ่ฟไธชไปๅบไธญใ
Usage
Clone with git-lfs:
git lfs clone https://huggingface.co/K024/ChatGLM-6b-onnx-u8s8
cd ChatGLM-6b-onnx-u8s8
pip install -r requirements.txt
streamlit run web-ui.py
Or use huggingface_hub
python client lib to download the repo snapshot:
from huggingface_hub import snapshot_download
snapshot_download(repo_id="K024/ChatGLM-6b-onnx-u8s8", local_dir="./ChatGLM-6b-onnx-u8s8")
Codes are released under MIT license.
Model weights are released under the same license as ChatGLM-6b, see MODEL LICENSE.