lokinfey
/

glm-edge-1.5b-chat-onnx-cpu-int4

Model card Files Files and versions Community

glm-edge-1.5b-chat-onnx-cpu-int4 / README.md

lokinfey's picture

Update README.md

35042fa verified 27 days ago

|

history blame contribute delete

1.26 kB

	---
	license: mit
	---

	# GLM-edge-1.5b-chat-onnx-cpu-int4

	<b><ul>Note: This is unoffical version,just for test and dev.</ul></b>

	This is the ONNX format INT4 quantized version of the glm-edge-1.5b-chat-onnx-cpu-int4.

	1. Install

	```bash

	pip install torch transformers onnx onnxruntime

	pip install --pre onnxruntime-genai

	```
	2. Sample

	```bash

	import onnxruntime_genai as og
	import numpy as np
	import os


	model_folder = "Your glm-edge-1.5b-chat-onnx-cpu-int4 path"


	model = og.Model(model_folder)


	tokenizer = og.Tokenizer(model)
	tokenizer_stream = tokenizer.create_stream()


	search_options = {}
	search_options['max_length'] = 2048
	search_options['past_present_share_buffer'] = False


	chat_template = "<\|user\|>{input}<\|assistant\|>"


	text = """自我介绍一下??"""


	prompt = f'{chat_template.format(input=text)}'


	input_tokens = tokenizer.encode(prompt)


	params = og.GeneratorParams(model)


	params.set_search_options(**search_options)
	params.input_ids = input_tokens


	generator = og.Generator(model, params)


	while not generator.is_done():
	generator.compute_logits()
	generator.generate_next_token()

	new_token = generator.get_next_tokens()[0]
	print(tokenizer_stream.decode(new_token), end='', flush=True)


	```