File size: 3,021 Bytes
44479ee
 
 
 
86e2cb3
 
 
077dce8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7e6f959
077dce8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
base_model:
- google/gemma-2-9b-it
pipeline_tag: text-generation
tags:
- directml
- windows
---

# Model Card for Model ID

## Model Details
google/gemma-2-9b quantized to ONNX GenAI INT4 with Microsoft DirectML optimization.<br>
Output is reformatted that each sentence starts at new line to improve readability.
<pre>
...
vNewDecoded = tokenizer_stream.decode(new_token)
if re.findall("^[\x2E\x3A\x3B]$", vPreviousDecoded) and vNewDecoded.startswith(" ") and (not vNewDecoded.startswith(" *")) :
   vNewDecoded = "\n" + vNewDecoded.replace(" ", "", 1)
print(vNewDecoded, end='', flush=True)
vPreviousDecoded = vNewDecoded
...
</pre>

<img src="https://zci.sourceforge.io/epub/gemmaonnx.png">

### Model Description
google/gemma-2-9b quantized to ONNX GenAI INT4 with Microsoft DirectML optimization<br>
https://onnxruntime.ai/docs/genai/howto/install.html#directml

Created using ONNX Runtime GenAI's builder.py<br>
https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/src/python/py/models/builder.py

Build options:<br>
INT4 accuracy level: FP32 (float32)

- **Developed by:** Mochamad Aris Zamroni

### Model Sources [optional]
https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct

### Direct Use
This is Microsoft Windows DirectML optimized model.<br>
It might not be working in ONNX execution provider other than DmlExecutionProvider.<br>
The needed python scripts are included in this repository

Prerequisites:<br>
1. Install Python 3.11 from Windows Store:<br>
https://apps.microsoft.com/search/publisher?name=Python+Software+Foundation

2. Open command line cmd.exe

3. Create python virtual environment, activate the environment then install onnxruntime-genai-directml<br>
mkdir c:\temp<br>
cd c:\temp<br>
python -m venv dmlgenai<br>
dmlgenai\Scripts\activate.bat<br>
pip install onnxruntime-genai-directml

4. Use the onnxgenairun.py to get chat interface.<br>
It is modified version of "https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3-qa.py".<br>
The modification makes the text output changes to new line after "., :, and ;" to make the output easier to be read.

rem Change directory to where model and script files is stored<br>
cd this_onnx_model_directory<br>
python onnxgenairun.py --help<br>
python onnxgenairun.py -m . -v -g

5. (Optional but recommended) Device specific optimization.<br>
a. Open "dml-device-specific-optim.py" with text editor and change the file path accordingly.<br>
b. Run the python script: python dml-device-specific-optim.py<br>
c. Rename the original model.onnx to other file name and put and rename the optimized onnx file from step 5.b to model.onnx file.<br>
d. Rerun step 4.

#### Speeds, Sizes, Times [optional]
6 token/s in Radeon 780M with 8GB pre-allocated RAM.

#### Hardware
AMD Ryzen Zen4 7840U with integrated Radeon 780M GPU<br>
RAM 32GB<br>

#### Software
Microsoft DirectML on Windows 10

## Model Card Authors [optional]
Mochamad Aris Zamroni

## Model Card Contact
https://www.linkedin.com/in/zamroni/