File size: 7,214 Bytes
ae9f583
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
04a4be4
ae9f583
750c6d3
 
 
ae9f583
 
750c6d3
ae9f583
bea0f87
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae9f583
bea0f87
ae9f583
 
 
750c6d3
ae9f583
 
bea0f87
ae9f583
 
 
 
 
 
bea0f87
 
ae9f583
bea0f87
 
 
ae9f583
 
bea0f87
 
 
ae9f583
bea0f87
ae9f583
bea0f87
 
 
 
ae9f583
 
 
 
 
 
 
 
 
 
 
750c6d3
ae9f583
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bea0f87
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
license: cc-by-nc-4.0
base_model: google/gemma-2b
model-index:
- name: Octopus-V2-2B
  results: []
tags:
- function calling
- on-device language model
- android
inference: false
space: false
spaces: false
language:
- en
---
# Quantized Octopus V2: On-device language model for super agent

This repo includes two types of quantized models: **GGUF** and **AWQ**, for our Octopus V2 model at [NexaAIDev/Octopus-v2](https://huggingface.co/NexaAIDev/Octopus-v2)

<p align="center" width="100%">
  <a><img src="Octopus-logo.jpeg" alt="nexa-octopus" style="width: 40%; min-width: 300px; display: block; margin: auto;"></a>
</p>


# GGUF Qauntization

To run the models, please download them to your local machine using either git clone or [Hugging Face Hub](https://huggingface.co/docs/huggingface_hub/en/guides/download)
```
git clone https://huggingface.co/NexaAIDev/Octopus-v2-gguf-awq
```

## Run with [llama.cpp](https://github.com/ggerganov/llama.cpp) (Recommended) 

1. **Clone and compile:**

```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Compile the source code:
make
```

2. **Execute the Model:**

Run the following command in the terminal:

```bash
./main -m ./path/to/octopus-v2-Q4_K_M.gguf -n 256 -p "Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Take a selfie for me with front camera\n\nResponse:"
```

## Run with [Ollama](https://github.com/ollama/ollama)

Since our models have not been uploaded to the Ollama server, please download the models and manually import them into Ollama by following these steps:

1. Install Ollama on your local machine. You can also following the guide from [Ollama GitHub repository](https://github.com/ollama/ollama/blob/main/docs/import.md)

```bash
git clone https://github.com/ollama/ollama.git ollama
```

2. Locate the local Ollama directory:
```bash
cd ollama
```

3. Create a `Modelfile` in your directory
```bash
touch Modelfile
``` 

4. In the Modelfile, include a `FROM` statement with the path to your local model, and the default parameters:

```bash
FROM ./path/to/octopus-v2-Q4_K_M.gguf
```

5. Use the following command to add the model to Ollama:
```bash
ollama create octopus-v2-Q4_K_M -f Modelfile
```

6. Verify that the model has been successfully imported:
```bash
ollama ls
```

7. Run the mode
```bash
ollama run octopus-v2-Q4_K_M "Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Take a selfie for me with front camera\n\nResponse:"
```

# AWQ Quantization
Python example:

```python
from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM
import torch
import time
import numpy as np
def inference(input_text):
    start_time = time.time()
    input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
    input_length = input_ids["input_ids"].shape[1]
    generation_output = model.generate(
        input_ids["input_ids"],
        do_sample=False,
        max_length=1024
    )
    end_time = time.time()
    # Decode only the generated part
    generated_sequence = generation_output[:, input_length:].tolist()
    res = tokenizer.decode(generated_sequence[0])
    latency = end_time - start_time
    num_output_tokens = len(generated_sequence[0])
    throughput = num_output_tokens / latency
    return {"output": res, "latency": latency, "throughput": throughput}
# Initialize tokenizer and model
model_id = "/path/to/Octopus-v2-AWQ-NexaAIDev"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)
prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]
avg_throughput = []
for prompt in prompts:
    out = inference(prompt)
    avg_throughput.append(out["throughput"])
    print("nexa model result:\n", out["output"])
print("avg throughput:", np.mean(avg_throughput))
```

# Quantized GGUF & AWQ Models Benchmark

| Name                   | Quant method | Bits | Size     | Response (t/s) | Use Cases                           |
| ---------------------- | ------------ | ---- | -------- | -------------- | ----------------------------------- |
| Octopus-v2-AWQ         | AWQ          | 4    | 3.00 GB  | 63.83          | fast, high quality, recommended     |
| Octopus-v2-Q2_K.gguf   | Q2_K         | 2    | 1.16 GB  | 57.81          | fast but high loss, not recommended |
| Octopus-v2-Q3_K.gguf   | Q3_K         | 3    | 1.38 GB  | 57.81          | extremely not recommended           |
| Octopus-v2-Q3_K_S.gguf | Q3_K_S       | 3    | 1.19 GB  | 52.13          | extremely not recommended           |
| Octopus-v2-Q3_K_M.gguf | Q3_K_M       | 3    | 1.38 GB  | 58.67          | moderate loss, not very recommended |
| Octopus-v2-Q3_K_L.gguf | Q3_K_L       | 3    | 1.47 GB  | 56.92          | not very recommended                |
| Octopus-v2-Q4_0.gguf   | Q4_0         | 4    | 1.55 GB  | 68.80          | moderate speed, recommended         |
| Octopus-v2-Q4_1.gguf   | Q4_1         | 4    | 1.68 GB  | 68.09          | moderate speed, recommended         |
| Octopus-v2-Q4_K.gguf   | Q4_K         | 4    | 1.63 GB  | 64.70          | moderate speed, recommended         |
| Octopus-v2-Q4_K_S.gguf | Q4_K_S       | 4    | 1.56 GB  | 62.16          | fast and accurate, very recommended |
| Octopus-v2-Q4_K_M.gguf | Q4_K_M       | 4    | 1.63 GB  | 64.74          | fast, recommended                   |
| Octopus-v2-Q5_0.gguf   | Q5_0         | 5    | 1.80 GB  | 64.80          | fast, recommended                   |
| Octopus-v2-Q5_1.gguf   | Q5_1         | 5    | 1.92 GB  | 63.42          | very big, prefer Q4                 |
| Octopus-v2-Q5_K.gguf   | Q5_K         | 5    | 1.84 GB  | 61.28          | big, recommended                    |
| Octopus-v2-Q5_K_S.gguf | Q5_K_S       | 5    | 1.80 GB  | 62.16          | big, recommended                    |
| Octopus-v2-Q5_K_M.gguf | Q5_K_M       | 5    | 1.71 GB  | 61.54          | big, recommended                    |
| Octopus-v2-Q6_K.gguf   | Q6_K         | 6    | 2.06 GB  | 55.94          | very big, not very recommended      |
| Octopus-v2-Q8_0.gguf   | Q8_0         | 8    | 2.67 GB  | 56.35          | very big, not very recommended      |
| Octopus-v2-f16.gguf    | f16          | 16   | 5.02 GB  | 36.27          | extremely big                       |
| Octopus-v2.gguf        |              |      | 10.00 GB |                |                                     |

_Quantized with llama.cpp_


**Acknowledgement**:  
We sincerely thank our community members, [Mingyuan](https://huggingface.co/ThunderBeee), [Zoey](https://huggingface.co/ZY6), [Brian](https://huggingface.co/JoyboyBrian), [Perry](https://huggingface.co/PerryCheng614), [Qi](https://huggingface.co/qiqiWav), [David](https://huggingface.co/Davidqian123) for their extraordinary contributions to this quantization effort.