File size: 7,214 Bytes
ae9f583 04a4be4 ae9f583 750c6d3 ae9f583 750c6d3 ae9f583 bea0f87 ae9f583 bea0f87 ae9f583 750c6d3 ae9f583 bea0f87 ae9f583 bea0f87 ae9f583 bea0f87 ae9f583 bea0f87 ae9f583 bea0f87 ae9f583 bea0f87 ae9f583 750c6d3 ae9f583 bea0f87 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 |
---
license: cc-by-nc-4.0
base_model: google/gemma-2b
model-index:
- name: Octopus-V2-2B
results: []
tags:
- function calling
- on-device language model
- android
inference: false
space: false
spaces: false
language:
- en
---
# Quantized Octopus V2: On-device language model for super agent
This repo includes two types of quantized models: **GGUF** and **AWQ**, for our Octopus V2 model at [NexaAIDev/Octopus-v2](https://huggingface.co/NexaAIDev/Octopus-v2)
<p align="center" width="100%">
<a><img src="Octopus-logo.jpeg" alt="nexa-octopus" style="width: 40%; min-width: 300px; display: block; margin: auto;"></a>
</p>
# GGUF Qauntization
To run the models, please download them to your local machine using either git clone or [Hugging Face Hub](https://huggingface.co/docs/huggingface_hub/en/guides/download)
```
git clone https://huggingface.co/NexaAIDev/Octopus-v2-gguf-awq
```
## Run with [llama.cpp](https://github.com/ggerganov/llama.cpp) (Recommended)
1. **Clone and compile:**
```bash
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Compile the source code:
make
```
2. **Execute the Model:**
Run the following command in the terminal:
```bash
./main -m ./path/to/octopus-v2-Q4_K_M.gguf -n 256 -p "Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Take a selfie for me with front camera\n\nResponse:"
```
## Run with [Ollama](https://github.com/ollama/ollama)
Since our models have not been uploaded to the Ollama server, please download the models and manually import them into Ollama by following these steps:
1. Install Ollama on your local machine. You can also following the guide from [Ollama GitHub repository](https://github.com/ollama/ollama/blob/main/docs/import.md)
```bash
git clone https://github.com/ollama/ollama.git ollama
```
2. Locate the local Ollama directory:
```bash
cd ollama
```
3. Create a `Modelfile` in your directory
```bash
touch Modelfile
```
4. In the Modelfile, include a `FROM` statement with the path to your local model, and the default parameters:
```bash
FROM ./path/to/octopus-v2-Q4_K_M.gguf
```
5. Use the following command to add the model to Ollama:
```bash
ollama create octopus-v2-Q4_K_M -f Modelfile
```
6. Verify that the model has been successfully imported:
```bash
ollama ls
```
7. Run the mode
```bash
ollama run octopus-v2-Q4_K_M "Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Take a selfie for me with front camera\n\nResponse:"
```
# AWQ Quantization
Python example:
```python
from transformers import AutoTokenizer
from awq import AutoAWQForCausalLM
import torch
import time
import numpy as np
def inference(input_text):
start_time = time.time()
input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
input_length = input_ids["input_ids"].shape[1]
generation_output = model.generate(
input_ids["input_ids"],
do_sample=False,
max_length=1024
)
end_time = time.time()
# Decode only the generated part
generated_sequence = generation_output[:, input_length:].tolist()
res = tokenizer.decode(generated_sequence[0])
latency = end_time - start_time
num_output_tokens = len(generated_sequence[0])
throughput = num_output_tokens / latency
return {"output": res, "latency": latency, "throughput": throughput}
# Initialize tokenizer and model
model_id = "/path/to/Octopus-v2-AWQ-NexaAIDev"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=False)
model = AutoAWQForCausalLM.from_quantized(model_id, fuse_layers=True,
trust_remote_code=False, safetensors=True)
prompts = ["Below is the query from the users, please call the correct function and generate the parameters to call the function.\n\nQuery: Can you take a photo using the back camera and save it to the default location? \n\nResponse:"]
avg_throughput = []
for prompt in prompts:
out = inference(prompt)
avg_throughput.append(out["throughput"])
print("nexa model result:\n", out["output"])
print("avg throughput:", np.mean(avg_throughput))
```
# Quantized GGUF & AWQ Models Benchmark
| Name | Quant method | Bits | Size | Response (t/s) | Use Cases |
| ---------------------- | ------------ | ---- | -------- | -------------- | ----------------------------------- |
| Octopus-v2-AWQ | AWQ | 4 | 3.00 GB | 63.83 | fast, high quality, recommended |
| Octopus-v2-Q2_K.gguf | Q2_K | 2 | 1.16 GB | 57.81 | fast but high loss, not recommended |
| Octopus-v2-Q3_K.gguf | Q3_K | 3 | 1.38 GB | 57.81 | extremely not recommended |
| Octopus-v2-Q3_K_S.gguf | Q3_K_S | 3 | 1.19 GB | 52.13 | extremely not recommended |
| Octopus-v2-Q3_K_M.gguf | Q3_K_M | 3 | 1.38 GB | 58.67 | moderate loss, not very recommended |
| Octopus-v2-Q3_K_L.gguf | Q3_K_L | 3 | 1.47 GB | 56.92 | not very recommended |
| Octopus-v2-Q4_0.gguf | Q4_0 | 4 | 1.55 GB | 68.80 | moderate speed, recommended |
| Octopus-v2-Q4_1.gguf | Q4_1 | 4 | 1.68 GB | 68.09 | moderate speed, recommended |
| Octopus-v2-Q4_K.gguf | Q4_K | 4 | 1.63 GB | 64.70 | moderate speed, recommended |
| Octopus-v2-Q4_K_S.gguf | Q4_K_S | 4 | 1.56 GB | 62.16 | fast and accurate, very recommended |
| Octopus-v2-Q4_K_M.gguf | Q4_K_M | 4 | 1.63 GB | 64.74 | fast, recommended |
| Octopus-v2-Q5_0.gguf | Q5_0 | 5 | 1.80 GB | 64.80 | fast, recommended |
| Octopus-v2-Q5_1.gguf | Q5_1 | 5 | 1.92 GB | 63.42 | very big, prefer Q4 |
| Octopus-v2-Q5_K.gguf | Q5_K | 5 | 1.84 GB | 61.28 | big, recommended |
| Octopus-v2-Q5_K_S.gguf | Q5_K_S | 5 | 1.80 GB | 62.16 | big, recommended |
| Octopus-v2-Q5_K_M.gguf | Q5_K_M | 5 | 1.71 GB | 61.54 | big, recommended |
| Octopus-v2-Q6_K.gguf | Q6_K | 6 | 2.06 GB | 55.94 | very big, not very recommended |
| Octopus-v2-Q8_0.gguf | Q8_0 | 8 | 2.67 GB | 56.35 | very big, not very recommended |
| Octopus-v2-f16.gguf | f16 | 16 | 5.02 GB | 36.27 | extremely big |
| Octopus-v2.gguf | | | 10.00 GB | | |
_Quantized with llama.cpp_
**Acknowledgement**:
We sincerely thank our community members, [Mingyuan](https://huggingface.co/ThunderBeee), [Zoey](https://huggingface.co/ZY6), [Brian](https://huggingface.co/JoyboyBrian), [Perry](https://huggingface.co/PerryCheng614), [Qi](https://huggingface.co/qiqiWav), [David](https://huggingface.co/Davidqian123) for their extraordinary contributions to this quantization effort. |