all-gguf-same-where

Running

App Files Files Community

matrixportal commited on Mar 28

Commit

5af094f

verified ·

1 Parent(s): f4971c6

Update app.py

Browse files

Files changed (1) hide show

app.py +126 -5

app.py CHANGED Viewed

@@ -240,11 +240,6 @@ def process_model(model_id, q_method, use_imatrix, imatrix_q_method, private_rep
 ## ✅ Quantized Models Download List
 ### 🔍 Recommended Quantizations
-- **✨ General CPU Use:** [`Q4_K_M`](https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-q4_k_m.gguf) (Best balance of speed/quality)
-- **📱 ARM Devices:** [`Q4_0`](https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-q4_0.gguf) (Optimized for ARM CPUs)
-- **🏆 Maximum Quality:** [`Q8_0`](https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-q8_0.gguf) (Near-original quality)
-### 📦 Full Quantization Options
 | 🚀 Download | 🔢 Type | 📝 Notes |
 |:---------|:-----|:------|
 | [Download](https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-q2_k.gguf) | ![Q2_K](https://img.shields.io/badge/Q2_K-1A73E8) | Basic quantization |
@@ -262,6 +257,132 @@ def process_model(model_id, q_method, use_imatrix, imatrix_q_method, private_rep
 | [Download](https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
 💡 **Tip:** Use `F16` for maximum precision when quality is critical
 """
             # README'yi güncelle (ModelCard kullanarak)

 ## ✅ Quantized Models Download List
 ### 🔍 Recommended Quantizations
 | 🚀 Download | 🔢 Type | 📝 Notes |
 |:---------|:-----|:------|
 | [Download](https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-q2_k.gguf) | ![Q2_K](https://img.shields.io/badge/Q2_K-1A73E8) | Basic quantization |
 | [Download](https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-f16.gguf) | ![F16](https://img.shields.io/badge/F16-000000) | Maximum accuracy |
 💡 **Tip:** Use `F16` for maximum precision when quality is critical
+# GGUF Model Quantization & Usage Guide with llama.cpp
+## What is GGUF and Quantization?
+**GGUF** (GPT-Generated Unified Format) is an efficient model file format developed by the `llama.cpp` team that:
+- Supports multiple quantization levels
+- Works cross-platform
+- Enables fast loading and inference
+**Quantization** converts model weights to lower precision data types (e.g., 4-bit integers instead of 32-bit floats) to:
+- Reduce model size
+- Decrease memory usage
+- Speed up inference
+- (With minor accuracy trade-offs)
+## Step-by-Step Guide
+### 1. Prerequisites
+```bash
+# System updates
+sudo apt update && sudo apt upgrade -y
+# Dependencies
+sudo apt install -y build-essential cmake python3-pip
+# Clone and build llama.cpp
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+make -j4
+```
+### 2. Using Quantized Models from Hugging Face
+My automated quantization script produces models in this format:
+```
+https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-q4_k_m.gguf
+```
+Download your quantized model directly:
+```bash
+wget https://huggingface.co/{new_repo_id}/resolve/main/{model_name.lower()}-q4_k_m.gguf
+```
+### 3. Running the Quantized Model
+Basic usage:
+```bash
+./main -m {model_name.lower()}-q4_k_m.gguf -p "Your prompt here" -n 128
+```
+Example with a creative writing prompt:
+```bash
+./main -m {model_name.lower()}-q4_k_m.gguf \
+       -p "[INST] Write a short poem about AI quantization in the style of Shakespeare [/INST]" \
+       -n 256 -c 2048 -t 8 --temp 0.7
+```
+Advanced parameters:
+```bash
+./main -m {model_name.lower()}-q4_k_m.gguf \
+       -p "Question: What is the GGUF format?\nAnswer:" \
+       -n 256 -c 2048 -t 8 --temp 0.7 --top-k 40 --top-p 0.9
+```
+### 4. Python Integration
+Install the Python package:
+```bash
+pip install llama-cpp-python
+```
+Example script:
+```python
+from llama_cpp import Llama
+# Initialize the model
+llm = Llama(
+    model_path="{model_name.lower()}-q4_k_m.gguf",
+    n_ctx=2048,
+    n_threads=8
+)
+# Run inference
+response = llm(
+    "[INST] Explain GGUF quantization to a beginner [/INST]",
+    max_tokens=256,
+    temperature=0.7,
+    top_p=0.9
+)
+print(response["choices"][0]["text"])
+```
+## Performance Tips
+1. **Hardware Utilization**:
+   - Set thread count with `-t` (typically CPU core count)
+   - Compile with CUDA/OpenCL for GPU support
+2. **Memory Optimization**:
+   - Lower quantization (like q4_k_m) uses less RAM
+   - Adjust context size with `-c` parameter
+3. **Speed/Accuracy Balance**:
+   - Higher bit quantization is slower but more accurate
+   - Reduce randomness with `--temp 0` for consistent results
+## FAQ
+**Q: What quantization levels are available?**
+A: Common options include q4_0, q4_k_m, q5_0, q5_k_m, q8_0 (my script uses q4_k_m by default)
+**Q: How much performance loss occurs with q4_k_m?**
+A: Typically 2-5% accuracy reduction but 4x smaller size
+**Q: How to enable GPU support?**
+A: Build with `make LLAMA_CUBLAS=1` for NVIDIA GPUs
+## Useful Resources
+1. [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
+2. [GGUF Format Specs](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
+3. [Hugging Face Model Hub](https://huggingface.co/models)
 """
             # README'yi güncelle (ModelCard kullanarak)