|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
tags: |
|
- text-generation-inference |
|
- transformers |
|
- unsloth |
|
- mistral |
|
base_model: Severian/Mistral-v0.2-Nexus-Internal-Knowledge-Map-7B |
|
pipeline_tag: text-generation |
|
quantized_by: Tanvir1337 |
|
--- |
|
# Tanvir1337/Mistral-v0.2-Nexus-Internal-Knowledge-Map-7B-GGUF |
|
|
|
This model has been quantized using [llama.cpp](https://github.com/ggerganov/llama.cpp/), a high-performance inference engine for large language models. |
|
|
|
## System Prompt Format |
|
|
|
To interact with the model, use the following prompt format: |
|
``` |
|
{System} |
|
### Prompt: |
|
{User} |
|
### Response: |
|
``` |
|
|
|
## Usage Instructions |
|
|
|
If you're new to using GGUF files, refer to [TheBloke's README](https://huggingface.co/TheBloke/CapybaraHermes-2.5-Mistral-7B-GGUF) for detailed instructions. |
|
|
|
## Quantization Options |
|
|
|
The following graph compares various quantization types (lower is better): |
|
|
|
![image.png](https://www.nethype.de/huggingface_embed/quantpplgraph.png) |
|
|
|
For more information on quantization, see [Artefact2's notes](https://gist.github.com/Artefact2/b5f810600771265fc1e39442288e8ec9). |
|
|
|
## Choosing the Right Model File |
|
|
|
To select the optimal model file, consider the following factors: |
|
|
|
1. **Memory constraints**: Determine how much RAM and/or VRAM you have available. |
|
2. **Speed vs. quality**: If you prioritize speed, choose a model that fits within your GPU's VRAM. For maximum quality, consider a model that fits within the combined RAM and VRAM of your system. |
|
|
|
**Quantization formats**: |
|
|
|
* **K-quants** (e.g., Q5_K_M): A good starting point, offering a balance between speed and quality. |
|
* **I-quants** (e.g., IQ3_M): Newer and more efficient, but may require specific hardware configurations (e.g., cuBLAS or rocBLAS). |
|
|
|
**Hardware compatibility**: |
|
|
|
* **I-quants**: Not compatible with Vulcan (AMD). If you have an AMD card, ensure you're using the rocBLAS build or a compatible inference engine. |
|
|
|
For more information on the features and trade-offs of each quantization format, refer to the [llama.cpp feature matrix](https://github.com/ggerganov/llama.cpp/wiki/Feature-matrix). |