File size: 2,431 Bytes
76b671f 777b8ba 20949d3 f82f682 ae873dd 79084cf f82f682 ac8e3eb |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
---
language:
- hi
pipeline_tag: text-generation
tags:
- hindi
- quantization
- shuvom/yuj-v1
license: apache-2.0
quantized_by: shuvom
---
# yuj-v1-GGUF
- Model creator: [shuvom_](https://huggingface.co/shuvom)
- Original model: [shuvom/yuj-v1](https://huggingface.co/shuvom/yuj-v1)
<!-- description start -->
## Description
This repo contains GGUF format model files for [shuvom/yuj-v1](https://huggingface.co/shuvom/yuj-v1).
<!-- description end -->
<!-- README_GGUF.md-about-gguf start -->
### About GGUF
GGUF and GGML are file formats used for storing models for inference, especially in the context of language models like GPT (Generative Pre-trained Transformer). It allows you to inference in consumer-grade GPUs and CPUs.
[more info.](https://github.com/ggerganov/llama.cpp)
## Provided files
| Name | Quant method | Bits | Size | Max RAM required | Use case |
| ---- | ---- | ---- | ---- | ---- | ----- |
| [yuj-v1.Q4_K_M.gguf](https://huggingface.co/shuvom/yuj-v1-GGUF/blob/main/yuj-v1.Q4_K_M.gguf) | Q4_K_M | 4 | 4.17 GB| 6.87 GB | medium, balanced quality - recommended |
## Usage
1. Installing lamma.cpp python client and HuggingFace-hub
```python
!pip install llama-cpp-python huggingface-hub
```
2. Downloading GGUF formatted model
```python
!huggingface-cli download shuvom/yuj-v1-GGUF yuj-v1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
```
3. Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
```python
from llama_cpp import Llama
llm = Llama(
model_path="./yuj-v1.Q4_K_M.gguf", # Download the model file first
n_ctx=2048, # The max sequence length to use - note that longer sequence lengths require much more resources
n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
)
```
4. Chat Completion API
```python
llm = Llama(model_path="/content/yuj-v1.Q4_K_M.gguf", chat_format="llama-2") # Set chat_format according to the model you are using
llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are a story writing assistant."},
{
"role": "user",
"content": "युज शीर्ष द्विभाषी मॉडल में से एक है"
}
]
)
```
|