File size: 2,431 Bytes
76b671f
 
 
 
 
 
 
777b8ba
 
20949d3
f82f682
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae873dd
79084cf
 
f82f682
 
 
 
 
ac8e3eb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
---
language:
- hi
pipeline_tag: text-generation
tags:
- hindi
- quantization
- shuvom/yuj-v1
license: apache-2.0
quantized_by: shuvom
---
# yuj-v1-GGUF
- Model creator: [shuvom_](https://huggingface.co/shuvom)
- Original model: [shuvom/yuj-v1](https://huggingface.co/shuvom/yuj-v1)

<!-- description start -->
## Description

This repo contains GGUF format model files for [shuvom/yuj-v1](https://huggingface.co/shuvom/yuj-v1).


<!-- description end -->
<!-- README_GGUF.md-about-gguf start -->
### About GGUF

GGUF and GGML are file formats used for storing models for inference, especially in the context of language models like GPT (Generative Pre-trained Transformer). It allows you to inference in consumer-grade GPUs and CPUs.

[more info.](https://github.com/ggerganov/llama.cpp)

## Provided files

| Name | Quant method | Bits | Size | Max RAM required | Use case |
| ---- | ---- | ---- | ---- | ---- | ----- |
| [yuj-v1.Q4_K_M.gguf](https://huggingface.co/shuvom/yuj-v1-GGUF/blob/main/yuj-v1.Q4_K_M.gguf) | Q4_K_M | 4 | 4.17 GB| 6.87 GB | medium, balanced quality - recommended |

## Usage

1. Installing lamma.cpp python client and HuggingFace-hub
```python
!pip install llama-cpp-python huggingface-hub
```
2. Downloading GGUF formatted model
```python
!huggingface-cli download shuvom/yuj-v1-GGUF yuj-v1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
```
3. Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
```python
from llama_cpp import Llama

llm = Llama(
  model_path="./yuj-v1.Q4_K_M.gguf",  # Download the model file first
  n_ctx=2048,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8,            # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=35         # The number of layers to offload to GPU, if you have GPU acceleration available
)
```
4. Chat Completion API
```python
llm = Llama(model_path="/content/yuj-v1.Q4_K_M.gguf", chat_format="llama-2")  # Set chat_format according to the model you are using
llm.create_chat_completion(
    messages = [
        {"role": "system", "content": "You are a story writing assistant."},
        {
            "role": "user",
            "content": "युज शीर्ष द्विभाषी मॉडल में से एक है"
        }
    ]
)
```