shuvom
/

yuj-v1-GGUF

Text Generation

Inference Endpoints

Model card Files Files and versions Community

yuj-v1-GGUF / README.md

shuvom's picture

Update info

ae873dd verified 9 months ago

|

history blame contribute delete

2.43 kB

	---
	language:
	- hi
	pipeline_tag: text-generation
	tags:
	- hindi
	- quantization
	- shuvom/yuj-v1
	license: apache-2.0
	quantized_by: shuvom
	---
	# yuj-v1-GGUF
	- Model creator: [shuvom_](https://huggingface.co/shuvom)
	- Original model: [shuvom/yuj-v1](https://huggingface.co/shuvom/yuj-v1)

	<!-- description start -->
	## Description

	This repo contains GGUF format model files for [shuvom/yuj-v1](https://huggingface.co/shuvom/yuj-v1).


	<!-- description end -->
	<!-- README_GGUF.md-about-gguf start -->
	### About GGUF

	GGUF and GGML are file formats used for storing models for inference, especially in the context of language models like GPT (Generative Pre-trained Transformer). It allows you to inference in consumer-grade GPUs and CPUs.

	[more info.](https://github.com/ggerganov/llama.cpp)

	## Provided files

	\| Name \| Quant method \| Bits \| Size \| Max RAM required \| Use case \|
	\| ---- \| ---- \| ---- \| ---- \| ---- \| ----- \|
	\| [yuj-v1.Q4_K_M.gguf](https://huggingface.co/shuvom/yuj-v1-GGUF/blob/main/yuj-v1.Q4_K_M.gguf) \| Q4_K_M \| 4 \| 4.17 GB\| 6.87 GB \| medium, balanced quality - recommended \|

	## Usage

	1. Installing lamma.cpp python client and HuggingFace-hub
	```python
	!pip install llama-cpp-python huggingface-hub
	```
	2. Downloading GGUF formatted model
	```python
	!huggingface-cli download shuvom/yuj-v1-GGUF yuj-v1.Q4_K_M.gguf --local-dir . --local-dir-use-symlinks False
	```
	3. Set gpu_layers to the number of layers to offload to GPU. Set to 0 if no GPU acceleration is available on your system.
	```python
	from llama_cpp import Llama

	llm = Llama(
	model_path="./yuj-v1.Q4_K_M.gguf", # Download the model file first
	n_ctx=2048, # The max sequence length to use - note that longer sequence lengths require much more resources
	n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
	n_gpu_layers=35 # The number of layers to offload to GPU, if you have GPU acceleration available
	)
	```
	4. Chat Completion API
	```python
	llm = Llama(model_path="/content/yuj-v1.Q4_K_M.gguf", chat_format="llama-2") # Set chat_format according to the model you are using
	llm.create_chat_completion(
	messages = [
	{"role": "system", "content": "You are a story writing assistant."},
	{
	"role": "user",
	"content": "युज शीर्ष द्विभाषी मॉडल में से एक है"
	}
	]
	)
	```