---
license: creativeml-openrail-m
language:
- en
base_model: prithivMLmods/GWQ-9B-Preview2
pipeline_tag: text-generation
library_name: transformers
tags:
- gemma2
- text-generation-inference
- f16
- llama-cpp
- gguf-my-repo
---

# Triangle104/GWQ-9B-Preview2-Q4_K_M-GGUF
This model was converted to GGUF format from [`prithivMLmods/GWQ-9B-Preview2`](https://huggingface.co/prithivMLmods/GWQ-9B-Preview2) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
Refer to the [original model card](https://huggingface.co/prithivMLmods/GWQ-9B-Preview2) for more details on the model.

---
Model details:
-
GWQ2 - Gemma with Questions Prev is a family of lightweight, 
state-of-the-art open models from Google, built using the same research 
and technology employed to create the Gemini models. These models are 
text-to-text, decoder-only large language models, available in English, 
with open weights for both pre-trained and instruction-tuned variants. 
Gemma models are well-suited for a variety of text generation tasks, 
including question answering, summarization, and reasoning. GWQ is 
fine-tuned on the Chain of Continuous Thought Synthetic Dataset, built 
upon the Gemma2forCasualLM architecture.


		Running GWQ Demo
	

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("prithivMLmods/GWQ-9B-Preview2")
model = AutoModelForCausalLM.from_pretrained(
    "prithivMLmods/GWQ-9B-Preview2",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))


You can ensure the correct chat template is applied by using tokenizer.apply_chat_template as follows:


messages = [
    {"role": "user", "content": "Write me a poem about Machine Learning."},
]
input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=256)
print(tokenizer.decode(outputs[0]))


		Key Architecture
	

Transformer-Based Design:
Gemma 2 leverages 
the transformer architecture, utilizing self-attention mechanisms to 
process input text and capture contextual relationships effectively.


Lightweight and Efficient:
It is designed to 
be computationally efficient, with fewer parameters compared to larger 
models, making it ideal for deployment on resource-constrained devices 
or environments.


Modular Layers:
The architecture consists of 
modular encoder and decoder layers, allowing flexibility in adapting the
 model for specific tasks like text generation, summarization, or 
classification.


Attention Mechanisms:
Gemma 2 employs 
multi-head self-attention to focus on relevant parts of the input text, 
improving its ability to handle long-range dependencies and complex 
language structures.


Pre-training and Fine-Tuning:
The model is 
pre-trained on large text corpora and can be fine-tuned for specific 
tasks, such as markdown processing in ReadM.Md, to enhance its 
performance on domain-specific data.


Scalability:
The architecture supports scaling up or down based on the application's requirements, balancing performance and resource usage.


Open-Source and Customizable:
Being 
open-source, Gemma 2 allows developers to modify and extend its 
architecture to suit specific use cases, such as integrating it into 
tools like ReadM.Md for markdown-related tasks.


		Intended Use of GWQ2 (Gemma with Questions2)
	

Question Answering:
The model excels in generating concise and relevant answers to user-provided queries across various domains.


Summarization:
It can be used to summarize 
large bodies of text, making it suitable for news aggregation, academic 
research, and report generation.


Reasoning Tasks:
GWQ is fine-tuned on the 
Chain of Continuous Thought Synthetic Dataset, which enhances its 
ability to perform reasoning, multi-step problem solving, and logical 
inferences.


Text Generation:
The model is ideal for 
creative writing tasks such as generating poems, stories, and essays. It
 can also be used for generating code comments, documentation, and 
markdown files.


Instruction Following:
GWQ’s 
instruction-tuned variant is suitable for generating responses based on 
user instructions, making it useful for virtual assistants, tutoring 
systems, and automated customer support.


Domain-Specific Applications:
Thanks to its 
modular design and open-source nature, the model can be fine-tuned for 
specific tasks like legal document summarization, medical record 
analysis, or financial report generation.


		Limitations of GWQ2
	

Resource Requirements:
Although lightweight 
compared to larger models, the 9B parameter size still requires 
significant computational resources, including GPUs with large memory 
for inference.


Knowledge Cutoff:
The model’s pre-training 
data may not include recent information, making it less effective for 
answering queries on current events or newly developed topics.


Bias in Outputs:
Since the model is trained 
on publicly available datasets, it may inherit biases present in those 
datasets, leading to potentially biased or harmful outputs in sensitive 
contexts.


Hallucinations:
Like other large language 
models, GWQ can occasionally generate incorrect or nonsensical 
information, especially when asked for facts or reasoning outside its 
training scope.


Lack of Common-Sense Reasoning:
While GWQ is 
fine-tuned for reasoning, it may still struggle with tasks requiring 
deep common-sense knowledge or nuanced understanding of human behavior 
and emotions.


Dependency on Fine-Tuning:
For optimal 
performance on domain-specific tasks, fine-tuning on relevant datasets 
is required, which demands additional computational resources and 
expertise.


Context Length Limitation:
The model’s 
ability to process long documents is limited by its maximum context 
window size. If the input exceeds this limit, truncation may lead to 
loss of important information.

---
## Use with llama.cpp
Install llama.cpp through brew (works on Mac and Linux)

```bash
brew install llama.cpp

```
Invoke the llama.cpp server or the CLI.

### CLI:
```bash
llama-cli --hf-repo Triangle104/GWQ-9B-Preview2-Q4_K_M-GGUF --hf-file gwq-9b-preview2-q4_k_m.gguf -p "The meaning to life and the universe is"
```

### Server:
```bash
llama-server --hf-repo Triangle104/GWQ-9B-Preview2-Q4_K_M-GGUF --hf-file gwq-9b-preview2-q4_k_m.gguf -c 2048
```

Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.
```
git clone https://github.com/ggerganov/llama.cpp
```

Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
```
cd llama.cpp && LLAMA_CURL=1 make
```

Step 3: Run inference through the main binary.
```
./llama-cli --hf-repo Triangle104/GWQ-9B-Preview2-Q4_K_M-GGUF --hf-file gwq-9b-preview2-q4_k_m.gguf -p "The meaning to life and the universe is"
```
or 
```
./llama-server --hf-repo Triangle104/GWQ-9B-Preview2-Q4_K_M-GGUF --hf-file gwq-9b-preview2-q4_k_m.gguf -c 2048
```