---
license: mit
base_model: CreitinGameplays/Llama-3.1-8b-reasoning-test
library_name: transformers
datasets:
- CreitinGameplays/reasoning-base-20k-llama3.1
tags:
- llama-cpp
- gguf-my-repo
pipeline_tag: text-generation
---

# CreitinGameplays/Llama-3.1-8b-reasoning-test-Q4_K_M-GGUF
This model was converted to GGUF format from [`CreitinGameplays/Llama-3.1-8b-reasoning-test`](https://huggingface.co/CreitinGameplays/Llama-3.1-8b-reasoning-test) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
Refer to the [original model card](https://huggingface.co/CreitinGameplays/Llama-3.1-8b-reasoning-test) for more details on the model.

## Use with llama.cpp
Install llama.cpp through brew (works on Mac and Linux)

```bash
brew install llama.cpp

```
Invoke the llama.cpp server or the CLI.

### CLI:
```bash
llama-cli --hf-repo CreitinGameplays/Llama-3.1-8b-reasoning-test-Q4_K_M-GGUF --hf-file llama-3.1-8b-reasoning-test-q4_k_m.gguf -p "The meaning to life and the universe is"
```

### Server:
```bash
llama-server --hf-repo CreitinGameplays/Llama-3.1-8b-reasoning-test-Q4_K_M-GGUF --hf-file llama-3.1-8b-reasoning-test-q4_k_m.gguf -c 2048
```

Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well.

Step 1: Clone llama.cpp from GitHub.
```
git clone https://github.com/ggerganov/llama.cpp
```

Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux).
```
cd llama.cpp && LLAMA_CURL=1 make
```

Step 3: Run inference through the main binary.
```
./llama-cli --hf-repo CreitinGameplays/Llama-3.1-8b-reasoning-test-Q4_K_M-GGUF --hf-file llama-3.1-8b-reasoning-test-q4_k_m.gguf -p "The meaning to life and the universe is"
```
or 
```
./llama-server --hf-repo CreitinGameplays/Llama-3.1-8b-reasoning-test-Q4_K_M-GGUF --hf-file llama-3.1-8b-reasoning-test-q4_k_m.gguf -c 2048
```

-------------
### Run the model:
```python
from llama_cpp import Llama

# Load the model (using the full training context for inference)
llm = Llama.from_pretrained(
    repo_id="CreitinGameplays/Llama-3.1-8b-reasoning-test-Q4_K_M-GGUF",
    filename="*.gguf",
    verbose=False,
    n_gpu_layers=0,  # CPU-only; increase if using GPU
    n_batch=512,
    n_ctx=8192,
    n_ctx_per_seq=8192,
    f16_kv=True
)

# Set up initial chat history with a system prompt.
chat_history = [
    {"role": "system", "content": """
You are a helpful assistant named Llama, made by Meta AI.
Always use your <|reasoning|> and <|end_reasoning|> tokens, without any text formatting, plain text only.
    """}
]

print("Enter 'quit' or 'exit' to stop the conversation.")

while True:
    # Prompt the user for input
    user_input = input("\nUser: ")
    if user_input.lower() in ["quit", "exit"]:
        break

    # Append the new user message to the chat history.
    chat_history.append({"role": "user", "content": user_input})

    # Call the chat completion API in streaming mode with the updated conversation.
    output_stream = llm.create_chat_completion(
        messages=chat_history,
        temperature=0.4,
        top_p=0.95,
        max_tokens=4096,
        stream=True
    )

    collected_reply = ""
    last_finish_reason = None

    # Process each chunk as it arrives.
    print("Assistant: ", end="", flush=True)
    for chunk in output_stream:
        # Each chunk has a 'choices' list; we get the first choice's delta.
        delta = chunk["choices"][0].get("delta", {})
        if "content" in delta:
            text = delta["content"]
            print(text, end="", flush=True)
            collected_reply += text                                                                                                                                                                                                                                               if "finish_reason" in chunk["choices"][0]:                                                                                                                                                                                                                                    last_finish_reason = chunk["choices"][0]["finish_reason"]

    # Add the assistant's reply to the conversation history.
    chat_history.append({"role": "assistant", "content": collected_reply})
                                                                                                                                                                                                                                                                              # Inform the user if generation stopped due to reaching the token limit.
    if last_finish_reason == "length":
        print("\n[Generation stopped: reached max_tokens. Consider increasing max_tokens or continuing the conversation.]")
```