--- license: mit base_model: CreitinGameplays/Llama-3.1-8b-reasoning-test library_name: transformers datasets: - CreitinGameplays/reasoning-base-20k-llama3.1 tags: - llama-cpp - gguf-my-repo pipeline_tag: text-generation --- # CreitinGameplays/Llama-3.1-8b-reasoning-test-Q4_K_M-GGUF This model was converted to GGUF format from [`CreitinGameplays/Llama-3.1-8b-reasoning-test`](https://huggingface.co/CreitinGameplays/Llama-3.1-8b-reasoning-test) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space. Refer to the [original model card](https://huggingface.co/CreitinGameplays/Llama-3.1-8b-reasoning-test) for more details on the model. ## Use with llama.cpp Install llama.cpp through brew (works on Mac and Linux) ```bash brew install llama.cpp ``` Invoke the llama.cpp server or the CLI. ### CLI: ```bash llama-cli --hf-repo CreitinGameplays/Llama-3.1-8b-reasoning-test-Q4_K_M-GGUF --hf-file llama-3.1-8b-reasoning-test-q4_k_m.gguf -p "The meaning to life and the universe is" ``` ### Server: ```bash llama-server --hf-repo CreitinGameplays/Llama-3.1-8b-reasoning-test-Q4_K_M-GGUF --hf-file llama-3.1-8b-reasoning-test-q4_k_m.gguf -c 2048 ``` Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well. Step 1: Clone llama.cpp from GitHub. ``` git clone https://github.com/ggerganov/llama.cpp ``` Step 2: Move into the llama.cpp folder and build it with `LLAMA_CURL=1` flag along with other hardware-specific flags (for ex: LLAMA_CUDA=1 for Nvidia GPUs on Linux). ``` cd llama.cpp && LLAMA_CURL=1 make ``` Step 3: Run inference through the main binary. ``` ./llama-cli --hf-repo CreitinGameplays/Llama-3.1-8b-reasoning-test-Q4_K_M-GGUF --hf-file llama-3.1-8b-reasoning-test-q4_k_m.gguf -p "The meaning to life and the universe is" ``` or ``` ./llama-server --hf-repo CreitinGameplays/Llama-3.1-8b-reasoning-test-Q4_K_M-GGUF --hf-file llama-3.1-8b-reasoning-test-q4_k_m.gguf -c 2048 ``` ------------- ### Run the model: ```python from llama_cpp import Llama # Load the model (using the full training context for inference) llm = Llama.from_pretrained( repo_id="CreitinGameplays/Llama-3.1-8b-reasoning-test-Q4_K_M-GGUF", filename="*.gguf", verbose=False, n_gpu_layers=0, # CPU-only; increase if using GPU n_batch=512, n_ctx=8192, n_ctx_per_seq=8192, f16_kv=True ) # Set up initial chat history with a system prompt. chat_history = [ {"role": "system", "content": """ You are a helpful assistant named Llama, made by Meta AI. Always use your <|reasoning|> and <|end_reasoning|> tokens, without any text formatting, plain text only. """} ] print("Enter 'quit' or 'exit' to stop the conversation.") while True: # Prompt the user for input user_input = input("\nUser: ") if user_input.lower() in ["quit", "exit"]: break # Append the new user message to the chat history. chat_history.append({"role": "user", "content": user_input}) # Call the chat completion API in streaming mode with the updated conversation. output_stream = llm.create_chat_completion( messages=chat_history, temperature=0.4, top_p=0.95, max_tokens=4096, stream=True ) collected_reply = "" last_finish_reason = None # Process each chunk as it arrives. print("Assistant: ", end="", flush=True) for chunk in output_stream: # Each chunk has a 'choices' list; we get the first choice's delta. delta = chunk["choices"][0].get("delta", {}) if "content" in delta: text = delta["content"] print(text, end="", flush=True) collected_reply += text if "finish_reason" in chunk["choices"][0]: last_finish_reason = chunk["choices"][0]["finish_reason"] # Add the assistant's reply to the conversation history. chat_history.append({"role": "assistant", "content": collected_reply}) # Inform the user if generation stopped due to reaching the token limit. if last_finish_reason == "length": print("\n[Generation stopped: reached max_tokens. Consider increasing max_tokens or continuing the conversation.]") ```