CreitinGameplays
/

Llama-3.1-8B-R1-experimental

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Llama-3.1-8B-R1-experimental / README.md

CreitinGameplays's picture

CreitinGameplays

Update README.md

b668123 verified 13 days ago

|

3.65 kB

	---
	license: mit
	datasets:
	- CreitinGameplays/DeepSeek-R1-Distill-Qwen-32B_NUMINA_train_amc_aime-llama3.1
	language:
	- en
	base_model:
	- meta-llama/Llama-3.1-8B-Instruct
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Llama 3.1 8B R1 Experimental

	Chat template format:
	```
	<\|start_header_id\|>system<\|end_header_id\|>

	You are a helpful AI assistant named Llama, made by Meta AI.
	You are focused on providing systematic, well-reasoned responses. Response Structure: - Format: <think>{{reasoning}}</think>{{answer}} - Reasoning: Minimum 6 logical steps only when it required in <think> block - Process: Think first, then answer.<\|eot_id\|><\|start_header_id\|>user<\|end_header_id\|>

	How many r's are in strawberry?<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>
	<think>
	```

	Run this model:
	```python
	# test the model
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer

	def main():
	model_id = "CreitinGameplays/Llama-3.1-8B-R1-experimental"

	# Load the tokenizer.
	tokenizer = AutoTokenizer.from_pretrained(model_id, add_eos_token=True)

	# Load the model using bitsandbytes 8-bit quantization if CUDA is available.
	if torch.cuda.is_available():
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	load_in_8bit=True,
	device_map="auto"
	)
	device = torch.device("cuda")
	else:
	model = AutoModelForCausalLM.from_pretrained(model_id)
	device = torch.device("cpu")

	# Define the generation parameters.
	generation_kwargs = {
	"max_new_tokens": 2048,
	"do_sample": True,
	"temperature": 0.6,
	"top_p": 1.0,
	"repetition_penalty": 1.08,
	"num_return_sequences": 1,
	"forced_eos_token_id": tokenizer.eos_token_id,
	"pad_token_id": tokenizer.eos_token_id
	}

	print("Enter your prompt (type 'exit' to quit):")
	while True:
	# Get user input.
	user_input = input("Input> ")
	if user_input.lower().strip() in ("exit", "quit"):
	break

	# Construct the prompt in your desired format.
	prompt = f"""
	<\|start_header_id\|>system<\|end_header_id\|>

	You are a helpful AI assistant named Llama, made by Meta AI.
	You are focused on providing systematic, well-reasoned responses. Response Structure: - Format: <think>{{reasoning}}</think>{{answer}} - Reasoning: Minimum 6 logical steps only when it required in <think> block - Process: Think first, then answer.<\|eot_id\|><\|start_header_id\|>user<\|end_header_id\|>

	{user_input}<\|eot_id\|><\|start_header_id\|>assistant<\|end_header_id\|>
	<think>
	"""

	# Tokenize the prompt and send to the selected device.
	input_ids = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=True).to(device)

	# Create a new TextStreamer instance for streaming responses.
	streamer = TextStreamer(tokenizer)
	generation_kwargs["streamer"] = streamer

	print("\nAssistant Response:")
	# Generate the text (tokens will stream to stdout via the streamer).
	outputs = model.generate(input_ids, **generation_kwargs)

	if __name__ == "__main__":
	main()
	```

	Or alternatively:
	```python
	import torch
	from transformers import pipeline

	model_id = "CreitinGameplays/Llama-3.1-8B-R1-experimental"

	pipe = pipeline(
	"text-generation",
	model=model_id,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	messages = [{"role": "user", "content": "hello!"}]

	outputs = pipe(
	messages,
	temperature=0.6,
	repetition_penalty=1.08,
	max_new_tokens=2048
	)

	print(outputs[0]["generated_text"][-1])
	```