textdetox
/

Llama-pairwise-toxicity-evaluator

Generated from Trainer

Model card Files Files and versions Community

Llama-pairwise-toxicity-evaluator / README.md

NiGuLa's picture

Update README.md

e7dfb06 verified 4 days ago

|

history blame contribute delete

2.7 kB

	---
	base_model: meta-llama/Llama-3.1-8B-Instruct
	library_name: transformers
	model_name: judgelm_llama_31_8b_toxic_ckpt_ep2
	tags:
	- generated_from_trainer
	- trl
	- sft
	licence: license
	language:
	- am
	- ar
	- de
	- en
	- es
	- hi
	- ru
	- uk
	- zh
	license: apache-2.0
	datasets:
	- textdetox/detoxification_pairwise_style_evaluation
	---


	This model is a fine-tuned version of [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct).
	It has been trained using [TRL](https://github.com/huggingface/trl) with [textdetox/detoxification_pairwise_style_evaluation](https://huggingface.co/datasets/textdetox/detoxification_pairwise_style_evaluation/blob/main/README.md) dataset

	## Quick start

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	from peft import PeftModel
	import torch

	# Load base model and tokenizer
	base_model = "meta-llama/Meta-Llama-3-8B-Instruct"
	model = AutoModelForCausalLM.from_pretrained(
	base_model,
	device_map="auto",
	torch_dtype=torch.bfloat16,
	token=<your-token>
	)
	tokenizer = AutoTokenizer.from_pretrained(base_model)

	# Load LoRA adapter
	lora_model = "textdetox/Llama-pairwise-toxicity-evaluator"
	model = PeftModel.from_pretrained(model, lora_model)

	# Set to eval mode
	model.eval()

	# === Input texts (example) ===
	text1 = "You're such a waste of space, nobody likes you."
	text2 = "I don't agree with you, but I respect your opinion."

	# === Prompt ===
	prompt = f"""You are given with two texts. Your task is to compare which text is more toxic.

	Options:

	- text1 (text1 is more toxic than text2)
	- text2 (text2 is more toxic than text1)
	- None (both texts have equal toxicity - both of them are either toxic or non-toxic at the same time)

	Examples:
	text1: "I hate you, you are the worst person ever."
	text2: "I love you, you are the best person ever."
	Answer: text1

	text1: "You are so smart, I can't believe you did that."
	text2: "You are so stupid, I can't believe you did that."
	Answer: text2

	text1: "I think you are a great person."
	text2: "I think you are a very good man"
	Answer: none

	You must return ONLY one of the options. Do not add any explanations or additional information.

	text1: {text1}
	text2: {text2}
	Answer:"""

	# Tokenize
	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

	# Generate output
	with torch.no_grad():
	outputs = model.generate(**inputs, max_new_tokens=5)
	answer = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

	# Print result
	print("Model prediction:", answer.strip())

	```


	### Training framework versions

	- TRL: 0.16.0
	- Transformers: 4.50.1
	- Pytorch: 2.5.1
	- Datasets: 3.4.1
	- Tokenizers: 0.21.1

	## Citations