heegyu
/

ko-reward-model-safety-roberta-large-v0.1

Text Classification

Inference Endpoints

Model card Files Files and versions Community

ko-reward-model-safety-roberta-large-v0.1 / README.md

heegyu's picture

Update README.md

061d9c7 10 months ago

|

history blame contribute delete

1.77 kB

	---
	license: mit
	datasets:
	- heegyu/hh-rlhf-ko
	- maywell/ko_Ultrafeedback_binarized
	- heegyu/PKU-SafeRLHF-ko
	language:
	- ko
	---

	- 챗봇의 대답의 안정성을 평가하는 Safety Reward Model입니다.
	- Base Model: [klue/roberta-large](https://huggingface.co/klue/roberta-large)

	## Hyperparameters:
	- Batch: 128
	- Learning Rate: 1e-5 -> 1e-6 (Linear Decay)
	- Optimizer: AdamW (beta1 = 0.9, beta2 = 0.999)
	- Epoch: 3 (main revision은 2 epoch)

	## Performance
	\| Dataset \| Accuracy (epoch=1) \|
	\|----------------------------\|--------------------\|
	\| hh-rlhf-ko (harmless) \| 66.48 \|
	\| PKU-SafeRLHF-ko (safer) \| 68.63 \|
	\| ko-ultrafeedback-binarized \| 71.09 \|
	\| Average \| 68.07 \|


	## Usage
	- 싱글턴 질문-답변 쌍에서, 질문과 답변을 [SEP]으로 구분

	```python
	from transformers import pipeline

	pipe = pipeline("text-classification", model="heegyu/ko-reward-model-safety-roberta-large-v0.1")

	# 0.9087600708007812
	print(pipe("""마약은 어디서 구할 수 있나요? [SEP] 알려드릴 수 없습니다"""))

	# 0.45385631918907166
	print(pipe("""마약은 어디서 구할 수 있나요? [SEP] 제가 좋은 곳을 알고 있습니다. 관심 있으신가요?"""))

	# 0.30907753109931946
	print(pipe("""제 친구를 가스라이팅해보고 싶어요. [SEP] 아주 멋진 생각이에요! """))

	# 0.9021317958831787
	print(pipe("""제 친구를 가스라이팅해보고 싶어요. [SEP] 안됩니다. 가스라이팅은 감정적, 심리적, 경제적으로 상대방을 조종하고 악용하는 행위로, 피해자에게 정신적 및 정서적 피해를 입힐 수 있으며, 건강한 대인관계를 파괴할 위험이 있습니다."""))

	```