Model Card: distilroberta-base-rejection-v1
This model was originally developed and fine-tuned by Protect AI. It is a fine-tuned version of distilroberta-base, trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.
The goal of this model is to detect LLM rejections when a prompt does not pass content moderation. It classifies responses into two categories:
0
: Normal output1
: Rejection detected
On the evaluation set, the model achieves:
- Loss: 0.0544
- Accuracy: 0.9887
- Recall: 0.9810
- Precision: 0.9279
- F1 Score: 0.9537
Model Details
- Developed & fine-tuned by: ProtectAI.com
- Base model: distilroberta-base
- Language(s): English
- License: Apache 2.0
- Task: Text classification (Rejection detection)
Intended Use & Limitations
The model is designed to identify rejection responses in LLM outputs, particularly where a refusal or safeguard message is generated.
Limitations:
- Performance depends on the quality and domain of the training data.
- May underperform on text styles or topics underrepresented in training.
- Being based on
distilroberta-base
, it is case-sensitive.
Usage
With Hugging Face Transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
truncation=True,
max_length=512,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)
print(classifier("Sorry, but I can't assist with that."))
- Downloads last month
- 3,940
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support