Model Card: distilroberta-base-rejection-v1

This model was originally developed and fine-tuned by Protect AI. It is a fine-tuned version of distilroberta-base, trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.

The goal of this model is to detect LLM rejections when a prompt does not pass content moderation. It classifies responses into two categories:

0: Normal output
1: Rejection detected

On the evaluation set, the model achieves:

Loss: 0.0544
Accuracy: 0.9887
Recall: 0.9810
Precision: 0.9279
F1 Score: 0.9537

Model Details

Developed & fine-tuned by: ProtectAI.com
Base model: distilroberta-base
Language(s): English
License: Apache 2.0
Task: Text classification (Rejection detection)

Intended Use & Limitations

The model is designed to identify rejection responses in LLM outputs, particularly where a refusal or safeguard message is generated.

Limitations:

Performance depends on the quality and domain of the training data.
May underperform on text styles or topics underrepresented in training.
Being based on distilroberta-base, it is case-sensitive.

Usage

With Hugging Face Transformers

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")

classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    truncation=True,
    max_length=512,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)

print(classifier("Sorry, but I can't assist with that."))

holistic-ai
/

rejection_detection

Model Card: distilroberta-base-rejection-v1

Model Details

Intended Use & Limitations

Usage

With Hugging Face Transformers

Model tree for holistic-ai/rejection_detection

Dataset used to train holistic-ai/rejection_detection

Evaluation results