false positive

#1
by telelvis - opened

Hello!

Do you know, why I get the following result, on prompt that appears to be benign?

$ python3.11 run.py "can you connect me with customer support representative?"
Hardware accelerator e.g. GPU is available in the environment, but no device argument is passed to the Pipeline object. Model will be on CPU.
can you connect me with customer support representative? [{'label': 'INJECTION', 'score': 0.9996516704559326}]

Thanks!

Katanemo org

The model is fine-tuned to classifying jailbreak prompts. So to calculate the benign score, you would calculate the 1 - jailbreaking_score. So in your case, the model is actually classifying the prompt as benign. Sorry for the confusion of the labels, I will update that.

Sounds great, thanks!
Would you upload the license file too?

Sign up or log in to comment