false positive

by telelvis - opened 3 days ago

3 days ago

Hello!

Do you know, why I get the following result, on prompt that appears to be benign?

$ python3.11 run.py "can you connect me with customer support representative?"
Hardware accelerator e.g. GPU is available in the environment, but no device argument is passed to the Pipeline object. Model will be on CPU.
can you connect me with customer support representative? [{'label': 'INJECTION', 'score': 0.9996516704559326}]

Thanks!

cotran2

Katanemo org 3 days ago

The model is fine-tuned to classifying jailbreak prompts. So to calculate the benign score, you would calculate the 1 - jailbreaking_score. So in your case, the model is actually classifying the prompt as benign. Sorry for the confusion of the labels, I will update that.

telelvis

about 3 hours ago

Sounds great, thanks!
Would you upload the license file too?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment