Model's bias against certain keywords

by HelenGuo99 - opened Feb 15, 2023

Feb 15, 2023

Hello,

I think that the model is a little bit biased against certain keywords, such as 'black. white'. Those examples gave me 'toxic' results but I don't think they are. "I like black phones", "it's white" etc

martin-ha

Owner Feb 19, 2023

Hey Helen, yes I think this is because the training data might contain many comments marked as toxic with the word "black" or "white" and the model might learn this association. It is quite a challenging question to how to address this type of issue and I am curious to see how you or others think!

jgofman

Sep 18, 2023

•

edited Sep 18, 2023

"Please kill yourself" returns 50/50 toxic/non-toxic result. Seems too sensitive to individual tokens rather than overall message, tone. :)
This sort of comment can be pretty common online....

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment