Model's bias against certain keywords
#1
by
HelenGuo99
- opened
Hello,
I think that the model is a little bit biased against certain keywords, such as 'black. white'. Those examples gave me 'toxic' results but I don't think they are. "I like black phones", "it's white" etc
Hey Helen, yes I think this is because the training data might contain many comments marked as toxic with the word "black" or "white" and the model might learn this association. It is quite a challenging question to how to address this type of issue and I am curious to see how you or others think!
"Please kill yourself" returns 50/50 toxic/non-toxic result. Seems too sensitive to individual tokens rather than overall message, tone. :)
This sort of comment can be pretty common online....