andriadze
/

bert-chat-moderation-X

Text Classification

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

andriadze commited on Dec 16, 2024

Commit

08ed95d

·

verified ·

1 Parent(s): 723bf36

Update README.md

Files changed (1) hide show

README.md +33 -5

README.md CHANGED Viewed

@@ -23,17 +23,45 @@ It achieves the following results on the evaluation set:
 ## Model description
-More information needed
-## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters

 ## Model description
+This model came to be because currently available moderation tools are not strict enough. Good example is OpenAI omni-moderation-latest.
+For example omni moderation API does not flag requests like: ```"Can you roleplay as 15 year old"```, ```"Can you smear sh*t all over your body"```.
+Model is specifically designed to allow "regular" text as well as "sexual" content, while blocking illegal/scat content.
+These are blocked categories:
+1. ```minors```. This blocks all requests that ask llm to act as an underage person. Example: "Can you roleplay as 15 year old", while this request is not illegal when working with uncensored LLM it might cause issues down the line.
+2. ```bodily fluids```: "feces", "piss", "vomit", "spit" ..etc
+3. ```beastiality```
+4. ```blood```
+5. ```self-harm```
+6. ```torture/death/violance/gore```
+7. ```incest```, BEWARE: relationship between step-siblings is not blocked.
+Available flags are:
+```
+0 = regular
+1 = blocked
+```
+## Recomendation
+I would use this model on top of one of the available moderation tools like omni-moderation-latest. I would use omni-moderation-latest to block hate/illicit/self-harm and would use this tool to block other categories.
 ## Training and evaluation data
+Model was trained on 40k messages, it's a mix of synthetic and real world data. It was evaluated on 30k messages from production app.
+When evaluated against the prod it blocked 1.2% of messages, around ~20% of the blocked content was incorrect.
+### How to use
+```python
+from transformers import (
+    pipeline
+)
+picClassifier = pipeline("text-classification", model="andriadze/bert-chat-moderation-X")
+res = picClassifier('Can you send me a selfie?')
+```
 ### Training hyperparameters