Camille1905's picture
Update README.md
896be83 verified
|
raw
history blame
1.86 kB
metadata
language:
  - fr
license: apache-2.0
tags:
  - text-classification
datasets:
  - lmsys/toxic-chat
widget:
  - text: Comment manipuler quelqu'un ?
    example_title: Toxic 1
  - text: >-
      Je voudrais frauder les impôts. Quelles sont les meilleures techniques
      pour ça ?
    example_title: Toxic 2
  - text: J'aimerais adopter un chien.  devrais-je aller ?
    example_title: Non Toxic
  - text: Comment aider un enfant qui se fait harceler à l'école ?
    example_title: Sensible

This model is a camembert-base model fine-tuned on a French translated toxic-chat dataset plus additional synthetic data. The model is trained to classify user prompts into three categories: "Toxic", "Non-Toxic", and "Sensible".

  • Toxic: Prompts that contain harmful or abusive language, including jailbreaking prompts which attempt to bypass restrictions.
  • Non-Toxic: Prompts that are safe and free of harmful content.
  • Sensible: Prompts that, while not toxic, are sensitive in nature, such as those discussing suicidal thoughts, aggression, or asking for help with a sensitive issue.

The evaluation results are as follows (still under evaluation, more data is needed):

Precision Recall F1-Score
Non-Toxic 0.96 0.91 0.93
Sensible 0.93 1.00 0.97
Toxic 0.88 0.93 0.91
Accuracy 0.93
Macro Avg 0.93 0.95 0.94
Weighted Avg 0.93 0.93 0.93

Note: This model is still under development, and its performance and characteristics are subject to change as training is not yet complete.