Fined-tuned BERT for Toxicity Classification in Spanish

This is a fine-tuned BERT model trained using as a base model BETO, which is a BERT base-sized trained specifically for Spanish. The dataset for training this model is a gold standard for protest events for toxicity and incivility in Spanish.

The dataset comprises ~5M data points from three Latin American protest events: (a) protests against the coronavirus and judicial reform measures in Argentina during August 2020; (b) protests against education budget cuts in Brazil in May 2019; and (c) the social outburst in Chile stemming from protests against the underground fare hike in October 2019. We are focusing on interactions in Spanish to elaborate a gold standard for digital interactions in this language, therefore, we prioritise Argentinian and Chilean data.

Labels: NONTOXIC and TOXIC.

Example of Classification

## Pipeline as a high-level helper
from transformers import pipeline
toxic_classifier = pipeline("text-classification", model="bgonzalezbustamante/bert-spanish-toxicity")

## Non-toxic example
non_toxic = toxic_classifier("Que tengas un excelente día :)")

## Toxic example
toxic = toxic_classifier("Eres un maldito infeliz")

## Print examples
print(non_toxic)
print(toxic)

Output:

[{'label': 'NONTOXIC', 'score': 0.8723140358924866}]
[{'label': 'TOXIC', 'score': 0.9470418691635132}]

Validation Metrics

  • Accuracy: 0.835
  • Precision: 0.816
  • Recall: 0.886
  • F1-Score: 0.849
Downloads last month
1,236
Safetensors
Model size
110M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for bgonzalezbustamante/bert-spanish-toxicity

Finetuned
(81)
this model

Dataset used to train bgonzalezbustamante/bert-spanish-toxicity

Collection including bgonzalezbustamante/bert-spanish-toxicity