IndoBERTweet-SexuallyExplicit

Model Description

IndoBERTweet fine-tuned on IndoToxic2024 dataset, with an accuracy of 0.91 and macro-F1 of 0.80. Performances are obtained through stratified 10-fold cross-validation.

Supported Tokenizer

  • indolem/indobertweet-base-uncased

Example Code

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

# Specify the model and tokenizer name
model_name = "Exqrch/IndoBERTweet-SexuallyExplicit"
tokenizer_name = "indolem/indobertweet-base-uncased"

# Load the pre-trained model
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

text = "selamat pagi semua!"

output = model(**tokenizer(text, return_tensors="pt"))
logits = output.logits

# Get the predicted class label
predicted_class = torch.argmax(logits, dim=-1).item()

print(predicted_class)
--- Output ---
> 0
--- End of Output ---

Limitations

Trained only on Indonesian texts. No information on code-switched text performance.

Sample Output

Model name: Exqrch/IndoBERTweet-SexuallyExplicit
Text 1: billiard engak ntar bro?
Prediction: 0
Text 2: eh kerumah ku yok main bareng di ranjang
Prediction: 1

Citation

If used, please cite:

@article{susanto2024indotoxic2024,
      title={IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language}, 
      author={Lucky Susanto and Musa Izzanardi Wijanarko and Prasetia Anugrah Pratama and Traci Hong and Ika Idris and Alham Fikri Aji and Derry Wijaya},
      year={2024},
      eprint={2406.19349},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.19349}, 
}
Downloads last month
6
Safetensors
Model size
111M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.