Text Classification
Transformers
Safetensors
distilbert
CortexPE's picture
Update README.md
3df9967 verified
metadata
license: cc-by-sa-4.0
datasets:
  - SuccubusBot/incoherent-text-dataset
language:
  - en
  - es
  - fr
  - de
  - zh
  - ja
  - ru
  - ar
  - hi
metrics:
  - accuracy
base_model:
  - distilbert/distilbert-base-multilingual-cased
pipeline_tag: text-classification
library_name: transformers

DistilBERT Incoherence Classifier (Multilingual)

This is a fine-tuned DistilBERT-multilingual model for classifying text based on its coherence. It can identify various types of incoherence.

Model Details

  • Model: DistilBERT (distilbert-base-multilingual-cased)
  • Task: Text Classification (Coherence Detection)
  • Fine-tuning: The model was fine-tuned using a synthetically generated dataset that features various types of incoherence

Training Metrics

Epoch Training Loss Validation Loss Accuracy Precision Recall F1
1 0.343600 0.303963 0.880312 0.882746 0.880312 0.879637
2 0.245200 0.286482 0.900850 0.901156 0.900850 0.899612
3 0.149700 0.313061 0.906161 0.906049 0.906161 0.905103

Evaluation Metrics

The following metrics were measured on the test set:

Metric Value
Loss 0.316272
Accuracy 0.903329
Precision 0.903704
Recall 0.903329
F1-Score 0.902359

Classification Report:

                    precision    recall  f1-score   support

          coherent       0.86      0.93      0.90      2051
grammatical_errors       0.88      0.76      0.81       599
      random_bytes       1.00      1.00      1.00       599
     random_tokens       1.00      1.00      1.00       600
      random_words       0.95      0.93      0.94       600
            run_on       0.85      0.79      0.82       600
         word_soup       0.89      0.83      0.86       599

          accuracy                           0.90      5648
         macro avg       0.92      0.89      0.90      5648
      weighted avg       0.90      0.90      0.90      5648

Confusion Matrix

Confusion Matrix

The confusion matrix above shows the performance of the model on each class.

Usage

This model can be used for text classification tasks, specifically for detecting and categorizing different types of text incoherence. You can use the inference_example function provided in the notebook to test your own text.

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

tokenizer = AutoTokenizer.from_pretrained("SuccubusBot/distilbert-multilingual-incoherence-classifier")
model = AutoModelForSequenceClassification.from_pretrained("SuccubusBot/distilbert-multilingual-incoherence-classifier")

classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)


while True:
    text = input("Enter text (or type 'exit' to quit): ")
    if text.lower() == "exit":
        break

    # Example usage
    results = classifier(text)

    # Print the results with confidence scores for all labels
    for result in results:
        print(f"Label: {result['label']}, Confidence: {result['score']}")

Limitations

The model has been trained on a generated dataset, so care must be taken in evaluating it in the real world. More data may need to be collected before evaluating this model in a real-world setting.