Text Classification
Transformers
Safetensors
distilbert
CortexPE's picture
Update README.md
3df9967 verified
---
license: cc-by-sa-4.0
datasets:
- SuccubusBot/incoherent-text-dataset
language:
- en
- es
- fr
- de
- zh
- ja
- ru
- ar
- hi
metrics:
- accuracy
base_model:
- distilbert/distilbert-base-multilingual-cased
pipeline_tag: text-classification
library_name: transformers
---
# DistilBERT Incoherence Classifier (Multilingual)
This is a fine-tuned DistilBERT-multilingual model for classifying text based on its coherence. It can identify various types of incoherence.
## Model Details
- **Model:** DistilBERT (distilbert-base-multilingual-cased)
- **Task:** Text Classification (Coherence Detection)
- **Fine-tuning:** The model was fine-tuned using a synthetically generated dataset that features various types of incoherence
## Training Metrics
| Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | F1 |
| :---- | :------------ | :------------ | :-------- | :-------- | :-------- | :------- |
| 1 | 0.343600 | 0.303963 | 0.880312 | 0.882746 | 0.880312 | 0.879637 |
| 2 | 0.245200 | 0.286482 | 0.900850 | 0.901156 | 0.900850 | 0.899612 |
| 3 | 0.149700 | 0.313061 | 0.906161 | 0.906049 | 0.906161 | 0.905103 |
## Evaluation Metrics
The following metrics were measured on the test set:
| Metric | Value |
| :---------- | :------- |
| Loss | 0.316272 |
| Accuracy | 0.903329 |
| Precision | 0.903704 |
| Recall | 0.903329 |
| F1-Score | 0.902359 |
## Classification Report:
```
precision recall f1-score support
coherent 0.86 0.93 0.90 2051
grammatical_errors 0.88 0.76 0.81 599
random_bytes 1.00 1.00 1.00 599
random_tokens 1.00 1.00 1.00 600
random_words 0.95 0.93 0.94 600
run_on 0.85 0.79 0.82 600
word_soup 0.89 0.83 0.86 599
accuracy 0.90 5648
macro avg 0.92 0.89 0.90 5648
weighted avg 0.90 0.90 0.90 5648
```
## Confusion Matrix
![Confusion Matrix](confusion_matrix.png)
The confusion matrix above shows the performance of the model on each class.
## Usage
This model can be used for text classification tasks, specifically for detecting and categorizing different types of text incoherence. You can use the `inference_example` function provided in the notebook to test your own text.
```py
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
tokenizer = AutoTokenizer.from_pretrained("SuccubusBot/distilbert-multilingual-incoherence-classifier")
model = AutoModelForSequenceClassification.from_pretrained("SuccubusBot/distilbert-multilingual-incoherence-classifier")
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
while True:
text = input("Enter text (or type 'exit' to quit): ")
if text.lower() == "exit":
break
# Example usage
results = classifier(text)
# Print the results with confidence scores for all labels
for result in results:
print(f"Label: {result['label']}, Confidence: {result['score']}")
```
## Limitations
The model has been trained on a generated dataset, so care must be taken in evaluating it in the real world. More data may need to be collected before evaluating this model in a real-world setting.