|
--- |
|
license: cc-by-sa-4.0 |
|
datasets: |
|
- SuccubusBot/incoherent-text-dataset |
|
language: |
|
- en |
|
- es |
|
- fr |
|
- de |
|
- zh |
|
- ja |
|
- ru |
|
- ar |
|
- hi |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- distilbert/distilbert-base-multilingual-cased |
|
pipeline_tag: text-classification |
|
library_name: transformers |
|
--- |
|
|
|
# DistilBERT Incoherence Classifier (Multilingual) |
|
|
|
This is a fine-tuned DistilBERT-multilingual model for classifying text based on its coherence. It can identify various types of incoherence. |
|
|
|
## Model Details |
|
|
|
- **Model:** DistilBERT (distilbert-base-multilingual-cased) |
|
- **Task:** Text Classification (Coherence Detection) |
|
- **Fine-tuning:** The model was fine-tuned using a synthetically generated dataset that features various types of incoherence |
|
|
|
## Training Metrics |
|
|
|
| Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | F1 | |
|
| :---- | :------------ | :------------ | :-------- | :-------- | :-------- | :------- | |
|
| 1 | 0.343600 | 0.303963 | 0.880312 | 0.882746 | 0.880312 | 0.879637 | |
|
| 2 | 0.245200 | 0.286482 | 0.900850 | 0.901156 | 0.900850 | 0.899612 | |
|
| 3 | 0.149700 | 0.313061 | 0.906161 | 0.906049 | 0.906161 | 0.905103 | |
|
|
|
## Evaluation Metrics |
|
|
|
The following metrics were measured on the test set: |
|
|
|
| Metric | Value | |
|
| :---------- | :------- | |
|
| Loss | 0.316272 | |
|
| Accuracy | 0.903329 | |
|
| Precision | 0.903704 | |
|
| Recall | 0.903329 | |
|
| F1-Score | 0.902359 | |
|
|
|
## Classification Report: |
|
|
|
``` |
|
precision recall f1-score support |
|
|
|
coherent 0.86 0.93 0.90 2051 |
|
grammatical_errors 0.88 0.76 0.81 599 |
|
random_bytes 1.00 1.00 1.00 599 |
|
random_tokens 1.00 1.00 1.00 600 |
|
random_words 0.95 0.93 0.94 600 |
|
run_on 0.85 0.79 0.82 600 |
|
word_soup 0.89 0.83 0.86 599 |
|
|
|
accuracy 0.90 5648 |
|
macro avg 0.92 0.89 0.90 5648 |
|
weighted avg 0.90 0.90 0.90 5648 |
|
``` |
|
|
|
## Confusion Matrix |
|
|
|
 |
|
|
|
The confusion matrix above shows the performance of the model on each class. |
|
|
|
## Usage |
|
|
|
This model can be used for text classification tasks, specifically for detecting and categorizing different types of text incoherence. You can use the `inference_example` function provided in the notebook to test your own text. |
|
|
|
```py |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("SuccubusBot/distilbert-multilingual-incoherence-classifier") |
|
model = AutoModelForSequenceClassification.from_pretrained("SuccubusBot/distilbert-multilingual-incoherence-classifier") |
|
|
|
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer) |
|
|
|
|
|
while True: |
|
text = input("Enter text (or type 'exit' to quit): ") |
|
if text.lower() == "exit": |
|
break |
|
|
|
# Example usage |
|
results = classifier(text) |
|
|
|
# Print the results with confidence scores for all labels |
|
for result in results: |
|
print(f"Label: {result['label']}, Confidence: {result['score']}") |
|
``` |
|
|
|
## Limitations |
|
|
|
The model has been trained on a generated dataset, so care must be taken in evaluating it in the real world. More data may need to be collected before evaluating this model in a real-world setting. |