SuccubusBot
/

distilbert-multilingual-incoherence-classifier

Text Classification

Transformers

Safetensors

distilbert

Model card Files Files and versions Community

CortexPE commited on Mar 1

Commit

3df9967

verified ·

1 Parent(s): 4ad32b5

Update README.md

Browse files

Files changed (1) hide show

README.md +70 -32

README.md CHANGED Viewed

@@ -1,22 +1,42 @@
-# DistilBERT Incoherence Classifier
-This is a fine-tuned DistilBERT model for classifying text based on its coherence. It can identify various types of incoherence.
 ## Model Details
--   **Model:** DistilBERT (distilbert-base-multilingual-cased)
--   **Task:** Text Classification (Coherence Detection)
--   **Fine-tuning:** The model was fine-tuned using a custom-generated dataset that features various types of incoherence.
-- **Training Dataset** The model was trained on the [incoherent-text-dataset](https://huggingface.co/datasets/your_huggingface_username/incoherent-text-dataset) dataset, located on Huggingface.
 ## Training Metrics
 | Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | F1       |
-| :---- | :------------ | :-------------- | :------- | :-------- | :----- | :------- |
-| 1     | 0.037500      | 0.071958        | 0.984995 | 0.985002  | 0.984995 | 0.984564 |
-| 2     | 0.008900      | 0.068670        | 0.985995 | 0.985973  | 0.985995 | 0.985603 |
-| 3     | 0.008500      | 0.058111        | 0.990330 | 0.990260  | 0.990330 | 0.990262 |
 ## Evaluation Metrics
@@ -24,28 +44,28 @@ The following metrics were measured on the test set:
 | Metric      | Value    |
 | :---------- | :------- |
-| Loss        | 0.049511 |
-| Accuracy    | 0.991    |
-| Precision   | 0.990958 |
-| Recall      | 0.991    |
-| F1-Score    | 0.990962 |
 ## Classification Report:
 ```
                     precision    recall  f1-score   support
-          coherent       0.99      0.99      0.99      1500
-grammatical_errors       0.96      0.94      0.95       250
-      random_bytes       1.00      1.00      1.00       250
-     random_tokens       1.00      1.00      1.00       250
-      random_words       1.00      1.00      1.00       250
-            run_on       1.00      0.99      1.00       250
-         word_soup       1.00      1.00      1.00       250
-          accuracy                           0.99      3000
-         macro avg       0.99      0.99      0.99      3000
-      weighted avg       0.99      0.99      0.99      3000
 ```
 ## Confusion Matrix
@@ -58,10 +78,28 @@ The confusion matrix above shows the performance of the model on each class.
 This model can be used for text classification tasks, specifically for detecting and categorizing different types of text incoherence. You can use the `inference_example` function provided in the notebook to test your own text.
-## Limitations
-The model has been trained on a generated dataset, so care must be taken in evaluating it in the real world. More data may need to be collected before evaluating this model in a real-world setting.
-## License
-CC-BY-SA 4.0

+---
+license: cc-by-sa-4.0
+datasets:
+- SuccubusBot/incoherent-text-dataset
+language:
+- en
+- es
+- fr
+- de
+- zh
+- ja
+- ru
+- ar
+- hi
+metrics:
+- accuracy
+base_model:
+- distilbert/distilbert-base-multilingual-cased
+pipeline_tag: text-classification
+library_name: transformers
+---
+# DistilBERT Incoherence Classifier (Multilingual)
+This is a fine-tuned DistilBERT-multilingual model for classifying text based on its coherence. It can identify various types of incoherence.
 ## Model Details
+- **Model:** DistilBERT (distilbert-base-multilingual-cased)
+- **Task:** Text Classification (Coherence Detection)
+- **Fine-tuning:** The model was fine-tuned using a synthetically generated dataset that features various types of incoherence
 ## Training Metrics
 | Epoch | Training Loss | Validation Loss | Accuracy | Precision | Recall | F1       |
+| :---- | :------------ | :------------ | :-------- | :-------- | :-------- | :------- |
+| 1 	| 0.343600  	| 0.303963  	| 0.880312 	| 0.882746 	| 0.880312 	| 0.879637 |
+| 2 	| 0.245200  	| 0.286482  	| 0.900850 	| 0.901156 	| 0.900850 	| 0.899612 |
+| 3 	| 0.149700  	| 0.313061  	| 0.906161 	| 0.906049 	| 0.906161 	| 0.905103 |
 ## Evaluation Metrics
 | Metric      | Value    |
 | :---------- | :------- |
+| Loss        | 0.316272 |
+| Accuracy    | 0.903329 |
+| Precision   | 0.903704 |
+| Recall      | 0.903329 |
+| F1-Score    | 0.902359 |
 ## Classification Report:
 ```
                     precision    recall  f1-score   support
+          coherent       0.86      0.93      0.90      2051
+grammatical_errors       0.88      0.76      0.81       599
+      random_bytes       1.00      1.00      1.00       599
+     random_tokens       1.00      1.00      1.00       600
+      random_words       0.95      0.93      0.94       600
+            run_on       0.85      0.79      0.82       600
+         word_soup       0.89      0.83      0.86       599
+          accuracy                           0.90      5648
+         macro avg       0.92      0.89      0.90      5648
+      weighted avg       0.90      0.90      0.90      5648
 ```
 ## Confusion Matrix
 This model can be used for text classification tasks, specifically for detecting and categorizing different types of text incoherence. You can use the `inference_example` function provided in the notebook to test your own text.
+```py
+from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
+tokenizer = AutoTokenizer.from_pretrained("SuccubusBot/distilbert-multilingual-incoherence-classifier")
+model = AutoModelForSequenceClassification.from_pretrained("SuccubusBot/distilbert-multilingual-incoherence-classifier")
+classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
+while True:
+    text = input("Enter text (or type 'exit' to quit): ")
+    if text.lower() == "exit":
+        break
+    # Example usage
+    results = classifier(text)
+    # Print the results with confidence scores for all labels
+    for result in results:
+        print(f"Label: {result['label']}, Confidence: {result['score']}")
+```
+## Limitations
+The model has been trained on a generated dataset, so care must be taken in evaluating it in the real world. More data may need to be collected before evaluating this model in a real-world setting.