CodeHima commited on
Commit
963e8c2
1 Parent(s): 60cb73f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -2
README.md CHANGED
@@ -1,3 +1,109 @@
1
- # Tos-Roberta Base
2
 
3
- This model is trained to classify clauses in Terms of Service (ToS) documents using RoBERTa-base.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TOSRoberta-base
2
 
3
+ ## Model Overview
4
+
5
+ **Model Name:** TOSRoberta-base
6
+ **Model Type:** Sequence Classification
7
+ **Base Model:** [RoBERTa-base](https://huggingface.co/roberta-base)
8
+ **Language:** English
9
+ **Task:** Classification of unfairness levels in Terms of Service (ToS) documents
10
+
11
+ **Model Card Version:** 1.0
12
+ **Author:** CodeHima
13
+
14
+ ## Model Description
15
+
16
+ The `TOSRoberta-base` model is a fine-tuned version of `RoBERTa-base` for classifying clauses in Terms of Service (ToS) documents into three categories:
17
+ - **Clearly Fair**
18
+ - **Potentially Unfair**
19
+ - **Clearly Unfair**
20
+
21
+ This model has been fine-tuned on a custom dataset labeled with the above categories to help identify unfair practices in ToS documents.
22
+
23
+ ## Intended Use
24
+
25
+ ### Primary Use Case
26
+ The primary use case of this model is to classify text from Terms of Service documents into different levels of fairness. It can be particularly useful for legal analysts, researchers, and consumer protection agencies to quickly identify potentially unfair clauses in ToS documents.
27
+
28
+ ### Limitations
29
+ - **Dataset Bias:** The model has been trained on a specific dataset, which may introduce biases. It may not generalize well to all types of ToS documents.
30
+ - **Context Understanding:** The model may struggle with clauses that require deep contextual or legal understanding.
31
+
32
+ ## Performance
33
+
34
+ ### Training Configuration
35
+ - **Batch Size:** 32 (training), 16 (evaluation)
36
+ - **Learning Rate:** 1e-5
37
+ - **Epochs:** 10
38
+ - **Optimizer:** AdamW
39
+ - **Scheduler:** Linear with warmup
40
+ - **Training Framework:** PyTorch using Hugging Face's `transformers` library
41
+ - **Mixed Precision Training:** Enabled (fp16)
42
+ - **Resource:** Trained on a single NVIDIA T4 GPU (15 GB VRAM)
43
+
44
+ ### Training Metrics
45
+
46
+ | Epoch | Training Loss | Validation Loss | Accuracy | F1 | Precision | Recall |
47
+ |-------|---------------|-----------------|----------|------|-----------|--------|
48
+ | 1 | 0.668100 | 0.620207 | 0.740000 | 0.727| 0.728 | 0.740 |
49
+ | 2 | 0.439800 | 0.463925 | 0.824762 | 0.821| 0.826 | 0.825 |
50
+ | 3 | 0.373500 | 0.432604 | 0.831429 | 0.832| 0.834 | 0.831 |
51
+ | 4 | 0.342800 | 0.402661 | 0.854286 | 0.854| 0.853 | 0.854 |
52
+ | 5 | 0.283800 | 0.434868 | 0.829524 | 0.832| 0.840 | 0.830 |
53
+ | 6 | 0.218000 | 0.437268 | 0.859048 | 0.859| 0.859 | 0.859 |
54
+ | 7 | 0.266800 | 0.508120 | 0.820952 | 0.824| 0.834 | 0.821 |
55
+ | 8 | 0.139600 | 0.486364 | 0.855238 | 0.856| 0.856 | 0.855 |
56
+ | 9 | 0.085000 | 0.530111 | 0.844762 | 0.846| 0.850 | 0.845 |
57
+ | 10 | 0.103600 | 0.528026 | 0.842857 | 0.844| 0.847 | 0.843 |
58
+
59
+ **Final Validation Accuracy:** 85.90%
60
+ **Final Test Accuracy:** 85.65%
61
+
62
+ ### Evaluation Metrics
63
+ - **Accuracy:** 85.65%
64
+ - **F1 Score:** 85.60%
65
+ - **Precision:** 85.61%
66
+ - **Recall:** 85.65%
67
+
68
+ ## Dataset
69
+
70
+ The model was trained on the `CodeHima/TOS_DatasetV3`, which includes labeled clauses from ToS documents. The dataset is split into training, validation, and test sets to ensure reliable performance evaluation.
71
+
72
+ **Dataset Labels:**
73
+ - `clearly_fair`
74
+ - `potentially_unfair`
75
+ - `clearly_unfair`
76
+
77
+ ## How to Use
78
+
79
+ Here’s how you can use the model with the Hugging Face `transformers` library:
80
+
81
+ ```python
82
+ from transformers import RobertaTokenizer, RobertaForSequenceClassification
83
+
84
+ # Load the model
85
+ model = RobertaForSequenceClassification.from_pretrained('CodeHima/TOSRoberta-base')
86
+ tokenizer = RobertaTokenizer.from_pretrained('CodeHima/TOSRoberta-base')
87
+
88
+ # Predict the unfairness level of a clause
89
+ text = "Insert clause text here."
90
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
91
+ outputs = model(**inputs)
92
+ predicted_class = outputs.logits.argmax(-1).item()
93
+
94
+ # Map the predicted class to the corresponding label
95
+ label_mapping = {0: 'clearly_fair', 1: 'potentially_unfair', 2: 'clearly_unfair'}
96
+ predicted_label = label_mapping[predicted_class]
97
+ print(f"Predicted Label: {predicted_label}")
98
+ ```
99
+
100
+ ## Ethical Considerations
101
+
102
+ - **Bias:** The model's predictions may reflect biases present in the training data.
103
+ - **Fair Use:** Ensure the model is used responsibly, especially in legal contexts where human oversight is critical.
104
+
105
+ ## Conclusion
106
+
107
+ The `TOSRoberta-base` model is a reliable tool for identifying unfair clauses in Terms of Service documents. While it performs well, it should be used in conjunction with expert analysis, particularly in legally sensitive contexts.
108
+
109
+ **Model Repository:** [CodeHima/TOSRoberta-base](https://huggingface.co/CodeHima/TOSRoberta-base)