CordwainerSmith commited on
Commit
1497987
1 Parent(s): 708b3d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -8
README.md CHANGED
@@ -7,8 +7,11 @@ tags:
7
  - pii-detection
8
  - token-classification
9
  - xlm-roberta
 
 
 
10
  datasets:
11
- - custom
12
  model-index:
13
  - name: GolemPII-xlm-roberta-v1
14
  results:
@@ -29,13 +32,33 @@ model-index:
29
 
30
  # GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
31
 
32
- This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
33
 
34
  ## Model Details
35
  - Based on xlm-roberta-base
36
- - Fine-tuned on a custom Hebrew PII dataset
37
  - Optimized for token classification tasks in Hebrew text
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  ## Performance Metrics
40
 
41
  ### Final Evaluation Results
@@ -77,6 +100,10 @@ eval_accuracy: 0.999795
77
  | 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
78
  | 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
79
 
 
 
 
 
80
  ## Usage
81
  ```python
82
  import torch
@@ -104,8 +131,45 @@ for token, label in zip(tokens, labels):
104
  print(f"Token: {token}, Label: {label}")
105
  ```
106
 
107
- ## Training Details
108
- - Training epochs: 5
109
- - Training speed: ~2.33 it/s (7615/7615 54:29)
110
- - Base model: xlm-roberta-base
111
- - Training language: Hebrew
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - pii-detection
8
  - token-classification
9
  - xlm-roberta
10
+ - privacy
11
+ - data-anonymization
12
+ - golemguard
13
  datasets:
14
+ - CordwainerSmith/GolemGuard
15
  model-index:
16
  - name: GolemPII-xlm-roberta-v1
17
  results:
 
32
 
33
  # GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
34
 
35
+ This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data to achieve high accuracy in identifying and classifying various types of PII.
36
 
37
  ## Model Details
38
  - Based on xlm-roberta-base
39
+ - Fine-tuned on the GolemGuard: Hebrew Privacy Information Detection Corpus
40
  - Optimized for token classification tasks in Hebrew text
41
 
42
+ ## Intended Uses & Limitations
43
+
44
+ This model is intended for:
45
+
46
+ * **Privacy Protection:** Detecting and masking PII in Hebrew text to protect individual privacy.
47
+ * **Data Anonymization:** Automating the process of de-identifying Hebrew documents in legal, medical, and other sensitive contexts.
48
+ * **Research:** Supporting research in Hebrew natural language processing and PII detection.
49
+
50
+ ## Training Parameters
51
+
52
+ * **Batch Size:** 32
53
+ * **Learning Rate:** 2e-5 with linear warmup and decay.
54
+ * **Optimizer:** AdamW
55
+ * **Hardware:** Trained on a single NVIDIA A100GPU.
56
+
57
+ ## Dataset Details
58
+
59
+ * **Dataset Name:** GolemGuard: Hebrew Privacy Information Detection Corpus
60
+ * **Dataset Link:** [https://huggingface.co/datasets/CordwainerSmith/GolemGuard](https://huggingface.co/datasets/CordwainerSmith/GolemGuard)
61
+
62
  ## Performance Metrics
63
 
64
  ### Final Evaluation Results
 
100
  | 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
101
  | 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
102
 
103
+ ## Model Architecture
104
+
105
+ The model is based on the `FacebookAI/xlm-roberta-base` architecture, a transformer-based language model pre-trained on a massive multilingual dataset. No architectural modifications were made to the base model during fine-tuning.
106
+
107
  ## Usage
108
  ```python
109
  import torch
 
131
  print(f"Token: {token}, Label: {label}")
132
  ```
133
 
134
+
135
+ ## License
136
+
137
+ The GolemPII-xlm-roberta-v1 model is released under MIT License with the following additional terms:
138
+
139
+ ```
140
+ MIT License
141
+
142
+ Copyright (c) 2024 Liran Baba
143
+
144
+ Permission is hereby granted, free of charge, to any person obtaining a copy
145
+ of this dataset and associated documentation files (the "Dataset"), to deal
146
+ in the Dataset without restriction, including without limitation the rights
147
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
148
+ copies of the Dataset, and to permit persons to whom the Dataset is
149
+ furnished to do so, subject to the following conditions:
150
+
151
+ 1. The above copyright notice and this permission notice shall be included in all
152
+ copies or substantial portions of the Dataset.
153
+
154
+ 2. Any academic or professional work that uses this Dataset must include an
155
+ appropriate citation as specified below.
156
+
157
+ THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
158
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
159
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
160
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
161
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
162
+ OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE
163
+ DATASET.
164
+ ```
165
+
166
+ ### How to Cite
167
+
168
+ If you use this model in your research, project, or application, please include the following citation:
169
+
170
+ For informal usage (e.g., blog posts, documentation):
171
+ ```
172
+ GolemPII-xlm-roberta-v1 model by Liran Baba (https://huggingface.co/CordwainerSmith/GolemPII-xlm-roberta-v1)
173
+ ```
174
+
175
+