CordwainerSmith commited on
Commit
5f2324a
1 Parent(s): b46639e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -10
README.md CHANGED
@@ -1,16 +1,20 @@
1
  ---
2
  language: he
3
  license: mit
 
4
  tags:
5
  - hebrew
6
  - ner
7
  - pii-detection
8
  - token-classification
9
  - xlm-roberta
 
 
 
10
  datasets:
11
- - custom
12
  model-index:
13
- - name: GolemPII-xlm-roberta-v1
14
  results:
15
  - task:
16
  name: Token Classification
@@ -27,15 +31,35 @@ model-index:
27
  value: 0.9982
28
  ---
29
 
30
- # GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
31
 
32
- This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
33
 
34
  ## Model Details
35
  - Based on xlm-roberta-base
36
- - Fine-tuned on a custom Hebrew PII dataset
37
  - Optimized for token classification tasks in Hebrew text
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  ## Performance Metrics
40
 
41
  ### Final Evaluation Results
@@ -77,6 +101,10 @@ eval_accuracy: 0.999795
77
  | 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
78
  | 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
79
 
 
 
 
 
80
  ## Usage
81
  ```python
82
  import torch
@@ -104,8 +132,43 @@ for token, label in zip(tokens, labels):
104
  print(f"Token: {token}, Label: {label}")
105
  ```
106
 
107
- ## Training Details
108
- - Training epochs: 5
109
- - Training speed: ~2.33 it/s (7615/7615 54:29)
110
- - Base model: xlm-roberta-base
111
- - Training language: Hebrew
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language: he
3
  license: mit
4
+ library_name: transformers
5
  tags:
6
  - hebrew
7
  - ner
8
  - pii-detection
9
  - token-classification
10
  - xlm-roberta
11
+ - privacy
12
+ - data-anonymization
13
+ - golemguard
14
  datasets:
15
+ - CordwainerSmith/GolemGuard
16
  model-index:
17
+ - name: GolemPII-v1
18
  results:
19
  - task:
20
  name: Token Classification
 
31
  value: 0.9982
32
  ---
33
 
34
+ # GolemPII-v1 - Hebrew PII Detection Model
35
 
36
+ This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data to achieve high accuracy in identifying and classifying various types of PII.
37
 
38
  ## Model Details
39
  - Based on xlm-roberta-base
40
+ - Fine-tuned on the GolemGuard: Hebrew Privacy Information Detection Corpus
41
  - Optimized for token classification tasks in Hebrew text
42
 
43
+ ## Intended Uses & Limitations
44
+
45
+ This model is intended for:
46
+
47
+ * **Privacy Protection:** Detecting and masking PII in Hebrew text to protect individual privacy.
48
+ * **Data Anonymization:** Automating the process of de-identifying Hebrew documents in legal, medical, and other sensitive contexts.
49
+ * **Research:** Supporting research in Hebrew natural language processing and PII detection.
50
+
51
+ ## Training Parameters
52
+
53
+ * **Batch Size:** 32
54
+ * **Learning Rate:** 2e-5 with linear warmup and decay.
55
+ * **Optimizer:** AdamW
56
+ * **Hardware:** Trained on a single NVIDIA A100GPU.
57
+
58
+ ## Dataset Details
59
+
60
+ * **Dataset Name:** GolemGuard: Hebrew Privacy Information Detection Corpus
61
+ * **Dataset Link:** [https://huggingface.co/datasets/CordwainerSmith/GolemGuard](https://huggingface.co/datasets/CordwainerSmith/GolemGuard)
62
+
63
  ## Performance Metrics
64
 
65
  ### Final Evaluation Results
 
101
  | 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
102
  | 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
103
 
104
+ ## Model Architecture
105
+
106
+ The model is based on the `FacebookAI/xlm-roberta-base` architecture, a transformer-based language model pre-trained on a massive multilingual dataset. No architectural modifications were made to the base model during fine-tuning.
107
+
108
  ## Usage
109
  ```python
110
  import torch
 
132
  print(f"Token: {token}, Label: {label}")
133
  ```
134
 
135
+
136
+ ## License
137
+
138
+ The GolemPII-v1 model is released under MIT License with the following additional terms:
139
+
140
+ ```
141
+ MIT License
142
+
143
+ Copyright (c) 2024 Liran Baba
144
+
145
+ Permission is hereby granted, free of charge, to any person obtaining a copy
146
+ of this dataset and associated documentation files (the "Dataset"), to deal
147
+ in the Dataset without restriction, including without limitation the rights
148
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
149
+ copies of the Dataset, and to permit persons to whom the Dataset is
150
+ furnished to do so, subject to the following conditions:
151
+
152
+ 1. The above copyright notice and this permission notice shall be included in all
153
+ copies or substantial portions of the Dataset.
154
+
155
+ 2. Any academic or professional work that uses this Dataset must include an
156
+ appropriate citation as specified below.
157
+
158
+ THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
159
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
160
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
161
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
162
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
163
+ OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE
164
+ DATASET.
165
+ ```
166
+
167
+ ### How to Cite
168
+
169
+ If you use this model in your research, project, or application, please include the following citation:
170
+
171
+ For informal usage (e.g., blog posts, documentation):
172
+ ```
173
+ GolemPII-v1 model by Liran Baba (https://huggingface.co/CordwainerSmith/GolemPII-v1)
174
+ ```