CordwainerSmith
commited on
Commit
•
1497987
1
Parent(s):
708b3d0
Update README.md
Browse files
README.md
CHANGED
@@ -7,8 +7,11 @@ tags:
|
|
7 |
- pii-detection
|
8 |
- token-classification
|
9 |
- xlm-roberta
|
|
|
|
|
|
|
10 |
datasets:
|
11 |
-
-
|
12 |
model-index:
|
13 |
- name: GolemPII-xlm-roberta-v1
|
14 |
results:
|
@@ -29,13 +32,33 @@ model-index:
|
|
29 |
|
30 |
# GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
|
31 |
|
32 |
-
This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
|
33 |
|
34 |
## Model Details
|
35 |
- Based on xlm-roberta-base
|
36 |
-
- Fine-tuned on
|
37 |
- Optimized for token classification tasks in Hebrew text
|
38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
## Performance Metrics
|
40 |
|
41 |
### Final Evaluation Results
|
@@ -77,6 +100,10 @@ eval_accuracy: 0.999795
|
|
77 |
| 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
|
78 |
| 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
|
79 |
|
|
|
|
|
|
|
|
|
80 |
## Usage
|
81 |
```python
|
82 |
import torch
|
@@ -104,8 +131,45 @@ for token, label in zip(tokens, labels):
|
|
104 |
print(f"Token: {token}, Label: {label}")
|
105 |
```
|
106 |
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
-
|
111 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
- pii-detection
|
8 |
- token-classification
|
9 |
- xlm-roberta
|
10 |
+
- privacy
|
11 |
+
- data-anonymization
|
12 |
+
- golemguard
|
13 |
datasets:
|
14 |
+
- CordwainerSmith/GolemGuard
|
15 |
model-index:
|
16 |
- name: GolemPII-xlm-roberta-v1
|
17 |
results:
|
|
|
32 |
|
33 |
# GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
|
34 |
|
35 |
+
This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data to achieve high accuracy in identifying and classifying various types of PII.
|
36 |
|
37 |
## Model Details
|
38 |
- Based on xlm-roberta-base
|
39 |
+
- Fine-tuned on the GolemGuard: Hebrew Privacy Information Detection Corpus
|
40 |
- Optimized for token classification tasks in Hebrew text
|
41 |
|
42 |
+
## Intended Uses & Limitations
|
43 |
+
|
44 |
+
This model is intended for:
|
45 |
+
|
46 |
+
* **Privacy Protection:** Detecting and masking PII in Hebrew text to protect individual privacy.
|
47 |
+
* **Data Anonymization:** Automating the process of de-identifying Hebrew documents in legal, medical, and other sensitive contexts.
|
48 |
+
* **Research:** Supporting research in Hebrew natural language processing and PII detection.
|
49 |
+
|
50 |
+
## Training Parameters
|
51 |
+
|
52 |
+
* **Batch Size:** 32
|
53 |
+
* **Learning Rate:** 2e-5 with linear warmup and decay.
|
54 |
+
* **Optimizer:** AdamW
|
55 |
+
* **Hardware:** Trained on a single NVIDIA A100GPU.
|
56 |
+
|
57 |
+
## Dataset Details
|
58 |
+
|
59 |
+
* **Dataset Name:** GolemGuard: Hebrew Privacy Information Detection Corpus
|
60 |
+
* **Dataset Link:** [https://huggingface.co/datasets/CordwainerSmith/GolemGuard](https://huggingface.co/datasets/CordwainerSmith/GolemGuard)
|
61 |
+
|
62 |
## Performance Metrics
|
63 |
|
64 |
### Final Evaluation Results
|
|
|
100 |
| 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
|
101 |
| 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
|
102 |
|
103 |
+
## Model Architecture
|
104 |
+
|
105 |
+
The model is based on the `FacebookAI/xlm-roberta-base` architecture, a transformer-based language model pre-trained on a massive multilingual dataset. No architectural modifications were made to the base model during fine-tuning.
|
106 |
+
|
107 |
## Usage
|
108 |
```python
|
109 |
import torch
|
|
|
131 |
print(f"Token: {token}, Label: {label}")
|
132 |
```
|
133 |
|
134 |
+
|
135 |
+
## License
|
136 |
+
|
137 |
+
The GolemPII-xlm-roberta-v1 model is released under MIT License with the following additional terms:
|
138 |
+
|
139 |
+
```
|
140 |
+
MIT License
|
141 |
+
|
142 |
+
Copyright (c) 2024 Liran Baba
|
143 |
+
|
144 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
145 |
+
of this dataset and associated documentation files (the "Dataset"), to deal
|
146 |
+
in the Dataset without restriction, including without limitation the rights
|
147 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
148 |
+
copies of the Dataset, and to permit persons to whom the Dataset is
|
149 |
+
furnished to do so, subject to the following conditions:
|
150 |
+
|
151 |
+
1. The above copyright notice and this permission notice shall be included in all
|
152 |
+
copies or substantial portions of the Dataset.
|
153 |
+
|
154 |
+
2. Any academic or professional work that uses this Dataset must include an
|
155 |
+
appropriate citation as specified below.
|
156 |
+
|
157 |
+
THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
158 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
159 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
160 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
161 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
162 |
+
OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE
|
163 |
+
DATASET.
|
164 |
+
```
|
165 |
+
|
166 |
+
### How to Cite
|
167 |
+
|
168 |
+
If you use this model in your research, project, or application, please include the following citation:
|
169 |
+
|
170 |
+
For informal usage (e.g., blog posts, documentation):
|
171 |
+
```
|
172 |
+
GolemPII-xlm-roberta-v1 model by Liran Baba (https://huggingface.co/CordwainerSmith/GolemPII-xlm-roberta-v1)
|
173 |
+
```
|
174 |
+
|
175 |
+
|