CordwainerSmith
commited on
Commit
•
5f2324a
1
Parent(s):
b46639e
Update README.md
Browse files
README.md
CHANGED
@@ -1,16 +1,20 @@
|
|
1 |
---
|
2 |
language: he
|
3 |
license: mit
|
|
|
4 |
tags:
|
5 |
- hebrew
|
6 |
- ner
|
7 |
- pii-detection
|
8 |
- token-classification
|
9 |
- xlm-roberta
|
|
|
|
|
|
|
10 |
datasets:
|
11 |
-
-
|
12 |
model-index:
|
13 |
-
- name: GolemPII-
|
14 |
results:
|
15 |
- task:
|
16 |
name: Token Classification
|
@@ -27,15 +31,35 @@ model-index:
|
|
27 |
value: 0.9982
|
28 |
---
|
29 |
|
30 |
-
# GolemPII-
|
31 |
|
32 |
-
This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
|
33 |
|
34 |
## Model Details
|
35 |
- Based on xlm-roberta-base
|
36 |
-
- Fine-tuned on
|
37 |
- Optimized for token classification tasks in Hebrew text
|
38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
## Performance Metrics
|
40 |
|
41 |
### Final Evaluation Results
|
@@ -77,6 +101,10 @@ eval_accuracy: 0.999795
|
|
77 |
| 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
|
78 |
| 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
|
79 |
|
|
|
|
|
|
|
|
|
80 |
## Usage
|
81 |
```python
|
82 |
import torch
|
@@ -104,8 +132,43 @@ for token, label in zip(tokens, labels):
|
|
104 |
print(f"Token: {token}, Label: {label}")
|
105 |
```
|
106 |
|
107 |
-
|
108 |
-
|
109 |
-
|
110 |
-
-
|
111 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
language: he
|
3 |
license: mit
|
4 |
+
library_name: transformers
|
5 |
tags:
|
6 |
- hebrew
|
7 |
- ner
|
8 |
- pii-detection
|
9 |
- token-classification
|
10 |
- xlm-roberta
|
11 |
+
- privacy
|
12 |
+
- data-anonymization
|
13 |
+
- golemguard
|
14 |
datasets:
|
15 |
+
- CordwainerSmith/GolemGuard
|
16 |
model-index:
|
17 |
+
- name: GolemPII-v1
|
18 |
results:
|
19 |
- task:
|
20 |
name: Token Classification
|
|
|
31 |
value: 0.9982
|
32 |
---
|
33 |
|
34 |
+
# GolemPII-v1 - Hebrew PII Detection Model
|
35 |
|
36 |
+
This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data to achieve high accuracy in identifying and classifying various types of PII.
|
37 |
|
38 |
## Model Details
|
39 |
- Based on xlm-roberta-base
|
40 |
+
- Fine-tuned on the GolemGuard: Hebrew Privacy Information Detection Corpus
|
41 |
- Optimized for token classification tasks in Hebrew text
|
42 |
|
43 |
+
## Intended Uses & Limitations
|
44 |
+
|
45 |
+
This model is intended for:
|
46 |
+
|
47 |
+
* **Privacy Protection:** Detecting and masking PII in Hebrew text to protect individual privacy.
|
48 |
+
* **Data Anonymization:** Automating the process of de-identifying Hebrew documents in legal, medical, and other sensitive contexts.
|
49 |
+
* **Research:** Supporting research in Hebrew natural language processing and PII detection.
|
50 |
+
|
51 |
+
## Training Parameters
|
52 |
+
|
53 |
+
* **Batch Size:** 32
|
54 |
+
* **Learning Rate:** 2e-5 with linear warmup and decay.
|
55 |
+
* **Optimizer:** AdamW
|
56 |
+
* **Hardware:** Trained on a single NVIDIA A100GPU.
|
57 |
+
|
58 |
+
## Dataset Details
|
59 |
+
|
60 |
+
* **Dataset Name:** GolemGuard: Hebrew Privacy Information Detection Corpus
|
61 |
+
* **Dataset Link:** [https://huggingface.co/datasets/CordwainerSmith/GolemGuard](https://huggingface.co/datasets/CordwainerSmith/GolemGuard)
|
62 |
+
|
63 |
## Performance Metrics
|
64 |
|
65 |
### Final Evaluation Results
|
|
|
101 |
| 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
|
102 |
| 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
|
103 |
|
104 |
+
## Model Architecture
|
105 |
+
|
106 |
+
The model is based on the `FacebookAI/xlm-roberta-base` architecture, a transformer-based language model pre-trained on a massive multilingual dataset. No architectural modifications were made to the base model during fine-tuning.
|
107 |
+
|
108 |
## Usage
|
109 |
```python
|
110 |
import torch
|
|
|
132 |
print(f"Token: {token}, Label: {label}")
|
133 |
```
|
134 |
|
135 |
+
|
136 |
+
## License
|
137 |
+
|
138 |
+
The GolemPII-v1 model is released under MIT License with the following additional terms:
|
139 |
+
|
140 |
+
```
|
141 |
+
MIT License
|
142 |
+
|
143 |
+
Copyright (c) 2024 Liran Baba
|
144 |
+
|
145 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
146 |
+
of this dataset and associated documentation files (the "Dataset"), to deal
|
147 |
+
in the Dataset without restriction, including without limitation the rights
|
148 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
149 |
+
copies of the Dataset, and to permit persons to whom the Dataset is
|
150 |
+
furnished to do so, subject to the following conditions:
|
151 |
+
|
152 |
+
1. The above copyright notice and this permission notice shall be included in all
|
153 |
+
copies or substantial portions of the Dataset.
|
154 |
+
|
155 |
+
2. Any academic or professional work that uses this Dataset must include an
|
156 |
+
appropriate citation as specified below.
|
157 |
+
|
158 |
+
THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
159 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
160 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
161 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
162 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
163 |
+
OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE
|
164 |
+
DATASET.
|
165 |
+
```
|
166 |
+
|
167 |
+
### How to Cite
|
168 |
+
|
169 |
+
If you use this model in your research, project, or application, please include the following citation:
|
170 |
+
|
171 |
+
For informal usage (e.g., blog posts, documentation):
|
172 |
+
```
|
173 |
+
GolemPII-v1 model by Liran Baba (https://huggingface.co/CordwainerSmith/GolemPII-v1)
|
174 |
+
```
|