CordwainerSmith
commited on
Commit
•
b46639e
1
Parent(s):
2d2e7a8
Upload folder using huggingface_hub
Browse files
README.md
CHANGED
@@ -1,18 +1,14 @@
|
|
1 |
---
|
2 |
language: he
|
3 |
license: mit
|
4 |
-
library_name: transformers
|
5 |
tags:
|
6 |
- hebrew
|
7 |
- ner
|
8 |
- pii-detection
|
9 |
- token-classification
|
10 |
- xlm-roberta
|
11 |
-
- privacy
|
12 |
-
- data-anonymization
|
13 |
-
- golemguard
|
14 |
datasets:
|
15 |
-
-
|
16 |
model-index:
|
17 |
- name: GolemPII-xlm-roberta-v1
|
18 |
results:
|
@@ -33,33 +29,13 @@ model-index:
|
|
33 |
|
34 |
# GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
|
35 |
|
36 |
-
This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data
|
37 |
|
38 |
## Model Details
|
39 |
- Based on xlm-roberta-base
|
40 |
-
- Fine-tuned on
|
41 |
- Optimized for token classification tasks in Hebrew text
|
42 |
|
43 |
-
## Intended Uses & Limitations
|
44 |
-
|
45 |
-
This model is intended for:
|
46 |
-
|
47 |
-
* **Privacy Protection:** Detecting and masking PII in Hebrew text to protect individual privacy.
|
48 |
-
* **Data Anonymization:** Automating the process of de-identifying Hebrew documents in legal, medical, and other sensitive contexts.
|
49 |
-
* **Research:** Supporting research in Hebrew natural language processing and PII detection.
|
50 |
-
|
51 |
-
## Training Parameters
|
52 |
-
|
53 |
-
* **Batch Size:** 32
|
54 |
-
* **Learning Rate:** 2e-5 with linear warmup and decay.
|
55 |
-
* **Optimizer:** AdamW
|
56 |
-
* **Hardware:** Trained on a single NVIDIA A100GPU.
|
57 |
-
|
58 |
-
## Dataset Details
|
59 |
-
|
60 |
-
* **Dataset Name:** GolemGuard: Hebrew Privacy Information Detection Corpus
|
61 |
-
* **Dataset Link:** [https://huggingface.co/datasets/CordwainerSmith/GolemGuard](https://huggingface.co/datasets/CordwainerSmith/GolemGuard)
|
62 |
-
|
63 |
## Performance Metrics
|
64 |
|
65 |
### Final Evaluation Results
|
@@ -101,10 +77,6 @@ eval_accuracy: 0.999795
|
|
101 |
| 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
|
102 |
| 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
|
103 |
|
104 |
-
## Model Architecture
|
105 |
-
|
106 |
-
The model is based on the `FacebookAI/xlm-roberta-base` architecture, a transformer-based language model pre-trained on a massive multilingual dataset. No architectural modifications were made to the base model during fine-tuning.
|
107 |
-
|
108 |
## Usage
|
109 |
```python
|
110 |
import torch
|
@@ -132,45 +104,8 @@ for token, label in zip(tokens, labels):
|
|
132 |
print(f"Token: {token}, Label: {label}")
|
133 |
```
|
134 |
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
```
|
141 |
-
MIT License
|
142 |
-
|
143 |
-
Copyright (c) 2024 Liran Baba
|
144 |
-
|
145 |
-
Permission is hereby granted, free of charge, to any person obtaining a copy
|
146 |
-
of this dataset and associated documentation files (the "Dataset"), to deal
|
147 |
-
in the Dataset without restriction, including without limitation the rights
|
148 |
-
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
149 |
-
copies of the Dataset, and to permit persons to whom the Dataset is
|
150 |
-
furnished to do so, subject to the following conditions:
|
151 |
-
|
152 |
-
1. The above copyright notice and this permission notice shall be included in all
|
153 |
-
copies or substantial portions of the Dataset.
|
154 |
-
|
155 |
-
2. Any academic or professional work that uses this Dataset must include an
|
156 |
-
appropriate citation as specified below.
|
157 |
-
|
158 |
-
THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
159 |
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
160 |
-
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
161 |
-
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
162 |
-
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
163 |
-
OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE
|
164 |
-
DATASET.
|
165 |
-
```
|
166 |
-
|
167 |
-
### How to Cite
|
168 |
-
|
169 |
-
If you use this model in your research, project, or application, please include the following citation:
|
170 |
-
|
171 |
-
For informal usage (e.g., blog posts, documentation):
|
172 |
-
```
|
173 |
-
GolemPII-xlm-roberta-v1 model by Liran Baba (https://huggingface.co/CordwainerSmith/GolemPII-xlm-roberta-v1)
|
174 |
-
```
|
175 |
-
|
176 |
-
|
|
|
1 |
---
|
2 |
language: he
|
3 |
license: mit
|
|
|
4 |
tags:
|
5 |
- hebrew
|
6 |
- ner
|
7 |
- pii-detection
|
8 |
- token-classification
|
9 |
- xlm-roberta
|
|
|
|
|
|
|
10 |
datasets:
|
11 |
+
- custom
|
12 |
model-index:
|
13 |
- name: GolemPII-xlm-roberta-v1
|
14 |
results:
|
|
|
29 |
|
30 |
# GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
|
31 |
|
32 |
+
This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
|
33 |
|
34 |
## Model Details
|
35 |
- Based on xlm-roberta-base
|
36 |
+
- Fine-tuned on a custom Hebrew PII dataset
|
37 |
- Optimized for token classification tasks in Hebrew text
|
38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
## Performance Metrics
|
40 |
|
41 |
### Final Evaluation Results
|
|
|
77 |
| 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
|
78 |
| 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
|
79 |
|
|
|
|
|
|
|
|
|
80 |
## Usage
|
81 |
```python
|
82 |
import torch
|
|
|
104 |
print(f"Token: {token}, Label: {label}")
|
105 |
```
|
106 |
|
107 |
+
## Training Details
|
108 |
+
- Training epochs: 5
|
109 |
+
- Training speed: ~2.33 it/s (7615/7615 54:29)
|
110 |
+
- Base model: xlm-roberta-base
|
111 |
+
- Training language: Hebrew
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|