CordwainerSmith commited on
Commit
b46639e
1 Parent(s): 2d2e7a8

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +8 -73
README.md CHANGED
@@ -1,18 +1,14 @@
1
  ---
2
  language: he
3
  license: mit
4
- library_name: transformers
5
  tags:
6
  - hebrew
7
  - ner
8
  - pii-detection
9
  - token-classification
10
  - xlm-roberta
11
- - privacy
12
- - data-anonymization
13
- - golemguard
14
  datasets:
15
- - CordwainerSmith/GolemGuard
16
  model-index:
17
  - name: GolemPII-xlm-roberta-v1
18
  results:
@@ -33,33 +29,13 @@ model-index:
33
 
34
  # GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
35
 
36
- This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data to achieve high accuracy in identifying and classifying various types of PII.
37
 
38
  ## Model Details
39
  - Based on xlm-roberta-base
40
- - Fine-tuned on the GolemGuard: Hebrew Privacy Information Detection Corpus
41
  - Optimized for token classification tasks in Hebrew text
42
 
43
- ## Intended Uses & Limitations
44
-
45
- This model is intended for:
46
-
47
- * **Privacy Protection:** Detecting and masking PII in Hebrew text to protect individual privacy.
48
- * **Data Anonymization:** Automating the process of de-identifying Hebrew documents in legal, medical, and other sensitive contexts.
49
- * **Research:** Supporting research in Hebrew natural language processing and PII detection.
50
-
51
- ## Training Parameters
52
-
53
- * **Batch Size:** 32
54
- * **Learning Rate:** 2e-5 with linear warmup and decay.
55
- * **Optimizer:** AdamW
56
- * **Hardware:** Trained on a single NVIDIA A100GPU.
57
-
58
- ## Dataset Details
59
-
60
- * **Dataset Name:** GolemGuard: Hebrew Privacy Information Detection Corpus
61
- * **Dataset Link:** [https://huggingface.co/datasets/CordwainerSmith/GolemGuard](https://huggingface.co/datasets/CordwainerSmith/GolemGuard)
62
-
63
  ## Performance Metrics
64
 
65
  ### Final Evaluation Results
@@ -101,10 +77,6 @@ eval_accuracy: 0.999795
101
  | 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
102
  | 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
103
 
104
- ## Model Architecture
105
-
106
- The model is based on the `FacebookAI/xlm-roberta-base` architecture, a transformer-based language model pre-trained on a massive multilingual dataset. No architectural modifications were made to the base model during fine-tuning.
107
-
108
  ## Usage
109
  ```python
110
  import torch
@@ -132,45 +104,8 @@ for token, label in zip(tokens, labels):
132
  print(f"Token: {token}, Label: {label}")
133
  ```
134
 
135
-
136
- ## License
137
-
138
- The GolemPII-xlm-roberta-v1 model is released under MIT License with the following additional terms:
139
-
140
- ```
141
- MIT License
142
-
143
- Copyright (c) 2024 Liran Baba
144
-
145
- Permission is hereby granted, free of charge, to any person obtaining a copy
146
- of this dataset and associated documentation files (the "Dataset"), to deal
147
- in the Dataset without restriction, including without limitation the rights
148
- to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
149
- copies of the Dataset, and to permit persons to whom the Dataset is
150
- furnished to do so, subject to the following conditions:
151
-
152
- 1. The above copyright notice and this permission notice shall be included in all
153
- copies or substantial portions of the Dataset.
154
-
155
- 2. Any academic or professional work that uses this Dataset must include an
156
- appropriate citation as specified below.
157
-
158
- THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
159
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
160
- FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
161
- AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
162
- LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
163
- OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE
164
- DATASET.
165
- ```
166
-
167
- ### How to Cite
168
-
169
- If you use this model in your research, project, or application, please include the following citation:
170
-
171
- For informal usage (e.g., blog posts, documentation):
172
- ```
173
- GolemPII-xlm-roberta-v1 model by Liran Baba (https://huggingface.co/CordwainerSmith/GolemPII-xlm-roberta-v1)
174
- ```
175
-
176
-
 
1
  ---
2
  language: he
3
  license: mit
 
4
  tags:
5
  - hebrew
6
  - ner
7
  - pii-detection
8
  - token-classification
9
  - xlm-roberta
 
 
 
10
  datasets:
11
+ - custom
12
  model-index:
13
  - name: GolemPII-xlm-roberta-v1
14
  results:
 
29
 
30
  # GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
31
 
32
+ This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
33
 
34
  ## Model Details
35
  - Based on xlm-roberta-base
36
+ - Fine-tuned on a custom Hebrew PII dataset
37
  - Optimized for token classification tasks in Hebrew text
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  ## Performance Metrics
40
 
41
  ### Final Evaluation Results
 
77
  | 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
78
  | 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
79
 
 
 
 
 
80
  ## Usage
81
  ```python
82
  import torch
 
104
  print(f"Token: {token}, Label: {label}")
105
  ```
106
 
107
+ ## Training Details
108
+ - Training epochs: 5
109
+ - Training speed: ~2.33 it/s (7615/7615 54:29)
110
+ - Base model: xlm-roberta-base
111
+ - Training language: Hebrew