CordwainerSmith
/

GolemPII-v1

@@ -1,16 +1,20 @@
 ---
 language: he
 license: mit
 tags:
 - hebrew
 - ner
 - pii-detection
 - token-classification
 - xlm-roberta
 datasets:
-- custom
 model-index:
-- name: GolemPII-xlm-roberta-v1
   results:
   - task:
       name: Token Classification
@@ -27,15 +31,35 @@ model-index:
         value: 0.9982
 ---
-# GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
-This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
 ## Model Details
 - Based on xlm-roberta-base
-- Fine-tuned on a custom Hebrew PII dataset
 - Optimized for token classification tasks in Hebrew text
 ## Performance Metrics
 ### Final Evaluation Results
@@ -77,6 +101,10 @@ eval_accuracy: 0.999795
 | 4     | 0.000900    | 0.000896       | 0.997626  | 0.997868| 0.997747 | 0.999750 |
 | 5     | 0.000600    | 0.000729       | 0.997981  | 0.998191| 0.998086 | 0.999795 |
 ## Usage
 ```python
 import torch
@@ -104,8 +132,43 @@ for token, label in zip(tokens, labels):
         print(f"Token: {token}, Label: {label}")
 ```
-## Training Details
-- Training epochs: 5
-- Training speed: ~2.33 it/s (7615/7615 54:29)
-- Base model: xlm-roberta-base
-- Training language: Hebrew

 ---
 language: he
 license: mit
+library_name: transformers
 tags:
 - hebrew
 - ner
 - pii-detection
 - token-classification
 - xlm-roberta
+- privacy
+- data-anonymization
+- golemguard
 datasets:
+- CordwainerSmith/GolemGuard
 model-index:
+- name: GolemPII-v1
   results:
   - task:
       name: Token Classification
         value: 0.9982
 ---
+# GolemPII-v1 - Hebrew PII Detection Model
+This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data to achieve high accuracy in identifying and classifying various types of PII.
 ## Model Details
 - Based on xlm-roberta-base
+- Fine-tuned on the GolemGuard: Hebrew Privacy Information Detection Corpus
 - Optimized for token classification tasks in Hebrew text
+## Intended Uses & Limitations
+This model is intended for:
+* **Privacy Protection:**  Detecting and masking PII in Hebrew text to protect individual privacy.
+* **Data Anonymization:** Automating the process of de-identifying Hebrew documents in legal, medical, and other sensitive contexts.
+* **Research:**  Supporting research in Hebrew natural language processing and PII detection.
+## Training Parameters
+* **Batch Size:** 32
+* **Learning Rate:** 2e-5 with linear warmup and decay.
+* **Optimizer:** AdamW
+* **Hardware:** Trained on a single NVIDIA A100GPU.
+## Dataset Details
+* **Dataset Name:** GolemGuard: Hebrew Privacy Information Detection Corpus
+* **Dataset Link:** [https://huggingface.co/datasets/CordwainerSmith/GolemGuard](https://huggingface.co/datasets/CordwainerSmith/GolemGuard)
 ## Performance Metrics
 ### Final Evaluation Results
 | 4     | 0.000900    | 0.000896       | 0.997626  | 0.997868| 0.997747 | 0.999750 |
 | 5     | 0.000600    | 0.000729       | 0.997981  | 0.998191| 0.998086 | 0.999795 |
+## Model Architecture
+The model is based on the `FacebookAI/xlm-roberta-base` architecture, a transformer-based language model pre-trained on a massive multilingual dataset.  No architectural modifications were made to the base model during fine-tuning.
 ## Usage
 ```python
 import torch
         print(f"Token: {token}, Label: {label}")
 ```
+## License
+The GolemPII-v1 model is released under MIT License with the following additional terms:
+```
+MIT License
+Copyright (c) 2024 Liran Baba
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this dataset and associated documentation files (the "Dataset"), to deal
+in the Dataset without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Dataset, and to permit persons to whom the Dataset is
+furnished to do so, subject to the following conditions:
+1. The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Dataset.
+2. Any academic or professional work that uses this Dataset must include an
+appropriate citation as specified below.
+THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE
+DATASET.
+```
+### How to Cite
+If you use this model in your research, project, or application, please include the following citation:
+For informal usage (e.g., blog posts, documentation):
+```
+GolemPII-v1 model by Liran Baba (https://huggingface.co/CordwainerSmith/GolemPII-v1)
+```