CordwainerSmith
/

GolemPII-v1

@@ -1,18 +1,14 @@
 ---
 language: he
 license: mit
-library_name: transformers
 tags:
 - hebrew
 - ner
 - pii-detection
 - token-classification
 - xlm-roberta
-- privacy
-- data-anonymization
-- golemguard
 datasets:
-- CordwainerSmith/GolemGuard
 model-index:
 - name: GolemPII-xlm-roberta-v1
   results:
@@ -33,33 +29,13 @@ model-index:
 # GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
-This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data to achieve high accuracy in identifying and classifying various types of PII.
 ## Model Details
 - Based on xlm-roberta-base
-- Fine-tuned on the GolemGuard: Hebrew Privacy Information Detection Corpus
 - Optimized for token classification tasks in Hebrew text
-## Intended Uses & Limitations
-This model is intended for:
-* **Privacy Protection:**  Detecting and masking PII in Hebrew text to protect individual privacy.
-* **Data Anonymization:** Automating the process of de-identifying Hebrew documents in legal, medical, and other sensitive contexts.
-* **Research:**  Supporting research in Hebrew natural language processing and PII detection.
-## Training Parameters
-* **Batch Size:** 32
-* **Learning Rate:** 2e-5 with linear warmup and decay.
-* **Optimizer:** AdamW
-* **Hardware:** Trained on a single NVIDIA A100GPU.
-## Dataset Details
-* **Dataset Name:** GolemGuard: Hebrew Privacy Information Detection Corpus
-* **Dataset Link:** [https://huggingface.co/datasets/CordwainerSmith/GolemGuard](https://huggingface.co/datasets/CordwainerSmith/GolemGuard)
 ## Performance Metrics
 ### Final Evaluation Results
@@ -101,10 +77,6 @@ eval_accuracy: 0.999795
 | 4     | 0.000900    | 0.000896       | 0.997626  | 0.997868| 0.997747 | 0.999750 |
 | 5     | 0.000600    | 0.000729       | 0.997981  | 0.998191| 0.998086 | 0.999795 |
-## Model Architecture
-The model is based on the `FacebookAI/xlm-roberta-base` architecture, a transformer-based language model pre-trained on a massive multilingual dataset.  No architectural modifications were made to the base model during fine-tuning.
 ## Usage
 ```python
 import torch
@@ -132,45 +104,8 @@ for token, label in zip(tokens, labels):
         print(f"Token: {token}, Label: {label}")
 ```
-## License
-The GolemPII-xlm-roberta-v1 model is released under MIT License with the following additional terms:
-```
-MIT License
-Copyright (c) 2024 Liran Baba
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this dataset and associated documentation files (the "Dataset"), to deal
-in the Dataset without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Dataset, and to permit persons to whom the Dataset is
-furnished to do so, subject to the following conditions:
-1. The above copyright notice and this permission notice shall be included in all
-copies or substantial portions of the Dataset.
-2. Any academic or professional work that uses this Dataset must include an
-appropriate citation as specified below.
-THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE
-DATASET.
-```
-### How to Cite
-If you use this model in your research, project, or application, please include the following citation:
-For informal usage (e.g., blog posts, documentation):
-```
-GolemPII-xlm-roberta-v1 model by Liran Baba (https://huggingface.co/CordwainerSmith/GolemPII-xlm-roberta-v1)
-```

 ---
 language: he
 license: mit
 tags:
 - hebrew
 - ner
 - pii-detection
 - token-classification
 - xlm-roberta
 datasets:
+- custom
 model-index:
 - name: GolemPII-xlm-roberta-v1
   results:
 # GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
+This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
 ## Model Details
 - Based on xlm-roberta-base
+- Fine-tuned on a custom Hebrew PII dataset
 - Optimized for token classification tasks in Hebrew text
 ## Performance Metrics
 ### Final Evaluation Results
 | 4     | 0.000900    | 0.000896       | 0.997626  | 0.997868| 0.997747 | 0.999750 |
 | 5     | 0.000600    | 0.000729       | 0.997981  | 0.998191| 0.998086 | 0.999795 |
 ## Usage
 ```python
 import torch
         print(f"Token: {token}, Label: {label}")
 ```
+## Training Details
+- Training epochs: 5
+- Training speed: ~2.33 it/s (7615/7615 54:29)
+- Base model: xlm-roberta-base
+- Training language: Hebrew