|
--- |
|
language: he |
|
license: mit |
|
library_name: transformers |
|
tags: |
|
- hebrew |
|
- ner |
|
- pii-detection |
|
- token-classification |
|
- xlm-roberta |
|
- privacy |
|
- data-anonymization |
|
- golemguard |
|
datasets: |
|
- CordwainerSmith/GolemGuard |
|
model-index: |
|
- name: GolemPII-v1 |
|
results: |
|
- task: |
|
name: Token Classification |
|
type: token-classification |
|
metrics: |
|
- name: F1 |
|
type: f1 |
|
value: 0.9982 |
|
- name: Precision |
|
type: precision |
|
value: 0.9982 |
|
- name: Recall |
|
type: recall |
|
value: 0.9982 |
|
--- |
|
|
|
# GolemPII-v1 - Hebrew PII Detection Model |
|
|
|
This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data to achieve high accuracy in identifying and classifying various types of PII. |
|
|
|
## Model Details |
|
- Based on xlm-roberta-base |
|
- Fine-tuned on the GolemGuard: Hebrew Privacy Information Detection Corpus |
|
- Optimized for token classification tasks in Hebrew text |
|
|
|
## Intended Uses & Limitations |
|
|
|
This model is intended for: |
|
|
|
* **Privacy Protection:** Detecting and masking PII in Hebrew text to protect individual privacy. |
|
* **Data Anonymization:** Automating the process of de-identifying Hebrew documents in legal, medical, and other sensitive contexts. |
|
* **Research:** Supporting research in Hebrew natural language processing and PII detection. |
|
|
|
## Training Parameters |
|
|
|
* **Batch Size:** 32 |
|
* **Learning Rate:** 2e-5 with linear warmup and decay. |
|
* **Optimizer:** AdamW |
|
* **Hardware:** Trained on a single NVIDIA A100GPU. |
|
|
|
## Dataset Details |
|
|
|
* **Dataset Name:** GolemGuard: Hebrew Privacy Information Detection Corpus |
|
* **Dataset Link:** [https://huggingface.co/datasets/CordwainerSmith/GolemGuard](https://huggingface.co/datasets/CordwainerSmith/GolemGuard) |
|
|
|
## Performance Metrics |
|
|
|
### Final Evaluation Results |
|
``` |
|
eval_loss: 0.000729 |
|
eval_precision: 0.9982 |
|
eval_recall: 0.9982 |
|
eval_f1: 0.9982 |
|
eval_accuracy: 0.999795 |
|
``` |
|
|
|
### Detailed Performance by Label |
|
|
|
| Label | Precision | Recall | F1-Score | Support | |
|
|------------------|-----------|---------|----------|---------| |
|
| BANK_ACCOUNT_NUM | 1.0000 | 1.0000 | 1.0000 | 4847 | |
|
| CC_NUM | 1.0000 | 1.0000 | 1.0000 | 234 | |
|
| CC_PROVIDER | 1.0000 | 1.0000 | 1.0000 | 242 | |
|
| CITY | 0.9997 | 0.9995 | 0.9996 | 12237 | |
|
| DATE | 0.9997 | 0.9998 | 0.9997 | 11943 | |
|
| EMAIL | 0.9998 | 1.0000 | 0.9999 | 13235 | |
|
| FIRST_NAME | 0.9937 | 0.9938 | 0.9937 | 17888 | |
|
| ID_NUM | 0.9999 | 1.0000 | 1.0000 | 10577 | |
|
| LAST_NAME | 0.9928 | 0.9921 | 0.9925 | 15655 | |
|
| PHONE_NUM | 1.0000 | 0.9998 | 0.9999 | 20838 | |
|
| POSTAL_CODE | 0.9998 | 0.9999 | 0.9999 | 13321 | |
|
| STREET | 0.9999 | 0.9999 | 0.9999 | 14032 | |
|
| micro avg | 0.9982 | 0.9982 | 0.9982 | 135049 | |
|
| macro avg | 0.9988 | 0.9987 | 0.9988 | 135049 | |
|
| weighted avg | 0.9982 | 0.9982 | 0.9982 | 135049 | |
|
|
|
### Training Progress |
|
|
|
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | |
|
|-------|--------------|-----------------|-----------|---------|----------|----------| |
|
| 1 | 0.005800 | 0.002487 | 0.993109 | 0.993678| 0.993393 | 0.999328 | |
|
| 2 | 0.001700 | 0.001385 | 0.995469 | 0.995947| 0.995708 | 0.999575 | |
|
| 3 | 0.001200 | 0.000946 | 0.997159 | 0.997487| 0.997323 | 0.999739 | |
|
| 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 | |
|
| 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 | |
|
|
|
## Model Architecture |
|
|
|
The model is based on the `FacebookAI/xlm-roberta-base` architecture, a transformer-based language model pre-trained on a massive multilingual dataset. No architectural modifications were made to the base model during fine-tuning. |
|
|
|
## Usage |
|
```python |
|
import torch |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("{repo_id}") |
|
model = AutoModelForTokenClassification.from_pretrained("{repo_id}") |
|
|
|
# Example text (Hebrew) |
|
text = "砖诇讜诐, 砖诪讬 讚讜讚 讻讛谉 讜讗谞讬 讙专 讘专讞讜讘 讛专爪诇 42 讘转诇 讗讘讬讘. 讛讟诇驻讜谉 砖诇讬 讛讜讗 050-1234567" |
|
|
|
# Tokenize and get predictions |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
predictions = torch.argmax(outputs.logits, dim=2) |
|
|
|
# Convert predictions to labels |
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
|
labels = [model.config.id2label[t.item()] for t in predictions[0]] |
|
|
|
# Print results (excluding special tokens and non-entity labels) |
|
for token, label in zip(tokens, labels): |
|
if label != "O" and not token.startswith("##"): |
|
print(f"Token: {token}, Label: {label}") |
|
``` |
|
|
|
|
|
## License |
|
|
|
The GolemPII-v1 model is released under MIT License with the following additional terms: |
|
|
|
``` |
|
MIT License |
|
|
|
Copyright (c) 2024 Liran Baba |
|
|
|
Permission is hereby granted, free of charge, to any person obtaining a copy |
|
of this dataset and associated documentation files (the "Dataset"), to deal |
|
in the Dataset without restriction, including without limitation the rights |
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell |
|
copies of the Dataset, and to permit persons to whom the Dataset is |
|
furnished to do so, subject to the following conditions: |
|
|
|
1. The above copyright notice and this permission notice shall be included in all |
|
copies or substantial portions of the Dataset. |
|
|
|
2. Any academic or professional work that uses this Dataset must include an |
|
appropriate citation as specified below. |
|
|
|
THE DATASET IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR |
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, |
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE |
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER |
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, |
|
OUT OF OR IN CONNECTION WITH THE DATASET OR THE USE OR OTHER DEALINGS IN THE |
|
DATASET. |
|
``` |
|
|
|
### How to Cite |
|
|
|
If you use this model in your research, project, or application, please include the following citation: |
|
|
|
For informal usage (e.g., blog posts, documentation): |
|
``` |
|
GolemPII-v1 model by Liran Baba (https://huggingface.co/CordwainerSmith/GolemPII-v1) |
|
``` |