metadata
language: he
license: mit
tags:
- hebrew
- ner
- pii-detection
- token-classification
- xlm-roberta
datasets:
- custom
model-index:
- name: GolemPII-xlm-roberta-v1
results:
- task:
name: Token Classification
type: token-classification
metrics:
- name: F1
type: f1
value: 0.9982
- name: Precision
type: precision
value: 0.9982
- name: Recall
type: recall
value: 0.9982
GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
Model Details
- Based on xlm-roberta-base
- Fine-tuned on a custom Hebrew PII dataset
- Optimized for token classification tasks in Hebrew text
Performance Metrics
Final Evaluation Results
eval_loss: 0.000729
eval_precision: 0.9982
eval_recall: 0.9982
eval_f1: 0.9982
eval_accuracy: 0.999795
Detailed Performance by Label
Label | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
BANK_ACCOUNT_NUM | 1.0000 | 1.0000 | 1.0000 | 4847 |
CC_NUM | 1.0000 | 1.0000 | 1.0000 | 234 |
CC_PROVIDER | 1.0000 | 1.0000 | 1.0000 | 242 |
CITY | 0.9997 | 0.9995 | 0.9996 | 12237 |
DATE | 0.9997 | 0.9998 | 0.9997 | 11943 |
0.9998 | 1.0000 | 0.9999 | 13235 | |
FIRST_NAME | 0.9937 | 0.9938 | 0.9937 | 17888 |
ID_NUM | 0.9999 | 1.0000 | 1.0000 | 10577 |
LAST_NAME | 0.9928 | 0.9921 | 0.9925 | 15655 |
PHONE_NUM | 1.0000 | 0.9998 | 0.9999 | 20838 |
POSTAL_CODE | 0.9998 | 0.9999 | 0.9999 | 13321 |
STREET | 0.9999 | 0.9999 | 0.9999 | 14032 |
micro avg | 0.9982 | 0.9982 | 0.9982 | 135049 |
macro avg | 0.9988 | 0.9987 | 0.9988 | 135049 |
weighted avg | 0.9982 | 0.9982 | 0.9982 | 135049 |
Training Progress
Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
---|---|---|---|---|---|---|
1 | 0.005800 | 0.002487 | 0.993109 | 0.993678 | 0.993393 | 0.999328 |
2 | 0.001700 | 0.001385 | 0.995469 | 0.995947 | 0.995708 | 0.999575 |
3 | 0.001200 | 0.000946 | 0.997159 | 0.997487 | 0.997323 | 0.999739 |
4 | 0.000900 | 0.000896 | 0.997626 | 0.997868 | 0.997747 | 0.999750 |
5 | 0.000600 | 0.000729 | 0.997981 | 0.998191 | 0.998086 | 0.999795 |
Usage
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("{repo_id}")
model = AutoModelForTokenClassification.from_pretrained("{repo_id}")
# Example text (Hebrew)
text = "砖诇讜诐, 砖诪讬 讚讜讚 讻讛谉 讜讗谞讬 讙专 讘专讞讜讘 讛专爪诇 42 讘转诇 讗讘讬讘. 讛讟诇驻讜谉 砖诇讬 讛讜讗 050-1234567"
# Tokenize and get predictions
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=2)
# Convert predictions to labels
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
labels = [model.config.id2label[t.item()] for t in predictions[0]]
# Print results (excluding special tokens and non-entity labels)
for token, label in zip(tokens, labels):
if label != "O" and not token.startswith("##"):
print(f"Token: {token}, Label: {label}")
Training Details
- Training epochs: 5
- Training speed: ~2.33 it/s (7615/7615 54:29)
- Base model: xlm-roberta-base
- Training language: Hebrew