--- language: he license: mit tags: - hebrew - ner - pii-detection - token-classification - xlm-roberta datasets: - custom model-index: - name: GolemPII-xlm-roberta-v1 results: - task: name: Token Classification type: token-classification metrics: - name: F1 type: f1 value: 0.9982 - name: Precision type: precision value: 0.9982 - name: Recall type: recall value: 0.9982 --- # GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data. ## Model Details - Based on xlm-roberta-base - Fine-tuned on a custom Hebrew PII dataset - Optimized for token classification tasks in Hebrew text ## Performance Metrics ### Final Evaluation Results ``` eval_loss: 0.000729 eval_precision: 0.9982 eval_recall: 0.9982 eval_f1: 0.9982 eval_accuracy: 0.999795 ``` ### Detailed Performance by Label | Label | Precision | Recall | F1-Score | Support | |------------------|-----------|---------|----------|---------| | BANK_ACCOUNT_NUM | 1.0000 | 1.0000 | 1.0000 | 4847 | | CC_NUM | 1.0000 | 1.0000 | 1.0000 | 234 | | CC_PROVIDER | 1.0000 | 1.0000 | 1.0000 | 242 | | CITY | 0.9997 | 0.9995 | 0.9996 | 12237 | | DATE | 0.9997 | 0.9998 | 0.9997 | 11943 | | EMAIL | 0.9998 | 1.0000 | 0.9999 | 13235 | | FIRST_NAME | 0.9937 | 0.9938 | 0.9937 | 17888 | | ID_NUM | 0.9999 | 1.0000 | 1.0000 | 10577 | | LAST_NAME | 0.9928 | 0.9921 | 0.9925 | 15655 | | PHONE_NUM | 1.0000 | 0.9998 | 0.9999 | 20838 | | POSTAL_CODE | 0.9998 | 0.9999 | 0.9999 | 13321 | | STREET | 0.9999 | 0.9999 | 0.9999 | 14032 | | micro avg | 0.9982 | 0.9982 | 0.9982 | 135049 | | macro avg | 0.9988 | 0.9987 | 0.9988 | 135049 | | weighted avg | 0.9982 | 0.9982 | 0.9982 | 135049 | ### Training Progress | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy | |-------|--------------|-----------------|-----------|---------|----------|----------| | 1 | 0.005800 | 0.002487 | 0.993109 | 0.993678| 0.993393 | 0.999328 | | 2 | 0.001700 | 0.001385 | 0.995469 | 0.995947| 0.995708 | 0.999575 | | 3 | 0.001200 | 0.000946 | 0.997159 | 0.997487| 0.997323 | 0.999739 | | 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 | | 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 | ## Usage ```python import torch from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("{repo_id}") model = AutoModelForTokenClassification.from_pretrained("{repo_id}") # Example text (Hebrew) text = "שלום, שמי דוד כהן ואני גר ברחוב הרצל 42 בתל אביב. הטלפון שלי הוא 050-1234567" # Tokenize and get predictions inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True) with torch.no_grad(): outputs = model(**inputs) predictions = torch.argmax(outputs.logits, dim=2) # Convert predictions to labels tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) labels = [model.config.id2label[t.item()] for t in predictions[0]] # Print results (excluding special tokens and non-entity labels) for token, label in zip(tokens, labels): if label != "O" and not token.startswith("##"): print(f"Token: {token}, Label: {label}") ``` ## Training Details - Training epochs: 5 - Training speed: ~2.33 it/s (7615/7615 54:29) - Base model: xlm-roberta-base - Training language: Hebrew