CordwainerSmith
/

GolemPII-v1

+---
+language: he
+license: mit
+tags:
+- hebrew
+- ner
+- pii-detection
+- token-classification
+- xlm-roberta
+datasets:
+- custom
+model-index:
+- name: GolemPII-xlm-roberta-v1
+  results:
+  - task:
+      name: Token Classification
+      type: token-classification
+    metrics:
+      - name: F1
+        type: f1
+        value: 0.9982
+      - name: Precision
+        type: precision
+        value: 0.9982
+      - name: Recall
+        type: recall
+        value: 0.9982
+---
+# GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
+This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
+## Model Details
+- Based on xlm-roberta-base
+- Fine-tuned on a custom Hebrew PII dataset
+- Optimized for token classification tasks in Hebrew text
+## Performance Metrics
+### Final Evaluation Results
+```
+eval_loss: 0.000729
+eval_precision: 0.9982
+eval_recall: 0.9982
+eval_f1: 0.9982
+eval_accuracy: 0.999795
+```
+### Detailed Performance by Label
+| Label            | Precision | Recall  | F1-Score | Support |
+|------------------|-----------|---------|----------|---------|
+| BANK_ACCOUNT_NUM | 1.0000    | 1.0000  | 1.0000   | 4847    |
+| CC_NUM          | 1.0000    | 1.0000  | 1.0000   | 234     |
+| CC_PROVIDER     | 1.0000    | 1.0000  | 1.0000   | 242     |
+| CITY            | 0.9997    | 0.9995  | 0.9996   | 12237   |
+| DATE            | 0.9997    | 0.9998  | 0.9997   | 11943   |
+| EMAIL           | 0.9998    | 1.0000  | 0.9999   | 13235   |
+| FIRST_NAME      | 0.9937    | 0.9938  | 0.9937   | 17888   |
+| ID_NUM          | 0.9999    | 1.0000  | 1.0000   | 10577   |
+| LAST_NAME       | 0.9928    | 0.9921  | 0.9925   | 15655   |
+| PHONE_NUM       | 1.0000    | 0.9998  | 0.9999   | 20838   |
+| POSTAL_CODE     | 0.9998    | 0.9999  | 0.9999   | 13321   |
+| STREET          | 0.9999    | 0.9999  | 0.9999   | 14032   |
+| micro avg       | 0.9982    | 0.9982  | 0.9982   | 135049  |
+| macro avg       | 0.9988    | 0.9987  | 0.9988   | 135049  |
+| weighted avg    | 0.9982    | 0.9982  | 0.9982   | 135049  |
+### Training Progress
+| Epoch | Training Loss | Validation Loss | Precision | Recall  | F1       | Accuracy |
+|-------|--------------|-----------------|-----------|---------|----------|----------|
+| 1     | 0.005800    | 0.002487       | 0.993109  | 0.993678| 0.993393 | 0.999328 |
+| 2     | 0.001700    | 0.001385       | 0.995469  | 0.995947| 0.995708 | 0.999575 |
+| 3     | 0.001200    | 0.000946       | 0.997159  | 0.997487| 0.997323 | 0.999739 |
+| 4     | 0.000900    | 0.000896       | 0.997626  | 0.997868| 0.997747 | 0.999750 |
+| 5     | 0.000600    | 0.000729       | 0.997981  | 0.998191| 0.998086 | 0.999795 |
+## Usage
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+tokenizer = AutoTokenizer.from_pretrained("{repo_id}")
+model = AutoModelForTokenClassification.from_pretrained("{repo_id}")
+# Example text (Hebrew)
+text = "שלום, שמי דוד כהן ואני גר ברחוב הרצל 42 בתל אביב. הטלפון שלי הוא 050-1234567"
+# Tokenize and get predictions
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+with torch.no_grad():
+    outputs = model(**inputs)
+    predictions = torch.argmax(outputs.logits, dim=2)
+# Convert predictions to labels
+tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
+labels = [model.config.id2label[t.item()] for t in predictions[0]]
+# Print results (excluding special tokens and non-entity labels)
+for token, label in zip(tokens, labels):
+    if label != "O" and not token.startswith("##"):
+        print(f"Token: {token}, Label: {label}")
+```
+## Training Details
+- Training epochs: 5
+- Training speed: ~2.33 it/s (7615/7615 54:29)
+- Base model: xlm-roberta-base
+- Training language: Hebrew