CordwainerSmith commited on
Commit
708b3d0
โ€ข
1 Parent(s): 9882178

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +111 -0
README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: he
3
+ license: mit
4
+ tags:
5
+ - hebrew
6
+ - ner
7
+ - pii-detection
8
+ - token-classification
9
+ - xlm-roberta
10
+ datasets:
11
+ - custom
12
+ model-index:
13
+ - name: GolemPII-xlm-roberta-v1
14
+ results:
15
+ - task:
16
+ name: Token Classification
17
+ type: token-classification
18
+ metrics:
19
+ - name: F1
20
+ type: f1
21
+ value: 0.9982
22
+ - name: Precision
23
+ type: precision
24
+ value: 0.9982
25
+ - name: Recall
26
+ type: recall
27
+ value: 0.9982
28
+ ---
29
+
30
+ # GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
31
+
32
+ This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
33
+
34
+ ## Model Details
35
+ - Based on xlm-roberta-base
36
+ - Fine-tuned on a custom Hebrew PII dataset
37
+ - Optimized for token classification tasks in Hebrew text
38
+
39
+ ## Performance Metrics
40
+
41
+ ### Final Evaluation Results
42
+ ```
43
+ eval_loss: 0.000729
44
+ eval_precision: 0.9982
45
+ eval_recall: 0.9982
46
+ eval_f1: 0.9982
47
+ eval_accuracy: 0.999795
48
+ ```
49
+
50
+ ### Detailed Performance by Label
51
+
52
+ | Label | Precision | Recall | F1-Score | Support |
53
+ |------------------|-----------|---------|----------|---------|
54
+ | BANK_ACCOUNT_NUM | 1.0000 | 1.0000 | 1.0000 | 4847 |
55
+ | CC_NUM | 1.0000 | 1.0000 | 1.0000 | 234 |
56
+ | CC_PROVIDER | 1.0000 | 1.0000 | 1.0000 | 242 |
57
+ | CITY | 0.9997 | 0.9995 | 0.9996 | 12237 |
58
+ | DATE | 0.9997 | 0.9998 | 0.9997 | 11943 |
59
+ | EMAIL | 0.9998 | 1.0000 | 0.9999 | 13235 |
60
+ | FIRST_NAME | 0.9937 | 0.9938 | 0.9937 | 17888 |
61
+ | ID_NUM | 0.9999 | 1.0000 | 1.0000 | 10577 |
62
+ | LAST_NAME | 0.9928 | 0.9921 | 0.9925 | 15655 |
63
+ | PHONE_NUM | 1.0000 | 0.9998 | 0.9999 | 20838 |
64
+ | POSTAL_CODE | 0.9998 | 0.9999 | 0.9999 | 13321 |
65
+ | STREET | 0.9999 | 0.9999 | 0.9999 | 14032 |
66
+ | micro avg | 0.9982 | 0.9982 | 0.9982 | 135049 |
67
+ | macro avg | 0.9988 | 0.9987 | 0.9988 | 135049 |
68
+ | weighted avg | 0.9982 | 0.9982 | 0.9982 | 135049 |
69
+
70
+ ### Training Progress
71
+
72
+ | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
73
+ |-------|--------------|-----------------|-----------|---------|----------|----------|
74
+ | 1 | 0.005800 | 0.002487 | 0.993109 | 0.993678| 0.993393 | 0.999328 |
75
+ | 2 | 0.001700 | 0.001385 | 0.995469 | 0.995947| 0.995708 | 0.999575 |
76
+ | 3 | 0.001200 | 0.000946 | 0.997159 | 0.997487| 0.997323 | 0.999739 |
77
+ | 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
78
+ | 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
79
+
80
+ ## Usage
81
+ ```python
82
+ import torch
83
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
84
+
85
+ tokenizer = AutoTokenizer.from_pretrained("{repo_id}")
86
+ model = AutoModelForTokenClassification.from_pretrained("{repo_id}")
87
+
88
+ # Example text (Hebrew)
89
+ text = "ืฉืœื•ื, ืฉืžื™ ื“ื•ื“ ื›ื”ืŸ ื•ืื ื™ ื’ืจ ื‘ืจื—ื•ื‘ ื”ืจืฆืœ 42 ื‘ืชืœ ืื‘ื™ื‘. ื”ื˜ืœืคื•ืŸ ืฉืœื™ ื”ื•ื 050-1234567"
90
+
91
+ # Tokenize and get predictions
92
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
93
+ with torch.no_grad():
94
+ outputs = model(**inputs)
95
+ predictions = torch.argmax(outputs.logits, dim=2)
96
+
97
+ # Convert predictions to labels
98
+ tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
99
+ labels = [model.config.id2label[t.item()] for t in predictions[0]]
100
+
101
+ # Print results (excluding special tokens and non-entity labels)
102
+ for token, label in zip(tokens, labels):
103
+ if label != "O" and not token.startswith("##"):
104
+ print(f"Token: {token}, Label: {label}")
105
+ ```
106
+
107
+ ## Training Details
108
+ - Training epochs: 5
109
+ - Training speed: ~2.33 it/s (7615/7615 54:29)
110
+ - Base model: xlm-roberta-base
111
+ - Training language: Hebrew