CordwainerSmith
commited on
Commit
โข
708b3d0
1
Parent(s):
9882178
Upload README.md with huggingface_hub
Browse files
README.md
ADDED
@@ -0,0 +1,111 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: he
|
3 |
+
license: mit
|
4 |
+
tags:
|
5 |
+
- hebrew
|
6 |
+
- ner
|
7 |
+
- pii-detection
|
8 |
+
- token-classification
|
9 |
+
- xlm-roberta
|
10 |
+
datasets:
|
11 |
+
- custom
|
12 |
+
model-index:
|
13 |
+
- name: GolemPII-xlm-roberta-v1
|
14 |
+
results:
|
15 |
+
- task:
|
16 |
+
name: Token Classification
|
17 |
+
type: token-classification
|
18 |
+
metrics:
|
19 |
+
- name: F1
|
20 |
+
type: f1
|
21 |
+
value: 0.9982
|
22 |
+
- name: Precision
|
23 |
+
type: precision
|
24 |
+
value: 0.9982
|
25 |
+
- name: Recall
|
26 |
+
type: recall
|
27 |
+
value: 0.9982
|
28 |
+
---
|
29 |
+
|
30 |
+
# GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model
|
31 |
+
|
32 |
+
This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.
|
33 |
+
|
34 |
+
## Model Details
|
35 |
+
- Based on xlm-roberta-base
|
36 |
+
- Fine-tuned on a custom Hebrew PII dataset
|
37 |
+
- Optimized for token classification tasks in Hebrew text
|
38 |
+
|
39 |
+
## Performance Metrics
|
40 |
+
|
41 |
+
### Final Evaluation Results
|
42 |
+
```
|
43 |
+
eval_loss: 0.000729
|
44 |
+
eval_precision: 0.9982
|
45 |
+
eval_recall: 0.9982
|
46 |
+
eval_f1: 0.9982
|
47 |
+
eval_accuracy: 0.999795
|
48 |
+
```
|
49 |
+
|
50 |
+
### Detailed Performance by Label
|
51 |
+
|
52 |
+
| Label | Precision | Recall | F1-Score | Support |
|
53 |
+
|------------------|-----------|---------|----------|---------|
|
54 |
+
| BANK_ACCOUNT_NUM | 1.0000 | 1.0000 | 1.0000 | 4847 |
|
55 |
+
| CC_NUM | 1.0000 | 1.0000 | 1.0000 | 234 |
|
56 |
+
| CC_PROVIDER | 1.0000 | 1.0000 | 1.0000 | 242 |
|
57 |
+
| CITY | 0.9997 | 0.9995 | 0.9996 | 12237 |
|
58 |
+
| DATE | 0.9997 | 0.9998 | 0.9997 | 11943 |
|
59 |
+
| EMAIL | 0.9998 | 1.0000 | 0.9999 | 13235 |
|
60 |
+
| FIRST_NAME | 0.9937 | 0.9938 | 0.9937 | 17888 |
|
61 |
+
| ID_NUM | 0.9999 | 1.0000 | 1.0000 | 10577 |
|
62 |
+
| LAST_NAME | 0.9928 | 0.9921 | 0.9925 | 15655 |
|
63 |
+
| PHONE_NUM | 1.0000 | 0.9998 | 0.9999 | 20838 |
|
64 |
+
| POSTAL_CODE | 0.9998 | 0.9999 | 0.9999 | 13321 |
|
65 |
+
| STREET | 0.9999 | 0.9999 | 0.9999 | 14032 |
|
66 |
+
| micro avg | 0.9982 | 0.9982 | 0.9982 | 135049 |
|
67 |
+
| macro avg | 0.9988 | 0.9987 | 0.9988 | 135049 |
|
68 |
+
| weighted avg | 0.9982 | 0.9982 | 0.9982 | 135049 |
|
69 |
+
|
70 |
+
### Training Progress
|
71 |
+
|
72 |
+
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|
73 |
+
|-------|--------------|-----------------|-----------|---------|----------|----------|
|
74 |
+
| 1 | 0.005800 | 0.002487 | 0.993109 | 0.993678| 0.993393 | 0.999328 |
|
75 |
+
| 2 | 0.001700 | 0.001385 | 0.995469 | 0.995947| 0.995708 | 0.999575 |
|
76 |
+
| 3 | 0.001200 | 0.000946 | 0.997159 | 0.997487| 0.997323 | 0.999739 |
|
77 |
+
| 4 | 0.000900 | 0.000896 | 0.997626 | 0.997868| 0.997747 | 0.999750 |
|
78 |
+
| 5 | 0.000600 | 0.000729 | 0.997981 | 0.998191| 0.998086 | 0.999795 |
|
79 |
+
|
80 |
+
## Usage
|
81 |
+
```python
|
82 |
+
import torch
|
83 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
84 |
+
|
85 |
+
tokenizer = AutoTokenizer.from_pretrained("{repo_id}")
|
86 |
+
model = AutoModelForTokenClassification.from_pretrained("{repo_id}")
|
87 |
+
|
88 |
+
# Example text (Hebrew)
|
89 |
+
text = "ืฉืืื, ืฉืื ืืื ืืื ืืื ื ืืจ ืืจืืื ืืจืฆื 42 ืืชื ืืืื. ืืืืคืื ืฉืื ืืื 050-1234567"
|
90 |
+
|
91 |
+
# Tokenize and get predictions
|
92 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
|
93 |
+
with torch.no_grad():
|
94 |
+
outputs = model(**inputs)
|
95 |
+
predictions = torch.argmax(outputs.logits, dim=2)
|
96 |
+
|
97 |
+
# Convert predictions to labels
|
98 |
+
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
|
99 |
+
labels = [model.config.id2label[t.item()] for t in predictions[0]]
|
100 |
+
|
101 |
+
# Print results (excluding special tokens and non-entity labels)
|
102 |
+
for token, label in zip(tokens, labels):
|
103 |
+
if label != "O" and not token.startswith("##"):
|
104 |
+
print(f"Token: {token}, Label: {label}")
|
105 |
+
```
|
106 |
+
|
107 |
+
## Training Details
|
108 |
+
- Training epochs: 5
|
109 |
+
- Training speed: ~2.33 it/s (7615/7615 54:29)
|
110 |
+
- Base model: xlm-roberta-base
|
111 |
+
- Training language: Hebrew
|