File size: 5,375 Bytes
3e1c49b
 
988a009
3e1c49b
 
988a009
026fa75
3e9b998
3e1c49b
 
 
 
 
988a009
3e1c49b
5402c9e
988a009
 
 
 
3e1c49b
3e9b998
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3e1c49b
 
988a009
 
3e1c49b
 
 
988a009
3ec3718
988a009
 
3e1c49b
988a009
 
 
3e1c49b
3ec3718
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0cc0579
3ec3718
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
---
library_name: transformers
tags: [token-classification, ner, deberta, privacy, pii-detection]
---

# Model Card for PII Detection with DeBERTa
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1RJMVrf8ZlbyYMabAQ2_GGm9Ln4FmMfoO)
This model is a fine-tuned version of [`microsoft/deberta`](https://huggingface.co/microsoft/deberta-v3-base) for Named Entity Recognition (NER), specifically designed for detecting Personally Identifiable Information (PII) entities like names, SSNs, phone numbers, credit card numbers, addresses, and more.

## Model Details

### Model Description

This transformer-based model is fine-tuned on a custom dataset to detect sensitive information, commonly categorized as PII. The model performs sequence labeling to identify entities using token-level classification.

- **Developed by:** [Maskify]
- **Finetuned from model:** `microsoft/deberta`
- **Model type:** Token Classification (NER)
- **Language(s):** English
- **Use case:** PII detection in text

# Training Details

## Training Data
The model was fine-tuned on a custom dataset containing labeled examples of the following PII entity types:

- NAME
- SSN
- PHONE-NO
- CREDIT-CARD-NO
- BANK-ACCOUNT-NO
- BANK-ROUTING-NO
- ADDRESS


### Epoch Logs

| Epoch | Train Loss | Val Loss | Precision | Recall | F1     | Accuracy |
|-------|------------|----------|-----------|--------|--------|----------|
| 1     | 0.3672     | 0.1987   | 0.7806    | 0.8114 | 0.7957 | 0.9534   |
| 2     | 0.1149     | 0.1011   | 0.9161    | 0.9772 | 0.9457 | 0.9797   |
| 3     | 0.0795     | 0.0889   | 0.9264    | 0.9825 | 0.9536 | 0.9813   |
| 4     | 0.0708     | 0.0880   | 0.9242    | 0.9842 | 0.9533 | 0.9806   |
| 5     | 0.0626     | 0.0858   | 0.9235    | 0.9851 | 0.9533 | 0.9806   |

## SeqEval Classification Report

| Label            | Precision | Recall | F1-score | Support |
|------------------|-----------|--------|----------|---------|
| ADDRESS          | 0.91      | 0.94   | 0.92     | 77      |
| BANK-ACCOUNT-NO  | 0.91      | 0.99   | 0.95     | 169     |
| BANK-ROUTING-NO  | 0.85      | 0.96   | 0.90     | 104     |
| CREDIT-CARD-NO   | 0.95      | 1.00   | 0.97     | 228     |
| NAME             | 0.98      | 0.97   | 0.97     | 164     |
| PHONE-NO         | 0.94      | 0.99   | 0.96     | 308     |
| SSN              | 0.87      | 1.00   | 0.93     | 90      |

### Summary
- **Micro avg:** 0.95
- **Macro avg:** 0.95
- **Weighted avg:** 0.95

## Evaluation

### Testing Data
Evaluation was done on a held-out portion of the same labeled dataset.

### Metrics
- Precision
- Recall
- F1 (via seqeval)
- Entity-wise breakdown
- Token-level accuracy

### Results
- F1-score consistently above 0.95 for most labels, showing robustness in PII detection.
- 
### Recommendations

- Use human review in high-risk environments.
- Evaluate on your own domain-specific data before deployment.

## How to Get Started with the Model

```python

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "AI-Enthusiast11/pii-entity-extractor"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Post processing logic to combine the subword tokens
def merge_tokens(ner_results):
    entities = {}
    for entity in ner_results:
        entity_type = entity["entity_group"]
        entity_value = entity["word"].replace("##", "")  # Remove subword prefixes

        # Handle token merging
        if entity_type not in entities:
            entities[entity_type] = []
        if entities[entity_type] and not entity_value.startswith(" "):
            # If the previous token exists and this one isn't a new word, merge it
            entities[entity_type][-1] += entity_value
        else:
            entities[entity_type].append(entity_value)

    return entities

def redact_text_with_labels(text):
    ner_results = nlp(text)

    # Merge tokens for multi-token entities (if any)
    cleaned_entities = merge_tokens(ner_results)

    redacted_text = text
    for entity_type, values in cleaned_entities.items():
        for value in values:
            # Replace each identified entity with the label
            redacted_text = redacted_text.replace(value, f"[{entity_type}]")

    return redacted_text



#Loading the pipeline
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")

# Example input (choose one from your examples)
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

# Run pipeline and process result
ner_results = nlp(example)
cleaned_entities = merge_tokens(ner_results)

# Print the NER results
print("\n==NER Results:==\n")
for entity_type, values in cleaned_entities.items():
    print(f"  {entity_type}: {', '.join(values)}")

# Redact the single example with labels
redacted_example = redact_text_with_labels(example)

# Print the redacted result
print(f"\n==Redacted Example:==\n{redacted_example}")