File size: 5,375 Bytes
3e1c49b 988a009 3e1c49b 988a009 026fa75 3e9b998 3e1c49b 988a009 3e1c49b 5402c9e 988a009 3e1c49b 3e9b998 3e1c49b 988a009 3e1c49b 988a009 3ec3718 988a009 3e1c49b 988a009 3e1c49b 3ec3718 0cc0579 3ec3718 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 |
---
library_name: transformers
tags: [token-classification, ner, deberta, privacy, pii-detection]
---
# Model Card for PII Detection with DeBERTa
[](https://colab.research.google.com/drive/1RJMVrf8ZlbyYMabAQ2_GGm9Ln4FmMfoO)
This model is a fine-tuned version of [`microsoft/deberta`](https://huggingface.co/microsoft/deberta-v3-base) for Named Entity Recognition (NER), specifically designed for detecting Personally Identifiable Information (PII) entities like names, SSNs, phone numbers, credit card numbers, addresses, and more.
## Model Details
### Model Description
This transformer-based model is fine-tuned on a custom dataset to detect sensitive information, commonly categorized as PII. The model performs sequence labeling to identify entities using token-level classification.
- **Developed by:** [Maskify]
- **Finetuned from model:** `microsoft/deberta`
- **Model type:** Token Classification (NER)
- **Language(s):** English
- **Use case:** PII detection in text
# Training Details
## Training Data
The model was fine-tuned on a custom dataset containing labeled examples of the following PII entity types:
- NAME
- SSN
- PHONE-NO
- CREDIT-CARD-NO
- BANK-ACCOUNT-NO
- BANK-ROUTING-NO
- ADDRESS
### Epoch Logs
| Epoch | Train Loss | Val Loss | Precision | Recall | F1 | Accuracy |
|-------|------------|----------|-----------|--------|--------|----------|
| 1 | 0.3672 | 0.1987 | 0.7806 | 0.8114 | 0.7957 | 0.9534 |
| 2 | 0.1149 | 0.1011 | 0.9161 | 0.9772 | 0.9457 | 0.9797 |
| 3 | 0.0795 | 0.0889 | 0.9264 | 0.9825 | 0.9536 | 0.9813 |
| 4 | 0.0708 | 0.0880 | 0.9242 | 0.9842 | 0.9533 | 0.9806 |
| 5 | 0.0626 | 0.0858 | 0.9235 | 0.9851 | 0.9533 | 0.9806 |
## SeqEval Classification Report
| Label | Precision | Recall | F1-score | Support |
|------------------|-----------|--------|----------|---------|
| ADDRESS | 0.91 | 0.94 | 0.92 | 77 |
| BANK-ACCOUNT-NO | 0.91 | 0.99 | 0.95 | 169 |
| BANK-ROUTING-NO | 0.85 | 0.96 | 0.90 | 104 |
| CREDIT-CARD-NO | 0.95 | 1.00 | 0.97 | 228 |
| NAME | 0.98 | 0.97 | 0.97 | 164 |
| PHONE-NO | 0.94 | 0.99 | 0.96 | 308 |
| SSN | 0.87 | 1.00 | 0.93 | 90 |
### Summary
- **Micro avg:** 0.95
- **Macro avg:** 0.95
- **Weighted avg:** 0.95
## Evaluation
### Testing Data
Evaluation was done on a held-out portion of the same labeled dataset.
### Metrics
- Precision
- Recall
- F1 (via seqeval)
- Entity-wise breakdown
- Token-level accuracy
### Results
- F1-score consistently above 0.95 for most labels, showing robustness in PII detection.
-
### Recommendations
- Use human review in high-risk environments.
- Evaluate on your own domain-specific data before deployment.
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
model_name = "AI-Enthusiast11/pii-entity-extractor"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Post processing logic to combine the subword tokens
def merge_tokens(ner_results):
entities = {}
for entity in ner_results:
entity_type = entity["entity_group"]
entity_value = entity["word"].replace("##", "") # Remove subword prefixes
# Handle token merging
if entity_type not in entities:
entities[entity_type] = []
if entities[entity_type] and not entity_value.startswith(" "):
# If the previous token exists and this one isn't a new word, merge it
entities[entity_type][-1] += entity_value
else:
entities[entity_type].append(entity_value)
return entities
def redact_text_with_labels(text):
ner_results = nlp(text)
# Merge tokens for multi-token entities (if any)
cleaned_entities = merge_tokens(ner_results)
redacted_text = text
for entity_type, values in cleaned_entities.items():
for value in values:
# Replace each identified entity with the label
redacted_text = redacted_text.replace(value, f"[{entity_type}]")
return redacted_text
#Loading the pipeline
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")
# Example input (choose one from your examples)
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."
# Run pipeline and process result
ner_results = nlp(example)
cleaned_entities = merge_tokens(ner_results)
# Print the NER results
print("\n==NER Results:==\n")
for entity_type, values in cleaned_entities.items():
print(f" {entity_type}: {', '.join(values)}")
# Redact the single example with labels
redacted_example = redact_text_with_labels(example)
# Print the redacted result
print(f"\n==Redacted Example:==\n{redacted_example}")
|