|
--- |
|
library_name: transformers |
|
tags: [token-classification, ner, deberta, privacy, pii-detection] |
|
--- |
|
|
|
# Model Card for PII Detection with DeBERTa |
|
[](https://colab.research.google.com/drive/1RJMVrf8ZlbyYMabAQ2_GGm9Ln4FmMfoO) |
|
This model is a fine-tuned version of [`microsoft/deberta`](https://huggingface.co/microsoft/deberta-v3-base) for Named Entity Recognition (NER), specifically designed for detecting Personally Identifiable Information (PII) entities like names, SSNs, phone numbers, credit card numbers, addresses, and more. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This transformer-based model is fine-tuned on a custom dataset to detect sensitive information, commonly categorized as PII. The model performs sequence labeling to identify entities using token-level classification. |
|
|
|
- **Developed by:** [Maskify] |
|
- **Finetuned from model:** `microsoft/deberta` |
|
- **Model type:** Token Classification (NER) |
|
- **Language(s):** English |
|
- **Use case:** PII detection in text |
|
|
|
# Training Details |
|
|
|
## Training Data |
|
The model was fine-tuned on a custom dataset containing labeled examples of the following PII entity types: |
|
|
|
- NAME |
|
- SSN |
|
- PHONE-NO |
|
- CREDIT-CARD-NO |
|
- BANK-ACCOUNT-NO |
|
- BANK-ROUTING-NO |
|
- ADDRESS |
|
|
|
|
|
### Epoch Logs |
|
|
|
| Epoch | Train Loss | Val Loss | Precision | Recall | F1 | Accuracy | |
|
|-------|------------|----------|-----------|--------|--------|----------| |
|
| 1 | 0.3672 | 0.1987 | 0.7806 | 0.8114 | 0.7957 | 0.9534 | |
|
| 2 | 0.1149 | 0.1011 | 0.9161 | 0.9772 | 0.9457 | 0.9797 | |
|
| 3 | 0.0795 | 0.0889 | 0.9264 | 0.9825 | 0.9536 | 0.9813 | |
|
| 4 | 0.0708 | 0.0880 | 0.9242 | 0.9842 | 0.9533 | 0.9806 | |
|
| 5 | 0.0626 | 0.0858 | 0.9235 | 0.9851 | 0.9533 | 0.9806 | |
|
|
|
## SeqEval Classification Report |
|
|
|
| Label | Precision | Recall | F1-score | Support | |
|
|------------------|-----------|--------|----------|---------| |
|
| ADDRESS | 0.91 | 0.94 | 0.92 | 77 | |
|
| BANK-ACCOUNT-NO | 0.91 | 0.99 | 0.95 | 169 | |
|
| BANK-ROUTING-NO | 0.85 | 0.96 | 0.90 | 104 | |
|
| CREDIT-CARD-NO | 0.95 | 1.00 | 0.97 | 228 | |
|
| NAME | 0.98 | 0.97 | 0.97 | 164 | |
|
| PHONE-NO | 0.94 | 0.99 | 0.96 | 308 | |
|
| SSN | 0.87 | 1.00 | 0.93 | 90 | |
|
|
|
### Summary |
|
- **Micro avg:** 0.95 |
|
- **Macro avg:** 0.95 |
|
- **Weighted avg:** 0.95 |
|
|
|
## Evaluation |
|
|
|
### Testing Data |
|
Evaluation was done on a held-out portion of the same labeled dataset. |
|
|
|
### Metrics |
|
- Precision |
|
- Recall |
|
- F1 (via seqeval) |
|
- Entity-wise breakdown |
|
- Token-level accuracy |
|
|
|
### Results |
|
- F1-score consistently above 0.95 for most labels, showing robustness in PII detection. |
|
- |
|
### Recommendations |
|
|
|
- Use human review in high-risk environments. |
|
- Evaluate on your own domain-specific data before deployment. |
|
|
|
## How to Get Started with the Model |
|
|
|
```python |
|
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
from transformers import pipeline |
|
|
|
model_name = "AI-Enthusiast11/pii-entity-extractor" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
# Post processing logic to combine the subword tokens |
|
def merge_tokens(ner_results): |
|
entities = {} |
|
for entity in ner_results: |
|
entity_type = entity["entity_group"] |
|
entity_value = entity["word"].replace("##", "") # Remove subword prefixes |
|
|
|
# Handle token merging |
|
if entity_type not in entities: |
|
entities[entity_type] = [] |
|
if entities[entity_type] and not entity_value.startswith(" "): |
|
# If the previous token exists and this one isn't a new word, merge it |
|
entities[entity_type][-1] += entity_value |
|
else: |
|
entities[entity_type].append(entity_value) |
|
|
|
return entities |
|
|
|
def redact_text_with_labels(text): |
|
ner_results = nlp(text) |
|
|
|
# Merge tokens for multi-token entities (if any) |
|
cleaned_entities = merge_tokens(ner_results) |
|
|
|
redacted_text = text |
|
for entity_type, values in cleaned_entities.items(): |
|
for value in values: |
|
# Replace each identified entity with the label |
|
redacted_text = redacted_text.replace(value, f"[{entity_type}]") |
|
|
|
return redacted_text |
|
|
|
|
|
|
|
#Loading the pipeline |
|
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple") |
|
|
|
# Example input (choose one from your examples) |
|
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed." |
|
|
|
# Run pipeline and process result |
|
ner_results = nlp(example) |
|
cleaned_entities = merge_tokens(ner_results) |
|
|
|
# Print the NER results |
|
print("\n==NER Results:==\n") |
|
for entity_type, values in cleaned_entities.items(): |
|
print(f" {entity_type}: {', '.join(values)}") |
|
|
|
# Redact the single example with labels |
|
redacted_example = redact_text_with_labels(example) |
|
|
|
# Print the redacted result |
|
print(f"\n==Redacted Example:==\n{redacted_example}") |
|
|