File size: 5,375 Bytes

---
library_name: transformers
tags: [token-classification, ner, deberta, privacy, pii-detection]
---

# Model Card for PII Detection with DeBERTa
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1RJMVrf8ZlbyYMabAQ2_GGm9Ln4FmMfoO)
This model is a fine-tuned version of [`microsoft/deberta`](https://huggingface.co/microsoft/deberta-v3-base) for Named Entity Recognition (NER), specifically designed for detecting Personally Identifiable Information (PII) entities like names, SSNs, phone numbers, credit card numbers, addresses, and more.

## Model Details

### Model Description

This transformer-based model is fine-tuned on a custom dataset to detect sensitive information, commonly categorized as PII. The model performs sequence labeling to identify entities using token-level classification.

- **Developed by:** [Maskify]
- **Finetuned from model:** `microsoft/deberta`
- **Model type:** Token Classification (NER)
- **Language(s):** English
- **Use case:** PII detection in text

# Training Details

## Training Data
The model was fine-tuned on a custom dataset containing labeled examples of the following PII entity types:

- NAME
- SSN
- PHONE-NO
- CREDIT-CARD-NO
- BANK-ACCOUNT-NO
- BANK-ROUTING-NO
- ADDRESS


### Epoch Logs

| Epoch | Train Loss | Val Loss | Precision | Recall | F1     | Accuracy |
|-------|------------|----------|-----------|--------|--------|----------|
| 1     | 0.3672     | 0.1987   | 0.7806    | 0.8114 | 0.7957 | 0.9534   |
| 2     | 0.1149     | 0.1011   | 0.9161    | 0.9772 | 0.9457 | 0.9797   |
| 3     | 0.0795     | 0.0889   | 0.9264    | 0.9825 | 0.9536 | 0.9813   |
| 4     | 0.0708     | 0.0880   | 0.9242    | 0.9842 | 0.9533 | 0.9806   |
| 5     | 0.0626     | 0.0858   | 0.9235    | 0.9851 | 0.9533 | 0.9806   |

## SeqEval Classification Report

| Label            | Precision | Recall | F1-score | Support |
|------------------|-----------|--------|----------|---------|
| ADDRESS          | 0.91      | 0.94   | 0.92     | 77      |
| BANK-ACCOUNT-NO  | 0.91      | 0.99   | 0.95     | 169     |
| BANK-ROUTING-NO  | 0.85      | 0.96   | 0.90     | 104     |
| CREDIT-CARD-NO   | 0.95      | 1.00   | 0.97     | 228     |
| NAME             | 0.98      | 0.97   | 0.97     | 164     |
| PHONE-NO         | 0.94      | 0.99   | 0.96     | 308     |
| SSN              | 0.87      | 1.00   | 0.93     | 90      |

### Summary
- **Micro avg:** 0.95
- **Macro avg:** 0.95
- **Weighted avg:** 0.95

## Evaluation

### Testing Data
Evaluation was done on a held-out portion of the same labeled dataset.

### Metrics
- Precision
- Recall
- F1 (via seqeval)
- Entity-wise breakdown
- Token-level accuracy

### Results
- F1-score consistently above 0.95 for most labels, showing robustness in PII detection.
- 
### Recommendations

- Use human review in high-risk environments.
- Evaluate on your own domain-specific data before deployment.

## How to Get Started with the Model

```python

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

model_name = "AI-Enthusiast11/pii-entity-extractor"  
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Post processing logic to combine the subword tokens
def merge_tokens(ner_results):
    entities = {}
    for entity in ner_results:
        entity_type = entity["entity_group"]
        entity_value = entity["word"].replace("##", "")  # Remove subword prefixes

        # Handle token merging
        if entity_type not in entities:
            entities[entity_type] = []
        if entities[entity_type] and not entity_value.startswith(" "):
            # If the previous token exists and this one isn't a new word, merge it
            entities[entity_type][-1] += entity_value
        else:
            entities[entity_type].append(entity_value)

    return entities

def redact_text_with_labels(text):
    ner_results = nlp(text)

    # Merge tokens for multi-token entities (if any)
    cleaned_entities = merge_tokens(ner_results)

    redacted_text = text
    for entity_type, values in cleaned_entities.items():
        for value in values:
            # Replace each identified entity with the label
            redacted_text = redacted_text.replace(value, f"[{entity_type}]")

    return redacted_text



#Loading the pipeline
nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")

# Example input (choose one from your examples)
example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

# Run pipeline and process result
ner_results = nlp(example)
cleaned_entities = merge_tokens(ner_results)

# Print the NER results
print("\n==NER Results:==\n")
for entity_type, values in cleaned_entities.items():
    print(f"  {entity_type}: {', '.join(values)}")

# Redact the single example with labels
redacted_example = redact_text_with_labels(example)

# Print the redacted result
print(f"\n==Redacted Example:==\n{redacted_example}")