Update README.md

5402c9e verified 17 days ago

5.38 kB

	---
	library_name: transformers
	tags: [token-classification, ner, deberta, privacy, pii-detection]
	---

	# Model Card for PII Detection with DeBERTa
	[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1RJMVrf8ZlbyYMabAQ2_GGm9Ln4FmMfoO)
	This model is a fine-tuned version of [`microsoft/deberta`](https://huggingface.co/microsoft/deberta-v3-base) for Named Entity Recognition (NER), specifically designed for detecting Personally Identifiable Information (PII) entities like names, SSNs, phone numbers, credit card numbers, addresses, and more.

	## Model Details

	### Model Description

	This transformer-based model is fine-tuned on a custom dataset to detect sensitive information, commonly categorized as PII. The model performs sequence labeling to identify entities using token-level classification.

	- Developed by: [Maskify]
	- Finetuned from model: `microsoft/deberta`
	- Model type: Token Classification (NER)
	- Language(s): English
	- Use case: PII detection in text

	# Training Details

	## Training Data
	The model was fine-tuned on a custom dataset containing labeled examples of the following PII entity types:

	- NAME
	- SSN
	- PHONE-NO
	- CREDIT-CARD-NO
	- BANK-ACCOUNT-NO
	- BANK-ROUTING-NO
	- ADDRESS


	### Epoch Logs

	\| Epoch \| Train Loss \| Val Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\|-------\|------------\|----------\|-----------\|--------\|--------\|----------\|
	\| 1 \| 0.3672 \| 0.1987 \| 0.7806 \| 0.8114 \| 0.7957 \| 0.9534 \|
	\| 2 \| 0.1149 \| 0.1011 \| 0.9161 \| 0.9772 \| 0.9457 \| 0.9797 \|
	\| 3 \| 0.0795 \| 0.0889 \| 0.9264 \| 0.9825 \| 0.9536 \| 0.9813 \|
	\| 4 \| 0.0708 \| 0.0880 \| 0.9242 \| 0.9842 \| 0.9533 \| 0.9806 \|
	\| 5 \| 0.0626 \| 0.0858 \| 0.9235 \| 0.9851 \| 0.9533 \| 0.9806 \|

	## SeqEval Classification Report

	\| Label \| Precision \| Recall \| F1-score \| Support \|
	\|------------------\|-----------\|--------\|----------\|---------\|
	\| ADDRESS \| 0.91 \| 0.94 \| 0.92 \| 77 \|
	\| BANK-ACCOUNT-NO \| 0.91 \| 0.99 \| 0.95 \| 169 \|
	\| BANK-ROUTING-NO \| 0.85 \| 0.96 \| 0.90 \| 104 \|
	\| CREDIT-CARD-NO \| 0.95 \| 1.00 \| 0.97 \| 228 \|
	\| NAME \| 0.98 \| 0.97 \| 0.97 \| 164 \|
	\| PHONE-NO \| 0.94 \| 0.99 \| 0.96 \| 308 \|
	\| SSN \| 0.87 \| 1.00 \| 0.93 \| 90 \|

	### Summary
	- Micro avg: 0.95
	- Macro avg: 0.95
	- Weighted avg: 0.95

	## Evaluation

	### Testing Data
	Evaluation was done on a held-out portion of the same labeled dataset.

	### Metrics
	- Precision
	- Recall
	- F1 (via seqeval)
	- Entity-wise breakdown
	- Token-level accuracy

	### Results
	- F1-score consistently above 0.95 for most labels, showing robustness in PII detection.
	-
	### Recommendations

	- Use human review in high-risk environments.
	- Evaluate on your own domain-specific data before deployment.

	## How to Get Started with the Model

	```python

	from transformers import AutoTokenizer, AutoModelForTokenClassification
	from transformers import pipeline

	model_name = "AI-Enthusiast11/pii-entity-extractor"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForTokenClassification.from_pretrained(model_name)

	# Post processing logic to combine the subword tokens
	def merge_tokens(ner_results):
	entities = {}
	for entity in ner_results:
	entity_type = entity["entity_group"]
	entity_value = entity["word"].replace("##", "") # Remove subword prefixes

	# Handle token merging
	if entity_type not in entities:
	entities[entity_type] = []
	if entities[entity_type] and not entity_value.startswith(" "):
	# If the previous token exists and this one isn't a new word, merge it
	entities[entity_type][-1] += entity_value
	else:
	entities[entity_type].append(entity_value)

	return entities

	def redact_text_with_labels(text):
	ner_results = nlp(text)

	# Merge tokens for multi-token entities (if any)
	cleaned_entities = merge_tokens(ner_results)

	redacted_text = text
	for entity_type, values in cleaned_entities.items():
	for value in values:
	# Replace each identified entity with the label
	redacted_text = redacted_text.replace(value, f"[{entity_type}]")

	return redacted_text



	#Loading the pipeline
	nlp = pipeline("ner", model=model_name, tokenizer=tokenizer, aggregation_strategy="simple")

	# Example input (choose one from your examples)
	example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."

	# Run pipeline and process result
	ner_results = nlp(example)
	cleaned_entities = merge_tokens(ner_results)

	# Print the NER results
	print("\n==NER Results:==\n")
	for entity_type, values in cleaned_entities.items():
	print(f" {entity_type}: {', '.join(values)}")

	# Redact the single example with labels
	redacted_example = redact_text_with_labels(example)

	# Print the redacted result
	print(f"\n==Redacted Example:==\n{redacted_example}")