File size: 4,291 Bytes
30ff1bf 0d8f8be 30ff1bf 0d8f8be 4437a17 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 0d8f8be 30ff1bf 4a4b06e 30ff1bf 0d8f8be 30ff1bf 0d8f8be |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
datasets:
- ai4privacy/pii-masking-400k
metrics:
- accuracy
- recall
- precision
- f1
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: token-classification
tags:
- pii
- privacy
- personal
- identification
---
# 🐟 PII-RANHA: Privacy-Preserving Token Classification Model
## Overview
PII-RANHA is a fine-tuned token classification model based on **ModernBERT-base** from Answer.AI. It is designed to identify and classify Personally Identifiable Information (PII) in text data. The model is trained on the `ai4privacy/pii-masking-400k` dataset and can detect 17 different PII categories, such as account numbers, credit card numbers, email addresses, and more.
This model is intended for privacy-preserving applications, such as data anonymization, redaction, or compliance with data protection regulations.
## Model Details
### Model Architecture
- **Base Model**: `answerdotai/ModernBERT-base`
- **Task**: Token Classification
- **Number of Labels**: 18 (17 PII categories + "O" for non-PII tokens)
## Usage
### Installation
To use the model, ensure you have the `transformers` and `datasets` libraries installed:
```bash
pip install transformers datasets
```
Inference Example
Here’s how to load and use the model for PII detection:
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
# Load the model and tokenizer
model_name = "scampion/piiranha"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
# Create a token classification pipeline
pii_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer)
# Example input
text = "My email is [email protected] and my phone number is 555-123-4567."
# Detect PII
results = pii_pipeline(text)
for entity in results:
print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")
```
```bash
Entity: Ġj, Label: I-ACCOUNTNUM, Score: 0.6445
Entity: ohn, Label: I-ACCOUNTNUM, Score: 0.3657
Entity: ., Label: I-USERNAME, Score: 0.5871
Entity: do, Label: I-USERNAME, Score: 0.5350
Entity: Ġ555, Label: I-ACCOUNTNUM, Score: 0.8399
Entity: -, Label: I-SOCIALNUM, Score: 0.5948
Entity: 123, Label: I-SOCIALNUM, Score: 0.6309
Entity: -, Label: I-SOCIALNUM, Score: 0.6151
Entity: 45, Label: I-SOCIALNUM, Score: 0.3742
Entity: 67, Label: I-TELEPHONENUM, Score: 0.3440
```
## Training Details
### Dataset
The model was trained on the ai4privacy/pii-masking-400k dataset, which contains 400,000 examples of text with annotated PII tokens.
### Training Configuration
- **Batch Size:** 32
- **Learning Rate:** 5e-5
- **Epochs:** 4
- **Optimizer:** AdamW
- **Weight Decay:** 0.01
- **Scheduler:** Linear learning rate scheduler
### Evaluation Metrics
The model was evaluated using the following metrics:
- Precision
- Recall
- F1 Score
- Accuracy
| Epoch | Training Loss | Validation Loss | Precision | Recall | F1 | Accuracy |
|-------|---------------|-----------------|-----------|--------|-------|----------|
| 1 | 0.017100 | 0.017944 | 0.897562 | 0.905612 | 0.901569 | 0.993549 |
| 2 | 0.011300 | 0.014114 | 0.915451 | 0.923319 | 0.919368 | 0.994782 |
| 3 | 0.005000 | 0.015703 | 0.919432 | 0.928394 | 0.923892 | 0.995136 |
| 4 | 0.001000 | 0.022899 | 0.921234 | 0.927212 | 0.924213 | 0.995267 |
Would you like me to help analyze any trends in these metrics?
## License
This model is licensed under the Commons Clause Apache License 2.0. For more details, see the Commons Clause website.
For another license, contact the author.
## Author
Name: Sébastien Campion
Email: [email protected]
Date: 2025-01-30
Version: 0.1
## Citation
If you use this model in your work, please cite it as follows:
```bibtex
@misc{piiranha2025,
author = {Sébastien Campion},
title = {PII-RANHA: A Privacy-Preserving Token Classification Model},
year = {2025},
version = {0.1},
url = {https://huggingface.co/sebastien-campion/piiranha},
}
```
## Disclaimer
This model is provided "as-is" without any guarantees of performance or suitability for specific use cases.
Always evaluate the model's performance in your specific context before deployment. |