File size: 4,291 Bytes
30ff1bf
0d8f8be
 
30ff1bf
0d8f8be
4437a17
0d8f8be
30ff1bf
0d8f8be
 
 
 
 
 
 
 
30ff1bf
0d8f8be
 
 
 
 
 
 
 
 
 
 
 
 
30ff1bf
 
0d8f8be
30ff1bf
0d8f8be
 
30ff1bf
0d8f8be
 
 
30ff1bf
0d8f8be
 
30ff1bf
0d8f8be
 
 
30ff1bf
0d8f8be
 
 
 
30ff1bf
0d8f8be
 
30ff1bf
0d8f8be
 
30ff1bf
0d8f8be
 
 
 
30ff1bf
0d8f8be
30ff1bf
0d8f8be
 
 
 
 
 
 
 
 
 
 
 
30ff1bf
0d8f8be
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30ff1bf
4a4b06e
 
 
 
 
 
30ff1bf
0d8f8be
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30ff1bf
0d8f8be
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
---
datasets:
- ai4privacy/pii-masking-400k
metrics:
- accuracy
- recall
- precision
- f1
base_model:
- answerdotai/ModernBERT-base
pipeline_tag: token-classification
tags:
- pii
- privacy
- personal
- identification
---
# 🐟 PII-RANHA: Privacy-Preserving Token Classification Model

## Overview
PII-RANHA is a fine-tuned token classification model based on **ModernBERT-base** from Answer.AI. It is designed to identify and classify Personally Identifiable Information (PII) in text data. The model is trained on the `ai4privacy/pii-masking-400k` dataset and can detect 17 different PII categories, such as account numbers, credit card numbers, email addresses, and more.

This model is intended for privacy-preserving applications, such as data anonymization, redaction, or compliance with data protection regulations.

## Model Details

### Model Architecture
- **Base Model**: `answerdotai/ModernBERT-base`
- **Task**: Token Classification
- **Number of Labels**: 18 (17 PII categories + "O" for non-PII tokens)


## Usage

### Installation
To use the model, ensure you have the `transformers` and `datasets` libraries installed:

```bash
pip install transformers datasets
```

Inference Example
Here’s how to load and use the model for PII detection:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load the model and tokenizer
model_name = "scampion/piiranha"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

# Create a token classification pipeline
pii_pipeline = pipeline("token-classification", model=model, tokenizer=tokenizer)

# Example input
text = "My email is [email protected] and my phone number is 555-123-4567."

# Detect PII
results = pii_pipeline(text)
for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}, Score: {entity['score']:.4f}")

```

```bash
Entity: Ġj, Label: I-ACCOUNTNUM, Score: 0.6445
Entity: ohn, Label: I-ACCOUNTNUM, Score: 0.3657
Entity: ., Label: I-USERNAME, Score: 0.5871
Entity: do, Label: I-USERNAME, Score: 0.5350
Entity: Ġ555, Label: I-ACCOUNTNUM, Score: 0.8399
Entity: -, Label: I-SOCIALNUM, Score: 0.5948
Entity: 123, Label: I-SOCIALNUM, Score: 0.6309
Entity: -, Label: I-SOCIALNUM, Score: 0.6151
Entity: 45, Label: I-SOCIALNUM, Score: 0.3742
Entity: 67, Label: I-TELEPHONENUM, Score: 0.3440
```

## Training Details

### Dataset
The model was trained on the ai4privacy/pii-masking-400k dataset, which contains 400,000 examples of text with annotated PII tokens.

### Training Configuration
- **Batch Size:** 32 
- **Learning Rate:** 5e-5
- **Epochs:** 4
- **Optimizer:** AdamW
- **Weight Decay:** 0.01
- **Scheduler:** Linear learning rate scheduler

### Evaluation Metrics
The model was evaluated using the following metrics:
- Precision
- Recall
- F1 Score
- Accuracy

| Epoch | Training Loss | Validation Loss | Precision | Recall | F1    | Accuracy |
|-------|---------------|-----------------|-----------|--------|-------|----------|
| 1     | 0.017100      | 0.017944        | 0.897562  | 0.905612 | 0.901569 | 0.993549 |
| 2     | 0.011300      | 0.014114        | 0.915451  | 0.923319 | 0.919368 | 0.994782 |
| 3     | 0.005000      | 0.015703        | 0.919432  | 0.928394 | 0.923892 | 0.995136 |
| 4     | 0.001000      | 0.022899        | 0.921234  | 0.927212 | 0.924213 | 0.995267 |

Would you like me to help analyze any trends in these metrics?

## License
This model is licensed under the Commons Clause Apache License 2.0. For more details, see the Commons Clause website.
For another license, contact the author.

## Author
Name: Sébastien Campion

Email: [email protected]

Date: 2025-01-30

Version: 0.1

## Citation
If you use this model in your work, please cite it as follows:

```bibtex
@misc{piiranha2025,
  author = {Sébastien Campion},
  title = {PII-RANHA: A Privacy-Preserving Token Classification Model},
  year = {2025},
  version = {0.1},
  url = {https://huggingface.co/sebastien-campion/piiranha},
}
```

## Disclaimer
This model is provided "as-is" without any guarantees of performance or suitability for specific use cases. 
Always evaluate the model's performance in your specific context before deployment.