|
--- |
|
license: mit |
|
datasets: |
|
- ai4privacy/open-pii-masking-500k-ai4privacy |
|
language: |
|
- en |
|
tags: |
|
- pii |
|
- redaction |
|
- anonymisation |
|
- english |
|
model-index: |
|
- name: english-anonymiser-openpii-ai4privacy |
|
results: |
|
- task: |
|
type: token-classification |
|
name: PII Masking |
|
dataset: |
|
type: ai4privacy/open-pii-masking-500k-ai4privacy |
|
name: Open PII Masking 500K |
|
split: english-validation |
|
metrics: |
|
- type: f1 |
|
value: 0.9882 |
|
name: F1 Score |
|
- type: precision |
|
value: 0.9882 |
|
name: Precision |
|
- type: recall |
|
value: 0.9883 |
|
name: Recall |
|
- type: accuracy |
|
value: 0.9917 |
|
name: Accuracy |
|
|
|
metrics: |
|
- f1 |
|
- precision |
|
- recall |
|
library_name: transformers |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
|
|
# English Anonymiser OpenPII (Ai4Privacy) |
|
|
|
This model is designed to **redact Personally Identifiable Information (PII)** from English text. It has been fine-tuned exclusively on the English subset of the [open-pii-masking-500k-ai4privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) dataset. |
|
|
|
--- |
|
|
|
## Evaluation Metrics |
|
|
|
The table below summarizes the detailed evaluation results per PII label: |
|
|
|
| **Label** | **TP** | **FP** | **FN** | **Accuracy** | **Precision** | **Recall** | **F1 Score** | |
|
|--------------------|:------:|:------:|:------:|:------------:|:-------------:|:----------:|:-------------:| |
|
| SURNAME | 3724 | 0 | 26 | 99.31% | 100.0% | 99.31% | 99.65% | |
|
| O (Non-PII) | 0 | 368 | 0 | 99.36% | n/a | n/a | n/a | |
|
| TIME | 1934 | 0 | 2 | 99.90% | 100.0% | 99.90% | 99.95% | |
|
| DRIVERLICENSENUM | 505 | 0 | 2 | 99.61% | 100.0% | 99.61% | 99.80% | |
|
| PASSPORTNUM | 566 | 0 | 0 | 100.0% | 100.0% | 100.0% | 100.0% | |
|
| GIVENNAME | 7557 | 0 | 163 | 97.89% | 100.0% | 97.89% | 98.93% | |
|
| TELEPHONENUM | 3637 | 0 | 4 | 99.89% | 100.0% | 99.89% | 99.95% | |
|
| BUILDINGNUM | 418 | 0 | 8 | 98.12% | 100.0% | 98.12% | 99.05% | |
|
| AGE | 164 | 0 | 5 | 97.04% | 100.0% | 97.04% | 98.50% | |
|
| DATE | 2335 | 0 | 0 | 100.0% | 100.0% | 100.0% | 100.0% | |
|
| CITY | 1717 | 0 | 85 | 95.28% | 100.0% | 95.28% | 97.58% | |
|
| TITLE | 363 | 0 | 21 | 94.53% | 100.0% | 94.53% | 97.19% | |
|
| IDCARDNUM | 2008 | 0 | 12 | 99.41% | 100.0% | 99.41% | 99.70% | |
|
| GENDER | 120 | 0 | 1 | 99.17% | 100.0% | 99.17% | 99.59% | |
|
| CREDITCARDNUMBER | 555 | 0 | 3 | 99.46% | 100.0% | 99.46% | 99.73% | |
|
| SEX | 77 | 0 | 2 | 97.47% | 100.0% | 97.47% | 98.72% | |
|
| STREET | 1379 | 0 | 8 | 99.42% | 100.0% | 99.42% | 99.71% | |
|
| TAXNUM | 343 | 0 | 14 | 96.08% | 100.0% | 96.08% | 98.00% | |
|
| EMAIL | 2607 | 0 | 1 | 99.96% | 100.0% | 99.96% | 99.98% | |
|
| SOCIALNUM | 421 | 0 | 1 | 99.76% | 100.0% | 99.76% | 99.88% | |
|
| ZIPCODE | 418 | 0 | 8 | 98.12% | 100.0% | 98.12% | 99.05% | |
|
|
|
**Overall Evaluation:** |
|
- **Accuracy:** 99.17% |
|
- **Precision:** 98.82% |
|
- **Recall:** 98.83% |
|
- **F1 Score:** 98.82% |
|
|
|
- **Total True Positives (TP):** 30,848 |
|
- **Total False Positives (FP):** 368 |
|
- **Total False Negatives (FN):** 366 |
|
|
|
**Macro-Averaged Metrics:** |
|
- **Accuracy:** 98.56% |
|
- **Precision:** 95.24% |
|
- **Recall:** 93.83% |
|
- **F1 Score:** 94.52% |
|
|
|
--- |
|
|
|
## Model Behavior & Limitations |
|
|
|
- **Evaluation Focus:** |
|
The metrics shown above reflect performance on the test split of the [open-pii-masking-500k-ai4privacy](https://huggingface.co/datasets/ai4privacy/open-pii-masking-500k-ai4privacy) dataset. Real-world performance may vary and requires additional measures. Feel free to contact support (at) ai4privacy.com |
|
|
|
--- |
|
|
|
## Disclaimer |
|
|
|
This model card details the evaluation metrics and fine-tuning parameters for the English anonymiser. **Please note:** |
|
- The model is provided **as-is** under the MIT License. |
|
- It is intended solely for redaction purposes and does not perform full PII classification |
|
- Users should carefully test and evaluate its performance on their own data before deploying in production environments. |
|
|
|
--- |
|
|
|
*Ai4Privacy – Committed to protecting personal data in the age of AI.* |
|
|