metadata

license: mit
datasets:
  - ai4privacy/open-pii-masking-500k-ai4privacy
language:
  - en
tags:
  - pii
  - redaction
  - anonymisation
  - english
model-index:
  - name: english-anonymiser-openpii-ai4privacy
    results:
      - task:
          type: token-classification
          name: PII Masking
        dataset:
          type: ai4privacy/open-pii-masking-500k-ai4privacy
          name: Open PII Masking 500K
          split: english-validation
        metrics:
          - type: f1
            value: 0.9882
            name: F1 Score
          - type: precision
            value: 0.9882
            name: Precision
          - type: recall
            value: 0.9883
            name: Recall
          - type: accuracy
            value: 0.9917
            name: Accuracy
metrics:
  - f1
  - precision
  - recall
library_name: transformers
pipeline_tag: token-classification

English Anonymiser OpenPII (Ai4Privacy)

This model is designed to redact Personally Identifiable Information (PII) from English text. It has been fine-tuned exclusively on the English subset of the open-pii-masking-500k-ai4privacy dataset.

Evaluation Metrics

The table below summarizes the detailed evaluation results per PII label:

Label	TP	FP	FN	Accuracy	Precision	Recall	F1 Score
SURNAME	3724	0	26	99.31%	100.0%	99.31%	99.65%
O (Non-PII)	0	368	0	99.36%	n/a	n/a	n/a
TIME	1934	0	2	99.90%	100.0%	99.90%	99.95%
DRIVERLICENSENUM	505	0	2	99.61%	100.0%	99.61%	99.80%
PASSPORTNUM	566	0	0	100.0%	100.0%	100.0%	100.0%
GIVENNAME	7557	0	163	97.89%	100.0%	97.89%	98.93%
TELEPHONENUM	3637	0	4	99.89%	100.0%	99.89%	99.95%
BUILDINGNUM	418	0	8	98.12%	100.0%	98.12%	99.05%
AGE	164	0	5	97.04%	100.0%	97.04%	98.50%
DATE	2335	0	0	100.0%	100.0%	100.0%	100.0%
CITY	1717	0	85	95.28%	100.0%	95.28%	97.58%
TITLE	363	0	21	94.53%	100.0%	94.53%	97.19%
IDCARDNUM	2008	0	12	99.41%	100.0%	99.41%	99.70%
GENDER	120	0	1	99.17%	100.0%	99.17%	99.59%
CREDITCARDNUMBER	555	0	3	99.46%	100.0%	99.46%	99.73%
SEX	77	0	2	97.47%	100.0%	97.47%	98.72%
STREET	1379	0	8	99.42%	100.0%	99.42%	99.71%
TAXNUM	343	0	14	96.08%	100.0%	96.08%	98.00%
EMAIL	2607	0	1	99.96%	100.0%	99.96%	99.98%
SOCIALNUM	421	0	1	99.76%	100.0%	99.76%	99.88%
ZIPCODE	418	0	8	98.12%	100.0%	98.12%	99.05%

Overall Evaluation:

Accuracy: 99.17%
Precision: 98.82%
Recall: 98.83%
F1 Score: 98.82%
Total True Positives (TP): 30,848
Total False Positives (FP): 368
Total False Negatives (FN): 366

Macro-Averaged Metrics:

Accuracy: 98.56%
Precision: 95.24%
Recall: 93.83%
F1 Score: 94.52%

Model Behavior & Limitations

Evaluation Focus:
The metrics shown above reflect performance on the test split of the open-pii-masking-500k-ai4privacy dataset. Real-world performance may vary and requires additional measures. Feel free to contact support (at) ai4privacy.com

Disclaimer

This model card details the evaluation metrics and fine-tuning parameters for the English anonymiser. Please note:

The model is provided as-is under the MIT License.
It is intended solely for redaction purposes and does not perform full PII classification
Users should carefully test and evaluate its performance on their own data before deploying in production environments.

Ai4Privacy – Committed to protecting personal data in the age of AI.