src/privacy/util/code_detect/ner/pii_inference/README.md

Run NER inference

Here we provide the code used for training StarPII and NER model for PII detection. We also provide the code (and slurm scripts) used for running Inference on StarCoderData, we were able to detect PII in ~800GB of text in 800 GPU hours on A100 80GB.

accelerate config
accelerate launch ner_inference.py --process_batch_size=100000 --out_path=processed_dataset

To replace secrets with this [code] in StarCoderData we used the following tokens:

<NAME>, <EMAIL>, <KEY>, <PASSWORD>

To mask IP addresses, we randomly selected an IP address from 5~synthetic, private, non-internet-facing IP addresses of the same type.