File size: 752 Bytes
54fa0c8 |
1 2 3 4 5 6 7 8 9 10 11 12 |
## Run NER inference
Here we provide the code used for training [StarPII](https://huggingface.co/bigcode/starpii) and NER model for PII detection. We also provide the code (and `slurm` scripts) used for running Inference on [StarCoderData](https://huggingface.co/datasets/bigcode/starcoderdata), we were able to detect PII in ~800GB of text in 800 GPU hours on A100 80GB.
```
accelerate config
accelerate launch ner_inference.py --process_batch_size=100000 --out_path=processed_dataset
```
To replace secrets with this [code] in StarCoderData we used the following tokens:
```
<NAME>, <EMAIL>, <KEY>, <PASSWORD>
```
To mask IP addresses, we randomly selected an IP address from 5~synthetic, private, non-internet-facing IP addresses of the same type.
|