### Example for running PII detection and anonymization

In [1]:
from datasets import load_dataset

from pii_detection import scan_pii_batch
from pii_redaction import redact_pii_batch, random_replacements

ds = load_dataset("bigcode/pii-for-code", split="train")

FileNotFoundError: Couldn't find a dataset script at C:\New folder\bigcode-dataset\pii\bigcode\pii-for-code\pii-for-code.py or any data file in the same directory. Couldn't find 'bigcode/pii-for-code' on the Hugging Face Hub either: FileNotFoundError: Dataset 'bigcode/pii-for-code' doesn't exist on the Hub

In [None]:
ds_pii = ds.map(scan_pii_batch, batched=True, batch_size=100, num_proc=12)

In [3]:
print(f"Dataset after PII detection:\n{ds_pii}")
print(f"Number of samples that contained PII: {sum(ds_pii['has_secrets'])}")
print(f"Total number of secrets found: {sum(ds_pii['number_secrets'])}")

Dataset after PII detection:
Dataset({
 features: ['content', 'language', 'license', 'path', 'annotation_id', 'pii', 'pii_modified', 'id', 'secrets', 'has_secrets', 'number_secrets'],
 num_rows: 400
})
Number of samples that contained PII: 211
Total number of secrets found: 336


#### About the detection and anonymization:
* we detect secret keys with detect-secrets and mask them with keys from these 4 randomly generated sequences -they can change in each execution on a new dataset-: 
 ```
 ['q8jtgev49gw1un9427qd9afza5vpuemo',
 'pj82ffu65gt9sh9v8n9s2fyupslmlcq4',
 'efijcf8z7r7pn0r25wfuh5vmpbrhoxkv',
 '1dgjoc8ebhmhzfxhcbmlh4ndb81gqeoe']
 ```
 
* we detect email addresses and mask them with one of these 4 emails (first part was randomly generated) -they can change in each execution on a new dataset-:
 ```
 ['mynbi@email.com',
 'qpmzj@email.com',
 'plsgq@email.com',
 'ejeyd@email.com']
 ```

* we detect IP addresses (and DNS servers) and mask them with the random private addresses below (they are fixed). Note that private IP addresses aren't masked (we use `ipaddress` python library to determine if they are private or not):
```
{'IPv4': ['172.16.31.10',
 '172.16.58.3',
 '192.168.127.12',
 '192.168.3.11'],
'IPv6': ['fd00:c2b6:b24b:be67:2827:688d:e6a1:6a3b',
 'fc00:e968:6179::de52:7100',
 'fc00:db20:35b:7399::5',
 'fdf8:f53e:61e4::18']},
```

Remarks:
* If the same secret appears multiple times in a file, we use the same replacement each time.
* To solve issue with dns servers being versions, we only detect an address in format x.x.x.x where x is one digit, if the words "dns" or "sever" appear in the near context.

In [49]:
# redaction
import random
from pprint import pprint
random.seed(0)

replacements = random_replacements()
pprint(replacements)
ds_redacted = ds_pii.map(lambda x: redact_pii_batch(x, replacements), batched=True, batch_size=100, num_proc=12, load_from_cache_file=False)

{'EMAIL': ['mynbi@email.com',
 'qpmzj@email.com',
 'plsgq@email.com',
 'ejeyd@email.com'],
 'IP_ADDRESS': {'IPv4': ['172.16.31.10',
 '172.16.58.3',
 '192.168.127.12',
 '192.168.3.11'],
 'IPv6': ['fd00:c2b6:b24b:be67:2827:688d:e6a1:6a3b',
 'fc00:e968:6179::de52:7100',
 'fc00:db20:35b:7399::5',
 'fdf8:f53e:61e4::18']},
 'KEY': ['q8jtgev49gw1un9427qd9afza5vpuemo',
 'pj82ffu65gt9sh9v8n9s2fyupslmlcq4',
 'efijcf8z7r7pn0r25wfuh5vmpbrhoxkv',
 '1dgjoc8ebhmhzfxhcbmlh4ndb81gqeoe']}


In [9]:
ds_redacted

Dataset({
 features: ['content', 'language', 'license', 'path', 'annotation_id', 'pii', 'pii_modified', 'id', 'secrets', 'has_secrets', 'number_secrets', 'new_content', 'redaction_refs'],
 num_rows: 400
})

In [None]:
import json

for e in ds_redacted:
 secrets = json.loads(e["secrets"])
 if len(secrets) >= 3:
 print(e["id"])

example 16

In [None]:
ds_redacted[16]["secrets"]

In [None]:
print("Old text:")
print(ds_redacted[16]["content"][1190:1500])

In [None]:
print("New text:")
print(ds_redacted[16]["new_content"][1190:1500])

In [None]:
print("New text with delimietrs (for visualization in a space):")
print(ds_redacted[16]["redaction_refs"][1190:1500])

example 27

In [None]:
ds_redacted[27]["secrets"]

In [None]:
print("Old text:")
# we don't replace private Ips like 0.0.0.0
print(ds_redacted[27]["content"][150:250])

print("\nNew text:")
print(ds_redacted[27]["new_content"][150:250])

In [None]:
print("Old text:")
print(ds_redacted[27]["content"][270:670])

print("\nNew text:")
# here the first part of the key was detected and replaced with pj82ffu65gt9sh9v8n9s2fyupslmlcq
print(ds_redacted[27]["new_content"][270:470])

example 49

In [None]:
ds_redacted[49]["secrets"]

In [None]:
print("Old text:")
print(ds_redacted[49]["content"][30:70])

print("\nNew text:")
# here the first part of the key was detected and replaced with pj82ffu65gt9sh9v8n9s2fyupslmlcq
print(ds_redacted[49]["new_content"][30:70])