File size: 17,616 Bytes
d6598f9 e8a0ac2 d6598f9 ea9a721 d6598f9 ea9a721 1361287 ea9a721 5ebae18 ea9a721 d294f61 ea9a721 d294f61 ea9a721 d294f61 ea9a721 d294f61 5ebae18 ea9a721 d294f61 ea9a721 d294f61 ea9a721 d294f61 ea9a721 d294f61 ea9a721 d294f61 ea9a721 d294f61 ea9a721 d294f61 ea9a721 d294f61 a9e23a5 d294f61 6aa8326 d294f61 a9e23a5 d294f61 a9e23a5 d294f61 ea9a721 dd6f5ec d294f61 99502af e2e4d37 dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af dd6f5ec 99502af ea9a721 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 |
---
inference: false
license: apache-2.0
datasets:
- metricspace/AnonymeData
pipeline_tag: text2text-generation
---
# EntityAnonymization-3B-V0.9
EntityAnonymization identifies entities in texts and replaces them with randomised versions.
In a first pass, the entities are recognised and a dictionary with similar but randomised variants is created.
In a second run, the original text and the dictionary are provided and the paraphrased variant is generated.
The two-step approach allows the dictionary to be cached and converted back to an anonymised text that has been further processed.
# License
This Natural Language Processing (NLP) model is made available under the Apache License, Version 2.0. You are free to use, modify, and distribute this software according to the terms and conditions of the Apache 2.0 License. For the full license text, please refer to the Apache 2.0 License.
# Usage and Specific Capabilities
## Text Length Limitation
The model is optimized to analyze texts containing up to 2048 tokens. If your text exceeds this limit, we recommend splitting it into smaller chunks, each containing no more than 2048 tokens. Each chunk can then be processed separately.
## Supported Languages
Bulgarian, Chinese, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Indonesian, Italian, Japanese, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Turkish
# Use Cases
## Entity Resampling and Anonymization
Introducing a cutting-edge model tailored to the task of extracting entities from sensitive text and anonymizing it. This model specializes in identifying and safeguarding confidential information, ensuring organizations' compliance with stringent data privacy regulations and minimizing the potential for inadvertent disclosure of classified data and trade secrets.
# Example Usage
```python
!pip install sentencepiece
!pip install transformers
```
```python
import torch
import json
import re
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("metricspace/EntityAnonymization-3B-V0.9")
model = AutoModelForCausalLM.from_pretrained("metricspace/EntityAnonymization-3B-V0.9", torch_dtype=torch.bfloat16)
model.to("cuda:0")
def extract_last_assistant_response(input_text):
# Find the occurrence of "ASSISTANT:" in the input text
match = re.search(r'ASSISTANT:', input_text)
# Get the index where the last "ASSISTANT:" ends
start_index = match.end()
response = input_text[start_index:].strip()
return response
# Input example
text_to_anonymize = '''Subject: HR Incident Report: Speculation of Drug Misuse by Mr. Benjamin Mitchell
Dear Mrs. Alice Williams,
I trust you're well. I wish to bring to your attention a concerning matter involving one of our esteemed employees, Mr. Benjamin Mitchell.
Employee Details:
Name: Benjamin Mitchell
Position: Senior Marketing Creative
Department: Marketing
Date of Joining: January 15, 2020
Reporting Manager: Mrs. Jane Fitzgerald
Incident Details:
Date: October 25, 2023
Location: Restroom, 4th Floor
Time: 11:45 AM
Description of Incident:
On the date specified, a few colleagues reported unusual behavior exhibited by Mr. Mitchell, which raised concerns about potential drug misuse. Witnesses mentioned that Benjamin appeared disoriented and was found in the restroom for an extended period. Some employees also discovered unidentified pills in close proximity to his chair.
Witness Accounts:
Ms. Emily Clark: "Benjamin seemed distracted and not his usual self today. He's been taking frequent breaks and appears a bit disoriented."
Mr. Robert Taylor: "I found some pills near his chair on the floor. It's concerning, and I felt it necessary to report."
Immediate Actions Taken:
Mr. Benjamin Mitchell was approached by HR for a preliminary conversation to understand the situation.
Mrs. Jane Fitzgerald, his reporting manager, was made aware of the concerns.
Recommendations:
It's crucial to have a private and supportive conversation with Mr. Mitchell to understand if there's an underlying issue.
Consider referring Benjamin to our Employee Assistance Program (EAP) for counseling or support.
It may be beneficial to organize a session on drug awareness and workplace safety for all employees.
It's of utmost importance to handle this situation with sensitivity and discretion, ensuring the wellbeing of Mr. Mitchell and maintaining the integrity of our workplace environment. This email serves as a formal documentation of the incident. We'll determine the subsequent course of action based on your guidance and the recommendations provided.
Looking forward to your direction on this matter.
'''
print(text_to_anonymize)
# Step 1: Extracting entities from text
prompt = f'USER: Resample the entities: {text_to_anonymize}\n\nASSISTANT:'
inputs = tokenizer(prompt, return_tensors='pt').to('cuda:0')
output_entities = model.generate(inputs.input_ids, max_new_tokens=300, do_sample=False, temperature=0.8, penalty_alpha=1.3, top_k=180, num_beams=5, repetition_penalty=2.3)
raw_output_entities_text = tokenizer.decode(output_entities[0])
entities = extract_last_assistant_response(raw_output_entities_text)
print('-----------Entities----------------')
try:
entities = re.search(r"\{.*?\}", entities, re.DOTALL).group(0)
data_dict = eval(entities)
formatted_json = json.dumps(data_dict, indent=4)
print(formatted_json)
except:
#bad formated json
print(entities)
#output
'''
{
"Mr. Benjamin Mitchell": "Mr. Edward Martin",
"Mrs. Alice Williams": "Mrs. Charlotte Johnson",
"January 15, 2020": "January 15, 2020",
"Mrs. Jane Fitzgerald": "Mrs. Jane Anderson",
"October 25, 2023": "October 25, 2023",
"4th Floor": "topmost floor",
"11:45 AM": "midday",
"Emily Clark": "Marie Foster",
"Employee Assistance Program (EAP)": "Personal Assistance Program (PAP)",
"Robert Taylor": "Benjamin Adams",
}
'''
# Step 2: Use entities to resample the original text
prompt_2 = f"USER: Rephrase with {entities}: {text_to_anonymize}\n\nASSISTANT:"
inputs = tokenizer(prompt_2, return_tensors='pt').to('cuda:0')
output_resampled = model.generate(inputs.input_ids, max_length=2048)
raw_output_resampled_text = tokenizer.decode(output_resampled[0])
resampled_text = extract_last_assistant_response(raw_output_resampled_text)
print('---------Anonymized Version--------')
print(resampled_text)
#output:
'''
Subject: HR Incident Report: Speculation of Drug Misuse by Mr. Edward Martin
Dear Mrs. Charlotte Johnson,
I trust you're well. I wish to bring to your attention a concerning matter involving one of our esteemed employees, Mr. Edward Martin.
Employee Details:
Name: Edward Martin
Position: Senior Marketing Creative
Department: Marketing
Date of Joining: January 15, 2020
Reporting Manager: Mrs. Jane Anderson
Incident Details:
Date: October 25, 2023
Location: Restroom, topmost floor
Time: midday
Description of Incident:
On the date specified, a few colleagues reported unusual behavior exhibited by Mr. Martin, which raised concerns about potential drug misuse. Witnesses mentioned that Edward appeared disoriented and was found in the restroom for an extended period. Some employees also discovered unidentified pills in close proximity to his chair.
Witness Accounts:
Ms. Marie Foster: "Edward seemed distracted and not his usual self today. He's been taking frequent breaks and appears a bit disoriented."
Mr. Benjamin Adams: "I found some pills near his chair on the floor. It's concerning, and I felt it necessary to report."
Immediate Actions Taken:
Mr. Edward Martin was approached by People Management for a preliminary conversation to understand the situation.
Mrs. Jane Anderson, his reporting manager, was made aware of the concerns.
Recommendations:
It's crucial to have a private and supportive conversation with Mr. Martin to understand if there's an underlying issue.
Consider referring Edward to our Personal Assistance Program (PAP) for counseling or support.
It may be beneficial to organize a session on drug awareness and workplace safety for all employees.
It's of utmost importance to handle this situation with sensitivity and discretion, ensuring the wellbeing of Mr. Martin and maintaining the integrity of our workplace environment. This email serves as a formal documentation of the incident. We'll determine the subsequent course of action based on your guidance and the recommendations provided.
Looking forward to your direction on this matter.
'''
```
# Example: Process anonymized version with GPT4 and change entities back
```python
import torch
import json
import re
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("metricspace/EntityAnonymization-3B-V0.9")
model = AutoModelForCausalLM.from_pretrained("metricspace/EntityAnonymization-3B-V0.9", torch_dtype=torch.bfloat16)
model.to("cuda:0")
# Anonymized input
anonymized_text = '''Subject: HR Incident Report: Speculation of Drug Misuse by Mr. Edward Martin
Dear Mrs. Charlotte Johnson,
I trust you're well. I wish to bring to your attention a concerning matter involving one of our esteemed employees, Mr. Edward Martin.
Employee Details:
Name: Edward Martin
Position: Senior Marketing Creative
Department: Marketing
Date of Joining: January 15, 2020
Reporting Manager: Mrs. Jane Anderson
Incident Details:
Date: October 25, 2023
Location: Restroom, topmost floor
Time: midday
Description of Incident:
On the date specified, a few colleagues reported unusual behavior exhibited by Mr. Martin, which raised concerns about potential drug misuse. Witnesses mentioned that Edward appeared disoriented and was found in the restroom for an extended period. Some employees also discovered unidentified pills in close proximity to his chair.
Witness Accounts:
Ms. Marie Foster: "Edward seemed distracted and not his usual self today. He's been taking frequent breaks and appears a bit disoriented."
Mr. Benjamin Adams: "I found some pills near his chair on the floor. It's concerning, and I felt it necessary to report."
Immediate Actions Taken:
Mr. Edward Martin was approached by People Management for a preliminary conversation to understand the situation.
Mrs. Jane Anderson, his reporting manager, was made aware of the concerns.
Recommendations:
It's crucial to have a private and supportive conversation with Mr. Martin to understand if there's an underlying issue.
Consider referring Edward to our Personal Assistance Program (PAP) for counseling or support.
It may be beneficial to organize a session on drug awareness and workplace safety for all employees.
It's of utmost importance to handle this situation with sensitivity and discretion, ensuring the wellbeing of Mr. Martin and maintaining the integrity of our workplace environment. This email serves as a formal documentation of the incident. We'll determine the subsequent course of action based on your guidance and the recommendations provided.
Looking forward to your direction on this matter.
'''
# Entities map
entities_map = '''
{
"Mr. Benjamin Mitchell": "Mr. Edward Martin",
"Mrs. Alice Williams": "Mrs. Charlotte Johnson",
"January 15, 2020": "January 15, 2020",
"Mrs. Jane Fitzgerald": "Mrs. Jane Anderson",
"October 25, 2023": "October 25, 2023",
"4th Floor": "topmost floor",
"11:45 AM": "midday",
"Emily Clark": "Marie Foster",
"Employee Assistance Program (EAP)": "Personal Assistance Program (PAP)",
"Robert Taylor": "Benjamin Adams",
}
'''
# Step 1: Processing anonymized text with GPT-4
import openai
openai.api_key = f"<API_KEY>"
completion = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "user", "content": f"Write an official warning letter to the employee, that we do not tolerate it and with the next incident he gets fired in the name of Mrs. Charlotte Johnson Human Resources Manager. Here is the report with information. {anonymized_info}"}
]
)
print(completion.choices[0].message.content)
#output
'''
Subject: Official Warning – Substance Misuse Policy Violation
Dear Mr. Edward Martin,
We hope this letter finds you well. This letter serves as an official warning in regards to an incident that occurred on October 25, 2023, in which signs of drug misuse were reported. This alleged conduct is highly alarming and contrary to our company’s policies and guidelines.
The incident detailed allegations of unusual behavior indicative of possible substance abuse. Colleagues reported that you appeared disoriented and were found in the restroom for an extended period. Additionally, unidentified pills were discovered near your workspace.
Our company is committed to providing a safe and non-detrimental work environment for all its workforce. This commitment is compromised when any type of drug misuse occurs. We draw your attention to our Employee Handbook, specifically 'Section 5: Substance Misuse', that states any illegal drug use, substance misuse or distribution thereof is strictly prohibited and could be subject to severe disciplinary action, including termination.
This is an official warning that such behavior misaligns with our workplace norms and cannot be tolerated. Another incident like this or similar breach of company guidelines may lead to further disciplinary action, up to and including termination of employment.
Please note that this is not an assumption of your guilt but an assertion to remain vigilant against any practices that could harm you or the workplace environment. We encourage you to utilize our Personal Assistance Program (PAP) as a tool for counseling and support, if needed.
We believe in your potential to rectify this situation and to maintain the high standards we are all accustomed to in our organization.
Should you need assistance or if you wish to discuss this matter further, please feel free to reach out to me. We appreciate your immediate attention to this important issue.
Yours sincerely,
Mrs. Charlotte Johnson
Human Resources Manager
'''
# Step 2: Replace the entities back in processed by GPT-4 text.
import ast
def swap_keys_and_values_in_string(input_str):
# Convert the input string to a dictionary
input_dict = ast.literal_eval(input_str)
# Swap the keys and values
swapped_dict = {v: k for k, v in input_dict.items()}
# Convert the swapped dictionary back to a string
swapped_str = str(swapped_dict)
return swapped_str
gpt_response = completion.choices[0].message.content
entities_map = swap_keys_and_values_in_string(entities_map)
prompt = f"USER: Rephrase with {entities_map}: {gpt_response}\n\nASSISTANT:"
inputs = tokenizer(prompt, return_tensors='pt').to('cuda:0')
outputs = model.generate(inputs.input_ids, max_new_tokens=2048)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
#output:
'''
Subject: Official Warning – Substance Misuse Policy Violation
Dear Mr. Benjamin Mitchell,
We hope this letter finds you well. This letter serves as an official warning in regards to an incident that occurred on January 15,
2020, in which signs of drug misuse were reported. This alleged conduct is highly alarming and contrary to our company’s policies and guidelines.
The incident detailed allegations of unusual behavior indicative of possible substance abuse. Colleagues reported that you appeared disoriented and
were found in the restroom for an extended period. Additionally, unidentified pills were discovered near your workspace.
Our company is committed to providing a safe and non-detrimental work environment for all its workforce. This commitment is compromised when any
type of drug misuse occurs. We draw your attention to our Employee Handbook, specifically 'Section 5: Substance Misuse', that states any illegal
drug use, substance misuse or distribution thereof is strictly prohibited and could be subject to severe disciplinary action, including termination.
This is an official warning that such behavior misaligns with our workplace norms and cannot be tolerated. Another incident like this or similar breach
of company guidelines may lead to further disciplinary action, up to and including termination of employment.
Please note that this is not an assumption of your guilt but an assertion to remain vigilant against any practices that could harm you or the workplace
environment. We encourage you to utilize our Employee Assistance Program (EAP) as a tool for counseling and support, if needed.
We believe in your potential to rectify this situation and to maintain the high standards we are all accustomed to in our organization.
Should you need assistance or if you wish to discuss this matter further, please feel free to reach out to me. We appreciate your immediate attention
to this important issue.
Yours sincerely,
Mrs. Alice Williams,
Human Resources Manager.
'''
```
…
# Dataset and Training Documentation for Audit
If you require the original dataset used for training this model, or further documentation related to its training and architecture for audit purposes, you can request this information by contacting us.
Further Tuning Services for Custom Use Cases
For specialized needs or custom use cases, we offer further tuning services to adapt the model to your specific requirements. To inquire about these services, please reach out to us at:
📧 Email: [email protected]
Please note that the availability of the dataset, additional documentation, and tuning services may be subject to certain conditions and limitations. |