---
license: apache-2.0
datasets:
- King-Harry/NinjaMasker-PII-Redaction-Dataset
language:
- en
tags:
- PII
- Redaction
- Masking
- LLM
- Llama2
---
๐ค About me โข๐ฑ Harry.vc โข ๐ฆ X.com โข ๐ Papers
# ๐ฅท Model Card for King-Harry/NinjaMasker-PII-Redaction
This model is designed for the redaction and masking of Personally Identifiable Information (PII) in complex text scenarios like call transcripts.
## News
- ๐ฅ๐ฅ๐ฅ[2023/10/06] **Building New Dataset** creating a significantly improved dataset, fixing stop tokens.
- ๐ฅ๐ฅ๐ฅ[2023/10/05] **NinjaMasker-PII-Redaction** version 1, was released.
## Model Details
### ๐ Model Description
This model aims to handle complex and difficult instances of PII redaction that traditional classification models struggle with.
- **Developed by:** [Harry Roy McLaughlin](https://www.linkedin.com/in/harryroy/)
- **Model type:** Fine-tuned Language Model
- **Language(s) (NLP):** English
- **License:** TBD
- **Finetuned from the model:** NousResearch/Llama-2-7b-chat-hf
### ๐ฑ Model Sources
- **Repository:** Hosted on HuggingFace
- **Demo:** Coming soon
### ๐งช Test the model
Log into HuggingFace (if not already)
```python
!pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
from huggingface_hub import notebook_login
notebook_login()
```
Load Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)
# Load the model and tokenizer with authentication token
model_name = "King-Harry/NinjaMasker-PII-Redaction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
```
Generate Text
```python
# Generate text
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100)
prompt = "My name is Harry and I live in Winnipeg. My phone number is ummm 204 no 203, ahh 4344, no 4355"
result = pipe(f"[INST] {prompt} [/INST]")
# Print the generated text
print(result[0]['generated_text'])
```
## Uses
### ๐ฏ Direct Use
The model is specifically designed for direct redaction and masking of PII in complex text inputs such as call transcripts.
### โฌ๏ธ Downstream Use
The model has potential for numerous downstream applications, though specific use-cases are yet to be fully explored.
### โ Out-of-Scope Use
The model is under development; use in critical systems requiring 100% accuracy is not recommended at this stage.
## โ๏ธ Bias, Risks, and Limitations
The model is trained only on English text, which may limit its applicability in multilingual or non-English settings.
### ๐ Recommendations
Users should be aware of the model's language-specific training and should exercise caution when using it in critical systems.
## ๐๏ธ Training Details
### ๐ Training Data
The model was trained on a dataset of 43,000 question/answer pairs that contained various forms of PII. There are 63 labels that the model looks for.
#### โ๏ธ Training Hyperparameters
- **Training regime:** FP16
#### ๐ Speeds, Sizes, Times
- **Hardware:** T4 GPU
- **Cloud Provider:** Google CoLab Pro (for the extra RAM)
- **Training Duration:** ~4 hours
## ๐ Evaluation
Evaluation is pending.
## ๐ Environmental Impact
Given the significant computing resources used, the model likely has a substantial carbon footprint. Exact calculations are pending.
- **Hardware Type:** T4 GPU
- **Hours used:** ~4
- **Cloud Provider:** Google CoLab Pro
## ๐ Technical Specifications
### ๐๏ธ Model Architecture and Objective
The model is a fine-tuned version of LLama 2 7b, tailored for PII redaction tasks.
#### ๐ฅ๏ธ Hardware
- **Training Hardware:** T4 GPU (with extra RAM)
#### ๐พ Software
- **Environment:** Google CoLab Pro
- ## ๐ช Disclaimer
- This model is in its first generation and will be updated rapidly.
## โ๏ธ Model Card Authors
[Harry Roy McLaughlin](https://www.linkedin.com/in/harryroy/)
## ๐ Model Card Contact
harry.roy@gmail.com