---
license: apache-2.0
datasets:
- King-Harry/NinjaMasker-PII-Redaction-Dataset
language:
- en
tags:
- PII
- Redaction
- Masking
- LLM
- Llama2
---
<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/6488b81bc6b1f2b4c8d93d4e/pukpRSNaPbWiSKhuiGF8R.jpeg" alt="banner">
</p>

<p align="center">
🤗 <a href="https://www.linkedin.com/in/harryroy/" target="_blank">About me</a> •🐱 <a href="https://www.harry.vc/" target="_blank">Harry.vc</a> • 🐦 <a href="https://X.com/" target="_blank">X.com</a> • 📃 <a href="https://arxiv.org/" target="_blank">Papers</a> <br>
</p>

# 🥷 Model Card for King-Harry/NinjaMasker-PII-Redaction
This model is designed for the redaction and masking of Personally Identifiable Information (PII) in complex text scenarios like call transcripts.

## News
- 🔥🔥🔥[2023/10/06] **Building New Dataset** creating a significantly improved dataset, fixing stop tokens.
- 🔥🔥🔥[2023/10/05] **NinjaMasker-PII-Redaction** version 1, was released.


## Model Details
### 📖 Model Description
This model aims to handle complex and difficult instances of PII redaction that traditional classification models struggle with.

- **Developed by:** [Harry Roy McLaughlin](https://www.linkedin.com/in/harryroy/)
- **Model type:** Fine-tuned Language Model
- **Language(s) (NLP):** English
- **License:** TBD
- **Finetuned from the model:** NousResearch/Llama-2-7b-chat-hf

### 🌱 Model Sources
- **Repository:** Hosted on HuggingFace
- **Demo:** Coming soon

### 🧪 Test the model
Log into HuggingFace (if not already)
```python
!pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
from huggingface_hub import notebook_login
notebook_login()
```

Load Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Load the model and tokenizer with authentication token
model_name = "King-Harry/NinjaMasker-PII-Redaction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
```

Generate Text
```python
# Generate text
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100)
prompt = "My name is Harry and I live in Winnipeg. My phone number is ummm 204 no 203, ahh 4344, no 4355"
result = pipe(f"<s>[INST] {prompt} [/INST]")

# Print the generated text
print(result[0]['generated_text'])
```

## Uses

### 🎯 Direct Use

The model is specifically designed for direct redaction and masking of PII in complex text inputs such as call transcripts.

### ⬇️ Downstream Use

The model has potential for numerous downstream applications, though specific use-cases are yet to be fully explored.

### ❌ Out-of-Scope Use

The model is under development; use in critical systems requiring 100% accuracy is not recommended at this stage.

## ⚖️ Bias, Risks, and Limitations

The model is trained only on English text, which may limit its applicability in multilingual or non-English settings.

### 👍 Recommendations

Users should be aware of the model's language-specific training and should exercise caution when using it in critical systems.


## 🏋️ Training Details

### 📊 Training Data

The model was trained on a dataset of 43,000 question/answer pairs that contained various forms of PII. There are 63 labels that the model looks for.

#### ⚙️ Training Hyperparameters

- **Training regime:** FP16

#### 🚀 Speeds, Sizes, Times

- **Hardware:** T4 GPU
- **Cloud Provider:** Google CoLab Pro (for the extra RAM)
- **Training Duration:** ~4 hours

## 📋 Evaluation

Evaluation is pending.

## 🌍 Environmental Impact

Given the significant computing resources used, the model likely has a substantial carbon footprint. Exact calculations are pending.

- **Hardware Type:** T4 GPU
- **Hours used:** ~4
- **Cloud Provider:** Google CoLab Pro

## 📄 Technical Specifications

### 🏛️ Model Architecture and Objective

The model is a fine-tuned version of LLama 2 7b, tailored for PII redaction tasks.

#### 🖥️ Hardware

- **Training Hardware:** T4 GPU (with extra RAM)

#### 💾 Software

- **Environment:** Google CoLab Pro

- ## 🪖 Disclaimer

- This model is in its first generation and will be updated rapidly. 

## ✍️ Model Card Authors

[Harry Roy McLaughlin](https://www.linkedin.com/in/harryroy/)

## 📞 Model Card Contact

harry.roy@gmail.com