|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- King-Harry/NinjaMasker-PII-Redaction-Dataset |
|
language: |
|
- en |
|
tags: |
|
- PII |
|
- Redaction |
|
- Masking |
|
- LLM |
|
- Llama2 |
|
--- |
|
<p align="center"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/6488b81bc6b1f2b4c8d93d4e/pukpRSNaPbWiSKhuiGF8R.jpeg" alt="banner"> |
|
</p> |
|
|
|
<p align="center"> |
|
π€ <a href="https://www.linkedin.com/in/harryroy/" target="_blank">About me</a> β’π± <a href="https://www.harry.vc/" target="_blank">Harry.vc</a> β’ π¦ <a href="https://X.com/" target="_blank">X.com</a> β’ π <a href="https://arxiv.org/" target="_blank">Papers</a> <br> |
|
</p> |
|
|
|
# π₯· Model Card for King-Harry/NinjaMasker-PII-Redaction |
|
This model is designed for the redaction and masking of Personally Identifiable Information (PII) in complex text scenarios like call transcripts. |
|
|
|
## News |
|
- π₯π₯π₯[2023/10/06] **Building New Dataset** creating a significantly improved dataset, fixing stop tokens. |
|
- π₯π₯π₯[2023/10/05] **NinjaMasker-PII-Redaction** version 1, was released. |
|
|
|
|
|
## Model Details |
|
### π Model Description |
|
This model aims to handle complex and difficult instances of PII redaction that traditional classification models struggle with. |
|
|
|
- **Developed by:** [Harry Roy McLaughlin](https://www.linkedin.com/in/harryroy/) |
|
- **Model type:** Fine-tuned Language Model |
|
- **Language(s) (NLP):** English |
|
- **License:** TBD |
|
- **Finetuned from the model:** NousResearch/Llama-2-7b-chat-hf |
|
|
|
### π± Model Sources |
|
- **Repository:** Hosted on HuggingFace |
|
- **Demo:** Coming soon |
|
|
|
### π§ͺ Test the model |
|
Log into HuggingFace (if not already) |
|
```python |
|
!pip install transformers |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging |
|
from huggingface_hub import notebook_login |
|
notebook_login() |
|
``` |
|
|
|
Load Model |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging |
|
|
|
# Ignore warnings |
|
logging.set_verbosity(logging.CRITICAL) |
|
|
|
# Load the model and tokenizer with authentication token |
|
model_name = "King-Harry/NinjaMasker-PII-Redaction" |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
``` |
|
|
|
Generate Text |
|
```python |
|
# Generate text |
|
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100) |
|
prompt = "My name is Harry and I live in Winnipeg. My phone number is ummm 204 no 203, ahh 4344, no 4355" |
|
result = pipe(f"<s>[INST] {prompt} [/INST]") |
|
|
|
# Print the generated text |
|
print(result[0]['generated_text']) |
|
``` |
|
|
|
## Uses |
|
|
|
### π― Direct Use |
|
|
|
The model is specifically designed for direct redaction and masking of PII in complex text inputs such as call transcripts. |
|
|
|
### β¬οΈ Downstream Use |
|
|
|
The model has potential for numerous downstream applications, though specific use-cases are yet to be fully explored. |
|
|
|
### β Out-of-Scope Use |
|
|
|
The model is under development; use in critical systems requiring 100% accuracy is not recommended at this stage. |
|
|
|
## βοΈ Bias, Risks, and Limitations |
|
|
|
The model is trained only on English text, which may limit its applicability in multilingual or non-English settings. |
|
|
|
### π Recommendations |
|
|
|
Users should be aware of the model's language-specific training and should exercise caution when using it in critical systems. |
|
|
|
|
|
## ποΈ Training Details |
|
|
|
### π Training Data |
|
|
|
The model was trained on a dataset of 43,000 question/answer pairs that contained various forms of PII. There are 63 labels that the model looks for. |
|
|
|
#### βοΈ Training Hyperparameters |
|
|
|
- **Training regime:** FP16 |
|
|
|
#### π Speeds, Sizes, Times |
|
|
|
- **Hardware:** T4 GPU |
|
- **Cloud Provider:** Google CoLab Pro (for the extra RAM) |
|
- **Training Duration:** ~4 hours |
|
|
|
## π Evaluation |
|
|
|
Evaluation is pending. |
|
|
|
## π Environmental Impact |
|
|
|
Given the significant computing resources used, the model likely has a substantial carbon footprint. Exact calculations are pending. |
|
|
|
- **Hardware Type:** T4 GPU |
|
- **Hours used:** ~4 |
|
- **Cloud Provider:** Google CoLab Pro |
|
|
|
## π Technical Specifications |
|
|
|
### ποΈ Model Architecture and Objective |
|
|
|
The model is a fine-tuned version of LLama 2 7b, tailored for PII redaction tasks. |
|
|
|
#### π₯οΈ Hardware |
|
|
|
- **Training Hardware:** T4 GPU (with extra RAM) |
|
|
|
#### πΎ Software |
|
|
|
- **Environment:** Google CoLab Pro |
|
|
|
- ## πͺ Disclaimer |
|
|
|
- This model is in its first generation and will be updated rapidly. |
|
|
|
## βοΈ Model Card Authors |
|
|
|
[Harry Roy McLaughlin](https://www.linkedin.com/in/harryroy/) |
|
|
|
## π Model Card Contact |
|
|
|
[email protected] |