King-Harry's picture
Update README.md
3aa67b9
---
license: apache-2.0
datasets:
- King-Harry/NinjaMasker-PII-Redaction-Dataset
language:
- en
tags:
- PII
- Redaction
- Masking
- LLM
- Llama2
---
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/6488b81bc6b1f2b4c8d93d4e/pukpRSNaPbWiSKhuiGF8R.jpeg" alt="banner">
</p>
<p align="center">
πŸ€— <a href="https://www.linkedin.com/in/harryroy/" target="_blank">About me</a> β€’πŸ± <a href="https://www.harry.vc/" target="_blank">Harry.vc</a> β€’ 🐦 <a href="https://X.com/" target="_blank">X.com</a> β€’ πŸ“ƒ <a href="https://arxiv.org/" target="_blank">Papers</a> <br>
</p>
# πŸ₯· Model Card for King-Harry/NinjaMasker-PII-Redaction
This model is designed for the redaction and masking of Personally Identifiable Information (PII) in complex text scenarios like call transcripts.
## News
- πŸ”₯πŸ”₯πŸ”₯[2023/10/06] **Building New Dataset** creating a significantly improved dataset, fixing stop tokens.
- πŸ”₯πŸ”₯πŸ”₯[2023/10/05] **NinjaMasker-PII-Redaction** version 1, was released.
## Model Details
### πŸ“– Model Description
This model aims to handle complex and difficult instances of PII redaction that traditional classification models struggle with.
- **Developed by:** [Harry Roy McLaughlin](https://www.linkedin.com/in/harryroy/)
- **Model type:** Fine-tuned Language Model
- **Language(s) (NLP):** English
- **License:** TBD
- **Finetuned from the model:** NousResearch/Llama-2-7b-chat-hf
### 🌱 Model Sources
- **Repository:** Hosted on HuggingFace
- **Demo:** Coming soon
### πŸ§ͺ Test the model
Log into HuggingFace (if not already)
```python
!pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
from huggingface_hub import notebook_login
notebook_login()
```
Load Model
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
# Ignore warnings
logging.set_verbosity(logging.CRITICAL)
# Load the model and tokenizer with authentication token
model_name = "King-Harry/NinjaMasker-PII-Redaction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
```
Generate Text
```python
# Generate text
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100)
prompt = "My name is Harry and I live in Winnipeg. My phone number is ummm 204 no 203, ahh 4344, no 4355"
result = pipe(f"<s>[INST] {prompt} [/INST]")
# Print the generated text
print(result[0]['generated_text'])
```
## Uses
### 🎯 Direct Use
The model is specifically designed for direct redaction and masking of PII in complex text inputs such as call transcripts.
### ⬇️ Downstream Use
The model has potential for numerous downstream applications, though specific use-cases are yet to be fully explored.
### ❌ Out-of-Scope Use
The model is under development; use in critical systems requiring 100% accuracy is not recommended at this stage.
## βš–οΈ Bias, Risks, and Limitations
The model is trained only on English text, which may limit its applicability in multilingual or non-English settings.
### πŸ‘ Recommendations
Users should be aware of the model's language-specific training and should exercise caution when using it in critical systems.
## πŸ‹οΈ Training Details
### πŸ“Š Training Data
The model was trained on a dataset of 43,000 question/answer pairs that contained various forms of PII. There are 63 labels that the model looks for.
#### βš™οΈ Training Hyperparameters
- **Training regime:** FP16
#### πŸš€ Speeds, Sizes, Times
- **Hardware:** T4 GPU
- **Cloud Provider:** Google CoLab Pro (for the extra RAM)
- **Training Duration:** ~4 hours
## πŸ“‹ Evaluation
Evaluation is pending.
## 🌍 Environmental Impact
Given the significant computing resources used, the model likely has a substantial carbon footprint. Exact calculations are pending.
- **Hardware Type:** T4 GPU
- **Hours used:** ~4
- **Cloud Provider:** Google CoLab Pro
## πŸ“„ Technical Specifications
### πŸ›οΈ Model Architecture and Objective
The model is a fine-tuned version of LLama 2 7b, tailored for PII redaction tasks.
#### πŸ–₯️ Hardware
- **Training Hardware:** T4 GPU (with extra RAM)
#### πŸ’Ύ Software
- **Environment:** Google CoLab Pro
- ## πŸͺ– Disclaimer
- This model is in its first generation and will be updated rapidly.
## ✍️ Model Card Authors
[Harry Roy McLaughlin](https://www.linkedin.com/in/harryroy/)
## πŸ“ž Model Card Contact
[email protected]