King-Harry's picture
Update README.md
3aa67b9
metadata
license: apache-2.0
datasets:
  - King-Harry/NinjaMasker-PII-Redaction-Dataset
language:
  - en
tags:
  - PII
  - Redaction
  - Masking
  - LLM
  - Llama2

banner

πŸ€— About me β€’πŸ± Harry.vc β€’ 🐦 X.com β€’ πŸ“ƒ Papers

πŸ₯· Model Card for King-Harry/NinjaMasker-PII-Redaction

This model is designed for the redaction and masking of Personally Identifiable Information (PII) in complex text scenarios like call transcripts.

News

  • πŸ”₯πŸ”₯πŸ”₯[2023/10/06] Building New Dataset creating a significantly improved dataset, fixing stop tokens.
  • πŸ”₯πŸ”₯πŸ”₯[2023/10/05] NinjaMasker-PII-Redaction version 1, was released.

Model Details

πŸ“– Model Description

This model aims to handle complex and difficult instances of PII redaction that traditional classification models struggle with.

  • Developed by: Harry Roy McLaughlin
  • Model type: Fine-tuned Language Model
  • Language(s) (NLP): English
  • License: TBD
  • Finetuned from the model: NousResearch/Llama-2-7b-chat-hf

🌱 Model Sources

  • Repository: Hosted on HuggingFace
  • Demo: Coming soon

πŸ§ͺ Test the model

Log into HuggingFace (if not already)

!pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
from huggingface_hub import notebook_login
notebook_login()

Load Model

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Load the model and tokenizer with authentication token
model_name = "King-Harry/NinjaMasker-PII-Redaction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Generate Text

# Generate text
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100)
prompt = "My name is Harry and I live in Winnipeg. My phone number is ummm 204 no 203, ahh 4344, no 4355"
result = pipe(f"<s>[INST] {prompt} [/INST]")

# Print the generated text
print(result[0]['generated_text'])

Uses

🎯 Direct Use

The model is specifically designed for direct redaction and masking of PII in complex text inputs such as call transcripts.

⬇️ Downstream Use

The model has potential for numerous downstream applications, though specific use-cases are yet to be fully explored.

❌ Out-of-Scope Use

The model is under development; use in critical systems requiring 100% accuracy is not recommended at this stage.

βš–οΈ Bias, Risks, and Limitations

The model is trained only on English text, which may limit its applicability in multilingual or non-English settings.

πŸ‘ Recommendations

Users should be aware of the model's language-specific training and should exercise caution when using it in critical systems.

πŸ‹οΈ Training Details

πŸ“Š Training Data

The model was trained on a dataset of 43,000 question/answer pairs that contained various forms of PII. There are 63 labels that the model looks for.

βš™οΈ Training Hyperparameters

  • Training regime: FP16

πŸš€ Speeds, Sizes, Times

  • Hardware: T4 GPU
  • Cloud Provider: Google CoLab Pro (for the extra RAM)
  • Training Duration: ~4 hours

πŸ“‹ Evaluation

Evaluation is pending.

🌍 Environmental Impact

Given the significant computing resources used, the model likely has a substantial carbon footprint. Exact calculations are pending.

  • Hardware Type: T4 GPU
  • Hours used: ~4
  • Cloud Provider: Google CoLab Pro

πŸ“„ Technical Specifications

πŸ›οΈ Model Architecture and Objective

The model is a fine-tuned version of LLama 2 7b, tailored for PII redaction tasks.

πŸ–₯️ Hardware

  • Training Hardware: T4 GPU (with extra RAM)

πŸ’Ύ Software

  • Environment: Google CoLab Pro

  • πŸͺ– Disclaimer

  • This model is in its first generation and will be updated rapidly.

✍️ Model Card Authors

Harry Roy McLaughlin

πŸ“ž Model Card Contact

[email protected]