You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

banner

πŸ€— About me β€’πŸ± Harry.vc β€’ 🐦 X.com β€’ πŸ“ƒ Papers

πŸ₯· Model Card for King-Harry/NinjaMasker-PII-Redaction

This model is designed for the redaction and masking of Personally Identifiable Information (PII) in complex text scenarios like call transcripts.

News

  • πŸ”₯πŸ”₯πŸ”₯[2023/10/06] Building New Dataset creating a significantly improved dataset, fixing stop tokens.
  • πŸ”₯πŸ”₯πŸ”₯[2023/10/05] NinjaMasker-PII-Redaction version 1, was released.

Model Details

πŸ“– Model Description

This model aims to handle complex and difficult instances of PII redaction that traditional classification models struggle with.

  • Developed by: Harry Roy McLaughlin
  • Model type: Fine-tuned Language Model
  • Language(s) (NLP): English
  • License: TBD
  • Finetuned from the model: NousResearch/Llama-2-7b-chat-hf

🌱 Model Sources

  • Repository: Hosted on HuggingFace
  • Demo: Coming soon

πŸ§ͺ Test the model

Log into HuggingFace (if not already)

!pip install transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging
from huggingface_hub import notebook_login
notebook_login()

Load Model

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging

# Ignore warnings
logging.set_verbosity(logging.CRITICAL)

# Load the model and tokenizer with authentication token
model_name = "King-Harry/NinjaMasker-PII-Redaction"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

Generate Text

# Generate text
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100)
prompt = "My name is Harry and I live in Winnipeg. My phone number is ummm 204 no 203, ahh 4344, no 4355"
result = pipe(f"<s>[INST] {prompt} [/INST]")

# Print the generated text
print(result[0]['generated_text'])

Uses

🎯 Direct Use

The model is specifically designed for direct redaction and masking of PII in complex text inputs such as call transcripts.

⬇️ Downstream Use

The model has potential for numerous downstream applications, though specific use-cases are yet to be fully explored.

❌ Out-of-Scope Use

The model is under development; use in critical systems requiring 100% accuracy is not recommended at this stage.

βš–οΈ Bias, Risks, and Limitations

The model is trained only on English text, which may limit its applicability in multilingual or non-English settings.

πŸ‘ Recommendations

Users should be aware of the model's language-specific training and should exercise caution when using it in critical systems.

πŸ‹οΈ Training Details

πŸ“Š Training Data

The model was trained on a dataset of 43,000 question/answer pairs that contained various forms of PII. There are 63 labels that the model looks for.

βš™οΈ Training Hyperparameters

  • Training regime: FP16

πŸš€ Speeds, Sizes, Times

  • Hardware: T4 GPU
  • Cloud Provider: Google CoLab Pro (for the extra RAM)
  • Training Duration: ~4 hours

πŸ“‹ Evaluation

Evaluation is pending.

🌍 Environmental Impact

Given the significant computing resources used, the model likely has a substantial carbon footprint. Exact calculations are pending.

  • Hardware Type: T4 GPU
  • Hours used: ~4
  • Cloud Provider: Google CoLab Pro

πŸ“„ Technical Specifications

πŸ›οΈ Model Architecture and Objective

The model is a fine-tuned version of LLama 2 7b, tailored for PII redaction tasks.

πŸ–₯️ Hardware

  • Training Hardware: T4 GPU (with extra RAM)

πŸ’Ύ Software

  • Environment: Google CoLab Pro

  • πŸͺ– Disclaimer

  • This model is in its first generation and will be updated rapidly.

✍️ Model Card Authors

Harry Roy McLaughlin

πŸ“ž Model Card Contact

[email protected]

Downloads last month
4
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train King-Harry/NinjaMasker-PII-Redaction

Spaces using King-Harry/NinjaMasker-PII-Redaction 3