--- license: apache-2.0 datasets: - King-Harry/NinjaMasker-PII-Redaction-Dataset language: - en tags: - PII - Redaction - Masking - LLM - Llama2 ---

banner

๐Ÿค— About me โ€ข๐Ÿฑ Harry.vc โ€ข ๐Ÿฆ X.com โ€ข ๐Ÿ“ƒ Papers

# ๐Ÿฅท Model Card for King-Harry/NinjaMasker-PII-Redaction This model is designed for the redaction and masking of Personally Identifiable Information (PII) in complex text scenarios like call transcripts. ## News - ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ[2023/10/06] **Building New Dataset** creating a significantly improved dataset, fixing stop tokens. - ๐Ÿ”ฅ๐Ÿ”ฅ๐Ÿ”ฅ[2023/10/05] **NinjaMasker-PII-Redaction** version 1, was released. ## Model Details ### ๐Ÿ“– Model Description This model aims to handle complex and difficult instances of PII redaction that traditional classification models struggle with. - **Developed by:** [Harry Roy McLaughlin](https://www.linkedin.com/in/harryroy/) - **Model type:** Fine-tuned Language Model - **Language(s) (NLP):** English - **License:** TBD - **Finetuned from the model:** NousResearch/Llama-2-7b-chat-hf ### ๐ŸŒฑ Model Sources - **Repository:** Hosted on HuggingFace - **Demo:** Coming soon ### ๐Ÿงช Test the model Log into HuggingFace (if not already) ```python !pip install transformers from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging from huggingface_hub import notebook_login notebook_login() ``` Load Model ```python from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, logging # Ignore warnings logging.set_verbosity(logging.CRITICAL) # Load the model and tokenizer with authentication token model_name = "King-Harry/NinjaMasker-PII-Redaction" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) ``` Generate Text ```python # Generate text pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=100) prompt = "My name is Harry and I live in Winnipeg. My phone number is ummm 204 no 203, ahh 4344, no 4355" result = pipe(f"[INST] {prompt} [/INST]") # Print the generated text print(result[0]['generated_text']) ``` ## Uses ### ๐ŸŽฏ Direct Use The model is specifically designed for direct redaction and masking of PII in complex text inputs such as call transcripts. ### โฌ‡๏ธ Downstream Use The model has potential for numerous downstream applications, though specific use-cases are yet to be fully explored. ### โŒ Out-of-Scope Use The model is under development; use in critical systems requiring 100% accuracy is not recommended at this stage. ## โš–๏ธ Bias, Risks, and Limitations The model is trained only on English text, which may limit its applicability in multilingual or non-English settings. ### ๐Ÿ‘ Recommendations Users should be aware of the model's language-specific training and should exercise caution when using it in critical systems. ## ๐Ÿ‹๏ธ Training Details ### ๐Ÿ“Š Training Data The model was trained on a dataset of 43,000 question/answer pairs that contained various forms of PII. There are 63 labels that the model looks for. #### โš™๏ธ Training Hyperparameters - **Training regime:** FP16 #### ๐Ÿš€ Speeds, Sizes, Times - **Hardware:** T4 GPU - **Cloud Provider:** Google CoLab Pro (for the extra RAM) - **Training Duration:** ~4 hours ## ๐Ÿ“‹ Evaluation Evaluation is pending. ## ๐ŸŒ Environmental Impact Given the significant computing resources used, the model likely has a substantial carbon footprint. Exact calculations are pending. - **Hardware Type:** T4 GPU - **Hours used:** ~4 - **Cloud Provider:** Google CoLab Pro ## ๐Ÿ“„ Technical Specifications ### ๐Ÿ›๏ธ Model Architecture and Objective The model is a fine-tuned version of LLama 2 7b, tailored for PII redaction tasks. #### ๐Ÿ–ฅ๏ธ Hardware - **Training Hardware:** T4 GPU (with extra RAM) #### ๐Ÿ’พ Software - **Environment:** Google CoLab Pro - ## ๐Ÿช– Disclaimer - This model is in its first generation and will be updated rapidly. ## โœ๏ธ Model Card Authors [Harry Roy McLaughlin](https://www.linkedin.com/in/harryroy/) ## ๐Ÿ“ž Model Card Contact harry.roy@gmail.com