AI-Enthusiast11
/

pii-entity-extractor

Token Classification

Model card Files Files and versions Community

AI-Enthusiast11 commited on Apr 23

Commit

3ec3718

·

verified ·

1 Parent(s): 3e9b998

Update README.md

Files changed (1) hide show

README.md +55 -4

README.md CHANGED Viewed

@@ -83,6 +83,7 @@ Evaluation was done on a held-out portion of the same labeled dataset.
 ## How to Get Started with the Model
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification
 from transformers import pipeline
@@ -90,7 +91,57 @@ model_name = "AI-Enthusiast11/pii-entity-extractor"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForTokenClassification.from_pretrained(model_name)
-nlp = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
-text = "My name is John Smith and my SSN is 123-45-6789."
-results = nlp(text)
-print(results)

 ## How to Get Started with the Model
 ```python
 from transformers import AutoTokenizer, AutoModelForTokenClassification
 from transformers import pipeline
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForTokenClassification.from_pretrained(model_name)
+# Post processing logic to combine the subword tokens
+def merge_tokens(ner_results):
+    entities = {}
+    for entity in ner_results:
+        entity_type = entity["entity_group"]
+        entity_value = entity["word"].replace("##", "")  # Remove subword prefixes
+        # Handle token merging
+        if entity_type not in entities:
+            entities[entity_type] = []
+        if entities[entity_type] and not entity_value.startswith(" "):
+            # If the previous token exists and this one isn't a new word, merge it
+            entities[entity_type][-1] += entity_value
+        else:
+            entities[entity_type].append(entity_value)
+    return entities
+def redact_text_with_labels(text):
+    ner_results = nlp(text)
+    # Merge tokens for multi-token entities (if any)
+    cleaned_entities = merge_tokens(ner_results)
+    redacted_text = text
+    for entity_type, values in cleaned_entities.items():
+        for value in values:
+            # Replace each identified entity with the label
+            redacted_text = redacted_text.replace(value, f"[{entity_type}]")
+    return redacted_text
+#Loading the pipeline
+nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer, aggregation_strategy="simple")
+# Example input (choose one from your examples)
+example = "Hi, I’m Mia Thompson. I recently noticed that my electricity bill hasn’t been updated despite making the payment last week. I used account number 4893172051 linked with routing number 192847561. My service was nearly suspended, and I’d appreciate it if you could verify the payment. You can reach me at 727-814-3902 if more information is needed."
+# Run pipeline and process result
+ner_results = nlp(example)
+cleaned_entities = merge_tokens(ner_results)
+# Print the NER results
+print("\n==NER Results:==\n")
+for entity_type, values in cleaned_entities.items():
+    print(f"  {entity_type}: {', '.join(values)}")
+# Redact the single example with labels
+redacted_example = redact_text_with_labels(example)
+# Print the redacted result
+print(f"\n==Redacted Example:==\n{redacted_example}")