CordwainerSmith
/

GolemPII-v1

Token Classification

data-anonymization

Inference Endpoints

Model card Files Files and versions Community

GolemPII-v1 / README.md

CordwainerSmith's picture

CordwainerSmith

Upload folder using huggingface_hub

b46639e verified 20 days ago

|

3.89 kB

	---
	language: he
	license: mit
	tags:
	- hebrew
	- ner
	- pii-detection
	- token-classification
	- xlm-roberta
	datasets:
	- custom
	model-index:
	- name: GolemPII-xlm-roberta-v1
	results:
	- task:
	name: Token Classification
	type: token-classification
	metrics:
	- name: F1
	type: f1
	value: 0.9982
	- name: Precision
	type: precision
	value: 0.9982
	- name: Recall
	type: recall
	value: 0.9982
	---

	# GolemPII-xlm-roberta-v1 - Hebrew PII Detection Model

	This model is trained to detect personally identifiable information (PII) in Hebrew text. While based on the multilingual XLM-RoBERTa model, it has been specifically fine-tuned on Hebrew data.

	## Model Details
	- Based on xlm-roberta-base
	- Fine-tuned on a custom Hebrew PII dataset
	- Optimized for token classification tasks in Hebrew text

	## Performance Metrics

	### Final Evaluation Results
	```
	eval_loss: 0.000729
	eval_precision: 0.9982
	eval_recall: 0.9982
	eval_f1: 0.9982
	eval_accuracy: 0.999795
	```

	### Detailed Performance by Label

	\| Label \| Precision \| Recall \| F1-Score \| Support \|
	\|------------------\|-----------\|---------\|----------\|---------\|
	\| BANK_ACCOUNT_NUM \| 1.0000 \| 1.0000 \| 1.0000 \| 4847 \|
	\| CC_NUM \| 1.0000 \| 1.0000 \| 1.0000 \| 234 \|
	\| CC_PROVIDER \| 1.0000 \| 1.0000 \| 1.0000 \| 242 \|
	\| CITY \| 0.9997 \| 0.9995 \| 0.9996 \| 12237 \|
	\| DATE \| 0.9997 \| 0.9998 \| 0.9997 \| 11943 \|
	\| EMAIL \| 0.9998 \| 1.0000 \| 0.9999 \| 13235 \|
	\| FIRST_NAME \| 0.9937 \| 0.9938 \| 0.9937 \| 17888 \|
	\| ID_NUM \| 0.9999 \| 1.0000 \| 1.0000 \| 10577 \|
	\| LAST_NAME \| 0.9928 \| 0.9921 \| 0.9925 \| 15655 \|
	\| PHONE_NUM \| 1.0000 \| 0.9998 \| 0.9999 \| 20838 \|
	\| POSTAL_CODE \| 0.9998 \| 0.9999 \| 0.9999 \| 13321 \|
	\| STREET \| 0.9999 \| 0.9999 \| 0.9999 \| 14032 \|
	\| micro avg \| 0.9982 \| 0.9982 \| 0.9982 \| 135049 \|
	\| macro avg \| 0.9988 \| 0.9987 \| 0.9988 \| 135049 \|
	\| weighted avg \| 0.9982 \| 0.9982 \| 0.9982 \| 135049 \|

	### Training Progress

	\| Epoch \| Training Loss \| Validation Loss \| Precision \| Recall \| F1 \| Accuracy \|
	\|-------\|--------------\|-----------------\|-----------\|---------\|----------\|----------\|
	\| 1 \| 0.005800 \| 0.002487 \| 0.993109 \| 0.993678\| 0.993393 \| 0.999328 \|
	\| 2 \| 0.001700 \| 0.001385 \| 0.995469 \| 0.995947\| 0.995708 \| 0.999575 \|
	\| 3 \| 0.001200 \| 0.000946 \| 0.997159 \| 0.997487\| 0.997323 \| 0.999739 \|
	\| 4 \| 0.000900 \| 0.000896 \| 0.997626 \| 0.997868\| 0.997747 \| 0.999750 \|
	\| 5 \| 0.000600 \| 0.000729 \| 0.997981 \| 0.998191\| 0.998086 \| 0.999795 \|

	## Usage
	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	tokenizer = AutoTokenizer.from_pretrained("{repo_id}")
	model = AutoModelForTokenClassification.from_pretrained("{repo_id}")

	# Example text (Hebrew)
	text = "שלום, שמי דוד כהן ואני גר ברחוב הרצל 42 בתל אביב. הטלפון שלי הוא 050-1234567"

	# Tokenize and get predictions
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.argmax(outputs.logits, dim=2)

	# Convert predictions to labels
	tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
	labels = [model.config.id2label[t.item()] for t in predictions[0]]

	# Print results (excluding special tokens and non-entity labels)
	for token, label in zip(tokens, labels):
	if label != "O" and not token.startswith("##"):
	print(f"Token: {token}, Label: {label}")
	```

	## Training Details
	- Training epochs: 5
	- Training speed: ~2.33 it/s (7615/7615 54:29)
	- Base model: xlm-roberta-base
	- Training language: Hebrew