--- library_name: transformers tags: [] --- # 🧠 GLiClass Gender Classifier — DeBERTaV3 Uni-Encoder (3-Class) This model is designed for **text classification** in clinical narratives, specifically for determining a patient's **sex or gender**. It was fine-tuned using a **uni-encoder architecture** based on [`microsoft/deberta-v3-small`](https://huggingface.co/microsoft/deberta-v3-small), and outputs one of three labels: - `male` - `female` - `sex undetermined` --- ## 🧪 Task This is a **multi-class text classification** task over **clinical free-text**. The model predicts the gender of a patient from discharge summaries, case descriptions, or medical notes. > ⚠️ **It is strongly recommended to keep the labels and the input text in the same language** (e.g., both in Spanish or both in English) to ensure optimal model performance. Mixing languages may reduce accuracy. --- ## 🧩 Model Architecture - Base: `microsoft/deberta-v3-small` - Architecture: `DebertaV2ForSequenceClassification` - Fine-tuned with a **uni-encoder** setup - 3 output labels --- ## 🔍 Input Format Each input sample must be a JSON object like this: ```json { "text": "Paciente de 63 años que refería déficit de agudeza visual (AV)...", "all_labels": ["male", "female", "sex undetermined"], "true_labels": ["sex undetermined"] } ## Usage example import json from transformers import AutoTokenizer from gliclass import GLiClassModel, ZeroShotClassificationPipeline import torch device = 0 if torch.cuda.is_available() else -1 model_path = "BSC-NLP4BIA/GLiClass-gender-classifier" classification_type = "single-label" # or "multilabel" test_path = "path/to/your/test_data.json" print(f"🔄 Loading model from {model_path}...") model = GLiClassModel.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) model.to(device) pipeline = ZeroShotClassificationPipeline( model=model, tokenizer=tokenizer, classification_type=classification_type, device=device ) with open(test_path, 'r') as f: test_data = json.load(f) # 🔍 Automatically infer candidate labels from the dataset all_labels = set() for sample in test_data: all_labels.update(sample["true_labels"]) candidate_labels = sorted(all_labels) print(f"🧾 Candidate labels inferred: {candidate_labels}") results = [] for sample in test_data: true_labels = sample["true_labels"] output = pipeline(sample["text"], candidate_labels) top_results = output[0] predicted_labels = [max(top_results, key=lambda x: x["score"])["label"]] score_dict = {d["label"]: d["score"] for d in top_results} entry = { "text": sample["text"], "true_labels": true_labels, "predicted_labels": predicted_labels } # Add scores for each candidate label for label in candidate_labels: entry[f"score_{label}"] = score_dict.get(label, 0.0) results.append(entry)