File size: 2,944 Bytes
137c446
 
 
 
 
3a0c3bf
137c446
3a0c3bf
137c446
3a0c3bf
 
 
137c446
3a0c3bf
137c446
3a0c3bf
137c446
3a0c3bf
137c446
 
3a0c3bf
 
137c446
3a0c3bf
137c446
3a0c3bf
 
 
 
137c446
3a0c3bf
137c446
3a0c3bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137c446
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
library_name: transformers
tags: []
---

# 🧠 GLiClass Gender Classifier — DeBERTaV3 Uni-Encoder (3-Class)

This model is designed for **text classification** in clinical narratives, specifically for determining a patient's **sex or gender**. It was fine-tuned using a **uni-encoder architecture** based on [`microsoft/deberta-v3-small`](https://huggingface.co/microsoft/deberta-v3-small), and outputs one of three labels:

- `male`
- `female`
- `sex undetermined`

---

## 🧪 Task

This is a **multi-class text classification** task over **clinical free-text**. The model predicts the gender of a patient from discharge summaries, case descriptions, or medical notes.


> ⚠️ **It is strongly recommended to keep the labels and the input text in the same language** (e.g., both in Spanish or both in English) to ensure optimal model performance. Mixing languages may reduce accuracy.
---

## 🧩 Model Architecture

- Base: `microsoft/deberta-v3-small`
- Architecture: `DebertaV2ForSequenceClassification`
- Fine-tuned with a **uni-encoder** setup
- 3 output labels

---

## 🔍 Input Format

Each input sample must be a JSON object like this:

```json
{
  "text": "Paciente de 63 años que refería déficit de agudeza visual (AV)...",
  "all_labels": ["male", "female", "sex undetermined"],
  "true_labels": ["sex undetermined"]
}

## Usage example
import json
from transformers import AutoTokenizer
from gliclass import GLiClassModel, ZeroShotClassificationPipeline
import torch

device = 0 if torch.cuda.is_available() else -1
model_path = "BSC-NLP4BIA/GLiClass-gender-classifier"
classification_type = "single-label"  # or "multilabel"
test_path = "path/to/your/test_data.json"

print(f"🔄 Loading model from {model_path}...")
model = GLiClassModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.to(device)

pipeline = ZeroShotClassificationPipeline(
    model=model,
    tokenizer=tokenizer,
    classification_type=classification_type,
    device=device
)

with open(test_path, 'r') as f:
    test_data = json.load(f)

# 🔍 Automatically infer candidate labels from the dataset
all_labels = set()
for sample in test_data:
    all_labels.update(sample["true_labels"])
candidate_labels = sorted(all_labels)

print(f"🧾 Candidate labels inferred: {candidate_labels}")

results = []

for sample in test_data:
    true_labels = sample["true_labels"]
    output = pipeline(sample["text"], candidate_labels)
    top_results = output[0]

    predicted_labels = [max(top_results, key=lambda x: x["score"])["label"]]
    score_dict = {d["label"]: d["score"] for d in top_results}
    
    entry = {
        "text": sample["text"],
        "true_labels": true_labels,
        "predicted_labels": predicted_labels
    }
    # Add scores for each candidate label
    for label in candidate_labels:
        entry[f"score_{label}"] = score_dict.get(label, 0.0)

    results.append(entry)