File size: 3,790 Bytes
1a7cabb
7a340b3
 
 
 
 
 
 
 
 
4d45415
 
1a7cabb
32a2694
7a340b3
32a2694
 
868f270
32a2694
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d45415
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32a2694
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
language: en
tags:
- veterinary
- pets
- classification
- vetbert
- BERT
  
widget:
- text: "Hx: 7 yo canine with history of vomiting intermittently since yesterday. No other concerns. Still eating and drinking normally. cPL negative."
  example_title: "Enteropathy"
---


# VetBERT Disease Syndrome Classifier

This is a finetuned version of the [VetBERT](https://huggingface.co/havocy28/VetBERT) model, designed to classify the disease syndrome within a veterinary clinical note.

<!-- Provide a quick summary of what the model is/does. -->
This pretrained model is designed for performing NLP tasks related to veterinary clinical notes. The [Domain Adaptation and Instance Selection for Disease Syndrome Classification over Veterinary Clinical Notes](https://aclanthology.org/2020.bionlp-1.17) (Hur et al., BioNLP 2020) paper introduced VetBERT model: an initialized Bert Model with ClinicalBERT (Bio+Clinical BERT) and further pretrained on the [VetCompass Australia](https://www.vetcompass.com.au/) corpus for performing tasks specific to veterinary medicine.

## Pretraining Data

The VetBERT model was initialized from [Bio_ClinicalBERT model](https://huggingface.co/emilyalsentzer/Bio_ClinicalBERT), which was initialized from BERT. The VetBERT model was trained on over 15 million veterinary clincal Records and 1.3 Billion tokens.

## Pretraining Hyperparameters

During the pretraining phase for VetBERT, we used a batch size of 32, a maximum sequence length of 512, and a learning rate of 5 · 10−5. The dup factor for duplicating input data with different masks was set to 5. All other default parameters were used (specifically, masked language model probability = 0.15 and max predictions per sequence = 20).

## VetBERT Finetuning

VetBERT was further finetuned on a set of 5002 annotated clinical notes to classifiy the disease syndrome associated with the clinical notes as outlined in the paper:  [Domain Adaptation and Instance Selection for Disease Syndrome Classification over Veterinary Clinical Notes](https://aclanthology.org/2020.bionlp-1.17)

## How to use the model

Load the model via the transformers library:

```
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load the tokenizer and model from the Hugging Face Hub
model_name = 'havocy28/VetBERTDx'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example text to classify
text = "Hx: 7 yo canine with history of vomiting intermittently since yesterday. No other concerns. Still eating and drinking normally. cPL negative."

# Encode the text and prepare inputs for the model
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

# Predict and compute softmax to get probabilities
with torch.no_grad():
    logits = model(**inputs).logits
    probabilities = torch.softmax(logits, dim=-1)

# Retrieve label mapping from model's configuration
label_map = model.config.id2label

# Combine labels and probabilities, and sort by probability in descending order
sorted_probs = sorted(((prob.item(), label_map[idx]) for idx, prob in enumerate(probabilities[0])), reverse=True, key=lambda x: x[0])

# Display sorted probabilities and labels
for prob, label in sorted_probs:
    print(f"{label}: {prob:.4f}")
```

## Citation

Please cite this article: Brian Hur, Timothy Baldwin, Karin Verspoor, Laura Hardefeldt, and James Gilkerson. 2020. [Domain Adaptation and Instance Selection for Disease Syndrome Classification over Veterinary Clinical Notes](https://aclanthology.org/2020.bionlp-1.17). In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pages 156–166, Online. Association for Computational Linguistics.