File size: 5,773 Bytes
3c4965f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
# Model Card for BioGPT-FineTuned-MedicalTextbooks-FP16
# Model Overview
This model is a fine-tuned and quantized version of the microsoft/biogpt model, specifically tailored for medical text understanding. It was fine-tuned on the dmedhi/medical-textbooks dataset from Hugging Face and subsequently quantized to FP16 (half-precision) to reduce memory usage and improve inference speed while maintaining accuracy. The model is designed for tasks like keyword extraction from medical texts and generative tasks in the biomedical domain.
# Model Details
```
Base Model: microsoft/biogpt
Fine-Tuning Dataset: dmedhi/medical-textbooks (15,970 rows)
Quantization: FP16 (half-precision) using PyTorch's .half() method
Model Type: Causal Language Model
Language: English
```
# Intended Use
This model is intended for:
- Keyword Extraction: Extracting relevant lines containing specific keywords (e.g., "anatomy") from medical textbooks, along with metadata like book names.
- Generative Tasks: Generating short explanations or summaries in the biomedical domain (e.g., answering questions like "What is anatomy?").
- Research and Education: Assisting researchers, students, and educators in exploring medical texts and generating insights.
# Out of Scope
- Real-time clinical decision-making or medical diagnosis (not evaluated for such tasks).
- Non-English text processing (not tested on other languages).
- Tasks requiring high precision in generative output without human oversight.
# Training Details
# Dataset
The model was fine-tuned on the dmedhi/medical-textbooks dataset, which contains excerpts from medical textbooks with two attributes:
**text:** The content of the excerpt.
**book:** The name of the book (e.g., "Gray's Anatomy").
# Dataset Splits:
- Original split: train (15,970 rows).
- Custom splits: 80% train (12,776 rows), 20% validation (3,194 rows).
# Training Procedure
# Preprocessing:
- Tokenized the text field using the BioGPT tokenizer (microsoft/biogpt).
- Set max_length=512, with truncation and padding.
- Used input_ids as labels for causal language modeling.
# Fine-Tuning:
- Fine-tuned microsoft/biogpt using Hugging Face's Trainer API.
```
Training arguments:
Epochs: 1
Batch size: 4 per device
Learning rate: 2e-5
Mixed precision: FP16 (fp16=True)
Evaluation strategy: Steps (every 1000 steps)
Training loss decreased from 2.8409 to 2.7006 over 3,194 steps.
Validation loss decreased from 2.7317 to 2.6512.
```
# Quantization:
- Converted the fine-tuned model to FP16 using PyTorch's .half() method.
- Saved as ./biogpt_finetuned/final_model_fp16.
- Compute Infrastructure
- Hardware: 12 GB GPU (NVIDIA)
- Environment: Jupyter Notebook on Windows
- Framework: PyTorch, Hugging Face Transformers
- Training Time: Approximately 27 minutes for 1 epoch
# Evaluation
**Metrics**
```
Training Loss: Decreased from 2.8409 to 2.7006.
Validation Loss: Decreased from 2.7317 to 2.6512.
Memory Usage: Post-quantization memory usage reported as ~661 MB (FP16), though actual savings may vary due to buffers and non-weight tensors.
```
# Qualitative Testing
**Generative Task:** Generated a response to "What is anatomy?" with reasonable output: "What is anatomy? Anatomy is the basis of medicine..."
**Keyword Extraction:** Successfully extracted up to 10 lines containing keywords (e.g., "anatomy") with corresponding book names (e.g., "Gray's Anatomy").
# Usage
**Installation**
- Ensure you have the required libraries installed:
```
pip install transformers torch datasets sacremoses
```
# Loading the Model
- Load the quantized FP16 model and tokenizer:
```
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_path = "path/to/biogpt_finetuned/final_model_fp16" # Update with your HF repo path
model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
```
# Example 1: Generative Inference
# Generate text with the quantized model:
```
input_text = "What is anatomy?"
inputs = tokenizer(input_text, return_tensors="pt", padding=True, truncation=True, max_length=512)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model.generate(**inputs, max_length=50)
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(output_text)
```
# Example 2: Keyword Extraction
```
from datasets import load_from_disk
original_datasets = load_from_disk('path/to/original_medical_textbooks')
def extract_lines_with_keyword(keyword, dataset_split='train', max_results=10):
dataset = original_datasets[dataset_split]
matching_lines = []
for entry in dataset:
text = entry['text']
book = entry['book']
lines = text.split('\n')
for line in lines:
if keyword.lower() in line.lower():
matching_lines.append({'text': line.strip(), 'book': book})
if len(matching_lines) >= max_results:
return matching_lines
return matching_lines
keyword = "anatomy"
matching_lines = extract_lines_with_keyword(keyword)
for i, match in enumerate(matching_lines, 1):
print(f"{i}. Text: {match['text']}")
print(f" Book: {match['book']}\n")
```
# Limitations
- Quantization Trade-offs: FP16 quantization may lead to minor accuracy degradation, though not extensively evaluated.
- Dataset Bias: Fine-tuned only on dmedhi/medical-textbooks, which may not cover all medical domains or topics.
- Generative Quality: Generative outputs may require human oversight for correctness.
- Scalability: Keyword extraction relies on string matching, not semantic understanding, limiting its ability to capture nuanced relationships.
|