medBERT-base / README.md
suayptalha's picture
Update README.md
dd8f135 verified
|
raw
history blame
2.64 kB
---
base_model:
- google-bert/bert-base-uncased
datasets:
- ddrg/named_math_formulas
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: fill-mask
tags:
- math
---
# **medBERT-base**
This repository contains a BERT-based model, **medBERT-base**, fine-tuned on the *gayanin/pubmed-gastro-maskfilling* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in medical and gastroenterological texts. The goal of this project is to improve the model's understanding and generation of medical-related information in natural language contexts.
## **Model Architecture**
- **Base Model**: `bert-base-uncased`
- **Task**: Masked Language Modeling (MLM) for medical texts
- **Tokenizer**: BERT's WordPiece tokenizer
## **Usage**
### **Loading the Pre-trained Model**
You can load the pre-trained **medBERT-base** model using the Hugging Face `transformers` library:
'''
from transformers import BertTokenizer, BertForMaskedLM
import torch
tokenizer = BertTokenizer.from_pretrained('suayptalha/medBERT-base')
model = BertForMaskedLM.from_pretrained('suayptalha/medBERT-base').to("cuda")
input_text = "The patient was diagnosed with gastric cancer after a thorough examination."
masked_text = input_text.replace("gastric cancer", tokenizer.mask_token)
inputs = tokenizer(masked_text, return_tensors='pt').to("cuda")
outputs = model(**inputs)
predicted_token_id = torch.argmax(outputs.logits, dim=-1)
predicted_token = tokenizer.decode(predicted_token_id[0, inputs['input_ids'].shape[1] - 1])
print(predicted_token)
'''
### **Fine-tuning the Model**
To fine-tune the **medBERT-base** model on your own medical dataset, follow these steps:
1. Prepare your dataset (e.g., medical texts or gastroenterology-related information) in text format.
2. Tokenize the dataset and apply masking.
3. Train the model using the provided training loop.
Here's the training code:
https://github.com/suayptalha/medBERT-base/blob/main/medBERT-base.ipynb
## **Training Details**
### **Hyperparameters**
- **Batch Size**: 16
- **Learning Rate**: 5e-5
- **Number of Epochs**: 1
- **Max Sequence Length**: 512 tokens
### **Dataset**
- **Dataset Name**: *gayanin/pubmed-gastro-maskfilling*
- **Task**: Masked Language Modeling (MLM) on medical texts
## **Acknowledgements**
- The *gayanin/pubmed-gastro-maskfilling* dataset is available on the Hugging Face dataset hub and provides a rich collection of medical and gastroenterology-related information for training.
- This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models