suayptalha
/

medBERT-base

 library_name: transformers
 tags:
 - math
+---
+# **mathBERT-base**
+This repository contains a BERT-based model, **mathBERT-base**, fine-tuned on the *ddrg/named_math_formulas* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in mathematical formulas and expressions. The goal of this project is to improve the model's understanding and generation of math-related formulas in natural language contexts.
+## **Model Architecture**
+- **Base Model**: `bert-base-uncased`
+- **Task**: Masked Language Modeling (MLM) for mathematical formulas
+- **Tokenizer**: BERT's WordPiece tokenizer
+## **Usage**
+### **Loading the Pre-trained Model**
+You can load the pre-trained **mathBERT-base** model using the Hugging Face `transformers` library:
+```python
+from transformers import BertTokenizer, BertForMaskedLM
+tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
+model = BertForMaskedLM.from_pretrained('<path_to_model>')
+# Example input text
+input_text = "The area of a circle is given by the formula A = πr^2."
+inputs = tokenizer(input_text, return_tensors='pt')
+# Mask a token and predict it
+inputs['input_ids'][0, 4] = tokenizer.mask_token_id  # Mask a token
+outputs = model(**inputs)
+predicted_token_id = torch.argmax(outputs.logits, dim=-1)
+predicted_token = tokenizer.decode(predicted_token_id)
+print(predicted_token)
+```
+### **Fine-tuning the Model**
+To fine-tune the **mathBERT-base** model on your own dataset, follow these steps:
+1. Prepare your dataset (e.g., mathematical formulas) in text format.
+2. Tokenize the dataset and apply masking.
+3. Train the model using the provided training loop.
+Here's the training code:
+'''python
+from transformers import AdamW
+from torch.utils.data import DataLoader
+from datasets import load_dataset
+import torch
+from tqdm.auto import tqdm
+# Load and preprocess data
+dataset = load_dataset('') # Load dataset
+inputs = tokenizer(data, max_length=512, truncation=True, padding='max_length', return_tensors='pt')
+# Masking input tokens
+random_tensor = torch.rand(inputs['input_ids'].shape)
+masked_tensor = (random_tensor < 0.15) * (inputs['input_ids'] != 101) * (inputs['input_ids'] != 102) * (inputs['input_ids'] != 0)
+nonzeros_indices = []
+for i in range(len(masked_tensor)):
+    nonzeros_indices.append(torch.flatten(masked_tensor[i].nonzero()).tolist())
+for i in range(len(inputs['input_ids'])):
+    inputs['input_ids'][i, nonzeros_indices[i]] = 103  # Mask token ID
+# Create labels
+inputs['labels'] = inputs['input_ids'].clone()
+for i in range(len(inputs['input_ids'])):
+    inputs['labels'][i] = torch.where(masked_tensor[i] == 0, torch.tensor(-100), inputs['labels'][i])
+# Dataset and DataLoader
+class MathDataset(torch.utils.data.Dataset):
+    def __init__(self, encodings):
+        self.encodings = encodings
+    def __len__(self):
+        return len(self.encodings['input_ids'])
+    def __getitem__(self, index):
+        input_ids = self.encodings['input_ids'][index]
+        labels = self.encodings['labels'][index]
+        attention_mask = self.encodings['attention_mask'][index]
+        token_type_ids = self.encodings['token_type_ids'][index]
+        return {
+            'input_ids': input_ids,
+            'labels': labels,
+            'attention_mask': attention_mask,
+            'token_type_ids': token_type_ids
+        }
+dataset = MathDataset(inputs)
+dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
+# Fine-tuning the model
+optimizer = AdamW(model.parameters(), lr=5e-5)
+device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
+model.to(device)
+epochs = 1
+for epoch in range(epochs):
+    loop = tqdm(dataloader, dynamic_ncols=True)
+    for step, batch in enumerate(loop):
+        optimizer.zero_grad()
+        input_ids = batch['input_ids'].to(device)
+        labels = batch['labels'].to(device)
+        attention_mask = batch['attention_mask'].to(device)
+        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
+        loss = outputs.loss
+        loss.backward()
+        optimizer.step()
+        loop.set_description(f"Epoch {epoch + 1}")
+        loop.set_postfix(loss=loss.item())
+```
+## **Training Details**
+### **Hyperparameters**
+- **Batch Size**: 16
+- **Learning Rate**: 5e-5
+- **Number of Epochs**: 1
+- **Max Sequence Length**: 512 tokens
+### **Dataset**
+- **Dataset Name**: *ddrg/named_math_formulas*
+- **Task**: Masked Language Modeling (MLM) on mathematical formulas
+## **Acknowledgements**
+- The *ddrg/named_math_formulas* dataset is available on the Hugging Face dataset hub and provides a rich collection of mathematical formulas for training.
+- This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models