Update README.md
Browse files
README.md
CHANGED
@@ -8,4 +8,136 @@ pipeline_tag: fill-mask
|
|
8 |
library_name: transformers
|
9 |
tags:
|
10 |
- math
|
11 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
library_name: transformers
|
9 |
tags:
|
10 |
- math
|
11 |
+
---
|
12 |
+
|
13 |
+
# **mathBERT-base**
|
14 |
+
|
15 |
+
This repository contains a BERT-based model, **mathBERT-base**, fine-tuned on the *ddrg/named_math_formulas* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in mathematical formulas and expressions. The goal of this project is to improve the model's understanding and generation of math-related formulas in natural language contexts.
|
16 |
+
|
17 |
+
## **Model Architecture**
|
18 |
+
- **Base Model**: `bert-base-uncased`
|
19 |
+
- **Task**: Masked Language Modeling (MLM) for mathematical formulas
|
20 |
+
- **Tokenizer**: BERT's WordPiece tokenizer
|
21 |
+
|
22 |
+
## **Usage**
|
23 |
+
|
24 |
+
### **Loading the Pre-trained Model**
|
25 |
+
|
26 |
+
You can load the pre-trained **mathBERT-base** model using the Hugging Face `transformers` library:
|
27 |
+
|
28 |
+
```python
|
29 |
+
from transformers import BertTokenizer, BertForMaskedLM
|
30 |
+
|
31 |
+
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
|
32 |
+
model = BertForMaskedLM.from_pretrained('<path_to_model>')
|
33 |
+
|
34 |
+
# Example input text
|
35 |
+
input_text = "The area of a circle is given by the formula A = πr^2."
|
36 |
+
inputs = tokenizer(input_text, return_tensors='pt')
|
37 |
+
|
38 |
+
# Mask a token and predict it
|
39 |
+
inputs['input_ids'][0, 4] = tokenizer.mask_token_id # Mask a token
|
40 |
+
|
41 |
+
outputs = model(**inputs)
|
42 |
+
predicted_token_id = torch.argmax(outputs.logits, dim=-1)
|
43 |
+
predicted_token = tokenizer.decode(predicted_token_id)
|
44 |
+
print(predicted_token)
|
45 |
+
```
|
46 |
+
|
47 |
+
### **Fine-tuning the Model**
|
48 |
+
|
49 |
+
To fine-tune the **mathBERT-base** model on your own dataset, follow these steps:
|
50 |
+
|
51 |
+
1. Prepare your dataset (e.g., mathematical formulas) in text format.
|
52 |
+
2. Tokenize the dataset and apply masking.
|
53 |
+
3. Train the model using the provided training loop.
|
54 |
+
|
55 |
+
Here's the training code:
|
56 |
+
|
57 |
+
'''python
|
58 |
+
from transformers import AdamW
|
59 |
+
from torch.utils.data import DataLoader
|
60 |
+
from datasets import load_dataset
|
61 |
+
import torch
|
62 |
+
from tqdm.auto import tqdm
|
63 |
+
|
64 |
+
# Load and preprocess data
|
65 |
+
dataset = load_dataset('') # Load dataset
|
66 |
+
|
67 |
+
inputs = tokenizer(data, max_length=512, truncation=True, padding='max_length', return_tensors='pt')
|
68 |
+
|
69 |
+
# Masking input tokens
|
70 |
+
random_tensor = torch.rand(inputs['input_ids'].shape)
|
71 |
+
masked_tensor = (random_tensor < 0.15) * (inputs['input_ids'] != 101) * (inputs['input_ids'] != 102) * (inputs['input_ids'] != 0)
|
72 |
+
nonzeros_indices = []
|
73 |
+
for i in range(len(masked_tensor)):
|
74 |
+
nonzeros_indices.append(torch.flatten(masked_tensor[i].nonzero()).tolist())
|
75 |
+
|
76 |
+
for i in range(len(inputs['input_ids'])):
|
77 |
+
inputs['input_ids'][i, nonzeros_indices[i]] = 103 # Mask token ID
|
78 |
+
|
79 |
+
# Create labels
|
80 |
+
inputs['labels'] = inputs['input_ids'].clone()
|
81 |
+
for i in range(len(inputs['input_ids'])):
|
82 |
+
inputs['labels'][i] = torch.where(masked_tensor[i] == 0, torch.tensor(-100), inputs['labels'][i])
|
83 |
+
|
84 |
+
# Dataset and DataLoader
|
85 |
+
class MathDataset(torch.utils.data.Dataset):
|
86 |
+
def __init__(self, encodings):
|
87 |
+
self.encodings = encodings
|
88 |
+
|
89 |
+
def __len__(self):
|
90 |
+
return len(self.encodings['input_ids'])
|
91 |
+
|
92 |
+
def __getitem__(self, index):
|
93 |
+
input_ids = self.encodings['input_ids'][index]
|
94 |
+
labels = self.encodings['labels'][index]
|
95 |
+
attention_mask = self.encodings['attention_mask'][index]
|
96 |
+
token_type_ids = self.encodings['token_type_ids'][index]
|
97 |
+
return {
|
98 |
+
'input_ids': input_ids,
|
99 |
+
'labels': labels,
|
100 |
+
'attention_mask': attention_mask,
|
101 |
+
'token_type_ids': token_type_ids
|
102 |
+
}
|
103 |
+
|
104 |
+
dataset = MathDataset(inputs)
|
105 |
+
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
|
106 |
+
|
107 |
+
# Fine-tuning the model
|
108 |
+
optimizer = AdamW(model.parameters(), lr=5e-5)
|
109 |
+
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
|
110 |
+
model.to(device)
|
111 |
+
epochs = 1
|
112 |
+
|
113 |
+
for epoch in range(epochs):
|
114 |
+
loop = tqdm(dataloader, dynamic_ncols=True)
|
115 |
+
for step, batch in enumerate(loop):
|
116 |
+
optimizer.zero_grad()
|
117 |
+
input_ids = batch['input_ids'].to(device)
|
118 |
+
labels = batch['labels'].to(device)
|
119 |
+
attention_mask = batch['attention_mask'].to(device)
|
120 |
+
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
|
121 |
+
loss = outputs.loss
|
122 |
+
loss.backward()
|
123 |
+
optimizer.step()
|
124 |
+
loop.set_description(f"Epoch {epoch + 1}")
|
125 |
+
loop.set_postfix(loss=loss.item())
|
126 |
+
```
|
127 |
+
|
128 |
+
## **Training Details**
|
129 |
+
|
130 |
+
### **Hyperparameters**
|
131 |
+
- **Batch Size**: 16
|
132 |
+
- **Learning Rate**: 5e-5
|
133 |
+
- **Number of Epochs**: 1
|
134 |
+
- **Max Sequence Length**: 512 tokens
|
135 |
+
|
136 |
+
### **Dataset**
|
137 |
+
- **Dataset Name**: *ddrg/named_math_formulas*
|
138 |
+
- **Task**: Masked Language Modeling (MLM) on mathematical formulas
|
139 |
+
|
140 |
+
## **Acknowledgements**
|
141 |
+
|
142 |
+
- The *ddrg/named_math_formulas* dataset is available on the Hugging Face dataset hub and provides a rich collection of mathematical formulas for training.
|
143 |
+
- This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models
|