suayptalha commited on
Commit
c108c5f
·
verified ·
1 Parent(s): 4107162

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -1
README.md CHANGED
@@ -8,4 +8,136 @@ pipeline_tag: fill-mask
8
  library_name: transformers
9
  tags:
10
  - math
11
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  library_name: transformers
9
  tags:
10
  - math
11
+ ---
12
+
13
+ # **mathBERT-base**
14
+
15
+ This repository contains a BERT-based model, **mathBERT-base**, fine-tuned on the *ddrg/named_math_formulas* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in mathematical formulas and expressions. The goal of this project is to improve the model's understanding and generation of math-related formulas in natural language contexts.
16
+
17
+ ## **Model Architecture**
18
+ - **Base Model**: `bert-base-uncased`
19
+ - **Task**: Masked Language Modeling (MLM) for mathematical formulas
20
+ - **Tokenizer**: BERT's WordPiece tokenizer
21
+
22
+ ## **Usage**
23
+
24
+ ### **Loading the Pre-trained Model**
25
+
26
+ You can load the pre-trained **mathBERT-base** model using the Hugging Face `transformers` library:
27
+
28
+ ```python
29
+ from transformers import BertTokenizer, BertForMaskedLM
30
+
31
+ tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
32
+ model = BertForMaskedLM.from_pretrained('<path_to_model>')
33
+
34
+ # Example input text
35
+ input_text = "The area of a circle is given by the formula A = πr^2."
36
+ inputs = tokenizer(input_text, return_tensors='pt')
37
+
38
+ # Mask a token and predict it
39
+ inputs['input_ids'][0, 4] = tokenizer.mask_token_id # Mask a token
40
+
41
+ outputs = model(**inputs)
42
+ predicted_token_id = torch.argmax(outputs.logits, dim=-1)
43
+ predicted_token = tokenizer.decode(predicted_token_id)
44
+ print(predicted_token)
45
+ ```
46
+
47
+ ### **Fine-tuning the Model**
48
+
49
+ To fine-tune the **mathBERT-base** model on your own dataset, follow these steps:
50
+
51
+ 1. Prepare your dataset (e.g., mathematical formulas) in text format.
52
+ 2. Tokenize the dataset and apply masking.
53
+ 3. Train the model using the provided training loop.
54
+
55
+ Here's the training code:
56
+
57
+ '''python
58
+ from transformers import AdamW
59
+ from torch.utils.data import DataLoader
60
+ from datasets import load_dataset
61
+ import torch
62
+ from tqdm.auto import tqdm
63
+
64
+ # Load and preprocess data
65
+ dataset = load_dataset('') # Load dataset
66
+
67
+ inputs = tokenizer(data, max_length=512, truncation=True, padding='max_length', return_tensors='pt')
68
+
69
+ # Masking input tokens
70
+ random_tensor = torch.rand(inputs['input_ids'].shape)
71
+ masked_tensor = (random_tensor < 0.15) * (inputs['input_ids'] != 101) * (inputs['input_ids'] != 102) * (inputs['input_ids'] != 0)
72
+ nonzeros_indices = []
73
+ for i in range(len(masked_tensor)):
74
+ nonzeros_indices.append(torch.flatten(masked_tensor[i].nonzero()).tolist())
75
+
76
+ for i in range(len(inputs['input_ids'])):
77
+ inputs['input_ids'][i, nonzeros_indices[i]] = 103 # Mask token ID
78
+
79
+ # Create labels
80
+ inputs['labels'] = inputs['input_ids'].clone()
81
+ for i in range(len(inputs['input_ids'])):
82
+ inputs['labels'][i] = torch.where(masked_tensor[i] == 0, torch.tensor(-100), inputs['labels'][i])
83
+
84
+ # Dataset and DataLoader
85
+ class MathDataset(torch.utils.data.Dataset):
86
+ def __init__(self, encodings):
87
+ self.encodings = encodings
88
+
89
+ def __len__(self):
90
+ return len(self.encodings['input_ids'])
91
+
92
+ def __getitem__(self, index):
93
+ input_ids = self.encodings['input_ids'][index]
94
+ labels = self.encodings['labels'][index]
95
+ attention_mask = self.encodings['attention_mask'][index]
96
+ token_type_ids = self.encodings['token_type_ids'][index]
97
+ return {
98
+ 'input_ids': input_ids,
99
+ 'labels': labels,
100
+ 'attention_mask': attention_mask,
101
+ 'token_type_ids': token_type_ids
102
+ }
103
+
104
+ dataset = MathDataset(inputs)
105
+ dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
106
+
107
+ # Fine-tuning the model
108
+ optimizer = AdamW(model.parameters(), lr=5e-5)
109
+ device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
110
+ model.to(device)
111
+ epochs = 1
112
+
113
+ for epoch in range(epochs):
114
+ loop = tqdm(dataloader, dynamic_ncols=True)
115
+ for step, batch in enumerate(loop):
116
+ optimizer.zero_grad()
117
+ input_ids = batch['input_ids'].to(device)
118
+ labels = batch['labels'].to(device)
119
+ attention_mask = batch['attention_mask'].to(device)
120
+ outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
121
+ loss = outputs.loss
122
+ loss.backward()
123
+ optimizer.step()
124
+ loop.set_description(f"Epoch {epoch + 1}")
125
+ loop.set_postfix(loss=loss.item())
126
+ ```
127
+
128
+ ## **Training Details**
129
+
130
+ ### **Hyperparameters**
131
+ - **Batch Size**: 16
132
+ - **Learning Rate**: 5e-5
133
+ - **Number of Epochs**: 1
134
+ - **Max Sequence Length**: 512 tokens
135
+
136
+ ### **Dataset**
137
+ - **Dataset Name**: *ddrg/named_math_formulas*
138
+ - **Task**: Masked Language Modeling (MLM) on mathematical formulas
139
+
140
+ ## **Acknowledgements**
141
+
142
+ - The *ddrg/named_math_formulas* dataset is available on the Hugging Face dataset hub and provides a rich collection of mathematical formulas for training.
143
+ - This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models