suayptalha commited on
Commit
dd8f135
·
verified ·
1 Parent(s): 1508e2e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -17
README.md CHANGED
@@ -12,30 +12,30 @@ tags:
12
  - math
13
  ---
14
 
15
- # **mathBERT-base**
16
 
17
- This repository contains a BERT-based model, **mathBERT-base**, fine-tuned on the *ddrg/named_math_formulas* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in mathematical formulas and expressions. The goal of this project is to improve the model's understanding and generation of math-related formulas in natural language contexts.
18
 
19
  ## **Model Architecture**
20
  - **Base Model**: `bert-base-uncased`
21
- - **Task**: Masked Language Modeling (MLM) for mathematical formulas
22
  - **Tokenizer**: BERT's WordPiece tokenizer
23
 
24
  ## **Usage**
25
 
26
  ### **Loading the Pre-trained Model**
27
 
28
- You can load the pre-trained **mathBERT-base** model using the Hugging Face `transformers` library:
29
 
30
- ```python
31
  from transformers import BertTokenizer, BertForMaskedLM
32
  import torch
33
 
34
- tokenizer = BertTokenizer.from_pretrained('suayptalha/mathBERT-base')
35
- model = BertForMaskedLM.from_pretrained('suayptalha/mathBERT-base').to("cuda")
36
 
37
- input_text = "The area of a circle is given by the formula A = πr^2."
38
- masked_text = input_text.replace("circle", tokenizer.mask_token)
39
 
40
  inputs = tokenizer(masked_text, return_tensors='pt').to("cuda")
41
 
@@ -45,19 +45,19 @@ predicted_token_id = torch.argmax(outputs.logits, dim=-1)
45
 
46
  predicted_token = tokenizer.decode(predicted_token_id[0, inputs['input_ids'].shape[1] - 1])
47
  print(predicted_token)
48
- ```
49
 
50
  ### **Fine-tuning the Model**
51
 
52
- To fine-tune the **mathBERT-base** model on your own dataset, follow these steps:
53
 
54
- 1. Prepare your dataset (e.g., mathematical formulas) in text format.
55
  2. Tokenize the dataset and apply masking.
56
  3. Train the model using the provided training loop.
57
 
58
  Here's the training code:
59
 
60
- https://github.com/suayptalha/mathBERT-base/blob/main/mathBERT-base.ipynb
61
 
62
  ## **Training Details**
63
 
@@ -68,10 +68,10 @@ https://github.com/suayptalha/mathBERT-base/blob/main/mathBERT-base.ipynb
68
  - **Max Sequence Length**: 512 tokens
69
 
70
  ### **Dataset**
71
- - **Dataset Name**: *ddrg/named_math_formulas*
72
- - **Task**: Masked Language Modeling (MLM) on mathematical formulas
73
 
74
  ## **Acknowledgements**
75
 
76
- - The *ddrg/named_math_formulas* dataset is available on the Hugging Face dataset hub and provides a rich collection of mathematical formulas for training.
77
- - This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models
 
12
  - math
13
  ---
14
 
15
+ # **medBERT-base**
16
 
17
+ This repository contains a BERT-based model, **medBERT-base**, fine-tuned on the *gayanin/pubmed-gastro-maskfilling* dataset for the task of **Masked Language Modeling (MLM)**. The model is trained to predict masked tokens in medical and gastroenterological texts. The goal of this project is to improve the model's understanding and generation of medical-related information in natural language contexts.
18
 
19
  ## **Model Architecture**
20
  - **Base Model**: `bert-base-uncased`
21
+ - **Task**: Masked Language Modeling (MLM) for medical texts
22
  - **Tokenizer**: BERT's WordPiece tokenizer
23
 
24
  ## **Usage**
25
 
26
  ### **Loading the Pre-trained Model**
27
 
28
+ You can load the pre-trained **medBERT-base** model using the Hugging Face `transformers` library:
29
 
30
+ '''
31
  from transformers import BertTokenizer, BertForMaskedLM
32
  import torch
33
 
34
+ tokenizer = BertTokenizer.from_pretrained('suayptalha/medBERT-base')
35
+ model = BertForMaskedLM.from_pretrained('suayptalha/medBERT-base').to("cuda")
36
 
37
+ input_text = "The patient was diagnosed with gastric cancer after a thorough examination."
38
+ masked_text = input_text.replace("gastric cancer", tokenizer.mask_token)
39
 
40
  inputs = tokenizer(masked_text, return_tensors='pt').to("cuda")
41
 
 
45
 
46
  predicted_token = tokenizer.decode(predicted_token_id[0, inputs['input_ids'].shape[1] - 1])
47
  print(predicted_token)
48
+ '''
49
 
50
  ### **Fine-tuning the Model**
51
 
52
+ To fine-tune the **medBERT-base** model on your own medical dataset, follow these steps:
53
 
54
+ 1. Prepare your dataset (e.g., medical texts or gastroenterology-related information) in text format.
55
  2. Tokenize the dataset and apply masking.
56
  3. Train the model using the provided training loop.
57
 
58
  Here's the training code:
59
 
60
+ https://github.com/suayptalha/medBERT-base/blob/main/medBERT-base.ipynb
61
 
62
  ## **Training Details**
63
 
 
68
  - **Max Sequence Length**: 512 tokens
69
 
70
  ### **Dataset**
71
+ - **Dataset Name**: *gayanin/pubmed-gastro-maskfilling*
72
+ - **Task**: Masked Language Modeling (MLM) on medical texts
73
 
74
  ## **Acknowledgements**
75
 
76
+ - The *gayanin/pubmed-gastro-maskfilling* dataset is available on the Hugging Face dataset hub and provides a rich collection of medical and gastroenterology-related information for training.
77
+ - This model uses the Hugging Face `transformers` library, which is a state-of-the-art library for NLP models