File size: 2,419 Bytes
27274f4
 
 
 
 
 
 
d980e0c
27274f4
 
 
 
 
 
 
 
 
 
 
 
 
 
9a8ac82
 
 
27274f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
---
language: en
tags:
- education
- K-12
license: apache-2.0
datasets:
- vasugoel/K-12Corpus

---

## K-12BERT model
K-12BERT is a model trained by performing continued pretraining on the K-12Corpus. Since, performance of BERT like models on domain adaptive tasks have shown great progress, we noticed the lack of such a model for the education domain (especially K-12 education). On that end we present K-12BERT, a BERT based model trained on our custom curated dataset, extracted from both open and proprietary education resources.

The model was trained using an MLM objective and in a continued pretraining fashion, due to the lack of resources available to train the model from ground up. This also, allowed us to save a lot of computational resources and utilize the existing knowledge of BERT. To that extent we also preserve the original vocabulary of BERT, to evaluate the performance under those conditions.

## Intended uses
We hope that the community especially researchers and professionals engaged in the education domain, are able to utilize this model to advance the domain of AI in education. With many fold usages for online education platforms, we hope we can contribute towards advancing education resources for the upcoming generation.

Here is how to use this model to get the features of a given text in PyTorch:

```python
from transformers import BertTokenizer, BertModel, AutoTokenizer, AutoModelForMaskedLM
tokenizer = BertTokenizer.from_pretrained('vasugoel/K-12BERT') # AutoTokenizer.from_pretrained('vasugoel/K-12BERT')
model = BertModel.from_pretrained("vasugoel/K-12BERT") # AutoModelForMaskedLM.from_pretrained('vasugoel/K-12BERT')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
```

### BibTeX entry and citation info

```bibtex
@misc{https://doi.org/10.48550/arxiv.2205.12335,
  doi = {10.48550/ARXIV.2205.12335},
  
  url = {https://arxiv.org/abs/2205.12335},
  
  author = {Goel, Vasu and Sahnan, Dhruv and V, Venktesh and Sharma, Gaurav and Dwivedi, Deep and Mohania, Mukesh},
  
  keywords = {Computation and Language (cs.CL), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  
  title = {K-12BERT: BERT for K-12 education},
  
  publisher = {arXiv},
  
  year = {2022},
  
  copyright = {arXiv.org perpetual, non-exclusive license}
}
```