Upload 7 files
Browse files- README.md +74 -0
- config.json +48 -0
- pytorch_model.bin +3 -0
- special_tokens_map.json +1 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -0
- vocab.txt +0 -0
README.md
CHANGED
@@ -1,3 +1,77 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
2 |
license: other
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language: en
|
3 |
+
tags:
|
4 |
+
- transformers
|
5 |
+
- feature-extraction
|
6 |
+
- materials
|
7 |
license: other
|
8 |
---
|
9 |
+
|
10 |
+
# PolymerNER
|
11 |
+
|
12 |
+
This model is a fine-tuned version of the MaterialsBERT model on a dataset of 638 abstracts and contains a linear layer on top of MaterialsBERT to predict the entity type of each token. The entity types predicted by this model are POLYMER, POLYMER\_FAMILY, ORGANIC, INORGANIC, MONOMER, PROP\_NAME, PROP\_VALUE, MATERIAL\_AMOUNT
|
13 |
+
This named entity recognition (NER) model was introduced in [this](https://www.nature.com/articles/s41524-023-01003-w) paper. Refer to this paper for a more detailed description of the entities and performance metrics of this model. As MaterialsBERT is uncased, the NER model is also uncased.
|
14 |
+
|
15 |
+
## Intended uses & limitations
|
16 |
+
|
17 |
+
You can use the model for sequence labeling/entity tagging tasks on materials science text. The training, validation and test data for this model consisted of abstracts related to polymers. The entities tagged by this model however are general and can be used with any materials science text to tag the entity types defined in the ontology of this model.
|
18 |
+
|
19 |
+
## How to Use
|
20 |
+
|
21 |
+
Here is how to use this model to tag entities given some text:
|
22 |
+
|
23 |
+
```python
|
24 |
+
from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
|
25 |
+
tokenizer = AutoTokenizer.from_pretrained('pranav-s/PolymerNER', model_max_length=512)
|
26 |
+
model = AutoModelForTokenClassification.from_pretrained('pranav-s/PolymerNER')
|
27 |
+
ner_pipeline = pipeline(task="ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple", device='cpu')
|
28 |
+
text = "Polyethylene has a glass transition temperature of -100 °C"
|
29 |
+
ner_output = ner_pipeline(text)
|
30 |
+
```
|
31 |
+
|
32 |
+
## Training data
|
33 |
+
|
34 |
+
A training data set of 638 polymer abstracts was used. The data set is provided [here](https://github.com/Ramprasad-Group/polymer_information_extraction)
|
35 |
+
|
36 |
+
## Training procedure
|
37 |
+
|
38 |
+
### Training hyperparameters
|
39 |
+
|
40 |
+
The following hyperparameters were used during training:
|
41 |
+
- learning_rate: 5e-05
|
42 |
+
- train\_batch_size: 8
|
43 |
+
- eval\_batch_size: 8
|
44 |
+
- seed: 42
|
45 |
+
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
46 |
+
- lr\_scheduler_type: linear
|
47 |
+
- num_epochs: 5
|
48 |
+
|
49 |
+
|
50 |
+
### Framework versions
|
51 |
+
|
52 |
+
- Transformers 4.17.0
|
53 |
+
- Pytorch 1.10.2
|
54 |
+
- Datasets 1.18.3
|
55 |
+
- Tokenizers 0.11.0
|
56 |
+
|
57 |
+
|
58 |
+
## Citation
|
59 |
+
|
60 |
+
If you find PolymerNER useful in your research, please cite the following paper:
|
61 |
+
|
62 |
+
```latex
|
63 |
+
@article{materialsbert,
|
64 |
+
title={A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing},
|
65 |
+
author={Shetty, Pranav and Rajan, Arunkumar Chitteth and Kuenneth, Chris and Gupta, Sonakshi and Panchumarti, Lakshmi Prerana and Holm, Lauren and Zhang, Chao and Ramprasad, Rampi},
|
66 |
+
journal={npj Computational Materials},
|
67 |
+
volume={9},
|
68 |
+
number={1},
|
69 |
+
pages={52},
|
70 |
+
year={2023},
|
71 |
+
publisher={Nature Publishing Group UK London}
|
72 |
+
}
|
73 |
+
```
|
74 |
+
|
75 |
+
<a href="https://huggingface.co/exbert/?model=pranav-s/PolymerNER">
|
76 |
+
<img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
|
77 |
+
</a>
|
config.json
ADDED
@@ -0,0 +1,48 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"_name_or_path": "/data/pranav/projects/matbert/pretrained_models/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
|
3 |
+
"architectures": [
|
4 |
+
"BertForTokenClassification"
|
5 |
+
],
|
6 |
+
"attention_probs_dropout_prob": 0.1,
|
7 |
+
"classifier_dropout": null,
|
8 |
+
"finetuning_task": "ner",
|
9 |
+
"hidden_act": "gelu",
|
10 |
+
"hidden_dropout_prob": 0.1,
|
11 |
+
"hidden_size": 768,
|
12 |
+
"id2label": {
|
13 |
+
"0": "INORGANIC",
|
14 |
+
"1": "MATERIAL_AMOUNT",
|
15 |
+
"2": "MONOMER",
|
16 |
+
"3": "O",
|
17 |
+
"4": "ORGANIC",
|
18 |
+
"5": "POLYMER",
|
19 |
+
"6": "POLYMER_FAMILY",
|
20 |
+
"7": "PROP_NAME",
|
21 |
+
"8": "PROP_VALUE"
|
22 |
+
},
|
23 |
+
"initializer_range": 0.02,
|
24 |
+
"intermediate_size": 3072,
|
25 |
+
"label2id": {
|
26 |
+
"INORGANIC": 0,
|
27 |
+
"MATERIAL_AMOUNT": 1,
|
28 |
+
"MONOMER": 2,
|
29 |
+
"O": 3,
|
30 |
+
"ORGANIC": 4,
|
31 |
+
"POLYMER": 5,
|
32 |
+
"POLYMER_FAMILY": 6,
|
33 |
+
"PROP_NAME": 7,
|
34 |
+
"PROP_VALUE": 8
|
35 |
+
},
|
36 |
+
"layer_norm_eps": 1e-12,
|
37 |
+
"max_position_embeddings": 512,
|
38 |
+
"model_type": "bert",
|
39 |
+
"num_attention_heads": 12,
|
40 |
+
"num_hidden_layers": 12,
|
41 |
+
"pad_token_id": 0,
|
42 |
+
"position_embedding_type": "absolute",
|
43 |
+
"torch_dtype": "float32",
|
44 |
+
"transformers_version": "4.17.0.dev0",
|
45 |
+
"type_vocab_size": 2,
|
46 |
+
"use_cache": true,
|
47 |
+
"vocab_size": 30522
|
48 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:73c7ad087af05c9dbd3b8b99d4550a7b8e5eccf6c10efa7e445c505f579478ad
|
3 |
+
size 435677681
|
special_tokens_map.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "/data/pranav/projects/matbert/pretrained_models/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "BertTokenizer"}
|
vocab.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|