pranav-s commited on
Commit
deeaff2
1 Parent(s): 81b617c

Upload 7 files

Browse files
README.md CHANGED
@@ -1,3 +1,77 @@
1
  ---
 
 
 
 
 
2
  license: other
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language: en
3
+ tags:
4
+ - transformers
5
+ - feature-extraction
6
+ - materials
7
  license: other
8
  ---
9
+
10
+ # PolymerNER
11
+
12
+ This model is a fine-tuned version of the MaterialsBERT model on a dataset of 638 abstracts and contains a linear layer on top of MaterialsBERT to predict the entity type of each token. The entity types predicted by this model are POLYMER, POLYMER\_FAMILY, ORGANIC, INORGANIC, MONOMER, PROP\_NAME, PROP\_VALUE, MATERIAL\_AMOUNT
13
+ This named entity recognition (NER) model was introduced in [this](https://www.nature.com/articles/s41524-023-01003-w) paper. Refer to this paper for a more detailed description of the entities and performance metrics of this model. As MaterialsBERT is uncased, the NER model is also uncased.
14
+
15
+ ## Intended uses & limitations
16
+
17
+ You can use the model for sequence labeling/entity tagging tasks on materials science text. The training, validation and test data for this model consisted of abstracts related to polymers. The entities tagged by this model however are general and can be used with any materials science text to tag the entity types defined in the ontology of this model.
18
+
19
+ ## How to Use
20
+
21
+ Here is how to use this model to tag entities given some text:
22
+
23
+ ```python
24
+ from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
25
+ tokenizer = AutoTokenizer.from_pretrained('pranav-s/PolymerNER', model_max_length=512)
26
+ model = AutoModelForTokenClassification.from_pretrained('pranav-s/PolymerNER')
27
+ ner_pipeline = pipeline(task="ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple", device='cpu')
28
+ text = "Polyethylene has a glass transition temperature of -100 °C"
29
+ ner_output = ner_pipeline(text)
30
+ ```
31
+
32
+ ## Training data
33
+
34
+ A training data set of 638 polymer abstracts was used. The data set is provided [here](https://github.com/Ramprasad-Group/polymer_information_extraction)
35
+
36
+ ## Training procedure
37
+
38
+ ### Training hyperparameters
39
+
40
+ The following hyperparameters were used during training:
41
+ - learning_rate: 5e-05
42
+ - train\_batch_size: 8
43
+ - eval\_batch_size: 8
44
+ - seed: 42
45
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
46
+ - lr\_scheduler_type: linear
47
+ - num_epochs: 5
48
+
49
+
50
+ ### Framework versions
51
+
52
+ - Transformers 4.17.0
53
+ - Pytorch 1.10.2
54
+ - Datasets 1.18.3
55
+ - Tokenizers 0.11.0
56
+
57
+
58
+ ## Citation
59
+
60
+ If you find PolymerNER useful in your research, please cite the following paper:
61
+
62
+ ```latex
63
+ @article{materialsbert,
64
+ title={A general-purpose material property data extraction pipeline from large polymer corpora using natural language processing},
65
+ author={Shetty, Pranav and Rajan, Arunkumar Chitteth and Kuenneth, Chris and Gupta, Sonakshi and Panchumarti, Lakshmi Prerana and Holm, Lauren and Zhang, Chao and Ramprasad, Rampi},
66
+ journal={npj Computational Materials},
67
+ volume={9},
68
+ number={1},
69
+ pages={52},
70
+ year={2023},
71
+ publisher={Nature Publishing Group UK London}
72
+ }
73
+ ```
74
+
75
+ <a href="https://huggingface.co/exbert/?model=pranav-s/PolymerNER">
76
+ <img width="300px" src="https://cdn-media.huggingface.co/exbert/button.png">
77
+ </a>
config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/data/pranav/projects/matbert/pretrained_models/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext",
3
+ "architectures": [
4
+ "BertForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "finetuning_task": "ner",
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "INORGANIC",
14
+ "1": "MATERIAL_AMOUNT",
15
+ "2": "MONOMER",
16
+ "3": "O",
17
+ "4": "ORGANIC",
18
+ "5": "POLYMER",
19
+ "6": "POLYMER_FAMILY",
20
+ "7": "PROP_NAME",
21
+ "8": "PROP_VALUE"
22
+ },
23
+ "initializer_range": 0.02,
24
+ "intermediate_size": 3072,
25
+ "label2id": {
26
+ "INORGANIC": 0,
27
+ "MATERIAL_AMOUNT": 1,
28
+ "MONOMER": 2,
29
+ "O": 3,
30
+ "ORGANIC": 4,
31
+ "POLYMER": 5,
32
+ "POLYMER_FAMILY": 6,
33
+ "PROP_NAME": 7,
34
+ "PROP_VALUE": 8
35
+ },
36
+ "layer_norm_eps": 1e-12,
37
+ "max_position_embeddings": 512,
38
+ "model_type": "bert",
39
+ "num_attention_heads": 12,
40
+ "num_hidden_layers": 12,
41
+ "pad_token_id": 0,
42
+ "position_embedding_type": "absolute",
43
+ "torch_dtype": "float32",
44
+ "transformers_version": "4.17.0.dev0",
45
+ "type_vocab_size": 2,
46
+ "use_cache": true,
47
+ "vocab_size": 30522
48
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:73c7ad087af05c9dbd3b8b99d4550a7b8e5eccf6c10efa7e445c505f579478ad
3
+ size 435677681
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "/data/pranav/projects/matbert/pretrained_models/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext", "do_basic_tokenize": true, "never_split": null, "tokenizer_class": "BertTokenizer"}
vocab.txt ADDED
The diff for this file is too large to render. See raw diff