datawrestler commited on
Commit
cdb0643
1 Parent(s): de3e69d

initial sycn of psych-search model and documentation

Browse files
README.md ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - mental-health
6
+ license: Apache 2.0
7
+ datasets:
8
+ - PubMed
9
+ ---
10
+
11
+ # Psych-Search
12
+
13
+ ## Model description
14
+
15
+ This model is an extension of [allenai/scibert_scivocab_uncased](https://huggingface.co/allenai/scibert_scivocab_uncased). Continued pretraining was done using SciBERT as the base model using abstract text only from Pyschology and Psychiatry PubMed research. Training was done on approximately 3.5 million papers for 10 epochs and evaluated on a task similar to BioASQ Task A.
16
+
17
+ ## Intended uses & limitations
18
+
19
+ #### How to use
20
+
21
+ ```python
22
+ from transformers import AutoTokenizer, AutoModel
23
+
24
+ mname = "datawrestler/psych-search"
25
+ tokenizer = AutoTokenizer.from_pretrained(mname)
26
+ model = AutoModel.from_pretrained(mname)
27
+ ```
28
+
29
+ #### Limitations and bias
30
+
31
+ This model was trained on all PubMed abstracts categorized under [Psychology and Psychiatry](https://meshb.nlm.nih.gov/treeView). As of March 1, this corresponded to approximately 3.2 million papers that contained abstract text. Of these 3.2 million papers, relevant sparse categories were back translated to increase the representation of sparser mental health categories. This included backtranslating the following:
32
+
33
+
34
+
35
+ ## Training data
36
+
37
+ This model was trained on all PubMed abstracts categorized under [Psychology and Psychiatry](https://meshb.nlm.nih.gov/treeView). As of March 1, this corresponded to approximately 3.2 million papers that contained abstract text. Of these 3.2 million papers, relevant sparse categories were back translated from english to french and from french to english to increase the representation of sparser mental health categories. This included backtranslating the following papers with the following categories:
38
+ - Female
39
+ - Adult
40
+ - Middle Aged
41
+ - Depressive Disorder
42
+ - Risk Factors
43
+ - Mental Disorders
44
+ - Child, Preschool
45
+ - Mental Health
46
+
47
+ In aggregate, this process added 557,980 additional papers to our training data.
48
+
49
+
50
+ ## Training procedure
51
+ Continued pretraining was on Psychology and Psychiatry PubMed papers for 10 epochs. Default parameters were used with the exception of gradient accumulation steps which was set at 4, with a per device train batch size of 32. 2 x Nvidia 3090's were used in the development of this model.
52
+
53
+
54
+ ## Eval results
55
+ To evaluate the utility of psych-search within the mental health domain, an evaluation task was constructed by finetuning psych-search for a task similar to [BioASQ Task A](http://bioasq.org/). Here we perform large scale biomedical indexing using the MESH taxonomy associated with each paper underneath Psychology and Psychiatry. The evaluation metric is the micro F1 score across all second level descriptors under Psychology and Psychiatry. This corresponds to 38 different MESH categories used during evaluation.
56
+
57
+ bert-base-uncased | SciBERT Scivocab Uncased | Psych-Search
58
+ -------|---------|----------
59
+ 0.7348 | 0.7394 | 0.7415
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "allenai/scibert_scivocab_uncased",
3
+ "architectures": [
4
+ "BertForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "gradient_checkpointing": false,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "transformers_version": "4.3.3",
21
+ "type_vocab_size": 2,
22
+ "use_cache": true,
23
+ "vocab_size": 31090
24
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:caf4ed78a3e912b87ab90e85836dc4e853f103a03c187d0a14d07dc35bfa1d02
3
+ size 439894482
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "special_tokens_map_file": null, "name_or_path": "allenai/scibert_scivocab_uncased", "do_basic_tokenize": true, "never_split": null}
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3daf5c850f9edcdc9135b76c0d1a6e44578aec460ce056f7123a3e11f1db8d43
3
+ size 2159
vocab.txt ADDED
The diff for this file is too large to render. See raw diff