berendg commited on
Commit
220e83e
·
1 Parent(s): 3f7ee1e

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -0
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ datasets:
4
+ - wikipedia
5
+ language:
6
+ - en
7
+ metrics:
8
+ - glue
9
+ ---
10
+ # Model Card for SzegedAI/babylm-strict-small-mlsm
11
+
12
+ <!-- Provide a quick summary of what the model is/does. -->
13
+
14
+ This base-sized DeBERTa model was created using the [Masked Latent Semantic Modeling](https://aclanthology.org/2023.findings-acl.876/) (MLSM) pre-training objective, which is a sample efficient alternative for classic Masked Language Modeling (MLM).
15
+ During MLSM, the objective is to recover the latent semantic profile of the masked tokens, as opposed to recovering their exact identity.
16
+ The contextualized latent semantic profile during pre-training is determined by performing sparse coding of the hidden representation of a partially pre-trained model (a base-sized DeBERTa model pre-trained over only 20 million input sequences in this particular case).
17
+
18
+ ## Model Details
19
+
20
+ ### Model Description
21
+
22
+ <!-- Provide a longer summary of what this model is. -->
23
+
24
+ - **Developed by:** SzegedAI
25
+ - **Model type:** transformer encoder
26
+ - **Language:** Engish
27
+ - **License:** MIT
28
+
29
+ ### Model Sources
30
+
31
+ <!-- Provide the basic links for the model. -->
32
+
33
+ - **Repository:** [https://github.com/szegedai/MLSM](https://github.com/szegedai/MLSM)
34
+ - **Paper:** [Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling](https://underline.io/events/395/posters/15279/poster/78046-masked-latent-semantic-modeling-an-efficient-pre-training-alternative-to-masked-language-modeling?tab=abstract+%26+voting)
35
+
36
+ ## How to Get Started with the Model
37
+
38
+ The pre-trained model can be used in the usual manner, e.g., for fine tuning on a particular sequence classification task, invoke the code:
39
+
40
+ ```
41
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
42
+
43
+ tokenizer = AutoTokenizer.from_pretrained('SzegedAI/babylm-strict-small-mlsm')
44
+ model = AutoModelForSequenceClassification.from_pretrained('SzegedAI/babylm-strict-small-mlsm')
45
+ ```
46
+
47
+ ## Training Details
48
+
49
+ ### Training Data
50
+
51
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
52
+
53
+ The model was pre-trained using the 10 million token [BabyLM](https://babylm.github.io/) strict small dataset.
54
+
55
+ ### Training Procedure
56
+
57
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
58
+
59
+ #### Preprocessing
60
+
61
+
62
+ #### Training Hyperparameters
63
+
64
+ Pre-training was conducted with a batch size of 128 sequences and a gradient accumulation over 8 batches, resulting in an effective batch size of 1024.
65
+ A total of 100,000 update steps were performed using the AdamW optimizer with a linear learning rate scheduling having a peak learning rate of 1e-04.
66
+ A maximum sequence length of 128 tokens was employed throughout pre-training.
67
+
68
+ - **Training regime:** fp32
69
+
70
+ ## Evaluation
71
+
72
+ <!-- This section describes the evaluation protocols and provides the results. -->
73
+
74
+ #### Metrics
75
+
76
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
77
+
78
+ The model was evaluated using the [BabyLM evaluation pipeline](https://github.com/babylm/evaluation-pipeline).
79
+
80
+ ### Results
81
+ The evaluation result after fine-tuning the given model on a wide range of tasks.
82
+ On each tasks 4 different fine-tuning experiments were performed, during which the only difference was the random initialization of the task-specific classification head.
83
+ Apart from reducing the batch size from 64 to 32 (in order to avoid OOM errors), we used the recommended hyperparameter settings of the shared task.
84
+ Both the average and the standard deviation are displayed below on each tasks.
85
+
86
+ #### (Super)GLUE
87
+ Unless stated otherwise (in parenthesis after the task name), the default evaluation metric is accuracy.
88
+
89
+ | | Avg. | Std. |
90
+ |---|---|---|
91
+ |BoolQ | 0.667 | 0.013|
92
+ |CoLA (MCC)| 0.417 | 0.022|
93
+ |MNLI | 0.754 | 0.006|
94
+ |MNLI-mm | 0.754 | 0.010|
95
+ |MRPC (F1) | 0.765 | 0.019|
96
+ |MultiRC | 0.568 | 0.068|
97
+ |QNLI | 0.824 | 0.003|
98
+ |QQP (F1) | 0.835 | 0.008|
99
+ |RTE | 0.520 | 0.024|
100
+ |SST2 | 0.892 | 0.006|
101
+ |WSC | 0.608 | 0.016|
102
+
103
+ #### MSGS
104
+ Results reported in MCC.
105
+
106
+ | | Avg. | Std. |
107
+ |---|---|---|
108
+ |control_raising_control | 0.735 | 0.036|
109
+ |control_raising_lexical_content_the |-0.073 | 0.300|
110
+ |control_raising_relative_token_position |-0.652 | 0.140|
111
+ |lexical_content_the_control | 1.000 | 0.000|
112
+ |main_verb_control | 0.998 | 0.002|
113
+ |main_verb_lexical_content_the |-0.947 | 0.071|
114
+ |main_verb_relative_token_position |-0.395 | 0.204|
115
+ |relative_position_control | 0.896 | 0.076|
116
+ |syntactic_category_control | 0.784 | 0.078|
117
+ |syntactic_category_lexical_content_the |-0.166 | 0.119|
118
+ |syntactic_category_relative_position |-0.528 | 0.038|
119
+
120
+ ## Environmental Impact
121
+
122
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
123
+
124
+ - **Hardware Type:** RTX A6000
125
+ - **Hours used:** 70
126
+ - **Carbon Emitted:** cca. 9 kg CO2 eq. (based on [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute))
127
+
128
+ ## Citation
129
+
130
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
131
+ The MLSM pre-training objective is introduced in the ACL Findings paper _Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling_.
132
+
133
+ **BibTeX:**
134
+
135
+ ```
136
+ @inproceedings{berend-2023-masked,
137
+ title = "Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling",
138
+ author = "Berend, G{\'a}bor",
139
+ booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
140
+ month = jul,
141
+ year = "2023",
142
+ address = "Toronto, Canada",
143
+ publisher = "Association for Computational Linguistics",
144
+ url = "https://aclanthology.org/2023.findings-acl.876",
145
+ pages = "13949--13962",
146
+ abstract = "In this paper, we propose an alternative to the classic masked language modeling (MLM) pre-training paradigm, where the objective is altered from the reconstruction of the exact identity of randomly selected masked subwords to the prediction of their latent semantic properties. We coin the proposed pre-training technique masked latent semantic modeling (MLSM for short). In order to make the contextualized determination of the latent semantic properties of the masked subwords possible, we rely on an unsupervised technique which uses sparse coding. Our experimental results reveal that the fine-tuned performance of those models that we pre-trained via MLSM is consistently and significantly better compared to the use of vanilla MLM pretraining and other strong baselines.",
147
+ }
148
+ ```
149
+