AndyChiang commited on
Commit
39b71ab
1 Parent(s): 9467b77

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md CHANGED
@@ -1,3 +1,102 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language: en
4
+ tags:
5
+ - bert
6
+ - cloze
7
+ - distractor
8
+ - generation
9
+ datasets:
10
+ - dgen
11
+ widget:
12
+ - text: "The only known planet with large amounts of water is [MASK]. [SEP] earth"
13
+ - text: "The products of photosynthesis are glucose and [MASK] else. [SEP] oxygen"
14
  ---
15
+
16
+ # cdgp-csg-bert-dgen
17
+
18
+ ## Model description
19
+
20
+ This model is a Candidate Set Generator in **"CDGP: Automatic Cloze Distractor Generation based on Pre-trained Language Model", Findings of EMNLP 2022**.
21
+
22
+ Its input are stem and answer, and output is candidate set of distractors. It is fine-tuned by [**DGen**](https://github.com/DRSY/DGen) dataset based on [**bert-base-uncased**](https://huggingface.co/bert-base-uncased) model.
23
+
24
+ For more details, you can see our **paper** or [**GitHub**](https://github.com/AndyChiangSH/CDGP).
25
+
26
+ ## How to use?
27
+
28
+ 1. Download model by hugging face transformers.
29
+ ```python
30
+ from transformers import BartTokenizer, BartForConditionalGeneration
31
+
32
+ tokenizer = BartTokenizer.from_pretrained("AndyChiang/cdgp-csg-bert-dgen")
33
+ csg_model = BartForConditionalGeneration.from_pretrained("AndyChiang/cdgp-csg-bert-dgen")
34
+ ```
35
+
36
+ 2. Create a unmasker.
37
+ ```python
38
+ unmasker = pipeline("fill-mask", tokenizer=tokenizer, model=csg_model, top_k=10)
39
+ ```
40
+
41
+ 3. Use the unmasker to generate the candidate set of distractors.
42
+ ```python
43
+ sent = "The only known planet with large amounts of water is [MASK]. [SEP] earth"
44
+ cs = unmasker(sent)
45
+ print(cs)
46
+ ```
47
+
48
+ ## Dataset
49
+
50
+ This model is fine-tuned by [DGen](https://github.com/DRSY/DGen) dataset, which covers multiple domains including science, vocabulary, common sense and trivia. It is compiled from a wide variety of datasets including SciQ, MCQL, AI2 Science Questions, etc. The detail of DGen dataset is shown below.
51
+
52
+ | Number of questions | Train | Valid | Test |
53
+ | ------------------- | ----- | ----- | ----- |
54
+ | Middle school | 22056 | 3273 | 3198 |
55
+ | High school | 54794 | 7794 | 8318 |
56
+ | Total | 76850 | 11067 | 11516 |
57
+
58
+ You can also use the [dataset](https://github.com/AndyChiangSH/CDGP/blob/main/datasets/DGen.zip) we have already cleaned.
59
+
60
+ ## Training
61
+
62
+ We use a special way to fine-tune model, which is called **"Answer-Relating Fine-Tune"**. More detail is in our paper.
63
+
64
+ ### Training hyperparameters
65
+
66
+ The following hyperparameters were used during training:
67
+
68
+ - Pre-train language model: [bert-base-uncased](https://huggingface.co/bert-base-uncased)
69
+ - Optimizer: adam
70
+ - Learning rate: 0.0001
71
+ - Max length of input: 64
72
+ - Batch size: 64
73
+ - Epoch: 1
74
+ - Device: NVIDIA® Tesla T4 in Google Colab
75
+
76
+ ## Testing
77
+
78
+ The evaluations of this model as a Candidate Set Generator in CDGP is as follows:
79
+
80
+ | P@1 | F1@3 | MRR | NDCG@10 |
81
+ | ----- | ---- | ----- | ------- |
82
+ | 10.81 | 7.72 | 18.15 | 24.47 |
83
+
84
+ ## Other models
85
+
86
+ ### Candidate Set Generator
87
+
88
+ | Models | CLOTH | DGen |
89
+ | ----------- | ----------------------------------------------------------------------------------- | -------------------------------------------------------------------------------- |
90
+ | **BERT** | [cdgp-csg-bert-cloth](https://huggingface.co/AndyChiang/cdgp-csg-bert-cloth) | [*cdgp-csg-bert-dgen*](https://huggingface.co/AndyChiang/cdgp-csg-bert-dgen) |
91
+ | **SciBERT** | [cdgp-csg-scibert-cloth](https://huggingface.co/AndyChiang/cdgp-csg-scibert-cloth) | [cdgp-csg-scibert-dgen](https://huggingface.co/AndyChiang/cdgp-csg-scibert-dgen) |
92
+ | **RoBERTa** | [Acdgp-csg-roberta-cloth](https://huggingface.co/AndyChiang/cdgp-csg-roberta-cloth) | [cdgp-csg-roberta-dgen](https://huggingface.co/AndyChiang/cdgp-csg-roberta-dgen) |
93
+ | **BART** | [cdgp-csg-bart-cloth](https://huggingface.co/AndyChiang/cdgp-csg-bart-cloth) | [cdgp-csg-bart-dgen](https://huggingface.co/AndyChiang/cdgp-csg-bart-dgen) |
94
+
95
+ ### Distractor Selector
96
+
97
+ **fastText**: [cdgp-ds-fasttext](https://huggingface.co/AndyChiang/cdgp-ds-fasttext)
98
+
99
+
100
+ ## Citation
101
+
102
+ None