varadhbhatnagar commited on
Commit
a1317e4
·
1 Parent(s): 1e667a4

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -0
README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ '[object Object]': null
3
+ license: apache-2.0
4
+ language:
5
+ - en
6
+ pipeline_tag: summarization
7
+ ---
8
+
9
+ # Model Card for Pegasus for Claim Summarization
10
+
11
+ <!-- Provide a quick summary of what the model is/does. -->
12
+
13
+ This model can be used to summarize noisy claims on social media into clean and concise claims which can be used for downstream tasks in a fact-checking pipeline.
14
+
15
+ # Model Details
16
+
17
+ This is the fine-tuned D BART model with 'Pre-processed with Mention Removed (P-M)' preprocessing strategy detailed in Table 2 in the paper.
18
+
19
+ ## Model Description
20
+
21
+ <!-- Provide a longer summary of what this model is. -->
22
+
23
+ - **Developed by:** Varad Bhatnagar, Diptesh Kanojia and Kameswari Chebrolu
24
+ - **Model type:** Summarization
25
+ - **Language(s) (NLP):** English
26
+ - **Finetuned from model:** https://huggingface.co/sshleifer/distilbart-cnn-12-6
27
+
28
+ ## Model Sources
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** https://github.com/varadhbhatnagar/FC-Claim-Det
33
+ - **Paper:** https://aclanthology.org/2022.coling-1.259/
34
+
35
+ ## Tokenizer
36
+
37
+ Same as https://huggingface.co/sshleifer/distilbart-cnn-12-6
38
+
39
+
40
+ # Uses
41
+
42
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
43
+
44
+ ## Direct Use
45
+
46
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
47
+
48
+ English to English summarization on noisy fact-checking worthy claims found on social media.
49
+
50
+ ## Downstream Use
51
+
52
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
53
+
54
+ Can be used for other tasks in a fact-checking pipeline such as claim matching and evidence retrieval.
55
+
56
+ # Bias, Risks, and Limitations
57
+
58
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
59
+
60
+ As the [Google Fact Check Explorer](https://toolbox.google.com/factcheck/explorer) is an ever growing and evolving system, the current Retrieval@k results may not exactly match
61
+ those in the corresponding paper as those experiments were conducted in the month of April and May 2022.
62
+
63
+ # Training Details
64
+
65
+ ## Training Data
66
+
67
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
68
+
69
+ [Data](https://github.com/varadhbhatnagar/FC-Claim-Det/blob/main/public_data/released_data.csv)
70
+
71
+ ## Training Procedure
72
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
73
+ Finetuning the pretrained Distilled BART model on the 567 pairs released in our paper.
74
+
75
+ ### Preprocessing
76
+
77
+ Pre-processed with Mention Removed (P-M). Apply this strategy on the input text before feeding it to model for summarization.
78
+
79
+ # Evaluation
80
+ <!-- This section describes the evaluation protocols and provides the results. -->
81
+ Retrieval@5 and Mean Reciprocal Recall scores are reported.
82
+
83
+ ## Results
84
+
85
+ Retrieval@5 = 30.15
86
+ MRR = 0.26
87
+
88
+ Further details can be found in the paper.
89
+
90
+ # Other Models from same work
91
+
92
+ [DPEGASUS](https://huggingface.co/varadhbhatnagar/fc-claim-det-DPEGASUS)
93
+ [T5-Base](https://huggingface.co/varadhbhatnagar/fc-claim-det-T5-base)
94
+
95
+ # Citation
96
+
97
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
98
+
99
+ **BibTeX:**
100
+ ```
101
+ @inproceedings{bhatnagar-etal-2022-harnessing,
102
+ title = "Harnessing Abstractive Summarization for Fact-Checked Claim Detection",
103
+ author = "Bhatnagar, Varad and
104
+ Kanojia, Diptesh and
105
+ Chebrolu, Kameswari",
106
+ booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
107
+ month = oct,
108
+ year = "2022",
109
+ address = "Gyeongju, Republic of Korea",
110
+ publisher = "International Committee on Computational Linguistics",
111
+ url = "https://aclanthology.org/2022.coling-1.259",
112
+ pages = "2934--2945",
113
+ abstract = "Social media platforms have become new battlegrounds for anti-social elements, with misinformation being the weapon of choice. Fact-checking organizations try to debunk as many claims as possible while staying true to their journalistic processes but cannot cope with its rapid dissemination. We believe that the solution lies in partial automation of the fact-checking life cycle, saving human time for tasks which require high cognition. We propose a new workflow for efficiently detecting previously fact-checked claims that uses abstractive summarization to generate crisp queries. These queries can then be executed on a general-purpose retrieval system associated with a collection of previously fact-checked claims. We curate an abstractive text summarization dataset comprising noisy claims from Twitter and their gold summaries. It is shown that retrieval performance improves 2x by using popular out-of-the-box summarization models and 3x by fine-tuning them on the accompanying dataset compared to verbatim querying. Our approach achieves Recall@5 and MRR of 35{\%} and 0.3, compared to baseline values of 10{\%} and 0.1, respectively. Our dataset, code, and models are available publicly: https://github.com/varadhbhatnagar/FC-Claim-Det/.",
114
+ }
115
+ ```
116
+
117
+ # Model Card Authors
118
+
119
+ Varad Bhatnagar
120
+
121
+ # Model Card Contact
122
+
123
124
+
125
+ # How to Get Started with the Model
126
+
127
+ Use the code below to get started with the model.
128
+
129
+ ```
130
+ from transformers import BartForConditionalGeneration, BartTokenizerFast
131
+ hft = BartTokenizerFast.from_pretrained('varadhbhatnagar/fc-claim-det-DBART')
132
+ hfm = BartForConditionalGeneration.from_pretrained('varadhbhatnagar/fc-claim-det-DBART').to(device)
133
+
134
+ row = 'hi satya my name is arman today i got this video which is being spread in whatsapp and it is being said that the all old age covid 19 patients are being killed in the government hospital kindly check the facts'
135
+
136
+ tokenized_text = hft.encode(row, return_tensors="pt")
137
+ summary_ids = hfm.generate(tokenized_text,
138
+ num_beams=6,
139
+ no_repeat_ngram_size=2,
140
+ min_length=5,
141
+ max_length=15,
142
+ early_stopping=True)
143
+
144
+ output = hft.decode(summary_ids[0], skip_special_tokens=True)
145
+ ```