pvcastro commited on
Commit
2535d20
·
verified ·
1 Parent(s): de52ba9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -0
README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ ---
5
+
6
+ ## EconBERTa: Towards Robust Extraction of Named Entities in Economics
7
+
8
+ [EconBERTa](https://aclanthology.org/2023.findings-emnlp.774)
9
+ EconBERTa is a DeBERTa-based language model trained from scratch in the domain of economics. It has been pretrained following the [ELECTRA](https://arxiv.org/abs/2003.10555) approach, using a large corpus consisting of 9,4B tokens from 1,5M economics papers (around 800,000 full articles and 700,000 abstracts).
10
+
11
+
12
+ ### Citation
13
+
14
+ If you find EconBERTa useful for your work, please cite the following paper:
15
+
16
+ ``` latex
17
+ @inproceedings{lasri-etal-2023-econberta,
18
+ title = "{E}con{BERT}a: Towards Robust Extraction of Named Entities in Economics",
19
+ author = "Lasri, Karim and
20
+ de Castro, Pedro Vitor Quinta and
21
+ Schirmer, Mona and
22
+ San Martin, Luis Eduardo and
23
+ Wang, Linxi and
24
+ Dulka, Tom{\'a}{\v{s}} and
25
+ Naushan, Haaya and
26
+ Pougu{\'e}-Biyong, John and
27
+ Legovini, Arianna and
28
+ Fraiberger, Samuel",
29
+ editor = "Bouamor, Houda and
30
+ Pino, Juan and
31
+ Bali, Kalika",
32
+ booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
33
+ month = dec,
34
+ year = "2023",
35
+ address = "Singapore",
36
+ publisher = "Association for Computational Linguistics",
37
+ url = "https://aclanthology.org/2023.findings-emnlp.774",
38
+ doi = "10.18653/v1/2023.findings-emnlp.774",
39
+ pages = "11557--11577",
40
+ abstract = "Adapting general-purpose language models has proven to be effective in tackling downstream tasks within specific domains. In this paper, we address the task of extracting entities from the economics literature on impact evaluation. To this end, we release EconBERTa, a large language model pretrained on scientific publications in economics, and ECON-IE, a new expert-annotated dataset of economics abstracts for Named Entity Recognition (NER). We find that EconBERTa reaches state-of-the-art performance on our downstream NER task. Additionally, we extensively analyze the model{'}s generalization capacities, finding that most errors correspond to detecting only a subspan of an entity or failure to extrapolate to longer sequences. This limitation is primarily due to an inability to detect part-of-speech sequences unseen during training, and this effect diminishes when the number of unique instances in the training set increases. Examining the generalization abilities of domain-specific language models paves the way towards improving the robustness of NER models for causal knowledge extraction.",
41
+ }
42
+ ```