|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- financial_phrasebank |
|
- pauri32/fiqa-2018 |
|
- zeroshot/twitter-financial-news-sentiment |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
pipeline_tag: text-classification |
|
tags: |
|
- finance |
|
--- |
|
|
|
|
|
We collects financial domain terms from Investopedia's Financia terms dictionary, NYSSCPA's accounting terminology guide |
|
and Harvey's Hypertextual Finance Glossary to expand RoBERTa's vocab dict. |
|
|
|
Based on added-financial-terms RoBERTa, we pretrained our model on multilple financial corpus: |
|
|
|
- Financial Terms |
|
- [Investopedia's Financia terms dictionary](https://www.investopedia.com/financial-term-dictionary-4769738) |
|
- [NYSSCPA's accounting terminology guide](https://www.nysscpa.org/professional-resources/accounting-terminology-guide) |
|
- [Harvey's Hypertextual Finance Glossary](https://people.duke.edu/~charvey/Classes/wpg/glossary.htm) |
|
- Financial Datasets |
|
- [FPB](https://huggingface.co/datasets/financial_phrasebank) |
|
- [FiQA SA](https://huggingface.co/datasets/pauri32/fiqa-2018) |
|
- [SemEval2017 Task5](https://aclanthology.org/S17-2089/) |
|
- [Twitter Financial News Sentiment](https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment) |
|
- Earnings Call |
|
2016-2023 NASDAQ 100 components stocks's Earnings Call Transcripts. |
|
|
|
|
|
In continual pretraining step, we apply following experiments settings to achieve better finetuned results on Four Financial Datasets: |
|
|
|
1. Masking Probability: 0.4 (instead of default 0.15) |
|
2. Warmup Steps: 0 (deriving better results than models with warmup steps) |
|
3. Epochs: 1 (is enough in case of overfitting) |
|
4. weight_decay: 0.01 |
|
5. Train Batch Size: 64 |
|
6. FP16 |
|
|
|
|