|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- briefai/LongShort-Dataset |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
tags: |
|
- pytorch |
|
- dolly |
|
- Gen-AI |
|
- Finance |
|
- KPI Extraction |
|
--- |
|
# LongShort-Dolly-2-7B |
|
|
|
### Model Description |
|
|
|
LongShort-Dolly-2-7B is a large language model fine-tuned on earnings call documents to extract financial KPIs from the earnings call documents. It is based on the Dolly-2-7B Architecture. |
|
- Model creator: [Brief AI](https://huggingface.co/briefai) |
|
- Original model: [Dolly-2-7B](https://huggingface.co/databricks/dolly-v2-7b) |
|
|
|
### Dataset Description |
|
- Data Source: Factiva |
|
- Data Description: 28K+ Earnings Call Documents |
|
- Data Scope: 1K+ public companies |
|
- Fine Tuning Data: Collection of 60K+ samples. |
|
|
|
## Prompt template: LongShort-Dolly-2-7B |
|
|
|
``` |
|
[INST]Given the context, answer the question. |
|
|
|
### Question: |
|
Extract all the finance-based performance indicators and evaluation metrics. |
|
|
|
### Context: |
|
{context} |
|
|
|
### Answer: |
|
[/INST] |
|
|
|
``` |
|
|
|
## Basics |
|
*This section provides information about the model type, version, license, funders, release date, developers, and contact information.* |
|
*It is useful for anyone who wants to reference the model.* |
|
|
|
|
|
**Developed by:** [Brief AI Team](https://huggingface.co/briefai) |
|
|
|
**Model Type:** Transformer-based Large Language Model |
|
|
|
**Version:** 1.0.0 |
|
|
|
**Languages:** English |
|
|
|
**License:** Apache 2.0 |
|
|
|
**Release Date Estimate:** Wednesday, 29.November.2023 |
|
|
|
**Send Questions to:** [email protected] |
|
|
|
**Cite as:** Brief AI LongShort Language Model |
|
|
|
**Funded by:** UChicago Data Science Institute |
|
|
|
**Mentored by:** Nick Kadochnikov |
|
|
|
## Technical Specifications |
|
*This section includes details about the model objective and architecture, and the compute infrastructure.* |
|
*It is useful for people interested in model development.* |
|
|
|
Please see [the LongShort training README](https://github.com/brief-ai-uchicago/LongShort-Dataset) for full details on replicating training. |
|
|
|
### Model Architecture and Objective |
|
|
|
* Modified from Dolly-2-7B |
|
|
|
**Objective:** Financial KPI extraction from earnings call documents. |
|
|
|
### Hardware and Software - Compute Infrastructure |
|
|
|
* 4 NVIDIA L4 GPUs & 48 vCPUs |
|
|
|
* Environment: PyTorch (pytorch-2.0 w/ CUDA-11.8; see [Github link](https://github.com/pytorch/pytorch)) |
|
|
|
* CPU: GCP G2 Standard 48 (Platform: Intel Cascade Lake) (Accelerator Optimized) |
|
|
|
* CPU memory: 192GB RAM |
|
|
|
* GPU memory: 30GB per GPU |
|
|
|
## Training |
|
*This section provides information about the training.* |
|
*It is useful for people who want to learn more about the model inputs and training footprint.* |
|
|
|
The following bits and bytes quantization config was used during training: |
|
|
|
* quant_method: bitsandbytes |
|
* load_in_8bit: False |
|
* load_in_4bit: True |
|
* llm_int8_threshold: 6.0 |
|
* llm_int8_skip_modules: None |
|
* llm_int8_enable_fp32_cpu_offload: False |
|
* llm_int8_has_fp16_weight: False |
|
* bnb_4bit_quant_type: nf4 |
|
* bnb_4bit_use_double_quant: True |
|
* bnb_4bit_compute_dtype: float16 |
|
|
|
Framework versions |
|
* PEFT 0.4.0 |
|
|
|
|
|
### Training Data |
|
*This section provides a high-level overview of the training data. It is relevant for anyone who wants to know the basics of what the model is learning.* |
|
|
|
Details for the dataset can be found in [LongShort Dataset](https://github.com/brief-ai-uchicago/LongShort-Dataset) |
|
|
|
Training data includes: |
|
|
|
- 5000 Earnings Call Documents |
|
|
|
## How to use |
|
|
|
This model can be easily used and deployed using HuggingFace's ecosystem. This needs `transformers` and `accelerate` installed. The model can be downloaded as follows: |
|
|
|
[LongShort-Dolly-2-7B](https://huggingface.co/briefai/LongShort-Dolly-2-7B) |
|
|
|
## Intended Use |
|
|
|
This model is being created in order to enable public research on large language models (LLMs). LLMs are intended to be used for language generation or as a pre-trained base model that can be further fine-tuned for specific tasks. The use cases below are not exhaustive. |
|
|
|
### Direct Use |
|
|
|
- Text generation |
|
|
|
- Exploring characteristics of language generated by a language model |
|
|
|
- Examples: Cloze tests, counterfactuals, generations with reframings |
|
|
|
### Downstream Use |
|
|
|
- Tasks that leverage language models include: Information Extraction, Question Answering, Summarization |
|
|
|
|
|
#### Out-of-scope Uses |
|
|
|
Using the model in [high-stakes](#high-stakes) settings is out of scope for this model. The model is not designed for [critical decisions](#critical-decisions) nor uses with any material consequences on an individual's livelihood or wellbeing. The model outputs content that appears factual but may not be correct. |
|
|
|
Out-of-scope Uses Include: |
|
|
|
- Usage for evaluating or scoring individuals, such as for employment, education, or credit |
|
|
|
- Applying the model for critical automatic decisions, generating factual content, creating reliable summaries, or generating predictions that must be correct |
|
|
|
#### Misuse |
|
|
|
Intentionally using the model for harm, violating [human rights](#human-rights), or other kinds of malicious activities, is a misuse of this model. This includes: |
|
|
|
- Spam generation |
|
|
|
- Disinformation and influence operations |
|
|
|
- Disparagement and defamation |
|
|
|
- Harassment and abuse |
|
|
|
- [Deception](#deception) |
|
|
|
- Unconsented impersonation and imitation |
|
|
|
- Unconsented surveillance |
|
|
|
- Generating content without attribution to the model, as specified in the [RAIL License, Use Restrictions](https://huggingface.co/spaces/bigscience/license) |
|
|
|
## Intended Users |
|
|
|
### Direct Users |
|
|
|
- General Public |
|
|
|
- Researchers |
|
|
|
- Students |
|
|
|
- Educators |
|
|
|
- Engineers/developers |
|
|
|
- Non-commercial entities |
|
|
|
- Financial Industry |
|
|
|
# Risks and Limitations |
|
*This section identifies foreseeable harms and misunderstandings.* |
|
|
|
Model may: |
|
|
|
- Overrepresent some viewpoints and underrepresent others |
|
|
|
- Contain stereotypes |
|
|
|
- Contain [personal information](#personal-data-and-information) |
|
|
|
- Generate: |
|
|
|
- Hateful, abusive, or violent language |
|
|
|
- Discriminatory or prejudicial language |
|
|
|
- Content that may not be appropriate for all settings, including sexual content |
|
|
|
- Make errors, including producing incorrect information as if it were factual |
|
|
|
- Generate irrelevant or repetitive outputs |
|
|
|
- Induce users into attributing human traits to it, such as sentience or consciousness |
|
|
|
|
|
# Evaluation |
|
*This section describes the evaluation protocols and provides the results.* |
|
|
|
Result: LongShort-Falcon-7B gives 45.4% accuracy on a validation set of 10% of the original training dataset. |
|
|
|
|
|
|
|
**Train-time Evaluation:** |
|
|
|
Final checkpoint after 700 epochs: |
|
|
|
- Training Loss: 1.645 |
|
|
|
|
|
# Recommendations |
|
*This section provides information on warnings and potential mitigations.* |
|
|
|
- Indirect users should be made aware when the content they're working with is created by the LLM. |
|
|
|
- Users should be aware of [Risks and Limitations](#risks-and-limitations), and include an appropriate age disclaimer or blocking interface as necessary. |
|
|
|
- Users of the model should provide mechanisms for those affected to provide feedback, such as an email address for comments. |
|
|
|
# Model Card Authors |
|
Vishal Parameshwaran, Garima Sohi, Jose Gerala, Sanchit Narayan Kumar |
|
|
|
|
|
|