CryptoGPT-1.0-7B-lt - Sentiment analysis model for financial news dedicated to crypto-assets

This project introduces our lightest AI model designed for real-time market sentiment analysis on financial news dedicated to cryptoassets. Leveraging the latest advances in natural language processing (NLP), our model classifies financial texts into specific classes and evaluates their sentiment, providing invaluable insights into market trends.

1. Background and Problem Statement

The cryptoasset market is known for its volatility, driven by various factors including statements by influential figures, political decisions, rumors, etc. Traditional financial models often fail to accurately interpret the impact of such events. In response to this challenge and inspired by the introduction of BloombergGPT, our project aims to democratize access to cutting-edge NLP for financial analysis dedicated to the crypto-asset market. Our research focused on developing a more accessible Large Language Model (LLM) capable of analyzing tweets, recent news, and market trends with limited resources.

2. Annotation Methodology

2.1. Annotation Objective

Our methodology is built around the annotation of financial texts into 21 financial classes specific to crypto-assets, and having a significant impact on this market, such as “Regulation and legislation”, “Market sentiment”, “ESG impact”, etc. The objective is twofold: to enable very precise categorization and to provide a comprehensive analysis of the sentiment reflected by the technical nuances of financial markets, particularly in the area of crypto-assets.

2.2. Annotation Method

We have used one of our automatic annotation technologies as part of our annotation and categorization process to ensure greater reliability compared to human annotation. Our input dataset of more than 15 million tokens has allowed us to develop high-performance market sentiment analysis models dedicated to crypto-assets. The cryptoGPT-1.0-7B-lt model that we present to you in this repository is fine-tuned on a much smaller input dataset of around 3.3 million tokens.

3. Fine-Tuning Strategy

Our fine-tuning process uses broad language models with QLoRA for efficient adaptation. We have optimized the training phase of our models to run on a small infrastructure, ensuring significant resource and time savings without compromising model performance.

4. Installation

To set up the environment for our model, follow these steps:

!pip install --upgrade pip
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q huggingface_hub

Alternatively, you can create your requiremnts.txt and install all required packages like below:

bitsandbytes
git+https://github.com/huggingface/transformers.git
git+https://github.com/huggingface/peft.git
git+https://github.com/huggingface/accelerate.git
huggingface_hub
tokenizers==0.15.2

!pip install --upgrade pip
!pip install -r requirements.txt

5. Usage

The model can be used to analyze financial texts ideally dedicated to the crypto-asset market, thereby providing accurate technical analysis of sentiment in various categories relevant to the crypto-asset market.

6. Python Example Code

Here is a simple example of Python code to illustrate a basic use of the model for sentiment analysis:

import torch
from transformers import GenerationConfig, AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from huggingface_hub import login
import gc

gc.collect()
torch.cuda.empty_cache()

HF_TOKEN = "HF_TOKEN"
login(HF_TOKEN)
MODEL_NAME = "mpetitguillaume/cryptoGPT-1.0-7B-lt"
      
def setup_device():
    """Configures and returns the primary device for model computations (GPU if available)."""
    return torch.device("cuda")

def login_to_hf_hub(token):
    """Authenticates with the Hugging Face Hub using a provided token."""
    login(token=token)

def load_model_and_tokenizer(model_name, bnb_config):
    """
    Loads the specified model and tokenizer from Hugging Face, applying quantization 
    configurations if provided. Also sets the tokenizer's pad token to its eos token.
    
    Args:
        model_name (str): Name of the model to load.
        bnb_config (BitsAndBytesConfig): Configuration for model quantization.

    Returns:
        tuple: The loaded model and tokenizer.
    """
    model = AutoModelForCausalLM.from_pretrained(
        model_name, 
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token  # Harmonize pad and eos tokens
    return model, tokenizer

def create_bnb_config():
    """Creates a BitsAndBytes configuration optimized for model performance."""
    return BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

def create_prompt_formats(news):
    """
    Creates a formatted prompt template for a prompt in the instruction dataset

    :param sample: Prompt or sample from the instruction dataset
    """
    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruction:"
    INPUT_KEY = "### Input:"
    RESPONSE_KEY = "### Response:"
    prompt = """You need to relate news to some of the following 21 categories, provide a brief explanation, conduct sentiment analysis within each category, and offer an overall sentiment analysis, especially focusing on the financial markets, including cryptocurrencies. Determine whether it has a strongly positive, moderately positive, strongly negative, moderately negative, or negligible impact on market trends.
              Categories:
              1. Regulation and Legislation (keywords: government, SEC, regulator, law, regulation, legislation.)
              2. Adoption and Usage (keywords: adoption, usage, business, institution, partnership.)
              3. Geopolitical Events (keywords: geopolitics, conflict, election, economic policy.)
              4. Technology and Infrastructure (keywords: technology, infrastructure, updates, protocols, security.)
              5. Financial Market Performances (keywords: stock market, bond market, currencies, indices.)
              6. Market Sentiment (keywords: sentiment, confidence, opinion, investors.)
              7. Competition Between Cryptocurrencies (keywords: competition, fork, updates, new projects, cryptocurrencies.)
              8. Partnerships and Collaborations (keywords: partnership, collaboration, business, institution.)
              9. Initial Coin Offerings (keywords: ICO, token sales, fundraising, crowdfunding.)
              10. Media Coverage (keywords: media, media coverage, reporting, articles, news.)
              11. Exchange Listings (keywords: exchange platforms, listing, liquidity.)
              12. Exchange Delistings (keywords: exchange platforms, delisting, liquidity.)
              13. Exchange Volume and Liquidity (keywords: volume, liquidity, exchange, trading.)
              14. Market Manipulation and Fraud (keywords: manipulation, fraud, deception, investigation.)
              15. Influential Players' Interventions (keywords: influence, statements, personalities, entrepreneurs, analysts.)
              16. Expert Analysis and Forecasts (keywords: analysis, forecasts, experts, projections, predictions.)
              17. Integration with Financial Services (keywords: integration, financial services, banking, payments.)
              18. Macroeconomic Indicators (keywords: macroeconomics, inflation, interest rates, economic growth.)
              19. Cryptocurrency Events and Conferences (keywords: events, conferences, summits, forums, exhibitions related to cryptocurrencies.)
              20. Rumors and Speculations (keywords: rumors, speculations, buzz, leaks, unconfirmed information.)
              21. Impact ESG (ESG Impact) (keywords: environment, social, governance, sustainability, responsibility, ethics, impact, carbon footprint, energy consumption, mining, electronic waste, working conditions, transparency, corporate governance, diversity, inclusion, human rights, climate change.)
              If you don't know the category, response "OTHERS"."""

    blurb = f"{INTRO_BLURB}"
    instruction = f"{INSTRUCTION_KEY}\n{prompt}"
    input_context = f"{INPUT_KEY}\n{news}" if news else None
    response = f"{RESPONSE_KEY}\n"

    parts = [part for part in [blurb, instruction, input_context, response] if part]
    formatted_prompt = "\n\n".join(parts)

    return formatted_prompt

def generate_response(model, tokenizer, news):
    """
    Generates a text response for a given input news snippet using the model and tokenizer.
    
    Args:
        model (AutoModelForCausalLM): The model for generating responses.
        tokenizer (AutoTokenizer): The tokenizer for processing input and output texts.
        news (str): The news snippet to respond to.

    Returns:
        str: The generated text response.
    """
    generation_config = GenerationConfig(
        max_new_tokens=300,
        do_sample=True,
        top_p=0.1,
        temperature=0.01,
        pad_token_id=tokenizer.eos_token_id,
    )
    input_tensor = tokenizer(create_prompt_formats(news), return_tensors="pt", truncation=True)
    device = setup_device()
    with torch.inference_mode():
        outputs = model.generate(
            input_ids=input_tensor["input_ids"].to("cuda"),
            attention_mask=input_tensor["attention_mask"],
            generation_config=generation_config,
        )
        result = tokenizer.batch_decode(
            outputs.detach().cpu().numpy(), skip_special_tokens=True
        )
    return result[0].split('### Response:')[1].split('###')[0]

bnb_config = create_bnb_config()
model, tokenizer = load_model_and_tokenizer(MODEL_NAME, bnb_config)

Here is an example output for this code:

0. Financial Market Performances (mention of Jinzhou Bank's financial situation)
Neutral - The article suggests that Jinzhou Bank's financial situation was in good shape, with low bad-debt level and only a small percentage of personal-business loans having gone sour.

1. Market Manipulation and Fraud (mention of the potential involvement of a billionaire, Li Hejun, in the bank's problems)
Negative - The article suggests that the billionaire, Li Hejun, may be behind the bank's distress, indicating potential market manipulation or fraud.

2. Geopolitical Events (mention of China's banks and its economic policies)
Neutral - The article discusses China's banks and their financial situation, but it doesn't provide any clear geopolitical analysis or opinion.

Sentiment Analysis regarding the Cryptocurrency Market:
Somewhat Negative Impact on the Market: The article suggests that Jinzhou Bank's financial situation may be in trouble, and the billionaire Li Hejun might be involved. This could lead to rumors and speculations about the stability of China's banks and the economy. However, the article doesn't provide any clear sentiment analysis or opinion on the matter...

7. Evaluation Results

Our model evaluation was based on manual expert evaluation. As part of the evaluation of this very lightweight model, we selected a set of 50 financial articles dedicated to crypto-assets, representative of various categories, rich in content and representative of several market trends. Six models were tested: refined LLaMa-2-7B, refined Mistral-7B, LLaMa-2-7B, LLaMa-2-13B, Mistral-7B, GPT-3.5 Turbo. Each model was rated on a scale of 0 to 4, where 4 indicates optimal performance and 0 means unusable results. According to the data our model based on fine-tuned Mistral-7B showed superior performance to GPT-3.5.

Our tests on our larger models demonstrated performance well above GPT-4 and the largest known wide language models.

Models	GPT-3.5	CryptoGPT-1.0-7B-lt	Mistral-7B	LLaMa-2-7B	LLaMa-2-13B
Average score	2.9	3.12	0.48	0.38	0.68
Score 4	14	15	0	0	0
Score 4 & 3	35	41	0	0	0

8. Reporting Issues

Please report any software "bug," or other problems with the models through one of the following means:

Reporting issues with the model: [email protected]

9. License

This project is licensed under the MIT License - see the LICENSE file for details.

10. Contact

For any questions or to contribute to the project, please contact us at [email protected].

mpetitguillaume
/

cryptoGPT-1.0-7B-lt

You need to agree to share your contact information to access this model