chainbase's picture
Update README.md
e95f38c verified
metadata
license: llama3.1
base_model: Llama-3.1-8B-Instruct
pipeline_tag: text-generation
library_name: transformers

Changelog

Theia-Llama-3.1-8B

Theia-Llama-3.1-8B is an open-source crypto LLM, trained with carefully-designed dataset from the crypto field.

Technical Implementation

Crypto-Oriented Dataset

The training dataset is curated from two primary sources to create a comprehensive representation of blockchain projects. The first source is data collected from CoinMarketCap, focusing on the top 2000 projects ranked by market capitalization. This includes a wide range of project-specific documents such as whitepapers, official blog posts, and news articles. The second core component of the dataset comprises detailed research reports on these projects gathered from various credible sources on the internet, providing in-depth insights into project fundamentals, development progress, and market impact. After constructing the dataset, both manual and algorithmic filtering are applied to ensure data accuracy and eliminate redundancy.

Model Fine-tuning and Quantization

The Theia-Llama-3.1-8B is fine-tuned from the base model (Llama-3.1-8B), specifically tailored for the cryptocurrency domain. We employed LoRA (Low-Rank Adaptation) to fine-tune the model effectively, leveraging its ability to adapt large pre-trained models to specific tasks with a smaller computational footprint. Our training methodology is further enhanced through the use of LLaMA Factory, an open-source training framework. We integrate DeepSpeed, Microsoft's distributed training engine, to optimize resource utilization and training efficiency. Techniques such as ZeRO (Zero Redundancy Optimizer), offload, sparse attention, 1-bit Adam, and pipeline parallelism are employed to accelerate the training process and reduce memory consumption. A fine-tuned model is also built using the novel D-DoRA, a decentralized training scheme, by our Chainbase Labs. Since the LoRA version is much easier to deploy and play with for developers, we release the LoRA version first for the Crypto AI community.

In addition to fine-tuning, we have quantized the model to optimize it for efficient deployment, specifically into the GGUF format. Model quantization is a process that reduces the precision of the model's weights from floating-point (typically FP16 or FP32) to lower-bit representations. The primary benefit of quantization is that it significantly reduces the model's memory footprint and improves inference speed while maintaining an acceptable level of accuracy. This makes the model more accessible for use in resource-constrained environments, such as on edge devices or lower-tier GPUs.

Benchmark

To evaluate the current LLMs in the crypto domain, we have proposed a benchmark for evaluating Crypto AI Models, which is the first AI model benchmark tailored specifically for the crypto domain. The models are evaluated across seven dimensions, including crypto knowledge comprehension and generation, knowledge coverage, and reasoning capabilities, etc. A detailed paper will follow to elaborate on this benchmark. Here we initially release the results of benchmarking the understanding and generation capabilities in the crypto domain on 11 open-source and close-source LLMs from OpenAI, Google, Meta, Qwen, and DeepSeek. For the open-source LLMs, we choose the models with the similar parameter size as ours (~8b). For the close-source LLMs, we choose the popular models with most end-users.

Model Perplexity ↓ BERT ↑
Theia-Llama-3.1-8B-v1 1.184 0.861
ChatGPT-4o 1.256 0.837
ChatGPT-4o-mini 1.257 0.794
ChatGPT-3.5-turbo 1.233 0.838
Claude-3-sonnet (~70b) N.A. 0.848
Gemini-1.5-Pro N.A. 0.830
Gemini-1.5-Flash N.A. 0.828
Llama-3.1-8B-Instruct 1.270 0.835
Mistral-7B-Instruct-v0.3 1.258 0.844
Qwen2.5-7B-Instruct 1.392 0.832
Gemma-2-9b 1.248 0.832
Deepseek-llm-7b-chat 1.348 0.846

System Prompt

The system prompt used for training this model is:

You are a helpful assistant who will answer crypto related questions. 

Chat Format

As mentioned above, the model uses the standard Llama 3.1 chat format. Here’s an example:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 29 September 2024

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Tips for Performance

We are initially recommending a set of parameters.

sequence length = 256
temperature = 0
top-k-sampling = -1
top-p = 1
context window = 39680