File size: 5,450 Bytes
e95f38c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
---
license: llama3.1
base_model: Llama-3.1-8B-Instruct
pipeline_tag: text-generation
library_name: transformers
---
# Changelog
- [2024.10.30] Released [Theia-Llama-3.1-8B-v1.1](https://huggingface.co/Chainbase-Labs/Theia-Llama-3.1-8B-v1.1), supervised fine-tuned with abundant crypto fundamental knowledge and popular projects.
- [2024.10.10] Released [Theia-Llama-3.1-8B-v1](https://huggingface.co/Chainbase-Labs/Theia-Llama-3.1-8B-v1)
# Theia-Llama-3.1-8B
**Theia-Llama-3.1-8B is an open-source crypto LLM, trained with carefully-designed dataset from the crypto field.**
## Technical Implementation
### Crypto-Oriented Dataset
The training dataset is curated from two primary sources to create a comprehensive representation of blockchain
projects. The first source is data collected from **CoinMarketCap**, focusing on the top **2000 projects** ranked by
market capitalization. This includes a wide range of project-specific documents such as whitepapers, official blog
posts, and news articles. The second core component of the dataset comprises detailed research reports on these projects
gathered from various credible sources on the internet, providing in-depth insights into project fundamentals,
development progress, and market impact. After constructing the dataset, both manual and algorithmic filtering are
applied to ensure data accuracy and eliminate redundancy.
### Model Fine-tuning and Quantization
The Theia-Llama-3.1-8B is fine-tuned from the base model (Llama-3.1-8B), specifically tailored for the cryptocurrency
domain. We employed LoRA (Low-Rank Adaptation) to fine-tune the model effectively, leveraging its ability to adapt large
pre-trained models to specific tasks with a smaller computational footprint. Our training methodology is further
enhanced through the use of LLaMA Factory, an open-source training framework. We integrate **DeepSpeed**, Microsoft's
distributed training engine, to optimize resource utilization and training efficiency. Techniques such as ZeRO (Zero
Redundancy Optimizer), offload, sparse attention, 1-bit Adam, and pipeline parallelism are employed to accelerate the
training process and reduce memory consumption. A fine-tuned model is also built using the
novel [D-DoRA](https://docs.chainbase.com/theia/Developers/Glossary/D2ORA), a decentralized training scheme, by our
Chainbase Labs. Since the LoRA version is much easier to deploy and play with for developers, we release the LoRA
version first for the Crypto AI community.
In addition to fine-tuning, we have quantized the model to optimize it for efficient deployment, specifically into the
GGUF format. Model quantization is a process that reduces the precision of the model's weights from floating-point
(typically FP16 or FP32) to lower-bit representations.
The primary benefit of quantization is that it significantly reduces the model's memory footprint and
improves inference speed while maintaining an acceptable level of accuracy. This makes the model more accessible for use
in resource-constrained environments, such as on edge devices or lower-tier GPUs.
## Benchmark
To evaluate the current LLMs in the crypto domain, we have proposed a benchmark for evaluating Crypto AI Models, which
is the first AI model benchmark tailored specifically for the crypto domain. The models are evaluated across seven
dimensions, including crypto knowledge comprehension and generation, knowledge coverage, and reasoning capabilities,
etc. A detailed paper will follow to elaborate on this benchmark. Here we initially release the results of benchmarking
the understanding and generation capabilities in the crypto domain on 11 open-source and close-source LLMs from OpenAI,
Google, Meta, Qwen, and DeepSeek. For the open-source LLMs, we choose the models with the similar parameter size as
ours (~8b). For the close-source LLMs, we choose the popular models with most end-users.
| Model | Perplexity ↓ | BERT ↑ |
|---------------------------|--------------|-----------|
| **Theia-Llama-3.1-8B-v1** | **1.184** | **0.861** |
| ChatGPT-4o | 1.256 | 0.837 |
| ChatGPT-4o-mini | 1.257 | 0.794 |
| ChatGPT-3.5-turbo | 1.233 | 0.838 |
| Claude-3-sonnet (~70b) | N.A. | 0.848 |
| Gemini-1.5-Pro | N.A. | 0.830 |
| Gemini-1.5-Flash | N.A. | 0.828 |
| Llama-3.1-8B-Instruct | 1.270 | 0.835 |
| Mistral-7B-Instruct-v0.3 | 1.258 | 0.844 |
| Qwen2.5-7B-Instruct | 1.392 | 0.832 |
| Gemma-2-9b | 1.248 | 0.832 |
| Deepseek-llm-7b-chat | 1.348 | 0.846 |
## System Prompt
The system prompt used for training this model is:
```
You are a helpful assistant who will answer crypto related questions.
```
## Chat Format
As mentioned above, the model uses the standard Llama 3.1 chat format. Here’s an example:
```
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
Cutting Knowledge Date: December 2023
Today Date: 29 September 2024
You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>
What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>
```
## Tips for Performance
We are initially recommending a set of parameters.
```
sequence length = 256
temperature = 0
top-k-sampling = -1
top-p = 1
context window = 39680
```
|