|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- TriadParty/deepsword |
|
language: |
|
- zh |
|
- en |
|
--- |
|
## **Deepsword-34B-Base** |
|
Introducing **wrath** in the Seven Deadly Sins series of models. |
|
 |
|
- Continuous pre-training of qlora on Yi-34b |
|
- High-quality martial arts novels |
|
- Thoughtful cleaning process |
|
|
|
This model is designed to serve as the base model in the agent model of the script-killing game process. For this purpose, I've collected approximately 10G of martial arts novels, sourced from various novel websites and PT sites. However, this dataset includes a significant amount of duplicate and low-quality content. To address these issues, I've undertaken the following steps: |
|
|
|
### 1. Define Data Quality Dimensions |
|
For martial arts novels, high-quality works are typically represented by authors like Jin Yong, Gu Long, and Liang Yusheng. In these novels, the complexity of the plot is a critical factor and is the focal point for script quality. |
|
|
|
### 2. Quantify Data Quality Dimensions |
|
Given the emphasis on plot complexity, we approached this in several stages: |
|
|
|
Chapter Summarization: |
|
|
|
English: Utilize [Hugging Face's LED-Large-Book-Summary model](https://huggingface.co/pszemraj/led-large-book-summary). |
|
Chinese: Use the [Randeng-Pegasus-523M-Summary-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese) model. |
|
Vectorization and Complexity Analysis: |
|
|
|
Convert plot summaries into vectors using a BERT-based model. |
|
Measure transitions between chapters through cosine similarity or Euclidean distance. |
|
Develop a complexity algorithm focused on standard deviation and peak analysis. |
|
Metric Quantification: |
|
|
|
Apply subjective weighting to the complexity metrics derived from chapter transitions. |
|
### 3. Outcome |
|
By employing these methods, we can effectively filter out novels of higher quality. This refined [dataset](https://huggingface.co/datasets/TriadParty/deepsword) has been shared for further use. The next step is to continue pretraining, for which specific parameters can be referred to in my previous model descriptions. |