Update README.md

c2220cb about 1 year ago

4.23 kB

	---
	license: apache-2.0
	datasets:
	- TriadParty/deepsword
	language:
	- zh
	- en
	---
	## Deepsword-34B-Base
	![f1d09b62cfa0687cf9070ee2a59a2a4.png](https://cdn-uploads.huggingface.co/production/uploads/630c1adea20a5367812196f6/0VTlW9BM-F_cbIF_ww4EP.png)
	Introducing wrath in the Seven Deadly Sins series of models.
	- Continuous pre-training of qlora on Yi-34b
	- High-quality martial arts novels
	- Thoughtful cleaning process

	This model is designed to serve as the base model in the agent model of the Live Action Role Playing games. For this purpose, I've collected approximately 10G of martial arts novels, sourced from various novel websites and PT sites. However, this dataset includes a significant amount of duplicate and low-quality content. To address these issues, I've undertaken the following steps:

	### 1. Define Data Quality Dimensions
	For martial arts novels, high-quality works are typically represented by authors like Jin Yong, Gu Long, and Liang Yusheng. In these novels, the complexity of the plot is a critical factor and is the focal point for script quality.

	### 2. Quantify Data Quality Dimensions
	Given the emphasis on plot complexity, we approached this in several stages:

	Chapter Summarization:

	English: Utilize [Hugging Face's LED-Large-Book-Summary model](https://huggingface.co/pszemraj/led-large-book-summary).
	Chinese: Use the [Randeng-Pegasus-523M-Summary-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-Pegasus-523M-Summary-Chinese) model.
	Vectorization and Complexity Analysis:

	Convert plot summaries into vectors using a BERT-based model.
	Measure transitions between chapters through cosine similarity or Euclidean distance.
	Develop a complexity algorithm focused on standard deviation and peak analysis.
	Metric Quantification:

	Apply subjective weighting to the complexity metrics derived from chapter transitions.
	### 3. Outcome
	By employing these methods, we can effectively filter out novels of higher quality. This refined [dataset](https://huggingface.co/datasets/TriadParty/deepsword) has been shared for further use. The next step is to continue pretraining, for which specific parameters can be referred to in my previous model descriptions.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/630c1adea20a5367812196f6/tOMnutLIoT3ImsocQ5hdt.png)

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/630c1adea20a5367812196f6/XNH2opnnJ9ZwV7ACcBcHL.png)
	As you can see, the zero-shot effect is good. The settings in some pretraining novels are very naturally embedded into the characters.


	本模型旨在当作剧本杀游戏流程的agent模型中的基座模型。
	主要特性：
	1. Yi-34b上的qlora持续预训练
	2. 高质量的武侠小说
	3. 思虑周全的清洗流程

	本次收集了大概10G左右的武侠小说，从各种小说网站和pt站上爬取。本身存在大量的重复数据，也存在很多质量不高的小说。为了清洗这部分数据。做了以下工作：
	1. 定义数据质量维度。对于武侠小说，传统意义上比较高质量的作品是金庸，古龙，梁羽生这些人的小说。在这些作品中，情节的复杂程度无疑是其中的必备因素。而情节的复杂程度无疑也是剧本杀的重点。
	2. 量化数据质量维度。既然在上一步中我们定义了情节的复杂程度是重点，那么我们就可以：
	（1）把每一章节做一个概述，具体而言，英文推荐：https://huggingface.co/pszemraj/led-large-book-summary。中文推荐：Randeng-Pegasus-523M-Summary-Chinese
	（2）根据概述把情节摘要用基于bert的模型做成向量，然后通过定义一个复杂度的算法，具体而言，先把章节通过余弦相似度或者欧几里得距离来衡量章节之间的转折幅度。
	（3）通过章节之间的复杂程度指标通过标准差和峰值分析的主观加权来量化最后的指标。
	3. 这样最后能筛选出很多质量不错的小说。我已经分享了这部分数据。然后我们要做的只是将其持续预训练，具体参数参见我的上一个模型。

	效果展示：

	见上图

	可以看到，zero-shot效果不错。一些小说中的设定会非常自然的嵌入到角色中去。