XVERSE-13B-256K / README.md
ChloeAuYeung's picture
Update README.md
a0a34b8 verified
---
license: apache-2.0
inference: false
---
# XVERSE-13B-256K
## 更新信息
**[2024/06/28]** 更新tokenizers。
**[2024/01/16]** 发布长序列对话模型 **XVERSE-13B-256K**,该版本模型最大支持 256K 的上下文窗口长度,约 25w 字的输入内容,可以协助进行文献总结、报告分析等任务。
**[2023/11/06]** 发布新版本的 **XVERSE-13B-2** 底座模型和 **XVERSE-13B-Chat-2** 对话模型,相较于原始版本,新版本的模型训练更加充分(从 1.4T 增加到 3.2T),各方面的能力均得到大幅提升,同时新增工具调用能力。
**[2023/09/26]** 发布 7B 尺寸的 [XVERSE-7B](https://github.com/xverse-ai/XVERSE-7B) 底座模型和 [XVERSE-7B-Chat](https://github.com/xverse-ai/XVERSE-7B) 对话模型,支持在单张消费级显卡部署运行,并保持高性能、全开源、免费可商用。
**[2023/08/22]** 发布经过指令精调的 XVERSE-13B-Chat 对话模型。
**[2023/08/07]** 发布 13B 尺寸的 XVERSE-13B 底座模型。
## Update Information
**[2024/06/28]** Updated tokenizers.
**[2024/01/16]** Released the long-sequence model **XVERSE-13B-256K**. This model version supports a maximum window length of 256K, accommodating approximately 250,000 words for tasks such as literature summarization and report analysis.
**[2023/11/06]** The new versions of the **XVERSE-13B-2** base model and the **XVERSE-13B-Chat-2** model have been released. Compared to the original versions, the new models have undergone more extensive training (increasing from 1.4T to 3.2T), resulting in significant improvements in all capabilities, along with the addition of Function Call abilities.
**[2023/09/26]** Released the [XVERSE-7B](https://github.com/xverse-ai/XVERSE-7B) base model and [XVERSE-7B-Chat](https://github.com/xverse-ai/XVERSE-7B) instruct-finetuned model with 7B size, which support deployment and operation on a single consumer-grade graphics card while maintaining high performance, full open source, and free for commercial use.
**[2023/08/22]** Released the aligned instruct-finetuned model XVERSE-13B-Chat.
**[2023/08/07]** Released the XVERSE-13B base model.
## Tokenizer版本说明
当使用的tokenizer版本低于0.19,可直接使用仓库中的tokenizer.json和tokenizer_config.json。对于0.19及以上版本,请使用tokenizer.json.update和tokenizer_config.json.update,需要将这两个文件中的所有内容复制并粘贴覆盖至现有的tokenizer.json和tokenizer_config.json文件中。
For tokenizer versions below 0.19, you can directly use the tokenizer.json and tokenizer_config.json files from the repository. For versions 0.19 and above, please utilize the tokenizer.json.update and tokenizer_config.json.update files. You need to copy all the contents from these two files and paste them over the existing tokenizer.json and tokenizer_config.json files.
## 模型介绍
**XVERSE-13B-256K**是[**XVERSE-13B-2**](https://huggingface.co/xverse/XVERSE-13B)模型经过ABF+继续预训练、NTK+SFT 微调后的版本。
**XVERSE-13B-256K** 是由深圳元象科技自主研发的支持多语言的大语言模型(Large Language Model),主要应用的技术如下:
- **ABF**: ABF 的全称是 Adjusted Base Frequency,表示将位置编码 RoPE(Rotary Position Embedding)的频率从 10000 修改成 500000 。别小看这个数字的更改,它可以大幅减少前面序列 attention 的衰减速度,让后面的序列更好地获取所有序列的信息。
- **继续预训练**:在 XVERSE-13B-2 的基础上,使用 20% 的预训练数据进行 32K 的长序列继续预训练。通过少量长序列数据的继续预训练而不是从头开始的长序列预训练,可以大幅减少预训练的训练量。
- **NTK**: NTK 的全称是 Neural Tangent Kernel,翻译为神经正切核,是一种用于理解和分析深度神经网络行为的工具。使用了 NTK 的 RoPE 可以对 RoPE 的频率进行动态的插值。在保持分辨率的情况下(高频),进行频域空间的缩放(低频),从而实现位置空间的插值。
- **SFT数据**:自主构建包含单文本问答,多文本问答,摘要,代码补全等各类长序列数据,序列长度从 32K 到 256K 不等。
## Model Introduction
**XVERSE-13B-256K** is the long-sequence version of model [**XVERSE-13B-2**](https://huggingface.co/xverse/XVERSE-13B),
updated by **Continual-Pre-Training** based on **ABF** and **supervised fine-tuning** based on **NTK**.
**XVERSE-13B-256K** is a multilingual large language model, independently developed by Shenzhen Yuanxiang Technology. Below are the main practical techniques:
- **ABF**: Adjusted Base Frequency means that changing the frequency of Rotary Position Embedding(RoPE) from 10,000 to 500,000.
- **Continual-Pre-Training**: Based on XVERSE-13B-2, 32K long sequence continuation pre-training is conducted using 20% of the pre-training data. This approach significantly reduces the training volume for pre-training by utilizing a small amount of long sequence data for continuation pre-training instead of starting from scratch with long sequence pre-training.
- **NTK**: Neural Tangent Kernel is a tool used for understanding and analyzing the behavior of deep neural networks. RoPE, employing NTK, enables dynamic interpolation of its frequencies. This involves scaling in the frequency domain while maintaining resolution, thereby achieving spatial interpolation in the positional domain.
- **Data for SFT**: We autonomously construct a diverse range of long sequence data, encompassing single-document question-answering (QA), multi-document QA, summarization, code completion, and other types. The sequence lengths vary from 32K to 256K.
## 评测结果
为了验证长序列的效果,这里我们使用了 LongBench 数据集。[ LongBench ](https://github.com/THUDM/LongBench)是第一个多任务、中英双语、针对大语言模型长文本理解能力的评测基准。 LongBench 由六大类、二十一个不同的任务组成,覆盖了单文档问答、多文档问答、摘要、Few shot任务、合成任务和代码补全等关键的长文本应用场景。 LongBench 包含 14 个英文任务、 5 个中文任务和 2 个代码任务,多数任务的平均长度在 5k-15k 之间,共包含 4750 条测试数据。评估结果如下:
| 能力维度 | 数据集 | XVERSE-13B-256K | GPT-3.5-Turbo-16K | Yi-6B-200K | LongChat-7B-16K | Llama2-7B-Chat-4K |
| :--------: | :-------------------: | :----: | :----------: | :--------: | :-----------: | :--------: |
| 多文档问答 | HotpotQA | 58.3 | 51.6 | 48.3 | 22.4 | 24.3 |
| | DuReader | 28.9 | 28.7 | 14.2 | 19.1 | 1.9 |
| 单文档问答 | NarrativeQA | 24.1 | 23.6 | 14.5 | 21.6 | 19.1 |
| | Qasper | 30.2 | 43.3 | 21.6 | 21.6 | 19.6 |
| 摘要 | VCSUM | 11.3 | 16.0 | 8.2 | 14.0 | 0.2 |
| Few shot | TREC | 72.0 | 68.0 | 71.0 | 61.5 | 60.5 |
| | LSHT | 35.0 | 29.2 | 38.0 | 20.8 | 19.8 |
| 合成任务 | PassageRetrieval-en | 63.0 | 71.0 | 6.0 | 24.0 | 9.2 |
| | PassageRetrieval-zh | 44.0 | 77.5 | 7.9 | 4.8 | 0.5 |
| 代码 | RepoBench-P | 55.6 | 53.6 | 61.5 | 54.7 | 42.4 |
对于上述所有比较模型,我们优先汇报其官方公布的结果。在缺少官方结果的情况下,我们采用自行执行的评估流程所获得的数据。
## Model Evaluation
To assess the performance of long sequences, we employed the LongBench dataset. [LongBench](https://github.com/THUDM/LongBench) stands as the inaugural multi-task, bilingual (English-Chinese), evaluation benchmark specifically designed to gauge the long-text comprehension capabilities of large language models. Comprising six major categories and twenty-one distinct tasks, LongBench encompasses critical long-text application scenarios such as single-document QA, multi-document QA, summarization, few-shot tasks, synthetic tasks, and code completion. The dataset consists of 14 English tasks, 5 Chinese tasks, and 2 code tasks, with the majority of tasks having an average length ranging from 5,000 to 15,000 tokens, totaling 4,750 test instances. The evaluation results are presented below:
| Capability Dimension | Dataset | XVERSE-13B-256K | GPT-3.5-Turbo-16K | Yi-6B-200K | LongChat-7B-16K | Llama2-7B-Chat-4K |
| :--------: | :-------------------: | :----: | :----------: | :--------: | :-----------: | :--------: |
| multi-document QA | HotpotQA | 58.3 | 51.6 | 48.3 | 22.4 | 24.3 |
| | DuReader | 28.9 | 28.7 | 14.2 | 19.1 | 1.9 |
| single-document QA | NarrativeQA | 24.1 | 23.6 | 14.5 | 21.6 | 19.1 |
| | Qasper | 30.2 | 43.3 | 21.6 | 21.6 | 19.6 |
| summarization | VCSUM | 11.3 | 16.0 | 8.2 | 14.0 | 0.2 |
| Few shot | TREC | 72.0 | 68.0 | 71.0 | 61.5 | 60.5 |
| | LSHT | 35.0 | 29.2 | 38.0 | 20.8 | 19.8 |
| synthetic tasks | PassageRetrieval-en | 63.0 | 71.0 | 6.0 | 24.0 | 9.2 |
| | PassageRetrieval-zh | 44.0 | 77.5 | 7.9 | 4.8 | 0.5 |
| code completion | RepoBench-P | 55.6 | 53.6 | 61.5 | 54.7 | 42.4 |
For all the comparison models mentioned above, we prioritize the disclosure of their officially published results. In the absence of official data, we refer to the results derived from our own evaluation pipeline.
### Loading with Transformers
环境安装:
Environment Setup:
```bash
pip install -r requirements.txt
```
可通过以下代码加载 XVERSE-13B-256K 模型进行对话:
The XVERSE-13B-256K model can be loaded for chat using the following code:
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("xverse/XVERSE-13B-256K")
model = AutoModelForCausalLM.from_pretrained("xverse/XVERSE-13B-256K", trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='auto')
model = model.eval()
inputs = tokenizer('北京的景点:故宫、天坛、万里长城等。\n深圳的景点:', return_tensors='pt').input_ids
inputs = inputs.cuda()
generated_ids = model.generate(inputs, max_new_tokens=64, eos_token_id=tokenizer.eos_token_id, repetition_penalty=1.1)
print(tokenizer.batch_decode(generated_ids, skip_special_tokens=True))
```
更多细节,包括对话 demo 、模型微调及量化等,请参考我们的[Github](https://github.com/xverse-ai/XVERSE-13B)。
For more details, including chat demo, model fine-tuning and quantization, please refer to our [Github](https://github.com/xverse-ai/XVERSE-13B).
## 局限性与免责申明
XVERSE-13B-256K 与其他所有 LLM 一样,在某些情况下可能会产生不准确、有偏见或其他令人反感的内容。因此,请谨慎使用模型生成的内容,请勿将生成的有害内容进行传播,在部署任何 XVERSE-13B-256K 的应用之前,开发人员应根据其具体应用对模型进行安全测试和调优。
我们强烈警告不要将 XVERSE-13B-256K 模型用于制造或传播有害信息,或进行任何可能损害公众、国家、社会安全或违反法规的活动。如果使用 XVERSE-13B-256K 模型产生任何问题,无论是数据安全问题、公共舆论风险,还是模型被误解、滥用、传播或不合规使用所引发的任何风险和问题,我们将不承担任何责任。
## Limitations and Disclaimer
Like all other Large Language Models (LLMs), XVERSE-13B-256K may produce inaccurate, biased, or otherwise offensive content under certain circumstances. Therefore, please use the content generated by the model with caution and refrain from disseminating harmful content. Before deploying any application of XVERSE-13B-256K, developers should conduct safety tests and optimization of the model according to its specific application.
We strongly warn against the use of the XVERSE-13B-256K model for producing or spreading harmful information, or conducting any activities that might harm the public, national, or social security, or violate regulations. We assume no responsibility for any problems arising from the use of the XVERSE-13B-256K model, whether it be data security issues, public opinion risks, or any risks and issues caused by misunderstanding, misuse, dissemination, or non-compliance with the model.
## 模型开源协议
使用本仓库的源码需要遵循 [Apache-2.0](https://github.com/xverse-ai/XVERSE-13B/blob/main/LICENSE) 开源协议,使用 XVERSE-13B-256K 的模型权重则需要遵循[模型许可协议](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf)。
XVERSE-13B-256K 模型权重对学术研究**完全开放**,并且支持**免费商用**。如需申请商业许可证,请填写【[申请表](https://chat.xverse.cn/home/business.html)】,如有其他问题或合作,请联系 <[email protected]>
## Open Source License
The use of the source code in this repository must follow the [Apache-2.0](https://github.com/xverse-ai/XVERSE-13B/blob/main/LICENSE) open-source license, while the use of the model weights of XVERSE-13B-256K needs to adhere to the [Model License Agreement](https://github.com/xverse-ai/XVERSE-13B/blob/main/MODEL_LICENSE.pdf).
The XVERSE-13B-256K model weights are **fully open** to academic research and support **free commercial use**. To apply for a commercial license, please fill in the [application form](https://chat.xverse.cn/home/business.html). For other questions or collaborations, please contact <[email protected]>.