Zhuoyang Song commited on
Commit
5732dc5
1 Parent(s): cb2cbf3

add model card

Browse files
Files changed (1) hide show
  1. README.md +101 -0
README.md CHANGED
@@ -1,3 +1,104 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+
5
+ # Randeng-TransformerXL-5B-Abduction-Chinese
6
+
7
+ - Github: [Fengshenbang-LM](https://github.com/IDEA-CCNL/Fengshenbang-LM)
8
+ - Docs: [Fengshenbang-Docs](https://fengshenbang-doc.readthedocs.io/)
9
+ - Demo: [Reasoning Tree](https://idea.edu.cn/ccnl-act/reasoning/)
10
+
11
+ ## 简介 Brief Introduction
12
+
13
+ 基于Transformer-XL的中文反绎(溯因)推理生成模型。
14
+
15
+ Chinese abductive reasoning model based on Transformer-XL.
16
+
17
+ ## 模型分类 Model Taxonomy
18
+
19
+ | 需求 Demand | 任务 Task | 系列 Series | 模型 Model | 参数 Parameter | 额外 Extra |
20
+ | :----: | :----: | :----: | :----: | :----: | :----: |
21
+ | 通用 General | 自然语言生成 NLG | 燃灯 Randeng | TransformerXL | 5.0B | 中文-因果推理 Chinese-Reasoning |
22
+
23
+ ## 模型信息 Model Information
24
+
25
+ **数据准备 Corpus Preparation**
26
+
27
+ * 悟道语料库(280G版本)
28
+ * 因果语料库(2.3M个样本):基于悟道语料库(280G版本),通过关联词匹配、人工标注 + [GTSFactory](https://gtsfactory.com/)筛选、数据清洗等步骤获取的具有因果关系的句子对
29
+
30
+ * Wudao Corpus (with 280G samples)
31
+ * Wudao Causal Corpus (with 2.3 million samples): Based on the Wudao corpus (280G version), sentence pairs with causality were obtained through logic indicator matching, manual annotation + [GTSFactory](https://gtsfactory.com/), and data cleaning.
32
+
33
+ **训练流程 Model Training**
34
+ 1. 在悟道语料库(280G版本)和标注的相似句子对数据集上进行预训练([Randeng-TransformerXL-1.1B-Paraphrasing-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-TransformerXL-1.1B-Paraphrasing-Chinese))
35
+ 2. 在1.5M因果语料上进行反绎生成任务的训练
36
+ 3. 基于其余0.8M因果语料,协同[Randeng-TransformerXL-5B-Deduction-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-TransformerXL-5B-Deduction-Chinese)和[Erlangshen-Roberta-330M-Causal-Chinese](https://huggingface.co/IDEA-CCNL/Erlangshen-Roberta-330M-Causal-Chinese)进行Self-consistency闭环迭代训练
37
+ * 两个生成模型基于核采样和贪心的方式进行因果推理和反绎推理,产生大量伪样本;
38
+ * Erlangshen-Roberta-330M-Causal-Chinese模型对伪样本句子对的因果关系进行打分,筛选供自身以及生成模型训练的样本
39
+
40
+ First, the Transformer-XL model was pre-trained on the Wudao Corpus (with 280G samples) and annotated similar-sentence pair dataset (same as [Randeng-TransformerXL-1.1B-Paraphrasing-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-TransformerXL-1.1B-Paraphrasing-Chinese)).
41
+ Then, the model was trained on our causal corpus (about 1.5 million samples) for the abductive reasoning task.
42
+ At last, based on the remaining 0.8 million samples of the causal corpus, we conducted self-consistent learning on this model, cooperating with [Randeng-TransformerXL-5B-Deduction-Chinese](https://huggingface.co/IDEA-CCNL/Randeng-TransformerXL-5B-Deduction-Chinese) and [Erlangshen-Roberta-330M-Causal-Chinese](https://huggingface.co/IDEA-CCNL/Erlangshen-Roberta-330M-Causal-Chinese).
43
+ Specifically, two generative models performed deductive reasoning and abductive reasoning based on each sample respectively, generating a large number of pseudo-samples; [Erlangshen-Roberta-330M-Causal-Chinese](https://huggingface.co/IDEA-CCNL/Erlangshen-Roberta-330M-Causal-Chinese) scored the causality of the pseudo-samples and selected the training data for itself and the generative models in the next iteration.
44
+
45
+ ## 加载模型 Loading Models
46
+
47
+ ```shell
48
+ git clone https://github.com/IDEA-CCNL/Fengshenbang-LM.git
49
+ cd Fengshenbang-LM
50
+ ```
51
+
52
+ ```python
53
+ from fengshen.models.transfo_xl_reasoning import TransfoXLModel
54
+ from transformers import T5Tokenizer as TransfoXLTokenizer
55
+
56
+ model = TransfoXLModel.from_pretrained('IDEA-CCNL/Randeng-TransformerXL-5B-Abduction-Chinese')
57
+ tokenizer = TransfoXLTokenizer.from_pretrained(
58
+ "IDEA-CCNL/Randeng-TransformerXL-5B-Abduction-Chinese",
59
+ eos_token='<|endoftext|>',
60
+ pad_token='<|endoftext|>',
61
+ extra_ids=0
62
+ )
63
+ tokenizer.add_special_tokens({'bos_token': '<bos>'})
64
+ ```
65
+
66
+ ## 使用示例 Usage Example
67
+
68
+ ```python
69
+ from fengshen.models.transfo_xl_reasoning import abduction_generate
70
+
71
+ input_text = "玉米价格持续上涨"
72
+ input_texts = ["玉米价格持续上涨", "玉米价格持续上涨"]
73
+ print(abduction_generate(model, tokenizer, input_text, device=0))
74
+ print(abduction_generate(model, tokenizer, input_texts, device=0))
75
+ ```
76
+
77
+ ## 引用 Citation
78
+
79
+ 如果您在您的工作中使用了我们的模型,可以引用我们的[论文](https://arxiv.org/abs/2209.02970):
80
+
81
+ If you are using the resource for your work, please cite the our [paper](https://arxiv.org/abs/2209.02970):
82
+
83
+ ```text
84
+ @article{fengshenbang,
85
+ author = {Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen and Ruyi Gan and Jiaxing Zhang},
86
+ title = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
87
+ journal = {CoRR},
88
+ volume = {abs/2209.02970},
89
+ year = {2022}
90
+ }
91
+ ```
92
+
93
+ 也可以引用我们的[网站](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
94
+
95
+ You can also cite our [website](https://github.com/IDEA-CCNL/Fengshenbang-LM/):
96
+
97
+ ```text
98
+ @misc{Fengshenbang-LM,
99
+ title={Fengshenbang-LM},
100
+ author={IDEA-CCNL},
101
+ year={2021},
102
+ howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
103
+ }
104
+ ```