bit-dny commited on
Commit
c96ddb5
·
1 Parent(s): 2c09987

Initialize Model Card

Browse files
Files changed (1) hide show
  1. README.md +88 -0
README.md CHANGED
@@ -1,3 +1,91 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ - zh
6
+ library_name: transformers
7
+ pipeline_tag: text-generation
8
  ---
9
+ # Model Card for MindLLM
10
+
11
+ <!-- Provide a quick summary of what the model is/does. -->
12
+
13
+ ## Model Details
14
+
15
+ ### Model Description
16
+
17
+ MindLLM 1.3B is a Transformer model with 1.3 billion parameters by *Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications* & *Beijing Institute of Technology Southeast Academy of Information Technology*.
18
+
19
+ It was trained using the bilingual data sources including Pile, Wudao, CBook and other self-collected data source that consists of filtered websites (for safety and educational value). When assessed against benchmarks testing common sense, language understanding, and logical reasoning, MindLLM showcased a great performance and even surpass models with less than 13 billion parameters.
20
+
21
+ Our model has been fine-tuned with instruction dataset in chat format but hasn't been fine-tuned through reinforcement learning from human feedback. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges and adopt to domain-specific application.
22
+
23
+ - **Developed by:** *Beijing Engineering Research Center of High Volume Language Information Processing and Cloud Computing Applications* & *Beijing Institute of Technology Southeast Academy of Information Technology*
24
+ - **Model type:** Pretrained Causal Language Model
25
+ - **Language(s) (NLP):** Chinese & English
26
+ - **License:** apache-2.0
27
+ - **Train from Scratch**
28
+
29
+ ### Model Sources
30
+
31
+ - **Paper:** https://arxiv.org/abs/2310.15777
32
+
33
+ To cite this model, please use
34
+ ```bib
35
+ @article{mindllm,
36
+ title={MindLLM: Pre-training Lightweight Large Language Model from Scratch, Evaluations and Domain Applications},
37
+ author={Yang, Yizhe and Sun, Huashan and Li, Jiawei and Liu, Runheng and Li, Yinghao and Liu, Yuhang and Huang, Heyan and Gao, Yang},
38
+ journal={arXiv preprint arXiv:2310.15777},
39
+ year={2023}
40
+ }
41
+ ```
42
+
43
+ ## Uses
44
+
45
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
+
47
+ ### Direct Use
48
+
49
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
50
+
51
+ As the model has been supervised trained on instruction data in a special chat format. You can use this model directly with a pipeline for text generation. This example generates a different sequence each time it's run:
52
+
53
+ ```python
54
+ from transformers import AutoTokenizer, AutoModelForCausalLM, TextGenerationPipeline
55
+ tokenizer = AutoTokenizer.from_pretrained('mindllm_path')
56
+ tokenizer.max_length = 1024
57
+ model = AutoModelForCausalLM.from_pretrained('mindllm_path').to(device)
58
+ generator = TextGenerationPipeline(model=model, tokenizer=tokenizer, device=device)
59
+ context = "<user>: 你知道电动车相对传统汽油车有哪些优点吗?"
60
+ outputs = generator(context, max_new_tokens=1024, do_sample=True, num_beams=4, repetition_penalty=0.5, no_repeat_ngram_size=5, return_full_text=False)
61
+ [{'generated_text': '电动车相对传统汽油车的优点包括:\n1. 更低的排放和更高的能源效率 - 电动车所产生的有害排放物质远少于汽油车,并且它们的能源利用效率更高。\n2. 更低的维护成本 - 电动车需要更少的保养和通常拥有较少的运动部件,从而降低了总体维护成本。\n3. 更低的燃料成本 - 电动车需要比汽油车少得多的燃料,因此随着时间的推移,可以节省成本。\n4. 更长的续航里程 - 电动车单次充电可以行驶比汽油车更远的距离,非常适合长途通勤。\n5. 更为安静的运行 - 电动车比汽油车要安静得多,使驾驶更加愉悦。'}]
62
+ ```
63
+
64
+ ## Training Details
65
+
66
+ ### Training Data
67
+
68
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
69
+
70
+ Our training corpus is a diverse blend of both English and Chinese language data sources. The English component originates from the Pile dataset, and the Chinese component comprises data from Wudao, CBooks, and data meticulously gathered through web crawling.
71
+
72
+ To ensure data quality, we execute a thorough preprocessing pipeline, which involves purging special tags via rigorous data cleaning, data deduplication using Locality-Sensitive Hashing (LSH), and comprehensive filtering to eliminate low-quality content predominantly from advertisements or inappropriate material. We also examine the relationship between data volume and model capacity, assess the impact of different data types on model fitting effectiveness, and evaluate model training stability when handling mixed data sources. This analysis offers valuable insights into the vital role of pre-training data and the complexities of processing it. We also apply some mixture craftsmanship to construct training data based on data engineering and experience.
73
+
74
+
75
+ ### Training Procedure
76
+
77
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
78
+
79
+ This version of model was trained on about 241 billion English tokens and 82 billion Chinese tokens with a two-stages training strategy. It was trained as a autoregressive language model, using cross-entropy loss.
80
+
81
+ This version of model was also fine-tuned on 4 million Chinese instruction samples which are collected from open source instruction tuning datasets. The instruction tuning stage make the model can answer questions and perform multi-turns conversation **in Chinese**.
82
+
83
+ **For more detailed information, please refer to the paper.**
84
+
85
+ ## Evaluation
86
+
87
+ <!-- This section describes the evaluation protocols and provides the results. -->
88
+
89
+ ### Result of MMLU
90
+
91
+ ### Result of CEval