dreamerdeo commited on
Commit
92981b0
·
verified ·
1 Parent(s): 9f0da9d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -0
README.md ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ - id
6
+ - th
7
+ - vi
8
+ - ms
9
+ - lo
10
+ - my
11
+ - jv
12
+ - km
13
+ - su
14
+ - tl
15
+ tags:
16
+ - multilingual
17
+ - sea
18
+ - sailor
19
+ - sft
20
+ - chat
21
+ - instruction
22
+ widget:
23
+ - text: 如何制作烤鱼?
24
+ example_title: Chinese
25
+ - text: How to bake fish?
26
+ example_title: English
27
+ - text: Bagaimana cara memanggang ikan?
28
+ example_title: Malay
29
+ - text: วิธีย่างปลา?
30
+ example_title: Thai
31
+ - text: Bagaimana membuat bakaran ikan?
32
+ example_title: Indonesian
33
+ - text: Làm thế nào để nướng cá?
34
+ example_title: Vietnamese
35
+ license: apache-2.0
36
+ base_model:
37
+ - Qwen/Qwen2.5-0.5B
38
+ ---
39
+
40
+ <div align="center">
41
+ <img src="sailor2_banner.jpg" width="700"/>
42
+ </div>
43
+
44
+ > The logo was generated by MidJourney
45
+
46
+ Sailor2 is a community-driven initiative that brings cutting-edge multilingual language models to South-East Asia (SEA).
47
+ Our research highlights a strong demand for models in the **8B and 20B parameter** range for production use, alongside **1B models** for specialized applications,
48
+ such as speculative decoding and research purposes.
49
+ These models, released under the **Apache 2.0 license**, provide enhanced accessibility to advanced language technologies across the region.
50
+
51
+ Developed with careful data curation, Sailor models are designed to understand and generate text across diverse linguistic landscapes of SEA region.
52
+ Built from [Qwen 2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e), Sailor encompasses models of varying sizes, spanning from 0.5B to 14B versions for different requirements.
53
+ We further fine-tune the base model with open-source datasets to get instruction-tuned models, namedly Sailor-Chat.
54
+ Benchmarking results demonstrate Sailor's proficiency in tasks such as question answering, commonsense reasoning, and other tasks in SEA languages.
55
+
56
+ Sailor2 builds upon the foundation of the awesome multilingual model [Qwen 2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e) and
57
+ is continuously pre-trained on **500B tokens** to support **15 languages** better with a unified model.
58
+ These languages include English, Chinese, Burmese, Cebuano, Ilocano, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tagalog, Thai, Vietnamese, and Waray.
59
+ By addressing the growing demand for diverse, robust, and accessible language models, Sailor2 seeks to serve the underserved in SEA areas with open, inclusive, and accessible multilingual LLMs.
60
+
61
+ Refer to [Sailor2 Website](https://sailorllm.github.io/) for more training details.
62
+
63
+ ## Model Summary
64
+ - **Model Collections:** [Base Model & Chat Model](https://huggingface.co/collections/sail/sailor2-language-models-674d7c9e6b4dbbd9a869906b)
65
+ - **Project Website:** [sailorllm.github.io](https://sailorllm.github.io/)
66
+ - **Codebase:** [github.com/sail-sg/sailor2](https://github.com/sail-sg/sailor2)
67
+ - **Technical Report:** Coming Soon
68
+
69
+
70
+ ## Training details
71
+
72
+
73
+ ## Requirements
74
+ The code of Sailor2 has been in the latest Hugging face transformers and we advise you to install `transformers==4.46.3`.
75
+
76
+ ## Quickstart
77
+
78
+ Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents.
79
+
80
+ ```python
81
+ import torch
82
+ from transformers import AutoModelForCausalLM, AutoTokenizer
83
+ device = "cuda"
84
+
85
+ model = AutoModelForCausalLM.from_pretrained(
86
+ 'sail/Sailor2-20B-Chat',
87
+ torch_dtype=torch.bfloat16,
88
+ device_map="auto"
89
+ )
90
+
91
+ tokenizer = AutoTokenizer.from_pretrained('sail/Sailor2-1B-Chat')
92
+ system_prompt= \
93
+ 'You are an AI assistant named Sailor2, created by Sea AI Lab. \
94
+ As an AI assistant, you can answer questions in English, Chinese, and Southeast Asian languages \
95
+ such as Burmese, Cebuano, Ilocano, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tagalog, Thai, Vietnamese, and Waray. \
96
+ Your responses should be friendly, unbiased, informative, detailed, and faithful.'
97
+
98
+ prompt = "Beri saya pengenalan singkat tentang model bahasa besar."
99
+ # prompt = "Hãy cho tôi một giới thiệu ngắn gọn về mô hình ngôn ngữ lớn."
100
+ # prompt = "ให้ฉันแนะนำสั้น ๆ เกี่ยวกับโมเดลภาษาขนาดใหญ่"
101
+
102
+ messages = [
103
+ {"role": "system", "content": system_prompt},
104
+ {"role": "user", "content": prompt}
105
+ ]
106
+ text = tokenizer.apply_chat_template(
107
+ messages,
108
+ tokenize=False,
109
+ add_generation_prompt=True
110
+ )
111
+
112
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
113
+ input_ids = model_inputs.input_ids.to(device)
114
+
115
+ generated_ids = model.generate(
116
+ input_ids,
117
+ max_new_tokens=512,
118
+ )
119
+
120
+ generated_ids = [
121
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
122
+ ]
123
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
124
+ print(response)
125
+ ```
126
+
127
+ # License
128
+
129
+ Sailor2 is distributed under the terms of the Apache License 2.0.
130
+ No restrict on the research and the commercial use.
131
+
132
+ ## Citation
133
+
134
+ If you find Sailor2 useful, please cite our work as follows:
135
+
136
+ ```
137
+ @misc{sailor2report,
138
+ title={Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLM},
139
+ author={Sailor2 Team},
140
+ year={2024}
141
+ }
142
+ ```
143
+
144
+ # Contact Us
145
+
146
+ If you have any questions, please raise an issue or contact us at [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected]).