IlyaGusev commited on
Commit
095f268
·
1 Parent(s): 845fdfe

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +133 -0
README.md ADDED
@@ -0,0 +1,133 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - IlyaGusev/ru_turbo_alpaca
4
+ - IlyaGusev/ru_turbo_saiga
5
+ - IlyaGusev/ru_sharegpt_cleaned
6
+ - IlyaGusev/oasst1_ru_main_branch
7
+ - IlyaGusev/ru_turbo_alpaca_evol_instruct
8
+ - lksy/ru_instruct_gpt4
9
+ language:
10
+ - ru
11
+ pipeline_tag: conversational
12
+ license: cc-by-4.0
13
+ ---
14
+
15
+ # GigaSaiga, Russian LLaMA-based chatbot
16
+
17
+ Based on [ruGPT-3.5-13B](https://huggingface.co/ai-forever/ruGPT-3.5-13B).
18
+
19
+ * This is an adapter-only version.
20
+
21
+ Training code: [link](https://github.com/IlyaGusev/rulm/tree/master/self_instruct)
22
+
23
+ ```python
24
+ from peft import PeftModel, PeftConfig
25
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
26
+
27
+ MODEL_NAME = "IlyaGusev/gigasaiga_lora"
28
+ DEFAULT_MESSAGE_TEMPLATE = "<s> {role}\n{content}</s>\n"
29
+ DEFAULT_SYSTEM_PROMPT = "Ты — Сайга, русскоязычный автоматический ассистент. Ты разговариваешь с людьми и помогаешь им."
30
+
31
+ class Conversation:
32
+ def __init__(
33
+ self,
34
+ message_template=DEFAULT_MESSAGE_TEMPLATE,
35
+ system_prompt=DEFAULT_SYSTEM_PROMPT,
36
+ start_token_id=2,
37
+ bot_token_id=46787
38
+ ):
39
+ self.message_template = message_template
40
+ self.start_token_id = start_token_id
41
+ self.bot_token_id = bot_token_id
42
+ self.messages = [{
43
+ "role": "system",
44
+ "content": system_prompt
45
+ }]
46
+
47
+ def get_start_token_id(self):
48
+ return self.start_token_id
49
+
50
+ def get_bot_token_id(self):
51
+ return self.bot_token_id
52
+
53
+ def add_user_message(self, message):
54
+ self.messages.append({
55
+ "role": "user",
56
+ "content": message
57
+ })
58
+
59
+ def add_bot_message(self, message):
60
+ self.messages.append({
61
+ "role": "bot",
62
+ "content": message
63
+ })
64
+
65
+ def get_prompt(self, tokenizer):
66
+ final_text = ""
67
+ for message in self.messages:
68
+ message_text = self.message_template.format(**message)
69
+ final_text += message_text
70
+ final_text += tokenizer.decode([self.start_token_id, self.bot_token_id])
71
+ return final_text.strip()
72
+
73
+
74
+ def generate(model, tokenizer, prompt, generation_config):
75
+ data = tokenizer(prompt, return_tensors="pt")
76
+ data = {k: v.to(model.device) for k, v in data.items()}
77
+ output_ids = model.generate(
78
+ **data,
79
+ generation_config=generation_config
80
+ )[0]
81
+ output_ids = output_ids[len(data["input_ids"][0]):]
82
+ output = tokenizer.decode(output_ids, skip_special_tokens=True)
83
+ return output.strip()
84
+
85
+ config = PeftConfig.from_pretrained(MODEL_NAME)
86
+ model = AutoModelForCausalLM.from_pretrained(
87
+ config.base_model_name_or_path,
88
+ load_in_8bit=True,
89
+ torch_dtype=torch.float16,
90
+ device_map="auto"
91
+ )
92
+ model = PeftModel.from_pretrained(
93
+ model,
94
+ MODEL_NAME,
95
+ torch_dtype=torch.float16
96
+ )
97
+ model.eval()
98
+
99
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=False)
100
+ generation_config = GenerationConfig.from_pretrained(MODEL_NAME)
101
+ print(generation_config)
102
+
103
+ inputs = ["Почему трава зеленая?", "Сочини длинный рассказ, обязательно упоминая следующие объекты. Дано: Таня, мяч"]
104
+ for inp in inputs:
105
+ conversation = Conversation()
106
+ conversation.add_user_message(inp)
107
+ prompt = conversation.get_prompt(tokenizer)
108
+
109
+ output = generate(model, tokenizer, prompt, generation_config)
110
+ print(inp)
111
+ print(output)
112
+ print()
113
+ print("==============================")
114
+ print()
115
+ ```
116
+
117
+ Examples:
118
+ ```
119
+ User: Почему трава зеленая?
120
+ Saiga:
121
+ ```
122
+
123
+ ```
124
+ User: Сочини длинный рассказ, обязательно упоминая следующие объекты. Дано: Таня, мяч
125
+ Saiga:
126
+ ```
127
+
128
+ v1:
129
+ - dataset code revision 7712a061d993f61c49b1e2d992e893c48acb3a87
130
+ - wandb [link](https://wandb.ai/ilyagusev/rulm_self_instruct/runs/lwgw4a1w)
131
+ - 7 datasets: ru_turbo_alpaca, ru_turbo_saiga, ru_sharegpt_cleaned, oasst1_ru_main_branch, gpt_roleplay_realm, ru_turbo_alpaca_evol_instruct (iteration 1/2), ru_instruct_gpt4
132
+ - Datasets merging script: [create_chat_set.py](https://github.com/IlyaGusev/rulm/blob/e4238fd9a196405b566a2d5838ab44b7a0f4dc31/self_instruct/src/data_processing/create_chat_set.py)
133
+ - saiga13b_v2 vs gigasaiga: 112-11-53