Artples commited on
Commit
920a49b
·
verified ·
1 Parent(s): b4a8776

Delete Finetuning_NoteBook.ipynb

Browse files
Files changed (1) hide show
  1. Finetuning_NoteBook.ipynb +0 -597
Finetuning_NoteBook.ipynb DELETED
@@ -1,597 +0,0 @@
1
- {
2
- "cells": [
3
- {
4
- "cell_type": "markdown",
5
- "id": "ba5a3824",
6
- "metadata": {},
7
- "source": [
8
- "# Installing Required Libraries!"
9
- ]
10
- },
11
- {
12
- "cell_type": "markdown",
13
- "id": "bb5c2ce5",
14
- "metadata": {},
15
- "source": [
16
- "Installing required libraries, including trl, transformers, accelerate, peft, datasets, and bitsandbytes."
17
- ]
18
- },
19
- {
20
- "cell_type": "code",
21
- "execution_count": null,
22
- "id": "fb17ce11",
23
- "metadata": {},
24
- "outputs": [],
25
- "source": [
26
- "\n",
27
- "# Checks if PyTorch is installed and installs it if not.\n",
28
- "try:\n",
29
- " import torch\n",
30
- " print(\"PyTorch is installed!\")\n",
31
- "except ImportError:\n",
32
- " print(\"PyTorch is not installed.\")\n",
33
- " !pip install -q torch\n"
34
- ]
35
- },
36
- {
37
- "cell_type": "code",
38
- "execution_count": null,
39
- "id": "5f38ad58",
40
- "metadata": {},
41
- "outputs": [],
42
- "source": [
43
- "\n",
44
- "!pip install -q --upgrade \"transformers==4.38.2\"\n",
45
- "!pip install -q --upgrade \"datasets==2.16.1\"\n",
46
- "!pip install -q --upgrade \"accelerate==0.26.1\"\n",
47
- "!pip install -q --upgrade \"evaluate==0.4.1\"\n",
48
- "!pip install -q --upgrade \"bitsandbytes==0.42.0\"\n",
49
- "!pip install -q --upgrade \"trl==0.7.11\"\n",
50
- "!pip install -q --upgrade \"peft==0.8.2\"\n",
51
- " "
52
- ]
53
- },
54
- {
55
- "cell_type": "markdown",
56
- "id": "98e65745",
57
- "metadata": {},
58
- "source": [
59
- "# Load and Prepare the Dataset"
60
- ]
61
- },
62
- {
63
- "cell_type": "markdown",
64
- "id": "7cf4cbb2",
65
- "metadata": {},
66
- "source": [
67
- "The dataset is already formatted in a conversational format, which is supported by [trl](https://huggingface.co/docs/trl/index/), and ready for supervised finetuning."
68
- ]
69
- },
70
- {
71
- "cell_type": "markdown",
72
- "id": "7c50d411",
73
- "metadata": {},
74
- "source": [
75
- "\n",
76
- "**Conversational format:**\n",
77
- "\n",
78
- "\n",
79
- "```python {\"messages\": [{\"role\": \"system\", \"content\": \"You are...\"}, {\"role\": \"user\", \"content\": \"...\"}, {\"role\": \"assistant\", \"content\": \"...\"}]}\n",
80
- "{\"messages\": [{\"role\": \"system\", \"content\": \"You are...\"}, {\"role\": \"user\", \"content\": \"...\"}, {\"role\": \"assistant\", \"content\": \"...\"}]}\n",
81
- "{\"messages\": [{\"role\": \"system\", \"content\": \"You are...\"}, {\"role\": \"user\", \"content\": \"...\"}, {\"role\": \"assistant\", \"content\": \"...\"}]}\n",
82
- "```\n"
83
- ]
84
- },
85
- {
86
- "cell_type": "code",
87
- "execution_count": null,
88
- "id": "60321c78",
89
- "metadata": {},
90
- "outputs": [],
91
- "source": [
92
- "\n",
93
- "from datasets import load_dataset\n",
94
- " \n",
95
- "# Load dataset from the hub\n",
96
- "dataset = load_dataset(\"HuggingFaceH4/ultrachat_200k\", split=\"train_sft\")\n",
97
- " \n",
98
- "dataset = dataset.shuffle(seed=42)\n",
99
- " "
100
- ]
101
- },
102
- {
103
- "cell_type": "markdown",
104
- "id": "5fdaa4ee",
105
- "metadata": {},
106
- "source": [
107
- "# Load **mistralai/Mistral-7B-v0.1** for Finetuning"
108
- ]
109
- },
110
- {
111
- "cell_type": "markdown",
112
- "id": "e046840e",
113
- "metadata": {},
114
- "source": [
115
- "\n",
116
- "This process involves two key steps:\n",
117
- "\n",
118
- "1. **LLM Quantization:**\n",
119
- " - We first load the selected large language model (LLM).\n",
120
- " - We then use the `bitsandbytes` library to quantize the model, which can significantly reduce its memory footprint.\n",
121
- "\n",
122
- "> **Note:** The memory requirements of the model scale with its size. For instance, a 7B parameter model may require \n",
123
- "a 24GB GPU for fine-tuning. \n",
124
- "\n",
125
- "2. **Chat Model Preparation:**\n",
126
- " - To train a model for chat/conversational tasks, we need to prepare both the model and its tokenizer.\n",
127
- " \n",
128
- " - This involves adding special tokens to the tokenizer and the model itself. These tokens help the model \n",
129
- " understand the different roles within a conversation. \n",
130
- " \n",
131
- " - The **trl** provides a convenient method called `setup_chat_format` for this purpose. This method performs the \n",
132
- " following actions: \n",
133
- " \n",
134
- " * Adds special tokens to the tokenizer, such as `<|im_start|>` and `<|im_end|>`, to mark the beginning and \n",
135
- " ending of a conversation. \n",
136
- " \n",
137
- " * Resizes the model's embedding layer to accommodate the new tokens.\n",
138
- " \n",
139
- " * Sets the tokenizer's chat template, which defines the format used to convert input data into a chat-like \n",
140
- " structure. The default template is `chatml` from OpenAI.\n",
141
- "\n",
142
- "\n"
143
- ]
144
- },
145
- {
146
- "cell_type": "code",
147
- "execution_count": null,
148
- "id": "e2af96b6",
149
- "metadata": {},
150
- "outputs": [],
151
- "source": [
152
- "\n",
153
- "import torch\n",
154
- "from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n",
155
- "from trl import setup_chat_format\n",
156
- "\n",
157
- "# Hugging Face model id\n",
158
- "model_id = \"mistralai/Mistral-7B-v0.1\"\n",
159
- "\n",
160
- "# BitsAndBytesConfig\n",
161
- "bnb_config = BitsAndBytesConfig(\n",
162
- " load_in_8bit=True, bnb_4bit_use_double_quant=True, \n",
163
- " bnb_4bit_quant_type=\"nf4\", bnb_4bit_compute_dtype=torch.bfloat16 \n",
164
- ")\n",
165
- "\n",
166
- "# Load model and tokenizer\n",
167
- "model = AutoModelForCausalLM.from_pretrained(\n",
168
- " model_id,\n",
169
- " device_map=\"auto\",\n",
170
- " trust_remote_code=True,\n",
171
- " \n",
172
- " torch_dtype=torch.bfloat16,\n",
173
- " quantization_config=bnb_config\n",
174
- ")\n",
175
- "\n",
176
- "tokenizer = AutoTokenizer.from_pretrained(\"mistralai/Mistral-7B-v0.1\")\n",
177
- "tokenizer.padding_side = \"right\"\n",
178
- "\n",
179
- "\n",
180
- "# Set chat template to OAI chatML\n",
181
- "model, tokenizer = setup_chat_format(model, tokenizer)\n",
182
- "\n",
183
- " "
184
- ]
185
- },
186
- {
187
- "cell_type": "markdown",
188
- "id": "1b837560",
189
- "metadata": {},
190
- "source": [
191
- "## Setting LoRA Config"
192
- ]
193
- },
194
- {
195
- "cell_type": "markdown",
196
- "id": "4617d5d0",
197
- "metadata": {},
198
- "source": [
199
- "The `SFTTrainer` provides native integration with `peft`, simplifying the process of efficiently tuning \n",
200
- " Language Models (LLMs) using techniques such as [LoRA](\n",
201
- " https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms). The only requirement is to create \n",
202
- " the `LoraConfig` and pass it to the `SFTTrainer`. \n",
203
- " "
204
- ]
205
- },
206
- {
207
- "cell_type": "code",
208
- "execution_count": null,
209
- "id": "b6244b7f",
210
- "metadata": {},
211
- "outputs": [],
212
- "source": [
213
- "\n",
214
- "from peft import LoraConfig\n",
215
- "\n",
216
- "peft_config = LoraConfig(\n",
217
- " lora_alpha=8,\n",
218
- " lora_dropout=0.05,\n",
219
- " r=6,\n",
220
- " bias=\"none\",\n",
221
- " target_modules=\"all-linear\",\n",
222
- " task_type=\"CAUSAL_LM\"\n",
223
- ")\n",
224
- " "
225
- ]
226
- },
227
- {
228
- "cell_type": "markdown",
229
- "id": "e5ffc4bd",
230
- "metadata": {},
231
- "source": [
232
- "## Setting the TrainingArguments"
233
- ]
234
- },
235
- {
236
- "cell_type": "code",
237
- "execution_count": null,
238
- "id": "eac8898f",
239
- "metadata": {},
240
- "outputs": [],
241
- "source": [
242
- "\n",
243
- "# Installing tensorboard to report the metrics\n",
244
- "!pip install -q tensorboard\n",
245
- " "
246
- ]
247
- },
248
- {
249
- "cell_type": "code",
250
- "execution_count": null,
251
- "id": "12aa9947",
252
- "metadata": {},
253
- "outputs": [],
254
- "source": [
255
- "\n",
256
- "from transformers import TrainingArguments\n",
257
- "\n",
258
- "args = TrainingArguments(\n",
259
- " output_dir=\"temp_/LChat-7b\",\n",
260
- " num_train_epochs=100,\n",
261
- " per_device_train_batch_size=3,\n",
262
- " gradient_accumulation_steps=2,\n",
263
- " gradient_checkpointing=True,\n",
264
- " gradient_checkpointing_kwargs={'use_reentrant': False},\n",
265
- " optim=\"adamw_torch_fused\",\n",
266
- " logging_steps=10,\n",
267
- " save_strategy='epoch',\n",
268
- " learning_rate=0.075,\n",
269
- " bf16=True,\n",
270
- " max_grad_norm=0.3,\n",
271
- " warmup_ratio=0.1,\n",
272
- " lr_scheduler_type='cosine',\n",
273
- " report_to='tensorboard', \n",
274
- " max_steps=-1,\n",
275
- " seed=42,\n",
276
- " overwrite_output_dir=True,\n",
277
- " remove_unused_columns=True\n",
278
- ")\n",
279
- " "
280
- ]
281
- },
282
- {
283
- "cell_type": "markdown",
284
- "id": "5c895809",
285
- "metadata": {},
286
- "source": [
287
- "## Setting the Supervised Finetuning Trainer (`SFTTrainer`)\n",
288
- " \n",
289
- "This `SFTTrainer` is a wrapper around the `transformers.Trainer` class and inherits all of its attributes and methods.\n",
290
- "The trainer takes care of properly initializing the `PeftModel`. \n",
291
- " "
292
- ]
293
- },
294
- {
295
- "cell_type": "code",
296
- "execution_count": null,
297
- "id": "d269b68a",
298
- "metadata": {},
299
- "outputs": [],
300
- "source": [
301
- "\n",
302
- "from trl import SFTTrainer\n",
303
- "\n",
304
- "trainer = SFTTrainer(\n",
305
- " model=model,\n",
306
- " args=args,\n",
307
- " train_dataset=dataset,\n",
308
- " peft_config=peft_config,\n",
309
- " max_seq_length=2048,\n",
310
- " tokenizer=tokenizer,\n",
311
- " packing=True,\n",
312
- " dataset_kwargs={'add_special_tokens': False, 'append_concat_token': False}\n",
313
- ")\n"
314
- ]
315
- },
316
- {
317
- "cell_type": "markdown",
318
- "id": "b05793a3",
319
- "metadata": {},
320
- "source": [
321
- "### Starting Training and Saving Model/Tokenizer\n",
322
- "\n",
323
- "We start training the model by calling the `train()` method on the trainer instance. This will start the training \n",
324
- "loop and train the model for `100 epochs`. The model will be automatically saved to the output directory (**'temp_/LChat-7b'**)\n",
325
- "and to the hub in **'User//LChat-7b'**. \n",
326
- " \n",
327
- " "
328
- ]
329
- },
330
- {
331
- "cell_type": "code",
332
- "execution_count": null,
333
- "id": "f56066fc",
334
- "metadata": {},
335
- "outputs": [],
336
- "source": [
337
- "\n",
338
- "\n",
339
- "model.config.use_cache = False\n",
340
- "\n",
341
- "# start training\n",
342
- "trainer.train()\n",
343
- "\n",
344
- "# save the peft model\n",
345
- "trainer.save_model()\n"
346
- ]
347
- },
348
- {
349
- "cell_type": "markdown",
350
- "id": "8bd579bb",
351
- "metadata": {},
352
- "source": [
353
- "### Free the GPU Memory to Prepare Merging `LoRA` Adapters with the Base Model\n"
354
- ]
355
- },
356
- {
357
- "cell_type": "code",
358
- "execution_count": null,
359
- "id": "e2b25dc2",
360
- "metadata": {},
361
- "outputs": [],
362
- "source": [
363
- "\n",
364
- "\n",
365
- "# Free the GPU memory\n",
366
- "del model\n",
367
- "del trainer\n",
368
- "torch.cuda.empty_cache()\n"
369
- ]
370
- },
371
- {
372
- "cell_type": "markdown",
373
- "id": "8b9955ad",
374
- "metadata": {},
375
- "source": [
376
- "## Merging LoRA Adapters into the Original Model\n",
377
- "\n",
378
- "While utilizing `LoRA`, we focus on training the adapters rather than the entire model. Consequently, during the \n",
379
- "model saving process, only the `adapter weights` are preserved, not the complete model. If we wish to save the \n",
380
- "entire model for easier usage with Text Generation Inference, we can incorporate the adapter weights into the model \n",
381
- "weights. This can be achieved using the `merge_and_unload` method. Following this, the model can be saved using the \n",
382
- "`save_pretrained` method. The result is a default model that is ready for inference.\n"
383
- ]
384
- },
385
- {
386
- "cell_type": "code",
387
- "execution_count": null,
388
- "id": "64d5cd68",
389
- "metadata": {},
390
- "outputs": [],
391
- "source": [
392
- "\n",
393
- "import torch\n",
394
- "from peft import AutoPeftModelForCausalLM\n",
395
- "\n",
396
- "# Load Peft model on CPU\n",
397
- "model = AutoPeftModelForCausalLM.from_pretrained(\n",
398
- " \"temp_/LChat-7b\",\n",
399
- " torch_dtype=torch.float16,\n",
400
- " low_cpu_mem_usage=True\n",
401
- ")\n",
402
- " \n",
403
- "# Merge LoRA with the base model and save\n",
404
- "merged_model = model.merge_and_unload()\n",
405
- "merged_model.save_pretrained(\"/LChat-7b\", safe_serialization=True, max_shard_size=\"2GB\")\n",
406
- "tokenizer.save_pretrained(\"/LChat-7b\")\n"
407
- ]
408
- },
409
- {
410
- "cell_type": "markdown",
411
- "id": "e8f96a1d",
412
- "metadata": {},
413
- "source": [
414
- "### Copy all result folders from 'temp_/LChat-7b' to '/LChat-7b'"
415
- ]
416
- },
417
- {
418
- "cell_type": "code",
419
- "execution_count": null,
420
- "id": "0f28559e",
421
- "metadata": {},
422
- "outputs": [],
423
- "source": [
424
- "\n",
425
- "import os\n",
426
- "import shutil\n",
427
- "\n",
428
- "source_folder = \"temp_/LChat-7b\"\n",
429
- "destination_folder = \"/LChat-7b\"\n",
430
- "os.makedirs(destination_folder, exist_ok=True)\n",
431
- "for item in os.listdir(source_folder):\n",
432
- " item_path = os.path.join(source_folder, item)\n",
433
- " if os.path.isdir(item_path):\n",
434
- " destination_path = os.path.join(destination_folder, item)\n",
435
- " shutil.copytree(item_path, destination_path)\n"
436
- ]
437
- },
438
- {
439
- "cell_type": "markdown",
440
- "id": "60bf3de1",
441
- "metadata": {},
442
- "source": [
443
- "### Generating a model card (README.md)"
444
- ]
445
- },
446
- {
447
- "cell_type": "code",
448
- "execution_count": null,
449
- "id": "97fe2e33",
450
- "metadata": {},
451
- "outputs": [],
452
- "source": [
453
- "\n",
454
- "card = '''\n",
455
- "---\n",
456
- "license: apache-2.0\n",
457
- "tags:\n",
458
- "- generated_from_trainer\n",
459
- "- mistralai/Mistral\n",
460
- "- PyTorch\n",
461
- "- transformers\n",
462
- "- trl\n",
463
- "- peft\n",
464
- "- tensorboard\n",
465
- "base_model: mistralai/Mistral-7B-v0.1\n",
466
- "widget:\n",
467
- " - example_title: Pirate!\n",
468
- " messages:\n",
469
- " - role: system\n",
470
- " content: You are a pirate chatbot who always responds with Arr!\n",
471
- " - role: user\n",
472
- " content: \"There's a llama on my lawn, how can I get rid of him?\"\n",
473
- " output:\n",
474
- " text: >-\n",
475
- " Arr! 'Tis a puzzlin' matter, me hearty! A llama on yer lawn be a rare\n",
476
- " sight, but I've got a plan that might help ye get rid of 'im. Ye'll need\n",
477
- " to gather some carrots and hay, and then lure the llama away with the\n",
478
- " promise of a tasty treat. Once he's gone, ye can clean up yer lawn and\n",
479
- " enjoy the peace and quiet once again. But beware, me hearty, for there\n",
480
- " may be more llamas where that one came from! Arr!\n",
481
- "model-index:\n",
482
- "- name: LChat-7b\n",
483
- " results: []\n",
484
- "datasets:\n",
485
- "- HuggingFaceH4/ultrachat_200k\n",
486
- "language:\n",
487
- "- en\n",
488
- "pipeline_tag: text-generation\n",
489
- "---\n",
490
- "\n",
491
- "# Model Card for LChat-7b:\n",
492
- "\n",
493
- "**LChat-7b** is a language model that is trained to act as helpful assistant. It is a finetuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) that was trained using `SFTTrainer` on publicly available dataset [\n",
494
- "HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k).\n",
495
- "\n",
496
- "## Training Procedure:\n",
497
- "\n",
498
- "The training code used to create this model was generated by [Menouar/LLM-FineTuning-Notebook-Generator](https://huggingface.co/spaces/Menouar/LLM-FineTuning-Notebook-Generator).\n",
499
- "\n",
500
- "\n",
501
- "\n",
502
- "## Training hyperparameters\n",
503
- "\n",
504
- "The following hyperparameters were used during the training:\n",
505
- "\n",
506
- "\n",
507
- "'''\n",
508
- "\n",
509
- "with open(\"/LChat-7b/README.md\", \"w\") as f:\n",
510
- " f.write(card)\n",
511
- "\n",
512
- "args_dict = vars(args)\n",
513
- "\n",
514
- "with open(\"/LChat-7b/README.md\", \"a\") as f:\n",
515
- " for k, v in args_dict.items():\n",
516
- " f.write(f\"- {k}: {v}\")\n",
517
- " f.write(\"\\n \\n\")\n"
518
- ]
519
- },
520
- {
521
- "cell_type": "markdown",
522
- "id": "6947c4c1",
523
- "metadata": {},
524
- "source": [
525
- "## Login to HF"
526
- ]
527
- },
528
- {
529
- "cell_type": "markdown",
530
- "id": "bafb24fe",
531
- "metadata": {},
532
- "source": [
533
- "Replace `HF_TOKEN` with a valid token in order to push **'/LChat-7b'** to `huggingface_hub`."
534
- ]
535
- },
536
- {
537
- "cell_type": "code",
538
- "execution_count": null,
539
- "id": "e498576f",
540
- "metadata": {},
541
- "outputs": [],
542
- "source": [
543
- "\n",
544
- "# Install huggingface_hub\n",
545
- "!pip install -q huggingface_hub\n",
546
- " \n",
547
- "from huggingface_hub import login\n",
548
- " \n",
549
- "login(\n",
550
- " token='_gxyairSqRlrHFswgszIHJmObFVaGSDGcEk',\n",
551
- " add_to_git_credential=True\n",
552
- ")\n",
553
- " "
554
- ]
555
- },
556
- {
557
- "cell_type": "markdown",
558
- "id": "6f5071dd",
559
- "metadata": {},
560
- "source": [
561
- "## Pushing '/LChat-7b' to the Hugging Face account."
562
- ]
563
- },
564
- {
565
- "cell_type": "code",
566
- "execution_count": null,
567
- "id": "13ba8863",
568
- "metadata": {},
569
- "outputs": [],
570
- "source": [
571
- "\n",
572
- "from huggingface_hub import HfApi, HfFolder, Repository\n",
573
- "\n",
574
- "# Instantiate the HfApi class\n",
575
- "api = HfApi()\n",
576
- "\n",
577
- "# Our Hugging Face repository\n",
578
- "repo_name = \"LChat-7b\"\n",
579
- "\n",
580
- "# Create a repository on the Hugging Face Hub\n",
581
- "repo = api.create_repo(token=HfFolder.get_token(), repo_type=\"model\", repo_id=repo_name)\n",
582
- "\n",
583
- "api.upload_folder(\n",
584
- " folder_path=\"/LChat-7b\",\n",
585
- " repo_id=repo.repo_id\n",
586
- ")\n"
587
- ]
588
- }
589
- ],
590
- "metadata": {
591
- "language_info": {
592
- "name": "python"
593
- }
594
- },
595
- "nbformat": 4,
596
- "nbformat_minor": 5
597
- }