Spaces:

Artples
/

LBook

Running

File size: 17,219 Bytes

b5dc94c

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "ba5a3824",
   "metadata": {},
   "source": [
    "# Installing Required Libraries!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bb5c2ce5",
   "metadata": {},
   "source": [
    "Installing required libraries, including trl, transformers, accelerate, peft, datasets, and bitsandbytes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb17ce11",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Checks if PyTorch is installed and installs it if not.\n",
    "try:\n",
    "    import torch\n",
    "    print(\"PyTorch is installed!\")\n",
    "except ImportError:\n",
    "    print(\"PyTorch is not installed.\")\n",
    "    !pip install -q torch\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f38ad58",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "!pip install -q --upgrade \"transformers==4.38.2\"\n",
    "!pip install -q --upgrade \"datasets==2.16.1\"\n",
    "!pip install -q --upgrade \"accelerate==0.26.1\"\n",
    "!pip install -q --upgrade \"evaluate==0.4.1\"\n",
    "!pip install -q --upgrade \"bitsandbytes==0.42.0\"\n",
    "!pip install -q --upgrade \"trl==0.7.11\"\n",
    "!pip install -q --upgrade \"peft==0.8.2\"\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "98e65745",
   "metadata": {},
   "source": [
    "# Load and Prepare the Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cf4cbb2",
   "metadata": {},
   "source": [
    "The dataset is already formatted in a conversational format, which is supported by [trl](https://huggingface.co/docs/trl/index/), and ready for supervised finetuning."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7c50d411",
   "metadata": {},
   "source": [
    "\n",
    "**Conversational format:**\n",
    "\n",
    "\n",
    "```python {\"messages\": [{\"role\": \"system\", \"content\": \"You are...\"}, {\"role\": \"user\", \"content\": \"...\"}, {\"role\": \"assistant\", \"content\": \"...\"}]}\n",
    "{\"messages\": [{\"role\": \"system\", \"content\": \"You are...\"}, {\"role\": \"user\", \"content\": \"...\"}, {\"role\": \"assistant\", \"content\": \"...\"}]}\n",
    "{\"messages\": [{\"role\": \"system\", \"content\": \"You are...\"}, {\"role\": \"user\", \"content\": \"...\"}, {\"role\": \"assistant\", \"content\": \"...\"}]}\n",
    "```\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "60321c78",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from datasets import load_dataset\n",
    "    \n",
    "# Load dataset from the hub\n",
    "dataset = load_dataset(\"HuggingFaceH4/ultrachat_200k\", split=\"train_sft\")\n",
    "    \n",
    "dataset = dataset.shuffle(seed=42)\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fdaa4ee",
   "metadata": {},
   "source": [
    "# Load **mistralai/Mistral-7B-v0.1** for Finetuning"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e046840e",
   "metadata": {},
   "source": [
    "\n",
    "This process involves two key steps:\n",
    "\n",
    "1. **LLM Quantization:**\n",
    "    - We first load the selected large language model (LLM).\n",
    "    - We then use the `bitsandbytes` library to quantize the model, which can significantly reduce its memory footprint.\n",
    "\n",
    "> **Note:** The memory requirements of the model scale with its size. For instance, a 7B parameter model may require \n",
    "a 24GB GPU for fine-tuning. \n",
    "\n",
    "2. **Chat Model Preparation:**\n",
    "    - To train a model for chat/conversational tasks, we need to prepare both the model and its tokenizer.\n",
    "    \n",
    "    - This involves adding special tokens to the tokenizer and the model itself. These tokens help the model \n",
    "    understand the different roles within a conversation. \n",
    "    \n",
    "    - The **trl** provides a convenient method called `setup_chat_format` for this purpose. This method performs the \n",
    "    following actions: \n",
    "    \n",
    "        * Adds special tokens to the tokenizer, such as `<|im_start|>` and `<|im_end|>`, to mark the beginning and \n",
    "        ending of a conversation. \n",
    "        \n",
    "        * Resizes the model's embedding layer to accommodate the new tokens.\n",
    "        \n",
    "        * Sets the tokenizer's chat template, which defines the format used to convert input data into a chat-like \n",
    "        structure. The default template is `chatml` from OpenAI.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2af96b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import torch\n",
    "from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n",
    "from trl import setup_chat_format\n",
    "\n",
    "# Hugging Face model id\n",
    "model_id = \"mistralai/Mistral-7B-v0.1\"\n",
    "\n",
    "# BitsAndBytesConfig\n",
    "bnb_config = BitsAndBytesConfig(\n",
    "    load_in_8bit=True, bnb_4bit_use_double_quant=True, \n",
    "    bnb_4bit_quant_type=\"nf4\", bnb_4bit_compute_dtype=torch.bfloat16 \n",
    ")\n",
    "\n",
    "# Load model and tokenizer\n",
    "model = AutoModelForCausalLM.from_pretrained(\n",
    "    model_id,\n",
    "    device_map=\"auto\",\n",
    "    trust_remote_code=True,\n",
    "    \n",
    "    torch_dtype=torch.bfloat16,\n",
    "    quantization_config=bnb_config\n",
    ")\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(\"mistralai/Mistral-7B-v0.1\")\n",
    "tokenizer.padding_side = \"right\"\n",
    "\n",
    "\n",
    "# Set chat template to OAI chatML\n",
    "model, tokenizer = setup_chat_format(model, tokenizer)\n",
    "\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b837560",
   "metadata": {},
   "source": [
    "## Setting LoRA Config"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4617d5d0",
   "metadata": {},
   "source": [
    "The `SFTTrainer` provides native integration with `peft`, simplifying the process of efficiently tuning \n",
    "    Language Models (LLMs) using techniques such as [LoRA](\n",
    "    https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms). The only requirement is to create \n",
    "    the `LoraConfig` and pass it to the `SFTTrainer`. \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6244b7f",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from peft import LoraConfig\n",
    "\n",
    "peft_config = LoraConfig(\n",
    "    lora_alpha=8,\n",
    "    lora_dropout=0.05,\n",
    "    r=6,\n",
    "    bias=\"none\",\n",
    "    target_modules=\"all-linear\",\n",
    "    task_type=\"CAUSAL_LM\"\n",
    ")\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e5ffc4bd",
   "metadata": {},
   "source": [
    "## Setting the TrainingArguments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eac8898f",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Installing tensorboard to report the metrics\n",
    "!pip install -q tensorboard\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12aa9947",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from transformers import TrainingArguments\n",
    "\n",
    "args = TrainingArguments(\n",
    "    output_dir=\"temp_/LChat-7b\",\n",
    "    num_train_epochs=100,\n",
    "    per_device_train_batch_size=3,\n",
    "    gradient_accumulation_steps=2,\n",
    "    gradient_checkpointing=True,\n",
    "    gradient_checkpointing_kwargs={'use_reentrant': False},\n",
    "    optim=\"adamw_torch_fused\",\n",
    "    logging_steps=10,\n",
    "    save_strategy='epoch',\n",
    "    learning_rate=0.075,\n",
    "    bf16=True,\n",
    "    max_grad_norm=0.3,\n",
    "    warmup_ratio=0.1,\n",
    "    lr_scheduler_type='cosine',\n",
    "    report_to='tensorboard', \n",
    "    max_steps=-1,\n",
    "    seed=42,\n",
    "    overwrite_output_dir=True,\n",
    "    remove_unused_columns=True\n",
    ")\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c895809",
   "metadata": {},
   "source": [
    "## Setting the Supervised Finetuning Trainer (`SFTTrainer`)\n",
    "    \n",
    "This `SFTTrainer` is a wrapper around the `transformers.Trainer` class and inherits all of its attributes and methods.\n",
    "The trainer takes care of properly initializing the `PeftModel`.   \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d269b68a",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from trl import SFTTrainer\n",
    "\n",
    "trainer = SFTTrainer(\n",
    "    model=model,\n",
    "    args=args,\n",
    "    train_dataset=dataset,\n",
    "    peft_config=peft_config,\n",
    "    max_seq_length=2048,\n",
    "    tokenizer=tokenizer,\n",
    "    packing=True,\n",
    "    dataset_kwargs={'add_special_tokens': False, 'append_concat_token': False}\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b05793a3",
   "metadata": {},
   "source": [
    "### Starting Training and Saving Model/Tokenizer\n",
    "\n",
    "We start training the model by calling the `train()` method on the trainer instance. This will start the training \n",
    "loop and train the model for `100 epochs`. The model will be automatically saved to the output directory (**'temp_/LChat-7b'**)\n",
    "and to the hub in **'User//LChat-7b'**. \n",
    "  \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f56066fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n",
    "model.config.use_cache = False\n",
    "\n",
    "# start training\n",
    "trainer.train()\n",
    "\n",
    "# save the peft model\n",
    "trainer.save_model()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8bd579bb",
   "metadata": {},
   "source": [
    "### Free the GPU Memory to Prepare Merging `LoRA` Adapters with the Base Model\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e2b25dc2",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n",
    "# Free the GPU memory\n",
    "del model\n",
    "del trainer\n",
    "torch.cuda.empty_cache()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8b9955ad",
   "metadata": {},
   "source": [
    "## Merging LoRA Adapters into the Original Model\n",
    "\n",
    "While utilizing `LoRA`, we focus on training the adapters rather than the entire model. Consequently, during the \n",
    "model saving process, only the `adapter weights` are preserved, not the complete model. If we wish to save the \n",
    "entire model for easier usage with Text Generation Inference, we can incorporate the adapter weights into the model \n",
    "weights. This can be achieved using the `merge_and_unload` method. Following this, the model can be saved using the \n",
    "`save_pretrained` method. The result is a default model that is ready for inference.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64d5cd68",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import torch\n",
    "from peft import AutoPeftModelForCausalLM\n",
    "\n",
    "# Load Peft model on CPU\n",
    "model = AutoPeftModelForCausalLM.from_pretrained(\n",
    "    \"temp_/LChat-7b\",\n",
    "    torch_dtype=torch.float16,\n",
    "    low_cpu_mem_usage=True\n",
    ")\n",
    "    \n",
    "# Merge LoRA with the base model and save\n",
    "merged_model = model.merge_and_unload()\n",
    "merged_model.save_pretrained(\"/LChat-7b\", safe_serialization=True, max_shard_size=\"2GB\")\n",
    "tokenizer.save_pretrained(\"/LChat-7b\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8f96a1d",
   "metadata": {},
   "source": [
    "### Copy all result folders from 'temp_/LChat-7b' to '/LChat-7b'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0f28559e",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "import os\n",
    "import shutil\n",
    "\n",
    "source_folder = \"temp_/LChat-7b\"\n",
    "destination_folder = \"/LChat-7b\"\n",
    "os.makedirs(destination_folder, exist_ok=True)\n",
    "for item in os.listdir(source_folder):\n",
    "    item_path = os.path.join(source_folder, item)\n",
    "    if os.path.isdir(item_path):\n",
    "        destination_path = os.path.join(destination_folder, item)\n",
    "        shutil.copytree(item_path, destination_path)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "60bf3de1",
   "metadata": {},
   "source": [
    "### Generating a model card (README.md)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97fe2e33",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "card = '''\n",
    "---\n",
    "license: apache-2.0\n",
    "tags:\n",
    "- generated_from_trainer\n",
    "- mistralai/Mistral\n",
    "- PyTorch\n",
    "- transformers\n",
    "- trl\n",
    "- peft\n",
    "- tensorboard\n",
    "base_model: mistralai/Mistral-7B-v0.1\n",
    "widget:\n",
    "  - example_title: Pirate!\n",
    "    messages:\n",
    "      - role: system\n",
    "        content: You are a pirate chatbot who always responds with Arr!\n",
    "      - role: user\n",
    "        content: \"There's a llama on my lawn, how can I get rid of him?\"\n",
    "    output:\n",
    "      text: >-\n",
    "        Arr! 'Tis a puzzlin' matter, me hearty! A llama on yer lawn be a rare\n",
    "        sight, but I've got a plan that might help ye get rid of 'im. Ye'll need\n",
    "        to gather some carrots and hay, and then lure the llama away with the\n",
    "        promise of a tasty treat. Once he's gone, ye can clean up yer lawn and\n",
    "        enjoy the peace and quiet once again. But beware, me hearty, for there\n",
    "        may be more llamas where that one came from! Arr!\n",
    "model-index:\n",
    "- name: LChat-7b\n",
    "  results: []\n",
    "datasets:\n",
    "- HuggingFaceH4/ultrachat_200k\n",
    "language:\n",
    "- en\n",
    "pipeline_tag: text-generation\n",
    "---\n",
    "\n",
    "# Model Card for LChat-7b:\n",
    "\n",
    "**LChat-7b** is a language model that is trained to act as helpful assistant. It is a finetuned version of [mistralai/Mistral-7B-v0.1](https://huggingface.co/mistralai/Mistral-7B-v0.1) that was trained using `SFTTrainer` on publicly available dataset [\n",
    "HuggingFaceH4/ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k).\n",
    "\n",
    "## Training Procedure:\n",
    "\n",
    "The training code used to create this model was generated by [Menouar/LLM-FineTuning-Notebook-Generator](https://huggingface.co/spaces/Menouar/LLM-FineTuning-Notebook-Generator).\n",
    "\n",
    "\n",
    "\n",
    "## Training hyperparameters\n",
    "\n",
    "The following hyperparameters were used during the training:\n",
    "\n",
    "\n",
    "'''\n",
    "\n",
    "with open(\"/LChat-7b/README.md\", \"w\") as f:\n",
    "    f.write(card)\n",
    "\n",
    "args_dict = vars(args)\n",
    "\n",
    "with open(\"/LChat-7b/README.md\", \"a\") as f:\n",
    "    for k, v in args_dict.items():\n",
    "        f.write(f\"- {k}: {v}\")\n",
    "        f.write(\"\\n \\n\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6947c4c1",
   "metadata": {},
   "source": [
    "## Login to HF"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bafb24fe",
   "metadata": {},
   "source": [
    "Replace `HF_TOKEN` with a valid token in order to push **'/LChat-7b'** to `huggingface_hub`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e498576f",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# Install huggingface_hub\n",
    "!pip install -q huggingface_hub\n",
    "    \n",
    "from huggingface_hub import login\n",
    "    \n",
    "login(\n",
    "        token='_gxyairSqRlrHFswgszIHJmObFVaGSDGcEk',\n",
    "        add_to_git_credential=True\n",
    ")\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f5071dd",
   "metadata": {},
   "source": [
    "## Pushing '/LChat-7b' to the Hugging Face account."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13ba8863",
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "from huggingface_hub import HfApi, HfFolder, Repository\n",
    "\n",
    "# Instantiate the HfApi class\n",
    "api = HfApi()\n",
    "\n",
    "# Our Hugging Face repository\n",
    "repo_name = \"LChat-7b\"\n",
    "\n",
    "# Create a repository on the Hugging Face Hub\n",
    "repo = api.create_repo(token=HfFolder.get_token(), repo_type=\"model\", repo_id=repo_name)\n",
    "\n",
    "api.upload_folder(\n",
    "    folder_path=\"/LChat-7b\",\n",
    "    repo_id=repo.repo_id\n",
    ")\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}