opyate
/

llama-7b-hf-redefined-3ep-1k

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/home/opyate/Documents/code/datura-ai/.venv/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
+      "  from .autonotebook import tqdm as notebook_tqdm\n"
+     ]
+    }
+   ],
+   "source": [
+    "import datasets\n",
+    "\n",
+    "dataset = datasets.load_dataset(\"tiiuae/falcon-refinedweb\", streaming=True, split=\"train\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'content': 'these birches can be found in many places in Europe - the photos is from a short trip to Baden-Baden in 2007. the clouds in the background are the messengers of the storm Kyrill. here are some more moments of the trip: Baden-Baden.\\n-\\n“ast/ray” is a bilingual wordplay: “ast” means “twig” in German. and while “Baden-Baden” sounds like wordplay, too, it is the actual name of a rather well-know spa town that also dates back to Roman times. “Bad” is the German word for “bath”.\\nMirror effect turned out nice. I like', 'url': 'http://100parts.wordpress.com/2012/08/04/astray-baden-baden-day-31/', 'timestamp': datetime.datetime(2013, 5, 18, 10, 42), 'dump': 'CC-MAIN-2013-20', 'segment': '1368696382261', 'image_urls': []}\n",
+      "{'content': 'Watch Survivor Redemption Island Season 22 Episode 11: A Mystery Package Online S22e11 Free Stream Megavideo\\nArticle by StreamThatSeries\\nHorray!time for another dose of very exciting reality series with lots of twists.You must watch survivor redemption island season 22 episode 11 tonight with a title of “Mystery Package” coz surely this will stir things up on the merge tribe murlonio. But in case you will the episode on your television set at home , just follow through the link here to watch survivor redemption island season 22 episode 11 a mystery package online for free at anyhotstuff.com\\nHere’s some sort or recaps and tidbits for last week’s episode: 5. Rice Capades – I don’t really know where to begin with talking about Phillip this week. I know that I certainly don’t want to use the c-word to describe Phillip’s behavior. However, through all of the histrionics in this episode, one thing is clear: You can no longer make the argument that Phillip is playing a game with any sort of strategy.\\nPhillip told us last week that his intention was to get Boston Rob to want Phillip next to him at the end of the game. However, at this point Phillip is not showing any signs that winning this game is his objective at all. Phillip is certainly entitled to having an opinion and feeling the way he does but part of being a good Survivor player is occasionally hiding your feelings in the interests of winning the game. Phillip’s aggressive behavior has all but assured himself a zero percent chance at the million dollars. Even if Phillip makes it to the final three, he would need five votes to win the game. Considering that there will likely be five Zapatera tribe members on the jury, this incident is going to stick to Phillip like a certain color on rice.\\n4. The Great Divide – Besides for Phillip, I think the most unique thing about this season of Survivor is the lack of a merge. Oh, it technically happened, but this is the only season of Survivor I can remember where the merged tribe not only has a separate alliance but separate shelters and separate food rations too. Actually, this isn’t that uncommon, I know a lot of people in bad marriages who live the same way. The big question going forward will be how much will the Zapateras hold Boston Rob accountable for being shut out to this degree? Phillip has done a lot to take the heat off of Rob but don’t be surprised if at the end of the game, the Zapatera tribe continues to act as a group and completely vote together for whomever treated them best – which is why I maintain one of the three Ometepe women have a great shot to win this game.\\n3. Saving Sheppard – This season, I have spent much more time breaking down Boston Rob’s decisions in the game than any of the other players. In my opinion, Rob has far more control in this game than any other Survivor may have had at this point in history. Since Rob is controlling the vote, did Rob make the right move by keeping Phillip this week? On the one hand, he realizes that getting rid of Phillip would end a lot of the drama around camp and may even create some goodwill among the remaining Zapateras. Instead, Rob chose to keep Phillip around for another week and I agree with his decision. Phillip Sheppard is the best thing that happened to Boston Rob this season because Phillip���s distractions keep everybody from thinking about the game. When Phillip is going off about rice, feathers or kung fu, nobody is ever asking themselves about their position in the tribe or some big move they’re going to make – and that’s exactly how Boston Rob wants it to be.\\n2. Tribal Counseling – It was no surprise that the feud between Phillip and Steve spilled over in to this week’s tribal council. I thought that Jeff Probst showed why he is the best host on television by exploring both sides of the debate. What I still don’t understand is what happened to Phillip’s shirt? Has there ever been a Survivor contestant to attend tribal council topless before? Now with Julie gone the prospects seem pretty slim that Phillip will ever find his bathing suit. Phillip now can only hope to win a reward at some point to find the first clue to the hidden bathing suit.\\n1. Three’s Company – This week we had our first ever three-way duel on Redemption Island, which ended in Mike and Matt moving on and David getting eliminated. I think the top 2 people advancing at Redemption Island seems like it would help Matt in his quest to return to the game since it seems unlikely he would ever come in last (unless the duel involved having a strategy of some sort). Unfortunately, it looks like Redemption Island is starting to really take its toll on Matt. In Matt’s prayers he says that he doesn’t want to be on Survivor anymore but is simply carrying out God’s will. You would think that having an extra person on Redemption Island might help cheer Matt up a little bit, but apparently Mike Chisel isn’t that great of a roommate.\\nWhat you waiting for, watch Survivor: Redemption Island season 22 Episode 11 a Mystery Package Survivor: Redemption Tropical isle Year twenty-two Event 11: We all Can’t stand A lot of our Tribe is currently approaching globally regarding tv set landscape an important subject due to this usually are We all Dislike Much of our Tribe. So that it signifies usual account a single tribe has long been dislike other tribe the reason why? Inside the landscape with the Survivor: Redemption Tropical isle Month or year twenty-two Show 11 On line aboard just what occur inside this landscape. He together with Kristina duel at Redemption Island. On the Ometepe campy, tribe unity commences to help you out unravel. Thus test in order to savor Survivor: Redemption Tropical isle Length twenty-two Show 11 online. Survivor: Redemption Tropical area Period twenty-two Show 11:. Of which CBS Survivor: Redemption Tropical isle 22?11 show on tv alongside subject We all Can’t stand Many of our Tribe Survivor: Redemption Tropical isle Season twenty-two Event 11: EVERY Puzzle Deal. This particular CBS Survivor: Redemption Tropical isle 22?11 show on tv combined with subject ANY Puzzle Package deal shown about Saturday, Annual percentage rates 29 2011 for 08: 00 EVENING HOURS. This can be a new conclusion regarding Survivor: Redemption Tropical isle Year or so twenty-two Event 11: VIRTUALLY ANY Puzzle Deal: About Redemption Tropical isle, He will be having a mechanical disappointment, one more castaway will be voted through your video game. Previous shows: Episode 10 “Rice Wars” Phillip and also Steve clash. Episode 9 “The Colleague System” Rob tries to be able to secure an Ometepe connections, but Grant may really do the one to jeopardize the item. Survivor is an American version of this Survivor reality television sport show, itself derived from the Swedish television series Journey Robinson originally created in 1997 by Charlie Parsons. This series premiered on May perhaps 31, 2000 on CBS. It can be hosted by veteran television system personality, reporter and one-time performance show emcee Jeff Probst, who might be also an executive designer, and also executive that is caused by Mark Burnett and main creator Charlie Parsons. WATCH HERE : Watch Survivor: Redemption Island Season 22 Episode 11 The show maroons a small grouping strangers (as one or longer tribes) in a destitute locale, where they ought to provide food, water, open fire, and shelter for themselves, while competing in challenges to earn either a reward, or an immunity from expulsion from the game yearly of the successive votes for elimination. While a great deal rarer than elimination by vote, medical conditions, such as injury or infection, need eliminated several contestants. The last 2 or three survivors face a jury historically made from at least the final seven players voted shut off. That jury interrogates one last few, and then votes for those winner of the distinction of Sole Survivor in addition to a million dollar prize. The first U. S. season of Survivor followed identical general format as the particular Swedish series, but, subsequently, the show has introduced several twists over the core rules to keep the players on their toes and prevent players from influenced by strategies that succeeded with prior seasons. These alters have included tribal buttons, seasons starting with well over two tribes, the abil vity to exile a gamer from a tribe for a few days, and hidden immun ity idols that players are able to use to save themselves for tribal council. Season 22 It season’s cast features typically the return of Rob MICHAEL. and Russell. This is Russell’s third time for the show and Rob METERS. ‘s fourth. It is the very first time in the show’s hist ory that your cont estant has played several individual times. This per iod also feat ures two past NFL players (Grant and also Steve). CBS today annou nced 16 of 18 castaw ays who will compete against each additi onal on SURV IVOR: REDEMPTI ON TROPICAL ISLAND, when the Em my Award-win ning series returns as for the 22nd season, Wednesday, February. 16 (8: 00-9: 00 EVENING, ET/PT) on the CBS Video Network. Two of your 18 cas taways, to be reve aled later this 7-day period, are form er casta ways who will return to seek redemption. This edition of SURVI VOR will include a new twist when, for when, castaways who have been elimi nated th rough the ga me will have possibi lity to seek rede mption and re turn for the opport unity to win the mil lion mone tary prize. Each week at Tribal Coun cil beca use a cast away is vot ed from, they will be brou ght to an isola ted isl and gen erally known as “Rede mption Isla nd, ” where they ‘ll live alo ne in ex ile. To last on Redempt ion Isla nd, they must com pete in a duel about the next person elimin ated wit hin Tr ibal Cou ncil and pumped to the Island. The winner of each duel earns an appro priate to contin ue figh ting for ena ble you to retu rn to the game and the chance to compete for the subj ect of Sole Sur vivor; the part icular los er is sent hou se. The bat tle unfolds in Nica ragua where 18 casta ways will pos sibly be divi ded into two Trib es of nine: the Omet epe Tr ibe and then the Zapatera Tribe. The tr ibes are derived from indivi duals fro m all dif ferent bac kgrounds aid ed by the same ultimate goal: for being the Sole Survi vor. While 16 of yo ur cont estants are new to the competi tion, two are former castaways that will be gi ven an other opportu nity to com pete for the million dollar prize andf the other last shot at redem ption.\\nAbout the Author\\nWatch FULL EPISODE here for Free', 'url': 'http://100percentwinnersblog.com/watch-survivor-redemption-island-season-22-episode-11-a-mystery-package-online-s22e11-free-stream-megavideo/', 'timestamp': datetime.datetime(2013, 5, 18, 11, 2, 3), 'dump': 'CC-MAIN-2013-20', 'segment': '1368696382261', 'image_urls': []}\n",
+      "{'content': 'Pesky?\\nthis was a high school project for a president campaign in our government class, yes thats him, for a school project, you guys are crazy\\ni know his dad from work. very cool and funny guy!!', 'url': 'http://101squadron.com/blog/2007/05/pesky-peculiarities-of-css.html/comment-page-1', 'timestamp': datetime.datetime(2013, 5, 18, 10, 21, 35), 'dump': 'CC-MAIN-2013-20', 'segment': '1368696382261', 'image_urls': [['http://101squadron.com/uploaded_images/Conger_89_cal-757679.jpg', None], [\"http://101squadron.com/uploaded_images/Hammerin'-Hank-Conger-795118.jpg\", None], ['http://101squadron.com/uploaded_images/hconger06108180bm-753187.jpg', None], ['http://101squadron.com/uploaded_images/HHMex8Au-794539.jpg', None], ['http://101squadron.com/uploaded_images/Conger-back-712123.jpg', None]]}\n",
+      "{'content': 'metalkingdom.net [ 80′s @ 8 Feature Video – Big City Nights [VIDEO] By Chris Chapman March 13, 2012 It\\'s time to rock out to those talented Germans Rudolf and Klaus from the Scorpions! \"Big City Nights\" wasn\\'t as big as \"Rock You Like A Hurricane\" or \"Winds Of Change\", but it still stands the test of time. Read More Category: 80\\'s @ 8 Featured Video Tags: 80\\'s, big city nights, love at first sting, Scorpions Send to a friend! Print this page Share on Facebook Share on Twitter Pin it! Reddit This!', 'url': 'http://1037theloon.com/tags/scorpions/', 'timestamp': datetime.datetime(2013, 5, 18, 10, 21, 51), 'dump': 'CC-MAIN-2013-20', 'segment': '1368696382261', 'image_urls': []}\n",
+      "{'content': 'Splice Review\\nBlack Ops Escalation Map Pack [VIDEO]\\nScream 4 Review-No Spoilers\\nBest seem\\nThrashman’s Metal Pick Of The Week\\nNightmare On Elm Street [VIDEO]\\n2011 Dodge Charger\\nMr Skin’s Anatomy Awards\\n', 'url': 'http://1063thebuzz.com/category/reviews/page/7/', 'timestamp': datetime.datetime(2013, 5, 18, 10, 31, 9), 'dump': 'CC-MAIN-2013-20', 'segment': '1368696382261', 'image_urls': [['https://s3.amazonaws.com/tsm-images/global/1x1.gif', 'rotten-sound-cursed-large-promo-album-pic rotten-sound-cursed-large-promo-album-pic'], ['https://s3.amazonaws.com/tsm-images/global/1x1.gif', 'rotten-sound-cursed-large-promo-album-pic rotten-sound-cursed-large-promo-album-pic'], ['https://s3.amazonaws.com/tsm-images/global/1x1.gif', 'Twicebroken and Joshua David Twicebroken and Joshua David'], ['https://s3.amazonaws.com/tsm-images/global/1x1.gif', 'splice dimension films'], ['https://s3.amazonaws.com/tsm-images/global/1x1.gif', 'hqdefault24 hqdefault24'], ['https://s3.amazonaws.com/tsm-images/global/1x1.gif', 'radioclown radioclown'], ['https://s3.amazonaws.com/tsm-images/global/1x1.gif', 'hqdefault15 hqdefault15'], ['https://s3.amazonaws.com/tsm-images/global/1x1.gif', 'Life On Kandahar Air Base Life On Kandahar Air Base'], ['https://s3.amazonaws.com/tsm-images/global/1x1.gif', 'Havok-TimeIsUp-300x300 Havok-TimeIsUp-300x300'], ['https://s3.amazonaws.com/tsm-images/global/1x1.gif', 'nightmare-300x244 freddy k'], ['https://s3.amazonaws.com/tsm-images/global/1x1.gif', '2011-dodge-charger-embed-4 2011-dodge-charger-embed-4'], ['https://s3.amazonaws.com/tsm-images/global/1x1.gif', 'Mr Skin Mr Skin'], ['https://s3.amazonaws.com/tsm-images/global/1x1.gif', 'hqdefault10 hqdefault10']]}\n"
+     ]
+    }
+   ],
+   "source": [
+    "dataset\n",
+    "\n",
+    "# show the first few examples\n",
+    "for example in dataset.take(5):\n",
+    "    print(example)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Model and tokenizer"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.\n",
+      "Loading checkpoint shards: 100%|██████████| 2/2 [00:02<00:00,  1.07s/it]\n"
+     ]
+    }
+   ],
+   "source": [
+    "import torch\n",
+    "from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig\n",
+    "\n",
+    "model_id = \"luodian/llama-7b-hf\"\n",
+    "bnb_config = BitsAndBytesConfig(\n",
+    "    load_in_4bit=True,\n",
+    "    bnb_4bit_compute_dtype=torch.bfloat16\n",
+    ")\n",
+    "\n",
+    "tokenizer = AutoTokenizer.from_pretrained(model_id)\n",
+    "tokenizer.pad_token = tokenizer.eos_token\n",
+    "\n",
+    "model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config,  \n",
+    " device_map=\"auto\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of tokens in sample: 150\n",
+      "Number of tokens in sample: 2731\n",
+      "Number of tokens in sample: 52\n",
+      "Number of tokens in sample: 162\n",
+      "Number of tokens in sample: 78\n",
+      "Number of tokens in sample: 257\n",
+      "Number of tokens in sample: 1074\n",
+      "Number of tokens in sample: 505\n",
+      "Number of tokens in sample: 592\n",
+      "Number of tokens in sample: 932\n"
+     ]
+    }
+   ],
+   "source": [
+    "for example in dataset.take(10):\n",
+    "    token_count = len(tokenizer.encode(example['content']))\n",
+    "    print(f\"Number of tokens in sample: {token_count}\") "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Function to calculate validation loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def calculate_validation_loss(model, dataloader):\n",
+    "    model.eval()\n",
+    "    total_loss = 0\n",
+    "\n",
+    "    with torch.no_grad():\n",
+    "        for batch in dataloader:\n",
+    "            batch = {k: v.to(model.device) for k, v in batch.items()}\n",
+    "            outputs = model(**batch, labels=batch[\"input_ids\"])\n",
+    "            loss = outputs.loss\n",
+    "            total_loss += loss.item()\n",
+    "\n",
+    "    average_loss = total_loss / len(dataloader)\n",
+    "    return average_loss"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Prepare data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def tokenize_function(examples):\n",
+    "    return tokenizer(examples[\"content\"], padding=\"max_length\", truncation=True, max_length=512) "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sample_size = 1000  # small, for a quick experiment\n",
+    "\n",
+    "# Approximate 80/20 split while streaming\n",
+    "train_dataset = []\n",
+    "validation_dataset = []\n",
+    "for i, item in enumerate(dataset):\n",
+    "    if i % 5 == 0:  # Every 5th item goes to validation (approximately 20%)\n",
+    "        validation_dataset.append(item)\n",
+    "    else:\n",
+    "        train_dataset.append(item)\n",
+    "    if len(train_dataset) >= sample_size and len(validation_dataset) >= (sample_size // 4): \n",
+    "        # Stop once we have enough samples for both training and validation\n",
+    "        break\n",
+    "\n",
+    "# Tokenize the training and validation samples\n",
+    "tokenized_train_dataset = []\n",
+    "for item in train_dataset:\n",
+    "    tokenized_item = tokenize_function(item)\n",
+    "    tokenized_train_dataset.append(tokenized_item)\n",
+    "\n",
+    "tokenized_validation_dataset = []\n",
+    "for item in validation_dataset:\n",
+    "    tokenized_item = tokenize_function(item)\n",
+    "    tokenized_validation_dataset.append(tokenized_item)\n",
+    "\n",
+    "# Convert to Dataset objects if needed\n",
+    "from datasets import Dataset\n",
+    "tokenized_train_dataset = Dataset.from_list(tokenized_train_dataset)\n",
+    "tokenized_validation_dataset = Dataset.from_list(tokenized_validation_dataset)\n",
+    "\n",
+    "# Convert the tokenized datasets to PyTorch tensors\n",
+    "tokenized_train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])\n",
+    "tokenized_validation_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Before training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Calculate validation loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Validation loss before training: 9.958484322547912\n"
+     ]
+    }
+   ],
+   "source": [
+    "from torch.utils.data import DataLoader\n",
+    "\n",
+    "validation_dataloader = DataLoader(tokenized_validation_dataset, batch_size=2) \n",
+    "loss_before_training = calculate_validation_loss(model, validation_dataloader)\n",
+    "print(f\"Validation loss before training: {loss_before_training}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Benchmark"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Run this command:\n",
+    "\n",
+    "```\n",
+    "accelerate launch -m  lm_eval --model hf \\\n",
+    "    --model_args pretrained=luodian/llama-7b-hf,load_in_4bit=True,dtype=\"bfloat16\" \\\n",
+    "    --tasks mmlu,hellaswag,truthfulqa \\\n",
+    "    --batch_size auto:4 \\\n",
+    "    --log_samples \\\n",
+    "    --output_path results/before-training\n",
+    "```\n",
+    "\n",
+    "Output:\n",
+    "\n",
+    "```\n",
+    "hf (pretrained=luodian/llama-7b-hf,load_in_4bit=True,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (16,64,64,64)\n",
+    "|                 Tasks                 |Version|Filter|n-shot|  Metric   |   | Value  |   |Stderr|\n",
+    "|---------------------------------------|------:|------|-----:|-----------|---|-------:|---|-----:|\n",
+    "|hellaswag                              |      1|none  |     0|acc        |↑  |  0.5646|±  |0.0049|\n",
+    "|                                       |       |none  |     0|acc_norm   |↑  |  0.7498|±  |0.0043|\n",
+    "|mmlu                                   |      2|none  |      |acc        |↑  |  0.3126|±  |0.0039|\n",
+    "| - humanities                          |      2|none  |      |acc        |↑  |  0.3101|±  |0.0067|\n",
+    "|  - formal_logic                       |      1|none  |     0|acc        |↑  |  0.2778|±  |0.0401|\n",
+    "|  - high_school_european_history       |      1|none  |     0|acc        |↑  |  0.3939|±  |0.0382|\n",
+    "|  - high_school_us_history             |      1|none  |     0|acc        |↑  |  0.3824|±  |0.0341|\n",
+    "|  - high_school_world_history          |      1|none  |     0|acc        |↑  |  0.3671|±  |0.0314|\n",
+    "|  - international_law                  |      1|none  |     0|acc        |↑  |  0.3884|±  |0.0445|\n",
+    "|  - jurisprudence                      |      1|none  |     0|acc        |↑  |  0.3519|±  |0.0462|\n",
+    "|  - logical_fallacies                  |      1|none  |     0|acc        |↑  |  0.3067|±  |0.0362|\n",
+    "|  - moral_disputes                     |      1|none  |     0|acc        |↑  |  0.3555|±  |0.0258|\n",
+    "|  - moral_scenarios                    |      1|none  |     0|acc        |↑  |  0.2380|±  |0.0142|\n",
+    "|  - philosophy                         |      1|none  |     0|acc        |↑  |  0.3441|±  |0.0270|\n",
+    "|  - prehistory                         |      1|none  |     0|acc        |↑  |  0.3364|±  |0.0263|\n",
+    "|  - professional_law                   |      1|none  |     0|acc        |↑  |  0.2803|±  |0.0115|\n",
+    "|  - world_religions                    |      1|none  |     0|acc        |↑  |  0.4503|±  |0.0382|\n",
+    "| - other                               |      2|none  |      |acc        |↑  |  0.3399|±  |0.0084|\n",
+    "|  - business_ethics                    |      1|none  |     0|acc        |↑  |  0.3300|±  |0.0473|\n",
+    "|  - clinical_knowledge                 |      1|none  |     0|acc        |↑  |  0.3208|±  |0.0287|\n",
+    "|  - college_medicine                   |      1|none  |     0|acc        |↑  |  0.2775|±  |0.0341|\n",
+    "|  - global_facts                       |      1|none  |     0|acc        |↑  |  0.3500|±  |0.0479|\n",
+    "|  - human_aging                        |      1|none  |     0|acc        |↑  |  0.3094|±  |0.0310|\n",
+    "|  - management                         |      1|none  |     0|acc        |↑  |  0.2524|±  |0.0430|\n",
+    "|  - marketing                          |      1|none  |     0|acc        |↑  |  0.3718|±  |0.0317|\n",
+    "|  - medical_genetics                   |      1|none  |     0|acc        |↑  |  0.4200|±  |0.0496|\n",
+    "|  - miscellaneous                      |      1|none  |     0|acc        |↑  |  0.4291|±  |0.0177|\n",
+    "|  - nutrition                          |      1|none  |     0|acc        |↑  |  0.3235|±  |0.0268|\n",
+    "|  - professional_accounting            |      1|none  |     0|acc        |↑  |  0.2801|±  |0.0268|\n",
+    "|  - professional_medicine              |      1|none  |     0|acc        |↑  |  0.2500|±  |0.0263|\n",
+    "|  - virology                           |      1|none  |     0|acc        |↑  |  0.2952|±  |0.0355|\n",
+    "| - social sciences                     |      2|none  |      |acc        |↑  |  0.3052|±  |0.0083|\n",
+    "|  - econometrics                       |      1|none  |     0|acc        |↑  |  0.2544|±  |0.0410|\n",
+    "|  - high_school_geography              |      1|none  |     0|acc        |↑  |  0.2828|±  |0.0321|\n",
+    "|  - high_school_government_and_politics|      1|none  |     0|acc        |↑  |  0.3161|±  |0.0336|\n",
+    "|  - high_school_macroeconomics         |      1|none  |     0|acc        |↑  |  0.2538|±  |0.0221|\n",
+    "|  - high_school_microeconomics         |      1|none  |     0|acc        |↑  |  0.2395|±  |0.0277|\n",
+    "|  - high_school_psychology             |      1|none  |     0|acc        |↑  |  0.3358|±  |0.0202|\n",
+    "|  - human_sexuality                    |      1|none  |     0|acc        |↑  |  0.2901|±  |0.0398|\n",
+    "|  - professional_psychology            |      1|none  |     0|acc        |↑  |  0.3333|±  |0.0191|\n",
+    "|  - public_relations                   |      1|none  |     0|acc        |↑  |  0.3000|±  |0.0439|\n",
+    "|  - security_studies                   |      1|none  |     0|acc        |↑  |  0.2286|±  |0.0269|\n",
+    "|  - sociology                          |      1|none  |     0|acc        |↑  |  0.4179|±  |0.0349|\n",
+    "|  - us_foreign_policy                  |      1|none  |     0|acc        |↑  |  0.3900|±  |0.0490|\n",
+    "| - stem                                |      2|none  |      |acc        |↑  |  0.2965|±  |0.0081|\n",
+    "|  - abstract_algebra                   |      1|none  |     0|acc        |↑  |  0.2600|±  |0.0441|\n",
+    "|  - anatomy                            |      1|none  |     0|acc        |↑  |  0.3407|±  |0.0409|\n",
+    "|  - astronomy                          |      1|none  |     0|acc        |↑  |  0.3487|±  |0.0388|\n",
+    "|  - college_biology                    |      1|none  |     0|acc        |↑  |  0.3403|±  |0.0396|\n",
+    "|  - college_chemistry                  |      1|none  |     0|acc        |↑  |  0.2700|±  |0.0446|\n",
+    "|  - college_computer_science           |      1|none  |     0|acc        |↑  |  0.3300|±  |0.0473|\n",
+    "|  - college_mathematics                |      1|none  |     0|acc        |↑  |  0.3000|±  |0.0461|\n",
+    "|  - college_physics                    |      1|none  |     0|acc        |↑  |  0.1765|±  |0.0379|\n",
+    "|  - computer_security                  |      1|none  |     0|acc        |↑  |  0.3500|±  |0.0479|\n",
+    "|  - conceptual_physics                 |      1|none  |     0|acc        |↑  |  0.3106|±  |0.0303|\n",
+    "|  - electrical_engineering             |      1|none  |     0|acc        |↑  |  0.3310|±  |0.0392|\n",
+    "|  - elementary_mathematics             |      1|none  |     0|acc        |↑  |  0.2619|±  |0.0226|\n",
+    "|  - high_school_biology                |      1|none  |     0|acc        |↑  |  0.3645|±  |0.0274|\n",
+    "|  - high_school_chemistry              |      1|none  |     0|acc        |↑  |  0.2512|±  |0.0305|\n",
+    "|  - high_school_computer_science       |      1|none  |     0|acc        |↑  |  0.3800|±  |0.0488|\n",
+    "|  - high_school_mathematics            |      1|none  |     0|acc        |↑  |  0.2481|±  |0.0263|\n",
+    "|  - high_school_physics                |      1|none  |     0|acc        |↑  |  0.2450|±  |0.0351|\n",
+    "|  - high_school_statistics             |      1|none  |     0|acc        |↑  |  0.2639|±  |0.0301|\n",
+    "|  - machine_learning                   |      1|none  |     0|acc        |↑  |  0.3125|±  |0.0440|\n",
+    "|truthfulqa_gen                         |      3|none  |     0|bleu_acc   |↑  |  0.2766|±  |0.0157|\n",
+    "|                                       |       |none  |     0|bleu_diff  |↑  |-10.2902|±  |0.8441|\n",
+    "|                                       |       |none  |     0|bleu_max   |↑  | 26.5005|±  |0.8063|\n",
+    "|                                       |       |none  |     0|rouge1_acc |↑  |  0.2619|±  |0.0154|\n",
+    "|                                       |       |none  |     0|rouge1_diff|↑  |-13.4103|±  |0.8561|\n",
+    "|                                       |       |none  |     0|rouge1_max |↑  | 51.0861|±  |0.8835|\n",
+    "|                                       |       |none  |     0|rouge2_acc |↑  |  0.2240|±  |0.0146|\n",
+    "|                                       |       |none  |     0|rouge2_diff|↑  |-15.4705|±  |1.0517|\n",
+    "|                                       |       |none  |     0|rouge2_max |↑  | 35.0729|±  |1.0250|\n",
+    "|                                       |       |none  |     0|rougeL_acc |↑  |  0.2619|±  |0.0154|\n",
+    "|                                       |       |none  |     0|rougeL_diff|↑  |-13.6375|±  |0.8721|\n",
+    "|                                       |       |none  |     0|rougeL_max |↑  | 48.4303|±  |0.8983|\n",
+    "|truthfulqa_mc1                         |      2|none  |     0|acc        |↑  |  0.2069|±  |0.0142|\n",
+    "|truthfulqa_mc2                         |      2|none  |     0|acc        |↑  |  0.3252|±  |0.0131|\n",
+    "\n",
+    "|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|\n",
+    "|------------------|------:|------|------|------|---|-----:|---|-----:|\n",
+    "|mmlu              |      2|none  |      |acc   |↑  |0.3126|±  |0.0039|\n",
+    "| - humanities     |      2|none  |      |acc   |↑  |0.3101|±  |0.0067|\n",
+    "| - other          |      2|none  |      |acc   |↑  |0.3399|±  |0.0084|\n",
+    "| - social sciences|      2|none  |      |acc   |↑  |0.3052|±  |0.0083|\n",
+    "| - stem           |      2|none  |      |acc   |↑  |0.2965|±  |0.0081|\n",
+    "\n",
+    "```\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define training loop"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from transformers import AdamW, get_linear_schedule_with_warmup\n",
+    "\n",
+    "optimizer = AdamW(model.parameters(), lr=5e-5)\n",
+    "\n",
+    "num_epochs = 3\n",
+    "num_training_steps = num_epochs * len(tokenized_train_dataset)\n",
+    "lr_scheduler = get_linear_schedule_with_warmup(\n",
+    "    optimizer, num_warmup_steps=0, num_training_steps=num_training_steps\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Dataloader for training"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from torch.utils.data import DataLoader\n",
+    "\n",
+    "train_dataloader = DataLoader(tokenized_train_dataset, shuffle=True, batch_size=2) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Run training loop"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      " 23%|██▎       | 686/3000 [04:50<16:20,  2.36it/s]\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Epoch 1/3 - Average Loss: 2.9037\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": []
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Epoch 2/3 - Average Loss: 2.0474\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": []
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Epoch 3/3 - Average Loss: 1.4295\n"
+     ]
+    }
+   ],
+   "source": [
+    "from tqdm.auto import tqdm\n",
+    "\n",
+    "progress_bar = tqdm(range(num_training_steps))\n",
+    "\n",
+    "model.train()\n",
+    "for epoch in range(num_epochs):\n",
+    "    total_loss = 0\n",
+    "\n",
+    "    for batch in train_dataloader:\n",
+    "        batch = {k: v.to(model.device) for k, v in batch.items()}\n",
+    "        outputs = model(**batch, labels=batch[\"input_ids\"])\n",
+    "        loss = outputs.loss\n",
+    "        total_loss += loss.item()\n",
+    "        loss.backward()\n",
+    "\n",
+    "        optimizer.step()\n",
+    "        lr_scheduler.step()\n",
+    "        optimizer.zero_grad()\n",
+    "        progress_bar.update(1)\n",
+    "\n",
+    "    average_loss = total_loss / len(train_dataloader)\n",
+    "    print(f\"Epoch {epoch+1}/{num_epochs} - Average Loss: {average_loss:.4f}\")  # Print the loss"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loss minimisation\n",
+    "\n",
+    "We see clear loss minimisation after just a few training loops (3 epochs over 1000 samples).\n",
+    "\n",
+    "Epoch 1/3 - Average Loss: 2.9037\n",
+    "\n",
+    "Epoch 2/3 - Average Loss: 2.0474\n",
+    "\n",
+    "Epoch 3/3 - Average Loss: 1.4295"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Push model to hub"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 22,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "model.safetensors: 100%|██████████| 4.17G/4.17G [05:49<00:00, 11.9MB/s]\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "CommitInfo(commit_url='https://huggingface.co/opyate/llama-7b-hf-redefined-3ep-1k/commit/ba69710ff29bd8ef8e0eb2460fde7cf0b8d1522c', commit_message='Upload LlamaForCausalLM', commit_description='', oid='ba69710ff29bd8ef8e0eb2460fde7cf0b8d1522c', pr_url=None, pr_revision=None, pr_num=None)"
+      ]
+     },
+     "execution_count": 22,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "from huggingface_hub import HfApi\n",
+    "\n",
+    "api = HfApi()\n",
+    "\n",
+    "trained_model_id = \"opyate/llama-7b-hf-redefined-3ep-1k\"\n",
+    "api.create_repo(repo_id=trained_model_id)\n",
+    "tokenizer.push_to_hub(trained_model_id)\n",
+    "model.push_to_hub(trained_model_id)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# After training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Calculate validation loss"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 23,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Validation loss before training: 9.958484322547912\n",
+      "Validation loss after training: 3.7852714624404906\n"
+     ]
+    }
+   ],
+   "source": [
+    "loss_after_training = calculate_validation_loss(model, validation_dataloader)\n",
+    "print(f\"Validation loss before training: {loss_before_training}\")\n",
+    "print(f\"Validation loss after training: {loss_after_training}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Observation: validation loss also minimised after training the model."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Benchmark\n",
+    "\n",
+    "Run this command:\n",
+    "\n",
+    "```\n",
+    "accelerate launch -m  lm_eval --model hf \\\n",
+    "    --model_args pretrained=opyate/llama-7b-hf-redefined-3ep-1k,load_in_4bit=True,dtype=\"bfloat16\" \\\n",
+    "    --tasks mmlu,hellaswag,truthfulqa \\\n",
+    "    --batch_size auto:4 \\\n",
+    "    --log_samples \\\n",
+    "    --output_path results/after-training\n",
+    "```\n",
+    "\n",
+    "Output:\n",
+    "\n",
+    "```\n",
+    "hf (pretrained=opyate/llama-7b-hf-redefined-3ep-1k,load_in_4bit=True,dtype=bfloat16), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (16,64,64,64)\n",
+    "|                 Tasks                 |Version|Filter|n-shot|  Metric   |   | Value |   |Stderr|\n",
+    "|---------------------------------------|------:|------|-----:|-----------|---|------:|---|-----:|\n",
+    "|hellaswag                              |      1|none  |     0|acc        |↑  | 0.2922|±  |0.0045|\n",
+    "|                                       |       |none  |     0|acc_norm   |↑  | 0.3176|±  |0.0046|\n",
+    "|mmlu                                   |      2|none  |      |acc        |↑  | 0.2400|±  |0.0036|\n",
+    "| - humanities                          |      2|none  |      |acc        |↑  | 0.2436|±  |0.0063|\n",
+    "|  - formal_logic                       |      1|none  |     0|acc        |↑  | 0.2698|±  |0.0397|\n",
+    "|  - high_school_european_history       |      1|none  |     0|acc        |↑  | 0.2182|±  |0.0323|\n",
+    "|  - high_school_us_history             |      1|none  |     0|acc        |↑  | 0.2843|±  |0.0317|\n",
+    "|  - high_school_world_history          |      1|none  |     0|acc        |↑  | 0.2574|±  |0.0285|\n",
+    "|  - international_law                  |      1|none  |     0|acc        |↑  | 0.2314|±  |0.0385|\n",
+    "|  - jurisprudence                      |      1|none  |     0|acc        |↑  | 0.2778|±  |0.0433|\n",
+    "|  - logical_fallacies                  |      1|none  |     0|acc        |↑  | 0.2638|±  |0.0346|\n",
+    "|  - moral_disputes                     |      1|none  |     0|acc        |↑  | 0.2601|±  |0.0236|\n",
+    "|  - moral_scenarios                    |      1|none  |     0|acc        |↑  | 0.2380|±  |0.0142|\n",
+    "|  - philosophy                         |      1|none  |     0|acc        |↑  | 0.2283|±  |0.0238|\n",
+    "|  - prehistory                         |      1|none  |     0|acc        |↑  | 0.2623|±  |0.0245|\n",
+    "|  - professional_law                   |      1|none  |     0|acc        |↑  | 0.2327|±  |0.0108|\n",
+    "|  - world_religions                    |      1|none  |     0|acc        |↑  | 0.2339|±  |0.0325|\n",
+    "| - other                               |      2|none  |      |acc        |↑  | 0.2594|±  |0.0078|\n",
+    "|  - business_ethics                    |      1|none  |     0|acc        |↑  | 0.3200|±  |0.0469|\n",
+    "|  - clinical_knowledge                 |      1|none  |     0|acc        |↑  | 0.2528|±  |0.0267|\n",
+    "|  - college_medicine                   |      1|none  |     0|acc        |↑  | 0.2486|±  |0.0330|\n",
+    "|  - global_facts                       |      1|none  |     0|acc        |↑  | 0.2900|±  |0.0456|\n",
+    "|  - human_aging                        |      1|none  |     0|acc        |↑  | 0.3857|±  |0.0327|\n",
+    "|  - management                         |      1|none  |     0|acc        |↑  | 0.2524|±  |0.0430|\n",
+    "|  - marketing                          |      1|none  |     0|acc        |↑  | 0.2735|±  |0.0292|\n",
+    "|  - medical_genetics                   |      1|none  |     0|acc        |↑  | 0.2600|±  |0.0441|\n",
+    "|  - miscellaneous                      |      1|none  |     0|acc        |↑  | 0.2414|±  |0.0153|\n",
+    "|  - nutrition                          |      1|none  |     0|acc        |↑  | 0.2451|±  |0.0246|\n",
+    "|  - professional_accounting            |      1|none  |     0|acc        |↑  | 0.2447|±  |0.0256|\n",
+    "|  - professional_medicine              |      1|none  |     0|acc        |↑  | 0.1912|±  |0.0239|\n",
+    "|  - virology                           |      1|none  |     0|acc        |↑  | 0.2892|±  |0.0353|\n",
+    "| - social sciences                     |      2|none  |      |acc        |↑  | 0.2327|±  |0.0076|\n",
+    "|  - econometrics                       |      1|none  |     0|acc        |↑  | 0.2193|±  |0.0389|\n",
+    "|  - high_school_geography              |      1|none  |     0|acc        |↑  | 0.1818|±  |0.0275|\n",
+    "|  - high_school_government_and_politics|      1|none  |     0|acc        |↑  | 0.2021|±  |0.0290|\n",
+    "|  - high_school_macroeconomics         |      1|none  |     0|acc        |↑  | 0.2077|±  |0.0206|\n",
+    "|  - high_school_microeconomics         |      1|none  |     0|acc        |↑  | 0.2353|±  |0.0276|\n",
+    "|  - high_school_psychology             |      1|none  |     0|acc        |↑  | 0.2349|±  |0.0182|\n",
+    "|  - human_sexuality                    |      1|none  |     0|acc        |↑  | 0.2443|±  |0.0377|\n",
+    "|  - professional_psychology            |      1|none  |     0|acc        |↑  | 0.2598|±  |0.0177|\n",
+    "|  - public_relations                   |      1|none  |     0|acc        |↑  | 0.2909|±  |0.0435|\n",
+    "|  - security_studies                   |      1|none  |     0|acc        |↑  | 0.2041|±  |0.0258|\n",
+    "|  - sociology                          |      1|none  |     0|acc        |↑  | 0.2587|±  |0.0310|\n",
+    "|  - us_foreign_policy                  |      1|none  |     0|acc        |↑  | 0.2600|±  |0.0441|\n",
+    "| - stem                                |      2|none  |      |acc        |↑  | 0.2226|±  |0.0074|\n",
+    "|  - abstract_algebra                   |      1|none  |     0|acc        |↑  | 0.3100|±  |0.0465|\n",
+    "|  - anatomy                            |      1|none  |     0|acc        |↑  | 0.2519|±  |0.0375|\n",
+    "|  - astronomy                          |      1|none  |     0|acc        |↑  | 0.1711|±  |0.0306|\n",
+    "|  - college_biology                    |      1|none  |     0|acc        |↑  | 0.2292|±  |0.0351|\n",
+    "|  - college_chemistry                  |      1|none  |     0|acc        |↑  | 0.2200|±  |0.0416|\n",
+    "|  - college_computer_science           |      1|none  |     0|acc        |↑  | 0.1300|±  |0.0338|\n",
+    "|  - college_mathematics                |      1|none  |     0|acc        |↑  | 0.2400|±  |0.0429|\n",
+    "|  - college_physics                    |      1|none  |     0|acc        |↑  | 0.1667|±  |0.0371|\n",
+    "|  - computer_security                  |      1|none  |     0|acc        |↑  | 0.1600|±  |0.0368|\n",
+    "|  - conceptual_physics                 |      1|none  |     0|acc        |↑  | 0.2766|±  |0.0292|\n",
+    "|  - electrical_engineering             |      1|none  |     0|acc        |↑  | 0.2483|±  |0.0360|\n",
+    "|  - elementary_mathematics             |      1|none  |     0|acc        |↑  | 0.2196|±  |0.0213|\n",
+    "|  - high_school_biology                |      1|none  |     0|acc        |↑  | 0.2097|±  |0.0232|\n",
+    "|  - high_school_chemistry              |      1|none  |     0|acc        |↑  | 0.2217|±  |0.0292|\n",
+    "|  - high_school_computer_science       |      1|none  |     0|acc        |↑  | 0.2200|±  |0.0416|\n",
+    "|  - high_school_mathematics            |      1|none  |     0|acc        |↑  | 0.3000|±  |0.0279|\n",
+    "|  - high_school_physics                |      1|none  |     0|acc        |↑  | 0.1457|±  |0.0288|\n",
+    "|  - high_school_statistics             |      1|none  |     0|acc        |↑  | 0.1667|±  |0.0254|\n",
+    "|  - machine_learning                   |      1|none  |     0|acc        |↑  | 0.2768|±  |0.0425|\n",
+    "|truthfulqa_gen                         |      3|none  |     0|bleu_acc   |↑  | 0.0073|±  |0.0030|\n",
+    "|                                       |       |none  |     0|bleu_diff  |↑  | 0.0010|±  |0.0008|\n",
+    "|                                       |       |none  |     0|bleu_max   |↑  | 0.0187|±  |0.0088|\n",
+    "|                                       |       |none  |     0|rouge1_acc |↑  | 0.0220|±  |0.0051|\n",
+    "|                                       |       |none  |     0|rouge1_diff|↑  |-0.0077|±  |0.0277|\n",
+    "|                                       |       |none  |     0|rouge1_max |↑  | 0.0726|±  |0.0213|\n",
+    "|                                       |       |none  |     0|rouge2_acc |↑  | 0.0024|±  |0.0017|\n",
+    "|                                       |       |none  |     0|rouge2_diff|↑  | 0.0031|±  |0.0043|\n",
+    "|                                       |       |none  |     0|rouge2_max |↑  | 0.0048|±  |0.0039|\n",
+    "|                                       |       |none  |     0|rougeL_acc |↑  | 0.0220|±  |0.0051|\n",
+    "|                                       |       |none  |     0|rougeL_diff|↑  |-0.0093|±  |0.0278|\n",
+    "|                                       |       |none  |     0|rougeL_max |↑  | 0.0709|±  |0.0208|\n",
+    "|truthfulqa_mc1                         |      2|none  |     0|acc        |↑  | 0.2277|±  |0.0147|\n",
+    "|truthfulqa_mc2                         |      2|none  |     0|acc        |↑  | 0.4685|±  |0.0171|\n",
+    "\n",
+    "|      Groups      |Version|Filter|n-shot|Metric|   |Value |   |Stderr|\n",
+    "|------------------|------:|------|------|------|---|-----:|---|-----:|\n",
+    "|mmlu              |      2|none  |      |acc   |↑  |0.2400|±  |0.0036|\n",
+    "| - humanities     |      2|none  |      |acc   |↑  |0.2436|±  |0.0063|\n",
+    "| - other          |      2|none  |      |acc   |↑  |0.2594|±  |0.0078|\n",
+    "| - social sciences|      2|none  |      |acc   |↑  |0.2327|±  |0.0076|\n",
+    "| - stem           |      2|none  |      |acc   |↑  |0.2226|±  |0.0074|\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Conclusion\n",
+    "\n",
+    "The benchmarks generally performed worse after training. E.g. before training, `hellaswag` on `acc_norm` showed reasonable performance (0.7498), then more than halved after training (0.3176). `truthfulqa_mc1` and `truthfulqa_mc2` show slight improvements.\n",
+    "\n",
+    "There are various reasons this could be:\n",
+    "- We used a quantised (4bit) model (quantisation reduces model weight precision, which introduces approximation error)\n",
+    "- and a very small number of training samples (not enough to learn from)\n",
+    "- might be issues with the training setup e.g., \n",
+    "  - overfitting: The model might have memorized the training data too well and is now performing poorly on unseen data (the validation set used for `lm-eval`).\n",
+    "  - data preparation: There could be problems with the training data, such as noise, inconsistencies, or biases that are negatively impacting the model's learning\n",
+    "  - hyperparameters: The learning rate, batch size, weight decay, etc might not be optimal for this specific task and model, and can be remedied with hyper parameter search techniques.\n",
+    "\n",
+    "I generally used quantisation, small batch sizes, and a small sample size to complete this task in good time. I would have also liked to show a [Gradio leaderboard](https://huggingface.co/spaces/freddyaboulton/gradio_leaderboard/tree/main) to compare the bench results, but enough can be gleaned from just looking at the numbers."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}