{ "cells": [ { "cell_type": "code", "execution_count": null, "id": "2d8737a1", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [] }, { "name": "stdout", "output_type": "stream", "text": [] }, { "name": "stdout", "output_type": "stream", "text": [] } ], "source": [ "%pip install lighteval==0.6.2\n", "%pip install great-tables\n", "%pip install polars" ] }, { "cell_type": "markdown", "id": "3d9ea816", "metadata": {}, "source": [ "# Comparaison de différentes formulations d'une instruction pour une même tâche\n", "Dans ce *notebook*, nous allons utiliser un très petit modèle pour une tâche simple. Nous nous concentrerons sur la comparaison de plusieurs formulations pour l'instruction (*prompt* en anglais) donné en entrée afin de voir comment elles affectent les résultats que l'on peut obtenir." ] }, { "cell_type": "code", "execution_count": 2, "id": "3684eec7", "metadata": {}, "outputs": [], "source": [ "import string\n", "import os\n", "from datetime import timedelta\n", "from types import ModuleType\n", "from ast import literal_eval" ] }, { "cell_type": "code", "execution_count": 3, "id": "a241de50", "metadata": {}, "outputs": [], "source": [ "# Pour la visualisation des données\n", "from great_tables import GT\n", "import polars as pl\n", "import polars.selectors as cs\n", "from datasets import load_dataset" ] }, { "cell_type": "code", "execution_count": null, "id": "e522341e", "metadata": {}, "outputs": [], "source": [ "# Pour l'évaluation\n", "import lighteval\n", "from lighteval.logging.evaluation_tracker import EvaluationTracker\n", "from lighteval.models.model_config import BaseModelConfig, VLLMModelConfig\n", "from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters\n", "from lighteval.metrics.metrics import Metrics\n", "from lighteval.tasks.lighteval_task import LightevalTaskConfig, Doc\n", "from lighteval.utils.utils import as_list, EnvConfig\n", "from lighteval.utils.imports import is_accelerate_available, is_tgi_available" ] }, { "cell_type": "code", "execution_count": 5, "id": "d954ae1b", "metadata": {}, "outputs": [], "source": [ "# Définir pour votre cas d'usage\n", "cache_dir = \"tmp\"\n", "max_samples = 10" ] }, { "cell_type": "markdown", "id": "a52f8c1b", "metadata": {}, "source": [ "## Comparer plusieurs formulations pour une même tâche\n", "\n", "Comparons :\n", "- à l'aide d'une évaluation MCQA (question-réponse à choix multiples i.e. un QCM)\n", "- l'utilisation d'une évaluation générative\n", "\n", "et pour les deux, en utilisant des variations des mêmes prompts.\n", "\n", "Nous utiliserons le jeu de données ARC d'AI2 pour nos expériences, en utilisant le sous-ensemble « challenge ». Vous pouvez consulter le jeu de données ici : https://huggingface.co/datasets/allenai/ai2_arc?row=0." ] }, { "cell_type": "markdown", "id": "6fc0902b", "metadata": {}, "source": [ "### Définissons le cœur de notre tâche" ] }, { "cell_type": "code", "execution_count": 6, "id": "6e1c0cde", "metadata": {}, "outputs": [], "source": [ "class ArcExplorationTask(LightevalTaskConfig):\n", " def __init__(self, name, prompt_function, metric):\n", " super().__init__(\n", " name=name,\n", " prompt_function=prompt_function,\n", " metric=as_list(metric),\n", " # Il s'agit d'une tâche personnalisée\n", " suite=[\"custom\"],\n", " # Ceci définit notre jeu de données et ses sous-ensembles\n", " hf_repo=\"allenai/ai2_arc\",\n", " hf_subset=\"ARC-Challenge\",\n", " hf_avail_splits=[\"train\", \"validation\", \"test\"],\n", " evaluation_splits=[\"test\"],\n", " # Paramètres des exemples few shot\n", " few_shots_split=\"validation\",\n", " few_shots_select=\"random\", \n", " # Autres paramètres\n", " stop_sequence=[\".\", \"\\n\"],\n", " generation_size=100,\n", " )" ] }, { "cell_type": "markdown", "id": "eaa0a65a", "metadata": {}, "source": [ "### Définissons nos métriques\n", "\n", "Pour une évaluation à choix multiples, , nous voulons la log-vraissemblance de l'*accuracy* normalisée par la longueur (= le choix le plus probable est-il le bon ?).\n", "\n", "Pour l'évaluation de générations, nous voulons une correspondance exacte (= le texte généré correspond-il à la référence ?)." ] }, { "cell_type": "code", "execution_count": 7, "id": "426c2ef5", "metadata": {}, "outputs": [], "source": [ "metric_mcqa = Metrics.loglikelihood_acc_norm\n", "metric_gen = Metrics.quasi_exact_match" ] }, { "cell_type": "markdown", "id": "aea37f61", "metadata": {}, "source": [ "### Définissons des fonctions pour les différentes instructions\n", "\n", "Une ligne du jeu de données ARC est un dictionnaire, de la forme suivante\n", "```python\n", "{\n", " \"question\": \"la question avec une instruction\",\n", " \"choices\": {\n", " \"text\": [\"choix 1\", \"choix 2\", ...],\n", " \"label\": [\"A\", \"B\", ...]\n", " },\n", " \"answerKey\": \"le label gold\"\n", "}\n", "```\n", "\n", "Notre fonction appliquera un gabarit dans lequel nous associerons toutes ces informations aux clés demandées (`query`, `choices`, `gold_index`, et une `instruction` si nécessaire)." ] }, { "cell_type": "markdown", "id": "7e994698", "metadata": {}, "source": [ "Premier cas, nous définissons le gabarit le plus basique possible. \n", "L'instruction ressemble à ceci :\n", "```\n", "\n", "```\n", "et nous regardons `` directement." ] }, { "cell_type": "code", "execution_count": 8, "id": "41bccce4", "metadata": {}, "outputs": [], "source": [ "def arc_base(line, task_name: str = None):\n", " query= f\"{line['question']}\"\n", " choices=line[\"choices\"][\"text\"]\n", "\n", " return Doc(\n", " task_name=task_name,\n", " query=query,\n", " choices=choices,\n", " gold_index=line[\"choices\"][\"label\"].index(line[\"answerKey\"]),\n", " )" ] }, { "cell_type": "markdown", "id": "d87631d5", "metadata": {}, "source": [ "Deuxième cas, nous ajoutons maintenant un peu de contexte. L'instruction ressemble alors à ceci :\n", "```\n", "Question: \n", "Answer: \n", "```\n", "et nous regardons `` directement aussi." ] }, { "cell_type": "code", "execution_count": 9, "id": "9865eafe", "metadata": {}, "outputs": [], "source": [ "def arc_context(line, task_name: str = None):\n", " query= f\"Question: {line['question']}\"\n", " query += \"\\nAnswer: \"\n", " choices=line[\"choices\"][\"text\"]\n", " return Doc(\n", " task_name=task_name,\n", " query=query,\n", " choices=choices,\n", " gold_index=line[\"choices\"][\"label\"].index(line[\"answerKey\"]),\n", " )" ] }, { "cell_type": "markdown", "id": "132deef3", "metadata": {}, "source": [ "Troisième cas, nous ajoutons maintenant des choix dans notre instruction. Le *prompt* ressemble alors à ceci :\n", "```\n", "Question: \n", "Choices:\n", "A. \n", "B. \n", "...\n", "Answer: \n", "```\n", "et nous regardons `` directement à nouveau." ] }, { "cell_type": "code", "execution_count": 10, "id": "698367f7", "metadata": {}, "outputs": [], "source": [ "letters = list(string.ascii_uppercase)" ] }, { "cell_type": "code", "execution_count": 11, "id": "e7073026", "metadata": {}, "outputs": [], "source": [ "def arc_context_choices(line, task_name: str = None):\n", " query = f\"Question: {line['question']}\\n\"\n", " query += \"\\n\".join([f\"{letters[ix]}. {choice}\" for ix, choice in enumerate(line[\"choices\"][\"text\"])])\n", " query += \"\\nAnswer: \"\n", " choices=line[\"choices\"][\"text\"]\n", " return Doc(\n", " task_name=task_name,\n", " query=query,\n", " choices=choices,\n", " gold_index=line[\"choices\"][\"label\"].index(line[\"answerKey\"]),\n", " )" ] }, { "cell_type": "markdown", "id": "b835822d", "metadata": {}, "source": [ "Dernier cas, nous faisons la même chose, mais nous regardons `` à la place." ] }, { "cell_type": "code", "execution_count": 12, "id": "b9453b39", "metadata": {}, "outputs": [], "source": [ "def arc_context_labels(line, task_name: str = None):\n", " query = f\"Question: {line['question']}\\n\"\n", " query += \"\\n\".join([f\"{letters[ix]}. {choice}\" for ix, choice in enumerate(line[\"choices\"][\"text\"])])\n", " query += \"\\nAnswer: \"\n", " choices=[letters[ix] for ix in range(len(line[\"choices\"][\"text\"]))]\n", " return Doc(\n", " task_name=task_name,\n", " query=query,\n", " choices=choices,\n", " gold_index=line[\"choices\"][\"label\"].index(line[\"answerKey\"]),\n", " )\n", "\n" ] }, { "cell_type": "markdown", "id": "da018e16", "metadata": {}, "source": [ "### Enchaînons le tout" ] }, { "cell_type": "code", "execution_count": 13, "id": "304a74dd", "metadata": {}, "outputs": [], "source": [ "task_module = ModuleType(\"task_module\")\n", "task_module.__file__ = \".\",\n", "task_module.TASKS_TABLE = [\n", " ArcExplorationTask(\n", " name=\"arc_base\", \n", " prompt_function=arc_base, \n", " metric=[metric_mcqa, metric_gen]\n", " ),\n", " ArcExplorationTask(\n", " name=\"arc_context\", \n", " prompt_function=arc_context, \n", " metric=[metric_mcqa, metric_gen]\n", " ),\n", " ArcExplorationTask(\n", " name=\"arc_context_choice\", \n", " prompt_function=arc_context_choices, \n", " metric=[metric_mcqa, metric_gen]\n", " ),\n", " ArcExplorationTask(\n", " name=\"arc_context_labels\", \n", " prompt_function=arc_context_labels, \n", " metric=[metric_mcqa, metric_gen]\n", " )\n", "]\n", "\n", "task_names = [\"arc_base\", \"arc_context\", \"arc_context_choice\", \"arc_context_labels\"]" ] }, { "cell_type": "markdown", "id": "42a131fd", "metadata": {}, "source": [ "# Lançons notre évaluation !" ] }, { "cell_type": "code", "execution_count": 14, "id": "756566ff", "metadata": {}, "outputs": [], "source": [ "if is_accelerate_available():\n", " from accelerate import Accelerator, InitProcessGroupKwargs\n", " accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=3000))])\n", "else:\n", " accelerator = None" ] }, { "cell_type": "code", "execution_count": 15, "id": "1ee62d6f", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:lighteval.logging.hierarchical_logger:WARNING: --max_samples WAS SET. THESE NUMBERS ARE ONLY PARTIAL AND SHOULD NOT BE USED FOR COMPARISON UNLESS YOU KNOW WHAT YOU ARE DOING.\n", "WARNING:lighteval.logging.hierarchical_logger:Test all gather {\n", "WARNING:lighteval.logging.hierarchical_logger: Test gather tensor\n", "WARNING:lighteval.logging.hierarchical_logger: gathered_tensor tensor([0]), should be [0]\n", "WARNING:lighteval.logging.hierarchical_logger:} [0:00:00.002244]\n", "WARNING:lighteval.logging.hierarchical_logger:Model loading {\n", "WARNING:lighteval.logging.hierarchical_logger: Tokenizer truncation and padding size set to the left side.\n", "WARNING:lighteval.logging.hierarchical_logger: We are not in a distributed setting. Setting model_parallel to False.\n", "WARNING:lighteval.logging.hierarchical_logger: Model parallel was set to False, max memory set to None and device map to None\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "4e00c4d6763240ed85bde9c5f8f6614e", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading checkpoint shards: 0%| | 0/2 [00:00.create_task..LightevalTaskFromConfig'>, 'custom|arc_context': .create_task..LightevalTaskFromConfig'>, 'custom|arc_context_choice': .create_task..LightevalTaskFromConfig'>, 'custom|arc_context_labels': .create_task..LightevalTaskFromConfig'>}\n", "WARNING:lighteval.logging.hierarchical_logger: allenai/ai2_arc ARC-Challenge\n", "WARNING:lighteval.logging.hierarchical_logger: allenai/ai2_arc ARC-Challenge\n", "WARNING:lighteval.logging.hierarchical_logger: allenai/ai2_arc ARC-Challenge\n", "WARNING:lighteval.logging.hierarchical_logger: allenai/ai2_arc ARC-Challenge\n", "WARNING:lighteval.logging.hierarchical_logger: Loading documents, and requests\n", "WARNING:lighteval.logging.hierarchical_logger:} [0:00:12.300826]\n", "WARNING:lighteval.logging.hierarchical_logger:Setting seeds and waiting for all processes {\n", "WARNING:lighteval.logging.hierarchical_logger: setting seed to 1234 for random and numpy\n", "WARNING:lighteval.logging.hierarchical_logger:} [0:00:00.000334]\n", "WARNING:lighteval.logging.hierarchical_logger:Evaluation {\n", "WARNING:lighteval.logging.hierarchical_logger: Evaluate on 4 tasks.\n", "WARNING:lighteval.logging.hierarchical_logger: Running RequestType.GREEDY_UNTIL requests\n", "WARNING:lighteval.logging.hierarchical_logger: \u001b[33mYou cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring.\u001b[0m\n", "Greedy generation: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [06:41<00:00, 9.73s/it]\u001b[A\n", "Splits: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [06:41<00:00, 401.95s/it]\u001b[A\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "WARNING:lighteval.logging.hierarchical_logger: Running RequestType.LOGLIKELIHOOD requests\n", "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [05:26<00:00, 8.17s/it]\u001b[A\n", "1it [05:26, 326.90s/it]\n", "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [04:50<00:00, 7.25s/it]\u001b[A\n", "2it [10:16, 305.21s/it]\n", "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [03:27<00:00, 5.19s/it]\u001b[A\n", "3it [13:44, 260.60s/it]\n", "100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36/36 [02:56<00:00, 4.91s/it]\u001b[A\n", "4it [16:41, 250.28s/it]\n", "WARNING:lighteval.logging.hierarchical_logger:} [0:23:23.253299]\n", "WARNING:lighteval.logging.hierarchical_logger:Compiling results {\n", "WARNING:lighteval.logging.hierarchical_logger:} [0:00:00.000472]\n", "WARNING:lighteval.logging.hierarchical_logger:Cleaning up {\n", "WARNING:lighteval.logging.hierarchical_logger:} [0:00:00.000034]\n", "WARNING:lighteval.logging.hierarchical_logger:Saving experiment tracker\n", "WARNING:lighteval.logging.hierarchical_logger:Saving results to .../tmp/results/HuggingFaceTB/SmolLM-1.7B/results_2024-10-24T16-24-58.398434.json\n", "WARNING:lighteval.logging.hierarchical_logger:Saving details to .../tmp/details/HuggingFaceTB/SmolLM-1.7B/2024-10-24T16-24-58.398434\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "b1e4e357b68b41f8ac49469cddda318d", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Creating parquet from Arrow format: 0%| | 0/1 [00:00\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "
Results
Prompt function\n", " Evaluations\n", "
Quasi Exact MatchNormalized Accuracy
arc base0.00.3
arc context0.10.5
arc context choice0.40.3
arc context labels0.00.0
all0.1250.275
\n", "\n", "\n", " " ], "text/plain": [ "GT(_tbl_data=shape: (5, 3)\n", "┌────────────────────┬───────────────────┬─────────────────────┐\n", "│ Prompt function ┆ Quasi Exact Match ┆ Normalized Accuracy │\n", "│ --- ┆ --- ┆ --- │\n", "│ str ┆ f64 ┆ f64 │\n", "╞════════════════════╪═══════════════════╪═════════════════════╡\n", "│ arc base ┆ 0.0 ┆ 0.3 │\n", "│ arc context ┆ 0.1 ┆ 0.5 │\n", "│ arc context choice ┆ 0.4 ┆ 0.3 │\n", "│ arc context labels ┆ 0.0 ┆ 0.0 │\n", "│ all ┆ 0.125 ┆ 0.275 │\n", "└────────────────────┴───────────────────┴─────────────────────┘, _body=, _boxhead=Boxhead([ColInfo(var='Prompt function', type=, column_label='Prompt function', column_align='left', column_width=None), ColInfo(var='Quasi Exact Match', type=, column_label='Quasi Exact Match', column_align='right', column_width=None), ColInfo(var='Normalized Accuracy', type=, column_label='Normalized Accuracy', column_align='right', column_width=None)]), _stub=, _spanners=Spanners([SpannerInfo(spanner_id='Evaluations', spanner_level=0, spanner_label='Evaluations', spanner_units=None, spanner_pattern=None, vars=['Quasi Exact Match', 'Normalized Accuracy'], built=None)]), _heading=Heading(title='Results', subtitle=None, preheader=None), _stubhead=None, _source_notes=[], _footnotes=[], _styles=[], _locale=, _formats=[], _substitutions=[], _options=Options(table_id=OptionsInfo(scss=False, category='table', type='value', value=None), table_caption=OptionsInfo(scss=False, category='table', type='value', value=None), table_width=OptionsInfo(scss=True, category='table', type='px', value='auto'), table_layout=OptionsInfo(scss=True, category='table', type='value', value='fixed'), table_margin_left=OptionsInfo(scss=True, category='table', type='px', value='auto'), table_margin_right=OptionsInfo(scss=True, category='table', type='px', value='auto'), table_background_color=OptionsInfo(scss=True, category='table', type='value', value='#FFFFFF'), table_additional_css=OptionsInfo(scss=False, category='table', type='values', value=[]), table_font_names=OptionsInfo(scss=False, category='table', type='values', value=['-apple-system', 'BlinkMacSystemFont', 'Segoe UI', 'Roboto', 'Oxygen', 'Ubuntu', 'Cantarell', 'Helvetica Neue', 'Fira Sans', 'Droid Sans', 'Arial', 'sans-serif']), table_font_size=OptionsInfo(scss=True, category='table', type='px', value='16px'), table_font_weight=OptionsInfo(scss=True, category='table', type='value', value='normal'), table_font_style=OptionsInfo(scss=True, category='table', type='value', value='normal'), table_font_color=OptionsInfo(scss=True, category='table', type='value', value='#333333'), table_font_color_light=OptionsInfo(scss=True, category='table', type='value', value='#FFFFFF'), table_border_top_include=OptionsInfo(scss=False, category='table', type='boolean', value=True), table_border_top_style=OptionsInfo(scss=True, category='table', type='value', value='solid'), table_border_top_width=OptionsInfo(scss=True, category='table', type='px', value='2px'), table_border_top_color=OptionsInfo(scss=True, category='table', type='value', value='#A8A8A8'), table_border_right_style=OptionsInfo(scss=True, category='table', type='value', value='none'), table_border_right_width=OptionsInfo(scss=True, category='table', type='px', value='2px'), table_border_right_color=OptionsInfo(scss=True, category='table', type='value', value='#D3D3D3'), table_border_bottom_include=OptionsInfo(scss=False, category='table', type='boolean', value=True), table_border_bottom_style=OptionsInfo(scss=True, category='table', type='value', value='solid'), table_border_bottom_width=OptionsInfo(scss=True, category='table', type='px', value='2px'), table_border_bottom_color=OptionsInfo(scss=True, category='table', type='value', value='#A8A8A8'), table_border_left_style=OptionsInfo(scss=True, category='table', type='value', value='none'), table_border_left_width=OptionsInfo(scss=True, category='table', type='px', value='2px'), table_border_left_color=OptionsInfo(scss=True, category='table', type='value', value='#D3D3D3'), heading_background_color=OptionsInfo(scss=True, category='heading', type='value', value=None), heading_align=OptionsInfo(scss=True, category='heading', type='value', value='center'), heading_title_font_size=OptionsInfo(scss=True, category='heading', type='px', value='125%'), heading_title_font_weight=OptionsInfo(scss=True, category='heading', type='value', value='initial'), heading_subtitle_font_size=OptionsInfo(scss=True, category='heading', type='px', value='85%'), heading_subtitle_font_weight=OptionsInfo(scss=True, category='heading', type='value', value='initial'), heading_padding=OptionsInfo(scss=True, category='heading', type='px', value='4px'), heading_padding_horizontal=OptionsInfo(scss=True, category='heading', type='px', value='5px'), heading_border_bottom_style=OptionsInfo(scss=True, category='heading', type='value', value='solid'), heading_border_bottom_width=OptionsInfo(scss=True, category='heading', type='px', value='2px'), heading_border_bottom_color=OptionsInfo(scss=True, category='heading', type='value', value='#D3D3D3'), heading_border_lr_style=OptionsInfo(scss=True, category='heading', type='value', value='none'), heading_border_lr_width=OptionsInfo(scss=True, category='heading', type='px', value='1px'), heading_border_lr_color=OptionsInfo(scss=True, category='heading', type='value', value='#D3D3D3'), column_labels_background_color=OptionsInfo(scss=True, category='column_labels', type='value', value=None), column_labels_font_size=OptionsInfo(scss=True, category='column_labels', type='px', value='100%'), column_labels_font_weight=OptionsInfo(scss=True, category='column_labels', type='value', value='normal'), column_labels_text_transform=OptionsInfo(scss=True, category='column_labels', type='value', value='inherit'), column_labels_padding=OptionsInfo(scss=True, category='column_labels', type='px', value='5px'), column_labels_padding_horizontal=OptionsInfo(scss=True, category='column_labels', type='px', value='5px'), column_labels_vlines_style=OptionsInfo(scss=True, category='table_body', type='value', value='none'), column_labels_vlines_width=OptionsInfo(scss=True, category='table_body', type='px', value='1px'), column_labels_vlines_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), column_labels_border_top_style=OptionsInfo(scss=True, category='column_labels', type='value', value='solid'), column_labels_border_top_width=OptionsInfo(scss=True, category='column_labels', type='px', value='2px'), column_labels_border_top_color=OptionsInfo(scss=True, category='column_labels', type='value', value='#D3D3D3'), column_labels_border_bottom_style=OptionsInfo(scss=True, category='column_labels', type='value', value='solid'), column_labels_border_bottom_width=OptionsInfo(scss=True, category='column_labels', type='px', value='2px'), column_labels_border_bottom_color=OptionsInfo(scss=True, category='column_labels', type='value', value='#D3D3D3'), column_labels_border_lr_style=OptionsInfo(scss=True, category='column_labels', type='value', value='none'), column_labels_border_lr_width=OptionsInfo(scss=True, category='column_labels', type='px', value='1px'), column_labels_border_lr_color=OptionsInfo(scss=True, category='column_labels', type='value', value='#D3D3D3'), column_labels_hidden=OptionsInfo(scss=False, category='column_labels', type='boolean', value=False), row_group_background_color=OptionsInfo(scss=True, category='row_group', type='value', value=None), row_group_font_size=OptionsInfo(scss=True, category='row_group', type='px', value='100%'), row_group_font_weight=OptionsInfo(scss=True, category='row_group', type='value', value='initial'), row_group_text_transform=OptionsInfo(scss=True, category='row_group', type='value', value='inherit'), row_group_padding=OptionsInfo(scss=True, category='row_group', type='px', value='8px'), row_group_padding_horizontal=OptionsInfo(scss=True, category='row_group', type='px', value='5px'), row_group_border_top_style=OptionsInfo(scss=True, category='row_group', type='value', value='solid'), row_group_border_top_width=OptionsInfo(scss=True, category='row_group', type='px', value='2px'), row_group_border_top_color=OptionsInfo(scss=True, category='row_group', type='value', value='#D3D3D3'), row_group_border_right_style=OptionsInfo(scss=True, category='row_group', type='value', value='none'), row_group_border_right_width=OptionsInfo(scss=True, category='row_group', type='px', value='1px'), row_group_border_right_color=OptionsInfo(scss=True, category='row_group', type='value', value='#D3D3D3'), row_group_border_bottom_style=OptionsInfo(scss=True, category='row_group', type='value', value='solid'), row_group_border_bottom_width=OptionsInfo(scss=True, category='row_group', type='px', value='2px'), row_group_border_bottom_color=OptionsInfo(scss=True, category='row_group', type='value', value='#D3D3D3'), row_group_border_left_style=OptionsInfo(scss=True, category='row_group', type='value', value='none'), row_group_border_left_width=OptionsInfo(scss=True, category='row_group', type='px', value='1px'), row_group_border_left_color=OptionsInfo(scss=True, category='row_group', type='value', value='#D3D3D3'), row_group_as_column=OptionsInfo(scss=False, category='row_group', type='boolean', value=False), table_body_hlines_style=OptionsInfo(scss=True, category='table_body', type='value', value='solid'), table_body_hlines_width=OptionsInfo(scss=True, category='table_body', type='px', value='1px'), table_body_hlines_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), table_body_vlines_style=OptionsInfo(scss=True, category='table_body', type='value', value='none'), table_body_vlines_width=OptionsInfo(scss=True, category='table_body', type='px', value='1px'), table_body_vlines_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), table_body_border_top_style=OptionsInfo(scss=True, category='table_body', type='value', value='solid'), table_body_border_top_width=OptionsInfo(scss=True, category='table_body', type='px', value='2px'), table_body_border_top_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), table_body_border_bottom_style=OptionsInfo(scss=True, category='table_body', type='value', value='solid'), table_body_border_bottom_width=OptionsInfo(scss=True, category='table_body', type='px', value='2px'), table_body_border_bottom_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), data_row_padding=OptionsInfo(scss=True, category='data_row', type='px', value='8px'), data_row_padding_horizontal=OptionsInfo(scss=True, category='data_row', type='px', value='5px'), stub_background_color=OptionsInfo(scss=True, category='stub', type='value', value=None), stub_font_size=OptionsInfo(scss=True, category='stub', type='px', value='100%'), stub_font_weight=OptionsInfo(scss=True, category='stub', type='value', value='initial'), stub_text_transform=OptionsInfo(scss=True, category='stub', type='value', value='inherit'), stub_border_style=OptionsInfo(scss=True, category='stub', type='value', value='solid'), stub_border_width=OptionsInfo(scss=True, category='stub', type='px', value='2px'), stub_border_color=OptionsInfo(scss=True, category='stub', type='value', value='#D3D3D3'), stub_row_group_background_color=OptionsInfo(scss=True, category='stub', type='value', value=None), stub_row_group_font_size=OptionsInfo(scss=True, category='stub', type='px', value='100%'), stub_row_group_font_weight=OptionsInfo(scss=True, category='stub', type='value', value='initial'), stub_row_group_text_transform=OptionsInfo(scss=True, category='stub', type='value', value='inherit'), stub_row_group_border_style=OptionsInfo(scss=True, category='stub', type='value', value='solid'), stub_row_group_border_width=OptionsInfo(scss=True, category='stub', type='px', value='2px'), stub_row_group_border_color=OptionsInfo(scss=True, category='stub', type='value', value='#D3D3D3'), source_notes_padding=OptionsInfo(scss=True, category='source_notes', type='px', value='4px'), source_notes_padding_horizontal=OptionsInfo(scss=True, category='source_notes', type='px', value='5px'), source_notes_background_color=OptionsInfo(scss=True, category='source_notes', type='value', value=None), source_notes_font_size=OptionsInfo(scss=True, category='source_notes', type='px', value='90%'), source_notes_border_bottom_style=OptionsInfo(scss=True, category='source_notes', type='value', value='none'), source_notes_border_bottom_width=OptionsInfo(scss=True, category='source_notes', type='px', value='2px'), source_notes_border_bottom_color=OptionsInfo(scss=True, category='source_notes', type='value', value='#D3D3D3'), source_notes_border_lr_style=OptionsInfo(scss=True, category='source_notes', type='value', value='none'), source_notes_border_lr_width=OptionsInfo(scss=True, category='source_notes', type='px', value='2px'), source_notes_border_lr_color=OptionsInfo(scss=True, category='source_notes', type='value', value='#D3D3D3'), source_notes_multiline=OptionsInfo(scss=False, category='source_notes', type='boolean', value=True), source_notes_sep=OptionsInfo(scss=False, category='source_notes', type='value', value=' '), row_striping_background_color=OptionsInfo(scss=True, category='row', type='value', value='rgba(128,128,128,0.05)'), row_striping_include_stub=OptionsInfo(scss=False, category='row', type='boolean', value=False), row_striping_include_table_body=OptionsInfo(scss=False, category='row', type='boolean', value=False), container_width=OptionsInfo(scss=False, category='container', type='px', value='auto'), container_height=OptionsInfo(scss=False, category='container', type='px', value='auto'), container_padding_x=OptionsInfo(scss=False, category='container', type='px', value='0px'), container_padding_y=OptionsInfo(scss=False, category='container', type='px', value='10px'), container_overflow_x=OptionsInfo(scss=False, category='container', type='overflow', value='auto'), container_overflow_y=OptionsInfo(scss=False, category='container', type='overflow', value='auto'), quarto_disable_processing=OptionsInfo(scss=False, category='quarto', type='logical', value=False), quarto_use_bootstrap=OptionsInfo(scss=False, category='quarto', type='logical', value=False)), _has_built=False)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = pipeline.get_results()[\"results\"]\n", "results_processed = []\n", "for eval_name, eval_results in results.items():\n", " results_processed.append({\n", " \"Prompt function\": (eval_name.split(\":\")[1] if \":\" in eval_name else eval_name).replace(\"_\", \" \"), \n", " \"Quasi Exact Match\": eval_results[\"qem\"], \n", " \"Normalized Accuracy\": eval_results[\"acc_norm\"]\n", " })\n", "results_data = pl.from_dicts(results_processed, strict=False)\n", "(GT(results_data.head(max_samples*4))\n", " .tab_header(\"Results\")\n", " .tab_spanner(label=\"Evaluations\", columns=[\"Quasi Exact Match\", \"Normalized Accuracy\"])\n", "\n", ")" ] }, { "cell_type": "markdown", "id": "872c4403", "metadata": {}, "source": [ "# Lire les résultats" ] }, { "cell_type": "code", "execution_count": 18, "id": "f4aae325", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2ca65572fde9462ab6c29b6f17974c65", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating train split: 0 examples [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "84e8f2c884de4ecb9b9e2d4c158128a2", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating train split: 0 examples [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "6dfe668ee83242fb9b9070b875db3f40", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating train split: 0 examples [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "11ff92775dca4ff5aca86c24fccc038b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Generating train split: 0 examples [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "path = f\"{cache_dir}/details/HuggingFaceTB/SmolLM-1.7B/\"\n", "\n", "results = {}\n", "\n", "for root, _, files in os.walk(path):\n", " for file in files:\n", " eval_name = file.split(\"|\")[1]\n", " results[eval_name] = load_dataset(\"parquet\", data_files=f\"{root}/{file}\")[\"train\"]\n" ] }, { "cell_type": "code", "execution_count": 19, "id": "08165452", "metadata": {}, "outputs": [], "source": [ "# créer un nouveau DataFrame pour stocker les données transformées\n", "transformed_data = []\n", "keys = [\"example\", \"gold\", \"predictions\", \"metrics\"]\n", "\n", "# itérer sur chaque jeu de données et ses échantillons\n", "for ix in range(max_samples * 2):\n", " for key in keys:\n", " cur_sample = {\"Sample\": f\"Sample {ix}\", \"Type\": key.capitalize()}\n", " for eval_name, df in sorted(results.items()):\n", " try:\n", " cur_result = literal_eval(results[eval_name][ix][key])\n", " if isinstance(cur_result, list):\n", " if len(cur_result) == 1:\n", " cur_sample[eval_name] = cur_result[0]\n", " else:\n", " cur_sample[eval_name] = \"\\n\".join([str(i) for i in cur_result])\n", " elif isinstance(cur_result, dict):\n", " for metric, value in cur_result.items():\n", " cur_sample[eval_name] = str(value)\n", " cur_sample[\"Type\"] = f\"{key.capitalize()}: {metric}\"\n", " except SyntaxError:\n", " cur_sample[eval_name] = results[eval_name][ix][key]\n", " \n", " for k, v in cur_sample.items():\n", " # Nous remplaçons les \\n de python par des
markdown pour l'affichage du tableau\n", " if isinstance(v, str):\n", " cur_sample[k] = v.replace(\"\\n\", \"
\")\n", " transformed_data.append(cur_sample)" ] }, { "cell_type": "markdown", "id": "e2f3d2f2", "metadata": {}, "source": [ "### Examinons les résultats des générations" ] }, { "cell_type": "code", "execution_count": 20, "id": "25cde03f", "metadata": {}, "outputs": [], "source": [ "pl_data = pl.from_dicts(transformed_data, strict=False, infer_schema_length=200)" ] }, { "cell_type": "code", "execution_count": 21, "id": "bcd7288e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "
Comparing our different prompts' outputs
\n", " Samples\n", "
arc_basearc_contextarc_context_choicearc_context_labels
Sample 0
ExampleCities control the amount of pollution that is allowed to come from cars. How does this most likely help people?Question: Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people?
Answer:
Question: Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people?
A. The air stays cleaner.
B. Cars can travel at faster speeds.
C. The skills of the drivers improve.
D. It becomes safer to drive on the roads.
Answer:
Question: Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people?
A. The air stays cleaner.
B. Cars can travel at faster speeds.
C. The skills of the drivers improve.
D. It becomes safer to drive on the roads.
Answer:
GoldThe air stays cleaner.The air stays cleaner.The air stays cleaner.A
PredictionsIt reduces the amount of pollution in the air1It becomes safer to drive on the roadsD
Metrics: qem0000
Sample 1
ExampleWhich statement correctly describes a physical characteristic of the Moon?Question: Which statement correctly describes a physical characteristic of the Moon?
Answer:
Question: Which statement correctly describes a physical characteristic of the Moon?
A. The Moon is made of hot gases.
B. The Moon is covered with many craters.
C. The Moon has many bodies of liquid water.
D. The Moon has the ability to give off its own light.
Answer:
Question: Which statement correctly describes a physical characteristic of the Moon?
A. The Moon is made of hot gases.
B. The Moon is covered with many craters.
C. The Moon has many bodies of liquid water.
D. The Moon has the ability to give off its own light.
Answer:
GoldThe Moon is covered with many craters.The Moon is covered with many craters.The Moon is covered with many craters.B
Predictionsa1The Moon has the ability to give off its own lightD
Metrics: qem0000
Sample 2
ExampleWhich object in the solar system is orbited by a belt of asteroids?Question: Which object in the solar system is orbited by a belt of asteroids?
Answer:
Question: Which object in the solar system is orbited by a belt of asteroids?
A. Pluto
B. Saturn
C. the Sun
D. the Moon
Answer:
Question: Which object in the solar system is orbited by a belt of asteroids?
A. Pluto
B. Saturn
C. the Sun
D. the Moon
Answer:
Goldthe Sunthe Sunthe SunC
Predictions1The SunD
Metrics: qem0010
Sample 3
ExampleLight waves that cross from an air medium to a water medium willQuestion: Light waves that cross from an air medium to a water medium will
Answer:
Question: Light waves that cross from an air medium to a water medium will
A. be focused into a straight line.
B. lose energy and dissipate.
C. change length and direction.
D. reflect off the water's surface.
Answer:
Question: Light waves that cross from an air medium to a water medium will
A. be focused into a straight line.
B. lose energy and dissipate.
C. change length and direction.
D. reflect off the water's surface.
Answer:
Goldchange length and direction.change length and direction.change length and direction.C
Predictionsbend toward the normal1change length and directionD
Metrics: qem0010
Sample 4
ExampleHow many valence electrons does selenium have?Question: How many valence electrons does selenium have?
Answer:
Question: How many valence electrons does selenium have?
A. 3
B. 5
C. 6
D. 8
Answer:
Question: How many valence electrons does selenium have?
A. 3
B. 5
C. 6
D. 8
Answer:
Gold666C
Predictions668
Metrics: qem0110
Sample 5
ExampleAs a warm moist air mass moving northward collides with a strong cold air mass moving southward, what observations will most likely be made?Question: As a warm moist air mass moving northward collides with a strong cold air mass moving southward, what observations will most likely be made?
Answer:
Question: As a warm moist air mass moving northward collides with a strong cold air mass moving southward, what observations will most likely be made?
A. Thick fog develops.
B. Temperatures increase.
C. Clouds begin to form.
D. Winds die down.
Answer:
Question: As a warm moist air mass moving northward collides with a strong cold air mass moving southward, what observations will most likely be made?
A. Thick fog develops.
B. Temperatures increase.
C. Clouds begin to form.
D. Winds die down.
Answer:
GoldClouds begin to form.Clouds begin to form.Clouds begin to form.C
Predictionsdrought1Thick fog developsD
Metrics: qem0000
Sample 6
ExampleChemical weathering occurs when minerals in rocks are changed chemically. Which of these will most likely change the rate of chemical weathering on a rock?Question: Chemical weathering occurs when minerals in rocks are changed chemically. Which of these will most likely change the rate of chemical weathering on a rock?
Answer:
Question: Chemical weathering occurs when minerals in rocks are changed chemically. Which of these will most likely change the rate of chemical weathering on a rock?
A. decrease in air temperature
B. increase in rainfall amounts
C. slow movement of a glacier
D. rapid growth of plant roots
Answer:
Question: Chemical weathering occurs when minerals in rocks are changed chemically. Which of these will most likely change the rate of chemical weathering on a rock?
A. decrease in air temperature
B. increase in rainfall amounts
C. slow movement of a glacier
D. rapid growth of plant roots
Answer:
Goldincrease in rainfall amountsincrease in rainfall amountsincrease in rainfall amountsB
Predictionsa1increase in rainfall amountsD
Metrics: qem0010
Sample 7
ExampleA toy truck rolls over a smooth surface. If the surface is covered with sand, the truck will most likely rollQuestion: A toy truck rolls over a smooth surface. If the surface is covered with sand, the truck will most likely roll
Answer:
Question: A toy truck rolls over a smooth surface. If the surface is covered with sand, the truck will most likely roll
A. slower
B. faster
C. at the same speed
Answer:
Question: A toy truck rolls over a smooth surface. If the surface is covered with sand, the truck will most likely roll
A. slower
B. faster
C. at the same speed
Answer:
GoldslowerslowerslowerA
Predictionsfarther1at the same speedC
Metrics: qem0000
Sample 8
ExampleA town in northern Arkansas experienced colder than normal temperatures during part of the winter. Which change was most likely responsible for this?Question: A town in northern Arkansas experienced colder than normal temperatures during part of the winter. Which change was most likely responsible for this?
Answer:
Question: A town in northern Arkansas experienced colder than normal temperatures during part of the winter. Which change was most likely responsible for this?
A. A southward dip in the jet stream
B. A northward movement of the jet stream
C. The Coriolis effect creating a low pressure area
D. The Coriolis effect creating a high pressure area
Answer:
Question: A town in northern Arkansas experienced colder than normal temperatures during part of the winter. Which change was most likely responsible for this?
A. A southward dip in the jet stream
B. A northward movement of the jet stream
C. The Coriolis effect creating a low pressure area
D. The Coriolis effect creating a high pressure area
Answer:
GoldA southward dip in the jet streamA southward dip in the jet streamA southward dip in the jet streamA
PredictionsThe sun was farther away from the earth1The Coriolis effect creating a low pressure areaD
Metrics: qem0000
Sample 9
ExampleHow much time is required for a bicycle to travel a distance of 100 m at an average speed of 2 m/s?Question: How much time is required for a bicycle to travel a distance of 100 m at an average speed of 2 m/s?
Answer:
Question: How much time is required for a bicycle to travel a distance of 100 m at an average speed of 2 m/s?
A. 0.02 s
B. 50 s
C. 100 s
D. 200 s
Answer:
Question: How much time is required for a bicycle to travel a distance of 100 m at an average speed of 2 m/s?
A. 0.02 s
B. 50 s
C. 100 s
D. 200 s
Answer:
Gold50 s50 s50 sB
Predictions100 m/2 m/s = 50 s100 s200 s
Metrics: qem0000
\n", "\n", "
\n", " " ], "text/plain": [ "GT(_tbl_data=shape: (40, 6)\n", "┌──────────┬──────────────┬──────────────────┬─────────────────┬─────────────────┬─────────────────┐\n", "│ Sample ┆ Type ┆ arc_base ┆ arc_context ┆ arc_context_cho ┆ arc_context_lab │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ice ┆ els │\n", "│ str ┆ str ┆ str ┆ str ┆ --- ┆ --- │\n", "│ ┆ ┆ ┆ ┆ str ┆ str │\n", "╞══════════╪══════════════╪══════════════════╪═════════════════╪═════════════════╪═════════════════╡\n", "│ Sample 0 ┆ Example ┆ Cities control ┆ Question: ┆ Question: ┆ Question: │\n", "│ ┆ ┆ the amount of p… ┆ Cities control ┆ Cities control ┆ Cities control │\n", "│ ┆ ┆ ┆ the a… ┆ the a… ┆ the a… │\n", "│ Sample 0 ┆ Gold ┆ The air stays ┆ The air stays ┆ The air stays ┆ A │\n", "│ ┆ ┆ cleaner. ┆ cleaner. ┆ cleaner. ┆ │\n", "│ Sample 0 ┆ Predictions ┆ It reduces the ┆ 1 ┆  It becomes ┆  D │\n", "│ ┆ ┆ amount of pollu… ┆ ┆ safer to drive ┆ │\n", "│ ┆ ┆ ┆ ┆ on … ┆ │\n", "│ Sample 0 ┆ Metrics: qem ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ Sample 1 ┆ Example ┆ Which statement ┆ Question: Which ┆ Question: Which ┆ Question: Which │\n", "│ ┆ ┆ correctly desc… ┆ statement corr… ┆ statement corr… ┆ statement corr… │\n", "│ … ┆ … ┆ … ┆ … ┆ … ┆ … │\n", "│ Sample 8 ┆ Metrics: qem ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "│ Sample 9 ┆ Example ┆ How much time is ┆ Question: How ┆ Question: How ┆ Question: How │\n", "│ ┆ ┆ required for … ┆ much time is ┆ much time is ┆ much time is │\n", "│ ┆ ┆ ┆ req… ┆ req… ┆ req… │\n", "│ Sample 9 ┆ Gold ┆ 50 s ┆ 50 s ┆ 50 s ┆ B │\n", "│ Sample 9 ┆ Predictions ┆ ┆ 100 m/2 m/s = ┆ 100 s ┆ 200 s │\n", "│ ┆ ┆ ┆ 50 s ┆ ┆ │\n", "│ Sample 9 ┆ Metrics: qem ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n", "└──────────┴──────────────┴──────────────────┴─────────────────┴─────────────────┴─────────────────┘, _body=, _boxhead=Boxhead([ColInfo(var='Sample', type=, column_label='Sample', column_align='left', column_width=None), ColInfo(var='Type', type=, column_label='Type', column_align='left', column_width=None), ColInfo(var='arc_base', type=, column_label='arc_base', column_align='left', column_width=None), ColInfo(var='arc_context', type=, column_label='arc_context', column_align='left', column_width=None), ColInfo(var='arc_context_choice', type=, column_label='arc_context_choice', column_align='left', column_width=None), ColInfo(var='arc_context_labels', type=, column_label='arc_context_labels', column_align='left', column_width=None)]), _stub=, _spanners=Spanners([SpannerInfo(spanner_id='Samples', spanner_level=0, spanner_label='Samples', spanner_units=None, spanner_pattern=None, vars=['arc_base', 'arc_context', 'arc_context_choice', 'arc_context_labels'], built=None)]), _heading=Heading(title=\"Comparing our different prompts' outputs\", subtitle=None, preheader=None), _stubhead=None, _source_notes=[], _footnotes=[], _styles=[], _locale=, _formats=[], _substitutions=[], _options=Options(table_id=OptionsInfo(scss=False, category='table', type='value', value=None), table_caption=OptionsInfo(scss=False, category='table', type='value', value=None), table_width=OptionsInfo(scss=True, category='table', type='px', value='auto'), table_layout=OptionsInfo(scss=True, category='table', type='value', value='fixed'), table_margin_left=OptionsInfo(scss=True, category='table', type='px', value='auto'), table_margin_right=OptionsInfo(scss=True, category='table', type='px', value='auto'), table_background_color=OptionsInfo(scss=True, category='table', type='value', value='#FFFFFF'), table_additional_css=OptionsInfo(scss=False, category='table', type='values', value=[]), table_font_names=OptionsInfo(scss=False, category='table', type='values', value=['-apple-system', 'BlinkMacSystemFont', 'Segoe UI', 'Roboto', 'Oxygen', 'Ubuntu', 'Cantarell', 'Helvetica Neue', 'Fira Sans', 'Droid Sans', 'Arial', 'sans-serif']), table_font_size=OptionsInfo(scss=True, category='table', type='px', value='16px'), table_font_weight=OptionsInfo(scss=True, category='table', type='value', value='normal'), table_font_style=OptionsInfo(scss=True, category='table', type='value', value='normal'), table_font_color=OptionsInfo(scss=True, category='table', type='value', value='#333333'), table_font_color_light=OptionsInfo(scss=True, category='table', type='value', value='#FFFFFF'), table_border_top_include=OptionsInfo(scss=False, category='table', type='boolean', value=True), table_border_top_style=OptionsInfo(scss=True, category='table', type='value', value='solid'), table_border_top_width=OptionsInfo(scss=True, category='table', type='px', value='2px'), table_border_top_color=OptionsInfo(scss=True, category='table', type='value', value='#A8A8A8'), table_border_right_style=OptionsInfo(scss=True, category='table', type='value', value='none'), table_border_right_width=OptionsInfo(scss=True, category='table', type='px', value='2px'), table_border_right_color=OptionsInfo(scss=True, category='table', type='value', value='#D3D3D3'), table_border_bottom_include=OptionsInfo(scss=False, category='table', type='boolean', value=True), table_border_bottom_style=OptionsInfo(scss=True, category='table', type='value', value='solid'), table_border_bottom_width=OptionsInfo(scss=True, category='table', type='px', value='2px'), table_border_bottom_color=OptionsInfo(scss=True, category='table', type='value', value='#A8A8A8'), table_border_left_style=OptionsInfo(scss=True, category='table', type='value', value='none'), table_border_left_width=OptionsInfo(scss=True, category='table', type='px', value='2px'), table_border_left_color=OptionsInfo(scss=True, category='table', type='value', value='#D3D3D3'), heading_background_color=OptionsInfo(scss=True, category='heading', type='value', value=None), heading_align=OptionsInfo(scss=True, category='heading', type='value', value='center'), heading_title_font_size=OptionsInfo(scss=True, category='heading', type='px', value='125%'), heading_title_font_weight=OptionsInfo(scss=True, category='heading', type='value', value='initial'), heading_subtitle_font_size=OptionsInfo(scss=True, category='heading', type='px', value='85%'), heading_subtitle_font_weight=OptionsInfo(scss=True, category='heading', type='value', value='initial'), heading_padding=OptionsInfo(scss=True, category='heading', type='px', value='4px'), heading_padding_horizontal=OptionsInfo(scss=True, category='heading', type='px', value='5px'), heading_border_bottom_style=OptionsInfo(scss=True, category='heading', type='value', value='solid'), heading_border_bottom_width=OptionsInfo(scss=True, category='heading', type='px', value='2px'), heading_border_bottom_color=OptionsInfo(scss=True, category='heading', type='value', value='#D3D3D3'), heading_border_lr_style=OptionsInfo(scss=True, category='heading', type='value', value='none'), heading_border_lr_width=OptionsInfo(scss=True, category='heading', type='px', value='1px'), heading_border_lr_color=OptionsInfo(scss=True, category='heading', type='value', value='#D3D3D3'), column_labels_background_color=OptionsInfo(scss=True, category='column_labels', type='value', value=None), column_labels_font_size=OptionsInfo(scss=True, category='column_labels', type='px', value='100%'), column_labels_font_weight=OptionsInfo(scss=True, category='column_labels', type='value', value='normal'), column_labels_text_transform=OptionsInfo(scss=True, category='column_labels', type='value', value='inherit'), column_labels_padding=OptionsInfo(scss=True, category='column_labels', type='px', value='5px'), column_labels_padding_horizontal=OptionsInfo(scss=True, category='column_labels', type='px', value='5px'), column_labels_vlines_style=OptionsInfo(scss=True, category='table_body', type='value', value='none'), column_labels_vlines_width=OptionsInfo(scss=True, category='table_body', type='px', value='1px'), column_labels_vlines_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), column_labels_border_top_style=OptionsInfo(scss=True, category='column_labels', type='value', value='solid'), column_labels_border_top_width=OptionsInfo(scss=True, category='column_labels', type='px', value='2px'), column_labels_border_top_color=OptionsInfo(scss=True, category='column_labels', type='value', value='#D3D3D3'), column_labels_border_bottom_style=OptionsInfo(scss=True, category='column_labels', type='value', value='solid'), column_labels_border_bottom_width=OptionsInfo(scss=True, category='column_labels', type='px', value='2px'), column_labels_border_bottom_color=OptionsInfo(scss=True, category='column_labels', type='value', value='#D3D3D3'), column_labels_border_lr_style=OptionsInfo(scss=True, category='column_labels', type='value', value='none'), column_labels_border_lr_width=OptionsInfo(scss=True, category='column_labels', type='px', value='1px'), column_labels_border_lr_color=OptionsInfo(scss=True, category='column_labels', type='value', value='#D3D3D3'), column_labels_hidden=OptionsInfo(scss=False, category='column_labels', type='boolean', value=False), row_group_background_color=OptionsInfo(scss=True, category='row_group', type='value', value=None), row_group_font_size=OptionsInfo(scss=True, category='row_group', type='px', value='100%'), row_group_font_weight=OptionsInfo(scss=True, category='row_group', type='value', value='initial'), row_group_text_transform=OptionsInfo(scss=True, category='row_group', type='value', value='inherit'), row_group_padding=OptionsInfo(scss=True, category='row_group', type='px', value='8px'), row_group_padding_horizontal=OptionsInfo(scss=True, category='row_group', type='px', value='5px'), row_group_border_top_style=OptionsInfo(scss=True, category='row_group', type='value', value='solid'), row_group_border_top_width=OptionsInfo(scss=True, category='row_group', type='px', value='2px'), row_group_border_top_color=OptionsInfo(scss=True, category='row_group', type='value', value='#D3D3D3'), row_group_border_right_style=OptionsInfo(scss=True, category='row_group', type='value', value='none'), row_group_border_right_width=OptionsInfo(scss=True, category='row_group', type='px', value='1px'), row_group_border_right_color=OptionsInfo(scss=True, category='row_group', type='value', value='#D3D3D3'), row_group_border_bottom_style=OptionsInfo(scss=True, category='row_group', type='value', value='solid'), row_group_border_bottom_width=OptionsInfo(scss=True, category='row_group', type='px', value='2px'), row_group_border_bottom_color=OptionsInfo(scss=True, category='row_group', type='value', value='#D3D3D3'), row_group_border_left_style=OptionsInfo(scss=True, category='row_group', type='value', value='none'), row_group_border_left_width=OptionsInfo(scss=True, category='row_group', type='px', value='1px'), row_group_border_left_color=OptionsInfo(scss=True, category='row_group', type='value', value='#D3D3D3'), row_group_as_column=OptionsInfo(scss=False, category='row_group', type='boolean', value=False), table_body_hlines_style=OptionsInfo(scss=True, category='table_body', type='value', value='solid'), table_body_hlines_width=OptionsInfo(scss=True, category='table_body', type='px', value='1px'), table_body_hlines_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), table_body_vlines_style=OptionsInfo(scss=True, category='table_body', type='value', value='none'), table_body_vlines_width=OptionsInfo(scss=True, category='table_body', type='px', value='1px'), table_body_vlines_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), table_body_border_top_style=OptionsInfo(scss=True, category='table_body', type='value', value='solid'), table_body_border_top_width=OptionsInfo(scss=True, category='table_body', type='px', value='2px'), table_body_border_top_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), table_body_border_bottom_style=OptionsInfo(scss=True, category='table_body', type='value', value='solid'), table_body_border_bottom_width=OptionsInfo(scss=True, category='table_body', type='px', value='2px'), table_body_border_bottom_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), data_row_padding=OptionsInfo(scss=True, category='data_row', type='px', value='8px'), data_row_padding_horizontal=OptionsInfo(scss=True, category='data_row', type='px', value='5px'), stub_background_color=OptionsInfo(scss=True, category='stub', type='value', value=None), stub_font_size=OptionsInfo(scss=True, category='stub', type='px', value='100%'), stub_font_weight=OptionsInfo(scss=True, category='stub', type='value', value='initial'), stub_text_transform=OptionsInfo(scss=True, category='stub', type='value', value='inherit'), stub_border_style=OptionsInfo(scss=True, category='stub', type='value', value='solid'), stub_border_width=OptionsInfo(scss=True, category='stub', type='px', value='2px'), stub_border_color=OptionsInfo(scss=True, category='stub', type='value', value='#D3D3D3'), stub_row_group_background_color=OptionsInfo(scss=True, category='stub', type='value', value=None), stub_row_group_font_size=OptionsInfo(scss=True, category='stub', type='px', value='100%'), stub_row_group_font_weight=OptionsInfo(scss=True, category='stub', type='value', value='initial'), stub_row_group_text_transform=OptionsInfo(scss=True, category='stub', type='value', value='inherit'), stub_row_group_border_style=OptionsInfo(scss=True, category='stub', type='value', value='solid'), stub_row_group_border_width=OptionsInfo(scss=True, category='stub', type='px', value='2px'), stub_row_group_border_color=OptionsInfo(scss=True, category='stub', type='value', value='#D3D3D3'), source_notes_padding=OptionsInfo(scss=True, category='source_notes', type='px', value='4px'), source_notes_padding_horizontal=OptionsInfo(scss=True, category='source_notes', type='px', value='5px'), source_notes_background_color=OptionsInfo(scss=True, category='source_notes', type='value', value=None), source_notes_font_size=OptionsInfo(scss=True, category='source_notes', type='px', value='90%'), source_notes_border_bottom_style=OptionsInfo(scss=True, category='source_notes', type='value', value='none'), source_notes_border_bottom_width=OptionsInfo(scss=True, category='source_notes', type='px', value='2px'), source_notes_border_bottom_color=OptionsInfo(scss=True, category='source_notes', type='value', value='#D3D3D3'), source_notes_border_lr_style=OptionsInfo(scss=True, category='source_notes', type='value', value='none'), source_notes_border_lr_width=OptionsInfo(scss=True, category='source_notes', type='px', value='2px'), source_notes_border_lr_color=OptionsInfo(scss=True, category='source_notes', type='value', value='#D3D3D3'), source_notes_multiline=OptionsInfo(scss=False, category='source_notes', type='boolean', value=True), source_notes_sep=OptionsInfo(scss=False, category='source_notes', type='value', value=' '), row_striping_background_color=OptionsInfo(scss=True, category='row', type='value', value='rgba(128,128,128,0.05)'), row_striping_include_stub=OptionsInfo(scss=False, category='row', type='boolean', value=False), row_striping_include_table_body=OptionsInfo(scss=False, category='row', type='boolean', value=False), container_width=OptionsInfo(scss=False, category='container', type='px', value='auto'), container_height=OptionsInfo(scss=False, category='container', type='px', value='auto'), container_padding_x=OptionsInfo(scss=False, category='container', type='px', value='0px'), container_padding_y=OptionsInfo(scss=False, category='container', type='px', value='10px'), container_overflow_x=OptionsInfo(scss=False, category='container', type='overflow', value='auto'), container_overflow_y=OptionsInfo(scss=False, category='container', type='overflow', value='auto'), quarto_disable_processing=OptionsInfo(scss=False, category='quarto', type='logical', value=False), quarto_use_bootstrap=OptionsInfo(scss=False, category='quarto', type='logical', value=False)), _has_built=False)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(GT(pl_data.head(max_samples*4))\n", " .tab_header(\"Comparing our different prompts' outputs\")\n", " .tab_spanner(label=\"Samples\", columns=cs.starts_with(\"arc\"))\n", " .tab_stub(rowname_col=\"Type\", groupname_col=\"Sample\")\n", " .fmt_markdown(columns=cs.starts_with(\"arc\"))\n", ")" ] }, { "cell_type": "markdown", "id": "9fda1fd6", "metadata": {}, "source": [ "Nous pouvons observer que :\n", "- le format de base est trop rigide en mode génératif : le modèle ne prédit jamais la bonne fin\n", "- le format de base + balises question/réponse semble inciter le modèle à produire des nombres, ce qui est inadéquat pour un certain nombre de questions\n", "- cependant (dans les deux derniers cas), l'introduction des choix dans la question aide le modèle à prédire un choix parmi les choix pertinents !\n", "\n", "Il est intéressant de noter que dans le dernier cas, lorsque les choix sont présents mais que le modèle doit prédire l'étiquette, il n'y parvient pas systématiquement. \n", "\n", "Dans d'autres cas, comme les échantillons 3 et 4, le modèle ne sélectionnera pas le même choix en prédisant l'étiquette ou en prédisant le choix (si les choix étaient présents)." ] }, { "cell_type": "markdown", "id": "8c2fe99d", "metadata": {}, "source": [ "### Examinons les log-probabilités en sortie" ] }, { "cell_type": "code", "execution_count": 22, "id": "1c5b7411", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "\n", "\n", " \n", " \n", " \n", "\n", " \n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "\n", "\n", "
Comparing our different prompts' outputs
\n", " Samples\n", "
arc_basearc_contextarc_context_choicearc_context_labels
Sample 10
ExampleCities control the amount of pollution that is allowed to come from cars. How does this most likely help people?Question: Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people?
Answer:
Question: Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people?
A. The air stays cleaner.
B. Cars can travel at faster speeds.
C. The skills of the drivers improve.
D. It becomes safer to drive on the roads.
Answer:
Question: Cities control the amount of pollution that is allowed to come from cars. How does this most likely help people?
A. The air stays cleaner.
B. Cars can travel at faster speeds.
C. The skills of the drivers improve.
D. It becomes safer to drive on the roads.
Answer:
GoldThe air stays cleaner.The air stays cleaner.The air stays cleaner.A
Predictions(-10.595719337463379, False)
(-17.943925857543945, False)
(-25.558387756347656, False)
(-19.944787979125977, False)
(-9.650793075561523, False)
(-18.44221305847168, False)
(-28.147789001464844, False)
(-21.082042694091797, False)
(-1.7918881177902222, False)
(-3.5758469104766846, False)
(-2.042778253555298, False)
(-0.5759693384170532, True)
(-1.445155382156372, False)
(-1.695155382156372, False)
(-1.445155382156372, False)
(-1.070155382156372, True)
Metrics: acc_norm1100
Sample 11
ExampleWhich statement correctly describes a physical characteristic of the Moon?Question: Which statement correctly describes a physical characteristic of the Moon?
Answer:
Question: Which statement correctly describes a physical characteristic of the Moon?
A. The Moon is made of hot gases.
B. The Moon is covered with many craters.
C. The Moon has many bodies of liquid water.
D. The Moon has the ability to give off its own light.
Answer:
Question: Which statement correctly describes a physical characteristic of the Moon?
A. The Moon is made of hot gases.
B. The Moon is covered with many craters.
C. The Moon has many bodies of liquid water.
D. The Moon has the ability to give off its own light.
Answer:
GoldThe Moon is covered with many craters.The Moon is covered with many craters.The Moon is covered with many craters.B
Predictions(-14.036760330200195, False)
(-14.156133651733398, False)
(-23.221176147460938, False)
(-22.99703598022461, False)
(-12.965611457824707, False)
(-11.864521026611328, False)
(-20.491010665893555, False)
(-21.728534698486328, False)
(-4.868813514709473, False)
(-3.7199177742004395, False)
(-3.739739418029785, False)
(-0.2140919268131256, True)
(-1.5505614280700684, False)
(-1.4255614280700684, False)
(-1.3005614280700684, True)
(-1.3005614280700684, False)
Metrics: acc_norm1100
Sample 12
ExampleWhich object in the solar system is orbited by a belt of asteroids?Question: Which object in the solar system is orbited by a belt of asteroids?
Answer:
Question: Which object in the solar system is orbited by a belt of asteroids?
A. Pluto
B. Saturn
C. the Sun
D. the Moon
Answer:
Question: Which object in the solar system is orbited by a belt of asteroids?
A. Pluto
B. Saturn
C. the Sun
D. the Moon
Answer:
Goldthe Sunthe Sunthe SunC
Predictions(-5.751784801483154, False)
(-3.7467873096466064, False)
(-6.4582719802856445, False)
(-7.0832719802856445, False)
(-3.279644250869751, False)
(-3.654644250869751, False)
(-5.170290946960449, False)
(-5.920290946960449, False)
(-2.128627300262451, False)
(-1.8786273002624512, False)
(-0.7749634981155396, True)
(-2.774963617324829, False)
(-1.6950974464416504, False)
(-1.4450974464416504, False)
(-1.3200974464416504, False)
(-1.1950974464416504, True)
Metrics: acc_norm0010
Sample 13
ExampleLight waves that cross from an air medium to a water medium willQuestion: Light waves that cross from an air medium to a water medium will
Answer:
Question: Light waves that cross from an air medium to a water medium will
A. be focused into a straight line.
B. lose energy and dissipate.
C. change length and direction.
D. reflect off the water's surface.
Answer:
Question: Light waves that cross from an air medium to a water medium will
A. be focused into a straight line.
B. lose energy and dissipate.
C. change length and direction.
D. reflect off the water's surface.
Answer:
Goldchange length and direction.change length and direction.change length and direction.C
Predictions(-25.314556121826172, False)
(-22.782466888427734, False)
(-22.68366050720215, False)
(-21.23833465576172, False)
(-17.779306411743164, False)
(-15.633138656616211, False)
(-14.039811134338379, False)
(-11.432372093200684, False)
(-4.998326301574707, False)
(-1.623903512954712, False)
(-0.5861192941665649, True)
(-2.0934457778930664, False)
(-1.625598669052124, False)
(-1.500598669052124, False)
(-1.375598669052124, False)
(-1.125598669052124, True)
Metrics: acc_norm0010
Sample 14
ExampleHow many valence electrons does selenium have?Question: How many valence electrons does selenium have?
Answer:
Question: How many valence electrons does selenium have?
A. 3
B. 5
C. 6
D. 8
Answer:
Question: How many valence electrons does selenium have?
A. 3
B. 5
C. 6
D. 8
Answer:
Gold666C
Predictions(-6.6732683181762695, False)
(-6.9232683181762695, False)
(-7.2357683181762695, False)
(-6.7982683181762695, False)
(-3.4133384227752686, False)
(-4.663338661193848, False)
(-1.1633384227752686, True)
(-3.7883384227752686, False)
(-6.926815986633301, False)
(-6.051815986633301, False)
(-0.17681613564491272, True)
(-1.9268162250518799, False)
(-1.8513681888580322, False)
(-1.6013681888580322, False)
(-1.3513681888580322, False)
(-0.976368248462677, True)
Metrics: acc_norm0110
Sample 15
ExampleAs a warm moist air mass moving northward collides with a strong cold air mass moving southward, what observations will most likely be made?Question: As a warm moist air mass moving northward collides with a strong cold air mass moving southward, what observations will most likely be made?
Answer:
Question: As a warm moist air mass moving northward collides with a strong cold air mass moving southward, what observations will most likely be made?
A. Thick fog develops.
B. Temperatures increase.
C. Clouds begin to form.
D. Winds die down.
Answer:
Question: As a warm moist air mass moving northward collides with a strong cold air mass moving southward, what observations will most likely be made?
A. Thick fog develops.
B. Temperatures increase.
C. Clouds begin to form.
D. Winds die down.
Answer:
GoldClouds begin to form.Clouds begin to form.Clouds begin to form.C
Predictions(-16.48362159729004, False)
(-12.333986282348633, False)
(-10.768516540527344, False)
(-14.52383804321289, False)
(-15.831233024597168, False)
(-11.781991004943848, False)
(-10.162830352783203, False)
(-15.114853858947754, False)
(-0.6979869604110718, True)
(-1.7184417247772217, False)
(-2.8269424438476562, False)
(-2.2389371395111084, False)
(-1.4881373643875122, False)
(-1.4881373643875122, False)
(-1.3631373643875122, False)
(-1.2381373643875122, True)
Metrics: acc_norm1100
Sample 16
ExampleChemical weathering occurs when minerals in rocks are changed chemically. Which of these will most likely change the rate of chemical weathering on a rock?Question: Chemical weathering occurs when minerals in rocks are changed chemically. Which of these will most likely change the rate of chemical weathering on a rock?
Answer:
Question: Chemical weathering occurs when minerals in rocks are changed chemically. Which of these will most likely change the rate of chemical weathering on a rock?
A. decrease in air temperature
B. increase in rainfall amounts
C. slow movement of a glacier
D. rapid growth of plant roots
Answer:
Question: Chemical weathering occurs when minerals in rocks are changed chemically. Which of these will most likely change the rate of chemical weathering on a rock?
A. decrease in air temperature
B. increase in rainfall amounts
C. slow movement of a glacier
D. rapid growth of plant roots
Answer:
Goldincrease in rainfall amountsincrease in rainfall amountsincrease in rainfall amountsB
Predictions(-15.226205825805664, False)
(-18.300310134887695, False)
(-18.7595272064209, False)
(-16.90313148498535, False)
(-15.182734489440918, False)
(-16.29802131652832, False)
(-15.929485321044922, False)
(-15.493678092956543, False)
(-2.301724433898926, False)
(-1.6302365064620972, False)
(-1.4298421144485474, False)
(-2.3473901748657227, False)
(-1.734103798866272, False)
(-1.609103798866272, False)
(-1.234103798866272, False)
(-1.109103798866272, True)
Metrics: acc_norm0000
Sample 17
ExampleA toy truck rolls over a smooth surface. If the surface is covered with sand, the truck will most likely rollQuestion: A toy truck rolls over a smooth surface. If the surface is covered with sand, the truck will most likely roll
Answer:
Question: A toy truck rolls over a smooth surface. If the surface is covered with sand, the truck will most likely roll
A. slower
B. faster
C. at the same speed
Answer:
Question: A toy truck rolls over a smooth surface. If the surface is covered with sand, the truck will most likely roll
A. slower
B. faster
C. at the same speed
Answer:
GoldslowerslowerslowerA
Predictions(-13.275317192077637, False)
(-10.096747398376465, False)
(-19.148441314697266, False)
(-2.1855695247650146, False)
(-2.1855695247650146, True)
(-7.597217559814453, False)
(-1.2695996761322021, False)
(-0.7695997357368469, True)
(-1.6472738981246948, False)
(-1.2452011108398438, False)
(-1.3702011108398438, False)
(-0.995201051235199, True)
Metrics: acc_norm0100
Sample 18
ExampleA town in northern Arkansas experienced colder than normal temperatures during part of the winter. Which change was most likely responsible for this?Question: A town in northern Arkansas experienced colder than normal temperatures during part of the winter. Which change was most likely responsible for this?
Answer:
Question: A town in northern Arkansas experienced colder than normal temperatures during part of the winter. Which change was most likely responsible for this?
A. A southward dip in the jet stream
B. A northward movement of the jet stream
C. The Coriolis effect creating a low pressure area
D. The Coriolis effect creating a high pressure area
Answer:
Question: A town in northern Arkansas experienced colder than normal temperatures during part of the winter. Which change was most likely responsible for this?
A. A southward dip in the jet stream
B. A northward movement of the jet stream
C. The Coriolis effect creating a low pressure area
D. The Coriolis effect creating a high pressure area
Answer:
GoldA southward dip in the jet streamA southward dip in the jet streamA southward dip in the jet streamA
Predictions(-14.345656394958496, False)
(-16.05936050415039, False)
(-28.54291534423828, False)
(-28.780471801757812, False)
(-14.603858947753906, False)
(-16.18315887451172, False)
(-27.121496200561523, False)
(-27.188758850097656, False)
(-1.6069732904434204, False)
(-3.61207914352417, False)
(-0.9148926734924316, True)
(-1.2960143089294434, False)
(-1.4954701662063599, False)
(-1.6204701662063599, False)
(-1.2454701662063599, True)
(-1.2454701662063599, False)
Metrics: acc_norm0000
Sample 19
ExampleHow much time is required for a bicycle to travel a distance of 100 m at an average speed of 2 m/s?Question: How much time is required for a bicycle to travel a distance of 100 m at an average speed of 2 m/s?
Answer:
Question: How much time is required for a bicycle to travel a distance of 100 m at an average speed of 2 m/s?
A. 0.02 s
B. 50 s
C. 100 s
D. 200 s
Answer:
Question: How much time is required for a bicycle to travel a distance of 100 m at an average speed of 2 m/s?
A. 0.02 s
B. 50 s
C. 100 s
D. 200 s
Answer:
Gold50 s50 s50 sB
Predictions(-9.211626052856445, False)
(-7.683445453643799, False)
(-7.148634433746338, False)
(-8.429720878601074, False)
(-7.6334710121154785, False)
(-4.587807655334473, False)
(-4.615830421447754, False)
(-6.059815883636475, False)
(-3.2665140628814697, False)
(-1.9924125671386719, False)
(-0.8778499960899353, True)
(-0.9978684186935425, False)
(-1.7351858615875244, False)
(-1.6101858615875244, False)
(-1.2351858615875244, False)
(-1.1101858615875244, True)
Metrics: acc_norm0000
\n", "\n", "
\n", " " ], "text/plain": [ "GT(_tbl_data=shape: (40, 6)\n", "┌───────────┬─────────────┬──────────────────┬─────────────────┬─────────────────┬─────────────────┐\n", "│ Sample ┆ Type ┆ arc_base ┆ arc_context ┆ arc_context_cho ┆ arc_context_lab │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ ice ┆ els │\n", "│ str ┆ str ┆ str ┆ str ┆ --- ┆ --- │\n", "│ ┆ ┆ ┆ ┆ str ┆ str │\n", "╞═══════════╪═════════════╪══════════════════╪═════════════════╪═════════════════╪═════════════════╡\n", "│ Sample 10 ┆ Example ┆ Cities control ┆ Question: ┆ Question: ┆ Question: │\n", "│ ┆ ┆ the amount of p… ┆ Cities control ┆ Cities control ┆ Cities control │\n", "│ ┆ ┆ ┆ the a… ┆ the a… ┆ the a… │\n", "│ Sample 10 ┆ Gold ┆ The air stays ┆ The air stays ┆ The air stays ┆ A │\n", "│ ┆ ┆ cleaner. ┆ cleaner. ┆ cleaner. ┆ │\n", "│ Sample 10 ┆ Predictions ┆ (-10.59571933746 ┆ (-9.65079307556 ┆ (-1.79188811779 ┆ (-1.44515538215 │\n", "│ ┆ ┆ 3379, False), _boxhead=Boxhead([ColInfo(var='Sample', type=, column_label='Sample', column_align='left', column_width=None), ColInfo(var='Type', type=, column_label='Type', column_align='left', column_width=None), ColInfo(var='arc_base', type=, column_label='arc_base', column_align='left', column_width=None), ColInfo(var='arc_context', type=, column_label='arc_context', column_align='left', column_width=None), ColInfo(var='arc_context_choice', type=, column_label='arc_context_choice', column_align='left', column_width=None), ColInfo(var='arc_context_labels', type=, column_label='arc_context_labels', column_align='left', column_width=None)]), _stub=, _spanners=Spanners([SpannerInfo(spanner_id='Samples', spanner_level=0, spanner_label='Samples', spanner_units=None, spanner_pattern=None, vars=['arc_base', 'arc_context', 'arc_context_choice', 'arc_context_labels'], built=None)]), _heading=Heading(title=\"Comparing our different prompts' outputs\", subtitle=None, preheader=None), _stubhead=None, _source_notes=[], _footnotes=[], _styles=[], _locale=, _formats=[], _substitutions=[], _options=Options(table_id=OptionsInfo(scss=False, category='table', type='value', value=None), table_caption=OptionsInfo(scss=False, category='table', type='value', value=None), table_width=OptionsInfo(scss=True, category='table', type='px', value='auto'), table_layout=OptionsInfo(scss=True, category='table', type='value', value='fixed'), table_margin_left=OptionsInfo(scss=True, category='table', type='px', value='auto'), table_margin_right=OptionsInfo(scss=True, category='table', type='px', value='auto'), table_background_color=OptionsInfo(scss=True, category='table', type='value', value='#FFFFFF'), table_additional_css=OptionsInfo(scss=False, category='table', type='values', value=[]), table_font_names=OptionsInfo(scss=False, category='table', type='values', value=['-apple-system', 'BlinkMacSystemFont', 'Segoe UI', 'Roboto', 'Oxygen', 'Ubuntu', 'Cantarell', 'Helvetica Neue', 'Fira Sans', 'Droid Sans', 'Arial', 'sans-serif']), table_font_size=OptionsInfo(scss=True, category='table', type='px', value='16px'), table_font_weight=OptionsInfo(scss=True, category='table', type='value', value='normal'), table_font_style=OptionsInfo(scss=True, category='table', type='value', value='normal'), table_font_color=OptionsInfo(scss=True, category='table', type='value', value='#333333'), table_font_color_light=OptionsInfo(scss=True, category='table', type='value', value='#FFFFFF'), table_border_top_include=OptionsInfo(scss=False, category='table', type='boolean', value=True), table_border_top_style=OptionsInfo(scss=True, category='table', type='value', value='solid'), table_border_top_width=OptionsInfo(scss=True, category='table', type='px', value='2px'), table_border_top_color=OptionsInfo(scss=True, category='table', type='value', value='#A8A8A8'), table_border_right_style=OptionsInfo(scss=True, category='table', type='value', value='none'), table_border_right_width=OptionsInfo(scss=True, category='table', type='px', value='2px'), table_border_right_color=OptionsInfo(scss=True, category='table', type='value', value='#D3D3D3'), table_border_bottom_include=OptionsInfo(scss=False, category='table', type='boolean', value=True), table_border_bottom_style=OptionsInfo(scss=True, category='table', type='value', value='solid'), table_border_bottom_width=OptionsInfo(scss=True, category='table', type='px', value='2px'), table_border_bottom_color=OptionsInfo(scss=True, category='table', type='value', value='#A8A8A8'), table_border_left_style=OptionsInfo(scss=True, category='table', type='value', value='none'), table_border_left_width=OptionsInfo(scss=True, category='table', type='px', value='2px'), table_border_left_color=OptionsInfo(scss=True, category='table', type='value', value='#D3D3D3'), heading_background_color=OptionsInfo(scss=True, category='heading', type='value', value=None), heading_align=OptionsInfo(scss=True, category='heading', type='value', value='center'), heading_title_font_size=OptionsInfo(scss=True, category='heading', type='px', value='125%'), heading_title_font_weight=OptionsInfo(scss=True, category='heading', type='value', value='initial'), heading_subtitle_font_size=OptionsInfo(scss=True, category='heading', type='px', value='85%'), heading_subtitle_font_weight=OptionsInfo(scss=True, category='heading', type='value', value='initial'), heading_padding=OptionsInfo(scss=True, category='heading', type='px', value='4px'), heading_padding_horizontal=OptionsInfo(scss=True, category='heading', type='px', value='5px'), heading_border_bottom_style=OptionsInfo(scss=True, category='heading', type='value', value='solid'), heading_border_bottom_width=OptionsInfo(scss=True, category='heading', type='px', value='2px'), heading_border_bottom_color=OptionsInfo(scss=True, category='heading', type='value', value='#D3D3D3'), heading_border_lr_style=OptionsInfo(scss=True, category='heading', type='value', value='none'), heading_border_lr_width=OptionsInfo(scss=True, category='heading', type='px', value='1px'), heading_border_lr_color=OptionsInfo(scss=True, category='heading', type='value', value='#D3D3D3'), column_labels_background_color=OptionsInfo(scss=True, category='column_labels', type='value', value=None), column_labels_font_size=OptionsInfo(scss=True, category='column_labels', type='px', value='100%'), column_labels_font_weight=OptionsInfo(scss=True, category='column_labels', type='value', value='normal'), column_labels_text_transform=OptionsInfo(scss=True, category='column_labels', type='value', value='inherit'), column_labels_padding=OptionsInfo(scss=True, category='column_labels', type='px', value='5px'), column_labels_padding_horizontal=OptionsInfo(scss=True, category='column_labels', type='px', value='5px'), column_labels_vlines_style=OptionsInfo(scss=True, category='table_body', type='value', value='none'), column_labels_vlines_width=OptionsInfo(scss=True, category='table_body', type='px', value='1px'), column_labels_vlines_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), column_labels_border_top_style=OptionsInfo(scss=True, category='column_labels', type='value', value='solid'), column_labels_border_top_width=OptionsInfo(scss=True, category='column_labels', type='px', value='2px'), column_labels_border_top_color=OptionsInfo(scss=True, category='column_labels', type='value', value='#D3D3D3'), column_labels_border_bottom_style=OptionsInfo(scss=True, category='column_labels', type='value', value='solid'), column_labels_border_bottom_width=OptionsInfo(scss=True, category='column_labels', type='px', value='2px'), column_labels_border_bottom_color=OptionsInfo(scss=True, category='column_labels', type='value', value='#D3D3D3'), column_labels_border_lr_style=OptionsInfo(scss=True, category='column_labels', type='value', value='none'), column_labels_border_lr_width=OptionsInfo(scss=True, category='column_labels', type='px', value='1px'), column_labels_border_lr_color=OptionsInfo(scss=True, category='column_labels', type='value', value='#D3D3D3'), column_labels_hidden=OptionsInfo(scss=False, category='column_labels', type='boolean', value=False), row_group_background_color=OptionsInfo(scss=True, category='row_group', type='value', value=None), row_group_font_size=OptionsInfo(scss=True, category='row_group', type='px', value='100%'), row_group_font_weight=OptionsInfo(scss=True, category='row_group', type='value', value='initial'), row_group_text_transform=OptionsInfo(scss=True, category='row_group', type='value', value='inherit'), row_group_padding=OptionsInfo(scss=True, category='row_group', type='px', value='8px'), row_group_padding_horizontal=OptionsInfo(scss=True, category='row_group', type='px', value='5px'), row_group_border_top_style=OptionsInfo(scss=True, category='row_group', type='value', value='solid'), row_group_border_top_width=OptionsInfo(scss=True, category='row_group', type='px', value='2px'), row_group_border_top_color=OptionsInfo(scss=True, category='row_group', type='value', value='#D3D3D3'), row_group_border_right_style=OptionsInfo(scss=True, category='row_group', type='value', value='none'), row_group_border_right_width=OptionsInfo(scss=True, category='row_group', type='px', value='1px'), row_group_border_right_color=OptionsInfo(scss=True, category='row_group', type='value', value='#D3D3D3'), row_group_border_bottom_style=OptionsInfo(scss=True, category='row_group', type='value', value='solid'), row_group_border_bottom_width=OptionsInfo(scss=True, category='row_group', type='px', value='2px'), row_group_border_bottom_color=OptionsInfo(scss=True, category='row_group', type='value', value='#D3D3D3'), row_group_border_left_style=OptionsInfo(scss=True, category='row_group', type='value', value='none'), row_group_border_left_width=OptionsInfo(scss=True, category='row_group', type='px', value='1px'), row_group_border_left_color=OptionsInfo(scss=True, category='row_group', type='value', value='#D3D3D3'), row_group_as_column=OptionsInfo(scss=False, category='row_group', type='boolean', value=False), table_body_hlines_style=OptionsInfo(scss=True, category='table_body', type='value', value='solid'), table_body_hlines_width=OptionsInfo(scss=True, category='table_body', type='px', value='1px'), table_body_hlines_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), table_body_vlines_style=OptionsInfo(scss=True, category='table_body', type='value', value='none'), table_body_vlines_width=OptionsInfo(scss=True, category='table_body', type='px', value='1px'), table_body_vlines_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), table_body_border_top_style=OptionsInfo(scss=True, category='table_body', type='value', value='solid'), table_body_border_top_width=OptionsInfo(scss=True, category='table_body', type='px', value='2px'), table_body_border_top_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), table_body_border_bottom_style=OptionsInfo(scss=True, category='table_body', type='value', value='solid'), table_body_border_bottom_width=OptionsInfo(scss=True, category='table_body', type='px', value='2px'), table_body_border_bottom_color=OptionsInfo(scss=True, category='table_body', type='value', value='#D3D3D3'), data_row_padding=OptionsInfo(scss=True, category='data_row', type='px', value='8px'), data_row_padding_horizontal=OptionsInfo(scss=True, category='data_row', type='px', value='5px'), stub_background_color=OptionsInfo(scss=True, category='stub', type='value', value=None), stub_font_size=OptionsInfo(scss=True, category='stub', type='px', value='100%'), stub_font_weight=OptionsInfo(scss=True, category='stub', type='value', value='initial'), stub_text_transform=OptionsInfo(scss=True, category='stub', type='value', value='inherit'), stub_border_style=OptionsInfo(scss=True, category='stub', type='value', value='solid'), stub_border_width=OptionsInfo(scss=True, category='stub', type='px', value='2px'), stub_border_color=OptionsInfo(scss=True, category='stub', type='value', value='#D3D3D3'), stub_row_group_background_color=OptionsInfo(scss=True, category='stub', type='value', value=None), stub_row_group_font_size=OptionsInfo(scss=True, category='stub', type='px', value='100%'), stub_row_group_font_weight=OptionsInfo(scss=True, category='stub', type='value', value='initial'), stub_row_group_text_transform=OptionsInfo(scss=True, category='stub', type='value', value='inherit'), stub_row_group_border_style=OptionsInfo(scss=True, category='stub', type='value', value='solid'), stub_row_group_border_width=OptionsInfo(scss=True, category='stub', type='px', value='2px'), stub_row_group_border_color=OptionsInfo(scss=True, category='stub', type='value', value='#D3D3D3'), source_notes_padding=OptionsInfo(scss=True, category='source_notes', type='px', value='4px'), source_notes_padding_horizontal=OptionsInfo(scss=True, category='source_notes', type='px', value='5px'), source_notes_background_color=OptionsInfo(scss=True, category='source_notes', type='value', value=None), source_notes_font_size=OptionsInfo(scss=True, category='source_notes', type='px', value='90%'), source_notes_border_bottom_style=OptionsInfo(scss=True, category='source_notes', type='value', value='none'), source_notes_border_bottom_width=OptionsInfo(scss=True, category='source_notes', type='px', value='2px'), source_notes_border_bottom_color=OptionsInfo(scss=True, category='source_notes', type='value', value='#D3D3D3'), source_notes_border_lr_style=OptionsInfo(scss=True, category='source_notes', type='value', value='none'), source_notes_border_lr_width=OptionsInfo(scss=True, category='source_notes', type='px', value='2px'), source_notes_border_lr_color=OptionsInfo(scss=True, category='source_notes', type='value', value='#D3D3D3'), source_notes_multiline=OptionsInfo(scss=False, category='source_notes', type='boolean', value=True), source_notes_sep=OptionsInfo(scss=False, category='source_notes', type='value', value=' '), row_striping_background_color=OptionsInfo(scss=True, category='row', type='value', value='rgba(128,128,128,0.05)'), row_striping_include_stub=OptionsInfo(scss=False, category='row', type='boolean', value=False), row_striping_include_table_body=OptionsInfo(scss=False, category='row', type='boolean', value=False), container_width=OptionsInfo(scss=False, category='container', type='px', value='auto'), container_height=OptionsInfo(scss=False, category='container', type='px', value='auto'), container_padding_x=OptionsInfo(scss=False, category='container', type='px', value='0px'), container_padding_y=OptionsInfo(scss=False, category='container', type='px', value='10px'), container_overflow_x=OptionsInfo(scss=False, category='container', type='overflow', value='auto'), container_overflow_y=OptionsInfo(scss=False, category='container', type='overflow', value='auto'), quarto_disable_processing=OptionsInfo(scss=False, category='quarto', type='logical', value=False), quarto_use_bootstrap=OptionsInfo(scss=False, category='quarto', type='logical', value=False)), _has_built=False)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(GT(pl_data.tail(max_samples * 4))\n", " .tab_header(\"Comparing our different prompts' outputs\")\n", " .tab_spanner(label=\"Samples\", columns=cs.starts_with(\"arc\"))\n", " .tab_stub(rowname_col=\"Type\", groupname_col=\"Sample\")\n", " .fmt_markdown(columns=cs.starts_with(\"arc\"))\n", ")" ] }, { "cell_type": "markdown", "id": "07a8d041", "metadata": {}, "source": [ "Lorsque l'on examine les log-vraisemblances des générations, ce qui semble curieusement fonctionner le mieux pour ce modèle est le fait de ne pas avoir indiqué d'exemples dans l'instruction, dans un mécanisme opposé à celui des évaluations de générations." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 5 }