diff --git "a/Time_Series_Forcasting_XGBoost.ipynb" "b/Time_Series_Forcasting_XGBoost.ipynb" new file mode 100644--- /dev/null +++ "b/Time_Series_Forcasting_XGBoost.ipynb" @@ -0,0 +1,2223 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "dM9_ky3Urx-0", + "metadata": { + "id": "dM9_ky3Urx-0" + }, + "source": [ + "# Multivariate Time Series Forecasting using XGBoost Regressor: A Machine Learning Approach" + ] + }, + { + "cell_type": "markdown", + "id": "ec02f7eb", + "metadata": {}, + "source": [ + "#### The notebook aims to implement a **Machine learning based model** (**XGBoost Regressor**) on historical stock data of **INTEL**, spanning **1980 to 2023**.\n", + "\n", + "#### The primary objective is to forecast the **closing prices** of the stock capturing temporal patterns through relevant predictive variables." + ] + }, + { + "cell_type": "markdown", + "id": "504890d4", + "metadata": {}, + "source": [ + "#### XGBoost Regressor is a powerful machine learning algorithm based on the gradient boosting framework. It is an ensemble learning method that combines the predictions of multiple weak models (decision trees) to create a strong predictive model -> each tree is trained to correct the errors of the previous ones.\n", + "\n", + "--- \n", + "\n", + "### XGBoost can be used for time series forecasting by reframing the problem into a supervised learning task -> \"transforming the time series data into a supervised learning format\" This involves creating lagged features (past values) of the target variable and optionally for other relevant input variablesvto enable the model to learn patterns and predict future values.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "7d287fa1", + "metadata": {}, + "source": [ + "\n", + "\n", + "# Approach\n", + "\n", + "### Data preparation:\n", + "\n", + "* Creating additional new features from Date components and rolling statistics\n", + "* Creating lagged features for all the relevant time series variables ,in our case, the \"Closing Price\" and \"High\" to serve as target and input variables respectively.\n", + "\n", + "---\n", + "\n", + "### Defining the target variable\n", + "\n", + "* Defining the target variable as the future value we want to predict\n", + "\n", + "---\n", + "\n", + "### Chronological Splitting\n", + "\n", + "* Preserve temporal order when splitting data into train/test sets\n", + "\n", + "\n", + "---\n", + "### Defining Evaluation metrics and Creating a Baseline model \n", + "\n", + "* We define a very basic **naive** model for later comparison with the trained model across the defined evaluation metrics\n", + "\n", + "---\n", + "### Model training\n", + "\n", + "* Train the XGBoost regressor on this structured dataset to learn patterns and dependencies across multiple time series\n", + "\n", + "---\n", + "\n", + "### Making Predictions\n", + "\n", + "* Making predictions one step ahead in the future\n", + "\n", + "---\n", + "\n", + "### Evaluate the model\n", + "\n", + "* Evaluate the model using appropriate performance metrics\n", + "\n", + "---\n", + "### Hyperparameter Tuning/Optimization\n", + "\n", + "* Using Grid Search CV to find best parameters\n", + "\n", + "\n", + "### Making Predictions (n steps ahead)\n", + "\n", + "* Making predictions for n steps ahead in the future\n", + "\n", + "---\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "cd45fb74", + "metadata": {}, + "source": [ + "# Importing required libraries" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "qggkhEYpryK7", + "metadata": { + "id": "qggkhEYpryK7" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "import matplotlib.pyplot as plt\n", + "\n", + "from sklearn.model_selection import GridSearchCV\n", + "from xgboost import plot_importance\n", + "import xgboost as xgb\n", + "import seaborn as sns\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "de09a572", + "metadata": {}, + "source": [ + "# Data Loading and Visualization" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "452fda19", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 203 + }, + "id": "452fda19", + "outputId": "3ef649d3-ec89-4e18-e369-70b37cfe2021" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "( Date High Close\n", + " 0 1980-03-18 0.328125 0.322917\n", + " 1 1980-03-19 0.335938 0.330729\n", + " 2 1980-03-20 0.334635 0.329427\n", + " 3 1980-03-21 0.322917 0.317708\n", + " 4 1980-03-24 0.316406 0.311198,\n", + " (10919, 3))" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df = pd.read_csv('I:/CQAI/TSA/TSD/TSD/archive (1)/INTEL (1980 - 11.07.2023).csv',usecols ={\"Date\",\"Close\",\"High\"})\n", + "\n", + "df.head(),df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "d955833b", + "metadata": {}, + "outputs": [], + "source": [ + "df[\"Date\"]=pd.to_datetime(df[\"Date\"]) #convert date column to pandas datetime format" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "df2e37e1", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "\n", + "plt.figure(figsize=(10, 4))\n", + "\n", + "plt.plot(df[\"Date\"],df[\"Close\"],label=\"Nividia Closing price\", color=\"blue\")\n" + ] + }, + { + "cell_type": "markdown", + "id": "d1283281", + "metadata": {}, + "source": [ + "\n", + "\n", + "## Feature Engineering\n", + "\n", + "Features created:\n", + "\n", + "* **Date features**: Extracting components like\n", + "dayofweek, year, day_of_year, sin_day, cos_day. These help the model understand seasonality and calendar effects.\n", + "\n", + "* **Lag features**: Using past values of a variable as predictors (e.g., `Close_t-1`, `High_t-1`, etc.)\n", + "* **Rolling statistics**: Applying moving averagesto smooth and represent recent trends\n", + "* **Date Features**: Capturing seasonality and cyclic behavior in the time series data\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "98b56b8d", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "98b56b8d", + "outputId": "566561b1-4cd2-4b6d-e6a9-70155c175625" + }, + "outputs": [], + "source": [ + "\n", + "# Creating Date features\n", + "df['dayofweek'] = df['Date'].dt.dayofweek\n", + "df['year'] = df['Date'].dt.year\n", + "df['dayofyear'] = df['Date'].dt.dayofyear\n", + "df['sin_day'] = np.sin(df['dayofyear'])\n", + "df['cos_day'] = np.cos(df['dayofyear'])\n", + "df['month'] = df['Date'].dt.month\n", + "df['year'] = df['Date'].dt.year\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "9baf7e8b", + "metadata": {}, + "outputs": [], + "source": [ + "#one step_behind from close\n", + "df[\"Close_t-1\"] = df.Close.shift(1)\n", + "df[\"High_t-1\"] = df.High.shift(1)\n", + "\n", + "#Differnced feature\n", + "df[\"Close_diff\"] = df.Close.diff(1)\n", + "\n", + "#rolling feature\n", + "df[\"Rolling_mean\"]=df.Close.rolling(20).mean().reset_index(level=0, drop=True)" + ] + }, + { + "cell_type": "markdown", + "id": "71f4ca63", + "metadata": {}, + "source": [ + "\n", + "## Transform to a Supervised Learning Problem \n", + "\n", + "Now we will create our target feature -> a new column called **`next_day_close`** by shifting the **`Close`** values one step back.\n", + "\n", + "This new column will contain the **one-step-ahead future values** of the `Close` price, making it suitable for supervised learning.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "f1808121", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DateHighClosedayofweekyeardayofyearsin_daycos_daymonthClose_t-1High_t-1Close_diffRolling_meanClose_next_day
01980-03-180.3281250.32291711980780.513978-0.8578033NaNNaNNaNNaN0.330729
11980-03-190.3359380.3307292198079-0.444113-0.89597130.3229170.3281250.007812NaN0.329427
21980-03-200.3346350.3294273198080-0.993889-0.11038730.3307290.335938-0.001302NaN0.317708
31980-03-210.3229170.3177084198081-0.6298880.77668630.3294270.334635-0.011719NaN0.311198
41980-03-240.3164060.31119801980840.733190-0.68002330.3177080.322917-0.006510NaN0.312500
\n", + "
" + ], + "text/plain": [ + " Date High Close dayofweek year dayofyear sin_day \\\n", + "0 1980-03-18 0.328125 0.322917 1 1980 78 0.513978 \n", + "1 1980-03-19 0.335938 0.330729 2 1980 79 -0.444113 \n", + "2 1980-03-20 0.334635 0.329427 3 1980 80 -0.993889 \n", + "3 1980-03-21 0.322917 0.317708 4 1980 81 -0.629888 \n", + "4 1980-03-24 0.316406 0.311198 0 1980 84 0.733190 \n", + "\n", + " cos_day month Close_t-1 High_t-1 Close_diff Rolling_mean \\\n", + "0 -0.857803 3 NaN NaN NaN NaN \n", + "1 -0.895971 3 0.322917 0.328125 0.007812 NaN \n", + "2 -0.110387 3 0.330729 0.335938 -0.001302 NaN \n", + "3 0.776686 3 0.329427 0.334635 -0.011719 NaN \n", + "4 -0.680023 3 0.317708 0.322917 -0.006510 NaN \n", + "\n", + " Close_next_day \n", + "0 0.330729 \n", + "1 0.329427 \n", + "2 0.317708 \n", + "3 0.311198 \n", + "4 0.312500 " + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['Close_next_day'] = df.Close.shift(-1)\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "8184344c", + "metadata": {}, + "outputs": [], + "source": [ + "df = df.dropna(subset=['Close_next_day'])" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "807c4055", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "10913 33.619999\n", + "10914 32.509998\n", + "10915 31.969999\n", + "10916 31.850000\n", + "10917 32.740002\n", + "Name: Close_next_day, dtype: float64" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df['Close_next_day'].tail()" + ] + }, + { + "cell_type": "markdown", + "id": "2ac05eed", + "metadata": {}, + "source": [ + "\n", + "## Train/Test Split\n", + "\n", + "This step involves **chronological splitting** of the dataset.\n", + "\n", + "We keep the **major portion** of the data for **training**, and use the **later sequential portion** for **testing**.\n", + "\n", + "This preserves the **temporal order** and avoids data leakage, which is essential for time series forecasting tasks.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "eA3q5pVfmUal", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "eA3q5pVfmUal", + "outputId": "19b69c13-d7ec-43c6-b772-806f57db290d" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "10918" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(df)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "7571764a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "8734\n" + ] + } + ], + "source": [ + "#we define a split value between train and test\n", + "\n", + "split_idx = int(len(df) * 0.8)\n", + "print(split_idx)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "96aa41aa", + "metadata": { + "id": "96aa41aa" + }, + "outputs": [], + "source": [ + "# Train data\n", + "\n", + "df_train=df[:split_idx]\n", + "X_train=df_train.drop(['Close_next_day','Date'],axis=1).copy()\n", + "\n", + "#Setting Close_next_day as the target feature\n", + "y_train=df_train['Close_next_day'].copy()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "840e2c99", + "metadata": { + "id": "840e2c99" + }, + "outputs": [], + "source": [ + "#Test data\n", + "\n", + "df_test=df[split_idx:]\n", + "X_test=df_test.drop(['Close_next_day','Date'],axis=1).copy()\n", + "y_test=df_test['Close_next_day'].copy()" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "892c14ca", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "892c14ca", + "outputId": "3ef15986-784c-46c7-a023-83d22bb9e94d" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "((8734, 12), (2184, 12))" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "X_train.shape,X_test.shape," + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "578cf9d8", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "((2184,), (8734,))" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y_test.shape,y_train.shape" + ] + }, + { + "cell_type": "markdown", + "id": "312f31df", + "metadata": {}, + "source": [ + "# Evaluation metrics \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ecb03afa", + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.metrics import mean_squared_error as MSE\n", + "\n", + "\n", + "def rmse(y_true, y_pred):\n", + " rmse = np.sqrt(MSE(y_true, y_pred))\n", + " print(\"Mean Value of Test Dataset:\", y_test.mean())\n", + " print(\"RMSE : % f\" %(rmse))\n", + "\n", + "def mape(y_true, y_pred):\n", + " ape = np.abs((y_true - y_pred) / y_true)\n", + " \n", + " ape[~np.isfinite(ape)] = 1.\n", + " print(\"Mape\",np.mean(ape))\n", + "\n", + "def wmape(y_true, y_pred):\n", + " print(\"Wmape\",np.sum(np.abs(y_true - y_pred)) / np.sum(np.abs(y_true)))" + ] + }, + { + "cell_type": "markdown", + "id": "86905cf6", + "metadata": {}, + "source": [ + "# Creating the Basline for Later evaluation\n", + "\n", + "The baseline model assumes that todays closing price will be the same as next days closing price. In this setup, the **predicted value** is set as today’s close, while the **true value** is the actual closing price of the following day. This provides a simple benchmark, representing the maximum error a naïve model might produce. Ideally, our trained model should achieve a lower error than this baseline.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "98e0b35d", + "metadata": {}, + "outputs": [], + "source": [ + "y_pred = df_test[\"Close\"]\n", + "y_true = df_test['Close_next_day']" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "759e507d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Mape 0.01403404955533739\n", + "Wmape 0.014251380880077104\n" + ] + } + ], + "source": [ + "mape(y_true,y_pred)\n", + "wmape(y_true,y_pred)" + ] + }, + { + "cell_type": "markdown", + "id": "d920d2e2", + "metadata": {}, + "source": [ + "## Handling Missing Values Before Training\n", + "\n", + "Before training, we re-inspect our dataset and impute the **NaN values** using the **`SimpleImputer`** with the default **mean strategy**.\n", + "\n", + "These missing values were introduced as a result of the **new features** we added — such as **lagged features** and **rolling statistics** — which naturally create missing values at the beginning of the time series.\n", + "\n", + "### Why this is important:\n", + "Imputing ensures that **each training example is complete**, allowing the model to learn effectively from all available data.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "64ab8415", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "High 0\n", + "Close 0\n", + "dayofweek 0\n", + "year 0\n", + "dayofyear 0\n", + "sin_day 0\n", + "cos_day 0\n", + "month 0\n", + "Close_t-1 1\n", + "High_t-1 1\n", + "Close_diff 1\n", + "Rolling_mean 19\n", + "dtype: int64" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "X_train.isnull().sum()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "cda45109", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "from sklearn.impute import SimpleImputer\n", + "imputer = SimpleImputer()\n", + "Xtr = imputer.fit_transform(X_train)\n", + "\n", + "\n", + "Xtst = imputer.transform(X_test)#This ensures that preprocessing is consistent between training and testing.\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "8f256f46", + "metadata": {}, + "source": [ + "## Model Training using `XGBRegressor` class from xgboost Library\n", + "\n", + "\n", + "### Parameters used:\n", + "\n", + "- Objective: regression with squared loss.\n", + "- n_estimators: The number of boosting (tree) rounds\n", + "- learning_rate: The step size shrinkage used to prevent overfitting\n", + "\n", + "check out the documentation :https://xgboost.readthedocs.io/en/stable/parameter.html#\n" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "73cb55a1", + "metadata": { + "id": "73cb55a1" + }, + "outputs": [], + "source": [ + "def fit_Regressor(x,y,n):\n", + " reg = xgb.XGBRegressor(\n", + " objective='reg:squarederror',\n", + " n_estimators=n,\n", + " learning_rate=0.01,\n", + " )\n", + "\n", + " reg.fit(x, y,\n", + " verbose=True, #eval_set=[(X_train, y_train), (X_test, y_test)] , eval_metric='mae')\n", + " ) \n", + " return reg\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5242ce9d", + "metadata": {}, + "outputs": [], + "source": [ + "reg =fit_Regressor(Xtr,y_train,230)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "6986b256", + "metadata": {}, + "outputs": [], + "source": [ + "#Predict for test data\n", + "yhat = reg.predict(Xtst)" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "e6610b92", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "e6610b92", + "outputId": "cbfb25e0-9508-4b4e-8a77-4b8abcce4745" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([32.18184 , 32.428726, 32.500183, ..., 31.38361 , 30.459702,\n", + " 30.474905], dtype=float32)" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "yhat" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "ee6ddb45", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "ee6ddb45", + "outputId": "feacc9bd-eaf9-4e62-ea7a-d68290191619" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "8734 34.310001\n", + "8735 34.540001\n", + "8736 33.759998\n", + "8737 33.820000\n", + "8738 33.580002\n", + " ... \n", + "10913 33.619999\n", + "10914 32.509998\n", + "10915 31.969999\n", + "10916 31.850000\n", + "10917 32.740002\n", + "Name: Close_next_day, Length: 2184, dtype: float64" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y_test" + ] + }, + { + "cell_type": "markdown", + "id": "1c12c8d7", + "metadata": {}, + "source": [ + "# Perform Evaluation on Model predictions" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "139477ee", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Mean Value of Test Dataset: 43.15017851923077\n", + "RMSE : 3.395802\n", + "Mape 0.06181969728770354\n", + "Wmape 0.06443764574924396\n" + ] + } + ], + "source": [ + "rmse(y_test, yhat)\n", + "\n", + "mape(y_test, yhat)\n", + "wmape(y_test, yhat)" + ] + }, + { + "cell_type": "markdown", + "id": "2d279780", + "metadata": {}, + "source": [ + "#### From the above results we see the mape and wmape for the trained model is high than the baseline which suggests model predictions are very poor " + ] + }, + { + "cell_type": "markdown", + "id": "1d61d360", + "metadata": {}, + "source": [ + "# Plotting the Forecast (Original vs Predicted)" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "43d553b8", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 293 + }, + "id": "43d553b8", + "outputId": "bb311728-bea2-4dda-f0d7-66637eaca3c1" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.plot(y_test.values, label='Original')\n", + "plt.plot(yhat, color='red', label='XGboost')\n", + "plt.legend()" + ] + }, + { + "cell_type": "markdown", + "id": "813e5cb8", + "metadata": {}, + "source": [ + "## Hyperparameter Tuning \n", + "\n", + "After training our initial model, we perform **hyperparameter tuning** to improve the model’s forecasting accuracy and generalization.\n", + "\n", + "We use **`GridSearchCV`** from `sklearn.model_selection`, which automates the search for the best combination of hyperparameters.\n", + "\n", + "\n", + "\n", + "`GridSearchCV` systematically tests multiple combinations of hyperparameters using **cross-validation**.\n", + "\n", + "- Splits the training data into multiple folds (e.g., 5)\n", + "- Trains the model on a subset and validates it on the rest\n", + "- Repeats this process for each combination of parameters\n", + "- Returns the set of parameters that **produces the best average performance**\n", + "\n", + "\n", + "This ensures that the model does not overfit or underfit, and that the chosen parameters are optimal for generalizing to unseen data.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "id": "197d6758", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "197d6758", + "outputId": "157b2232-bce1-43d0-f9d2-df10af438e1d" + }, + "outputs": [], + "source": [ + "\n", + "Xtr = pd.DataFrame(Xtr, columns=X_train.columns)\n", + "Xtst = pd.DataFrame(Xtst, columns=X_test.columns)\n", + "\n", + "\n", + "params = {\n", + " 'n_estimators': [600],# no of decision trees\n", + " 'subsample': [0.6, 0.7, 0.8, 0.9],\n", + " 'colsample_bytree': [0.6, 0.7, 0.8, 0.9],\n", + " 'max_depth': [2,3],\n", + " 'gamma': [0.3, 0.4],\n", + " 'min_child_weight': [4,5]\n", + "}\n", + "# Initialize XGB and GridSearch\n", + "xgb_reg = xgb.XGBRegressor(nthread=-1, objective='reg:squarederror')\n", + "grid = GridSearchCV(xgb_reg, params)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "id": "f6e4884f", + "metadata": {}, + "outputs": [], + "source": [ + "grid.fit(Xtr, y_train)\n", + "gridcv_xgb = grid.best_estimator_\n" + ] + }, + { + "cell_type": "markdown", + "id": "044d43d4", + "metadata": {}, + "source": [ + "## About Hyperparameters\n", + "\n", + "### `min_child_weight`\n", + "Minimum sum of instance weight (Hessian) needed in a child node. \n", + "➤ Larger values prevent the model from learning patterns that are too specific to individual samples (i.e., helps avoid overfitting).\n", + "\n", + "---\n", + "\n", + "### `gamma`\n", + "Minimum loss reduction required to make a further partition on a leaf node. \n", + "➤ Acts as a regularization parameter — higher values make the algorithm more conservative by limiting tree growth.\n", + "\n", + "---\n", + "\n", + "### `subsample`\n", + "Fraction of the training data used to build each tree. \n", + "➤ Adding this randomness reduces overfitting. Typical values are between **0.5 and 1.0**.\n", + "\n", + "---\n", + "\n", + "### `colsample_bytree`\n", + "Fraction of features (columns) randomly sampled for each tree. \n", + "➤ Encourages diversity among trees and reduces overfitting by randomly ignoring some features at each iteration.\n", + "\n", + "---\n", + "\n", + "### `max_depth`\n", + "Maximum depth of a decision tree. \n", + "➤ Controls the complexity of the model — deeper trees can model more intricate patterns but are prone to overfitting.\n" + ] + }, + { + "cell_type": "markdown", + "id": "a8e2e11e", + "metadata": {}, + "source": [ + "## Feature Importance Plot\n", + "\n", + "After training our best model using `GridSearchCV`, we visualize which features had the most impact on the model's predictions.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "id": "7da62198", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 295 + }, + "id": "7da62198", + "outputId": "cde72905-5bd3-4c35-e8e2-2c04f6f86cfb" + }, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "\n", + "_ = plot_importance(gridcv_xgb)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "id": "3e5e8220", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3e5e8220", + "outputId": "81e27a2d-025f-4628-ba31-8bca4279929a" + }, + "outputs": [ + { + "data": { + "text/html": [ + "
XGBRegressor(base_score=None, booster=None, callbacks=None,\n",
+              "             colsample_bylevel=None, colsample_bynode=None,\n",
+              "             colsample_bytree=0.9, device=None, early_stopping_rounds=None,\n",
+              "             enable_categorical=False, eval_metric=None, feature_types=None,\n",
+              "             feature_weights=None, gamma=0.3, grow_policy=None,\n",
+              "             importance_type=None, interaction_constraints=None,\n",
+              "             learning_rate=None, max_bin=None, max_cat_threshold=None,\n",
+              "             max_cat_to_onehot=None, max_delta_step=None, max_depth=3,\n",
+              "             max_leaves=None, min_child_weight=5, missing=nan,\n",
+              "             monotone_constraints=None, multi_strategy=None, n_estimators=600,\n",
+              "             n_jobs=None, nthread=-1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" + ], + "text/plain": [ + "XGBRegressor(base_score=None, booster=None, callbacks=None,\n", + " colsample_bylevel=None, colsample_bynode=None,\n", + " colsample_bytree=0.9, device=None, early_stopping_rounds=None,\n", + " enable_categorical=False, eval_metric=None, feature_types=None,\n", + " feature_weights=None, gamma=0.3, grow_policy=None,\n", + " importance_type=None, interaction_constraints=None,\n", + " learning_rate=None, max_bin=None, max_cat_threshold=None,\n", + " max_cat_to_onehot=None, max_delta_step=None, max_depth=3,\n", + " max_leaves=None, min_child_weight=5, missing=nan,\n", + " monotone_constraints=None, multi_strategy=None, n_estimators=600,\n", + " n_jobs=None, nthread=-1, ...)" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "gridcv_xgb" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "9d17515c", + "metadata": { + "id": "9d17515c" + }, + "outputs": [], + "source": [ + "yhat = grid.predict(Xtst)" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "9m3olnq0LyFO", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "9m3olnq0LyFO", + "outputId": "fb7dc17b-bdb0-484c-80a2-f9d3d94a2fde" + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Mean Value of Test Dataset: 43.15017851923077\n", + "RMSE : 2.094299\n", + "Mape 0.03131156751819206\n", + "Wmape 0.033588469781125375\n" + ] + } + ], + "source": [ + "rmse(y_test, yhat)\n", + "\n", + "mape(y_test, yhat)\n", + "wmape(y_test, yhat)" + ] + }, + { + "cell_type": "markdown", + "id": "f911b510", + "metadata": {}, + "source": [ + "#### From the above results we see that after tuning parameters the mape and wmape dropped form 0.06 to 0.03 which is progressively better and more closer to our baseline models metrics." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "ea67317a", + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 293 + }, + "id": "ea67317a", + "outputId": "7c6ffd72-8517-461f-b503-4b04085bfc0a" + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.plot(y_test.values, label='Original')\n", + "plt.plot(yhat, color='red', label='XGboost')\n", + "plt.legend()" + ] + }, + { + "cell_type": "markdown", + "id": "6b480c5c", + "metadata": {}, + "source": [ + "# Make Weekly Predictions (Target variable = next_week_close) " + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "ed4b906d", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
DateHighClosedayofweekyeardayofyearsin_daycos_daymonthClose_t-1High_t-1Close_diffRolling_meanClose_next_day
01980-03-180.3281250.32291711980780.513978-0.8578033NaNNaNNaNNaN0.330729
51980-03-250.3177080.3125001198085-0.176076-0.98437730.3111980.3164060.001302NaN0.309896
101980-04-010.3281250.3229171198092-0.779466-0.62644440.3216150.3268230.001302NaN0.325521
141980-04-080.3177080.3125001198099-0.9992070.03982140.3111980.3164060.001302NaN0.305990
191980-04-150.3085940.30338511980106-0.7271430.68648740.3072920.312500-0.0039070.3141930.291667
.............................................
108912023-05-3030.04000129.99000012023150-0.7148760.699251529.00000029.0900000.99000029.69900031.440001
108962023-06-0631.45000130.95999912023157-0.0795490.996831629.86000131.4000001.09999829.76300031.280001
109012023-06-1333.95000133.910000120231640.5949330.803775633.07000033.2999990.84000030.45000035.580002
109052023-06-2037.11000135.000000120231710.9765910.215105636.36999936.799999-1.36999931.70350032.900002
109102023-06-2734.23000034.099998120231780.877575-0.479439633.34000033.9900020.75999832.74650033.570000
\n", + "

2238 rows × 14 columns

\n", + "
" + ], + "text/plain": [ + " Date High Close dayofweek year dayofyear sin_day \\\n", + "0 1980-03-18 0.328125 0.322917 1 1980 78 0.513978 \n", + "5 1980-03-25 0.317708 0.312500 1 1980 85 -0.176076 \n", + "10 1980-04-01 0.328125 0.322917 1 1980 92 -0.779466 \n", + "14 1980-04-08 0.317708 0.312500 1 1980 99 -0.999207 \n", + "19 1980-04-15 0.308594 0.303385 1 1980 106 -0.727143 \n", + "... ... ... ... ... ... ... ... \n", + "10891 2023-05-30 30.040001 29.990000 1 2023 150 -0.714876 \n", + "10896 2023-06-06 31.450001 30.959999 1 2023 157 -0.079549 \n", + "10901 2023-06-13 33.950001 33.910000 1 2023 164 0.594933 \n", + "10905 2023-06-20 37.110001 35.000000 1 2023 171 0.976591 \n", + "10910 2023-06-27 34.230000 34.099998 1 2023 178 0.877575 \n", + "\n", + " cos_day month Close_t-1 High_t-1 Close_diff Rolling_mean \\\n", + "0 -0.857803 3 NaN NaN NaN NaN \n", + "5 -0.984377 3 0.311198 0.316406 0.001302 NaN \n", + "10 -0.626444 4 0.321615 0.326823 0.001302 NaN \n", + "14 0.039821 4 0.311198 0.316406 0.001302 NaN \n", + "19 0.686487 4 0.307292 0.312500 -0.003907 0.314193 \n", + "... ... ... ... ... ... ... \n", + "10891 0.699251 5 29.000000 29.090000 0.990000 29.699000 \n", + "10896 0.996831 6 29.860001 31.400000 1.099998 29.763000 \n", + "10901 0.803775 6 33.070000 33.299999 0.840000 30.450000 \n", + "10905 0.215105 6 36.369999 36.799999 -1.369999 31.703500 \n", + "10910 -0.479439 6 33.340000 33.990002 0.759998 32.746500 \n", + "\n", + " Close_next_day \n", + "0 0.330729 \n", + "5 0.309896 \n", + "10 0.325521 \n", + "14 0.305990 \n", + "19 0.291667 \n", + "... ... \n", + "10891 31.440001 \n", + "10896 31.280001 \n", + "10901 35.580002 \n", + "10905 32.900002 \n", + "10910 33.570000 \n", + "\n", + "[2238 rows x 14 columns]" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "weekly_data = df[df['dayofweek'] == 1].copy()\n", + "weekly_data" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "6c356bcc", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "2238" + ] + }, + "execution_count": 37, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "len(weekly_data)" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "9fb4ae19", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "2014\n" + ] + } + ], + "source": [ + "split_idx = int(len(weekly_data) * 0.9)\n", + "print(split_idx)" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "4989c0d9", + "metadata": {}, + "outputs": [], + "source": [ + "weekly_data.rename(columns ={'Close_next_day':'Close_next_week'},inplace=True)" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "b1487820", + "metadata": {}, + "outputs": [], + "source": [ + "df_train=weekly_data[:split_idx]\n", + "X_train=df_train.drop(['Close_next_week','Date',],axis=1).copy()\n", + "y_train=df_train['Close_next_week'].copy()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "e368a407", + "metadata": {}, + "outputs": [], + "source": [ + "df_test=weekly_data[split_idx:]\n", + "X_test=df_test.drop(['Close_next_week','Date',],axis=1).copy()\n", + "y_test=df_test['Close_next_week'].copy()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0817e6b3", + "metadata": {}, + "outputs": [], + "source": [ + "from sklearn.impute import SimpleImputer\n", + "imputer = SimpleImputer()\n", + "Xtr = imputer.fit_transform(X_train)\n", + "\n", + "\n", + "Xtst = imputer.transform(X_test) #This ensures that preprocessing is consistent between training and testing.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "19613d8a", + "metadata": {}, + "outputs": [], + "source": [ + "reg = fit_Regressor(Xtr,y_train,600) #no of estimators = 600 \"from grid search cv \"" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "2553e76e", + "metadata": {}, + "outputs": [], + "source": [ + "#Predict for test data\n", + "yhat = reg.predict(Xtst)" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "d95fdd36", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "((224,), (224,))" + ] + }, + "execution_count": 46, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "y_test.shape, yhat.shape\n" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "66391990", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Mean Value of Test Dataset: 47.93718738839285\n", + "RMSE : 1.593557\n", + "Mape 0.023720529593762415\n", + "Wmape 0.024556221123795293\n" + ] + } + ], + "source": [ + "rmse(y_test, yhat)\n", + "mape(y_test, yhat)\n", + "wmape(y_test, yhat)" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "d8004acb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 48, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.plot(y_test.values, label='Original')\n", + "plt.plot(yhat, color='red', label='XGboost')\n", + "plt.legend()" + ] + } + ], + "metadata": { + "accelerator": "GPU", + "colab": { + "collapsed_sections": [], + "name": "Time Series Forcasting using Grid Search.ipynb", + "provenance": [] + }, + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.2" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}