ML-For-Beginners/TimeSeries/2-ARIMA/solution/notebook.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Time series forecasting with ARIMA\n",
    "\n",
    "In this notebook, we demonstrate how to:\n",
    "- prepare time series data for training an ARIMA times series forecasting model\n",
    "- implement a simple ARIMA model to forecast the next HORIZON steps ahead (time *t+1* through *t+HORIZON*) in the time series\n",
    "- evaluate the model \n",
    "\n",
    "\n",
    "The data in this example is taken from the GEFCom2014 forecasting competition<sup>1</sup>. It consists of 3 years of hourly electricity load and temperature values between 2012 and 2014. The task is to forecast future values of electricity load. In this example, we show how to forecast one time step ahead, using historical load data only.\n",
    "\n",
    "<sup>1</sup>Tao Hong, Pierre Pinson, Shu Fan, Hamidreza Zareipour, Alberto Troccoli and Rob J. Hyndman, \"Probabilistic energy forecasting: Global Energy Forecasting Competition 2014 and beyond\", International Journal of Forecasting, vol.32, no.3, pp 896-913, July-September, 2016."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import warnings\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import datetime as dt\n",
    "import math\n",
    "\n",
    "from pandas.plotting import autocorrelation_plot\n",
    "# from pyramid.arima import auto_arima\n",
    "from statsmodels.tsa.statespace.sarimax import SARIMAX\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from common.utils import load_data, mape\n",
    "from IPython.display import Image\n",
    "\n",
    "%matplotlib inline\n",
    "pd.options.display.float_format = '{:,.2f}'.format\n",
    "np.set_printoptions(precision=2)\n",
    "warnings.filterwarnings(\"ignore\") # specify to ignore warning messages\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the data from csv into a Pandas dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "energy = load_data('./data')[['load']]\n",
    "energy.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot all available load data (January 2012 to Dec 2014)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "energy.plot(y='load', subplots=True, figsize=(15, 8), fontsize=12)\n",
    "plt.xlabel('timestamp', fontsize=12)\n",
    "plt.ylabel('load', fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create training and testing data sets\n",
    "\n",
    "We separate our dataset into train and test sets. We train the model on the train set. After the model has finished training, we evaluate the model on the test set. We must ensure that the test set cover a later period in time from the training set, to ensure that the model does not gain from information from future time periods.\n",
    "\n",
    "We will allocate the period 1st September 2014 to 31st October to training set (2 months) and the period 1st November 2014 to 31st December 2014 to the test set (2 months). Since this is daily consumption of energy, there is a strong seasonal pattern, but the consumption is most similar to the consumption in the recent days. Therefore, using a relatively small window of time for training the data should be sufficient.\n",
    "\n",
    "> NOTE: Since function we use to fit ARIMA model uses in-sample validation during feeting, we will omit the validation data from this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_start_dt = '2014-11-01 00:00:00'\n",
    "test_start_dt = '2014-12-30 00:00:00'    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "energy[(energy.index < test_start_dt) & (energy.index >= train_start_dt)][['load']].rename(columns={'load':'train'}) \\\n",
    "    .join(energy[test_start_dt:][['load']].rename(columns={'load':'test'}), how='outer') \\\n",
    "    .plot(y=['train', 'test'], figsize=(15, 8), fontsize=12)\n",
    "plt.xlabel('timestamp', fontsize=12)\n",
    "plt.ylabel('load', fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data preparation\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our data preparation for the training set will involve the following steps:\n",
    "\n",
    "1. Filter the original dataset to include only that time period reserved for the training set\n",
    "2. Scale the time series such that the values fall within the interval (0, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create training set containing only the model features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train = energy.copy()[(energy.index >= train_start_dt) & (energy.index < test_start_dt)][['load']]\n",
    "test = energy.copy()[energy.index >= test_start_dt][['load']]\n",
    "\n",
    "print('Training data shape: ', train.shape)\n",
    "print('Test data shape: ', test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scale data to be in range (0, 1). This transformation should be calibrated on the training set only. This is to prevent information from the validation or test sets leaking into the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler = MinMaxScaler()\n",
    "train['load'] = scaler.fit_transform(train)\n",
    "train.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Original vs scaled data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "energy[(energy.index >= train_start_dt) & (energy.index < test_start_dt)][['load']].rename(columns={'load':'original load'}).plot.hist(bins=100, fontsize=12)\n",
    "train.rename(columns={'load':'scaled load'}).plot.hist(bins=100, fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also scale the test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test['load'] = scaler.transform(test)\n",
    "test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implement ARIMA method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An ARIMA, which stands for **A**uto**R**egressive **I**ntegrated **M**oving **A**verage, model can be created using the statsmodels library. In the next section, we perform the following steps:\n",
    "1. Define the model by calling SARIMAX() and passing in the model parameters: p, d, and q parameters, and P, D, and Q parameters.\n",
    "2. The model is prepared on the training data by calling the fit() function.\n",
    "3. Predictions can be made by calling the forecast() function and specifying the number of steps (horizon) which to forecast\n",
    "\n",
    "In an ARIMA model there are 3 parameters that are used to help model the major aspects of a times series: seasonality, trend, and noise. These parameters are:\n",
    "- **p** is the parameter associated with the auto-regressive aspect of the model, which incorporates past values. \n",
    "- **d** is the parameter associated with the integrated part of the model, which effects the amount of differencing to apply to a time series. \n",
    "- **q** is the parameter associated with the moving average part of the model.\n",
    "\n",
    "If our model has a seasonal component, we use a seasonal ARIMA model (SARIMA). In that case we have another set of parameters: P, D, and Q which describe the same associations as p,d, and q, but correspond with the seasonal components of the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specify the number of steps to forecast ahead\n",
    "HORIZON = 3\n",
    "print('Forecasting horizon:', HORIZON, 'hours')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting the best parameters for an Arima model can be challenging - somewhat subjective and time intesive, so we'll leave it as an exercise to the user. We used an **auto_arima()** function and some additional manual selection to find a decent model.\n",
    "\n",
    ">NOTE: For more info on selecting an Arima model, please refer to the an arima notebook in /ReferenceNotebook directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "order = (4, 1, 0)\n",
    "seasonal_order = (1, 1, 0, 24)\n",
    "\n",
    "model = SARIMAX(endog=train, order=order, seasonal_order=seasonal_order)\n",
    "results = model.fit()\n",
    "\n",
    "print(results.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we display the distribution of residuals. A zero mean in the residuals may indicate that there is no bias in the prediction. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will perform the so-called **walk forward validation**. In practice, time series models are re-trained each time a new data becomes available. This allows the model to make the best forecast at each time step. \n",
    "\n",
    "Starting at the beginning of the time series, we train the model on the train data set. Then we make a prediction on the next time step. The prediction is then evaluated against the known value. The training set is then expanded to include the known value and the process is repeated. (Note that we keep the training set window fixed, for more efficient training, so every time we add a new observation to the training set, we remove the observation from the beginning of the set.)\n",
    "\n",
    "This process provides a more robust estimation of how the model will perform in practice. However, it comes at the computation cost of creating so many models. This is acceptable if the data is small or if the model is simple, but could be an issue at scale. \n",
    "\n",
    "Walk-forward validation is the gold standard of time series model evaluation and is recommended for your own projects."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a test data point for each HORIZON step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_shifted = test.copy()\n",
    "\n",
    "for t in range(1, HORIZON):\n",
    "    test_shifted['load+'+str(t)] = test_shifted['load'].shift(-t, freq='H')\n",
    "    \n",
    "test_shifted = test_shifted.dropna(how='any')\n",
    "test_shifted.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make predictions on the test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "training_window = 720 # dedicate 30 days (720 hours) for training\n",
    "\n",
    "train_ts = train['load']\n",
    "test_ts = test_shifted\n",
    "\n",
    "history = [x for x in train_ts]\n",
    "history = history[(-training_window):]\n",
    "\n",
    "predictions = list()\n",
    "\n",
    "# let's user simpler model for demonstration\n",
    "order = (2, 1, 0)\n",
    "seasonal_order = (1, 1, 0, 24)\n",
    "\n",
    "for t in range(test_ts.shape[0]):\n",
    "    model = SARIMAX(endog=history, order=order, seasonal_order=seasonal_order)\n",
    "    model_fit = model.fit()\n",
    "    yhat = model_fit.forecast(steps = HORIZON)\n",
    "    predictions.append(yhat)\n",
    "    obs = list(test_ts.iloc[t])\n",
    "    # move the training window\n",
    "    history.append(obs[0])\n",
    "    history.pop(0)\n",
    "    print(test_ts.index[t])\n",
    "    print(t+1, ': predicted =', yhat, 'expected =', obs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare predictions to actual load"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_df = pd.DataFrame(predictions, columns=['t+'+str(t) for t in range(1, HORIZON+1)])\n",
    "eval_df['timestamp'] = test.index[0:len(test.index)-HORIZON+1]\n",
    "eval_df = pd.melt(eval_df, id_vars='timestamp', value_name='prediction', var_name='h')\n",
    "eval_df['actual'] = np.array(np.transpose(test_ts)).ravel()\n",
    "eval_df[['prediction', 'actual']] = scaler.inverse_transform(eval_df[['prediction', 'actual']])\n",
    "eval_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the **mean absolute percentage error (MAPE)** over all predictions\n",
    "\n",
    "$$MAPE = \\frac{1}{n} \\sum_{t=1}^{n}|\\frac{actual_t - predicted_t}{actual_t}|$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if(HORIZON > 1):\n",
    "    eval_df['APE'] = (eval_df['prediction'] - eval_df['actual']).abs() / eval_df['actual']\n",
    "    print(eval_df.groupby('h')['APE'].mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('One step forecast MAPE: ', (mape(eval_df[eval_df['h'] == 't+1']['prediction'], eval_df[eval_df['h'] == 't+1']['actual']))*100, '%')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Multi-step forecast MAPE: ', mape(eval_df['prediction'], eval_df['actual'])*100, '%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the predictions vs the actuals for the first week of the test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if(HORIZON == 1):\n",
    "    ## Plotting single step forecast\n",
    "    eval_df.plot(x='timestamp', y=['actual', 'prediction'], style=['r', 'b'], figsize=(15, 8))\n",
    "\n",
    "else:\n",
    "    ## Plotting multi step forecast\n",
    "    plot_df = eval_df[(eval_df.h=='t+1')][['timestamp', 'actual']]\n",
    "    for t in range(1, HORIZON+1):\n",
    "        plot_df['t+'+str(t)] = eval_df[(eval_df.h=='t+'+str(t))]['prediction'].values\n",
    "\n",
    "    fig = plt.figure(figsize=(15, 8))\n",
    "    ax = plt.plot(plot_df['timestamp'], plot_df['actual'], color='red', linewidth=4.0)\n",
    "    ax = fig.add_subplot(111)\n",
    "    for t in range(1, HORIZON+1):\n",
    "        x = plot_df['timestamp'][(t-1):]\n",
    "        y = plot_df['t+'+str(t)][0:len(x)]\n",
    "        ax.plot(x, y, color='blue', linewidth=4*math.pow(.9,t), alpha=math.pow(0.8,t))\n",
    "    \n",
    "    ax.legend(loc='best')\n",
    "    \n",
    "plt.xlabel('timestamp', fontsize=12)\n",
    "plt.ylabel('load', fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernel_info": {
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "nteract": {
   "version": "nteract-front-end@1.0.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}