time series 1

5 years ago · 618bbf4a33
parent b219b206e8
commit 618bbf4a33
3 changed files with 778 additions and 14 deletions
--- a/TimeSeries/1-Introduction/README.md
+++ b/TimeSeries/1-Introduction/README.md
@ -1,30 +1,28 @@
-# [Lesson Topic]
+# Introduction to Time Series Forecasting
-Add a sketchnote if possible/appropriate
+![Introduction to Time Series Forecasting](video-url)
 [![Introduction to Time Series Forecasting](https://img.youtube.com/vi/mAv1SEXUKhE/0.jpg)](https://youtu.be/mAv1SEXUKhE "Introduction to Time Series Forecasting")
-![Embed a video here if available](video-url)
+> Introduction to Time Series Forecasting with Francesca Lazzeri; starting at 7:13.
 ## [Pre-lecture quiz](link-to-quiz-app)
-Describe what we will learn
+In this lesson and the following one, you will learn a bit about Time Series Forecasting, an interesting and valuable part of a ML scientist's repertoire that is a bit lesser known than other topics. Time Series Forecasting is a sort of crystal ball: based on past performance of a variable such as price, you can predict its future potential value. It's a powerful and interesting field especially in business, unsurprisingly, given its direct application to problems of value. While deep learning techniques have started to be used to gain more insights in the prediction of future performance, it remains a field greatly informed by classic ML techniques.
 ### Introduction
-Describe what will be covered
+Supposing you maintain an array of smart parking meters that provide data about how often they are used and for how long over time. What if you could generate revenue to maintain your streets by tweaking the prices of the meters when there is greater demand for them? What if you could predict, based on the meter's past performance, its future value according to the laws of supply and demand? This is a challenge that could be tackled by a Time Series problem. It wouldn't make those in search of a rare parking spot in busy times very happy to have to pay more for it, but it would be a sure way to generate revenue to clean the streets!
 > Notes
-### Prerequisite
+Let's explore some of the types of Time Series algorithms and start a notebook in preparation of cleaning some data. The data you will analyze  is taken from the GEFCom2014 forecasting competition. It consists of 3 years of hourly electricity load and temperature values between 2012 and 2014. Given the historical patterns of electricity load and temperature, you can predict future values of electricity load. In this example, you'll learn how to forecast one time step ahead, using historical load data only.
-What steps should have been covered before this lesson?
+Before starting, however, it's useful to understand what's going on behind the scenes.
-### Preparation
+## Types of Time Series Forecasting
-Preparatory steps to start this lesson
+## Algorithms
---
+## Some Math
-[Step through content in blocks]
+## ARIMA
 ## [Topic 1]
--- a/TimeSeries/1-Introduction/solution/notebook.ipynb
+++ b/TimeSeries/1-Introduction/solution/notebook.ipynb
--- a/TimeSeries/2-ARIMA/solution/notebook.ipynb
+++ b/TimeSeries/2-ARIMA/solution/notebook.ipynb
@ -0,0 +1,493 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Time series forecasting with ARIMA\n",
    "\n",
    "In this notebook, we demonstrate how to:\n",
    "- prepare time series data for training an ARIMA times series forecasting model\n",
    "- implement a simple ARIMA model to forecast the next HORIZON steps ahead (time *t+1* through *t+HORIZON*) in the time series\n",
    "- evaluate the model \n",
    "\n",
    "\n",
    "The data in this example is taken from the GEFCom2014 forecasting competition<sup>1</sup>. It consists of 3 years of hourly electricity load and temperature values between 2012 and 2014. The task is to forecast future values of electricity load. In this example, we show how to forecast one time step ahead, using historical load data only.\n",
    "\n",
    "<sup>1</sup>Tao Hong, Pierre Pinson, Shu Fan, Hamidreza Zareipour, Alberto Troccoli and Rob J. Hyndman, \"Probabilistic energy forecasting: Global Energy Forecasting Competition 2014 and beyond\", International Journal of Forecasting, vol.32, no.3, pp 896-913, July-September, 2016."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import warnings\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import datetime as dt\n",
    "import math\n",
    "\n",
    "from pandas.plotting import autocorrelation_plot\n",
    "# from pyramid.arima import auto_arima\n",
    "from statsmodels.tsa.statespace.sarimax import SARIMAX\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from common.utils import load_data, mape\n",
    "from IPython.display import Image\n",
    "\n",
    "%matplotlib inline\n",
    "pd.options.display.float_format = '{:,.2f}'.format\n",
    "np.set_printoptions(precision=2)\n",
    "warnings.filterwarnings(\"ignore\") # specify to ignore warning messages\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load the data from csv into a Pandas dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "energy = load_data('./data')[['load']]\n",
    "energy.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot all available load data (January 2012 to Dec 2014)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "energy.plot(y='load', subplots=True, figsize=(15, 8), fontsize=12)\n",
    "plt.xlabel('timestamp', fontsize=12)\n",
    "plt.ylabel('load', fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create training and testing data sets\n",
    "\n",
    "We separate our dataset into train and test sets. We train the model on the train set. After the model has finished training, we evaluate the model on the test set. We must ensure that the test set cover a later period in time from the training set, to ensure that the model does not gain from information from future time periods.\n",
    "\n",
    "We will allocate the period 1st September 2014 to 31st October to training set (2 months) and the period 1st November 2014 to 31st December 2014 to the test set (2 months). Since this is daily consumption of energy, there is a strong seasonal pattern, but the consumption is most similar to the consumption in the recent days. Therefore, using a relatively small window of time for training the data should be sufficient.\n",
    "\n",
    "> NOTE: Since function we use to fit ARIMA model uses in-sample validation during feeting, we will omit the validation data from this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_start_dt = '2014-11-01 00:00:00'\n",
    "test_start_dt = '2014-12-30 00:00:00'    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "energy[(energy.index < test_start_dt) & (energy.index >= train_start_dt)][['load']].rename(columns={'load':'train'}) \\\n",
    "    .join(energy[test_start_dt:][['load']].rename(columns={'load':'test'}), how='outer') \\\n",
    "    .plot(y=['train', 'test'], figsize=(15, 8), fontsize=12)\n",
    "plt.xlabel('timestamp', fontsize=12)\n",
    "plt.ylabel('load', fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data preparation\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our data preparation for the training set will involve the following steps:\n",
    "\n",
    "1. Filter the original dataset to include only that time period reserved for the training set\n",
    "2. Scale the time series such that the values fall within the interval (0, 1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create training set containing only the model features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train = energy.copy()[(energy.index >= train_start_dt) & (energy.index < test_start_dt)][['load']]\n",
    "test = energy.copy()[energy.index >= test_start_dt][['load']]\n",
    "\n",
    "print('Training data shape: ', train.shape)\n",
    "print('Test data shape: ', test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scale data to be in range (0, 1). This transformation should be calibrated on the training set only. This is to prevent information from the validation or test sets leaking into the training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scaler = MinMaxScaler()\n",
    "train['load'] = scaler.fit_transform(train)\n",
    "train.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Original vs scaled data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "energy[(energy.index >= train_start_dt) & (energy.index < test_start_dt)][['load']].rename(columns={'load':'original load'}).plot.hist(bins=100, fontsize=12)\n",
    "train.rename(columns={'load':'scaled load'}).plot.hist(bins=100, fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's also scale the test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test['load'] = scaler.transform(test)\n",
    "test.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implement ARIMA method"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An ARIMA, which stands for **A**uto**R**egressive **I**ntegrated **M**oving **A**verage, model can be created using the statsmodels library. In the next section, we perform the following steps:\n",
    "1. Define the model by calling SARIMAX() and passing in the model parameters: p, d, and q parameters, and P, D, and Q parameters.\n",
    "2. The model is prepared on the training data by calling the fit() function.\n",
    "3. Predictions can be made by calling the forecast() function and specifying the number of steps (horizon) which to forecast\n",
    "\n",
    "In an ARIMA model there are 3 parameters that are used to help model the major aspects of a times series: seasonality, trend, and noise. These parameters are:\n",
    "- **p** is the parameter associated with the auto-regressive aspect of the model, which incorporates past values. \n",
    "- **d** is the parameter associated with the integrated part of the model, which effects the amount of differencing to apply to a time series. \n",
    "- **q** is the parameter associated with the moving average part of the model.\n",
    "\n",
    "If our model has a seasonal component, we use a seasonal ARIMA model (SARIMA). In that case we have another set of parameters: P, D, and Q which describe the same associations as p,d, and q, but correspond with the seasonal components of the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Specify the number of steps to forecast ahead\n",
    "HORIZON = 3\n",
    "print('Forecasting horizon:', HORIZON, 'hours')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Selecting the best parameters for an Arima model can be challenging - somewhat subjective and time intesive, so we'll leave it as an exercise to the user. We used an **auto_arima()** function and some additional manual selection to find a decent model.\n",
    "\n",
    ">NOTE: For more info on selecting an Arima model, please refer to the an arima notebook in /ReferenceNotebook directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "order = (4, 1, 0)\n",
    "seasonal_order = (1, 1, 0, 24)\n",
    "\n",
    "model = SARIMAX(endog=train, order=order, seasonal_order=seasonal_order)\n",
    "results = model.fit()\n",
    "\n",
    "print(results.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next we display the distribution of residuals. A zero mean in the residuals may indicate that there is no bias in the prediction. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate the model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will perform the so-called **walk forward validation**. In practice, time series models are re-trained each time a new data becomes available. This allows the model to make the best forecast at each time step. \n",
    "\n",
    "Starting at the beginning of the time series, we train the model on the train data set. Then we make a prediction on the next time step. The prediction is then evaluated against the known value. The training set is then expanded to include the known value and the process is repeated. (Note that we keep the training set window fixed, for more efficient training, so every time we add a new observation to the training set, we remove the observation from the beginning of the set.)\n",
    "\n",
    "This process provides a more robust estimation of how the model will perform in practice. However, it comes at the computation cost of creating so many models. This is acceptable if the data is small or if the model is simple, but could be an issue at scale. \n",
    "\n",
    "Walk-forward validation is the gold standard of time series model evaluation and is recommended for your own projects."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create a test data point for each HORIZON step."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_shifted = test.copy()\n",
    "\n",
    "for t in range(1, HORIZON):\n",
    "    test_shifted['load+'+str(t)] = test_shifted['load'].shift(-t, freq='H')\n",
    "    \n",
    "test_shifted = test_shifted.dropna(how='any')\n",
    "test_shifted.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make predictions on the test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%%time\n",
    "training_window = 720 # dedicate 30 days (720 hours) for training\n",
    "\n",
    "train_ts = train['load']\n",
    "test_ts = test_shifted\n",
    "\n",
    "history = [x for x in train_ts]\n",
    "history = history[(-training_window):]\n",
    "\n",
    "predictions = list()\n",
    "\n",
    "# let's user simpler model for demonstration\n",
    "order = (2, 1, 0)\n",
    "seasonal_order = (1, 1, 0, 24)\n",
    "\n",
    "for t in range(test_ts.shape[0]):\n",
    "    model = SARIMAX(endog=history, order=order, seasonal_order=seasonal_order)\n",
    "    model_fit = model.fit()\n",
    "    yhat = model_fit.forecast(steps = HORIZON)\n",
    "    predictions.append(yhat)\n",
    "    obs = list(test_ts.iloc[t])\n",
    "    # move the training window\n",
    "    history.append(obs[0])\n",
    "    history.pop(0)\n",
    "    print(test_ts.index[t])\n",
    "    print(t+1, ': predicted =', yhat, 'expected =', obs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compare predictions to actual load"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_df = pd.DataFrame(predictions, columns=['t+'+str(t) for t in range(1, HORIZON+1)])\n",
    "eval_df['timestamp'] = test.index[0:len(test.index)-HORIZON+1]\n",
    "eval_df = pd.melt(eval_df, id_vars='timestamp', value_name='prediction', var_name='h')\n",
    "eval_df['actual'] = np.array(np.transpose(test_ts)).ravel()\n",
    "eval_df[['prediction', 'actual']] = scaler.inverse_transform(eval_df[['prediction', 'actual']])\n",
    "eval_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Compute the **mean absolute percentage error (MAPE)** over all predictions\n",
    "\n",
    "$$MAPE = \\frac{1}{n} \\sum_{t=1}^{n}|\\frac{actual_t - predicted_t}{actual_t}|$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if(HORIZON > 1):\n",
    "    eval_df['APE'] = (eval_df['prediction'] - eval_df['actual']).abs() / eval_df['actual']\n",
    "    print(eval_df.groupby('h')['APE'].mean())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('One step forecast MAPE: ', (mape(eval_df[eval_df['h'] == 't+1']['prediction'], eval_df[eval_df['h'] == 't+1']['actual']))*100, '%')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Multi-step forecast MAPE: ', mape(eval_df['prediction'], eval_df['actual'])*100, '%')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plot the predictions vs the actuals for the first week of the test set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if(HORIZON == 1):\n",
    "    ## Plotting single step forecast\n",
    "    eval_df.plot(x='timestamp', y=['actual', 'prediction'], style=['r', 'b'], figsize=(15, 8))\n",
    "\n",
    "else:\n",
    "    ## Plotting multi step forecast\n",
    "    plot_df = eval_df[(eval_df.h=='t+1')][['timestamp', 'actual']]\n",
    "    for t in range(1, HORIZON+1):\n",
    "        plot_df['t+'+str(t)] = eval_df[(eval_df.h=='t+'+str(t))]['prediction'].values\n",
    "\n",
    "    fig = plt.figure(figsize=(15, 8))\n",
    "    ax = plt.plot(plot_df['timestamp'], plot_df['actual'], color='red', linewidth=4.0)\n",
    "    ax = fig.add_subplot(111)\n",
    "    for t in range(1, HORIZON+1):\n",
    "        x = plot_df['timestamp'][(t-1):]\n",
    "        y = plot_df['t+'+str(t)][0:len(x)]\n",
    "        ax.plot(x, y, color='blue', linewidth=4*math.pow(.9,t), alpha=math.pow(0.8,t))\n",
    "    \n",
    "    ax.legend(loc='best')\n",
    "    \n",
    "plt.xlabel('timestamp', fontsize=12)\n",
    "plt.ylabel('load', fontsize=12)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernel_info": {
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  },
  "nteract": {
   "version": "nteract-front-end@1.0.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }