@ -35,6 +35,7 @@ The analysis be performed using a variety of methods, including frequency-domain
Time Series Forecasting is the use of a model to predict future values based on patterns displayed by previously gathered data as it occurred in the past. While it is possible to use regression models to explore time series data, with time indices as x variables on a plot, this type of data is best analyzed using special types of models.
Time Series Forecasting is the use of a model to predict future values based on patterns displayed by previously gathered data as it occurred in the past. While it is possible to use regression models to explore time series data, with time indices as x variables on a plot, this type of data is best analyzed using special types of models.
Time series data is a list of ordered observations, unlike data that can be analyzed by linear regression. The most common one is ARIMA, an acronym that stands for "Autoregressive Integrated Moving Average".
Time series data is a list of ordered observations, unlike data that can be analyzed by linear regression. The most common one is ARIMA, an acronym that stands for "Autoregressive Integrated Moving Average".
ARIMA models "relate the present value of a series to past values and past prediction errors." [source](https://online.stat.psu.edu/stat510/lesson/1/1.1). They are most appropriate for analyzing time-domain data, where data is ordered over time.
ARIMA models "relate the present value of a series to past values and past prediction errors." [source](https://online.stat.psu.edu/stat510/lesson/1/1.1). They are most appropriate for analyzing time-domain data, where data is ordered over time.
> There are several types of ARIMA models, which you can learn about [here](https://people.duke.edu/~rnau/411arim.htm) and which you will touch on in the next lesson.
> There are several types of ARIMA models, which you can learn about [here](https://people.duke.edu/~rnau/411arim.htm) and which you will touch on in the next lesson.
@ -57,8 +58,6 @@ In the next lesson, you will build an ARIMA model using [Univariate Time Series]
| 330.97 | 1975.96 | 1975 | 12 |
| 330.97 | 1975.96 | 1975 | 12 |
✅ Identify the variable that changes over time in this dataset
✅ Identify the variable that changes over time in this dataset
## Time Series [data characteristics](https://online.stat.psu.edu/stat510/lesson/1/1.1) to consider
## Time Series [data characteristics](https://online.stat.psu.edu/stat510/lesson/1/1.1) to consider
When looking at time series data, you might notice that it has certain characteristics that you need to take into account and mitigate to better understand its patterns. If you consider time series data as potentially providing a 'signal' that you want to analyze, these characteristics can be thought of as 'noise'. You often will need to reduce this 'noise' by offsetting some of these characteristics using some statistical techniques.
When looking at time series data, you might notice that it has certain characteristics that you need to take into account and mitigate to better understand its patterns. If you consider time series data as potentially providing a 'signal' that you want to analyze, these characteristics can be thought of as 'noise'. You often will need to reduce this 'noise' by offsetting some of these characteristics using some statistical techniques.
@ -89,38 +88,26 @@ Let's get started creating a time series model to predict future power usage giv
>
>
> Tao Hong, Pierre Pinson, Shu Fan, Hamidreza Zareipour, Alberto Troccoli and Rob J. Hyndman, "Probabilistic energy forecasting: Global Energy Forecasting Competition 2014 and beyond", International Journal of Forecasting, vol.32, no.3, pp 896-913, July-September, 2016.
> Tao Hong, Pierre Pinson, Shu Fan, Hamidreza Zareipour, Alberto Troccoli and Rob J. Hyndman, "Probabilistic energy forecasting: Global Energy Forecasting Competition 2014 and beyond", International Journal of Forecasting, vol.32, no.3, pp 896-913, July-September, 2016.
In the `working` folder of this lesson, open the `notebook.ipynb` file. Start by adding libraries that will help you load and visualize remote data
In the `working` folder of this lesson, open the `notebook.ipynb` file. Start by adding libraries that will help you load and visualize data
```python
```python
import os
import os
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
from common.utils import load_data
from common.utils import load_data
from common.extract_data import extract_data
%matplotlib inline
%matplotlib inline
```
```
Note, you are using the files from the included `common` folder which set up your environment and handle downloading the data.
Note, you are using the files from the included `common` folder which set up your environment and handle downloading the data.
Next, download, move and extract the zip file of data:
Next, examine the data as a dataframe
```python
```python
data_dir = './data'
data_dir = './data'
if not os.path.exists(os.path.join(data_dir, 'energy.csv')):
"\u001b[0;32m~/Documents/MSFT/curricula/Currriculum-Dev/ml-for-beginners/TimeSeries/1-Introduction/working/common/extract_data.py\u001b[0m in \u001b[0;36mextract_data\u001b[0;34m(data_dir)\u001b[0m\n\u001b[1;32m 24\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 25\u001b[0m \u001b[0;31m# load the data from Excel file\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 26\u001b[0;31m \u001b[0mdata\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_excel\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata_dir\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'GEFCom2014-E'\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'GEFCom2014-E.xlsx'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparse_date\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'Date'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 27\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 28\u001b[0m \u001b[0;31m# create timestamp variable from Date and Hour\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;31mTypeError\u001b[0m: read_excel() got an unexpected keyword argument 'parse_date'"
]
}
],
"source": [
"data_dir = './data'\n",
"\n",
"\n",
"if not os.path.exists(os.path.join(data_dir, 'energy.csv')):\n",
"In this notebook, we demonstrate how to:\n",
" extract_data(data_dir)"
"\n",
]
"setup time series data for this module\n",
},
"visualize the data\n",
{
"The data in this example is taken from the GEFCom2014 forecasting competition1. It consists of 3 years of hourly electricity load and temperature values between 2012 and 2014.\n",
"cell_type": "code",
"\n",
"execution_count": 4,
"1Tao Hong, Pierre Pinson, Shu Fan, Hamidreza Zareipour, Alberto Troccoli and Rob J. Hyndman, \"Probabilistic energy forecasting: Global Energy Forecasting Competition 2014 and beyond\", International Journal of Forecasting, vol.32, no.3, pp 896-913, July-September, 2016."
"metadata": {},
"outputs": [
{
"output_type": "error",
"ename": "FileNotFoundError",
"evalue": "[Errno 2] No such file or directory: './data/energy.csv'",
"\u001b[0;32m<ipython-input-4-a4cbf8be04ff>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0menergy\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mload_data\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata_dir\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'load'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 2\u001b[0m \u001b[0menergy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mhead\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
"\u001b[0;32m~/Documents/MSFT/curricula/Currriculum-Dev/ml-for-beginners/TimeSeries/1-Introduction/working/common/utils.py\u001b[0m in \u001b[0;36mload_data\u001b[0;34m(data_dir)\u001b[0m\n\u001b[1;32m 7\u001b[0m \u001b[0;34m\"\"\"Load the GEFCom 2014 energy load data\"\"\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 9\u001b[0;31m \u001b[0menergy\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mos\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpath\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdata_dir\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m'energy.csv'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mparse_dates\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'timestamp'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m 10\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m 11\u001b[0m \u001b[0;31m# Reindex the dataframe such that the dataframe has a record for every time point\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",