This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Linear and Polynomial Regression for Pumpkin Pricing - Lesson 3\n",
"\n",
"Load up required libraries and dataset. Convert the data to a dataframe containing a subset of the data: \n",
"\n",
"- Only get pumpkins priced by the bushel\n",
"- Convert the date to a month\n",
"- Calculate the price to be an average of high and low prices\n",
"- Convert the price to reflect the pricing by bushel quantity"
]
},
{
"cell_type": "code",
"execution_count": 167,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>City Name</th>\n",
" <th>Type</th>\n",
" <th>Package</th>\n",
" <th>Variety</th>\n",
" <th>Sub Variety</th>\n",
" <th>Grade</th>\n",
" <th>Date</th>\n",
" <th>Low Price</th>\n",
" <th>High Price</th>\n",
" <th>Mostly Low</th>\n",
" <th>...</th>\n",
" <th>Unit of Sale</th>\n",
" <th>Quality</th>\n",
" <th>Condition</th>\n",
" <th>Appearance</th>\n",
" <th>Storage</th>\n",
" <th>Crop</th>\n",
" <th>Repack</th>\n",
" <th>Trans Mode</th>\n",
" <th>Unnamed: 24</th>\n",
" <th>Unnamed: 25</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>BALTIMORE</td>\n",
" <td>NaN</td>\n",
" <td>24 inch bins</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>4/29/17</td>\n",
" <td>270.0</td>\n",
" <td>280.0</td>\n",
" <td>270.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>E</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>BALTIMORE</td>\n",
" <td>NaN</td>\n",
" <td>24 inch bins</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>5/6/17</td>\n",
" <td>270.0</td>\n",
" <td>280.0</td>\n",
" <td>270.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>E</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>BALTIMORE</td>\n",
" <td>NaN</td>\n",
" <td>24 inch bins</td>\n",
" <td>HOWDEN TYPE</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>9/24/16</td>\n",
" <td>160.0</td>\n",
" <td>160.0</td>\n",
" <td>160.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>N</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>BALTIMORE</td>\n",
" <td>NaN</td>\n",
" <td>24 inch bins</td>\n",
" <td>HOWDEN TYPE</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>9/24/16</td>\n",
" <td>160.0</td>\n",
" <td>160.0</td>\n",
" <td>160.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>N</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>BALTIMORE</td>\n",
" <td>NaN</td>\n",
" <td>24 inch bins</td>\n",
" <td>HOWDEN TYPE</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>11/5/16</td>\n",
" <td>90.0</td>\n",
" <td>100.0</td>\n",
" <td>90.0</td>\n",
" <td>...</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>N</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 26 columns</p>\n",
"</div>"
],
"text/plain": [
" City Name Type Package Variety Sub Variety Grade Date \\\n",
"0 BALTIMORE NaN 24 inch bins NaN NaN NaN 4/29/17 \n",
"1 BALTIMORE NaN 24 inch bins NaN NaN NaN 5/6/17 \n",
"2 BALTIMORE NaN 24 inch bins HOWDEN TYPE NaN NaN 9/24/16 \n",
"3 BALTIMORE NaN 24 inch bins HOWDEN TYPE NaN NaN 9/24/16 \n",
"4 BALTIMORE NaN 24 inch bins HOWDEN TYPE NaN NaN 11/5/16 \n",
"\n",
" Low Price High Price Mostly Low ... Unit of Sale Quality Condition \\\n",
"A scatterplot reminds us that we only have month data from August through December. We probably need more data to be able to draw conclusions in a linear fashion."
"Looks like correlation is pretty small, but there is some other more important relationship - because price points in the plot above seem to have several distinct clusters. Let's make a plot that will show different pumpkin varieties: "
"The slope of the line can be determined from linear regression coefficients:"
]
},
{
"cell_type": "code",
"execution_count": 178,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(array([-0.01751876]), 21.133734359909326)"
]
},
"execution_count": 178,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lin_reg.coef_, lin_reg.intercept_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the trained model to predict price:"
]
},
{
"cell_type": "code",
"execution_count": 179,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([16.64893156])"
]
},
"execution_count": 179,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Pumpkin price on programmer's day\n",
"\n",
"lin_reg.predict([[256]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Polynomial Regression\n",
"\n",
"Sometimes the relationship between features and the results is inherently non-linear. For example, pumpkin prices might be high in winter (months=1,2), then drop over summer (months=5-7), and then rise again. Linear regression is unable to fin this relationship accurately.\n",
"\n",
"In this case, we may consider adding extra features. Simple way is to use polynomials from input features, which would result in **polynomial regression**. In Scikit Learn, we can automatically pre-compute polynomial features using pipelines: "
"In the ideal world, we want to be able to predict prices for different pumpkin varieties using the same model. To take variety into account, we first need to convert it to numeric form, or **encode**. There are several way we can do it:\n",
"\n",
"* Simple numeric encoding that will build a table of different varieties, and then replace variety name by an index in that table. This is not the best idea for linear regression, because linear regression takes the numeric value of the index into account, and the numeric value is likely not to correlate numerically with the price.\n",
"* One-hot encoding, which will replace `Variety` column by 4 different columns, one for each variety, that will contain 1 if the corresponding row is of given variety, and 0 otherwise.\n",
"\n",
"The code below shows how we can can one-hot encode a variety:"
]
},
{
"cell_type": "code",
"execution_count": 181,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>FAIRYTALE</th>\n",
" <th>MINIATURE</th>\n",
" <th>MIXED HEIRLOOM VARIETIES</th>\n",
" <th>PIE TYPE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>70</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>71</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>72</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>73</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>74</th>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1738</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1739</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1740</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1741</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1742</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>415 rows × 4 columns</p>\n",
"</div>"
],
"text/plain": [
" FAIRYTALE MINIATURE MIXED HEIRLOOM VARIETIES PIE TYPE\n",
"70 0 0 0 1\n",
"71 0 0 0 1\n",
"72 0 0 0 1\n",
"73 0 0 0 1\n",
"74 0 0 0 1\n",
"... ... ... ... ...\n",
"1738 0 1 0 0\n",
"1739 0 1 0 0\n",
"1740 0 1 0 0\n",
"1741 0 1 0 0\n",
"1742 0 1 0 0\n",
"\n",
"[415 rows x 4 columns]"
]
},
"execution_count": 181,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.get_dummies(new_pumpkins['Variety'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Linear Regression on Variety \n",
"\n",
"We will now use the same code as above, but instead of `DayOfYear` we will use our one-hot-encoded variety as input:"
]
},
{
"cell_type": "code",
"execution_count": 182,
"metadata": {},
"outputs": [],
"source": [
"X = pd.get_dummies(new_pumpkins['Variety'])\n",
"y = new_pumpkins['Price']"
]
},
{
"cell_type": "code",
"execution_count": 183,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean error: 5.24 (19.7%)\n",
"Model determination: 0.774085281105197\n"
]
}
],
"source": [
"def run_linear_regression(X,y):\n",
" X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)\n",
"Polynomial regression can also be used with categorical features that are one-hot-encoded. The code to train polynomial regression would essentially be the same as we have seen above."