diff --git a/2-Regression/3-Linear/README.md b/2-Regression/3-Linear/README.md index bf8e5cc1..2a8ae959 100644 --- a/2-Regression/3-Linear/README.md +++ b/2-Regression/3-Linear/README.md @@ -77,7 +77,7 @@ A good linear regression model will be one that has a high (nearer to 1 than 0) In the code below, we will assume that we have cleaned up the data, and obtained a dataframe called `new_pumpkins`, similar to the following: - | Month | DayOfYear | Variety | City | Package | Low Price | High Price | Price +ID | Month | DayOfYear | Variety | City | Package | Low Price | High Price | Price ---|-------|-----------|---------|------|---------|-----------|------------|------- 70 | 9 | 267 | PIE TYPE | BALTIMORE | 1 1/9 bushel cartons | 15.0 | 15.0 | 13.636364 71 | 9 | 267 | PIE TYPE | BALTIMORE | 1 1/9 bushel cartons | 18.0 | 18.0 | 16.363636 @@ -97,11 +97,11 @@ Now that you have an understanding of the math behind linear regression, let's c From the previous lesson you have probably seen that the average price for different months looks like this: -Average price by month +Average price by month This suggests that there should be some correlation, and we can try training linear regression model to predict the relationship between `Month` and `Price`, or between `DayOfYear` and `Price`. Here is the scatter plot that shows the latter relationship: -Scatter plot of Price vs. Day of Year +Scatter plot of Price vs. Day of Year It looks like there are different clusters of prices corresponding to different pumpkin varieties. To confirm this hypothesis, let's plot each pumpkin category using different color. By passing `ax` parameter to the `scatter` plotting function we can plot all points on the same graph: @@ -113,7 +113,7 @@ for i,var in enumerate(new_pumpkins['Variety'].unique()): ax = df.plot.scatter('DayOfYear','Price',ax=ax,c=colors[i],label=var) ``` -Scatter plot of Price vs. Day of Year +Scatter plot of Price vs. Day of Year Our investigation suggests that variety has more effect on the overall price than actual selling date. So let us focus for the moment only on one pumpkin variety, and see what effect does the date have: @@ -122,7 +122,7 @@ Our investigation suggests that variety has more effect on the overall price tha pie_pumpkins = new_pumpkins[new_pumpkins['Variety']=='PIE TYPE'] pie_pumpkins.plot.scatter('DayOfYear','Price') ``` -Scatter plot of Price vs. Day of Year +Scatter plot of Price vs. Day of Year If we now calculate the correlation between `Price` and `DayOfYear` using `corr` function, we will get something like `-0.27` - which means that training predictive model makes sense. @@ -193,7 +193,7 @@ plt.scatter(X_test,y_test) plt.plot(X_test,pred) ``` -Linear regression +Linear regression ## Polynomial Regression @@ -223,7 +223,7 @@ Using `PolynomialFeatures(2)` means that we will include all second-degree polyn Pipeline can be used in the same manner as original `LinearRegression` object, i.e. we can `fit` the pipeline, and then use `predict` to get the prediction results. Here is the graph showing test data, and the approximation curve: -Polynomial regression +Polynomial regression Using polynomial regression we can get slightly lower MSE and higher determination, but not significantly. We need to take into account other features! @@ -237,7 +237,7 @@ In the ideal world, we want to be able to predict prices for different pumpkin v Here you can see how average price depends on variety: -Average price by variety +Average price by variety To take variety into account, we first need to convert it to numeric form, or **encode**. There are several way we can do it: