notes on clustering

pull/34/head
Jen Looper 3 years ago
parent 9351f64825
commit 4dca818ad8

@ -23,7 +23,7 @@ Before embarking on this curriculum, you need to have your computer set up and r
- Learn more about how to do this in this [set of videos](https://www.youtube.com/playlist?list=PLlrxD0HtieHhS8VzuMCfQD4uJ9yne1mE6)
- It's also recommended to grasp the basics of [Python](https://docs.microsoft.com/learn/paths/python-language/?WT.mc_id=academic-15963-cxa), a programming language useful for data scientists that we use in this course.
- We also use JavaScript a few times in this course when building web apps, so you will need to have [node](https://nodejs.org) and [npm](https://www.npmjs.com/) installed and [Visual Studio Code](https://code.visualstudio.com/) available for both Python and JavaScript development.
- Since you are here on [GitHub](https://github.com), working with this courseware, you might already have an account, but if not, create one and then fork this curriculum to use on your own. (Give us a star, too, please!)
- Since you are here on [GitHub](https://github.com), working with this course ware, you might already have an account, but if not, create one and then fork this curriculum to use on your own. (Give us a star, too, please!)
- Familiarize yourself with [Scikit-Learn]([https://scikit-learn.org/stable/user_guide.html), which we reference in these lessons, as well.
### What is Machine Learning?

@ -191,6 +191,8 @@ This model's accuracy is not very good, and the shape of the clusters gives you
![clusters](images/clusters.png)
This data is too imbalanced, too little correlated and there is too much variance between the column values, to cluster well. In fact, the clusters that form are probably heavily influenced or skewed by the three genre categories we defined above. That was a learning process!
In Scikit-Learn's documentation, you can see that a model like this one, with clusters not very well demarcated, has a 'variance' problem:
![problem models](images/problems.png)
@ -207,7 +209,7 @@ Variance is defined as "the average of the squared differences from the Mean."[s
Spend some time with this notebook, tweaking parameters. Can you improve the accuracy of the model by cleaning the data more (removing outliers, for example)? You can use weights to give more weight to given data samples. What else can you do to create better clusters?
Hint: Try to sale your data. There's commented code in the notebook that adds Standard Scaling to make the data columns resemble each other more closely in terms of range. You'll find that while the silhouette score goes down, the 'kink' in the elbow graph smooths out. This is because leaving the data unscaled allows data with less variance to carry more weight. Read a bit more on this problem [here](https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering/21226#21226).
Hint: Try to scale your data. There's commented code in the notebook that adds Standard Scaling to make the data columns resemble each other more closely in terms of range. You'll find that while the silhouette score goes down, the 'kink' in the elbow graph smooths out. This is because leaving the data unscaled allows data with less variance to carry more weight. Read a bit more on this problem [here](https://stats.stackexchange.com/questions/21222/are-mean-normalization-and-feature-scaling-needed-for-k-means-clustering/21226#21226).
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/28/)
## Review & Self Study

Loading…
Cancel
Save