edits and image refresh

pull/38/head
Jen Looper 3 years ago
parent afcf395224
commit b333541d82

@ -45,23 +45,20 @@ Although the terms can be confused, machine learning (ML) is an important subset
## What you will learn in this course
In this curriculum, we are going to cover only the core concepts of machine learning that a beginner must know. We cover what we call 'Classical machine learning' primarily using Scikit-learn, an excellent library many students use to learn the basics. To understand broader concepts of artificial intelligence or deep learning, a strong fundamental knowledge of machine learning is indispensable, and so we would like to offer it here.
You will additionally learn the basics of Regression, Classification, Clustering, natural language processing, Time Series Forecasting, and Reinforcement Learning, as well as real-world applications, the history of ML, ML and Fairness, and how to use your model in web apps.
In this curriculum, we are going to cover only the core concepts of machine learning that a beginner must know. We cover what we call 'classical machine learning' primarily using Scikit-learn, an excellent library many students use to learn the basics. To understand broader concepts of artificial intelligence or deep learning, a strong fundamental knowledge of machine learning is indispensable, and so we would like to offer it here.
In this course you will learn:
- Core concepts of machine learning
- The history of ML
- core concepts of machine learning
- the history of ML
- ML and fairness
- The definition of "Classical machine learning"
- Regression
- Classification
- Clustering
- natural language processing
- Time Series Forecasting
- Reinforcement Learning
- Real-world applications
- regression ML techniques
- classification ML techniques
- clustering ML techniques
- natural language processing ML techniques
- time series forecasting ML techniques
- reinforcement learning
- real-world applications for ML
## What we will not cover
- deep learning

@ -4,9 +4,8 @@ The process of building, using, and maintaining machine learning models and the
- Understand the processes underpinning machine learning at a high level.
- Explore base concepts such as 'models', 'predictions', and 'training data'.
-
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/7/)
## Introduction
On a high level, the craft of creating machine learning (ML) processes is comprised of a number of steps:
@ -45,17 +44,15 @@ To be able to answer your question with any kind of certainty, you need a good a
A [feature](https://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection) is a measurable property of your data. In many datasets it is expressed as a column heading like 'date' 'size' or 'color'. Your feature variable, usually represented as `y` in code, represents the answer to the question you are trying to ask of your data: in December, what **color** pumpkins will be cheapest? in San Francisco, what neighborhoods will have the best real estate **price**?
🎓 **Feature Selection and Feature Extraction** How do you know which variable to choose when building a model? You'll probably go through a process of feature selection or feature extraction to choose the right variables for the most performant model. They're not the same thing, however: "Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features." ([source](https://wikipedia.org/wiki/Feature_selection))
### Visualize your data
An important aspect of the data scientist's toolkit is the power to visualize data using several excellent libraries such as Seaborn or MatPlotLib. Representing your data visually might allow you to uncover hidden correlations that you can leverage. Your visualizations might also help you to uncover bias or unbalanced data (as we discover in [Classification](../../4-Classification/2-Classifiers-1/README.md)).
### Split your dataset
Prior to training, you need to split your dataset into two or more parts of unequal size that still represent the data well.
- **Training**, this part of the dataset goes into your model to train it. The size of this chunk constitutes the majority of the original dataset.
- **Testing**. A test dataset is another independent group of data, often gathered from the original data, that you use to confirm the performance of the built model.
- **Training**. This part of the dataset is fit to your model to train it. This set constitutes the majority of the original dataset.
- **Testing**. A test dataset is an independent group of data, often gathered from the original data, that you use to confirm the performance of the built model.
- **Validating**. A validation set is a smaller independent group of examples that you use to tune the model's hyperparameters, or architecture, to improve the model. Depending on your data's size and the question you are asking, you might not need to build this third set (as we note in [Time Series Forecasting](../7-TimeSeries/1-Introduction/README.md)).
## Building a model
@ -65,11 +62,9 @@ Using your training data, your goal is to build a model, or a statistical repres
### Decide on a training method
Depending on your question and the nature of your data, your will choose a method to train it. Stepping through [Scikit-learn's documentation](https://scikit-learn.org/stable/user_guide.html) - which we use in this course - you can explore many ways to train a model. Depending on your experience, you might have to try several different methods to build the best model. You are likely to go through a process whereby data scientists evaluate the performance of a model by feeding it unseen data, checking for accuracy, bias, and other quality-degrading issues, and selecting the most appropriate training method for the task at hand.
### Train
### Train a model
Armed with your training data, you are ready to 'fit' it to create a model. You will notice that in many ML libraries you will find the code 'model.fit' - it is at this time that you send in your data as an array of values (usually 'X') and a feature variable (usually 'y').
### Evaluate the model
Once the training process is complete (it can take many iterations, or 'epochs', to train a large model), you will be able to evaluate the model's quality by using test data to gauge its performance. This data is a subset of the original data that the model has not previously analyzed. You can print out a table of metrics about your model's quality.
@ -80,6 +75,9 @@ In the context of machine learning, model fitting refers to the accuracy of the
🎓 **Underfitting** and **overfitting** are common problems that degrade the quality of the model, as the model fits either not well enough or too well. This causes the model to make predictions either too closely aligned or too loosely aligned with its training data. An overfit model predicts training data too well because it has learned the data's details and noise too well. An underfit model is not accurate as it can neither accurately analyze its training data nor data it has not yet 'seen'.
![overfitting model](images/overfitting.png)
> Infographic by [Jen Looper](https://twitter.com/jenlooper)
## Parameter tuning
Once your initial training is complete, observe the quality of the model and consider improving it by tweaking its 'hyperparameters'. Read more about the process [in the documentation](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?WT.mc_id=academic-15963-cxa).

Binary file not shown.

After

Width:  |  Height:  |  Size: 99 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 219 KiB

Loading…
Cancel
Save