- Understand the process of doing machine learning at a high level.
- Explore concepts like model and training data.
# Introduction
On a high-level the craft of doing machine learning, ML goes through a number of steps. Here's what it looks like:
1. **Decide on the question**. You have a question you want answered, here's where you decide what that question should be.
1. **Collect and prepare data**. To be able to answer your question, you need data, lots of it.
1. **Choose a model**. A model is the same thing as an algorithm and you need to train in order for it to recognize what you need it to recognize.
1. **Train the model**. You take part of your collected data and make sure the model changes basd on its input. The model has internal weights that gets adjusted based on what you feed it.
1. **Evaluate the model**. You use never before seen data from your collected data to see how the model is performing.
1. **Parameter tuning**.
1. **Predict**, make predictions with your model based on new input.
## What question to ask
Ok, so why are you doing machine learning? Well, you have a question you want to ask, like if there's a correlation between your living habits and diabetes, maybe age is a factor. You think about it, you have your question, you want to know what causes diabetes.
### Identifying factors
You have a hypotheses on what might cause diabetes like age, living habits, maybe a gene is involved. Great you are off to a great start. But to be able to get further, you need data, lots of data, the more the better.
## Data
To be able to answer your question with any kind of certainty, you need a lot of data, and the right type. There are two things you need to do at this point:
- **Collect data**. Any which way you can collect data, do it. For things like diabetes there are actually free datasets you can use that are event built-in two libraries. Once you've either used datasets out there or data a ton of measurements, you have data. This data is also referred to as _training data_.
- **Prepare data**. First you need to fuse together your data if it comes from many different sources. You might need to improve the data a little at this point, like cleaning and editing it. Finally you might also need to randomize it, this is to ensure that there is no actual correlation depending on how you later feed the data into the model for training.
NOTE: After all this data collection and data preparation, can I address my intended question. You need a resounding yes at this point, or there's no point in continuing.
### 🎓 Feature Variable
A [feature](https://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection) is a measurable property of your data. In many datasets it is expressed as a column heading like 'date' 'size' or 'color'.
### Visualizing your data
You most likely need to visualize your data at this point. There might be interesting correlations that you an make use of. One thing to be aware of is if your collected dataset represents what you are likely to find in the wild. Example, let's say you collected a lot of images on Dogs and Cats and you want a model to recognize animals in general, you need to be aware of this bias towards Dogs and Cats. Cause the consequences might be that your model labels everything as a Dog or a Cat when it's in fact a Squirrel.
### Split your dataset
You need to split your dataset at this point into some major parts.
- **Training**, this part of the dataset goes into your model to train it. The size of this chunk constitutes the majority of the original dataset.
- **Evaluation**. A validation set is a smaller independent group of examples that you use to tune the model's hyperparameters, or architecture, to improve the model.
- **Test dataset**. A test dataset is another independent group of data, often gathered from the original data, that you use to confirm the performance of the built model.
## The model
A model is another word for algorithm or actually many smaller algorithms working together. Because we are doing Machine Learning, lets refer to it henceforth as the _model_.
🎓 **Feature Selection and Feature Extraction** How do you know which variable to choose when building a model? You'll probably go through a process of feature selection or feature extraction to choose the right variables for the most performant model. They're not the same thing, however: "Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features." [source](https://wikipedia.org/wiki/Feature_selection)
### Decide on a model
What we need to know at this point is that we need to select a model type. There are lot of existing models out there specialized on different things like images, text and so on. You are likely to go through a process whereby data scientists evaluate the performance of a model or any other relevant metric of a model by feeding it unseen data, selecting the most appropriate model for the task at hand.
## Training
At this point you are ready to go. You have your training data, that subset of your original dataset and you need to feed it into the model.
## Evaluate the model
At this point, you want to check if your model is any good, is it able to answer the question you set out for it? The way to test is by using your evaluation data and see how it performs. It's important it's data the model hasn't seen before so it simulates how it would perform in the real world.
### 🎓 Model Fitting
In the context of machine learning, Model fitting refers to the accuracy of the model's underlying function as it attempts to analyze data with which it is not familiar.
**Underfitting** and **overfitting** are common problems that degrade the quality of the model as the model fits either not well enough or too well. This causes the model to make predictions either too closely aligned or too loosely aligned with its training data. An overfit model predicts training data too well because it has learned the data's details and noise too well. An underfit model is not accurate as it can neither accurately analyze its training data nor data it has not yet 'seen'.
The lessons in this section cover types of Regression in the context of machine learning. Regression models can help determine the _relationship_ between variables. This type of model can predict values such as length, temperature, or age, thus uncovering relationships between variables as it analyzes data points.
In this series of lessons, you'll discover the difference between Linear vs. Logistic Regression, and when you should use one or the other.
## Parameter tuning
Ok, you've made some initial assumptions before starting out. Now it's time to look at something called hyperparameters. What we are looking to do is to control the learning process, see if we can make it better. Hyperparameters affect the speed and quality of this process and don't affect the performance of the model.
##Prediction
You've made it to your goal hopefully. The whole point of this process was to combine an algorithm, i.e model and training data so you can make a prediction of data you haven't seen yet. Will that stock increase or decrease, is it sunny tomorrow and so on?
The lessons in this section cover types of Regression in the context of machine learning. Regression models can help determine the relationship between variables. This type of model can predict values such as length, temperature, or age, thus uncovering relationships between variables as it analyzes datapoints.
In this series of lessons, you'll discover the difference between Linear vs. Logistic Regression, and when you should use one or the other.
But before you do anything, make sure you have the right tools in place!
In this lesson, you will learn:
- How to configure your computer for local machine learning tasks
- Getting used to working with Jupyter notebooks
- An introduction to Scikit-Learn, including installation
- An introduction to Linear Regression with a hands-on exercise
- Configure your computer for local machine learning tasks.
- Getting used to working with Jupyter notebooks.
- Using Scikit-Learn, including installation.
- Explore Linear Regression with a hands-on exercise.
## Installations and Configurations
[![Using Python with Visual Studio Code](https://img.youtube.com/vi/7EXd4_ttIuw/0.jpg)](https://youtu.be/7EXd4_ttIuw "Using Python with Visual Studio Code")
> 🎥 Click the image above for a video: using Python within VS Code.
1. Ensure that [Python](https://www.python.org/downloads/) is installed on your computer. You will use Python for many data science and machine learning tasks. Most computer systems already include a Python installation. There are useful [Python Coding Packs](https://code.visualstudio.com/learn/educators/installers?WT.mc_id=academic-15963-cxa) available as well to ease the setup for some users. Some usages of Python, however, require one version of the software, whereas others require a different version. For this reason, it's useful to work within a virtual environment.
1. **Install Python**. Ensure that [Python](https://www.python.org/downloads/) is installed on your computer. You will use Python for many data science and machine learning tasks. Most computer systems already include a Python installation. There are useful [Python Coding Packs](https://code.visualstudio.com/learn/educators/installers?WT.mc_id=academic-15963-cxa) available as well, to ease the setup for some users.
Some usages of Python, however, require one version of the software, whereas others require a different version. For this reason, it's useful to work within a [virtual environment](https://docs.python.org/3/library/venv.html).
2. **Install Visual Studio Code**. Make sure you have Visual Studio Code installed on your computer. Follow these instructions to [install Visual Studio Code](https://code.visualstudio.com/) for the basic installation. You are going to use Python in Visual Studio Code in this course, so you might want to brush up on how to [configure Visual Studio Code](https://docs.microsoft.com/learn/modules/python-install-vscode?WT.mc_id=academic-15963-cxa) for Python development.
2. Make sure you have Visual Studio Code installed on your computer. Follow [these instructions](https://code.visualstudio.com/) for the basic installation. You are going to use Python in Visual Studio Code in this course, so you might want to brush up on how to [configure](https://docs.microsoft.com/learn/modules/python-install-vscode?WT.mc_id=academic-15963-cxa) VS Code for Python development.
> Get comfortable with Python by working through this collection of [Learn modules](https://docs.microsoft.com/users/jenlooper-2911/collections/mp1pagggd5qrq7?WT.mc_id=academic-15963-cxa)
> Get comfortable with Python by working through this collection of [Learn modules](https://docs.microsoft.com/users/jenlooper-2911/collections/mp1pagggd5qrq7?WT.mc_id=academic-15963-cxa)
3. **Install Scikit-Learn**, by following [these instructions](https://scikit-learn.org/stable/install.html). Since you need to ensure that you use Python 3, it's recommended that you use a virtual environment. Note, if you are installing this library on a M1 Mac, there are special instructions on the page linked above.
1. **Install Jupyter Notebook**. You will need to [install the Jupyter package](https://pypi.org/project/jupyter/).
3. Install Scikit-Learn by following [these instructions](https://scikit-learn.org/stable/install.html). Since you need to ensure that you use Python 3, it's recommended that you use a virtual environment. Note, if you are installing this library on a M1 Mac, there are special instructions on the page linked above.
## Your ML Authoring Environment
You are going to use **notebooks** to develop your Python code and create machine learning models. This type of file is a common tool for data scientists, and they can be identified by their suffix or extension `.ipynb`.
Notebooks are an interactive environment that allow the developer to both code and add notes and write documentation around the code which is quite helpful for experimental or research-oriented projects.
### Working with A Notebook
In this folder, you will find the file `notebook.ipynb`. If you open it in VS Code, assuming VS Code is properly configured, a Jupyter server will start with Python 3+ started. You will find areas of the notebook that can be 'run' by pressing arrows next to code blocks, and other areas that contain text.
### Exercise - work with A Notebook
In your notebook, add a comment. To do this, click the 'md' icon and add a bit of markdown, like `# Welcome to your notebook`.
In this folder, you will find the file `notebook.ipynb`.
Next, add some Python code: Type `print('hello notebook')` and click the arrow to run the code. You should see the printed statement, 'hello notebook'.
1. Open _notebook.ipynb_ in Visual Studio Code.
![VS Code with a notebook open](images/notebook.png)
A Jupyter server will start with Python 3+ started. You will find areas of the notebook that can be `run`, pieces of code. You can run a code block, by selecting the icon that looks like a play button.
You can interleaf your code with comments to self-document the notebook.
1. Select the `md` icon and add a bit of markdown, and the following text **# Welcome to your notebook**.
✅ Think for a minute how different a web developer's working environment is versus that of a data scientist.
## Up and Running with Scikit-Learn
Now that Python is set up in your local environment and you are comfortable with Jupyter notebooks, let's get equally comfortable with Scikit-Learn (pronounce it `sci` as in `science`). Scikit-Learn provides an [extensive API](https://scikit-learn.org/stable/modules/classes.html#api-ref) to help you perform ML tasks.
According to their [website](https://scikit-learn.org/stable/getting_started.html), "Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities."
### Let's unpack some of this jargon:
Next, add some Python code.
> 🎓 A machine learning **model** is a mathematical model that generates predictions given data to which it has not been exposed. It builds these predictions based on its analysis of data and extrapolating patterns.
1. Type **print("hello notebook'")** in the code block.
1. Select the arrow to run the code.
> 🎓 **[Supervised Learning](https://wikipedia.org/wiki/Supervised_learning)** works by mapping an input to an output based on example pairs. It uses **labeled** training data to build a function to make predictions. [Download a printable Zine about Supervised Learning](https://zines.jenlooper.com/zines/supervisedlearning.html). Regression, which is covered in this group of lessons, is a type of supervised learning.
You should see the printed statement:
> 🎓 **[Unsupervised Learning](https://wikipedia.org/wiki/Unsupervised_learning)** works similarly but it maps pairs using **unlabeled data**. [Download a printable Zine about Unsupervised Learning](https://zines.jenlooper.com/zines/unsupervisedlearning.html)
```output
hello notebook
```
> 🎓 **[Model Fitting](https://scikit-learn.org/stable/auto_examples/model_selection/plot_underfitting_overfitting.html#sphx-glr-auto-examples-model-selection-plot-underfitting-overfitting-py)** in the context of machine learning refers to the accuracy of the model's underlying function as it attempts to analyze data with which it is not familiar. **Underfitting** and **overfitting** are common problems that degrade the quality of the model as the model fits either not well enough or too well. This causes the model to make predictions either too closely aligned or too loosely aligned with its training data. An overfit model predicts training data too well because it has learned the data's details and noise too well. An underfit model is not accurate as it can neither accurately analyze its training data nor data it has not yet 'seen'.
![VS Code with a notebook open](images/notebook.png)
![overfitting vs. correct model](images/overfitting.png)
You can interleaf your code with comments to self-document the notebook.
> Infographic by [Jen Looper](https://twitter.com/jenlooper)
✅ Think for a minute how different a web developer's working environment is versus that of a data scientist.
> 🎓 **Data Preprocessing** is the process whereby data scientists clean and convert data for use in the machine learning lifecycle.
## up and Running with Scikit-Learn
> 🎓 **Model Selection and Evaluation** is the process whereby data scientists evaluate the performance of a model or any other relevant metric of a model by feeding it unseen data, selecting the most appropriate model for the task at hand.
Now that Python is set up in your local environment, and you are comfortable with Jupyter notebooks, let's get equally comfortable with Scikit-Learn (pronounce it `sci` as in `science`). Scikit-Learn provides an [extensive API](https://scikit-learn.org/stable/modules/classes.html#api-ref) to help you perform ML tasks.
> 🎓 **Feature Variable** A [feature](https://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection) is a measurable property of your data. In many datasets it is expressed as a column heading like 'date' 'size' or 'color'.
According to their [website](https://scikit-learn.org/stable/getting_started.html), "Scikit-learn is an open source machine learning library that supports supervised and unsupervised learning. It also provides various tools for model fitting, data preprocessing, model selection and evaluation, and many other utilities."
> 🎓 **[Training and Testing](https://wikipedia.org/wiki/Training,_validation,_and_test_sets) datasets** Throughout this curriculum, you will divide up a dataset into at least two parts, one large group of data for 'training' and a smaller part for 'testing'. Sometimes you'll also find a 'validation' set. A training set is the group of examples you use to train a model. A validation set is a smaller independent group of examples that you use to tune the model's hyperparameters, or architecture, to improve the model. A test dataset is another independent group of data, often gathered from the original data, that you use to confirm the performance of the built model.
> 🎓 **Feature Selection and Feature Extraction** How do you know which variable to choose when building a model? You'll probably go through a process of feature selection or feature extraction to choose the right variables for the most performant model. They're not the same thing, however: "Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features." [source](https://wikipedia.org/wiki/Feature_selection)
In this course, you will use Scikit-Learn and other tools to build machine learning models to perform what we call 'traditional machine learning' tasks. We have deliberately avoided neural networks and deep learning, as they are better covered in our forthcoming 'AI for Beginners' curriculum.
Scikit-Learn makes it straightforward to build models and evaluate them for use. It is primarily focused on using numeric data and contains several ready-made datasets for use as learning tools. It also includes pre-built models for students to try. Let's explore the process of loading prepackaged data and using a built in estimator first ML model with Scikit-Learn with some basic data.
## Your First Scikit-Learn Notebook
## Exercise - your First Scikit-Learn Notebook
> This tutorial was inspired by the [Linear Regression example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py) on Scikit-Learn's web site.
In the `notebook.ipynb` file associated to this lesson, clear out all the cells by pressing the 'trash can' icon.
In the _notebook.ipynb_ file associated to this lesson, clear out all the cells by pressing the 'trash can' icon.
In this section, you will work with a small dataset about diabetes that is built into Scikit-Learn for learning purposes. Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic Regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.
In this section, you will work with a small dataset about diabetes that is built into Scikit-Learn for learning purposes. Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic Regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.
> ✅ There are many types of Regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use Linear Regression, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use Logistic Regression. You'll learn more about Logistic Regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.
✅ There are many types of Regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use Linear Regression, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use Logistic Regression. You'll learn more about Logistic Regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.
Let's get started on this task.
1. Import some libraries to help with your tasks. First, import `matplotlib`, a useful [graphing tool](https://matplotlib.org/). We will use it to create a line plot. Also import [numpy](https://numpy.org/doc/stable/user/whatisnumpy.html), a useful library for handling numeric data in Python. Load up `datasets` and the `linear_model` from the Scikit-Learn library. Load `model_selection` for splitting data into training and test sets.
### Import libraries
For this task we will import some libraries:
- **matplotlib**. It's a useful [graphing tool](https://matplotlib.org/) and we will use it to create a line plot.
- **numpy**. [numpy](https://numpy.org/doc/stable/user/whatisnumpy.html) is a useful library for handling numeric data in Python.
- **sklearn**. This is the Scikit-Learn library.
Import some libraries to help with your tasks.
```python
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, model_selection
```
1. Add imports by typing the following code:
2. Print out a bit of the built-in [diabetes housing dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset). It includes 442 samples of data around diabetes, with 10 feature variables, some of which include:
```python
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, model_selection
```
Above you are importing `matplottlib`, `numpy` and you are importing `datasets`, `linear_model` and `model_selection` from `sklearn`. `model_selection` is used for splitting data into training and test sets.
### The diabetes housing dataset
The built-in [diabetes housing dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) includes 442 samples of data around diabetes, with 10 feature variables, some of which include:
age: age in years
bmi: body mass index
@ -114,66 +128,70 @@ Now, load up the X and y data.
> 🎓 Remember, this is supervised learning, and we need a named 'y' target.
3. In a new cell, load the diabetes dataset as data and target (X and y, loaded as a tuple). X will be a data matrix, and y will be the regression target. Add some print commands to show the shape of the data matrix and its first element:
In a new code cell, load the diabetes dataset by calling `load_diabetes()`. The input `return_X_y=True` signals that `X` will be a data matrix, and `y` will be the regression target.
1. Add some print commands to show the shape of the data matrix and its first element:
```python
X, y = datasets.load_diabetes(return_X_y=True)
print(X.shape)
print(X[0])
```
What you are getting back as a response, is a tuple. What you are doing is to assign the two first values of the tuple to `X` and `y` respectively. Learn more [about tuples](https://wikipedia.org/wiki/Tuple).
> 🎓 A **tuple** is an [ordered list of elements](https://wikipedia.org/wiki/Tuple).
You can see that this data has 442 items shaped in arrays of 10 elements:
✅ Think a bit about the relationship between the data and the regression target. Linear regression predicts relationships between feature X and target variable y. Can you find the [target](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) for the diabetes dataset in the documentation? What is this dataset demonstrating, given that target?
✅ Think a bit about the relationship between the data and the regression target. Linear regression predicts relationships between feature X and target variable y. Can you find the [target](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) for the diabetes dataset in the documentation? What is this dataset demonstrating, given that target?
You can see that this data has 442 items shaped in arrays of 10 elements:
1. Next, select a portion of this dataset to plot by arranging it into a new array using numpy's `newaxis` function. We are going to use Linear Regression to generate a line between values in this data, according to a pattern it determines.
4. Next, select a portion of this dataset to plot by arranging it into a new array using numpy's newaxis function. We are going to use Linear Regression to generate a line between values in this data, according to a pattern it determines.
✅ At any time, print out the data to check its shape.
```python
X = X[:, np.newaxis, 2]
```
✅ At any time, print out the data to check its shape
1. Now that you have data ready to be plotted, you can see if a machine can help determine a logical split between the numbers in this dataset. To do this, you need to split both the data (X) and the target (y) into test and training sets. Scikit-Learn has a straightforward way to do this; you can split your test data at a given point.
5. Now that you have data ready to be plotted, you can see if a machine can help determine a logical split between the numbers in this dataset. To do this, you need to split both the data (X) and the target (y) into test and training sets. Scikit-Learn has a straightforward way to do this; you can split your test data at a given point.
```python
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33)
```
```python
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33)
```
6. Now you are ready to train your model! Load up the Linear Regression model and train it with your X and y training sets:
1. Now you are ready to train your model! Load up the Linear Regression model and train it with your X and y training sets using `model.fit()`:
✅ `model.fit` is a command you'll see in many ML libraries such as TensorFlow
```python
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
```
```python
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
```
✅ `model.fit()` is a function you'll see in many ML libraries such as TensorFlow
7. Then, create a prediction using test data. This will be used to draw the line between data groups
1. Then, create a prediction using test data, using the function `predict()`. This will be used to draw the line between data groups
```python
y_pred = model.predict(X_test)
```
```python
y_pred = model.predict(X_test)
```
8. Now it's time to show the data in a plot. Matplotlib is a very useful tool for this task. Create a scatterplot of all the X and y test data, and use the prediction to draw a line in the most appropriate place, between the model's data groupings.
![a scatterplot showing datapoints around diabetes](./images/scatterplot.png)
![a scatterplot showing datapoints around diabetes](./images/scatterplot.png)
✅ Think a bit about what's going on here. A straight line is running through many small dots of data, but what is it doing exactly? Can you see how you should be able to use this line to predict where a new, unseen data point should fit in relationship to the plot's y axis? Try to put into words the practical use of this model.
✅ Think a bit about what's going on here. A straight line is running through many small dots of data, but what is it doing exactly? Can you see how you should be able to use this line to predict where a new, unseen data point should fit in relationship to the plot's y axis? Try to put into words the practical use of this model.
Congratulations, you just built your first Linear Regression model, created a prediction with it, and displayed it in a plot!
Congratulations, you built your first Linear Regression model, created a prediction with it, and displayed it in a plot!