Merge branch 'main' into AnkitaSinghIE-patch-1

pull/90/head
Jasmine Greenaway 3 years ago committed by GitHub
commit 6467a267a7
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

4
.gitignore vendored

@ -307,6 +307,7 @@ paket-files/
# Python Tools for Visual Studio (PTVS)
__pycache__/
*.pyc
venv/
# Cake - Uncomment if you are using it
# tools/**
@ -350,3 +351,6 @@ MigrationBackup/
# Ionide (cross platform F# VS Code tools) working folder
.ionide/
4-Data-Science-Lifecycle/14-Introduction/README.md
.vscode/settings.json
Data/Taxi/*

@ -1,7 +1,12 @@
# Defining Data Science
[![Defining Data Science Video](images/video-def-ds.png)](https://youtu.be/pqqsm5reGvs)
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/01-Definitions.png)|
|:---:|
|Defining Data Science - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
---
[![Defining Data Science Video](images/video-def-ds.png)](https://youtu.be/pqqsm5reGvs)
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/0)

@ -1,5 +1,11 @@
# Introduction to Data Ethics
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/02-Ethics.png)|
|:---:|
| Data Science Ethics - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
---
We are all data citizens living in a datafied world.
Market trends tell us that by 2022, 1-in-3 large organizations will buy and sell their data through online [Marketplaces and Exchanges](https://www.gartner.com/smarterwithgartner/gartner-top-10-trends-in-data-and-analytics-for-2020/). As **App Developers**, we'll find it easier and cheaper to integrate data-driven insights and algorithm-driven automation into daily user experiences. But as AI becomes pervasive, we'll also need to understand the potential harms caused by the [weaponization](https://www.youtube.com/watch?v=TQHs8SA1qpk) of such algorithms at scale.
@ -17,13 +23,6 @@ In this lesson, we'll explore the fascinating area of data ethics - from core co
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/2) 🎯
## Sketchnote 🖼
> A Visual Guide to Data Ethics by [Nitya Narasimhan](https://twitter.com/nitya) / [(@sketchthedocs)](https://sketchthedocs.dev)
---
## Basic Definitions
Let's start by understanding the basic terminology.

@ -1,5 +1,9 @@
# Defining Data
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/03-DefiningData.png)|
|:---:|
|Defining Data - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Data are facts, information, observations and measurements that are used to make discoveries and to support informed decisions. A data point is a single unit of data with in a dataset, which is collection of data points. Datasets may come in different formats and structures, and will usually be based on its source, or where the data came from. For example, a company's monthly earnings might be in a spreadsheet but hourly heart rate data from a smartwatch may be in [JSON](https://stackoverflow.com/a/383699) format. It's common for data scientists to work with different types of data within a dataset.
This lesson focuses on identifying and classifying data by its characteristics and its sources.

@ -1,5 +1,9 @@
# A Brief Introduction to Statistics and Probability
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/04-Statistics-Probability.png)|
|:---:|
| Statistics and Probability - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Statistics and Probability Theory are two highly related areas of Mathematics that are highly relevant to Data Science. It is possible to operate with data without deep knowledge of mathematics, but it is still better to know at least some basic concepts. Here we will present a short introduction that will help you get started.
[![Intro Video](images/video-prob-and-stats.png)](https://youtu.be/Z5Zy85g4Yjw)

@ -1,5 +1,9 @@
# Working with Data: Relational Databases
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/05-RelationalData.png)|
|:---:|
| Working With Data: Relational Databases - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Chances are you have used a spreadsheet in the past to store information. You had a set of rows and columns, where the rows contained the information (or data), and the columns described the information (sometimes called metadata). A relational database is built upon this core principle of columns and rows in tables, allowing you to have information spread across multiple tables. This allows you to work with more complex data, avoid duplication, and have flexibility in the way you explore the data. Let's explore the concepts of a relational database.
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/8)
@ -10,7 +14,7 @@ A relational database has at its core tables. Just as with the spreadsheet, a ta
Let's begin our exploration by starting a table to store information about cities. We might start with their name and country. You could store this in a table as follows:
| city | country |
| City | Country |
| -------- | ------------- |
| Tokyo | Japan |
| Atlanta | United States |
@ -22,7 +26,7 @@ Notice the column names of **city**, **country** and **population** to describe
Chances are, the table above seems relatively familiar to you. Let's start to add some additional data to our burgeoning database - annual rainfall (in millimeters). We'll focus on the years 2018, 2019 and 2020. If we were to add it for Tokyo, it might look something like this:
| city | country | year | amount |
| City | Country | Year | Amount |
| ----- | ------- | ---- | ------ |
| Tokyo | Japan | 2020 | 1690 |
| Tokyo | Japan | 2019 | 1874 |
@ -32,7 +36,7 @@ What do you notice about our table? You might notice we're duplicating the name
OK, let's try something else. Let's add new columns for each year:
| city | country | 2018 | 2019 | 2020 |
| City | Country | 2018 | 2019 | 2020 |
| -------- | ------------- | ---- | ---- | ---- |
| Tokyo | Japan | 1445 | 1874 | 1690 |
| Atlanta | United States | 1779 | 1111 | 1683 |
@ -46,7 +50,7 @@ This is why we need multiple tables and relationships. By breaking apart our dat
Let's return to our data and determine how we want to split things up. We know we want to store the name and country for our cities, so this will probably work best in one table.
| city | country |
| City | Country |
| -------- | ------------- |
| Tokyo | Japan |
| Atlanta | United States |
@ -54,21 +58,23 @@ Let's return to our data and determine how we want to split things up. We know w
But before we create the next table, we need to figure out how to reference each city. We need some form of an identifier, ID or (in technical database terms) a primary key. A primary key is a value used to identify one specific row in a table. While this could be based on a value itself (we could use the name of the city, for example), it should almost always be a number or other identifier. We don't want the id to ever change as it would break the relationship. You will find in most cases the primary key or id will be an auto-generated number.
> [!NOTE] Foreign key is frequently abbreviated as PK
### cities
| city_id | city | country |
| city_id | City | Country |
| ------- | -------- | ------------- |
| 1 | Tokyo | Japan |
| 2 | Atlanta | United States |
| 3 | Auckland | New Zealand |
> [!NOTE] You will notice we use the terms "id" and "primary key" interchangeably during this lesson. The concepts here apply to DataFrames, which you will explore later. DataFrames don't use the terminology of "primary key", however you will notice they behave much in the same way.
> [NOTE] You will notice we use the terms "id" and "primary key" interchangeably during this lesson. The concepts here apply to DataFrames, which you will explore later. DataFrames don't use the terminology of "primary key", however you will notice they behave much in the same way.
With our cities table created, let's store the rainfall. Rather than duplicating the full information about the city, we can use the id. We should also ensure the newly created table has an *id* column as well, as all tables should have an id or primary key.
### rainfall
| rainfall_id | city_id | year | amount |
| rainfall_id | city_id | Year | Amount |
| ----------- | ------- | ---- | ------ |
| 1 | 1 | 2018 | 1445 |
| 2 | 1 | 2019 | 1874 |
@ -80,7 +86,9 @@ With our cities table created, let's store the rainfall. Rather than duplicating
| 8 | 3 | 2019 | 942 |
| 9 | 3 | 2020 | 1176 |
Notice the **city_id** column inside the newly created **rainfall** table. This column contains values which reference the IDs in the **cities** table. In technical relational data terms, this is called a foreign key; it's a primary key from another table. You can just think of it as a reference or a pointer. **city_id** 1 references Tokyo.
Notice the **city_id** column inside the newly created **rainfall** table. This column contains values which reference the IDs in the **cities** table. In technical relational data terms, this is called a **foreign key**; it's a primary key from another table. You can just think of it as a reference or a pointer. **city_id** 1 references Tokyo.
> [!NOTE] Foreign key is frequently abbreviated as FK
## Retrieving the data
@ -100,7 +108,7 @@ FROM cities;
`SELECT` is where you list the columns, and `FROM` is where you list the tables.
> [!NOTE] SQL syntax is case-insensitive, meaning `select` and `SELECT` mean the same thing. However, depending on the type of database you are using the columns and tables might be case sensitive. As a result, it's a best practice to always treat everything in programming like it's case sensitive. When writing SQL queries common convention is to put the keywords in all upper-case letters.
> [NOTE] SQL syntax is case-insensitive, meaning `select` and `SELECT` mean the same thing. However, depending on the type of database you are using the columns and tables might be case sensitive. As a result, it's a best practice to always treat everything in programming like it's case sensitive. When writing SQL queries common convention is to put the keywords in all upper-case letters.
The query above will display all cities. Let's imagine we only wanted to display cities in New Zealand. We need some form of a filter. The SQL keyword for this is `WHERE`, or "where something is true".
@ -154,7 +162,7 @@ Relational databases are centered around dividing information between multiple t
## 🚀 Challenge
TBD
There are numerous relational databases available on the internet. You can explore the data by using the skills you've learned above.
## Post-Lecture Quiz
@ -162,7 +170,11 @@ TBD
## Review & Self Study
- SQL content on Learn
There are several resources available on [Microsoft Learn](https://docs.microsoft.com/learn?WT.mc_id=academic-40229-cxa) for you to continue your exploration of SQL and relational database concepts
- [Describe concepts of relational data](https://docs.microsoft.com//learn/modules/describe-concepts-of-relational-data?WT.mc_id=academic-40229-cxa)
- [Get Started Querying with Transact-SQL](https://docs.microsoft.com//learn/paths/get-started-querying-with-transact-sql?WT.mc_id=academic-40229-cxa) (Transact-SQL is a version of SQL)
- [SQL content on Microsoft Learn](https://docs.microsoft.com/learn/browse/?products=azure-sql-database%2Csql-server&expanded=azure&WT.mc_id=academic-40229-cxa)
## Assignment

@ -1,8 +1,59 @@
# Title
# Displaying airport data
You have been provided a [database](https://raw.githubusercontent.com/Microsoft/Data-Science-For-Beginners/main/2-Working-With-Data/05-relational-databases/airports.db) built on [SQLite](https://sqlite.org/index.html) which contains information about airports. The schema is displayed below. You will use the [SQLite extension](https://marketplace.visualstudio.com/items?itemName=alexcvzz.vscode-sqlite&WT.mc_id=academic-40229-cxa) in [Visual Studio Code](https://code.visualstudio.com?WT.mc_id=academic-40229-cxa) to display information about different cities' airports.
## Instructions
To get started with the assignment, you'll need to perform a couple of steps. You'll need to install a bit of tooling and download the sample database.
### Setup your system
You can use Visual Studio Code and the SQLite extension to interact with the database.
1. Navigate to [code.visualstudio.com](https://code.visualstudio.com?WT.mc_id=academic-40229-cxa) and follow the instructions to install Visual Studio Code
1. Install the [SQLite extension](https://marketplace.visualstudio.com/items?itemName=alexcvzz.vscode-sqlite&WT.mc_id=academic-40229-cxa) extension as instructed on the Marketplace page
### Download and open the database
Next you will download an open the database.
1. Download the [database file from GitHub](https://raw.githubusercontent.com/Microsoft/Data-Science-For-Beginners/main/2-Working-With-Data/05-relational-databases/airports.db) and save it to a directory
1. Open Visual Studio Code
1. Open the database in the SQLite extension by selecting **Ctl-Shift-P** (or **Cmd-Shift-P** on a Mac) and typing `SQLite: Open database`
1. Select **Choose database from file** and open the **airports.db** file you downloaded previously
1. After opening the database (you won't see an update on the screen), create a new query window by selecting **Ctl-Shift-P** (or **Cmd-Shift-P** on a Mac) and typing `SQLite: New query`
Once open, the new query window can be used to run SQL statements against the database. You can use the command **Ctl-Shift-Q** (or **Cmd-Shift-Q** on a Mac) to run queries against the database.
> [!NOTE] For more information about the SQLite extension, you can consult the [documentation](https://marketplace.visualstudio.com/items?itemName=alexcvzz.vscode-sqlite&WT.mc_id=academic-40229-cxa)
## Database schema
A database's schema is its table design and structure. The **airports** database as two tables, `cities`, which contains a list of cities in the United Kingdom and Ireland, and `airports`, which contains the list of all airports. Because some cities may have multiple airports, two tables were created to store the information. In this exercise you will use joins to display information for different cities.
| Cities |
| ---------------- |
| id (PK, integer) |
| city (text) |
| country (text) |
| Airports |
| -------------------------------- |
| id (PK, integer) |
| name (text) |
| code (text) |
| city_id (FK to id in **Cities**) |
## Assignment
Create queries to return the following information:
1. all city names in the `Cities` table
1. all cities in Ireland in the `Cities` table
1. all airport names with their city and country
1. all airports in London, United Kingdom
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |
| Exemplary | Adequate | Needs Improvement |
| --------- | -------- | ----------------- |

@ -1,8 +1,10 @@
# Working with Data: Non-Relational Data
## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/10)
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/06-NoSQL.png)|
|:---:|
|Working with NoSQL Data - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/10)
Data is not limited to relational databases. This lesson focuses on non-relational data and will cover the basic of spreadsheets and NoSQL.

@ -1,5 +1,9 @@
# Working with Data: Python and the Pandas Library
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/07-WorkWithPython.png)|
|:---:|
|Working With Python - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
[![Intro Video](images/video-ds-python.png)](https://youtu.be/dZjWOGbsN4Y)
While databases offer very efficient ways to store data and query them using query languages, the most flexible way of data processing is writing your own program to manipulate data. In many cases, doing a database query would be a more effective way. However in some cases when more complex data processing is needed, it cannot be done easily using SQL.
@ -55,7 +59,7 @@ Pandas is centered around a few basic concepts.
Consider an example: we want to analyze sales of our ice-cream spot. Let's generate a series of sales numbers (number of items sold each day) for some time period:
```python
tart_date = "Jan 1, 2020"
start_date = "Jan 1, 2020"
end_date = "Mar 31, 2020"
idx = pd.date_range(start_date,end_date)
print(f"Length of index is {len(idx)}")
@ -267,7 +271,7 @@ Whether you already have structured or unstructured data, using Python you can p
**Learning Python**
* [Learn Python in a Fun Way with Turtle Graphics and Fractals](https://github.com/shwars/pycourse)
* [Take your First Steps with Python](https://docs.microsoft.com/en-us/learn/paths/python-first-steps/?WT.mc_id=acad-31812-dmitryso) Learning Path on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=acad-31812-dmitryso)
* [Take your First Steps with Python](https://docs.microsoft.com/learn/paths/python-first-steps/?WT.mc_id=acad-31812-dmitryso) Learning Path on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=acad-31812-dmitryso)
## Assignment

@ -1,15 +1,19 @@
# Working with Data: Data Preparation
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/08-DataPreparation.png)|
|:---:|
|Data Preparation - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
## Pre-Lecture Quiz
[Pre-lecture quiz]()
[Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/14)
## 🚀 Challenge
## Post-Lecture Quiz
[Post-lecture quiz]()
[Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/15)
## Review & Self Study

@ -13,4 +13,4 @@ In these lessons, you will learn some of the ways that data can be managed, mani
### Credits
These lessons were written with ❤️ by [Christopher Harrison](https://twitter.com/geektrainer) and ...
These lessons were written with ❤️ by [Christopher Harrison](https://twitter.com/geektrainer), [Dmitry Soshnikov](https://twitter.com/shwars) and [Jasmine Greenaway](https://twitter.com/paladique)

@ -1,6 +1,10 @@
# Visualizing Quantities
In this lesson, you will use three different libraries to learn how to create interesting visualizations all around the concept of quantity. Using a cleaned dataset about the birds of Minnesota, you can learn many interesting facts about local wildlife.
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/09-Visualizing-Quantities.png)|
|:---:|
| Visualizing Quantities - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In this lesson you will explore how to use one of the many available Python libraries to learn how to create interesting visualizations all around the concept of quantity. Using a cleaned dataset about the birds of Minnesota, you can learn many interesting facts about local wildlife.
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/16)
## Observe wingspan with Matplotlib

@ -1,5 +1,9 @@
# Visualizing Distributions
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/10-Visualizing-Distributions.png)|
|:---:|
| Visualizing Distributions - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In the previous lesson, you learned some interesting facts about a dataset about the birds of Minnesota. You found some erroneous data by visualizing outliers and looked at the differences between bird categories by their maximum length.
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/18)

@ -1,18 +1,22 @@
# Visualizing Proportions
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/11-Visualizing-Proportions.png)|
|:---:|
|Visualizing Proportions - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In this lesson, you will use a different nature-focused dataset to visualize proportions, such as how many different types of fungi populate a given dataset about mushrooms. Let's explore these fascinating fungi using a dataset sourced from Audubon listing details about 23 species of gilled mushrooms in the Agaricus and Lepiota families. You will experiment with tasty visualizations such as:
- Pie charts 🥧
- Donut charts 🍩
- Waffle charts 🧇
> 💡 A very interesting project called [Charticulator](https://charticulator.com) by Microsoft Research offers a free drag and drop interface for data visualizations. In one of their tutorials they also use this mushroom dataset! So you can explore the data and learn the library at the same time: https://charticulator.com/tutorials/tutorial4.html
> 💡 A very interesting project called [Charticulator](https://charticulator.com) by Microsoft Research offers a free drag and drop interface for data visualizations. In one of their tutorials they also use this mushroom dataset! So you can explore the data and learn the library at the same time: [Charticulator tutorial](https://charticulator.com/tutorials/tutorial4.html).
## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/20)
## Get to know your mushrooms 🍄
Mushrooms are very interesting. Let's import a dataset to study them.
Mushrooms are very interesting. Let's import a dataset to study them:
```python
import pandas as pd
@ -30,7 +34,7 @@ A table is printed out with some great data for analysis:
| Edible | Bell | Smooth | White | Bruises | Anise | Free | Close | Broad | Brown | Enlarging | Club | Smooth | Smooth | White | White | Partial | White | One | Pendant | Brown | Numerous | Meadows |
| Poisonous | Convex | Scaly | White | Bruises | Pungent | Free | Close | Narrow | Brown | Enlarging | Equal | Smooth | Smooth | White | White | Partial | White | One | Pendant | Black | Scattered | Urban |
Right away, you notice that all the data is textual. You will have to edit this data to be able to use it in a chart. Most of the data, in fact, is represented as an object:
Right away, you notice that all the data is textual. You will have to convert this data to be able to use it in a chart. Most of the data, in fact, is represented as an object:
```python
print(mushrooms.select_dtypes(["object"]).columns)
@ -72,7 +76,7 @@ plt.pie(edibleclass['population'],labels=labels,autopct='%.1f %%')
plt.title('Edible?')
plt.show()
```
Voila, a pie chart showing the proportions of this data according to these two classes of mushroom. It's quite important to get the order of labels correct, especially here, so be sure to verify the order with which the label array is built!
Voila, a pie chart showing the proportions of this data according to these two classes of mushrooms. It's quite important to get the order of the labels correct, especially here, so be sure to verify the order with which the label array is built!
![pie chart](images/pie1.png)
@ -80,7 +84,7 @@ Voila, a pie chart showing the proportions of this data according to these two c
A somewhat more visually interesting pie chart is a donut chart, which is a pie chart with a hole in the middle. Let's look at our data using this method.
Take a look at the various habitats where mushrooms grow.
Take a look at the various habitats where mushrooms grow:
```python
habitat=mushrooms.groupby(['habitat']).count()
@ -106,9 +110,9 @@ plt.show()
![donut chart](images/donut.png)
This code draws a chart and a center circle, then adds that center circle in. Edit the width of the center circle by changing `0.40` to another value.
This code draws a chart and a center circle, then adds that center circle in the chart. Edit the width of the center circle by changing `0.40` to another value.
Donut charts can be tweaked several ways to change the labels. The labels in particular can be highlighted for readability. Learn more in the [docs](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html?highlight=donut).
Donut charts can be tweaked in several ways to change the labels. The labels in particular can be highlighted for readability. Learn more in the [docs](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html?highlight=donut).
Now that you know how to group your data and then display it as a pie or donut, you can explore other types of charts. Try a waffle chart, which is just a different way of exploring quantity.
## Waffles!
@ -149,13 +153,13 @@ fig = plt.figure(
)
```
Using a waffle chart, you can plainly see the proportions of cap color of this mushroom dataset. Interestingly, there are many green-capped mushrooms!
Using a waffle chart, you can plainly see the proportions of cap colors of this mushrooms dataset. Interestingly, there are many green-capped mushrooms!
![waffle chart](images/waffle.png)
✅ Pywaffle supports icons within the charts that use any icon available in [Font Awesome](https://fontawesome.com/). Do some experiments to create an even more interesting waffle chart using icons instead of squares.
In this lesson you learned three ways to visualize proportions. First, you need to group your data into categories and then decide which is the best way to display the data - pie, donut, or waffle. All are delicious and gratify the user with an instant snapshot of a dataset.
In this lesson, you learned three ways to visualize proportions. First, you need to group your data into categories and then decide which is the best way to display the data - pie, donut, or waffle. All are delicious and gratify the user with an instant snapshot of a dataset.
## 🚀 Challenge

@ -2,10 +2,10 @@
## Instructions
Did you know you can create donut, pie and waffle charts in Excel? Using a dataset of your choice, create these three charts right in an Excel spreadsheet
Did you know you can create donut, pie, and waffle charts in Excel? Using a dataset of your choice, create these three charts right in an Excel spreadsheet.
## Rubric
| Exemplary | Adequate | Needs Improvement |
| ------------------------------------------------------- | ------------------------------------------------- | ------------------------------------------------------ |
| An Excel spreadsheet is presented with all three charts | An Excel spreadsheet is presented with two charts | An Excel spreadsheet is presented with only one charts |
| An Excel spreadsheet is presented with all three charts | An Excel spreadsheet is presented with two charts | An Excel spreadsheet is presented with only one chart |

@ -1,5 +1,9 @@
# Visualizing Relationships: All About Honey 🍯
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/12-visualizing-relationships.png)|
|:---:|
|Visualizing Relationships - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Continuing with the nature focus of our research, let's discover interesting visualizations to show the relationships between various types of honey, according to a dataset derived from the [United States Department of Agriculture](https://www.nass.usda.gov/About_NASS/index.php).
This dataset of about 600 items displays honey production in many U.S. states. So, for example, you can look at the number of colonies, yield per colony, total production, stocks, price per pound, and value of the honey produced in a given state from 1998-2012, with one row per year for each state.

@ -1,5 +1,9 @@
# Making Meaningful Visualizations
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/13-MeaningfulViz.png)|
|:---:|
| Meaningful Visualizations - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
> "If you torture the data long enough, it will confess to anything" -- [Ronald Coase](https://en.wikiquote.org/wiki/Ronald_Coase)
One of the basic skills of a data scientist is the ability to create a meaningful data visualization that helps answer questions you might have. Prior to visualizing your data, you need to ensure that it has been cleaned and prepared, as you did in prior lessons. After that, you can start deciding how best to present the data.
@ -19,7 +23,6 @@ In this lesson, you will review:
In previous lessons, you experimented with building all kinds of interesting data visualizations using Matplotlib and Seaborn for charting. In general, you can select the [right kind of chart](https://chartio.com/learn/charts/how-to-select-a-data-vizualization/) for the question you are asking using this table:
| You need to: | You should use: |
| -------------------------- | ------------------------------- |
| Show data trends over time | Line |
@ -30,11 +33,12 @@ In previous lessons, you experimented with building all kinds of interesting dat
| Show proportions | Pie, Donut, Waffle |
> ✅ Depending on the makeup of your data, you might need to convert it from text to numeric to get a given chart to support it.
## Avoid deception
Even if a data scientist is careful to choose the right chart for the right data, there are plenty of ways that data can be displayed in a way to prove a point, often at the cost of undermining the data itself. There are many examples of deceptive charts and infographics!
[![Deceptive Charts by Alberto Cairo](./images/tornado.png)](https://www.youtube.com/Low28hx4wyk "Deceptive charts")
[![How Charts Lie by Alberto Cairo](./images/tornado.png)](https://www.youtube.com/watch?v=oX74Nge8Wkw "How charts lie")
> 🎥 Click the image above for a conference talk about deceptive charts
@ -50,17 +54,17 @@ This notorious example uses color AND a flipped Y axis to deceive: instead of co
![bad chart 3](images/bad-chart-3.jpg)
This strange chart shows how proportion can be manipulated, to hilarious effect:
This strange chart shows how proportion can be manipulated, to hilarious effect:
![bad chart 4](images/bad-chart-4.jpg)
Comparing the incomparable is yet another shady trick. There is a [wonderful web site](https://tylervigen.com/spurious-correlations) all about 'spurious correlations' displaying 'facts' correlating things like the divorce rate in Maine and the consumption of margarine. A Reddit group also collects the [ugly uses](https://www.reddit.com/r/dataisugly/top/?t=all) of data.
It's important to understand how easily the eye can be fooled by deceptive charts. Even if the data scientist's intention is good, the choice of a bad type of chart, such as a pie chart showing too many categories, can be deceptive.
It's important to understand how easily the eye can be fooled by deceptive charts. Even if the data scientist's intention is good, the choice of a bad type of chart, such as a pie chart showing too many categories, can be deceptive.
## Color
You saw in the 'Florida gun violence' chart above how color can provide an additional layer of meaning to charts, especially ones not designed using libraries such as Matplotlib and Seaborn which come with various vetted color libraries and palettes. If you are making a chart by hand, do a little study of [color theory](https://colormatters.com/color-and-design/basic-color-theory)
You saw in the 'Florida gun violence' chart above how color can provide an additional layer of meaning to charts, especially ones not designed using libraries such as Matplotlib and Seaborn which come with various vetted color libraries and palettes. If you are making a chart by hand, do a little study of [color theory](https://colormatters.com/color-and-design/basic-color-theory)
> ✅ Be aware, when designing charts, that accessibility is an important aspect of visualization. Some of your users might be color blind - does your chart display well for users with visual impairments?
@ -78,18 +82,20 @@ While [color meaning](https://colormatters.com/color-symbolism/the-meanings-of-c
| orange | vibrance |
If you are tasked with building a chart with custom colors, ensure that your charts are both accessible and the color you choose coincides with the meaning you are trying to convey.
## Styling your charts for readability
Charts are not meaningful if they are not readable! Take a moment to consider styling the width and height of your chart to scale well with your data. If one variable (such as all 50 states) need to be displayed, show them vertically on the Y axis if possible so as to avoid a horizontally-scrolling chart.
Label your axes, provide a legend if necessary, and offer tooltips for better comprehension of data.
If your data is textual and verbose on the X-axis, you can angle the text for better readability. [Matplotlib](https://matplotlib.org/stable/tutorials/toolkits/mplot3d.html) offers 3d plotting, if you data supports it. Sophisticated data visualizations can be produced using `mpl_toolkits.mplot3d`.
If your data is textual and verbose on the X axis, you can angle the text for better readability. [Matplotlib](https://matplotlib.org/stable/tutorials/toolkits/mplot3d.html) offers 3d plotting, if you data supports it. Sophisticated data visualizations can be produced using `mpl_toolkits.mplot3d`.
![3d plots](images/3d.png)
## Animation and 3D chart display
Some of the best data visualizations today are animated. Shirley Wu has amazing ones done with D3, such as '[film flowers](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/)', where each flower is a visualization of a movie. Another example for the Guardian is 'bussed out', an interactive experience combining visualizations with Greensock and D3 plus a scrollytelling article format to show how NYC handles its homeless problem by busing people out of the city.
Some of the best data visualizations today are animated. Shirley Wu has amazing ones done with D3, such as '[film flowers](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/)', where each flower is a visualization of a movie. Another example for the Guardian is 'bussed out', an interactive experience combining visualizations with Greensock and D3 plus a scrollytelling article format to show how NYC handles its homeless problem by bussing people out of the city.
![busing](images/busing.png)
@ -102,14 +108,15 @@ While this lesson is insufficient to go into depth to teach these powerful visua
You will complete a web app that will display an animated view of this social network. It uses a library that was built to create a [visual of a network](https://github.com/emiliorizzo/vue-d3-network) using Vue.js and D3. When the app is running, you can pull the nodes around on the screen to shuffle the data around.
![liaisons](images/liaisons.png)
## Project: Build a chart to show a network using D3.js
> This lesson folder includes a `solution` folder where you can find the completed project, for your reference.
1. Follow the instructions in the README.md file in the starter folder's root. Make sure you have NPM and Node.js running on your machine before installing your project's dependencies.
2. Open the `starter/src` folder. You'll discover an `assets` folder where you can find a .json file with all the letters from the novel, numbered, with a 'to' and 'from' annotation.
2. Open the `starter/src` folder. You'll discover an `assets` folder where you can find a .json file with all the letters from the novel, numbered, with a 'to' and 'from' annotation.
3. Complete the code in `components/Nodes.vue` to enable the visualization. Look for the method called `createLinks()` and add the following nested loop.
Loop through the .json object to capture the 'to' and 'from' data for the letters and build up the `links` object so that the visualization library can consume it:
@ -130,11 +137,14 @@ Loop through the .json object to capture the 'to' and 'from' data for the letter
}
this.links.push({ sid: f, tid: t });
}
```
```
Run your app from the terminal (npm run serve) and enjoy the visualization!
## 🚀 Challenge
Take a tour of the internet to discover deceptive visualizations. How does the author fool the user, and is it intentional? Try correcting the visualizations to show how they should look.
## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/25)
## Review & Self Study
@ -149,9 +159,10 @@ Take a look at these interest visualizations for historical assets and artifacts
https://handbook.pubpub.org/
Look through this article on how animation can enhance your visualizations
Look through this article on how animation can enhance your visualizations:
https://medium.com/@EvanSinar/use-animation-to-supercharge-data-visualization-cd905a882ad4
## Assignment
[Build your own custom vis](assignment.md)
[Build your own custom visualization](assignment.md)

@ -1,5 +1,13 @@
# Introduction to the Data Science Lifecycle
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/14-DataScience-Lifecycle.png)|
|:---:|
| Introduction to the Data Science Lifecycle - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
## Pre-Lecture Quiz
[Pre-lecture quiz]()
At this point you've probably come to the realization that data science is a process. This process can be broken down into 5 stages:
- Capturing
@ -61,9 +69,6 @@ On premise refers to hosting managing the data on your own equipment, like ownin
**Cold vs hot data**
When training your models, you may require more training data. If youre content with your model, more data will arrive for a model to serve its purpose. In any case the cost of storing and accessing data will increase as you accumulate more of it. Separating rarely used data, known as cold data from frequently accessed hot data can be a cheaper data storage option through hardware or software services. If cold data needs to be accessed, it may take a little longer to retrieve in comparison to hot data.
Below is an example of the cost of owning an Azure Storage Account
[screenshot of Azure cost calculator]
### Managing Data
As you work with data you may discover that some of the data needs to be cleaned using some of the techniques covered in the lesson focused on [data preparation](2-Working-With-Data\08-data-preparation) to build accurate models. When new data arrives, it will need some of the same applications to maintain consistency in quality. Some projects will involve use of an automated tool for cleansing, aggregation, and compression before the data is moved to its final location. Azure Data Factory is an example of one of these tools.
@ -77,18 +82,27 @@ Heres some things that a team may do with security in mind:
- Let only certain project members alter the data
## Pre-Lecture Quiz
## 🚀 Challenge
[Pre-lecture quiz]()
There are many versions of the Data Science Lifecycle, where each step may have different names and number of stages but will contain the same processes mentioned within this lesson.
## 🚀 Challenge
Explore the [Team Data Science Process lifecycle](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/lifecycle) and the [Cross-industry standard process for data mining](https://www.datascience-pm.com/crisp-dm-2/). Name 3 similarities and differences between the two.
|Team Data Science Process (TDSP)|Cross-industry standard process for data mining (CRISP-DM)|
|--|--|
|![](..\images\tdsp-lifecycle2.png)> Photo by [Microsoft](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/lifecycle)| ![](..\images\CRISP-DM.png)> Photo by [Data Science Process Alliance](https://www.datascience-pm.com/crisp-dm-2/)
## Post-Lecture Quiz
## Post-Lecture Quiz
[Post-lecture quiz]()
## Review & Self Study
Applying the Data Science Lifecycle involves multiple roles and tasks, where some may focus on particular parts of each stage. The Team Data Science Process provides a few resources that explain the types of roles and tasks that someone may have in a project.
* [Team Data Science Process roles and tasks](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/roles-tasks)
* [Execute data science tasks: exploration, modeling, and deployment](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/execute-data-science-tasks)
## Assignment
[Assignment Title](assignment.md)
[Exploring and Assessing a Dataset](assignment.md)

@ -1,7 +1,19 @@
# Title
# Exploring and Assessing a Dataset
A client has approached your team for help in investigating a taxi customer's seasonal spending habits in New York City.
They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
Your team is in the [Capturing](Readme.md#Capturing) stage of the Data Science Lifecycle and you are in charge of exploring the dataset. You have been provided a notebook and data from Azure Open Datasets to explore and assess if the data can answer the client's question. You have decided to select a small sample of 1 summer month and 1 winter month in the year 2019.
## Instructions
In this directory is a [notebook](notebook.ipynb) that uses Python to load yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets) for the months of January and July 2019. These datasets have been joined together in a Pandas dataframe.
Your task is to identify the columns that are the most likely required to answer this question, then reorganize the joined dataset so that these columns are displayed first.
Finally, write 3 questions that you would ask the client for more clarification and better understanding of the problem.
## Rubric
Exemplary | Adequate | Needs Improvement

@ -0,0 +1,76 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"Copyright (c) Microsoft Corporation. All rights reserved.\r\n",
"\r\n",
"Licensed under the MIT License."
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"# Exploring NYC Taxi data in Winter and Summer\r\n",
"\r\n",
"Refer to the [Data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to explore the columns that have been provided.\r\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"!pip install pandas"
],
"outputs": [],
"metadata": {
"scrolled": true
}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"import pandas as pd\r\n",
"import glob\r\n",
"\r\n",
"path = '../../data/Taxi/yellow_tripdata_2019-{}.csv'\r\n",
"july_taxi = pd.read_csv(path.format('07'))\r\n",
"january_taxi = pd.read_csv(path.format('01'))\r\n",
"\r\n",
"df = pd.concat([january_taxi, july_taxi])\r\n",
"\r\n",
"print(df)"
],
"outputs": [],
"metadata": {}
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.9.7 64-bit ('venv': venv)"
},
"language_info": {
"mimetype": "text/x-python",
"name": "python",
"pygments_lexer": "ipython3",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"version": "3.9.7",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"name": "04-nyc-taxi-join-weather-in-pandas",
"notebookId": 1709144033725344,
"interpreter": {
"hash": "6b9b57232c4b57163d057191678da2030059e733b8becc68f245de5a75abe84e"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -1,7 +1,17 @@
# Title
# Analyzing for answers
This continues the process of the lifecycle
They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
Your team is in the [Analyzing](Readme.md) stage of the Data Science Lifecycle.. You have been provided a notebook and data from Azure Open Datasets to explore. For summer you choose June, July, and August and for winter you choose January, February, and December.
## Instructions
In this directory is a [notebook](notebook.ipynb) that uses Python to load 6 months of yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets) and Integrated Surface Data from NOAA. These datasets have been joined together in a Pandas dataframe.
Your task is to ___
## Rubric
Exemplary | Adequate | Needs Improvement

@ -0,0 +1,25 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"source": [
"# print(pd.read_csv('../../data/Taxi/yellow_tripdata_2019-01.csv'))\r\n",
"# all_files = glob.glob('../../data/Taxi/*.csv')\r\n",
"\r\n",
"# df = pd.concat((pd.read_csv(f) for f in all_files))\r\n",
"# print(df)"
],
"outputs": [],
"metadata": {}
}
],
"metadata": {
"orig_nbformat": 4,
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -1,5 +1,9 @@
# The Data Science Lifecycle: Communication
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/16-Communicating.png)|
|:---:|
| Data Science Lifecycle: Communication - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/30)
Test your knowledge of what's to come with the Pre-Lecture Quiz above!

@ -8,7 +8,7 @@ In these lessons, you'll explore some of the aspects of the Data Science lifeycl
1. [Introduction](14-Introduction/README.md)
2. [Analyzing](15-Analyzing/README.md)
3. [Communication](16-Communication/README.md)
3. [Communication](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/4-Data-Science-Lifecycle/16-communication)
### Credits

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 279 KiB

@ -1,4 +1,10 @@
# Data Science in the Cloud: in progress
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/17-DataScience-Cloud.png)|
|:---:|
| Data Science In The Cloud: Introduction - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
In this lesson, you will learn the fundamental principles of the Cloud, then you will see why it can be interesting for you to use Cloud services to run your data science projects and we'll look at some examples of data science projects run in the Cloud.
@ -67,7 +73,7 @@ The steps necessary to create this project are as follows:
* Create an output sink and specify the job output
* Start the job
To view the full process, check out the [documentation](https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends).
To view the full process, check out the [documentation](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends?WT.mc_id=academic-40229-cxa&ocid=AID30411099).
### Scientific papers analysis
@ -76,9 +82,9 @@ Lets take another example of a project created by [Dmitry Soshnikov](http://s
Dmitry created a tool that analyses COVID papers. By reviewing this project, you will see how you can create a tool that extracts knowledge from scientific papers, gains insights and helps researchers navigate through large collections of papers in an efficient way.
Let's see the different steps used for this:
* Extracting and pre-processing information with [Text Analytics for Health](https://docs.microsoft.com/en-us/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health?WT.mc_id=academic-40229-cxa)
* Using [Azure ML](https://azure.microsoft.com/services/machine-learning/?WT.mc_id=academic-40229-cxa) to parallelize the processing
* Storing and querying information with [Cosmos DB](https://azure.microsoft.com/services/cosmos-db/?WT.mc_id=academic-40229-cxa)
* Extracting and pre-processing information with [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health?WT.mc_id=academic-40229-cxa&ocid=AID3041109)
* Using [Azure ML](https://azure.microsoft.com/services/machine-learning?WT.mc_id=academic-40229-cxa&ocid=AID3041109) to parallelize the processing
* Storing and querying information with [Cosmos DB](https://azure.microsoft.com/services/cosmos-db?WT.mc_id=academic-40229-cxa&ocid=AID3041109)
* Create an interactive dashboard for data exploration and visualization using Power BI
To see the full process, visit [Dmitrys blog](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/).
@ -89,8 +95,8 @@ As you can see, we can leverage Cloud services in many ways to perform Data Scie
## Footnote
Sources:
* https://azure.microsoft.com/overview/what-is-cloud-computing
* https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends
* https://azure.microsoft.com/overview/what-is-cloud-computing?ocid=AID3041109
* https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends?ocid=AID3041109
* https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/
## Post-Lecture Quiz

@ -1,5 +1,9 @@
# Data Science in the Cloud: The "Low code/No code" way
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/18-DataScience-Cloud.png)|
|:---:|
| Data Science In The Cloud: Low Code - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Table of contents:
- [Data Science in the Cloud: The "Low code/No code" way](#data-science-in-the-cloud-the-low-codeno-code-way)
@ -47,29 +51,29 @@ Table of contents:
### 1.1 What is Azure Machine Learning?
The Azure cloud platform is more than 200 products and cloud services designed to help you bring new solutions to life.
Data scientists expend a lot of effort exploring and pre-processing data, and trying various types of model-training algorithms to produce accurate models, which is time consuming, and often makes inefficient use of expensive compute hardware.
Data scientists expend a lot of effort exploring and pre-processing data, and trying various types of model-training algorithms to produce accurate models. These tasks are time consuming, and often make inefficient use of expensive compute hardware.
[Azure ML](https://docs.microsoft.com/EN-US/azure/machine-learning/overview-what-is-azure-machine-learning) is a cloud-based platform for building and operating machine learning solutions in Azure. It includes a wide range of features and capabilities that help data scientists prepare data, train models, publish predictive services, and monitor their usage. Most importantly, it helps data scientists increase their efficiency by automating many of the time-consuming tasks associated with training models; and it enables them to use cloud-based compute resources that scale effectively to handle large volumes of data while incurring costs only when actually used.
[Azure ML](https://docs.microsoft.com/azure/machine-learning/overview-what-is-azure-machine-learning?WT.mc_id=academic-40229-cxa&ocid=AID3041109) is a cloud-based platform for building and operating machine learning solutions in Azure. It includes a wide range of features and capabilities that help data scientists prepare data, train models, publish predictive services, and monitor their usage. Most importantly, it helps them to increase their efficiency by automating many of the time-consuming tasks associated with training models; and it enables them to use cloud-based compute resources that scale effectively, to handle large volumes of data while incurring costs only when actually used.
Azure ML provides all the tools developers and data scientists need for their machine learning workflows, including:
Azure ML provides all the tools developers and data scientists need for their machine learning workflows. These include:
- **Azure Machine Learning Studio** is a web portal in Azure Machine Learning for low-code and no-code options for model training, deployment, automation, tracking and asset management. The studio integrates with the Azure Machine Learning SDK for a seamless experience.
- **Jupyter Notebooks** to quickly prototype and test ML models
- **Azure Machine Learning Designer** allows to drag-n-drop modules to build experiments and then deploy pipelines in a low-code environment.
- **Automated machine learning UI (AutoML)** automates iterative tasks of machine learning model development allowing to build ML models with high scale, efficiency, and productivity all while sustaining model quality.
- **Data labeling**: an assisted ML tool to automatically label data.
- **Machine learning extension for Visual Studio Code** provides a full-featured development environment for building and managing ML projects.
- **Machine learning CLI** provides commands for managing Azure ML resources from the command line.
- **Integration with open-source frameworks** such as PyTorch, TensorFlow, and scikit-learn and many more for training, deploying, and managing the end-to-end machine learning process.
- **MLflow** is an open-source library for managing the life cycle of your machine learning experiments. **MLFlow Tracking** is a component of MLflow that logs and tracks your training run metrics and model artifacts, no matter your experiment's environment.
- **Azure Machine Learning Studio**: it is a web portal in Azure Machine Learning for low-code and no-code options for model training, deployment, automation, tracking and asset management. The studio integrates with the Azure Machine Learning SDK for a seamless experience.
- **Jupyter Notebooks**: quickly prototype and test ML models.
- **Azure Machine Learning Designer**: allows to drag-n-drop modules to build experiments and then deploy pipelines in a low-code environment.
- **Automated machine learning UI (AutoML)** : automates iterative tasks of machine learning model development, allowing to build ML models with high scale, efficiency, and productivity, all while sustaining model quality.
- **Data Labelling**: an assisted ML tool to automatically label data.
- **Machine learning extension for Visual Studio Code**: provides a full-featured development environment for building and managing ML projects.
- **Machine learning CLI**: provides commands for managing Azure ML resources from the command line.
- **Integration with open-source frameworks** such as PyTorch, TensorFlow, scikit-learn and many more for training, deploying, and managing the end-to-end machine learning process.
- **MLflow**: It is an open-source library for managing the life cycle of your machine learning experiments. **MLFlow Tracking** is a component of MLflow that logs and tracks your training run metrics and model artifacts, irrespective of your experiment's environment.
### 1.2 The Heart Failure Prediction Project
### 1.2 The Heart Failure Prediction Project:
What better way to learn than actually doing a project! In this lesson, we are going to explore two different ways of building a data science project for the prediction of heart failure attacks in Azure ML Studio, through Low code/No code and through the Azure ML SDK as shown in the following schema.
There is no doubt that making and building projects is the best to put your skills and knowledge to test. In this lesson, we are going to explore two different ways of building a data science project for the prediction of heart failure attacks in Azure ML Studio, through Low code/No code and through the Azure ML SDK as shown in the following schema:
![project-schema](img/project-schema.PNG)
Both ways has its pro and cons. The Low code/No code way is easier to start with because it is mostly interacting with a GUI (Graphical User Interface) without knowledge of code required. This method is great at the beginning of a project to quickly test if a project is viable and to create POC (Proof Of Concept). However, once a project grows and things need to be production ready, it is not maintainable to create resources by hand through the GUI. We need to programmatically automate everything, from the creation of resources, to the deployment of a model. This is where knowing how to use the Azure ML SDK is critical.
Each way has its own pros and cons. The Low code/No code way is easier to start with as it involves interacting with a GUI (Graphical User Interface), with no pior knowledge of code required. This method enables quick testing of the project's viability and to create POC (Proof Of Concept). However, as the project grows and things need to be production ready, it is not feasible to create resources through GUI. We need to programmatically automate everything, from the creation of resources, to the deployment of a model. This is where knowing how to use the Azure ML SDK becomes crucial.
| | Low code/No code | Azure ML SDK |
|-------------------|------------------|---------------------------|
@ -77,34 +81,35 @@ Both ways has its pro and cons. The Low code/No code way is easier to start with
| Time to develop | Fast and easy | Depends on code expertise |
| Production ready | No | Yes |
### 1.3 The Heart Failure Dataset
### 1.3 The Heart Failure Dataset:
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, accounting for 31% of all deaths worldwide. Environmental and behavioural risk factors such as use of tobacco, unhealthy diet and obesity, physical inactivity and harmful use of alcohol could be used as features for estimation models. Being able to estimate the probability of the development of a CVD could be of great use to prevent attacks in high risk people.
Cardiovascular diseases (CVDs) are the number 1 cause of death globally, accounting for 31% of all deaths worldwide. Environmental and behavioural risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol could be used as features for estimation models. Being able to estimate the probability of developping a CVD could be of great to prevent attacks for high risk people.
Kaggle has made publically available a [Heart Failure dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) that we are going to use for this project. You can download the dataset now. This is a tabular dataset with 13 columns (12 features and 1 target variable) and contains 299 rows.
Kaggle has made a [Heart Failure dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) publically available, that we are going to use for this project. You can download the dataset now. This is a tabular dataset with 13 columns (12 features and 1 target variable) and 299 rows.
| | Variable name | Type | Description | Example |
|----|---------------------------|-----------------|-----------------------------------------------------------|-------------------|
| 1 | age | numerical | age of the patient | 25 |
| 2 | anaemia | boolean | Decrease of red blood cells or hemoglobin | 0 or 1 |
| 3 | creatinine_phosphokinase | numerical | Level of the CPK enzyme in the blood | 542 |
| 2 | anaemia | boolean | Decrease of red blood cells or haemoglobin | 0 or 1 |
| 3 | creatinine_phosphokinase | numerical | Level of CPK enzyme in the blood | 542 |
| 4 | diabetes | boolean | If the patient has diabetes | university.degree |
| 5 | ejection_fraction | numerical | Percentage of blood leaving the heart at each contraction | 45 |
| 5 | ejection_fraction | numerical | Percentage of blood leaving the heart on each contraction | 45 |
| 6 | high_blood_pressure | boolean | If the patient has hypertension | 0 or 1 |
| 7 | platelets | numerical | Platelets in the blood | 149000 |
| 8 | serum_creatinine | numerical | Level of serum creatinine in the blood | 0.5 |
| 9 | serum_sodium | numerical | Level of serum sodium in the blood | jun |
| 10 | sex | boolean | Woman or man | 0 or 1 |
| 10 | sex | boolean | woman or man | 0 or 1 |
| 11 | smoking | boolean | If the patient smokes | 285 |
| 12 | time | numerical | follow-up period (days) | 4 |
|----|---------------------------|-----------------|-----------------------------------------------------------|-------------------|
| 21 | DEATH_EVENT [Target] | boolean | if the patient deceased during the follow-up period | 0 or 1 |
| 21 | DEATH_EVENT [Target] | boolean | if the patient dies during the follow-up period | 0 or 1 |
Once you have the dataset, we can start the project in Azure.
## 2. Low code/No code training of a model in Azure ML Studio
### 2.1 Create an Azure ML workspace
To train a model in Azure ML you first need to create an Azure ML workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. The workspace keeps a history of all training runs, including logs, metrics, output, and a snapshot of your scripts. You use this information to determine which training run produces the best model. [Learn more](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace)
To train a model in Azure ML you first need to create an Azure ML workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. The workspace keeps a history of all training runs, including logs, metrics, output, and a snapshot of your scripts. You use this information to determine which training run produces the best model. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-workspace?WT.mc_id=academic-40229-cxa&ocid=AID3041109)
It is recommended to use the most up-to-date browser that's compatible with your operating system. The following browsers are supported:
@ -115,7 +120,7 @@ It is recommended to use the most up-to-date browser that's compatible with your
To use Azure Machine Learning, create a workspace in your Azure subscription. You can then use this workspace to manage data, compute resources, code, models, and other artifacts related to your machine learning workloads.
> **_NOTE:_** Your Azure subscription will be charged a small amount for data storage as long as the Azure Machine Learning workspace exists in your subscription, so we recommend you delete the Azure Machine Learning workspace when you are no longer using it.
> **_NOTE:_** Your Azure subscription will be charged a small amount for data storage as long as the Azure Machine Learning workspace exists in your subscription, so we recommend you to delete the Azure Machine Learning workspace when you are no longer using it.
1. Sign into the [Azure portal](https://ms.portal.azure.com/) using the Microsoft credentials associated with your Azure subscription.
2. Select **Create a resource**
@ -130,7 +135,7 @@ To use Azure Machine Learning, create a workspace in your Azure subscription. Yo
![workspace-3](img/workspace-3.PNG)
Fill in the settings:
Fill in the settings as follows:
- Subscription: Your Azure subscription
- Resource group: Create or select a resource group
- Workspace name: Enter a unique name for your workspace
@ -143,7 +148,7 @@ To use Azure Machine Learning, create a workspace in your Azure subscription. Yo
![workspace-4](img/workspace-4.PNG)
- Click the create + review and then on the create button
3. Wait for your workspace to be created (it can take a few minutes). Then go to it in the portal. You can find it through the Machine Learning Azure service.
3. Wait for your workspace to be created (this can take a few minutes). Then go to it in the portal. You can find it through the Machine Learning Azure service.
4. On the Overview page for your workspace, launch Azure Machine Learning studio (or open a new browser tab and navigate to https://ml.azure.com), and sign into Azure Machine Learning studio using your Microsoft account. If prompted, select your Azure directory and subscription, and your Azure Machine Learning workspace.
![workspace-5](img/workspace-5.PNG)
@ -152,7 +157,7 @@ To use Azure Machine Learning, create a workspace in your Azure subscription. Yo
![workspace-6](img/workspace-6.PNG)
You can manage your workspace using the Azure portal, but for data scientists and Machine Learning operations engineers, Azure Machine Learning studio provides a more focused user interface for managing workspace resources.
You can manage your workspace using the Azure portal, but for data scientists and Machine Learning operations engineers, Azure Machine Learning Studio provides a more focused user interface for managing workspace resources.
### 2.2 Compute Resources
@ -170,7 +175,7 @@ Some key factors are to consider when creating a compute resource and those choi
**Do you need CPU or GPU ?**
A CPU (Central Processing Unit) is the electronic circuitry that executes instructions comprising a computer program. A GPU (Graphics Processing Unit) is specialized electronic circuit that can execute graphics-related code at a very high rate.
A CPU (Central Processing Unit) is the electronic circuitry that executes instructions comprising a computer program. A GPU (Graphics Processing Unit) is a specialized electronic circuit that can execute graphics-related code at a very high rate.
The main difference between CPU and GPU architecture is that a CPU is designed to handle a wide-range of tasks quickly (as measured by CPU clock speed), but are limited in the concurrency of tasks that can be running. GPUs are designed for parallel computing and therfore are much better at deep learning tasks.
@ -183,11 +188,11 @@ The main difference between CPU and GPU architecture is that a CPU is designed t
**Cluster Size**
Larger clusters are more expensive but will result in better responsiveness. Therefore, if you have time and not much money, you should start with a small cluster. Conversely, if you have money but not much time, you should start with a larger cluster.
Larger clusters are more expensive but will result in better responsiveness. Therefore, if you have time but not enough money, you should start with a small cluster. Conversely, if you have money but not much time, you should start with a larger cluster.
**VM Size**
Depending on your time and budgetary constrains, you can vary the size of your RAM, disk, number of cores and higher clock speed. Increasing all those parameters will be ore expensive but will result in better performance.
Depending on your time and budgetary constraints, you can vary the size of your RAM, disk, number of cores and clock speed. Increasing all those parameters will be costlier, but will result in better performance.
**Dedicated or Low-Priority Instances ?**
@ -196,17 +201,17 @@ This is another consideration of time vs money, since interruptible instances ar
#### 2.2.2 Creating a compute cluster
In the [Azure ML workspace](https://ml.azure.com/) that we created earlier, go to compute and you will see the different compute resources we just discussed (i.e compute instances, compute clusters, inference clusters and attached compute). For this project, we are going to need a compute cluster for the model training. In the Studio, Click on the "Compute" menu, then the "Compute cluster" tab and click on the "+ New" button to create a compute cluster.
In the [Azure ML workspace](https://ml.azure.com/) that we created earlier, go to compute and you will be able to see the different compute resources we just discussed (i.e compute instances, compute clusters, inference clusters and attached compute). For this project, we are going to need a compute cluster for model training. In the Studio, Click on the "Compute" menu, then the "Compute cluster" tab and click on the "+ New" button to create a compute cluster.
![22](img/cluster-1.PNG)
1. Choose your options: Dedicated vs Low priority, CPU or GPU, VM size and core number (you can keep the default settings for this project).
2. Click in the Next button.
2. Click on the Next button.
![23](img/cluster-2.PNG)
3. Give the cluster a compute name
4. Choose your options: Min/Max number of nodes, Idle seconds before scale down, SSH access. Note that if the min number of nodes is 0, you will save money when the cluster is idle. Note that the higher the number of max node, the shorter the training the will be. The max number of nodes recommended is 3.
4. Choose your options: Minimum/Maximum number of nodes, Idle seconds before scale down, SSH access. Note that if the minimum number of nodes is 0, you will save money when the cluster is idle. Note that the higher the number of maximum nodes, the shorter the training will be. The maximum number of nodes recommended is 3.
5. Click on the "Create" button. This step may take a few minutes.
![29](img/cluster-3.PNG)
@ -215,7 +220,7 @@ Awesome! Now that we have a Compute cluster, we need to load the data to Azure M
### 2.3 Loading the Dataset
1. In the [Azure ML workspace](https://ml.azure.com/) that we created earlier click on "Datasets" in the left menu and click on the "+ Create dataset" button to create a dataset. Choose the "From local files" option and select the Kaggle dataset we downloaded earlier.
1. In the [Azure ML workspace](https://ml.azure.com/) that we created earlier, click on "Datasets" in the left menu and click on the "+ Create dataset" button to create a dataset. Choose the "From local files" option and select the Kaggle dataset we downloaded earlier.
![24](img/dataset-1.PNG)
@ -227,12 +232,12 @@ Awesome! Now that we have a Compute cluster, we need to load the data to Azure M
![26](img/dataset-3.PNG)
Great now that the dataset is in place and the compute cluster is created, we can start the training of the model!
Great! Now that the dataset is in place and the compute cluster is created, we can start the training of the model!
### 2.4 Low code/No Code training with AutoML
Traditional machine learning model development is resource-intensive, requiring significant domain knowledge and time to produce and compare dozens of models.
Automated machine learning (AutoML), is the process of automating the time-consuming, iterative tasks of machine learning model development. It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity all while sustaining model quality. It greatly accelerates the time it takes to get production-ready ML models with great ease and efficiency. [Learn more](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml)
Traditional machine learning model development is resource-intensive, requires significant domain knowledge and time to produce and compare dozens of models.
Automated machine learning (AutoML), is the process of automating the time-consuming, iterative tasks of machine learning model development. It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity, all while sustaining model quality. It reduces the time it takes to get production-ready ML models, with great ease and efficiency. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml?WT.mc_id=academic-40229-cxa&ocid=AID3041109)
1. In the [Azure ML workspace](https://ml.azure.com/) that we created earlier click on "Automated ML" in the left menu and select the dataset you just uploaded. Click Next.
@ -242,7 +247,7 @@ Automated machine learning (AutoML), is the process of automating the time-consu
![28](img/aml-2.PNG)
3. Choose "Classification" and Click Finish. This step might take between 30 min to 1 hour depending on your compute cluster size.
3. Choose "Classification" and Click Finish. This step might take between 30 minutes to 1 hour, depending upon your compute cluster size.
![30](img/aml-3.PNG)
@ -250,12 +255,12 @@ Automated machine learning (AutoML), is the process of automating the time-consu
![31](img/aml-4.PNG)
Here you can see the detailed description of the best model that AutoML generated. You can also explore other modes generated in the Models tab. Take a few minutes to explore the models in the Explanations (preview button). Once you have chosen the model you want to use (here we will chose the best model selected by autoML), we will see how we can deploy it.
Here you can see a detailed description of the best model that AutoML generated. You can also explore other modes generated in the Models tab. Take a few minutes to explore the models in the Explanations (preview button). Once you have chosen the model you want to use (here we will chose the best model selected by autoML), we will see how we can deploy it.
## 3. Low code/No Code model deployment and endpoint consumption
### 3.1 Model deployment
The automated machine learning interface allows you to deploy the best model as a web service in a few steps. Deployment is the integration of the model so it can predict on new data and identify potential areas of opportunity. For this project, deployment to a web service means that medical applications will be able to consume the model to have live predictions of their patients risk to have a heart attack.
The automated machine learning interface allows you to deploy the best model as a web service in a few steps. Deployment is the integration of the model so that it can make predictions based on new data and identify potential areas of opportunity. For this project, deployment to a web service means that medical applications will be able to consume the model to be able to make live predictions of their patients risk to get a heart attack.
In the best model description, click on the "Deploy" button.
@ -265,7 +270,7 @@ In the best model description, click on the "Deploy" button.
![deploy-2](img/deploy-2.PNG)
16. Once it is deployed, go click on the Endpoint tab and click on the endpoint you just deployed. You can find here all the details you need to know about the endpoint.
16. Once it has been deployed, click on the Endpoint tab and click on the endpoint you just deployed. You can find here all the details you need to know about the endpoint.
![deploy-3](img/deploy-3.PNG)
@ -290,7 +295,7 @@ The `url` variable is the REST endpoint found in the consume tab and the `api_ke
```python
b'"{\\"result\\": [true]}"'
```
This means that the prediction of heart failure for the data given is true. This makes sens because if you look more closely at the data automatically generated in the script, everythin is at 0 and false by default. You can change the data with the following input sample:
This means that the prediction of heart failure for the data given is true. This makes sense because if you look more closely at the data automatically generated in the script, everything is at 0 and false by default. You can change the data with the following input sample:
```python
data = {
@ -332,12 +337,12 @@ The script should return :
b'"{\\"result\\": [true, false]}"'
```
Congratulations! You just consumed the model deployed and trained on Azure ML !
Congratulations! You just consumed the model deployed and trained it on Azure ML !
> **_NOTE:_** Once you are done with the project, don't forget to delete all the resources.
## 🚀 Challenge
Look more closely at the model explanations and details that AutoML generated for the top models. Try to understand why the best model is better than the other ones. What algorithms were compared? What are the differences between them? Why is the best one performing better in this case?
Look closely at the model explainations and details that AutoML generated for the top models. Try to understand why the best model is better than the other ones. What algorithms were compared? What are the differences between them? Why is the best one performing better in this case?
## Post-Lecture Quiz
@ -347,7 +352,7 @@ Look more closely at the model explanations and details that AutoML generated fo
2. A compute instance
3. A compute cluster
2. Which of the following are the different tasks supported by Automated ML?
2. Which of the following tasks are supported by Automated ML?
1. Image generation
2. TRUE : Classification
3. Natural Language generation
@ -359,9 +364,9 @@ Look more closely at the model explanations and details that AutoML generated fo
## Review & Self Study
In this lesson, you learned how to train, deploy and consume a model to predict heart failure risk in a Low code/No code fashion in the cloud. If you have not done it yet, look more closely at the model explanations that AutoML generated for the top models and try to understand why the best model is better than the other ones.
In this lesson, you learned how to train, deploy and consume a model to predict heart failure risk in a Low code/No code fashion in the cloud. If you have not done it yet, dive deeper into the model explainations that AutoML generated for the top models and try to understand why the best model is better than others.
You can go further into Low code/No code AutoML by reading this [documentation](https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-first-experiment-automated-ml).
You can go further into Low code/No code AutoML by reading this [documentation](https://docs.microsoft.com/azure/machine-learning/tutorial-first-experiment-automated-ml?WT.mc_id=academic-40229-cxa&ocid=AID3041109).
## Assignment

@ -2,7 +2,7 @@
## Instructions
We saw how to use the Azure ML platform to train, deploy and consume a model in a Low code/No code fashion. Now look around for some data that you could use to train an other model, deploy it and consume it. You can look for datasets on [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/catalog/?WT.mc_id=academic-15963-cxa).
We saw how to use the Azure ML platform to train, deploy and consume a model in a Low code/No code fashion. Now look around for some data that you could use to train an other model, deploy it and consume it. You can look for datasets on [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/catalog?WT.mc_id=academic-40229-cxa&ocid=AID3041109).
## Rubric

@ -1,5 +1,9 @@
# Data Science in the Cloud: The "Azure ML SDK" way
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/19-DataScience-Cloud.png)|
|:---:|
| Data Science In The Cloud: Azure ML SDK - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
Table of contents:
- [Data Science in the Cloud: The "Azure ML SDK" way](#data-science-in-the-cloud-the-azure-ml-sdk-way)
@ -54,7 +58,7 @@ Key areas of the SDK include:
- Use automated machine learning, which accepts configuration parameters and training data. It automatically iterates through algorithms and hyperparameter settings to find the best model for running predictions.
- Deploy web services to convert your trained models into RESTful services that can be consumed in any application.
[Learn more about the Azure Machine Learning SDK](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py)
[Learn more about the Azure Machine Learning SDK](https://docs.microsoft.com/en-us/python/api/overview/azure/ml?WT.mc_id=academic-40229-cxa&ocid=AID3041109)
In the [previous lesson](../18-tbd/README.md), we saw how to train, deploy and consume a model in a Low code/No code fashion. We used the Heart Failure dataset to generate and Heart failure prediction model. In this lesson, we are going to do the exact same thing but using the Azure Machine Learning SDK.
@ -107,7 +111,7 @@ Now that we have a Notebook, we can start training the model with Azure ML SDK.
### 2.5 Training a model
First of all, if you ever have a doubt, refer to the [Azure ML SDK documentation](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py). In contains all the necessary information to understand the modules we are going to see in this lesson.
First of all, if you ever have a doubt, refer to the [Azure ML SDK documentation](https://docs.microsoft.com/en-us/python/api/overview/azure/ml?WT.mc_id=academic-40229-cxa&ocid=AID3041109). In contains all the necessary information to understand the modules we are going to see in this lesson.
#### 2.5.1 Setup Workspace, experiment, compute cluster and dataset
@ -155,7 +159,7 @@ df.describe()
```
#### 2.5.2 AutoML Configuration and training
To set the AutoML configuration, use the [AutoMLConfig class](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig(class)?view=azure-ml-py).
To set the AutoML configuration, use the [AutoMLConfig class](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig(class)?WT.mc_id=academic-40229-cxa&ocid=AID3041109).
As described in the doc, there are a lot of parameters with which you can play with. For this project, we will use the following parameters:
@ -207,18 +211,18 @@ RunDetails(remote_run).show()
### 3.1 Saving the best model
The `remote_run` an object of type [AutoMLRun](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?view=azure-ml-py). This object contains the method `get_output()` which returns the best run and the corresponding fitted model.
The `remote_run` an object of type [AutoMLRun](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?WT.mc_id=academic-40229-cxa&ocid=AID3041109). This object contains the method `get_output()` which returns the best run and the corresponding fitted model.
```python
best_run, fitted_model = remote_run.get_output()
```
You can see the parameters used for the best model by just printing the fitted_model and see the properties of the best model by using the [get_properties()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#azureml_core_Run_get_properties) method.
You can see the parameters used for the best model by just printing the fitted_model and see the properties of the best model by using the [get_properties()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#azureml_core_Run_get_properties?WT.mc_id=academic-40229-cxa&ocid=AID3041109) method.
```python
best_run.get_properties()
```
Now register the model with the [register_model](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?view=azure-ml-py#register-model-model-name-none--description-none--tags-none--iteration-none--metric-none-) method.
Now register the model with the [register_model](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?view=azure-ml-py#register-model-model-name-none--description-none--tags-none--iteration-none--metric-none-?WT.mc_id=academic-40229-cxa&ocid=AID3041109) method.
```python
model_name = best_run.properties['model_name']
script_file_name = 'inference/score.py'
@ -231,9 +235,9 @@ model = best_run.register_model(model_name = model_name,
```
### 3.2 Model Deployment
Once the best model is saved, we can deploy it with the [InferenceConfig](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model.inferenceconfig?view=azure-ml-py) class. InferenceConfig represents the configuration settings for a custom environment used for deployment. The [AciWebservice](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.webservice.aciwebservice?view=azure-ml-py) class represents a machine learning model deployed as a web service endpoint on Azure Container Instances. A deployed service is created from a model, script, and associated files. The resulting web service is a load-balanced, HTTP endpoint with a REST API. You can send data to this API and receive the prediction returned by the model.
Once the best model is saved, we can deploy it with the [InferenceConfig](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model.inferenceconfig?view=azure-ml-py?ocid=AID3041109) class. InferenceConfig represents the configuration settings for a custom environment used for deployment. The [AciWebservice](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.webservice.aciwebservice?view=azure-ml-py) class represents a machine learning model deployed as a web service endpoint on Azure Container Instances. A deployed service is created from a model, script, and associated files. The resulting web service is a load-balanced, HTTP endpoint with a REST API. You can send data to this API and receive the prediction returned by the model.
The model is deployed using the [deploy](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model(class)?view=azure-ml-py#deploy-workspace--name--models--inference-config-none--deployment-config-none--deployment-target-none--overwrite-false--show-output-false-) method.
The model is deployed using the [deploy](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.model(class)?view=azure-ml-py#deploy-workspace--name--models--inference-config-none--deployment-config-none--deployment-target-none--overwrite-false--show-output-false-?WT.mc_id=academic-40229-cxa&ocid=AID3041109) method.
```python
from azureml.core.model import InferenceConfig, Model
@ -296,7 +300,7 @@ Congratulations! You just consumed the model deployed and trained on Azure ML wi
There are many other things you can do through the SDK, unfortunately, we can not view them all in this lesson. But good news, learning how to skim through the SDK documentation can take you a long way on your own. Have a look at the Azure ML SDK documentation and find the `Pipeline` class that allows you to create pipelines. A Pipeline is a collection of steps which can be executed as a workflow.
**HINT:** Go to the [SDK documentation](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py) and type keywords in the search bar like "Pipeline". You should have the `azureml.pipeline.core.Pipeline` class in the search results.
**HINT:** Go to the [SDK documentation](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py?WT.mc_id=academic-40229-cxa&ocid=AID3041109) and type keywords in the search bar like "Pipeline". You should have the `azureml.pipeline.core.Pipeline` class in the search results.
## Post-Lecture Quiz
@ -316,7 +320,7 @@ Congratulations! You just consumed the model deployed and trained on Azure ML wi
3. It can be used throught a Graphical User Interface
## Review & Self Study
In this lesson, you learned how to train, deploy and consume a model to predict heart failure risk with the Azure ML SDK in the cloud. Check this [documentation](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py) for further information about the Azure ML SDK. Try to create your own model with the Azure ML SDK.
In this lesson, you learned how to train, deploy and consume a model to predict heart failure risk with the Azure ML SDK in the cloud. Check this [documentation](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py?WT.mc_id=academic-40229-cxa&ocid=AID3041109) for further information about the Azure ML SDK. Try to create your own model with the Azure ML SDK.
## Assignment

@ -2,7 +2,7 @@
## Instructions
We saw how to use the Azure ML platform to train, deploy and consume a model with the Azure ML SDK. Now look around for some data that you could use to train an other model, deploy it and consume it. You can look for datasets on [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/en-us/services/open-datasets/catalog/?WT.mc_id=academic-15963-cxa).
We saw how to use the Azure ML platform to train, deploy and consume a model with the Azure ML SDK. Now look around for some data that you could use to train an other model, deploy it and consume it. You can look for datasets on [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/catalog?WT.mc_id=academic-40229-cxa&ocid=AID3041109).
## Rubric

@ -86,7 +86,7 @@
"cell_type": "markdown",
"source": [
"## Create a Compute Cluster\n",
"You will need to create a [compute target](https://docs.microsoft.com/en-us/azure/machine-learning/concept-azure-machine-learning-architecture#compute-target) for your AutoML run."
"You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/concept-azure-machine-learning-architecture#compute-target) for your AutoML run."
],
"metadata": {}
},

@ -0,0 +1,115 @@
# Data Science in the Real World
We're almost at the end of this learning journey!
We started with definitions of data science and ethics, explored various tools & techniques for data analysis, reviewed the data science lifecycle, and looked at scaling and automating data science workflows with cloud computing services.
And right now, you're probably wondering: "_How do these lessons translate to real-world contexts?_"
In this lesson, we'll talk about the real-world applications of data science and dive into a select few examples that explore data science in research, sustainability and digital humanities contexts. And we'll conclude with resources to help you continue the learning journey and explore some of these application ideas on your own.
## Where is Data Science Used Today?
Data Science technologies and techniques are finding a home in almost every industry today - thanks in no small part due to the democratization of AI, allowing developers to integrate data insights and decision-making intelligence into user experiences and workflows.
Here are some examples of "applied" data science in the real world:
* [Google Flu Trends](https://www.wired.com/2015/10/can-learn-epic-failure-google-flu-trends/) used data science to correlate search terms with flu trends. While the approach had flaws, it raised awareness of the possibilities (and challenges) of data-driven healthcare predictions.
* [UPS Routing Predictions](https://www.technologyreview.com/2018/11/21/139000/how-ups-uses-ai-to-outsmart-bad-weather/) - explains how UPS uses data science and machine learning to predict optimal routes for delivery, taking into account weather conditions, traffic patterns, delivery deadlines and more.
* [NYC Taxicab Route Visualization](http://chriswhong.github.io/nyctaxi/) - data gathered using [Freedom Of Information Laws](https://chriswhong.com/open-data/foil_nyc_taxi/) helped visualize a day in the life of NYC cabs, helping us understand how they navigate the busy city, the money they make, and the duration of trips over each 24-hour period.
* [Uber Data Science Workbench](https://eng.uber.com/dsw/) - uses data (on pickup & dropoff locations, trip duration, preferred routes etc.) gathered from millions of uber trips *daily* to build a data analytics tool to help with pricing, safety, fraud detection and navigation decisions.
* [Sports Analytics](https://towardsdatascience.com/scope-of-analytics-in-sports-world-37ed09c39860) - focuses on _predictive analytics_ (team and player analysis - think [Moneyball](https://datasciencedegree.wisconsin.edu/blog/moneyball-proves-importance-big-data-big-ideas/) - and fan management) and _data visualization_ (team & fan dashboards, games etc.) with applications like talent scouting, sports gambling and inventory/venue management.
* [Data Science in Banking](https://data-flair.training/blogs/data-science-in-banking/) - highlights the value of data science in the finance industry with applications ranging from risk modeling and fraud detction, to customer segmentation, real-time prediction and recommender systems. Predictive analytics also drive critical measures like [credit scores](https://dzone.com/articles/using-big-data-and-predictive-analytics-for-credit).
* [Data Science in Healthcare](https://data-flair.training/blogs/data-science-in-healthcare/) - highlights applications like medical imaging (e.g., MRI, X-Ray, CT-Scan), genomics (DNA sequencing), drug development (risk assessment, success prediction), predictive analytics (patient care & supply logistics), disease tracking & prevention etc.
![Data Science Applications in The Real World](data-science-applications.png) Image Credit: [Data Flair: 6 Amazing Data Science Applications ](https://data-flair.training/blogs/data-science-applications/)
There are many other application domains to consider (see the image above as one example) - check out the [Review & Self Study](?id=review-amp-self-study) section for some relevant resources. For now, let's take a slightly deeper look at a few interesting examples in the following sections.
## Research: Gender Shades Study
Researchers are often the earliest members of the technical community to explore real-world applications for big data algorithms and applied AI. The focus is often on both _exploring opportunities_ to do good and _uncovering challenges_ that lead to potential harms or unintended consequences.
Let's talk about one example - the [Gender Shades](http://gendershades.org/overview.html) project from MIT, one of the earliest to explore data ethics topics like fairness and bias, to highlight the need for more transparency in algorithm design and AI, and demand more inclusive testing of products.
The project evaluated the accuracy of AI-powered _gender classification_ products (from companies like IBM, Microsoft and Face++) using a dataset of 1270 images (from African and European countries) as the benchmark. While overall accuracy of classification was high for all products, the study identified non-trivial differences in the error rates _between different groups of users_, with misgendering being higher for female subjects or those with darker skin.
The study had broader implications for facial analysis algorithms as a whole, highlighting the potential for individual and social harms when used in contexts like law enforcement or hiring. Many organizations have since created _responsible AI_ principles and practices to improve the fairness of AI systems.
**Want to learn about relevant research efforts in Microsoft?**
* Check out these [Microsoft Research Projects](https://www.microsoft.com/research/research-area/artificial-intelligence/?facet%5Btax%5D%5Bmsr-research-area%5D%5B%5D=13556&facet%5Btax%5D%5Bmsr-content-type%5D%5B%5D=msr-project)
* Explore student projects and coursework from the [Microsoft Research Data Science Summer School](https://www.microsoft.com/en-us/research/academic-program/data-science-summer-school/).
* Check out the [Fairlearn](https://fairlearn.org/) open-source, community-driven effort to improve fairness in AI systems.
## Digital Humanities: Poetics
Digital Humanities [has been defined](https://digitalhumanities.stanford.edu/about-dh-stanford) as "a collection of practices and approaches combining computational methods with humanistic inquiry". [Stanford projects](https://digitalhumanities.stanford.edu/projects) like _"rebooting history"_ and _"poetic thinking"_ illustrate the linkage between [Digital Humanities and Data Science](https://digitalhumanities.stanford.edu/digital-humanities-and-data-science) - emphasizing techniques like network analysis, information visualization, spatial and text analysis that can help us revisit historical and literary data sets to derive new insights and perspective.
*Want to explore and extend a project in this space?*
Check out ["Emily Dickinson and the Meter of Mood"](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671) - a great example from [Jen Looper](https://twitter.com/jenlooper) that asks how we can use data science to revisit familiar poetry and re-evaluate its meaning and the contributions of its author in new contexts. For instance, _can we predict the year in which a poem was authored by analyzing its tone or sentiment_ - and what does this tell us about the author's state of mind over the relevant period?
To answer that question, we follow the steps of our data science lifecycle:
* [`Data Acquisition`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#acquiring-the-dataset) - to collect a relevant dataset for analysis. Options including using an API ( e.g., [Poetry DB API](https://poetrydb.org/index.html)) or scraping web pages (e.g., [Project Gutenberg](https://www.gutenberg.org/files/12242/12242-h/12242-h.htm)) using tools like [Scrapy](https://scrapy.org/).
* [`Data Cleaning`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#clean-the-data) - explains how text can be formatted, sanitized and simplified using basic tools like Visual Studio Code and Microsoft Excel.
* [`Data Analysis`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#working-with-the-data-in-a-notebook) - explains how we can now import the dataset into "Notebooks" for analysis using Python packages (like pandas, numpy and matplotlib) to organize and visualize the data.
* [`Sentiment Analysis`](https://gist.github.com/jlooper/ce4d102efd057137bc000db796bfd671#sentiment-analysis-using-cognitive-services) - explains how we can integrate cloud services like Text Analytics, using low-code tools like [Power Automate](https://flow.microsoft.com/en-us/) for automated data processing workflows.
Using this workflow, we can explore the seasonal impacts on the sentiment of the poems, and help us fashion our own perspectives on the author. Try it out yourself - then extend the notebook to ask other questions or visualize the data in new ways!
## Sustainability: Planetary Data
The [2030 Agenda For Sustainable Development](https://sdgs.un.org/2030agenda) - adopted by all United Nations members in 2015 - identifies 17 goals including ones that focus on **Protecting the Planet** from degradation and the impact of climate change. The [Microsoft Sustainability](https://www.microsoft.com/en-us/sustainability) initiative supports these goals by exploring ways in which technology solutions can support and build more sustainable futures with a [focus on 4 goals](https://dev.to/azure/a-visual-guide-to-sustainable-software-engineering-53hh) - being carbon negative, water positive, zero waste, and bio-diverse by 2030.
Tackling these challenges in a scalable and timely manner requires cloud-scale thinking - and large scale data. That's where the [Planetary Computer](https://planetarycomputer.microsoft.com/) initiative. It consists of 4 components:
* [Data Catalog](https://planetarycomputer.microsoft.com/catalog) - with petabytes of data on Earth systems, hosted on Azure, available for free.
* [Planetary API](https://planetarycomputer.microsoft.com/docs/reference/stac/) - to help users search for relevant data across space and time.
* [Hub](https://planetarycomputer.microsoft.com/docs/overview/environment/) - a managed environment for scientists to process massive geospatial datasets.
* [Applications](https://planetarycomputer.microsoft.com/applications) - showcasing use cases and tools using this data, for sustainability insights.
Check out [the documentation](https://planetarycomputer.microsoft.com/docs/overview/about) for more details and explore applications like [Ecosystem Monitoring](https://analytics-lab.org/ecosystemmonitoring/) to get ideas for how you can use the data sets to derive useful insights or build applications that can motivate relevant behavioral changes for sustainability.
**The Planetary Computer Project is currently in preview (as of Sep 2021)**
Please [request access](https://planetarycomputer.microsoft.com/account/request) to get started with your own exploration and connect with your peers in this space.
## Pre-Lecture Quiz
[Pre-lecture quiz]()
## 🚀 Challenge
## Post-Lecture Quiz
[Post-lecture quiz]()
## Review & Self Study
Want to explore more use cases? Here are a few relevant articles:
* [17 Data Science Applications and Examples](https://builtin.com/data-science/data-science-applications-examples) - Jul 2021
* [11 Breathtaking Data Science Applications in Real World](https://myblindbird.com/data-science-applications-real-world/) - May 2021
* [Data Science In The Real World](https://towardsdatascience.com/data-science-in-the-real-world/home) - Article Collection
* [Data Science In Education](https://data-flair.training/blogs/data-science-in-education/)
* [Data Science In Agriculture](https://data-flair.training/blogs/data-science-in-agriculture/)
* [Data Science in Finance](https://data-flair.training/blogs/data-science-in-finance/)
* [Data Science at the Movies](https://data-flair.training/blogs/data-science-at-movies/)
## Assignment
[Assignment Title](assignment.md)

Binary file not shown.

After

Width:  |  Height:  |  Size: 407 KiB

@ -1,19 +0,0 @@
# Data Science in the Real World
## Pre-Lecture Quiz
[Pre-lecture quiz]()
## 🚀 Challenge
## Post-Lecture Quiz
[Post-lecture quiz]()
## Review & Self Study
## Assignment
[Assignment Title](assignment.md)

@ -4,6 +4,6 @@
### Topics
1. [Data Science in the Real World](20-Real-World/README.md)
1. [Data Science in the Real World](20-Real-World-Examples/README.md)
### Credits

@ -4,6 +4,11 @@ Azure Cloud Advocates at Microsoft are pleased to offer a 10-week, 20-lesson cur
**Hearty thanks to our authors:** Jasmine Greenaway, Dmitry Soshnikov, Nitya Narasimhan, Jalen McGee, Jen Looper, Maud Levy, Tiffany Souterre, Christopher Harrison.
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](./sketchnotes/00-Title.png)|
|:---:|
| Data Science For Beginners - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
# Getting Started
> **Teachers**, we have [included some suggestions](for-teachers.md) on how to use this curriculum. We'd love your feedback [in our discussion forum](https://github.com/microsoft/Data-Science-For-Beginners/discussions)!
@ -25,21 +30,27 @@ In addition, a low-stakes quiz before a class sets the intention of the student
## Each lesson includes:
- optional sketchnote
- optional supplemental video
- pre-lesson warmup quiz
- written lesson
- for project-based lessons, step-by-step guides on how to build the project
- knowledge checks
- a challenge
- supplemental reading
- assignment
- post-lesson quiz
- Optional sketchnote
- Optional supplemental video
- Pre-lesson warmup quiz
- Written lesson
- For project-based lessons, step-by-step guides on how to build the project
- Knowledge checks
- A challenge
- Supplemental reading
- Assignment
- Post-lesson quiz
> **A note about quizzes**: All quizzes are contained [in this app](https://red-water-0103e7a0f.azurestaticapps.net/), for 40 total quizzes of three questions each. They are linked from within the lessons but the quiz app can be run locally; follow the instruction in the `quiz-app` folder. They are gradually being localized.
## Lessons
|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](./sketchnotes/00-Roadmap.png)|
|:---:|
| Data Science For Beginners: Roadmap - _Sketchnote by [@nitya](https://twitter.com/nitya)_ |
| Lesson Number | Topic | Lesson Grouping | Learning Objectives | Linked Lesson | Author |
| :-----------: | :----------------------------------------: | :--------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------------------------------------------------------: | :----: |
| 01 | Defining Data Science | [Introduction](1-Introduction/README.md) | Learn the basic concepts behind data science and how its related to artificial intelligence, machine learning, and big data. | [lesson](1-Introduction/01-defining-data-science/README.md) [video](https://youtu.be/pqqsm5reGvs) | [Dmitry](http://soshnikov.com) |
@ -61,7 +72,7 @@ In addition, a low-stakes quiz before a class sets the intention of the student
| 17 | Data Science in the Cloud | [Cloud Data](5-Data-Science-In-Cloud/README.md) | This series of lessons introduces data science in the cloud and its benefits. | [lesson](5-Data-Science-In-Cloud/17-Introduction/README.md) | Tiffany and Maud |
| 18 | Data Science in the Cloud | [Cloud Data](5-Data-Science-In-Cloud/README.md) | Training models using Low Code tools |[lesson](5-Data-Science-In-Cloud/18-Low-Code/README.md) |Tiffany and Maud |
| 19 | Data Science in the Cloud | [Cloud Data](5-Data-Science-In-Cloud/README.md) | Deploying models with Azure Machine Learning Studio | [lesson](5-Data-Science-In-Cloud/19-Azure/README.md)|Tiffany and Maud |
| 20 | Data Science in the Wild | [In the Wild](6-Data-Science-In-Wild/README.md) | Data science driven projects in the real world | |
| 20 | Data Science in the Wild | [In the Wild](6-Data-Science-In-Wild/README.md) | Data science driven projects in the real world | [lesson](6-Data-Science-In-Wild/20-Real-World-Examples/README.md) | [Nitya](https://twitter.com/nitya) |
## Offline access
You can run this documentation offline by using [Docsify](https://docsify.js.org/#/). Fork this repo, [install Docsify](https://docsify.js.org/#/quickstart) on your local machine, then in the root folder of this repo, type `docsify serve`. The website will be served on port 3000 on your localhost: `localhost:3000`.

6
package-lock.json generated

@ -1433,9 +1433,9 @@
"dev": true
},
"prismjs": {
"version": "1.24.1",
"resolved": "https://registry.npmjs.org/prismjs/-/prismjs-1.24.1.tgz",
"integrity": "sha512-mNPsedLuk90RVJioIky8ANZEwYm5w9LcvCXrxHlwf4fNVSn8jEipMybMkWUyyF0JhnC+C4VcOVSBuHRKs1L5Ow==",
"version": "1.25.0",
"resolved": "https://registry.npmjs.org/prismjs/-/prismjs-1.25.0.tgz",
"integrity": "sha512-WCjJHl1KEWbnkQom1+SzftbtXMKQoezOCYs5rECqMN+jP+apI7ftoflyqigqzopSO3hMhTEb0mFClA8lkolgEg==",
"dev": true
},
"process-nextick-args": {

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.5 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 213 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 199 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 252 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 215 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 218 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 223 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 206 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 250 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 211 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 249 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 240 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 248 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 290 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 234 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 298 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 248 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 291 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 319 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 240 KiB

Loading…
Cancel
Save