@ -0,0 +1,16 @@
|
|||||||
|
---
|
||||||
|
name: Review Checklist
|
||||||
|
about: Reviewing curriculum lessons
|
||||||
|
title: '[Review]'
|
||||||
|
labels: ''
|
||||||
|
assignees: ''
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
# This lesson has been reviewed and resolved of the following issues
|
||||||
|
- [ ] Typos
|
||||||
|
- [ ] Grammar errors
|
||||||
|
- [ ] Missing links
|
||||||
|
- [ ] Broken Images
|
||||||
|
- [ ] Checked for completeness
|
||||||
|
- [ ] Quiz (if no quiz assign to @paladique)
|
||||||
@ -1,8 +1,31 @@
|
|||||||
# Title
|
# Assignment: Data Science Scenarios
|
||||||
|
|
||||||
|
In this first assignment, we ask you to think about some real-life process or problem in different problem domains, and how you can improve it using the Data Science process. Think about the following:
|
||||||
|
|
||||||
|
1. Which data can you collect?
|
||||||
|
1. How would you collect it?
|
||||||
|
1. How would you store the data? How large the data is likely to be?
|
||||||
|
1. Which insights you might be able to get from this data? Which decisions we would be able to take based on the data?
|
||||||
|
|
||||||
|
Try to think about 3 different problems/processes and describe each of the points above for each problem domain.
|
||||||
|
|
||||||
|
Here are some of the problem domains and problems that can get you started thinking:
|
||||||
|
|
||||||
|
1. How can you use data to improve education process for children in schools?
|
||||||
|
1. How can you use data to control vaccination during the pandemic?
|
||||||
|
1. How can you use data to make sure you are being productive at work?
|
||||||
## Instructions
|
## Instructions
|
||||||
|
|
||||||
|
Fill in the following table (substitute suggested problem domains for your own ones if needed):
|
||||||
|
|
||||||
|
| Problem Domain | Problem | Which data to collect | How to store the data | Which insights/decisions we can make |
|
||||||
|
|----------------|---------|-----------------------|-----------------------|--------------------------------------|
|
||||||
|
| Education | | | | |
|
||||||
|
| Vaccination | | | | |
|
||||||
|
| Productivity | | | | |
|
||||||
|
|
||||||
## Rubric
|
## Rubric
|
||||||
|
|
||||||
Exemplary | Adequate | Needs Improvement
|
Exemplary | Adequate | Needs Improvement
|
||||||
--- | --- | -- |
|
--- | --- | -- |
|
||||||
|
One was able to identify reasonable data sources, ways of storing data and possible decisions/insights for all problem domains | Some of the aspects of the solution are not detailed, data storage is not discussed, at least 2 problem domains are described | Only parts of the data solution are described, only one problem domain is considered.
|
||||||
|
|||||||
|
After Width: | Height: | Size: 103 KiB |
@ -0,0 +1,33 @@
|
|||||||
|
# Assignment: Data Science Scenarios
|
||||||
|
|
||||||
|
In this first assignment, we ask you to think about some real-life process or problem in different problem domains, and how you can improve it using the Data Science process. Think about the following:
|
||||||
|
|
||||||
|
1. Which data can you collect?
|
||||||
|
1. How would you collect it?
|
||||||
|
1. How would you store the data? How large the data is likely to be?
|
||||||
|
1. Which insights you might be able to get from this data? Which decisions we would be able to take based on the data?
|
||||||
|
|
||||||
|
Try to think about 3 different problems/processes and describe each of the points above for each problem domain.
|
||||||
|
|
||||||
|
Here are some of the problem domains and problems that can get you started thinking:
|
||||||
|
|
||||||
|
1. How can you use data to improve education process for children in schools?
|
||||||
|
1. How can you use data to control vaccination during the pandemic?
|
||||||
|
1. How can you use data to make sure you are being productive at work?
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
Fill in the following table (substitute suggested problem domains for your own ones if needed):
|
||||||
|
|
||||||
|
| Problem Domain | Problem | Which data to collect | How to store the data | Which insights/decisions we can make |
|
||||||
|
|----------------|---------|-----------------------|-----------------------|--------------------------------------|
|
||||||
|
| Education | In university, we typically have low attendance to lectures, and we have the hypothesis that students who attend lectures on average to better during exams. We want to stimulate attendance and test the hypothesis. | We can track attendance through pictures taken by the security camera in class, or by tracking bluetooth/wifi addresses of student mobile phones in class. Exam data is already available in the university database. | In case we track security camera images - we need to store a few (5-10) photographs during class (unstructured data), and then use AI to identify faces of students (convert data to structured form). | We can compute average attendance data for each student, and see if there is any correlation with exam grades. We will talk more about correlation in [probability and statistics](../../04-stats-and-probability/README.md) section. In order to stimulate student attendance, we can publish the weekly attendance rating on school portal, and draw prizes among those with highest attendance. |
|
||||||
|
| Vaccination | | | | |
|
||||||
|
| Productivity | | | | |
|
||||||
|
|
||||||
|
> *We provide just one answer as an example, so that you can get an idea of what is expected in this assignment.*
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | -- |
|
||||||
|
One was able to identify reasonable data sources, ways of storing data and possible decisions/insights for all problem domains | Some of the aspects of the solution are not detailed, data storage is not discussed, at least 2 problem domains are described | Only parts of the data solution are described, only one problem domain is considered.
|
||||||
|
Before Width: | Height: | Size: 73 KiB |
@ -1,3 +0,0 @@
|
|||||||
## Courses
|
|
||||||
|
|
||||||
## Articles
|
|
||||||
|
After Width: | Height: | Size: 383 KiB |
|
After Width: | Height: | Size: 89 KiB |
|
After Width: | Height: | Size: 506 KiB |
@ -0,0 +1,27 @@
|
|||||||
|
|
||||||
|
[
|
||||||
|
{
|
||||||
|
"firstname": "Christophe",
|
||||||
|
"age": 32
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"firstname": "Prema",
|
||||||
|
"age": 20
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"firstname": "Arthur",
|
||||||
|
"age": 15
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"firstname": "Zoe",
|
||||||
|
"age": 7
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"firstname": "Keisha",
|
||||||
|
"age": 84
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"firstname": "Jackie",
|
||||||
|
"age": 45
|
||||||
|
}
|
||||||
|
]
|
||||||
|
After Width: | Height: | Size: 8.3 KiB |
|
After Width: | Height: | Size: 159 KiB |
|
After Width: | Height: | Size: 246 KiB |
|
After Width: | Height: | Size: 168 KiB |
|
After Width: | Height: | Size: 76 KiB |
|
After Width: | Height: | Size: 41 KiB |
|
After Width: | Height: | Size: 27 KiB |
|
After Width: | Height: | Size: 139 KiB |
|
After Width: | Height: | Size: 26 KiB |
|
After Width: | Height: | Size: 119 KiB |
|
Before Width: | Height: | Size: 22 KiB After Width: | Height: | Size: 22 KiB |
@ -0,0 +1,23 @@
|
|||||||
|
# Assignment for Data Processing in Python
|
||||||
|
|
||||||
|
In this assignment, we will ask you to elaborate on the code we have started developing in our challenges. The assignment consists of two parts:
|
||||||
|
|
||||||
|
## COVID-19 Spread Modelling
|
||||||
|
|
||||||
|
- [ ] Plot $R_t$ graphs for 5-6 different countries on one plot for comparison, or using several plots side-by-side
|
||||||
|
- [ ] See how the number of deaths and recoveries correlate with number of infected cases.
|
||||||
|
- [ ] Find out how long a typical disease lasts by visually correlating infection rate and deaths rate and looking for some anomalies. You may need to look at different countries to find that out.
|
||||||
|
- [ ] Calculate the fatality rate and how it changes over time. *You may want to take into account the length of the disease in days to shift one time series before doing calculations*
|
||||||
|
|
||||||
|
## COVID-19 Papers Analysis
|
||||||
|
|
||||||
|
- [ ] Build co-occurrence matrix of different medications, and see which medications often occur together (i.e. mentioned in one abstract). You can modify the code for building co-occurrence matrix for medications and diagnoses.
|
||||||
|
- [ ] Visualize this matrix using heatmap.
|
||||||
|
- [ ] As a stretch goal, visualize the co-occurrence of medications using [chord diagram](https://en.wikipedia.org/wiki/Chord_diagram). [This library](https://pypi.org/project/chord/) may help you draw a chord diagram.
|
||||||
|
- [ ] As another stretch goal, extract dosages of different medications (such as **400mg** in *take 400mg of chloroquine daily*) using regular expressions, and build dataframe that shows different dosages for different medications. **Note**: consider numeric values that are in close textual vicinity of the medicine name.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | -- |
|
||||||
|
All tasks are complete, graphically illustrated and explained, including at least one of two stretch goals | More than 5 tasks are complete, no stretch goals are attempted, or the results are not clear | Less than 5 (but more than 3) tasks are complete, visualizations do not help to demonstrate the point
|
||||||
|
After Width: | Height: | Size: 10 KiB |
|
After Width: | Height: | Size: 97 KiB |
|
After Width: | Height: | Size: 94 KiB |
@ -1,13 +1,16 @@
|
|||||||
# Working with Data
|
# Working with Data
|
||||||
|
|
||||||
[Brief description about the lessons in this section]
|

|
||||||
|
> Photo by <a href="https://unsplash.com/@swimstaralex?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Alexander Sinn</a> on <a href="https://unsplash.com/s/photos/data?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
|
||||||
|
|
||||||
|
In these lessons, you will learn some of the ways that data can be managed, manipulated, and used in applications. You will learn about relational and non-relational databases and how data can be stored in them. You'll learn the fundamentals of working with Python to manage data, and you'll discover some of the many ways that you can work with Python to manage and mine data.
|
||||||
### Topics
|
### Topics
|
||||||
|
|
||||||
1. [Spreadsheets](05-spreadsheets/README.md)
|
1. [Relational databases](05-relational-databases/README.md)
|
||||||
2. [Relational Databases](06-relational-databases/README.md)
|
2. [Non-relational databases](06-non-relational/README.md)
|
||||||
3. [NoSQL](07-nosql/README.md)
|
3. [Working with Python](07-python/README.md)
|
||||||
4. [Python](08-python/README.md)
|
4. [Preparing data](08-data-preparation/README.md)
|
||||||
5. [Cleaning and Transformations](09-cleaning-transformations/README.md)
|
|
||||||
|
|
||||||
### Credits
|
### Credits
|
||||||
|
|
||||||
|
These lessons were written with ❤️ by [Christopher Harrison](https://twitter.com/geektrainer), [Dmitry Soshnikov](https://twitter.com/shwars) and [Jasmine Greenaway](https://twitter.com/paladique)
|
||||||
|
|||||||
|
After Width: | Height: | Size: 2.4 MiB |
@ -0,0 +1,20 @@
|
|||||||
|
# Exploring and Assessing a Dataset
|
||||||
|
|
||||||
|
A client has approached your team for help in investigating a taxi customer's seasonal spending habits in New York City.
|
||||||
|
|
||||||
|
They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
|
||||||
|
|
||||||
|
Your team is in the [Capturing](Readme.md#Capturing) stage of the Data Science Lifecycle and you are in charge of exploring the dataset. You have been provided a notebook and data from Azure Open Datasets to explore and assess if the data can answer the client's question. You have decided to select a small sample of 1 summer month and 1 winter month in the year 2019.
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
In this directory is a [notebook](notebook.ipynb) that uses Python to load yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets) for the months of January and July 2019. These datasets have been joined together in a Pandas dataframe.
|
||||||
|
|
||||||
|
Your task is to identify the columns that are the most likely required to answer this question, then reorganize the joined dataset so that these columns are displayed first.
|
||||||
|
|
||||||
|
Finally, write 3 questions that you would ask the client for more clarification and better understanding of the problem.
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
Exemplary | Adequate | Needs Improvement
|
||||||
|
--- | --- | -- |
|
||||||
|
Before Width: | Height: | Size: 45 KiB After Width: | Height: | Size: 45 KiB |
@ -0,0 +1,76 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"Copyright (c) Microsoft Corporation. All rights reserved.\r\n",
|
||||||
|
"\r\n",
|
||||||
|
"Licensed under the MIT License."
|
||||||
|
],
|
||||||
|
"metadata": {}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"source": [
|
||||||
|
"# Exploring NYC Taxi data in Winter and Summer\r\n",
|
||||||
|
"\r\n",
|
||||||
|
"Refer to the [Data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to explore the columns that have been provided.\r\n"
|
||||||
|
],
|
||||||
|
"metadata": {}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"source": [
|
||||||
|
"!pip install pandas"
|
||||||
|
],
|
||||||
|
"outputs": [],
|
||||||
|
"metadata": {
|
||||||
|
"scrolled": true
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"source": [
|
||||||
|
"import pandas as pd\r\n",
|
||||||
|
"import glob\r\n",
|
||||||
|
"\r\n",
|
||||||
|
"path = '../../data/Taxi/yellow_tripdata_2019-{}.csv'\r\n",
|
||||||
|
"july_taxi = pd.read_csv(path.format('07'))\r\n",
|
||||||
|
"january_taxi = pd.read_csv(path.format('01'))\r\n",
|
||||||
|
"\r\n",
|
||||||
|
"df = pd.concat([january_taxi, july_taxi])\r\n",
|
||||||
|
"\r\n",
|
||||||
|
"print(df)"
|
||||||
|
],
|
||||||
|
"outputs": [],
|
||||||
|
"metadata": {}
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"name": "python3",
|
||||||
|
"display_name": "Python 3.9.7 64-bit ('venv': venv)"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"version": "3.9.7",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"file_extension": ".py"
|
||||||
|
},
|
||||||
|
"name": "04-nyc-taxi-join-weather-in-pandas",
|
||||||
|
"notebookId": 1709144033725344,
|
||||||
|
"interpreter": {
|
||||||
|
"hash": "6b9b57232c4b57163d057191678da2030059e733b8becc68f245de5a75abe84e"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 2
|
||||||
|
}
|
||||||
@ -0,0 +1,25 @@
|
|||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": null,
|
||||||
|
"source": [
|
||||||
|
"# print(pd.read_csv('../../data/Taxi/yellow_tripdata_2019-01.csv'))\r\n",
|
||||||
|
"# all_files = glob.glob('../../data/Taxi/*.csv')\r\n",
|
||||||
|
"\r\n",
|
||||||
|
"# df = pd.concat((pd.read_csv(f) for f in all_files))\r\n",
|
||||||
|
"# print(df)"
|
||||||
|
],
|
||||||
|
"outputs": [],
|
||||||
|
"metadata": {}
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"orig_nbformat": 4,
|
||||||
|
"language_info": {
|
||||||
|
"name": "python"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 2
|
||||||
|
}
|
||||||
@ -1,13 +1,15 @@
|
|||||||
# The Data Science Lifecycle
|
# The Data Science Lifecycle
|
||||||
|
|
||||||
[Brief description about the lessons in this section]
|

|
||||||
|
> Photo by <a href="https://unsplash.com/@headwayio?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Headway</a> on <a href="https://unsplash.com/s/photos/communication?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
|
||||||
|
|
||||||
|
In these lessons, you'll explore some of the aspects of the Data Science lifeycle, including analysis and communication around data.
|
||||||
### Topics
|
### Topics
|
||||||
|
|
||||||
1. [Capturing](15-capturing/README.md)
|
1. [Introduction](14-Introduction/README.md)
|
||||||
1. [Processing](16-processing/README.md)
|
2. [Analyzing](15-Analyzing/README.md)
|
||||||
1. [Analyzing](17-analyzing/README.md)
|
3. [Communication](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/4-Data-Science-Lifecycle/16-communication)
|
||||||
1. [Communication](18-communication/README.md)
|
|
||||||
1. [Maintaining](19-maintaining/README.md)
|
|
||||||
|
|
||||||
### Credits
|
### Credits
|
||||||
|
|
||||||
|
These lessons were written with ❤️ by [Jalen McGee](https://twitter.com/JalenMCG)
|
||||||
|
|||||||
|
After Width: | Height: | Size: 20 KiB |
|
After Width: | Height: | Size: 3.2 MiB |
|
After Width: | Height: | Size: 279 KiB |
|
Before Width: | Height: | Size: 26 KiB After Width: | Height: | Size: 26 KiB |
|
Before Width: | Height: | Size: 52 KiB After Width: | Height: | Size: 52 KiB |
|
Before Width: | Height: | Size: 63 KiB After Width: | Height: | Size: 63 KiB |
|
Before Width: | Height: | Size: 86 KiB After Width: | Height: | Size: 86 KiB |
|
Before Width: | Height: | Size: 71 KiB After Width: | Height: | Size: 71 KiB |
|
Before Width: | Height: | Size: 68 KiB After Width: | Height: | Size: 68 KiB |
|
Before Width: | Height: | Size: 44 KiB After Width: | Height: | Size: 44 KiB |
|
Before Width: | Height: | Size: 76 KiB After Width: | Height: | Size: 76 KiB |
|
Before Width: | Height: | Size: 71 KiB After Width: | Height: | Size: 71 KiB |
|
Before Width: | Height: | Size: 28 KiB After Width: | Height: | Size: 28 KiB |
|
Before Width: | Height: | Size: 85 KiB After Width: | Height: | Size: 85 KiB |
|
Before Width: | Height: | Size: 44 KiB After Width: | Height: | Size: 44 KiB |
|
Before Width: | Height: | Size: 28 KiB After Width: | Height: | Size: 28 KiB |
|
Before Width: | Height: | Size: 71 KiB After Width: | Height: | Size: 71 KiB |
|
Before Width: | Height: | Size: 120 KiB After Width: | Height: | Size: 120 KiB |
|
Before Width: | Height: | Size: 30 KiB After Width: | Height: | Size: 30 KiB |
|
Before Width: | Height: | Size: 27 KiB After Width: | Height: | Size: 27 KiB |
|
Before Width: | Height: | Size: 40 KiB After Width: | Height: | Size: 40 KiB |
|
Before Width: | Height: | Size: 28 KiB After Width: | Height: | Size: 28 KiB |
|
Before Width: | Height: | Size: 40 KiB After Width: | Height: | Size: 40 KiB |
|
Before Width: | Height: | Size: 75 KiB After Width: | Height: | Size: 75 KiB |
@ -0,0 +1,11 @@
|
|||||||
|
# Data Science project using Azure ML SDK
|
||||||
|
|
||||||
|
## Instructions
|
||||||
|
|
||||||
|
We saw how to use the Azure ML platform to train, deploy and consume a model with the Azure ML SDK. Now look around for some data that you could use to train an other model, deploy it and consume it. You can look for datasets on [Kaggle](https://kaggle.com) and [Azure Open Datasets](https://azure.microsoft.com/services/open-datasets/catalog?WT.mc_id=academic-40229-cxa&ocid=AID3041109).
|
||||||
|
|
||||||
|
## Rubric
|
||||||
|
|
||||||
|
| Exemplary | Adequate | Needs Improvement |
|
||||||
|
|-----------|----------|-------------------|
|
||||||
|
|When doing the AutoML Configuration, you went through the SDK documentation to see what parameters you could use. You ran a training on a dataset through AutoML using Azure ML SDK, and you checked the model explanations. You deployed the best model and you were able to consume it through the Azure ML SDK. | You ran a training on a dataset through AutoML using Azure ML SDK, and you checked the model explanations. You deployed the best model and you were able to consume it through the Azure ML SDK. | You ran a training on a dataset through AutoML using Azure ML SDK. You deployed the best model and you were able to consume it through the Azure ML SDK. |
|
||||||
|
Before Width: | Height: | Size: 76 KiB After Width: | Height: | Size: 76 KiB |