@ -265,7 +265,7 @@ Notice that when a previous value is not available for forward-filling, the null
In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, `pandas` provides an easy means of detecting and removing duplicate entries.
In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, `pandas` provides an easy means of detecting and removing duplicate entries.
- **Identifying duplicates: `duplicated`**: You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a `DataFrame` is a duplicate of an ealier one. Let's create another example `DataFrame` to see this in action.
- **Identifying duplicates: `duplicated`**: You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a `DataFrame` is a duplicate of an earlier one. Let's create another example `DataFrame` to see this in action.
@ -21,7 +21,7 @@ An excellent library to create both simple and sophisticated plots and charts of
If you have a dataset and need to discover how much of a given item is included, one of the first tasks you have at hand will be to inspect its values.
If you have a dataset and need to discover how much of a given item is included, one of the first tasks you have at hand will be to inspect its values.
✅ There are very good 'cheat sheets' available for Matplotlib [here](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-1.png) and [here](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-2.png).
✅ There are very good 'cheat sheets' available for Matplotlib [here](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
## Build a line plot about bird wingspan values
## Build a line plot about bird wingspan values
@ -52,7 +52,7 @@ Let's start by plotting some of the numeric data using a basic line plot. Suppos
wingspan = birds['MaxWingspan']
wingspan = birds['MaxWingspan']
wingspan.plot()
wingspan.plot()
```
```


What do you notice immediately? There seems to be at least one outlier - that's quite a wingspan! A 2300 centimeter wingspan equals 23 meters - are there Pterodactyls roaming Minnesota? Let's investigate.
What do you notice immediately? There seems to be at least one outlier - that's quite a wingspan! A 2300 centimeter wingspan equals 23 meters - are there Pterodactyls roaming Minnesota? Let's investigate.
@ -72,7 +72,7 @@ plt.plot(x, y)
plt.show()
plt.show()
```
```


Even with the rotation of the labels set to 45 degrees, there are too many to read. Let's try a different strategy: label only those outliers and set the labels within the chart. You can use a scatter chart to make more room for the labeling:
Even with the rotation of the labels set to 45 degrees, there are too many to read. Let's try a different strategy: label only those outliers and set the labels within the chart. You can use a scatter chart to make more room for the labeling:
@ -94,7 +94,7 @@ What's going on here? You used `tick_params` to hide the bottom labels and then
What did you discover?
What did you discover?


## Filter your data
## Filter your data
Both the Bald Eagle and the Prairie Falcon, while probably very large birds, appear to be mislabeled, with an extra `0` added to their maximum wingspan. It's unlikely that you'll meet a Bald Eagle with a 25 meter wingspan, but if so, please let us know! Let's create a new dataframe without those two outliers:
Both the Bald Eagle and the Prairie Falcon, while probably very large birds, appear to be mislabeled, with an extra `0` added to their maximum wingspan. It's unlikely that you'll meet a Bald Eagle with a 25 meter wingspan, but if so, please let us know! Let's create a new dataframe without those two outliers:
@ -114,7 +114,7 @@ plt.show()
By filtering out outliers, your data is now more cohesive and understandable.
By filtering out outliers, your data is now more cohesive and understandable.


Now that we have a cleaner dataset at least in terms of wingspan, let's discover more about these birds.
Now that we have a cleaner dataset at least in terms of wingspan, let's discover more about these birds.
@ -140,13 +140,13 @@ birds.plot(x='Category',
title='Birds of Minnesota')
title='Birds of Minnesota')
```
```


This bar chart, however, is unreadable because there is too much non-grouped data. You need to select only the data that you want to plot, so let's look at the length of birds based on their category.
This bar chart, however, is unreadable because there is too much non-grouped data. You need to select only the data that you want to plot, so let's look at the length of birds based on their category.
Filter your data to include only the bird's category.
Filter your data to include only the bird's category.
✅ Notice that that you use Pandas to manage the data, and then let Matplotlib do the charting.
✅ Notice that you use Pandas to manage the data, and then let Matplotlib do the charting.
Since there are many categories, you can display this chart vertically and tweak its height to account for all the data:
Since there are many categories, you can display this chart vertically and tweak its height to account for all the data:


This bar chart shows a good view of the number of birds in each category. In a blink of an eye, you see that the largest number of birds in this region are in the Ducks/Geese/Waterfowl category. Minnesota is the 'land of 10,000 lakes' so this isn't surprising!
This bar chart shows a good view of the number of birds in each category. In a blink of an eye, you see that the largest number of birds in this region are in the Ducks/Geese/Waterfowl category. Minnesota is the 'land of 10,000 lakes' so this isn't surprising!
Nothing is surprising here: hummingbirds have the least MaxLength compared to Pelicans or Geese. It's good when data makes logical sense!
Nothing is surprising here: hummingbirds have the least MaxLength compared to Pelicans or Geese. It's good when data makes logical sense!
@ -189,7 +189,7 @@ plt.show()
```
```
In this plot, you can see the range per bird category of the Minimum Length and Maximum length. You can safely say that, given this data, the bigger the bird, the larger its length range. Fascinating!
In this plot, you can see the range per bird category of the Minimum Length and Maximum length. You can safely say that, given this data, the bigger the bird, the larger its length range. Fascinating!
@ -21,7 +21,7 @@ Una excelente librería para crear gráficos tanto simples como sofisticados de
Si tienes un conjunto de datos y necesitas descubrir qué cantidad de un elemento determinado está incluido, una de las primeras tareas que tienes que hacer será inspeccionar sus valores.
Si tienes un conjunto de datos y necesitas descubrir qué cantidad de un elemento determinado está incluido, una de las primeras tareas que tienes que hacer será inspeccionar sus valores.
✅ Hay muy buenas "hojas de trucos" disponibles para Matplotlib [aquí](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-1.png) y [aquí](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-2.png).
✅ Hay muy buenas "hojas de trucos" disponibles para Matplotlib [aquí](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
## Construir un gráfico de líneas sobre los valores de la envergadura de las aves
## Construir un gráfico de líneas sobre los valores de la envergadura de las aves
यदि आपके पास एक डेटासेट है और यह पता लगाने की आवश्यकता है कि किसी दिए गए आइटम में से कितना शामिल है, तो आपके पास सबसे पहले कार्यों में से एक इसके मूल्यों का निरीक्षण करना होगा।
यदि आपके पास एक डेटासेट है और यह पता लगाने की आवश्यकता है कि किसी दिए गए आइटम में से कितना शामिल है, तो आपके पास सबसे पहले कार्यों में से एक इसके मूल्यों का निरीक्षण करना होगा।
✅ माटप्लोटलिब के लिए बहुत अच्छी 'चीट शीट' उपलब्ध हैं [here](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-1.png) and [here](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-2.png).
✅ माटप्लोटलिब के लिए बहुत अच्छी 'चीट शीट' उपलब्ध हैं [here](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
## बर्ड विंगस्पैन मूल्यों के बारे में एक लाइन प्लॉट बनाएं
## बर्ड विंगस्पैन मूल्यों के बारे में एक लाइन प्लॉट बनाएं
데이터 세트가 있고 주어진 항목이 얼마나 포함되어 있는지 확인해야 하는 경우에, 가장 먼저 처리해야 하는 작업 중 하나는 해당 값을 검사하는 것입니다.
데이터 세트가 있고 주어진 항목이 얼마나 포함되어 있는지 확인해야 하는 경우에, 가장 먼저 처리해야 하는 작업 중 하나는 해당 값을 검사하는 것입니다.
✅ Matplotlib에 사용할 수 있는 매우 좋은 '치트 시트'가 있습니다. [here](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-1.png) and [here](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-2.png).
✅ Matplotlib에 사용할 수 있는 매우 좋은 '치트 시트'가 있습니다. [here](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
@ -21,7 +21,7 @@ Uma biblioteca excelente para criar tanto gráficos simples como sofisticados e
Se você tem um dataset e precisa descobrir quanto de um dado elemento está presente, uma das primeiras coisas que você precisará fazer é examinar seus valores.
Se você tem um dataset e precisa descobrir quanto de um dado elemento está presente, uma das primeiras coisas que você precisará fazer é examinar seus valores.
✅ Existem dicas ('cheat sheets') ótimas disponíveis para o Matplotlib [aqui](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-1.png) e [aqui](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-2.png).
✅ Existem dicas ('cheat sheets') ótimas disponíveis para o Matplotlib [aqui](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
## Construindo um gráfico de linhas sobre os valores de envergadura de aves
## Construindo um gráfico de linhas sobre os valores de envergadura de aves
In general, you can quickly look at the way data is distributed by using a scatter plot as we did in the previous lesson:
In general, you can quickly look at the way data is distributed by using a scatter plot as we did in the previous lesson:
```python
```python
@ -31,6 +40,8 @@ plt.xlabel('Max Length')
plt.show()
plt.show()
```
```

This gives an overview of the general distribution of body length per bird Order, but it is not the optimal way to display true distributions. That task is usually handled by creating a Histogram.
This gives an overview of the general distribution of body length per bird Order, but it is not the optimal way to display true distributions. That task is usually handled by creating a Histogram.
## Working with histograms
## Working with histograms
@ -40,7 +51,7 @@ Matplotlib offers very good ways to visualize data distribution using Histograms


As you can see, most of the 400+ birds in this dataset fall in the range of under 2000 for their Max Body Mass. Gain more insight into the data by changing the `bins` parameter to a higher number, something like 30:
As you can see, most of the 400+ birds in this dataset fall in the range of under 2000 for their Max Body Mass. Gain more insight into the data by changing the `bins` parameter to a higher number, something like 30:
@ -48,7 +59,7 @@ As you can see, most of the 400+ birds in this dataset fall in the range of unde


This chart shows the distribution in a bit more granular fashion. A chart less skewed to the left could be created by ensuring that you only select data within a given range:
This chart shows the distribution in a bit more granular fashion. A chart less skewed to the left could be created by ensuring that you only select data within a given range:
✅ Try some other filters and data points. To see the full distribution of the data, remove the `['MaxBodyMass']` filter to show labeled distributions.
✅ Try some other filters and data points. To see the full distribution of the data, remove the `['MaxBodyMass']` filter to show labeled distributions.
@ -76,7 +87,7 @@ hist = ax.hist2d(x, y)
```
```
There appears to be an expected correlation between these two elements along an expected axis, with one particularly strong point of convergence:
There appears to be an expected correlation between these two elements along an expected axis, with one particularly strong point of convergence:


Histograms work well by default for numeric data. What if you need to see distributions according to text data?
Histograms work well by default for numeric data. What if you need to see distributions according to text data?
## Explore the dataset for distributions using text data
## Explore the dataset for distributions using text data
@ -115,7 +126,7 @@ plt.gca().set(title='Conservation Status', ylabel='Max Body Mass')
plt.legend();
plt.legend();
```
```


There doesn't seem to be a good correlation between minimum wingspan and conservation status. Test other elements of the dataset using this method. You can try different filters as well. Do you find any correlation?
There doesn't seem to be a good correlation between minimum wingspan and conservation status. Test other elements of the dataset using this method. You can try different filters as well. Do you find any correlation?
Now, if you print out the mushrooms data, you can see that it has been grouped into categories according to the poisonous/edible class:
Now, if you print out the mushrooms data, you can see that it has been grouped into categories according to the poisonous/edible class:
@ -78,7 +84,7 @@ plt.show()
```
```
Voila, a pie chart showing the proportions of this data according to these two classes of mushrooms. It's quite important to get the order of the labels correct, especially here, so be sure to verify the order with which the label array is built!
Voila, a pie chart showing the proportions of this data according to these two classes of mushrooms. It's quite important to get the order of the labels correct, especially here, so be sure to verify the order with which the label array is built!


## Donuts!
## Donuts!
@ -108,7 +114,7 @@ plt.title('Mushroom Habitats')
plt.show()
plt.show()
```
```


This code draws a chart and a center circle, then adds that center circle in the chart. Edit the width of the center circle by changing `0.40` to another value.
This code draws a chart and a center circle, then adds that center circle in the chart. Edit the width of the center circle by changing `0.40` to another value.
@ -28,7 +28,7 @@ Defining the goals of the project will require deeper context into the problem o
Questions a data scientist may ask:
Questions a data scientist may ask:
- Has this problem been approached before? What was discovered?
- Has this problem been approached before? What was discovered?
- Is the purpose and goal understood by all involved?
- Is the purpose and goal understood by all involved?
- Where is there ambiguity and how to reduce it?
- Is there ambiguity and how to reduce it?
- What are the constraints?
- What are the constraints?
- What will the end result potentially look like?
- What will the end result potentially look like?
- How much resources (time, people, computational) are available?
- How much resources (time, people, computational) are available?
@ -62,16 +62,18 @@ Considerations of how and where the data is stored can influence the cost of its
Here’s some aspects of modern data storage systems that can affect these choices:
Here’s some aspects of modern data storage systems that can affect these choices:
**On premise vs off premise vs public or private cloud**
**On premise vs off premise vs public or private cloud**
On premise refers to hosting managing the data on your own equipment, like owning a server with hard drives that store the data, while off premise relies on equipment that you don’t own, such as a data center. The public cloud is a popular choice for storing data that requires no knowledge of how or where exactly the data is stored, where public refers to a unified underlying infrastructure that is shared by all who use the cloud. Some organizations have strict security policies that require that they have complete access to the equipment where the data is hosted and will rely on a private cloud that provides its own cloud services. You’ll learn more about data in the cloud in [later lessons](5-Data-Science-In-Cloud).
On premise refers to hosting managing the data on your own equipment, like owning a server with hard drives that store the data, while off premise relies on equipment that you don’t own, such as a data center. The public cloud is a popular choice for storing data that requires no knowledge of how or where exactly the data is stored, where public refers to a unified underlying infrastructure that is shared by all who use the cloud. Some organizations have strict security policies that require that they have complete access to the equipment where the data is hosted and will rely on a private cloud that provides its own cloud services. You’ll learn more about data in the cloud in [later lessons](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/5-Data-Science-In-Cloud).
**Cold vs hot data**
**Cold vs hot data**
When training your models, you may require more training data. If you’re content with your model, more data will arrive for a model to serve its purpose. In any case the cost of storing and accessing data will increase as you accumulate more of it. Separating rarely used data, known as cold data from frequently accessed hot data can be a cheaper data storage option through hardware or software services. If cold data needs to be accessed, it may take a little longer to retrieve in comparison to hot data.
When training your models, you may require more training data. If you’re content with your model, more data will arrive for a model to serve its purpose. In any case the cost of storing and accessing data will increase as you accumulate more of it. Separating rarely used data, known as cold data from frequently accessed hot data can be a cheaper data storage option through hardware or software services. If cold data needs to be accessed, it may take a little longer to retrieve in comparison to hot data.
### Managing Data
### Managing Data
As you work with data you may discover that some of the data needs to be cleaned using some of the techniques covered in the lesson focused on [data preparation](2-Working-With-Data\08-data-preparation) to build accurate models. When new data arrives, it will need some of the same applications to maintain consistency in quality. Some projects will involve use of an automated tool for cleansing, aggregation, and compression before the data is moved to its final location. Azure Data Factory is an example of one of these tools.
As you work with data you may discover that some of the data needs to be cleaned using some of the techniques covered in the lesson focused on [data preparation](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/2-Working-With-Data/08-data-preparation) to build accurate models. When new data arrives, it will need some of the same applications to maintain consistency in quality. Some projects will involve use of an automated tool for cleansing, aggregation, and compression before the data is moved to its final location. Azure Data Factory is an example of one of these tools.
### Securing the Data
### Securing the Data
One of the main goals of securing data is ensuring that those working it are in control of what is collected and in what context it is being used. Keeping data secure involves limiting access to only those who need it, adhering to local laws and regulations, as well as maintaining ethical standards, as covered in the [ethics lesson](1-Introduction\02-ethics).
One of the main goals of securing data is ensuring that those working it are in control of what is collected and in what context it is being used. Keeping data secure involves limiting access to only those who need it, adhering to local laws and regulations, as well as maintaining ethical standards, as covered in the [ethics lesson](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/1-Introduction/02-ethics).
Here’s some things that a team may do with security in mind:
Here’s some things that a team may do with security in mind:
A client has approached your team for help in investigating a taxi customer's seasonal spending habits in New York City.
They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
Your team is in the [Capturing](Readme.md#Capturing) stage of the Data Science Lifecycle and you are in charge of handling the the dataset. You have been provided a notebook and [data](../../data/taxi.csv) to explore.
In this directory is a [notebook](notebook.ipynb) that uses Python to load yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets).
You can also open the taxi data file in text editor or spreadsheet software like Excel.
## Instructions
- Assess whether or not the data in this dataset can help answer the question.
- Explore the [NYC Open Data catalog](https://data.cityofnewyork.us/browse?sortBy=most_accessed&utf8=%E2%9C%93). Identify an additional dataset that could potentially be helpful in answering the client's question.
- Write 3 questions that you would ask the client for more clarification and better understanding of the problem.
Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf) for more information about the data.
@ -50,7 +50,7 @@ Azure ML provides all the tools developers and data scientists need for their ma
### 1.2 The Heart Failure Prediction Project:
### 1.2 The Heart Failure Prediction Project:
There is no doubt that making and building projects is the best to put your skills and knowledge to test. In this lesson, we are going to explore two different ways of building a data science project for the prediction of heart failure attacks in Azure ML Studio, through Low code/No code and through the Azure ML SDK as shown in the following schema:
There is no doubt that making and building projects is the best way to put your skills and knowledge to the test. In this lesson, we are going to explore two different ways of building a data science project for the prediction of heart failure attacks in Azure ML Studio, through Low code/No code and through the Azure ML SDK as shown in the following schema: