Merge branch 'microsoft:main' into main

pull/297/head
Francisco Imanol Suarez 4 years ago committed by GitHub
commit 9784988c31
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -265,7 +265,7 @@ Notice that when a previous value is not available for forward-filling, the null
In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, `pandas` provides an easy means of detecting and removing duplicate entries.
- **Identifying duplicates: `duplicated`**: You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a `DataFrame` is a duplicate of an ealier one. Let's create another example `DataFrame` to see this in action.
- **Identifying duplicates: `duplicated`**: You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a `DataFrame` is a duplicate of an earlier one. Let's create another example `DataFrame` to see this in action.
```python
example4 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
'numbers': [1, 2, 1, 3, 3]})
@ -290,7 +290,7 @@ example4.duplicated()
4 True
dtype: bool
```
- **Dropping duplicates: `drop_duplicates`: `drop_duplicates` simply returns a copy of the data for which all of the `duplicated` values are `False`:
- **Dropping duplicates: `drop_duplicates`:** simply returns a copy of the data for which all of the `duplicated` values are `False`:
```python
example4.drop_duplicates()
```

@ -12,7 +12,7 @@
"\r\n",
"A client has been testing a [small form](index.html) to gather some basic data about their client-base. They have brought their findings to you to validate the data they have gathered. You can open the `index.html` page in a browser to take a look at the form.\r\n",
"\r\n",
"You have been provided a [dataset of csv records](../../data/form.csv)that contain entries from the form as well as some basic visualizations.The client pointed out that some of the visualizations look incorrect but they're unsure about how to resolve them. You can explore it in the [assignment notebook](assignment.ipynb).\r\n",
"You have been provided a [dataset of csv records](../../data/form.csv) that contain entries from the form as well as some basic visualizations.The client pointed out that some of the visualizations look incorrect but they're unsure about how to resolve them. You can explore it in the [assignment notebook](assignment.ipynb).\r\n",
"\r\n",
"## Instructions\r\n",
"\r\n",
@ -139,4 +139,4 @@
},
"nbformat": 4,
"nbformat_minor": 2
}
}

@ -21,7 +21,7 @@ An excellent library to create both simple and sophisticated plots and charts of
If you have a dataset and need to discover how much of a given item is included, one of the first tasks you have at hand will be to inspect its values.
✅ There are very good 'cheat sheets' available for Matplotlib [here](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-1.png) and [here](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-2.png).
✅ There are very good 'cheat sheets' available for Matplotlib [here](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
## Build a line plot about bird wingspan values
@ -52,7 +52,7 @@ Let's start by plotting some of the numeric data using a basic line plot. Suppos
wingspan = birds['MaxWingspan']
wingspan.plot()
```
![Max Wingspan](images/max-wingspan.png)
![Max Wingspan](images/max-wingspan-02.png)
What do you notice immediately? There seems to be at least one outlier - that's quite a wingspan! A 2300 centimeter wingspan equals 23 meters - are there Pterodactyls roaming Minnesota? Let's investigate.
@ -72,7 +72,7 @@ plt.plot(x, y)
plt.show()
```
![wingspan with labels](images/max-wingspan-labels.png)
![wingspan with labels](images/max-wingspan-labels-02.png)
Even with the rotation of the labels set to 45 degrees, there are too many to read. Let's try a different strategy: label only those outliers and set the labels within the chart. You can use a scatter chart to make more room for the labeling:
@ -94,7 +94,7 @@ What's going on here? You used `tick_params` to hide the bottom labels and then
What did you discover?
![outliers](images/labeled-wingspan.png)
![outliers](images/labeled-wingspan-02.png)
## Filter your data
Both the Bald Eagle and the Prairie Falcon, while probably very large birds, appear to be mislabeled, with an extra `0` added to their maximum wingspan. It's unlikely that you'll meet a Bald Eagle with a 25 meter wingspan, but if so, please let us know! Let's create a new dataframe without those two outliers:
@ -114,7 +114,7 @@ plt.show()
By filtering out outliers, your data is now more cohesive and understandable.
![scatterplot of wingspans](images/scatterplot-wingspan.png)
![scatterplot of wingspans](images/scatterplot-wingspan-02.png)
Now that we have a cleaner dataset at least in terms of wingspan, let's discover more about these birds.
@ -140,13 +140,13 @@ birds.plot(x='Category',
title='Birds of Minnesota')
```
![full data as a bar chart](images/full-data-bar.png)
![full data as a bar chart](images/full-data-bar-02.png)
This bar chart, however, is unreadable because there is too much non-grouped data. You need to select only the data that you want to plot, so let's look at the length of birds based on their category.
Filter your data to include only the bird's category.
✅ Notice that that you use Pandas to manage the data, and then let Matplotlib do the charting.
✅ Notice that you use Pandas to manage the data, and then let Matplotlib do the charting.
Since there are many categories, you can display this chart vertically and tweak its height to account for all the data:
@ -155,7 +155,7 @@ category_count = birds.value_counts(birds['Category'].values, sort=True)
plt.rcParams['figure.figsize'] = [6, 12]
category_count.plot.barh()
```
![category and length](images/category-counts.png)
![category and length](images/category-counts-02.png)
This bar chart shows a good view of the number of birds in each category. In a blink of an eye, you see that the largest number of birds in this region are in the Ducks/Geese/Waterfowl category. Minnesota is the 'land of 10,000 lakes' so this isn't surprising!
@ -171,7 +171,7 @@ plt.barh(y=birds['Category'], width=maxlength)
plt.rcParams['figure.figsize'] = [6, 12]
plt.show()
```
![comparing data](images/category-length.png)
![comparing data](images/category-length-02.png)
Nothing is surprising here: hummingbirds have the least MaxLength compared to Pelicans or Geese. It's good when data makes logical sense!
@ -189,7 +189,7 @@ plt.show()
```
In this plot, you can see the range per bird category of the Minimum Length and Maximum length. You can safely say that, given this data, the bigger the bird, the larger its length range. Fascinating!
![superimposed values](images/superimposed.png)
![superimposed values](images/superimposed-02.png)
## 🚀 Challenge

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 62 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 14 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 72 KiB

@ -21,7 +21,7 @@ Una excelente librería para crear gráficos tanto simples como sofisticados de
Si tienes un conjunto de datos y necesitas descubrir qué cantidad de un elemento determinado está incluido, una de las primeras tareas que tienes que hacer será inspeccionar sus valores.
✅ Hay muy buenas "hojas de trucos" disponibles para Matplotlib [aquí](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-1.png) y [aquí](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-2.png).
✅ Hay muy buenas "hojas de trucos" disponibles para Matplotlib [aquí](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
## Construir un gráfico de líneas sobre los valores de la envergadura de las aves
@ -203,4 +203,4 @@ Este conjunto de datos sobre aves ofrece una gran cantidad de información sobre
Esta primera lección has recibido alguna información sobre cómo utilizar Matplotlib para visualizar cantidades. Investiga sobre otras formas de trabajar con conjuntos de datos para su visualización. [Plotly](https://github.com/plotly/plotly.py) es otra forma que no cubriremos en estas lecciones, así que echa un vistazo a lo que puede ofrecer.
## Asignación
[Líneas, dispersiones y barras](assignment.es.md)
[Líneas, dispersiones y barras](assignment.es.md)

@ -21,7 +21,7 @@
यदि आपके पास एक डेटासेट है और यह पता लगाने की आवश्यकता है कि किसी दिए गए आइटम में से कितना शामिल है, तो आपके पास सबसे पहले कार्यों में से एक इसके मूल्यों का निरीक्षण करना होगा।
✅ माटप्लोटलिब के लिए बहुत अच्छी 'चीट शीट' उपलब्ध हैं [here](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-1.png) and [here](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-2.png).
✅ माटप्लोटलिब के लिए बहुत अच्छी 'चीट शीट' उपलब्ध हैं [here](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
## बर्ड विंगस्पैन मूल्यों के बारे में एक लाइन प्लॉट बनाएं

@ -21,7 +21,7 @@
데이터 세트가 있고 주어진 항목이 얼마나 포함되어 있는지 확인해야 하는 경우에, 가장 먼저 처리해야 하는 작업 중 하나는 해당 값을 검사하는 것입니다.
✅ Matplotlib에 사용할 수 있는 매우 좋은 '치트 시트'가 있습니다. [here](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-1.png) and [here](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-2.png).
✅ Matplotlib에 사용할 수 있는 매우 좋은 '치트 시트'가 있습니다. [here](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
## 새 날개 길이 값에 대한 선 그래프 작성하기

@ -21,7 +21,7 @@ Uma biblioteca excelente para criar tanto gráficos simples como sofisticados e
Se você tem um dataset e precisa descobrir quanto de um dado elemento está presente, uma das primeiras coisas que você precisará fazer é examinar seus valores.
✅ Existem dicas ('cheat sheets') ótimas disponíveis para o Matplotlib [aqui](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-1.png) e [aqui](https://github.com/matplotlib/cheatsheets/blob/master/cheatsheets-2.png).
✅ Existem dicas ('cheat sheets') ótimas disponíveis para o Matplotlib [aqui](https://matplotlib.org/cheatsheets/cheatsheets.pdf).
## Construindo um gráfico de linhas sobre os valores de envergadura de aves

@ -20,6 +20,15 @@ birds = pd.read_csv('../../data/birds.csv')
birds.head()
```
| | Name | ScientificName | Category | Order | Family | Genus | ConservationStatus | MinLength | MaxLength | MinBodyMass | MaxBodyMass | MinWingspan | MaxWingspan |
| ---: | :--------------------------- | :--------------------- | :-------------------- | :----------- | :------- | :---------- | :----------------- | --------: | --------: | ----------: | ----------: | ----------: | ----------: |
| 0 | Black-bellied whistling-duck | Dendrocygna autumnalis | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 47 | 56 | 652 | 1020 | 76 | 94 |
| 1 | Fulvous whistling-duck | Dendrocygna bicolor | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Dendrocygna | LC | 45 | 53 | 712 | 1050 | 85 | 93 |
| 2 | Snow goose | Anser caerulescens | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 79 | 2050 | 4050 | 135 | 165 |
| 3 | Ross's goose | Anser rossii | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 57.3 | 64 | 1066 | 1567 | 113 | 116 |
| 4 | Greater white-fronted goose | Anser albifrons | Ducks/Geese/Waterfowl | Anseriformes | Anatidae | Anser | LC | 64 | 81 | 1930 | 3310 | 130 | 165 |
In general, you can quickly look at the way data is distributed by using a scatter plot as we did in the previous lesson:
```python
@ -31,6 +40,8 @@ plt.xlabel('Max Length')
plt.show()
```
![max length per order](images/scatter-wb.png)
This gives an overview of the general distribution of body length per bird Order, but it is not the optimal way to display true distributions. That task is usually handled by creating a Histogram.
## Working with histograms
@ -40,7 +51,7 @@ Matplotlib offers very good ways to visualize data distribution using Histograms
birds['MaxBodyMass'].plot(kind = 'hist', bins = 10, figsize = (12,12))
plt.show()
```
![distribution over the entire dataset](images/dist1.png)
![distribution over the entire dataset](images/dist1-wb.png)
As you can see, most of the 400+ birds in this dataset fall in the range of under 2000 for their Max Body Mass. Gain more insight into the data by changing the `bins` parameter to a higher number, something like 30:
@ -48,7 +59,7 @@ As you can see, most of the 400+ birds in this dataset fall in the range of unde
birds['MaxBodyMass'].plot(kind = 'hist', bins = 30, figsize = (12,12))
plt.show()
```
![distribution over the entire dataset with larger bins param](images/dist2.png)
![distribution over the entire dataset with larger bins param](images/dist2-wb.png)
This chart shows the distribution in a bit more granular fashion. A chart less skewed to the left could be created by ensuring that you only select data within a given range:
@ -59,7 +70,7 @@ filteredBirds = birds[(birds['MaxBodyMass'] > 1) & (birds['MaxBodyMass'] < 60)]
filteredBirds['MaxBodyMass'].plot(kind = 'hist',bins = 40,figsize = (12,12))
plt.show()
```
![filtered histogram](images/dist3.png)
![filtered histogram](images/dist3-wb.png)
✅ Try some other filters and data points. To see the full distribution of the data, remove the `['MaxBodyMass']` filter to show labeled distributions.
@ -76,7 +87,7 @@ hist = ax.hist2d(x, y)
```
There appears to be an expected correlation between these two elements along an expected axis, with one particularly strong point of convergence:
![2D plot](images/2D.png)
![2D plot](images/2D-wb.png)
Histograms work well by default for numeric data. What if you need to see distributions according to text data?
## Explore the dataset for distributions using text data
@ -115,7 +126,7 @@ plt.gca().set(title='Conservation Status', ylabel='Max Body Mass')
plt.legend();
```
![wingspan and conservation collation](images/histogram-conservation.png)
![wingspan and conservation collation](images/histogram-conservation-wb.png)
There doesn't seem to be a good correlation between minimum wingspan and conservation status. Test other elements of the dataset using this method. You can try different filters as well. Do you find any correlation?

Binary file not shown.

After

Width:  |  Height:  |  Size: 5.0 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 11 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.6 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 13 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 39 KiB

@ -57,6 +57,12 @@ Take this data and convert the 'class' column to a category:
cols = mushrooms.select_dtypes(["object"]).columns
mushrooms[cols] = mushrooms[cols].astype('category')
```
```python
edibleclass=mushrooms.groupby(['class']).count()
edibleclass
```
Now, if you print out the mushrooms data, you can see that it has been grouped into categories according to the poisonous/edible class:
@ -78,7 +84,7 @@ plt.show()
```
Voila, a pie chart showing the proportions of this data according to these two classes of mushrooms. It's quite important to get the order of the labels correct, especially here, so be sure to verify the order with which the label array is built!
![pie chart](images/pie1.png)
![pie chart](images/pie1-wb.png)
## Donuts!
@ -108,7 +114,7 @@ plt.title('Mushroom Habitats')
plt.show()
```
![donut chart](images/donut.png)
![donut chart](images/donut-wb.png)
This code draws a chart and a center circle, then adds that center circle in the chart. Edit the width of the center circle by changing `0.40` to another value.

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.8 KiB

@ -28,7 +28,7 @@ Defining the goals of the project will require deeper context into the problem o
Questions a data scientist may ask:
- Has this problem been approached before? What was discovered?
- Is the purpose and goal understood by all involved?
- Where is there ambiguity and how to reduce it?
- Is there ambiguity and how to reduce it?
- What are the constraints?
- What will the end result potentially look like?
- How much resources (time, people, computational) are available?
@ -62,16 +62,18 @@ Considerations of how and where the data is stored can influence the cost of its
Heres some aspects of modern data storage systems that can affect these choices:
**On premise vs off premise vs public or private cloud**
On premise refers to hosting managing the data on your own equipment, like owning a server with hard drives that store the data, while off premise relies on equipment that you dont own, such as a data center. The public cloud is a popular choice for storing data that requires no knowledge of how or where exactly the data is stored, where public refers to a unified underlying infrastructure that is shared by all who use the cloud. Some organizations have strict security policies that require that they have complete access to the equipment where the data is hosted and will rely on a private cloud that provides its own cloud services. Youll learn more about data in the cloud in [later lessons](5-Data-Science-In-Cloud).
**Cold vs hot data**
On premise refers to hosting managing the data on your own equipment, like owning a server with hard drives that store the data, while off premise relies on equipment that you dont own, such as a data center. The public cloud is a popular choice for storing data that requires no knowledge of how or where exactly the data is stored, where public refers to a unified underlying infrastructure that is shared by all who use the cloud. Some organizations have strict security policies that require that they have complete access to the equipment where the data is hosted and will rely on a private cloud that provides its own cloud services. Youll learn more about data in the cloud in [later lessons](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/5-Data-Science-In-Cloud).
**Cold vs hot data**
When training your models, you may require more training data. If youre content with your model, more data will arrive for a model to serve its purpose. In any case the cost of storing and accessing data will increase as you accumulate more of it. Separating rarely used data, known as cold data from frequently accessed hot data can be a cheaper data storage option through hardware or software services. If cold data needs to be accessed, it may take a little longer to retrieve in comparison to hot data.
### Managing Data
As you work with data you may discover that some of the data needs to be cleaned using some of the techniques covered in the lesson focused on [data preparation](2-Working-With-Data\08-data-preparation) to build accurate models. When new data arrives, it will need some of the same applications to maintain consistency in quality. Some projects will involve use of an automated tool for cleansing, aggregation, and compression before the data is moved to its final location. Azure Data Factory is an example of one of these tools.
As you work with data you may discover that some of the data needs to be cleaned using some of the techniques covered in the lesson focused on [data preparation](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/2-Working-With-Data/08-data-preparation) to build accurate models. When new data arrives, it will need some of the same applications to maintain consistency in quality. Some projects will involve use of an automated tool for cleansing, aggregation, and compression before the data is moved to its final location. Azure Data Factory is an example of one of these tools.
### Securing the Data
One of the main goals of securing data is ensuring that those working it are in control of what is collected and in what context it is being used. Keeping data secure involves limiting access to only those who need it, adhering to local laws and regulations, as well as maintaining ethical standards, as covered in the [ethics lesson](1-Introduction\02-ethics).
One of the main goals of securing data is ensuring that those working it are in control of what is collected and in what context it is being used. Keeping data secure involves limiting access to only those who need it, adhering to local laws and regulations, as well as maintaining ethical standards, as covered in the [ethics lesson](https://github.com/microsoft/Data-Science-For-Beginners/tree/main/1-Introduction/02-ethics).
Heres some things that a team may do with security in mind:
- Confirm that all data is encrypted

@ -0,0 +1,23 @@
# Assessing a Dataset
A client has approached your team for help in investigating a taxi customer's seasonal spending habits in New York City.
They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
Your team is in the [Capturing](Readme.md#Capturing) stage of the Data Science Lifecycle and you are in charge of handling the the dataset. You have been provided a notebook and [data](../../data/taxi.csv) to explore.
In this directory is a [notebook](notebook.ipynb) that uses Python to load yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets).
You can also open the taxi data file in text editor or spreadsheet software like Excel.
## Instructions
- Assess whether or not the data in this dataset can help answer the question.
- Explore the [NYC Open Data catalog](https://data.cityofnewyork.us/browse?sortBy=most_accessed&utf8=%E2%9C%93). Identify an additional dataset that could potentially be helpful in answering the client's question.
- Write 3 questions that you would ask the client for more clarification and better understanding of the problem.
Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf) for more information about the data.
## Rubric
Exemplary | Adequate | Needs Improvement
--- | --- | -- |

@ -50,7 +50,7 @@ Azure ML provides all the tools developers and data scientists need for their ma
### 1.2 The Heart Failure Prediction Project:
There is no doubt that making and building projects is the best to put your skills and knowledge to test. In this lesson, we are going to explore two different ways of building a data science project for the prediction of heart failure attacks in Azure ML Studio, through Low code/No code and through the Azure ML SDK as shown in the following schema:
There is no doubt that making and building projects is the best way to put your skills and knowledge to the test. In this lesson, we are going to explore two different ways of building a data science project for the prediction of heart failure attacks in Azure ML Studio, through Low code/No code and through the Azure ML SDK as shown in the following schema:
![project-schema](images/project-schema.PNG)

Loading…
Cancel
Save