Delete 2-Working-With-Data/08-python directory

pull/49/head
Jasmine Greenaway 3 years ago committed by GitHub
parent 7826170f20
commit 6aa3bac052
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -1,92 +0,0 @@
# Working with Data: Python and the Pandas Library
While databases offer very efficient ways to store data and query them using query languages, the most flexible way of data processing is writing your own program to manipulate data. While in many cases doing database query will be more effective way, in some cases you might need some more complex data processing, which cannot be easily done using SQL.
Data processing can be programmed in any programming language, but there are certain languages that are higher level with respect to working with data. Data scientists typically prefer one of the following languages:
* **[Python](https://www.python.org/)**, a general-purpose programming language, which is often considered one of the best options for beginners due to its simplicity. Python has a lot of additional libraries that can help you solve many practical problems, such as extracting your data from ZIP archive, or converting picture to grayscale. In addition to data science, Python is also often used for web development.
* **[R](https://www.r-project.org/)** is a traditional toolbox developed with statistical data processing in mind. It also contains large repository of libraries (CRAN), making it a good choice for data processing. However, R is not a general-purpose programming language, and is rarely used outside of data science domain.
* **[Julia](https://julialang.org/)** is another language developed specifically for data science. It is intended to give better performance than Python, making it a great tool for scientific experimentation.
In this lesson, we will focus on using Python for simple data processing. We will assume basic familiarity with the language. If you want deeper tour of Python, you can refer to one of the following resources:
* [Learn Python in a Fun Way with Turtle Graphics and Fractals](https://github.com/shwars/pycourse) - GitHub-based quick intro course into Python Programming
* [Take your First Steps with Python](https://docs.microsoft.com/en-us/learn/paths/python-first-steps/?WT.mc_id=acad-31812-dmitryso) Learning Path on [Microsoft Learn](http://learn.microsoft.com/?WT.mc_id=acad-31812-dmitryso)
Data can come in many forms. In this lesson, we will consider three forms of data - **tabular data**, **text** and **images**.
We will focus on a few examples of data processing, instead of giving you full overview of all related libraries. This would allow you to get the main idea of what's possible, and leave you with understanding on where to find solutions to your problems when you need them.
> **Most useful advice**. When you need to perform certain operation on data that you do not know how to do, try searching for it in the internet. [Stackoverflow](https://stackoverflow.com/) usually contains a lot of useful code sample in Python for many typical tasks.
## Pre-Lecture Quiz
[Pre-lecture quiz]()
## Tabular Data and Dataframes
You have already met tabular data when we talked about relational databases. When you have a lot of data, and it is contained in many different linked tables, it definitely makes sense to use SQL for working with it. However, there are many cases when we have a table of data, and we need to gain some **understanding** or **insights** about this data, such as the distribution, correlation between values, etc. In data science, there are a lot of cases when we need to perform some transformations of the original data, followed by visualization. Both those steps can be easily done using Python.
There are two most useful libraries in Python that can help you deal with tabular data:
* **[Pandas](https://pandas.pydata.org/)** allows you to manipulate so-called **Dataframes**, which are analogous to relational tables. You can have named columns, and perform different operations on row, columns and dataframes in general.
* **[Numpy](https://numpy.org/)** is a library for working with **tensors**, i.e. multi-dimensional **arrays**. Array has values of the same underlying type, and it is simpler than dataframe, but it offers more mathematical operations, and creates less overhead.
There are also a couple of other libraries you should know about:
* **[Matplotlib](https://matplotlib.org/)** is a library used for data visualization and plotting graphs
* **[SciPy](https://www.scipy.org/)** is a library with some additional scientific functions. We have already come accross this library when talking about probability and statistics
Here is a piece of code that you would typically use to import those libraries in the beginning of your Python program:
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import ... # you need to specify exact sub-packages that you need
```
Pandas is centered around the following basic concepts:
**Series** is a sequence of values, similar to a list or numpy array. The main difference is that series also has and **index**, and when we operate on series (eg., add them), the index is taken into account. Index can be as simple as integer row number (it is the index used by default when creating a series from list or array), or it can have a complex structure, such as date interval.
> **Note**: There is some introductory Pandas code in the accompanying notebook [`notebook.ipynb`](notebook.ipynb). We only outline some the examples here, and you are definitely welcome to check out the full notebook.
Consider an example: we want to analyze sales of our ice-cream spot. Let's generate a series of sales numbers (number of items sold each day) for some time period:
```python
tart_date = "Jan 1, 2020"
end_date = "Mar 31, 2020"
idx = pd.date_range(start_date,end_date)
print(f"Length of index is {len(idx)}")
items_sold = pd.Series(np.random.randint(25,50,size=len(idx)),index=idx)
items_sold.plot()
```
![Time Series Plot](images/timeseries-1.png)
Now suppose that each week we are organizing a party for friends, and we take additional 10 packs of ice-cream for a party. We can create another series, indexed by week, to demonstrate that:
```python
additional_items = pd.Series(10,index=pd.date_range(start_date,end_date,freq="W"))
```
When we add two series together, we get total number:
```python
total_items = items_sold.add(additional_items,fill_value=0)
total_items.plot()
```
![Time Series Plot](images/timeseries-2.png)
## 🚀 Challenge
First problem we will focus on is modelling of epidemic spread of COVID-19. In order to do that, we will use the data on the number of infected individuals in different countries, provided by the [Center for Systems Science and Engineering](https://systems.jhu.edu/) (CSSE) at [Johns Hopkins University](https://jhu.edu/). Dataset is available in [this GitHub Repository](https://github.com/CSSEGISandData/COVID-19).
Since we want to demonstrate how to deal with data, we invite you to open [`notebook-pandas.ipynb`](notebook-pandas.ipynb) and read it from top to bottom. You can also execute cells, and do some challenges that we have leaf for you along the way.
## Post-Lecture Quiz
[Post-lecture quiz]()
## Review & Self Study
* [Wes McKinney. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython](https://www.amazon.com/gp/product/1491957662)
## Assignment
[Assignment Title](assignment.md)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 75 KiB

@ -1,287 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Using Python for Data Processing\r\n",
"\r\n",
"## Tabular Data\r\n",
"\r\n",
"We will use data on COVID-19 infected individuals, provided by the [Center for Systems Science and Engineering](https://systems.jhu.edu/) (CSSE) at [Johns Hopkins University](https://jhu.edu/). Dataset is available in [this GitHub Repository](https://github.com/CSSEGISandData/COVID-19)."
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 1,
"source": [
"import numpy as np\r\n",
"import pandas as pd"
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"We can load the most recent data directly from GitHub using `pd.read_csv`. If for some reason the data is not available, you can always use the copy available locally in the `data` folder - just uncomment lines below:"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 4,
"source": [
"base_url = \"https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/\" # loading from Internet\r\n",
"# base_url = \"../../data/COVID/\" # loading from disk\r\n",
"infected_dataset_url = base_url + \"time_series_covid19_confirmed_global.csv\"\r\n",
"recovered_dataset_url = base_url + \"time_series_covid19_recovered_global.csv\"\r\n",
"deaths_dataset_url = base_url + \"time_series_covid19_deaths_global.csv\"\r\n",
"countries_dataset_url = base_url + \"../UID_ISO_FIPS_LookUp_Table.csv\""
],
"outputs": [],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": 5,
"source": [
"infected = pd.read_csv(infected_dataset_url)\r\n",
"infected.head()"
],
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
" Province/State Country/Region Lat Long 1/22/20 1/23/20 \\\n",
"0 NaN Afghanistan 33.93911 67.709953 0 0 \n",
"1 NaN Albania 41.15330 20.168300 0 0 \n",
"2 NaN Algeria 28.03390 1.659600 0 0 \n",
"3 NaN Andorra 42.50630 1.521800 0 0 \n",
"4 NaN Angola -11.20270 17.873900 0 0 \n",
"\n",
" 1/24/20 1/25/20 1/26/20 1/27/20 ... 8/14/21 8/15/21 8/16/21 \\\n",
"0 0 0 0 0 ... 151770 151770 152142 \n",
"1 0 0 0 0 ... 135550 135947 136147 \n",
"2 0 0 0 0 ... 186655 187258 187968 \n",
"3 0 0 0 0 ... 14924 14924 14954 \n",
"4 0 0 0 0 ... 44534 44617 44739 \n",
"\n",
" 8/17/21 8/18/21 8/19/21 8/20/21 8/21/21 8/22/21 8/23/21 \n",
"0 152243 152363 152411 152448 152448 152448 152583 \n",
"1 136598 137075 137597 138132 138790 139324 139721 \n",
"2 188663 189384 190078 190656 191171 191583 192089 \n",
"3 14960 14976 14981 14988 14988 14988 15002 \n",
"4 44972 45175 45325 45583 45817 45945 46076 \n",
"\n",
"[5 rows x 584 columns]"
],
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Province/State</th>\n",
" <th>Country/Region</th>\n",
" <th>Lat</th>\n",
" <th>Long</th>\n",
" <th>1/22/20</th>\n",
" <th>1/23/20</th>\n",
" <th>1/24/20</th>\n",
" <th>1/25/20</th>\n",
" <th>1/26/20</th>\n",
" <th>1/27/20</th>\n",
" <th>...</th>\n",
" <th>8/14/21</th>\n",
" <th>8/15/21</th>\n",
" <th>8/16/21</th>\n",
" <th>8/17/21</th>\n",
" <th>8/18/21</th>\n",
" <th>8/19/21</th>\n",
" <th>8/20/21</th>\n",
" <th>8/21/21</th>\n",
" <th>8/22/21</th>\n",
" <th>8/23/21</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>NaN</td>\n",
" <td>Afghanistan</td>\n",
" <td>33.93911</td>\n",
" <td>67.709953</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>151770</td>\n",
" <td>151770</td>\n",
" <td>152142</td>\n",
" <td>152243</td>\n",
" <td>152363</td>\n",
" <td>152411</td>\n",
" <td>152448</td>\n",
" <td>152448</td>\n",
" <td>152448</td>\n",
" <td>152583</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>NaN</td>\n",
" <td>Albania</td>\n",
" <td>41.15330</td>\n",
" <td>20.168300</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>135550</td>\n",
" <td>135947</td>\n",
" <td>136147</td>\n",
" <td>136598</td>\n",
" <td>137075</td>\n",
" <td>137597</td>\n",
" <td>138132</td>\n",
" <td>138790</td>\n",
" <td>139324</td>\n",
" <td>139721</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>Algeria</td>\n",
" <td>28.03390</td>\n",
" <td>1.659600</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>186655</td>\n",
" <td>187258</td>\n",
" <td>187968</td>\n",
" <td>188663</td>\n",
" <td>189384</td>\n",
" <td>190078</td>\n",
" <td>190656</td>\n",
" <td>191171</td>\n",
" <td>191583</td>\n",
" <td>192089</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>Andorra</td>\n",
" <td>42.50630</td>\n",
" <td>1.521800</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>14924</td>\n",
" <td>14924</td>\n",
" <td>14954</td>\n",
" <td>14960</td>\n",
" <td>14976</td>\n",
" <td>14981</td>\n",
" <td>14988</td>\n",
" <td>14988</td>\n",
" <td>14988</td>\n",
" <td>15002</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>NaN</td>\n",
" <td>Angola</td>\n",
" <td>-11.20270</td>\n",
" <td>17.873900</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>44534</td>\n",
" <td>44617</td>\n",
" <td>44739</td>\n",
" <td>44972</td>\n",
" <td>45175</td>\n",
" <td>45325</td>\n",
" <td>45583</td>\n",
" <td>45817</td>\n",
" <td>45945</td>\n",
" <td>46076</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 584 columns</p>\n",
"</div>"
]
},
"metadata": {},
"execution_count": 5
}
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [],
"outputs": [],
"metadata": {}
}
],
"metadata": {
"orig_nbformat": 4,
"language_info": {
"name": "python",
"version": "3.8.8",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3.8.8 64-bit (conda)"
},
"interpreter": {
"hash": "86193a1ab0ba47eac1c69c1756090baa3b420b3eea7d4aafab8b85f8b312f0c5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because one or more lines are too long
Loading…
Cancel
Save