analysis assignment

pull/117/head
Jasmine 4 years ago
parent cf4ada2020
commit 2aa3d8bd2d

@ -12,9 +12,10 @@ You can also open the taxi data file in text editor or spreadsheet software like
## Instructions ## Instructions
- Assess whether or not the data in this dataset can help answer the question. - Assess whether or not the data in this dataset can help answer the question.
- Explore the [NYC Open Data catalog](https://data.cityofnewyork.us/browse?sortBy=most_accessed&utf8=%E2%9C%93). Identify an additional dataset that could potentially be helpful in answering the client's question.
- Write 3 questions that you would ask the client for more clarification and better understanding of the problem. - Write 3 questions that you would ask the client for more clarification and better understanding of the problem.
Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) for more information about the Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf) for more information about the data.
## Rubric ## Rubric

@ -3,7 +3,7 @@
{ {
"cell_type": "markdown", "cell_type": "markdown",
"source": [ "source": [
"# Exploring NYC Taxi data in Winter and Summer\r\n", "# NYC Taxi data in Winter and Summer\r\n",
"\r\n", "\r\n",
"Refer to the [Data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to learn more about the columns that have been provided.\r\n" "Refer to the [Data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to learn more about the columns that have been provided.\r\n"
], ],
@ -13,6 +13,7 @@
"cell_type": "code", "cell_type": "code",
"execution_count": null, "execution_count": null,
"source": [ "source": [
"#Install the pandas library\r\n",
"!pip install pandas" "!pip install pandas"
], ],
"outputs": [], "outputs": [],
@ -25,10 +26,13 @@
"execution_count": 7, "execution_count": 7,
"source": [ "source": [
"import pandas as pd\r\n", "import pandas as pd\r\n",
"import glob\r\n",
"\r\n", "\r\n",
"path = '../../data/taxi.csv'\r\n", "path = '../../data/taxi.csv'\r\n",
"\r\n",
"#Load the csv file into a dataframe\r\n",
"df = pd.read_csv(path)\r\n", "df = pd.read_csv(path)\r\n",
"\r\n",
"#Print the dataframe\r\n",
"print(df)\r\n" "print(df)\r\n"
], ],
"outputs": [ "outputs": [

@ -28,10 +28,10 @@ In a few of the previous lessons, we have used Pandas to provide some descriptiv
## Sampling and Querying ## Sampling and Querying
Exploring everything in a large dataset can be very time consuming and a task thats usually left up to a computer to do. However, sampling is a helpful tool in understanding of the data and allows us to have a better understanding of whats in the dataset and what it represents. With a sample, you can apply probability and statistics to come to some general conclusions about your data. While theres no defined rule on how much data you should sample its important to note that the more data you sample, the more precise of a generalization you can make of about data. Exploring everything in a large dataset can be very time consuming and a task thats usually left up to a computer to do. However, sampling is a helpful tool in understanding of the data and allows us to have a better understanding of whats in the dataset and what it represents. With a sample, you can apply probability and statistics to come to some general conclusions about your data. While theres no defined rule on how much data you should sample its important to note that the more data you sample, the more precise of a generalization you can make of about data.
Pandas has the [`sample()` function in its library]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) where you can pass an argument of how many random samples youd like to receive and use. Pandas has the [`sample()` function in its library](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) where you can pass an argument of how many random samples youd like to receive and use.
General querying of the data can help you answer some general questions and theories you may have. In contrast to sampling, queries allow you to have control and focus on specific parts of the data you have questions about. General querying of the data can help you answer some general questions and theories you may have. In contrast to sampling, queries allow you to have control and focus on specific parts of the data you have questions about.
The [`query() `function]( https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select columns and receive simple answers about the data through the rows retrieved. The [`query() `function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html) in the Pandas library allows you to select columns and receive simple answers about the data through the rows retrieved.
## Exploring with Visualizations ## Exploring with Visualizations
You dont have to wait until the data is thoroughly cleaned and analyzed to start creating visualizations. In fact, having a visual representation while exploring can help identify patterns, relationships, and problems in the data. Furthermore, visualizations provide a means of communication with those who are not involved with managing the data and can be an opportunity to share and clarify additional questions that were not addressed in the capture stage. Refer to the [section on Visualizations](3-Data-Visualization) to learn more about some popular ways to explore visually. You dont have to wait until the data is thoroughly cleaned and analyzed to start creating visualizations. In fact, having a visual representation while exploring can help identify patterns, relationships, and problems in the data. Furthermore, visualizations provide a means of communication with those who are not involved with managing the data and can be an opportunity to share and clarify additional questions that were not addressed in the capture stage. Refer to the [section on Visualizations](3-Data-Visualization) to learn more about some popular ways to explore visually.
@ -44,10 +44,6 @@ All the topics in this lesson can help identify missing or inconsistent values,
## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/27) ## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/27)
## Review & Self Study
## Assignment ## Assignment
[Assignment Title](assignment.md) [Assignment Title](assignment.md)

@ -0,0 +1,141 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# NYC Taxi data in Winter and Summer\r\n",
"\r\n",
"Refer to the [Data dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) to learn more about the columns that have been provided.\r\n"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [
"#Install the pandas library\r\n",
"!pip install pandas"
],
"outputs": [],
"metadata": {
"scrolled": true
}
},
{
"cell_type": "code",
"execution_count": 7,
"source": [
"import pandas as pd\r\n",
"\r\n",
"path = '../../data/taxi.csv'\r\n",
"\r\n",
"#Load the csv file into a dataframe\r\n",
"df = pd.read_csv(path)\r\n",
"\r\n",
"#Print the dataframe\r\n",
"print(df)\r\n"
],
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": [
" VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count \\\n",
"0 2.0 2019-07-15 16:27:53 2019-07-15 16:44:21 3.0 \n",
"1 2.0 2019-07-17 20:26:35 2019-07-17 20:40:09 6.0 \n",
"2 2.0 2019-07-06 16:01:08 2019-07-06 16:10:25 1.0 \n",
"3 1.0 2019-07-18 22:32:23 2019-07-18 22:35:08 1.0 \n",
"4 2.0 2019-07-19 14:54:29 2019-07-19 15:19:08 1.0 \n",
".. ... ... ... ... \n",
"195 2.0 2019-01-18 08:42:15 2019-01-18 08:56:57 1.0 \n",
"196 1.0 2019-01-19 04:34:45 2019-01-19 04:43:44 1.0 \n",
"197 2.0 2019-01-05 10:37:39 2019-01-05 10:42:03 1.0 \n",
"198 2.0 2019-01-23 10:36:29 2019-01-23 10:44:34 2.0 \n",
"199 2.0 2019-01-30 06:55:58 2019-01-30 07:07:02 5.0 \n",
"\n",
" trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID \\\n",
"0 2.02 1.0 N 186 233 \n",
"1 1.59 1.0 N 141 161 \n",
"2 1.69 1.0 N 246 249 \n",
"3 0.90 1.0 N 229 141 \n",
"4 4.79 1.0 N 237 107 \n",
".. ... ... ... ... ... \n",
"195 1.18 1.0 N 43 237 \n",
"196 2.30 1.0 N 148 234 \n",
"197 0.83 1.0 N 237 263 \n",
"198 1.12 1.0 N 144 113 \n",
"199 2.41 1.0 N 209 107 \n",
"\n",
" payment_type fare_amount extra mta_tax tip_amount tolls_amount \\\n",
"0 1.0 12.0 1.0 0.5 4.08 0.0 \n",
"1 2.0 10.0 0.5 0.5 0.00 0.0 \n",
"2 2.0 8.5 0.0 0.5 0.00 0.0 \n",
"3 1.0 4.5 3.0 0.5 1.65 0.0 \n",
"4 1.0 19.5 0.0 0.5 5.70 0.0 \n",
".. ... ... ... ... ... ... \n",
"195 1.0 10.0 0.0 0.5 2.16 0.0 \n",
"196 1.0 9.5 0.5 0.5 2.15 0.0 \n",
"197 1.0 5.0 0.0 0.5 1.16 0.0 \n",
"198 2.0 7.0 0.0 0.5 0.00 0.0 \n",
"199 1.0 10.5 0.0 0.5 1.00 0.0 \n",
"\n",
" improvement_surcharge total_amount congestion_surcharge \n",
"0 0.3 20.38 2.5 \n",
"1 0.3 13.80 2.5 \n",
"2 0.3 11.80 2.5 \n",
"3 0.3 9.95 2.5 \n",
"4 0.3 28.50 2.5 \n",
".. ... ... ... \n",
"195 0.3 12.96 0.0 \n",
"196 0.3 12.95 0.0 \n",
"197 0.3 6.96 0.0 \n",
"198 0.3 7.80 0.0 \n",
"199 0.3 12.30 0.0 \n",
"\n",
"[200 rows x 18 columns]\n"
]
}
],
"metadata": {}
},
{
"cell_type": "markdown",
"source": [
"# Use the cells below to do your own Exploratory Data Analysis"
],
"metadata": {}
},
{
"cell_type": "code",
"execution_count": null,
"source": [],
"outputs": [],
"metadata": {}
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3.9.7 64-bit ('venv': venv)"
},
"language_info": {
"mimetype": "text/x-python",
"name": "python",
"pygments_lexer": "ipython3",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"version": "3.9.7",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"name": "04-nyc-taxi-join-weather-in-pandas",
"notebookId": 1709144033725344,
"interpreter": {
"hash": "6b9b57232c4b57163d057191678da2030059e733b8becc68f245de5a75abe84e"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

@ -1,16 +1,22 @@
# Analyzing for answers # Exploring for answers
This continues the process of the lifecycle This is a continuation of the previous lesson's [assignment](..\14-Introduction\assignment.md), where we briefly took a look at the data set. Now we will be taking a deeper look at the data.
They want to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?** Again, the question the client wants to know: **Do yellow taxi passengers in New York City tip drivers more in the winter or summer?**
Your team is in the [Analyzing](Readme.md) stage of the Data Science Lifecycle.. You have been provided a notebook and data from Azure Open Datasets to explore. For summer you choose June, July, and August and for winter you choose January, February, and December. Your team is in the [Analyzing](Readme.md) stage of the Data Science Lifecycle, where you are responsible for doing exploratory data analysis on the dataset. You have been provided a notebook and dataset that contains 200 taxi transactions from January and July 2019.
## Instructions ## Instructions
In this directory is a [notebook](notebook.ipynb) that uses Python to load 6 months of yellow taxi trip data from the [NYC Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets) and Integrated Surface Data from NOAA. These datasets have been joined together in a Pandas dataframe. In this directory is a [notebook](notebook.ipynb) and data from the [Taxi & Limousine Commission](https://docs.microsoft.com/en-us/azure/open-datasets/dataset-taxi-yellow?tabs=azureml-opendatasets). Refer to the [dataset's dictionary](https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf) and [user guide](https://www1.nyc.gov/assets/tlc/downloads/pdf/trip_record_user_guide.pdf) for more information about the data.
Use some the techniques in this lesson to do your own EDA in the notebook (add cells if you'd like) and answer the following questions:
- What other influences in the data could affect the tip amount?
- What columns will most likely not be needed to answer the client's questions?
- Based on what has been provided so far, does the data seem to provide any evidence of seasonal tipping behavior?
Your task is to ___
## Rubric ## Rubric

Loading…
Cancel
Save