diff --git a/1-Introduction/01-defining-data-science/README.md b/1-Introduction/01-defining-data-science/README.md index 544f698b..2795302a 100644 --- a/1-Introduction/01-defining-data-science/README.md +++ b/1-Introduction/01-defining-data-science/README.md @@ -147,6 +147,7 @@ In this challenge, we will try to find concepts relevant to the field of Data Sc [Post-lecture quiz]() -## Assignment +## Assignments -[Think About Data Science Scenarios](assignment.md) +* **Task 1**: Modify the code above to find out related concepts for the fields of **Big Data** and **Machine Learning** +* **Task 2**: [Think About Data Science Scenarios](assignment.md) diff --git a/1-Introduction/01-defining-data-science/solution/assignment.md b/1-Introduction/01-defining-data-science/solution/assignment.md new file mode 100644 index 00000000..875a1c00 --- /dev/null +++ b/1-Introduction/01-defining-data-science/solution/assignment.md @@ -0,0 +1,33 @@ +# Assignment: Data Science Scenarios + +In this first assignment, we ask you to think about some real-life process or problem in different problem domains, and how you can improve it using the Data Science process. Think about the following: + +1. Which data can you collect? +1. How would you collect it? +1. How would you store the data? How large the data is likely to be? +1. Which insights you might be able to get from this data? Which decisions we would be able to take based on the data? + +Try to think about 3 different problems/processes and describe each of the points above for each problem domain. + +Here are some of the problem domains and problems that can get you started thinking: + +1. How can you use data to improve education process for children in schools? +1. How can you use data to control vaccination during the pandemic? +1. How can you use data to make sure you are being productive at work? +## Instructions + +Fill in the following table (substitute suggested problem domains for your own ones if needed): + +| Problem Domain | Problem | Which data to collect | How to store the data | Which insights/decisions we can make | +|----------------|---------|-----------------------|-----------------------|--------------------------------------| +| Education | In university, we typically have low attendance to lectures, and we have the hypothesis that students who attend lectures on average to better during exams. We want to stimulate attendance and test the hypothesis. | We can track attendance through pictures taken by the security camera in class, or by tracking bluetooth/wifi addresses of student mobile phones in class. Exam data is already available in the university database. | In case we track security camera images - we need to store a few (5-10) photographs during class (unstructured data), and then use AI to identify faces of students (convert data to structured form). | We can compute average attendance data for each student, and see if there is any correlation with exam grades. We will talk more about correlation in [probability and statistics](../../04-stats-and-probability/README.md) section. In order to stimulate student attendance, we can publish the weekly attendance rating on school portal, and draw prizes among those with highest attendance. | +| Vaccination | | | | | +| Productivity | | | | | + +> *We provide just one answer as an example, so that you can get an idea of what is expected in this assignment.* + +## Rubric + +Exemplary | Adequate | Needs Improvement +--- | --- | -- | +One was able to identify reasonable data sources, ways of storing data and possible decisions/insights for all problem domains | Some of the aspects of the solution are not detailed, data storage is not discussed, at least 2 problem domains are described | Only parts of the data solution are described, only one problem domain is considered. diff --git a/1-Introduction/01-defining-data-science/solution/notebook.ipynb b/1-Introduction/01-defining-data-science/solution/notebook.ipynb index e69de29b..ac2c5524 100644 --- a/1-Introduction/01-defining-data-science/solution/notebook.ipynb +++ b/1-Introduction/01-defining-data-science/solution/notebook.ipynb @@ -0,0 +1,527 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Challenge: Analyzing Text about Data Science\r\n", + "\r\n", + "> *In this notebook, we experiment with using different URL - wikipedia article on Machine Learning. You can see that, unlike Data Science, this article contains a lot of terms, this making the analysis more problematic. We need to come up with another way to clean up the data after doing keyword extraction, to get rid of some frequent, but not meaningful word combinations.*\r\n", + "\r\n", + "In this example, let's do a simple exercise that covers all steps of a traditional data science process. You do not have to write any code, you can just click on the cells below to execute them and observe the result. As a challenge, you are encouraged to try this code out with different data. \r\n", + "\r\n", + "## Goal\r\n", + "\r\n", + "In this lesson, we have been discussing different concepts related to Data Science. Let's try to discover more related concepts by doing some **text mining**. We will start with a text about Data Science, extract keywords from it, and then try to visualize the result.\r\n", + "\r\n", + "As a text, I will use the page on Data Science from Wikipedia:" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 2, + "source": [ + "url = 'https://en.wikipedia.org/wiki/Data_science'\r\n", + "url = 'https://en.wikipedia.org/wiki/Machine_learning'" + ], + "outputs": [], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "## Step 1: Getting the Data\r\n", + "\r\n", + "First step in every data science process is getting the data. We will use `requests` library to do that:" + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 3, + "source": [ + "import requests\r\n", + "\r\n", + "text = requests.get(url).content.decode('utf-8')\r\n", + "print(text[:1000])" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "\n", + "\n", + "\n", + "Machine learning - Wikipedia\n", + "