diff --git a/.github/ISSUE_TEMPLATE/review_checklist.md b/.github/ISSUE_TEMPLATE/review_checklist.md new file mode 100644 index 00000000..f1f0124e --- /dev/null +++ b/.github/ISSUE_TEMPLATE/review_checklist.md @@ -0,0 +1,16 @@ +--- +name: Review Checklist +about: Reviewing curriculum lessons +title: '[Review]' +labels: '' +assignees: '' + +--- + +# This lesson has been reviewed and resolved of the following issues +- [ ] Typos +- [ ] Grammar errors +- [ ] Missing links +- [ ] Broken Images +- [ ] Checked for completeness +- [ ] Quiz (if no quiz assign to @paladique) diff --git a/.gitignore b/.gitignore index 70a91727..abcb05e8 100644 --- a/.gitignore +++ b/.gitignore @@ -307,6 +307,7 @@ paket-files/ # Python Tools for Visual Studio (PTVS) __pycache__/ *.pyc +venv/ # Cake - Uncomment if you are using it # tools/** @@ -350,3 +351,6 @@ MigrationBackup/ # Ionide (cross platform F# VS Code tools) working folder .ionide/ +4-Data-Science-Lifecycle/14-Introduction/README.md +.vscode/settings.json +Data/Taxi/* diff --git a/1-Introduction/01-defining-data-science/README.md b/1-Introduction/01-defining-data-science/README.md index 412ef753..873ad74f 100644 --- a/1-Introduction/01-defining-data-science/README.md +++ b/1-Introduction/01-defining-data-science/README.md @@ -1,27 +1,31 @@ # Defining Data Science -## Pre-Lecture Quiz +|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/01-Definitions.png)| +|:---:| +|Defining Data Science - _Sketchnote by [@nitya](https://twitter.com/nitya)_ | -[Pre-lecture quiz]() +--- -## What is Data? +[![Defining Data Science Video](images/video-def-ds.png)](https://youtu.be/pqqsm5reGvs) -In our everyday life, we are always surrounded by **data**. The text you are reading now is data, the list of phone numbers of your friends in your smartphone is data, as well as current time displayed on your watch. As human beings, we naturally operate with data, counting the amount of money we have, or writing letters to our friends. +## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/0) -However, data became much more important with the creation of **computers**. The main role of computers is to perform *computations*, but they need data to operate on. Thus, we need to understand how computers store and process data. +## What is Data? +In our everyday life, we are constantly surrounded by data. The text you are reading now is data, the list of phone numbers of your friends in your smartphone is data, as well as the current time displayed on your watch. As human beings, we naturally operate with data by counting the money we have or writing letters to our friends. -With the emergence of Internet, the role of computers as data handling devices increased. If you think of it, we now use computers more and more for data processing and communication, rather than actual computations. When we write an e-mail to a friend, or search some information on the Internet - we are essentially creating, storing, transmitting, and manipulating data. +However, data became much more critical with the creation of computers. The primary role of computers is to perform computations, but they need data to operate on. Thus, we need to understand how computers store and process data. +With the emergence of the Internet, the role of computers as data handling devices increased. If you think of it, we now use computers more and more for data processing and communication, rather than actual computations. When we write an e-mail to a friend or search for some information on the Internet - we are essentially creating, storing, transmitting, and manipulating data. > Can you remember the last time you have used computers to actually compute something? ## What is Data Science? -In [Wikipedia](https://en.wikipedia.org/wiki/Data_science), **Data Science** is defined as *scientific field that uses scientific methods to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains*. +In [Wikipedia](https://en.wikipedia.org/wiki/Data_science), **Data Science** is defined as *a scientific field that uses scientific methods to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains*. This definition highlights the following important aspects of data science: * The main goal of data science is to **extract knowledge** from data, in order words - to **understand** data, find some hidden relationships and build a **model**. -* Data science uses **scientific methods**, such as probability and statistics. In fact, when the term *data science* was first introduced, some people argued that data science is just a new fancy name for statistics. Nowadays it becomes evident that the field is much more broad. +* Data science uses **scientific methods**, such as probability and statistics. In fact, when the term *data science* was first introduced, some people argued that data science is just a new fancy name for statistics. Nowadays it has become evident that the field is much broader. * Obtained knowledge should be applied to produce some **actionable insights**. * We should be able to operate on both **structured** and **unstructured** data. We will come back to discuss different types of data later in the course. * **Application domain** is an important concept, and data scientist often needs at least some degree of expertise in the problem domain. @@ -36,7 +40,7 @@ One of the ways (attributed to [Jim Gray](https://en.wikipedia.org/wiki/Jim_Gray ## Other Related Fields -Since data is pervasive concept, data science itself is also a broad field, touching many other related disciplines. +Since data is a pervasive concept, data science itself is also a broad field, touching many other related disciplines.
Databases
@@ -63,21 +67,65 @@ Vast amounts of data are incomprehensible for a human being, but once we create ## Types of Data -As we have already mentioned - data is everywhere, we just need to capture it in the right way! It is useful to distinguish between **structured** and **unstructured** data - the former are typically represented in some well-structured form, often as a table or number of tables, while latter is just a collection of files. Sometimes we can also talk about **semistructured** data, that have some sort of a structure that may vary greatly. +As we have already mentioned - data is everywhere, we just need to capture it in the right way! It is useful to distinguish between **structured** and **unstructured** data. The former are typically represented in some well-structured form, often as a table or number of tables, while latter is just a collection of files. Sometimes we can also talk about **semistructured** data, that have some sort of a structure that may vary greatly. | Structured | Semi-structured | Unstructured | |----------- |-----------------|--------------| -| List of people with their phone numbers | Wikipedia pages with links | Text of Encyclopaedia Brittanica | +| List of people with their phone numbers | Wikipedia pages with links | Text of Encyclopaedia Britannica | | Temperature in all rooms of a building at every minute for the last 20 years | Collection of scientific papers in JSON format with authors, data of publication, and abstract | File share with corporate documents | -| Data for age and gender for all people entering the building | Internet pages | Raw video feed from surveillance camera | +| Data for age and gender of all people entering the building | Internet pages | Raw video feed from surveillance camera | ## Where to get Data There are many possible sources of data, and it will be impossible to list all of them! However, let's mention some of the typical places where you can get data: - + +* **Structured** + - **Internet of Things**, including data from different sensors, such as temperature or pressure sensors, provides a lot of useful data. For example, if an office building is equipped with IoT sensors, we can automatically control heating and lighting in order to minimize costs. + - **Surveys** that we ask users after purchase of a good, or after visiting a web site. + - **Analysis of behavior** can, for example, help us understand how deeply a user goes into a site, and what is the typical reason for leaving the site. +* **Unstructured** + - **Texts** can be a rich source of insights, starting from overall **sentiment score**, up to extracting keywords and even some semantic meaning. + - **Images** or **Video**. A video from surveillance camera can be used to estimate traffic on the road, and inform people about potential traffic jams. + - Web server **Logs** can be used to understand which pages of our site are most visited, and for how long. +* Semi-structured + - **Social Network** graph can be a great source of data about user personality and potential effectiveness in spreading information around. + - When we have a bunch of photographs from a party, we can try to extract **Group Dynamics** data by building a graph of people taking pictures with each other. + +By knowing different possible sources of data, you can try to think about different scenarios where data science techniques can be applied to know the situation better, and to improve business processes. + ## What you can do with Data +In Data Science, we focus on the following steps of data journey: +
+
1) Data Acquisition
+
+First step is to collect the data. While in many cases it can be a straightforward process, like data coming to a database from web application, sometimes we need to use special techniques. For example, data from IoT sensors can be overwhelming, and it is a good practice to use buffering endpoints such as IoT Hub to collect all the data before further processing. +
+
2) Data Storage
+
+Storing the data can be challenging, especially if we are talking about big data. When deciding how to store data, it makes sense to anticipate the way you would want later on to query them. There are several ways data can be stored: +
    +
  • Relational database stores a collection of tables, and uses a special language called SQL to query them. Typically, tables would be connected to each other using some schema. In many cases we need to convert the data from original form to fit the schema.
  • +
  • NoSQL database, such as CosmosDB, does not enforce schema on data, and allows storing more complex data, for example, hierarchical JSON documents or graphs. However, NoSQL database does not have rich querying capabilities of SQL, and cannot enforce referential integrity between data.
  • +
  • Data Lake storage is used for large collections of data in raw form. Data lakes are often used with big data, where all data cannot fit into one machine, and has to be stored and processed by a cluster. Parquet is the data format that is often used in conjunction with big data.
  • +
+
+
3) Data Processing
+
+This is the most exciting part of data journey, which involved processing the data from its original form to the form that can be used for visualization/model training. When dealing with unstructured data such as text or images, we may need to use some AI techniques to extract **features** from the data, thus converting it to structured form. +
+
4) Visualization / Human Insights
+
+Often to understand the data we need to visualize them. Having many different visualization techniques in our toolbox we can find the right view to make an insight. Often, data scientist needs to "play with data", visualizing it many times and looking for some relationships. Also, we may use techniques from statistics to test some hypotheses or prove correlation between different pieces of data. +
+
5) Training predictive model
+
+Because the ultimate goal of data science is to be able to take decisions based on data, we may want to use the techniques of Machine Learning to build predictive model that will be able to solve our problem. +
+
+ +Of course, depending on the actual data some steps might be missing (eg., when we already have the data in the database, or when we do not need model training), or some steps might be repeated several times (such as data processing). ## Digitalization and Digital Transformation @@ -98,12 +146,20 @@ If we want to get even more complicated, we can plot the time taken for each mod In this challenge, we will try to find concepts relevant to the field of Data Science by looking at texts. We will take Wikipedia article on Data Science, download and process the text, and then build a word cloud like this one: ![Word Cloud for Data Science](images/ds_wordcloud.png) -## Post-Lecture Quiz -[Post-lecture quiz]() +Visit [`notebook.ipynb`](notebook.ipynb) to read through the code. You can also run the code, and see how it performs all data transformations in real time. + +> If you do not know how to run code in Jupyter Notebook, have a look at [this article](https://soshnikov.com/education/how-to-execute-notebooks-from-github/). + + + +## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/1) + +## Assignments -## Review & Self Study +* **Task 1**: Modify the code above to find out related concepts for the fields of **Big Data** and **Machine Learning** +* **Task 2**: [Think About Data Science Scenarios](assignment.md) -## Assignment +## Credits -[Assignment Title](assignment.md) +This lesson has been authored with ♥️ by [Dmitry Soshnikov](http://soshnikov.com) diff --git a/1-Introduction/01-defining-data-science/assignment.md b/1-Introduction/01-defining-data-science/assignment.md index b7af6412..d4b731d9 100644 --- a/1-Introduction/01-defining-data-science/assignment.md +++ b/1-Introduction/01-defining-data-science/assignment.md @@ -1,8 +1,31 @@ -# Title +# Assignment: Data Science Scenarios +In this first assignment, we ask you to think about some real-life process or problem in different problem domains, and how you can improve it using the Data Science process. Think about the following: + +1. Which data can you collect? +1. How would you collect it? +1. How would you store the data? How large the data is likely to be? +1. Which insights you might be able to get from this data? Which decisions we would be able to take based on the data? + +Try to think about 3 different problems/processes and describe each of the points above for each problem domain. + +Here are some of the problem domains and problems that can get you started thinking: + +1. How can you use data to improve education process for children in schools? +1. How can you use data to control vaccination during the pandemic? +1. How can you use data to make sure you are being productive at work? ## Instructions +Fill in the following table (substitute suggested problem domains for your own ones if needed): + +| Problem Domain | Problem | Which data to collect | How to store the data | Which insights/decisions we can make | +|----------------|---------|-----------------------|-----------------------|--------------------------------------| +| Education | | | | | +| Vaccination | | | | | +| Productivity | | | | | + ## Rubric Exemplary | Adequate | Needs Improvement --- | --- | -- | +One was able to identify reasonable data sources, ways of storing data and possible decisions/insights for all problem domains | Some of the aspects of the solution are not detailed, data storage is not discussed, at least 2 problem domains are described | Only parts of the data solution are described, only one problem domain is considered. diff --git a/1-Introduction/01-defining-data-science/images/video-def-ds.png b/1-Introduction/01-defining-data-science/images/video-def-ds.png new file mode 100644 index 00000000..66c85207 Binary files /dev/null and b/1-Introduction/01-defining-data-science/images/video-def-ds.png differ diff --git a/1-Introduction/01-defining-data-science/solution/assignment.md b/1-Introduction/01-defining-data-science/solution/assignment.md new file mode 100644 index 00000000..875a1c00 --- /dev/null +++ b/1-Introduction/01-defining-data-science/solution/assignment.md @@ -0,0 +1,33 @@ +# Assignment: Data Science Scenarios + +In this first assignment, we ask you to think about some real-life process or problem in different problem domains, and how you can improve it using the Data Science process. Think about the following: + +1. Which data can you collect? +1. How would you collect it? +1. How would you store the data? How large the data is likely to be? +1. Which insights you might be able to get from this data? Which decisions we would be able to take based on the data? + +Try to think about 3 different problems/processes and describe each of the points above for each problem domain. + +Here are some of the problem domains and problems that can get you started thinking: + +1. How can you use data to improve education process for children in schools? +1. How can you use data to control vaccination during the pandemic? +1. How can you use data to make sure you are being productive at work? +## Instructions + +Fill in the following table (substitute suggested problem domains for your own ones if needed): + +| Problem Domain | Problem | Which data to collect | How to store the data | Which insights/decisions we can make | +|----------------|---------|-----------------------|-----------------------|--------------------------------------| +| Education | In university, we typically have low attendance to lectures, and we have the hypothesis that students who attend lectures on average to better during exams. We want to stimulate attendance and test the hypothesis. | We can track attendance through pictures taken by the security camera in class, or by tracking bluetooth/wifi addresses of student mobile phones in class. Exam data is already available in the university database. | In case we track security camera images - we need to store a few (5-10) photographs during class (unstructured data), and then use AI to identify faces of students (convert data to structured form). | We can compute average attendance data for each student, and see if there is any correlation with exam grades. We will talk more about correlation in [probability and statistics](../../04-stats-and-probability/README.md) section. In order to stimulate student attendance, we can publish the weekly attendance rating on school portal, and draw prizes among those with highest attendance. | +| Vaccination | | | | | +| Productivity | | | | | + +> *We provide just one answer as an example, so that you can get an idea of what is expected in this assignment.* + +## Rubric + +Exemplary | Adequate | Needs Improvement +--- | --- | -- | +One was able to identify reasonable data sources, ways of storing data and possible decisions/insights for all problem domains | Some of the aspects of the solution are not detailed, data storage is not discussed, at least 2 problem domains are described | Only parts of the data solution are described, only one problem domain is considered. diff --git a/1-Introduction/01-defining-data-science/solution/notebook.ipynb b/1-Introduction/01-defining-data-science/solution/notebook.ipynb index e69de29b..ac2c5524 100644 --- a/1-Introduction/01-defining-data-science/solution/notebook.ipynb +++ b/1-Introduction/01-defining-data-science/solution/notebook.ipynb @@ -0,0 +1,527 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "source": [ + "# Challenge: Analyzing Text about Data Science\r\n", + "\r\n", + "> *In this notebook, we experiment with using different URL - wikipedia article on Machine Learning. You can see that, unlike Data Science, this article contains a lot of terms, this making the analysis more problematic. We need to come up with another way to clean up the data after doing keyword extraction, to get rid of some frequent, but not meaningful word combinations.*\r\n", + "\r\n", + "In this example, let's do a simple exercise that covers all steps of a traditional data science process. You do not have to write any code, you can just click on the cells below to execute them and observe the result. As a challenge, you are encouraged to try this code out with different data. \r\n", + "\r\n", + "## Goal\r\n", + "\r\n", + "In this lesson, we have been discussing different concepts related to Data Science. Let's try to discover more related concepts by doing some **text mining**. We will start with a text about Data Science, extract keywords from it, and then try to visualize the result.\r\n", + "\r\n", + "As a text, I will use the page on Data Science from Wikipedia:" + ], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 2, + "source": [ + "url = 'https://en.wikipedia.org/wiki/Data_science'\r\n", + "url = 'https://en.wikipedia.org/wiki/Machine_learning'" + ], + "outputs": [], + "metadata": {} + }, + { + "cell_type": "markdown", + "source": [ + "## Step 1: Getting the Data\r\n", + "\r\n", + "First step in every data science process is getting the data. We will use `requests` library to do that:" + ], + "metadata": {} + }, + { + "cell_type": "code", + "execution_count": 3, + "source": [ + "import requests\r\n", + "\r\n", + "text = requests.get(url).content.decode('utf-8')\r\n", + "print(text[:1000])" + ], + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "\n", + "\n", + "\n", + "\n", + "Machine learning - Wikipedia\n", + " diff --git a/quiz-app/src/views/Home.vue b/quiz-app/src/views/Home.vue index 01ce25cc..4bf30f72 100644 --- a/quiz-app/src/views/Home.vue +++ b/quiz-app/src/views/Home.vue @@ -1,13 +1,17 @@ @@ -16,11 +20,32 @@ import messages from "@/assets/translations"; export default { name: "Home", + data() { + return { + locale: "", + }; + }, computed: { questions() { - return this.$t("quizzes"); + return messages; }, + currLocale() { + return this.$root.$i18n.locale; + } }, i18n: { messages }, + watch: { + locale(val) { + this.$root.$i18n.locale = val; + }, + }, + created() { + this.route = this.$route.params.id; + if (this.$route.query.loc) { + this.locale = this.$route.query.loc; + } else { + this.locale = "en"; + } + }, }; \ No newline at end of file diff --git a/sketchnotes/00-Roadmap.png b/sketchnotes/00-Roadmap.png new file mode 100644 index 00000000..3ca46064 Binary files /dev/null and b/sketchnotes/00-Roadmap.png differ diff --git a/sketchnotes/00-Title.png b/sketchnotes/00-Title.png new file mode 100644 index 00000000..013396b7 Binary files /dev/null and b/sketchnotes/00-Title.png differ diff --git a/sketchnotes/01-Definitions.png b/sketchnotes/01-Definitions.png new file mode 100644 index 00000000..16f08f28 Binary files /dev/null and b/sketchnotes/01-Definitions.png differ diff --git a/sketchnotes/02-Ethics.png b/sketchnotes/02-Ethics.png new file mode 100644 index 00000000..f227c05b Binary files /dev/null and b/sketchnotes/02-Ethics.png differ diff --git a/sketchnotes/03-DefiningData.png b/sketchnotes/03-DefiningData.png new file mode 100644 index 00000000..bd6127be Binary files /dev/null and b/sketchnotes/03-DefiningData.png differ diff --git a/sketchnotes/04-Statistics-Probability.png b/sketchnotes/04-Statistics-Probability.png new file mode 100644 index 00000000..5a7010be Binary files /dev/null and b/sketchnotes/04-Statistics-Probability.png differ diff --git a/sketchnotes/05-RelationalData.png b/sketchnotes/05-RelationalData.png new file mode 100644 index 00000000..9cb8c9bd Binary files /dev/null and b/sketchnotes/05-RelationalData.png differ diff --git a/sketchnotes/06-NoSQL.png b/sketchnotes/06-NoSQL.png new file mode 100644 index 00000000..184a741d Binary files /dev/null and b/sketchnotes/06-NoSQL.png differ diff --git a/sketchnotes/07-WorkWithPython.png b/sketchnotes/07-WorkWithPython.png new file mode 100644 index 00000000..29787765 Binary files /dev/null and b/sketchnotes/07-WorkWithPython.png differ diff --git a/sketchnotes/08-DataPreparation.png b/sketchnotes/08-DataPreparation.png new file mode 100644 index 00000000..b047ddf4 Binary files /dev/null and b/sketchnotes/08-DataPreparation.png differ diff --git a/sketchnotes/09-Visualizing-Quantities.png b/sketchnotes/09-Visualizing-Quantities.png new file mode 100644 index 00000000..27c9755b Binary files /dev/null and b/sketchnotes/09-Visualizing-Quantities.png differ diff --git a/sketchnotes/10-Visualizing-Distributions.png b/sketchnotes/10-Visualizing-Distributions.png new file mode 100644 index 00000000..44a815b9 Binary files /dev/null and b/sketchnotes/10-Visualizing-Distributions.png differ diff --git a/sketchnotes/11-Visualizing-Proportions.png b/sketchnotes/11-Visualizing-Proportions.png new file mode 100644 index 00000000..098c46ab Binary files /dev/null and b/sketchnotes/11-Visualizing-Proportions.png differ diff --git a/sketchnotes/12-Visualizing-Relationships.png b/sketchnotes/12-Visualizing-Relationships.png new file mode 100644 index 00000000..fc9fe6a5 Binary files /dev/null and b/sketchnotes/12-Visualizing-Relationships.png differ diff --git a/sketchnotes/13-MeaningfulViz.png b/sketchnotes/13-MeaningfulViz.png new file mode 100644 index 00000000..d37289df Binary files /dev/null and b/sketchnotes/13-MeaningfulViz.png differ diff --git a/sketchnotes/14-DataScience-Lifecycle.png b/sketchnotes/14-DataScience-Lifecycle.png new file mode 100644 index 00000000..0e2a16a3 Binary files /dev/null and b/sketchnotes/14-DataScience-Lifecycle.png differ diff --git a/sketchnotes/16-Communicating.png b/sketchnotes/16-Communicating.png new file mode 100644 index 00000000..d515b395 Binary files /dev/null and b/sketchnotes/16-Communicating.png differ diff --git a/sketchnotes/17-DataScience-Cloud.png b/sketchnotes/17-DataScience-Cloud.png new file mode 100644 index 00000000..8db543ab Binary files /dev/null and b/sketchnotes/17-DataScience-Cloud.png differ diff --git a/sketchnotes/18-DataScience-Cloud.png b/sketchnotes/18-DataScience-Cloud.png new file mode 100644 index 00000000..e55aeae7 Binary files /dev/null and b/sketchnotes/18-DataScience-Cloud.png differ diff --git a/sketchnotes/19-DataScience-Cloud.png b/sketchnotes/19-DataScience-Cloud.png new file mode 100644 index 00000000..b3b77696 Binary files /dev/null and b/sketchnotes/19-DataScience-Cloud.png differ