From e511c5c60a27910311f57da8cf8643b51caa173d Mon Sep 17 00:00:00 2001 From: Thoogend1 Date: Tue, 2 Nov 2021 15:18:43 +0100 Subject: [PATCH 01/13] Eerste opzet README vertaling --- .../translations/README.nl.md | 165 ++++++++++++++++++ 1 file changed, 165 insertions(+) create mode 100644 1-Introduction/01-defining-data-science/translations/README.nl.md diff --git a/1-Introduction/01-defining-data-science/translations/README.nl.md b/1-Introduction/01-defining-data-science/translations/README.nl.md new file mode 100644 index 00000000..f7257365 --- /dev/null +++ b/1-Introduction/01-defining-data-science/translations/README.nl.md @@ -0,0 +1,165 @@ +# Definitie van Data Science + +| ![ Sketchnote door [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/01-Definitions.png) | +| :----------------------------------------------------------------------------------------------------: | +| Defining Data Science - _Sketchnote by [@nitya](https://twitter.com/nitya)_ | + +--- + +[![Defining Data Science Video](images/video-def-ds.png)](https://youtu.be/beZ7Mb_oz9I) + +## [Starttoets data science](https://red-water-0103e7a0f.azurestaticapps.net/quiz/0) + +## Wat is Data? +In our everyday life, we are constantly surrounded by data. The text you are reading now is data. The list of phone numbers of your friends in your smartphone is data, as well as the current time displayed on your watch. As human beings, we naturally operate with data by counting the money we have or by writing letters to our friends. + +However, data became much more critical with the creation of computers. The primary role of computers is to perform computations, but they need data to operate on. Thus, we need to understand how computers store and process data. + +With the emergence of the Internet, the role of computers as data handling devices increased. If you think about it, we now use computers more and more for data processing and communication, rather than actual computations. When we write an e-mail to a friend or search for some information on the Internet - we are essentially creating, storing, transmitting, and manipulating data. +> Can you remember the last time you have used computers to actually compute something? + +## What is Data Science? + +In [Wikipedia](https://en.wikipedia.org/wiki/Data_science), **Data Science** is defined as *a scientific field that uses scientific methods to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains*. + +This definition highlights the following important aspects of data science: + +* The main goal of data science is to **extract knowledge** from data, in order words - to **understand** data, find some hidden relationships and build a **model**. +* Data science uses **scientific methods**, such as probability and statistics. In fact, when the term *data science* was first introduced, some people argued that data science was just a new fancy name for statistics. Nowadays it has become evident that the field is much broader. +* Obtained knowledge should be applied to produce some **actionable insights**, i.e. practical insights that you can apply to real business situations. +* We should be able to operate on both **structured** and **unstructured** data. We will come back to discuss different types of data later in the course. +* **Application domain** is an important concept, and data scientists often need at least some degree of expertise in the problem domain, for example: finance, medicine, marketing, etc. + +> Another important aspect of Data Science is that it studies how data can be gathered, stored and operated upon using computers. While statistics gives us mathematical foundations, data science applies mathematical concepts to actually draw insights from data. + +One of the ways (attributed to [Jim Gray](https://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist))) to look at the data science is to consider it to be a separate paradigm of science: +* **Empirical**, in which we rely mostly on observations and results of experiments +* **Theoretical**, where new concepts emerge from existing scientific knowledge +* **Computational**, where we discover new principles based on some computational experiments +* **Data-Driven**, based on discovering relationships and patterns in the data + +## Other Related Fields + +Since data is pervasive, data science itself is also a broad field, touching many other disciplines. + +
+
Databases
+
+A critical consideration is **how to store** the data, i.e. how to structure it in a way that allows faster processing. There are different types of databases that store structured and unstructured data, which we will consider in our course. +
+
Big Data
+
+Often we need to store and process very large quantities of data with a relatively simple structure. There are special approaches and tools to store that data in a distributed manner on a computer cluster, and process it efficiently. +
+
Machine Learning
+
+One way to understand data is to **build a model** that will be able to predict a desired outcome. Developing models from data is called **machine learning**. You may want to have a look at our Machine Learning for Beginners Curriculum to learn more about it. +
+
Artificial Intelligence
+
+An area of machine learning known as artificial intelligence (AI) also relies on data, and it involves building high complexity models that mimic human thought processes. AI methods often allow us to turn unstructured data (e.g. natural language) into structured insights. +
+
Visualization
+
+Vast amounts of data are incomprehensible for a human being, but once we create useful visualizations using that data, we can make more sense of the data, and draw some conclusions. Thus, it is important to know many ways to visualize information - something that we will cover in Section 3 of our course. Related fields also include **Infographics**, and **Human-Computer Interaction** in general. +
+
+ +## Types of Data + +As we have already mentioned, data is everywhere. We just need to capture it in the right way! It is useful to distinguish between **structured** and **unstructured** data. The former is typically represented in some well-structured form, often as a table or number of tables, while the latter is just a collection of files. Sometimes we can also talk about **semistructured** data, that have some sort of a structure that may vary greatly. + +| Structured | Semi-structured | Unstructured | +| ---------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | --------------------------------------- | +| List of people with their phone numbers | Wikipedia pages with links | Text of Encyclopaedia Britannica | +| Temperature in all rooms of a building at every minute for the last 20 years | Collection of scientific papers in JSON format with authors, data of publication, and abstract | File share with corporate documents | +| Data for age and gender of all people entering the building | Internet pages | Raw video feed from surveillance camera | + +## Where to get Data + +There are many possible sources of data, and it will be impossible to list all of them! However, let's mention some of the typical places where you can get data: + +* **Structured** + - **Internet of Things** (IoT), including data from different sensors, such as temperature or pressure sensors, provides a lot of useful data. For example, if an office building is equipped with IoT sensors, we can automatically control heating and lighting in order to minimize costs. + - **Surveys** that we ask users to complete after a purchase, or after visiting a web site. + - **Analysis of behavior** can, for example, help us understand how deeply a user goes into a site, and what is the typical reason for leaving the site. +* **Unstructured** + - **Texts** can be a rich source of insights, such as an overall **sentiment score**, or extracting keywords and semantic meaning. + - **Images** or **Video**. A video from a surveillance camera can be used to estimate traffic on the road, and inform people about potential traffic jams. + - Web server **Logs** can be used to understand which pages of our site are most often visited, and for how long. +* Semi-structured + - **Social Network** graphs can be great sources of data about user personalities and potential effectiveness in spreading information around. + - When we have a bunch of photographs from a party, we can try to extract **Group Dynamics** data by building a graph of people taking pictures with each other. + +By knowing different possible sources of data, you can try to think about different scenarios where data science techniques can be applied to know the situation better, and to improve business processes. + +## What you can do with Data + +In Data Science, we focus on the following steps of data journey: + +
+
1) Data Acquisition
+
+The first step is to collect the data. While in many cases it can be a straightforward process, like data coming to a database from a web application, sometimes we need to use special techniques. For example, data from IoT sensors can be overwhelming, and it is a good practice to use buffering endpoints such as IoT Hub to collect all the data before further processing. +
+
2) Data Storage
+
+Storing data can be challenging, especially if we are talking about big data. When deciding how to store data, it makes sense to anticipate the way you would to query the data in the future. There are several ways data can be stored: +
    +
  • A relational database stores a collection of tables, and uses a special language called SQL to query them. Typically, tables are organized into different groups called schemas. In many cases we need to convert the data from original form to fit the schema.
  • +
  • A NoSQL database, such as CosmosDB, does not enforce schemas on data, and allows storing more complex data, for example, hierarchical JSON documents or graphs. However, NoSQL databases do not have the rich querying capabilities of SQL, and cannot enforce referential integrity, i.e. rules on how the data is structured in tables and governing the relationships between tables.
  • +
  • Data Lake storage is used for large collections of data in raw, unstructured form. Data lakes are often used with big data, where all data cannot fit on one machine, and has to be stored and processed by a cluster of servers. Parquet is the data format that is often used in conjunction with big data.
  • +
+
+
3) Data Processing
+
+This is the most exciting part of the data journey, which involves converting the data from its original form into a form that can be used for visualization/model training. When dealing with unstructured data such as text or images, we may need to use some AI techniques to extract **features** from the data, thus converting it to structured form. +
+
4) Visualization / Human Insights
+
+Oftentimes, in order to understand the data, we need to visualize it. Having many different visualization techniques in our toolbox, we can find the right view to make an insight. Often, a data scientist needs to "play with data", visualizing it many times and looking for some relationships. Also, we may use statistical techniques to test a hypotheses or prove a correlation between different pieces of data. +
+
5) Training a predictive model
+
+Because the ultimate goal of data science is to be able to make decisions based on data, we may want to use the techniques of Machine Learning to build a predictive model. We can then use this to make predictions using new data sets with similar structures. +
+
+ +Of course, depending on the actual data, some steps might be missing (e.g., when we already have the data in the database, or when we do not need model training), or some steps might be repeated several times (such as data processing). + +## Digitalization and Digital Transformation + +In the last decade, many businesses started to understand the importance of data when making business decisions. To apply data science principles to running a business, one first needs to collect some data, i.e. translate business processes into digital form. This is known as **digitalization**. Applying data science techniques to this data to guide decisions can lead to significant increases in productivity (or even business pivot), called **digital transformation**. + +Let's consider an example. Suppose we have a data science course (like this one) which we deliver online to students, and we want to use data science to improve it. How can we do it? + +We can start by asking "What can be digitized?" The simplest way would be to measure the time it takes each student to complete each module, and to measure the obtained knowledge by giving a multiple-choice test at the end of each module. By averaging time-to-complete across all students, we can find out which modules cause the most difficulties for students, and work on simplifying them. + +> You may argue that this approach is not ideal, because modules can be of different lengths. It is probably more fair to divide the time by the length of the module (in number of characters), and compare those values instead. + +When we start analyzing results of multiple-choice tests, we can try to determine which concepts that students have difficulty understanding, and and use that information to improve the content. To do that, we need to design tests in such a way that each question maps to a certain concept or chunk of knowledge. + +If we want to get even more complicated, we can plot the time taken for each module against the age category of students. We might find out that for some age categories it takes an inappropriately long time to complete the module, or that students drop out before completing it. This can help us provide age recommendations for the module, and minimize people's dissatisfaction from wrong expectations. + +## ๐Ÿš€ Challenge + +In this challenge, we will try to find concepts relevant to the field of Data Science by looking at texts. We will take a Wikipedia article on Data Science, download and process the text, and then build a word cloud like this one: + +![Word Cloud for Data Science](images/ds_wordcloud.png) + +Visit [`notebook.ipynb`](notebook.ipynb) to read through the code. You can also run the code, and see how it performs all data transformations in real time. + +> If you do not know how to run code in a Jupyter Notebook, have a look at [this article](https://soshnikov.com/education/how-to-execute-notebooks-from-github/). + + + +## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/1) + +## Assignments + +* **Task 1**: Modify the code above to find out related concepts for the fields of **Big Data** and **Machine Learning** +* **Task 2**: [Think About Data Science Scenarios](assignment.md) + +## Credits + +This lesson has been authored with โ™ฅ๏ธ by [Dmitry Soshnikov](http://soshnikov.com) From 9b50293f5baa13a3a8ec702713ca58683945b2df Mon Sep 17 00:00:00 2001 From: Thoogend1 Date: Wed, 3 Nov 2021 09:02:30 +0100 Subject: [PATCH 02/13] Translated 'what is data' section --- .../translations/README.nl.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/1-Introduction/01-defining-data-science/translations/README.nl.md b/1-Introduction/01-defining-data-science/translations/README.nl.md index f7257365..7808649f 100644 --- a/1-Introduction/01-defining-data-science/translations/README.nl.md +++ b/1-Introduction/01-defining-data-science/translations/README.nl.md @@ -1,24 +1,24 @@ # Definitie van Data Science -| ![ Sketchnote door [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/01-Definitions.png) | +| ![ Sketchnote door [(@sketchthedocs)](https://sketchthedocs.dev) ](../../../sketchnotes/01-Definitions.png) | | :----------------------------------------------------------------------------------------------------: | | Defining Data Science - _Sketchnote by [@nitya](https://twitter.com/nitya)_ | --- -[![Defining Data Science Video](images/video-def-ds.png)](https://youtu.be/beZ7Mb_oz9I) +[![Defining Data Science Video](../images/video-def-ds.png)](https://youtu.be/beZ7Mb_oz9I) ## [Starttoets data science](https://red-water-0103e7a0f.azurestaticapps.net/quiz/0) ## Wat is Data? -In our everyday life, we are constantly surrounded by data. The text you are reading now is data. The list of phone numbers of your friends in your smartphone is data, as well as the current time displayed on your watch. As human beings, we naturally operate with data by counting the money we have or by writing letters to our friends. +In ons dagelijks leven zijn we voortdurend omringd door data. De tekst die je nu leest is data. De lijst met telefoonnummers van je vrienden op je smartphone is data, evenals de huidige tijd die op je horloge wordt weergegeven. Als mens werken we van nature met data, denk aan het geld dat we moeten tellen of door berichten te schrijven aan onze vrienden. -However, data became much more critical with the creation of computers. The primary role of computers is to perform computations, but they need data to operate on. Thus, we need to understand how computers store and process data. +Gegevens werden echter veel belangrijker met de introductie van computers. De primaire rol van computers is om berekeningen uit te voeren, maar ze hebben gegevens nodig om mee te werken. We moeten dus begrijpen hoe computers gegevens opslaan en verwerken. -With the emergence of the Internet, the role of computers as data handling devices increased. If you think about it, we now use computers more and more for data processing and communication, rather than actual computations. When we write an e-mail to a friend or search for some information on the Internet - we are essentially creating, storing, transmitting, and manipulating data. -> Can you remember the last time you have used computers to actually compute something? +Met de opkomst van het internet nam de rol van computers als gegevensverwerkingsapparatuur toe. Als je erover nadenkt, gebruiken we computers nu steeds meer voor gegevensverwerking en communicatie, in plaats van echte berekeningen. Wanneer we een e-mail schrijven naar een vriend of zoeken naar informatie op internet, creรซren, bewaren, verzenden en manipuleren we in wezen gegevens. +> Kan jij je herinneren wanneer jij voor het laatste echte berekeningen door een computer hebt laten uitvoeren? -## What is Data Science? +## Wat is Data Science? In [Wikipedia](https://en.wikipedia.org/wiki/Data_science), **Data Science** is defined as *a scientific field that uses scientific methods to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains*. From 4c10ab183b353d79f77eb35e7ab041b5c192bfb3 Mon Sep 17 00:00:00 2001 From: Timo Hoogendorp Date: Mon, 22 Nov 2021 08:45:04 +0100 Subject: [PATCH 03/13] Link to 'working with data' fixed. --- .../01-defining-data-science/translations/README.pt-br.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/1-Introduction/01-defining-data-science/translations/README.pt-br.md b/1-Introduction/01-defining-data-science/translations/README.pt-br.md index 4b2708c4..168b4d18 100644 --- a/1-Introduction/01-defining-data-science/translations/README.pt-br.md +++ b/1-Introduction/01-defining-data-science/translations/README.pt-br.md @@ -45,7 +45,7 @@ Jรก que dados sรฃo um conceito difundido, a ciรชncia de dados em si tambรฉm รฉ u
Banco de Dados
-A coisa mais รณbvia a considerar รฉ **como armazenar** os dados, ex. como estruturรก-los de uma forma que permite um processamento rรกpido. Existem diferentes tipos de banco de dados que armazenam dados estruturados e nรฃo estruturados, que nรณs vamos considerar nesse curso. +A coisa mais รณbvia a considerar รฉ **como armazenar** os dados, ex. como estruturรก-los de uma forma que permite um processamento rรกpido. Existem diferentes tipos de banco de dados que armazenam dados estruturados e nรฃo estruturados, que nรณs vamos considerar nesse curso.
Big Data
From 0001e1b5c8557e6cc01b197f791ee00c58f27e74 Mon Sep 17 00:00:00 2001 From: Timo Hoogendorp Date: Mon, 22 Nov 2021 09:17:40 +0100 Subject: [PATCH 04/13] Fixed dead links --- .../01-defining-data-science/translations/README.es.md | 4 ++-- .../01-defining-data-science/translations/README.pt-br.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/1-Introduction/01-defining-data-science/translations/README.es.md b/1-Introduction/01-defining-data-science/translations/README.es.md index 32848e9c..a29ab89d 100644 --- a/1-Introduction/01-defining-data-science/translations/README.es.md +++ b/1-Introduction/01-defining-data-science/translations/README.es.md @@ -50,7 +50,7 @@ Dado que los datos son omnipresentes, la propia ciencia de los datos es tambiรฉn
Bases de datos
-Una consideraciรณn crรญtica es **cรณmo almacenar** los datos, es decir, cรณmo estructurarlos de forma que permitan un procesamiento mรกs rรกpido. Hay diferentes tipos de bases de datos que almacenan datos estructurados y no estructurados, que consideraremos en nuestro curso. +Una consideraciรณn crรญtica es **cรณmo almacenar** los datos, es decir, cรณmo estructurarlos de forma que permitan un procesamiento mรกs rรกpido. Hay diferentes tipos de bases de datos que almacenan datos estructurados y no estructurados, que consideraremos en nuestro curso.
Big Data
@@ -66,7 +66,7 @@ Un รกrea del Machine learning llamada inteligencia artificial (IA o AI, por sus
Visualizaciรณn
-Cantidades muy grandes de datos son incomprensibles para un ser humano, pero una vez que creamos visualizaciones รบtiles con esos datos, podemos darles mรกs sentido y sacar algunas conclusiones. Por ello, es importante conocer muchas formas de visualizar la informaciรณn, algo que trataremos en la secciรณn 3 de nuestro curso. Campos relacionados tambiรฉn incluyen la **Infografรญa**, y la **Interacciรณn Persona-Ordenador** en general. +Cantidades muy grandes de datos son incomprensibles para un ser humano, pero una vez que creamos visualizaciones รบtiles con esos datos, podemos darles mรกs sentido y sacar algunas conclusiones. Por ello, es importante conocer muchas formas de visualizar la informaciรณn, algo que trataremos en la secciรณn 3 de nuestro curso. Campos relacionados tambiรฉn incluyen la **Infografรญa**, y la **Interacciรณn Persona-Ordenador** en general.
diff --git a/1-Introduction/01-defining-data-science/translations/README.pt-br.md b/1-Introduction/01-defining-data-science/translations/README.pt-br.md index 168b4d18..e03e8184 100644 --- a/1-Introduction/01-defining-data-science/translations/README.pt-br.md +++ b/1-Introduction/01-defining-data-science/translations/README.pt-br.md @@ -61,7 +61,7 @@ Como aprendizado de mรกquina, inteligรชncia artificial tambรฉm se baseia em dado
Visualizaรงรฃo
-Vastas quantidades de dados sรฃo incompreensรญveis para o ser humano, mas uma vez que criamos visualizaรงรตes รบteis - nรณs podemos comeรงar a dar muito mais sentido aos dados, e desenhar algumas conclusรตes. Portanto, รฉ importante conhecer vรกrias formas de visualizar informaรงรฃo - algo que vamos cobrir na Seรงรฃo 3 do nosso curso. รreas relacionadas tambรฉm incluem **Infogrรกficos**, e **Interaรงรฃo Humano-Computador** no geral. +Vastas quantidades de dados sรฃo incompreensรญveis para o ser humano, mas uma vez que criamos visualizaรงรตes รบteis - nรณs podemos comeรงar a dar muito mais sentido aos dados, e desenhar algumas conclusรตes. Portanto, รฉ importante conhecer vรกrias formas de visualizar informaรงรฃo - algo que vamos cobrir na Seรงรฃo 3 do nosso curso. รreas relacionadas tambรฉm incluem **Infogrรกficos**, e **Interaรงรฃo Humano-Computador** no geral.
From 9c6b5cbc65baa4a29a5ecc408e5b863cbd17be39 Mon Sep 17 00:00:00 2001 From: Timo Hoogendorp Date: Mon, 22 Nov 2021 09:51:35 +0100 Subject: [PATCH 05/13] dutch translation of Defining Data science readme --- .../translations/README.nl.md | 155 +++++++++--------- 1 file changed, 77 insertions(+), 78 deletions(-) diff --git a/1-Introduction/01-defining-data-science/translations/README.nl.md b/1-Introduction/01-defining-data-science/translations/README.nl.md index 7808649f..7964af05 100644 --- a/1-Introduction/01-defining-data-science/translations/README.nl.md +++ b/1-Introduction/01-defining-data-science/translations/README.nl.md @@ -20,146 +20,145 @@ Met de opkomst van het internet nam de rol van computers als gegevensverwerkings ## Wat is Data Science? -In [Wikipedia](https://en.wikipedia.org/wiki/Data_science), **Data Science** is defined as *a scientific field that uses scientific methods to extract knowledge and insights from structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains*. +[Wikipedia](https://en.wikipedia.org/wiki/Data_science) definieert **Data Science** als *een interdisciplinair onderzoeksveld met betrekking tot wetenschappelijke methoden, processen en systemen om kennis en inzichten te onttrekken uit (zowel gestructureerde als ongestructureerde) data.* -This definition highlights the following important aspects of data science: +Deze definitie belicht de volgende belangrijke aspecten van data science: -* The main goal of data science is to **extract knowledge** from data, in order words - to **understand** data, find some hidden relationships and build a **model**. -* Data science uses **scientific methods**, such as probability and statistics. In fact, when the term *data science* was first introduced, some people argued that data science was just a new fancy name for statistics. Nowadays it has become evident that the field is much broader. -* Obtained knowledge should be applied to produce some **actionable insights**, i.e. practical insights that you can apply to real business situations. -* We should be able to operate on both **structured** and **unstructured** data. We will come back to discuss different types of data later in the course. -* **Application domain** is an important concept, and data scientists often need at least some degree of expertise in the problem domain, for example: finance, medicine, marketing, etc. +* Het belangrijkste doel van data science is om **kennis** uit gegevens te destilleren, in andere woorden - om data **te begrijpen**, verborgen relaties te vinden en een **model** te bouwen. +* Data science maakt gebruik van **wetenschappelijke methoden**, zoals waarschijnlijkheid en statistiek. Toen de term *data science* voor het eerst werd geรฏntroduceerd, beweerden sommige mensen zelfs dat data science slechts een nieuwe mooie naam voor statistiek was. Tegenwoordig is duidelijk geworden dat het veld veel breder is. +* Verkregen kennis moet worden toegepast om enkele **bruikbare inzichten** te produceren, d.w.z. praktische inzichten die je kunt toepassen op echte bedrijfssituaties. +* We moeten in staat zijn om te werken met zowel **gestructureerde** als **ongestructureerde** data. We komen later in de cursus terug om verschillende soorten gegevens te bespreken. +* **Toepassingsdomein** is een belangrijk begrip, en datawetenschappers hebben vaak minstens een zekere mate van expertise nodig in het probleemdomein, bijvoorbeeld: financiรซn, geneeskunde, marketing, enz. -> Another important aspect of Data Science is that it studies how data can be gathered, stored and operated upon using computers. While statistics gives us mathematical foundations, data science applies mathematical concepts to actually draw insights from data. +> Een ander belangrijk aspect van Data Science is dat het bestudeert hoe gegevens kunnen worden verzameld, opgeslagen en bediend met behulp van computers. Terwijl statistiek ons wiskundige grondslagen geeft, past data science wiskundige concepten toe om daadwerkelijk inzichten uit gegevens te halen. -One of the ways (attributed to [Jim Gray](https://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist))) to look at the data science is to consider it to be a separate paradigm of science: -* **Empirical**, in which we rely mostly on observations and results of experiments -* **Theoretical**, where new concepts emerge from existing scientific knowledge -* **Computational**, where we discover new principles based on some computational experiments -* **Data-Driven**, based on discovering relationships and patterns in the data -## Other Related Fields +Een van de manieren (toegeschreven aan [Jim Gray] (https://en.wikipedia.org/wiki/Jim_Gray_ (computer_scientist))) om naar de data science te kijken, is om het te beschouwen als een apart paradigma van de wetenschap: +* **Empirisch**, waarbij we vooral vertrouwen op waarnemingen en resultaten van experimenten +* **Theoretisch**, waar nieuwe concepten voortkomen uit bestaande wetenschappelijke kennis +* **Computational**, waar we nieuwe principes ontdekken op basis van enkele computationele experimenten +* **Data-Driven**, gebaseerd op het ontdekken van relaties en patronen in de data -Since data is pervasive, data science itself is also a broad field, touching many other disciplines. +## Andere gerelateerde vakgebieden + +Omdat data alomtegenwoordig is, is data science zelf ook een breed vakgebied, dat veel andere disciplines raakt.
Databases
-A critical consideration is **how to store** the data, i.e. how to structure it in a way that allows faster processing. There are different types of databases that store structured and unstructured data, which we will consider in our course. +Een kritische overweging is **hoe de gegevens op te slaan**, d.w.z. hoe deze te structureren op een manier die een snellere verwerking mogelijk maakt. Er zijn verschillende soorten databases die gestructureerde en ongestructureerde gegevens opslaan, welke we in onze cursus zullen overwegen.
Big Data
-Often we need to store and process very large quantities of data with a relatively simple structure. There are special approaches and tools to store that data in a distributed manner on a computer cluster, and process it efficiently. +Vaak moeten we zeer grote hoeveelheden gegevens opslaan en verwerken met een relatief eenvoudige structuur. Er zijn speciale benaderingen en hulpmiddelen om die gegevens op een gedistribueerde manier op een computercluster op te slaan en efficiรซnt te verwerken.
-
Machine Learning
+
Machine learning
-One way to understand data is to **build a model** that will be able to predict a desired outcome. Developing models from data is called **machine learning**. You may want to have a look at our Machine Learning for Beginners Curriculum to learn more about it. +Een manier om gegevens te begrijpen is door **een model** te bouwen dat in staat zal zijn om een gewenste uitkomst te voorspellen. Het ontwikkelen van modellen op basis van data wordt **machine learning** genoemd. Misschien wilt u een kijkje nemen op onze Machine Learning for Beginners Curriculum om er meer over te weten te komen.
-
Artificial Intelligence
+
kunstmatige intelligentie
-An area of machine learning known as artificial intelligence (AI) also relies on data, and it involves building high complexity models that mimic human thought processes. AI methods often allow us to turn unstructured data (e.g. natural language) into structured insights. +Een gebied van machine learning dat bekend staat als Artificial Intelligence (AI) is ook afhankelijk van gegevens en betreft het bouwen van modellen met een hoge complexiteit die menselijke denkprocessen nabootsen. AI-methoden stellen ons vaak in staat om ongestructureerde data (bijvoorbeeld natuurlijke taal) om te zetten in gestructureerde inzichten.
-
Visualization
+
visualisatie
-Vast amounts of data are incomprehensible for a human being, but once we create useful visualizations using that data, we can make more sense of the data, and draw some conclusions. Thus, it is important to know many ways to visualize information - something that we will cover in Section 3 of our course. Related fields also include **Infographics**, and **Human-Computer Interaction** in general. +Enorme hoeveelheden gegevens zijn onbegrijpelijk voor een mens, maar zodra we nuttige visualisaties maken met behulp van die gegevens, kunnen we de gegevens beter begrijpen en enkele conclusies trekken. Het is dus belangrijk om veel manieren te kennen om informatie te visualiseren - iets dat we zullen behandelen in sectie 3 van onze cursus. Gerelateerde velden omvatten ook **Infographics** en **Mens-computerinteractie** in het algemeen.
-## Types of Data +## Typen van Data -As we have already mentioned, data is everywhere. We just need to capture it in the right way! It is useful to distinguish between **structured** and **unstructured** data. The former is typically represented in some well-structured form, often as a table or number of tables, while the latter is just a collection of files. Sometimes we can also talk about **semistructured** data, that have some sort of a structure that may vary greatly. +Zoals we al hebben vermeld, zijn gegevens overal te vinden. We moeten het gewoon op de juiste manier vastleggen! Het is handig om onderscheid te maken tussen **gestructureerde** en **ongestructureerde** data. De eerste wordt meestal weergegeven in een goed gestructureerde vorm, vaak als een tabel of een aantal tabellen, terwijl de laatste slechts een verzameling bestanden is. Soms kunnen we het ook hebben over **semigestructureerde** gegevens, die een soort structuur hebben die sterk kan variรซren. -| Structured | Semi-structured | Unstructured | -| ---------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------- | --------------------------------------- | -| List of people with their phone numbers | Wikipedia pages with links | Text of Encyclopaedia Britannica | -| Temperature in all rooms of a building at every minute for the last 20 years | Collection of scientific papers in JSON format with authors, data of publication, and abstract | File share with corporate documents | -| Data for age and gender of all people entering the building | Internet pages | Raw video feed from surveillance camera | +| Gestructureerde | Semi-gestructureerde | Ongestructureerde | +| --------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------- | ------------------------------------------ | +| Lijst van mensen met hun telefoonnummer | Wikipedia pagina's met links | Tekst van encyclopaedia Britannica | +| Temperatuur in alle kamers van een gebouw op elke minuut gedurende de laatste 20 jaar | Verzameling van wetenschappelijke artikelen in JSON-formaat met auteurs, publicatiegegevens en een abstract | Bestanden opslag met bedrijfsdocumenten | +| Gegevens van leeftijd en geslacht van alle mensen die het gebouw betreden | Internet pagina's | Onbewerkte videofeed van bewakingscamera's | -## Where to get Data +## Waar data vandaan te halen -There are many possible sources of data, and it will be impossible to list all of them! However, let's mention some of the typical places where you can get data: +Er zijn veel mogelijke gegevensbronnen en het zal onmogelijk zijn om ze allemaal op te sommen! Laten we echter enkele van de typische plaatsen noemen waar u gegevens kunt krijgen: -* **Structured** - - **Internet of Things** (IoT), including data from different sensors, such as temperature or pressure sensors, provides a lot of useful data. For example, if an office building is equipped with IoT sensors, we can automatically control heating and lighting in order to minimize costs. - - **Surveys** that we ask users to complete after a purchase, or after visiting a web site. - - **Analysis of behavior** can, for example, help us understand how deeply a user goes into a site, and what is the typical reason for leaving the site. -* **Unstructured** - - **Texts** can be a rich source of insights, such as an overall **sentiment score**, or extracting keywords and semantic meaning. - - **Images** or **Video**. A video from a surveillance camera can be used to estimate traffic on the road, and inform people about potential traffic jams. - - Web server **Logs** can be used to understand which pages of our site are most often visited, and for how long. -* Semi-structured - - **Social Network** graphs can be great sources of data about user personalities and potential effectiveness in spreading information around. - - When we have a bunch of photographs from a party, we can try to extract **Group Dynamics** data by building a graph of people taking pictures with each other. +* **Gestructureerd** + - **Internet of Things** (IoT), inclusief data van verschillende sensoren, zoals temperatuur- of druksensoren, leveren veel bruikbare data op. Als een kantoorgebouw bijvoorbeeld is uitgerust met IoT-sensoren, kunnen we automatisch verwarming en verlichting regelen om de kosten te minimaliseren. + - **Enquรชtes** die we gebruikers vragen in te vullen na een aankoop of na een bezoek aan een website. + - **Analyse van gedrag** kan ons bijvoorbeeld helpen begrijpen hoe diep een gebruiker in een website gaat en wat de typische reden is om de site te verlaten. +* ** Ongestructureerd ** + - **Teksten** kunnen een rijke bron van inzichten zijn, zoals een algemene **sentimentscore**, of het extraheren van trefwoorden en semantische betekenis. + - **Afbeeldingen** of **Video**. Een video van een bewakingscamera kan worden gebruikt om het verkeer op de weg in te schatten en mensen te informeren over mogelijke files. + - Webserver **Logs** kunnen worden gebruikt om te begrijpen welke pagina's van onze site het vaakst worden bezocht en voor hoe lang. +* Semi-gestructureerd + - **Social Network** grafieken kunnen geweldige bronnen van gegevens zijn over gebruikerspersoonlijkheden en potentiรซle effectiviteit bij het verspreiden van informatie. + - Wanneer we een heleboel foto's van een feest hebben, kunnen we proberen **Group Dynamics**-gegevens te extraheren door een grafiek te maken van mensen die met elkaar foto's maken. -By knowing different possible sources of data, you can try to think about different scenarios where data science techniques can be applied to know the situation better, and to improve business processes. +Door verschillende mogelijke databronnen te kennen, kun je proberen na te denken over verschillende scenario's waarin data science technieken kunnen worden toegepast om de situatie beter te leren kennen en bedrijfsprocessen te verbeteren. -## What you can do with Data +## Wat je met Data kunt doen -In Data Science, we focus on the following steps of data journey: +In Data Science richten we ons op de volgende stappen van data journey:
-
1) Data Acquisition
+
1) Data-acquisitie
-The first step is to collect the data. While in many cases it can be a straightforward process, like data coming to a database from a web application, sometimes we need to use special techniques. For example, data from IoT sensors can be overwhelming, and it is a good practice to use buffering endpoints such as IoT Hub to collect all the data before further processing. +De eerste stap is het verzamelen van de gegevens. Hoewel het in veel gevallen een eenvoudig proces kan zijn, zoals gegevens die vanuit een webapplicatie naar een database komen, moeten we soms speciale technieken gebruiken. Gegevens van IoT-sensoren kunnen bijvoorbeeld overweldigend zijn en het is een goede gewoonte om bufferingseindpunten zoals IoT Hub te gebruiken om alle gegevens te verzamelen voordat ze verder worden verwerkt.
-
2) Data Storage
+
2) Gegevensopslag
-Storing data can be challenging, especially if we are talking about big data. When deciding how to store data, it makes sense to anticipate the way you would to query the data in the future. There are several ways data can be stored: +Het opslaan van gegevens kan een uitdaging zijn, vooral als we het hebben over big data. Wanneer u beslist hoe u gegevens wilt opslaan, is het logisch om te anticiperen op de manier waarop u de gegevens in de toekomst zou opvragen. Er zijn verschillende manieren waarop gegevens kunnen worden opgeslagen:
    -
  • A relational database stores a collection of tables, and uses a special language called SQL to query them. Typically, tables are organized into different groups called schemas. In many cases we need to convert the data from original form to fit the schema.
  • -
  • A NoSQL database, such as CosmosDB, does not enforce schemas on data, and allows storing more complex data, for example, hierarchical JSON documents or graphs. However, NoSQL databases do not have the rich querying capabilities of SQL, and cannot enforce referential integrity, i.e. rules on how the data is structured in tables and governing the relationships between tables.
  • -
  • Data Lake storage is used for large collections of data in raw, unstructured form. Data lakes are often used with big data, where all data cannot fit on one machine, and has to be stored and processed by a cluster of servers. Parquet is the data format that is often used in conjunction with big data.
  • +
  • Een relationele database slaat een verzameling tabellen op en gebruikt een speciale taal genaamd SQL om deze op te vragen. Tabellen zijn meestal georganiseerd in verschillene groepen die schema's worden genoemd. In veel gevallen moeten we de gegevens van de oorspronkelijke vorm converteren naar het schema.
  • +
  • A NoSQL database, zoals CosmosDB, dwingt geen schema's af op gegevens en maakt het opslaan van complexere gegevens mogelijk, bijvoorbeeld hiรซrarchische JSON-documenten of grafieken. NoSQL-databases hebben echter niet de uitgebreide querymogelijkheden van SQL en kunnen geen referentiรซle integriteit afdwingen, d.w.z. regels over hoe de gegevens in tabellen zijn gestructureerd en de relaties tussen tabellen regelen.
  • +
  • Data Lake opslag wordt gebruikt voor grote verzamelingen gegevens in ruwe, ongestructureerde vorm. Data lakes worden vaak gebruikt met big data, waarbij alle data niet op รฉรฉn machine past en moet worden opgeslagen en verwerkt door een cluster van servers. Parquet is het gegevensformaat dat vaak wordt gebruikt in combinatie met big data.
-
3) Data Processing
+
3) Gegevensverwerking
-This is the most exciting part of the data journey, which involves converting the data from its original form into a form that can be used for visualization/model training. When dealing with unstructured data such as text or images, we may need to use some AI techniques to extract **features** from the data, thus converting it to structured form. +Dit is het meest spannende deel van het gegevenstraject, waarbij de gegevens van de oorspronkelijke vorm worden omgezet in een vorm die kan worden gebruikt voor visualisatie / modeltraining. Bij het omgaan met ongestructureerde gegevens zoals tekst of afbeeldingen, moeten we mogelijk enkele AI-technieken gebruiken om **functies** uit de gegevens te destilleren en deze zo naar gestructureerde vorm te converteren.
-
4) Visualization / Human Insights
+
4) Visualisatie / Menselijke inzichten
-Oftentimes, in order to understand the data, we need to visualize it. Having many different visualization techniques in our toolbox, we can find the right view to make an insight. Often, a data scientist needs to "play with data", visualizing it many times and looking for some relationships. Also, we may use statistical techniques to test a hypotheses or prove a correlation between different pieces of data. +Vaak moeten we, om de gegevens te begrijpen, deze visualiseren. Met veel verschillende visualisatietechnieken in onze toolbox kunnen we de juiste weergave vinden om inzicht te krijgen. Vaak moet een data scientist "spelen met data", deze vele malen visualiseren en op zoek gaan naar wat relaties. Ook kunnen we statistische technieken gebruiken om een hypothese te testen of een correlatie tussen verschillende gegevens te bewijzen.
-
5) Training a predictive model
+
5) Het trainen van een voorspellend model
-Because the ultimate goal of data science is to be able to make decisions based on data, we may want to use the techniques of Machine Learning to build a predictive model. We can then use this to make predictions using new data sets with similar structures. +Omdat het uiteindelijke doel van data science is om beslissingen te kunnen nemen op basis van data, willen we misschien de technieken van Machine Learning gebruiken om een voorspellend model te bouwen. We kunnen dit vervolgens gebruiken om voorspellingen te doen met behulp van nieuwe datasets met vergelijkbare structuren.
-Of course, depending on the actual data, some steps might be missing (e.g., when we already have the data in the database, or when we do not need model training), or some steps might be repeated several times (such as data processing). - -## Digitalization and Digital Transformation - -In the last decade, many businesses started to understand the importance of data when making business decisions. To apply data science principles to running a business, one first needs to collect some data, i.e. translate business processes into digital form. This is known as **digitalization**. Applying data science techniques to this data to guide decisions can lead to significant increases in productivity (or even business pivot), called **digital transformation**. +Natuurlijk, afhankelijk van de werkelijke gegevens, kunnen sommige stappen ontbreken (bijvoorbeeld wanneer we de gegevens al in de database hebben opgeslagen of wanneer we geen modeltraining nodig hebben), of sommige stappen kunnen meerdere keren worden herhaald (zoals gegevensverwerking). -Let's consider an example. Suppose we have a data science course (like this one) which we deliver online to students, and we want to use data science to improve it. How can we do it? +## Digitalisering en digitale transformatie -We can start by asking "What can be digitized?" The simplest way would be to measure the time it takes each student to complete each module, and to measure the obtained knowledge by giving a multiple-choice test at the end of each module. By averaging time-to-complete across all students, we can find out which modules cause the most difficulties for students, and work on simplifying them. +In het afgelopen decennium begonnen veel bedrijven het belang van gegevens te begrijpen bij het nemen van zakelijke beslissingen. Om data science-principes toe te passen op het opereren van een bedrijf, moet men eerst wat gegevens verzamelen, d.w.z. bedrijfsprocessen vertalen naar digitale vorm. Dit staat bekend als **digitalisering**. Het toepassen van data science-technieken op deze gegevens om beslissingen te sturen, kan leiden tot aanzienlijke productiviteitsstijgingen (of zelfs zakelijke spil), **digitale transformatie** genoemd. -> You may argue that this approach is not ideal, because modules can be of different lengths. It is probably more fair to divide the time by the length of the module (in number of characters), and compare those values instead. +Laten we een voorbeeld nemen. Stel dat we een data science-cursus hebben (zoals deze) die we online aan studenten geven, en we willen data science gebruiken om het te verbeteren. Hoe kunnen we dat doen? -When we start analyzing results of multiple-choice tests, we can try to determine which concepts that students have difficulty understanding, and and use that information to improve the content. To do that, we need to design tests in such a way that each question maps to a certain concept or chunk of knowledge. +We kunnen beginnen met de vraag "Wat kan worden gedigitaliseerd?" De eenvoudigste manier zou zijn om de tijd te meten die elke student nodig heeft om elke module te voltooien en om de verkregen kennis te meten door aan het einde van elke module een meerkeuzetest te geven. Door het gemiddelde te nemen van de time-to-complete over alle studenten, kunnen we erachter komen welke modules de meeste problemen veroorzaken voor studenten en werken aan het vereenvoudigen ervan. -If we want to get even more complicated, we can plot the time taken for each module against the age category of students. We might find out that for some age categories it takes an inappropriately long time to complete the module, or that students drop out before completing it. This can help us provide age recommendations for the module, and minimize people's dissatisfaction from wrong expectations. +> Je zou kunnen stellen dat deze aanpak niet ideaal is, omdat modules van verschillende lengtes kunnen zijn. Het is waarschijnlijk eerlijker om de tijd te delen door de lengte van de module (in aantal tekens) en in plaats daarvan die waarden te vergelijken. -## ๐Ÿš€ Challenge +Wanneer we beginnen met het analyseren van resultaten van meerkeuzetoetsen, kunnen we proberen te bepalen welke concepten studenten moeilijk kunnen begrijpen en die informatie gebruiken om de inhoud te verbeteren. Om dat te doen, moeten we tests zo ontwerpen dat elke vraag is toegewezen aan een bepaald concept of een deel van de kennis. -In this challenge, we will try to find concepts relevant to the field of Data Science by looking at texts. We will take a Wikipedia article on Data Science, download and process the text, and then build a word cloud like this one: +Als we het nog ingewikkelder willen maken, kunnen we de tijd die voor elke module nodig is, uitzetten tegen de leeftijdscategorie van studenten. We kunnen erachter komen dat het voor sommige leeftijdscategorieรซn ongepast lang duurt om de module te voltooien, of dat studenten afhaken voordat ze het voltooien. Dit kan ons helpen leeftijdsaanbevelingen voor de module te geven en de ontevredenheid van mensen over verkeerde verwachtingen te minimaliseren. -![Word Cloud for Data Science](images/ds_wordcloud.png) +## ๐Ÿš€ Uitdaging -Visit [`notebook.ipynb`](notebook.ipynb) to read through the code. You can also run the code, and see how it performs all data transformations in real time. +In deze challenge proberen we concepten te vinden die relevant zijn voor het vakgebied Data Science door te kijken naar teksten. We nemen een Wikipedia-artikel over Data Science, downloaden en verwerken de tekst en bouwen vervolgens een woordwolk zoals deze: -> If you do not know how to run code in a Jupyter Notebook, have a look at [this article](https://soshnikov.com/education/how-to-execute-notebooks-from-github/). +![Word Cloud voor Data Science] (../images/ds_wordcloud.png) +Ga naar ['notebook.ipynb'](notebook.ipynb) om de code door te lezen. Je kunt de code ook uitvoeren en zien hoe alle gegevenstransformaties in realtime worden uitgevoerd. +> Als je niet weet hoe je code in een Jupyter Notebook moet uitvoeren, kijk dan eens naar [dit artikel] (https://soshnikov.com/education/how-to-execute-notebooks-from-github/). ## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/1) -## Assignments +## Opdrachten -* **Task 1**: Modify the code above to find out related concepts for the fields of **Big Data** and **Machine Learning** -* **Task 2**: [Think About Data Science Scenarios](assignment.md) +* **Taak 1**: Wijzig de bovenstaande code om gerelateerde concepten te achterhalen voor de velden **Big Data** en **Machine Learning** +* ** Taak 2 **: [Denk na over Data Science-scenario's] (assignment.md) ## Credits -This lesson has been authored with โ™ฅ๏ธ by [Dmitry Soshnikov](http://soshnikov.com) +Deze les is geschreven met โ™ฅ๏ธ door [Dmitry Soshnikov] (http://soshnikov.com) \ No newline at end of file From 9912a4ce8a2f90c3da4470b742eca94c13705d80 Mon Sep 17 00:00:00 2001 From: Timo Hoogendorp Date: Mon, 22 Nov 2021 09:57:38 +0100 Subject: [PATCH 06/13] cleanup --- .../01-defining-data-science/translations/README.nl.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/1-Introduction/01-defining-data-science/translations/README.nl.md b/1-Introduction/01-defining-data-science/translations/README.nl.md index 7964af05..791f080f 100644 --- a/1-Introduction/01-defining-data-science/translations/README.nl.md +++ b/1-Introduction/01-defining-data-science/translations/README.nl.md @@ -33,7 +33,7 @@ Deze definitie belicht de volgende belangrijke aspecten van data science: > Een ander belangrijk aspect van Data Science is dat het bestudeert hoe gegevens kunnen worden verzameld, opgeslagen en bediend met behulp van computers. Terwijl statistiek ons wiskundige grondslagen geeft, past data science wiskundige concepten toe om daadwerkelijk inzichten uit gegevens te halen. -Een van de manieren (toegeschreven aan [Jim Gray] (https://en.wikipedia.org/wiki/Jim_Gray_ (computer_scientist))) om naar de data science te kijken, is om het te beschouwen als een apart paradigma van de wetenschap: +Een van de manieren (toegeschreven aan [Jim Gray](https://en.wikipedia.org/wiki/Jim_Gray_(computer_scientist))) om naar de data science te kijken, is om het te beschouwen als een apart paradigma van de wetenschap: * **Empirisch**, waarbij we vooral vertrouwen op waarnemingen en resultaten van experimenten * **Theoretisch**, waar nieuwe concepten voortkomen uit bestaande wetenschappelijke kennis * **Computational**, waar we nieuwe principes ontdekken op basis van enkele computationele experimenten @@ -62,7 +62,7 @@ Een gebied van machine learning dat bekend staat als Artificial Intelligence (AI
visualisatie
-Enorme hoeveelheden gegevens zijn onbegrijpelijk voor een mens, maar zodra we nuttige visualisaties maken met behulp van die gegevens, kunnen we de gegevens beter begrijpen en enkele conclusies trekken. Het is dus belangrijk om veel manieren te kennen om informatie te visualiseren - iets dat we zullen behandelen in sectie 3 van onze cursus. Gerelateerde velden omvatten ook **Infographics** en **Mens-computerinteractie** in het algemeen. +Enorme hoeveelheden gegevens zijn onbegrijpelijk voor een mens, maar zodra we nuttige visualisaties maken met behulp van die gegevens, kunnen we de gegevens beter begrijpen en enkele conclusies trekken. Het is dus belangrijk om veel manieren te kennen om informatie te visualiseren - iets dat we zullen behandelen in Sectie 3 van onze cursus. Gerelateerde velden omvatten ook **Infographics** en **Mens-computerinteractie** in het algemeen.
@@ -146,11 +146,11 @@ Als we het nog ingewikkelder willen maken, kunnen we de tijd die voor elke modul In deze challenge proberen we concepten te vinden die relevant zijn voor het vakgebied Data Science door te kijken naar teksten. We nemen een Wikipedia-artikel over Data Science, downloaden en verwerken de tekst en bouwen vervolgens een woordwolk zoals deze: -![Word Cloud voor Data Science] (../images/ds_wordcloud.png) +![Word Cloud for Data Science](../images/ds_wordcloud.png) Ga naar ['notebook.ipynb'](notebook.ipynb) om de code door te lezen. Je kunt de code ook uitvoeren en zien hoe alle gegevenstransformaties in realtime worden uitgevoerd. -> Als je niet weet hoe je code in een Jupyter Notebook moet uitvoeren, kijk dan eens naar [dit artikel] (https://soshnikov.com/education/how-to-execute-notebooks-from-github/). +> Als je niet weet hoe je code in een Jupyter Notebook moet uitvoeren, kijk dan eens naar [dit artikel](https://soshnikov.com/education/how-to-execute-notebooks-from-github/). ## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/1) From bbbacb94bdc0d002cf496a8a774723b6b7bae191 Mon Sep 17 00:00:00 2001 From: Timo Hoogendorp Date: Mon, 22 Nov 2021 09:59:20 +0100 Subject: [PATCH 07/13] last formatting changes --- .../01-defining-data-science/translations/README.nl.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/1-Introduction/01-defining-data-science/translations/README.nl.md b/1-Introduction/01-defining-data-science/translations/README.nl.md index 791f080f..513eb83c 100644 --- a/1-Introduction/01-defining-data-science/translations/README.nl.md +++ b/1-Introduction/01-defining-data-science/translations/README.nl.md @@ -84,7 +84,7 @@ Er zijn veel mogelijke gegevensbronnen en het zal onmogelijk zijn om ze allemaal - **Internet of Things** (IoT), inclusief data van verschillende sensoren, zoals temperatuur- of druksensoren, leveren veel bruikbare data op. Als een kantoorgebouw bijvoorbeeld is uitgerust met IoT-sensoren, kunnen we automatisch verwarming en verlichting regelen om de kosten te minimaliseren. - **Enquรชtes** die we gebruikers vragen in te vullen na een aankoop of na een bezoek aan een website. - **Analyse van gedrag** kan ons bijvoorbeeld helpen begrijpen hoe diep een gebruiker in een website gaat en wat de typische reden is om de site te verlaten. -* ** Ongestructureerd ** +* **Ongestructureerd ** - **Teksten** kunnen een rijke bron van inzichten zijn, zoals een algemene **sentimentscore**, of het extraheren van trefwoorden en semantische betekenis. - **Afbeeldingen** of **Video**. Een video van een bewakingscamera kan worden gebruikt om het verkeer op de weg in te schatten en mensen te informeren over mogelijke files. - Webserver **Logs** kunnen worden gebruikt om te begrijpen welke pagina's van onze site het vaakst worden bezocht en voor hoe lang. @@ -157,7 +157,7 @@ Ga naar ['notebook.ipynb'](notebook.ipynb) om de code door te lezen. Je kunt de ## Opdrachten * **Taak 1**: Wijzig de bovenstaande code om gerelateerde concepten te achterhalen voor de velden **Big Data** en **Machine Learning** -* ** Taak 2 **: [Denk na over Data Science-scenario's] (assignment.md) +* **Taak 2**: [Denk na over Data Science-scenario's] (assignment.md) ## Credits From e2b9d4882c0528aaa0dfd50dc645b6923ae8e44f Mon Sep 17 00:00:00 2001 From: Tauqeer Ahmad <68806440+TauqeerAhmad5201@users.noreply.github.com> Date: Fri, 3 Dec 2021 22:13:03 +0530 Subject: [PATCH 08/13] Updated Readme Added my socials in the readme. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 5df47a7e..b6166d72 100644 --- a/README.md +++ b/README.md @@ -15,7 +15,7 @@ Azure Cloud Advocates at Microsoft are pleased to offer a 10-week, 20-lesson cur **Hearty thanks to our authors:** [Jasmine Greenaway](https://www.twitter.com/paladique), [Dmitry Soshnikov](http://soshnikov.com), [Nitya Narasimhan](https://twitter.com/nitya), [Jalen McGee](https://twitter.com/JalenMcG), [Jen Looper](https://twitter.com/jenlooper), [Maud Levy](https://twitter.com/maudstweets), [Tiffany Souterre](https://twitter.com/TiffanySouterre), [Christopher Harrison](https://www.twitter.com/geektrainer). **๐Ÿ™ Special thanks ๐Ÿ™ to our [Microsoft Student Ambassador](https://studentambassadors.microsoft.com/) authors, reviewers and content contributors,** notably Aaryan Arora, [Aditya Garg](https://github.com/AdityaGarg00), [Alondra Sanchez](https://www.linkedin.com/in/alondra-sanchez-molina/), [Ankita Singh](https://www.linkedin.com/in/ankitasingh007), [Anupam Mishra](https://www.linkedin.com/in/anupam--mishra/), [Arpita Das](https://www.linkedin.com/in/arpitadas01/), ChhailBihari Dubey, [Dibri Nsofor](https://www.linkedin.com/in/dibrinsofor), [Dishita Bhasin](https://www.linkedin.com/in/dishita-bhasin-7065281bb), [Majd Safi](https://www.linkedin.com/in/majd-s/), [Max Blum](https://www.linkedin.com/in/max-blum-6036a1186/), [Miguel Correa](https://www.linkedin.com/in/miguelmque/), [Mohamma Iftekher (Iftu) Ebne Jalal](https://twitter.com/iftu119), [Nawrin Tabassum](https://www.linkedin.com/in/nawrin-tabassum), [Raymond Wangsa Putra](https://www.linkedin.com/in/raymond-wp/), [Rohit Yadav](https://www.linkedin.com/in/rty2423), Samridhi Sharma, [Sanya Sinha](https://www.linkedin.com/mwlite/in/sanya-sinha-13aab1200), -[Sheena Narula](https://www.linkedin.com/in/sheena-narua-n/), Tauqeer Ahmad, Yogendrasingh Pawar +[Sheena Narula](https://www.linkedin.com/in/sheena-narua-n/), [Tauqeer Ahmad](https://www.linkedin.com/in/tauqeerahmad5201/), Yogendrasingh Pawar |![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](./sketchnotes/00-Title.png)| |:---:| From 646f007eddf174cf2eebee40e05a22c008bcc593 Mon Sep 17 00:00:00 2001 From: Jen Looper Date: Sun, 5 Dec 2021 20:25:28 -0500 Subject: [PATCH 09/13] editing English version - lesson 16 --- .../16-communication/README.md | 223 ++++++++++++++++++ .../translations/README.hi.md | 2 +- .../{ => translations}/README.ko.md | 0 3 files changed, 224 insertions(+), 1 deletion(-) create mode 100644 4-Data-Science-Lifecycle/16-communication/README.md rename 4-Data-Science-Lifecycle/16-communication/{ => translations}/README.ko.md (100%) diff --git a/4-Data-Science-Lifecycle/16-communication/README.md b/4-Data-Science-Lifecycle/16-communication/README.md new file mode 100644 index 00000000..a3cdaf43 --- /dev/null +++ b/4-Data-Science-Lifecycle/16-communication/README.md @@ -0,0 +1,223 @@ +# The Data Science Lifecycle: Communication + +|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev)](../../sketchnotes/16-Communicating.png)| +|:---:| +| Data Science Lifecycle: Communication - _Sketchnote by [@nitya](https://twitter.com/nitya)_ | + +## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/30) + +Test your knowledge of what's to come with the Pre-Lecture Quiz above! + +# Introduction + +### What is Communication? +Letโ€™s start this lesson by defining what is means to communicate. **To communicate is to convey or exchange information.** Information can be ideas, thoughts, feelings, messages, covert signals, data โ€“ anything that a **_sender_** (someone sending information) wants a **_receiver_** (someone receiving information) to understand. In this lesson, we will refer to senders as communicators, and receivers as the audience. + +### Data Communication & Storytelling +We understand that when communicating, the aim is to convey or exchange information. But when communicating data, your aim shouldn't be to simply pass along numbers to your audience. Your aim should be to communicate a story that is informed by your data - effective data communication and storytelling go hand-in-hand. Your audience is more likely to remember a story you tell, than a number you give. Later in this lesson, we will go over a few ways that you can use storytelling to communicate your data more effectively. + +### Types of Communication +Throughout this lesson two different types of communication will be discussed, One-Way Communication and Two-Way Communication. + +**One way communication** happens when a sender sends information to a receiver, without any feedback or response. We see examples of one-way communication every day โ€“ in bulk/mass emails, when the news delivers the most recent stories, or even when a television commercial comes on and informs you about why their product is great. In each of these instances, the sender is not seeking an exchange of information. They are only seeking to convey or deliver information. + +**Two-way communication** happens when all involved parties act as both senders and receivers. A sender will begin by communicating to a receiver, and the receiver will provide feedback or a response. Two-way communication is what we traditionally think of when we talk about communication. We usually think of people engaged in a conversation - either in person, or over a phone call, social media, or text message. + +When communicating data, there will be cases where you will be using one-way communication (think about presenting at a conference, or to a large group where questions wonโ€™t be asked directly after) and there will be cases where you will use two-way communication (think about using data to persuade a few stakeholders for buy-in, or to convince a teammate that time and effort should be spent building something new). + +# Effective Communication + +### Your Responsibilities as a communicator +When communicating, it is your job to make sure that your receiver(s) are taking away the information that you want them to take away. When youโ€™re communicating data, you donโ€™t just want your receivers to takeaway numbers, you want your receivers to takeaway a story thatโ€™s informed by your data. A good data communicator is a good storyteller. + +How do you tell a story with data? There are infinite ways โ€“ but below are 6 that we will talk about in this lesson. +1. Understand Your Audience, Your Medium, & Your Communication Method +2. Begin with the End in Mind +3. Approach it Like an Actual Story +4. Use Meaningful Words & Phrases +5. Use Emotion + +Each of these strategies is explained in greater detail below. + +### 1. Understand Your Audience, Your Channel & Your Communication Method +The way you communicate with family members is likely different than the way you communicate with your friends. You probably use different words and phrases that the people youโ€™re speaking to are more likely to understand. You should take the same approach when communicating data. Think about who youโ€™re communicating to. Think about their goals and the context that they have around the situation that youโ€™re explaining to them. + +You can likely group the majority of your audience them within a category. In a _Harvard Business Review_ article, โ€œ[How to Tell a Story with Data](http://blogs.hbr.org/2013/04/how-to-tell-a-story-with-data/),โ€ Dell Executive Strategist Jim Stikeleather identifies five categories of audiences. + + - **Novice**: first exposure to the subject, but doesnโ€™t want + oversimplification + - **Generalist**: aware of the topic, but looking for an overview + understanding and major themes + - **Managerial**: in-depth, actionable understanding of intricacies and + interrelationships with access to detail + - **Expert**: more exploration and discovery and less storytelling with + great detail + - **Executive**: only has time to glean the significance and conclusions of + weighted probabilities + +These categories can inform the way you present data to your audience. + +In addition to thinking about your audience's category, you should also consider the channel you're using to communicate with your audience. Your approach should be slightly different if you're writing a memo or email vs having a meeting or presenting at a conference. + +On top of understanding your audience, knowing how you will be communicating with them (using one-way communication or two-way) is also critical. + +If you are communicating with a majority Novice audience and youโ€™re using one-way communication, you must first educate the audience and give them proper context. Then you must present your data to them and tell them what your data means and why your data matters. In this instance, you may want to be laser focused on driving clarity, because your audience will not be able to ask you any direct questions. + +If you are communicating with a majority Managerial audience and youโ€™re using two-way communication, you likely wonโ€™t need to educate your audience or provide them with much context. You may be able to jump straight into discussing the data that youโ€™ve collected and why it matters. In this scenario though, you should be focused on timing and controlling your presentation. When using two-way communication (especially with a Managerial audience who is seeking an โ€œactionable understanding of intricacies and interrelationships with access to detailโ€) questions may pop up during your interaction that may take the discussion in a direction that doesnโ€™t relate to the story that youโ€™re trying to tell. When this happens, you can take action and move the discussion back on track with your story. + + +### 2. Begin With The End In Mind +Beginning with the end in mind means understanding your intended takeaways for your audience before you start communicating with them. Being thoughtful about what you want your audience to takeaway ahead of time can help you craft a story that your audience can follow. Beginning with the end in mind is appropriate for both one-way communication and two-way communication. + +How do you begin with the end in mind? Before communicating your data, write down your key takeaways. Then, every step of the way as you're preparing the story that you want to tell with your data, ask yourself, "How does this integrate into the story I'm telling?" + +Be Aware โ€“ While starting with the end in mind is ideal, you donโ€™t want to communicate only the data that supports your intended takeaways. Doing this is called Cherry-Picking, which happens when a communicator only communicates data that supports the point they are tying to make and ignores all other data. + +If all the data that you collected clearly supports your intended takeaways, great. But if there is data that you collected that doesnโ€™t support your takeaways, or even supports an argument against your key takeaways, you should communicate that data as well. If this happens, be upfront with your audience and let them know why you're choosing to stick with your story even though all the data doesn't necessarily support it. + + +### 3. Approach it Like an Actual Story +A traditional story happens in 5 Phases. You may have heard these phases expressed as Exposition, Rising Action, Climax, Falling Action, and Denouncement. Or the easier to remember Context, Conflict, Climax, Closure, Conclusion. When communicating your data and your story, you can take a similar approach. + +You can begin with context, set the stage and make sure your audience is all on the same page. Then introduce the conflict. Why did you need to collect this data? What problems were you seeking to solve? After that, the climax. What is the data? What does the data mean? What solutions does the data tell us we need? Then you get to the closure, where you can reiterate the problem, and the proposed solution(s). Lastly, we come to the conclusion, where you can summarize your key takeaways and the next steps you recommend the team takes. + +### 4. Use Meaningful Words & Phrases +If you and I were working together on a product, and I said to you "Our users take a long time to onboard onto our platform," how long would you estimate that "long time" to be? An hour? A week? It's hard to know. What if I said that to an entire audience? Everyone in the audience may end up with a different idea of how long users take to onboard onto our platform. + +Instead, what if I said "Out users take, on average, 3 minutes to sign up and onboard onto our platform." + +That messaging is more clear. When communicating data, it can be easy to think that everyone in your audience is thinking just like you. But that is not always the case. Driving clarity around your data and what it means is one of your responsibilities as a communicator. If the data or your story is not clear, your audience will have a hard time following, and it is less likely that they will understand your key takeaways. + +You can communicate data more clearly when you use meaningful words and phrases, instead of vague ones. Below are a few examples. + + - We had an *impressive* year! + - One person could think a impressive means a 2% - 3% increase in revenue, and one person could think it means a 50% - 60% increase. + - Our users' success rates increased *dramatically*. + - How large of an increase is a dramatic increase? + - This undertaking will require *significant* effort. + - How much effort is significant? + +Using vague words could be useful as an introduction to more data that's coming, or as a summary of the story that you've just told. But consider ensuring that every part of your presentation is clear for your audience. + + +### 5. Use Emotion +Emotion is key in storytelling. It's even more important when you're telling a story with data. When you're communicating data, everything is focused on the takeaways you want your audience to have. When you evoke an emotion for an audience it helps them empathize, and makes them more likely to take action. Emotion also increases the likelihood that an audience will remember your message. + +You may have encountered this before with TV commercials. Some commercials are very somber, and use a sad emotion to connect with their audience and make the data that they're presenting really stand out. Or, some commercials are very upbeat and happy may make you associate their data with a happy feeling. + +How do you use emotion when communicating data? Below are a couple of ways. + + - Use Testimonials and Personal Stories + - When collecting data, try to collect both quantitative and qualitative data, and integrate both types of data when you're communicating. If your data is primarily quantitative, seek stories from individuals to learn more about their experience with whatever your data is telling you. + - Use Imagery + - Images help an audience see themselves in a situation. When you use + images, you can push an audience toward the emotion that you feel + they should have about your data. + - Use Color + - Different colors evoke different emotions. Popular colors and the emotions they evoke are below. Be aware, that colors could have different meanings in different cultures. + - Blue usually evokes emotions of peace and trust + - Green is usually related to the nature and the environment + - Red is usually passion and excitement + - Yellow is usually optimism and happiness + +# Communication Case Study +Emerson is a Product Manager for a mobile app. Emerson has noticed that customers submit 42% more complaints and bug reports on the weekends. Emerson also noticed that customers who submit a complaint that goes unanswered after 48 hours are more 32% more likely to give the app a rating of 1 or 2 in the app store. + +After doing research, Emerson has a couple of solutions that will address the issue. Emerson sets up a 30-minute meeting with the 3 company leads to communicate the data and the proposed solutions. + +During this meeting, Emersonโ€™s goal is to have the company leads understand that the 2 solutions below can improve the appโ€™s rating, which will likely translate into higher revenue. + +**Solution 1.** Hire customer service reps to work on weekends + +**Solution 2.** Purchase a new customer service ticketing system where customer service reps can easily identify which complaints have been in the queue the longest โ€“ so they can tell which to address most immediately. + +In the meeting, Emerson spends 5 minutes explaining why having a low rating on the app store is bad, 10 minutes explaining the research process and how the trends were identified, 10 minutes going through some of the recent customer complaints, and the last 5 minutes glossing over the 2 potential solutions. + +Was this an effective way for Emerson to communicate during this meeting? + +During the meeting, one company lead fixated on the 10 minutes of customer complaints that Emerson went through. After the meeting, these complaints were the only thing that this team lead remembered. Another company lead primarily focused on Emerson describing the research process. The third company lead did remember the solutions proposed by Emerson but wasnโ€™t sure how those solutions could be implemented. + +In the situation above, you can see that there was a significant gap between what Emerson wanted the team leads to take away, and what they ended up taking away from the meeting. Below is another approach that Emerson could consider. + +How could Emerson improve this approach? +Context, Conflict, Climax, Closure, Conclusion +**Context** - Emerson could spend the first 5 minutes introducing the entire situation and making sure that the team leads understand how the problems affect metrics that are critical to the company, like revenue. + +It could be laid out this way: "Currently, our app's rating in the app store is a 2.5. Ratings in the app store are critical to App Store Optimization, which impacts how many users see our app in search, and how our app is viewed to perspective users. And ofcourse, the number of users we have is tied directly to revenue." + +**Conflict** Emerson could then move to talk for the next 5 minutes or so on the conflict. + +It could go like this: โ€œUsers submit 42% more complaints and bug reports on the weekends. Customers who submit a complaint that goes unanswered after 48 hours are more 32% less likely to give our app a rating over a 2 in the app store. Improving our app's rating in the app store to a 4 would improve our visibility by 20-30%, which I project would increase revenue by 10%." Of course, Emerson should be prepared to justify these numbers. + +**Climax** After laying the groundwork, Emerson could then move to the Climax for 5 or so minutes. + +Emerson could introduce the proposed solutions, lay out how those solutions will address the issues outlined, how those solutions could be implemented into existing workflows, how much the solutions cost, what the ROI of the solutions would be, and maybe even show some screenshots or wireframes of how the solutions would look if implemented. Emerson could also share testimonials from users who took over 48 hours to have their complaint addressed, and even a testimonial from a current customer service representative within the company who has comments on the current ticketing system. + +**Closure** Now Emerson can spend 5 minutes restating the problems faced by the company, revisit the proposed solutions, and review why those solutions are the right ones. + +**Conclusion** Because this is a meeting with a few stakeholders where two-way communication will be used, Emerson could then plan to leave 10 minutes for questions, to make sure that anything that was confusing to the team leads could be clarified before the meeting is over. + +If Emerson took approach #2, it is much more likely that the team leads will take away from the meeting exactly what Emerson intended for them to take away โ€“ that the way complaints and bugs are handled could be improved, and there are 2 solutions that could be put in place to make that improvement happen. This approach would be a much more effective approach to communicating the data, and the story, that Emerson wants to communicate. + + +# Conclusion +### Summary of main points + - To communicate is to convey or exchange information. + - When communicating data, your aim shouldn't be to simply pass along numbers to your audience. Your aim should be to communicate a story that is informed by your data. + - There are 2 types of communication, One-Way Communication (information is communicated with no intention of a response) and Two-Way Communication (information is communicated back and forth.) + - There are many strategies you can use to telling a story with your data, 5 strategies we went over are: + - Understand Your Audience, Your Medium, & Your Communication Method + - Begin with the End in Mind + - Approach it Like an Actual Story + - Use Meaningful Words & Phrases + - Use Emotion + +### Recommended Resources for Self Study +[The Five C's of Storytelling - Articulate Persuasion](http://articulatepersuasion.com/the-five-cs-of-storytelling/) + +[1.4 Your Responsibilities as a Communicator โ€“ Business Communication for Success (umn.edu)](https://open.lib.umn.edu/businesscommunication/chapter/1-4-your-responsibilities-as-a-communicator/) + +[How to Tell a Story with Data (hbr.org)](https://hbr.org/2013/04/how-to-tell-a-story-with-data) + +[Two-Way Communication: 4 Tips for a More Engaged Workplace (yourthoughtpartner.com)](https://www.yourthoughtpartner.com/blog/bid/59576/4-steps-to-increase-employee-engagement-through-two-way-communication) + +[6 succinct steps to great data storytelling - BarnRaisers, LLC (barnraisersllc.com)](https://barnraisersllc.com/2021/05/02/6-succinct-steps-to-great-data-storytelling/) + +[How to Tell a Story With Data | Lucidchart Blog](https://www.lucidchart.com/blog/how-to-tell-a-story-with-data) + +[6 Cs of Effective Storytelling on Social Media | Cooler Insights](https://coolerinsights.com/2018/06/effective-storytelling-social-media/) + +[The Importance of Emotions In Presentations | Ethos3 - A Presentation Training and Design Agency](https://ethos3.com/2015/02/the-importance-of-emotions-in-presentations/) + +[Data storytelling: linking emotions and rational decisions (toucantoco.com)](https://www.toucantoco.com/en/blog/data-storytelling-dataviz) + +[Emotional Advertising: How Brands Use Feelings to Get People to Buy (hubspot.com)](https://blog.hubspot.com/marketing/emotions-in-advertising-examples) + +[Choosing Colors for Your Presentation Slides | Think Outside The Slide](https://www.thinkoutsidetheslide.com/choosing-colors-for-your-presentation-slides/) + +[How To Present Data [10 Expert Tips] | ObservePoint](https://resources.observepoint.com/blog/10-tips-for-presenting-data) + +[Microsoft Word - Persuasive Instructions.doc (tpsnva.org)](https://www.tpsnva.org/teach/lq/016/persinstr.pdf) + +[The Power of Story for Your Data (thinkhdi.com)](https://www.thinkhdi.com/library/supportworld/2019/power-story-your-data.aspx) + +[Common Mistakes in Data Presentation (perceptualedge.com)](https://www.perceptualedge.com/articles/ie/data_presentation.pdf) + +[Infographic: Here are 15 Common Data Fallacies to Avoid (visualcapitalist.com)](https://www.visualcapitalist.com/here-are-15-common-data-fallacies-to-avoid/) + +[Cherry Picking: When People Ignore Evidence that They Dislike โ€“ Effectiviology](https://effectiviology.com/cherry-picking/#How_to_avoid_cherry_picking) + +[Tell Stories with Data: Communication in Data Science | by Sonali Verghese | Towards Data Science](https://towardsdatascience.com/tell-stories-with-data-communication-in-data-science-5266f7671d7) + +[1. Communicating Data - Communicating Data with Tableau [Book] (oreilly.com)](https://www.oreilly.com/library/view/communicating-data-with/9781449372019/ch01.html) + + + +## [Post-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/31) + +Review what you've just learned with the Post-Lecture Quiz above! + + +## Assignment + +[Market Research](assignment.md) \ No newline at end of file diff --git a/4-Data-Science-Lifecycle/16-communication/translations/README.hi.md b/4-Data-Science-Lifecycle/16-communication/translations/README.hi.md index 681d2923..a89904c4 100644 --- a/4-Data-Science-Lifecycle/16-communication/translations/README.hi.md +++ b/4-Data-Science-Lifecycle/16-communication/translations/README.hi.md @@ -1,6 +1,6 @@ # เคกเฅ‡เคŸเคพ เคตเคฟเคœเฅเคžเคพเคจ เค•เฅ‡ เคœเฅ€เคตเคจเคšเค•เฅเคฐ: เคธเค‚เคšเคพเคฐ -|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev)](https://github.com/Heril18/Data-Science-For-Beginners/raw/main/sketchnotes/16-Communicating.png)| +|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev)](../..//sketchnotes/16-Communicating.png)| |:---:| | เคกเฅ‡เคŸเคพ เคตเคฟเคœเฅเคžเคพเคจ เค•เฅ‡ เคœเฅ€เคตเคจเคšเค•เฅเคฐ: เคธเค‚เคšเคพเคฐ - _[@nitya](https://twitter.com/nitya) เคฆเฅเคตเคพเคฐเคพ เคธเฅเค•เฅ‡เคšเคจเฅ‹เคŸ_| diff --git a/4-Data-Science-Lifecycle/16-communication/README.ko.md b/4-Data-Science-Lifecycle/16-communication/translations/README.ko.md similarity index 100% rename from 4-Data-Science-Lifecycle/16-communication/README.ko.md rename to 4-Data-Science-Lifecycle/16-communication/translations/README.ko.md From bf5592304719cf4f94dd93be6eeee154473f2c14 Mon Sep 17 00:00:00 2001 From: Jen Looper Date: Sun, 5 Dec 2021 20:29:44 -0500 Subject: [PATCH 10/13] removing problematic translations --- .../assignment.ko.md | 18 -- .../translations/README.ko.md | 227 ------------------ .../assignment.ko.md | 17 -- .../{ => translations}/assignment.hi.md | 0 .../{ => translations}/assignment.ko.md | 0 5 files changed, 262 deletions(-) delete mode 100644 3-Data-Visualization/11-visualization-proportions/assignment.ko.md delete mode 100644 3-Data-Visualization/11-visualization-proportions/translations/README.ko.md delete mode 100644 3-Data-Visualization/13-meaningful-visualizations/assignment.ko.md rename 3-Data-Visualization/13-meaningful-visualizations/{ => translations}/assignment.hi.md (100%) rename 4-Data-Science-Lifecycle/14-Introduction/{ => translations}/assignment.ko.md (100%) diff --git a/3-Data-Visualization/11-visualization-proportions/assignment.ko.md b/3-Data-Visualization/11-visualization-proportions/assignment.ko.md deleted file mode 100644 index 6f7ca60c..00000000 --- a/3-Data-Visualization/11-visualization-proportions/assignment.ko.md +++ /dev/null @@ -1,18 +0,0 @@ -# Try it in Excel -# ์—‘์…€๋กœ ํ•ด๋ณด์„ธ์š”. - -## Instructions -## ์ง€์‹œ์‚ฌํ•ญ - -Did you know you can create donut, pie, and waffle charts in Excel? Using a dataset of your choice, create these three charts right in an Excel spreadsheet. -์—‘์…€์—์„œ ๋„๋„›, ํŒŒ์ด, ์™€ํ”Œ ์ฐจํŠธ๋ฅผ ๋งŒ๋“ค ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ณ  ์žˆ์—ˆ๋‚˜์š”? ์„ ํƒํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ Excel ์Šคํ”„๋ ˆ๋“œ์‹œํŠธ์— ์„ธ ๊ฐœ์˜ ์ฐจํŠธ๋ฅผ ์ง์ ‘ ๋งŒ๋“œ์‹ญ์‹œ์˜ค. - -## Rubric -## ๋ฃจ๋ธŒ๋ฆญ - -| Exemplary | Adequate | Needs Improvement | -| ------------------------------------------------------- | ------------------------------------------------- | ------------------------------------------------------ | -| An Excel spreadsheet is presented with all three charts | An Excel spreadsheet is presented with two charts | An Excel spreadsheet is presented with only one chart | -| ๋ชจ๋ฒ” | ์ถฉ๋ถ„ | ๊ฐœ์„  ํ•„์š” | -| ------------------------------------------------------- | ------------------------------------------------- | ------------------------------------------------------ | -| ์—‘์…€ ์Šคํ”„๋ ˆ๋“œ์‹œํŠธ๋Š” ์„ธ ์ฐจํŠธ์™€ ํ•จ๊ป˜ ์ œ์‹œ๋ฉ๋‹ˆ๋‹ค | ์—‘์…€ ์Šคํ”„๋ ˆ๋“œ์‹œํŠธ๋Š” ๋‘ ์ฐจํŠธ์™€ ํ•จ๊ป˜ ์ œ์‹œ๋ฉ๋‹ˆ๋‹ค | ์—‘์…€ ์Šคํ”„๋ ˆ๋“œ์‹œํŠธ๋Š” ์˜ค์ง ํ•˜๋‚˜์˜ ์ฐจํŠธ์™€ ํ•จ๊ป˜ ์ œ์‹œ๋ฉ๋‹ˆ๋‹ค | diff --git a/3-Data-Visualization/11-visualization-proportions/translations/README.ko.md b/3-Data-Visualization/11-visualization-proportions/translations/README.ko.md deleted file mode 100644 index da50922c..00000000 --- a/3-Data-Visualization/11-visualization-proportions/translations/README.ko.md +++ /dev/null @@ -1,227 +0,0 @@ -# Visualizing Proportions -# ๋น„์œจ ์‹œ๊ฐํ™” - -|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/11-Visualizing-Proportions.png)| -|:---:| -|Visualizing Proportions - _Sketchnote by [@nitya](https://twitter.com/nitya)_ | -|๋น„์œจ ์‹œ๊ฐํ™” - _์ œ์ž‘์ž : [@nitya](https://twitter.com/nitya)_ | - -In this lesson, you will use a different nature-focused dataset to visualize proportions, such as how many different types of fungi populate a given dataset about mushrooms. Let's explore these fascinating fungi using a dataset sourced from Audubon listing details about 23 species of gilled mushrooms in the Agaricus and Lepiota families. You will experiment with tasty visualizations such as: -์ด ๊ณผ์ •์—์„œ๋Š” ๋ฒ„์„ฏ์— ๋Œ€ํ•ด ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์…‹์— ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ์ข…๋ฅ˜์˜ ๊ณฐํŒก์ด๊ฐ€ ์ฑ„์›Œ์ ธ ์žˆ๋Š”์ง€์™€ ๊ฐ™์€ ๋‹ค๋ฅธ ์ž์—ฐ์— ์ดˆ์ ์„ ๋งž์ถ˜ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•˜์—ฌ ๋น„์œจ์„ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค. Agaricus์™€ Lepiota๊ณผ์— ์†ํ•˜๋Š” 23์ข…์˜ ๊ตฌ์ด๋ฒ„์„ฏ์— ๋Œ€ํ•œ ์„ธ๋ถ€ ์ •๋ณด๋ฅผ ๋‚˜์—ดํ•œ Audubon์˜ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ ์ด ๋งค๋ ฅ์ ์ธ ๊ณฐํŒก์ด๋ฅผ ํƒํ—˜ํ•ด ๋ด…์‹œ๋‹ค. ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ง›์žˆ๋Š” ์‹œ๊ฐํ™”๋ฅผ ์‹คํ—˜ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. - -- Pie charts ๐Ÿฅง -- Donut charts ๐Ÿฉ -- Waffle charts ๐Ÿง‡ -- ์›ํ˜• ์ฐจํŠธ ๐Ÿฅง -- ๋„๋„› ์ฐจํŠธ ๐Ÿฉ -- ์™€ํ”Œ ์ฐจํŠธ ๐Ÿง‡ - -> ๐Ÿ’ก A very interesting project called [Charticulator](https://charticulator.com) by Microsoft Research offers a free drag and drop interface for data visualizations. In one of their tutorials they also use this mushroom dataset! So you can explore the data and learn the library at the same time: [Charticulator tutorial](https://charticulator.com/tutorials/tutorial4.html). -> ๐Ÿ’ก Microsoft Research์˜ [Charticulator](https://charticulator.com)๋ผ๋Š” ๋งค์šฐ ํฅ๋ฏธ๋กœ์šด ํ”„๋กœ์ ํŠธ๋Š” ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•œ ๋ฌด๋ฃŒ ๋Œ์–ด ๋†“๊ธฐ(๋“œ๋ž˜๊ทธ ์•ค ๋“œ๋ž) ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ํŠœํ† ๋ฆฌ์–ผ ์ค‘ ํ•˜๋‚˜์—์„œ๋Š” ๋ฒ„์„ฏ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค! ๋”ฐ๋ผ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ํƒ์ƒ‰ํ•˜๋ฉด์„œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋™์‹œ์— ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. [Charticulator tutorial](https://charticulator.com/tutorials/tutorial4.html). - -## [Pre-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/20) -## [์‚ฌ์ „ ๊ฐ•์˜ ํ€ด์ฆˆ](https://red-water-0103e7a0f.azurestaticapps.net/quiz/20) - -## Get to know your mushrooms ๐Ÿ„ -## ๋ฒ„์„ฏ์„ ์•Œ์•„๊ฐ€์„ธ์š” ๐Ÿ„ - -Mushrooms are very interesting. Let's import a dataset to study them: -๋ฒ„์„ฏ์€ ๋งค์šฐ ํฅ๋ฏธ๋กญ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์„ ๊ฐ€์ ธ์™€ ์—ฐ๊ตฌํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. - -```python -import pandas as pd -import matplotlib.pyplot as plt -mushrooms = pd.read_csv('../../data/mushrooms.csv') -mushrooms.head() -``` -A table is printed out with some great data for analysis: -๋ถ„์„์„ ์œ„ํ•œ ๋ช‡ ๊ฐ€์ง€ ํ›Œ๋ฅญํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ํฌํ•จ๋œ ํ‘œ๊ฐ€ ์ธ์‡„๋ฉ๋‹ˆ๋‹ค: - - -| class | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | stalk-root | stalk-surface-above-ring | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat | -| --------- | --------- | ----------- | --------- | ------- | ------- | --------------- | ------------ | --------- | ---------- | ----------- | ---------- | ------------------------ | ------------------------ | ---------------------- | ---------------------- | --------- | ---------- | ----------- | --------- | ----------------- | ---------- | ------- | -| Poisonous | Convex | Smooth | Brown | Bruises | Pungent | Free | Close | Narrow | Black | Enlarging | Equal | Smooth | Smooth | White | White | Partial | White | One | Pendant | Black | Scattered | Urban | -| Edible | Convex | Smooth | Yellow | Bruises | Almond | Free | Close | Broad | Black | Enlarging | Club | Smooth | Smooth | White | White | Partial | White | One | Pendant | Brown | Numerous | Grasses | -| Edible | Bell | Smooth | White | Bruises | Anise | Free | Close | Broad | Brown | Enlarging | Club | Smooth | Smooth | White | White | Partial | White | One | Pendant | Brown | Numerous | Meadows | -| Poisonous | Convex | Scaly | White | Bruises | Pungent | Free | Close | Narrow | Brown | Enlarging | Equal | Smooth | Smooth | White | White | Partial | White | One | Pendant | Black | Scattered | Urban | - -Right away, you notice that all the data is textual. You will have to convert this data to be able to use it in a chart. Most of the data, in fact, is represented as an object: -๋ฐ”๋กœ, ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ํ…์ŠคํŠธ์ž„์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์ฐจํŠธ์—์„œ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ๋Š” ๊ฐœ์ฒด๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค: - -```python -print(mushrooms.select_dtypes(["object"]).columns) -``` - -The output is: -์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค: - -```output -Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor', - 'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color', - 'stalk-shape', 'stalk-root', 'stalk-surface-above-ring', - 'stalk-surface-below-ring', 'stalk-color-above-ring', - 'stalk-color-below-ring', 'veil-type', 'veil-color', 'ring-number', - 'ring-type', 'spore-print-color', 'population', 'habitat'], - dtype='object') -``` -Take this data and convert the 'class' column to a category: -์ด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์™€์„œ 'ํด๋ž˜์Šค' ์—ด์„ ๋ฒ”์ฃผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค: - -```python -cols = mushrooms.select_dtypes(["object"]).columns -mushrooms[cols] = mushrooms[cols].astype('category') -``` -Now, if you print out the mushrooms data, you can see that it has been grouped into categories according to the poisonous/edible class: -์ด์ œ ๋ฒ„์„ฏ ๋ฐ์ดํ„ฐ๋ฅผ ์ธ์‡„ํ•˜๋ฉด poisonous/editable ํด๋ž˜์Šค์— ๋”ฐ๋ผ ๋ฒ”์ฃผ๋กœ ๋ถ„๋ฅ˜๋˜์—ˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: - - -| | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | ... | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat | -| --------- | --------- | ----------- | --------- | ------- | ---- | --------------- | ------------ | --------- | ---------- | ----------- | --- | ------------------------ | ---------------------- | ---------------------- | --------- | ---------- | ----------- | --------- | ----------------- | ---------- | ------- | -| class | | | | | | | | | | | | | | | | | | | | | | -| Edible | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | ... | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | 4208 | -| Poisonous | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | ... | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | 3916 | - -If you follow the order presented in this table to create your class category labels, you can build a pie chart: -์ด ํ‘œ์— ๋‚˜์™€ ์žˆ๋Š” ์ˆœ์„œ์— ๋”ฐ๋ผ ํด๋ž˜์Šค ๋ฒ”์ฃผ ๋ ˆ์ด๋ธ”์„ ๋งŒ๋“ค๋ฉด ํŒŒ์ด ์ฐจํŠธ๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: - -## Pie! -## ํŒŒ์ด! - -```python -labels=['Edible','Poisonous'] -plt.pie(edibleclass['population'],labels=labels,autopct='%.1f %%') -plt.title('Edible?') -plt.show() -``` -Voila, a pie chart showing the proportions of this data according to these two classes of mushrooms. It's quite important to get the order of the labels correct, especially here, so be sure to verify the order with which the label array is built! - -![pie chart](images/pie1.png) -![ํŒŒ์ด ์ฐจํŠธ](images/pie1.png) - -## Donuts! -## ๋„๋„›! - -A somewhat more visually interesting pie chart is a donut chart, which is a pie chart with a hole in the middle. Let's look at our data using this method. -์ข€ ๋” ์‹œ๊ฐ์ ์œผ๋กœ ํฅ๋ฏธ๋กœ์šด ํŒŒ์ด ์ฐจํŠธ๋Š” ๋„๋„› ์ฐจํŠธ์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ๊ฐ€์šด๋ฐ์— ๊ตฌ๋ฉ์ด ์žˆ๋Š” ํŒŒ์ด ์ฐจํŠธ์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์šฐ๋ฆฌ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ดํŽด๋ด…์‹œ๋‹ค. - -Take a look at the various habitats where mushrooms grow: -๋ฒ„์„ฏ์ด ์ž๋ผ๋Š” ๋‹ค์–‘ํ•œ ์„œ์‹์ง€๋ฅผ ์‚ดํŽด๋ณด์„ธ์š”: - -```python -habitat=mushrooms.groupby(['habitat']).count() -habitat -``` -Here, you are grouping your data by habitat. There are 7 listed, so use those as labels for your donut chart: -์—ฌ๊ธฐ์„œ๋Š” ์„œ์‹์ง€์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฃนํ™”ํ•ฉ๋‹ˆ๋‹ค. 7๊ฐœ๊ฐ€ ๋‚˜์—ด๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ ๋„๋„› ์ฐจํŠธ์˜ ๋ ˆ์ด๋ธ”๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: - -```python -labels=['Grasses','Leaves','Meadows','Paths','Urban','Waste','Wood'] - -plt.pie(habitat['class'], labels=labels, - autopct='%1.1f%%', pctdistance=0.85) - -center_circle = plt.Circle((0, 0), 0.40, fc='white') -fig = plt.gcf() - -fig.gca().add_artist(center_circle) - -plt.title('Mushroom Habitats') - -plt.show() -``` - -![donut chart](images/donut.png) -![๋„๋„› ์ฐจํŠธ](images/donut.png) - -This code draws a chart and a center circle, then adds that center circle in the chart. Edit the width of the center circle by changing `0.40` to another value. -์ด ์ฝ”๋“œ๋Š” ์ฐจํŠธ์™€ ์ค‘์‹ฌ ์›์„ ๊ทธ๋ฆฌ๊ณ , ์ฐจํŠธ์— ํ•ด๋‹น ์ค‘์‹ฌ ์›์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. 0.40์„ ๋‹ค๋ฅธ ๊ฐ’์œผ๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ์ค‘์‹ฌ ์›์˜ ๋„ˆ๋น„๋ฅผ ํŽธ์ง‘ํ•ฉ๋‹ˆ๋‹ค. - -Donut charts can be tweaked in several ways to change the labels. The labels in particular can be highlighted for readability. Learn more in the [docs](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html?highlight=donut). -๋„๋„› ์ฐจํŠธ๋Š” ๋ ˆ์ด๋ธ”์„ ๋ณ€๊ฒฝํ•˜๊ธฐ ์œ„ํ•ด ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ ์ˆ˜์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋ผ๋ฒจ์€ ๊ฐ€๋…์„ฑ์„ ์œ„ํ•ด ๊ฐ•์กฐ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž์„ธํ•œ ๋‚ด์šฉ์€ [https](https://matplotlib.org/stable/gallery/pie_and_polar_charts/pie_and_donut_labels.html?highlight=donut))์—์„œ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค. - -Now that you know how to group your data and then display it as a pie or donut, you can explore other types of charts. Try a waffle chart, which is just a different way of exploring quantity. -## Waffles! -์ด์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฃนํ™”ํ•œ ๋‹ค์Œ ํŒŒ์ด ๋˜๋Š” ๋„๋„›์œผ๋กœ ํ‘œ์‹œํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•Œ์•˜์œผ๋ฏ€๋กœ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ์ฐจํŠธ๋ฅผ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์–‘์„ ํƒ๊ตฌํ•˜๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์ธ ์™€ํ”Œ ์ฐจํŠธ๋ฅผ ์‹œ๋„ํ•ด ๋ณด์„ธ์š”. -## ์™€ํ”Œ! - -A 'waffle' type chart is a different way to visualize quantities as a 2D array of squares. Try visualizing the different quantities of mushroom cap colors in this dataset. To do this, you need to install a helper library called [PyWaffle](https://pypi.org/project/pywaffle/) and use Matplotlib: -'์™€ํ”Œ' ์œ ํ˜• ์ฐจํŠธ๋Š” ์–‘์„ 2D ์ •์‚ฌ๊ฐํ˜• ๋ฐฐ์—ด๋กœ ์‹œ๊ฐํ™”ํ•˜๋Š” ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด ๋ฐ์ดํ„ฐ์…‹์— ์žˆ๋Š” ๋ฒ„์„ฏ ๋จธ๋ฆฌ ์ƒ‰์ƒ์˜ ๋‹ค์–‘ํ•œ ์–‘์„ ์‹œ๊ฐํ™”ํ•ด ๋ณด์‹ญ์‹œ์˜ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ ค๋ฉด [PyWaffle](https://pypi.org/project/pywaffle/)์ด๋ผ๋Š” ๋„์šฐ๋ฏธ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•˜๊ณ  Matplotlib์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค: - -```python -pip install pywaffle -``` - -Select a segment of your data to group: -๊ทธ๋ฃนํ™”ํ•  ๋ฐ์ดํ„ฐ์˜ ๋ถ€๋ถ„ ์„ ํƒ: - -```python -capcolor=mushrooms.groupby(['cap-color']).count() -capcolor -``` - -Create a waffle chart by creating labels and then grouping your data: -๋ ˆ์ด๋ธ”์„ ๋งŒ๋“  ๋‹ค์Œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ์™€ํ”Œ ์ฐจํŠธ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค: - -```python -import pandas as pd -import matplotlib.pyplot as plt -from pywaffle import Waffle - -data ={'color': ['brown', 'buff', 'cinnamon', 'green', 'pink', 'purple', 'red', 'white', 'yellow'], - 'amount': capcolor['class'] - } - -df = pd.DataFrame(data) - -fig = plt.figure( - FigureClass = Waffle, - rows = 100, - values = df.amount, - labels = list(df.color), - figsize = (30,30), - colors=["brown", "tan", "maroon", "green", "pink", "purple", "red", "whitesmoke", "yellow"], -) -``` - -Using a waffle chart, you can plainly see the proportions of cap colors of this mushrooms dataset. Interestingly, there are many green-capped mushrooms! -์™€ํ”Œ ์ฐจํŠธ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด ๋ฒ„์„ฏ ๋ฐ์ดํ„ฐ์…‹์˜ ๋จธ๋ฆฌ ์ƒ‰์ƒ ๋น„์œจ์„ ์‰ฝ๊ฒŒ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํฅ๋ฏธ๋กญ๊ฒŒ๋„, ๋…น์ƒ‰ ๋จธ๋ฆฌ๊ฐ€ ์žˆ๋Š” ๋ฒ„์„ฏ์ด ๋งŽ์ด ์žˆ๋‹ต๋‹ˆ๋‹ค! - -![waffle chart](images/waffle.png) -![์™€ํ”Œ ์ฐจํŠธ](images/waffle.png) - - -โœ… Pywaffle supports icons within the charts that use any icon available in [Font Awesome](https://fontawesome.com/). Do some experiments to create an even more interesting waffle chart using icons instead of squares. -โœ… Pywaffle์€ ์ฐจํŠธ ๋‚ด์—์„œ [Font Awesome](https://fontawesome.com/))์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์•„์ด์ฝ˜์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์ •์‚ฌ๊ฐํ˜• ๋Œ€์‹  ์•„์ด์ฝ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ํ›จ์”ฌ ๋” ํฅ๋ฏธ๋กœ์šด ์™€ํ”Œ ์ฐจํŠธ๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋ช‡ ๊ฐ€์ง€ ์‹คํ—˜์„ ํ•ด๋ณด์„ธ์š”. - -In this lesson, you learned three ways to visualize proportions. First, you need to group your data into categories and then decide which is the best way to display the data - pie, donut, or waffle. All are delicious and gratify the user with an instant snapshot of a dataset. -์ด ๊ณผ์ •์—์„œ๋Š” ๋น„์œจ์„ ์‹œ๊ฐํ™”ํ•˜๋Š” ์„ธ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ๋ฐฐ์› ์Šต๋‹ˆ๋‹ค. ๋จผ์ € ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒ”์ฃผ๋กœ ๋ถ„๋ฅ˜ํ•œ ๋‹ค์Œ ํŒŒ์ด, ๋„๋„› ๋˜๋Š” ์™€ํ”Œ ์ค‘ ์–ด๋–ค ๊ฒƒ์ด ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œ์‹œํ•˜๋Š” ๊ฐ€์žฅ ์ข‹์€ ๋ฐฉ๋ฒ•์ธ์ง€ ๊ฒฐ์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ชจ๋“  ๊ฒƒ์ด ๋ง›์žˆ๊ณ  ๋ฐ์ดํ„ฐ์…‹์˜ ์ฆ‰๊ฐ์ ์ธ ์Šค๋ƒ…์ƒท์œผ๋กœ ์‚ฌ์šฉ์ž๋ฅผ ๋งŒ์กฑ์‹œํ‚ต๋‹ˆ๋‹ค. - -## ๐Ÿš€ Challenge -## ๐Ÿš€ ๋„์ „ - -Try recreating these tasty charts in [Charticulator](https://charticulator.com). -## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/21) -[Charticulator](https://charticulator.com)์—์„œ ๋ง›์žˆ๋Š” ์ฐจํŠธ๋ฅผ ๋‹ค์‹œ ๋งŒ๋“ค์–ด ๋ณด์‹ญ์‹œ์˜ค. -## [๊ฐ•์˜ ํ›„ ํ€ด์ฆˆ](https://red-water-0103e7a0f.azurestaticapps.net/quiz/21) - -## Review & Self Study -## ๋ฆฌ๋ทฐ & ์…€ํ”„ ํ•™์Šต - -Sometimes it's not obvious when to use a pie, donut, or waffle chart. Here are some articles to read on this topic: -๋•Œ๋•Œ๋กœ ์–ธ์ œ ํŒŒ์ด, ๋„๋„›, ์™€ํ”Œ ์ฐจํŠธ๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•˜๋Š”์ง€ ๋ช…ํ™•ํ•˜์ง€ ์•Š๋‹ค. ๋‹ค์Œ์€ ์ด ์ฃผ์ œ์— ๋Œ€ํ•ด ์ฝ์„ ๋ช‡ ๊ฐ€์ง€ ๊ธฐ์‚ฌ์ž…๋‹ˆ๋‹ค: - -https://www.beautiful.ai/blog/battle-of-the-charts-pie-chart-vs-donut-chart - -https://medium.com/@hypsypops/pie-chart-vs-donut-chart-showdown-in-the-ring-5d24fd86a9ce - -https://www.mit.edu/~mbarker/formula1/f1help/11-ch-c6.htm - -https://medium.datadriveninvestor.com/data-visualization-done-the-right-way-with-tableau-waffle-chart-fdf2a19be402 - -Do some research to find more information on this sticky decision. -## Assignment -์ด ๊นŒ๋‹ค๋กœ์šด ๊ฒฐ์ •์— ๋Œ€ํ•œ ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ์กฐ์‚ฌ๋ฅผ ํ•˜์„ธ์š”. -## ๊ณผ์ œ - -[Try it in Excel](assignment.md) -[์—‘์…€๋กœ ๋„์ „ํ•ด๋ณด์„ธ์š”](assignment.md) diff --git a/3-Data-Visualization/13-meaningful-visualizations/assignment.ko.md b/3-Data-Visualization/13-meaningful-visualizations/assignment.ko.md deleted file mode 100644 index 88dc710e..00000000 --- a/3-Data-Visualization/13-meaningful-visualizations/assignment.ko.md +++ /dev/null @@ -1,17 +0,0 @@ -# Build your own custom vis -# ๋‚˜๋งŒ์˜ ์‚ฌ์šฉ์ž ์ •์˜ ๋ณด๊ธฐ ๊ตฌ์ถ• - -## Instructions -## ์ง€์‹œ์‚ฌํ•ญ - -Using the code sample in this project to create a social network, mock up data of your own social interactions. You could map your usage of social media or make a diagram of your family members. Create an interesting web app that shows a unique visualization of a social network. -์ด ํ”„๋กœ์ ํŠธ์˜ ์ฝ”๋“œ ์ƒ˜ํ”Œ์„ ์‚ฌ์šฉํ•˜์—ฌ ์†Œ์…œ ๋„คํŠธ์›Œํฌ๋ฅผ ๋งŒ๋“ค๊ณ , ์†Œ์…œ ์ƒํ˜ธ ์ž‘์šฉ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์กฐํ•ด ๋ณด์‹ญ์‹œ์˜ค. ์†Œ์…œ ๋ฏธ๋””์–ด ์‚ฌ์šฉ์„ ์ง€๋„ํ™”ํ•˜๊ฑฐ๋‚˜ ๊ฐ€์กฑ ๊ตฌ์„ฑ์›์˜ ๋„ํ‘œ๋ฅผ ์ž‘์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์†Œ์…œ ๋„คํŠธ์›Œํฌ์˜ ๊ณ ์œ ํ•œ ์‹œ๊ฐํ™”๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ํฅ๋ฏธ๋กœ์šด ์›น ์•ฑ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค. -## Rubric -## ๋ฃจ๋ธŒ๋ฆญ - -Exemplary | Adequate | Needs Improvement ---- | --- | -- | -A GitHub repo is presented with code that runs properly (try deploying it as a static web app) and has an annotated README explaining the project | The repo does not run properly or is not documented well | The repo does not run properly and is not documented well -๋ชจ๋ฒ” | ์ถฉ๋ถ„ | ๊ฐœ์„  ํ•„์š” ---- | --- | -- | -GitHub repo๋Š” ์ ์ ˆํ•˜๊ฒŒ ์‹คํ–‰๋˜๋Š” ์ฝ”๋“œ์™€ ํ•จ๊ป˜ ์ œ์‹œ๋˜๋ฉฐ(์ •์  ์›น ์•ฑ์œผ๋กœ ๋ฐฐํฌํ•ด ๋ณด์‹ญ์‹œ์˜ค) ํ”„๋กœ์ ํŠธ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ์ฃผ์„์ด ๋‹ฌ๋ฆฐ README๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. | repo๋Š” ์ œ๋Œ€๋กœ ์‹คํ–‰๋˜์ง€ ์•Š๊ฑฐ๋‚˜ ์ž˜ ๋ฌธ์„œํ™”๋˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค. | repo๋Š” ์ œ๋Œ€๋กœ ์‹คํ–‰๋˜์ง€ ์•Š์œผ๋ฉฐ ์ž˜ ๋ฌธ์„œํ™”๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. diff --git a/3-Data-Visualization/13-meaningful-visualizations/assignment.hi.md b/3-Data-Visualization/13-meaningful-visualizations/translations/assignment.hi.md similarity index 100% rename from 3-Data-Visualization/13-meaningful-visualizations/assignment.hi.md rename to 3-Data-Visualization/13-meaningful-visualizations/translations/assignment.hi.md diff --git a/4-Data-Science-Lifecycle/14-Introduction/assignment.ko.md b/4-Data-Science-Lifecycle/14-Introduction/translations/assignment.ko.md similarity index 100% rename from 4-Data-Science-Lifecycle/14-Introduction/assignment.ko.md rename to 4-Data-Science-Lifecycle/14-Introduction/translations/assignment.ko.md From 388e2773dfbf77ef8c530a9932ab0173eba69e12 Mon Sep 17 00:00:00 2001 From: Jen Looper Date: Sun, 5 Dec 2021 20:49:03 -0500 Subject: [PATCH 11/13] removing blank files, incomplete/incorrect translations --- .../02-ethics/translations/README.es.md | 0 .../translations/README.es.md | 0 .../translations/README.md | 1 + .../translations/README.es.md | 0 .../translations/assignment.ne.md | 0 .../07-python/translations/README.es.md | 0 .../07-python/translations/assignment.fr.md | 23 -- .../translations/README.es.md | 0 .../translations/README.md | 1 + 2-Working-With-Data/translations/README.es.md | 0 .../translations/assignment.ko.md | 2 +- .../translations/README.ko.md | 232 ------------------ .../14-Introduction/translations/README.es.md | 0 .../14-Introduction/translations/README.ko.md | 211 ---------------- .../15-analyzing/translations/README.es.md | 0 .../translations/README.es.md | 0 .../translations/README.es.md | 0 .../17-Introduction/translations/README.es.md | 0 .../18-Low-Code/translations/README.es.md | 0 .../18-Low-Code/translations/README.md | 1 + .../19-Azure/translations/README.es.md | 0 21 files changed, 4 insertions(+), 467 deletions(-) delete mode 100644 1-Introduction/02-ethics/translations/README.es.md delete mode 100644 2-Working-With-Data/05-relational-databases/translations/README.es.md create mode 100644 2-Working-With-Data/05-relational-databases/translations/README.md delete mode 100644 2-Working-With-Data/06-non-relational/translations/README.es.md delete mode 100644 2-Working-With-Data/06-non-relational/translations/assignment.ne.md delete mode 100644 2-Working-With-Data/07-python/translations/README.es.md delete mode 100644 2-Working-With-Data/07-python/translations/assignment.fr.md delete mode 100644 2-Working-With-Data/08-data-preparation/translations/README.es.md create mode 100644 2-Working-With-Data/08-data-preparation/translations/README.md delete mode 100644 2-Working-With-Data/translations/README.es.md delete mode 100644 3-Data-Visualization/13-meaningful-visualizations/translations/README.ko.md delete mode 100644 4-Data-Science-Lifecycle/14-Introduction/translations/README.es.md delete mode 100644 4-Data-Science-Lifecycle/14-Introduction/translations/README.ko.md delete mode 100644 4-Data-Science-Lifecycle/15-analyzing/translations/README.es.md delete mode 100644 4-Data-Science-Lifecycle/16-communication/translations/README.es.md delete mode 100644 4-Data-Science-Lifecycle/translations/README.es.md delete mode 100644 5-Data-Science-In-Cloud/17-Introduction/translations/README.es.md delete mode 100644 5-Data-Science-In-Cloud/18-Low-Code/translations/README.es.md create mode 100644 5-Data-Science-In-Cloud/18-Low-Code/translations/README.md delete mode 100644 5-Data-Science-In-Cloud/19-Azure/translations/README.es.md diff --git a/1-Introduction/02-ethics/translations/README.es.md b/1-Introduction/02-ethics/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 diff --git a/2-Working-With-Data/05-relational-databases/translations/README.es.md b/2-Working-With-Data/05-relational-databases/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 diff --git a/2-Working-With-Data/05-relational-databases/translations/README.md b/2-Working-With-Data/05-relational-databases/translations/README.md new file mode 100644 index 00000000..0e47ad51 --- /dev/null +++ b/2-Working-With-Data/05-relational-databases/translations/README.md @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/2-Working-With-Data/06-non-relational/translations/README.es.md b/2-Working-With-Data/06-non-relational/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 diff --git a/2-Working-With-Data/06-non-relational/translations/assignment.ne.md b/2-Working-With-Data/06-non-relational/translations/assignment.ne.md deleted file mode 100644 index e69de29b..00000000 diff --git a/2-Working-With-Data/07-python/translations/README.es.md b/2-Working-With-Data/07-python/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 diff --git a/2-Working-With-Data/07-python/translations/assignment.fr.md b/2-Working-With-Data/07-python/translations/assignment.fr.md deleted file mode 100644 index 1e423e53..00000000 --- a/2-Working-With-Data/07-python/translations/assignment.fr.md +++ /dev/null @@ -1,23 +0,0 @@ -# Tรขche: Traitement Des Donnรฉes en Python - -Dans cette tรขche, dรฉveloppez le code que nous avons commencรฉ dans nos dรฉfis. La tรขche a deux sections: - -## Modรฉlisation de la Propagation de COVID-19 - - - [ ] Placez les graphiques $R_t$ de 5-6 pays sur une graphe ร  comparer, ou mettez plusieurs graphiques cรดte ร  cรดte - - [ ] Voiez comment les nombres des morts et guรฉrisons sont liรฉs aux nombres des infectรฉs - - [ ] Trouvez combien de temps une maladie typique dure en mettre en corrรฉlation le taux de l'infection avex le taux du dรฉcรจs et cherchez les anomalies. Peut-รชtre vous avez besoin de regarder les pays diffรฉrents ร  le trouver. - - [ ] Calculez le taux du dรฉcรจs et comment il change au fil du temps. *Il se peut que vous voulez prendre en compte la durรฉe de la maladie en jours alors que vous pouvez dรฉplacer une sรฉrie chronologique avant de faire les calculs* - -## Analyse des Articles COVID-19 - -- [ ] Build co-occurrence matrix of different medications, and see which medications often occur together (i.e. mentioned in one abstract). You can modify the code for building co-occurrence matrix for medications and diagnoses. -- [ ] Visualize this matrix using heatmap. -- [ ] As a stretch goal, visualize the co-occurrence of medications using [chord diagram](https://en.wikipedia.org/wiki/Chord_diagram). [This library](https://pypi.org/project/chord/) may help you draw a chord diagram. -- [ ] As another stretch goal, extract dosages of different medications (such as **400mg** in *take 400mg of chloroquine daily*) using regular expressions, and build dataframe that shows different dosages for different medications. **Note**: consider numeric values that are in close textual vicinity of the medicine name. - -## Rubric - -Exemplaire | Acceptable | A Besoin Dโ€™amรฉlioration ---- | --- | -- | -Chaque tรขche est complet, illustrรฉ, et expliquรฉ, y compris au moins un des deux objectifs | Plus que 5 tรขches sont complets, aucun des objectifs sont essayรฉs, ou les rรฉsultats ne sont pas รฉvidents | Moins que 5 (mais plus que 3) tรขches sont complets, les illustrations n'expliquent pas l'objectif diff --git a/2-Working-With-Data/08-data-preparation/translations/README.es.md b/2-Working-With-Data/08-data-preparation/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 diff --git a/2-Working-With-Data/08-data-preparation/translations/README.md b/2-Working-With-Data/08-data-preparation/translations/README.md new file mode 100644 index 00000000..0e47ad51 --- /dev/null +++ b/2-Working-With-Data/08-data-preparation/translations/README.md @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/2-Working-With-Data/translations/README.es.md b/2-Working-With-Data/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 diff --git a/3-Data-Visualization/09-visualization-quantities/translations/assignment.ko.md b/3-Data-Visualization/09-visualization-quantities/translations/assignment.ko.md index ae930712..e3620623 100644 --- a/3-Data-Visualization/09-visualization-quantities/translations/assignment.ko.md +++ b/3-Data-Visualization/09-visualization-quantities/translations/assignment.ko.md @@ -2,7 +2,7 @@ ## ์ง€์นจ -์ด ๊ฐ•์˜์—์„œ๋Š” ์„ ํ˜• ์ฐจํŠธ, ์‚ฐ์ ๋„ ๋ฐ ๋ง‰๋Œ€ํ˜• ์ฐจํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด ๋ฐ์ดํ„ฐ ์…‹์— ๋Œ€ํ•œ ํฅ๋ฏธ๋กœ์šด ์‚ฌ์‹ค์„ ๋ณด์—ฌ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ณผ์ œ์—์„œ๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ์ž์„ธํžˆ ์กฐ์‚ฌํ•˜์—ฌ ํŠน์ • ์œ ํ˜•์˜ ์ƒˆ์— ๋Œ€ํ•œ ์‚ฌ์‹ค์„ ๋ฐœ๊ฒฌํ•˜๋Š” ๊ณผ์ •์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํฐ๊ธฐ๋Ÿฌ๊ธฐ(Snoew Geese)์— ๋Œ€ํ•œ ๋ชจ๋“  ํฅ๋ฏธ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๋Š” ๋…ธํŠธ๋ถ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ์„ธ ๊ฐ€์ง€์˜ ํ”Œ๋กฏ์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ๋ถ„์˜ ๋…ธํŠธ๋ถ์„ ๋งŒ๋“ค์–ด๋ณด์„ธ์š”. +์ด ๊ฐ•์˜์—์„œ๋Š” ์„ ํ˜• ์ฐจํŠธ, ์‚ฐ์ ๋„ ๋ฐ ๋ง‰๋Œ€ํ˜• ์ฐจํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ด ๋ฐ์ดํ„ฐ ์…‹์— ๋Œ€ํ•œ ํฅ๋ฏธ๋กœ์šด ์‚ฌ์‹ค์„ ๋ณด์—ฌ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ณผ์ œ์—์„œ๋Š” ๋ฐ์ดํ„ฐ์…‹์„ ์ž์„ธํžˆ ์กฐ์‚ฌํ•˜์—ฌ ํŠน์ • ์œ ํ˜•์˜ ์ƒˆ์— ๋Œ€ํ•œ ์‚ฌ์‹ค์„ ๋ฐœ๊ฒฌํ•˜๋Š” ๊ณผ์ •์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํฐ๊ธฐ๋Ÿฌ๊ธฐ(Snow Geese) ์— ๋Œ€ํ•œ ๋ชจ๋“  ํฅ๋ฏธ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๋Š” ๋…ธํŠธ๋ถ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ ์„ธ ๊ฐ€์ง€์˜ ํ”Œ๋กฏ์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ๋ถ„์˜ ๋…ธํŠธ๋ถ์„ ๋งŒ๋“ค์–ด๋ณด์„ธ์š”. ## ๊ธฐ์ค€ํ‘œ diff --git a/3-Data-Visualization/13-meaningful-visualizations/translations/README.ko.md b/3-Data-Visualization/13-meaningful-visualizations/translations/README.ko.md deleted file mode 100644 index ad82bcdc..00000000 --- a/3-Data-Visualization/13-meaningful-visualizations/translations/README.ko.md +++ /dev/null @@ -1,232 +0,0 @@ -# Making Meaningful Visualizations -# ์˜๋ฏธ ์žˆ๋Š” ์‹œ๊ฐํ™” ๋งŒ๋“ค๊ธฐ - -|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/13-MeaningfulViz.png)| -|:---:| -| Meaningful Visualizations - _Sketchnote by [@nitya](https://twitter.com/nitya)_ | -| ์˜๋ฏธ ์žˆ๋Š” ์‹œ๊ฐํ™” -_์ œ์ž‘์ž: [@nitya](https://twitter.com/nitya)_ | - -> "If you torture the data long enough, it will confess to anything" -- [Ronald Coase](https://en.wikiquote.org/wiki/Ronald_Coase) -> "๋ฐ์ดํ„ฐ๋ฅผ ์ถฉ๋ถ„ํžˆ ์˜ค๋ž˜ ๊ณ ๋ฌธํ•˜๋ฉด, ๊ทธ๊ฒƒ์€ ๋ฌด์—‡์ด๋“  ์ž๋ฐฑํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค." [Ronald Coase] (https://en.wikiquote.org/wiki/Ronald_Coase) - -One of the basic skills of a data scientist is the ability to create a meaningful data visualization that helps answer questions you might have. Prior to visualizing your data, you need to ensure that it has been cleaned and prepared, as you did in prior lessons. After that, you can start deciding how best to present the data. -๋ฐ์ดํ„ฐ ๊ณผํ•™์ž์˜ ๊ธฐ๋ณธ ๊ธฐ์ˆ  ์ค‘ ํ•˜๋‚˜๋Š” ์‚ฌ์šฉ์ž๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ์งˆ๋ฌธ์— ๋Œ€๋‹ตํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋˜๋Š” ์˜๋ฏธ ์žˆ๋Š” ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”๋ฅผ ๋งŒ๋“œ๋Š” ๋Šฅ๋ ฅ์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๊ธฐ ์ „์— ์ด์ „ ํ•™์Šต์—์„œ์™€ ๊ฐ™์ด ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ฆฌํ•˜๊ณ  ์ค€๋น„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์žฅ ์ž˜ ํ‘œ์‹œํ•  ๋ฐฉ๋ฒ•์„ ๊ฒฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. - -In this lesson, you will review: -์ด ๊ณผ์ •์—์„œ๋Š” ๋‹ค์Œ์„ ๋ณต์Šตํ•ฉ๋‹ˆ๋‹ค: - -1. How to choose the right chart type -2. How to avoid deceptive charting -3. How to work with color -4. How to style your charts for readability -5. How to build animated or 3D charting solutions -6. How to build a creative visualization -1. ์˜ฌ๋ฐ”๋ฅธ ์ฐจํŠธ ์œ ํ˜•์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ• -2. ๊ธฐ๋งŒ์ ์ธ ์ฐจํŠธ ์ž‘์„ฑ์„ ํ”ผํ•˜๋Š” ๋ฐฉ๋ฒ• -3. ์ปฌ๋Ÿฌ๋กœ ์ž‘์—…ํ•˜๋Š” ๋ฐฉ๋ฒ• -4. ๊ฐ€๋…์„ฑ์„ ์œ„ํ•ด ์ฐจํŠธ๋ฅผ ์Šคํƒ€์ผ๋งํ•˜๋Š” ๋ฐฉ๋ฒ• -5. ์• ๋‹ˆ๋ฉ”์ด์…˜ ๋˜๋Š” 3D ์ฐจํŠธ ์ž‘์„ฑ ์†”๋ฃจ์…˜์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐฉ๋ฒ• -6. ์ฐฝ์˜์  ์‹œ๊ฐํ™”๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ• - -## [Pre-Lecture Quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/24) -## [์‚ฌ์ „ ๊ฐ•์˜ ํ€ด์ฆˆ](https://red-water-0103e7a0f.azurestaticapps.net/quiz/24) - -## Choose the right chart type -## ์˜ฌ๋ฐ”๋ฅธ ์ฐจํŠธ ์œ ํ˜• ์„ ํƒ - -In previous lessons, you experimented with building all kinds of interesting data visualizations using Matplotlib and Seaborn for charting. In general, you can select the [right kind of chart](https://chartio.com/learn/charts/how-to-select-a-data-vizualization/) for the question you are asking using this table: -์ด์ „ ๊ณผ์ •์—์„œ๋Š” ์ฐจํŠธ ์ž‘์„ฑ์„ ์œ„ํ•ด Matplotlib์™€ Seaborn์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋“  ์ข…๋ฅ˜์˜ ํฅ๋ฏธ๋กœ์šด ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”๋ฅผ ๊ตฌ์ถ•ํ•˜๋Š” ์‹คํ—˜์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹ค์Œ ํ‘œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ [์ ์ ˆํ•œ ์œ ํ˜•์˜ ์ฐจํŠธ](https://chartio.com/learn/charts/how-to-select-a-data-vizualization/)๋ฅผ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: - -| You need to: | You should use: | -| -------------------------- | ------------------------------- | -| Show data trends over time | Line | -| Compare categories | Bar, Pie | -| Compare totals | Pie, Stacked Bar | -| Show relationships | Scatter, Line, Facet, Dual Line | -| Show distributions | Scatter, Histogram, Box | -| Show proportions | Pie, Donut, Waffle | - -> โœ… Depending on the makeup of your data, you might need to convert it from text to numeric to get a given chart to support it. -> โœ… ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์„ฑ์— ๋”ฐ๋ผ ํ…์ŠคํŠธ์—์„œ ์ˆซ์ž๋กœ ๋ณ€ํ™˜ํ•ด์•ผ ๋ฐ์ดํ„ฐ๋ฅผ ์ง€์›ํ•  ์ˆ˜ ์žˆ๋Š” ์ฐจํŠธ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. - -## Avoid deception -## ์†์ž„์ˆ˜๋ฅผ ํ”ผํ•˜๋ผ - -Even if a data scientist is careful to choose the right chart for the right data, there are plenty of ways that data can be displayed in a way to prove a point, often at the cost of undermining the data itself. There are many examples of deceptive charts and infographics! -๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๊ฐ€ ์˜ฌ๋ฐ”๋ฅธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์˜ฌ๋ฐ”๋ฅธ ์ฐจํŠธ๋ฅผ ์„ ํƒํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ์˜๋ฅผ ๊ธฐ์šธ์ธ๋‹ค๊ณ  ํ•ด๋„, ๋ฐ์ดํ„ฐ ์ž์ฒด๋ฅผ ์†์ƒ์‹œํ‚ค๋Š” ๋Œ€๊ฐ€๋ฅผ ์น˜๋ฅด๊ณ ๋ผ๋„, ์š”์ ์„ ์ž…์ฆํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œ์‹œํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์€ ์–ผ๋งˆ๋“ ์ง€ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋งŒ์ ์ธ ์ฐจํŠธ์™€ ์ธํฌ๊ทธ๋ž˜ํ”ฝ์˜ ์˜ˆ๋Š” ๋งŽ์Šต๋‹ˆ๋‹ค! - -[![How Charts Lie by Alberto Cairo](./images/tornado.png)](https://www.youtube.com/watch?v=oX74Nge8Wkw "How charts lie") -[![์•Œ๋ฒ ๋ฅดํ†  ์นด์ด๋กœ์˜ ์ฐจํŠธ ๋ˆ•๊ธฐ](..images/tornado.png)](https://www.youtube.com/watch?v=oX74Nge8Wkw "์ฐจํŠธ ๋ˆ•๊ธฐ" - -> ๐ŸŽฅ Click the image above for a conference talk about deceptive charts -> ๐ŸŽฅ ์œ„ ์ด๋ฏธ์ง€๋ฅผ ํด๋ฆญํ•˜๋ฉด ๊ธฐ๋งŒ์ ์ธ ์ฐจํŠธ์— ๋Œ€ํ•œ ์ปจํผ๋Ÿฐ์Šค ํ† ํฌ๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค. - -This chart reverses the X axis to show the opposite of the truth, based on date: -์ด ์ฐจํŠธ๋Š” ๋‚ ์งœ๋ฅผ ๊ธฐ์ค€์œผ๋กœ X์ถ•์„ ๋ฐ˜์ „์‹œ์ผœ ์ง„์‹ค์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค: - -![bad chart 1](images/bad-chart-1.png) -![๋‚˜์œ ์ฐจํŠธ 1](images/bad-chart-1.png) - -[This chart](https://media.firstcoastnews.com/assets/WTLV/images/170ae16f-4643-438f-b689-50d66ca6a8d8/170ae16f-4643-438f-b689-50d66ca6a8d8_1140x641.jpg) is even more deceptive, as the eye is drawn to the right to conclude that, over time, COVID cases have declined in the various counties. In fact, if you look closely at the dates, you find that they have been rearranged to give that deceptive downward trend. -[์ด ์ฐจํŠธ](https://media.firstcoastnews.com/assets/WTLV/images/170ae16f-4643-438f-b689-50d66ca6a8d8/170ae16f-4643-438f-b689-50d66ca6a8d8_1140x641.jpg)๋Š” ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ๋‹ค์–‘ํ•œ ์นด์šดํ‹ฐ์—์„œ COVID ์‚ฌ๋ก€๊ฐ€ ๊ฐ์†Œํ–ˆ๋‹ค๊ณ  ๊ฒฐ๋ก ์„ ๋‚ด๋ฆด ์ˆ˜ ์žˆ๋Š” ๊ถŒ๋ฆฌ์— ์ฃผ๋ชฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ›จ์”ฌ ๋” ๊ธฐ๋งŒ์ ์ด๋‹ค. ์‚ฌ์‹ค, ๋งŒ์•ฝ ์—ฌ๋Ÿฌ๋ถ„์ด ๋‚ ์งœ๋ฅผ ์ž์„ธํžˆ ๋ณธ๋‹ค๋ฉด, ์—ฌ๋Ÿฌ๋ถ„์€ ๊ทธ๊ฒƒ๋“ค์ด ๊ธฐ๋งŒ์ ์ธ ํ•˜ํ–ฅ ์ถ”์„ธ๋ฅผ ์ฃผ๊ธฐ ์œ„ํ•ด ์žฌ๋ฐฐ์—ด๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. - -![bad chart 2](images/bad-chart-2.jpg) -![๋‚˜์œ ์ฐจํŠธ 2](images/bad-chart-2.jpg) - -This notorious example uses color AND a flipped Y axis to deceive: instead of concluding that gun deaths spiked after the passage of gun-friendly legislation, in fact the eye is fooled to think that the opposite is true: -์ด ์•…๋ช… ๋†’์€ ์˜ˆ๋Š” ์ƒ‰๊น”๊ณผ ๋’ค์ง‘ํžŒ Y์ถ•์„ ์‚ฌ์šฉํ•˜์—ฌ ์†์ธ๋‹ค: ์ด๊ธฐ ์นœํ™”์ ์ธ ๋ฒ•์•ˆ์ด ํ†ต๊ณผ๋œ ํ›„ ์ด๊ธฐ ์‚ฌ๋ง๋ฅ ์ด ๊ธ‰์ฆํ–ˆ๋‹ค๊ณ  ๊ฒฐ๋ก ์ง“๋Š” ๋Œ€์‹ , ์‚ฌ์‹ค ๊ทธ ๋ฐ˜๋Œ€๋ผ๊ณ  ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ์€ ๋ˆˆ์„ ์†์ธ๋‹ค: - -![bad chart 3](images/bad-chart-3.jpg) -![๋‚˜์œ ์ฐจํŠธ 3](images/bad-chart-3.jpg) - -This strange chart shows how proportion can be manipulated, to hilarious effect: -์ด ์ด์ƒํ•œ ์ฐจํŠธ๋Š” ๋น„์œจ์„ ์กฐ์ž‘ํ•˜์—ฌ ์šฐ์Šค๊ฝ์Šค๋Ÿฌ์šด ํšจ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: - -![bad chart 4](images/bad-chart-4.jpg) -![๋‚˜์œ ์ฐจํŠธ 4](images/bad-chart-4.jpg) - -Comparing the incomparable is yet another shady trick. There is a [wonderful web site](https://tylervigen.com/spurious-correlations) all about 'spurious correlations' displaying 'facts' correlating things like the divorce rate in Maine and the consumption of margarine. A Reddit group also collects the [ugly uses](https://www.reddit.com/r/dataisugly/top/?t=all) of data. -๋น„๊ตํ•  ์ˆ˜ ์—†๋Š” ๊ฒƒ์„ ๋น„๊ตํ•˜๋Š” ๊ฒƒ์€ ๋˜ ๋‹ค๋ฅธ ์Œํ‰ํ•œ ์†์ž„์ˆ˜์ด๋‹ค. ๋ฉ”์ธ์ฃผ์˜ ์ดํ˜ผ์œจ๊ณผ ๋งˆ๊ฐ€๋ฆฐ ์†Œ๋น„์™€ ๊ฐ™์€ '๋น„๊ต์  ์ƒ๊ด€๊ด€๊ณ„'๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” [์ •๋ณด ์›น์‚ฌ์ดํŠธ](https://tylervigen.com/spurious-correlations)๊ฐ€ ์žˆ๋‹ค. Reddit ๊ทธ๋ฃน์€ ๋˜ํ•œ ๋ฐ์ดํ„ฐ์˜ [์‚ฌ์šฉ๋Ÿ‰](https://www.reddit.com/r/dataisugly/top/?t=all)์„ ์ˆ˜์ง‘ํ•ฉ๋‹ˆ๋‹ค. - -It's important to understand how easily the eye can be fooled by deceptive charts. Even if the data scientist's intention is good, the choice of a bad type of chart, such as a pie chart showing too many categories, can be deceptive. -๋ˆˆ์ด ์–ผ๋งˆ๋‚˜ ์‰ฝ๊ฒŒ ๊ธฐ๋งŒ์ ์ธ ๋„ํ‘œ์— ์†์•„ ๋„˜์–ด๊ฐˆ ์ˆ˜ ์žˆ๋Š”์ง€ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค. ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž์˜ ์˜๋„๊ฐ€ ์ข‹๋”๋ผ๋„ ๋„ˆ๋ฌด ๋งŽ์€ ๋ฒ”์ฃผ๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ํŒŒ์ด ์ฐจํŠธ์™€ ๊ฐ™์€ ์ž˜๋ชป๋œ ์œ ํ˜•์˜ ์ฐจํŠธ๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ์€ ๊ธฐ๋งŒ์ ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. - -## Color -## ์ƒ‰์ƒ - -You saw in the 'Florida gun violence' chart above how color can provide an additional layer of meaning to charts, especially ones not designed using libraries such as Matplotlib and Seaborn which come with various vetted color libraries and palettes. If you are making a chart by hand, do a little study of [color theory](https://colormatters.com/color-and-design/basic-color-theory) -๋‹น์‹ ์€ 'ํ”Œ๋กœ๋ฆฌ๋‹ค ์ด๊ธฐ ํญ๋ ฅ' ์ฐจํŠธ์—์„œ ์ƒ‰์ƒ์ด ์ฐจํŠธ์— ์–ด๋–ป๊ฒŒ ์ถ”๊ฐ€์ ์ธ ์˜๋ฏธ ์ธต์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ณด์•˜์œผ๋ฉฐ, ํŠนํžˆ ๋‹ค์–‘ํ•œ ์ปฌ๋Ÿฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์™€ ํŒ”๋ ˆํŠธ๊ฐ€ ์ œ๊ณต๋˜๋Š” Matplotlib๊ณผ Seaborn๊ณผ ๊ฐ™์€ ๋„์„œ๊ด€์„ ์‚ฌ์šฉํ•˜์—ฌ ์„ค๊ณ„๋˜์ง€ ์•Š์€ ๊ฒƒ์„ ๋ณด์•˜์Šต๋‹ˆ๋‹ค. ๋งŒ์•ฝ ์—ฌ๋Ÿฌ๋ถ„์ด ์†์œผ๋กœ ์ฐจํŠธ๋ฅผ ๋งŒ๋“ค๊ณ  ์žˆ๋‹ค๋ฉด, [์ƒ‰๊น” ์ด๋ก ]์„ ์กฐ๊ธˆ ์—ฐ๊ตฌํ•ด ๋ณด์„ธ์š”. https://colormatters.com/color-and-design/basic-color-theory) - -> โœ… Be aware, when designing charts, that accessibility is an important aspect of visualization. Some of your users might be color blind - does your chart display well for users with visual impairments? -> โœ… ์ฐจํŠธ๋ฅผ ์„ค๊ณ„ํ•  ๋•Œ ์ ‘๊ทผ์„ฑ์€ ์‹œ๊ฐํ™”์˜ ์ค‘์š”ํ•œ ์ธก๋ฉด์ž„์„ ์œ ์˜ํ•ด์•ผ ํ•œ๋‹ค. ์ผ๋ถ€ ์‚ฌ์šฉ์ž๋Š” ์ƒ‰๋งน์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹น์‹ ์˜ ์ฐจํŠธ๋Š” ์‹œ๊ฐ ์žฅ์• ๊ฐ€ ์žˆ๋Š” ์‚ฌ์šฉ์ž๋“ค์—๊ฒŒ ์ž˜ ํ‘œ์‹œ๋ฉ๋‹ˆ๊นŒ? - -Be careful when choosing colors for your chart, as color can convey meaning you might not intend. The 'pink ladies' in the 'height' chart above convey a distinctly 'feminine' ascribed meaning that adds to the bizarreness of the chart itself. -์ƒ‰์ƒ์€ ์˜๋„ํ•˜์ง€ ์•Š์€ ์˜๋ฏธ๋ฅผ ์ „๋‹ฌํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์ฐจํŠธ์— ์‚ฌ์šฉํ•  ์ƒ‰์ƒ์„ ์„ ํƒํ•  ๋•Œ ์ฃผ์˜ํ•˜์‹ญ์‹œ์˜ค. ์œ„์˜ 'height' ์ฐจํŠธ์— ์žˆ๋Š” 'pink ladies'๋Š” ์ฐจํŠธ ์ž์ฒด์˜ ๊ธฐ๊ดดํ•จ์„ ๋”ํ•˜๋Š” ๋šœ๋ ทํ•˜๊ฒŒ '์—ฌ์„ฑ์ ์ธ' ์˜๋ฏธ๋ฅผ ์ „๋‹ฌํ•œ๋‹ค. - -While [color meaning](https://colormatters.com/color-symbolism/the-meanings-of-colors) might be different in different parts of the world, and tend to change in meaning according to their shade. Generally speaking, color meanings include: -๊ทธ๋Ÿฌ๋‚˜ [color languation](https://colormatters.com/color-symbolism/the-meanings-of-colors)์€ ์„ธ๊ณ„ ๊ฐ์ง€์—์„œ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์œผ๋ฉฐ ์ƒ‰์กฐ์— ๋”ฐ๋ผ ์˜๋ฏธ๊ฐ€ ๋ณ€ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ์ƒ‰์ƒ์˜ ์˜๋ฏธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. - -| Color | Meaning | -| ------ | ------------------- | -| red | power | -| blue | trust, loyalty | -| yellow | happiness, caution | -| green | ecology, luck, envy | -| purple | happiness | -| orange | vibrance | - -If you are tasked with building a chart with custom colors, ensure that your charts are both accessible and the color you choose coincides with the meaning you are trying to convey. -์‚ฌ์šฉ์ž ์ง€์ • ์ƒ‰์„ ์‚ฌ์šฉํ•˜์—ฌ ์ฐจํŠธ๋ฅผ ์ž‘์„ฑํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ ์ฐจํŠธ์— ์•ก์„ธ์Šคํ•  ์ˆ˜ ์žˆ๊ณ  ์„ ํƒํ•œ ์ƒ‰์ƒ์ด ์ „๋‹ฌํ•˜๋ ค๋Š” ์˜๋ฏธ์™€ ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค. - -## Styling your charts for readability -## ๊ฐ€๋…์„ฑ์„ ์œ„ํ•œ ์ฐจํŠธ ์Šคํƒ€์ผ๋ง - -Charts are not meaningful if they are not readable! Take a moment to consider styling the width and height of your chart to scale well with your data. If one variable (such as all 50 states) need to be displayed, show them vertically on the Y axis if possible so as to avoid a horizontally-scrolling chart. -์ฐจํŠธ๋ฅผ ์ฝ์„ ์ˆ˜ ์—†์œผ๋ฉด ์˜๋ฏธ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค! ๋ฐ์ดํ„ฐ์— ๋งž๊ฒŒ ์ž˜ ํ™•์žฅ๋˜๋„๋ก ์ฐจํŠธ์˜ ๋„ˆ๋น„์™€ ๋†’์ด๋ฅผ ์Šคํƒ€์ผ๋งํ•˜๋Š” ๊ฒƒ์„ ๊ณ ๋ คํ•ด ๋ณด์‹ญ์‹œ์˜ค. ํ•˜๋‚˜์˜ ๋ณ€์ˆ˜(์˜ˆ: 50๊ฐœ ์ƒํƒœ ๋ชจ๋‘)๋ฅผ ํ‘œ์‹œํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ ๊ฐ€๋กœ ์Šคํฌ๋กค ์ฐจํŠธ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ€๋Šฅํ•˜๋ฉด Y์ถ•์— ์„ธ๋กœ๋กœ ํ‘œ์‹œํ•ฉ๋‹ˆ๋‹ค. - -Label your axes, provide a legend if necessary, and offer tooltips for better comprehension of data. -์ถ•์— ๋ ˆ์ด๋ธ”์„ ์ง€์ •ํ•˜๊ณ  ํ•„์š”ํ•œ ๊ฒฝ์šฐ ๋ฒ”๋ก€๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ์ž˜ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ํˆดํŒ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. - -If your data is textual and verbose on the X axis, you can angle the text for better readability. [Matplotlib](https://matplotlib.org/stable/tutorials/toolkits/mplot3d.html) offers 3d plotting, if you data supports it. Sophisticated data visualizations can be produced using `mpl_toolkits.mplot3d`. -๋ฐ์ดํ„ฐ๊ฐ€ ํ…์ŠคํŠธ์ด๊ณ  X์ถ•์— ์ƒ์„ธํ•  ๊ฒฝ์šฐ ํ…์ŠคํŠธ๋ฅผ ๋” ์ž˜ ์ฝ์„ ์ˆ˜ ์žˆ๋„๋ก ๊ฐ๋„๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. [Matplotlib](https://matplotlib.org/stable/tutorials/toolkits/mplot3d.html)์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์ง€์›ํ•˜๋Š” ๊ฒฝ์šฐ 3D ํ”Œ๋กœํŒ…์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. 'mpl_toolkits.mplot3d'๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ •๊ตํ•œ ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. - -![3d plots](images/3d.png) - -## Animation and 3D chart display -## ์• ๋‹ˆ๋ฉ”์ด์…˜ ๋ฐ 3D ์ฐจํŠธ ํ‘œ์‹œ - -Some of the best data visualizations today are animated. Shirley Wu has amazing ones done with D3, such as '[film flowers](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/)', where each flower is a visualization of a movie. Another example for the Guardian is 'bussed out', an interactive experience combining visualizations with Greensock and D3 plus a scrollytelling article format to show how NYC handles its homeless problem by bussing people out of the city. -์˜ค๋Š˜๋‚  ์ตœ๊ณ ์˜ ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™” ์ค‘ ์ผ๋ถ€๋Š” ์• ๋‹ˆ๋ฉ”์ด์…˜์ž…๋‹ˆ๋‹ค. Shirley Wu๋Š” D3๋กœ ๋†€๋ผ์šด ์ž‘์—…์„ ํ–ˆ์Šต๋‹ˆ๋‹ค. -์˜ˆ๋ฅผ ๋“ค์–ด '[ํ•„๋ฆ„ ํ”Œ๋ผ์›Œ](http://bl.ocks.org/sxywu/raw/d612c6c653fb8b4d7ff3d422be164a5d/)',์—์„œ๋Š” ๊ฐ๊ฐ์˜ ๊ฝƒ์ด ์˜ํ™”์˜ ์‹œ๊ฐํ™”์ด๋‹ค. ๊ฐ€๋””์–ธ์˜ ๋˜ ๋‹ค๋ฅธ ์˜ˆ๋Š” '๋ฒ„์Šค๋“œ ์•„์›ƒ'์œผ๋กœ, Greensock ๋ฐ D3์™€ ์‹œ๊ฐํ™”๋ฅผ ๊ฒฐํ•ฉํ•œ ๋Œ€ํ™”ํ˜• ์ฒดํ—˜๊ณผ NYC๊ฐ€ ์‚ฌ๋žŒ๋“ค์„ ๋„์‹œ ๋ฐ–์œผ๋กœ ๋‚ด์ซ“์•„ ๋…ธ์ˆ™์ž ๋ฌธ์ œ๋ฅผ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ๊ธฐ์‚ฌ ํ˜•์‹์„ ํฌํ•จํ•œ๋‹ค. - -![busing](images/busing.png) -![๋ฒ„์Šค ์ˆ˜์†ก](images/busing.png) - -> "Bussed Out: How America Moves its Homeless" from [the Guardian](https://www.theguardian.com/us-news/ng-interactive/2017/dec/20/bussed-out-america-moves-homeless-people-country-study). Visualizations by Nadieh Bremer & Shirley Wu -> [๊ฐ€๋””์–ธ](https://www.theguardian.com/us-news/ng-interactive/2017/dec/20/bussed-out-america-moves-homeless-people-country-study))์˜ "๋ฒ„์Šค๋“œ ์•„์›ƒ: ๋ฏธ๊ตญ์˜ ๋…ธ์ˆ™์ž ์ด๋™ ๋ฐฉ๋ฒ•" Nadieh Bremer & Shirley Wu์˜ ์‹œ๊ฐํ™” - -While this lesson is insufficient to go into depth to teach these powerful visualization libraries, try your hand at D3 in a Vue.js app using a library to display a visualization of the book "Dangerous Liaisons" as an animated social network. -์ด ๊ตํ›ˆ์€ ์ด๋Ÿฌํ•œ ๊ฐ•๋ ฅํ•œ ์‹œ๊ฐํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๊ฐ€๋ฅด์น˜๊ธฐ์— ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์ง€๋งŒ, ์• ๋‹ˆ๋ฉ”์ด์…˜ ์†Œ์…œ ๋„คํŠธ์›Œํฌ๋กœ์„œ "์œ„ํ—˜ํ•œ ๊ด€๊ณ„"๋ผ๋Š” ์ฑ…์˜ ์‹œ๊ฐํ™”๋ฅผ ํ‘œ์‹œํ•˜๊ธฐ ์œ„ํ•ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ Vue.js ์•ฑ์˜ D3์—์„œ ์—ฌ๋Ÿฌ๋ถ„์˜ ์†์„ ์‚ฌ์šฉํ•ด ๋ณด์‹ญ์‹œ์˜ค. - -> "Les Liaisons Dangereuses" is an epistolary novel, or a novel presented as a series of letters. Written in 1782 by Choderlos de Laclos, it tells the story of the vicious, morally-bankrupt social maneuvers of two dueling protagonists of the French aristocracy in the late 18th century, the Vicomte de Valmont and the Marquise de Merteuil. Both meet their demise in the end but not without inflicting a great deal of social damage. The novel unfolds as a series of letters written to various people in their circles, plotting for revenge or simply to make trouble. Create a visualization of these letters to discover the major kingpins of the narrative, visually. -> "Les Liaisons Dangereuses"๋Š” ํŽธ์ง€ ์†Œ์„ค ๋˜๋Š” ์ผ๋ จ์˜ ํŽธ์ง€๋กœ ํ‘œํ˜„๋œ ์†Œ์„ค์ด๋‹ค. 1782๋…„ Choderlos de Laclos์— ์˜ํ•ด ์“ฐ์—ฌ์ง„ ์ด ์ฑ…์€ 18์„ธ๊ธฐ ํ›„๋ฐ˜ ํ”„๋ž‘์Šค ๊ท€์กฑ์˜ ๊ฒฐํˆฌ์ ์ธ ๋‘ ์ฃผ์ธ๊ณต์ธ Vicomte de Valmont์™€ Marquise de Merteuil์˜ ์ž”์ธํ•˜๊ณ  ๋„๋•์ ์œผ๋กœ ํƒ€๋ฝํ•œ ์‚ฌํšŒ์  ์ฑ…๋žต์— ๋Œ€ํ•œ ์ด์•ผ๊ธฐ๋ฅผ ๋“ค๋ ค์ค€๋‹ค. ๋‘˜ ๋‹ค ๊ฒฐ๊ตญ ๊ทธ๋“ค์˜ ์ฃฝ์Œ์„ ๋งž์ดํ•˜์ง€๋งŒ ํฐ ์‚ฌํšŒ์  ํ”ผํ•ด๋ฅผ ์ž…ํžˆ์ง€ ์•Š๊ณ ๋Š” ์•„๋‹ˆ๋‹ค. ์ด ์†Œ์„ค์€ ๋ณต์ˆ˜์˜ ์Œ๋ชจ๋ฅผ ๊พธ๋ฏธ๊ฑฐ๋‚˜ ๋‹จ์ˆœํžˆ ๋ฌธ์ œ๋ฅผ ์ผ์œผํ‚ค๊ธฐ ์œ„ํ•ด ์„œํด์— ์žˆ๋Š” ๋‹ค์–‘ํ•œ ์‚ฌ๋žŒ๋“ค์—๊ฒŒ ์“ด ์ผ๋ จ์˜ ํŽธ์ง€๋“ค๋กœ ์ „๊ฐœ๋œ๋‹ค. ์ด ๊ธ€์ž๋“ค์„ ์‹œ๊ฐํ™”ํ•ด์„œ ์ด์•ผ๊ธฐ์˜ ์ฃผ์š” ํ‚นํ•€์„ ์‹œ๊ฐ์ ์œผ๋กœ ๋ฐœ๊ฒฌํ•˜์„ธ์š”. - -You will complete a web app that will display an animated view of this social network. It uses a library that was built to create a [visual of a network](https://github.com/emiliorizzo/vue-d3-network) using Vue.js and D3. When the app is running, you can pull the nodes around on the screen to shuffle the data around. -์ด ์†Œ์…œ ๋„คํŠธ์›Œํฌ์˜ ์• ๋‹ˆ๋ฉ”์ด์…˜ ๋ณด๊ธฐ๋ฅผ ํ‘œ์‹œํ•˜๋Š” ์›น ์•ฑ์„ ์™„๋ฃŒํ•ฉ๋‹ˆ๋‹ค. Vue.js ๋ฐ D3๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ [๋„คํŠธ์›Œํฌ์˜ ์‹œ๊ฐ](https://github.com/emiliorizzo/vue-d3-network))์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๊ตฌ์ถ•๋œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์•ฑ์ด ์‹คํ–‰ ์ค‘์ผ ๋•Œ ํ™”๋ฉด์—์„œ ๋…ธ๋“œ๋ฅผ ๋‹น๊ฒจ ๋ฐ์ดํ„ฐ๋ฅผ ์ด๋ฆฌ์ €๋ฆฌ ์„ž์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. - -![liaisons](images/liaisons.png) -![๊ด€๊ณ„](images/liaisons.png) - -## Project: Build a chart to show a network using D3.js -## ํ”„๋กœ์ ํŠธ: D3.js๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋„คํŠธ์›Œํฌ๋ฅผ ํ‘œ์‹œํ•  ์ฐจํŠธ ์ž‘์„ฑ - -> This lesson folder includes a `solution` folder where you can find the completed project, for your reference. -> ์ด ๊ณผ์ • ํด๋”์—๋Š” ์™„๋ฃŒ๋œ ํ”„๋กœ์ ํŠธ๋ฅผ ์ฐธ์กฐํ•  ์ˆ˜ ์žˆ๋Š” '์†”๋ฃจ์…˜' ํด๋”๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. - -1. Follow the instructions in the README.md file in the starter folder's root. Make sure you have NPM and Node.js running on your machine before installing your project's dependencies. -1. ์Šคํƒ€ํ„ฐ ํด๋”์˜ ๋ฃจํŠธ์— ์žˆ๋Š” README.md ํŒŒ์ผ์˜ ์ง€์นจ์„ ๋”ฐ๋ฆ…๋‹ˆ๋‹ค. ํ”„๋กœ์ ํŠธ์˜ ์ข…์†์„ฑ์„ ์„ค์น˜ํ•˜๊ธฐ ์ „์— ์‹œ์Šคํ…œ์—์„œ NPM ๋ฐ Node.js๊ฐ€ ์‹คํ–‰ ์ค‘์ธ์ง€ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค. - -2. Open the `starter/src` folder. You'll discover an `assets` folder where you can find a .json file with all the letters from the novel, numbered, with a 'to' and 'from' annotation. -2. 'starter/src' ํด๋”๋ฅผ ์—ฝ๋‹ˆ๋‹ค. ๋‹น์‹ ์€ ์†Œ์„ค์˜ ๋ชจ๋“  ๊ธ€์ž์™€ ๋ฒˆํ˜ธ๊ฐ€ ๋งค๊ฒจ์ง„ .json ํŒŒ์ผ์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋Š” assets ํด๋”๋ฅผ ๋ฐœ๊ฒฌํ•  ๊ฒƒ์ด๋‹ค. - -3. Complete the code in `components/Nodes.vue` to enable the visualization. Look for the method called `createLinks()` and add the following nested loop. -3. `components/Nodes.vue`๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐํ™”๋ฅผ ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค. createLinks()๋ผ๋Š” ๋ฉ”์„œ๋“œ๋ฅผ ์ฐพ์•„ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ค‘์ฒฉ ๋ฃจํ”„๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. - -Loop through the .json object to capture the 'to' and 'from' data for the letters and build up the `links` object so that the visualization library can consume it: -.json ๊ฐ์ฒด๋ฅผ ๋ฃจํ”„ํ•˜์—ฌ ๋ฌธ์ž์— ๋Œ€ํ•œ 'to' ๋ฐ 'from' ๋ฐ์ดํ„ฐ๋ฅผ ์บก์ฒ˜ํ•˜๊ณ  ์‹œ๊ฐํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก 'links' ๊ฐ์ฒด๋ฅผ ๊ตฌ์ถ•ํ•ฉ๋‹ˆ๋‹ค. - -```javascript -//loop through letters - let f = 0; - let t = 0; - for (var i = 0; i < letters.length; i++) { - for (var j = 0; j < characters.length; j++) { - - if (characters[j] == letters[i].from) { - f = j; - } - if (characters[j] == letters[i].to) { - t = j; - } - } - this.links.push({ sid: f, tid: t }); - } - ``` - -Run your app from the terminal (npm run serve) and enjoy the visualization! -ํ„ฐ๋ฏธ๋„(npm run serve)์—์„œ ์•ฑ์„ ์‹คํ–‰ํ•˜๊ณ  ์‹œ๊ฐํ™”๋ฅผ ์ฆ๊ธฐ์‹ญ์‹œ์˜ค! - -## ๐Ÿš€ Challenge -## ๐Ÿš€ ๋„์ „ - -Take a tour of the internet to discover deceptive visualizations. How does the author fool the user, and is it intentional? Try correcting the visualizations to show how they should look. -์ธํ„ฐ๋„ท์„ ๋‘˜๋Ÿฌ๋ณด๊ณ  ๊ธฐ๋งŒ์ ์ธ ์‹œ๊ฐํ™”๋ฅผ ์ฐพ์•„๋ณด์„ธ์š”. ์ €์ž๋Š” ์–ด๋–ป๊ฒŒ ์‚ฌ์šฉ์ž๋ฅผ ์†์ด๊ณ , ์˜๋„์ ์ธ๊ฐ€? ์‹œ๊ฐํ™”๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ์–ด๋–ป๊ฒŒ ๋ณด์—ฌ์•ผ ํ•˜๋Š”์ง€ ํ‘œ์‹œํ•ด ๋ณด์‹ญ์‹œ์˜ค. - -## [Post-lecture quiz](https://red-water-0103e7a0f.azurestaticapps.net/quiz/25) -## [๊ฐ•์˜ ํ›„ ํ€ด์ฆˆ](https://red-water-0103e7a0f.azurestaticapps.net/quiz/25) - -## Review & Self Study -## ๋ฆฌ๋ทฐ & ์…€ํ”„ ํ•™์Šต - -Here are some articles to read about deceptive data visualization: -๋‹ค์Œ์€ ๊ธฐ๋งŒ์ ์ธ ๋ฐ์ดํ„ฐ ์‹œ๊ฐํ™”์— ๋Œ€ํ•œ ๋ช‡ ๊ฐ€์ง€ ๊ธฐ์‚ฌ์ž…๋‹ˆ๋‹ค: - -https://gizmodo.com/how-to-lie-with-data-visualization-1563576606 - -http://ixd.prattsi.org/2017/12/visual-lies-usability-in-deceptive-data-visualizations/ - -Take a look at these interest visualizations for historical assets and artifacts: -๊ณผ๊ฑฐ ์ž์‚ฐ ๋ฐ ์ธ๊ณต๋ฌผ์— ๋Œ€ํ•œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ด€์‹ฌ ์‹œ๊ฐํ™”๋ฅผ ์‚ดํŽด ๋ณด์‹ญ์‹œ์˜ค. - -https://handbook.pubpub.org/ - -Look through this article on how animation can enhance your visualizations: -์ด ๊ธฐ์‚ฌ๋ฅผ ํ†ตํ•ด ์• ๋‹ˆ๋ฉ”์ด์…˜์ด ์‹œ๊ฐํ™”๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์‹ญ์‹œ์˜ค: - -https://medium.com/@EvanSinar/use-animation-to-supercharge-data-visualization-cd905a882ad4 - -## Assignment -## ๊ณผ์ œ - -[Build your own custom visualization](assignment.md) -[๋งž์ถคํ˜• ์‹œ๊ฐํ™” ๊ตฌ์ถ•](assignment.md) diff --git a/4-Data-Science-Lifecycle/14-Introduction/translations/README.es.md b/4-Data-Science-Lifecycle/14-Introduction/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 diff --git a/4-Data-Science-Lifecycle/14-Introduction/translations/README.ko.md b/4-Data-Science-Lifecycle/14-Introduction/translations/README.ko.md deleted file mode 100644 index 76945223..00000000 --- a/4-Data-Science-Lifecycle/14-Introduction/translations/README.ko.md +++ /dev/null @@ -1,211 +0,0 @@ -<<<<<<< HEAD -# ๋ฐ์ดํ„ฐ ๊ณผํ•™์˜ ์ƒ์• ์ฃผ๊ธฐ ์†Œ๊ฐœ - -|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../sketchnotes/14-DataScience-Lifecycle.png)| -|:---:| -| ๋ฐ์ดํ„ฐ ๊ณผํ•™์˜ ์ƒ์• ์ฃผ๊ธฐ ์†Œ๊ฐœ - [@nitya](https://twitter.com/nitya)์˜ ์ด๋ฏธ์ง€ | - -## [๊ฐ•์˜ ์‹œ์ž‘ ์ „ ํ€ด์ฆˆ](https://red-water-0103e7a0f.azurestaticapps.net/quiz/26) - -์ด ์‹œ์ ์—์„œ ์—ฌ๋Ÿฌ๋ถ„์€ ์•„๋งˆ ๋ฐ์ดํ„ฐ ๊ณผํ•™์ด ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์Šค๋ผ๋Š” ๊ฒƒ์„ ๊นจ๋‹ฌ์•˜์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ํ”„๋กœ์„ธ์Šค๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด 5๋‹จ๊ณ„๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: - -- ๋ฐ์ดํ„ฐ ํฌํš -- ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ -- ๋ฐ์ดํ„ฐ ๋ถ„์„ -- ์†Œํ†ต -- ์œ ์ง€๋ณด์ˆ˜ - - -์ด๋ฒˆ ๊ฐ•์˜์—์„œ๋Š” ์ƒ์•  ์ฃผ๊ธฐ์˜ ์„ธ ๋ถ€๋ถ„ : ๋ฐ์ดํ„ฐ ํฌํš, ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๊ทธ๋ฆฌ๊ณ  ์œ ์ง€์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค. - -![Diagram of the data science lifecycle](./images/data-science-lifecycle.jpg) -> [Berkeley School of Information](https://ischoolonline.berkeley.edu/data-science/what-is-data-science/) ์˜ ์ด๋ฏธ์ง€ - -## ๋ฐ์ดํ„ฐ ํฌํš - -์ƒ์•  ์ฃผ๊ธฐ์˜ ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ๋‹ค์Œ ๋‹จ๊ณ„์˜ ์˜์กด๋„๊ฐ€ ๋†’๊ธฐ ๋•Œ๋ฌธ์— ์•„์ฃผ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์‚ฌ์‹ค์ƒ ๋‘ ๋‹จ๊ณ„๊ฐ€ ํ•ฉํ•ด์ง„ ๊ฒƒ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค : ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๊ณผ ํ•ด๊ฒฐํ•ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ๋“ค ๋ฐ ๋ชฉ์  ์ •์˜. -ํ”„๋กœ์ ํŠธ์˜ ๋ชฉํ‘œ๋ฅผ ์ •์˜ํ•˜๋ ค๋ฉด ๋ฌธ์ œ๋‚˜ ์งˆ๋ฌธ์— ๋Œ€ํ•ด์„œ ๋” ๊นŠ์€ ๋งฅ๋ฝ์„ ํ•„์š”๋กœ ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ฒซ์งธ, ์šฐ๋ฆฌ๋Š” ๋ฌธ์ œ ํ•ด๊ฒฐ์ด ํ•„์š”ํ•œ ์‚ฌ๋žŒ๋“ค์„ ์ฐพ์•„๋‚ด๊ณ  ์˜์ž…ํ•ด์•ผ ํ•œ๋‹ค. ๊ทธ๋“ค์€ ์‚ฌ์—…์˜ ์ดํ•ด๊ด€๊ณ„์ž์ด๊ฑฐ๋‚˜ ํ”„๋กœ์ ํŠธ์˜ ํ›„์›์ž์ผ ์ˆ˜๋„ ์žˆ์œผ๋ฉฐ, ๊ทธ๋“ค์€ ๋ˆ„๊ฐ€ ์ด ํ”„๋กœ์ ํŠธ๋ฅผ ํ†ตํ•ด ์ด์ต์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š”์ง€, ๋ฌด์—‡์„ ์™œ ํ•„์š”๋กœ ํ•˜๋Š”์ง€๋ฅผ ์‹๋ณ„ํ•˜๋Š”๋ฐ์— ๋„์›€์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž˜ ์ •์˜๋œ ๋ชฉํ‘œ๋Š” ๋‚ฉ๋“ํ• ๋งŒํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ •์˜ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ„๋Ÿ‰(์ธก์ •)๊ณผ ์ˆ˜๋Ÿ‰ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•ด์•ผ๋งŒ ํ•œ๋‹ค. - -๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๊ฐ€ ํ•  ์ˆ˜๋„ ์žˆ๋Š” ์งˆ๋ฌธ๋“ค : -- ์ด ๋ฌธ์ œ์— ์ ‘๊ทผํ•œ ์ ์ด ์žˆ์Šต๋‹ˆ๊นŒ? ๋ฌด์—‡์ด ๋ฐœ๊ฒฌ๋˜์—ˆ์Šต๋‹ˆ๊นŒ? -- ๊ด€๋ จ๋˜์–ด ์žˆ๋Š” ๋ชจ๋“  ์‚ฌ๋žŒ๋“ค์ด ๋ชฉ์ ๊ณผ ๋ชฉํ‘œ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ? -- ๋ชจํ˜ธ์„ฑ์€ ์–ด๋””์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์–ด๋–ป๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๊นŒ? -- ์ œ์•ฝ์ด ๋˜๋Š” ๊ฒƒ๋“ค์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ? -- ์ตœ์ข… ๊ฒฐ๊ณผ๋Š” ์ž ์žฌ์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๊นŒ? -- ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ž์›๋“ค (์‹œ๊ฐ„, ์ธ๋ ฅ, ์ปดํ“จํ„ฐ ์ด์šฉ) ์ด ์–ผ๋งˆ๋‚˜ ๋ฉ๋‹ˆ๊นŒ? - -๋‹ค์Œ์€ ์ด ์ •์˜๋œ ๋ชฉํ‘œ๋“ค์„ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์‹๋ณ„ํ•˜๊ณ , ์ˆ˜์ง‘ํ•˜๊ณ , ๋งˆ์ง€๋ง‰์œผ๋กœ ํƒ์ƒ‰ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ํš๋“ ๋‹จ๊ณ„์—์„œ, ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๋“ค์€ ๋ฐ์ดํ„ฐ์˜ ์–‘๊ณผ ์งˆ๋˜ํ•œ ํ‰๊ฐ€ํ•ด์•ผ๋งŒ ํ•ฉ๋‹ˆ๋‹ค. ์–ป์€ ๊ฒƒ์ด ์›ํ•˜๋Š” ๊ฒฐ๊ณผ์— ๋„๋‹ฌํ•˜๋Š”๋ฐ ๋„์›€์ด ๋  ์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์•ฝ๊ฐ„์˜ ๋ฐ์ดํ„ฐ ํƒ์ƒ‰์ด ์š”๊ตฌ๋ฉ๋‹ˆ๋‹ค. - -๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๊ฐ€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋ฌผ์–ด๋ณผ ์ˆ˜ ์žˆ๋Š” ์งˆ๋ฌธ๋“ค : -- ์–ด๋–ค ๋ฐ์ดํ„ฐ๊ฐ€ ์ด๋ฏธ ์ œ๊ฐ€ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๊นŒ? -- ์ด ๋ฐ์ดํ„ฐ์˜ ์†Œ์œ ์ž๋Š” ๋ˆ„๊ตฌ์ž…๋‹ˆ๊นŒ? -- ๊ฐœ์ธ ์ •๋ณด ๋ณดํ˜ธ ๋ฌธ์ œ๋Š” ๋ฌด์—‡์ž…๋‹ˆ๊นŒ? -- ๋‚ด๊ฐ€ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ• ๋งŒํผ ์ถฉ๋ถ„ํ•ฉ๋‹ˆ๊นŒ? -- ์ด ๋ฌธ์ œ์— ๋Œ€ํ•ด ํ—ˆ์šฉ ๊ฐ€๋Šฅํ•œ ํ’ˆ์งˆ์˜ ๋ฐ์ดํ„ฐ ์ž…๋‹ˆ๊นŒ? -- ๋งŒ์•ฝ ๋‚ด๊ฐ€ ์ด ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ์ถ”๊ฐ€์ ์ธ ์ •๋ณด๋ฅผ ๋ฐœ๊ฒฌํ•œ๋‹ค๋ฉด, ๋ชฉํ‘œ๋ฅผ ๋ฐ”๊พธ๊ฑฐ๋‚˜ ์ •์˜๋ฅผ ๋‹ค์‹œ ๋‚ด๋ ค์•ผ ํ•ฉ๋‹ˆ๊นŒ? - -## ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ - -์ƒ์•  ์ฃผ๊ธฐ์˜ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋‹จ๊ณ„๋Š” ๋ชจ๋ธ๋ง๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ์—์„œ ํŒจํ„ด์„ ๋ฐœ๊ฒฌํ•˜๋Š” ๋ฐ ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋ช‡๋ช‡ ๊ธฐ์ˆ ๋“ค์€ ํŒจํ„ด์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•œ ํ†ต๊ณ„์ • ๋ฐฉ์‹์„ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ, ์ด๊ฒƒ์ด ์‚ฌ๋žŒ์—๊ฒŒ๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ์ง€๋ฃจํ•œ ์ž‘์—…์ผ๊ฒƒ์ด๊ณ , ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์˜ ์†๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋ฌด๊ฑฐ์šด ์ž‘์—…์„ ์ปดํ“จํ„ฐ๋“ค์—๊ฒŒ ์‹œํ‚ค๋ฉฐ ์˜์กดํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„๋Š” ๋˜ํ•œ ๋ฐ์ดํ„ฐ ๊ณผํ•™๊ณผ ๊ธฐ๊ณ„ํ•™์Šต์ด ๊ต์ฐจํ•˜๋Š” ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ˆ˜์—…์—์„œ ๋ฐฐ์› ๋“ฏ์ด, ๊ธฐ๊ณ„ํ•™์Šต์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ๋ฐ์ดํ„ฐ ๋‚ด ๋ณ€์ˆ˜๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์œผ๋กœ ๊ฒฐ๊ณผ๋“ค์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค. - -์ผ๋ฐ˜์ ์œผ๋กœ ์ด ๋‹จ๊ณ„์—์„œ ์ด์š”๋˜๋Š” ๊ธฐ์ˆ ๋“ค์€ ML for Beginners ์ปค๋ฆฌํ˜๋Ÿผ์—์„œ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ๋งํฌ๋ฅผ ๋”ฐ๋ผ๊ฐ€ ๊ทธ๊ฒƒ๋“ค์— ๋Œ€ํ•ด ๋” ์•Œ์•„๋ณด์‹ญ์‹œ์˜ค : - -- [๋ถ„๋ฅ˜](https://github.com/microsoft/ML-For-Beginners/tree/main/4-Classification): ๋ณด๋‹ค ํšจ์œจ์ ์ธ ์‚ฌ์šฉ์„ ์œ„ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒ”์ฃผํ™” ํ•ฉ๋‹ˆ๋‹ค. -- [๊ตฐ์ง‘](https://github.com/microsoft/ML-For-Beginners/tree/main/5-Clustering): ๋ฐ์ดํ„ฐ๋ฅผ ๋น„์Šทํ•œ ๊ตฐ์ง‘๋“ค๋กœ ๊ตฐ์ง‘ํ™” ํ•ฉ๋‹ˆ๋‹ค. -- [ํšŒ๊ท€](https://github.com/microsoft/ML-For-Beginners/tree/main/2-Regression): ๊ฐ’์„ ์˜ˆ์ธกํ•˜๊ฑฐ๋‚˜ ์˜ˆ์ธกํ•  ๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. - -## ์œ ์ง€๋ณด์ˆ˜ -์ƒ์• ์ฃผ๊ธฐ ๋‹ค์ด์–ด๊ทธ๋žจ์—์„œ, ์œ ์ง€๋ณด์ˆ˜๋Š” ๋ฐ์ดํ„ฐ ํฌํš๋‹จ๊ณ„์™€ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋‹จ๊ณ„์˜ ์‚ฌ์ด์— ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์œ ์ง€๋ณด์ˆ˜๋Š” ํ”„๋กœ์ ํŠธ ๊ณผ์ • ์ „์ฒด์— ๊ฑธ์ณ ๋ฐ์ดํ„ฐ๋ฅผ ๊ด€๋ฆฌ, ์ €์žฅ ๋ฐ ๋ณดํ˜ธํ•˜๋Š” ์ง€์†์ ์ธ ๊ณผ์ •์ด๋ฉฐ ํ”„๋กœ์ ํŠธ ์ „์ฒด์— ๊ฑธ์ณ ๊ณ ๋ คํ•ด์•ผ๋งŒ ํ•ฉ๋‹ˆ๋‹ค. - -### ๋ฐ์ดํ„ฐ ์ €์žฅ -๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ, ์–ด๋””๋กœ ์ €์žฅ๋˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ณ ๋ ค์‚ฌํ•ญ๋“ค์€ ์ €์žฅ์†Œ ๋น„์šฉ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ์˜ ์ ‘๊ทผ ์†๋„์— ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์™€ ๊ฐ™์€ ๊ฒฐ์ •๋“ค์€ ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๊ฐ€ ๋‹จ๋…์œผ๋กœ ๋‚ด๋ฆฌ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ์ง€๋งŒ, ๋ฐ์ดํ„ฐ ์ €์žฅ ๋ฐฉ์‹์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์„ ์Šค์Šค๋กœ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. - -์ด๋Ÿฌํ•œ ์„ ํƒ๋“ค์— ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ๋Š” ์ตœ์‹  ๋ฐ์ดํ„ฐ ์ €์žฅ์†Œ ์‹œ์Šคํ…œ์˜ ๋ช‡ ๊ฐ€์ง€ ์ธก๋ฉด๋“ค์ž…๋‹ˆ๋‹ค: - -**์ „์ œ ์žˆ์Œ vs ์ „์ œ ์—†์Œ vs ๊ณต์šฉ ํ˜น์€ ๊ฐœ์ธ(์ž์ฒด) ํด๋ผ์šฐ๋“œ** -์ „์ œ ์žˆ์Œ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๋Š” ํ•˜๋“œ ๋“œ๋ผ์ด๋ธŒ๊ฐ€ ์žˆ๋Š” ์„œ๋ฒ„๋ฅผ ์†Œ์œ ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ด ์ž์ฒด ์žฅ๋น„์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ํ˜ธ์ŠคํŒ…์„ ์˜๋ฏธํ•˜๋Š” ๋ฐ˜๋ฉด, ์ „์ œ ์—†์Œ์€ ๋ฐ์ดํ„ฐ ์„ผํ„ฐ์™€ ๊ฐ™์ด ์†Œ์œ ํ•˜์ง€ ์•Š์€ ์žฅ๋น„์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ๊ณต์šฉ ํด๋ผ์šฐ๋“œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์ •ํ™•์ด ์–ด๋””์— ์–ด๋–ป๊ฒŒ ์ €์žฅ๋˜๋Š”์ง€์— ๋Œ€ํ•œ ์ง€์‹์ด ํ•„์š”ํ•˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ ์ €์žฅ์— ์ธ๊ธฐ์žˆ๋Š” ์„ ํƒ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ณต์šฉ์ด๋ž€ ํด๋ผ์šฐ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“  ์‚ฌ์šฉ์ž๊ฐ€ ๊ณต์œ ํ•˜๋Š” ํ†ตํ•ฉ ๊ธฐ๋ฐ˜ ์ธํ”„๋ผ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ถ€ ์กฐ์ง๋“ค์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ˜ธ์ŠคํŒ…๋˜๋Š” ์žฅ๋น„์— ๋Œ€ํ•˜์—ฌ ์™„์ „ํ•œ ์ ‘๊ทผ ๊ถŒํ•œ์„ ์š”๊ตฌํ•˜๋Š” ์—„๊ฒฉํ•œ ๋ณด์•ˆ์ •์ฑ…์ด ์žˆ์œผ๋ฉฐ, ์ž์ฒด ํด๋ผ์šฐ๋“œ ์„œ๋น„์Šค๋ฅผ ์ œ๊ณตํ•˜๋Š” ์‚ฌ์„ค ํด๋ผ์šฐ๋“œ์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ํด๋ผ์šฐ๋“œ์˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ [๋‹ค์Œ ๊ฐ•์˜](5-Data-Science-In-Cloud) ์—์„œ ๋” ๋ฐฐ์šฐ๊ฒŒ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. - -**Cold vs hot ๋ฐ์ดํ„ฐ** -๋ชจ๋ธ์„ ํ›ˆ๋ จํ•  ๋•Œ, ๋” ๋งŽ์€ ํ›ˆ๋ จ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งŒ์•ฝ ๋‹น์‹ ์ด ๋‹น์‹ ์˜ ๋ชจ๋ธ์— ๋งŒ์กฑ์„ ํ•œ๋‹ค๋ฉด, ๋ชจ๋ธ์ด ๋ชฉ์ ์„ ๋‹ฌ์„ฑํ•˜๋„๋ก ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋“ค์ด ์ œ๊ณต๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์–ด๋– ํ•œ ๊ฒฝ์šฐ์—๋„ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๋งŽ์ด ์ถ•์ ํ• ์ˆ˜๋ก ๋ฐ์ดํ„ฐ ์ €์žฅ ๋ฐ ์ ‘๊ทผ ๋น„์šฉ์€ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ž์ฃผ ์ ‘๊ทผํ•˜๋Š” hot ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ, cold ๋ฐ์ดํ„ฐ๋กœ ์•Œ๋ ค์ ธ ์žˆ๋Š” ์ž์ฃผ ์ ‘๊ทผํ•˜์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์€ ํ•˜๋“œ์›จ์–ด ํ˜น์€ ์†Œํ”„ํŠธ์›จ์–ด ์„œ๋น„์Šค๋ฅผ ํ†ตํ•ด ๋” ์ €๋ ดํ•œ ๋ฐ์ดํ„ฐ ์ €์žฅ ์„ ํƒ์ง€๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งŒ์•ฝ cold ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ, hot ๋ฐ์ดํ„ฐ์— ๋น„ํ•˜์—ฌ ๊ฒ€์ƒ‰ํ•˜๋Š”๋ฐ ์‹œ๊ฐ„์ด ์ข€ ๋” ์†Œ์š”๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. - -### ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ -๋ฐ์ดํ„ฐ๋ฅผ ์ž‘์—… ํ•˜๋‹ค๋ณด๋ฉด ์ •ํ™•ํ•œ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๊ธฐ ์œ„ํ•ด [๋ฐ์ดํ„ฐ ์ค€๋น„](2-Working-With-Data\08-data-preparation)์— ์ค‘์ ์„ ๋‘” ๊ฐ•์˜์—์„œ ๋‹ค๋ฃฌ ์ผ๋ถ€ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ผ๋ถ€ ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ฆฌํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ์ œ๊ณต๋˜๋ฉด, ํ’ˆ์งˆ์˜ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋™์ผํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์ผ๋ถ€๋ฅผ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ถ€ ํ”„๋กœ์ ํŠธ๋“ค์—์„œ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ์ข… ์œ„์น˜๋กœ ์ด๋™ํ•˜๊ธฐ ์ „์— ์ •๋ฆฌ, ์ง‘๊ณ„ ๋ฐ ์••์ถ• ์ž‘์—…์„ ์œ„ํ•œ ์ž๋™ํ™”๋œ ๋„๊ตฌ์˜ ์‚ฌ์šฉ์ด ํฌํ•ฉ๋ฉ๋‹ˆ๋‹ค. Azure Data Factory๋Š” ์ด๋Ÿฌํ•œ ๋„๊ตฌ ์ค‘ ํ•˜๋‚˜์˜ ์˜ˆ์ž…๋‹ˆ๋‹ค. - -### ๋ฐ์ดํ„ฐ ๋ณด์•ˆ -๋ฐ์ดํ„ฐ ๋ณด์•ˆ์˜ ์ฃผ์š” ๋ชฉํ‘œ ์ค‘ ํ•˜๋‚˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ž‘์—…ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์ด ์ˆ˜์ง‘ ๋Œ€์ƒ๊ณผ ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋˜๋Š” ๋งฅ๋ฝ์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋ณด์•ˆ์„ ์œ ์ง€ํ•˜๋ ค๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ์‚ฌ๋žŒ๋งŒ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋„๋ก ์ œํ•œํ•˜๊ณ , ํ˜„์ง€ ๋ฒ•๋ฅ  ๋ฐ ๊ทœ์ •์„ ์ค€์ˆ˜ํ•˜๋ฉฐ, [์œค๋ฆฌ ๊ฐ•์˜](1-Introduction\02-ethics)์—์„œ ๋‹ค๋ฃจ๋Š” ์œค๋ฆฌ์  ํ‘œ์ค€์„ ์œ ์ง€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. - -๋‹ค์Œ์€ ๋ณด์•ˆ์„ ์—ผ๋‘์— ๋‘๊ณ  ํŒ€์—์„œ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ช‡ ๊ฐ€์ง€ ์‚ฌํ•ญ์ž…๋‹ˆ๋‹ค: -- ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ์•”ํ˜ธํ™” ๋˜๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. -- ๊ทธ๋“ค์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ์ด์šฉ๋˜๋Š”์ง€ ๊ณ ๊ฐ๋“ค์—๊ฒŒ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. -- ํ”„๋กœ์ ํŠธ์—์„œ ๋– ๋‚œ ์‚ฌ๋žŒ๋“ค์˜ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ์„ ๊ธˆ์ง€ํ•ฉ๋‹ˆ๋‹ค. -- ํŠน์ • ํ”„๋กœ์ ํŠธ ๊ตฌ์„ฑ์›๋“ค๋งŒ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ๋„๋ก ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค. - - -## ๐Ÿš€ ๋„์ „ - -๋ฐ์ดํ„ฐ ๊ณผํ•™์˜ ์ƒ์• ์ฃผ๊ธฐ์—๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ฒ„์ „์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ฐ ๋‹จ๊ณ„๋Š” ์ด๋ฆ„๊ณผ ๋‹จ๊ณ„ ์ˆ˜๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์ง€๋งŒ ์ด ๊ฐ•์˜์—์„œ ์–ธ๊ธ‰ํ•œ ๊ฒƒ๊ณผ ๋™์ผํ•œ ๊ณผ์ •์„ ํฌํ•ฉํ•ฉ๋‹ˆ๋‹ค. - -[Team Data Science Process lifecycle](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/lifecycle) ์™€ [Cross-industry standard process for data mining](https://www.datascience-pm.com/crisp-dm-2/)๋ฅผ ํƒ๊ตฌ ํ•ด๋ณด์‹ญ์‹œ์˜ค. ์ด ๋‘˜ ์‚ฌ์ด์˜ 3๊ฐ€์ง€ ์œ ์‚ฌ์ ๊ณผ ์ฐจ์ด์ ์„ ๋Œ€๋ณด์‹œ์˜ค. - -|Team Data Science Process (TDSP)|Cross-industry standard process for data mining (CRISP-DM)| -|--|--| -|![Team Data Science Lifecycle](./images/tdsp-lifecycle2.png) | ![Data Science Process Alliance Image](./images/CRISP-DM.png) | -| [Microsoft](https://docs.microsoft.comazure/architecture/data-science-process/lifecycle)์˜ ์ด๋ฏธ์ง€ | [Data Science Process Alliance](https://www.datascience-pm.com/crisp-dm-2/)์˜ ์ด๋ฏธ์ง€ | - -## [์ด์ „ ๊ฐ•์˜ ํ€ด์ฆˆ](https://red-water-0103e7a0f.azurestaticapps.net/quiz/27) - -## ๋ณต์Šต & ์ž๊ธฐ์ฃผ๋„ํ•™์Šต - -๋ฐ์ดํ„ฐ ๊ณผํ•™์˜ ์ƒ์• ์ฃผ๊ธฐ๋ฅผ ์ ์šฉํ•˜๋Š” ๋ฐ๋Š” ์—ฌ๋Ÿฌ ์—ญํ• ๊ณผ ์ž‘์—…์ด ํฌํ•จ๋˜๋ฉฐ, ์ผ๋ถ€๋Š” ๊ฐ ๋‹จ๊ณ„์˜ ํŠน์ • ๋ถ€๋ถ„์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŒ€ ๋ฐ์ดํ„ฐ ๊ณผํ•™ ํ”„๋กœ์„ธ์Šค๋Š” ํ”„๋กœ์ ํŠธ์—์„œ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ์—ญํ•  ๋ฐ ์ž‘์—… ์œ ํ˜•์„ ์„ค๋ช…ํ•˜๋Š” ๋ช‡ ๊ฐ€์ง€ ๋ฆฌ์†Œ์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. - -* [ํŒ€ ๋ฐ์ดํ„ฐ ๊ณผํ•™ ํ”„๋กœ์„ธ์Šค ์—ญํ•  ๋ฐ ์ž‘์—…](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/roles-tasks) -* [๋ฐ์ดํ„ฐ ๊ณผํ•™ ์ž‘์—… ์‹คํ–‰: ํƒ์ƒ‰, ๋ชจ๋ธ๋ง ๋ฐ ๋ฐฐ์น˜](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/execute-data-science-tasks) - -## ๊ณผ์ œ -[๋ฐ์ดํ„ฐ ์„ธํŠธ ](assignment.md) -======= -# ๋ฐ์ดํ„ฐ ๊ณผํ•™์˜ ์ƒ์• ์ฃผ๊ธฐ ์†Œ๊ฐœ - -|![ Sketchnote by [(@sketchthedocs)](https://sketchthedocs.dev) ](../../../sketchnotes/14-DataScience-Lifecycle.png)| -|:---:| -| ๋ฐ์ดํ„ฐ ๊ณผํ•™์˜ ์ƒ์• ์ฃผ๊ธฐ ์†Œ๊ฐœ - [@nitya](https://twitter.com/nitya)์˜ ์ด๋ฏธ์ง€ | - -## [๊ฐ•์˜ ์‹œ์ž‘ ์ „ ํ€ด์ฆˆ](https://red-water-0103e7a0f.azurestaticapps.net/quiz/26) - -์ด ์‹œ์ ์—์„œ ์—ฌ๋Ÿฌ๋ถ„์€ ์•„๋งˆ ๋ฐ์ดํ„ฐ ๊ณผํ•™์ด ํ•˜๋‚˜์˜ ํ”„๋กœ์„ธ์Šค๋ผ๋Š” ๊ฒƒ์„ ๊นจ๋‹ฌ์•˜์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ํ”„๋กœ์„ธ์Šค๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด 5๋‹จ๊ณ„๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: - -- ๋ฐ์ดํ„ฐ ํฌํš -- ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ -- ๋ฐ์ดํ„ฐ ๋ถ„์„ -- ์†Œํ†ต -- ์œ ์ง€๋ณด์ˆ˜ - - -์ด๋ฒˆ ๊ฐ•์˜์—์„œ๋Š” ์ƒ์•  ์ฃผ๊ธฐ์˜ ์„ธ ๋ถ€๋ถ„ : ๋ฐ์ดํ„ฐ ํฌํš, ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๊ทธ๋ฆฌ๊ณ  ์œ ์ง€์— ์ง‘์ค‘ํ•ฉ๋‹ˆ๋‹ค. - -![Diagram of the data science lifecycle](.././images/data-science-lifecycle.jpg) -> [Berkeley School of Information](https://ischoolonline.berkeley.edu/data-science/what-is-data-science/) ์˜ ์ด๋ฏธ์ง€ - -## ๋ฐ์ดํ„ฐ ํฌํš - -์ƒ์•  ์ฃผ๊ธฐ์˜ ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” ๋‹ค์Œ ๋‹จ๊ณ„์˜ ์˜์กด๋„๊ฐ€ ๋†’๊ธฐ ๋•Œ๋ฌธ์— ์•„์ฃผ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ ์‚ฌ์‹ค์ƒ ๋‘ ๋‹จ๊ณ„๊ฐ€ ํ•ฉํ•ด์ง„ ๊ฒƒ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค : ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘๊ณผ ํ•ด๊ฒฐํ•ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ๋“ค ๋ฐ ๋ชฉ์  ์ •์˜. -ํ”„๋กœ์ ํŠธ์˜ ๋ชฉํ‘œ๋ฅผ ์ •์˜ํ•˜๋ ค๋ฉด ๋ฌธ์ œ๋‚˜ ์งˆ๋ฌธ์— ๋Œ€ํ•ด์„œ ๋” ๊นŠ์€ ๋งฅ๋ฝ์„ ํ•„์š”๋กœ ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ฒซ์งธ, ์šฐ๋ฆฌ๋Š” ๋ฌธ์ œ ํ•ด๊ฒฐ์ด ํ•„์š”ํ•œ ์‚ฌ๋žŒ๋“ค์„ ์ฐพ์•„๋‚ด๊ณ  ์˜์ž…ํ•ด์•ผ ํ•œ๋‹ค. ๊ทธ๋“ค์€ ์‚ฌ์—…์˜ ์ดํ•ด๊ด€๊ณ„์ž์ด๊ฑฐ๋‚˜ ํ”„๋กœ์ ํŠธ์˜ ํ›„์›์ž์ผ ์ˆ˜๋„ ์žˆ์œผ๋ฉฐ, ๊ทธ๋“ค์€ ๋ˆ„๊ฐ€ ์ด ํ”„๋กœ์ ํŠธ๋ฅผ ํ†ตํ•ด ์ด์ต์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š”์ง€, ๋ฌด์—‡์„ ์™œ ํ•„์š”๋กœ ํ•˜๋Š”์ง€๋ฅผ ์‹๋ณ„ํ•˜๋Š”๋ฐ์— ๋„์›€์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ž˜ ์ •์˜๋œ ๋ชฉํ‘œ๋Š” ๋‚ฉ๋“ํ• ๋งŒํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ •์˜ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ„๋Ÿ‰(์ธก์ •)๊ณผ ์ˆ˜๋Ÿ‰ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•ด์•ผ๋งŒ ํ•œ๋‹ค. - -๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๊ฐ€ ํ•  ์ˆ˜๋„ ์žˆ๋Š” ์งˆ๋ฌธ๋“ค : -- ์ด ๋ฌธ์ œ์— ์ ‘๊ทผํ•œ ์ ์ด ์žˆ์Šต๋‹ˆ๊นŒ? ๋ฌด์—‡์ด ๋ฐœ๊ฒฌ๋˜์—ˆ์Šต๋‹ˆ๊นŒ? -- ๊ด€๋ จ๋˜์–ด ์žˆ๋Š” ๋ชจ๋“  ์‚ฌ๋žŒ๋“ค์ด ๋ชฉ์ ๊ณผ ๋ชฉํ‘œ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๊นŒ? -- ๋ชจํ˜ธ์„ฑ์€ ์–ด๋””์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ์–ด๋–ป๊ฒŒ ์ค„์ผ ์ˆ˜ ์žˆ๊ฒ ์Šต๋‹ˆ๊นŒ? -- ์ œ์•ฝ์ด ๋˜๋Š” ๊ฒƒ๋“ค์€ ๋ฌด์—‡์ž…๋‹ˆ๊นŒ? -- ์ตœ์ข… ๊ฒฐ๊ณผ๋Š” ์ž ์žฌ์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ๋  ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๊นŒ? -- ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ž์›๋“ค (์‹œ๊ฐ„, ์ธ๋ ฅ, ์ปดํ“จํ„ฐ ์ด์šฉ) ์ด ์–ผ๋งˆ๋‚˜ ๋ฉ๋‹ˆ๊นŒ? - -๋‹ค์Œ์€ ์ด ์ •์˜๋œ ๋ชฉํ‘œ๋“ค์„ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์‹๋ณ„ํ•˜๊ณ , ์ˆ˜์ง‘ํ•˜๊ณ , ๋งˆ์ง€๋ง‰์œผ๋กœ ํƒ์ƒ‰ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ด ํš๋“ ๋‹จ๊ณ„์—์„œ, ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๋“ค์€ ๋ฐ์ดํ„ฐ์˜ ์–‘๊ณผ ์งˆ๋˜ํ•œ ํ‰๊ฐ€ํ•ด์•ผ๋งŒ ํ•ฉ๋‹ˆ๋‹ค. ์–ป์€ ๊ฒƒ์ด ์›ํ•˜๋Š” ๊ฒฐ๊ณผ์— ๋„๋‹ฌํ•˜๋Š”๋ฐ ๋„์›€์ด ๋  ์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์•ฝ๊ฐ„์˜ ๋ฐ์ดํ„ฐ ํƒ์ƒ‰์ด ์š”๊ตฌ๋ฉ๋‹ˆ๋‹ค. - -๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๊ฐ€ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋ฌผ์–ด๋ณผ ์ˆ˜ ์žˆ๋Š” ์งˆ๋ฌธ๋“ค : -- ์–ด๋–ค ๋ฐ์ดํ„ฐ๊ฐ€ ์ด๋ฏธ ์ œ๊ฐ€ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๊นŒ? -- ์ด ๋ฐ์ดํ„ฐ์˜ ์†Œ์œ ์ž๋Š” ๋ˆ„๊ตฌ์ž…๋‹ˆ๊นŒ? -- ๊ฐœ์ธ ์ •๋ณด ๋ณดํ˜ธ ๋ฌธ์ œ๋Š” ๋ฌด์—‡์ž…๋‹ˆ๊นŒ? -- ๋‚ด๊ฐ€ ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ• ๋งŒํผ ์ถฉ๋ถ„ํ•ฉ๋‹ˆ๊นŒ? -- ์ด ๋ฌธ์ œ์— ๋Œ€ํ•ด ํ—ˆ์šฉ ๊ฐ€๋Šฅํ•œ ํ’ˆ์งˆ์˜ ๋ฐ์ดํ„ฐ ์ž…๋‹ˆ๊นŒ? -- ๋งŒ์•ฝ ๋‚ด๊ฐ€ ์ด ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ์ถ”๊ฐ€์ ์ธ ์ •๋ณด๋ฅผ ๋ฐœ๊ฒฌํ•œ๋‹ค๋ฉด, ๋ชฉํ‘œ๋ฅผ ๋ฐ”๊พธ๊ฑฐ๋‚˜ ์ •์˜๋ฅผ ๋‹ค์‹œ ๋‚ด๋ ค์•ผ ํ•ฉ๋‹ˆ๊นŒ? - -## ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ - -์ƒ์•  ์ฃผ๊ธฐ์˜ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋‹จ๊ณ„๋Š” ๋ชจ๋ธ๋ง๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ์—์„œ ํŒจํ„ด์„ ๋ฐœ๊ฒฌํ•˜๋Š” ๋ฐ ์ดˆ์ ์„ ๋งž์ถฅ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๋ช‡๋ช‡ ๊ธฐ์ˆ ๋“ค์€ ํŒจํ„ด์„ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•œ ํ†ต๊ณ„์ • ๋ฐฉ์‹์„ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ, ์ด๊ฒƒ์ด ์‚ฌ๋žŒ์—๊ฒŒ๋Š” ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ์ง€๋ฃจํ•œ ์ž‘์—…์ผ๊ฒƒ์ด๊ณ , ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ์˜ ์†๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋ฌด๊ฑฐ์šด ์ž‘์—…์„ ์ปดํ“จํ„ฐ๋“ค์—๊ฒŒ ์‹œํ‚ค๋ฉฐ ์˜์กดํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„๋Š” ๋˜ํ•œ ๋ฐ์ดํ„ฐ ๊ณผํ•™๊ณผ ๊ธฐ๊ณ„ํ•™์Šต์ด ๊ต์ฐจํ•˜๋Š” ๋‹จ๊ณ„์ž…๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ์ˆ˜์—…์—์„œ ๋ฐฐ์› ๋“ฏ์ด, ๊ธฐ๊ณ„ํ•™์Šต์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๋Š” ๊ณผ์ •์ž…๋‹ˆ๋‹ค. ๋ชจ๋ธ์€ ๋ฐ์ดํ„ฐ ๋‚ด ๋ณ€์ˆ˜๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฒƒ์œผ๋กœ ๊ฒฐ๊ณผ๋“ค์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค. - -์ผ๋ฐ˜์ ์œผ๋กœ ์ด ๋‹จ๊ณ„์—์„œ ์ด์š”๋˜๋Š” ๊ธฐ์ˆ ๋“ค์€ ML for Beginners ์ปค๋ฆฌํ˜๋Ÿผ์—์„œ ๋‹ค๋ฃน๋‹ˆ๋‹ค. ๋งํฌ๋ฅผ ๋”ฐ๋ผ๊ฐ€ ๊ทธ๊ฒƒ๋“ค์— ๋Œ€ํ•ด ๋” ์•Œ์•„๋ณด์‹ญ์‹œ์˜ค : - -- [๋ถ„๋ฅ˜](https://github.com/microsoft/ML-For-Beginners/tree/main/4-Classification): ๋ณด๋‹ค ํšจ์œจ์ ์ธ ์‚ฌ์šฉ์„ ์œ„ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฒ”์ฃผํ™” ํ•ฉ๋‹ˆ๋‹ค. -- [๊ตฐ์ง‘](https://github.com/microsoft/ML-For-Beginners/tree/main/5-Clustering): ๋ฐ์ดํ„ฐ๋ฅผ ๋น„์Šทํ•œ ๊ตฐ์ง‘๋“ค๋กœ ๊ตฐ์ง‘ํ™” ํ•ฉ๋‹ˆ๋‹ค. -- [ํšŒ๊ท€](https://github.com/microsoft/ML-For-Beginners/tree/main/2-Regression): ๊ฐ’์„ ์˜ˆ์ธกํ•˜๊ฑฐ๋‚˜ ์˜ˆ์ธกํ•  ๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. - -## ์œ ์ง€๋ณด์ˆ˜ -์ƒ์• ์ฃผ๊ธฐ ๋‹ค์ด์–ด๊ทธ๋žจ์—์„œ, ์œ ์ง€๋ณด์ˆ˜๋Š” ๋ฐ์ดํ„ฐ ํฌํš๋‹จ๊ณ„์™€ ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ๋‹จ๊ณ„์˜ ์‚ฌ์ด์— ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์œ ์ง€๋ณด์ˆ˜๋Š” ํ”„๋กœ์ ํŠธ ๊ณผ์ • ์ „์ฒด์— ๊ฑธ์ณ ๋ฐ์ดํ„ฐ๋ฅผ ๊ด€๋ฆฌ, ์ €์žฅ ๋ฐ ๋ณดํ˜ธํ•˜๋Š” ์ง€์†์ ์ธ ๊ณผ์ •์ด๋ฉฐ ํ”„๋กœ์ ํŠธ ์ „์ฒด์— ๊ฑธ์ณ ๊ณ ๋ คํ•ด์•ผ๋งŒ ํ•ฉ๋‹ˆ๋‹ค. - -### ๋ฐ์ดํ„ฐ ์ €์žฅ -๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ, ์–ด๋””๋กœ ์ €์žฅ๋˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ณ ๋ ค์‚ฌํ•ญ๋“ค์€ ์ €์žฅ์†Œ ๋น„์šฉ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ์˜ ์ ‘๊ทผ ์†๋„์— ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์™€ ๊ฐ™์€ ๊ฒฐ์ •๋“ค์€ ๋ฐ์ดํ„ฐ ๊ณผํ•™์ž๊ฐ€ ๋‹จ๋…์œผ๋กœ ๋‚ด๋ฆฌ๋Š” ๊ฒƒ์€ ์•„๋‹ˆ์ง€๋งŒ, ๋ฐ์ดํ„ฐ ์ €์žฅ ๋ฐฉ์‹์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์„ ์Šค์Šค๋กœ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. - -์ด๋Ÿฌํ•œ ์„ ํƒ๋“ค์— ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ๋Š” ์ตœ์‹  ๋ฐ์ดํ„ฐ ์ €์žฅ์†Œ ์‹œ์Šคํ…œ์˜ ๋ช‡ ๊ฐ€์ง€ ์ธก๋ฉด๋“ค์ž…๋‹ˆ๋‹ค: - -**์ „์ œ ์žˆ์Œ vs ์ „์ œ ์—†์Œ vs ๊ณต์šฉ ํ˜น์€ ๊ฐœ์ธ(์ž์ฒด) ํด๋ผ์šฐ๋“œ** -์ „์ œ ์žˆ์Œ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•˜๋Š” ํ•˜๋“œ ๋“œ๋ผ์ด๋ธŒ๊ฐ€ ์žˆ๋Š” ์„œ๋ฒ„๋ฅผ ์†Œ์œ ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์ด ์ž์ฒด ์žฅ๋น„์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ด€๋ฆฌํ•˜๋Š” ํ˜ธ์ŠคํŒ…์„ ์˜๋ฏธํ•˜๋Š” ๋ฐ˜๋ฉด, ์ „์ œ ์—†์Œ์€ ๋ฐ์ดํ„ฐ ์„ผํ„ฐ์™€ ๊ฐ™์ด ์†Œ์œ ํ•˜์ง€ ์•Š์€ ์žฅ๋น„์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ๊ณต์šฉ ํด๋ผ์šฐ๋“œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์ •ํ™•์ด ์–ด๋””์— ์–ด๋–ป๊ฒŒ ์ €์žฅ๋˜๋Š”์ง€์— ๋Œ€ํ•œ ์ง€์‹์ด ํ•„์š”ํ•˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ ์ €์žฅ์— ์ธ๊ธฐ์žˆ๋Š” ์„ ํƒ์ž…๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ณต์šฉ์ด๋ž€ ํด๋ผ์šฐ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“  ์‚ฌ์šฉ์ž๊ฐ€ ๊ณต์œ ํ•˜๋Š” ํ†ตํ•ฉ ๊ธฐ๋ฐ˜ ์ธํ”„๋ผ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ถ€ ์กฐ์ง๋“ค์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ˜ธ์ŠคํŒ…๋˜๋Š” ์žฅ๋น„์— ๋Œ€ํ•˜์—ฌ ์™„์ „ํ•œ ์ ‘๊ทผ ๊ถŒํ•œ์„ ์š”๊ตฌํ•˜๋Š” ์—„๊ฒฉํ•œ ๋ณด์•ˆ์ •์ฑ…์ด ์žˆ์œผ๋ฉฐ, ์ž์ฒด ํด๋ผ์šฐ๋“œ ์„œ๋น„์Šค๋ฅผ ์ œ๊ณตํ•˜๋Š” ์‚ฌ์„ค ํด๋ผ์šฐ๋“œ์— ์˜์กดํ•ฉ๋‹ˆ๋‹ค. ํด๋ผ์šฐ๋“œ์˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ [๋‹ค์Œ ๊ฐ•์˜](5-Data-Science-In-Cloud) ์—์„œ ๋” ๋ฐฐ์šฐ๊ฒŒ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. - -**Cold vs hot ๋ฐ์ดํ„ฐ** -๋ชจ๋ธ์„ ํ›ˆ๋ จํ•  ๋•Œ, ๋” ๋งŽ์€ ํ›ˆ๋ จ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งŒ์•ฝ ๋‹น์‹ ์ด ๋‹น์‹ ์˜ ๋ชจ๋ธ์— ๋งŒ์กฑ์„ ํ•œ๋‹ค๋ฉด, ๋ชจ๋ธ์ด ๋ชฉ์ ์„ ๋‹ฌ์„ฑํ•˜๋„๋ก ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋“ค์ด ์ œ๊ณต๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์–ด๋– ํ•œ ๊ฒฝ์šฐ์—๋„ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๋งŽ์ด ์ถ•์ ํ• ์ˆ˜๋ก ๋ฐ์ดํ„ฐ ์ €์žฅ ๋ฐ ์ ‘๊ทผ ๋น„์šฉ์€ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ์ž์ฃผ ์ ‘๊ทผํ•˜๋Š” hot ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ, cold ๋ฐ์ดํ„ฐ๋กœ ์•Œ๋ ค์ ธ ์žˆ๋Š” ์ž์ฃผ ์ ‘๊ทผํ•˜์ง€ ์•Š๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฆฌํ•˜๋Š” ๊ฒƒ์€ ํ•˜๋“œ์›จ์–ด ํ˜น์€ ์†Œํ”„ํŠธ์›จ์–ด ์„œ๋น„์Šค๋ฅผ ํ†ตํ•ด ๋” ์ €๋ ดํ•œ ๋ฐ์ดํ„ฐ ์ €์žฅ ์„ ํƒ์ง€๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งŒ์•ฝ cold ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ, hot ๋ฐ์ดํ„ฐ์— ๋น„ํ•˜์—ฌ ๊ฒ€์ƒ‰ํ•˜๋Š”๋ฐ ์‹œ๊ฐ„์ด ์ข€ ๋” ์†Œ์š”๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. - -### ๋ฐ์ดํ„ฐ ๊ด€๋ฆฌ -๋ฐ์ดํ„ฐ๋ฅผ ์ž‘์—… ํ•˜๋‹ค๋ณด๋ฉด ์ •ํ™•ํ•œ ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•˜๊ธฐ ์œ„ํ•ด [๋ฐ์ดํ„ฐ ์ค€๋น„](2-Working-With-Data\08-data-preparation)์— ์ค‘์ ์„ ๋‘” ๊ฐ•์˜์—์„œ ๋‹ค๋ฃฌ ์ผ๋ถ€ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ผ๋ถ€ ๋ฐ์ดํ„ฐ๋ฅผ ์ •๋ฆฌํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๊ฐ€ ์ œ๊ณต๋˜๋ฉด, ํ’ˆ์งˆ์˜ ์ผ๊ด€์„ฑ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋™์ผํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์ผ๋ถ€๋ฅผ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ถ€ ํ”„๋กœ์ ํŠธ๋“ค์—์„œ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ์ข… ์œ„์น˜๋กœ ์ด๋™ํ•˜๊ธฐ ์ „์— ์ •๋ฆฌ, ์ง‘๊ณ„ ๋ฐ ์••์ถ• ์ž‘์—…์„ ์œ„ํ•œ ์ž๋™ํ™”๋œ ๋„๊ตฌ์˜ ์‚ฌ์šฉ์ด ํฌํ•ฉ๋ฉ๋‹ˆ๋‹ค. Azure Data Factory๋Š” ์ด๋Ÿฌํ•œ ๋„๊ตฌ ์ค‘ ํ•˜๋‚˜์˜ ์˜ˆ์ž…๋‹ˆ๋‹ค. - -### ๋ฐ์ดํ„ฐ ๋ณด์•ˆ -๋ฐ์ดํ„ฐ ๋ณด์•ˆ์˜ ์ฃผ์š” ๋ชฉํ‘œ ์ค‘ ํ•˜๋‚˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ž‘์—…ํ•˜๋Š” ์‚ฌ๋žŒ๋“ค์ด ์ˆ˜์ง‘ ๋Œ€์ƒ๊ณผ ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ๋˜๋Š” ๋งฅ๋ฝ์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋ณด์•ˆ์„ ์œ ์ง€ํ•˜๋ ค๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ํ•„์š”๋กœ ํ•˜๋Š” ์‚ฌ๋žŒ๋งŒ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋„๋ก ์ œํ•œํ•˜๊ณ , ํ˜„์ง€ ๋ฒ•๋ฅ  ๋ฐ ๊ทœ์ •์„ ์ค€์ˆ˜ํ•˜๋ฉฐ, [์œค๋ฆฌ ๊ฐ•์˜](1-Introduction\02-ethics)์—์„œ ๋‹ค๋ฃจ๋Š” ์œค๋ฆฌ์  ํ‘œ์ค€์„ ์œ ์ง€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. - -๋‹ค์Œ์€ ๋ณด์•ˆ์„ ์—ผ๋‘์— ๋‘๊ณ  ํŒ€์—์„œ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ช‡ ๊ฐ€์ง€ ์‚ฌํ•ญ์ž…๋‹ˆ๋‹ค: -- ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ์•”ํ˜ธํ™” ๋˜๋Š”์ง€ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. -- ๊ทธ๋“ค์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ์ด์šฉ๋˜๋Š”์ง€ ๊ณ ๊ฐ๋“ค์—๊ฒŒ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. -- ํ”„๋กœ์ ํŠธ์—์„œ ๋– ๋‚œ ์‚ฌ๋žŒ๋“ค์˜ ๋ฐ์ดํ„ฐ ์ ‘๊ทผ์„ ๊ธˆ์ง€ํ•ฉ๋‹ˆ๋‹ค. -- ํŠน์ • ํ”„๋กœ์ ํŠธ ๊ตฌ์„ฑ์›๋“ค๋งŒ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€๊ฒฝํ•  ์ˆ˜ ์žˆ๋„๋ก ํ—ˆ์šฉํ•ฉ๋‹ˆ๋‹ค. - - -## ๐Ÿš€ ๋„์ „ - -๋ฐ์ดํ„ฐ ๊ณผํ•™์˜ ์ƒ์• ์ฃผ๊ธฐ์—๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋ฒ„์ „์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ฐ ๋‹จ๊ณ„๋Š” ์ด๋ฆ„๊ณผ ๋‹จ๊ณ„ ์ˆ˜๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์ง€๋งŒ ์ด ๊ฐ•์˜์—์„œ ์–ธ๊ธ‰ํ•œ ๊ฒƒ๊ณผ ๋™์ผํ•œ ๊ณผ์ •์„ ํฌํ•ฉํ•ฉ๋‹ˆ๋‹ค. - -[Team Data Science Process lifecycle](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/lifecycle) ์™€ [Cross-industry standard process for data mining](https://www.datascience-pm.com/crisp-dm-2/)๋ฅผ ํƒ๊ตฌ ํ•ด๋ณด์‹ญ์‹œ์˜ค. ์ด ๋‘˜ ์‚ฌ์ด์˜ 3๊ฐ€์ง€ ์œ ์‚ฌ์ ๊ณผ ์ฐจ์ด์ ์„ ๋Œ€๋ณด์‹œ์˜ค. - -|Team Data Science Process (TDSP)|Cross-industry standard process for data mining (CRISP-DM)| -|--|--| -|![Team Data Science Lifecycle](.././images/tdsp-lifecycle2.png) | ![Data Science Process Alliance Image](.././images/CRISP-DM.png) | -| [Microsoft](https://docs.microsoft.comazure/architecture/data-science-process/lifecycle)์˜ ์ด๋ฏธ์ง€ | [Data Science Process Alliance](https://www.datascience-pm.com/crisp-dm-2/)์˜ ์ด๋ฏธ์ง€ | - -## [์ด์ „ ๊ฐ•์˜ ํ€ด์ฆˆ](https://red-water-0103e7a0f.azurestaticapps.net/quiz/27) - -## ๋ณต์Šต & ์ž๊ธฐ์ฃผ๋„ํ•™์Šต - -๋ฐ์ดํ„ฐ ๊ณผํ•™์˜ ์ƒ์• ์ฃผ๊ธฐ๋ฅผ ์ ์šฉํ•˜๋Š” ๋ฐ๋Š” ์—ฌ๋Ÿฌ ์—ญํ• ๊ณผ ์ž‘์—…์ด ํฌํ•จ๋˜๋ฉฐ, ์ผ๋ถ€๋Š” ๊ฐ ๋‹จ๊ณ„์˜ ํŠน์ • ๋ถ€๋ถ„์— ์ง‘์ค‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ํŒ€ ๋ฐ์ดํ„ฐ ๊ณผํ•™ ํ”„๋กœ์„ธ์Šค๋Š” ํ”„๋กœ์ ํŠธ์—์„œ ๋ˆ„๊ตฐ๊ฐ€๊ฐ€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ์—ญํ•  ๋ฐ ์ž‘์—… ์œ ํ˜•์„ ์„ค๋ช…ํ•˜๋Š” ๋ช‡ ๊ฐ€์ง€ ๋ฆฌ์†Œ์Šค๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. - -* [ํŒ€ ๋ฐ์ดํ„ฐ ๊ณผํ•™ ํ”„๋กœ์„ธ์Šค ์—ญํ•  ๋ฐ ์ž‘์—…](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/roles-tasks) -* [๋ฐ์ดํ„ฐ ๊ณผํ•™ ์ž‘์—… ์‹คํ–‰: ํƒ์ƒ‰, ๋ชจ๋ธ๋ง ๋ฐ ๋ฐฐ์น˜](https://docs.microsoft.com/en-us/azure/architecture/data-science-process/execute-data-science-tasks) - -## ๊ณผ์ œ -[๋ฐ์ดํ„ฐ ์„ธํŠธ](assignment.md) ->>>>>>> f226d9539b580b27eb72c07423c0e0a5fcf4d540 diff --git a/4-Data-Science-Lifecycle/15-analyzing/translations/README.es.md b/4-Data-Science-Lifecycle/15-analyzing/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 diff --git a/4-Data-Science-Lifecycle/16-communication/translations/README.es.md b/4-Data-Science-Lifecycle/16-communication/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 diff --git a/4-Data-Science-Lifecycle/translations/README.es.md b/4-Data-Science-Lifecycle/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 diff --git a/5-Data-Science-In-Cloud/17-Introduction/translations/README.es.md b/5-Data-Science-In-Cloud/17-Introduction/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 diff --git a/5-Data-Science-In-Cloud/18-Low-Code/translations/README.es.md b/5-Data-Science-In-Cloud/18-Low-Code/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 diff --git a/5-Data-Science-In-Cloud/18-Low-Code/translations/README.md b/5-Data-Science-In-Cloud/18-Low-Code/translations/README.md new file mode 100644 index 00000000..0e47ad51 --- /dev/null +++ b/5-Data-Science-In-Cloud/18-Low-Code/translations/README.md @@ -0,0 +1 @@ + \ No newline at end of file diff --git a/5-Data-Science-In-Cloud/19-Azure/translations/README.es.md b/5-Data-Science-In-Cloud/19-Azure/translations/README.es.md deleted file mode 100644 index e69de29b..00000000 From 7f19e4bba684ee6e7c8ff2a7a1614f7d40c8a24c Mon Sep 17 00:00:00 2001 From: Frederick Legaspi Date: Mon, 6 Dec 2021 10:09:41 -0500 Subject: [PATCH 12/13] Fix broken link Fix broken link to 18-Low-Code/README.md. Fix minor typo. --- 5-Data-Science-In-Cloud/19-Azure/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/5-Data-Science-In-Cloud/19-Azure/README.md b/5-Data-Science-In-Cloud/19-Azure/README.md index be76915d..30a4bac1 100644 --- a/5-Data-Science-In-Cloud/19-Azure/README.md +++ b/5-Data-Science-In-Cloud/19-Azure/README.md @@ -76,7 +76,7 @@ Let's create a compute instance to provision a jupyter notebook. Congratulations, you have just created a compute instance! We will use this compute instance to create a Notebook the [Creating Notebooks section](#23-creating-notebooks). ### 2.3 Loading the Dataset -Refer the [previous lesson](../18-tbd/README.md) in the section **2.3 Loading the Dataset** if you have not uploaded the dataset yet. +Refer the [previous lesson](../18-Low-Code/README.md) in the section **2.3 Loading the Dataset** if you have not uploaded the dataset yet. ### 2.4 Creating Notebooks @@ -97,7 +97,7 @@ Now that we have a Notebook, we can start training the model with Azure ML SDK. ### 2.5 Training a model -First of all, if you ever have a doubt, refer to the [Azure ML SDK documentation](https://docs.microsoft.com/python/api/overview/azure/ml?WT.mc_id=academic-40229-cxa&ocid=AID3041109). In contains all the necessary information to understand the modules we are going to see in this lesson. +First of all, if you ever have a doubt, refer to the [Azure ML SDK documentation](https://docs.microsoft.com/python/api/overview/azure/ml?WT.mc_id=academic-40229-cxa&ocid=AID3041109). It contains all the necessary information to understand the modules we are going to see in this lesson. #### 2.5.1 Setup Workspace, experiment, compute cluster and dataset From 39614b0317bb9d287548dbaec5ea2f3d5b6b80f6 Mon Sep 17 00:00:00 2001 From: Frederick Legaspi Date: Mon, 6 Dec 2021 10:21:06 -0500 Subject: [PATCH 13/13] Fix typo Fix minor typo --- 6-Data-Science-In-Wild/20-Real-World-Examples/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/6-Data-Science-In-Wild/20-Real-World-Examples/README.md b/6-Data-Science-In-Wild/20-Real-World-Examples/README.md index 098a389d..b3dcdd3d 100644 --- a/6-Data-Science-In-Wild/20-Real-World-Examples/README.md +++ b/6-Data-Science-In-Wild/20-Real-World-Examples/README.md @@ -26,7 +26,7 @@ Thanks to the democratization of AI, developers are now finding it easier to des * [Sports Analytics](https://towardsdatascience.com/scope-of-analytics-in-sports-world-37ed09c39860) - focuses on _predictive analytics_ (team and player analysis - think [Moneyball](https://datasciencedegree.wisconsin.edu/blog/moneyball-proves-importance-big-data-big-ideas/) - and fan management) and _data visualization_ (team & fan dashboards, games etc.) with applications like talent scouting, sports gambling and inventory/venue management. - * [Data Science in Banking](https://data-flair.training/blogs/data-science-in-banking/) - highlights the value of data science in the finance industry with applications ranging from risk modeling and fraud detction, to customer segmentation, real-time prediction and recommender systems. Predictive analytics also drive critical measures like [credit scores](https://dzone.com/articles/using-big-data-and-predictive-analytics-for-credit). + * [Data Science in Banking](https://data-flair.training/blogs/data-science-in-banking/) - highlights the value of data science in the finance industry with applications ranging from risk modeling and fraud detection, to customer segmentation, real-time prediction and recommender systems. Predictive analytics also drive critical measures like [credit scores](https://dzone.com/articles/using-big-data-and-predictive-analytics-for-credit). * [Data Science in Healthcare](https://data-flair.training/blogs/data-science-in-healthcare/) - highlights applications like medical imaging (e.g., MRI, X-Ray, CT-Scan), genomics (DNA sequencing), drug development (risk assessment, success prediction), predictive analytics (patient care & supply logistics), disease tracking & prevention etc.