nosql types and descriptions

pull/61/head
Jasmine 4 years ago
parent c7ceb2b573
commit 86de6aa51d

@ -1,16 +1,18 @@
# Defining Data # Defining Data
This lesson focuses on identifying and classifying data by its characteristics and its sources.
Data are facts, information, observations and measurements that are used to make discoveries and to support informed decisions. A data point is a single unit of data with in a dataset, which is collection of data points. Datasets may come in different formats and structures, and will usually be based on its source, or where the data came from. For example, a company's monthly earnings might be in a spreadsheet but hourly heart rate data from a smartwatch may be in [JSON](https://stackoverflow.com/a/383699) format. It's common for data scientists to work with different types of data within a dataset. Data are facts, information, observations and measurements that are used to make discoveries and to support informed decisions. A data point is a single unit of data with in a dataset, which is collection of data points. Datasets may come in different formats and structures, and will usually be based on its source, or where the data came from. For example, a company's monthly earnings might be in a spreadsheet but hourly heart rate data from a smartwatch may be in [JSON](https://stackoverflow.com/a/383699) format. It's common for data scientists to work with different types of data within a dataset.
This lesson focuses on identifying and classifying data by its characteristics and its sources.
## Pre-Lecture Quiz ## Pre-Lecture Quiz
[Pre-lecture quiz]() [Pre-lecture quiz]()
![Image of numerical data, also known as quantitative data](mika-baumeister-Wpnoqo2plFA-unsplash.jpg)
> Source: [Mika Baumister](https://unsplash.com/@mbaumi) via [Unsplash](https://unsplash.com/photos/Wpnoqo2plFA)
## How Data is Described ## How Data is Described
*Raw data* are data that has come from its source in its initial state and has not been analyzed or organized. In order to make sense of what is happening with a dataset, it needs to be organized into a format that can be understood by humans as well as the technology they may use to analyze it further. The structure of a dataset describes how it's organized and can be classified at structured, unstructured and semi-structured. These types of structure will vary, depending on the source but will ultimately fit in these three categories. **Raw data** are data that has come from its source in its initial state and has not been analyzed or organized. In order to make sense of what is happening with a dataset, it needs to be organized into a format that can be understood by humans as well as the technology they may use to analyze it further. The structure of a dataset describes how it's organized and can be classified at structured, unstructured and semi-structured. These types of structure will vary, depending on the source but will ultimately fit in these three categories.
### Quantitative Data ### Quantitative Data
Quantitative data are numerical observations within a dataset and can typically be analyzed, measured and used mathematically. Some examples of quantitative data are: a country's population, a person's height or a company's quarterly earnings. With some additional analysis, quantitative data could be used to discover seasonal trends of the Air Quality Index (AQI) or estimate the probability of rush hour traffic on a typical work day. Quantitative data are numerical observations within a dataset and can typically be analyzed, measured and used mathematically. Some examples of quantitative data are: a country's population, a person's height or a company's quarterly earnings. With some additional analysis, quantitative data could be used to discover seasonal trends of the Air Quality Index (AQI) or estimate the probability of rush hour traffic on a typical work day.
@ -39,11 +41,23 @@ Examples of unstructured data: HTML, CSV files, JavaScript Object Notation (JSO
A data source is the initial location of where the data was generated, or where it "lives" and will vary based on how and when it was collected. Data generated by its user(s) are known as primary data while secondary data comes from a source that has collected data for general use. For example, a group of scientists collecting observations in a rainforest would be considered primary and if they decide to share it with other scientists it would be considered secondary to those that use it. A data source is the initial location of where the data was generated, or where it "lives" and will vary based on how and when it was collected. Data generated by its user(s) are known as primary data while secondary data comes from a source that has collected data for general use. For example, a group of scientists collecting observations in a rainforest would be considered primary and if they decide to share it with other scientists it would be considered secondary to those that use it.
Databases are a common source and rely on a database management system to host and maintain the data where users use commands called queries to explore the data. Files as data sources can be audio, image, and video files as well as spreadsheets like Excel. Internet sources are a common location for hosting data, where databases as well as files can be found. Application programming interfaces, also known as APIs allow programmers to create ways to share data with external users through the internet, while the process of web scraping extracts data from a web page. The [lessons in Working with Data](/2-Working-With-Data) focus on how to use various data sources. Databases are a common source and rely on a database management system to host and maintain the data where users use commands called queries to explore the data. Files as data sources can be audio, image, and video files as well as spreadsheets like Excel. Internet sources are a common location for hosting data, where databases as well as files can be found. Application programming interfaces, also known as APIs allow programmers to create ways to share data with external users through the internet, while the process of web scraping extracts data from a web page. The [lessons in Working with Data](/2-Working-With-Data) focuses on how to use various data sources.
## Conclusion
In this lesson we have learned:
- What data is
- How data is described
- How data is classified and categorized
- Where data can be found
## 🚀 Challenge ## 🚀 Challenge
Kaggle is an excellent source of open datasets. Use the [dataset search tool](https://www.kaggle.com/datasets) to find some data sets of interest and and classify 3-5 with this criteria:
- Is the data quantitative or qualitative?
- Is the data structured, unstructured, or semi-structured?
## Post-Lecture Quiz ## Post-Lecture Quiz
@ -51,6 +65,7 @@ Databases are a common source and rely on a database management system to host a
## Review & Self Study ## Review & Self Study
- This Microsoft Learn unit, titled [Classify your Data](https://docs.microsoft.com/en-us/learn/modules/choose-storage-approach-in-azure/2-classify-data) has a detailed breakdown of structured, semi-structured, and unstructured data.
## Assignment ## Assignment

@ -4,40 +4,58 @@
Follow the prompts in this assignment to identify and classify the data with one of each of the following data types: Follow the prompts in this assignment to identify and classify the data with one of each of the following data types:
Structure Types: Structured, Semi-Structured, or Unstructured **Structure Types**: Structured, Semi-Structured, or Unstructured
Value Types: Qualitative or Quantitative
Source Types: Primary or Secondary **Value Types**: Qualitative or Quantitative
**Source Types**: Primary or Secondary
1. A company has been acquired and now has a parent company. The data scientists have received a spreadsheet of customer phone numbers from the parent company. 1. A company has been acquired and now has a parent company. The data scientists have received a spreadsheet of customer phone numbers from the parent company.
Structure Type: Structure Type:
Value Type: Value Type:
Source Type: Source Type:
---
2. A smart watch has been collecting heart rate data from its wearer, and the raw data is in JSON format. 2. A smart watch has been collecting heart rate data from its wearer, and the raw data is in JSON format.
Structure Type: Structure Type:
Value Type: Value Type:
Source Type: Source Type:
---
3. A workplace survey of employee morale that is stored in a CSV file. 3. A workplace survey of employee morale that is stored in a CSV file.
Structure Type: Structure Type:
Value Type: Value Type:
Source Type: Source Type:
---
4. Astrophysicists are accessing a database of galaxies that has been collected by a space probe. The data contains the number of planets within in each galaxy. 4. Astrophysicists are accessing a database of galaxies that has been collected by a space probe. The data contains the number of planets within in each galaxy.
Structure Type: Structure Type:
Value Type: Value Type:
Source Type: Source Type:
---
5. A personal finance app uses APIs to connect to a user's financial accounts in order to calculate their net worth. They can see all of their transactions in a format of rows and columns and looks similar to a spreadsheet. 5. A personal finance app uses APIs to connect to a user's financial accounts in order to calculate their net worth. They can see all of their transactions in a format of rows and columns and looks similar to a spreadsheet.
Structure Type: Structure Type:
Value Type: Value Type:
Source Type: Source Type:
## Rubric ## Rubric

Binary file not shown.

After

Width:  |  Height:  |  Size: 383 KiB

@ -1,10 +1,10 @@
# Working with Data: Non-Relational Data # Working with Data: Non-Relational Data
Data is not limited to relational databases. This lesson focuses on non relational data and will cover the basic of spreadsheets and NoSQL. Data is not limited to relational databases. This lesson focuses on non-relational data and will cover the basic of spreadsheets and NoSQL.
## Spreadsheets ## Spreadsheets
Many data scientists will not pick spreadsheets as their first tool for various and valid reasons. However, it's a popular way to store and explore data because it requires less work to setup and get started. In this lesson you'll learn the basic components of a spreadsheet and formulas and functions are applied. This lesson provides foundational knowledge of spreadsheets in the rare event that you find yourself working with with them. The examples will be illustrated with Microsoft Excel, but most of the parts and topics will have similar names and steps in comparison to other spreadsheet software. Spreadsheets are a popular way to store and explore data because it requires less work to setup and get started. In this lesson you'll learn the basic components of a spreadsheet, as well as formulas and functions. The examples will be illustrated with Microsoft Excel, but most of the parts and topics will have similar names and steps in comparison to other spreadsheet software.
![An empty Microsoft Excel workbook with two worksheets](parts-of-spreadsheet.png) ![An empty Microsoft Excel workbook with two worksheets](parts-of-spreadsheet.png)
@ -18,42 +18,41 @@ With these basic elements of an Excel workbook, we'll use and an example from [M
The spreadsheet file named "InventoryExample" is a formatted spreadsheet of items within an inventory that contains three worksheets, where the tabs are labeled "Inventory List", "Inventory Pick List" and "Bin Lookup". Row 4 of the Inventory List worksheet is the header, which describes the value of each cell in the header column. The spreadsheet file named "InventoryExample" is a formatted spreadsheet of items within an inventory that contains three worksheets, where the tabs are labeled "Inventory List", "Inventory Pick List" and "Bin Lookup". Row 4 of the Inventory List worksheet is the header, which describes the value of each cell in the header column.
There are instances where a cell is dependent on the values of other cells to generate its value. The Inventory List spreadsheet keeps track of the cost of every item in its inventory, but what if we need to know the value of everything in the inventory? **Formulas** perform actions on cell data and is used to calculate the cost of the inventory in this example. This spreadsheet used a formula in the Inventory Value column to calculate the value of each item by multiplying the quantity under the QTY header and its costs by the cells under the COST header. Double clicking or highlighting a cell will show the formula. You'll notice that formulas start with an equals sign, followed by the calculation or operation. ![A highlighted formula from an example inventory list in Microsoft Excel](formula-excel.png)
!IMG[Show what it looks like here] There are instances where a cell is dependent on the values of other cells to generate its value. The Inventory List spreadsheet keeps track of the cost of every item in its inventory, but what if we need to know the value of everything in the inventory? [**Formulas**](https://support.microsoft.com/en-us/office/overview-of-formulas-34519a4e-1e8d-4f4b-84d4-d642c4f63263) perform actions on cell data and is used to calculate the cost of the inventory in this example. This spreadsheet used a formula in the Inventory Value column to calculate the value of each item by multiplying the quantity under the QTY header and its costs by the cells under the COST header. Double clicking or highlighting a cell will show the formula. You'll notice that formulas start with an equals sign, followed by the calculation or operation.
We can use another formula to add all the values of Inventory Value together to get its total value. This could be calculated by adding each cell to generate the sum, but that can be a tedious task. Excel has **functions**, or predefined formulas to perform calculations on cell values. Functions require arguments, which are the required values used to perform these calculations. When functions require more than one argument, they will need to be listed in a particular order or the function may not calculate the correct value. This example uses the SUM function, and uses the values of on Inventory Value as the argument to add generate the total listed under row 3, column B (also referred to as B3). ![A highlighted function from an example inventory list in Microsoft Excel](function-excel.png)
There are some additional formatting and features added to this spreadsheet that this lesson does not cover. If you're interested in learning more about Excel, [RESOURCE HERE] We can use another formula to add all the values of Inventory Value together to get its total value. This could be calculated by adding each cell to generate the sum, but that can be a tedious task. Excel has [**functions**](https://support.microsoft.com/en-us/office/sum-function-043e1c7d-7726-4e80-8f32-07b23e057f89), or predefined formulas to perform calculations on cell values. Functions require arguments, which are the required values used to perform these calculations. When functions require more than one argument, they will need to be listed in a particular order or the function may not calculate the correct value. This example uses the SUM function, and uses the values of on Inventory Value as the argument to add generate the total listed under row 3, column B (also referred to as B3).
## NoSQL ## NoSQL
NoSQL stands for "Not only SQL" NoSQL is an umbrella term for the different ways to store non-relational data and can be interpreted as "non-SQL", "non-relational" or "not only SQL". These type of database systems can be categorized into 4 types.
NoSQL is an umbrella term for the different ways to store non-relational data. These can be categorized into 4 types. ![Graphical representation of a key-value data store showing 4 unique numerical keys that are associated with 4 various values](kv-db.png)
https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data > Source from [Michał Białecki Blog](https://www.michalbialecki.com/2018/03/18/azure-cosmos-db-key-value-database-cloud/)
[Key-value](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#keyvalue-data-stores) databases pair unique keys, which are a unique identifier associated with a value. These pairs are stored using a [hash table](https://www.hackerearth.com/practice/data-structures/hash-tables/basics-of-hash-tables/tutorial/) with an appropriate hashing function. [Key-value](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#keyvalue-data-stores) databases pair unique keys, which are a unique identifier associated with a value. These pairs are stored using a [hash table](https://www.hackerearth.com/practice/data-structures/hash-tables/basics-of-hash-tables/tutorial/) with an appropriate hashing function.
![Image of a key-value store]
![Graphical representation of a graph data store showing the relationships between people, their interests and locations](graph-db.png)
> Source from [Microsoft](https://docs.microsoft.com/en-us/azure/cosmos-db/graph/graph-introduction#graph-database-by-example)
[Graph](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#graph-data-stores) databases describe relationships in data and are represented as a collection of nodes and edges. A node represents an entity, something that exists in the real world such as a student or bank statement. Edges represent the relationship between two entities Each node and edge have properties that provides additional information about each node and edges. [Graph](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#graph-data-stores) databases describe relationships in data and are represented as a collection of nodes and edges. A node represents an entity, something that exists in the real world such as a student or bank statement. Edges represent the relationship between two entities Each node and edge have properties that provides additional information about each node and edges.
![Image of a key-value store] ![Graphical representation of a columnar data store showing a customer database with two column families named Identity and Contact Info](columnar-db.png)
[Columnar](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#columnar-data-stores) data stores organizes data into columns and rows like a relational data structure but each column is divided into groups called a column family, where the all the data under one column is related and can be retrieved and changed in one unit. [Columnar](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#columnar-data-stores) data stores organizes data into columns and rows like a relational data structure but each column is divided into groups called a column family, where the all the data under one column is related and can be retrieved and changed in one unit.
![Image of a key-value store]
### Document Data Stores with the Azure Cosmos DB Emulator ### Document Data Stores with the Azure Cosmos DB Emulator
[Document](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#document-data-stores) data stores build on the concept of a key-value data store and is made up of a series of fields and objects
#### The Cosmos DB Emulator [Document](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data#document-data-stores) data stores build on the concept of a key-value data store and is made up of a series of fields and objects. This section will explore document databases with the Cosmos DB emulator.
#### The Cosmos DB Emulator
A Cosmos DB database fits the definition of "Not Only SQL" as you have choices on the type of non-relational database you'd like to use. The document database relies SQL to query the data. The previous lesson on SQL covers the basics, and we'll be able to apply these same queries to a document database here. We'll be using the Cosmos DB emulator, which allows us to create and explore a document database locally on a computer.
## Pre-Lecture Quiz ## Pre-Lecture Quiz
@ -61,11 +60,6 @@ https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-rela
[Pre-lecture quiz]() [Pre-lecture quiz]()
## 🚀 Challenge ## 🚀 Challenge
@ -75,6 +69,9 @@ https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-rela
## Review & Self Study ## Review & Self Study
- There are some additional formatting and features added to this spreadsheet that this lesson does not cover. Microsoft has a [large library of documentation and videos](https://support.microsoft.com/excel) on Excel if you're interested in learning more.
- This architectural documentation details the characteristics in the different types of non-relational data: [Non-relational Data and NoSQL](https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data)
## Assignment ## Assignment

Binary file not shown.

After

Width:  |  Height:  |  Size: 8.3 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 76 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 41 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 139 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

@ -1,6 +1,13 @@
# Introduction to the Data Science Lifecycle # Introduction to the Data Science Lifecycle
At this point you've probably come to the realization that that data science is a process. This process can be broken down into 5 stages. At this point you've probably come to the realization that that data science is a process. This process can be broken down into 5 stages:
- Capturing
- Processing
- Analysis
- Communication
- Maintenance
This lesson focuses on 3 parts of the life cycle: capturing, processing and maintenance. This lesson focuses on 3 parts of the life cycle: capturing, processing and maintenance.
@ -57,8 +64,6 @@ When training your models, you may require more training data. If youre conte
Below is an example of the cost of owning an Azure Storage Account Below is an example of the cost of owning an Azure Storage Account
[screenshot of Azure cost calculator] [screenshot of Azure cost calculator]
Putting it all together: The Data Lake
### Managing Data ### Managing Data
As you work with data you may discover that some of the data needs to be cleaned using some of the techniques covered in the lesson focused on [data preparation](2-Working-With-Data\08-data-preparation) to build accurate models. When new data arrives, it will need some of the same applications to maintain consistency in quality. Some projects will involve use of an automated tool for cleansing, aggregation, and compression before the data is moved to its final location. Azure Data Factory is an example of one of these tools. As you work with data you may discover that some of the data needs to be cleaned using some of the techniques covered in the lesson focused on [data preparation](2-Working-With-Data\08-data-preparation) to build accurate models. When new data arrives, it will need some of the same applications to maintain consistency in quality. Some projects will involve use of an automated tool for cleansing, aggregation, and compression before the data is moved to its final location. Azure Data Factory is an example of one of these tools.

Loading…
Cancel
Save