You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

9.9 KiB

Working with Data: Relational Databases

 Sketchnote by (@sketchthedocs)
Working With Data: Relational Databases - Sketchnote by @nitya

Chances are youve used a spreadsheet before to store information. You had rows and columns, where the rows contained the data, and the columns described the data (sometimes called metadata). A relational database is based on this same principle of rows and columns in tables, but it allows you to spread information across multiple tables. This makes it possible to work with more complex data, avoid duplication, and have more flexibility in exploring the data. Lets dive into the concepts of relational databases.

Pre-lecture quiz

It all starts with tables

At the heart of a relational database are tables. Similar to a spreadsheet, a table is a collection of rows and columns. The rows contain the data you want to work with, such as the name of a city or the amount of rainfall, while the columns describe the type of data stored.

Lets start by creating a table to store information about cities. We might begin with their name and country. You could organize this in a table like this:

City Country
Tokyo Japan
Atlanta United States
Auckland New Zealand

Notice how the column names city, country, and population describe the data being stored, and each row contains information about one city.

The shortcomings of a single table approach

The table above might look familiar to you. Now lets add more data to our growing database—annual rainfall (in millimeters). Well focus on the years 2018, 2019, and 2020. If we were to add this data for Tokyo, it might look like this:

City Country Year Amount
Tokyo Japan 2020 1690
Tokyo Japan 2019 1874
Tokyo Japan 2018 1445

What do you notice about this table? You might see that were repeating the name and country of the city multiple times. This could take up a lot of storage and is unnecessary since Tokyo only has one name and country.

Lets try another approach. Well add new columns for each year:

City Country 2018 2019 2020
Tokyo Japan 1445 1874 1690
Atlanta United States 1779 1111 1683
Auckland New Zealand 1386 942 1176

While this avoids repeating rows, it introduces other challenges. Wed need to change the table structure every time a new year is added. Additionally, as the data grows, having years as columns would make it harder to retrieve and calculate values.

This is why we need multiple tables and relationships. By splitting the data into separate tables, we can avoid duplication and gain more flexibility in working with the data.

The concepts of relationships

Lets revisit our data and decide how to divide it. We know we want to store the name and country of each city, so this will work best in one table.

City Country
Tokyo Japan
Atlanta United States
Auckland New Zealand

Before creating the next table, we need a way to reference each city. We need an identifier, ID, or (in database terms) a primary key. A primary key is a unique value used to identify a specific row in a table. While it could be based on an existing value (like the city name), its better to use a number or other identifier that wont change. If the ID changes, it would break the relationship. In most cases, the primary key or ID is an auto-generated number.

Primary key is often abbreviated as PK

cities

city_id City Country
1 Tokyo Japan
2 Atlanta United States
3 Auckland New Zealand

Youll notice we use the terms "id" and "primary key" interchangeably in this lesson. These concepts also apply to DataFrames, which youll explore later. While DataFrames dont use the term "primary key," they behave similarly.

With our cities table created, lets store the rainfall data. Instead of repeating the full city information, we can use the ID. The new table should also have an id column, as all tables should have a primary key.

rainfall

rainfall_id city_id Year Amount
1 1 2018 1445
2 1 2019 1874
3 1 2020 1690
4 2 2018 1779
5 2 2019 1111
6 2 2020 1683
7 3 2018 1386
8 3 2019 942
9 3 2020 1176

Notice the city_id column in the rainfall table. This column contains values that reference the IDs in the cities table. In relational database terms, this is called a foreign key—a primary key from another table. You can think of it as a reference or pointer. city_id 1 refers to Tokyo.

[!NOTE] Foreign key is often abbreviated as FK

Retrieving the data

With our data split into two tables, you might wonder how to retrieve it. If youre using a relational database like MySQL, SQL Server, or Oracle, you can use a language called Structured Query Language (SQL). SQL (sometimes pronounced "sequel") is a standard language for retrieving and modifying data in relational databases.

To retrieve data, you use the SELECT command. Essentially, you select the columns you want to see from the table theyre in. For example, to display just the names of the cities, you could use:

SELECT city
FROM cities;

-- Output:
-- Tokyo
-- Atlanta
-- Auckland

SELECT specifies the columns, and FROM specifies the table.

[NOTE] SQL syntax is case-insensitive, meaning select and SELECT are the same. However, depending on the database, column and table names might be case-sensitive. Its a best practice to treat everything in programming as case-sensitive. In SQL, its common to write keywords in uppercase.

The query above will display all cities. If you only want to display cities in New Zealand, you can use a filter. The SQL keyword for this is WHERE, which specifies a condition.

SELECT city
FROM cities
WHERE country = 'New Zealand';

-- Output:
-- Auckland

Joining data

So far, weve retrieved data from a single table. Now lets combine data from both cities and rainfall. This is done by joining the tables. You create a connection between the two tables by matching values in a column from each table.

In our example, well match the city_id column in rainfall with the city_id column in cities. This will link the rainfall data to its corresponding city. The type of join well use is called an inner join, which means rows without matches in the other table wont be displayed. In our case, every city has rainfall data, so all rows will be displayed.

Lets retrieve the rainfall data for 2019 for all cities.

Well do this step by step. First, join the tables by specifying the columns to connect—city_id.

SELECT cities.city
    rainfall.amount
FROM cities
    INNER JOIN rainfall ON cities.city_id = rainfall.city_id

Weve highlighted the columns we want and specified that were joining the tables by city_id. Now we can add a WHERE statement to filter for the year 2019.

SELECT cities.city
    rainfall.amount
FROM cities
    INNER JOIN rainfall ON cities.city_id = rainfall.city_id
WHERE rainfall.year = 2019

-- Output

-- city     | amount
-- -------- | ------
-- Tokyo    | 1874
-- Atlanta  | 1111
-- Auckland |  942

Summary

Relational databases are designed to divide information across multiple tables, which can then be combined for display and analysis. This approach provides flexibility for calculations and data manipulation. Youve learned the core concepts of relational databases and how to join two tables.

🚀 Challenge

There are many relational databases available online. Use the skills youve learned to explore the data.

Post-Lecture Quiz

Post-lecture quiz

Review & Self Study

Microsoft Learn offers several resources to deepen your understanding of SQL and relational database concepts:

Assignment

Assignment Title


Disclaimer:
This document has been translated using the AI translation service Co-op Translator. While we strive for accuracy, please note that automated translations may contain errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is recommended. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.