Introduction to relational databases
This is the first post in a series on SQL.
According to an IADSS article* SQL is the second most common skill listed in data science job postings – only python is more common. This makes SQL a core skill for any data scientist, data analyst, or data engineer. So what is SQL and why is it so important to data scientists? This post is meant to be a very basic, high-level overview of what SQL is and why it is important to have this skill if you are pursuing a data career.
Our focus on this blog is data science, so we will concentrate on querying databases – a frequent task of a data scientist is to retrieve data that is housed in a relational database.
What is SQL, and what is a relational database?
Structured Query Language, or SQL, is a programming language used to access and manipulate relational databases. The first step in understanding SQL is to understand what we mean by relational database.
A relational database stores information using data points that are related to one another. These data points are stored in tables.
Let’s say we are working at an online catalog company. For our business to function, at minimum we need to know the products being sold, who they are being sold to, and where they are being shipped. We could make one really large table with each order, the customer, the products, and the address. However, this would lead to a whole lot of duplicative information and we would eventually run out of storage space.
As you can see from the above example, we have customers that have multiple orders, products that are on multiple orders, and so on. Because all the information for each order is entered every time a new order is placed, we can also expect many data entry errors.
A relational database would be structured so that the customer, product, and address information was only entered once by using separate tables. This saves storage space and reduces data entry errors.
Now, when an order is placed, the orders table only needs to refer to the other tables. This is done with the ID columns – we’ll talk more about this in future posts, but for now it is important to understand that every customer, product, and address has a unique identifier in their respective table.
Our orders table will now reference those IDs to get all the information for each order.
Great, so now you’re probably wondering how you get that data from all the tables into one place so you can understand what is going on with your data. That is where SQL queries come in.
What’s a query?
Query is just a fancy way of saying ‘information request’ – if I query a database, I am asking it to retrieve specific information. In the orders example above, I might request all of the orders placed by a specific customer, or every order placed on a specific date, or any order that has a specific product.
To make these requests, we use SQL. If I want to find all the orders placed by a specific customer, I will need to ask the database to retrieve all the order IDs that reference the customer I am looking for. The next post will talk about the specific SQL query structure, but here we are going to talk at a high level on how to structure the request.
The first step is to decide which tables you need for your request. To get every order number placed by a specific customer, I will need the customer and orders tables.
The second step is to understand how those tables relate to each other. The orders table has a column for CustomerID and this column is also in the customer table. So, I want to ask for information from database that shares a CustomerID in both the tables.
Next we decide which columns from each table we would like to retrieve. Let’s say we are only interested in the order numbers for a specific customer. We might ask ‘may I have a list of order numbers for Mutt Barkley?’
To get this information, we will first instruct the database to find ‘Mutt Barkley’ in the Customer table and retrieve the CustomerID (2).
Next we instruct the database to find any row in the orders table that also has the CustomerID of 2.
Since we are only asking for a list of order numbers, we only need one column – the Order ID. This query will return ‘1’.
In the most basic terms, we asked the database to find the name ‘Mutt Barkely’ in the customer table and then match the CustomerID from that row to the CustomerID in the order table (this is called a join, more that later) and return any orderID where there was a match.
Written in SQL:
JOIN Customers on Customers.CustomerID = Orders.CustomerID
WHERE Customers.Name = ‘Mutt Barkley’
That is the most basic explanation – the next post in this series will break down what the SELECT, FROM, JOIN and WHERE statements do and how to write them.