Your SQL is probably WRONG

22 minute read

Published: September 26, 2023

Yes, the title is a bit click-baity, but I hope that this catches your attention for a big warning of a very subtle behavior of most SQL dialects. The main idea is that when you perform a left (or right) join that has clauses on the right (left) hand side table, it is very easy to mess up the integrity of the join and end up with results that are very different to what you are expecting. This behavior is very easy to miss and many people fall into this trap without even noticing, as it is basically a silent failure.

The objective

The main objective is simple: we have two clients, the clients table and the orders table. They contain respectively the information regarding the clients metadata and the clients records of orders. We are interested in obtaining the number of orders per client (displayed by their id), even if there are no entries in the orders . Then we will repeate the same exercise but for a particular day.

Setting up the tables

For this example I will assume that we have a running Postgres database, and I will proceed to create the tables from scratch and insert data manually to show the main point. In this section we will take care of the first part. Note that to keep things simple, I will be using simple incremental integer valued counters for the tables, which is something that I would most likely not do in a production environment.

Let’s start by creating a clients table:

CREATE TABLE clients (
    client_id SERIAL PRIMARY KEY,
    client_name VARCHAR(255)
);

This table contains only the client_id and their name. Let’s create now an orders table:

CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    product_id VARCHAR(255),
    ordered_by INT,
    ordered_at TIMESTAMP,
    quantity INT
);

In this table we have some information about the orders, including who made the order. This means there is a direct relationship with the clients table. We will insert some dummy data so we can work with these examples.

INSERT INTO clients (client_name) VALUES 
	('John Doe'),
	('Jane Smith'),
	('Sarah Johnson');
 
INSERT INTO orders (product_id, ordered_by, ordered_at, quantity) VALUES 
  ('ABC123', 1, '2023-09-25 10:00:00', 2),
  ('DEF456', 1, '2023-09-26 12:30:00', 3);

Counting orders

Now we go to the task at hand. The solution is simple: we join the tables and apply a group by. A careful reader would note that only one of the clients has orders, so if we do an inner join, we will lose the other clients. That is why we must do a left join and coalesce the sum of the quantities to have a 0 value for the other clients. The query would look like this:

select 
	client_id, 
	coalesce(sum(quantity), 0) as quantity
from
	clients as c 
left join 
	orders as o
on 
	c.client_id = o.ordered_by
group by
	client_id
order by
	client_id
;

All good, business as usual. We get our neat results:

client_id	quantity
1	5
2	0
3	0

The problem

The problem arises when we start trying to impose conditions over the join. Suppose that now we are interested not on all the orders but only on the orders that occurred during 2023-09-26. This means that we need to impose that condition somewhere in the query. The most natural place is to put it in the where clause, just like this:

select 
	client_id, 
	coalesce(sum(quantity), 0) as quantity
from
	clients as c 
left join 
	orders as o
on 
	c.client_id = o.ordered_by
where 
	date_trunc('day', o.ordered_at) = '2023-09-26'
group by
	client_id
order by
	client_id
    ;

but let’s look at the results:

client_id	quantity
1	3

We have lost the other clients for which we had zero orders! Why? The answer is subtle: when you tell postgres date_trunc('day', o.ordered_at) = '2023-09-26', you are implicitely telling it that the column o.ordered_at has to be not null and satisfy that condition AFTER the join, and then proceed with the aggregations. This means that it will join the tables, will then filter out the resulting joined rows that have the null field (for all the clients but John) and then it will sum the quantities of the orders.

The fix

The fix is quite simple here: you can put conditions like date_trunc('day', o.ordered_at) = '2023-09-26' in the join clause. In this particular case, it would look like

select 
	client_id, 
	coalesce(sum(quantity), 0) as quantity
from
	clients as c 
left join 
	orders as o
on 
	c.client_id = o.ordered_by
	and date_trunc('day', o.ordered_at) = '2023-09-26'
group by
	client_id
order by
	client_id;

which results in the following:

client_id	quantity
1	3
2	0
3	0

Voilà! To see what is going on here, think that the join will consist of all the joined rows that match on the client_id and ordered_by fields, as well as the having the ordered_at to be within the day of ‘2023-09-26’. We are looking for all the rows of orders that match with Sarah and Jane that happened on the 26th, which is none, hence we get each of them matched with no rows, and consequently, just null values.

Conclusion

While this seems like a toy example, this situation can be found very often disguised in much larger and complex queries. It is important to understand well the order of the statements in a SQL query, specially when we are doing joins that are not inner.

Share on

Twitter Facebook LinkedIn

Felipe Pérez

Your SQL is probably WRONG

The objective

Setting up the tables

Counting orders

The problem

The fix

Conclusion

Share on

Leave a Comment

You May Also Enjoy

Is overfitting… good?

PCA and supervised learning

Martingales 0

Extreme value theory III