2022-07-07
|~6 min read
|1099 words
NB: This note is based on my experience using Mode analytics to run analysis on an Amazon Redshift SQL database. If there are language distinctions, start by looking at your version of SQL.
A CTE is a common way to refer to a Common Table Expression. But what is a CTE? I like to think of them as abstracted subqueries.
To understand how CTEs might be useful, let’s look at the same query two different ways.
Before we can do that though, we need to set up the example.
Using Mode’s tutorial database, we’ll look at the dunder_mifflin_paper_sales
table.
Let’s start with what’s included in the table:
SELECT
column_name,
data_type
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
table_name = 'dunder_mifflin_paper_sales'
This returns the 60 columns and their data types. 1
Now, to demonstrate the power of CTEs, let’s say that we want to see the total quantity and percent of sales by account. We can imagine that if the list of orders looks like:
order_id | quantity | account_id |
---|---|---|
1 | 10 | a |
2 | 5 | a |
3 | 15 | b |
4 | 2 | c |
5 | 8 | c |
6 | 20 | c |
Then what we want is:
acct_qty | account_id | pct |
---|---|---|
15 | a | .25 |
15 | b | .25 |
30 | c | .5 |
There are likely many ways to achieve this, but one is through the use of CTEs.
Let’s start with just getting the quantities by account:
SELECT
d.account_id AS acct,
sum (d.quantity) AS acct_qty
FROM
tutorial.dunder_mifflin_paper_sales AS d
GROUP BY
1
And the total quantity would be:
SELECT
sum (d.quantity) AS qty
FROM
tutorial.dunder_mifflin_paper_sales AS d
But how do we bring them together and add the extra detail of the percent at the account level? One way would be to add a subquery. There are numerous problems with subqueries, however. Most obviously is the duplication of logic throughout the query. In many cases, we want to use the same piece of logic, but with a subquery, we’d be forced to rewrite it in multiple spots. CTEs might look more complicated, but once you get used to them, I think they make queries much easier to read and reason through.
Before we write out the full query, let’s look at some standard syntax for CTEs.
-- To get started with CTEs, we need to define them.
-- To do this, we use the keyword WITH
-- Then we provide the CTE a name.
-- Just as
with qty_by_account AS ()
-- if we want to do multiple CTEs, add a comma to separate
, total_qty as ()
-- then when you're done we are able to proceed with our standard query
And now, let’s put this into practice with a real example:
-- Defining our CTEs
WITH account_data AS (
SELECT
d.account_id AS acct,
sum (d.quantity) AS acct_qty
FROM
tutorial.dunder_mifflin_paper_sales AS d
GROUP BY
1
),
all_acct_qty AS (
SELECT
sum(acct_qty) AS qty
FROM
account_data
)
-- Using our CTEs in our standard query
SELECT
a.*,
t.qty,
1.0 * SUM(a.acct_qty) / sum(t.qty) AS pct
FROM
account_data AS a
CROSS JOIN all_acct_qty AS t
GROUP BY
a.acct,
a.acct_qty,
t.qty
ORDER BY
2 DESC
While this is slightly verbose, each piece is understandable on its own.
The first CTE, account_data
, gets a sum of the quantity of paper sold at the account level.
The second CTE, all_acct_qty
, summarizes further to get the full quantity of paper products sold by Dunder Mifflin in the data set.
In our SELECT
statement we then reference these table expressions as if they were tables.
SELECT
-- ...
FROM
account_data AS a
CROSS JOIN all_acct_qty AS t
CTEs provide a lovely way to simplify potentially complicated queries. They work by teasing out pieces into smaller queries which are then composed on their own. There’s definitely a learning curve to them, though I’d argue the hardest part of using them is that they force you to think about a problem differently. What used to be one bigger step now makes more sense as smaller ones. These smaller pieces are then composed.
It’s nice and elegant!
1 The full list of columns in our table.
column_name | data_type |
---|---|
order_id | text |
purchased_at | timestamp without time zone |
status | text |
cancelled_at | timestamp without time zone |
returned_at | text |
product_id | text |
product_name | text |
price | double precision |
discount | double precision |
shipping_cost | double precision |
quantity | double precision |
business_size | double precision |
payment_cycle | text |
account_id | text |
account_name | text |
account_manager | text |
days_to_close | double precision |
shipping_mode | text |
shipping_address | text |
shipping_city | text |
shipping_state | text |
shipping_zip | double precision |
shipping_region | text |
shipping_latitude | double precision |
shipping_longitude | double precision |
days_to_ship | double precision |
reviewed_at | timestamp without time zone |
rating | double precision |
index | double precision |
review | text |
order_id | text |
purchased_at | timestamp without time zone |
status | text |
cancelled_at | timestamp without time zone |
returned_at | text |
product_id | text |
product_name | text |
price | double precision |
discount | double precision |
shipping_cost | double precision |
quantity | double precision |
business_size | double precision |
payment_cycle | text |
account_id | text |
account_name | text |
account_manager | text |
days_to_close | double precision |
shipping_mode | text |
shipping_address | text |
shipping_city | text |
shipping_state | text |
shipping_zip | double precision |
shipping_region | text |
shipping_latitude | double precision |
shipping_longitude | double precision |
days_to_ship | double precision |
reviewed_at | timestamp without time zone |
rating | double precision |
index | double precision |
review | text |
Hi there and thanks for reading! My name's Stephen. I live in Chicago with my wife, Kate, and dog, Finn. Want more? See about and get in touch!