Online College Courses for Credit

Subquery Performance

Subquery Performance

Author: Sophia Tutorial

Determine the performance differences for two queries that return identical results, where one uses a subquery and one uses a JOIN.

See More

what's covered
This tutorial explores the use of the EXPLAIN statement in two parts:

  1. EXPLAIN explained
  2. Subqueries vs JOIN

1. EXPLAIN explained

The use of subqueries can be a bit trickier as we do have to consider the performance as it relates to each individual part of the query. This can seem a bit complex as it can and will be different each time you run a query. However, you can gauge a general idea of how efficient a query will be. In PostgreSQL, a query plan is created for every query that is attempted to run in the database. The database tries to look at the query structure and the properties of the data and tables to create a good plan before it is executed.

Understanding all of the details of a plan can be quite complex so our focus is on the basics around understanding what to look for within a plan. We can use the EXPLAIN command on a query to display some of those details.

Let us query the invoice table and see what the results tell us:

FROM invoice;


Using just EXPLAIN, the database will not run the query but it creates an estimation of the execution plan based on the statistics that it has. This can mean that the actual plan can differ a bit from reality. In PostgreSQL though, we can also execute the query as well. To do so, we can use the EXPLAIN ANALYZE at the beginning of the query instead of just EXPLAIN.

FROM invoice;


Note that as this is an actual run process, the timing most likely will be different each time that you run it. For example, running it two more times yields the following results:



As such, we cannot simply depend on the planning and execution time when comparing various statements. Each query plan consists of nodes and they can be nested. They are executed from the inside out. This means that the innermost node is executed before an outer node. Each of the nodes has a set of associated statistics like the cost, the number of rows that are returned and the number of loops that are performed (if needed), and some other choices. From the last execution above, we can see that the cost shows 0.00 .. 10.12 and we estimate that there are 412 rows returned. The width also shows the estimated width of each row in bytes which in our case is 65. The costs field shows how expensive the node was. This can be a bit complex as it is measured using different settings that start to become quite technical. The important item is on the left of the first line with the type of node. In this case, we have a “Seq Scan”.

There are quite a few nodes but the most common ones are the following:

  • Seq Scan – This is a sequential scan that is performed over a table within the database. It can be very slow if we’re retrieving many rows in a table so it is best to avoid these types of nodes for large tables.
  • Index Scan – A scan over an index can end up being much faster to find data. Think of it like searching through the table of contents in a book or a glossary in the back of a book to pin-point you to a specific page. This can be much faster than looking at each page at a time. These tend to be the fastest means to search for data.
  • Bitmap Index Scan and Bitmap Heap Scan – The bitmap scans in terms of performance are between the sequential scan and the index scan. These can occur if we are reading too much data from an index scan but too little from a sequential scan.
Anytime that we are querying data from a primary key, a unique key, or another index field, it should show as an index scan or a bitmap scan. As such, the results will generally be much faster than a sequential scan.

2. Subqueries vs JOIN

With subqueries, this can get more complex as you will have nested queries being used. For example, using a subquery similar to what we used in the prior tutorial:

SELECT invoice_id, invoice_date,customer_id, total
FROM invoice
WHERE customer_id IN

    (SELECT customer_id FROM customer
     WHERE city LIKE '%A'); 


Here, we have the sequential scan on the customer table in the inner query. Then also another sequential scan on the outer query. However, this may not always result in a faster execution either as it does depend on the data. For example:

SELECT invoice_id, invoice_date,invoice.customer_id, total
FROM invoice
INNER JOIN customer ON invoice.customer_id = customer.customer_id
WHERE city LIKE '%A';


The results are basically identical with the type of join and then the cost associated with each. As the data sets get larger, and the result sets differing, it will become easier to distinguish them especially being that the customer_id was used to join them together. What if we joined it on the billing address instead:

FROM invoice
WHERE billing_address IN (SELECT address FROM customer WHERE COUNTRY like '%a');


SELECT invoice.*
FROM invoice
INNER JOIN customer ON customer.address = invoice.billing_address
WHERE COUNTRY like '%a';


Conceptually, the subquery would end up being slightly faster on the cost of the hash join.

try it
Your turn! Open the SQL tool by clicking on the LAUNCH DATABASE button below. Then enter in one of the examples above and see how it works. Next, try your own choices for which columns you want the query to provide.

The performance of queries can be explored further using the EXPLAIN statement.