The use of subqueries can be a bit trickier as we do have to consider the performance as it relates to each individual part of the query. This can seem a bit complex as it can and will be different each time you run a query. However, you can gauge a general idea of how efficient a query will be. In PostgreSQL, a query plan is created for every query that is attempted to run in the database. The database tries to look at the query structure and the properties of the data and tables to create a good plan before it is executed.
Understanding all of the details of a plan can be quite complex so our focus is on the basics around understanding what to look for within a plan. We can use the EXPLAIN command on a query to display some of those details.
Let us query the invoice table and see what the results tell us:
Using just EXPLAIN, the database will not run the query but it creates an estimation of the execution plan based on the statistics that it has. This can mean that the actual plan can differ a bit from reality. In PostgreSQL though, we can also execute the query as well. To do so, we can use the EXPLAIN ANALYZE at the beginning of the query instead of just EXPLAIN.
Note that as this is an actual run process, the timing most likely will be different each time that you run it. For example, running it two more times yields the following results:
As such, we cannot simply depend on the planning and execution time when comparing various statements. Each query plan consists of nodes and they can be nested. They are executed from the inside out. This means that the innermost node is executed before an outer node. Each of the nodes has a set of associated statistics like the cost, the number of rows that are returned and the number of loops that are performed (if needed), and some other choices. From the last execution above, we can see that the cost shows 0.00 .. 10.12 and we estimate that there are 412 rows returned. The width also shows the estimated width of each row in bytes which in our case is 65. The costs field shows how expensive the node was. This can be a bit complex as it is measured using different settings that start to become quite technical. The important item is on the left of the first line with the type of node. In this case, we have a “Seq Scan”.
There are quite a few nodes but the most common ones are the following:
With subqueries, this can get more complex as you will have nested queries being used. For example, using a subquery similar to what we used in the prior tutorial:
SELECT invoice_id, invoice_date,customer_id, total
WHERE customer_id IN
(SELECT customer_id FROM customer WHERE city LIKE '%A');
Here, we have the sequential scan on the customer table in the inner query. Then also another sequential scan on the outer query. However, this may not always result in a faster execution either as it does depend on the data. For example:
SELECT invoice_id, invoice_date,invoice.customer_id, total
INNER JOIN customer ON invoice.customer_id = customer.customer_id
WHERE city LIKE '%A';
The results are basically identical with the type of join and then the cost associated with each. As the data sets get larger, and the result sets differing, it will become easier to distinguish them especially being that the customer_id was used to join them together. What if we joined it on the billing address instead:
WHERE billing_address IN (SELECT address FROM customer WHERE COUNTRY like '%a');
INNER JOIN customer ON customer.address = invoice.billing_address
WHERE COUNTRY like '%a';
Conceptually, the subquery would end up being slightly faster on the cost of the hash join.