Writing Efficient SQL Queries

Many SQL tutorials do an excellent job teaching how to write queries to retrieve data from a database. However, only a few teach best practices because they aim primarily to pass the basic intuition behind using SQL across to learners. This article provides tips on how to write efficient SQL queries.

The downsides of Inefficient query-writing approaches are not evident when dealing with small databases. As your database scales up and grows into tables with millions or billions of rows, writing efficient SQL queries becomes a vital tool at hand. There are two significant reasons why you should learn to write SQL queries efficiently;

To minimize Query Response Time - This means reducing the duration between when a query is issued and when a response is received.
Efficient Use of Server Resources - Involves using the smallest possible server resources to retrieve data from your database.

Before we jump right into the tips, note;

I talk about PostgreSQL in this writing, but If you're not a PostgreSQL user, read on; I'm sure you'd learn from this writing too.
Since the primary aims are minimizing query response time and efficient use of server resources, I should say this too; working on a database with a good design puts you one step ahead in achieving these goals. This writing isn't about database design, but if you're interested in learning database design and best practices, Caleb Curry has a complete course on YouTube. You could start with that.

Tips for Writing Efficient SQL Queries

1. Do not useSelect *

There is barely any beginner-level tutorial that does not teach the use of select *. As a beginner, I felt like a superhero using it to fetch the entire data in a table, but I've learned not to use it over time (In fact, I get livid when I see it, just kidding). Here's why;

Unnecessary I/O and Network Traffic- There's usually no good reason to use this query because the fields you need to work with are often only a subset of the table's fields. Running this statement leads to unnecessary I/O operations on both the server and client side. Also, it leads to inefficient use of CPU resources on both the server and the client. Your client application may also require extra storage to hold the redundant data you pulled from the server. Finally, you would be wasting network bandwidth by transferring unnecessary data from the server to the client.
Problem with Joins - When joining multiple tables, there could be fields with the same name, e.g., ID, Name, etc. If you used select * in the join query, performing some operations (e.g., ordering by the field that appears multiple times) could lead to ambiguity in column reference. Ambiguity in column reference is simply PostgreSQL saying

Please help, I don't know which of these columns you are referring to

This can be temporarily solved by using aliasing to rename repeated column names. However, it's best to stick to not using select *.

Consider a car race database with three tables; car_brand, car_model, and drivers. The car_brand and car_model tables contain the field Name. I try to join the three tables using a subquery and inner joins and also include an 'order by' on the 'Name' column. Notice the error it gives at the end.

Writings=# create table car_brand(brand_id int primary key,
Writings(# name varchar(15),
Writings(# car_class varchar(20) not null check (car_class in ('Basic','Premium','Luxurious')));
CREATE TABLE
Writings=# create table car_model(model_id int primary key,
Writings(# name varchar(20),
Writings(# brand_id int references car_brand(brand_id));
CREATE TABLE
Writings=# insert into car_brand(brand_id,name,car_class) values(1,'Toyota','Basic');
INSERT 0 1                                                             ^
Writings=# insert into car_brand(brand_id,name,car_class) values(2,'lexus','Premium');
INSERT 0 1
Writings=# insert into car_brand(brand_id,name,car_class) values(3,'BMW','Luxurious');
INSERT 0 1
Writings=# insert into car_model(model_id,name,brand_id) values (10,'Camry01',1);
INSERT 0 1
Writings=# insert into car_model(model_id,name,brand_id) values (100,'Lexus01',2);
INSERT 0 1
Writings=# insert into car_model(model_id,name,brand_id) values (1000,'BMW01',3);
INSERT 0 1


Writings=# create table driver(first_name varchar(10),
Writings(# brand_id int references car_brand(brand_id));
CREATE TABLE
Writings=# insert into driver(first_name,brand_id)values('John',1);
INSERT 0 1
Writings=# insert into driver(first_name,brand_id)values('Sam',2);
INSERT 0 1
Writings=# insert into driver(first_name,brand_id)values('Dan',3);
INSERT 0 1
Writings=# select a.*,
Writings-# b.*
Writings-# from driver as a
Writings-# inner join (select c.*,
Writings(# d.*
Writings(# from car_brand as c
Writings(# inner join car_model as d
Writings(# on c.brand_id = d.brand_id)
Writings-# as b
Writings-# on a.brand_id = b.brand_id
Writings-# order by b.Name;
ERROR:  column reference "Name" is ambiguous
LINE 10: on a.brand_id = b.brand_id

Breaks Modularity Rule - Using select * breaks the modularity rule of reducing a system's complexity by breaking it into independent parts and hiding its details behind an abstraction. The point here is select * does not show the individual components of the query. It would be impossible for anyone looking at the query to know what exactly it returns.
Problem with Views - A view is a logical table that holds data contained in underlying tables through a SELECT statement. Views work like stored formulas. When a select statement is executed often, views can help you store the data that the select statement returns, so you don't have to rewrite the statement every time. When views are created in SQL and changes are made to the underlying table, some changes such as changing column or field name will not lead to inaccurate data in the view since PostgreSQL uses object id to reference objects in a view. Since these changes do not alter the object id, the view returns the same data.

In the example below, a view test_view is created on a table person by using the select *. The first_name and last_name columns of the table were renamed to first and last respectively. This change does not break the view definition.

Writings=# create table person (id int, first_name varchar(20), last_name varchar(20));
CREATE TABLE
Writings=# insert into person (id, first_name, last_name) values (1,'Sam','Daniel');
INSERT 0 1
Writings=# insert into person (id, first_name, last_name) values (2,'Tobi','Mike');
INSERT 0 1
Writings=# select id, first_name, last_name from person;
 id | first_name | last_name 
----+------------+-----------
  1 | Sam        | Daniel
  2 | Tobi       | Mike
(2 rows)

Writings=# create view test_view as select * from person;
CREATE VIEW
Writings=# select * from test_view;
 id | first_name | last_name 
----+------------+-----------
  1 | Sam        | Daniel
  2 | Tobi       | Mike
(2 rows)

Writings=# alter table person rename first_name to first;
ALTER TABLE
Writings=# alter table person rename last_name to last;
ALTER TABLE
Writings=# \d+ test_view;
                                   View "public.test_view"
   Column   |         Type          | Collation | Nullable | Default | Storage  | Description 
------------+-----------------------+-----------+----------+---------+----------+-------------
 id         | integer               |           |          |         | plain    | 
 first_name | character varying(20) |           |          |         | extended | 
 last_name  | character varying(20) |           |          |         | extended | 
View definition:
 SELECT person.id,
    person.first AS first_name,
    person.last AS last_name
   FROM person;

However, changes such as adding a column or dropping a column will not be reflected in the view, causing the view to break and leading to inaccurate data.

Using the same table above, let's add a new column age and see if this change is effected in test_view .Spoilers, it's not

Writings=# alter table person add column age int;
ALTER TABLE
Writings=# \d+ person;
                                                 Table "public.person"
 Column |         Type          | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
--------+-----------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 id     | integer               |           |          |         | plain    |             |              | 
 first  | character varying(20) |           |          |         | extended |             |              | 
 last   | character varying(20) |           |          |         | extended |             |              | 
 age    | integer               |           |          |         | plain    |             |              | 
Access method: heap

Writings=# \d+ test_view;
                                   View "public.test_view"
   Column   |         Type          | Collation | Nullable | Default | Storage  | Description 
------------+-----------------------+-----------+----------+---------+----------+-------------
 id         | integer               |           |          |         | plain    | 
 first_name | character varying(20) |           |          |         | extended | 
 last_name  | character varying(20) |           |          |         | extended | 
View definition:
SELECT person.id,
    person.first AS first_name,
    person.last AS last_name
 FROM person;

In addition to not using select *, another good piece of advice is to useLimit. When writing queries (especially in handling computed fields), we often try to see if the logic behind our computation would lead to the desired outcome. If you're not sure of the logic, you should useLimitto test your logic on a good number of rows before applying it to the entire table.

2. Use Indexes

Suppose you were reading a text in the morning and came across a chapter you liked. Returning to the text at night to find that chapter would require scanning through every page until you found that chapter again. Finding the chapter becomes easier and faster if the writer provides an index. In simple terms, an index is an object containing pointers to other objects' locations. To understand why using indexes is a critical way to improve query efficiency, let's see the path of a query in PostgreSQL.

Step 1: A connection manager establishes a connection from a client to the server.

Step 2: The parser checks the query for syntax errors. If no errors are detected, a query tree is created.

Step 3: The rewrite system takes the query tree created by the parser stage and looks for any rules (stored in the system catalogs) to apply to the query tree. It performs the transformations given in the rules. Now, to where indexes matter.

Step 4: The planner/optimizer takes the (rewritten) query tree and creates a query plan passed to the executor. The planner creates all possible paths to the same result requested by the query. If there is an index on the field used to filter/sort the data, an additional route is added to the existing paths. The cost of each path is calculated, and the cheapest path is chosen.

Step 5: The executor recursively steps through the plan tree and retrieves rows in the way represented by the plan. The executor uses the storage system while scanning relations, performs sorts and joins, evaluates qualifications, and finally hands back the rows derived.

Indexes in PostgreSQL are created using the statement format below;

create index index_name on table_name(field_name);

Let's see how creating an index on a column can improve query response time. We create a table test_table with two columns, id and name . The table has 6 million rows. Half of the table has name = 'Bob' while the other half, 'Ada.' We filter the table to return the name where id = 4587.

Writings=# create table test_table (id serial, name text); 
CREATE TABLE 
Writings=# insert into test_table (name) select 'Bob' from generate_series(1,3000000); 
INSERT 0 3000000 
Writings=# insert into test_table (name) select 'Ada' from generate_series(1,3000000); 
INSERT 0 3000000 
Writings=# analyze; 
ANALYZE 
Writings=# \timing 
Timing is on. 
Writings=# select name from test_table where id = 4587; 
name
Bob (1 row)
Time: 293.280 ms 

Writings=# create index id_index on test_table(id); 
CREATE INDEX Time: 4581.159 ms (00:04.581) 
Writings=# select name from test_table where id = 4587; 
name
Bob (1 row)
Time: 1.456 ms

Before we created an index id_index on the id column, it took 293.280 ms to return the row with id = 4587. After the index was created, it took 1.456 ms to return the same data (a 99.5% reduction in query response time). Now, let's drop the index and see what happens when we run the query without an index. To do this, we use the explain analyze statement before the query.

Writings=# drop index id_index; 
DROP INDEX 
Time: 13.292 ms 
Writings=# explain analyze select name from test_table where id=2; 
QUERY PLAN
Gather (cost=1000.00..58799.10 rows=1 width=4) 
(actual time=0.371..280.385 rows=1 loops=1) 
Workers Planned: 2 
Workers Launched: 2 
-> Parallel Seq Scan on test_table 
(cost=0.00..57799.00 rows=1 width=4) 
(actual time=178.033..270.987 rows=0 loops=3) 
Filter: (id = 2) Rows Removed by Filter: 2000000 
Planning Time: 0.238 ms Execution 
Time: 280.407 ms 
(8 rows)

Time: 283.613 ms

Without an index, the query optimizer has no choice but to do a parallel sequential scan on the table, which takes more time.

Writings=# create index id_index on test_table(id); 
CREATE INDEX 
Time: 3261.624 ms (00:03.262) 
Writings=# explain analyze select name from test_table where id=2; 
QUERY PLAN

Index Scan using id_index on test_table 
(cost=0.43..8.45 rows=1 width=4) 
(actual time=0.265..0.266 rows=1 loops=1) 
Index Cond: (id = 2) 
Planning Time: 0.371 ms 
Execution Time: 0.281 ms (4 rows)

Time: 1.307 ms 
Writings=#

The query response time is drastically reduced with an index because the query optimizer uses an index scan on the table. Primarily, the role of an index is to make retrieval of records easy, so you don't have to go through every record to find records that meet your search criteria.

In PostgreSQL, there are six types of index; B-tree, Hash, GiST, SP-GiST, GIN, and BRIN index. PostgreSQL implements the B-Tree index by default, but this does not always guarantee a reduction in response time as different indexes are best suited for different fields. Explaining all the index types would lengthen this writing; I have included links to learn more about the different index types that can be implemented in PostgreSQL in the further readings section.

Some guidelines for Indexing

When creating indexes, these are some things to keep in mind.

Use indexes only when working with large tables - If you create an index on a field in a small table, the query optimizer may never use it because the cost of scanning through the entire table might just be lower than that of using the index. The trade-off becomes unnecessary since the index will not be used but will always require updating (when rows are updated or inserted into the table), causing your index to become an excessive load on the database. Let's create a table test_t similar to test_table but with one-hundred rows. We also create an index on the id field. Using the explain analyze statement, we see what path the query optimizer follows to return the data.

Writings=# create table test_t (id serial, name text); 
CREATE TABLE Writings=# insert into test_t (name) select 'Waneri' from generate_series(1,50); 
INSERT 0 50 
Writings=# insert into test_t (name) select 'Wale' from generate_series(1,50); 
INSERT 0 50 
Writings=# create index id_test_t on test_t(id); 
CREATE INDEX 
Writings=# explain analyze select * from test_t where id =78; 
QUERY PLAN

Seq Scan on test_t (cost=0.00..2.25 rows=1 width=10) 
(actual time=0.133..0.135 rows=1 loops=1) 
Filter: (id = 78) Rows Removed by Filter: 99 
Planning Time: 2.288 ms 
Execution Time: 0.158 ms 
(5 rows)

Writings=#

The query optimizer neglects the index and performs a sequential scan on the table. -

Use indexes on tables that do not have a frequent bulk insert or update operations because indexes also need to be updated when inserts or updates happen. You get to define 'frequent' based on the business scenario.
Finally, Indexes occupy a somewhat significant amount of space (sometimes, almost the same size as the table depending on the table size); always ensure you use the appropriate index type and measure its relevance against its cost. The cardinality of the field is a go-to factor in deciding on an index. The rule-of-thumb is to use the B-Tree index for high-cardinality columns. Also, a good way to deal with fields having a skewed distribution is to use a partial index on the minority category in that field. Partial index creates an index on only records that meet specified criteria. For instance, if a field gender in a table, students , contains 80% male and 20% female, we can speed up query response time by creating an index only on records of females. We create the partial indexes with the statement below;

create index index_name on table_name(field_name) where [ condition ];

In the scenario above, it would be;
create index gender_index on students(gender) where gender = 'female';

Since SQL is a declarative language, in the end, the decision is left to the optimizer to decide what path to follow based on the indexes available, the underlying table's statistics, and the query issued. In the further reading section, I have put a link to other index types and their best use-case(s).

3. Vacuum and Vacuum Full

Have you ever thought of what happens when a record in a table is deleted or updated? Here's what happens; In the case of a deletion, the deleted row is not removed from the table but marked as deleted so that when a select statement is run, it returns all specified records except records marked as deleted. In the case of an update, the row is first marked as deleted, and then an insert operation with the new values is made into the table.

The underlying table is not unaffected by these dead records since they occupy usable space in the database. These dead rows lead to a bloated table and increase the time required for table scans. They also increase the size of the table. If you run frequent update and delete .

Vacuum

Vacuum marks the space occupied by dead rows as reusable so that when an insert command is run, the new record occupies the space once held by a dead record. The Vacuum command does not remove the dead records; it only marks the space it occupies as reusable.
Vacuum Full

On the other hand, Vacuum Full removes the dead rows from the table, reorders indexes, shrinks the table, and writes this new copy of the table to the disk. The old table is not deleted until the new table has been completely written to the disk. In the event of an error when copying the new table to the disk, the Vacuum Full operation is reversed, so live records are not lost. This happens to ensure ACID compliance in the database. It is important to note that while a Vacuum Full is being run, the underlying table is locked, and no commands can be run on the table. Postgresql, by default, implements an Autovacuum daemon that helps with routine vacuuming. The goal of Autovacuum is to automate the Vacuum command at specified intervals to prevent the need for a Vacuum Full operation that would require an exclusive lock on the underlying table. Autovacuum also lets you vacuum intervals based on the portion of the table occupied by dead records.

The PostgreSQL documentation shows how to tune the Autovacuum parameters to meet your business needs. I've added a link in the appendix of this writing. If you think you're better off with manual Vacuums, you can decide to disable the Autovacuum(I often advise tuning its parameters over disabling the Autovacuum). Vacuum Full tends to be very important in improving query efficiency after a spike in deletes or updates.

The point is, whatever decision you make (putting in cost as a consideration), ensure dead records are not reducing the efficiency of your queries.

Now let's see Vacuum in action. First, we create a table with 500,000 rows with Autovacuum disabled.

Writings=# create table vacuum_test (id int) with (autovacuum_enabled=off); 
CREATE TABLE 
Writings=# insert into vacuum_test select * from generate_series(1,500000); 
INSERT 0 500000 
Writings=# select pg_size_pretty(pg_relation_size('vacuum_test')); 
pg_size_pretty

17 MB (1 row)

The size of the table created is 17MB. Now let's update the entire table and see what happens to its size.

Writings=# update vacuum_test set id = id+1; 
UPDATE 500000 
Writings=# select pg_size_pretty(pg_relation_size('vacuum_test')); 
pg_size_pretty

35 MB (1 row)

The table doubles in size. Half of the table now contains dead records. Now, let us run the vacuum on the table and see what happens to its size.

Writings=# vacuum vacuum_test; 
VACUUM 
Writings=# select pg_size_pretty(pg_relation_size('vacuum_test')); 
pg_size_pretty
35 MB (1 row)

The table size does not change because the vacuum does not remove dead records; it only marks their space as reusable. Now, let's see what happens when an insert is made into the table.

Writings=# insert into vacuum_test select * from generate_series(1,500000); 
INSERT 0 500000 
Writings=# select pg_size_pretty(pg_relation_size('vacuum_test')); 
pg_size_pretty
35 MB (1 row)

The size remains the same because the new records occupy the space marked as reusable after we ran the vacuum command. Now let's update the entire table. When this happens the table size doubles

Writings=# update vacuum_test set id = id +1; 
UPDATE 1000000 
Writings=# select pg_size_pretty(pg_relation_size('vacuum_test')); 
pg_size_pretty
69 MB (1 row)

Now, let's run the vacuum full command on the table.

Writings=# vacuum full vacuum_test;
VACUUM 
Writings=# select pg_size_pretty(pg_relation_size('vacuum_test')); 
pg_size_pretty
35 MB (1 row)

The dead rows are removed, and the table is restored to its previous size (35MB)

Some other tips

Use truncate instead of delete - When the task is emptying tables (especially large ones), truncate is a better option than delete because it requires fewer server resources and logging overhead. It allows you to remove all rows without scanning the tables, which makes it faster. It also reclaims disk space (immediately) like the vacuum full command. However, note that this command cannot be rolled back after it has been committed.
Dealing with string data - Use wildcards instead of regular expressions (regex) functions whenever possible. Also, avoid using unnecessary functions except necessary

If you've read up to this point, I hope you learnt from this writing. Writing efficient queries (code, generally) comes with practice and continuous learning. Feel free to leave a comment about anything mentioned in this writing. Gracias!

Further readings/Reference