This guide explains how to use CROSS JOIN phrases to override the optimizer's ordering of tables.
SQLite uses the “CROSS JOIN” phrase as a means to override the table reordering decisions of the query optimizer. The CROSS JOIN connector is rarely needed and should probably never be used prior to the final performance tuning phase of application development. Even then, SQLite usually gets the order of tables in a join right without any extra help. But on those rare occasions when SQLite gets it wrong, the CROSS JOIN connector is an invaluable way of tweaking the optimizer to do what you want.
Intended audience:
This document is intended to be used by Symbian platform licensees and third party application developers.
The SQLite query optimizer will attempt to reorder the tables in a join in order to find the most efficient way to evaluate the join. The optimizer usually does this job well, but occasionally it will make a bad choice. When that happens, it might be necessary to override the optimizer's choice by explicitly specifying the order of tables in the SELECT statement.
To illustrate the problem, consider the following schema:
CREATE TABLE node( id INTEGER PRIMARY KEY, name TEXT ); CREATE INDEX node_idx ON node(name); CREATE TABLE edge( orig INTEGER REFERENCES node, dest INTEGER REFERENCES node, PRIMARY KEY(orig, dest) ); CREATE INDEX edge_idx ON edge(dest,orig);
This schema defines a directed graph with the ability to store a name on each node of the graph. Similar designs (though usually more complicated) arise frequently in application development. Now consider a three-way join against the above schema:
SELECT e.* FROM edge AS e, node AS n1, node AS n2 WHERE n1.name = 'alice' AND n2.name = 'bob' AND e.orig = n1.id AND e.dest = n2.id;
This query asks for information about all edges that go from nodes labelled “alice” over to nodes labelled “bob”.
There are many ways that the optimizer might choose to implement this query, but they all boil down to two basic designs. The first option looks for edges between all pairs of nodes. The following pseudocode illustrates:
foreach n1 where n1.name='alice' do: foreach n2 where n2.name='bob' do: foreach e where e.orig=n1.id and e.dest=n2.id do: return e.* end end end
The second design is to loop over all 'alice' nodes and follow edges off of those nodes looking for nodes named 'bob'. (The roles of 'alice' and 'bob' might be reversed here without changing the fundamental character or the algorithm):
foreach n1 where n1.name='alice' do: foreach e where e.orig=n1.id do: foreach n2 where n2.id=e.dest and n2.name='bob' do: return e.* end end end
The first algorithm above corresponds to a join order of n1-n2-e. The second algorithm corresponds to a join order of n1-e-n2.
The question the optimizer has to answer is which of these two algorithms is likely to give the fastest result, and it turns out that the answer depends on the nature of the data stored in the database.
Let the number of alice nodes be M and the number of bob nodes be N. Consider two scenarios: In the first scenario, M and N are both 2 but there are thousands of edges on each node. In this case, the first algorithm is preferred. With the first algorithm, the inner loop checks for the existence of an edge between a pair of nodes and outputs the result if found. But because there are only 2 alice and bob nodes each, the inner loop only has to run 4 times and the query is very quick.
The second algorithm would take much longer here. The outer loop of the second algorithm only executes twice, but because there are a large number of edges leaving each 'alice' node, the middle loop has to iterate many thousands of times. So in the first scenario, we prefer to use the first algorithm.
Now consider the case where M and N are both 3500. But suppose each of these nodes is connected by only one or two edges. In this case, the second algorithm is preferred.
With the second algorithm, the outer loop still has to run 3500 times, but the middle loop only runs once or twice for each outer loop and the inner loop will only run once for each middle loop, if at all. So the total number of iterations of the inner loop is around 7000.
The first algorithm, on the other hand, has to run both its outer loop and its middle loop 3500 times each, resulting in 12 million iterations of the middle loop. Thus in the second scenario, second algorithm is nearly 2000 times faster than the first.
In this particular example, if you run ANALYZE on your database to collect statistics on the tables, the optimizer will be able to figure out the best algorithm to use. But if you do not want to run ANALYZE or if you do not want to waste database space storing the SQLITE_STAT1 statistics table that ANALYZE generates, you can manually override the decision of the optimizer by specifying a particular order for tables in a join. You do this by substituting the keyword phrase “CROSS JOIN” in place of commas in the FROM clause.
The “CROSS JOIN” phrase forces the table to the left to be used before the table to the right. For example, to force the first algorithm, write the query this way:
SELECT * FROM node AS n1 CROSS JOIN node AS n2 CROSS JOIN edge AS e WHERE n1.name = 'alice' AND n2.name = 'bob' AND e.orig = n1.id AND e.dest = n2.id;
And to force the second algorithm, write the query like this:
SELECT * FROM node AS n1 CROSS JOIN edge AS e CROSS JOIN node AS n2 WHERE n1.name = 'alice' AND n2.name = 'bob' AND e.orig = n1.id AND e.dest = n2.id;
The CROSS JOIN keyword phrase is perfectly valid SQL syntax according to the SQL standard, but it is syntax that is rarely if ever used in real-world SQL statements. Because it is so rarely used otherwise, SQLite has appropriated the phrase as a means to override the table reordering decisions of the query optimizer.
The CROSS JOIN connector is rarely needed and should probably never be used prior to the final performance tuning phase of application development. Even then, SQLite usually gets the order of tables in a join right without any extra help. But on those rare occasions when SQLite gets it wrong, the CROSS JOIN connector is an invaluable way of tweaking the optimizer to do what you want.