PostgreSQL, Trees and Recursive CTE
Common Table Expressions Over Paths, Loops
This post is a followon to my answer at SO#44620695, recursive path aggregation and CTE query for topdown tree postgres.
@klin’s excellent answer – there – inspired me to experiment with PostgreSQL, trees (paths), and recursive CTE!
Preliminary: What is a Recursive Common Table Expression?
[ Image source: Recursive CTEs Explained  local copy (pdf) ]
See also:
 PostgreSQL docs: WITH Queries (Common Table Expressions)
 Recursive CTEs Explained (pdf)
 Excellent (Michał Konarski): Advanced SQL  Common Table Expressions (pdf)
 his Recursive CTE subsection (pdf) is the best explanation I’ve seen
 Graph Algorithms in a Database  Recursive CTEs & Topological Sort with Postgres (pdf)
[ an excellent companion to this post ]  Finding Circular References in Postgres (pdf)
Preamble
My motivation is storing data in PostgreSQL, but visualizing those data in a graph. While the approach here has limitations (e.g. undirected edges; …), it may otherwise be useful in other contexts.
Here, I adapted @klin’s code to enable CTE without a dependence on the table id
, although I do use those to deal with the issue of circular references (“loops”) in the data, e.g.
a,b
b,a
that throw the CTE into a nonterminating loop.
To solve that, I employed the rather brilliant approach suggested by @ahorsewithnoname in SO#31739150 – see my comments in the script, below.
[ A similar approach is also suggested here: Finding Circular References in Postgres  local copy (pdf). ]
PSQL script (“tree with paths.sql”):
 File: /mnt/Vancouver/Programming/data/metabolism/practice/sql/tree_with_paths.sql
 Adapted from: https://stackoverflow.com/questions/44620695/recursivepathaggregationandctequeryfortopdowntreepostgres
 See also: /mnt/Vancouver/FC/RDB  PostgreSQL/Recursive CTE  Graph Algorithms in a Database Recursive CTEs and Topological Sort with Postgres.pdf
 https://www.fusionbox.com/blog/detail/graphalgorithmsinadatabaserecursivectesandtopologicalsortwithpostgres/620/
 Run this script in psql, at the psql# prompt:
 \! cd /mnt/Vancouver/Programming/data/metabolism/practice/sql/
 \i /mnt/Vancouver/Programming/data/metabolism/practice/sql/tree_with_paths.sql
\c practice
DROP TABLE tree;
CREATE TABLE tree (
 id int primary key
id SERIAL PRIMARY KEY
,s TEXT  s: source node
,t TEXT  t: target node
,UNIQUE(s, t)
);
INSERT INTO tree(s, t) VALUES
('a','b')
,('b','a')  << adding this 'back relation' breaks CTE_1 below, as it enters a loop and cannot terminate
,('b','c')
,('b','d')
,('c','e')
,('d','e')
,('e','f')
,('f','g')
,('g','h')
,('c','h');
SELECT * FROM tree;
 SELECT s,t FROM tree WHERE s='b';
 RECURSIVE QUERY 1 (CTE_1):
 WITH RECURSIVE nodes(src, path, tgt) AS (
 SELECT s, concat(s, '>', t), t FROM tree WHERE s = 'a'
  SELECT s, concat(s, '>', t), t FROM tree WHERE s = 'c'
 UNION ALL
 SELECT t.s, concat(path, '>', t), t.t FROM tree t
 JOIN nodes n ON n.tgt = t.s
 )
  SELECT * FROM nodes;
 SELECT path FROM nodes;
 RECURSIVE QUERY 2 (CTE_2):
 Deals with "loops" in Postgres data, per
 https://stackoverflow.com/questions/31739150/tofindinfiniterecursiveloopincte
 "With Postgres it's quite easy to prevent this by collecting all visited nodes in an array."
 EXPLAIN ANALYZE WITH RECURSIVE nodes(id, src, path, tgt) AS (
WITH RECURSIVE nodes(id, src, path, tgt) AS (
SELECT id, s, concat(s, '>', t), t
,array[id] as all_parent_ids
FROM tree WHERE s = 'a'
UNION ALL
SELECT t.id, t.s, concat(path, '>', t), t.t, all_parent_idst.id FROM tree t
JOIN nodes n ON n.tgt = t.s
AND t.id <> ALL (all_parent_ids)  this is the trick to exclude the endless loops
)
 SELECT * FROM nodes;
SELECT path FROM nodes;
Script execution / output (PSQL):
# \i tree_with_paths.sql
You are now connected to database "practice" as user "victoria".
DROP TABLE
CREATE TABLE
INSERT 0 10
id  s  t
++
1  a  b
2  b  a
3  b  c
4  b  d
5  c  e
6  d  e
7  e  f
8  f  g
9  g  h
10  c  h
path

a>b
a>b>a
a>b>c
a>b>d
a>b>c>e
a>b>d>e
a>b>c>h
a>b>d>e>f
a>b>c>e>f
a>b>c>e>f>g
a>b>d>e>f>g
a>b>d>e>f>g>h
a>b>c>e>f>g>h
For reference, here is what those data look like, via NetworkX
:
You can change the starting node (e.g. start at node “d”) in the SQL script – giving, e.g.:
# \i tree_with_paths.sql
...
path

d>e
d>e>f
d>e>f>g
d>e>f>g>h
Network visualization
I exported those data (at the PSQL prompt) to a CSV,
# \copy (SELECT s, t FROM tree) TO '/tmp/tree.csv' WITH CSV
COPY 9
# \! cat /tmp/tree.csv
a,b
b,c
b,d
c,e
d,e
e,f
f,g
g,h
c,h
… which I visualized (image above) in a Python 3.5 venv:
import networkx as nx import pylab as plt G = nx.read_edgelist("/tmp/tree.csv", delimiter=",") G.nodes() # ['b', 'a', 'd', 'f', 'c', 'h', 'g', 'e'] G.edges() # [('b', 'a'), ('b', 'd'), ('b', 'c'), ('d', 'e'), ('f', 'g'), # j w('f', 'e'), ('c', 'e'), ('c', 'h'), ('h', 'g')] G.number_of_nodes() # 8 G.number_of_edges() # 9 # There is a bug in Python or NetworkX: you may need to run this # command 2x, as you may get an error the first time: from networkx.drawing.nx_agraph import graphviz_layout nx.draw(G, pos=graphviz_layout(G), node_size=1200, node_color='lightblue', \ linewidths=0.25, font_size=10, font_weight='bold', with_labels=True) plt.show() nx.dijkstra_path(G, 'a', 'h') # ['a', 'b', 'c', 'h'] nx.dijkstra_path(G, 'a', 'f') # ['a', 'b', 'd', 'e', 'f']
Note that the dijkstra_path
returned from NetworkX
is one of several possible, whereas all paths are returned by the Postgres CTE in a visuallyappealing manner.
Conclusions
While this was a fun and practical exercise,
 recursive CTE;
 how to deal with circular references / loops in CTE;
 applying CTE to trees / paths embedded in PostgreSQL data;
… applied to a simple hierarchical tree structure, for larger data and more complex data structures, it is probably better to apply the right tool – e.g., Neo4j
for data structures involving highly relational data (where Cypher
queries are ideally applied).
Addendum
I was wondering about the “Big O” computational complexity of recursive CTE, but was unsure how to determine it.
I posted a question regarding the computational complexity of recursive CTE to dba.StackExchange: What’s the (BigO) computational complexity of a PostgreSQL recursive common table expression?.
A User there asked “Did you run query analyzer?”, so here is the EXPLAIN ANALYZE
output:
[practice]# \i tree_with_paths.sql You are now connected to database "practice" as user "victoria". DROP TABLE CREATE TABLE INSERT 0 10 QUERY PLAN  CTE Scan on nodes (cost=322.86..356.54 rows=1684 width=32) (actual time=0.027..0.120 rows=13 loops=1) CTE nodes > Recursive Union (cost=4.18..322.86 rows=1684 width=132) (actual time=0.025..0.110 rows=13 loops=1) > Bitmap Heap Scan on tree (cost=4.18..12.65 rows=4 width=132) (actual time=0.024..0.024 rows=1 loops=1) Recheck Cond: (s = 'a'::text) Heap Blocks: exact=1 > Bitmap Index Scan on tree_s_t_key (cost=0.00..4.18 rows=4 width=0) (actual time=0.005..0.005 rows=1 loops=1) Index Cond: (s = 'a'::text) > Hash Join (cost=1.30..27.65 rows=168 width=132) (actual time=0.009..0.012 rows=2 loops=6) Hash Cond: (t.s = n.tgt) Join Filter: (t.id <> ALL (n.all_parent_ids)) Rows Removed by Join Filter: 0 > Seq Scan on tree t (cost=0.00..18.50 rows=850 width=68) (actual time=0.001..0.002 rows=10 loops=6) > Hash (cost=0.80..0.80 rows=40 width=96) (actual time=0.001..0.001 rows=2 loops=6) Buckets: 1024 Batches: 1 Memory Usage: 9kB > WorkTable Scan on nodes n (cost=0.00..0.80 rows=40 width=96) (actual time=0.000..0.001 rows=2 loops=6) Planning time: 0.376 ms Execution time: 0.170 ms
This blog post, Reading a Postgres EXPLAIN ANALYZE Query Plan (local copy: pdf) explains how to interpret that output.
Update (20180617)
I ended up answering my own question on dba.StackExchange #209138, What’s the (Big O) computational complexity of a PostgreSQL recursive common table expression?.
Using the excellent article “A Gentle Introduction to Algorithm Complexity Analysis” as a guide (local copy: pdf), I believe that the worstcase complexity of the example in my question is as follows.
Given this recursive CTE (used above):
WITH RECURSIVE nodes(id, src, path, tgt) AS (
 Anchor member:
SELECT id, s, concat(s, '>', t), t,
array[id] as all_parent_ids
FROM tree WHERE s = 'a'
UNION ALL
 Recursive member:
SELECT t.id, t.s, concat(path, '>', t), t.t, all_parent_idst.id FROM tree t
JOIN nodes n ON n.tgt = t.s
AND t.id <> ALL (all_parent_ids)
)
 Invocation:
SELECT path FROM nodes;

At the Anchor member (query definition), the algorithm selects each row from the table; therefore, at this step the maximum number of iterations (
i
) and the maximum size (n
) is the number of rows in table;i < n
, if a starting point within the table is specified. 
The Recursive member selects each row from the table, starting from the position specified in the anchor member, so the maximum number of iterations here once again is:
i ≤ n
.
So, with the recursive CTE above I believe that the overall complexity is Θ(n)
.
Prior to asking that question on dba.StackExchange, I did a pretty thorough search for discussions of the time complexity of complex CTE queries in Postgres / PSQL – see my notes appended at the end of this blog post (local copy: pdf).
One of the few online articles discussing the computational complexity of recursive CTE is Graph Algorithms in a Database  Recursive CTEs & Topological Sort with Postgres (local copy: pdf).
However, in the “Warning: Here Be Dragons” subsection I believe that the author improperly conflates discussion of computational complexity regarding the two data structures: RDB vs. graphs: he suddenly moves the discussion from RDBMS and recursive CTE (the central discussion in that article), to graph structures.
My contention with that subdiscussion is as follows. I believe that the Postgres specification of the tree, above, explicitly encodes all the information regarding the entities (“nodes / vertices”), and the relations between them (“paths / edges”). So, traversing those data (once; iteratively) occurs in Θ(n)
time.
In the visualized tree (“graph”), the graph implicitly embeds the graph structure, which is simplified in the visual representation. Referring to the figure below (see also SO#5058406: What is the maximum number of edges in a directed graph with n nodes?, the maximum number of edges (in a directed edge graph) is (n)(n1)
; i.e. n²  n
or Θ(n²)
[E = O(V²)
in the cited article]:
While that is true, it’s an “apples vs. oranges” comparison of trees/graphs/paths specified in as RDB vs. a property graph.
As I stated further above, while embedding complex graphs in a RDB (arguably) is not a good idea, given that recursive CTE appear to be an efficient algorithmic approach to determining paths in trees/graphs, that is not one of the reasons.