This post is a follow-on to my answer at SO#44620695, recursive path aggregation and CTE query for top-down tree postgres.
@klin’s excellent answer – there – inspired me to experiment with PostgreSQL, trees (paths), and recursive CTE!
Preliminary: What is a Recursive Common Table Expression?
- PostgreSQL docs: WITH Queries (Common Table Expressions)
- Recursive CTEs Explained (pdf)
- Excellent (Michał Konarski): Advanced SQL - Common Table Expressions (pdf)
- his Recursive CTE subsection (pdf) is the best explanation I’ve seen
- Graph Algorithms in a Database - Recursive CTEs & Topological Sort with Postgres (pdf)
[ an excellent companion to this post ]
- Finding Circular References in Postgres (pdf)
My motivation is storing data in PostgreSQL, but visualizing those data in a graph. While the approach here has limitations (e.g. undirected edges; …), it may otherwise be useful in other contexts.
Here, I adapted @klin’s code to enable CTE without a dependence on the table
id, although I do use those to deal with the issue of circular references (“loops”) in the data, e.g.
that throw the CTE into a nonterminating loop.
PSQL script (“tree with paths.sql”):
-- File: /mnt/Vancouver/Programming/data/metabolism/practice/sql/tree_with_paths.sql -- Adapted from: https://stackoverflow.com/questions/44620695/recursive-path-aggregation-and-cte-query-for-top-down-tree-postgres -- See also: /mnt/Vancouver/FC/RDB - PostgreSQL/Recursive CTE - Graph Algorithms in a Database Recursive CTEs and Topological Sort with Postgres.pdf -- https://www.fusionbox.com/blog/detail/graph-algorithms-in-a-database-recursive-ctes-and-topological-sort-with-postgres/620/ -- Run this script in psql, at the psql# prompt: -- \! cd /mnt/Vancouver/Programming/data/metabolism/practice/sql/ -- \i /mnt/Vancouver/Programming/data/metabolism/practice/sql/tree_with_paths.sql \c practice DROP TABLE tree; CREATE TABLE tree ( -- id int primary key id SERIAL PRIMARY KEY ,s TEXT -- s: source node ,t TEXT -- t: target node ,UNIQUE(s, t) ); INSERT INTO tree(s, t) VALUES ('a','b') ,('b','a') -- << adding this 'back relation' breaks CTE_1 below, as it enters a loop and cannot terminate ,('b','c') ,('b','d') ,('c','e') ,('d','e') ,('e','f') ,('f','g') ,('g','h') ,('c','h'); SELECT * FROM tree; -- SELECT s,t FROM tree WHERE s='b'; -- RECURSIVE QUERY 1 (CTE_1): -- WITH RECURSIVE nodes(src, path, tgt) AS ( -- SELECT s, concat(s, '->', t), t FROM tree WHERE s = 'a' -- -- SELECT s, concat(s, '->', t), t FROM tree WHERE s = 'c' -- UNION ALL -- SELECT t.s, concat(path, '->', t), t.t FROM tree t -- JOIN nodes n ON n.tgt = t.s -- ) -- -- SELECT * FROM nodes; -- SELECT path FROM nodes; -- RECURSIVE QUERY 2 (CTE_2): -- Deals with "loops" in Postgres data, per -- https://stackoverflow.com/questions/31739150/to-find-infinite-recursive-loop-in-cte -- "With Postgres it's quite easy to prevent this by collecting all visited nodes in an array." -- EXPLAIN ANALYZE WITH RECURSIVE nodes(id, src, path, tgt) AS ( WITH RECURSIVE nodes(id, src, path, tgt) AS ( SELECT id, s, concat(s, '->', t), t ,array[id] as all_parent_ids FROM tree WHERE s = 'a' UNION ALL SELECT t.id, t.s, concat(path, '->', t), t.t, all_parent_ids||t.id FROM tree t JOIN nodes n ON n.tgt = t.s AND t.id <> ALL (all_parent_ids) -- this is the trick to exclude the endless loops ) -- SELECT * FROM nodes; SELECT path FROM nodes;
Script execution / output (PSQL):
# \i tree_with_paths.sql You are now connected to database "practice" as user "victoria". DROP TABLE CREATE TABLE INSERT 0 10 id | s | t ----+---+--- 1 | a | b 2 | b | a 3 | b | c 4 | b | d 5 | c | e 6 | d | e 7 | e | f 8 | f | g 9 | g | h 10 | c | h path --------------------- a->b a->b->a a->b->c a->b->d a->b->c->e a->b->d->e a->b->c->h a->b->d->e->f a->b->c->e->f a->b->c->e->f->g a->b->d->e->f->g a->b->d->e->f->g->h a->b->c->e->f->g->h
For reference, here is what those data look like, via
You can change the starting node (e.g. start at node “d”) in the SQL script – giving, e.g.:
# \i tree_with_paths.sql ... path --------------- d->e d->e->f d->e->f->g d->e->f->g->h
I exported those data (at the PSQL prompt) to a CSV,
# \copy (SELECT s, t FROM tree) TO '/tmp/tree.csv' WITH CSV COPY 9 # \! cat /tmp/tree.csv a,b b,c b,d c,e d,e e,f f,g g,h c,h
… which I visualized (image above) in a Python 3.5 venv:
import networkx as nx import pylab as plt G = nx.read_edgelist("/tmp/tree.csv", delimiter=",") G.nodes() # ['b', 'a', 'd', 'f', 'c', 'h', 'g', 'e'] G.edges() # [('b', 'a'), ('b', 'd'), ('b', 'c'), ('d', 'e'), ('f', 'g'), # j w('f', 'e'), ('c', 'e'), ('c', 'h'), ('h', 'g')] G.number_of_nodes() # 8 G.number_of_edges() # 9 # There is a bug in Python or NetworkX: you may need to run this # command 2x, as you may get an error the first time: from networkx.drawing.nx_agraph import graphviz_layout nx.draw(G, pos=graphviz_layout(G), node_size=1200, node_color='lightblue', \ linewidths=0.25, font_size=10, font_weight='bold', with_labels=True) plt.show() nx.dijkstra_path(G, 'a', 'h') # ['a', 'b', 'c', 'h'] nx.dijkstra_path(G, 'a', 'f') # ['a', 'b', 'd', 'e', 'f']
Note that the
dijkstra_path returned from
NetworkX is one of several possible, whereas all paths are returned by the Postgres CTE in a visually-appealing manner.
While this was a fun and practical exercise,
- recursive CTE;
- how to deal with circular references / loops in CTE;
- applying CTE to trees / paths embedded in PostgreSQL data;
… applied to a simple hierarchical tree structure, for larger data and more complex data structures, it is probably better to apply the right tool – e.g.,
Neo4j for data structures involving highly relational data (where
Cypher queries are ideally applied).
I was wondering about the “Big O” computational complexity of recursive CTE, but was unsure how to determine it.
I posted a question regarding the computational complexity of recursive CTE to dba.StackExchange: What’s the (Big-O) computational complexity of a PostgreSQL recursive common table expression?.
A User there asked “Did you run query analyzer?”, so here is the
EXPLAIN ANALYZE output:
[practice]# \i tree_with_paths.sql You are now connected to database "practice" as user "victoria". DROP TABLE CREATE TABLE INSERT 0 10 QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- CTE Scan on nodes (cost=322.86..356.54 rows=1684 width=32) (actual time=0.027..0.120 rows=13 loops=1) CTE nodes -> Recursive Union (cost=4.18..322.86 rows=1684 width=132) (actual time=0.025..0.110 rows=13 loops=1) -> Bitmap Heap Scan on tree (cost=4.18..12.65 rows=4 width=132) (actual time=0.024..0.024 rows=1 loops=1) Recheck Cond: (s = 'a'::text) Heap Blocks: exact=1 -> Bitmap Index Scan on tree_s_t_key (cost=0.00..4.18 rows=4 width=0) (actual time=0.005..0.005 rows=1 loops=1) Index Cond: (s = 'a'::text) -> Hash Join (cost=1.30..27.65 rows=168 width=132) (actual time=0.009..0.012 rows=2 loops=6) Hash Cond: (t.s = n.tgt) Join Filter: (t.id <> ALL (n.all_parent_ids)) Rows Removed by Join Filter: 0 -> Seq Scan on tree t (cost=0.00..18.50 rows=850 width=68) (actual time=0.001..0.002 rows=10 loops=6) -> Hash (cost=0.80..0.80 rows=40 width=96) (actual time=0.001..0.001 rows=2 loops=6) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> WorkTable Scan on nodes n (cost=0.00..0.80 rows=40 width=96) (actual time=0.000..0.001 rows=2 loops=6) Planning time: 0.376 ms Execution time: 0.170 ms
I ended up answering my own question on dba.StackExchange #209138, What’s the (Big O) computational complexity of a PostgreSQL recursive common table expression?.
Using the excellent article “A Gentle Introduction to Algorithm Complexity Analysis” as a guide (local copy: pdf), I believe that the worst-case complexity of the example in my question is as follows.
Given this recursive CTE (used above):
WITH RECURSIVE nodes(id, src, path, tgt) AS ( -- Anchor member: SELECT id, s, concat(s, '->', t), t, array[id] as all_parent_ids FROM tree WHERE s = 'a' UNION ALL -- Recursive member: SELECT t.id, t.s, concat(path, '->', t), t.t, all_parent_ids||t.id FROM tree t JOIN nodes n ON n.tgt = t.s AND t.id <> ALL (all_parent_ids) ) -- Invocation: SELECT path FROM nodes;
At the Anchor member (query definition), the algorithm selects each row from the table; therefore, at this step the maximum number of iterations (
i) and the maximum size (
n) is the number of rows in table;
i < n, if a starting point within the table is specified.
The Recursive member selects each row from the table, starting from the position specified in the anchor member, so the maximum number of iterations here once again is:
i ≤ n.
So, with the recursive CTE above I believe that the overall complexity is
Prior to asking that question on dba.StackExchange, I did a pretty thorough search for discussions of the time complexity of complex CTE queries in Postgres / PSQL – see my notes appended at the end of this blog post (local copy: pdf).
One of the few online articles discussing the computational complexity of recursive CTE is Graph Algorithms in a Database - Recursive CTEs & Topological Sort with Postgres (local copy: pdf).
However, in the “Warning: Here Be Dragons” subsection I believe that the author improperly conflates discussion of computational complexity regarding the two data structures: RDB vs. graphs: he suddenly moves the discussion from RDBMS and recursive CTE (the central discussion in that article), to graph structures.
My contention with that sub-discussion is as follows. I believe that the Postgres specification of the tree, above, explicitly encodes all the information regarding the entities (“nodes / vertices”), and the relations between them (“paths / edges”). So, traversing those data (once; iteratively) occurs in
In the visualized tree (“graph”), the graph implicitly embeds the graph structure, which is simplified in the visual representation. Referring to the figure below (see also SO#5058406: What is the maximum number of edges in a directed graph with n nodes?, the maximum number of edges (in a directed edge graph) is
n² - n or
E = O(V²) in the cited article]:
While that is true, it’s an “apples vs. oranges” comparison of trees/graphs/paths specified in as RDB vs. a property graph.
As I stated further above, while embedding complex graphs in a RDB (arguably) is not a good idea, given that recursive CTE appear to be an efficient algorithmic approach to determining paths in trees/graphs, that is not one of the reasons.