This post is a follow-on to my answer at SO#44620695, recursive path aggregation and CTE query for top-down tree postgres.

@klin’s excellent answer – there – inspired me to experiment with PostgreSQL, trees (paths), and recursive CTE!   

Preliminary: What is a Recursive Common Table Expression?

PostgreSQL - Recursive CTE
[ Image source: Recursive CTEs Explained  |  local copy (pdf) ]

See also:


Preamble

My motivation is storing data in PostgreSQL, but visualizing those data in a graph. While the approach here has limitations (e.g. undirected edges; …), it may otherwise be useful in other contexts.

Here, I adapted @klin’s code to enable CTE without a dependence on the table id, although I do use those to deal with the issue of circular references (“loops”) in the data, e.g.

a,b
b,a

that throw the CTE into a nonterminating loop.

To solve that, I employed the rather brilliant approach suggested by @a-horse-with-no-name in SO#31739150 – see my comments in the script, below.

[ A similar approach is also suggested here: Finding Circular References in Postgres  |  local copy (pdf). ]


PSQL script (“tree with paths.sql”):

--         File: /mnt/Vancouver/Programming/data/metabolism/practice/sql/tree_with_paths.sql
-- Adapted from: https://stackoverflow.com/questions/44620695/recursive-path-aggregation-and-cte-query-for-top-down-tree-postgres
--     See also: /mnt/Vancouver/FC/RDB - PostgreSQL/Recursive CTE - Graph Algorithms in a Database Recursive CTEs and Topological Sort with Postgres.pdf
--               https://www.fusionbox.com/blog/detail/graph-algorithms-in-a-database-recursive-ctes-and-topological-sort-with-postgres/620/

-- Run this script in psql, at the psql# prompt:
--    \! cd /mnt/Vancouver/Programming/data/metabolism/practice/sql/
--    \i /mnt/Vancouver/Programming/data/metabolism/practice/sql/tree_with_paths.sql

\c practice

DROP TABLE tree;

CREATE TABLE tree (
  -- id int primary key
  id SERIAL PRIMARY KEY
  ,s TEXT    -- s: source node
  ,t TEXT    -- t: target node
  ,UNIQUE(s, t)
);

INSERT INTO tree(s, t) VALUES
  ('a','b')
  ,('b','a')     -- << adding this 'back relation' breaks CTE_1 below, as it enters a loop and cannot terminate
  ,('b','c')
  ,('b','d')
  ,('c','e')
  ,('d','e')
  ,('e','f')
  ,('f','g')
  ,('g','h')
  ,('c','h');

SELECT * FROM tree;
-- SELECT s,t FROM tree WHERE s='b';

-- RECURSIVE QUERY 1 (CTE_1):
-- WITH RECURSIVE nodes(src, path, tgt) AS (
--     SELECT s, concat(s, '->', t), t FROM tree WHERE s = 'a'
--     -- SELECT s, concat(s, '->', t), t FROM tree WHERE s = 'c'
-- UNION ALL
--     SELECT t.s, concat(path, '->', t), t.t FROM tree t
--     JOIN nodes n ON n.tgt = t.s
-- )
-- -- SELECT * FROM nodes;
-- SELECT path FROM nodes;


-- RECURSIVE QUERY 2 (CTE_2):
-- Deals with "loops" in Postgres data, per
-- https://stackoverflow.com/questions/31739150/to-find-infinite-recursive-loop-in-cte
-- "With Postgres it's quite easy to prevent this by collecting all visited nodes in an array."
-- EXPLAIN ANALYZE WITH RECURSIVE nodes(id, src, path, tgt) AS (
WITH RECURSIVE nodes(id, src, path, tgt) AS (
    SELECT id, s, concat(s, '->', t), t
    ,array[id] as all_parent_ids
     FROM tree WHERE s = 'a'
UNION ALL
    SELECT t.id, t.s, concat(path, '->', t), t.t, all_parent_ids||t.id FROM tree t
    JOIN nodes n ON n.tgt = t.s
    AND t.id <> ALL (all_parent_ids)    -- this is the trick to exclude the endless loops
)
-- SELECT * FROM nodes;
SELECT path FROM nodes;

Script execution / output (PSQL):

# \i tree_with_paths.sql
You are now connected to database "practice" as user "victoria".
DROP TABLE
CREATE TABLE
INSERT 0 10
 id | s | t 
----+---+---
  1 | a | b
  2 | b | a
  3 | b | c
  4 | b | d
  5 | c | e
  6 | d | e
  7 | e | f
  8 | f | g
  9 | g | h
 10 | c | h

        path         
---------------------
 a->b
 a->b->a
 a->b->c
 a->b->d
 a->b->c->e
 a->b->d->e
 a->b->c->h
 a->b->d->e->f
 a->b->c->e->f
 a->b->c->e->f->g
 a->b->d->e->f->g
 a->b->d->e->f->g->h
 a->b->c->e->f->g->h

For reference, here is what those data look like, via NetworkX:

NetworkX visualization

You can change the starting node (e.g. start at node “d”) in the SQL script – giving, e.g.:

# \i tree_with_paths.sql
...
     path      
---------------
 d->e
 d->e->f
 d->e->f->g
 d->e->f->g->h

Network visualization

I exported those data (at the PSQL prompt) to a CSV,

# \copy (SELECT s, t FROM tree) TO '/tmp/tree.csv' WITH CSV
  COPY 9

# \! cat /tmp/tree.csv
  a,b
  b,c
  b,d
  c,e
  d,e
  e,f
  f,g
  g,h
  c,h

… which I visualized (image above) in a Python 3.5 venv:

import networkx as nx
import pylab as plt
G = nx.read_edgelist("/tmp/tree.csv", delimiter=",")
G.nodes()
#   ['b', 'a', 'd', 'f', 'c', 'h', 'g', 'e']
G.edges()
#   [('b', 'a'), ('b', 'd'), ('b', 'c'), ('d', 'e'), ('f', 'g'),
#    j w('f', 'e'), ('c', 'e'), ('c', 'h'), ('h', 'g')]
G.number_of_nodes()
#   8
G.number_of_edges()
#   9
# There is a bug in Python or NetworkX: you may need to run this
# command 2x, as you may get an error the first time:
from networkx.drawing.nx_agraph import graphviz_layout
nx.draw(G, pos=graphviz_layout(G), node_size=1200, node_color='lightblue', \
  linewidths=0.25, font_size=10, font_weight='bold', with_labels=True)
plt.show()
nx.dijkstra_path(G, 'a', 'h')
#   ['a', 'b', 'c', 'h']
nx.dijkstra_path(G, 'a', 'f')
#   ['a', 'b', 'd', 'e', 'f']

Note that the dijkstra_path returned from NetworkX is one of several possible, whereas all paths are returned by the Postgres CTE in a visually-appealing manner.


Conclusions

While this was a fun and practical exercise,

  • recursive CTE;
  • how to deal with circular references / loops in CTE;
  • applying CTE to trees / paths embedded in PostgreSQL data;

… applied to a simple hierarchical tree structure, for larger data and more complex data structures, it is probably better to apply the right tool – e.g., Neo4j for data structures involving highly relational data (where Cypher queries are ideally applied).


Addendum

I was wondering about the “Big O” computational complexity of recursive CTE, but was unsure how to determine it.

I posted a question regarding the computational complexity of recursive CTE to dba.StackExchange: What’s the (Big-O) computational complexity of a PostgreSQL recursive common table expression?.

A User there asked “Did you run query analyzer?”, so here is the EXPLAIN ANALYZE output:

[practice]# \i tree_with_paths.sql 
You are now connected to database "practice" as user "victoria".
DROP TABLE
CREATE TABLE
INSERT 0 10
                                                            QUERY PLAN                                                             
-----------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on nodes  (cost=322.86..356.54 rows=1684 width=32) (actual time=0.027..0.120 rows=13 loops=1)
   CTE nodes
     ->  Recursive Union  (cost=4.18..322.86 rows=1684 width=132) (actual time=0.025..0.110 rows=13 loops=1)
           ->  Bitmap Heap Scan on tree  (cost=4.18..12.65 rows=4 width=132) (actual time=0.024..0.024 rows=1 loops=1)
                 Recheck Cond: (s = 'a'::text)
                 Heap Blocks: exact=1
                 ->  Bitmap Index Scan on tree_s_t_key  (cost=0.00..4.18 rows=4 width=0) (actual time=0.005..0.005 rows=1 loops=1)
                       Index Cond: (s = 'a'::text)
           ->  Hash Join  (cost=1.30..27.65 rows=168 width=132) (actual time=0.009..0.012 rows=2 loops=6)
                 Hash Cond: (t.s = n.tgt)
                 Join Filter: (t.id <> ALL (n.all_parent_ids))
                 Rows Removed by Join Filter: 0
                 ->  Seq Scan on tree t  (cost=0.00..18.50 rows=850 width=68) (actual time=0.001..0.002 rows=10 loops=6)
                 ->  Hash  (cost=0.80..0.80 rows=40 width=96) (actual time=0.001..0.001 rows=2 loops=6)
                       Buckets: 1024  Batches: 1  Memory Usage: 9kB
                       ->  WorkTable Scan on nodes n  (cost=0.00..0.80 rows=40 width=96) (actual time=0.000..0.001 rows=2 loops=6)
 Planning time: 0.376 ms
 Execution time: 0.170 ms

This blog post, Reading a Postgres EXPLAIN ANALYZE Query Plan (local copy: pdf) explains how to interpret that output.


Update (2018-06-17)

I ended up answering my own question on dba.StackExchange #209138, What’s the (Big O) computational complexity of a PostgreSQL recursive common table expression?.

Using the excellent article “A Gentle Introduction to Algorithm Complexity Analysis” as a guide (local copy: pdf), I believe that the worst-case complexity of the example in my question is as follows.

Given this recursive CTE (used above):

WITH RECURSIVE nodes(id, src, path, tgt) AS (
    -- Anchor member:
    SELECT id, s, concat(s, '->', t), t,
    array[id] as all_parent_ids
    FROM tree WHERE s = 'a'
UNION ALL
    -- Recursive member:
    SELECT t.id, t.s, concat(path, '->', t), t.t, all_parent_ids||t.id FROM tree t
    JOIN nodes n ON n.tgt = t.s
    AND t.id <> ALL (all_parent_ids)
)
-- Invocation:
SELECT path FROM nodes;
  1. At the Anchor member (query definition), the algorithm selects each row from the table; therefore, at this step the maximum number of iterations (i) and the maximum size (n) is the number of rows in table; i < n, if a starting point within the table is specified.

  2. The Recursive member selects each row from the table, starting from the position specified in the anchor member, so the maximum number of iterations here once again is: i ≤ n.

So, with the recursive CTE above I believe that the overall complexity is Θ(n).

Prior to asking that question on dba.StackExchange, I did a pretty thorough search for discussions of the time complexity of complex CTE queries in Postgres / PSQL – see my notes appended at the end of this blog post (local copy: pdf).

One of the few online articles discussing the computational complexity of recursive CTE is Graph Algorithms in a Database - Recursive CTEs & Topological Sort with Postgres (local copy: pdf).

However, in the “Warning: Here Be Dragons” subsection I believe that the author improperly conflates discussion of computational complexity regarding the two data structures: RDB vs. graphs: he suddenly moves the discussion from RDBMS and recursive CTE (the central discussion in that article), to graph structures.

My contention with that sub-discussion is as follows. I believe that the Postgres specification of the tree, above, explicitly encodes all the information regarding the entities (“nodes / vertices”), and the relations between them (“paths / edges”). So, traversing those data (once; iteratively) occurs in Θ(n) time.

In the visualized tree (“graph”), the graph implicitly embeds the graph structure, which is simplified in the visual representation. Referring to the figure below (see also SO#5058406: What is the maximum number of edges in a directed graph with n nodes?, the maximum number of edges (in a directed edge graph) is (n)(n-1); i.e. n² - n or Θ(n²) [E = O(V²) in the cited article]:

graphs_max_edges

While that is true, it’s an “apples vs. oranges” comparison of trees/graphs/paths specified in as RDB vs. a property graph.

As I stated further above, while embedding complex graphs in a RDB (arguably) is not a good idea, given that recursive CTE appear to be an efficient algorithmic approach to determining paths in trees/graphs, that is not one of the reasons.