KEGG pathways representation I: NetworkX

Source Persagen.com
Author Dr. Victoria A. Stuart, Ph.D.
Created 2019-07-30
Last modified
Summary Visualization of data in Python using NetworkX
Related KEGG Pathways Representation II: Cytoscape
Contents

Background

Hello!

I constantly search for satisfactory representations and renderings of relational data; for example, representing metabolic pathways as relational property graphs.

A little over a year ago (2018-04) I posted a description of my efforts to render the KEGG glycolyticglycolysis_pathway.png and TCA cycletca_cycle.png (Kreb's Cycle) pathways in Neo4j - see my 2018-04 research blog post Creating A Metabolic Pathway In Neo4j, and my accompanying StackOverflow post.

However, while Neo4j offers an excellent, mature platform (including the Cypher graph query language), from my perspective there are limitations:

  • Neo4j is a proprietary (albeit open-sourced | GitHub) platform;
  • while adding functionality, the recent flurry of additions and extensions to Neo4j not relevant to my interests and needs adds unneeded complexity;
  • other issues: e.g., use of those data outside Neo4j; ...
  • Hence, around that time I was becoming much less enthused about Neo4j, which I increasingly regarded as "bloatware."

    With PostgreSQL serving as a well-supported and highly functional RDBMS, I sought my own graph network visualization and analysis solutions that allow facile programmatic access to relational datastores, amenable to downstream machine learning (ML) and natural language processing (NLP) applications. Additionally, I'd like to be able to access multidimensional data (e.g. tensor representations.

    Applications include knowledge graph construction, in silico modeling, metabolic flux balance analysis, etc.


    Holoviews / Bokeh

    In parallel to my NetworkX experiments (summarized in the following subsection), I briefly looked at SageMath directed graphs, and spent perhaps a week looking at PyViz/Holoviews  [website].

    While Holoviews offers decent matplotlib graphs (also used by NetworkX, below), I was favorably impressed with the slick browser-based visualizations and interfaces provided by Bokeh, permitting mouseover displays of node and edge attributes, etc. - well summarized and illustrated in my Nov 2018-11 blog post, "Interactive Data Visualization in Python With Bokeh."

    However, a deal-breaker for me once again was Bokeh's inability to natively show labeled nodes and edges in Bokeh's HTML representations - where there appears to be a reliance on graph legends, with no permanent (non-mouseover) node/edge labels. The lack of permanent (displayed) labels on nodes/edges remains [2019-06] an acknowledged issue:

  • [StackOverflow] How To Add Permanent Name Labels (Not Interactive Ones) On Nodes For A Networkx Graph In Bokeh?
  • [StackOverflow] How To Add Edge Labels (Interactive Or Permanent Ones) For Networkx Graph In Bokeh?
  • [GitHub "open" issue] Permanent Labels On Networkx Graph
  • While it appears that you can use Bokeh Labels to label nodes and edges, this seems like an unwieldy workaround. Likewise, while this example  (Alaska Airline Routesbokeh_airlines.png) may prove me wrong, that figure appears to be a matplotlib graph  [hv.extension('matplotlib')]: as I recall, my issue was the lack of labeled nodes and edges in the HTML plots (related GitHub issue).

    Lastly, while I found the Holoviews / Bokeh communities to moderately active, with a reasonable level of available documentation, frustratingly the code in their examples is generally insufficient to replicate their results.


    NetworkX

    Having examined other options and discovering their limitations, I was pleased to find that NetworkX offered several attractive attributes.

  • Pythonic access
  • matplotlib graphics, including multidigraphs
  • graph analytics
  • flexible addition of node and edge attributes
  • excellent documentation and community support
  • active and well-maintained GitHub repository
  • I recently (Jul 2019) spent a couple of weeks thoroughly investigating the NetworkX platform for my research needs.

    While I was pleased, overall, with my programming and modeling in Networkx, there were again some limitations. Most significantly, NetworkX renders graphs through the construction of Python dictionaries: {(src, tgt), rel)} where the keys are node source, target pairs and the edges (relations) are the values. (Note that DICT data structures have unique keys!)

    Thus, if your data contains "duplicate" data (e.g. node-rel-node) that appear more than once,

  • appearing as identical reactions at different places in the pathways, or
  • differentiated by the associated attribute data (not the node and edge labels)
  • while those underlying data remain unperturbed, when constructing the graphs NetworkX silently drops what it infers as "duplicate" relations - because of the constraint that DICT keys [(src, tgt) pairs) must be unique.

    Unaddressed, this results in graphs that do not faithfully and accurately represent the underlying data.

    That issue was encountered in my first script (below).

    To remedy that issue, I needed to return to the approach described in my 2018-04 StackOverflow post; namely, the use of "tags" to uniquely identify every node and edge in a graph. This issue / approach is illustrated in these pp. from my programming notebook:

    While the first script rather easily represents graphs as an edge adjacency framework (where the edges define the graph),

    the second script required that the relations themselves be considered as nodes, so that I could unambiguously specify each node-relation-node relationship:

    Although this provides a robust solution, I needed to restructure my graph input data, and I lost facile access to the facile embedding edge attributes, used in the first script.

  • In the first script, where nodes represent KEGG compound (i.e., biochemical metabolites) and edges represent enzymes, it is easy to separately add node and edge attributes - e.g. from Pandas dataframes.

  • In the second script KEGG compounds and enzymes are both represented as nodes, hence we lose the ability to label edges and add edge-centric attributes (in a facile manner), and the source data preparation is somewhat (in my opinion) somewhat more convoluted.

  • Noting those observations, here are my two scripts - to which the Reader is referred for details. The code is fully commented, additionally with embedded sample outputs.

  • Script 1: networkx_practice.py

  • Script 2: networkx_practice_2.py


  • Script notes:

  • Code is in Python 3 (I run these in a Python 3.7 venv).

  • If you want to run these, you'll need to edit paths in those scripts (search for "Vancouver"). If I forgot to include a datafile, simply me (info@Persagen.com).

  • Code is in a linear format, as I was just testing and evaluating; for production, you can wrap code sections into functions (def name() ...) and/or methods. Refer here for ideas.

  • I program in Vim (Neovim) in a widescreen terminal with textwidth=220. If you view the code wrapped with shorter lines, it's going to look pretty messy - ymmv.

  • While the actual code is reasonably compact, the scripts are thoroughly commented - mostly so that if / when I return to them, I can easily understand and follow what I was thinking and doing.


  • Sample output (plots)

    From Script 1:

  • 2019.07.29-glycolysis+tca.pdf
  • Note the multiple attributes on some of the edges, in which I addressed the "DICT" limitation of NetworkX.
  • Although cluttered due to the spring-loaded layout, the second plot from that script illustrates the use / display of edge labels:

  • From Script 2:

  • pdf: 2019.07.29-glycolysis+tca.pdf


  • What's next?

    Like Holoviews / Bokeh, NetworkX was both promising yet frustrating. While sorting through those issues, above, I began searching for additional solutions. For my research purposes, there are two additional solutions.

    GNU R

    The R programming language (with which I am acquainted) offers superb utilities for working with genomic data, e.g. via the Bioconductor package - which in turn includes the KEGGlincs utility for explicitly recreating KEGG pathway maps and overlaying NIH LINCS transcriptional data.

    KEGGlincs can be used with Cytoscape, to visualize the graph (Cytoscape must be running, for the CyREST interface layer interaction):

    Pretty cool!


    Cytoscape

    Continued in my follow-on post, KEGG Pathways Representation II: Cytoscape


    Enjoy!


    Return to Persagen.com