Here, I use cancer as a paradigm for disease in general. Cancer is an enormously complex disorder, variations of which affect all aspects of cell biology. Hence, cancer serves as an excellent computational and biological model.

Carcinogenesis is complex, multifactorial process that despite decades of intensive research continues to profoundly challenge our understanding of the origin, treatment and prevention of cancer and related morbidities. An approach to an improved understanding of cancer involves integrating and understanding all areas of cell biology including ageing, biochemistry and metabolism, molecular biology and molecular genetics/genomics, epigenetics, immunology, cellular signaling, pharmacology, etc.


[Image source. Click image to open in new window.]

At the molecular level, growing evidence suggests that cancer can be better understood through mutated or dysregulated pathways and networks rather than individual mutations; e.g.:

A June 2017 paper by Jonathon Pritchard and colleagues at Stanford University, An Expanded View of Complex Traits: From Polygenic to Omnigenic [discussion here and here;  see also Theory Suggests That All Genes Affect Every Complex Traitslides], concluded that complex (polygenic) traits may involve many more genes than previously suspected.  In some cases – e.g. height, Crohn’s disease, rheumatoid arthritis, and schizophrenia – it appeared that perhaps almost the entire genome could be involved, in some way. On this basis Pritchard and colleagues proposed that gene regulatory networks are sufficiently interconnected such that all genes expressed in disease-relevant cells are liable to affect the functions of core disease-related genes, and that most heritability can be explained by effects on genes outside core pathways. This omnigenic model of complex traits concluded that in cell types relevant to a disease essentially all genes contribute to the condition.


[Image source. Click image to open in new window.]

Although it does not cite Pritchard’s June 2017 paper, an August 2018 Nature paper, Gene Discovery and Polygenic Prediction from a Genome-Wide Association Study of Educational Attainment in 1.1 Million Individuals, likewise found that nearly 1,300 genetic variants were associated with educational attainment – providing some support for an omnigenic model.

While cancer was not mentioned in Pritchard’s omnigenics paper, an earlier publication by different authors – The Mini-driver Model of Polygenic Cancer Evolution (Nov 2015) – suggested that many mutations found in cancer might have relatively weak tumour-promoting effects (thus acting as “mini-driver” mutations), and that many cancers may follow (at least partly) a polygenic model. A recent meta-analysis of prostate cancer, Fine-Mapping of Prostate Cancer Susceptibility Loci in a Large Meta-analysis Identifies Candidate Causal Variants (Jun 2018), found evidence for multiple independent signals for cancer risk at 12 regions, and 99 risk signals overall. Thus, it seems likely that the omnigenic model will be relevant to cancer, as well.

The omnigenic model complicates the view that cancer may be easily explained by network and pathway based models: the omnigenic model predicts that virtually any variant with regulatory effects in a given tissue is likely to have (weak) effects on all diseases that are modulated through that tissue (“network pleiotropy”). A single variant may affect multiple traits because those traits are mediated through the same cell type(s) and hence regulated through the same network(s) — and not because the traits are directly causally related [correlation does not imply causation]. Traits that share core genes or whose genes are close in the network will tend to have correlated effects. Conversely, traits that are mediated through the same tissue but have no overlap of core genes may show little or no correlation in effects even though many causal variants are shared.

Regardless of the genes and mechanisms involved, it is well established that carcinogenesis is a multistep, temporal process with pleiotropic effects. Knowledge graphs – easily capable of encapsulating this highly heterogeneous, highly relational data – are ideally suited to the study of carcinogenesis, as are machine learning models applied to those knowledge graphs. [Interestingly, Jonathan Pritchard is shown here
whiteboarding what appears to be a temporal network.]

Aside / Update [2020-02-06].

The complexity and long-term temporal development of various cancers are characterized in the following cancer genomes papers.

  • Gerstung M et al. [PCAWG Consortium] (2020-02-06) “The Evolutionary History of 2,658 Cancers.” Nature. 578: 122–128. DOI:

    ” Cancer develops through a process of somatic evolution. Sequencing data from a single biopsy represent a snapshot of this process that can reveal the timing of specific genomic aberrations and the changing influence of mutational processes. Here, by whole-genome sequencing analysis of 2,658 cancers as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), we reconstruct the life history and evolution of mutational processes and driver mutation sequences of 38 types of cancer.

    “Early oncogenesis is characterized by mutations in a constrained set of driver genes, and specific copy number gains, such as trisomy 7 in glioblastoma and isochromosome 17q in medulloblastoma. The mutational spectrum changes significantly throughout tumour evolution in 40% of samples. A nearly fourfold diversification of driver genes and increased genomic instability are features of later stages. Copy number alterations often occur in mitotic crises, and lead to simultaneous gains of chromosomal segments.>

    “Timing analyses suggest that driver mutations often precede diagnosis by many years, if not decades. Together, these results determine the evolutionary trajectories of cancer, and highlight opportunities for early cancer detection.”

    • The Pan-Cancer Project is a collaboration involving more than 1300 scientists and clinicians from 37 countries. It involved analysis of more than 2600 genomes of 38 different tumour types, creating a huge resource of primary cancer genomes. This was the starting point for 16 working groups to study multiple aspects of cancer development, causation, progression, and classification.

    • YouTube: The Evolutionary History of Cancer

    • [media: 2020-02-06] Cancer mutations occur decades before diagnosis

  • Campbell PJ et al. [The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium] (2020-02-05) “Pan-Cancer Analysis of Whole Genomes.” Nature. 578: 82-93. DOI:

    “Cancer is driven by genetic change, and the advent of massively parallel sequencing has enabled systematic documentation of this variation at the whole-genome scale. Here we report the integrative analysis of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). We describe the generation of the PCAWG resource, facilitated by international data sharing using compute clouds.

    “On average, cancer genomes contained 4-5 driver mutations when combining coding and non-coding genomic elements; however, in around 5% of cases no drivers were identified, suggesting that cancer driver discovery is not yet complete. Chromothripsis, in which many clustered structural variants arise in a single catastrophic event, is frequently an early event in tumour evolution; in acral melanoma, for example, these events precede most somatic point mutations and affect several cancer-associated genes simultaneously.

    “Cancers with abnormal telomere maintenance often originate from tissues with low replicative activity and show several mechanisms of preventing telomere attrition to critical levels. Common and rare germline variants affect patterns of somatic mutation, including point mutations, structural variants and somatic retrotransposition.

    “A collection of papers from the PCAWG Consortium describes non-coding mutations that drive cancer beyond those in the TERT promoter; identifies new signatures of mutational processes that cause base substitutions, small insertions and deletions and structural variation; analyses timings and patterns of tumour evolution; describes the diverse transcriptional consequences of somatic mutation on splicing, expression levels, fusion genes and promoter activity; and evaluates a range of more-specialized features of cancer genomes.”

While The Complex Underpinnings of Genetic Background Effects (Sep 2018; see also does not mention “omnigenics” or cite Pritchard's work (Jun 2017), it is clearly relevant to that earlier work.

“Genetic interactions between mutations and standing polymorphisms can cause mutations to show distinct phenotypic effects in different individuals. To characterize the genetic architecture of these so-called background effects, we genotype 1411 wild-type and mutant yeast cross progeny and measure their growth in 10 environments.

“Using these data, we map 1086 interactions between segregating loci and 7 different gene knockouts. Each knockout exhibits between 73 and 543 interactions, with 89% of all interactions involving higher-order epistasis between a knockout and multiple loci. Identified loci interact with as few as one knockout and as many as all seven knockouts. In mutants, loci interacting with fewer and more knockouts tend to show enhanced and reduced phenotypic effects, respectively. Cross-environment analysis reveals that most interactions between the knockouts and segregating loci also involve the environment. These results illustrate the complicated interactions between mutations, standing polymorphisms, and the environment that cause background effects.”

For an excellent review of genetic interactions, see Global Genetic Networks and the Genotype-to-Phenotype Relationship (Cell: Mar 2019).  Again, despite alluding to similar genome-wide causal effects on phenotypes this paper also does not mention Pritchard’s omnigenics paper. From p. 96 (pdf p. 12; see the paper for citations):  “Despite statistical evidence linking a remark-able number of candidate variants to a given disease, locus association alone is not predictive of disease risk and there remains a substantial disparity between the disease risk explained by the genetic loci discovered by GWAS and the estimated total heritable disease risk based on familial aggregation. Several reasons have been proposed to explain this so-called ‘missing heritability,’ including the existence of a large number of modifier loci, each having a relatively small effect or rare variants that cannot be easily detected using traditional approaches.”

While my published work, Construction and Application of a Protein and Genetic Interaction Network (Yeast Interactome)  (Mar 2009) was overlooked by Pritchard in his omnigenic paper (mine buried in the older biomedical literature; however, a follow-on email to Dr. Pritchard was subsequently ignored), a decade ago I noted pleiotropic interactions among (yeast) transcription factors, and hence their potential effects on global cellular regulation.

In my Abstract, I stated:

    "We also report and briefly describe the complex associations among transcription factors that result in the regulation of thousands of genes through coordinated changes in expression of dozens of transcription factors. These cells are thus able to sensitively regulate cellular metabolism in response to changes in genetic or environmental conditions through relatively small changes in the expression of large numbers of genes, affecting the entire yeast metabolome."

In my paper, I later stated:

    "Analyses of these data additionally revealed that transcription factors and their target genes form highly complex, interconnected networks affecting all aspects of cellular metabolism in S. cerevisiae."


    "... the 168 transcription factors downloaded from YEASTRACT interact among one another in an extraordinarily complex way while directly regulating the expression of at least 5902 target genes (our yeast interactome: data not shown)."

I’ve uploaded the list of transcription factors and a pdf copy of the interactome of the interactions among those transcription factors. Here is some of my personal correspondence to my coauthors, at that time, regarding those observations:

From: “Stuart, Greg (NIH/NIEHS) [V]” <stuart@…>
Date: Thu, 19 Apr 2007 16:33:38 -0400
To: “Copeland, Bill (NIH/NIEHS) [E]” <copelan1@…>, “_Mimi (Mimi at ARO” <micheline.strand@…>
Subject: Thoughts on the focus of the pos5 manuscript … [Thursday April 19, 2007]

Hi Bill & Mimi: Here is an update regarding the microarray analyses …

In the update that I emailed on Sunday, I mentioned that as an exercise I had tried “mapping” the glycolysis pathway in Cytoscape. I extended this, adding the pentose phosphate pathway (PPP), then adding the associated transcription factors. The results really drive home the fact that looking at simple groups of genes in Cytoscape is not terribly informative, but that adding the “first neighbors” or associated transcription factors immediately results in too much complexity, whatever you are examining. For example, the glycolysis pathway consists of 14 genes; including the first neighbors results in 225 nodes selected, and using Cytoscape and the list of (146 documented) transcription factors from the YEASTRACT database, there are at least 33 transcription factors associated with these 14 glycolysis genes. I’ve illustrated some of these interactions in the attached PDF file.

Rather than trying to “reduce the complexity” in the microarray expression datasets, we should adopt a paradigm shift, “embracing” the complexity that is observed as a main focus of the paper. This complexity is beautifully illustrated by mapping only the 146 documented S. cerevisiae transcription factors from YEASTRACT in Cytoscape, which shows that these transcription factors are highly interconnected with one another. As far as I know, the pleiotropic responses that we observe from disruption of single, non-essential genes (pos5; sod1; sod2) has really not been addressed previously, and/or certainly has not been illustrated as clearly and comprehensively as we can show using bioinformatic tools such as Cytoscape.

This approach would allow us to shift away from listing large numbers of genes and trying to explain why they are “related,” to summarizing the broad changes observed (e.g. GoMiner), with specific groups of genes identified through various bioinformatic tools (e.g. Cytoscape). More importantly, this approach allows us to illustrate the intersection of multiple processes, pathways, and the complexity of cellular metabolism, and also the ability of cells to quickly (indicated by the short time-frame of microarray treatments) and sensitively (single gene knockouts; change in carbon-source; application of external stressors e.g. H2O2) and change their gene expression/metabolism to adapt to internal and external changes in the cellular milieu!

It is important, as Ben Van Houten suggests, to identify the top-scoring (most influential) genes (i.e. transcription factors) from the jActiveModules analyses for each microarray experiment. However, given the difficulty of displaying these genes in isolation in Cytoscape (let alone illustrating what else they interact with), I decided to simply list them with their expression values (colored according to level of expression if significantly differentially expressed) in an Excel table (also attached – please refer to the second worksheet in the Excel file). Interestingly, this table shows us that even though the overall expression profiles are very similar among certain experiments [(A ~ G ~ H); (C ~ D ~ E)] as shown by GeneSpring gene trees (“heat maps”) and the GoMiner summary tables, the transcription factors – which drive the expression profiles observed – differ among the 8 experiments (A – H)!

To try to identify these differences, I am planning to taking the transcription factors from the jActiveModules analyses (in each experiment), identify what they link to (regulate) in Cytoscape, then filter these genes by expression level to reduce the complexity. I have done this for Experiment A (attached PDF). Once again illustrating the complexity of the data, when the first neighbors of the 23 Experiment A jAM10-identified transcription factors are selected, this results in a total of 2443 genes (not shown)! To reduce this complexity, I applied a “node filter” in Cytoscape, and found that if I selected nodes from among these 2443 that were more than 1.5-fold up-regulated (48 genes) or more than 2.5-fold down-regulated (56 genes), that I would get a manageable list of genes (23 Experiment A transcription factors + 48 genes >1.5-fold up-regulated + 56 genes <-2.5-fold down-regulated = 127 genes total), that could be illustrated in Cytoscape (see the attached PDF). This has the benefit of including “important” genes (identified by jActiveModules) but focusing on those with the greatest changes in expression. By examining the resulting Cytoscape map, we can see (for example) that the Upc2p, Ino2p/Ino4p and Rox1p transcription factors regulate the changes in expression of most of the hypoxic and cell wall mannoproteins … I think that this is a far better and more focused application of Cytoscape in regard to organizing/understanding this data, particularly from a regulatory standpoint.

From: “Stuart, Greg (NIH/NIEHS) [V]” <stuart@…>
To: “Stuart, Greg (NIH/NIEHS) [V]” <stuart@…>, gstuart1@…>
Subject: pos5 microarray
Date: Fri, 04 May 2007 16:22:58 -0400

Transcription factors regulate cellular gene expression and hence cellular metabolism and biochemistry. In the yeast S. cerevisiae, and likely most other organisms, transcription factors form a highly interconnected, self-regulated network that forms a scaffold upon which cellular metabolism is regulated with extreme sensitivity.