### Technical Review

Biomedical Applications of Machine Learning

### These Contents

While machine learning may be applied to all subdisciplines within the life sciences (biology, biomedical, biotechnology; genetics, genomics, medicine, molecular genetics/genomics, pathology, pharmacology, proteomics; etc.), I focus here primarily on recent progress in the molecular genetics/genomics, biomedical and clinical domains.

# BIOMEDICAL APPLICATIONS OF MACHINE LEARNING

## Reviews

Several recent reviews that provide an excellent introduction and overview of many of the topics discussed in this TECHNICAL REVIEW.

• “Network-based analytics plays an increasingly important role in precision oncology. Growing evidence in recent studies suggests that cancer can be better understood through mutated or dysregulated pathways or networks rather than individual mutations and that the efficacy of repositioned drugs can be inferred from disease modules in molecular networks.

• This article reviews network-based machine learning and graph theory algorithms for integrative analysis of personal genomic data and biomedical knowledge bases to identify tumor-specific molecular mechanisms, candidate targets and repositioned drugs for personalized treatment.

• The review focuses on the algorithmic design and mathematical formulation of these methods to facilitate applications and implementations of network-based analysis in the practice of precision oncology. We review the methods applied in three scenarios to integrate genomic data and network models in different analysis pipelines, and we examine three categories of network-based approaches for repositioning drugs in drug-disease-gene networks.

• In addition, we perform a comprehensive subnetwork/pathway analysis of mutations in 31 cancer genome projects in the Cancer Genome Atlas and present a detailed case study on ovarian cancer.

• Finally, we discuss interesting observations, potential pitfalls and future directions in network-based precision oncology.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Opportunities and Obstacles for Deep Learning in Biology and Medicine (Apr 2018; updated version) “… We examine applications of deep learning to a variety of biomedical problems - patient classification, fundamental biological processes and treatment of patients – and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. …”

• Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities (July 2018) [review article] describes the principles of data integration and discusses current methods and available implementations. They provide examples of successful data integration in biology and medicine. and discuss current challenges in biomedical integrative methods and their perspective on the future development of the field.

• “New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include a myriad of properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. …“

[Image source. Click image to open in new window.]

• NeurIPS 2018 featured a Workshop, ML4H: Machine Learning for Health (Dec 2018), that provides an overview of recent research in this domain (arXiv: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018).

• A Primer on Deep Learning in Genomics (Nov 2018) “Deep learning methods are a class of machine learning techniques capable of identifying highly complex patterns in large datasets. Here, we provide a perspective and primer on deep learning applications for genome analysis. We discuss successful applications in the fields of regulatory genomics, variant calling and pathogenicity scores. We include general guidance for how to effectively use deep learning methods as well as a practical guide to tools and resources. This primer is accompanied by an interactive online tutorial.”

[Image source. Click image to open in new window.]

• Deep Learning for Genomics: A Concise Overview (May 2018) “… we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with a proper deep architecture, and remark on practical considerations of developing modern deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research, as well as pointing out potential opportunities and obstacles for future genomics applications.”

• Artificial Intelligence Used in Genome Analysis Studies (Apr 2018) “Next Generation Sequencing (NGS) or deep sequencing technology enables parallel reading of multiple individual DNA frag ments, thereby enabling the identification of millions of base pairs in several hours. Recent research has clearly shown that machine learning technologies can efficiently analyse large sets of genomic data and help to identify novel gene functions and regulation regions. …”

• Deep Learning in Bioinformatics (Sep 2017) “… we review deep learning in bioinformatics, presenting examples of current research. To provide a useful and comprehensive perspective, we categorize research both by the bioinformatics domain (i.e. omics, biomedical imaging, biomedical signal processing) and deep learning architecture (i.e. deep neural networks, convolutional neural networks, recurrent neural networks, emergent architectures) and present brief descriptions of each study. Additionally, we discuss theoretical and practical issues of deep learning in bioinformatics and suggest future research directions. …”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Machine Learning in Genomic Medicine: A Review of Computational Problems and Data Sets (Jan 2016) “… we provide an introduction to machine learning tasks that address important problems in genomic medicine. One of the goals of genomic medicine is to determine how variations in the DNA of individuals can affect the risk of different diseases, and to find causal explanations so that targeted therapies can be designed. Here we focus on how machine learning can help to model the relationship between DNA and the quantities of key molecules in the cell, with the premise that these quantities, which we refer to as cell variables, may be associated with disease risks. …”

• Deep Learning for Computational Biology (Jul 2016) “… In this review, we discuss applications of this new breed of analysis approaches in regulatory genomics and cellular imaging. We provide background of what deep learning is, and the settings in which it can be successfully applied to derive biological insights. In addition to presenting specific applications and providing tips for practical use, we also highlight possible pitfalls and limitations to guide computational biologists when and how to make the most use of this new technology.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Not a review, rather summary lists of applications of ML to genomics related research: [Quora] How is machine learning used in genomics?

• Not a review, but informative: Deep Learning Meets Genome Biology: An Interview With Brendan Frey About Realizing New Possibilities In Genomic Medicine (Apr 2016) [local copy]

## Approaches

Deep Learning in Bioinformatics: Introduction, Application, and Perspective in Big Data Era (Feb 2019) [code.  “Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the esoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at GitHub.”

• “… we provide eight examples, which cover five research directions, four data types, and a number of deep learning models that people will encounter in Bioinformatics. The five research directions are: sequence analysis, structure prediction and reconstruction, biomolecular property and function prediction, biomedical image processing and diagnosis, biomolecule interaction prediction and systems biology. The four data types are: structured data, 1D sequence data, 2D image or profiling data, graph data. The covered deep learning models are: deep fully connected neural networks, ConvNet, RNN, graph convolutional neural network, ResNet, GAN, VAE.”

[Image source. Click image to open in new window.]

## Association Studies

Genome-wide association studies (GWAS) have emerged as a rich source of genetic clues into disease biology, and they have revealed strong genetic correlations among many diseases and traits. Some of these genetic correlations may reflect causal relationships. Distinguishing Correlation from Causation Using Genome-Wide Association Studies (Nov 2018) [code] developed a method to quantify causal relationships between genetically correlated traits using GWAS summary association statistics. Their method quantified what part of the genetic component of $\small \text{trait 1}$ was also causal for $\small \text{trait 2}$, using mixed fourth moments $\small E(\alpha {2 \atop 1} \alpha_1 \alpha_2)$ and $\small E(\alpha {2 \atop 2} \alpha_1 \alpha_2)$ of the bivariate effect size distribution. If $\small \text{trait 1}$ was causal for $\small \text{trait 2}$, then SNPs affecting $\small \text{trait 1}$ (large $\small \alpha {2 \atop 1})$ will have correlated effects on $\small \text{trait 2}$ (large $\small \alpha_1 \alpha_2)$, but not vice versa. They validated this approach in extensive simulations. Across 52 traits (average $\small N = 331k)$, they identified 30 putative genetically causal relationships, many novel, including an effect of LDL cholesterol on decreased bone mineral density. More broadly, they demonstrated that it was possible to distinguish between genetic correlation and causation using genetic association data.

• “GWAS have identified thousands of common genetic variants (SNPs) affecting disease risk and other complex traits. The same SNPs often affect multiple traits, resulting in a genetic correlation: genetic effect sizes are correlated across the genome, and so are the traits themselves. Some genetic correlations may result from causal relationships. For example, SNPs that cause higher triglyceride levels reliably confer increased risk of coronary artery disease. This causal inference approach, using genetic variants as instrumental variables, is known as Mendelian Randomization (MR). However, genetic variants often have pleiotropic (shared) effects on multiple traits even in the absence of a causal relationship, and pleiotropy is a challenge for MR, especially when it leads to a strong genetic correlation. Statistical methods have been used to account for certain kinds of pleiotropy; however, these approaches too are easily confounded by genetic correlations due to pleiotropy. Here, we develop a robust method to distinguish whether a genetic correlation results from pleiotropy or from causality.”

[Image source. Click image to open in new window.]

Latent causal variable model. The latent causal variable (LCV) model features a latent variable $\small L$ that mediates the genetic correlation between the two traits (Figure 1a). More abstractly, $\small L$ represents the shared genetic component of both traits. $\small \text{Trait 1}$ is fully genetically causal for $\small \text{trait 2}$ if it is perfectly genetically correlated with $\small L$; “fully” means that the entire genetic component of $\small \text{trait 1}$ is causal for $\small \text{trait 2}$ (Figure 1b). More generally, $\small \text{trait 1}$ is partially genetically causal for $\small \text{trait 2}$ if the latent variable has a stronger genetic correlation with $\small \text{trait 1}$ than with $\small \text{trait 2}$; “partially” means that part of the genetic component of $\small \text{trait 1}$ is causal for $\small \text{trait 2}$. …“

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Biomedical association studies are increasingly done using clinical concepts, and in particular diagnostic codes from clinical data repositories as phenotypes. Clinical concepts can be represented in a meaningful, vector space using word embedding models. These embeddings allow for comparison between clinical concepts or for straightforward input to machine learning models. … Learning Contextual Hierarchical Structure of Medical Concepts with Poincairé Embeddings to Clarify Phenotypes (Nov 2018) [code here and here] applied Poincaré embeddings in a 2-dimensional hyperbolic space to a large-scale administrative claims database and show performance comparable to 100-dimensional embeddings in a Euclidean space. They then examined disease relationships under different disease contexts to better understand potential phenotypes.

[image source. click image to open in new window.]

[image source. click image to open in new window.]

Association Studies:

• Predicting MicroRNA-Disease Associations using Network Topological Similarity Based on DeepWalk  (Oct 2017) applied DeepWalk to the prediction of microRNA-disease associations, by calculating similarities within a miRNA-disease association network. This approach showed superior predictive performance for 22 complex diseases, with area under the ROC curve scores ranging from 0.805 to 0.937, using five-fold cross-validation. In addition, case studies on breast, lung and prostate cancer further justified the use the method for the discovery of latent miRNA-disease pairs.

• Multi-view data (matched sets of measurements on the same subjects) have become increasingly common with technological advances in genomics and other fields. Often, the subjects are separated into known classes, and it is of interest to find associations between the views that are related to the class membership. … Joint Association and Classification Analysis of Multi-View Data (Nov 2018) proposed a framework for Joint Association and Classification Analysis of multi-view data (JACA ). In addition to the joint learning framework, an advantage of their approach was its ability to use partial information: it could be applied in settings with missing class labels, and in settings with missing subsets of views. They applied JACA to colorectal cancer data from The Cancer Genome Atlas project, and quantified the association between RNAseq and miRNA views with respect to consensus molecular subtypes of colorectal cancer.

• “Identifying the subset of genetic alterations present in individual tumors that are essential and collectively sufficient for cancer initiation and progression would advance the development of effective personalized treatments. We present Cancer Rule Set Optimization  (CRSO ) for inferring the combinations of alterations, i.e., rules, that cooperate to drive tumor formation in individual patients. CRSO  prioritizes driver combinations in each patient by integrating patient-specific passenger probabilities for individual alterations along with information about the recurrence of particular combinations throughout the population. We present examples in glioma, liver cancer and melanoma of significant differences in patient progression-free intervals based on rule assignments that would not be identifiable by consideration of individual alterations.”

## Cancer

Disease Related Knowledge Summarization Based on Deep Graph Search (Aug 2015) presented an approach to automatically construct disease related knowledge summarization from biomedical literature. First, Kullback-Leibler Divergence with mutual information was used to extract disease salient information. A deep search based on depth first search (DFS) was then applied to find hidden (indirect) relations between biomedical entities. Finally, a random walk algorithm was exploited to filter out the weak relations. The experimental results showed that their approach achieved a precision of 60% and a recall of 61% on salient information extraction for Carcinoma of bladder.

[Image source. Click image to open in new window.]

REACH, described in Large-scale Automated Machine Reading Discovers New Cancer Driving-Mechanisms (2017) [code] by the Computational Language Understanding Lab (CLU Lab) at the University of Arizona, is a biomedical information extraction system which aims to read the scientific literature and extract cancer signaling pathways. REACH implements a fairly complete extraction pipeline, including recognition of biochemical entities (proteins, chemicals, etc.), grounding them to known knowledge bases such as UniProt, extraction of BioPAX-like interactions, e.g., phosphorylation, complex assembly, positive/negative regulations, and coreference resolution, for both entities and interactions. REACH was developed using CLU Lab’s open-domain information extraction framework, Odin, which is released within their processors repository on GitHub.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• REACH was evaluated in FamPlex: A Resource for Entity Recognition and Relationship Resolution of Human Protein Families and Complexes in Biomedical Text Mining (Jun 2018) [code], which stated:

• “In a task involving automated reading of ∼215,000 articles using the REACH event extraction software we found that grounding was disproportionately inaccurate for multi-protein families (e.g., AKT) and complexes with multiple subunits (e.g.NF-κB). To address this problem we constructed FamPlex, a manually curated resource defining protein families and complexes as they are commonly encountered in biomedical text. In FamPlex the gene-level constituents of families and complexes are defined in a flexible format allowing for multi-level, hierarchical membership. To create FamPlex, text strings corresponding to entities were identified empirically from literature and linked manually to uniform identifiers; these identifiers were also mapped to equivalent entries in multiple related databases. FamPlex also includes curated prefix and suffix patterns that improve named entity recognition and event extraction. Evaluation of REACH extractions on a test corpus of ∼54,000 articles showed that FamPlex significantly increased grounding accuracy for families and complexes (from 15 to 71%).”

• FamPlex provides a collection of resources for grounding biological entities from text and describing their hierarchical relationships, focusing on protein families, complexes, and their lexical synonyms.

Informatic approaches for including latent variables in knowledge discovery include the excellent multivariate information-based inductive causation  (miic ) algorithm, which learns causal networks from observational data in the presence of latent variables. miicimplemented as an R-package and described in Learning Causal Networks with Latent Variables from Multivariate Information in Genomic Data (Oct 2017) – is an information-theoretic method which learns a large class of causal or non-causal graphical models from purely observational data, while including the effects of unobserved latent variables, commonly found in many datasets. Starting from a complete graph, the method iteratively removes dispensable edges, by uncovering significant information contributions from indirect paths, and assesses edge-specific confidences from randomization of available data. The remaining edges are then oriented based on the signature of causality in observational data. This approach can be applied to a wide range of datasets and provide new biological insights on regulatory networks from single cell expression data, genomic alterations during tumor development and co-evolving residues in protein structures. For example, applied to the development of breast cancer, miic network reconstruction highlighted the direct association between tetraploidization and TP53 mutations in agreement with findings on actual tumors and their resistance to treatments, consistent with the high incidence of tetraploid tumors in patients with BRCA1/2 germline mutations.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

The potential benefits of applying machine learning methods to ‘omics data are becoming increasingly apparent, especially in clinical settings. However, the unique characteristics of these data are not always well suited to machine learning techniques. These data are often generated across different technologies in different labs, and frequently with high dimensionality. A Framework for Implementing Machine Learning on Omics Data (Nov 2018) presented a framework for combining ‘omics data sets, and for handling high dimensional data, making ‘omics research more accessible to machine learning applications. They demonstrated the success of this framework through integration and analysis of multi-analyte data for a set of 3,533 breast cancers. They then used this data-set to predict breast cancer patient survival for individuals at risk of an impending event, with higher accuracy and lower variance than methods trained on individual data-sets. Their pipelines for data-set generation and transformation are freely available for noncommercial use at Cambridge Cancer Genomics.

• “Cambridge Cancer Genomics is using blood tests to guide smarter cancer therapy. Currently, cancer patients have to wait up to 6 months to know whether their chemotherapy is working. In the interim, patients suffer the side effects of such treatments. Using simple blood draws, CCG shortens the time required to know whether treatment is working, buying the clinician more time to alter treatment and reduce unnecessary side effects. In addition, we can identify relapse an average of 7 months earlier than standard practice. Over time, CCG will be able to better predict the best therapeutic strategy for cancer patients before they even begin treatment.”

[Image source. Click image to open in new window.]

Predicting MicroRNA-Disease Associations using Network Topological Similarity Based on DeepWalk (Oct 2017) applied DeepWalk to the prediction of microRNA-disease associations, by calculating similarities within a miRNA-disease association network. This approach showed superior predictive performance for 22 complex diseases, with area under the ROC curve scores ranging from 0.805 to 0.937, using five-fold cross-validation. In addition, case studies on breast, lung and prostate cancer further justified the use the method for the discovery of latent miRNA-disease pairs.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• “It is a big challenge to identify patient-specific drug combinations based on cancer omics data. However, most conventional methods used for identifying personalized therapies require large sample sizes and focus on the population as a whole and we still lack a feasible mathematical framework to circumvent this key problem in the clinical application of precision medicine. This work presents a personalized drug controller method (PDC) to identify drug combinations of individual cancer patients, by exploring the transition state information from a disease state to a normal state, thus providing novel insights into tumor heterogeneity in complex patient ecosystems. We validate the effectiveness of PDC in terms of the accurate prediction of drug combinations on two benchmark cancer datasets. We can also discover personalized key control genes (KCGs) even if the KCGs are hidden in transcription profiles by exploring their network and structural characteristics. Furthermore, we provide computationally derived side effect signatures for drug combinations based on patient-specific KCGs to enhance patient stratification and prognostication. The experimental results strongly support that PDC is effective for personalized drug therapy and treatment, and can fill the gap between network control theory and the personalized drug discovery problem.”

[Image source. Click image to open in new window.]

• Motivation:  Cancer subtype classification has the potential to significantly improve disease prognosis and develop individualized patient management. Existing methods are limited by their ability to handle extremely high-dimensional data and by the influence of misleading, irrelevant factors, resulting in ambiguous and overlapping subtypes.

Results:  To address the above issues, we proposed a novel approach to disentangling and eliminating irrelevant factors by leveraging the power of deep learning. Specifically, we designed a deep-learning framework, referred to as DeepType , that performs joint supervised classification, unsupervised clustering and dimensionality reduction to learn cancer-relevant data representation with cluster structure. We applied DeepType  to the METABRIC breast cancer dataset and compared its performance to alternative state-of-the-art methods. DeepType  significantly outperformed the existing methods, identifying more robust subtypes while using fewer genes. The new approach provides a framework for the derivation of more accurate and robust molecular cancer subtypes by using increasingly complex, multi-source data.

Availability and implementation:  An open-source software package for the proposed method is freely available at acsu.buffalo.edu/~yijunsun/lab/DeepType.html.

[Image source. Click image to open in new window.]

### [Cancer] Random Walk Approaches

Random walk and diffusion based models were originally studied in physics to describe the movement of molecules [Diffusion Based Network Embedding (May 2018)]. The general random walk refers to a discrete stochastic process, while diffusion is defined in continuous space and time by a stochastic differential equation that incorporates Brownian motion (which is highly related to and modeled by random walk models), a diffusion coefficient, and a drift term. Random walks are often used as a model for diffusion. In network research, random walks are a basic block of diffusion processes such as the spread of epidemics or the propagation of opinions. While random walks are a powerful tool for traversing networks, diffusion over networks generates more informative traces that not only consist of node sequences but also associated node information. Diffusion can be used in ranking tasks, to find the most important or relevant webpage (PageRank / Personalized PageRank), research paper, local graph partitioning (finding subgraphs and cliques), etc. [See also Random Walks and Diffusion on Networks (Nov 2017), and Introduction to Diffusion and Random Walks  (theory plus Python code snippets).]

DawnRank: Discovering Personalized Driver Genes in Cancer (Jul 2014) [R code] discovered and ranked mutated genes in the cancer genomes of individual patients according to their potential to be driver mutations. DawnRank ranked genes according to their impact on the perturbation of downstream genes, i.e. a gene will be ranked higher if it causes many downstream genes, directly or indirectly in the interaction network, to be differentially expressed. The framework effectively reflected the observation from previous works that mutations in genes with higher connectivity within the gene network are more likely to be impactful. DawnRank adopted the PageRank random walk approach to iteratively model this process. In each iteration a node in the network could – with some probability – stay at the same node or walk randomly to a downstream node, which symbolized the impact of a particular gene on its downstream neighbors. The ranked output described a gene’s overall impact.

[Image source. Click image to open in new window.]

### [Cancer] Diffusion Network Based Approaches

Comprehensive Molecular Characterization of Clear Cell Renal Cell Carcinoma (Jul 2013) [code] introduced TieDIE, which was used to connect frequently mutated genes involving the SWI/SNF  chromatin remodeling complex to a diverse set of gene expression changes characteristic of tumor development and progression. “… We next searched for causal regulatory interactions connecting ccRCC somatic mutations to these transcriptional hubs, using a bi-directional extension to HotNet (‘TieDIE’) and identified a chromatin-specific sub-network (Fig. 4a and Supplementary Figs 50-52). TieDIE defines a set of transcriptional targets, whose state in the tumour cells is proposed to be influenced by one or more of the significantly mutated genes. …”

[Image source. Click image to open in new window.]

Discovering Causal Pathways Linking Genomic Events to Transcriptional States using Tied Diffusion Through Interacting Events (TieDIE) (Nov 2013) [projectcode  |  code: Cytoscape-based application] used a network diffusion approach (TieDIE) to connect genomic perturbations to gene expression changes characteristic of cancer subtypes. The method computed a subnetwork of protein-protein interactions, predicted transcription factor-to-target connections and curated interactions from the literature that connected genomic and transcriptomic perturbations. Because many transcription factors are not conventionally regarded as being druggable, approaches such as TieDIE that pinpoint influences upstream of these factors but still in neighborhoods proximal to key driving mutations may provide key starting points for identifying new drug targets.

[Image source. Click image to open in new window.]

HotNet2, described in Pan-Cancer Network Analysis Identifies Combinations of Rare Somatic Mutations Across Pathways and Protein Complexes (Dec 2014) [projectcodenon-author code, updated for Python 3], is a well established diffusion based approach for the discovery of personalized cancer driver genes. HotNet2 overcame the limitations of existing single-gene, pathway and network approaches to find mutated subnetworks in cancer. They identified 16 significantly mutated subnetworks that comprised well-known cancer signaling pathways as well as subnetworks with less characterized roles in cancer. HotNet2 uses a directed heat diffusion model to simultaneously assess the significance of mutations in individual genes and the local topology of interactions among the encoded proteins, overcoming the limitations of pathway-based enrichment statistics and earlier network approaches.

[Image source. Click image to open in new window.]

Incorporating Networks in a Probabilistic Graphical Model to Find Drivers For Complex Human Diseases (Oct 2017) proposed a graphical model based method, Conflux, that integrated genotype data with networks using diffusion-like methods. This Bayesian framework allowed Conflux to keep track of the uncertainty in the gene list that was being associated with the disease and consequently rank the genes with respect to the confidence in the association. It also allowed for the discovery of gene sets that were not fully supported by the network if they had enough support in the data. Using networks clearly improved gene detection compared to individual gene testing. Conflux showed consistently improved performance to the state of the art diffusion network based method Hotnet2 and a variety of other network and variant aggregation methods.

[Image source. Click image to open in new window.]

Cancer

• Towards Gene Expression Convolutions using Gene Interaction Graphs (Jun 2018) applied GCN to the analysis of RNA-Seq gene expression data from The Cancer Genome Atlas (TCGA). They explored the use of GCN with dropout, and gene embeddings (to utilize the graph information). While the GCN approach in this exploratory work provided an advantage for particular tasks in a low data regime, it was very dependent on the quality of the graph used.

• Joint Association and Classification Analysis of Multi-View Data (Nov 2018) proposed a framework for Joint Association and Classification Analysis of multi-view data (JACA ). In addition to the joint learning framework, an advantage of their approach was its ability to use partial information: it could be applied in settings with missing class labels, and in settings with missing subsets of views. They applied JACA to colorectal cancer data from The Cancer Genome Atlas project, and quantified the association between RNAseq and miRNA views with respect to consensus molecular subtypes of colorectal cancer.

• Spectral Clustering in Regression-Based Biological Networks (John Quackenbush: May 2019)

## Classification

An extension of Neuro-Symbolic Representation Learning on Biological Knowledge Graphs (by Alshahrani et al.) was described by Agibetov and Samwald in Fast and Scalable Learning of Neuro-Symbolic Representations of Biomedical Knowledge (Apr 2018). Agibetov and Samwald showed how to train fast log-linear neural embeddings (representations) of the entities, that were used as inputs for ML classifiers enabling important tasks such as biological link prediction. Classifiers were trained by concatenating learned entity embeddings to represent entity relations, and training classifiers on the concatenated embeddings to discern true relations from automatically generated negative examples. Their simple embedding methodology greatly improved on classification error compared to previously published state of the art results, yielding increased F-measure and ROC AUC scores for the most difficult biological link prediction problem. Their embedding approach was also much more economical in terms of embedding dimensions (d=50 vs. d=512), and naturally encoded the directionality of the asymmetric biological relations, that could be controlled by the order with which the embeddings were concatenated.

• “As opposed to the approach taken by Alshahrani et al. we [Agibetov and Samwald] employ another neural embedding method which requires fewer parameters and is much faster to train. Specifically, we exploit the fact that the biological relations have well defined non-overlapping domain and ranges, and therefore the whole knowledge graph can be treated as an untyped directed graph, where there is no ambiguity in the semantics of any relation. To this end, we employ the neural embedding model from the StarSpace toolkit, which aims at learning entities, each of which is described by a set of discrete features (bag-of-features) coming from a fixed-length dictionary. …“

[Image source. Click image to open in new window.]

• In the following figure, results in Table 2 (from in Alshahrani et al., 2017 serve as the baseline for Tables 3, 4:

[Image source. Click image to open in new window.]

Protein Classification using Machine Learning and Statistical Techniques: A Comparative Analysis (Jan 2019) implemented machine learning classification for feature selection, prediction, and for determining an appropriate classification technique for function prediction. Seven classification techniques [CRT, QUEST, CHAID, C5.0, ANN (Artificial Neural Network), SVM and Bayesian] were implemented on data for 4368 proteins from 6 categories from UniProtKB. The protein data was high dimensional sequence data containing a maximum of 48 features. SPSS was used to manipulate the high dimensional sequential protein data with different classification techniques. Different classification techniques gave different results for each model, showing that the data were imbalanced for classes C4, C5 and C6, affecting the performance of model. Experimental results indicated that the C5.0 classification technique accuracy was more suited for protein feature classification and predictions, giving 95.56% accuracy and high precision and recall values. Features that can be selected for function prediction are also discussed.

• “C5.0: It is an extension of ID3 algorithm of decision tree. It produces a binary tree with multiple branch. It deals with all possible data including the missing features. It is discrete and continuous in nature. [Pang & Gong, 2009]”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Gene expression profiles have been widely used to characterize patterns of cellular responses to diseases. Distinguishing between Normal and Cancer Cells Using Autoencoder Node Saliency (Jan 2019) presented an autoencoder to capture nonlinear relationships recovered from gene expression profiles. The autoencoder is a nonlinear dimension reduction technique using an artificial neural network, which learns hidden representations of unlabeled data. They trained the autoencoder on a large collection of tumor samples from the National Cancer Institute Genomic Data Commons, obtaining a generalized and unsupervised latent representation. With the trained autoencoder, they generated latent representations of a small dataset, containing pairs of normal and cancer cells of various tumor types, demonstrating that the autoencoder effectively extracted distinct gene features for multiple learning tasks in the dataset.

[[p. 3] To compared ifferent histograms, the unsupervised node saliency utilizes the normalized entropy difference (NED) defined as ...  Image source. Click image to open in new window.]

Classification:

• “… We introduce a novel methodology, inspired by topic modeling from the natural language domain, to derive expressive features from the high-dimensional gene expression (GE) data. There, a document is represented as a mixture over a relatively small number of topics, where each topic corresponds to a distribution over the words; here, to accommodate the heterogeneity of a patient’s cancer, we represent each patient (~document) as a mixture over cancer-topics, where each cancer-topic is a mixture over GE values (~words). This required some extensions to the standard LDA model e.g.: to accommodate the ‘real-valued’ expression values - leading to our novel ‘discretized’ Latent Dirichlet Allocation (dLDA) procedure. …”

• “Immune repertoire deep sequencing allows comprehensive characterization of antigen receptor-encoding genes in a lymphocyte population. We hypothesized that this method could enable a novel approach to diagnose disease by identifying antigen receptor sequence patterns associated with clinical phenotypes. In this study, we developed statistical classifiers of T-cell receptor (TCR) repertoires that distinguish tumor tissue from patient-matched healthy tissue of the same organ. The basis of both classifiers was a biophysicochemical motif in the complementarity determining region 3 (CDR3) of TCRβ chains. To develop each classifier, we extracted 4-mers from every TCRβ CDR3 and represented each 4-mer using biophysicochemical features of its amino acid sequence combined with quantification of 4-mer (or receptor) abundance. This representation was scored using a logistic regression model. Unlike typical logistic regression, the classifier is fitted and validated under the requirement that at least 1 positively labeled 4-mer appears in every tumor repertoire and no positively labeled 4-mers appear in healthy tissue repertoires. We applied our method to publicly available data in which tumor and adjacent healthy tissue were collected from each patient. Using a patient-holdout cross-validation, our method achieved classification accuracy of 93% and 94% for colorectal and breast cancer, respectively. The parameter values for each classifier revealed distinct biophysicochemical properties for tumor-associated 4-mers within each cancer type. We propose that such motifs might be used to develop novel immune-based cancer screening assays.”

• “… Multiple instance learning.  The problem of predicting repertoire-level labels from the 4-mers in each repertoire can be formally described as MIL in which the 4-mers are instances, the repertoires are bags, and the bag label is the tissue source of the repertoire (i.e., tumor or healthy). MIL relies on aggregating instance-level scores to assign a bag-level label. Thus, we need to aggregate the scores from all 4-mers in a repertoire into a single value that predicts whether the repertoire came from tumor or healthy tissue. …”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

## Clustering

Spectral Clustering in Regression-Based Biological Networks (John Quackenbush: May 2019).  “Biological networks often have complex structure consisting of meaningful clusters of nodes that are integral to understanding biological function. Community detection algorithms to identify the clustering, or community structure, of a network have been well established. These algorithms assume that data used in network construction is observed without error. However, oftentimes intermediary analyses such as regression are performed before constructing biological networks and the associated error is not propagated in community detection. In expression quantitative trait loci (eQTL) networks, one must first map eQTLs via linear regression in order to specify the matrix representation of the network. We study the effects of using estimates from regression models when applying the spectral clustering approach to community detection. We demonstrate the impacts on the affinity matrix and consider adjusted estimates of the affinity matrix for use in spectral clustering. We further provide a recommendation for selection of the tuning parameter in spectral clustering. We evaluate the proposed adjusted method for performing spectral clustering to detect gene clusters in eQTL data from the GTEx project and to assess the stability of communities in biological data.”

## Disease Classification

GRAM: Graph-based Attention Model for Healthcare Representation Learning (Apr 2017) [code] described a graph-based attention model (GRAM) which addressed two challenges of deep learning applied to healthcare-related KG: insufficient sample size, and interpretation (representations learned by deep learning methods should align with medical knowledge). GRAM supplemented electronic health records with the hierarchical information inherent in medical ontologies (represented as a directed acyclic graph). The ontological KG provided a multi-resolution view of medical concepts upon which GRAM analyzed health records via recurrent neural networks to learn accurate and interpretable representations of medical concepts.

[Image source. Click image to open in new window.]

TieDIE can identify significantly implicated pathways connecting two sets of genes; for example, core asthma genes from DisGeNET and a severe asthma-related differential expression signature mapped onto the Reactome regulatory pathway network, shown in Fig. 3 in Navigating the Disease Landscape: Knowledge Representations for Contextualizing Molecular Signatures (Apr 2018) [supplemental material]. The diffusion model used by this method could take into account magnitude of effects, direction and type of interactions. After applying this approach, the regulatory subnetwork from Reactome that was found to be significant for these two sets was visualized in Cytoscape (Fig. 3 in that paper). A subset of the network relating to the IL1RN  gene was identified, where it was possible to see that two interleukin genes (both involved airway inflammation) were found to be particularly critical.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

To date, 83 single nucleotide polymorphisms (SNPs) for type 2 diabetes have been identified using GWAS. However, standard statistical tests for single and multi-locus analysis such as logistic regression, have demonstrated little effect in understanding the genetic architecture of complex human diseases. …

… Logistic regression models capture linear interactions but neglect the non-linear epistatic interactions present within genetic data. There is an urgent need to detect epistatic interactions in complex diseases as this may explain the remaining missing heritability in such diseases. Extracting Epistatic Interactions in Type 2 Diabetes Genome-Wide Data Using Stacked Autoencoder (Aug 2018) presented a novel framework based on deep learning algorithms that dealt with non-linear epistatic interactions that exist in genome wide association data. Logistic association analysis under an additive genetic model, adjusted for genomic control inflation factor, was conducted to remove statistically improbable SNPs to minimize computational overheads.

Biomedical association studies are increasingly done using clinical concepts, and in particular diagnostic codes from clinical data repositories as phenotypes. Clinical concepts can be represented in a meaningful, vector space using word embedding models. These embeddings allow for comparison between clinical concepts or for straightforward input to machine learning models. Using traditional approaches, good representations require high dimensionality, making downstream tasks such as visualization more difficult. Learning Contextual Hierarchical Structure of Medical Concepts with Poincairé Embeddings to Clarify Phenotypes (Nov 2018) [code here and here] applied Poincaré embeddings in a 2-dimensional hyperbolic space to a large-scale administrative claims database and show performance comparable to 100-dimensional embeddings in a Euclidean space. They then examined disease relationships under different disease contexts to better understand potential phenotypes.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Identifying disease-associated missense mutations remains a challenge, especially in large-scale sequencing studies. An Interactome Perturbation Framework Prioritizes Damaging Missense Mutations for Developmental Disorders (Jun 2018) [media] established an experimentally and computationally integrated approach to investigate the functional impact of missense mutations in the context of the human interactome network and tested their approach by analyzing ~2,000 de novo missense mutations found in autism subjects (probands) and their unaffected siblings. Interaction-disrupting de novo missense mutations were more common in autism probands, principally affecting hub proteins, and disrupting a significantly higher fraction of hub interactions than in unaffected siblings. Moreover, they tended to disrupt interactions involving genes previously implicated in autism, providing complementary evidence that strengthened previously identified associations and enhanced the discovery of new ones. By analyzing de novo missense mutation data from six disorders, they demonstrated that their interactome perturbation approach offered a generalizable framework for identifying and prioritizing missense mutations that contributed to the risk of human disease.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Previous studies have reported functional clustering in genes with de novo protein-truncating variants (dnPTVs) in ASD individuals. Here we assessed the network distance within the human interactome between genes harboring interaction-disrupting dnMis mutations (excluding genes with dnPTVs) and seven classes of known ASD-associated genes. These genes (Supplementary Table 4) include: fragile X mental retardation protein (FMRP) target genes, with transcripts bound by FMRP; genes encoding chromatin modifiers; genes expressed preferentially in embryos; genes encoding postsynaptic density proteins; 881 genes in the SFARI database; a high-quality SFARI subset (141 genes scored as syndromic, high confidence or strong candidate 58); and the latest set of 65 ASD genes discovered by de novo mutations. We found that in probands, proteins harboring interaction-disrupting dnMis mutations are significantly closer to proteins from all seven classes in comparison to proteins with non-disrupting dnMis mutations ( and Supplementary Note; Methods). In contrast, no significant differences were observed among unaffected siblings in any category. These findings demonstrate that disruptive dnMis mutations identified by our study in ASD probands are indeed closely related to known ASD genes and functional classes and that they may contribute to ASD etiology by disrupting common pathways shared with dnPTVs.

Non-synonymous mutations linked to the complex diseases often have a global impact on a biological system, affecting large biomolecular networks and pathways. However, the magnitude of the mutation-driven effects on the macromolecular network is yet to be fully explored. Multilayer View of Pathogenic SNVs in Human Interactome through In Silico Edgetic Profiling (Jul 2018) presented a systematic multi-level characterization of human mutations associated with genetic disorders by determining their individual and combined interaction-rewiring, ‘edgetic’ effects on the human interactome. Their in silico analysis highlighted the intrinsic differences and important similarities between the pathogenic single-nucleotide variants (SNVs) and frameshift mutations. They showed that pathogenic SNVs were more likely to cause gene pleiotropy than pathogenic frameshift mutations and were enriched on the protein interaction interfaces. Functional profiling of SNVs indicated widespread disruption of the protein-protein interactions and synergistic effects of SNVs. The coverage of their approach was several times greater than the recently published experimental study and had the minimal overlap with it, while the distributions of determined edgotypes between the two sets of profiled mutations were remarkably similar. Case studies revealed the central role of interaction-disrupting mutations in type 2 diabetes mellitus and suggested the importance of studying mutations that abnormally strengthen the protein interactions in cancer. With the advancement of next-generation sequencing technology that drives precision medicine, there is an increasing demand in understanding the changes in molecular mechanisms caused by the patient-specific genetic variation. The current and future in silico edgotyping tools present a cheap and fast solution to deal with the rapidly growing data sets of discovered mutations.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Disease:

Ontologies can be used for the interpretation of natural language. While medical synonym resources have been an important part of medical natural language processing (NLP), they suffer from low precision and low recall rate. To construct an anti-infective drug ontology, one needs to design and deploy a methodological step to carry out entity discovery and linking. Approach for Semi-automatic Construction of Anti-infective Drug Ontology Based on Entity Linking (Jan 2018) adopted an NLP approach to generate candidate entities, using an ontology to extract semantic relations. Six-word vector features and word-level features were selected for entity linking. The extraction of synonyms with a single feature and different combinations of features was studied. Experiments showed that their selected features achieved a precision of 86.77%, a recall of 89.03% and an F1 score of 87.89%.

[Image source. Click image to open in new window.]

The task of entity relation extraction discovers new relation facts and enables broader applications of knowledge graph. Attention-Aware Path-Based Relation Extraction for Medical Knowledge Graph (Jan 2018) proposed a novel path attention-based model (GRU + ONE) to discover new entity relation facts. Instead of finding texts for relation extraction, the proposed method extracted path-only information for entity pairs from the knowledge graph. For each pair of entities, multiple paths could be extracted, with some of them being more useful than others for relation extraction. They employed an attention mechanism to assign different weights to different paths, which highlighted paths useful for entity relation extraction. On a large medical knowledge graph, the proposed method significantly improved the accuracy of extracted relation facts compared with state of the art relation extraction methods.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

## Explainable Models

The use of constraint graphs for finding optimal Bayesian networks (graphs) is an intuitive method for incorporating rich prior knowledge into the structure learning tasks (Finding the Optimal Bayesian Network Given a Constraint Graph (Jul 2017) [code]. The accompanying Jupyter notebook by these authors provides toy examples, including a three-layer network with:

• health symptoms (low energy; bloating; loss of appetite; vomiting; abdominal cramps) on the bottom layer;
• diseases (ovarian cancer; lactose intolerance; pregnancy) on the middle layer; and,
• genetic tests on the top layer

for three different genetic mutations (BRCA1, BRCA2, and LCT). The edges in this graph were constrained such that symptoms were explained by diseases, and diseases could be partially explained by genetic mutations. There were no edges from diseases to genetic conditions, and no edges from genetic conditions to symptoms. The authors defined a constraint graph which comprised three nodes; “symptoms”, “diseases”, and “genetic mutations”. There was a directed edge from genetic mutations to diseases, and a directed edge from diseases to symptoms. This specified that genetic mutations could be parents to diseases, and diseases to symptoms. Priors were simply programmed Python expressions from the constraint graph, and a resultant learned Bayesian network was returned. Relevant to this REVIEW, this approach could enable very interesting explorations of the KG, for example the discovery of latent/learned embeddings.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

In my Explainable (Interpretable) Models TECHNICAL REVIEW, I review those approaches. Notably relevant to this Biomedical Applications of Machine Learning TECHNICAL REVIEW are my summaries (in the former) of the LIME,  the subsequent “anchors” version – each by Marco Ribeiro et al.

• LIME,  described in 'Why Should I Trust You?': Explaining the Predictions of Any Classifier (Aug 2016) [codediscussion], supports explaining individual predictions for text classifiers or classifiers that act on tables (numpy arrays of numerical or categorical data) or images.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Anchors: High-Precision Model-Agnostic Explanations] (2018) [code here and here] introduced a novel model-agnostic system that explained the behavior of complex models with high-precision rules called anchors, representing local, sufficient  conditions for predictions. An anchor explanation is a rule that sufficiently anchors  the prediction locally – such that changes to the rest of the feature values of the instance do not matter. …

[Image source. Click image to open in new window.]

In what is likely the strongest work in this domain, A Unified Approach to Interpreting Model Predictions (Scott Lundberg et al., Nov 2017) [code] describes a unified approach (SHAP: SHapleyA Shapley value is a solution concept in cooperative game theory. It was named in honor of Lloyd Shapley, who introduced it in 1953. To each cooperative game it assigns a unique distribution (among the players) of a total surplus generated by the coalition of all players. The Shapley value is characterized by a collection of desirable properties. Additive exPlanations ) to explain the output of any machine learning model.  SHAP,  which united six existing models including LIME,  assigned an importance value for a particular prediction to each feature, using game theory to guarantee a unique solution that was better aligned to human intuition than existing methods. SHAP is well-discussed in these three blog posts, and the project’s very heavily (~4k) starred GitHub repository provides much additional information and examples.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

The SHAP approach was updated in a follow-on paper by the same authors, Consistent Individualized Feature Attribution for Tree Ensembles (Scott Lundberg et al., Feb 2018, updated Mar 2019) [code].

• “Interpreting predictions from tree ensemble methods such as gradient boosting machines and random forests is important, yet feature attribution for trees is often heuristics and not individualized for each prediction. Here we show that popular feature attribution methods are inconsistent, meaning they can lower a feature’s assigned importance when the true impact of that feature actually increases. This is a fundamental problem that casts doubt on any comparison between features. To address it we turn to recent applications of game theory and develop fast exact tree solutions for SHAP  (SHapley Additive exPlanation ) values, which are the unique consistent and locally accurate attribution values. We then extend SHAP  values to interaction effects and define SHAP  interaction values. We propose a rich visualization of individualized feature attributions that improves over classic attribution summaries and partial dependence plots, and a unique ‘supervised’ clustering (clustering based on feature attributions). We demonstrate better agreement with human intuition through a user study, exponential improvements in run time, improved clustering performance, and better identification of influential features. An implementation of our algorithm has also been merged into XGBoost  and LightGBM, see this GitHub repository for details.”

Those authors (Scott Lundberg et al.) applied their SHAP model, referred to as Prescience, to the clinical domain in their wonderful paper Explainable Machine-Learning Predictions for the Prevention of Hypoxaemia During Surgery (Oct 2018).

• “Although anaesthesiologists strive to avoid hypoxaemia during surgery, reliably predicting future intraoperative hypoxaemia is not possible at present. Here, we report the development and testing of a machine-learning-based system that predicts the risk of hypoxaemia and provides explanations of the risk factors in real time during general anaesthesia. The system, which was trained on minute-by-minute data from the electronic medical records of over 50,000 surgeries, improved the performance of anaesthesiologists by providing interpretable hypoxaemia risks and contributing factors. The explanations for the predictions are broadly consistent with the literature and with prior knowledge from anaesthesiologists. Our results suggest that if anaesthesiologists currently anticipate 15% of hypoxaemia events, with the assistance of this system they could anticipate 30%, a large portion of which may benefit from early intervention because they are associated with modifiable factors. The system can help improve the clinical understanding of hypoxaemia risk during anaesthesia care by providing general insights into the exact changes in risk induced by certain characteristics of the patient or procedure.”

[Image source. Click image to open in new window.]

Note here that I excised Fig. 4c -- refer to the paper for the full figure.  [Image source. Click image to open in new window.]

Explainable Models:

## Image Processing

• Graph refinement (the task of obtaining subgraphs of interest from over-complete graphs), can have many varied applications. In Graph Refinement Based Tree Extraction Using Mean-Field Networks and Graph Neural Networks (Nov 2018) Selvan et al. [Thomas Kipf | Max Welling] extracted tree structures from image data by first deriving a graph-based representation of the volumetric data and then posing tree extraction as a graph refinement task. They presented two methods to perform graph refinement: (1) mean-field approximation (MFA), to approximate the posterior density over the subgraphs from which the optimal subgraph of interest can be estimated …; and (2) they presented a supervised learning approach using graph neural networks (GNNs), which can be seen as generalisations of mean field networks. Subgraphs were obtained by jointly training a GNN based encoder-decoder pair, wherein the encoder learned useful edge embeddings from which the edge probabilities were predicted using a simple decoder. They discussed connections between the two classes of methods, and compared them for the task of extracting airways from 3D, low-dose, chest CT data.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

## Latent Knowledge Discovery

Relation Path Feature Embedding Based Convolutional Neural Network Method for Drug Discovery (Apr 2019) proposed a relation path features embedding based convolutional neural network [CNN] model with attention mechanism for drug discovery from literature (PACNN ).  “First, we use predications from biomedical abstracts to construct a biomedical knowledge graph, and then apply a path ranking algorithm to extract drug-disease relation path features on the biomedical knowledge graph. After that, we use these drug-disease relation features to train a convolutional neural network model which combined with the attention mechanism. Finally, we employ the trained models to mine drugs for treating diseases.

• Knowledge graph construction.  In general, knowledge graph (KG) comprises of different nodes and edges. In this work, we firstly obtained the predications extracted by SemRep from the biomedical text. Then, a knowledge graph was constructed by the predications. Specifically, in the KG, let $\small E = \{e_1, e_2, \ldots, e_n\}$ denote the nodes and $\small R = \{r_1, r_2, \ldots, r_n\}$ denote the edges, where $\small e$ and $\small r$ represent entity and relation, respectively. The KG structure (like a tree structure) is shown in Fig.2, this is a two-level relation tree example of the KG.”

[Image source. Click image to open in new window.]

Latent Knowledge Discovery:

• “The study of human genes and diseases is very rewarding and can lead to improvements in healthcare, disease diagnostics and drug discovery. In this paper, we further our previous study on gene disease relationship specifically with the multifunctional genes. We investigate the multifunctional gene disease relationship based on the published molecular function annotations of genes from the Gene Ontology which is the most comprehensive source on gene functions.”

## Molecular Interactions

Identifying interactions between proteins is important to understanding underlying biological processes. Extracting protein-protein interactions (PPI) from raw text is often very difficult. Previous supervised learning methods have used handcrafted features on human-annotated data sets. Recent NLP/ML models address this challenge.

A Shortest Dependency Path Based Convolutional Neural Network for Protein-Protein Relation Extraction (2016) addressed protein-protein interaction (PPI) extraction using convolutional neural networks (CNN), proposing a shortest dependency path (sdp) based CNN (sdpCNN ) model. The proposed method took only the sdp and word embedding as input, and could avoid bias from feature selection by using a CNN. Experimental results on the AIMed and BioInfer datasets demonstrated that this approach outperformed state of the art kernel based methods. sdpCNN could extract key features automatically, it verified that the use of pretrained word embedding was crucial in this PPI task.

[Image source. Click image to open in new window.]

• This later work, by different authors, is similar (also a shortest dependency path based approach) but uses a Bi-LSTM rather than a CNN: Feature Assisted Bi-Directional LSTM Model for Protein-Protein Interaction Identification from Biomedical Texts (Jul 2018). “Most of the existing systems model the PPI extraction task as a classification problem and are tailored to the handcrafted feature set including domain dependent features. In this paper, we present a novel method based on deep bidirectional long short-term memory (B-LSTM) technique that exploits word sequences and dependency path related information to identify PPI information from text. This model leverages joint modeling of proteins and relations in a single unified framework, which we name as Shortest Dependency Path B-LSTM  (sdpLSTM ) model. We perform experiments on two popular benchmark PPI datasets, namely AIMed and BioInfer. The evaluation shows the $\small F_1$ score values of 86.45% and 77.35% on AIMed and BioInfer, respectively. Comparisons with the existing systems show that our proposed approach attains state-of-the-art performance.”

[Image source. Click image to open in new window.]

• Note the improved $\small F_1$ scores (Table 8, above/below) for the sdpLSTM (this paper) vs. the sdpCNN  (ref. 17 in that table) models:

[Image source. Click image to open in new window.]

Identifying Protein-Protein Interaction using Tree LSTM and Structured Attention (Jul 2108) proposed a novel tree recurrent neural network with structured attention architecture for doing PPI. Their architecture achieved state of the art results on the benchmark AIMed and BioInfer datasets; moreover, their models achieved significant improvement over previous best models without any explicit feature extraction. Experimental results showed that traditional recurrent networks had inferior performance compared to tree recurrent networks, for the supervised PPI task.

• “… we propose a novel neural net architecture for identifying protein-protein interactions from biomedical text using a Tree LSTM with structured attention. We provide an in depth analysis of traversing the dependency tree of a sentence through a child sum tree LSTM and at the same time learn this structural information through a parent selection mechanism by modeling non-projective dependency trees.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Neuro-Symbolic Representation Learning on Biological Knowledge Graphs (Sep 2017) code] developed a method for feature learning on biological knowledge graphs. Their method combined symbolic methods (specifically, knowledge representation using symbolic logic and automated reasoning) with neural networks to generate vector representations (embeddings) of the nodes in those graphs, as well as the kinds of relations that existed to neighboring nodes. To learn those representations, they repeatedly performed random walks from each node in the KG, using the resulting walks as sentences within a corpus, and applied a word2vec skip-gram model (DeepWalk) to learn the embeddings for each node. Through the use of symbolic logic, these embeddings contain both explicit and implicit information. These embeddings were applied to the prediction of edges in the KG, applicable to tasks such as the prediction of biological function, finding candidate genes of diseases, and identifying protein-protein interactions or drug-target relations.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Protein interactions constitute the fundamental building block of almost every life activity. Identifying protein communities from Protein-Protein Interaction (PPI) networks is essential to understand the principles of cellular organization and explore the causes of various diseases. It is critical to integrate multiple data resources to identify reliable protein communities that have biological significance and improve the performance of community detection methods for large-scale PPI networks. Parallel Protein Community Detection in Large-scale PPI Networks Based on Multi-source Learning (Oct 2018) proposed a Multi-source Learning based Protein Community Detection (MLPCD) algorithm by integrating Gene Expression Data (GED) and a parallel solution of MLPCD using cloud computing technology. To effectively discover the biological functions of proteins that participate in different cellular processes, GED under different conditions was integrated with the original PPI network to reconstruct a Weighted-PPI (WPPI) network.

• To flexibly identify protein communities of different scales, we define community modularity and functional cohesion measurements and detect protein communities from WPPI using an agglomerative method. In addition, the detected communities were compared with known protein complexes to evaluate the functional enrichment of protein function modules using Gene Ontology annotations. An implementation of a parallel version of the MLPCD algorithm on the Apache Spark platform enhanced the performance of the algorithm for large-scale realistic PPI networks. Experimental results indicated the superiority and advantages of the MLPCD algorithm over relevant algorithms in terms of accuracy and performance.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Comprehensive Evaluation of Deep Learning Architectures for Prediction of DNA/RNA Sequence Binding Specificities (Jan 2019) [code] presented a systematic exploration of deep learning architectures for predicting DNA- and RNA-binding specificities. For this purpose, they presented deepRAM, an end-to-end deep learning tool that provided an implementation of novel and previously proposed architectures; its fully automatic model selection procedure allowed them to perform a fair and unbiased comparison of deep learning architectures. They found that in an architecture that uses $\small k$-mer embedding to represent the sequence, with a convolutional layer and a recurrent layer, outperformed all other methods in terms of accuracy. Their work provided guidelines that will assist practitioners in choosing the best architecture for the task at hand, and provided some insights on the differences between the models learned by convolutional and recurrent networks. In particular, although recurrent networks improved model accuracy, this came at the expense of a loss in the interpretability of the features learned by the model.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Exploring the Extracellular Regulation of the Tumor Angiogenic Interaction Network Using a Systems Biology Model (Mar 2019) constructed a tumor tissue-based model to better understand how the angiogenic network is regulated by opposing mediators at the extracellular level.  “… Our study provides mechanistic insights into this counterintuitive result and highlights the role of heparan sulfate proteoglycans in regulating the interactions between angiogenic factors. This work complements previous studies aimed at understanding formation of angiogenic complexes in tumor tissue and helps in the development of anti-cancer strategies targeting angiogenesis.”

• Model implementation and simulation.  The model ODEs are generated using BioNetGen, a rule-based modeling framework. BioNetGen produces all possible molecular species and the corresponding ODEs by specifying a set of starting molecular species and defining reaction rules. Given 40 seed species and 127 reaction rules, the model produced by BioNetGen consists of 154 species. The set of 154 ODEs is implemented in MATLAB, which we used to generate the dynamic results, as well as steady state predictions (i.e., when the model outputs change less than 0.01%). The MATLAB model file is provided in Supplementary File S3 [zip file].”

• This paper used BioNetGen (Biological Network Generator), a rule-based modeling that enables the construction and analyses of comprehensive, large scale models for signal transduction pathways and other biochemical systems.  |  Supplementary File 3 (MATLAB models zip file)

Molecular Interactions

• When the Web Meets the Cell: Using Personalized PageRank for Analyzing Protein Interaction Networks (Feb 2011) demonstrated a technique, originating from PageRank, for analyzing large interaction networks. The method was fast, scalable and robust, and its capabilities were demonstrated on metabolic network data of the tuberculosis bacterium and the proteomics analysis of the blood of melanoma patients. A Perl script for computing the personalized PageRank in protein networks is available here [zip file; local copy].

[Image source. Click image to open in new window.]

• Predicting MicroRNA-Disease Associations using Network Topological Similarity Based on DeepWalk  (Oct 2017) applied DeepWalk to the prediction of microRNA-disease associations, by calculating similarities within a miRNA-disease association network. This approach showed superior predictive performance for 22 complex diseases, with area under the ROC curve scores ranging from 0.805 to 0.937, using five-fold cross-validation. In addition, case studies on breast, lung and prostate cancer further justified the use the method for the discovery of latent miRNA-disease pairs.

• edge2vec: Learning Node Representation using Edge Semantics (Sep 2018) incorporated edge semantics to represent different edge-types in heterogeneous networks. edge2vec (aka heterogeneous node2vec or H-node2vec) was validated and evaluated using three medical domain problems on an ensemble of complex medical networks: medical entity classification, compound-gene binding prediction, and medical information search costs.

[Image source. Click image to open in new window.]

• Comparing Two Deep Learning Sequence-Based Models for Protein-Protein Interaction Prediction (Jan 2019) [code] compared two deep learning models, showing pitfalls to avoid while predicting PPIs through machine learning. Their best model accurately predicted >78% of human PPI under very strict conditions for training and testing. Their method should be applicable to other organisms, and predictions over whole proteomes.

[Image source. Click image to open in new window.]

• “… We approach the drug-drug interaction prediction problem as a link prediction problem and present two novel methods for drug-drug interaction prediction based on artificial neural networks and factor propagation over graph nodes: adjacency matrix factorization (AMF) and adjacency matrix factorization with propagation (AMFP). We conduct a retrospective analysis by training our models on a previous release of the DrugBank database with 1,141 drugs and 45,296 drug-drug interactions and evaluate the results on a later version of DrugBank with 1,440 drugs and 248,146 drug-drug interactions. Additionally, we perform a holdout analysis using DrugBank. We report an area under the receiver operating characteristic curve score of 0.807 and 0.990 for the retrospective and holdout analyses respectively. Finally, we create an ensemble-based classifier using AMF, AMFP, and existing link prediction methods and obtain an area under the receiver operating characteristic curve of 0.814 and 0.991 for the retrospective and the holdout analyses. We demonstrate that AMF and AMFP provide state of the art results compared to existing methods and that the ensemble-based classifier improves the performance by combining various predictors. These results suggest that AMF, AMFP, and the proposed ensemble-based classifier can provide important information during drug development and regarding drug prescription given only partial or noisy data. These methods can also be used to solve other link prediction problems. Drug embeddings (compressed representations) created when training our models using the interaction network have been made public.”
• “This study advances our understanding of inter- and intra-pathways higher order signaling in the cellular system and it leads to new discovery of multiple intracellular structures in signal transduction pathways in yeast Saccharomyces.  We present a new tensor decomposition algorithm in reconstructing the pathways based on higher correlations among genes that compose a cellular system. The higher order gene correlation (HOGC) analysis has the power to elucidate gene’s higher interaction dependencies which has been barely understood. Recent studies i.e. [24] have experimentally revealed that multiple signaling proteins, yet sometimes infinite, may assemble to meaningful structure to transmit a receptor activation information. In this paper we reveal 3-order genomic correlations among significant component of the cellular system. This is the first time such a systematic and computational model provided for analysis of higher order correlations among genes. We use new fast algorithm to formulate a genes by genes by genes decorrelated rank-1 sub-tensors (complexes) which can be associated with functionally independent pathways. Then we model higher order tensor decomposition, which is constructed by K tensors of genes by genes by genes. Each new tensor is constructed by an orthogonal projection of data signal onto a designated basis signal to keep common sub-tensors in both signals. Our model for decomposing tensor order-4 approximates series of tensors as linear components of decorrelated rank-1 sub-tensors over tensor of order-3 and rank-3 triplings among sub-tensors. The linear components represent intra-pathway in cell signaling and triplings implicate inter-pathways higher order signaling. Through structural studies of inter- and intra- higher order signaling pathways, we uncover different scenario that involves triple formation of signaling proteins into higher order signaling machines for transmission of receptor activation information to cellular responses.”
• “Long non-coding RNA, microRNA, and messenger RNA enable key regulations of various biological processes through a variety of diverse interaction mechanisms. Identifying the interactions and cross-talk between these heterogeneous RNA classes is essential in order to uncover the functional role of individual RNA transcripts, especially for unannotated and newly-discovered RNA sequences with no known interactions. Recently, sequence-based deep learning and network embedding methods are becoming promising approaches that can either predict RNA-RNA interactions from a sequence or infer missing interactions from patterns that may exist in the network topology. However, the majority of these methods have several limitations, e.g., the inability to perform inductive predictions, to distinguish the directionality of interactions, or to integrate various sequence, interaction, and annotation biological datasets.

“We proposed a novel deep learning-based framework, rna2rna , which learns from RNA sequences to produce a low-dimensional embedding that preserves the proximities in both the interactions topology and the functional affinity topology. In this proposed embedding space, we have designated a two-part “source and target contexts” to capture the targeting and receptive fields of each RNA transcript, while encapsulating the heterogenous cross-talk interactions between lncRNAs and miRNAs. From experimental results, our method exhibits superior performance in AUPR rates compared to state-of-art approaches at predicting missing interactions in different RNA-RNA interaction databases and was shown to accurately perform link predictions to novel RNA sequences not seen at training time, even without any prior information. Additional results suggest that our proposed framework may have successfully captured a manifold for heterogeneous RNA sequences to be used to discover novel functional annotations.”

Molecular Interactions:

• Excellent review: Global Genetic Networks and the Genotype-to-Phenotype Relationship (Cell: Mar 2019) [media (ScienceDaily.com)]

• Genetic interactions identify combinations of genetic variants that impinge on phenotype. With whole-genome sequence information available for thousands of individuals within a species, a major outstanding issue concerns the interpretation of allelic combinations of genes underlying inherited traits. In this Review, we discuss how large-scale analyses in model systems have illuminated the general principles and phenotypic impact of genetic interactions. We focus on studies in budding yeast, including the mapping of a global genetic network. We emphasize how information gained from work in yeast translates to other systems, and how a global genetic network not only annotates gene function but also provides new insights into the genotype-to-phenotype relationship.”

• Note also their “Human GI Networks and Cancer” subsection, pp. 94-96 (pdf pp. 10-12).

• [Concluding paragraph]  “Ultimately, as has been done with yeast, sustained efforts to systematically map GIs [genetic interactions] in model human cell lines will generate a global genetic network for different human cells, providing a powerful data-driven resource that defines a functional wiring diagram for humans. Exploiting GIs to define functional modules and mapping their relationships should provide key knowledge for enabling systematic discovery of pathway-level GIs from human genotyping data, leading to a new level of understanding of human biology, enhancing our knowledge of the genotype to phenotype relationship.”

[Image source. Click image to open in new window.]

## Molecular Structure

Deeply Learning Molecular Structure-Property Relationships using Graph Attention Neural Network (Oct 2018) [code] applied graph attention networks (GAT) to the study of molecular structure-property relationships, which are key to molecular engineering for materials and drug discovery. The authors showed that GATs greatly improved deep learning performance for chemistry, distinguishing atoms in different environments and extracting important structural features determining target properties (such as molecular polarity, solubility, and energy). Interestingly, it identified two distinct parts of molecules; as a result, it could accurately predict molecular properties. Moreover, the resultant latent space was well-organized such that molecules with similar properties were closely located, which is critical for successful molecular engineering.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Understanding the three-dimensional (3D) structure of the genome is essential for elucidating vital biological processes and their links to human disease. To determine how the genome folds within the nucleus, chromosome conformation capture methods such as HiC have recently been employed. However, computational methods that exploit the resulting high-throughput, high-resolution data are still suffer from important limitations. Inference of the three-dimensional chromatin structure and its temporal behavior (Nov 2018) explored the idea of manifold learning for the 3D chromatin structure inference and presented a novel method, REcurrent Autoencoders for CHromatin 3D structure prediction  (REACH-3D ). Their framework employed autoencoders with recurrent neural units to reconstruct the chromatin structure. In comparison to existing methods, REACH-3D made no transfer function assumptions, and permitted dynamic analysis. Evaluated on synthetic data, REACH-3D indicated high agreement with the ground truth. When tested on real experimental HiC data, REACH-3D faithfully recovered the expected biological properties, and obtained the highest correlation coefficient with microscopy measurements. Last, when REACH-3D was applied to dynamic HiC data, it successfully modeled chromatin conformation during the cell cycle.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Hi-C is a 3C-based technology that allows measurement of pairwise chromatin interaction frequencies within a cell population. Hi-C data can be thought of as a network where genomic regions are nodes and the normalized read counts mapped to two bins are weighted edges.

Predicting protein structure from sequence is a central challenge of biochemistry. Co-evolution methods show promise, but an explicit sequence-to-structure map remains elusive. Advances in deep learning that replace complex, human-designed pipelines with differentiable models optimized end-to-end suggest the potential benefits of similarly reformulating structure prediction. End-To-End Differentiable Learning of Protein Structure (Aug 2018) [code  |  media  |  discussion here (Hacker News)here (reddit) and here (reddit)  |  mentioned here (reddit)  |  author’s blog post  (local copy)] reported the first end-to-end differentiable model of protein structure. The model coupled local and global protein structure via geometric units that optimized global geometry without violating local covalent chemistry. They tested their model on two challenging tasks: predicting novel folds without co-evolutionary data, and predicting known folds without structural templates. In the first task the model achieved state of the art accuracy and in the second it came within 1-2Å; competing methods using co-evolution and experimental templates have been refined over many years and it is likely that the differentiable approach has substantial room for further improvement, with applications ranging from drug discovery to protein design.

[Image source. Click image to open in new window.]

The Boltzmann distribution is a natural model for many systems, from brains to materials and biomolecules, but is often of limited utility for fitting data because Monte Carlo algorithms are unable simulate it in available time. This gap between the expressive capabilities and sampling practicalities of energy-based models is exemplified by the protein folding problem, since energy landscapes underlie contemporary knowledge of protein biophysics but computer simulations are often unable to fold all but the smallest proteins from first-principles. Learning Protein Structure with a Differentiable Simulator (ICLR 2019) [OpenReviewdiscussionmentioned] sought to bridge the gap between the expressive capacity of energy functions and the practical capabilities of their simulators via an unrolled Monte Carlo simulation as a model for data. They composed a neural energy function with a novel and efficient simulator based on Langevin dynamics to build an end-to-end-differentiable model of atomic protein structure given amino acid sequence information. They introduced techniques for stabilizing backpropagation under long roll-outs and demonstrated the model’s capacity to make multimodal predictions and to (sometimes) generalize to unobserved protein fold types when trained on a large corpus of protein structures.

[Image source. Click image to open in new window.]

De novo protein structure prediction from amino acid sequence is one of the most challenging problems in computational biology. As one of the extensively explored mathematical models for protein folding, Hydrophobic-Polar (HP) model enables thorough investigation of protein structure formation and evolution. Although HP model discretizes the conformational space and simplifies the folding energy function, it has been proven to be an NP-complete problem. FoldingZero: Protein Folding from Scratch in Hydrophobic-Polar Model (Dec 2018) propose a novel protein folding framework FoldingZero, self-folding a de novo protein 2D HP structure from scratch based on deep reinforcement learning. … It is trained solely by a reinforcement learning algorithm, which improves HPNet and R-UCT iteratively through iterative policy optimization. Without any supervision and domain knowledge, FoldingZero not only achieves comparable results, but also learns the latent folding knowledge to stabilize the structure. Without exponential computation, FoldingZero shows promising potential to be adopted for real-world protein properties prediction.

[Image source. Click image to open in new window.]

DeepMind brought together experts from the fields of structural biology, physics, and machine learning to apply cutting-edge techniques to predict the 3D structure of a protein based solely on its genetic sequence AlphaFold: Using AI For Scientific Discovery (Dec 2018) [abstract;  discussion here and here]. AlphaFold builds on years of prior research in using vast genomic data to predict protein structure. The 3D protein models that AlphaFold generated were far more accurate than previous models – demonstrating significant progress on one of the core challenges in biology.

• CASP13: 13th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction; CASP13 Target List

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

ProteinNet: A Standardized Data Set for Machine Learning of Protein Structure (Mohammed AlQuraishi: Feb 2019) [data, code]  “We have created the ProteinNet series of data sets to provide a standardized mechanism for training and assessing data-driven models of protein sequence-structure relationships. ProteinNet integrates sequence, structure, and evolutionary information in programmatically accessible file formats tailored for machine learning frameworks. … Standardized data splits were also generated to emulate the difficulty of past CASP (Critical Assessment of protein Structure Prediction) experiments by resetting protein sequence and structure space to the historical states that preceded six prior CASPs. Utilizing sensitive evolution-based distance metrics to segregate distantly related proteins, we have additionally created validation sets distinct from the official CASP sets that faithfully mimic their difficulty. ProteinNet thus represents a comprehensive and accessible resource for training and assessing machine-learned models of protein structure.”

Ab initio protein docking represents a major challenge for optimizing a noisy and costly black box-like function in a high-dimensional space. Despite progress in this field, there is no docking method available for rigorous uncertainty quantification (UQ) of its solution quality (e.g. interface RMSD or iRMSD). Bayesian Active Learning for Optimization and Uncertainty Quantification in Protein Docking (Jan 2019) [code introduced a novel algorithm, Bayesian Active Learning (BAL), for optimization and UQ of such black-box functions and flexible protein docking. “BAL directly models the posterior distribution of the global optimum (or native structures for protein docking) with active sampling and posterior estimation iteratively feeding each other. … Over a protein docking benchmark set and a CAPRI set including homology docking, we establish that BAL significantly improve against both starting points by rigid docking and refinements by particle swarm optimization, … To the best of our knowledge, this study represents the first uncertainty quantification solution for protein docking, with theoretical rigor and comprehensive assessment.”

Unified Rational Protein Engineering with Sequence-Only Deep Representation Learning (Mohammed AlQuraishi  |  George Church  |  …: Mar 2019) [code]

“Rational protein engineering requires a holistic understanding of protein function. Here, we apply deep learning to unlabelled amino acid sequences to distill the fundamental features of a protein into a statistical representation that is semantically rich and structurally, evolutionarily, and biophysically grounded. We show that the simplest models built on top of this unified representation (UniRep) are broadly applicable and generalize to unseen regions of sequence space. Our data-driven approach reaches near state-of-the-art or superior performance predicting stability of natural and de novo designed proteins as well as quantitative function of molecularly diverse mutants. UniRep further enables two orders of magnitude cost savings in a protein engineering task. We conclude UniRep is a versatile protein summary that can be applied across protein engineering informatics.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

## Pharmacology

DASPfind : New Efficient Method to Predict Drug-Target Interactions (Mar 2016) [code] online access] presented a computational method for finding reliable new interactions between drugs and proteins. When the single top‑ranked predictions (or when a drug with no known targets or with few known targets) were considered, DASPfind outperformed other state of the art methods on six different drug-target interaction (DTI) datasets. The usefulness and practicality of DASPfind was demonstrated by predicting novel DTIs for the Ion Channel dataset. The validated predictions suggested that DASPfind could be used as an efficient method to identify correct DTIs, reducing the cost of experimental verifications in the process of drug discovery.

• DASPfind relied on a graph interaction model to predict drug-target interactions via a heterogeneous graph consisting of three subgraphs (drug-drug similarity; protein-protein similarity; known drug-protein interactions) connected to each other. The algorithm for predicting new drug-protein interactions is based on all simple paths of particular lengths on the graph, utilizing similarity information within the sub networks combined with information from the topology of the heterogeneous graph.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Deep Learning Applications for Predicting Pharmacological Properties of Drugs and Drug Repurposing Using Transcriptomic Data (Jul 2016) [code] demonstrated how deep neural networks (DNN) trained on large transcriptional response data sets could classify various drugs to therapeutic categories solely based on their transcriptional profiles. They used the perturbation samples of 678 drugs across A549, MCF‐7 and PC‐3 cell lines from the LINCS project and linked those to 12 therapeutic use categories derived from MeSH. To train the DNN, they utilized both gene level transcriptomic data and transcriptomic data processed using a pathway activation scoring algorithm, for a pooled dataset of samples perturbed with different concentrations of the drug for 6 and 24 hours. In both gene and pathway level classification, DNN convincingly outperformed support vector machine (SVM) model on every multiclass classification problem, however, models based on a pathway level classification performed better. For the first time they demonstrated that a deep learning neural net trained on transcriptomic data could recognize pharmacological properties of multiple drugs across different biological systems and conditions. They also proposed using deep neural net confusion matrices for drug repositioning.

[Image source. Click image to open in new window.]

Deep Mining Heterogeneous Networks of Biomedical Linked Data to Predict Novel Drug-Target Associations;  (Apr 2017) [code] proposed a similarity-based drug-target prediction method that enhanced existing association discovery methods by using a topology-based similarity measure. DeepWalk, a deep learning method, was used to calculate the similarities within Linked Tripartite Network (LTN), a heterogeneous network generated from biomedical linked datasets. The proposed method showed promising results for drug-target association prediction: 98.96% AUC ROC score with a 10-fold cross-validation and 99.25% AUC ROC score with a Monte Carlo cross-validation with LTN. Using DeepWalk, (i) this method outperformed other existing topology-based similarity computation methods, (ii) the performance was better for tripartite than with bipartite networks, and (iii) the measure of similarity using network topology outperformed the ones derived from chemical structure (drugs) or genomic sequence (targets). Their proposed methodology proved to be capable of providing a promising solution for drug-target prediction based on topological similarity with a heterogeneous network.

[Image source. Click image to open in new window.]

Most drug discoveries require expert knowledge and expensive biological experiments for identifying the physicochemical and physiological properties. While there is growing interest in using supervised machine learning to automatically identify those chemical molecular properties, there has been little advancement of the performance and accuracy due to the limited amount of training data. Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery (Aug 2017) [codePython notebook] proposed a novel unsupervised molecular embedding method, providing a continuous feature vector for each molecule to perform further tasks, e.g., solubility classification. In the proposed method, a multi-layered gated recurrent unit (GRU) network was used to map the input molecule into a continuous feature vector of fixed dimensionality, and then another deep GRU network was employed to decode the continuous vector back to the original molecule. As a result, the continuous encoding vector was expected to contain rigorous and enough information to recover the original molecule and predict its chemical properties. The proposed embedding method could utilize almost unlimited molecule data for the training phase. With sufficient information encoded in the vector, the proposed method was also robust and task-insensitive.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

The use of drug combinations, termed polypharmacy, is common for treating patients with complex diseases or co-existing conditions. However, a major consequence of polypharmacy is a much higher risk of adverse side effects for the patient. Polypharmacy side effects may emerge because of drug-drug interactions, in which activity of one drug may change favorably or unfavorably if taken with another drug. The knowledge of drug interactions is often limited because these complex relationships are rare, and are usually not observed in relatively small clinical testing. Discovering polypharmacy side effects thus remains an important challenge with significant implications for patient mortality and morbidity.

Modeling Polypharmacy Side Effects with Graph Convolutional Networks (Jul 2018) [project  (code/datasets);  discussion] – by Jure Leskovec and colleagues at Stanford University – presented Decagon, an approach for modeling polypharmacy side effects. The approach constructed a multimodal graph of protein-protein interactions, drug-protein target interactions and the polypharmacy side effects, which were represented as drug-drug interactions, where each side effect was an edge of a different type. … Unlike approaches limited to predicting simple drug-drug interaction values, Decagon could predict the exact side effect, if any, through which a given drug combination manifests clinically.

Decagon accurately predicted polypharmacy side effects, outperforming baselines by up to 69%. They found that it automatically learned representations of side effects indicative of co-occurrence of polypharmacy in patients. Furthermore, Decagon modeled particularly well polypharmacy side effects that had a strong molecular basis, while on predominantly non-molecular side effects it achieved good performance because of effective sharing of model parameters across edge types.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Systematic Integration of Biomedical Knowledge Prioritizes Drugs for Repurposing (Sep 2017) [code;  see also first author Himmel’s GitHub repositories] described Project Rephetio, which systematically modeled drug efficacy based on 755 existing treatments. They first constructed Hetionet, an integrative network encoding knowledge from millions of biomedical studies. Hetionet v1.0 consisted of 47,031 nodes of 11 types and 2,250,197 relationships of 24 types. Data were integrated from 29 public resources to connect compounds, diseases, genes, anatomies, pathways, biological processes, molecular functions, cellular components, pharmacologic classes, side effects, and symptoms. Next, they identified network patterns that distinguish treatments from non-treatments. Then, they predicted the probability of treatment for 209,168 compound-disease pairs. Their predictions were validated on two external sets of treatment and provided pharmacological insights on epilepsy, suggesting they will help prioritize drug repurposing candidates.

• Himmelstein et al. used a computational approach to analyze 50,000 data points – including drugs, diseases, genes and symptoms – from 19 different public databases. This approach made it possible to create more than two million relationships among the data points, which could be used to develop models that predict which drugs currently in use by doctors might be best suited to treat any of 136 common diseases. For example, Himmelstein et al. identified specific drugs currently used to treat depression and alcoholism that could be repurposed to treat smoking addition and epilepsy. These findings provide a new and powerful way to study drug repurposing.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Safe Medicine Recommendation via Medical Knowledge Graph Embedding (Oct 2017) [code not available, but note similarity of embedding specification to those summarized in my RESCAL summary] proposed a framework, Safe Medicine Recommendation (SMR ) that first constructed a high-quality heterogeneous graph by bridging electronic medical records and medical knowledge graphs (ICD-9 ontology and DrugBank). then jointly embedded diseases, medicines, patients, and their corresponding relations into a shared lower dimensional space. Finally, SMR used the embeddings to decompose the medicine recommendation into a link prediction process while considering the patient’s diagnoses and adverse drug reactions.

[Image source. Click image to open in new window.]

SemaTyP : A Knowledge Graph Based Literature Mining Method for Drug Discovery (May 2018) described various heuristics that may be useful for mining biomedical direct and indirect relations. SemaTyP proposed a biomedical knowledge graph-based drug discovery method which discovered candidate drugs for diseases by mining the biomedical literature. They first constructed a biomedical knowledge graph with the relations extracted from biomedical abstracts, then a logistic regression model was trained by learning the semantic types of paths of known drug therapies present in the biomedical knowledge graph; finally the learned model was used to discover drug therapies for new diseases.

[Image source. Click image to open in new window.]

GrEDeL :A Knowledge Graph Embedding Based Method for Drug Discovery from Biomedical Literatures (2019) [code] proposed a biomedical knowledge graph embedding based RNN method called GrEDeL which discovered potential drugs for diseases by mining published biomedical literature. GrEDeL first built a biomedical knowledge graph by exploiting the relations extracted from biomedical abstracts. The graph data were then converted into a low dimensional space by leveraging knowledge graph embedding methods. A RNN model was then trained from the known drug therapies which were represented by graph embeddings. Finally, GrEDeL used the learned model to discover candidate drugs for diseases of interest from biomedical literature. Experimental results showed that their method could not only effectively discover new drugs by mining literature, but also could provide the corresponding mechanism of actions for the candidate drugs.

[Image source. Click image to open in new window.]

Drug Prioritization Using the Semantic Properties of a Knowledge Graph (Apr 2019) [code].  “Compounds that are candidates for drug repurposing can be ranked by leveraging knowledge available in the biomedical literature and databases. This knowledge, spread across a variety of sources, can be integrated within a knowledge graph, which thereby comprehensively describes known relationships between biomedical concepts, such as drugs, diseases, genes, etc. Our work uses the semantic information between drug and disease concepts as features, which are extracted from an existing knowledge graph that integrates 200 different biological knowledge sources. RepoDB, a standard drug repurposing database which describes drug-disease combinations that were approved or that failed in clinical trials, is used to train a random forest classifier. The 10-times repeated 10-fold cross-validation performance of the classifier achieves a mean area under the receiver operating characteristic curve (AUC) of 92.2%. We apply the classifier to prioritize 21 preclinical drug repurposing candidates that have been suggested for Autosomal Dominant Polycystic Kidney Disease (ADPKD). Mozavaptan, a vasopressin V2 receptor antagonist is predicted to be the drug most likely to be approved after a clinical trial, and belongs to the same drug class as tolvaptan, the only treatment for ADPKD that is currently approved. We conclude that semantic properties of concepts in a knowledge graph can be exploited to prioritize drug repurposing candidates for testing in clinical trials.”

[Image source. Click image to open in new window.]

Pharmacology

• Neuro-Symbolic Representation Learning on Biological Knowledge Graphs (Sep 2017) … they repeatedly performed random walks from each node in the knowledge graph, using the resulting walks as sentences within a corpus, and applied a word2vec skip-gram model (DeepWalk) to learn the embeddings for each node. Through the use of symbolic logic, these embeddings contain both explicit and implicit information. These embeddings were applied to the prediction of edges in the KG, applicable to tasks such as the prediction of biological function, finding candidate genes of diseases, and identifying protein-protein interactions or drug-target relations.

• MedSim: A Novel Semantic Similarity Measure in Bio-medical Knowledge Graphs (Dec 2018) [dataset] presented MedSim, a novel semantic similarity method based on public well-established biomedical knowledge graphs and a large-scale corpus, to study the therapeutic substitution of antibiotics. MedSim constructed multi-dimensional medicine-specific feature vectors. On a dataset of 528 antibiotic pairs scored by doctors used for evaluation, MedSim demonstrated a statistically significant improvement over other semantic similarity methods. Furthermore, some promising case study applications of MedSim in drug substitution and drug abuse prevention were presented.

• Discovering Causal Pathways Linking Genomic Events to Transcriptional States using Tied Diffusion Through Interacting Events (Nov 2013) used a network diffusion approach (TieDIE) to connect genomic perturbations to gene expression changes characteristic of cancer subtypes. … Because many transcription factors are not conventionally regarded as being druggable, approaches such as TieDIE that pinpoint influences upstream of these factors but still in neighborhoods proximal to key driving mutations may provide key starting points for identifying new drug targets.

• DeepWalk was used in Deep Mining Heterogeneous Networks of Biomedical Linked Data to Predict Novel Drug-Target Associations (Apr 2017) [code].

• edge2vec : Learning Node Representation using Edge Semantics (Sep 2018) incorporated edge semantics to represent different edge-types in heterogeneous networks. edge2vec (aka heterogeneous node2vec / H-node2vec) was validated and evaluated using three medical domain problems on an ensemble of complex medical networks: medical entity classification, compound-gene binding prediction, and medical information search costs.

[Image source. Click image to open in new window.]

## Prediction

Similar to the skip-gram approach applied to learning word embeddings in word2vecDeepWalk: Online Learning of Social Representations (Jun 2014) (code] generalized language modeling and unsupervised feature learning from sequences of words to graphs, using local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. In other words, DeepWalk learned latent representations of vertices in a network in a continuous vector space via truncated random walks.

Although subsequently outperformed by LINE  and other models, the DeepWalk approach has proven to be very popular. For example, recent applications of DeepWalk in the biomedical domain include the following works.

edge2vec : Learning Node Representation using Edge Semantics (Sep 2018) proposed a model that incorporated edge semantics to represent different edge-types in heterogeneous networks. An edge type transition matrix was optimized from an Expectation-Maximization framework as an extra criterion of a biased node random walk on networks, and a biased skip-gram model was then leveraged to learn node embeddings based on the random walks. edge2vec was validated and evaluated using three medical domain problems on an ensemble of complex medical networks (>10 node and edge types): medical entity classification, compound-gene binding prediction, and medical information search costs. By considering edge semantics, edge2vec significantly outperformed other state of the art models on all three tasks.

• Note that in the following tables edge2vec is listed as heterogeneous node2vec  (H-node2vec).

Also note that heterogeneous node2vec outperforms LINE.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Biomarkers of aging can be used to assess the health of individuals and to study aging and age-related diseases. Predicting Age from the Transcriptome of Human Dermal Fibroblasts (Dec 2018) [codediscussionmedia] generated a large dataset of genome-wide RNA-seq profiles of human dermal fibroblasts from 133 people aged 1 to 94 years old to test whether signatures of aging were encoded within the transcriptome. They developed an ensemble machine learning method that predicted age to a median error of 4 years, outperforming previous methods used to predict age. The ensemble was further validated by testing it on ten progeria patients: their method was the only one that predicted accelerated aging in those patients.

Nontargeted DNA mutations produced by cellular repair of a CRISPR-Cas9-generated double-strand breaks are not random, but depend on DNA sequence at the targeted location. Predicting the Mutations Generated by Repair of Cas9-Induced Double-Strand Breaks (Nov 2018) [codemedia] systematically studied the influence of flanking DNA sequences on repair outcomes by measuring the edits generated by >40,000 guide RNAs (gRNAs) in synthetic constructs. Experiments were conducted in a range of genetic backgrounds using alternative CRISPR-Cas9 reagents, generating data for >109 mutational outcomes. Most reproducible mutations were insertions of a single base, short deletions or longer microhomology-mediated deletions. Each gRNA had an individual cell-line-dependent bias toward particular outcomes. Mutant sequences were used to derive a predictor of Cas9 editing outcomes, which may allow better design of gene editing experiments.

[Image source. Click image to open in new window.]

Prediction

• “The initiation site of DNA replication is called the (ORI) which is regulated by a set of regulatory proteins and plays important roles in the basic biochemical process during cell growth and division in all living organisms. … In this review, we summarize the current progress in computational prediction of eukaryotic ORIs including the collection of benchmark dataset, the application of machine learning-based techniques, the results obtained by these methods, and the construction of web servers. …“

• [note: student authored paper (Stanford cs224d project)] Learning the Language of the Genome using RNNs (2016) [code]

• “We explore how deep recurrent neural network (RNN) architectures can be used to capture the structure within a genetic sequence. We first confirm that a character-level RNN can capture the non-random parts of DNA by comparing the perplexity obtained after training on a real genome to that obtained after training on a random sequence of nucleotides. We then train a bidirectional character-level RNN to predict whether a given genomic sequence will interact with a variety of transcription factors, DNase I hypersensitive sites, and histone marks. Because multiple biological objects can interact with a given sequence, we cast this latter problem as a multitask learning problem. We empirically show how a deep network can outperform a baseline model on a significant majority of binary labeling tasks.”

[Image source. Click image to open in new window.]

## Recommendation

In Learning a Health Knowledge Graph from Electronic Medical Records (Jul 2017) maximum likelihood estimation of three probabilistic models was used to automatically construct knowledge graphs: logistic regression, naive Bayes classifier, and a Bayesian network using noisy-OR gates. … A graph of disease-symptom relationships was elicited from the learned parameters, and the constructed knowledge graphs were evaluated and validated, with permission, against Google’s manually-constructed knowledge graph and against expert physician opinions. The noisy-OR model significantly outperformed the other tested models, producing a high quality knowledge graph reaching precision of 0.85 for a recall of 0.6 in the clinical evaluation.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Exploiting Semantic Patterns Over Biomedical Knowledge Graphs for Predicting Treatment and Causative Relations (Jun 2018) first built a large knowledge graph of biomedical relations obtained from the National Library of Medicine (NLM)’s Unified Medical Language System Metathesaurus (UMLS). They then refrained from NLP approaches that looked at individual sentences to extract a potential relation, instead exploiting semantic path patterns over this graph to build models for specific predicates. Instead of looking at what a particular sentence conveys, they modeled their prediction problem at a global level, outputting probability estimates of whether a pair of entities participated in a particular relation. A different binary classification model was trained for each predicate. While the approach was demonstrated using the “TREATS” and “CAUSES” predicates drawn from the UMLS Metathesaurus SemMedDB (Semantic Medline Database), their method also generalized to other predicates (such as “DISRUPTS” and “PREVENTS”), and could also complement other lexical and syntactic pattern-based distant supervision approaches for relation extraction.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

While code for Exploiting Semantic Patterns Over Biomedical Knowledge Graphs for Predicting Treatment and Causative Relations is not available, code for reasoning over SemMedDB is provided in the mediKanren repository on GitHub [media]. mediKanren is well described in the media report A ‘High-Speed Dr. House’ for Medical Breakthroughs (May 2018)  [local copy]. Notably (paraphrased here):

• mediKanren is a reasoning engine for biomedical knowledge … It understands the arguments made in scientific research: $\small \text{X inhibits Y}$, $\small \text{Y causes Z}$, and can draw logical conclusions between them. … Crucially, mediKanren doesn’t just return results, but also provides an explanation for why they might work. … Users can follow mediKanren’s thought processes, seeing the chain of research papers and other clues the program followed to arrive at those conclusions (logical reasoning) …”

• “It’s fast enough so that you can do many queries in an hour-long consultation. And even if the search turns up nothing today, it could be automated to run every week – constantly querying the medical literature for new insights.”

• mediKanren is programmed to find hidden connections in biomedical data by seeking out logical relationships between concepts. Type in two concepts and a predicate – for example, a disease and a possible new treatment option, mediKanren will search through 97 million assertions taken from more than 20 million scientific papers. (Updates are adding more databases, including a full list of FDA-approved medications.) In 15 seconds the program highlights several possible connections, serving up links to the relevant papers on PubMed, the NIH’s vast research database. mediKanren may find tens of thousands of possible links to explore, but it is carefully designed to highlight the top results for users. ‘If people don’t find something they’re interested in within the first five results, they’ll give up,’ says Byrd.”

[Image source. Click image to open in new window.]

There are between 6,000-7,000 known rare diseases today. Identifying and diagnosing a patient with rare disease is time consuming, cumbersome, cost intensive and requires resources generally available only at large hospital centers. Furthermore, most medical doctors, especially general practitioners, will likely only see one patient with a rare disease if at all. A cognitive assistant for differential diagnosis in rare disease will provide the knowledge on all rare diseases online, help create a list of weighted diagnosis and access to the evidence base on which the list was created. Cognitive DDx Assistant in Rare Diseases (Jul 2018) was built on knowledge graph technology that incorporated data from ICD-10, DOID, medDRA, PubMed, Wikipedia, Orphanet, the CDC and anonymized patient data. The final knowledge graph (comprising over 500,000 nodes) was tested with 101 published cases for rare disease, delivering 79.5% accuracy in finding the diagnosis in the top 1% of nodes. A further learning step was taken to rank the correct result in the TOP 15 hits. With a reduced data pool, 51% of the 101 cases were tested delivering the correct result in the TOP 3-13 (TOP 6 on average) for 74% of these cases. The results demonstrated that data curation was among the most critical aspects to deliver accurate results, and that knowledge graphs could deliver cognitive solutions for differential diagnosis in rare disease that can be applied in clinical practice.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

Diagnosing an inherited disease often requires identifying the pattern of inheritance in a patient’s family. In an interesting approach, Explainable Genetic Inheritance Pattern Prediction (Dec 2018) [code] represented family trees with genetic patterns of inheritance using hypergraphs and latent state space models to provide explainable inheritance pattern predictions. Their approach, GraphLSSM, allowed for exact causal inference over a patient’s possible genotypes given their relatives’ phenotypes. By design, inference could be examined at a low level to provide explainable predictions. Furthermore, they made use of human intuition by providing a method to assign hypothetical evidence to any inherited gene alleles. Their analysis supported the application of latent state space models to improve patient care in cases of rare inherited diseases, where access to genetic specialists is limited.

[Image source. Click image to open in new window.]

Interpretable Graph Convolutional Neural Networks for Inference on Noisy Knowledge Graphs (Dec 2018) provided a new graph convolutional neural networks (GCNNs) formulation for link prediction on graph data that addressed common challenges for biomedical knowledge graphs (KGs). They introduced a regularized attention mechanism to GCNNs that not only improved performance on clean datasets, but also favorably accommodated noise in KGs, a pervasive issue in real-world applications. Further, they explored new visualization methods for interpretable modelling and illustrated how the learned representation can be exploited to automate dataset denoising. Results were demonstrated on a synthetic dataset (FB15k-237) and a large biomedical knowledge graph derived from a combination of noisy and clean data sources. Using these improvements, they visualized a learned model representation of the disease cystic fibrosis and demonstrated how to interrogate a neural network to show the potential of PPARG (PPAR-γ: peroxisome proliferator-activated receptor gamma) as a candidate therapeutic target for rheumatoid arthritis.

[Image source. Click image to open in new window.]

## Resources

• “Advances in machine learning, coupled with rapidly growing genome sequencing and molecular profiling datasets, are catalyzing progress in genomics. In particular, predictive machine learning models, which are mathematical functions trained to map input data to output values, have found widespread usage. Prominent examples include calling variants from whole-genome sequencing data, estimating CRISPR guide activity and predicting molecular phenotypes, including transcription factor binding, chromatin accessibility and splicing efficiency, from DNA sequence. Once trained, these models can be probed in silico to infer quantitative relationships between diverse genomic data modalities, enabling several key applications such as the interpretation of functional genetic variants and rational design of synthetic genes.

“Here, we present Kipoi (Greek for ‘gardens’, pronounced ‘kípi’), an open science initiative to foster sharing and reuse of trained models in genomics. Already, the Kipoi repository (Fig. 1, middle) offers more than 2,000 individual trained models from 22 distinct studies that cover key predictive tasks in genomics, including the prediction of chromatin accessibility, transcription factor binding, and alternative splicing from DNA sequence. Kipoi is accessible via GitHub and as web resource (https://kipoi.org), providing a browsable interface to explore and search models for specific tasks.”

[Image source. Click image to open in new window.]

## Representation Learning

• “In the field of artificial intelligence, a combination of scale in data and model capacity enabled by unsupervised learning has led to major advances in representation learning and statistical generation. In biology, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. Learning the natural distribution of evolutionary protein sequence variation is a logical step toward predictive and generative modeling for biology. To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million sequences spanning evolutionary diversity. The resulting model maps raw sequences to representations of biological properties without labels or prior domain knowledge. The learned representation space organizes sequences at multiple levels of biological granularity from the biochemical to proteomic levels. Learning recovers information about protein structure: secondary structure and residue-residue contacts can be extracted by linear projections from learned representations. With small amounts of labeled data, the ability to identify tertiary contacts is further improved. Learning on full sequence diversity rather than individual protein families increases recoverable information about secondary structure. We show the networks generalize by adapting them to variant activity prediction from sequences only, with results that are comparable to a state-of-the-art variant predictor that uses evolutionary and structurally derived features.”

## Visualization

“Data visualization has become very crucial in the post genomic era where the accumulation of genomic information is mounting exponentially. Visualizing protein interactions in the context of the interactome is an essential application, especially in association with disease phenotypes. Here, we describe Proteinarium , a multi-sample protein-protein interaction network visualization and analysis tool to identify clusters of samples with genomic data derived from protein-protein interactions. Proteinarium  is a command-line tool written in Java with no external dependencies and it is freely available at https://github.com/Armanious/Proteinarium.”

[Image source. Click image to open in new window.]