### Technical Review

Biomedical Applications of Machine Learning

# BIOMEDICAL APPLICATIONS OF MACHINE LEARNING

Two recent reviews that provide an excellent introduction and overview of many of the topics discussed in this TECHNICAL REVIEW include:

• “Network-based analytics plays an increasingly important role in precision oncology. Growing evidence in recent studies suggests that cancer can be better understood through mutated or dysregulated pathways or networks rather than individual mutations and that the efficacy of repositioned drugs can be inferred from disease modules in molecular networks.

• This article reviews network-based machine learning and graph theory algorithms for integrative analysis of personal genomic data and biomedical knowledge bases to identify tumor-specific molecular mechanisms, candidate targets and repositioned drugs for personalized treatment.

• The review focuses on the algorithmic design and mathematical formulation of these methods to facilitate applications and implementations of network-based analysis in the practice of precision oncology. We review the methods applied in three scenarios to integrate genomic data and network models in different analysis pipelines, and we examine three categories of network-based approaches for repositioning drugs in drug-disease-gene networks.

• In addition, we perform a comprehensive subnetwork/pathway analysis of mutations in 31 cancer genome projects in the Cancer Genome Atlas and present a detailed case study on ovarian cancer.

• Finally, we discuss interesting observations, potential pitfalls and future directions in network-based precision oncology.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• “New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include a myriad of properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.”

[Image source. Click image to open in new window.]

NeurIPS 2018 (Dec 2018) also features a Workshop, ML4H: Machine Learning for Health  (arXiv: Machine Learning for Health (ML4H) Workshop at NeurIPS 2018).

• Genome-wide association studies (GWAS) have emerged as a rich source of genetic clues into disease biology, and they have revealed strong genetic correlations among many diseases and traits. Some of these genetic correlations may reflect causal relationships. Distinguishing correlation from causation using genome-wide association studies (Nov 2018) [code] developed a method to quantify causal relationships between genetically correlated traits using GWAS summary association statistics. Their method quantified what part of the genetic component of $\small \text{trait 1}$ was also causal for $\small \text{trait 2}$, using mixed fourth moments $\small E(\alpha {2 \atop 1} \alpha_1 \alpha_2)$ and $\small E(\alpha {2 \atop 2} \alpha_1 \alpha_2)$ of the bivariate effect size distribution. If $\small \text{trait 1}$ was causal for $\small \text{trait 2}$, then SNPs affecting $\small \text{trait 1}$ (large $\small \alpha {2 \atop 1})$ will have correlated effects on $\small \text{trait 2}$ (large $\small \alpha_1 \alpha_2)$, but not vice versa. They validated this approach in extensive simulations. Across 52 traits (average $\small N = 331k)$, they identified 30 putative genetically causal relationships, many novel, including an effect of LDL cholesterol on decreased bone mineral density. More broadly, they demonstrated that it was possible to distinguish between genetic correlation and causation using genetic association data.

• “Genome-wide association studies (GWAS) have identified thousands of common genetic variants (SNPs) affecting disease risk and other complex traits. The same SNPs often affect multiple traits, resulting in a genetic correlation: genetic effect sizes are correlated across the genome, and so are the traits themselves. Some genetic correlations may result from causal relationships. For example, SNPs that cause higher triglyceride levels reliably confer increased risk of coronary artery disease. This causal inference approach, using genetic variants as instrumental variables, is known as Mendelian Randomization (MR). However, genetic variants often have shared, or “pleiotropic”, effects on multiple traits even in the absence of a causal relationship, and pleiotropy is a challenge for MR, especially when it leads to a strong genetic correlation. Statistical methods have been used to account for certain kinds of pleiotropy; however, these approaches too are easily confounded by genetic correlations due to pleiotropy. Here, we develop a robust method to distinguish whether a genetic correlation results from pleiotropy or from causality.”

[Image source. Click image to open in new window.]

Latent causal variable model. The latent causal variable (LCV) model features a latent variable $\small L$ that mediates the genetic correlation between the two traits (Figure 1a). More abstractly, $\small L$ represents the shared genetic component of both traits. $\small \text{Trait 1}$ is fully genetically causal for $\small \text{trait 2}$ if it is perfectly genetically correlated with $\small L$; “fully” means that the entire genetic component of $\small \text{trait 1}$ is causal for $\small \text{trait 2}$ (Figure 1b). More generally, $\small \text{trait 1}$ is partially genetically causal for $\small \text{trait 2}$ if the latent variable has a stronger genetic correlation with $\small \text{trait 1}$ than with $\small \text{trait 2}$; “partially” means that part of the genetic component of $\small \text{trait 1}$ is causal for $\small \text{trait 2}$. …”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• “The potential benefits of applying machine learning methods to -omics data are becoming increasingly apparent, especially in clinical settings. However, the unique characteristics of these data are not always well suited to machine learning techniques. These data are often generated across different technologies in different labs, and frequently with high dimensionality. In this paper we present a framework for combining -omics data sets, and for handling high dimensional data, making -omics research more accessible to machine learning applications. We demonstrate the success of this framework through integration and analysis of multi-analyte data for a set of 3,533 breast cancers. We then use this data-set to predict breast cancer patient survival for individuals at risk of an impending event, with higher accuracy and lower variance than methods trained on individual data-sets. We hope that our pipelines for data-set generation and transformation will open up -omics data to machine learning researchers. We have made these freely available for noncommercial use at Cambridge Cancer Genomics.”

• “Cambridge Cancer Genomics is using blood tests to guide smarter cancer therapy. Currently, cancer patients have to wait up to 6 months to know whether their chemotherapy is working. In the interim, patients suffer the side effects of such treatments. Using simple blood draws, CCG shortens the time required to know whether treatment is working, buying the clinician more time to alter treatment and reduce unnecessary side effects. In addition, we can identify relapse an average of 7 months earlier than standard practice. Over time, CCG will be able to better predict the best therapeutic strategy for cancer patients before they even begin treatment.”

[Image source. Click image to open in new window.]

• “Understanding the three-dimensional (3D) structure of the genome is essential for elucidating vital biological processes and their links to human disease. To determine how the genome folds within the nucleus, chromosome conformation capture methods such as HiC have recently been employed. However, computational methods that exploit the resulting high-throughput, high-resolution data are still suffering from important limitations. In this work, we explore the idea of manifold learning for the 3D chromatin structure inference and present a novel method, REcurrent Autoencoders for CHromatin 3D structure prediction (REACH-3D). Our framework employs autoencoders with recurrent neural units to reconstruct the chromatin structure. In comparison to existing methods, REACH-3D makes no transfer function assumption and permits dynamic analysis. Evaluating REACH-3D on synthetic data indicated high agreement with the ground truth. When tested on real experimental HiC data, REACH-3D recovered most faithfully the expected biological properties and obtained the highest correlation coefficient with microscopy measurements. Last, REACH-3D was applied to dynamic HiC data, where it successfully modeled chromatin conformation during the cell cycle.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• Protein interactions constitute the fundamental building block of almost every life activity. Identifying protein communities from Protein-Protein Interaction (PPI) networks is essential to understand the principles of cellular organization and explore the causes of various diseases. It is critical to integrate multiple data resources to identify reliable protein communities that have biological significance and improve the performance of community detection methods for large-scale PPI networks. Parallel Protein Community Detection in Large-scale PPI Networks Based on Multi-source Learning (Oct 2018) proposed a Multi-source Learning based Protein Community Detection (MLPCD) algorithm by integrating Gene Expression Data (GED) and a parallel solution of MLPCD using cloud computing technology. To effectively discover the biological functions of proteins that participate in different cellular processes, GED under different conditions was integrated with the original PPI network to reconstruct a Weighted-PPI (WPPI) network.

To flexibly identify protein communities of different scales, we define community modularity and functional cohesion measurements and detect protein communities from WPPI using an agglomerative method. In addition, the detected communities were compared with known protein complexes to evaluate the functional enrichment of protein function modules using Gene Ontology annotations. An implementation of a parallel version of the MLPCD algorithm on the Apache Spark platform enhanced the performance of the algorithm for large-scale realistic PPI networks. Experimental results indicated the superiority and advantages of the MLPCD algorithm over relevant algorithms in terms of accuracy and performance.

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

• AlphaFold: Using AI For Scientific Discovery (Dec 2018) [abstract;  discussion here and here]

• “DeepMind has brought together experts from the fields of structural biology, physics, and machine learning to apply cutting-edge techniques to predict the 3D structure of a protein based solely on its genetic sequence. Our system, AlphaFold, which we have been working on for the past two years, builds on years of prior research in using vast genomic data to predict protein structure. The 3D models of proteins that AlphaFold generates are far more accurate than any that have come before - making significant progress on one of the core challenges in biology.”

• CASP13: 13th Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction; CASP13 Target List

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]

## Mentioned Elsewhere in This Technical Review

• Graph refinement (the task of obtaining subgraphs of interest from over-complete graphs) can have many varied applications. In this paper, Thomas Kipf, Max Welling and colleagues extracted tree structures from image data by first deriving a graph-based representation of the volumetric data and then posing tree extraction as a graph refinement task. … Subgraphs were obtained by jointly training a GNN based encoder-decoder pair, wherein the encoder learned useful edge embeddings from which the edge probabilities were predicted using a simple decoder. They discussed connections between the two classes of methods, and compared them for the task of extracting airways from 3D, low-dose, chest CT data.

• Joint Association and Classification Analysis of Multi-View Data (Nov 2018) [Summary]Multi-view data, that is matched sets of measurements on the same subjects, have become increasingly common with technological advances in genomics and other fields. Often, the subjects are separated into known classes, and it is of interest to find associations between the views that are related to the class membership. … In this work we propose a framework for Joint Association and Classification Analysis of multi-view data (JACA ). We support the methodology with theoretical guarantees for estimation consistency in high-dimensional settings, and numerical comparisons with existing methods. In addition to joint learning framework, a distinct advantage of our approach is its ability to use partial information: it can be applied both in the settings with missing class labels, and in the settings with missing subsets of views. We apply JACA to colorectal cancer data from The Cancer Genome Atlas project, and quantify the association between RNAseq and miRNA views with respect to consensus molecular subtypes of colorectal cancer.

• Predicting the Mutations Generated by Repair of Cas9-Induced Double-Strand Breaks (Nov 2018)  [Summary]The DNA mutation produced by cellular repair of a CRISPR-Cas9-generated double-strand break determines its phenotypic effect. It is known that the mutational outcomes are not random, but depend on DNA sequence at the targeted location. Here we systematically study the influence of flanking DNA sequence on repair outcome by measuring the edits generated by >40,000 guide RNAs (gRNAs) in synthetic constructs. We performed the experiments in a range of genetic backgrounds and using alternative CRISPR-Cas9 reagents. In total, we gathered data for >109 mutational outcomes. The majority of reproducible mutations are insertions of a single base, short deletions or longer microhomology-mediated deletions. Each gRNA has an individual cell-line-dependent bias toward particular outcomes. We uncover sequence determinants of the mutations produced and use these to derive a predictor of Cas9 editing outcomes. Improved understanding of sequence repair will allow better design of gene editing experiments.  [media]

• Cognitive DDx Assistant in Rare Diseases (Jul 2018)

“There are between 6,000-7,000 known rare diseases today. Identifying and diagnosing a patient with rare disease is time consuming, cumbersome, cost intensive and requires resources generally available only at large hospital centers. Furthermore, most medical doctors, especially general practitioners, will likely only see one patient with a rare disease if at all. A cognitive assistant for differential diagnosis in rare disease will provide the knowledge on all rare diseases online, help create a list of weighted diagnosis and access to the evidence base on which the list was created. The system is built on knowledge graph technology that incorporates data from ICD-10, DOID, medDRA, PubMed, Wikipedia, Orphanet, the CDC and anonymized patient data. The final knowledge graph comprised over 500,000 nodes.

“The solution was tested with 101 published cases for rare disease. The learning system improves over training sprints and delivers 79.5% accuracy in finding the diagnosis in the top 1% of nodes. A further learning step was taken to rank the correct result in the TOP 15 hits. With a reduced data pool, 51% of the 101 cases were tested delivering the correct result in the TOP 3-13 (TOP 6 on average) for 74% of these cases. The results show that data curation is among the most critical aspects to deliver accurate results. The knowledge graph technology demonstrates its power to deliver cognitive solutions for differential diagnosis in rare disease that can be applied in clinical practice.”

[Image source. Click image to open in new window.]

[Image source. Click image to open in new window.]
• Approach for Semi-automatic Construction of Anti-infective Drug Ontology Based on Entity Linking (Jan 2018) [Summary]Ontology can be used for the interpretation of natural language. To construct an anti-infective drug ontology, one needs to design and deploy a methodological step to carry out the entity discovery and linking. Medical synonym resources have been an important part of medical natural language processing (NLP). However, there are problems such as low precision and low recall rate. In this study, an NLP approach is adopted to generate candidate entities. Open ontology is analyzed to extract semantic relations. Six-word vector features and word-level features are selected to perform the entity linking. The extraction results of synonyms with a single feature and different combinations of features are studied. Experiments show that our selected features have achieved a precision rate of 86.77%, a recall rate of 89.03% and an F1 score of 87.89%. This paper finally presents the structure of the proposed ontology and its relevant statistical data.

• “We present MedSim, a novel semantic similarity method based on public well-established biomedical knowledge graphs (KGs) and large-scale corpus, to study the therapeutic substitution of antibiotics. Besides hierarchy and corpus of KGs, MedSim further interprets medicine characteristics by constructing multi-dimensional medicine-specific feature vectors. Dataset of 528 antibiotic pairs scored by doctors is applied for evaluation and MedSim has produced statistically significant improvement over other semantic similarity methods. Furthermore, some promising applications of MedSim in drug substitution and drug abuse prevention are presented in case study.”