Technical Review
Last modified: 20190417
Copyright notice, citation: Copyright
Source © 2018present, Victoria A. Stuart
These Contents
 Knowledge Graphs
 Why Graphs?
 Comprehensive Reviews
 Knowledge Graph Construction and Completion
 Graph Structure
 Graph Neural Networks [Graph Convolutional Networks (GCN) ...]
 Graph Attention Networks
 Question Answering Over Knowledge Graphs
 Graph Representation Learning
 Graph Generation Through Generative Adversarial Networks
 Subgraphs: HigherOrder Organization of Complex Networks
 Entity Linking and Text Grounding
KNOWLEDGE GRAPHS
Why Graphs?
The introductory chapter in Probabilistic Graphical Models (Daphne Koller and Nir Friedman: 2009) succinctly states some key points, which I paraphrase here.
[Image source. Click image to open in new window.]
[Probabilistic] Graphical models use a graphbased representation as the basis for compactly encoding a complex distribution over a highdimensional space. There is a dual perspective that one can use to interpret the structure of this graph. … From one perspective, the graph is a compact representation of a set of independencies that hold in the distribution; these properties take the form $\small \mathbf{X}$ is independent of $\small \mathbf{Y}$ given $\small \mathbf{Z}$, denoted $\small (\mathbf{X}\perp \mathbf{Y}\ \vert\ \mathbf{Z})$, for some subsets of variables $\small \mathbf{X,Y,Z}$. … The other perspective is that the graph defines a skeleton for compactly representing a highdimensional distribution: rather than encode the probability of every possible assignment to all of the variables in our domain, we can partition the distribution into smaller factors, each over a much smaller space of possibilities. We can then define the overall joint distribution as a product of these factors.

This framework has many advantages. First, it often allows the distribution to be written down tractably, even in cases where the explicit representation of the joint distribution is astronomically large. Importantly, the type of representation provided by this framework is transparent, in that a human expert can understand and evaluate its semantics and properties. This property is important for constructing models that provide an accurate reflection of our understanding of a domain. Models that are opaque can easily give rise to unexplained, and even undesirable, answers.
Second, as we show, the same structure often also allows the distribution to be used effectively for inference – answering queries using the distribution as our model of the world. In particular, we provide algorithms for computing the posterior probability of some variables given evidence on others. …
Third, this framework facilitates the effective construction of these models, whether by a human expert or automatically, by learning from data a model that provides a good approximation to our past experience. …
These three components – representation, inference, and learning – are critical components in constructing an intelligent system.
A Review of Relational Machine Learning for Knowledge Graphs (Sep 2015) by Maximilian Nickel, Kevin Murphy, Volker Tresp and Evgeniy Gabrilovich includes the following subsection (paraphrased here).

Uses of Knowledge Graphs
"Knowledge graphs provide semantically structured information that is interpretable by computers  a property that is regarded as an important ingredient to build more intelligent machines [38D.B. Lenat and E.A. Feigenbaum (1991) "On the Thresholds of Knowledge." Artificial intelligence, 47(1): 185250.] [paper]. Consequently, knowledge graphs are already powering multiple 'Big Data' applications in a variety of commercial and scientific domains. A prime example is the integration of Google's Knowledge Graph, which currently stores 18 billion facts about 570 million entities, into the results of Google's search engine. The Google Knowledge Graph is used to identify and disambiguate entities in text, to enrich search results with semantically structured summaries, and to provide links to related entities in exploratory search.(Microsoft has a similar KB, called Satori, integrated with its Bing search engine.)
"Enhancing search results with semantic information from knowledge graphs can be seen as an important step to transform textbased search engines into semantically aware question answering services. Another prominent example demonstrating the value of knowledge graphs is IBM's question answering system Watson, which was able to beat human experts in the game of Jeopardy!. Among others, this system used YAGO, DBpedia, and Freebase as its sources of information. Repositories of structured knowledge are also an indispensable component of digital assistants such as Siri, Cortana, or Google Now.
"Knowledge graphs are also used in several specialized domains. For instance, Bio2RDF, Neurocommons, and LinkedLifeData are knowledge graphs that integrate multiple sources of biomedical information. These have been used for question answering and decision support in the life sciences."
In his introductory statements in Chapter 8, Graphical Models [local copy; from: Bishop CM (2006) Pattern Recognition and Machine Learning] Christopher Bishop adds these points regarding the useful properties of probabilistic graphical models.

They provide a simple way to visualize the structure of a probabilistic model and can be used to design and motivate new models.

Insights into the properties of the model, including conditional independence properties, can be obtained by inspection of the graph.

Complex computations, required to perform inference and learning in sophisticated models, can be expressed in terms of graphical manipulations, in which underlying mathematical expressions are carried along implicitly.
In a probabilistic graphical model, each node represents a random variable (or group of random variables), and the links (edges) express probabilistic relationships between these variables. The graph then captures the way in which the joint distribution over all of the random variables can be decomposed into a product of factors each depending only on a subset of the variables. Bayesian networks, also known as directed graphical models, in which the links of the graphs have a particular directionality indicated by arrows. The other major class of graphical models are Markov random fields, also known as undirected graphical models, in which the links do not carry arrows and have no directional significance. Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are better suited to expressing soft constraints between random variables.
Theo blog post An Introduction to Graph Theory and Network Analysis (with Python codes) also includes a Why Graphs? subsection:
 Graphs provide a better way of dealing with abstract concepts like relationships and interactions; they also offer an intuitively visual way of thinking about these concepts. Graphs also form a natural basis for analyzing relationships in a Social context. Graph databases have become common computational tools and alternatives to SQL and NoSQL databases.
 Graphs are used to model analytics workflows in the form of directed acyclic graphs (DAGs). Some neural network frameworks also use DAGs to model the various operations in different layers.
 Graph theory concepts are used to study and model social networks, fraud patterns, power consumption patterns, virality and influence in social media. social network analysis (SNA) is probably the best known application of graph theory for data science. [See graph signal processing.]
 Graphs are used in clustering algorithms; specifically, KMeans.
 System dynamics also uses some graph theory concepts; specifically, loops.
 Path optimization is a subset of the optimization problem that also uses graph concepts.
 From a computer science perspective, graphs offer computational efficiency; the Big O complexity for some algorithms is better for data arranged in the form of Graphs (compared to tabular data).
Lastly, a recent position paper (Relational Inductive Biases, Deep Learning, and Graph Networks (Battaglia et al., Oct 2018) [code/demos] – jointly from DeepMind, Google Brain, MIT, and the University of Edinburgh – makes the case for graph networks as a foundational building block of the next generation of AI. They argued that Graph Networks could effectively support two critical humanlike capabilities: relational reasoning (i.e. drawing logical conclusions of how different objects and things relate to one another), and combinatorial generalization (i.e. constructing new inferences, predictions, and behaviors from known building blocks), laying the foundation for more sophisticated, interpretable, and flexible patterns of reasoning.
Comprehensive Reviews
Overview of variants of graph neural networks:
[Image source. Click image to open in new window.]Graph Neural Networks: A Review of Methods and Applications (Dec 2018) [Summary]Lots of learning tasks require dealing with graph data which contains rich relation information among elements. Modeling physics system, learning molecular fingerprints, predicting protein interface, and classifying diseases require that a model to learn from graph inputs. In other domains such as learning from nonstructural data like texts and images, reasoning on extracted structures, like the dependency tree of sentences and the scene graph of images, is an important research topic which also needs graph reasoning models. Graph neural networks (GNNs) are connectionist models that capture the dependence of graphs via message passing between the nodes of graphs. Unlike standard neural networks, GNNs retain a state that can represent information from its neighborhood with an arbitrary depth. Although the primitive GNNs have been found difficult to train for a fixed point, recent advances in network architectures, optimization techniques, and parallel computation have enabled successful learning with them. In recent years, systems based on graph convolutional network (GCN) and gated graph neural network (GGNN) have demonstrated groundbreaking performance on many tasks mentioned above. In this survey, we provide a detailed review over existing graph neural network models, systematically categorize the applications, and propose four open problems for future research. [discussion]
[Image source. Click image to open in new window.]A Comprehensive Survey on Graph Neural Networks (Jan 2019) [Summary]Deep learning has revolutionized ML tasks ranging from image classification & video processing to speech recognition and natural language understanding. Data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications where data are generated from nonEuclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. In this survey, we provide a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. We propose a new taxonomy to divide the stateoftheart graph neural networks into different categories. With a focus on graph convolutional networks, we review alternative architectures that have recently been developed; these learning paradigms include graph attention networks, graph autoencoders, graph generative networks, and graph spatialtemporal networks. We further discuss the applications of graph neural networks across various domains and summarize the open source codes and benchmarks of the existing algorithms on different learning tasks. Finally, we propose potential research directions in this fastgrowing field.
[Image source. Click image to open in new window.]Knowledge Graph Construction and Completion
Knowledge graphs (relational property graphs) model information in the form of entities (nodes/vertices) and the relationships between them (edges). For a formal definition see the “Problem Definition” subsection in Discriminative Predicate Path Mining for Fact Checking in Knowledge Graphs, and for a general overview see Towards a Definition of Knowledge Graphs.
Knowledge graphs (KG) provide semantically structured information that is interpretable by computers – an important property to building more intelligent machines, as well as an important step to transforming textbased knowledge stores and search engines into semanticallyaware question answering services [A Review of Relational Machine Learning for Knowledge Graphs (Sep 2015)].
Knowledge graphs are applicable to a broad spectrum of problems ranging from biochemistry to recommender systems; for example, question answering, structured search, exploratory search, and digital assistants. Examples of the use of KG in the biomedical domain include:

BioGraph: Unsupervised Biomedical Knowledge Discovery via Automated Hypothesis Generation (2011) [code];
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]

KnowLife: A Versatile Approach for Constructing a Large Knowledge Graph for Biomedical Sciences (May 2015);
Background. Biomedical knowledge bases (KB’s) have become important assets in life sciences. Prior work on KB construction has three major limitations. First, most biomedical KBs are manually built and curated, and cannot keep up with the rate at which new findings are published. Second, for automatic information extraction (IE), the text genre of choice has been scientific publications, neglecting sources like health portals and online communities. Third, most prior work on IE has focused on the molecular level or chemogenomics only, like proteinprotein interactions or genedrug relationships, or solely address highly specific topics such as drug effects.
Results. We address these three limitations by a versatile and scalable approach to automatic KB construction. Using a small number of seed facts for distant supervision of patternbased extraction, we harvest a huge number of facts in an automated manner without requiring any explicit training.
We extend previous techniques for patternbased IE with confidence statistics, and we combine this recalloriented stage with logical reasoning for consistency constraint checking to achieve high precision. To our knowledge, this is the first method that uses consistency checking for biomedical relations. Our approach can be easily extended to incorporate additional relations and constraints.
We ran extensive experiments not only for scientific publications, but also for encyclopedic health portals and online communities, creating different KB’s based on different configurations. We assess the size and quality of each KB, in terms of number of facts and precision. The best configured KB, KnowLife, contains more than 500,000 facts at a precision of 93% for 13 relations covering genes, organs, diseases, symptoms, treatments, as well as environmental and lifestyle risk factors.Conclusion. KnowLife is a large knowledge base for health and life sciences, automatically constructed from different Web sources. As a unique feature, KnowLife is harvested from different text genres such as scientific publications, health portals, and online communities. Thus, it has the potential to serve as onestop portal for a wide range of relations and use cases. To showcase the breadth and usefulness, we make the KnowLife KB accessible through the KnowLife – OneStop Health Portal.
[Image source. Click image to open in new window.] 
[Hetionet] Systematic Integration of Biomedical Knowledge Prioritizes Drugs for Repurposing (Sep 2017); and,
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.] 
[Image source. Click image to open in new window.]
A noteworthy feature of knowledge graphs is the excellent performance of traversaltype queries across entities of diverse types within them. Such queries can be challenging to realize in relational databases because of the cost of computing statements (joins) across different tables. See:
 Why are Graph Databases More Efficient than RDMS on Connected Data;
 Comparison of Relational Databases and Graph Databases; and,
 search for “joins” in: The Definitive Guide to Graph Databases for the RDBMS Developer.
Cypher queries in Neo4j are easier to write and understand than complex SQL queries in relational database management systems (RDBMS), especially those involving multiple join statements. For example, see pp. 2223 in The Definitive Guide to Graph Databases for the RDBMS Developer).
Graphs represent a recent and exciting extension of machine learning; for a good reviews, see

The Knowledge Graph as the Default Data Model for Learning on Heterogeneous Knowledge (Apr 2017), and

Deep Learning on Graphs: A Survey (Dec 2018) [Summary]Deep learning has been shown successful in a number of domains, ranging from acoustics, images to natural language processing. However, applying deep learning to the ubiquitous graph data is nontrivial because of the unique characteristics of graphs. Recently, a significant amount of research efforts have been devoted to this area, greatly advancing graph analyzing techniques. In this survey, we comprehensively review different kinds of deep learning methods applied to graphs. We divide existing methods into three main categories: semisupervised methods including Graph Neural Networks and Graph Convolutional Networks, unsupervised methods including Graph Autoencoders, and recent advancements including Graph Recurrent Neural Networks and Graph Reinforcement Learning. We then provide a comprehensive overview of these methods in a systematic manner following their history of developments. We also analyze the differences of these methods and how to composite different architectures. Finally, we briefly outline their applications and discuss potential future directions..
Although a typical knowledge graph may contain millions of entities and billions of relational facts, it is usually far from complete. Determining the validity of information in a knowledge graph and filling in its missing parts are crucial tasks for many researchers and practitioners.
Traditional fact checking by experts and analysts cannot keep pace with the volume of newly created information. It is therefore important and necessary to enhance our ability to computationally determine whether statements of fact are true or false. This problem may be viewed as a link prediction (knowledge graph completion) task. Knowledge graph completion (KGC) aims at predicting relations between entities under supervision of the existing knowledge graph, which is an important supplement to relation extraction from plain texts. To address those challenges, a number of knowledge graph completion methods have been developed using link prediction and lowdimensional graph embeddings.
MultiTask Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction (Aug 2018) [data/code] introduced a joint multitask setup of identifying and classifying entities, relations, and coreference clusters in scientific articles. They created SciERC, a dataset that included annotations for all three tasks, and developed a unified framework called Scientific Information Extractor (SciIE ) with shared span representations. The multitask setup reduced cascading errors between tasks and leveraged crosssentence relations through coreference links. Their multitask model was better at predicting span boundaries and outperformed previous state of the art scientific IE systems on entity and relation extraction, without using any handengineered features or pipeline processing.
[Image source. Click image to open in new window.] SciIE was able to automatically organize the extracted information from a large collection of scientific articles into a knowledge graph. SciIE supported the construction of a scientific knowledge graph, which they used to analyze information in scientific literature. Their analysis showed the importance of coreference links in making a dense, useful graph. Experiments showed that their multitask model outperformed previous models in scientific information extraction without using any domainspecific features.
Additional Reading

Augmenting Compositional Models for Knowledge Base Completion Using Gradient Representations (Nov 2018)
“Neural models of Knowledge Base data have typically employed compositional representations of graph objects: entity and relation embeddings are systematically combined to evaluate the truth of a candidate Knowledge Base entry. Using a model inspired by Harmonic Grammar, we propose to tokenize triplet embeddings by subjecting them to a process of optimization with respect to learned wellformedness conditions on Knowledge Base triplets. The resulting model, known as Gradient Graphs, leads to sizable improvements when implemented as a companion to compositional models. Also, we show that the “supracompositional” triplet token embeddings it produces have interpretable properties that prove helpful in performing inference on the resulting triplet representations.”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
Statistical Relational Learning
A Review of Relational Machine Learning for Knowledge Graphs (Sep 2015) – by Maximilian Nickel, Kevin Murphy, Volker Tresp and Evgeniy Gabrilovich – provides an excellent introduction to machine learning and knowledge graphs, with a focus on statistical relational learning. The authors demonstrated how statistical relational learning can be used in conjunction with machine reading and information extraction methods to automatically build KG.
[Image source. Click image to open in new window.]
 “knowledge graphs have found important applications in question answering, structured search, exploratory search, and digital assistants. We provided a review of state of the art statistical relational learning methods applied to very large knowledge graphs. We also demonstrated how statistical relational learning can be used in conjunction with machine reading and information extraction methods to automatically build such knowledge repositories. As a result, we showed how to create a truly massive, machineinterpretable ‘semantic memory’ of facts, which is already empowering numerous practical applications. …”
In statistical relational learning the representation of an object can contain its relationships to other objects. The data is in the form of a graph, consisting of nodes (entities) and labelled edges (relationships between entities). The main goals of statistical relational learning include prediction of missing edges, prediction of properties of nodes, and clustering nodes based on their connectivity patterns. These tasks arise in many settings such as analysis of social networks and biological pathways.

Statistical relational learning (sometimes called Relational Machine Learning, RML), is a subdiscipline of machine learning that is concerned with domain models that exhibit both uncertainty (which can be dealt with using statistical methods) and complex, relational structure. Typically, the knowledge representation formalisms developed in statistical relational learning use (a subset of) firstorder logic to describe relational properties of a domain in a general manner (universal quantification) and draw upon probabilistic graphical models (such as Bayesian networks or Markov networks) to model the uncertainty; some also build upon the methods of inductive logic programming.
The field of statistical relational learning is not strictly limited to learning aspects; it is equally concerned with reasoning (specifically, probabilistic inference) and knowledge representation. Therefore, alternative terms that reflect the main foci of the field include statistical relational learning and reasoning (emphasizing the importance of reasoning) and firstorder probabilistic languages (emphasizing the key properties of the languages with which models are represented).
 Statistical relational learning should not be confused with Semantic Role Labeling (sometimes also called shallow semantic parsing); unfortunately, both are abbreviated as SRL. Semantic role labeling is a process in NLP that assigns labels to words or phrases in a sentence that indicate their semantic role in the sentence. It consists of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles (i.e., automatically finding the semantic roles of each argument of each predicate in a sentence).
RESCAL
One of the first knowledge graph embedding methods was RESCAL by Maximilian Nickel, Volker Tresp and HansPeter Kriegel, which computed a threeway factorization of an adjacency tensor (i.e. the factorization of a threeway tensor) that represents the knowledge graph [A ThreeWay Model for Collective Learning on MultiRelational Data (2011); author’s code; nonauthor code here and here]. RESCAL may be alternatively interpreted as a compositional model, where pairs of entities are represented via the tensor product of their embeddings. Unlike other tensor approaches, RESCAL was able to perform collective learning via the latent components of the model and provide an efficient algorithm to compute the factorization. They substantiated their theoretical considerations regarding the collective learning capabilities of their model on a new dataset, and a dataset commonly used in entity resolution. RESCAL is a very powerful model that can capture complex relational patterns over multiple hops in a graph. The RESCAL algorithm only involved standard matrix operations and was implemented in Python/NumPy without any additional software in less than 120 lines of code.
[Image source. Click image to open in new window.]RESCAL was discussed in the following papers.

On Evaluating Embedding Models for Knowledge Base Completion (Oct 2018) explored whether recent embedding models work well for knowledge base completion tasks, arguing that the current evaluation protocols are more suited for question answering rather than knowledge base completion. They showed that when using an alternative evaluation protocol more suitable for knowledge base completion the performance of all models was unsatisfactory, indicating the need for more research into embedding models and evaluation protocols for knowledge base completion.
“Embedding models. We subsequently write $\small \mathbf{R}_k \in \mathbb{R}^{d \times d}$ (instead of $\small \mathbf{r}_k$) for the embedding of relation $\small k$ if it is best interpreted as a matrix. Then $\small d_R = d^2$; otherwise, $\small d_R = d$.
RESCAL is the most general bilinear model and uses the scoring function $\small s(i,k,j) = \mathbf{e}^T_i \mathbf{R}_k\mathbf{e}_j$.
TransE is a translationbased model inspired by word2vec and uses the score function $\small s(i,k,j) = \vert\vert \mathbf{e}_i + \mathbf{r}_k  \mathbf{e}_j \vert\vert$ (using either $\small l_1$ or $\small l_2$ norm).
DistMult is a fairly constrained factorization model with scoring function $\small s(i,k,j) = \mathbf{e}^T_i\ diag(\mathbf{r}_k)\mathbf{e}_j$.
ComplEx uses embeddings in the complex domain with scoring function $\small s(i,k,j) = Real(\mathbf{e}^T_i\ diag(\mathbf{r}_k)\mathbf{e}_j)$, where $\small Real(\cdot)$ extracts the real part of a complex number.
ANALOGY uses the scoring function $\small s(i,k,j) = \mathbf{e}^T_i \mathbf{R}_k \mathbf{e}_j$ of RESCAL, but constrains $\small \mathbf{R}_k \in \mathbb{R}^{d \times d}$ to a block diagonal matrix in which each block is either a real scalar or a $\small 2×2$ matrix of form $\small \begin{bmatrix} x & y \\ y & x \end{bmatrix}$ with $\small x,y \in \mathbb{R}$.
Unlike RESCAL, Analogy and ComplEx – DistMult and TransE are restricted models in that they cannot represent any given knowledge base. See Sec. 3.2 for more details.”

Knowledge Representation Learning: A Quantitative Review (Dec 2018) also discusses RESCAL …
“A typical model of matrix factorization in KRL is RESCAL, a collective tensor factorization model, which reduce the modeling of the structure of KGs into a tensor factorization operation. In RESCAL, the triples in KGs forms a large tensor $\small \mathbf{X}$ which $\small X_{hrt}$ is 1 when $\small (h,r,t)$ holds, otherwise 0. Tensor factorization aims to factorize $\small \mathbf{X}$ to entity embeddings and relation embeddings, so that $\small X_{hrt}$ is close to $\small \mathbf{hM_rt}$. … We can find that RESCAL is similar to the previous model LFM [Latent Factor Model]. The major difference is that RESCAL will optimize all values in X including the zero values while LFM focuses on the triples in KGs.”

Embedding Models for Episodic Knowledge Graphs (Dec 2018), which extended models for static knowledge graphs to temporal knowledge graphs – enabling them to store episodic data and to generalize to new facts (inductive learning). They generalized leading learning models for static knowledge graphs (i.e., TuckER, RESCAL, HolE, ComplEx, DistMult ) to temporal knowledge graphs, introducing a new tensor model with superior generalization performance, ConT .
[Image source. Click image to open in new window.]“Another model closely related to the semantic Tucker tensor decomposition is the RESCAL model, which has shown excellent performance in modelling knowledge graphs [KGs]. In RESCAL, subjects and objects have vector latent representations, while predicates have matrix latent representations. The indicator function of RESCAL for modeling semantic KGs takes the form

$\begin{align} \theta^{\ sem}_{s,p,o} = \sum^{\tilde{r}}_{r_1,r_2=1} a_{e_s,r_1} g_p (r_1,r_2) a_{eo,r_2} \end{align}$,
where $\small \theta^{\ sem}_{s,p,o}$ is the logit of $\small (e_s, e_p, e_o)$, and $\small g_p(r_1,r_2)$ represents the matrix latent representation for the predicate $\small e_p$. Then next two models, Tree and ConT, are novel generalizations of RESCAL to episodic tensors.”

TuckER: Tensor Factorization for Knowledge Graph Completion (Jan 2019) proposed TuckER, a relatively simple but powerful linear model based on Tucker decomposition of the binary tensor representation of knowledge graph triples. TuckER outperformed all previous state of the art models across standard link prediction datasets. They proved that TuckER was a fully expressive model, deriving the bound on its entity and relation embedding dimensionality for full expressiveness which was several orders of magnitude smaller than the bound of previous state of the art models ComplEx and SimplE. They further showed that several previously introduced linear models could be viewed as special cases of TuckER.
“Finally, we show that several previous state of the art linear models (RESCAL, DistMult, ComplEx, SimplE) are special cases of TuckER.”
“RESCAL. An early linear model, RESCAL, optimizes a scoring function containing a bilinear product between vector embeddings for each subject and object entity and a full rank matrix for each relation. Although a very expressive and powerful model, RESCAL is prone to overfitting due to its large number of parameters, which increases quadratically in the embedding dimension with the number of relations in a knowledge graph.
“DistMult. DistMult is a special case of RESCAL with a diagonal matrix per relation, so the number of parameters of DistMult grows linearly with respect to the embedding dimension, reducing overfitting. However, the linear transformation performed on subject entity embedding vectors in DistMult is limited to a stretch. …”
[Image source. Click image to open in new window.]
RESCAL could be hard to scale to very large knowledgegraphs because it had a quadratic runtime and memory complexity in regard to the embedding dimension. For this reason, Nickel et al. also explored compositional operators, which were more efficient than the tensor product. One outcome of this research direction was holographic embeddings of knowledge graphs (HolE), which used circular convolution as the compositional operator. Due to this, it could also be interpreted as a multirelational associative memory, where the relation embeddings stored which latent pairs (or pairs of prototypes) were true for a certain relation type. HolE retained the excellent performance of RESCAL while being far more scalable, as it only had a linear dependency on the embedding dimension. [Source: Maximilian Nickel, Knowledge Graph Embeddings.]
Ontology reasoning is the ephemeral name given to reasoning over ontologies and knowledge bases: deriving facts that are not explicitly expressed in an ontology or in knowledge base. Ontology reasoning describes problem settings where the rules for conducting reasoning are specified alongside with the actual information. In 2017 Hohenecker and Lukasiewicz (University of Oxford) presented a novel approach to ontology reasoning that was based on deep learning rather than logicbased formal reasoning [Deep Learning for Ontology Reasoning (May 2017)]. They introduced a new model for statistical relational learning built upon deep recursive neural networks which easily competed with existing logicbased reasoners on the task of ontology reasoning on “RDFox” (an inmemory RDF triple store, i.e. graph, being developed as a commercial product).
In their 2018 followon work, Ontology Reasoning with Deep Neural Networks (Sep 2018), Hohenecker and Lukasiewicz devised a novel model that was closely coupled to symbolic reasoning methods and thus able to learn how to effectively perform basic ontology reasoning. Tests showed that their model learned to perform precise reasoning on a number of diverse inference tasks that required comprehensive deductive proficiencies; furthermore, the suggested model suffered much less from different obstacles that prohibit symbolic reasoning.
[Image source. Click image to open in new window.]Knowledge Graph Embedding
In a radically different approach from the probabilistic methods employed in statistical relational learning, knowledge graph embedding (KGE) aims to represent entities and relations in a knowledge graph as points or vectors in a continuous vector space (Euclidean space) – simplifying manipulation, while preserving the inherent structure of the KG. For recent reviews see:

A Comprehensive Survey of Graph Embedding: Problems, Techniques and Applications (Feb 2018)
[Image source. Click image to open in new window.] 
Knowledge Graph Embedding: A Survey of Approaches and Applications (Sep 2017; pdf).
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
 An excellent followon article (different authors) to A Review of Relational Machine Learning for Knowledge Graphs (Sep 2015) is On Embeddings as an Alternative Paradigm for Relational Learning (Jul 2018), which systematically compared knowledge graph embedding (KGE) and logicbased statistical relational learning methods on standard relational classification and clustering tasks.
“The most important takeaway messages from this work are the following:

KGE seems to be suitable for curated data. KGEs seem to have difficulties in the case when only a small fraction of available information is necessary for the given prediction task. SRL methods do not have that issue as they cherry pick useful information during the learning phase. This observation explains why KGEs perform well on the knowledge base completion task where the complexity of reasoning is low, but the dataset is huge. This further indicates the target use cases for the respective approaches: KGEs target simple relational reasoning with huge amounts of data, while SRL targets complex reasoning tasks with not a lot of data.

KGE might be useful for relational clustering but needs further research. Although logicbased relational clustering algorithms achieve the overall best results, clustering instances in the embeddings space does not lack much behind. They certainly show potential for this task, however they rise many problem, such as the question of an appropriate distance/similarity measure.

Hyperparameters matter a lot. A major disadvantage of KGEs is their sensitivity to hyperparameters. Out experiments show there is no clear strategy for choosing the right dimension. Though this might be less of an issue for classification problems, it poses a major one for clustering where there is no option to tune the parameters.

Besides the above outlined point, some obvious (dis)advantages are worth reiterating here. A strong advantage of KGEs is their scalability, at the expense of their blackbox nature and limited reasoning capabilities. SRL methods are a direct opposite – they can capture very complex reasoning, are interpretable but currently of a limited scalability.”

A typical KG is represented as a graph built from symbolic triples – variously stated as (source, relation, target), (subject, predicate, object), or (head, relation, tail). KGE methods attempt to represent those symbols (nodes and edges) with their corresponding source, relation, and target vectors – amenable to mathematical processing.
[Image source (slide 7). Click image to open in new window.]The underlying concept of KGE is that in a knowledge graph each entity can be regarded as a point in a continuous vector space while relations can be modelled as translation vectors [Expeditious Generation of Knowledge Graph Embeddings (Mar 2018)]. The generated vector representations can be used by machine learning algorithms to accomplish specific tasks.
Restated, KGE aim to represent instances and their relationships as vectors and/or matrices in the Euclidean space [On Embeddings as an Alternative Paradigm for Relational Learning (Jul 2018)]. The hope is that the geometry of the embedding space will resemble the structure of the data by keeping the instances that participate in the same relationships close to one another in Euclidean space. This in turn allows one to apply standard propositional (logicbased) machine learning tools and retain their scalability, while at the same time preserving certain properties of structured relational data.
Euclidean space is a mathematical construct that encompasses the line, the plane, and threedimensional space as special cases. Its elements are called vectors. Vector space (also called a linear space) is a collection of objects called vectors, which may be added together and multiplied (“scaled”) by numbers, called scalars.
Vectors can be understood in various ways: as arrows, as quantities with magnitude and direction, as displacements, or as points. Euclidean vectors are an example of a vector space. They represent physical quantities such as forces: any two forces (of the same type) can be added to yield a third, and the multiplication of a force vector by a real multiplier is another force vector. In the same vein, but in a more geometric sense, vectors representing displacements in the plane or in threedimensional space also form vector spaces.
Vectors are abstract mathematical objects with particular properties, which in some cases can be visualized as arrows. However, vectors in vector spaces do not necessarily have to be arrowlike objects. Vector spaces are the subject of linear algebra and are well characterized by their dimension, which, roughly speaking, specifies the number of independent directions in the space.
A recent Microsoft/Google paper, Link Prediction using Embedded Knowledge Graphs (Apr 2018), discussed the differences between symbolic and vector space
Source in KG completion; instead of sampling branching paths in the symbolic space of the original knowledge graph, they performed short sequences of interactive lookup operations in the vector space of an embedded knowledge graph.
KGE [reviewed in An Overview of Embedding Models of Entities and Relationships for Knowledge Base Completion (Mar 2017; updated Feb 2018)] has proven to be very effective for the task of knowledge graph completion, where the goal is to identify missing links in the existing knowledge graph [A Review of Relational Machine Learning for Knowledge Graphs (Sep 2015)]. Accordingly, KGE approaches are the current (2018) dominating methodology for knowledge graph link prediction [On Link Prediction in Knowledge Bases: MaxK Criterion and Prediction Protocols (2018) (code)]. KGE – fundamentally based on distributed representations – has not only proved to be effective for KG link prediction, but has also helped to improve our understanding and engineering of knowledge representation. A strong advantage of KGE is their scalability (at the expense of their blackbox nature and limited reasoning capabilities): KGE has proven to be scalable to very large knowledge graphs.
TransE – by Antoine Bordes et al. at the Université de Technologie de Compiègne – CNRS, and Jason Weston and Oksana Yakhnenko at Google – is a well known KGE method which embedded entities and relations into the same space where the difference between the head and the tail was approximately the relation [Translating Embeddings for Modeling Multirelational Data (NeurIPS 2013)]. TransE adopted the principle of geometric translation, formally as $\small h + r ≈ t$. While this embedding permits very simple translationbased relational inference, it is too restrictive in dealing with $\small \text{1toN}$, $\small \text{Nto1}$ and $\small \text{NtoN}$ relations. Furthermore, the representation of these models is still not semantically interpretable, which may be a major flaw and harm the potential applications (as stated in KSR: A Semantic Representation of Knowledge Graph within a Novel Unsupervised Paradigm).

Since recent (late 2018  early 2019) approaches to triplet classification / confidence scoring (refer here) are based on this TransE paper, I will excerpt that section of the TransE paper.
“Given a training set $\small \mathcal{S}$ of triplets $\small (h,\ell,t)$ composed of two entities $\small h,t \in E$ (the set of entities) and a relationship $\small \ell \in L$ (the set of relationships), our model learns vector embeddings of the entities and the relationships. The embeddings take values in $\small \mathbb{R}^k$ ($\small k$ is a model hyperparameter) and are denoted with the same letters, in boldface characters. The basic idea behind our model is that the functional relation induced by the $\small \ell$labeled edges corresponds to a translation of the embeddings, i.e. we want that $\small \boldsymbol{h + \ell \approx t}$ when $\small (h,\ell,t)$ holds ($\small \mathbf{t}$ should be a nearest neighbor of $\small \boldsymbol{h + \ell}$), while $\small \boldsymbol{h + \ell}$ should be far away from $\small \boldsymbol{t}$ otherwise. Following an energybased framework, the energy of a triplet is equal to $\small d(\boldsymbol{h + \ell,t}$) for some dissimilarity measured, which we take to be either the $\small L_1$ or the $\small L_2$norm.
“To learn such embeddings, we minimize a marginbased ranking criterion over the training set:

$\begin{align} \mathcal{L} = \sum_{(h,\ell,t) \in S} \ \sum_{(h^{\prime},\ell,t^{\prime}) \in S^{\prime}_{(h,\ell,t)}} [\gamma + d(\mathbf{h+\ell,t})  d(\boldsymbol{h^{\prime} + \ell,t^{\prime}})]_{+} \end{align}$ [1]
where $\small [x]_+$ denotes the positive part of $\small x, \gamma > 0$ is a margin hyperparameter, and

$S^{\prime}_{(h,\ell,t)} = \{(h^{\prime},\ell,t) \vert h^{\prime} \in \textit{E}\} \cup \{(h,\ell,t^{\prime}) \vert t^{\prime} \in \textit{E}\}$. [2]
The set of corrupted triplets, constructed according to Equation [2], is composed of training triplets with either the head or tail replaced by a random entity (but not both at the same time). The loss function [1] favors lower values of the energy for training triplets than for corrupted triplets, and is thus a natural implementation of the intended criterion. Note that for a given entity, its embedding vector is the same when the entity appears as the head or as the tail of a triple.”

This approach was also used by Richard Socher and colleagues in Reasoning With Neural Tensor Networks for Knowledge Base Completion (NeurIPS 2013).

The energy based approach to relation classification originates with energy based machine learning models (cited in the TransE paper) [I recently posted some of those links on reddit]. For example, Restricted Boltzmann Machines (RBM) [local copy] – from Yoshua Bengio’s group at MILA – gives a good introduction.
Energybased models associate a scalar energy to each configuration of the variables of interest. Learning corresponds to modifying that energy function so that its shape has desirable properties. For example, we would like plausible or desirable configurations to have low energy. Energybased probabilistic models define a probability distribution through an energy function, as follows:

$p(x) = \large \frac{e^{E(x)}}{Z}$.
The normalizing factor $\small Z$ is called the partition function by analogy with physical systems.

$\small \begin{align} Z = \sum_x e^{E(x)} \end{align}$
An energybased model can be learnt by performing (stochastic) gradient descent on the empirical negative loglikelihood of the training data. As for the logistic regression we will first define the loglikelihood and then the loss function as being the negative loglikelihood.

$\small \begin{align} \mathcal{L}(\theta, \mathcal{D}) = \frac{1}{N} \sum_{x^{(i)} \in \mathcal{D}} \log\ p(x^{(i)}) \\ \\ \ell (\theta, \mathcal{D}) =  \mathcal{L} (\theta, \mathcal{D}) \end{align}$
using the stochastic gradient $\frac{\partial \log p(x^{(i)})}{\partial \theta}$, where $\small \theta$ are the parameters of the model.
Energy Based Triple Classification (Triple Confidence)
SCEF: A SupportConfidenceaware Embedding Framework for Knowledge Graph Refinement (Feb 2019) [no code]. “Knowledge graph (KG) refinement mainly aims at KG completion and correction (i.e., error detection). However, most conventional KG embedding models only focus on KG completion with an unreasonable assumption that all facts in KG hold without noises, ignoring error detection which also should be significant and essential for KG refinement. In this paper, we propose a novel supportconfidenceaware KG embedding framework (SCEF), which implements KG completion and correction simultaneously by learning knowledge representations with both triple support and triple confidence. Specifically, we build model energy function by incorporating conventional translationbased model with support and confidence. To make our triple supportconfidence more sufficient and robust, we not only consider the internal structural information in KG, studying the approximate relation entailment as triple confidence constraints, but also the external textual evidence, proposing two kinds of triple supports with entity types and descriptions respectively. Through extensive experiments on realworld datasets, we demonstrate SCEF’s effectiveness.”
[Image source. Click image to open in new window.]SCEF: A SupportConfidenceaware Embedding Framework for Knowledge Graph Refinement (above) is very similar to this one (different institutions / authors; cited as ref. 44). They basically cloned the following work, with an updated confidenceaware KG embedding energy function (Eqn. 1 above; Eqn 2 below).

Does William Shakespeare REALLY Write Hamlet? Knowledge Representation Learning with Confidence (May 2017, updated Feb 2018) [code]. “Knowledge graphs (KGs), which could provide essential relational information between entities, have been widely utilized in various knowledgedriven applications. Since the overall human knowledge is innumerable that still grows explosively and changes frequently, knowledge construction and update inevitably involve automatic mechanisms with less human supervision, which usually bring in plenty of noises and conflicts to KGs. However, most conventional knowledge representation learning methods assume that all triple facts in existing KGs share the same significance without any noises. To address this problem, we propose a novel confidenceaware knowledge representation learning framework (CKRL), which detects possible noises in KGs while learning knowledge representations with confidence simultaneously. Specifically, we introduce the triple confidence to conventional translationbased methods for knowledge representation learning. To make triple confidence more flexible and universal, we only utilize the internal structural information in KGs, and propose three kinds of triple confidences considering both local and global structural information. In experiments, We evaluate our models on knowledge graph noise detection, knowledge graph completion and triple classification. Experimental results demonstrate that our confidenceaware models achieve significant and consistent improvements on all tasks, which confirms the capability of CKRL modeling confidence with structural information in both KG noise detection and knowledge representation learning.”
[Image source. Click image to open in new window.]
A third paper (different authors from the preceding two papers) is Triple Trustworthiness Measurement for Knowledge Graph (Sep 2018; updated Feb 2018) [code]. “In this paper, we establish a knowledge graph triple trustworthiness measurement model that quantify their semantic correctness and the true degree of the facts expressed. The model is a crisscrossing neural network structure. It synthesizes the internal semantic information in the triples and the global inference information of the KG to achieve the trustworthiness measurement and fusion in the three levels of entity level, relationship level, and KG global level. We analyzed the validity of the model output confidence values, and conducted experiments in the realworld dataset FB15k (from Freebase) for the knowledge graph error detection task. The experimental results showed that compared with other models, our model achieved significant and consistent improvements.”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
This paper cites [as ref. 43] Does William Shakespeare REALLY Write Hamlet? Knowledge Representation Learning with Confidence. “The concept of trustworthiness has been applied to knowledge graph related tasks to some extent. Reference [43] proposed a triple confidence awareness knowledge representation learning framework, which improved the knowledge representation effect. There were three kinds of triple credibility calculation methods using the internal structure information of the KG. is method used only the information provided by the relationship, ignoring the related entities. …”
“… There are no explicit labelled errors in the FB15k. Considering the experience that most errors in realworld KGs derive from them is understanding between similar entities, we use the methods described in [43] to generate fake triples as negative examples automatically where the picked entity should once appear at the same position. For example, …”

In the traditional methods it is almost impossible to exactly extract the specific semantics from geometric points; for the example of TransE, a representation of $\small \text{Table}$ as $\small (0.82, 0.51, …)$ hardly tells us anything meaningful – such as a $\small \text{Table}$ being $\small \text{furniture}$, not being an $\small \text{animal}$, etc. Without semantics, the gap between knowledge and language remains, limiting the task of knowledge applications and natural language understanding (NLU).
TransH (Jun 2014). “TransE is a promising method proposed recently, which is very efficient while achieving stateoftheart predictive performance. We discuss some mapping properties of relations which should be considered in embedding, such as reflexive, onetomany, manytoone, and manytomany. We note that TransE does not do well in dealing with these properties. Some complex models are capable of preserving these mapping properties but sacrifice efficiency in the process. To make a good tradeoff between model capacity and efficiency, in this paper we propose TransH which models a relation as a hyperplane together with a translation operation on it. In this way, we can well preserve the above mapping properties of relations with almost the same model complexity of TransE. Additionally, as a practical knowledge graph is often far from completed, how to construct negative examples to reduce false negative labels in training is very important. Utilizing the onetomany/manytoone mapping property of a relation, we propose a simple trick to reduce the possibility of false negative labeling. We conduct extensive experiments on link prediction, triplet classification and fact extraction on benchmark datasets like WordNet and Freebase. Experiments show TransH delivers significant improvements over TransE on predictive accuracy with comparable capability to scale up.”
[Image source. Click image to open in new window.]One effective solution is to consider two separate embedding spaces for entities and relations. Entities are then mapped into the relation space using relationspecific projections, such as those in TransR [Learning Entity and Relation Embeddings for Knowledge Graph Completion (Jan 2018)]. This mapping strategy, however, causes critical drawbacks. First, when the number of relations is large, the whole projection matrices are expensive to model. Second, treating each relation separately does not account for the latent structure in the relation space, leading to waste of resources. An example of such a latent structure is the correlation between relations “nationality” and “placeofbirth”, as the latter may infer about the former.
While the raw representation of KG as (head, relation, tail) triples is adequate to store known knowledge, relating distant entities requires expensive graph traversal, possibly through multiple paths. Thus knowledge graph completion calls for learning of a new representation that supports scalable reasoning. The most successful approach thus far is through embedding entities and relations into a continuous vector space, which naturally lends itself to simple algebraic manipulations. An effective solution is to consider two separate embedding spaces for entities and relations. Entities are then mapped into the relation space using relationspecific projections , such as those in TransR. However, this mapping strategy causes critical drawbacks. First, when the number of relations is large, the whole projection matrices are expensive to model. Second, treating each relation separately does not account for the latent structure in the relation space, leading to waste of resources. An example of such a latent structure is the correlation between relations ‘nationality’ and ‘placeofbirth,’ as the latter may infer knowledge about the former.
Knowledge Graph Embedding with Multiple Relation Projections (Jan 2018) proposed a new translationbased KGE method called TransF, which was inspired by TransR but did not suffer from those issues. Under TransF, projection matrices are members of a matrix space spanned by a fixed number of matrix bases. A relationspecific projection matrix is characterized by a relationspecific coordinate in the space. Put in other way, the relation projection tensor is factorized into product of a relation coordinate matrix and a basis tensor. Hence, TransF is much more efficient and robust than TransR.
[Image source. Click image to open in new window.]While KG structures used for representation learning and other purposes generally represent (subject, relation, object) relational triples, knowledge bases often contain a wide variety of data types beyond these direct links. Apart from relations to a fixed set of entities, KB often not only include numerical attributes (e.g. ages, dates, financial, etc.) but also textual attributes (e.g. names, descriptions, and titles/designations) and images (profile photos, flags, posters, etc.). These different types of data can play a crucial role as extra pieces of evidence for knowledge base completion. For example the textual descriptions and images might provide evidence for a person’s age, profession, and designation.
[Image source. Click image to open in new window.]Embedding Multimodal Relational Data for Knowledge Base Completion (Sep 2018) [code] proposed multimodal knowledge base embeddings (MKBE) that used different neural encoders for this variety of observed data, combining them with existing relational models to learn embeddings of the entities and multimodal data. Furthermore, those learned embeddings and different neural decoders were used to develop a novel multimodal imputation model to generate missing multimodal values (like text and images) from information in the knowledge base.
[Image source. Click image to open in new window.]Most previous research in KG completion has focused on the problem of inferring missing entities and missing relation types between entities. However, in addition to these many KG also suffer from missing entity types (i.e. the category labels for entities, such as “/music/artist”). Learning Entity Type Embeddings for Knowledge Graph Completion (Nov 2017; pdf) [code] addressed this issue, proposing a novel approach to entity type prediction.
[Image source. Click image to open in new window.]DOLORES: Deep Contextualized Knowledge Graph Embeddings (Nov 2018) introduced a new method, DOLORES, for learning knowledge graph embeddings that effectively captured contextual cues and dependencies among entities and relations. Short paths on knowledge graphs comprising of chains of entities and relations can encode valuable information regarding their contextual usage. They operationalized this notion by representing knowledge graphs not as a collection of triples but as a collection of entityrelation chains, and learned embeddings for entities and relations using deep neural models that captured such contextual usage. Their model was based on bidirectional LSTMs, and learned deep representations of entities and relations from constructed entityrelation chains. They showed that these representations could very easily be incorporated into existing models, significantly advancing the state of the art on several knowledge graph prediction tasks like link prediction, triple classification, and missing relation type prediction (in some cases by at least 9.5%).
[Image source. Click image to open in new window.]For a related paper by the same authors contemporaneously released on arXiv, see *MOHONE*: Modeling Higher Order Network Effects in Knowledge Graphs via Network Infused Embeddings.
Knowledge graph embedding aims at modeling entities and relations with lowdimensional vectors. Most previous methods require that all entities should be seen during training, which is unpractical for realworld knowledge graphs with new entities emerging on a daily basis. Recent efforts on this issue suggest training a neighborhood aggregator in conjunction with the conventional entity and relation embeddings, which may help embed new entities inductively via their existing neighbors. However, their neighborhood aggregators neglect the unordered and unequal natures of an entity’s neighbors. To this end, Logic Attention Based Neighborhood Aggregation for Inductive Knowledge Graph Embedding (Nov 2018) summarized the desired properties that may lead to effective neighborhood aggregators, and introduced a novel aggregator: Logic Attention Network (LAN), which addressed the properties by aggregating neighbors with both rulesbased and networkbased attention weights. By comparing with conventional aggregators on two knowledge graph completion tasks, they experimentally validated LAN’s superiority in terms of the desired properties.
[Image source. Click image to open in new window.]While the celebrated sequence to sequence learning (seq2seq) technique and its numerous variants achieve excellent performance on many tasks, many machine learning tasks have inputs that are naturally represented as graphs, and existing seq2seq models face a significant challenge in achieving accurate conversion from graph form to the appropriate sequence. To address this challenge, IBM Research Graph2Seq: Graph to Sequence Learning with Attentionbased Neural Networks (Dec 2018) [code] introduced an end to end, graph to sequence neural encoderdecoder model that mapped an input graph to a sequence of vectors, and used an attentionbased LSTM method to decode the target sequence from those vectors. Their method generated the node and graph embeddings using a graphbased neural network that incorporated edge directions in the node embeddings. They then introduced an attention mechanism that aligns node embeddings and the decoding sequence to better cope with large graphs. Their model achieved state of the art performance on bAbI, Shortest Path, and Natural Language Generation tasks, and significantly outperformed existing graph neural network, seq2seq, and Tree2Seq models. Using the proposed bidirectional node embedding aggregation strategy, the model rapidly converged to the optimal performance.
[Image source. Click image to open in new window.]Knowledge Representation Learning: A Quantitative Review (Dec 2018) [code] first reviewed existing approaches for knowledge representation learning (KRL) – including (summarized above) some limited discussion on RESCAL, TransE, TransH, TransR and other Trans* variants, HolE, ComplEx, KG2E, RESCAL, PRA, etc. They also introduced a new KRL approach, Imageembodied Knowledge Representation Learning (IKRL), to take visual information into consideration when learning representations of the KGs. Afterwards, they conducted extensive quantitative comparisons and analyses of several typical KRL methods on three evaluation tasks of knowledge acquisition: knowledge graph completion, triple classification, and relation extraction. They also reviewed realworld applications of KRL such as language modeling, question answering, information retrieval, and recommender systems.
 See also the XKE fork of the OpenKE GitHub repository.
Is a Single Embedding Enough? Learning Node Representations That Capture Multiple Social Contexts (Google AI: Mar 2019) [code here; see also here]. “Recent interest in graph embedding methods has focused on learning a single representation for each node in the graph. But can nodes really be best described by a single vector representation? In this work, we propose a method [Splitter ] for learning multiple representations of the nodes in a graph (e.g., the users of a social network). Based on a principled decomposition of the egonetwork, each representation encodes the role of the node in a different local community in which the nodes participate. These representations allow for improved reconstruction of the nuanced relationships that occur in the graph a phenomenon that we illustrate through stateoftheart results on link prediction tasks on a variety of graphs, reducing the error by up to 90%. In addition, we show that these embeddings allow for effective visual analysis of the learned community structure.”
[Image source. Click image to open in new window.]Iteratively Learning Embeddings and Rules for Knowledge Graph Reasoning (Mar 2019). “… we explore how embedding and rule learning can be combined together and complement each other’s difficulties with their advantages. We propose a novel framework IterE (iteratively learning embeddings and rules), in which rules are learned from embeddings with proper pruning strategy and embeddings are learned from existing triples and new triples inferred by rules. Evaluations on embedding qualities of IterE show that rules help improve the quality of sparse entity embeddings and their link prediction results. We also evaluate the efficiency of rule learning and quality of rules from IterE compared with AMIE^{+}, showing that IterE is capable of generating high quality rules more efficiently. Experiments show that iteratively learning embeddings and rules benefit each other during learning and prediction.”
Unsupervised Inductive WholeGraph Embedding by Preserving Graph Proximity (Apr 2019) introduced a novel approach to graph level representation learning, embeddding an entire graph into a vector space where the embeddings of two graphs preserved their graphgraph proximity. “Our approach, UGraphEmb, is a general framework that provides a novel means to performing graphlevel embedding in a completely unsupervised and inductive manner. The learned neural network can be considered as a function that receives any graph as input, either seen or unseen in the training set, and transforms it into an embedding. A novel graphlevel embedding generation mechanism called MultiScale Node Attention (MSNA), is proposed. Experiments on five real graph datasets show that UGraphEmb achieves competitive accuracy in the tasks of graph classification, similarity ranking, and graph visualization.”
[Image source. Click image to open in new window.]Probing the Effectiveness of Embedding Models for Knowledge Graph Completion
Knowledge bases (graphs) contribute to many artificial intelligence tasks, yet they are often incomplete. To add missing facts to a given knowledge base, various embedding models have been proposed. Perhaps surprisingly, relatively simple models with limited expressiveness often performed remarkably well under most commonly used evaluation protocols. On Evaluating Embedding Models for Knowledge Base Completion (Oct 2018) explored whether recent embedding models work well for knowledge base completion tasks and argue that the current evaluation protocols are more suited for question answering rather than knowledge base completion. They showed that when using an alternative evaluation protocol more suitable for knowledge base completion the performance of all models was unsatisfactory, indicating the need for more research into embedding models and evaluation protocols for knowledge base completion.
[Image source. Click image to open in new window.]“Current Evaluation Protocols. Most studies use the triple classification (TC) and the entity ranking (ER) protocols to assess model performance, where ER is arguably the most widely adopted protocol. We assume throughout that only true but no false triples are available (as is commonly the case), and that these triples are divided into training, validation, and test triples. The union of these three sets acts as a proxy of the entire KB, which is unknown due to incompleteness.”
“Triple classification (TC). The goal of TC is to test the model’s ability to discriminate between true and false triples [Socher et al., Reasoning With Neural Tensor Networks for Knowledge Base Completion (NeurIPS 2013)]. Since only true triples are available in practice, pseudonegative triples are generated by randomly replacing either the subject or the object of each test triple by another random entity that appears as a subject or object, respectively. Each resulting triple is then classified as positive or negative. In particular, triple $\small (i,k,j)$ is classified as positive if its score $\small s(i,k,j)$ exceeds a relationspecific decision threshold $\small \sigma_k$ (learned on validation data using the same procedure). Model performance is assessed by its accuracy, i.e., how many triples are classified correctly.”
[Image source. Click image to open in new window.][Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
“Entity ranking (ER). The goal of ER is to assess model performance in terms of ranking answers to certain questions. In particular, for each test triple $\small t = (i,k,j)$, two questions $\small q_s = (?,k,j)$ and $\small q_o = (i,k,?)$ are generated. For question $\small q_s$, all entities $\small i’ \in \mathcal{E}$ are ranked based on the score $\small s(i’,k,j)$ . To avoid misleading results, entities $\small i’$ that correspond to observed triples in the dataset (i.e., $\small (i’,k,j)$ in train/validate) are discarded to obtain a filtered ranking. The same process is applied for question $\small q_o$. Model performance is evaluated based on the recorded positions of the test triples in the filtered ranking. The intuition is that models that rank test triples (which are known to be true) higher are expected to be superior. Usually, the microaverage of $\small \text{filtered Hits@K}$ (i.e., the proportion of test triples ranking in the top$\small K$) and $\small \text{filtered MRR}$ (i.e., the mean reciprocal rank of the test triples) are reported. Figure 1a
Source provides a pictorial view of ER for a single relation. Given the score matrix of a relation $\small k$, where $\small s_{ij}$ is the score of triple $\small (i,k,j)$, a single test triple is shown in green, all candidate triples considered during the evaluation are shown in blue, and all triples observed in the training, validation and testing sets (not considered during evaluation) are shown in grey.”“Discussion. Regarding triple classification, Wang, Ye, and Gupta (Apr 2018) found that most models achieve an accuracy of at least 93%. This is due to the fact that negative triples with high score are rarely sampled as pseudonegative triples because of the large number of entities from which the single replacement entity is picked for a given test triple. This means that most classification tasks are ‘easy.’ Consequently, the accuracy of triple classification overestimates model performance for KBC tasks. This protocol is less adopted in recent work. We argue that ER also overestimates model performance for KBC. In particular, the protocol is more appropriate to evaluate question answering tasks. Since ER generates questions from true test triples, it only asks questions that are known to have an answer. The question itself leaks this information from the test data into the evaluation. …”
“Entitypair ranking (PR). To study the overestimation effect of current evaluation protocols, we used an evaluation protocol for KBC termed entitypair ranking (PR). PR is simple: for each relation $\small k$, we ask question $\small (?,k,?)$ . As before, we use the model to rank all answers, i.e., pairs of entities, and filter out training and validation data in the ranking so as to rank only triples not used during model training. In this way, any negative triples with a high score will appear at a top position, making it harder for true triples to rank high. Figure 1b
Source shows the contrast between the number of negative triples considered for entitypair and those considered for ER. Again, test triples are shown in green, candidate triples are shown in blue, and triples observed during training and validation are shown in grey. The number of candidates is much higher than those considered for ER. However, when answering the question $\small (?,k,?)$ with all possible entity pairs, all test triples for relation $\small k$ will be ranked simultaneously. Let $\small \mathcal{T}_k$ be the number of test triples in $\small k$ . ER needs to consider in total $\small 2 \cdot \mathcal{T}_k \cdot \mathcal{E}$ candidates for $\small k$, while PR needs to consider $\small \mathcal{E}^2$ candidates. Since all test triples in relation $\small k$ are considered at once, we do not rely on MRR for PR, but consider weighted $\small \text{MAP@K}$, i.e., weighted mean average precision in the top$\small K$ filtered results, and weighted Hits@$\small K$, i.e., weighted percentage of test triples in the top$\small K$ filtered results. …”
Analyzing Knowledge Graph Embedding Methods from a MultiEmbedding Interaction Perspective (Mar 2019). “… we propose a multiembedding interaction mechanism as a new approach to uniting and generalizing these models [Canonical decomposition/Parallel factorization (CP ), DistMult , and state of the art ComplEx ]. We derive them theoretically via this mechanism and provide empirical analyses and comparisons between them. We also propose a new multiembedding model based on quaternion algebra and show that it achieves promising results using popular benchmarks.”

“In this paper, we propose a multiembedding interaction mechanism as a new approach to uniting and generalizing the above models. In the proposed mechanism, each entity $\small e$ is represented by multiple embedding vectors $\small \{\mathbf{e}^{(1)}, \mathbf{e}^{(2)}, \dots\}$ and each relation $\small r$ is represented by multiple embedding vectors $\small \{\mathbf{r}^{(1)}, \mathbf{r}^{(2)}, \ldots\}$. In a triple $\small (h,t,r)$, all embedding vectors of $\small h$, $\small t$, and $\small r$ interact with each other by trilinear products to produce multiple interaction scores. These scores are then weighted summed by a weight vector $\small \mathbf{\omega}$ to produce the final matching score for the triple. We show that the above models are special cases of this mechanism. Therefore, it unifies those models and lets us compare them directly. The mechanism also enables us to develop new models by extending to additional embedding vectors.”

“Note that this model needs more extensive evaluation. One potential problem is its being prone to overfitting, as seen in the on train results, with H@10 at absolute 1.000. This might mean that better regularization methods may be needed. However, the general results suggest that extending to more embedding vectors for multiembedding interaction models is a promising approach.”
Additional Reading

“Awesome” lists:
 An open source project that should greatly assist KGE research efforts is OpenKE [code; included models], a framework for knowledge embedding. OpenKE provides fast and stable toolkits, including the most popular knowledge representation learning methods. The framework easy to extend and convenient to design new knowledge embedding models. Toolkits for classic models are provided.
[Image source. Click image to open in new window.]
A heavily modified fork of of the OpenKE GitHub repository is XKE (Explaining Knowledge Embedding models) …

XKE employed a pedagogical approach where – intuitively speaking – a noninterpretable but accurate model is run, and an interpretable model is learned from the output of the noninterpretable one. In their paper, Interpreting Embedding Models of Knowledge Bases: A Pedagogical Approach (ICML 2018) [code] the authors proposed two models:
 KGE models with predicted features (XKEPRED), and
 KGE models with observed features (XKETRUE).

XKEPRED treated the embedding model as a black box and assumed no other source of information for building the interpretable model. By changing the original classifier’s inputs and observing its outputs, the pedagogical approach constructed a training set for an interpretable classifier from which explanations were extracted.

XKETRUE is a variation of XKEPRED that assumed an external source of knowledge (regarded as ground truth of their relational domain) besides the embedding model, from which XKETRUE extracted interpretable features.

Skipping over additional detail (provided in the paper), the results in their Table 3
[Image source. Click image to open in new window.]
Source in that paper are most the interesting with respect to this SUBAIM: the application of those models to a Freebase dataset. Five examples were shown (ID #15), along with input triples (headrelationtail), their explanations (reasons: weighted rules and the bias term) with the interpretable classifier’s score (XKEPRED; XKETRUE), and with the labels predicted by the embedding model (trained as a binary classifier: 1 = True; 0 = False). While the results (discussed in Section 4.4) vary – a consequence of an earlystage research effort – the models performed remarkably well in explaining (through the explanations: weighted paths) the selected, embedded triples.
[Image source. Click image to open in new window.]


BioKEEN: A Library for Learning and Evaluating Biological Knowledge Graph Embeddings (Feb 2019) [bioRxiv; code: PyKEEN; BioKEEN; video]. BioKEEN (Biological KnowlEdge EmbeddiNgs) is a package for training and evaluating biological knowledge graph embeddings built on PyKEEN.
“Knowledge graph embeddings (KGEs) have received significant attention in other domains due to their ability to predict links and create dense representations for graphs’ nodes and edges. However, the software ecosystem for their application to bioinformatics remains limited and inaccessible for users without expertise in programming and machine learning. Therefore, we developed BioKEEN (Biological KnowlEdge EmbeddiNgs) and PyKEEN (Python KnowlEdge EmbeddiNgs) to facilitate their easy use through an interactive command line interface. Finally, we present a case study in which we used a novel biological pathway mapping resource to predict links that represent pathway crosstalks and hierarchies. Availability: BioKEEN and PyKEEN are open source Python packages publicly available under the MIT License at BioKEEN and PyKEEN as well as through PyPI.”

“PyKEEN (Python KnowlEdge EmbeddiNgs) is a package for training and evaluating knowledge graph embeddings. Currently, it provides implementations of 10 knowledge graph embeddings models, and can be run in training mode in which users provide their own set of hyperparameter values, or in hyperparameter optimization mode to find suitable hyperparameter values from set of user defined values. PyKEEN can also be run without having experience in programing by using its interactive command line interface that can be started with the command pykeen from a terminal.”

“BioKEEN (Biological KnowlEdge EmbeddiNgs) is a package for training and evaluating biological knowledge graph embeddings built on PyKEEN. Because we use PyKEEN as the underlying software package, implementations of 10 knowledge graph embedding models are currently available for BioKEEN. Furthermore, BioKEEN can be run in training mode in which users provide their own set of hyperparameter values, or in hyperparameter optimization mode to find suitable hyperparameter values from set of user defined values. Through the integration of the Bio2BEL software numerous biomedical databases are directly accessible within BioKEEN. BioKEEN can also be run without having experience in programing by using its interactive command line interface that can be started with the command ‘biokeen’ from a terminal.”


Constructing Graph Node Embeddings via Discrimination of Similarity Distributions (Oct 2018)
“The problem of unsupervised learning node embeddings in graphs is one of the important directions in modern network science. In this work we propose a novel framework, which is aimed to find embeddings by discriminating distributions of similarities (DDoS) between nodes in the graph. The general idea is implemented by maximizing the Earth Mover’s Distance between distributions of decoded similarities of similar and dissimilar nodes. The resulting algorithm generates embeddings which give a state of the art performance in the problem of link prediction in realworld graphs.”
 See also LowComplexity DataParallel Earth Mover’s Distance Approximations (Dec 2018)
“The Earth Mover’s Distance (EMD) is a stateofthe art metric for comparing probability distributions. The high distinguishability offered by the EMD comes at a high cost in computational complexity. Therefore, linearcomplexity approximation algorithms have been proposed to improve its scalability. However, these algorithms are either limited to vector spaces with only a few dimensions or require the probability distributions to populate the vector space sparsely. We propose novel approximation algorithms that overcome both of these limitations, yet still achieve linear time complexity. All our algorithms are data parallel, and therefore, we can take advantage of massively parallel computing engines, such as Graphics Processing Units (GPUs). The experiments on MNIST images show that the new algorithms can perform a billion distance computations in less than a minute using a single GPU. On the popular textbased 20 Newsgroups dataset, the new algorithms are four orders of magnitude faster than the stateoftheart FastEMD library and match its search accuracy.
“… We also propose dataparallel implementations that are suitable for GPU acceleration, and demonstrate a 30,000fold performance improvement with respect to a multithreaded implementation of WMD without sacrificing any accuracy.”
 See also LowComplexity DataParallel Earth Mover’s Distance Approximations (Dec 2018)

Embedding with Adaptive Similarities for Scalable Learning over Graphs (Dec 2018)
“Node embedding is the task of extracting informative and descriptive features over the nodes of a graph. The importance of node embeddings for graph analytics, as well as learning tasks such as node classification, link prediction and community detection, has led to increased interest on the problem leading to a number of recent advances. Much like PCA in the feature domain, node embedding is an inherently \emph{unsupervised} task; in lack of metadata used for validation, practical methods may require standardization and limiting the use of tunable hyperparameters. Finally, node embedding methods are faced with maintaining scalability in the face of largescale realworld graphs of everincreasing sizes.
“In the present work, we propose an adaptive node embedding framework that adjusts the embedding process to a given underlying graph, in a fully unsupervised manner. To achieve this, we adopt the notion of a tunable node similarity matrix that assigns weights on paths of different length. The design of the multilength similarities ensures that the resulting embeddings also inherit interpretable spectral properties. The proposed model is carefully studied, interpreted, and numerically evaluated using stochastic block models. Moreover, an algorithmic scheme is proposed for training the model parameters efficiently and in an unsupervised manner. We perform extensive node classification, link prediction, and clustering experiments on many real world graphs from various domains, and compare with stateoftheart scalable and unsupervised node embedding alternatives. The proposed method enjoys superior performance in many cases, while also yielding interpretable information on the underlying structure of the graph.”

Embedding Uncertain Knowledge Graphs (Nov 2018)
“Embedding models for deterministic Knowledge Graphs (KG) have been extensively studied, with the purpose of capturing latent semantic relations between entities and incorporating the structured knowledge into machine learning. However, there are many KGs that model uncertain knowledge, which typically model the inherent uncertainty of relations facts with a confidence score, and embedding such uncertain knowledge represents an unresolved challenge. The capturing of uncertain knowledge will benefit many knowledgedriven applications such as question answering and semantic search by providing more natural characterization of the knowledge.
“In this paper, we propose a novel uncertain KG embedding model UKGE, which aims to preserve both structural and uncertainty information of relation facts in the embedding space. Unlike previous models that characterize relation facts with binary classification techniques, UKGE learns embeddings according to the confidence scores of uncertain relation facts. To further enhance the precision of UKGE, we also introduce probabilistic soft logic to infer confidence scores for unseen relation facts during training. We propose and evaluate two variants of UKGE based on different learning objectives. Experiments are conducted on three realworld uncertain KGs via three tasks, i.e. confidence prediction, relation fact ranking, and relation fact classification. UKGE shows effectiveness in capturing uncertain knowledge by achieving promising results on these tasks, and consistently outperforms baselines on these tasks.”
[Image source. Click image to open in new window.] 
COSINE: Compressive Network Embedding on Largescale Information Networks (Dec 2018)
“There is recently a surge in approaches that learn lowdimensional embeddings of nodes in networks. As there are many largescale realworld networks, it’s inefficient for existing approaches to store amounts of parameters in memory and update them edge after edge. With the knowledge that nodes having similar neighborhood will be close to each other in embedding space, we propose COSINE (COmpresSIve NE ) algorithm which reduces the memory footprint and accelerates the training process by parameters sharing among similar nodes. COSINE applies graph partitioning algorithms to networks and builds parameter sharing dependency of nodes based on the result of partitioning. With parameters sharing among similar nodes, COSINE injects prior knowledge about higher structural information into training process which makes network embedding more efficient and effective. COSINE can be applied to any embedding lookup method and learn highquality embeddings with limited memory and shorter training time. We conduct experiments of multilabel classification and link prediction, where baselines and our model have the same memory usage. Experimental results show that COSINE gives baselines up to 23% increase on classification and up to 25% increase on link prediction. Moreover, time of all representation learning methods using COSINE decreases from 30% to 70%.”
Knowledge Graph Edge Semantics
Here I collate and summarize/paraphrase “edge related” discussion from elsewhere in this REVIEW.

A major problem in current graph neural network models such as graph attention networks (GAT) and graph convolutional networks (GCN) is that edge features – which contain important information about graph – are not incorporated. Adaptive Edge Features Guided Graph Attention Networks (Sep 2018) proposed a graph learning model, edge features guided graph attention networks (EGAT). Guided by the edge features, the attention mechanism on pairs of graph nodes not only depended on node contents but also adjusted automatically with respect to the properties of the edge connecting those nodes. Moreover, the edge features were adjusted by the attention function and fed to the next layer, which meant that the edge features were adaptive across network layers.

Exploring Graphstructured Passage Representation for Multihop Reading Comprehension with Graph Neural Networks (Sep 2018) introduced a new method for better connecting the global evidence present in knowledge graphs, to form more complex graphs as compared to directed acyclic graphs (DAG). After obtaining representation vectors for question and entity mentions in passages, an additive attention model treated all entity mention representations and the question representation as the memory and the query, respectively. Footnote 2 was of interest: “The concurrent unpublished work (Cao et al., 2018) also investigates the usage of graph convolution networks on WikiHop. Our work proposes a different model architecture, and focus more on the exploration and comparison of multiple edge types for building the graphstructured passage representation.”

While representation learning on networks generates reliable results in regard to node embeddings, it is limited to homogeneous networks in which all nodes and edges are of the same type. Addressing this challenge, edge2vec : Learning Node Representation using Edge Semantics (Sep 2018) [code] proposed a model that incorporated edge semantics to represent different edgetypes in heterogeneous networks. Their model generated an edge type transition matrix as an extra criterion of a biased node random walk on networks, and a biased skipgram model was then leveraged to learn node embeddings based on the random walks. During the process to generate the node random walk corpus, the heterogeneity of the network was considered. By considering edge semantics, edge2vec significantly outperformed other state of the art models on all three tasks [note that in their tables, edge2vec is listed as heterogeneous node2vec].

mvn2vec: Preservation and Collaboration in MultiView Network Embedding (Jan 2018) focused on characteristics that were specific and important in embedding multiview networks (a multiview network consists of multiple network views, where each view corresponds to a type of edge, and all views share the same set of nodes). With respect to edges, two comments are relevant, here:

[re: Collaboration:] In some datasets, edges between the same pair of nodes may be observed in different views due to shared latent reasons. For instance, if nodes in two views may complement (interact with) each other in various social media contexts, embedding them jointly may potentially yield better results than embedding them independently.

[re: Preservation:] On the other hand, it is possible for different network views to have different semantic meanings; it is also possible that a portion of nodes have completely disagreeing edges in different views since edges in different views are formed due to distinct latent reasons.

MultiView / MultiLayer / Temporal Graph Embedding
Metabolic pathways are intuitively modeled as graph structures, which can represent relationships between elements (here, genetic and biochemical entities) in complex high dimensional datasets. Describing dynamic (temporal) processes, metabolic graphs should also represent various states – e.g. metabolic states (enzyme isoforms and metabolite concentrations), health status (healthy; diabetic; breast cancer; etc.) – perhaps through different graph representations. Likewise, metadata encoded in a graph representing scientific experiments could reflect changes in experimental variables over time, or in response to different experimental conditions (controls vs. experiments). Thus, depending on the state, we could have graphs that at different times differ in the data represented in the nodes and/or the connections between them (different paths or altered strengths of those connections).
 Relevant to the embedding of temporal relations in graphs are the following:
 Learning Sequence Encoders for Temporal Knowledge Graph Completion (Sep 2018) [code] [Summary]“Research on link prediction in knowledge graphs has mainly focused on static multirelational data. In this work we consider temporal knowledge graphs where relations between entities may only hold for a time interval or a specific point in time. In line with previous work on static knowledge graphs, we propose to address this problem by learning latent entity and relation type representations. To incorporate temporal information, we utilize recurrent neural networks to learn timeaware representations of relation types which can be used in conjunction with existing latent factorization methods. The proposed approach is shown to be robust to common challenges in realworld KGs: the sparsity and heterogeneity of temporal expressions. Experiments show the benefits of our approach on four temporal KGs.”
Source: Learning Sequence Encoders for Temporal Knowledge Graph Completion
dyngraph2vec: Capturing Network Dynamics using Dynamic Graph Representation Learning (Sep 2018) [Summary]Learning graph representations is a fundamental task aimed at capturing various properties of graphs in vector space. The most recent methods learn such representations for static networks. However, real world networks evolve over time and have varying dynamics. Capturing such evolution is key to predicting the properties of unseen networks. To understand how the network dynamics affect the prediction performance, we propose an embedding approach which learns the structure of evolution in dynamic graphs and can predict unseen links with higher precision. Our model, dyngraph2vec, learns the temporal transitions in the network using a deep architecture composed of dense and recurrent layers. We motivate the need of capturing dynamics for prediction on a toy data set created using stochastic block models. We then demonstrate the efficacy of dyngraph2vec over existing stateoftheart methods on two real world data sets. We observe that learning dynamics can improve the quality of embedding and yield better performance in link prediction.
Source: dyngraph2vec: Capturing Network Dynamics using Dynamic Graph Representation Learning[In the NLP domain:] WordLevel Loss Extensions for Neural Temporal Relation Classification
[Image source. Click image to open in new window.]
Streaming Graph Neural Networks (Oct 2018)
“… graph neural networks, which extend the neural network models to graph data, have attracted increasing attention. Graph neural networks have been applied to advance many different graph related tasks such as reasoning dynamics of the physical system, graph classification, and node classification.
Most of the existing graph neural network models have been designed for static graphs, while many realworld graphs are inherently dynamic. For example, social networks are naturally evolving as new users joining and new relations being created. Current graph neural network models cannot utilize the dynamic information in dynamic graphs. However, the dynamic information has been proven to enhance the performance of many graph analytical tasks such as community detection and link prediction. Hence, it is necessary to design dedicated graph neural networks for dynamic graphs.
In this paper, we propose DGNN (Dynamic Graph Neural Network ) model, which can model the dynamic information as the graph evolving. In particular, the proposed framework can keep updating node information by capturing the sequential information of edges, the time intervals between edges and information propagation coherently.”
[Image source. Click image to open in new window.]
Temporal Graph Offset Reconstruction: Towards Temporally Robust Graph Representation Learning (Nov 2018) [code]
“Graphs are a commonly used construct for representing relationships between elements in complex high dimensional datasets. Many realworld phenomenon are dynamic in nature, meaning that any graph used to represent them is inherently temporal. However, many of the machine learning models designed to capture knowledge about the structure of these graphs ignore this rich temporal information when creating representations of the graph, resulting in models that do not perform well when used to make predictions about the future state of the graph – especially when the delta between time stamps is not small. In this work, we explore a novel training procedure and an associated unsupervised model which creates graph representations optimised to predict the future state of the graph. We make use of graph convolutional neural networks to encode the graph into a latent representation, which we then use to train our temporal offset reconstruction method, inspired by autoencoders, to predict a later time point – multiple time steps into the future. Using our method, we demonstrate superior performance for the task of future link prediction compared with nonetemporal stateoftheart baselines. We show our approach to be capable of outperforming nontemporal baselines by 38% on a real world dataset.”
[Image source. Click image to open in new window.]
“Whilst a lot of focus has recently been placed on finding methods for learning graph representations which are accurately able to make predictions about the current state of the graph, few works have investigated how these models perform for temporal graphs. This paper has explored the use of graph specific neural networks to learn vertex level representations which are accurately able to make predictions about future time points of evolving graphs. This is achieved by explicitly training the models to predict future time steps via the temporal offset reconstruction method.”
“For creating our temporally offset graph embeddings, we will explore the use of both nonprobabilistic and variational encoder models. These are related to the convolutional graph autoencoders of Kipf, 2016, however, we are exploring the creation of two models explicitly to reconstruct a future state of the graph, rather than just to capture the current graph. Below we detail the specifics of both the nonprobabilistic (TOGAE) and variational (TOGVAE) models used for temporal offset reconstruction.”
DynamicGEM: A Library for Dynamic Graph Embedding Methods (Nov 2018)
“DynamicGEM is an opensource Python library for learning node representations of dynamic graphs. It consists of stateoftheart algorithms for defining embeddings of nodes whose connections evolve over time. The library also contains the evaluation framework for four downstream tasks on the network: graph reconstruction, static and temporal link prediction, node classification, and temporal visualization. We have implemented various metrics to evaluate the stateoftheart methods, and examples of evolving networks from various domains. We have easytouse functions to call and evaluate the methods and have extensive usage documentation. Furthermore, DynamicGEM provides a template to add new algorithms with ease to facilitate further research on the topic.”
Cellular growth and development provides an excellent example: a totipotent fertilized egg or a pluripotent stem cell can give rise to any other cell type – blood, liver, heart, brain, etc. – each arising from differential expression of the identical genome encoded within these cells throughout their lineage; e.g.:
 DNA Methylation in Stem Cell Renewal and Multipotency;
 Epigenetic Regulation of Pluripotency and Differentiation;
 Germline and Pluripotent Stem Cells;
 Epigenetics of Reprogramming to Induced Pluripotency); and

DNA methylation: an epigenetic mark of cellular memory.
[Image source. Click image to open in new window.]
This cellular differentiation arises largely due to selective epigenetic modification of the genome during growth and development.
It would be exceptionally useful to be able to access multiple “views” of those types of graphs, depending on the data, metadata and embeddings.
Recent work in knowledge graph embedding such as TransE, TransH and TransR have shown competitive and promising results in relational learning. NonParametric Estimation of Multiple Embeddings for Link Prediction on Dynamic Knowledge Graphs (2017) [local copy] proposed a novel extension [ puTransE: Parallel Universe TransE ] of the translational embedding model to solve three main problems of the current models:

translational models are highly sensitive to hyperparameters such as margin and learning rate;

the translation principle only allows one spot in vector space for each “golden” triplet; thus, congestion of entities and relations in vector space may reduce precision;

current models are not able to handle dynamic data, especially the introduction of new unseen entities/relations or removal of triplets.
The approach proposed for puTransE proposed explicitly generated multiple embedding spaces via semantically and structurally aware triplet selection schemes, and nonparametrically estimated the energy score of a triplet. The intuition for the approach was that in every “parallel universe” embedding space, a constraint on triplets was imposed in terms of count and diversity such that each embedding space observed the original knowledge graph from a different view. The proposed puTransE approach was simple, robust and parallelizable. Experimental results showed that puTransE outperformed TransE and many other embedding methods for link prediction on knowledge graphs on both public benchmark dataset and a real world dynamic dataset.
[Image source. Click image to open in new window.]KSR : A Semantic Representation of Knowledge Graph within a Novel Unsupervised Paradigm (May 2018) proposed a semantic representation method for knowledge graph (KSR ) which imposed a twolevel hierarchical generative process that globally extracted many aspects and then locally assigned a specific category in each aspect for every triple. Notably, they introduced knowledge features (i.e. views for clustering) to describe some knowledge semantic aspects, such as being a university or not (University), geographical position (Location), etc.
 By gathering the information of clusters in each view/feature, the semantic representations were formed. The firstlevel process generated knowledge features (views) with different semantics such as University and Location, while the secondlevel process grouped the entities/relations/triples according to the corresponding semantic features (views). Finally, by summarizing the cluster identifications within each feature/view, KSR constructed the semantic representation of the knowledge elements. For example, for “Tsinghua University” a Yes category (cluster identification) is assigned to the University feature, while the Beijing category/cluster identification is assigned to the Location feature. By exploiting multiview clustering, this knowledge was semantically organized as $\small \color{Brown}{\text{Tsinghua University = (University:Yes, Location:Beijing)}}$.
Multiview Clustering with Graph Embedding for Connectome Analysis (2017) from the University of Illinois at Chicago addressed multiview clustering on graph instances with their model Multiview Clustering framework on graph instances with Graph Embedding (MCGE), in which they modeled multiview graph data as tensors and applied tensor factorization to learn the multiview graph embeddings thereby capturing the local structure of graphs. They built an iterative framework incorporating multiview graph embedding into the multiview clustering task on graph instances, jointly performing multiview clustering and multiview graph embedding simultaneously. The multiview clustering results were used for refining the multiview graph embedding, and the updated multiview graph embedding results further improved the multiview clustering.
[Image source. Click image to open in new window.]Extensive experiments on two brain network datasets (HIV and Bipolar) demonstrated the superior performance of the proposed MCGE approach in multiview connectome analysis for clinical investigation and application. In a simple twoview example, given fMRI and DTI brain networks of five subjects MCGE aimed to learn multiview graph embedding for each of them and cluster these subjects into different groups based on the obtained multiview graph embeddings. In the HIV and bipolar brain networks (their Figs. 4 and 5, respectively), differential brain region clustering maps of normal controls and patients could be compared. Each colorcoded node represented a brain region, and each edge indicated the correlation between two brain regions. Nodes of the normal brain network were well grouped into several clusters, while nodes in the HIV brain network were less coherent. Additionally, for the normal control the edges within each cluster were much more intense than the edges across different clusters.

This work was cited in earlier work (also in 2017) by the same authors, Multiview Graph Embedding with Hub Detection for Brain Network Analysis (Sep 2017). In that paper, they proposed incorporating the hub detection task into the multiview graph embedding framework so that the two tasks could benefit each other, via an autoweighted framework of Multiview Graph Embedding with Hub Detection (MVGEHD) for brain network analysis (again, HIV and bipolar brain networks). The MVGEHD framework learned a unified graph embedding across all the views while reducing the potential influence of the hubs on blurring the boundaries between node clusters in the graph, thus leading to a clear and discriminative node clustering structure for the graph.
[Image source. Click image to open in new window.]

Followon work (2018) by those authors, MultiView MultiGraph Embedding for Brain Network Clustering Analysis (Jun 2018), proposed Multiview Multigraph Embedding (M2E) by stacking multigraphs into multiple partiallysymmetric tensors and using tensor techniques to simultaneously leverage the dependencies and correlations among multiview and multigraph brain networks. Although there had been work on singlegraph embedding and multiview learning, there had been no embedding method available which enabled preserving multigraph structures (i.e. taking multiple graphs as input) on multiple views. The goal of M2E is to find lowdimensional representations from multiview multigraph data which reveal patterns and structures among the brain networks. Experiments (again) on real HIV and bipolar disorder brain network datasets demonstrated the superior performance of M2E on clustering brain networks by leveraging the multiview multigraph interactions.
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
Critique.
While these authors survey clustering accuracy of a number of baselines against M2E (“We compare the proposed M2E with eight other methods for multiview clustering on brain networks.”), in a rather suspect omission they do not include comparisons of this current work to their earlier MCGE or MVGEHD work (cited in this paper as Ma et al. 2017a and Ma et al. 2017b, respectively) among those models.

For many realworld systems, multiple types of relations are naturally represented by multiple networks. However, existing network embedding methods mainly focus on single network embedding and neglect the information shared among different networks. For example, in social networks relationships between people may include friendships, money transfers, colleagues, etc. One simple solution for multinetwork embedding is to summarize multiple networks into a single network and apply the singlenetwork embedding method on the integrated network. Deep Feature Learning of MultiNetwork Topology for Node Classification (Sep 2018) proposed a novel multiple network embedding method based on semisupervised autoencoder, named DeepMNE, which captured complex topological structures of multinetworks, taking the correlation among multinetworks into account.
The DeepMNE algorithm mainly contained two parts: obtaining topological information learning, and learning multinetworkbased features. After obtaining the integrated representations of multinetworks, they trained a machine learning model based on the outputs of DeepMNE to classify the nodes. Experimental results on two real world datasets (yeast and human gene networks) demonstrated the superior performance of the method over four state of the art algorithms on tasks such as accurate prediction of gene function.
[Image source. Click image to open in new window.]Existing approaches to learning node representations usually study networks with a single type of proximity between nodes, which defines a single view of a network. However, in reality multiple types of proximities between nodes usually exist, yielding networks with multiple views. An Attentionbased Collaboration Framework for MultiView Network Representation Learning (Sep 2017) [code] studied learning node representations for networks with multiple views, which aimed to infer robust node representations across different views. They proposed a multiview representation learning approach (MVE), which promoted the collaboration of different views, letting them vote for robust representations. During the voting process, an attention mechanism was introduced, which enabled each node to focus on the most informative views. Experimental results on realworld networks showed that the proposed approach outperformed state of the art approaches for network representation learning (node classification; link prediction) with a single or multiple views.
[Image source. Click image to open in new window.]mvn2vec: Preservation and Collaboration in MultiView Network Embedding (Jan 2018) focused on characteristics that were specific and important in embedding multiview networks. They identified two characteristics, “preservation” and “collaboration” and explored the feasibility of achieving better embedding quality by simultaneously modeling preservation and collaboration. As shown in their Fig. 1a, a multiview network consists of multiple network views, where each view corresponds to a type of edge, and all views share the same set of nodes.
[Image source. Click image to open in new window.]
This paper, and An Attentionbased Collaboration Framework for MultiView Network Representation Learning (Sep 2017) are conceptually similar (compare Fig. 1
Sources: arXiv:1709.06636 and arXiv:1801.06597 in each paper). Dr. Jiawei Han, University of Chicago at UrbanaChampaign, is also a coauthor on both papers.Re: “Collaboration ”: in some datasets, edges between the same pair of nodes may be observed in different views due to shared latent reasons. For instance, if nodes in two views may complement (interact with) each other in various social media contexts, embedding them jointly may potentially yield better results than embedding them independently.
Re: “Preservation ”: on the other hand, it is possible for different network views to have different semantic meanings; it is also possible that a portion of nodes have completely disagreeing edges in different views since edges in different views are formed due to distinct latent reasons. For example, professional relationships may not align well with friendships. If we embed the profession and the friendship views into the same embedding space, the embedding fails to preserve the unique information carried by the different network views. The authors refer to the need for preserving unique information carried by different views as “preservation.”
It is also possible for preservation and collaboration to coexist in the same multiview network.
Realworld social networks and digital platforms are comprised of individuals (nodes) that are linked to other individuals or entities through multiple types of relationships (links). Subnetworks of such a network based on each type of link correspond to distinct views of the underlying network. In realworld applications each node is typically linked to only a small subset of other nodes; hence, practical approaches to problems such as node labeling have to cope with the resulting sparse networks. While lowdimensional network embeddings offer a promising approach to this problem, most of the current network embedding methods focus primarily on single view networks.
MultiView Network Embedding Via Graph Factorization Clustering and CoRegularized MultiView Agreement (Nov 2018) introduced a novel multiview network embedding (MVNE) algorithm for constructing lowdimensional node embeddings from multiview networks. MVNE adapted and extended an approach to single view network embedding (SVNE ) using graph factorization clustering (GFC) to the multiview setting using an objective function that maximized the agreement between views based on both the local and global structure of the underlying multiview graph. Experiments with several benchmark realworld single view networks showed that GFCbased SVNE yielded network embeddings that were competitive with or superior to those produced by the state of the art single view network embedding methods when the embeddings are used for labeling unlabeled nodes in the networks. Experiments with several multiview networks showed that MVNE substantially outperformed the single view methods on integrated view and the state of the art multiview methods. Even when the goal was to predict labels of nodes within a single target view, MVNE outperformed its singleview counterpart, suggesting that MVNE was able to extract the information that was useful for labeling nodes in the target view from the all of the views.
[Image source. Click image to open in new window.]In a multiview graph embedding, each node is assigned a vector that incorporates data from all views of the graph. Simple methods to create multiview embeddings include combining multiple views of the graph into one graph using an $\small \text{AND/OR}$ aggregation of the edge sets and embedding the resulting single graph, or embedding each view independently and concatenating the different embeddings obtained for each node. More sophisticated algorithms have been developed based on matrix factorization, tensor factorization, and spectral embedding. Many of these algorithms focus on clustering multiview graphs, a specific application thereof. High clustering accuracy indicates a good embedding since relative similarity between nodes should be correctly reflected in the embedding. … . Ideally, graph embeddings should preserve the distances between the nodes in their respective node embeddings. [Discussion in this paragraph was drawn from the Introduction in MultiView Graph Embedding Using Randomized Shortest Paths, and the references cited therein.)
MultiView Graph Embedding Using Randomized Shortest Paths (Aug 2018) [datasets, experimental results] proposed a generalized distance on multiview graphs called the Common Randomized Shortest Path Dissimilarity (CRSP), based on the randomized shortest path (RSP) dissimilarity measure on singleview graphs. This algorithm generated a dissimilarity measure between nodes by minimizing the expected cost of a random walk between any two nodes across all views of a multiview graph, in doing so encoding both the local and global structure of the graph. This leads to more accurate graph embeddings, resulting in better visualization and high clustering accuracy.
[Image source. Click image to open in new window.]Unsupervised Multiview Nonlinear Graph Embedding (2018) addressed “multiview” graph embedding: given a graph with node features they aimed to learn a network embedding (for each node from its network features) and a content embedding (from its content features) simultaneously for each node in an unsupervised manner. Generally, network structure and node features are considered as two different “views” for a node in the graph; in this work the authors allowed the network and content “views” to reinforce one another, to obtain better graph embeddings. They proposed a simple, effective unsupervised MultiviEw nonlineaR Graph Embedding (MERGE) model. Inspired by structural deep network embedding, MERGE encoded the nonlinearity of the network/content by taking the network/content features as input, and then applying a deep autoencoder to learn a nonlinear network/content embedding for each node. MERGE also preserved the secondorder proximity by extending DeepWalk to the multiview setting: MERGE extended DeepWalk enabling one node’s embedding to interpret its “neighbor” node’s content embedding, enforcing crossinstancecrossview consistency. MERGE consistently outperformed the state of the art baselines over five public datasets including PubMed (their Table 2) with $\small \theta(\vert V \vert)$ complexity, with onethird the training data.
[Image source. Click image to open in new window.]EndtoEnd MultiView Networks for Text Classification (Apr 2017) [nonauthor code here, here and here; discussion here and here: the OP later deleted their question but the comments are apropos] from National Research Council Canada proposed a “multiview network” for text classification. Their MVN model was outperformed by Richard Socher/SalesForce’s BCN+Char+CoVe algorithm (see their Table 4 in Learned in Translation: Contextualized Word Vectors).
[Image source. Click image to open in new window.]Hypergraph Neural Networks (Sep 2018) presented a hypergraph neural network (HGNN) framework for data representation learning, which could encode correlated, highorder data in a hypergraph structure. Confronting the challenge of representation learning for complex data, they proposed incorporating those data in a hypergraph, which was more flexible for modeling complex data. A hyperedge convolution operation handled the data correlation during representation learning, permitting the efficient use of traditional hypergraph learning procedures to be used using hyperedge convolution operations. HGNN was thus able to learn the hidden layer representation (the highorder data structure).
While graphs are effective in modeling complex relationships, in many scenarios a single graph is rarely sufficient to succinctly represent all interactions, and hence multilayered graphs have become popular. Though this leads to richer representations, extending solutions from the singlegraph case is not straightforward. Consequently, there is a strong need for novel solutions to solve classical problems, such as node classification, in the multilayered case. Attention Models with Random Features for Multilayered Graph Embeddings (Oct 2018) considered the problem of semisupervised learning with multilayered graphs. Though deep network embeddings, e.g. DeepWalk, are widely adopted for community discovery, these authors argue that feature learning with random node attributes – using graph neural network – can be more effective. They proposed the use of attention models for effective feature learning, and developed two novel architectures, GrAMMESG and GrAMMEFusion, that exploited the interlayer dependencies for building multilayered graph embeddings. Empirical studies on several benchmark datasets demonstrated significant performance improvements in comparison to state of the art network embedding strategies. The results also showed that using simple random features was an effective choice, even in cases where explicit node attributes were not available.
[Image source. Click image to open in new window.]In very interesting work, Harada et al. Dual Convolutional Neural Network for Graph of Graphs Link Prediction (Oct 2018) proposed the use of graphs of graphs (GoG) for feature extraction from graphs. GoG consist of an external graph and internal graphs, where each node in the external graph has an internal graph structure. They proposed a dual CNN that (i) extracted node representations by combining the external and internal graph structures in an endtoend manner, and (ii) efficiently learned lowdimensional representations of the GoG nodes. Experiments on link prediction tasks using several chemical network datasets demonstrated the effectiveness of the proposed method.
[Image source. Click image to open in new window.]A Recurrent Graph Neural Network for MultiRelational Data (Nov 2018) [code] introduced a graph recurrent neural network (GRNN) for scalable semisupervised learning from multirelational data. Key aspects of the novel GRNN architecture were the use of multirelational graphs, the dynamic adaptation to the different relations via learnable weights, and the consideration of graphbased regularizers to promote smoothness and alleviate overparametrization. The goal was to design a powerful learning architecture able to (i) discover complex and highly nonlinear data associations, (ii) combine (and select) multiple types of relations, and (iii) scale gracefully with respect to the size of the graph. Numerical tests with real data sets corroborated the design goals and illustrated the performance gains relative to competing alternatives.
[Image source. Click image to open in new window.]Community detection is a challenging, yet crucial, problem while mining largescale graph structured data. Most existing approaches solve this problem by mapping nodes into a vector space and performing unsupervised learning with the resulting embeddings. In cases where multiple types of connectivity patterns exist for the set of nodes, commonly modeled as multilayer graphs, new strategies are required to model the interlayer dependencies in order to perform effective inferencing. Improved Community Detection using Deep Embeddings from Multilayer Graphs (Sep 2018) focused on learning embeddings for each node of a multilayer graph through neural modeling techniques, such that the complex dependencies could be concisely encoded into lowdimensional representations. Referred to as multilayer graph embeddings, these representations could be utilized for discovering community structure in a scalable fashion, even with a large number of layers. Furthermore, in order to ensure that the semantics that persisted over a longer range in the network were well modeled, they proposed to refine the multilayer embeddings via a proxy clustering loss and a graph modularity measure. Using realworld datasets, they demonstrated that this algorithm generated scalable and robust representations, and outperformed existing multilayer community detection approaches.
[Image source. Click image to open in new window.]Modeling a sequence of interactions between users and items (e.g., products, posts, or courses) is crucial in domains such as ecommerce, social networking, and education to predict future interactions. Representation learning presents an attractive solution to model the dynamic evolution of user and item properties, where each user/item can be embedded in a Euclidean space and its evolution can be modeled by dynamic changes in embedding. However, existing embedding methods either generate static embeddings, treat users and items independently, or are not scalable. Learning Dynamic Embeddings from Temporal Interactions (Dec 2018) by Jure Leskovec and colleagues presents JODIE, a coupled recurrent model to jointly learn the dynamic embeddings of users and items from a sequence of useritem interactions. JODIE has three components. First, the update component updates the user and item embedding from each interaction using their previous embeddings with the two mutuallyrecursive Recurrent Neural Networks. Second, a novel projection component is trained to forecast the embedding of users at any future time. Finally, the prediction component directly predicts the embedding of the item in a future interaction. For models that learn from a sequence of interactions, traditional training data batching cannot be done due to complex useruser dependencies. Therefore, they presented a novel batching algorithm called tBatch that generated timeconsistent batches of training data that can run in parallel, giving massive speedup. Six experiments were conducted: two prediction tasks (future interaction prediction, and state change prediction), and four real world datasets. JODIE outperforms six state oftheart algorithms in these tasks by up to 22.4%. Moreover, we show that JODIE is highly scalable and up to 9.2x faster than comparable models. As an additional experiment, we illustrate that JODIE can predict student dropout from courses five interactions in advance.
[Image source. Click image to open in new window.]Additional Reading
 Graph Signal Processing

MultiMultiView Learning: Multilingual and MultiRepresentation Entity Typing (Oct 2018) [Summary] “Knowledge bases (KBs) are paramount in NLP. We employ multiview learning for increasing accuracy and coverage of entity type information in KBs. We rely on two metaviews: language and representation. For language, we consider highresource and lowresource languages from Wikipedia. For representation, we consider representations based on the context distribution of the entity (i.e., on its embedding), on the entity’s name (i.e., on its surface form) and on its description in Wikipedia. The two metaviews language and representation can be freely combined: each pair of language and representation (e.g., German embedding, English description, Spanish name) is a distinct view. Our experiments on entity typing with finegrained classes demonstrate the effectiveness of multiview learning.”
[Click image to enlarge.]
Source: MultiMultiView Learning: Multilingual and MultiRepresentation Entity typing 
Spectral Multigraph Networks for Discovering and Fusing Relationships in Molecules (Nov 2018) [Summary]Spectral Graph Convolutional Networks (GCNs) are a generalization of convolutional networks to learning on graphstructured data. Applications of spectral GCNs have been successful, but limited to a few problems where the graph is fixed, such as shape correspondence and node classification. In this work, we address this limitation by revisiting a particular family of spectral graph networks, Chebyshev GCNs, showing its efficacy in solving graph classification tasks with a variable graph structure and size. Chebyshev GCNs restrict graphs to have at most one edge between any pair of nodes. To this end, we propose a novel multigraph network that learns from multirelational graphs. We model learned edges with abstract meaning and experiment with different ways to fuse the representations extracted from annotated and learned edges, achieving competitive results on a variety of chemical classification benchmarks.

Multilayered Graph Embedding with Graph Convolution Networks (Nov 2018)
“Recently, graph embedding emerges as an effective approach for graph analysis tasks such as node classification and link prediction. The goal of network embedding is to find low dimensional representation of graph nodes that preserves the graph structure. Since there might be signals on nodes as features, recent methods like Graph Convolutional Networks (GCNs) try to consider node signals besides the node relations. On the other hand, multilayered graph analysis has been received much attention. However, the recent methods for node embedding have not been explored in these networks. In this paper, we study the problem of node embedding in multilayered graphs and propose a deep method that embeds nodes using both relations (connections within and between layers of the graph) and nodes signals. We evaluate our method on node classification tasks. Experimental results demonstrate the superiority of the proposed method to other multilayered and singlelayered competitors and also proves the effect of using crosslayer edges.”
[Image source. Click image to open in new window.]

Embedding Models for Episodic Knowledge Graphs (Dec 2018), by Volker Tresp and colleagues, extended models for static knowledge graphs to temporal knowledge graphs – enabling them to store episodic data and to generalize to new facts (inductive learning). They generalized leading learning models for static knowledge graphs (i.e., TuckER, RESCAL, HolE, ComplEx, DistMult ) to temporal knowledge graphs. They introduced a new tensor model with superior generalization performance, ConT . Performance was analyzed on two different datasets: the Global Database of Events, Language, and Tone (GDELT), and the database for Integrated Conflict Early Warning System (ICEWS). They also argued that temporal knowledge graph embeddings might also be models for cognitive episodic memory (facts we remember and can recollect) and that a semantic memory (current facts we know) can be generated from episodic memory by a marginalization operation. They validated this episodictosemantic projection hypothesis with the ICEWS dataset.
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.] Embedding Models for Episodic Knowledge Graphs (Tresp et al., Dec 2018 – above) appears to build on (but does not cite, or vice versa) Tresp and Ma, The Tensor Memory Hypothesis (Aug 2017). “We discuss memory models which are based on tensor decompositions using latent representations of entities and events. We show how episodic memory and semantic memory can be realized and discuss how new memory traces can be generated from sensory input … We describe the first detailed mathematical models for the complete processing pipeline from sensory input and its semantic decoding, i.e., perception, to the formation of episodic and semantic memories and their declarative semantic decodings. …“
 Embedding Models for Episodic Knowledge Graphs (Tresp et al., Dec 2018 – above) appears to build on (but does not cite, or vice versa) Tresp and Ma, The Tensor Memory Hypothesis (Aug 2017). “We discuss memory models which are based on tensor decompositions using latent representations of entities and events. We show how episodic memory and semantic memory can be realized and discuss how new memory traces can be generated from sensory input … We describe the first detailed mathematical models for the complete processing pipeline from sensory input and its semantic decoding, i.e., perception, to the formation of episodic and semantic memories and their declarative semantic decodings. …“

Dynamic Network Embeddings: From Random Walks to Temporal Random Walks (2018)
“Networks evolve continuously over time with the addition, deletion, and changing of links and nodes. Although many networks contain this type of temporal information, the majority of research in network representation learning has focused on static snapshots of the graph and has largely ignored the temporal dynamics of the network. In this work, we describe a general framework for incorporating temporal information into network embedding methods. The framework gives rise to methods for learning timerespecting embeddings from continuoustime dynamic networks. Overall, the experiments demonstrate the effectiveness of the proposed framework and dynamic network embedding approach as it achieves an average gain of 11.9% across all methods and graphs. The results indicate that modeling temporal dependencies in graphs is important for learning appropriate and meaningful network representations.”
[Image source. Click image to open in new window.] 
Dynamic Graph Representation Learning via SelfAttention Networks (Dec 2018)
“Learning latent representations of nodes in graphs is an important and ubiquitous task with widespread applications such as link prediction, node classification, and graph visualization. Previous methods on graph representation learning mainly focus on static graphs, however, many realworld graphs are dynamic and evolve over time. In this paper, we present Dynamic SelfAttention Network (DySAT), a novel neural architecture that operates on dynamic graphs and learns node representations that capture both structural properties and temporal evolutionary patterns. Specifically, DySAT computes node representations by jointly employing selfattention layers along two dimensions: structural neighborhood and temporal dynamics. We conduct link prediction experiments on two classes of graphs: communication networks and bipartite rating networks. Our experimental results show that DySAT has a significant performance gain over several different stateoftheart graph embedding baselines.”

MultiGCN: Graph Convolutional Networks for MultiView Networks, with Applications to Global Poverty (Jan 2019)
“… we develop a graphbased convolutional network for learning on multiview networks. We show that this method outperforms stateoftheart semisupervised learning algorithms on three different prediction tasks using mobile phone datasets from three different developing countries. We also show that, while designed specifically for use in poverty research, the algorithm also outperforms existing benchmarks on a broader set of learning tasks on multiview networks, including node labelling in citation networks.”
[Image source. Click image to open in new window.] 
EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs (IBM Watson  MIT CSAIL: Feb 2019)
” Graph representation learning resurges as a trending research subject owing to the widespread use of deep learning for Euclidean data, which inspire various creative designs of neural networks in the nonEuclidean domain, particularly graphs. With the success of these graph neural networks (GNN) in the static setting, we approach further practical scenarios where the graph dynamically evolves. For this case, combining the GNN with a recurrent neural network (RNN, broadly speaking) is a natural idea. Existing approaches typically learn one single graph model for all the graphs, by using the RNN to capture the dynamism of the output node embeddings and to implicitly regulate the graph model. In this work, we propose a different approach, coined EvolveGCN, that uses the RNN to evolve the graph model itself over time. This model adaptation approach is model oriented rather than node oriented, and hence is advantageous in the flexibility on the input. For example, in the extreme case, the model can handle at a new time step, a completely new set of nodes whose historical information is unknown, because the dynamism has been carried over to the GNN parameters. We evaluate the proposed approach on tasks including node classification, edge classification, and link prediction. The experimental results indicate a generally higher performance of EvolveGCN compared with related approaches.”

SpatioTemporal Deep Graph Infomax (Apr 2019)
Hypergraphs
Here I collate and summarize/paraphrase hypergraphrelated discussion from elsewhere in this REVIEW.
Neural Segmental Hypergraphs for Overlapping Mention Recognition (Oct 2018) [code  supplementary material] proposed a novel segmental hypergraph representation to model overlapping entity mentions that are prevalent in many practical datasets. Their model was robust in handling both overlapping and nonoverlapping mentions, and was thus able to capture features and interactions that could not be captured by previous models while maintaining a low time complexity for inference. They also presented a theoretical analysis to formally assess how our representation is better than alternative representations reported in the literature in terms of representational power. Coupled with neural networks for feature learning, their model achieved the state of the art performance in three benchmark datasets [including GENIA] annotated with overlapping mentions.
[Image source. Click image to open in new window.]In a similar approach to Neural Segmental Hypergraphs for Overlapping Mention Recognition (Oct 2018; above), Learning to Recognize Discontiguous Entities (Oct 2018) [project page/code] focused on the study of recognizing discontiguous entities, that can be overlapping at the same time. They proposed a novel hypergraph representation to jointly encode discontiguous entities of unbounded length, which could overlap with one another. Empirical results showed that their model was able to achieve significantly better results when evaluated on standard data with many discontiguous entities.
[Image source. Click image to open in new window.]Hypergraph Neural Networks (Sep 2018) presented a hypergraph neural network (HGNN) framework for data representation learning, which could encode correlated, highorder data in a hypergraph structure. Confronting the challenge of representation learning for complex data, they proposed incorporating those data in a hypergraph, which was more flexible for modeling complex data. A hyperedge convolution operation handled the data correlation during representation learning, permitting the efficient use of traditional hypergraph learning procedures to be used using hyperedge convolution operations. The convolution in the spectral domain was conducted with the hypergraph Laplacian, and further approximated by truncated Chebyshev polynomials. HGNN was thus able to learn the hidden layer representation (the highorder data structure). HGNN, graph convolutional networks (GCN) and other methods were applied to citation network classification and visual object recognition tasks, demonstrating that the proposed HGNN method could outperform recent state of the art methods. HGNN was also superior than other methods when dealing with multimodal data.
[Image source. Click image to open in new window.]While GCNs inherently assume pairwise relationships in graphstructured data, in many realworld problems relationships extend beyond pairwise connections – hypergraphs naturally capture these complex relationships. HyperGCN: Hypergraph Convolutional Networks for SemiSupervised Classification (Sep 2018) explored the use of GCN for hypergraph based semisupervised learning (SSL). They proposed HyperGCN, a SSL method which used a layerwise propagation rule for CNN operating directly on hypergraphs (the first principled adaptation of GCN to hypergraphs). HyperGCN was able to encode both the hypergraph structure and hypernode features in an effective manner.
[Image source. Click image to open in new window.]“In conventional graphbased SSL problems, the loss function is defined as a weighted sum of the supervised loss over labeled data and a regulariser for the graph structure. $\small \mathcal{L} = \underbrace{\mathcal{L}_0}_\text{labeled data} + \underbrace{\lambda \mathcal{L}_{reg}}_\text{graph data}$. Here, $\small \mathcal{L}_0$ denotes the supervised loss w.r.t. the labeled part of the graph and $\small \mathcal{L}_{reg}$ is an explicit graphbased regulariser that smooths the label information over the graph (with $\small \lambda$ being the weighting factor). A popularly used graphbased regulariser is the graph Laplacian regulariser which relies on the prior assumption that connected nodes in the graph are likely to share the same label (the cluster assumption (Chapelle, Weston, and Schőlkopf 2003). This assumption might restrict modeling capacity, as graph edges need not encode node similarity, but could instead contain other information such as pointwise mutual semantic information (Zhuang and Ma 2018), or knowledge graph relationship information (Wang, Ye, and Gupta 2018).
“To avoid this restriction, recent works have encoded both the labeled and graph data using a convolutional neural network (Atwood and Towsley 2016; Kipf and Welling 2017). This allows the network to be trained on a supervised loss function $\small \mathcal{L} = \mathcal{L}_0$ for all graph nodes, thereby, avoiding an explicit graphbased regulariser in the loss function. Specifically, graph convolutional networks (GCNs) (Kipf and Welling 2017) naturally integrate the graph structure (e.g., citation networks) and the feature attributes of nodes (e.g., bagofwords features). GCNs have achieved stateofthe art performance on benchmark graphbased semisupervised node classification datasets. Even though GCNs are able to incorporate arbitrary relationships between nodes (and not just similarity), they are still limited by pairwise relationships in the graph.
“However, in many realworld problems, relationships go beyond pairwise associations. In many naturally occurring graphs, we observe complex relationships involving more than two nodes, e.g., coauthorship, cocitation, email communication, etc. Hypergraphs provide a flexible and natural modeling tool to model such complex relationships. A hypergraph is a generalisation of a simple graph in which an edge (a.k.a., hyperedge) can connect any number of nodes. While simple graph edges connect pairs of nodes (pairwise relationships), hyperedges can connect an arbitrary number of nodes (and thus capture complex relationships). For example, in a coauthorship network modeled as a hypergraph, each node represents a paper and each hyperedge represents an author and connects all papers coauthored by the author. Because of the obvious existence of such complex relationships in many realworld networks, the problem of learning with hypergraphs assumes significance (Zhou, Huang, and Schölkopf 2007; Hein et al. 2013 Zhang et al. 2017).
“In this work, we consider the problem of hypergraphbased semi supervised classification. …”
Most networkbased machine learning methods assume that the labels of two adjacent samples in the network are likely to be the same. However, assuming the pairwise relationship between samples is not complete. The information a group of samples that shows very similar pattern and tends to have similar labels is missed. The natural way overcoming the information loss of the above assumption is to represent the feature dataset of samples as the hypergraph. Unnormalized Hypergraph pLaplacian Based SemiSupervised Learning Methods (Nov 2018) were applied to the zoo dataset and the tiny version of 20 newsgroups dataset. Experiment results showed that the accuracy performance measures of these unnormalized hypergraph pLaplacian based semisupervised learning methods were significantly greater than the accuracy performance measure of the nonpLaplacian method (the current state of the art method hypergraph Laplacian based semisupervised learning method for classification problems).
[Image source. [Original source.] Click image to open in new window.]Diagnosing an inherited disease often requires identifying the pattern of inheritance in a patient’s family. In an interesting approach, Explainable Genetic Inheritance Pattern Prediction (Dec 2018) [code] represented family trees with genetic patterns of inheritance using hypergraphs and latent state space models to provide explainable inheritance pattern predictions. Their approach, GraphLSSM, allowed for exact causal inference over a patient’s possible genotypes given their relatives’ phenotypes. By design, inference could be examined at a low level to provide explainable predictions. Furthermore, they made use of human intuition by providing a method to assign hypothetical evidence to any inherited gene alleles. Their analysis supported the application of latent state space models to improve patient care in cases of rare inherited diseases, where access to genetic specialists is limited.
[Image source. Click image to open in new window.]
The code provided by these authors, Graphical Latent State Space Models (GraphLSSM), is a Python library that extends traditional generative time series models, such as hidden Markov models, linear dynamical systems and their extensions, to graphical models. With this framework, a user can do Bayesian inference over graphical structures (i.e., probabilistic graphical modeling). One use case is doing inference over a pedigree chart, where phenotypes (observations) are emitted based on each person’s genotype (latent state), and the genotypes of individuals are linked through their ancestor tree. [Visualization with Graphviz.]
[Image source. Click image to open in new window.]
Hypergraph Convolution and Hypergraph Attention (Jan 2019) introduced two end to end trainable operators (hypergraph convolution; hypergraph attention) to graph neural networks. Hypergraph attention further enhanced the capacity of representation learning (through hypergraph convolution) by leveraging an attention module. With the two operators, a graph neural network was readily extended to a more flexible model and applied to diverse applications where nonpairwise relationships were observed.
[Image source. Click image to open in new window.]SemiSupervised Graph Classification: A Hierarchical Graph Perspective (Apr 2019). “… A node of a graph usually represents a realworld entity, e.g., a user in a social network, or a protein in a proteinprotein interaction network. In this work, we consider a more challenging but practically useful setting, in which a node itself is a graph instance. … For example, in a social network, a group of people with shared interests forms a user group, whereas a number of user groups are interconnected via interactions or common members. We study the node classification problem in the hierarchical graph where a ‘node’ is a graph instance, e.g., a user group … we propose a novel supervised, selfattentive graph embedding method called SAGE , which embeds graph instances of arbitrary size into fixedlength vectors. …”
[Image source. Click image to open in new window.]Additional Reading

On a Hypergraph Probabilistic Graphical Model (Nov 2018)
“We propose a directed acyclic hypergraph framework for a probabilistic graphical model that we call Bayesian hypergraphs. The space of directed acyclic hypergraphs is much larger than the space of chain graphs. Hence Bayesian hypergraphs can model much finer factorizations than Bayesian networks or LWF chain graphs and provide simpler and more computationally efficient procedures for factorizations and interventions. Bayesian hypergraphs also allow a modeler to represent causal patterns of interaction such as NoisyOR graphically (without additional annotations). We introduce global, local and pairwise Markov properties of Bayesian hypergraphs and prove under which conditions they are equivalent. We define a projection operator, called shadow, that maps Bayesian hypergraphs to chain graphs, and show that the Markov properties of a Bayesian hypergraph are equivalent to those of its corresponding chain graph. We extend the causal interpretation of LWF chain graphs to Bayesian hypergraphs and provide corresponding formulas and a graphical criterion for intervention.”
[Image source. Click image to open in new window.]
Graph Signal Processing
This is a fascinating domain. For additional background, see my comprehensive Resources pages:
Graph classification has recently received a lot of attention from various fields of machine learning e.g. kernel methods, sequential modeling or graph embedding. All these approaches offer promising results with different respective strengths and weaknesses. However, most of them rely on complex mathematics and require heavy computational power to achieve their best performance. A Simple Baseline Algorithm for Graph Classification (Oct 2018) [datasets] proposed a simple and fast algorithm based on the spectral decomposition of the graph Laplacian to perform graph classification and get a first reference score for a dataset. This method obtained competitive results compared to state of the art algorithms.
[Image source. Click image to open in new window.][Image source. Click image to open in new window.]Datasets. We evaluated our model against some standard datasets from biology:
 Mutag (MT),
 Predictive Toxicology Challenge (PTC),
 Enzymes (EZ),
 Proteins Full (PF),
 Dobson and Doig (DD) and
 National Cancer Institute (NCI1).
All graphs represent chemical compounds. Nodes are molecular substructures (typically atoms) and edges represent connections between these substructures (chemical bound or spatial proximity). In MT, the compounds are either mutagenic and not mutagenic while in PTC, they are either carcinogens or not. EZ contains tertiary structures of proteins from the 6 Enzyme Commission top level classes. In DD, graphs represent secondary structures of proteins being either enzyme or not enzyme. PF is a subset of DD where the largest graphs have been removed. In NCI1, compounds have either an anticancer activity or not. Statistics about the graphs are presented in table 1.
Interpretability has emerged as a crucial aspect of machine learning, aimed at providing insights into the working of complex neural networks. However, existing solutions vary vastly based on the nature of the interpretability task, with each use case requiring substantial time and effort. MARGIN: Uncovering Deep Neural Networks using Graph Signal Analysis (Dec 2108) introduced MARGIN , a simple yet general approach to address a large set of interpretability tasks ranging from identifying prototypes to explaining image predictions. MARGIN exploited ideas rooted in graph signal analysis to determine influential nodes in a graph, which were defined as those nodes that maximally described a function defined on the graph. By carefully defining taskspecific graphs and functions, MARGIN outperformed existing approaches in a number of disparate interpretability challenges.
[Image source. Click image to open in new window.]Many networks exhibit rich, lowerorder connectivity patterns that can be captured at the level of individual nodes and edges. HigherOrder Organization of Complex Networks (Benson, Gleich and Leskovec: 2016) [project; slides here and here] developed a generalized framework for clustering networks on the basis of higherorder connectivity patterns, at the level of small network subgraphs. That project focused on finding higherorder organization of complex networks at the level of small network subgraphs (motifs), by clustering the nodes of a graph based on motifs instead of edges via a motifbased spectral clustering method.

“Graphs are a pervasive tool for modeling and analyzing network data throughout the sciences. Benson et al. developed an algorithmic framework for studying how complex networks are organized by higherorder connectivity patterns (see the Perspective by Pržulj and MalodDognin). Motifs in transportation networks reveal hubs and geographical elements not readily achievable by other methods. A motif previously suggested as important for neuronal networks is part of a “rich club” of subnetworks.” [Source: Science, Editors Summary]
[Image source. Click image to open in new window.]
[Image source; (see also). Click image to open in new window.]

The algorithm illustrated in Fig. 1C, above, efficiently identifies a cluster of nodes $\small S$ as follows:

Step 1: Given a network and a motif $\small M$ of interest, form the motif adjacency matrix $\small W_M$ whose entries $\small (i,j)$ are the cooccurrence counts of nodes $\small i$ and $\small j$ in the motif $\small M:(W _jM)_{ij}$ = number of instances of $\small M$ that contain nodes $\small i$ and $\small j$.

Step 2: Compute the spectral ordering $\small \sigma$ of the nodes from the normalized motif graph Laplacian matrix constructed via $\small W_M$.
 The normalized motif Laplacian matrix is $\small L_M = D^{1/2}(D  W_M)D^{1/2}$, where $\small D$ is a diagonal matrix with the rowsums of $\small W_M$ on the diagonal, $\small D_{ii} = \sum_j (W_M)_{ij}$ , and $\small D^{1/2}$ is the same matrix with the inverse square roots on the diagonal $\small D_{ii}^{1/2} = 1 / \sqrt{\sum_j (W_M)_{ij}}$. The spectral ordering $\small \sigma$ is the byvalue ordering of $\small D^{1/2}z$, where $\small z$ is the eigenvector corresponding to the second smallest eigenvalue of $\small L_M$, i.e., $\small \sigma_i$ is the index of $\small D^{1/2}z$ with the $\small i^{th}$ smallest value.
 The normalized motif Laplacian matrix is $\small L_M = D^{1/2}(D  W_M)D^{1/2}$, where $\small D$ is a diagonal matrix with the rowsums of $\small W_M$ on the diagonal, $\small D_{ii} = \sum_j (W_M)_{ij}$ , and $\small D^{1/2}$ is the same matrix with the inverse square roots on the diagonal $\small D_{ii}^{1/2} = 1 / \sqrt{\sum_j (W_M)_{ij}}$. The spectral ordering $\small \sigma$ is the byvalue ordering of $\small D^{1/2}z$, where $\small z$ is the eigenvector corresponding to the second smallest eigenvalue of $\small L_M$, i.e., $\small \sigma_i$ is the index of $\small D^{1/2}z$ with the $\small i^{th}$ smallest value.

Step 3: Find the prefix set of $\small \sigma$ with the smallest motif conductance (the argument of the minimum), formally, $\small S :=argmin_r \phi_M(S_r)$, where $\small S_r = \{\sigma_1, \ldots, \sigma_r \}$.
[Image source; (see also). Click image to open in new window.]


[Perspective] Network Analytics in the Age of Big Data (2016):
“We live in a complex world of interconnected entities. In all areas of human endeavor, from biology to medicine, economics, and climate science, we are flooded with largescale data sets. These data sets describe intricate realworld systems from different and complementary viewpoints, with entities being modeled as nodes and their connections as edges, comprising large networks. These networked data are a new and rich source of domainspecific information, but that information is currently largely hidden within the complicated wiring patterns. Deciphering these patterns is paramount, because computational analyses of large networks are often intractable, so that many questions we ask about the world cannot be answered exactly, even with unlimited computer power and time. Hence, the only hope is to answer these questions approximately (that is, heuristically) and prove how far the approximate answer is from the exact, unknown one, in the worst case. On page 163 of this issue, Benson et al. take an important step in that direction by providing a scalable heuristic framework for grouping entities based on their wiring patterns and using the discovered patterns for revealing the higherorder organizational principles of several realworld networked systems.” [Source]

Response. “HypergraphBased Spectral Clustering of HigherOrder Network Structures” [Tom Michoel & Bruno Nachtergaele] (July 2016):
 “The authors refer to our work T Michoel et al, Mol. Bio. Syst. 7, 2769 (2011) [local copy] where we introduced an algorithm for clustering networks on the basis of 3node network motifs, but appear to have missed our subsequent work where this algorithm was extended into a general spectral clustering algorithm for hypergraphs T Michoel and B Nachtergaele, Phys Rev E 86, 05611 (2012). As a special case, and similar to SNAP, this algorithm can be (and was) used to cluster signed, colored or weighted networks based on network motifs or subgraph patterns of arbitrary size and shape, including patterns of unequal size such as shortest paths. An implementation of the algorithm is available on GitHub.”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
Michoel & Nachtergaele, 2012:
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]  “The authors refer to our work T Michoel et al, Mol. Bio. Syst. 7, 2769 (2011) [local copy] where we introduced an algorithm for clustering networks on the basis of 3node network motifs, but appear to have missed our subsequent work where this algorithm was extended into a general spectral clustering algorithm for hypergraphs T Michoel and B Nachtergaele, Phys Rev E 86, 05611 (2012). As a special case, and similar to SNAP, this algorithm can be (and was) used to cluster signed, colored or weighted networks based on network motifs or subgraph patterns of arbitrary size and shape, including patterns of unequal size such as shortest paths. An implementation of the algorithm is available on GitHub.”
Graph Signal Processing – Wavelets
“Wavelets are mathematical functions that cut up data into different frequency components, and then study each component with a resolution matched to its scale. They have advantages over traditional Fourier methods in analyzing physical situations where the signal contains discontinuities and sharp spikes. Wavelets were developed independently in the fields of mathematics, quantum physics, electrical engineering, and seismic geology. Interchanges between these fields during the last ten years have led to many new wavelet applications such as image compression, turbulence, human vision, radar, and earthquake prediction. This paper introduces wavelets to the interested technical person outside of the digital signal processing field. I describe the history of wavelets beginning with Fourier, compare wavelet transforms with Fourier transforms, state properties and other special aspects of wavelets, and Ønish with some interesting applications such as image compression, musical tones, and denoising noisy data.” [Source: An Introduction to Wavelets (1995).]

“A wavelet is a wavelike oscillation with an amplitude that begins at zero, increases, and then decreases back to zero. It can typically be visualized as a “brief oscillation” like one recorded by a seismograph or heart monitor. Generally, wavelets are intentionally crafted to have specific properties that make them useful for signal processing. Using a “reverse, shift, multiply and integrate” technique called convolution, wavelets can be combined with known portions of a damaged signal to extract information from the unknown portions.
“… As a mathematical tool, wavelets can be used to extract information from many different kinds of data, including  but not limited to  audio signals and images. Sets of wavelets are generally needed to analyze data fully. A set of “complementary” wavelets will decompose data without gaps or overlap so that the decomposition process is mathematically reversible. Thus, sets of complementary wavelets are useful in wavelet based compression/decompression algorithms where it is desirable to recover the original information with minimal loss.
“In formal terms, this representation is a wavelet series representation of a squareintegrable function with respect to either a complete, orthonormal set of basis functions, or an overcomplete set or frame of a vector space, for the Hilbert space of square integrable functions. This is accomplished through coherent states.
Recent work mentioned but not discussed on my Graph Signal Processing page includes work by Zou and Lerman [Encoding Robust Representation for Graph Generation (Sep 2018); code; OpenReview] at the University of Minnesota, which employed the graph wavelet approach of Wavelets on Graphs via Spectral Graph Theory (2011). While generative models like generative adversarial networks (GAN) and variational autoencoders (VAE) have recently been applied to graphs, they are difficult to train. This work proposed a graph generation model that used an adaptation of a scattering transform to graphs. The proposed model was composed of an encoder (a Gaussianized graph scattering transform) and a decoder (a simple fully connected network that is adapted to specific tasks, such as link prediction, signal generation on graphs and full graph and signal generation). Results demonstrated state of the art performance of the proposed system for both link prediction (Cora, Citeseer and PubMed citation data) and graph and signal generation.
[Click image to open in new window.]The application of CNN to structured signal classification (e.g. image classification) inspired the development of deep filter banks, referred to as scattering transforms. These transforms apply a cascade of wavelet transforms and complex modulus operators to extract features that are invariant to group operations and stable to deformations. Furthermore, ConvNets inspired recent advances in geometric deep learning, which aim to generalize these networks to graph data by applying notions from graph signal processing to learn deep graph filter cascades. Graph Classification with Geometric Scattering (Oct 2018) [OpenReview] further advanced those lines of research by proposing a geometric scattering transform using graph wavelets defined in terms of random walks on the graph. They demonstrated the utility of features extracted with this designed deep filter bank in graph classification, and showed its competitive performance relative to other methods, including graph kernel methods and geometric deep learning ones, on both social and biochemistry data.
“In supervised graph classification problems one is given a training database of graph/label pairs $\small \{ (G_i,y{_i})^{N}_{i=1} \subset \mathcal{G} \times \mathcal{Y} \}$ sampled from a set of potential graphs $\small \mathcal{G}$ and potential labels $\small \mathcal{Y}$. The goal is to use the training data to learn a model $\small f : \mathcal{G} → \mathcal{Y}$ that associates to any graph $\small G ∈ \mathcal{G}$ a label $\small y = f(G) \in \mathcal{Y}$. These types of databases arise in biochemistry, in which the graphs may be molecules and the labels some property of the molecule (e.g., its toxicity), as well as in various types of social network databases. Until recently, most approaches were kernel based methods, in which the model $\small f$ was selected from the reproducing kernel Hilbert space generated by a kernel that measures the similarity between two graphs [ … snip … ]
“In many of these algorithms, task based (i.e., dependent upon the labels $\small \mathcal{Y}$) graph filters are learned from the training data as part of the larger network architecture. These filters act on a characteristic signal $\small \mathbf{x}_G$ that is defined on the vertices of any graph $\small G$, e.g., $\small \mathbf{x}_G$ is the vector of degrees of each vertex (we remark there are also edge based algorithms, such as Gilmer et al. and references within, but these have largely been developed for and tested on databases not considered in Sec. 4).
“Here, we propose an alternative to these methods in the form of a Geometric Scattering Classifier (GSC) that leverages graphdependent (but not label dependent) scattering transforms to map each graph $\small G$ to the scattering features extracted from $\small \mathbf{x}_G$. Furthermore, inspired by transfer learning approaches, we apply the scattering cascade as frozen network layers on $\small \mathbf{x}_G$, followed by several fully connected classification layers (see Fig. 2). We note that while the formulation in Sec. 3 is phrased for a single signal $\small \mathbf{x}_G$, it naturally extends to multiple signals by concatenating their scattering features.

“… our evaluation results on graph classification show the potential of the produced scattering features to serve as universal representations of graphs. Indeed, classification with these features with relatively simple classifier models reaches high accuracy results on most commonly used graph classification datasets, and outperforms both traditional and recent deep learning feed forward methods in terms of average classification accuracy over multiple datasets. …
Finally, the geometric scattering features provide a new way for computing and considering global graph representations, independent of specific learning tasks. Therefore, they raise the possibility of embedding entire graphs in Euclidean space (albeit high dimensional) and computing meaningful distances between graphs with them, which can be used for both supervised and unsupervised learning, as well as exploratory analysis of graphstructured data.”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
Graph Wavelet Neural Network (ICLR 2019) [nonauthor code; OpenReview; discussion] presented graph wavelet neural network (GWNN), a novel graph convolutional neural network (GCN) leveraging graph wavelet transforms to address the shortcomings of previous spectral graph CNN methods that depended on graph Fourier transforms. Different from graph Fourier transformations, graph wavelet transformations could be obtained via a fast algorithm without requiring matrix eigendecomposition with high computational cost. Moreover, graph wavelets were sparse and localized in the vertex domain, offering high efficiency and good interpretability for graph convolution. The proposed GWNN significantly outperformed previous spectral graph CNNs in the task of graphbased semisupervised classification on three benchmark datasets: Cora, Citeseer and PubMed.
[Image source. Click image to open in new window.]Community Detection over a Heterogeneous Population of NonAligned Networks (Apr 2019) [code  discussion (reddit)]
[Image source. Click image to open in new window.]
 “Clustering and community detection with multiple graphs have typically focused on aligned graphs, where there is a mapping between nodes across the graphs (e.g., multiview, multilayer, temporal graphs). However, there are numerous application areas with multiple graphs that are only partially aligned, or even unaligned. … we develop a joint stochastic blockmodel (Joint SBM) to estimate shared communities across sets of heterogeneous nonaligned graphs. We derive an efficient spectral clustering approach to learn the parameters of the joint SBM. We … show that the joint model is able to exploit crossgraph information to better estimate the communities compared to learning separate SBMs on each individual graph.”
Graph Signal Processing – see also:
 Learning Structural Node Embeddings via Diffusion Wavelets (2018) [project; code] by Jure Leskovec and colleagues at Stanford University presented GraphWave, an algorithm for computing structural signatures for nodes in a network using heat spectral wavelet diffusion patterns. Instead of training on handselected features, GraphWave learned these embeddings in an unsupervised way. They mathematically proved that nodes with similar network neighborhoods had similar GraphWave embeddings even though these nodes may reside in very different parts of the network. GraphWave outperformed existing state of the art baselines in every experiment, by as much as 137%.
Graph Signal Processing – Subgraph and Anomaly (Outlier) Detection
Although describing slightly older work, Benjamin Miller (MIT) and colleagues presented an interesting paper, Subgraph Detection Using Eigenvector L_{1} Norms (2010) [well explained in their slides; local copy], which cast the problem of subgraph detection in a [graph] signal processing context. Their focus was the detection of anomalies in unweighted, undirected graphs through L_{1} properties of the eigenvectors of the graph’s modularity matrix. This metric was observed to have relatively low variance for certain categories of randomlygenerated graphs, revealing the presence of an anomalous subgraph with reasonable reliability when the anomaly was not wellcorrelated with stronger portions of the background graph. They described the problem of detecting a subgraph embedded in a background as one of detecting a signal in noise. The background graph (a network in which there exists no anomaly) functioned as the “noise” in their system. They then defined the anomalous subgraph as the “signal,” with the objective of evaluating a binary hypothesis test: deciding between the null hypothesis H_{0} and the alternate hypothesis H_{1}.
[image source. click image to open in new window.] This work is continued in a followon paper (Miller et al.), A Spectral Framework for Anomalous Subgraph Detection (Aug 2015).
“A wide variety of application domains are concerned with data consisting of entities and their relationships or connections, formally represented as graphs. Within these diverse application areas, a common problem of interest is the detection of a subset of entities whose connectivity is anomalous with respect to the rest of the data. While the detection of such anomalous subgraphs has received a substantial amount of attention, no applicationagnostic framework exists for analysis of signal detectability in graphbased data. In this paper, we describe a framework that enables such analysis using the principal Eigenspace of a graph’s residuals matrix, commonly called the modularity matrix in community detection. Leveraging this analytical tool, we show that the framework has a natural power metric in the spectral norm of the anomalous subgraph’s adjacency matrix (signal power) and of the background graph’s residuals matrix (noise power).
“We propose several algorithms based on spectral properties of the residuals matrix, with more computationally expensive techniques providing greater detection power. Detection and identification performance are presented for a number of signal and noise models, including clusters and bipartite foregrounds embedded into simple random backgrounds as well as graphs with community structure and realistic degree distributions. The trends observed verify intuition gleaned from other signal processing areas, such as greater detection power when the signal is embedded within a less active portion of the background. We demonstrate the utility of the proposed techniques in detecting small, highly anomalous subgraphs in real graphs derived from Internet traffic and product copurchases.”
More recent work in this domain was presented in Spectral Embedding Norm: Looking Deep into the Spectrum of the Graph Laplacian (Oct 2018). The extraction of clusters from a dataset which includes multiple clusters and another significant portion of “background” samples is a task of practical importance. The traditional spectral clustering algorithm, relying on the leading $\small K$ eigenvectors to detect the $\small K$ clusters, fails in such cases. This paper proposed the spectral embedding norm which summed the squared values of the first $\small I$ (normalized) eigenvectors, where $\small I$ can be larger than $\small K$. They proved that the quantity could be used to separate clusters from the background under generic conditions motivated by applications such as anomaly detection. The performance of the algorithm was not sensitive to the choice of $\small I$. Experiments on synthetic and realworld datasets were presented.
[Image source. Click image to open in new window.]This is an impressive demonstration:
[Image source. Click image to open in new window.]Graph Signal Processing – Dimensionality Analysis
Although largescale datasets are frequently highdimensional, their data frequently possess structures that significantly decrease their intrinsic dimensionality (ID) due to the presence of clusters, points being located close to lowdimensional varieties or finegrained lumping. Estimating the Effective Dimension of Large Biological Datasets Using Fisher Separability Analysis (Jan 2019) [code] tested a dimensionality estimator that was based on analysing the separability properties of data points, on several benchmarks and real biological datasets. They showed that the introduced measure of ID had performance competitive with state of the art measures, being efficient across a wide range of dimensions and performing better in the case of noisy samples. Moreover, it allowed estimating the intrinsic dimension in situations where the intrinsic manifold assumption was not valid.

In signal processing of multidimensional signals, for example in computer vision, the intrinsic dimension of the signal describes how many variables are needed to represent the signal. For a signal of $\small N$ variables, its intrinsic dimension $\small M$ satisfies $\small 0 \leq M \leq N$. Usually the intrinsic dimension of a signal relates to variables defined in a Cartesian coordinate system. In general, however, it is also possible to describe the concept for nonCartesian coordinates, for example, using polar coordinates.

(p. 4) “… the finegrained lumping of the data point cloud can be identified by two interesting features: the histogram of empirical $\small p_\alpha$ distribution (probabilities of individual point nonseparability) and the profile of intrinsic dimensions $\small n_\alpha$ for a range of $\small \alpha$ values (e.g., $\small \alpha \in [0.6, \ldots, 1.0]$).”
The data for Figure 5 in Estimating the Effective Dimension of Large Biological Datasets Using Fisher Separability Analysis (above) was derived from Cell Type Atlas and Lineage Tree of a Whole Complex Animal by SingleCell Transcriptomics (May 2018), which mapped the transcriptome for essentially all cell types a planarian (flatworm): dozens of cell types including stem cells, progenitors, and terminally differentiated cells. They then applied a new computational algorithm, partitionbased graph abstraction (PAGA), which could predict a lineage tree for the whole animal in an unbiased way. Notably, their approach was applicable to other model and nonmodel organisms, assuming that their differentiation processes are sampled with sufficient time resolution.
[Image source. Click image to open in new window.]
Note also (different research group; computational/in silico approach) Inferring Regulatory Networks from Experimental Morphological Phenotypes: A Computational Method ReverseEngineers Planarian Regeneration (Jun 2015) [media: Planarian Regeneration Model Discovered by Artificial Intelligence]. “An artificial intelligence system has for the first time reverseengineered the regeneration mechanism of planaria  the small worms whose extraordinary power to regrow body parts has made them a research model in human regenerative medicine. The discovery by Tufts University biologists presents the first model of regeneration discovered by a nonhuman intelligence and the first comprehensive model of planarian regeneration, which had eluded human scientists for over 100 years. …”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
Additional Reading

Multilayer Graph Signal Clustering (Nov 2018)
“Multilayer graphs are commonly used to model relationships of different types between data points. In this paper, we propose a method for multilayer graph data clustering, which combines the different graph layers in the Riemann manifold of SemiPositive Definite (SPD) graph Laplacian matrices. The resulting combination can be seen as a lowdimensional representation of the original data points. In addition, we consider that data can also carry signal values and not only graph information. We thus propose new clustering solution for such hybrid data by training a neural network such that the transformed data points are orthonormal, and their distance on the aggregated graph is minimized. Experiments on synthetic and real data show that our method leads to a significant improvement with respect to stateoftheart clustering algorithms for graph data.”
[Image source. Click image to open in new window.]

Spectral Multigraph Networks for Discovering and Fusing Relationships in Molecules (Nov 2018)
“Spectral Graph Convolutional Networks (GCNs) are a generalization of convolutional networks to learning on graphstructured data. Applications of spectral GCNs have been successful, but limited to a few problems where the graph is fixed, such as shape correspondence and node classification. In this work, we address this limitation by revisiting a particular family of spectral graph networks, Chebyshev GCNs, showing its efficacy in solving graph classification tasks with a variable graph structure and size. Chebyshev GCNs restrict graphs to have at most one edge between any pair of nodes. To this end, we propose a novel multigraph network that learns from multirelational graphs. We model learned edges with abstract meaning and experiment with different ways to fuse the representations extracted from annotated and learned edges, achieving competitive results on a variety of chemical classification benchmarks.”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.] 
Convolutional Neural Network Architectures for Signals Supported on Graphs (Dec 2018)
“Two architectures that generalize convolutional neural networks (CNNs) for the processing of signals supported on graphs are introduced. We start with the selection graph neural network (GNN), which replaces linear time invariant filters with linear shift invariant graph filters to generate convolutional features and reinterprets pooling as a possibly nonlinear subsampling stage where nearby nodes pool their information in a set of preselected sample nodes. A key component of the architecture is to remember the position of sampled nodes to permit computation of convolutional features at deeper layers. The second architecture, dubbed aggregation GNN, diffuses the signal through the graph and stores the sequence of diffused components observed by a designated node. This procedure effectively aggregates all components into a stream of information having temporal structure to which the convolution and pooling stages of regular CNNs can be applied. A multinode version of aggregation GNNs is further introduced for operation in large scale graphs. An important property of selection and aggregation GNNs is that they reduce to conventional CNNs when particularized to time signals reinterpreted as graph signals in a circulant graph. Comparative numerical analyses are performed in a source localization application over synthetic and realworld networks. Performance is also evaluated for an authorship attribution problem and text category classification. Multinode aggregation GNNs are consistently the best performing GNN architecture.”

“The graph Laplacian is a standard tool in data science, machine learning, and image processing. The corresponding matrix inherits the complex structure of the underlying network and is in certain applications densely populated. This makes computations, in particular matrixvector products, with the graph Laplacian a hard task. A typical application is the computation of a number of its eigenvalues and eigenvectors. Standard methods become infeasible as the number of nodes in the graph is too large. We propose the use of the fast summation based on the nonequispaced fast Fourier transform (NFFT ) to perform the dense matrixvector product with the graph Laplacian fast without ever forming the whole matrix. The enormous flexibility of the NFFT algorithm allows us to embed the accelerated multiplication into Lanczosbased eigenvalues routines or iterative linear system solvers and even consider other than the standard Gaussian kernels. We illustrate the feasibility of our approach on a number of test problems from image segmentation to semisupervised learning based on graphbased PDEs. In particular, we compare our approach with the Nyström method. Moreover, we present and test an enhanced, hybrid version of the Nyström method, which internally uses the NFFT.”
 code: “This algorithm is implemented as $\small \text{applications/fastsum}$ and $\small \text{matlab/fastsum}$ in C and Matlab within the NFFT3 software library. See also Keiner et al. (2008) Using NFFT3  A Software Library for Various Nonequispaced Fast Fourier Transforms (pp. 19:119:30). We use the default KaiserBessel window function. In Figure 1, we list the relevant control parameters of Algorithm 3.1 and regularization approach.”

Learning Graphs from Data: A Signal Representation Perspective (Dec 2018) [slides (excellent); local copy]
“The construction of a meaningful graph topology plays a crucial role in the effective representation, processing, analysis and visualization of structured data. When a natural choice of the graph is not readily available from the data sets, it is thus desirable to infer or learn a graph topology from the data. In this tutorial overview, we survey solutions to the problem of graph learning, including classical viewpoints from statistics and physics, and more recent approaches that adopt a graph signal processing (GSP) perspective. We further emphasize the conceptual similarities and differences between classical and GSPbased graph inference methods, and highlight the potential advantage of the latter in a number of theoretical and practical scenarios. We conclude with several open issues and challenges that are keys to the design of future signal processing and machine learning algorithms for learning graphs from data.”
[Image source. Click image to open in new window.]
[Image source. This is the source (sharper image here) for Fig. 8 in Learning Graphs from Data: A Signal Representation Perspective. Click image to open in new window.]

Guided Graph Spectral Embedding: Application to the C. elegans Connectome (Feb 2019)
 “Graph spectral analysis can yield meaningful embeddings of graphs by providing insight into distributed features not directly accessible in nodal domain. … In this work, we take inspiration from these constructions to define a new guided spectral embedding that combines maximizing energy concentration with minimizing modified embedded distance for a given importance weighting of the nodes. … Following the derivation of our new optimization criterion and its linear approximation, we exemplify the methodology on the C. elegans structural connectome. The results of our analyses confirm known observations on the nematode’s neural network in terms of functionality and importance of cells. Compared to Laplacian embedding, the guided approach, focused on a certain class of cells (sensory, inter and motoneurons), provides more biological insights, such as the distinction between somatic positions of cells, and their involvement in low or high order processing functions.”
Link Prediction
Link prediction, the discovery of relations such as (subject, relation, object) triples, is tremendously useful for knowledge graph {construction  embedding  completion}, fact checking, and knowledge discovery.
Embedding Entities and Relations for Learning and Inference in Knowledge Bases (Dec 2014; updated Aug 2015) [nonauthor code here and here] considered learning representations of entities and relations in KBs using a neuralembedding approach. They showed that most existing models, including NTN (Socher et al., 2013) and TransE (Bordes et al., 2013b), could be generalized under a unified learning framework, where entities are lowdimensional vectors learned from a neural network and relations are bilinear and/or linear mapping functions. Under this framework (DistMult ), they compared a variety of embedding models on the link prediction task, showing that a simple bilinear formulation could achieve [then] state of the art results for the task (achieving a top10 accuracy of 73.2% vs. 54.7% by TransE on Freebase). They also introduced a novel approach that utilized the learned relation embeddings to mine logical rules such as $\small \text{BornInCity(a,b) and CityInCountry(b,c) $\Rightarrow$ Nationality(a,c)}$. Embeddings learned from the bilinear objective were particularly good at capturing relational semantics; more interestingly, their embedding based rule extraction approach successfully outperformed [then] state of the art confidence based rule mining approach (Horn rules) that involved compositional reasoning.
[Image source. Click image to open in new window.]Knowledge graph completion deals with automatically understanding the structure of large knowledge graphs (labeled directed graphs), and predicting missing relationships (labeled edges). In statistical relational learning, the link prediction problem is key to automatically understand the structure of large knowledge bases. Complex Embeddings for Simple Link Prediction (Jun 2016) [code] – by Théo Trouillon (Xerox Research Centre Europe) and colleagues – proposed to solve this problem through latent factorization, but through the use of complex valued embeddings. The composition of complex embeddings could handle a large variety of binary relations, among them symmetric and antisymmetric relations. Compared to state of the art models such as Neural Tensor Networks and Holographic Embeddings, their approach – based on complex embeddings – was arguably simpler, as it only uses the Hermitian dot product (the complex counterpart of the standard dot product between real vectors). Their approach, ComplEx, was scalable to large datasets as it remained linear in time and space, consistently outperforming alternative approaches on standard link prediction benchmarks.
[Image source. Click image to open in new window.]State of the art (statistical relational learning / knowledge graph completion) embedding models propose different tradeoffs between modeling expressiveness, and time and space complexity. Trouillon et al. followed up their ComplEx paper (Jun 2016) with Knowledge Graph Completion via Complex Tensor Factorization (Feb 2017; updated Nov 2017) [code], in which they reconciled both expressiveness and complexity through the use of complexvalued embeddings, and explored the link between such complexvalued embeddings and unitary diagonalization. They corroborated their approach theoretically and showed that all real square matrices – thus all possible relation/adjacency matrices – are the real part of some unitarily diagonalizable matrix – opening the door to other applications of square matrices factorization. Their approach, based on complex embeddings, was arguably simple (as it only involves a Hermitian dot product, the complex counterpart of the standard dot product between real vectors), whereas other methods resorted to increasingly complicated composition functions to increase their expressiveness. The proposed complex embeddings were scalable to large data sets as they remain linear in both space and time, while consistently outperforming alternative approaches on standard link prediction benchmarks.
“This extended version adds proofs of existence of the proposed model in both single and multirelational settings, as well as proofs of the nonuniqueness of the complex embeddings for a given relation. Bounds on the rank of the proposed decomposition are also demonstrated and discussed. The learning algorithm is provided in more details, and more experiments are provided, especially regarding the training time of the models.”
Improving Knowledge Graph Embedding Using Simple Constraints (May 2018) [code] investigated the potential of using very simple constraints to improve KG embedding. They examined nonnegativity constraints on entity representations and approximate entailment constraints on relation representations – hence, ComplExNNE+AER – an extension of ComplEx.
[Image source. Click image to open in new window.]Nonnegativity constraints on entity representations in that work helped learn compact and interpretable representations for entities, whereas approximate entailment constraints on relation representations further encoded regularities of logical entailment between relations into their distributed representations. The constraints imposed prior beliefs upon the structure of the embedding space, without negative impacts on efficiency or scalability. For each relation, it assumed a score matrix whose sign matrix was partially observed. The entries corresponding to factual and nonfactual triples had the sign $\small 1$ and $\small 1$, respectively.
Notes:
Regarding nonnegativity constraints: for technical reasons the variables of linear programs must always take nonnegative values; for example, the linear inequalities x ≥ 0 and y ≥ 0 specify that you cannot produce a negative number of items.
Regarding approximate_entailment_constraints, these authors mean an ordered pair of relations such that the former approximately entails the latter – e.g., BornInCountry and Nationality – stating that a person born in a country is very likely, but not necessarily, to have a nationality of that country.]
Holographic embeddings are an approach to generating entity embeddings from a list of (head, tail, relation) triples like (‘Jeff’, ‘Amazon’, ‘employer’) and (‘Zuck’, ‘Palo Alto’, ‘location’). Embeddings can be used as lossy (but memoryefficient) inputs to other machine learning models, used directly in triple inference (i.e. knowledge base completion, or link prediction) by evaluating the likelihood of candidate triples, or to search for associated entities using $\small k$nearest neighbors. For example, a search from the embedding representing the entity ‘University of California, Berkeley’ yields the associated entities ‘UC Irvine’, ‘Stanford University’, ‘USC’, ‘UCLA’, and ‘UCSD’ (see the figure in this GitHub repo).
Learning embeddings of entities and relations is an efficient and versatile method to perform machine learning on relational data such as knowledge graphs. Holographic Embeddings of Knowledge Graphs (Dec 2017) [author’s code]) – by Maximilian Nickel, Lorenzo Rosasco and Tomaso Poggio at MIT – proposed holographic embeddings (HolE) to learn compositional vector space representations of entire knowledge graphs. The proposed method was related to holographic models of associative memory in that it employed circular correlation to create compositional representations. By using correlation as the compositional operator HolE could capture rich interactions but simultaneously remained efficient to compute, easy to train, and scalable to very large datasets. In extensive experiments they showed that holographic embeddings were able to outperform state of the art methods for link prediction in knowledge graphs and relational learning benchmark datasets.
[Image source. Click image to open in new window.] HolE is equivalent to ComplEx :; see: On the Equivalence of Holographic and Complex Embeddings for Link Prediction (Sep 2017).
What is a “holographic model?” Holographic Embeddings of Knowledge Graphs states “The proposed method is related to holographic models of associative memory in that it employs circular correlation to create compositional representations. By using correlation as the compositional operator HolE can capture rich interactions but simultaneously remains efficient to compute, easy to train, and scalable to very large datasets.”
[Image source. Click image to open in new window.]
The Wikipedia Crosscorrelation page and this Quora post, What is the need for circular correlation and its advantages over a linear correlation? provide a basic description of the concept of circular correlation and its Fourier transformation, which is used in that paper, and discussed in the ComplEx / HolE “comparison” paper.
 Expanding Holographic Embeddings for Knowledge Completion provides additional explanation.
These slides (starting at slide 7) describe holographic embeddings, Holographic Embeddings of Knowledge Graphs (2016; Maximilian Nickel, Lorenzo Rosasco and Tomaso Poggio) [local copy].
Excellent nonauthor code, discussion and examples for HolE are found in this GitHub repository.
Expanding Holographic Embeddings for Knowledge Completion (NeurIPS 2018) [video] proposed a new family of embeddings for knowledge graphs, that interpolated between a method with high model complexity and one (HolE) with low dimensionality and high training efficiency. This interpolation, termed HolEx, was achieved by concatenating several linearly perturbed copies of original HolE. They formally characterized the number of perturbed copies needed to provably recover the full entityentity or entityrelation interaction matrix, leveraging ideas from Haar wavelets and compressed sensing. In practice, using just a handful of Haarbased or random perturbation vectors resulted in a much stronger knowledge completion system. On the Freebase FB15k dataset, HolEx outperformed HolE by 14.7% on the HITS@10 metric, and the current pathbased state of the art method PTransE by 4% (absolute).
[Image source. Click image to open in new window.]Knowledge graph embeddings have received significant attention due to their excellent performance for tasks like link prediction and entity resolution. In Complex and Holographic Embeddings of Knowledge Graphs: A Comparison (Jul 2017) Théo Trouillon (Université Grenoble Alpes) and Maximilian Nickel (Facebook AI Research  MIT) provided a comparison of two state of the art knowledge graph embeddings: ComplEx and HolE. They briefly reviewed both models, discussing how their scoring functions were equivalent, and then analyzed the discrepancy of results reported in the original articles – showing experimentally that they are likely due to the use of different loss functions. They also discussed advantages and disadvantages of both models and under which conditions one would be preferable to the other.
Critique: although they do not label the models in their Table 1, from their discussion of the loss models, “Margin” refers to their HolE model, whereas “NegLL” refers to Trouillon et al.’s ComplEx model.
[Image source. Click image to open in new window.]
Though similar, the filtered MRR value (0.541) that they report for HolE on the FB15k dataset in this 2017 paper differs from the value (0.524) that they reported in their earlier 2016 and 2017 papers.
Likewise, the filtered MRR value (0.639) that they report for ComplEx on the FB15k dataset in this 2017 paper is similar but differs from the value (0.692) that they reported in their earlier 2016 and 2017 papers.
Analogical Inference for MultiRelational Embeddings (Jul 2017) [code] imposed analogical properties on embeddings. Each relation was represented as a normal matrix, which were constrained to satisfy commutativity properties. ANALOGY represented a novel framework for optimizing the latent representations with respect to the “analogical” properties of the embedded entities and relations for knowledge base completion. Analogical inference posited that if subsets of entities and relations were analogous in systems A and B, then the unobserved triples in B could be inferred by mirroring their counterparts in A. The proposed approach obtained state of the art results on two popular benchmark datasets, outperforming a large number of strong baselines in most cases.
[Image source. Click image to open in new window.]Related to analogical reasoning, note also that Beyond Word Embeddings: Learning Entity and Concept Representations from Large Scale Knowledge Bases described work that jointly learned concept vectors from a textual knowledge base (Wikipedia) and a graphical knowledge base (Probase), demonstrating superior results on analogical reasoning and concept categorization.
Knowledge Graph Embedding with Iterative Guidance from Soft Rules (Nov 2017) [code] presented a novel approach to knowledge graph embedding (KGE) combined with guidance from soft logic rules, called RuleGuided Embedding (RUGE), that provided state of the art results in KG link prediction (compared e.g. to ComplEx: their Table 3). This work builds on previous work by these authors (that provides additional information): Jointly Embedding Knowledge Graphs and Logical Rules (which introduced KALE, an approach that learned entity and relation embeddings by jointly modeling knowledge and logic), and Knowledge Base Completion using Embeddings and Rules (which employed an integer linear programming approach plus logic rules that leveraged prior knowledge in the KG to greatly reduce the solution space during inference, i.e. link prediction).
[Image source. Click image to open in new window.]PredPath, described in Discriminative Predicate Path Mining for Fact Checking in Knowledge Graphs (Shi and Weninger, Apr 2016) [code] presented a discriminative pathbased method for fact checking in KG that incorporated connectivity, type information, and predicate interactions. Given a statement in the form of a (subject, predicate, object) triple – for example, (Chicago, capitalOf, Illinois) – their approach mined discriminative paths that alternatively defined the generalized statement (U.S. city, predicate, U.S. state) and used the mined rules to evaluate the veracity of that statement.

An advantage of this fact checking procedure is its generality and contextdependency: just as humans learn unknown words, model generality means the predicate of a statement can be arbitrary, and is not required to be presented in the knowledge base. Moreover, once a prior knowledge is learned, it is associated with a certain type of entity pair relation and can be used for different tasks including general question answering or knowledge base completion. The notion of contextdependency allows the fact checker to discern different definitions of a predicate in different situations (association rule mining). For example, $\small \text{capitalOf}$ could define the capitals of US states, colloquialisms such as “Kansas City is the soccer capital of America”, or historical or timesensitive predicates such as “Calcutta was the capital of India,” depending on the context.

For followon work by those authors see ProjE, below.
Followon work by the PredPath authors, ProjE : Embedding Projection for Knowledge Graph Completion (Shi and Weninger, Nov 2016) [code] presented a shared variable neural network model that filled in missing information in a KG by learning joint embeddings of the KG entities and edges (collectively calculating the scores of all candidate triples). ProjE had a parameter size smaller than 11 out of 15 existing methods while performing 37% better than the current best method on standard datasets. They also showed, via a fact checking task, that ProjE was capable of accurately determining the veracity of many declarative statements.
[Image source. Click image to open in new window.] Relevant to the biomedical domain, both of those papers (PredPath; ProjE) evaluated their models on the SemMedDB dataset, a KG extracted from all of PubMed [SemMedDB: A PubMedScale Repository of Biomedical Semantic Predications (Dec 2012)]. Table 4 in the ProjE paper indicated that these models performed similarly to one another while outperforming the other evaluated models.
ProjE authors Shi and Weninger) also proposed a new KGC task, the OpenWorld Knowledge Graph Completion (Nov 2017) [code]. As a first attempt to solve this task they introduced an openworld KGC model called ConMask, which learned embeddings of an entity’s name and parts of its textdescription to connect unseen entities to the KG. To mitigate the presence of noisy text descriptions, ConMask used a relationshipdependent content masking to extract relevant snippets, and then trained a CNN to fuse the extracted snippets with entities in the KG. Experiments on large data sets showed that ConMask performed well in the openworld KGC task, even outperforming existing KGC models on the standard closedworld KGC task.
[Image source. Click image to open in new window.]Convolutional 2D Knowledge Graph Embeddings (Jul 2018) [code] introduced ConvE, a deeper, multilayer convolutional network model for link prediction. ConvE was highly parameter efficient, yielding the same performance as DistMult and RGCN [RGCN is discussed below] with 8x and 17x fewer parameters.
 Comment: due to it’s architecture [a two layer, convolutional network] – which are hard to analyze [from a comment in RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space (Feb 2019)] – ConvE is a “black box.”
[Image source. Click image to open in new window.]“Analysis of our model suggests that it is particularly effective at modelling nodes with high indegree – which are common in highlyconnected, complex knowledge graphs such as Freebase and YAGO3. In addition, it has been noted that the WN18 and FB15k datasets suffer from test set leakage, due to inverse relations from the training set being present in the test set – however, the extent of this issue has so far not been quantified. We find this problem to be severe: a simple rulebased model can achieve stateoftheart results on both WN18 and FB15k.”
ConvE was the inspiration – by different authors – for HypER (Hypernetwork Knowledge Graph Embeddings) (Aug 2018) [code]. HypER used a hypernetwork architecture to generate convolutional layer filters specific to each relation and apply those filters to the subject entity embeddings.
[Image source. Click image to open in new window.]Their model, HypER, simplified the entity and relation embedding interactions introduced by ConvE while outperforming all previous approaches to link prediction across all standard link prediction datasets. HypER used a hypernetwork to generate weights of convolutional filters for each relation. In this work, they generated relationspecific weights to process input entities, and also obtained multitask knowledge sharing across different relations in the knowledge graph.
Whereas ConvE had a global common set of 2dimensional convolutional filters (suggesting the presence of a 2D structure in word embeddings) that reshaped and concatenated subject entity and relation embeddings, HypER used a hypernetwork to generate 1dimensional relationspecific filters to process the unadjusted subject entity embeddings, thereby simplifying the interaction between subject entity and relation embeddings. Thus, instead of extracting a limited number of both entityrelated and relationrelated features from the dimensions around the concatenation point, HypER covered all subject entity embedding dimensions by sliding the relationdependent convolutional filters over the whole entity embedding.
 A hypernetwork is an approach of having one network generate weights for another network, which can be used to provide weightsharing across layers and to dynamically synthesize weights based on an input. [See also HyperNetworks (Google Brain: Dec 2016).]
Link Prediction using Embedded Knowledge Graphs (Nov 2016; updated Apr 2018) by Microsoft Research and Google Research addressed the task of knowledge base completion by performing a single, short sequence of interactive lookup operations on an embedded knowledge graph which had been trained through endtoend backpropagation to be an optimized and compressed version of the initial knowledge base. Their proposed model, Embedded Knowledge Graph Network (EKGN), achieved state of the art results on popular knowledge base completion benchmarks.
[Image source. Click image to open in new window.]In August 2018 A Capsule Networkbased Embedding Model for Knowledge Graph Completion and Search Personalization [code] introduced CapsE, a new model for knowledge graph completion. CapsE employed a capsule network to model (subject, relation, object) relationship triples. CapsE obtained state of the art link prediction results for knowledge graph completion on two benchmark datasets: WN18RR and FB15k237. Comparing Table 2 in that paper to Table 4 in On Link Prediction in Knowledge Bases: MaxK Criterion and Prediction Protocols (2018) appears to confirm CapsE as a state of the art model, the first to apply capsule_networks to knowledge graph completion and search personalization.
[Image source. Click image to open in new window.]Predicting Semantic Relations using Global Graph Properties (Aug 2018) [code] combined global and local properties of semantic graphs through the framework of MaxMargin Markov Graph Models (M3GM), a novel extension of Exponential Random Graph Model (ERGM) that scales to large multirelational graphs. They demonstrated how such global modeling improves performance on the local task of predicting semantic relations between synsets, yielding new state of the art results on the WN18RR dataset, a challenging version of WordNet link prediction in which “easy” reciprocal cases were removed. In addition, the M3GM model identified multirelational motifs that were characteristic of wellformed lexical semantic ontologies.
[Image source. Click image to open in new window.]OneShot Relational Learning for Knowledge Graphs (Aug 2018) [code] by the University of California (Santa Barbara) and IBM Research observed that longtail relations are common in KGs (in other words, they have very few instances) and those newly added relations often do not have many known triples for training. In this work, they aimed at predicting new facts under a challenging setting where only one training instance was available. They proposed a oneshot relational learning framework, which utilized the knowledge extracted by embedding models and learned a matching metric by considering both the learned embeddings and onehop graph structures. Empirically, their model yielded considerable performance improvements over existing embedding models, and also eliminated the need of retraining the embedding models when dealing with newly added relations. These authors also prepared two new datasets, for their work.
[Image source. Click image to open in new window.][paraphrased:] “Existing benchmarks for knowledge graph completion, such as and YAGO310 are small subsets of realworld KGs. These datasets consider the same set of relations during training and testing and often include sufficient training triples for every relation. To construct datasets for oneshot learning, we go back to the original KGs and select those relations that do not have too many triples as oneshot task relations. We refer the rest of the relations as background relations, since their triples provide important background knowledge for us to match entity pairs. Our first dataset is based on NELL, a system that continuously collects structured knowledge by reading webs. We take the latest dump and remove those inverse relations. We select the relations with less than 500 but more than 50 triples 3 as oneshot tasks. To show that our model is able to operate on largescale KGs, we follow the similar process to build another larger dataset based on Wikidata. The dataset statistics are shown in Table 1. Note that the WikiOne dataset is an order of magnitude larger than any other benchmark datasets in terms of the numbers of entities and triples.”
Most existing knowledge graph completion methods either focus on the positional relationship between entity pair and single relation (1hop path) in semantic space, or concentrate on the joint probability of random walks on multihop paths among entities. However, they do not fully consider the intrinsic relationships of all the links among entities. By observing that the single relation and multihop paths between the same entity pair generally contain shared/similar semantic information, Hierarchical Attention Networks for Knowledge Base Completion via Joint Adversarial Training (Oct 2018) proposed a novel method for KB completion, which captured the features shared by different data sources utilizing hierarchical attention networks (HAN) and adversarial training (AT). The joint adversarial training with a gradient reversal layer (GRL) reversed the backpropagation gradient to allow the feature extractor to extract the shared features between different data sources. The HANs automatically encoded the inputs into lowdimensional vectors and exploited two partial parametershared components: one for feature source discrimination, and the other for determining missing relations. By joint adversarial training (AT) the entire model, their method minimized the classification error of missing relations. The AT mechanism encouraged their model to extract features that were both discriminative for missing relation prediction, and shareable between single relation and multihop paths.
[Image source. Click image to open in new window.]Hierarchical Attention Networks for Document Classification also employed a hierarchical attention network.
My Table 1, below, summarizes MRR scores for link prediction for selected, bestperforming embedding models ca. mid2018. Around that time, ComplEx was employed by Facebook AI Research as the state of the art for comparison of various models in Canonical Tensor Decomposition for Knowledge Base Completion (Jun 2018) [code], which framed knowledge base completion as a 3rdorder binary tensor completion problem. “In this light, the Canonical Tensor Decomposition (CP) (Hitchcock, 1927) seems like a natural solution; however, current implementations of CP on standard Knowledge Base Completion benchmarks are lagging behind their competitors. In this work, we attempt to understand the limits of CP for knowledge base completion. First, we motivate and test a novel regularizer, based on tensor nuclear $\small p$norms. Then, we present a reformulation of the problem that makes it invariant to arbitrary choices in the inclusion of predicates or their reciprocals in the dataset. These two methods combined allow us to beat the current state of the art [Lacroix et al. 2018, Table 2 at bottom right in that image] on several datasets with a CP decomposition, and obtain even better results using the more advanced ComplEx model.”
[Image source. Click image to open in new window.]
Tensor factorization approaches have proved promising for link prediction problems. Proposed in 1927, Canonical Polyadic (CP) decomposition was among the first tensor factorization approaches. CP generally performs poorly for link prediction as it learns two independent embedding vectors for each entity
Source, whereas they are really tied. SimplE Embedding for Link Prediction in Knowledge Graphs (Oct 2018) [code] presented a simple enhancement of CP (SimplE ) to allow the two embeddings of each entity to be learned dependently. The complexity of SimplE grows linearly with the size of embeddings, which are interpretable, also allowing certain types of background knowledge to be incorporated into these embeddingsIncorporating Background Knowledge into the Embeddings. In SimplE, each element of the embedding vector of the entities can be considered as a feature of the entity and the corresponding element of a relation can be considered as a measure of how important that feature is to the relation. Such interpretability allows the embeddings learned through SimplE for an entity (or relation) to be potentially transferred to other domains. It also allows for incorporating observed features of entities into the embeddings by fixing one of the elements of the embedding vector of the observed value. Nickel et al. [30] show that incorporating such features helps reduce the size of the embeddings. through weight tying.
 SimplE Embedding for Link Prediction in Knowledge Graphs (Feb 2018; updated Oct 2018) – which was only evaluated on datasets WN18 (not WN18RR, which corrects flaws in WN18) and FB15k (not FB15k237, which corrects errors in FB15k) – did not cite Canonical Tensor Decomposition for Knowledge Base Completion (Jun 2018), variants of which outperformed SimplE (refer to these tables).
TuckER: Tensor Factorization for Knowledge Graph Completion (Jan 2019) [code; discussion] proposed TuckER, a relatively simple but powerful linear model based on Tucker decomposition of the binary tensor representation of knowledge graph triples. TuckER outperformed all previous state of the art models across standard link prediction datasets. TuckER was a fully expressive model, deriving the bound on its entity and relation embedding dimensionality for full expressiveness which was several orders of magnitude smaller than the bound of previous state of the art models ComplEx and SimplE. “Given that knowledge graphs contain several relation types (symmetric, asymmetric, transitive, etc.), it is important for a link prediction model to have enough expressive power to accurately represent all of them. We thus show that TuckER is fully expressive, i.e. given any ground truth over the triples, there exists an assignment of values to the entity and relation embeddings that accurately separates the true triples from false ones.” They further showed that “… several previous state of the art linear models (RESCAL, DistMult, ComplEx, SimplE) are special cases of TuckER.”

“The number of parameters of TuckER increases linearly with respect to entity and relation embedding dimensionality $\small d_e$ and $\small d_r$, as the number of entities and relations ($\small n_e$ and $\small n_r$ respectively) increases, since the number of parameters of core tensor $\small \mathcal{W}$ depends only on the entity and relation embedding dimensionality and not on the number of entities or relations. By having the core tensor $\small \mathcal{W}$, unlike simpler models such as DistMult, ComplEx and SimplE, TuckER does not encode all the learned knowledge into the embeddings; some is stored in the core tensor and shared between all entities and relations.”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]

“As previously stated in Section 4 [TuckER: Tensor Factorization for Knowledge Graph Completion], a tensor factorization model is said to be fully expressive if for any ground truth over all entities and relations, there exist entity and relation embeddings that accurately separate the true triples from the false ones.
“As shown in (Trouillon et al., 2017), ComplEx is fully expressive with the bound on entity and relation embedding dimensionality of $\small d_e = d_r = n_e \cdot n_r$ for achieving full expressiveness. Similarly to ComplEx, Kazemi & Poole (2018) show that SimplE is fully expressive with entity and relation embeddings of size $\small d_e = d_r = min(n_e \cdot n_r, \gamma + 1)$, with $\small \gamma$ representing the number of true facts. The authors further prove other models are not fully expressive: DistMult, because it cannot model asymmetric relations; and transitive models such as TransE (Bordes et al., 2013) and its variants FTransE (Feng et al., 2016) and STransE (Nguyen et al., 2016), because of certain contradictions that they impose between different relation types.
“By Theorem 1, we establish the bound on entity and relation embedding dimensionality (i.e. rank of the decomposition) that guarantees full expressiveness of TuckER. …”
RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space (Feb 2019) [code]. “… we present a new approach for knowledge graph embedding called RotatE, which is able to model and infer various relation patterns including: symmetry/antisymmetry, inversion, and composition. Specifically, the RotatE model defines each relation as a rotation from the source entity to the target entity in the complex vector space. In addition, we propose a novel selfadversarial negative sampling technique for efficiently and effectively training the RotatE model. Experimental results on multiple benchmark knowledge graphs show that the proposed RotatE model is not only scalable, but also able to infer and model various relation patterns and significantly outperform existing stateoftheart models for link prediction.”
 “… In this paper, we propose such an approach called RotatE for knowledge graph embedding. Our motivation is from Euler’s identity $\small e^{i \theta} = cos \theta + i sin \theta$, which indicates that a unitary complex number can be regarded as a rotation in the complex plane. …“ I discussed this identity in my Oct 2018 blog post Complex Numbers, $\small \pi$, Fourier Transform, Eigenvectors.
Table 1. Link prediction results: comparison of various embedding models (filtered MRR metric^{1})  
General notes. 1. This list is current to approximately late August, 2018. 2. I did not exhaustively search the literature, only the papers mentioned in this subsection. 3. From those papers, I selected ("cherry picked") the data from some of the betterperforming models discussed in this subsection, and from those again selected those models that are discussed herein. I also only collected/reported the MRR scores, as I wanted an uncluttered comparison of those models. 4. References cited may not be the primary source (refer to the references cited therein).  
Table footnotes.  
^{ 1}Filtered MRR: filtered MRR.  
^{2} Datasets:
WN18RR corrects flaws in WN18:
"A popular relation prediction dataset for WordNet is the subset curated as WN18, containing 18 relations for about 41,000 synsets extracted from WordNet 3.0. It has been noted that this dataset suffers from considerable leakage: edges from reciprocal relations such as hypernym/hyponym appear in one direction in the training set and in the opposite direction in dev/test. This allows trivial rulebased baselines to achieve high performance. To alleviate this concern, Dettmers et al. (2018) released the WN18RR set, removing seven relations altogether. However, even this dataset retains four symmetric relation types: 'also see', 'derivationally related form', 'similar to', and 'verb group'. These symmetric relations can be exploited by defaulting to a simple rulebased predictor." [Source: Section 4.1 in Predicting Semantic Relations using Global Graph Properties; references therein.] FB15k237 (see also), which corrects errors in FB15k, contains about 14,541 entities with 237 different relations. "This dataset contains knowledge base relation triples and textual mentions of Freebase entity pairs. The knowledge base triples are a subset of the FB15k set. The textual mentions are derived from 200 million sentences from the ClueWeb12 corpus coupled with FACC1 Freebase entity mention annotations." YAGO310 is a subset of YAGO3 which consists of entities which have a minimum of 10 relations each. It has 123,182 entities and 37 relations. Most of the triples deal with descriptive attributes of people, such as citizenship, gender, and profession. YAGO37 is extracted from the core facts of YAGO3, containing 37 relations and 123,189 entities. This dataset was created by the RUGE authors: "The FB15k data set consists of 1,345 relations and 14,951 entities among them. The training set contains 483,142 triples, the validation set 50,000 triples, and the test set 59,071 triples. 454 rules are created for FB15k. The YAGO37 data set consists of 37 relations and 123,189 entities among them. The training set contains 989,132 triples, the validation set 50,000 triples, and the test set 50,000 triples. 16 rules are created for YAGO37. All triples are unique and we made sure that all entities/relations appearing in the validation or test sets were occurring in the training set."  
Dataset^{2}  
Model [ref]  FB15k  FB15k237  WN18  WN18RR  YAGO310  YAGO37 
ANALOGY [1]  0.723  0.211  0.942  0.391  0.257  
ANALOGY [7]  0.725  0.942  
CapsE [8]  0.538  0.391  
ComplEx [1]  0.716  0.206  0.942  0.390  0.266  
ComplEx [10]  0.692  0.247  0.941  0.440  0.340  
ComplEx [11]  0.69  0.240  0.941  0.444  
ComplEx [5]  0.692  0.941  
ComplEx [6]  0.639  0.941  
ComplExNNE+AER [3]  0.803  0.943  
ConvE [10]  0.657  0.325  0.943  0.430  0.440  
ConvE [4]  0.745  0.942  
ConvE [8]  0.316  0.460  
ConvE [9]  0.301  0.342  
ConvE [11]  0.745  0.301  0.942  0.342  
DistMult [10]  0.654  0.241  0.822  0.430  0.340  
DistMult [11]  0.35  0.241  0.83  0.425  
DistMult [3]  0.644  0.365  
DistMult [5]  0.654  0.822  
HolE [6]  0.541  0.938  
HolE [5]  0.524  0.938  
HypER [2]  0.762  0.341  0.951  0.463  
M3GM [12]  0.4983  
ProjE [1]  0.588  0.249  0.820  0.367  0.470  
RGCN [10]  0.696  0.248  0.814  
RGCN+ [9]  0.696  0.249  0.819  
RotatE [14]  0.797  0.338  0.949  0.476  
RUGE [3]  0.768  0.431  
TransE [1]  0.456  0.219  0.584  0.191  0.151  
TransE [3]  0.400  0.303  
TransE [5]  0.380  0.454  
TransE [8]  0.294  0.226  
TransE [9]  0.463  0.294  0.495  0.226  
TransE [12]  0.4659  
TransF [11]  0.564  0.286  0.856  0.505  
TuckER [13]  0.795  0.358  0.953  0.470 
From the table above we can see that there is considerable variation among the filtered MRR for link prediction by the various models on the specified datasets. Per the footnotes in that table – although there are fewer comparisons at present – we should probably assign greater performance to models that score well on the more challenging datasets:
Knowledge graph embedding has been an active research topic for knowledge base completion, with progressive improvement from the initial TransE, TransH, DistMult etc. to the current state of the art ConvE. ConvE uses 2D convolution over embeddings and multiple layers of nonlinear features to model knowledge graphs. The model can be efficiently trained and scalable to large knowledge graphs. However, there is no structure enforcement in the embedding space of ConvE. The recent graph convolutional network (GCN) provides another way of learning graph node embedding by successfully utilizing graph connectivity structure.
Endtoend StructureAware Convolutional Networks for Knowledge Base Completion (Nov 2018) [authors code here and here (PyTorch): both not yet available, 20190104] proposed a novel endtoend StructureAware Convolutional Networks (SACN ) that took the benefit of GCN and ConvE together. SACN consisted of an encoder of a weighted graph convolutional network (WGCN), and a decoder of a convolutional network called ConvTransE. WGCN utilized knowledge graph node structure, node attributes and relation types. It had learnable weights that collected adaptive amounts of information from neighboring graph nodes, resulting in more accurate embeddings of graph nodes. In addition, the node attributes were added as the nodes and were easily integrated into the WGCN. The decoder ConvTransE extended the state of the art ConvE to be translational between entities and relations while keeping the state of the art performance as ConvE. They demonstrated the effectiveness of their proposed SACN model on standard FB15k237 and WN18RR datasets, and presented about 10% relative improvement over the state of the art ConvE in terms of HITS@1, HITS@3 and HITS@10.
[Image source. Click image to open in new window.]In an interesting approach, Finding Streams in Knowledge Graphs to Support Fact Checking (Aug 2017) [code] viewed a knowledge graph as a “flow network” and knowledge as a fluid, abstract commodity. They showed that computational fact checking of a (subject, predicate, object) triple then amounts to finding a “knowledge stream” that emanates from the subject node and flows toward the object node through paths connecting them. Evaluations of their models (KS: Knowledge Stream and KLREL: Relational Knowledge Linker) revealed that this networkflow model was very effective in discerning true statements from false ones, outperforming existing algorithms on many test cases. Moreover, the model was expressive in its ability to automatically discover useful path patterns and relevant facts that may help human fact checkers corroborate or refute a claim.
[Image source. Click image to open in new window.]Predictive Network Representation Learning for Link Prediction (2017) [code] addressed structural link prediction, which infers missing links on a static network. Their proposed model, Predictive Network Representation Learning (PNRL), defined two learning objectives: observed structure preservation, and hidden link prediction. [Network representation learning models learn the latent representations of nodes, which can embed the rich structural information into the latent space. Most of these models are learned in an unsupervised manner.] To integrate the two objectives in a unified model, they developed an effective sampling strategy to select certain edges in a given network as assumed hidden links and regard the rest network structure as observed when training the model. By jointly optimizing the two objectives, the model not only enhanced the predictive ability of node representations, but also learned additional link prediction knowledge in the representation space.
[Image source. Click image to open in new window.]The use of drug combinations, termed polypharmacy, is common for treating patients with complex diseases or coexisting conditions. However, a major consequence of polypharmacy is a much higher risk of adverse side effects for the patient. Polypharmacy side effects may emerge because of drugdrug interactions, in which activity of one drug may change favorably or unfavorably if taken with another drug. The knowledge of drug interactions is often limited because these complex relationships are rare, and are usually not observed in relatively small clinical testing. Discovering polypharmacy side effects thus remains an important challenge with significant implications for patient mortality and morbidity.
Modeling Polypharmacy Side Effects with Graph Convolutional Networks (Jul 2018) [project (code/datasets); discussion] – by Jure Leskovec and colleagues at Stanford University – presented Decagon, an approach for modeling polypharmacy side effects. The approach constructed a multimodal graph of proteinprotein interactions, drugprotein target interactions and the polypharmacy side effects, which were represented as drugdrug interactions, where each side effect was an edge of a different type. Decagon was developed specifically to handle multimodal graphs with a large number of edge types. Their approach developed a new graph convolutional neural network for multirelational link prediction in multimodal networks. Unlike approaches limited to predicting simple drugdrug interaction values, Decagon could predict the exact side effect, if any, through which a given drug combination manifests clinically.
Decagon accurately predicted polypharmacy side effects, outperforming baselines by up to 69%. They found that it automatically learned representations of side effects indicative of cooccurrence of polypharmacy in patients. Furthermore, Decagon modeled particularly well polypharmacy side effects that had a strong molecular basis, while on predominantly nonmolecular side effects it achieved good performance because of effective sharing of model parameters across edge types. Decagon opens up opportunities to use large pharmacogenomic and patient population data to flag and prioritize polypharmacy side effects for followup analysis via formal pharmacological studies.
[Image source. Click image to open in new window.]Current KG completion models compel twothirds of a triple provided (e.g., subject and relation) to predict the remaining one. DSKG: A Deep Sequential Model for Knowledge Graph Completion (Oct 2018) [code] proposed a new model which used a KGspecific multilayer recurrent neutral network (RNN) to model triples in a KG as sequences. DSKG outperformed several state of the art KG completion models on the conventional entity prediction task for many evaluation metrics, based on two benchmark datasets (FB15k; WN18) and a more difficult dataset (FB15k237). Furthermore, their model was capable of predicting the triples, given only one entity.
[Image source. Click image to open in new window.]Many knowledge graph embedding methods operate on triples and are therefore implicitly limited by a very local view of the entire knowledge graph. MOHONE: Modeling Higher Order Network Effects in KnowledgeGraphs via Network Infused Embeddings (Nov 2018) presented a new framework, MOHONE, to effectively model higher order network effects in knowledgegraphs thus enabling one to capture varying degrees of network connectivity (from the local to the global). Their framework was generic, explicitly modeled the network scale, and captured two different aspects of similarity in networks: (a) shared local neighborhood and (b) structural rolebased similarity. They first introduced methods that learned network representations of entities in the knowledge graph, capturing these varied aspects of similarity. They then proposed a fast, efficient method to incorporate the information captured by these network representations into existing knowledge graph embeddings. Their method consistently and significantly improved the performance on link prediction of several different knowledgegraph embedding methods including TransE, TransD, DistMult, and ComplEx (by at least 4 points or 17% in some cases).
[Image source. Click image to open in new window.]For a related paper by the same authors contemporaneously released on arXiv, see *DOLORES*: Deep Contextualized Knowledge Graph Embeddings.
An interesting approach to KG construction involves the Biological Expression Language (BEL), which represents findings in the life sciences in a computable form. Biological statements in BEL are represented as (subject, predicate, object) triples, where the subject is a BEL term, the predicate is a biological relationship that connects the subject with the object (which can be a BEL term or a BEL statement). Hence, the knowledge captured by BEL statements can be represented as a graph. A BEL term represent the abundance of a biological entity, or a biological process such as a Gene Ontology entry or a disease. Each entity is described in an existing namespace (describing, for example, it’s source database, e.g. ChEBI), or a userdefined namespace. A fixed set of causal, correlative and other relationships link the entities: each statement can be associated with metadata that contextualizes it, for example qualifying it to be true only in specific tissues.

While BEL is an active research area with Tracks in the BioCreative VI Community Challengesp (see also) and preceding BioCreative Tracks, the approach appears to have gained little traction outside that community. For example, only a handful of BELrelated papers are present in PubMed that are not focused on the development of BEL, itself. For more information on BEL, see:

Training and Evaluation Corpora for the Extraction of Causal Relationships Encoded in Biological Expression Language (BEL) (2016), which described the corpus prepared for the BioCreative V BEL track, which also provides an excellent review of the Biological Expression Language:
[Image source. Click image to open in new window.] 
Saved by the BEL: Ringing in a Common Language for the Life Sciences (2012):
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]  BEL 2.0 Specification, home of the BEL Language Documentation v2.0;
 OpenBEL;

PyBEL Python ecosystem;

Navigating the Disease Landscape: Knowledge Representations for Contextualizing Molecular Signatures (Apr 2018): the “Contextualization by Knowledge Graph Representations” subsection, pp. 89, summarizes some biomedical applications of BEL:
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.] 
The Causal Biological Network database (Causal Biological Network Database: A Comprehensive Platform of Causal Biological Network Models Focused on the Pulmonary and Vascular Systems (2015)), “… a set of biological network models scripted in BEL that reflect causal signaling pathways across a wide range of biological processes, including cell fate, cell stress, cell proliferation, inflammation, tissue repair and angiogenesis in the pulmonary and cardiovascular context. This comprehensive collection of networks is now freely available to the scientific community in a centralized webbased repository, composed of over 120 manually curated and well annotated biological network models supported by over 80,000 unique pieces of evidence from the scientific literature.”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.] 
See also (BEL):

Combining Relation Extraction with Function Detection for BEL Statement Extraction (Jan 2019) [Summary]The BioCreativeV community proposed a challenging task of automatic extraction of causal relation network in Biological Expression Language (BEL) from the biomedical literature. Previous studies on this task largely used models induced from other related tasks and then transformed intermediate structures to BEL statements, which left the given training corpus unexplored. To make full use of the BEL training corpus, in this work, we propose a deep learningbased approach to extract BEL statements. Specifically, we decompose the problem into two subtasks: entity relation extraction and entity function detection. First, two attentionbased bidirectional long shortterm memory networks models are used to extract entity relation and entity function, respectively. Then entity relation and their functions are combined into a BEL statement. In order to boost the overall performance, a strategy of threshold filtering is applied to improve the precision of identified entity functions. We evaluate our approach on the BioCreativeV Track 4 corpus with or without gold entities. The experimental results show that our method achieves the stateoftheart performance with an overall F1measure of 46.9% in stage 2 and 21.3% in stage 1, respectively.
[Image source. Click image to open in new window.] 
Hierarchical Sequence Labeling for Extracting BEL Statements from Biomedical Literature (Apr 2019). “… The BioCreative community has organized a shared task to evaluate the robustness of the causal relationship extraction algorithms in Biological Expression Language (BEL) from biomedical literature. We first map the sentencelevel BEL statements in the BCV training corpus to the corresponding text segments, thus generating hierarchically tagged training instances. A hierarchical sequence labeling model was afterwards induced from these training instances and applied to the test sentences in order to construct the BEL statements. The experimental results on extracting BEL statements from BioCreative V Track 4 test corpus show that our method achieves promising performance with an overall Fmeasure of 31.6%. Furthermore, it has the potential to be enhanced by adopting more advanced machine learning approaches.”


Additional Reading
Latent Knowledge Discovery
KG are ideallysuited for knowledge discovery. Notable examples in the biomedical domain include the following papers.
In Exploiting Semantic Patterns Over Biomedical Knowledge Graphs for Predicting Treatment and Causative Relations (Jun 2018), the authors first built a large knowledge graph of biomedical relations obtained from the National Library of Medicine (NLM)’s Unified Medical Language System Metathesaurus (UMLS). They then took a different approach to predicting potential relations between arbitrary pairs of biomedical entities. They refrained from NLP approaches that looked at individual sentences to extract a potential relation, instead exploiting semantic path patterns over this graph to build models for specific predicates. Instead of looking at what a particular sentence conveys, they modeled their prediction problem at a global level, outputting probability estimates of whether a pair of entities participated in a particular relation. A different binary classification model was trained for each predicate. While the approach was demonstrated using the “TREATS” and “CAUSES” predicates drawn from the UMLS Metathesaurus SemMedDB (Semantic Medline Database), their method also generalized to other predicates (such as “DISRUPTS” and “PREVENTS”), and could also complement other lexical and syntactic patternbased distant supervision approaches for relation extraction.
[Image source. Click image to open in new window.]MOLIERE : Automatic Biomedical Hypothesis Generation System (May 2017) [project; code] is a system that can identify connections within biomedical literature. MOLIERE utilized a multimodal/multirelational network of biomedical objects (papers; keywords; genes; proteins; diseases; diagnoses) extracted from PubMed. MOLIERE finds the shortest path between two query keywords in the KG, and extends this path to identify a significant set of related abstracts (which, due to the network construction process, share common topics). Topic modeling, performed on these documents using PLDA+, returns a set of plain text topics representing concepts that likely connect the queried keywords, supporting hypothesis generation (for example, on historical findings MOLIERE showed the implicit link between Venlafaxine and HTR1A, and the involvement of DDX3 on Wnt signaling).
[Image source. Click image to open in new window.][Image source. Click image to open in new window.]
A: abstract layer; K: keyword layer:
“Our knowledge base starts as XML files provided by MEDLINE, from which we extract each publication’s title, document ID, and abstract text. We first process these results with the SPECIALIST NLP toolset. The result is a corpus of text which has standardized spellings (for example ‘colour’ becomes ‘color’), no stop words (including medical specific stop words such as Not Otherwise Specified (NOS)), and other characteristics which improve later algorithms on this corpus. Then we use ToPMine [Scalable Topical Phrase Mining from Text Corpora (2014); nonauthor code] to identify multiword phrases from that corpus such as ‘asthma attack,’ allowing us to treat phrases as single tokens. Next, we send the corpus through fastText, the most recent word2vec implementation, which maps each unique token in the corpus to a vector.
“We can then fit a centroid to each publication and use the Fast Library for Approximate Nearest Neighbors (FLANN) to generate a nearest neighbors graph. The result is a network of MEDLINE papers, each of which are connected to other papers sharing a similar topic. This network, combined with the UMLS Metathesaurus and semantic network, constitutes our full knowledge base. … With our network, a researcher can query for the connections between two keywords. We find the shortest path between the two keywords in the knowledge network, and extend this path to identify a significant set of related abstracts. This subset contains many documents which, due to our network construction process, all share common topics. We perform topic modeling on these documents using PLDA+. The result is a set of plain text topics which represent different concepts which likely connect the two queried keywords. More information about the query process is detailed in Section 3.”
“In order to create edges between A and K, we used a simple metric of term frequencyinverse document frequency (TFIDF). UMLS provides not only a list of keywords, but all known synonyms for each keyword. For example, the keyword Color C0009393 has the American spelling, the British spelling, and the pluralization of both defined as synonyms. Therefore we used the raw text abstracts and titles (before running the SPECIALIST NLP tools) to calculate tfidf. In order to quickly count all occurrences of UMLS keywords across all synonyms, we implemented a simple parser. This was especially important because many keywords in UMLS are actually multiword phrases such as ‘Clustered Regularly Interspaced Short Palindromic Repeats’ (a.k.a. CRISPR) C3658200 . …“
For additional detail, see Section 3.1
Source.
The task of entity relation extraction discovers new relation facts and enables broader applications of knowledge graph. AttentionAware PathBased Relation Extraction for Medical Knowledge Graph (Jan 2018) proposed a novel path attentionbased model (GRU + ONE) to discover new entity relation facts. Instead of finding texts for relation extraction, the proposed method extracted pathonly information for entity pairs from the knowledge graph. For each pair of entities, multiple paths could be extracted, with some of them being more useful than others for relation extraction. They employed an attention mechanism to assign different weights to different paths, which highlighted paths useful for entity relation extraction. On a large medical knowledge graph, the proposed method significantly improved the accuracy of extracted relation facts compared with state of the art relation extraction methods.
[Image source. Click image to open in new window.]Distilled Wasserstein Learning for Word Embedding and Topic Modeling (Sep 2018) proposed a novel Wasserstein method with a distillation mechanism, yielding joint learning of word embeddings and topics. The proposed method – distilled Wasserstein learning (DWL) – was based on the fact that the Euclidean distance between word embeddings may be employed as the underlying distance in the Wasserstein topic model. The word distributions of topics, their optimal transports to the word distributions of documents, and the embeddings of words were learned in a unified framework. Evaluated on patient admission records, the proposed method embedded disease codes and procedures and learned the admission topics, obtaining superior performance on clinicallymeaningful disease network construction, mortality prediction as a function of admission codes, and procedure recommendation.
[Image source. Click image to open in new window.]Beyond Word Embeddings: Learning Entity and Concept Representations from Large Scale Knowledge Bases (Aug 2018) [code/data] described work that jointly learned concept vectors from a textual knowledge base (Wikipedia) and a graphical knowledge base (Probase), demonstrating superior results (their Table 1) on analogical reasoning and concept categorization. The authors employed the skipgram model to seamlessly learn from the knowledge in Wikipedia text and Probase concept graph, for unsupervised to argumenttype identification for neural semantic parsing.
[Image source. Click image to open in new window.]Recall that:
the QA4IE question answering framework likewise processed input documents, along with a knowledge base (the Wikipedia Ontology, to produce high quality relation triples, and
ANALOGY also employed analogical inference (for link prediction).
In Learning a Health Knowledge Graph from Electronic Medical Records (Jul 2017) [data; code not available, but note similarity of embedding specification to those summarized in my RESCAL], maximum likelihood estimation of three probabilistic models was used to automatically construct knowledge graphs: logistic regression, naive Bayes classifier, and a Bayesian network using noisyOR gates. Logistic regression, widely used for binary classification, was chosen as an example of a wellestablished machine learning classifier with interpretable parameters which is frequently used for modeling binary outcomes. Naive Bayes was chosen as it provides a baseline of what can be inferred from simple pairwise cooccurrences. NoisyOR was chosen as an example of a probabilistic model that jointly models diseases and symptoms; similar models have successfully been used in previous medical diagnosis applications. Parameters for all three models were learned using maximum likelihood estimation. A graph of diseasesymptom relationships was elicited from the learned parameters, and the constructed knowledge graphs were evaluated and validated, with permission, against Google’s manuallyconstructed knowledge graph and against expert physician opinions. The noisyOR model significantly outperformed the other tested models, producing a high quality knowledge graph reaching precision of 0.85 for a recall of 0.6 in the clinical evaluation.
[Image source. Click image to open in new window.]
 Careful reading of that paper and its Appendix, which describes the assumptions associated with each of those three models, clarifies the approach. Diseases (and their effects: symptoms) were separately modeled by logistic regression and naive Bayes; additionally, in naive Bayes, all symptoms were assumed to be conditionally independent from one another. However, the noisyOR model jointly models all diseases and symptoms [wherein the central assumption of independence of effects (symptoms) is maintained]. NoisyOR is a conditional probability distribution that describes the causal mechanisms by which parent nodes (e.g. diseases) affect the states of children nodes (e.g. symptoms). For example, given that patients tend to present with few diseases, the presence of one disease typically lowers the probability of others. In the logistic regression and naive Bayes models, however, since each disease is modeled separately (resulting in an assumption of independence between diseases), the presence of one disease does not (for example) rule out or lessen the probability of another disease (i.e. diagnosis).
Author David Sontag also evaluates noisyor maximum likelihood estimation in Clinical Tagging with Joint Probabilistic Models (Sep 2016). See also Sontag’s paper, Unsupervised Learning of NoisyOr Bayesian Networks (Sep 2013).
[Image source. Click image to open in new window.]
Code for Learning a Health Knowledge Graph from Electronic Medical Records is not available.
The noisyOR model was also employed in a 2009 paper [different authors], Biomedical Discovery Acceleration, with Applications to Craniofacial Development:
“… the explosion of new results in the scientific literature, particularly in molecular biomedicine, is both a blessing and a curse to the bench researcher. Even knowledgeable and experienced scientists can benefit from computational tools that help navigate this vast and rapidly evolving terrain. In this paper, we describe a novel computational approach to this challenge, a knowledgebased system that combines reading, reasoning, and reporting methods to facilitate analysis of experimental data.
“Reading methods extract information from external resources, either by parsing structured data or using biomedical language processing to extract information from unstructured data, and track knowledge provenance. Reasoning methods enrich the knowledge that results from reading by, for example, noting two genes that are annotated to the same ontology term or database entry. Reasoning is also used to combine all sources into a knowledge network that represents the integration of all sorts of relationships between a pair of genes, and to calculate a combined reliability score. Reporting methods combine the knowledge network with a congruent network constructed from experimental data and visualize the combined network in a tool that facilitates the knowledgebased analysis of that data.
“An implementation of this approach, called the Hanalyzer, is demonstrated on a largescale gene expression array dataset relevant to craniofacial development. The use of the tool was critical in the creation of hypotheses regarding the roles of four genes never previously characterized as involved in craniofacial development; each of these hypotheses was validated by further experimental work.”

They also stated (paraphrased):
“One of the most popular methods to combine individual reliabilities is to assume independence of experts (naive Bayes assumption) and compute the integrated likelihood P for each relationship using the NoisyOR function … The NoisyOR function has the useful property that the probability of a relationship is high with at least one reliable assertion yet increases with additional support. This property is especially relevant in biology, where it is often difficult to identify false negatives; a given assertion is strengthened by additional information but unlike the case for estimating the reliability of an expert on the whole, an individual assertion is not penalized for lack of additional evidence. Moreover, since the experts are assumed to be independent, experts can be removed from or added to the analysis without excessive recomputation.”
Their approach is implemented in a tool, “Hanalyzer” [project; code (Sourceforge); [code (GitHub); video; Cytoscape plugin], which was used to reason over gene expression data and the biomedical data, creating hypotheses regarding the roles of genes never previously characterized as involved in craniofacial development.
Heterogeneous information networks (HINs), complex networks widelyseen in realworld applications, often encapsulate higherorder interactions that crucially reflect the complex nature among nodes and edges in realworld data. Modeling higherorder interactions in HIN facilitates the userguided clustering problem by providing an informative collection of signals. At the same time, network motifs have been used extensively to reveal higherorder interactions and network semantics in homogeneous networks; thus, it is natural to extend the use of motifs to HINs. HigherOrder Clustering in Heterogeneous Information Networks (Nov 2018) tackles the problem of userguided clustering in HIN by using motifs, highlighting the benefits of comprehensively modeling higherorder interactions instead of decomposing the complex relationships to pairwise interaction. They proposed the MoCHIN model which was applicable to arbitrary forms of HIN motifs, which is often necessary for the application scenario in HINs due to the rich and diverse semantics encapsulated in the heterogeneity. To overcome the curse of dimensionality (the tensor size grows exponentially as the number of nodes increased in their model), they proposed an efficient inference algorithm for MoCHIN. MoCHIN surpassed all baselines in three evaluation tasks under different metrics. The advantage of the model when the supervision was weak was also discussed in additional experiments.
[Image source. Click image to open in new window.]Network representation aims to represent the nodes in a network as continuous and compact vectors, and has attracted much attention in due to its ability to capture complex structure relationships inside networks. However, existing network representation methods are commonly designed for homogeneous information networks where all the nodes (entities) of a network are of the same type, e.g., papers in a citation network. Universal Network Representation for Heterogeneous Information Networks (Nov 2018) proposed a universal network representation approach (UNRA), that represented different types of nodes in heterogeneous information networks in a continuous and common vector space. UNRA was built on a mutually updated neural language module, which simultaneously captured interrelationship among homogeneous nodes and nodecontent correlation. Relationships between different types of nodes were also assembled and learned in a unified framework. UNRA achieved outstanding performance compared to six other state of the art algorithms in node representation, node classification, and network visualization. In node classification, UNRA achieved a 3% to 132% performance improvement in terms of accuracy.
[Image source. Click image to open in new window.]HAHE: Hierarchical Attentive Heterogeneous Information Network Embedding (Jan 2019) [“code will be released”] proposed a Hierarchical Attentive Heterogeneous information network Embedding (HAHE) model which is capable of learning personalized metapath preferences for each node as well as updating the embedding information for each new node efficiently with only its neighbor node information. The proposed HAHE model extracted the semantic relationships among nodes in the semantic space based on different metapaths and adopted a neighborhood attention layer to conduct weighted aggregations of neighborhood structure features for each node, enabling the embedding information of each new node to be updated efficiently. A metapath attention layer was also employed, to learn the personalized metapath preferences for each individual node.
[Image source. Click image to open in new window.]Network Lens: Node Classification in Topologically Heterogeneous Networks (Jan 2019) [code; local copy (zip)] studied the problem of identifying different behaviors occurring in different parts of a large heterogenous network. They zoomed into the network using lenses of different sizes to capture the local structure of the network. These network signatures were then weighted to provide a set of predicted labels for every node. They achieved a peak accuracy of ∼42% (random = 11%) on two networks with ∼100,000 and ∼1,000,000 nodes each. Further, they performed better than random even when the given node is connected to up to 5 different types of networks. Finally, they performed this analysis on homogeneous networks, showing that highly structured networks have high homogeneity.”
[Image source. Click image to open in new window.]Regarding natural language understanding and reasoning, a recent review/perspective paper (Relational Inductive Biases, Deep Learning, and Graph Networks (Battaglia et al., Jun 2018; updated Oct 2018) [code/demos; media; critique; discussion here and here] – jointly from DeepMind, Google Brain, MIT and the University of Edinburgh – described how graph networks provide a straightforward interface for manipulating structured knowledge and producing structured behaviors. The paper argued that their Graph Networks could effectively support two critical humanlike capabilities: relational reasoning (i.e. drawing logical conclusions of how different objects and things relate to one another), and combinatorial generalization (i.e. constructing new inferences, predictions, and behaviors from known building blocks). Their Graph Networks framework defined a class of functions for relational reasoning over graph structured representations. They discussed how Graph Networks could support relational reasoning and combinatorial generalization, laying the foundation for more sophisticated, interpretable, and flexible patterns of reasoning. While “natural language understanding” is not discussed, per se, it is likely that NLU will be integral to the relational reasoning paradigm that is described in that paper.
Basically this is a “the whole is greater than the sum of the parts ” perspective paper which posits that graphs are needed for humanlike, relational reasoning. They propose “graph networks,” that can be built as needed form smaller subgraphs, to address specific tasks. A structured representation captures a composition of known building blocks, with structured computations operating over the elements and their composition as a whole. * Relational reasoning* operates over structure representations combining entities and relations, using rules for how they can be composed.
The paper argues that Graph Networks can effectively support two critical humanlike capabilities:
 relational reasoning, i.e. drawing logical conclusions of how different objects and things relate to one another, and
 combinatorial generalization, i.e. constructing new inferences, predictions, and behaviors from known building blocks.
The main unit of computation in the Graph Network framework is the GN block – also referred to as “graphtograph” modules – which takes a graph as input, performs computations over the structure, and returns a graph as output… graph features (entities) are represented by (i) the graph’s nodes, (ii) relations by the edges, and (iii) systemlevel properties by global attributes. The Graph Network block takes a graph as an input, performs calculations from the edge, to the node, to the global attributes – generating a new graph as output.
[Image source. Click image to open in new window.]
A graph is thus a threetuple $\small G = (\mathbf{u}, \mathcal{V}, \mathcal{E})$ where $\small \mathbf{u}$ is a global attribute, $\small \mathcal{V}$ is a set of nodes (each represented by attributes), and $\small \mathcal{E}$ is a set of edges (also represented by attributes). As described (above), the framework is agnostic to specific representations and functional forms. However, GN may be used in deep learning architectures, allowing them to act as learnable graphtograph function approximators. The three design principles behind graph networks which support this are:
 flexible representations,
 configurable withinblock structure, and
 composable multiblock architectures.
One common design is the encodeprocessdecode configuration in which an input graph is transformed into a latent representation by an encoder GN, a shared core block is applied multiple times to process the encoding, and then a decoder GN produces the final output. [Discussion source.]
Graph networks are flexible in terms of attributes representations we might want to choose: tensors, sets, graphs, etc. – these will most likely depend on your problem. Regarding the output, we can use edges, nodes or the global attributes based on what we want to achieve. Having all these possibilities means that we have to spend more time on our model architectures when using GNs. Not like now when if it’s an image, use CNN; when it’s a sequence, use LSTM. Regarding the question of how the input data is to be represented as a graph, there are two possible options: (i) the input contains the relational structure or (ii) the relations must be inferred. [Discussion source.]
Reviews:
 Blogger Adrian Colyer provides (Sep 2018) an excellent overview/discussion.
 Review by Vitaly Kurin at yobiblog.
 See also this Oct 2018 blog post, describing a nonauthor project: Knowledge Graph Convolutional Networks: Machine Learning over Reasoned Knowledge [discussion].
Embedding Logical Queries on Knowledge Graphs (Dan Jurafsky  Jure Leskovec  Stanford University: Feb 2019) [code]. “Learning lowdimensional embeddings of knowledge graphs is a powerful approach used to predict unobserved or missing edges between entities. However, an open challenge in this area is developing techniques that can go beyond simple edge prediction and handle more complex logical queries, which might involve multiple unobserved edges, entities, and variables. For instance, given an incomplete biological knowledge graph, we might want to predict ‘What drugs are likely to target proteins involved with both diseases $\small X$ and $\small Y$? ’ – a query that requires reasoning about all possible proteins that might interact with diseases $\small X$ and $\small Y$. Here we introduce a framework to efficiently make predictions about conjunctive logical queries – a flexible but tractable subset of firstorder logic – on incomplete knowledge graphs. In our approach, we embed graph nodes in a lowdimensional space and represent logical operators as learned geometric operations (e.g., translation, rotation) in this embedding space. By performing logical operations within a lowdimensional embedding space, our approach achieves a time complexity that is linear in the number of query variables, compared to the exponential complexity required by a naive enumerationbased approach. We demonstrate the utility of this framework in two application studies on realworld datasets with millions of relations: predicting logical relationships in a network of druggenedisease interactions and in a graphbased representation of social interactions derived from a popular web forum.”
[Image source. Click image to open in new window.] “Of course, there are limitations of our framework: for instance, it cannot handle logical negation or disjunction, and we also do not consider features on edges. Natural future directions include generalizing the space of logical queries – for example, by learning a geometric negation operator – and using graph neural networks to incorporate richer feature information on nodes and edges.”
Graph Structure
Here, I summarize work which leverages the representational and topological structure of knowledge graphs, that can be used to learn latent relationships or aid KG construction (e.g., through link prediction).
Network embedding methods can be divided into two categories (all discussed below – at times excerpted from Improved SemanticAware Network Embedding with FineGrained Word Alignment): methods that solely rely on structure (e.g. vertex information), such as
and methods that leverage both the structure the network and the semantic information associated with its vertices, such as
 TADW (TextAssociated DeepWalk),
 CENE (ContentEnhanced Network Embedding),
 CANE (ContextAware Network Embedding), and
 WANE (WordAlignmentbased Network Embedding).
Similar to the skipgram concept [Mikolov et al. (2013)] originally introduced for learning word embeddings, DeepWalk [DeepWalk: Online Learning of Social Representations (Jun 2014)] [code] generalized language modeling and unsupervised feature learning from sequences of words to graphs, using local information obtained from truncated random walks to learn latent representations by treating walks as the equivalent of sentences. In other words, DeepWalk learned latent representations of vertices in a network in a continuous vector space via truncated random walks.
[Image source. Click image to open in new window.][Image source. Click image to open in new window.][More on Zachary's Karate Club.]
Likewise, LINE [LINE : Largescale Information Network Embedding (Mar 2015)] [code] proposed a principled objective to explicitly capture firstorder and secondorder proximity information from the vertices of a network.

“This paper studies the problem of embedding very large information networks into lowdimensional vector spaces, which is useful in many tasks such as visualization, node classification, and link prediction. … In this paper, we propose a novel network embedding method called the LINE, which is suitable for arbitrary types of information networks: undirected, directed, and/or weighted. The method optimizes a carefully designed objective function that preserves both the local and global network structures. An edgesampling algorithm is proposed that addresses the limitation of the classical stochastic gradient descent and improves both the effectiveness and the efficiency of the inference.
“Empirical experiments prove the effectiveness of the LINE on a variety of realworld information networks, including language networks, social networks, and citation networks. The algorithm is very efficient, which is able to learn the embedding of a network with millions of vertices and billions of edges in a few hours on a typical single machine. …”
Subsequently, node2vec: Scalable Feature Learning for Networks (Jul 2016) [project, code] – by Aditya Grover and Jure Leskovec at Stanford University – introduced a biased random walk procedure to generate the neighborhood for a vertex, which inferred the node representations by maximizing the likelihood of preserving the local context information of vertices. “Here we propose node2vec, an algorithmic framework for learning continuous feature representations for nodes in networks. In node2vec, we learn a mapping of nodes to a lowdimensional space of features that maximizes the likelihood of preserving network neighborhoods of nodes. We define a flexible notion of a node’s network neighborhood and design a biased random walk procedure, which efficiently explores diverse neighborhoods. Our algorithm generalizes prior work which is based on rigid notions of network neighborhoods, and we argue that the added flexibility in exploring neighborhoods is the key to learning richer representations. “We demonstrate the efficacy of node2vec over existing stateoftheart techniques on multilabel classification and link prediction in several realworld networks from diverse domains. Taken together, our work represents a new way for efficiently learning stateoftheart taskindependent representations in complex networks.”
[Image source. Click image to open in new window.]
node2vec was later evaluated in Representation Learning on Graphs: A Reinforcement Learning Application (Jan 2019), which studied value function approximation in reinforcement learning (RL) problems with high dimensional state or action spaces via a generalized version of representation policy iteration (RPI). They considered the limitations of protovalue functions (PVFs) at accurately approximating the value function in low dimensions, and they highlighted the importance of features learning for an improved lowdimensional value function approximation. They then adopted different representation learning algorithms on graphs to learn the basis functions that best represented the value function, empirically showing that node2vec, an algorithm for scalable feature learning in networks, and the Variational Graph AutoEncoder (VGAE) constantly outperformed the commonly used smooth protovalue functions in lowdimensional feature space.

Under sampling strategies that satisfy the requirements of the previous point, node2vec outperformed the commonly used protovalue functions.

VGAE, which was more complex than *node2vec and required more training, led to minor performance improvement over node2vec.

Although struc2vec was also designed to capture structural similarities between nodes, it also preserved the local properties of the graph by considering neighborhoods of different sizes. Hence, struc2vec was able to accurately approximate the value function even in graphs that had symmetrical structure.

The policy learned via the GraphWave basis function led to very poor performances regardless of the dimension size. Because GraphWave aims at learning embeddings that are exclusively structural, they hypothesized that those embeddings fail to capture global network properties. Although GraphWave is a powerful model for capturing structural information in networks, it is not a good choice of basis function for approximating the value function on a graph.

These findings encourage the further study of representation learning on graphs for achieving efficient and accurate policy learning for reinforcement learning.

Learning Structural Node Embeddings via Diffusion Wavelets (2018) [project; code] by Jure Leskovec and colleagues at Stanford University presented GraphWave, an algorithm for computing structural signatures for nodes in a network using heat spectral wavelet diffusion patterns. Instead of training on handselected features, GraphWave learned these embeddings in an unsupervised way. They mathematically proved that nodes with similar network neighborhoods had similar GraphWave embeddings even though these nodes may reside in very different parts of the network. GraphWave outperformed existing state of the art baselines in every experiment, by as much as 137%.
[Image source. Click image to open in new window.]
Note also Adaptive Diffusions for Scalable Learning over Graphs (Jan 2019) [code], which introduced a disciplined, dataefficient approach to learning classspecific diffusion functions adapted to the underlying network topology. The novel learning approach leveraged the notion of landing probabilities of classspecific random walks, which could be computed efficiently, thereby ensuring scalability to large graphs. A robust version of the classifier facilitated learning even in noisy environments. Classification tests on real networks demonstrated that adapting the diffusion function to the given graph and observed labels significantly improved the performance over fixed diffusions – reaching, and many times surpassing, the classification accuracy of computationally heavier state of the art competing methods that relied on node embeddings and deep neural networks.
[Image source. Click image to open in new window.] 
This model is evaluated in Representation Learning on Graphs: A Reinforcement Learning Application (Jan 2019).
DeepWalk
The DeepWalk approach has proven to be very popular: for example, recent applications of DeepWalk in the biomedical domain include the following works.
Graph structure was leveraged by Alshahrani et al. [NeuroSymbolic Representation Learning on Biological Knowledge Graphs (Sep 2017) code], who developed a method for feature learning on biological knowledge graphs. Their method combined symbolic methods (specifically, knowledge representation using symbolic logic and automated reasoning) with neural networks to generate vector representations (embeddings) of the nodes in those graphs, as well as the kinds of relations that existed to neighboring nodes. To learn those representations, they repeatedly performed random walks from each node in the KG, using the resulting walks as sentences within a corpus, and applied a word2vec skipgram model (DeepWalk) to learn the embeddings for each node. Through the use of symbolic logic, these embeddings contain both explicit and implicit information. These embeddings were applied to the prediction of edges in the KG, applicable to tasks such as the prediction of biological function, finding candidate genes of diseases, and identifying proteinprotein interactions or drugtarget relations.
[Image source. Click image to open in new window.]An extension of Alshahrani’s work (NeuroSymbolic Representation Learning on Biological Knowledge Graphs) (by Alshahrani et al.) was described by Agibetov and Samwald in Fast and Scalable Learning of NeuroSymbolic Representations of Biomedical Knowledge (Apr 2018) [code]. Agibetov and Samwald showed how to train fast (less than one minute, vs. hours previously) loglinear neural embeddings (representations) of the entities, that were used as inputs for ML classifiers enabling important tasks such as biological link prediction. Classifiers were trained by concatenating learned entity embeddings to represent entity relations, and training classifiers on the concatenated embeddings to discern true relations from automatically generated negative examples. Their simple embedding methodology greatly improved on classification error compared to previously published state of the art results, yielding increased Fmeasure and ROC AUC scores for the most difficult biological link prediction problem. Their embedding approach was also much more economical in terms of embedding dimensions (d=50 vs. d=512), and naturally encoded the directionality of the asymmetric biological relations, that could be controlled by the order with which the embeddings were concatenated.
[Image source. Click image to open in new window.]“As opposed to the approach taken by Alshahrani et al. we employ another neural embedding method which requires fewer parameters and is much faster to train. Specifically, we exploit the fact that the biological relations have well defined nonoverlapping domain and ranges, and therefore the whole knowledge graph can be treated as an untyped directed graph, where there is no ambiguity in the semantics of any relation. To this end, we employ the neural embedding model from the StarSpace toolkit, which aims at learning entities, each of which is described by a set of discrete features (bagoffeatures) coming from a fixedlength dictionary. The model is trained by assigning a $\small d$dimensional vector to each of the discrete features in the set that we want to embed directly. Ultimately, the lookup matrix (the matrix of embeddings: latent vectors) is learned by minimizing the following loss function …”
In the following figure, results in Table 2 (from Table 1
Source in Alshahrani et al., 2017 serve as the baseline for Tables 3, 4:
Notably, in the work described by Agibetov and Samwald [Fast and Scalable Learning of NeuroSymbolic Representations of Biomedical Knowledge] in the preceding paragraph, there was no need for a sophisticated labeled DeepWalk, since all considered biological relations had clear nonoverlapping domain and range separations (and therefore the whole knowledge graph could be treated as an untyped directed graph, where there is no ambiguity in the semantics of any relation). Instead of using DeepWalk, they trained faster and more economical loglinear neural embeddings using Jason Weston’s (Facebook AI Research) StarSpace: Embed All the Things! [code]. StarSpace is a generalpurpose neural embedding model that aims at learning entities, each of which is described by a set of discrete features (bagoffeatures) coming from a fixedlength dictionary. StarSpace can solve a wide variety of problems including labeling tasks such as text classification, ranking tasks such as information retrieval/web search, collaborative filteringbased or contentbased recommendation, embedding of multirelational graphs, and learning word, sentence or document level embeddings. The classifiers used by Agibetov and Samwald were trained on concatenated embeddings of entities (nodes), which were obtained from the flattened graphs for each biomedical relations using StarSpace.
[Image source. Click image to open in new window.]Predicting MicroRNADisease Associations using Network Topological Similarity Based on DeepWalk (Oct 2017) applied DeepWalk to the prediction of microRNAdisease associations, by calculating similarities within a miRNAdisease association network. This approach showed superior predictive performance for 22 complex diseases, with area under the ROC (AUR) curve scores ranging from 0.805 to 0.937, using fivefold crossvalidation. In addition, case studies on breast, lung and prostate cancer further justified the use the method for the discovery of latent miRNAdisease pairs.
[Image source. Click image to open in new window.]Deep Mining Heterogeneous Networks of Biomedical Linked Data to Predict Novel DrugTarget Associations; (Apr 2017) [code] also used DeepWalk to predict novel drugtarget associations.
Deep Neural Networks for Learning Graph Representations (2016) [DNGR: code: MATLAB; nonauthor Python (Keras) implementation] provided an alternative deep learning approach to learning graph representations (especially word representations) via a random surfing model to directly capture graph structural information. Stacked denoising autoencoders (SDAE) were used to extract complex features and to model nonlinearities, generating informative representations. The learned representations could be regarded as input features that could be fed into other tasks, such as unsupervised and supervised classification. Note Fig. 3 in their paper (image below), which is reminiscent of Fig. 2
Source in in the struc2vec paper, and the comparison of each of those approaches in those figures to DeepWalk.
The wine dataset visualized in the image above is a dataset from the UCI Machine Learning Repository describing the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the 3 types of wines and comprises 178 instances.
struc2vec: Learning Node Representations from Structural Identity (Jul 2017) [code] is yet another vector spacebased model, which like DeepWalk and node2vec learned latent representations for the structural identity of nodes. While DeepWalk and node2vec failed to capture the notion of structural identity, struc2vec excelled on this task, even when the original network was subject to strong random noise (random edge removal). Fig. 3 in the struc2vec paper showed superior node representations for the toy Karate network (Zachary's Karate Club), compared to DeepWalk and node2vec.
[Image source. Click image to open in new window.] This model is evaluated in Representation Learning on Graphs: A Reinforcement Learning Application (Jan 2019).
Another recent improvement over node2vec, graph2vec: Learning Distributed Representations of Graphs (Jul 2017) [code  nonauthor code, discussion, image
Source] learned datadriven distributed representations of arbitrary sized graphs, rather than learning distributed representations of graph substructures. graph2vec’s embeddings were learned in an unsupervised manner and were task agnostic. Hence, those representations could be used for any downstream task such as graph classification, clustering and even seeding supervised representation learning approaches. Experiments on several benchmark and large realworld datasets showed that graph2vec achieved significant improvements in classification and clustering accuracies over substructure representation learning approaches. For example, compared to node2vec, graph2vec resulted in increased accuracy on five chemoinformatic and bioinformatic domains (see results in Subsection 5.1 and Table 2 in the graph2vec paper; datasets used there are described/available here, here and here). Comparisons of graph2vec to struc2vec are not available at this time; however, the approach and performance of struc2vec over node2vec are clearly evident in Figs. 2, 3 and 7 in the struc2vec paper.
Syntree2Vec – An Algorithm to Augment Syntactic Hierarchy into Word Embeddings (Aug 2018) [code] aimed to include syntactic knowledge – here, preserving the word order during sampling – into word embeddings via a graph based embedding algorithm inspired from node2vec. Using SyntaxNet for syntactic/dependency parsing, Syntree2Vec generated word embeddings by optimizing a graph based objective function that maximized the probability of preserving a network neighborhood. A semiguided second order random walk procedure then sampled network neighborhood of nodes. In an example of their graph creation (their Figs. 1, 2), the Authors parsed two similar sentences and concatenated the two graphs, that preserved the word order among the syntactic parse trees. Syntree2vec performed similarly to word2vec and node2vec (perplexity scores), albeit with training times that scaled with the data size (unlike word2vec; Fig. 3) for the data sizes evaluated (Table 1). [Data were drawn from a 2008 Wikipedia data dump.]
[Image source. Click image to open in new window.]
 SyntaxNet, introduced by Google in Globally Normalized TransitionBased Neural Networks (Mar 2016; updated Jun 2016) [code; nonauthor code], was further described by Google in SyntaxNet Models for the CoNLL 2017 Shared Task (Mar 2017), and discussed in the May 2016 Google AI blog post Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source.
[Image source. Click image to open in new window.]
Random walk algorithms have many limitations that arise from the use of random walks, e.g., the features resulting from these methods are unable to transfer to new nodes and graphs as they are tied to vertex identity. Learning Rolebased Graph Embeddings (Jul 2018) [code; discussion] introduced the Role2Vec framework which used the flexible notion of attributed random walks, and served as a basis for generalizing existing methods such as DeepWalk, node2vec, and many others that leverage random walks. Their proposed framework enabled these methods to be more widely applicable for both transductive and inductive learning as well as for use on graphs with attributes (if available). This was achieved by learning functions that generalized to new nodes and graphs. The proposed framework was effective with an average AUC improvement of 16.55% while requiring on average 853x less space than existing methods on a variety of graphs.
[Image source. Click image to open in new window.]The various graph structure algorithms discussed above generally ignore rich heterogeneous information associated with vertices (Improved SemanticAware Network Embedding with FineGrained Word Alignment). Models that learn semanticaware network embeddings include the following.

TextAssociated DeepWalk (TADW) integrated textual features into network representations with matrix factorization by leveraging the equivalence between DeepWalk and matrix factorization (Network Representation Learning with Rich Text Information (2015) [code]).
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]

CENE (ContentEnhanced Network Embedding) used bidirectional recurrent neural networks to abstract the semantic information associated with vertices, further demonstrating the advantages of employing textual information [A General Framework for Contentenhanced Network Representation Learning (Oct 2016)].
Textassociated DeepWalk (TADW) incorporated textual features into network embeddings through matrix factorization. “… This approach typically suffers from high computational cost and not scalable to largescale networks. Besides, contents in TADW are simply incorporated as unordered text features instead of being explicitly modeled. Therefore, deeper semantics contained in the contents cannot be well captured. Present work. We present a general framework for learning ContentEnhanced Network Embedding (CENE) that is capable of jointly leveraging the network structure and the contents. We consider textual contents in this study, however, our approach can be flexibly scaled to other modalities of content.”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]

To capture the interaction between sentences of vertex pairs, ContextAware Network Embedding (CANE) employed a mutual attention mechanism to adaptively account for the textual information from neighboring vertices (CANE: ContextAware Network Embedding for Relation Modeling (Oct 2016  ACL2017) [code]).
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
Despite showing improvement over structureonly models, those semanticaware methods could not capture wordlevel alignment information – important for inferring the relationship between node pairs. For a wide range of applications, vertices in a network are typically accompanied by rich textual information such as user profiles, paper abstracts, etc. Improved SemanticAware Network Embedding with FineGrained Word Alignment (Aug 2018) focused on incorporating semantic features into network embeddings by matching important words between text sequences for all pairs of vertices. They introduced a wordbyword alignment framework, WordAlignmentbased Network Embedding (WANE), that measured the compatibility of embeddings between word pairs, and then adaptively accumulated those alignment features with an aggregation function. Evaluations on three real world benchmarks for downstream tasks including link prediction and multilabel vertex classification demonstrated that their model outperformed state of the art network embedding methods by a large margin. [” Our code will be released to encourage future research.” ]
[Image source. Click image to open in new window.]Graph algorithms are key tools in many fields of science and technology. Some of these algorithms depend on propagating information between distant nodes in a graph. Recently, there have been a number of deep learning architectures proposed to learn on undirected graphs. However, most of these architectures aggregate information in the local neighborhood of a node, and therefore they may not be capable of efficiently propagating longrange information. Learning LongRange Information in Undirected Graphs with Wave Networks (Oct 2018) [code] addresses this problem with a recently proposed architecture (wave) which propagates information back and forth across an undirected graph in waves of nonlinear computation. They compared wave to graph convolution – an architecture based on local aggregation – and find that wave learned three different graphbased tasks with greater efficiency and accuracy: (1) labeling a path connecting two nodes in a graph, (2) solving a maze presented as an image, and (3) computing voltages in a circuit. Wave could extrapolate from small training examples to much larger testing examples. The results showed that wave may be able to efficiently solve a wide range of problems that require longrange information propagation across undirected graphs.
[Image source. Click image to open in new window.]Additional Reading

PANDA: AdaPtive Noisy Data Augmentation for Regularization of Undirected Graphical Models (Oct 2018)
“We propose PANDA, an AdaPtive Noise Augmentation technique to regularize estimating and constructing undirected graphical models (UGMs). PANDA iteratively solves MLEs [maximum likelihood estimations] given noise augmented data in the regressionbased framework until convergence to achieve the designed regularization effects. … Our empirical results suggest the inferences achieve nominal or nearnominal coverage and are far more efficient compared to some existing postselection procedures. On the algorithm level, PANDA can be easily programmed in any standard software without resorting to complicated optimization techniques. We show the noninferior performance of PANDA in constructing graphs of different types in simulation studies and also apply PANDA to the autism spectrum disorder data to construct a mixednode graph.”
[Image source. . Click image to open in new window.]
[Image source. . Click image to open in new window.]“We apply PANDA to an autism spectrum disorder (ASD) data collected by the Dutch Association for Autism and the VU University Amsterdam. The dataset contains 28 variables of various types (10 continuous, 7 categorical, and 11 count variables) from 3521 participants; and is available in the R package mgm.
“Figure 6 presents a visualization of the estimated UGM via PANDA. … Figure 6 suggests that node “GCdtA” (Good Characteristics due to Autism) is connected with multiple nodes of different domains including “NoSC” (Number of Social Contacts), “NoI” (Number of Interests) and “IQ” (Intelligence Quotient), indicating that the uniquely positive traits of people with autism connects with various aspects of their lives (social environment, wellbeing, diagnostic measurements). PANDA is also able to detect expected relationship among the nodes, such as the strong positive relationship between the present age of a participant (“Age”) and the age when the participant was diagnosed with autism (“Agd”).
“In addition to the relationships among the variables, we can obtain some insights on the relative importance of those variables in the structure of the estimated graph. Figure 7
Source displays the standardized centrality measures (strength, closeness and betweenness) (Opsahl et al., 2010) for each node. The results suggest that some variables, such as “Good Characteristics due to Autism”, “Satisfaction: Work” and “No of Social Contacts” were identified as having relatively high centrality level while some other variables, such as “Openness about Diagnosis”, “Type of Work”, “Type of Housing” and “Gender”, had low values in the centrality measures, implying that those variables were not as important in the constitution of the network structure (Figure 6 also shows that these three nodes are not connected to the rest of the nodes).”
Graph Neural Networks [Graph Convolutional Networks (GCN) …]
Graph neural networks (GNN) have revolutionized the field of graph representation learning through effectively learned node embeddings, achieving state of the art results in tasks such as node classification and link prediction. However, current GNN methods are inherently flat and do not learn hierarchical representations of graphs – a limitation that is especially problematic for the task of graph classification, where the goal is to predict the label associated with an entire graph.
For a recent review see Graph Neural Networks: A Review of Methods and Applications (Dec 2018).
 “Lots of learning tasks require dealing with graph data which contains rich relation information among elements. Modeling physics system, learning molecular fingerprints, predicting protein interface, and classifying diseases require that a model to learn from graph inputs. In other domains such as learning from nonstructural data like texts and images, reasoning on extracted structures … is an important research topic which also needs graph reasoning models. … In recent years, systems based on graph convolutional network (GCN) and gated graph neural network (GGNN) have demonstrated groundbreaking performance on many tasks mentioned above. In this survey, we provide a detailed review over existing graph neural network models, systematically categorize the applications, and propose four open problems for future research.”
An important early paper that applied a neural network to a graph structure is Kipf and Welling’s SemiSupervised Classification with Graph Convolutional Networks (Sep 2016; updated Feb 2017), which introduced graph convolutional neural networks (GCN).
[Image source. Click image to open in new window.]“We present a scalable approach for semisupervised learning on graphstructured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs. We motivate the choice of our convolutional architecture via a localized firstorder approximation of spectral graph convolutions. Our model scales linearly in the number of graph edges and learns hidden layer representations that encode both local graph structure and features of nodes. In a number of experiments on citation networks and on a knowledge graph dataset we demonstrate that our approach outperforms related methods by a significant margin.”
Thomas Kipf provides an excellent introduction to GCNs in his blog post Graph Convolutional Networks [local copy]. When Work Matters: Transforming Classical Network Structures to Graph CNN also provides a recent review of GCN, as well as transformation of the basic structures of ResNet, Inception and DenseNet as GCN.
[Source]
Variational Graph AutoEncoders (Nov 2016) [code] by Thomas Kipf and Max Welling introduced the variational graph autoencoder (VGAE), a framework for unsupervised learning on graph structured data based on the variational autoencoder (VAE). This model made use of latent variables and was capable of learning interpretable latent representations for undirected graphs. They demonstrated their model using a graph convolutional network (GCN) encoder and a simple inner product decoder, achieving competitive results on a link prediction task in citation networks. In contrast to most existing models for unsupervised learning on graphstructured data and link prediction, VGAE could naturally incorporate node features, which significantly improved predictive performance on a number of benchmark datasets.
[Image source. Click image to open in new window.] This model is evaluated in Representation Learning on Graphs: A Reinforcement Learning Application (Jan 2019).
Kipf and Welling’s SemiSupervised Classification with Graph Convolutional Networks (Sep 2016; updated Feb 2017) built on work by Defferrard et al., Convolutional Neural Networks on Graphs with fast Localized Spectral Filtering (Jun 2016; updated Feb 2017) [code discussion.

“In this work, we are interested in generalizing convolutional neural networks (CNNs) from lowdimensional regular grids, where image, video and speech are represented, to highdimensional irregular domains, such as social networks, brain connectomes or words’ embedding, represented by graphs. We present a formulation of CNNs in the context of spectral graph theory, which provides the necessary mathematical background and efficient numerical schemes to design fast localized convolutional filters on graphs. Importantly, the proposed technique offers the same linear computational complexity and constant learning complexity as classical CNNs, while being universal to any graph structure. Experiments on MNIST and 20NEWS demonstrate the ability of this novel deep learning system to learn local, stationary, and compositional features on graphs.”
[Image source. Click image to open in new window.]
GCN provide a scalable approach to semisupervised learning on graph structured data based on an efficient variant of convolutional neural networks which operate directly on graphs. GCN have been used to explore graph structure; for example, Ying et al. (Pinterest) with William Hamilton and Jure Leskovec (Stanford University) employed GCN to investigate webscale recommender systems [Graph Convolutional Neural Networks for WebScale Recommender Systems (Jun 2018)]. Notably in that work:

“Here we describe a largescale deep recommendation engine that we developed and deployed at Pinterest. We develop a dataefficient Graph Convolutional Network (GCN) algorithm PinSage, which combines efficient random walks and graph convolutions to generate embeddings of nodes (i.e., items) that incorporate both graph structure as well as node feature information. Compared to prior GCN approaches, we develop a novel method based on highly efficient random walks to structure the convolutions and design a novel training strategy which relies on harderandharder training examples to improve robustness and convergence of the model. …
“The core idea behind GCNs is to learn how to iteratively aggregate feature information from local graph neighborhoods using neural networks (Fig. 1). Here a single ‘convolution’ operation transforms and aggregates feature information from a node’s onehop graph neighborhood, and by stacking multiple such convolutions information can be propagated across far reaches of a graph. Unlike purely contentbased deep models (e.g., recurrent neural networks), GCNs leverage both content information as well as graph structure. GCNbased methods have set a new standard on countless recommender system benchmarks.”
Like Graph Convolutional Neural Networks for WebScale Recommender Systems, above, Graph Convolutional Networks based Word Embeddings also employed GCN for learning node embeddings. They proposed WordGCN, a graph convolution based approach to learning word representations. WordGCN utilized multiple types of word relationships such as linear context, syntactic context, [hypernym, meronym and synonym in a principled manner, through a single model. Through extensive experiments on various intrinsic and extrinsic tasks, they evaluated the effectiveness of WordGCN and found that it outperformed existing word embedding approaches. They also found that WordGCN was able to capture multiple connotations of a word in the learned embeddings.
 “WordGCN is able to capture multiple senses of the same word. For instance, we find that for most embedding methods, closest word to “bank” are banking related terms like “citibank”, “investment” whereas WordGCN is also able to capture the other sense of “bank” i.e. “riverbank”. For the word “grave”, we find WordGCN captures its sense as a part of “graveyard” (topical similarity) and also captures its other meaning i.e. to be “serious” (functional similarity). This shows that WordGCN captures multiple, diverse connotations of a word as well as functional and topical similarities.”
[Image source. Click image to open in new window.]
 “We make WordGCN’s source code available to encourage reproducible research.” ← not provided!
Hierarchical Graph Representation Learning with Differentiable Pooling (Oct 2018) code  nonauthor PyTorch implementation  drug prediction fork] – also by Jure Leskovec and colleagues at Stanford University – presented DiffPool, a differentiable graph pooling module that could generate hierarchical representations of graphs, and can be combined with various graph neural network architectures in an endtoend fashion. DiffPool learned a differentiable soft cluster assignment for nodes at each layer of a deep GNN, mapping nodes to a set of clusters, which then formed the coarsened input for the next GNN layer. Experimental results showed that combining existing GNN methods with DiffPool yielded an average improvement of 510% accuracy on graph classification benchmarks compared to existing pooling approaches, achieving a new state of the art on four out of five benchmark data sets.
[Image source. Click image to open in new window.][Image source. Click image to open in new window.]Regarding Table 2, below:
“Differentiable Pooling on structure2vec (Discriminative Embeddings of Latent Variable Models for Structured Data; note that this is not struc2vec). *Diffpool can be applied to other GNN architectures besides GraphSAGE to capture hierarchical structure in the graph data. To further support answering Q1 [“How does diffpool compare to other pooling methods proposed for GNNs (e.g., using sort pooling or the set2set method)?”], we also applied diffpool on structure2vec (S2V). …”
Neural Graph Machines: Learning Neural Networks Using Graphs (Mar 2017) [published pdf; nonauthor code; critical review], from the University of Cambridge and Google Research, discussed and implemented various neural graphical models. That work generalised previous literature on graphaugmented training of neural networks, enabling it to be applied to multiple neural architectures (feedforward neural networks, CNN and LSTM) and a wide range of graphs. The new objective allowed the neural networks to harness both labeled and unlabeled data by allowing the network to train using labeled data as in the supervised setting, and biasing the network to learn similar hidden representations for neighboring nodes on a graph (in the same vein as label propagation). The proposed joint training approach convincingly outperformed many existing methods on a wide range of tasks (multilabel classification on social graphs, news categorization, document classification and semantic intent classification), with multiple forms of graph inputs (including graphs with and without nodelevel features) and using different types of neural networks.
[Image source. Click image to open in new window.]LargeScale Learnable Graph Convolutional Networks (Aug 2018) [code] proposed the Learnable Graph Convolutional Layer (LGCL), which automatically selected a fixed number of neighboring nodes for each feature based on value ranking in order to transform graph data into gridlike structures in 1D format, thereby enabling the use of regular convolutional operations on generic graphs. To enable model training on largescale graphs, they proposed a subgraph training method to reduce the excessive memory and computational resource requirements suffered by prior methods on graph convolutions on largescale graph data. Experiments on node classification tasks demonstrated consistently better performance on the Cora, Citeseer and Pubmed citation networks, and protein interaction network datasets. The proposed subgraph training strategy was also more efficient than prior approaches.
[Image source. Click image to open in new window.]
See also:
 Deep Graph Infomax (Sep 2018)
 Inductive Representation Learning on Large Graphs (Jun 2017, updated Sep 2018)
Uses of GCNs include:

Modeling Relational Data with Graph Convolutional Networks (Oct 2017) [code (Thomas Kipf); code] – by Thomas Kipf, Ivan Titov, Max Welling and others at the University of Amsterdam – introduced relational graph convolutional networks (RGCN) and applied them to two standard knowledge base completion tasks: link prediction (the recovery of missing facts, i.e. subjectpredicateobject triples), and entity classification (the recovery of missing entity attributes). RGCN are related to a recent class of neural networks operating on graphs, and are developed specifically to deal with the highly multirelational data characteristic of knowledge bases.
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]

FastGCN: Fast Learning with Graph Convolutional Networks via Importance Sampling (Jan 2018) [code] [different authors: IBM Research] extended the previous work, RGCN, above – which was originally designed to be learned with the presence of both training and test data; moreover, for which the recursive neighborhood expansion across layers posed time and memory challenges for training with large, dense graphs. To relax the requirement of simultaneous availability of test data the newer model, FastGCN, interpreted graph convolutions as integral transformations of embedding functions under probability measures. That interpretation allowed the use of Monte Carlo approaches to consistently estimate the integrals, that in turn led to a batched training scheme. FastGCN was efficient for training and generalized well for inference. FastGCN was discussed in A Scalable Deep Learning Approach for Massive Graphs, which stated:

“A challenge for GCN is that for a network with multiple layers, the neighborhood is quickly expanded, involving many vectors to be summed together, for even just a single node. Such a computation is prohibitively costly for graphs with a large number of nodes. where the neighborhood expands rapidly, making computation very costly. To resolve this challenge, the authors propose a novel approach (FastGCN), validated through mathematical proofs and experimental results, that suggests that it suffices to gather the information of only a handful of random entities in each neighborhood expansion. This substantial reduction in neighborhood size renders the same quality of prediction as state of the art deep neural networks but cuts training cost by orders of magnitude (e.g., 10x to 100x less computation and resource time), leading to appealing scalability.”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
Related to FastGCN. Simplifying Graph Convolutional Networks (Cornell University: Feb 2019) [code; discussion: reddit]. “GCNs derive inspiration primarily from recent deep learning approaches, and as a result, may inherit unnecessary complexity and redundant computation. In this paper, we reduce this excess complexity through successively removing nonlinearities and collapsing weight matrices between consecutive layers. We theoretically analyze the resulting linear model and show that it corresponds to a fixed lowpass filter followed by a linear classifier. Notably, our experimental evaluation demonstrates that these simplifications do not negatively impact accuracy in many downstream applications. Moreover, the resulting model scales to larger datasets, is naturally interpretable, and yields up to two orders of magnitude speedup over FastGCN.”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]



Disease Prediction using Graph Convolutional Networks: Application to Autism Spectrum Disorder and Alzheimer’s Disease (Jun 2018) used GCN to represent populations as a sparse graph where its nodes were associated with imagingbased feature vectors, while phenotypic information was integrated as edge weights.
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]

Towards Gene Expression Convolutions using Gene Interaction Graphs (Jun 2018) – a rather poorlywritten paper from Yoshua Bengio’s group which nonetheless shows highlevel interest in this domain – applied GCN to the analysis of RNASeq gene expression data from The Cancer Genome Atlas (TCGA). They explored the use of GCN with dropout, and gene embeddings (to utilize the graph information), with perhaps the main finding reported being that while the GCN approach provided an advantage for particular tasks in a low data regime, it was very dependent on the quality of the graph used. [That conclusion, at least, echoes my longstanding emphasis on data quality.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
Named Entity Disambiguation using Deep Learning on Graphs (Oct 2018) [dataset/code] addressed named entity disambiguation (NED) by comparing entities in short sentences with Wikidata graphs. Creating a context vector from graphs through deep learning is a challenging problem that has never been applied to NED. Their main contribution was to present an experimental study of recent neural techniques, as well as a discussion about which graph features were most important for the disambiguation task. In addition, a new dataset, WikidataDisamb, was created to allow a clean and scalable evaluation of NED with Wikidata entries, and to be used as a reference in future research. Results showed that a BiLSTM encoding of the graph triplets performed best, improving upon the baseline models and scoring a $\small F_1$ value of 91.6% on the WikidataDisamb test set.
[Image source. Click image to open in new window.]In Graph Capsule Convolutional Neural Networks (Aug 2018) [code: “Our GCAPSCNN code and data will be made available at Github.” – not available, 20181102)], the authors build on both GCN and Hinton’s capsules to tackle some of the basic weaknesses in GCN, proposing a Graph Capsule Network (GCAPSCNN) model designed to solve especially graph classification problem which current GCNN (graph CNN) models find challenging. Through extensive experiments, they showed that their proposed graph capsule network significantly outperformed existing state of the art deep learning methods and graph kernels on graph classification benchmark datasets – for example, the bioinformatics datasets summarized in their Table 1. Their graph capsule network model captured more local structure information than traditional GCNN, and could provide much richer representation of individual graph nodes (or for the whole graph), preserving more information in a local pool.
[Image source. Click image to open in new window.]
Capsule Neural Networks for Graph Classification using Explicit Tensorial Graph Representations (Feb 2019) [code]. “Graph classification is a significant problem in many scientific domains. It addresses tasks such as the classification of proteins and chemical compounds into categories according to their functions, or chemical and structural properties. In a supervised setting, this problem can be framed as learning the structure, features and relationships between features within a set of labelled graphs and being able to correctly predict the labels or categories of unseen graphs. A significant difficulty in this task arises when attempting to apply established classification algorithms due to the requirement for fixed size matrix or tensor representations of the graphs which may vary greatly in their numbers of nodes and edges. Building on prior work combining explicit tensor representations with a standard imagebased classifier, we propose a model to perform graph classification by extracting fixed size tensorial information from each graph in a given set, and using a Capsule Network to perform classification. The graphs we consider here are undirected and with categorical features on the nodes. Using standard benchmarking chemical and protein datasets, we demonstrate that our graph Capsule Network classification model using an explicit tensorial representation of the graphs is competitive with current state of the art graph kernels and graph neural network models despite only limited hyperparameter searching.”
Combining Neural Networks with Personalized PageRank for Classification on Graphs (Nov 2018) [code here (PyTorch) and here (TensorFlow); OpenReview; reddit]. “In this paper, we use the relationship between graph convolutional networks (GCN) and PageRank to derive an improved propagation scheme based on personalized PageRank. We utilize this propagation procedure to construct personalized propagation of neural predictions (PPNP) and its approximation, APPNP. Our model’s training time is on par or faster and its number of parameters on par or lower than previous models. It leverages a large, adjustable neighborhood for classification and can be combined with any neural network. We show that this model outperforms several recently proposed methods for semisupervised classification on multiple graphs in the most thorough study done so far for GCNlike models.”
[Image source. Click image to open in new window.]Interpretable Graph Convolutional Neural Networks for Inference on Noisy Knowledge Graphs (Dec 2018) provided a new graph convolutional neural networks (GCNNs) formulation for link prediction on graph data that addressed common challenges for biomedical knowledge graphs (KGs). They introduced a regularized attention mechanism to GCNNs that not only improved performance on clean datasets, but also favorably accommodated noise in KGs, a pervasive issue in realworld applications. Further, they explored new visualization methods for interpretable modelling and illustrated how the learned representation can be exploited to automate dataset denoising. Results were demonstrated on a synthetic dataset (FB15k237) and a large biomedical knowledge graph derived from a combination of noisy and clean data sources. Using these improvements, they visualized a learned model representation of the disease cystic fibrosis and demonstrated how to interrogate a neural network to show the potential of PPARG (PPARγ: peroxisome proliferatoractivated receptor gamma) as a candidate therapeutic target for rheumatoid arthritis.
[Image source. Click image to open in new window.]While GCNs inherently assume pairwise relationships in graphstructured data, in many realworld problems relationships extend beyond pairwise connections – hypergraphs naturally capture these complex relationships. HyperGCN: Hypergraph Convolutional Networks for SemiSupervised Classification (Sep 2018) explored the use of GCN for hypergraph based semisupervised learning (SSL). They proposed HyperGCN, a SSL method which used a layerwise propagation rule for CNN operating directly on hypergraphs (the first principled adaptation of GCN to hypergraphs). HyperGCN was able to encode both the hypergraph structure and hypernode features in an effective manner.
Recently, the deep learning community has given growing attention to neural architectures engineered to learn problems in relational domains. Typed Graph Networks (Jan 2019) [code] revisits the original Graph Neural Network Model (Scarselli et al., 2009) and show that it generalises many recent models. To illustrate the generality of the original model they presented a Graph Neural Network formalisation, which partitioned the vertices of a graph into a number of types. Each type represented an entity in the ontology of the problem one wanted to learn. This allowed – for instance – one to assign embeddings to edges, hyperedges, and any number of global attributes of the graph. As a companion to this paper they provided a Python/Tensorflow library to facilitate the development of such architectures, with which they instantiated the formalisation to reproduce a number of models proposed in the current literature.
[Image source. Click image to open in new window.] Ref. 10 in the figure above is the recent review/perspective paper Relational Inductive Biases, Deep Learning, and Graph Networks (Jun 2018; updated Oct 2018) [code/demos; media; critique; discussion here and here] – jointly from DeepMind, Google Brain, MIT and the University of Edinburgh – which described how graph networks provide a straightforward interface for manipulating structured knowledge and producing structured behaviors.
In addition to label scores (in node classification), it is also desirable to have confidence scores associated with them. Confidencebased Graph Convolutional Networks for SemiSupervised Learning (Jan 2019) [code] proposed ConfGCN, which estimated labels scores along with their confidences jointly in GCNbased setting. ConfGCN used these estimated confidences to determine the influence of one node on another during neighborhood aggregation, thereby acquiring anisotropic capabilities.
[Image source. Click image to open in new window.]A HigherOrder Graph Convolutional Layer (NeurIPS 2018) [code; discussion: reddit]. “Recent methods generalize convolutional layers from Euclidean domains to graphstructured data by approximating the eigenbasis of the graph Laplacian. The computationallyefficient and broadlyused Graph ConvNet of Kipf & Welling [11], oversimplifies the approximation, effectively rendering graph convolution as a neighborhoodaveraging operator. This simplification restricts the model from learning delta operators, the very premise of the graph Laplacian. In this work, we propose a new Graph Convolutional layer which mixes multiple powers of the adjacency matrix, allowing it to learn delta operators. Our layer exhibits the same memory footprint and computational complexity as a GCN. We illustrate the strength of our proposed layer on both synthetic graph datasets, and on several realworld citation graphs, setting the record stateoftheart on Pubmed.”
[Image source. Click image to open in new window.]A ContextAware Citation Recommendation Model with BERT and Graph Convolutional Networks (Mar 2019). “… we propose a deep learning based model and wellorganized dataset for contextaware paper citation recommendation. Our model comprises a document encoder and a context encoder, which uses Graph Convolutional Networks (GCN) layer and Bidirectional Encoder Representations from Transformers (BERT ), which is a pretrained model of textual data. By modifying the related PeerRead dataset, we propose a new dataset called FullTextPeerRead containing context sentences to cited references and paper metadata. To the best of our knowledge, This dataset is the first wellorganized dataset for contextaware paper recommendation. The results indicate that the proposed model with the proposed datasets can attain stateoftheart performance and achieve a more than 28% improvement in mean average precision (MAP) and recall@k.”
[Image source. Click image to open in new window.]DAGCN: Dual Attention Graph Convolutional Networks (Apr 2019) [code
Graph Neural Networks [GCN ...]:“… At present, the vast majority of GCNs use a neighborhood aggregation framework to learn a continuous and compact vector, then performing a pooling operation to generalize graph embedding for the classification task. These approaches have two disadvantages in the graph classification task: (1) when only the largest subgraph structure (khop neighbor) is used for neighborhood aggregation, a large amount of earlystage information is lost during the graph convolution step; (2) simple average/sum pooling or max pooling utilized, which loses the characteristics of each node and the topology between nodes. In this paper, we propose a novel framework called, dual attention graph convolutional networks (DAGCN ) to address these problems. DAGCN automatically learns the importance of neighbors at different hops using a novel attention graph convolution layer, and then employs a second attention component, a selfattention pooling layer, to generalize the graph representation from the various aspects of a matrix graph embedding. … experimental results show that our framework not only outperforms other baselines but also achieves a better rate of convergence.”
Additional Reading

Multitask Learning on Graph Neural Networks – Learning Multiple Graph Centrality Measures with a Unified Network (Sep 2018; image
Source)“… we show that GNN are capable of multitask learning, which can be naturally enforced by training the model to refine a single set of multidimensional embeddings $\small \in \mathbb{R}^d$ and decode them into multiple outputs by connecting MLPs at the end of the pipeline. We demonstrate the multitask learning capability of the model in the relevant relational problem of estimating network centrality measures, i.e. is vertex $\small v_1$ more central than vertex $\small v_2$ given centrality $\small c$? We then show that a GNN can be trained to develop a lingua franca of vertex embeddings from which all relevant information about any of the trained centrality measures can be decoded. The proposed model achieves 89% accuracy on a test dataset of random instances with up to 128 vertices and is shown to generalise to larger problem sizes. The model is also shown to obtain reasonable accuracy on a dataset of real world instances with up to 4k vertices, vastly surpassing the sizes of the largest instances with which the model was trained (n=128). Finally, we believe that our contributions attest to the potential of GNN in symbolic domains in general and in relational learning in particular.”

Deep Graph Convolutional Encoders for Structured Data to Text Generation (Oct 2018) [code]

“Most previous work on neural text generation from graphstructured data relies on standard sequencetosequence methods. These approaches linearise the input graph to be fed to a recurrent neural network. In this paper, we propose an alternative encoder based on graph convolutional networks that directly exploits the input structure. We report results on two graphtosequence datasets that empirically show the benefits of explicitly encoding the input graph structure.”

“We compared LSTM sequential encoders with a structured data encoder based on GCNs on the task of structured data to text generation. On two different tasks, WebNLG and SR11Deep, we show that explicitly encoding structural information with GCNs is beneficial with respect to sequential encoding.”


Accurate, Efficient and Scalable Graph Embedding (Oct 2018)
“The Graph Convolutional Network (GCN) model and its variants are powerful graph embedding tools for facilitating classification and clustering on graphs. However, a major challenge is to reduce the complexity of layered GCNs and make them parallelizable and scalable on very large graphs – state of the art techniques are unable to achieve scalability without losing accuracy and efficiency. In this paper, we propose novel parallelization techniques for graph samplingbased GCNs that achieve superior scalable performance on very large graphs without compromising accuracy. … We demonstrate that our parallel graph embedding outperforms state of the art methods in scalability (with respect to number of processors, graph size and GCN model size), efficiency and accuracy on several large datasets. On a 40core Xeon platform, our parallel training achieves 64× speedup in the sampling step and 25× speedup in the feature propagation step, compared to the serial implementation, resulting in a net speedup of 21×. Our scalable algorithm enables deeper GCN, as demonstrated by 1306× speedup on a 3layer GCN compared to Tensorflow implementation of state of the art.”

[Oct 2018 blog post] Knowledge Graph Convolutional Networks: Machine Learning over Reasoned Knowledge (discussion) – relevant to Relational Inductive Biases, Deep Learning, and Graph Networks.

Hierarchical Bipartite Graph Convolution Networks (Dec 2018)
“Recently, graph neural networks have been adopted in a wide variety of applications ranging from relational representations to modeling irregular data domains such as point clouds and social graphs. However, the space of graph neural network architectures remains highly fragmented impeding the development of optimized implementations similar to what is available for convolutional neural networks. In this work, we present BiGraphNet, a graph neural network architecture that generalizes many popular graph neural network models and enables new efficient operations similar to those supported by ConvNets. By explicitly separating the input and output nodes, BiGraphNet: (i) generalizes the graph convolution to support new efficient operations such as coarsened graph convolutions (similar to strided convolution in convnets), multiple input graphs convolution and graph expansions (unpooling) which can be used to implement various graph architectures such as graph autoencoders, and graph residual nets; and (ii) accelerates and scales the computations and memory requirements in hierarchical networks by performing computations only at specified output nodes.”
[Image source. Click image to open in new window.] 
ICML 2019 WorkshopLearning and Reasoning with GraphStructured Data (Spring 2019) [discussion (reddit)]
“Learning and reasoning with graphstructured representations is gaining increasing interest in both academia and industry, due to its fundamental advantages over more traditional unstructured methods in supporting interpretability, causality, and transferability / inductive generalization. Recently, there is a surge of new techniques in the context of deep learning, such as graph neural networks, for learning graph representations and performing reasoning and prediction, which have achieved impressive progress. However, it can still be a long way to go to obtain satisfactory results in longrange multistep reasoning, scalable learning with very large graphs, flexible modeling of graphs in combination with other dimensions such as temporal variation and other modalities such as language and vision. New advances in theoretical foundations, models and algorithms, as well as empirical discoveries and applications are therefore all highly desirable. The aims of this workshop are to bring together researchers to dive deeply into some of the most promising methods which are under active exploration today, discuss how we can design new and better benchmarks, identify impactful application domains, encourage discussion and foster collaboration.”
Graph Attention Networks
Graphstructured data arise naturally in many different application domains. By representing data as graphs we can capture entities (nodes) and their relationships (edges). Useful insights can be derived from graphstructured data, demonstrated by an evergrowing body of work focused on graph mining. However, in the realworld, graphs can be both large (with many complex patterns) and noisy, which poses problems for effective graph mining. An effective way to deal with this issue is to incorporate “attention” into graph mining solutions. An attention mechanism allows a method to focus on taskrelevant parts of the graph, helping it to make better decisions. Attention Models in Graphs: A Survey (Jul 2018) provides a recent focused survey of the literature on the emerging field of graph attention models.
[Image source. Click image to open in new window.]Unlike graph convolutional networks (GCNs) which use a fixed or learnable polynomial of a graph Laplacian or an adjacency matrix to aggregate (filter) node information, graph attention networks (GATs) aggregate node information by using an attention mechanism on graph neighborhoods (Adaptive Edge Features Guided Graph Attention Networks). The essential difference between GAT and GCN is stark: in GCN the weights for aggregating (filtering) neighbor nodes are defined by the graph structure, which is independent on node contents; in contrast, weights in GAT are a function of node contents, thanks to the attention mechanism. Results on graph node classification show that the adaptiveness of GAT makes it more effective to fuse information from node features and graph topological structures.
One major problem in current graph neural network models such as GAT and GCN is that edge features, which contain important information about graphs, are not incorporated (Adaptive Edge Features Guided Graph Attention Networks). In GAT, graph topological information is injected into the model by forcing the attention coefficient between two nodes to zero if they are not connected. Therefore, the edge information used in GAT is only the indication about whether there is an edge or not, i.e. connectivities. However, graph edges are often in possession of rich information like strengths, types, etc. Instead of being a binary indicator variable, edge features could be continuous, e.g. strengths, or multidimensional. Properly addressing this problem is likely to benefit many graph learning problems. Another problem of GAT and GCN is that each GAT or GCN layer filters node features based on the original input adjacency matrix. The original adjacency matrix is likely to be noisy and not optimal, which will restrict the effectiveness of the filtering operation.
Adaptive Edge Features Guided Graph Attention Networks (Sep 2018) proposed a graph learning model, edge features guided graph attention networks (EGAT). Guided by the edge features, the attention mechanism on pairs of graph nodes not only depended on node contents but also adjusted automatically with respect to the properties of the edge connecting those nodes. Moreover, the edge features were adjusted by the attention function and fed to the next layer, which meant that the edge features were adaptive across network layers. Experimental results on three citation networks (including PubMed) and a biological network (a chromatin interaction network: Bio1kbHic) showed that EGAT outperformed the current state of the art methods.
[Image source. Click image to open in new window.]Graph Attention Networks (Feb 2018) [code; nonauthor code here and here] are neural network architectures that operate on graph structured data, leveraging masked selfattentional layers to address the shortcomings of prior methods based on graph convolutions (GCN, above) or their approximations. The GAT models described in this paper (by Yoshua Bengio and colleagues) achieved or matched state of the art results across four established transductive and inductive graph benchmarks: the Cora, CiteSeerX and PubMed citation network datasets, as well as a proteinprotein interaction dataset (in which test graphs remained unseen during training). Note Figure 2 in that paper, showing a tSNE plot of the computed feature representations of a pretrained GAT model’s first hidden layer on the Cora dataset. GAT appears to offer better results than GCN (above) and GraphSAGE (below).
[Image source. Click image to open in new window.]Notes:
Regarding “masked attention,” refer to Attention Is All You Need, and Attention in Neural Networks and How to Use It.
The CORA dataset [GitHub] is a citation network consisting of 2,708 machine learning publications classified into one of seven classes; the citation network consists of 5,429 links.
Advantages of Graph Attention Networks (GAT )
Discussed in Graph Attention Networks, the graph attentional layers in GAT directly address several issues affecting prior approaches to modeling graph structured data with neural networks:

Computational efficiency: selfattentional layer operations can be parallelized across all edges, and the computation of output features can be parallelized across all nodes. The time complexity of a single GAT attention head computing $\small F’$ features may be expressed as $\small O(VFF’ + EF’)$, where $\small F$ is the number of input features, and $\small V$ and $\small E$ are the numbers of nodes and edges in the graph, respectively. This complexity is on par with the baseline methods such as graph convolutional networks (GCN).

As opposed to GCN, GAT stack graph layers in which nodes are able to attend over their neighborhood features, which enables (implicitly) specifying different weights to different nodes in a neighborhood, enabling a leap in model capacity without requiring any kind of costly matrix operation (such as inversion) or depending on knowing the graph structure upfront. Furthermore, analyzing the learned attentional weights may lead to benefits in interpretability, as was the case in the machine translation domain.

The attention mechanism is applied in a shared manner to all edges in the graph, and therefore it does not depend on upfront access to the global graph structure or (features of) all of its nodes (a limitation of many prior techniques). This has several desirable implications: the graph is not required to be undirected, and it makes the GAT technique directly applicable to inductive learning – including tasks where the model is evaluated on graphs that are completely unseen during training.

Unlike previous methods which sampled a fixedsize neighborhood of each node to keep its computational footprint consistent (therefore not allowing access to the entirety of the neighborhood while performing inference), GAT works with the entirety of the neighborhood (at the expense of a variable computational footprint, which is still onpar with methods like the GCN) and does not assume any ordering within it.

The authors were also able to produce a version of the GAT layer that leverages sparse matrix operations, reducing the storage complexity to linear in the number of nodes and edges and enabling the execution of GAT models on larger graph datasets.
Deeply Learning Molecular StructureProperty Relationships using Graph Attention Neural Network (Oct 2018) [code] applied GAT to the study of molecular structureproperty relationships, which are key to molecular engineering for materials and drug discovery. The authors showed that GAT could greatly improve performance of the deep learning for chemistry, distinguishing atoms in different environments and extracting important structural features determining target properties (such as molecular polarity, solubility, and energy). Interestingly, it identified two distinct parts of molecules; as a result, it could accurately predict molecular properties. Moreover, the resultant latent space was wellorganized such that molecules with similar properties were closely located, which is critical for successful molecular engineering.
[Image source. Click image to open in new window.]Aside: returning to the graph convolutional networks domain, Geometric Deep Learning Autonomously Learns Chemical Features That Outperform Those Engineered by Domain Experts (2018) [code not available but see the deepchem repository mentioned in the paper; discussion] explored the performance of deep learning methods in the context of drug discovery, comparing machine learned features against the domain expert engineered features that are mainstream in the pharmaceutical industry. [See also Geometric deep learning.]
[Image source. Click image to open in new window.][Image source. Click image to open in new window.]
In an innovative approach, Attentionbased Graph Neural Network for Semisupervised Learning (Mar 2018) [nonauthor code] proposed a novel graph neural network (AGNN) that removed all the intermediate fullyconnected layers, and replaced the propagation layers with attention mechanisms that respected the structure of the graph. The attention mechanism allowed them to learn a dynamic and adaptive local summary of the neighborhood to achieve more accurate predictions. Using this approach, the authors attained state of the art results on classification of a PubMed citation network (shown in Table 2 in that paper). Notably, as shown in Fig. 2, AGNN correctly classified three target nodes in the test set that were misclassified by a graph convolutional network (GCN), and shows how the local attention network weighted the contribution of its local neighborhood.
[Image source. Click image to open in new window.]There are many hyperparameters (nontrained parameters, such as random walk length) in graph embedding methods which have to be manually tuned for every graph. Watch Your Step: Learning Node Embeddings via Graph Attention (Sep 2018) [code] by the University of Southern California and Google AI replaced random walk hyperparameters with trainable parameters that they automatically learned via backpropagation. In particular, they learned a novel attention model on the power series of the transition matrix, which guided the random walk to optimize an upstream objective. Unlike previous approaches to attention models, the method that we propose utilized attention parameters exclusively on the data (e.g. on the random walk), and not used by the model for inference. They experimented on link prediction tasks, as they aimed to produce embeddings that bestpreserved the graph structure, generalizing to unseen information. They improved state of the art on a comprehensive suite of real world datasets including social, collaboration, and biological networks. Adding attention to random walks reduced the error by 20% to 45% on datasets evaluated. Further, the learned attention parameters were different for every graph, and the automatically found values agreed with the optimal choice of hyperparameters, if existing methods were manually tuned.
[Image source. Click image to open in new window.]
Graph convolutional networks (GCN) have been able to achieve state of the art results in the task of node classification; however, since GCN relies on the localized firstorder approximations of spectral graph convolutions, it is unable to capture higherorder interactions between nodes in the graph. Higherorder Graph Convolutional Networks (Sep 2018) proposed a motifbased
Source graph attention model called Motif Convolutional Networks (MCN), which generalized past approaches by using weighted multihop motif adjacency matrices to capture higherorder neighborhoods. A novel attention mechanism was used, allowing each individual node to select the most relevant neighborhood to apply its filter. Experiments showed that their proposed method was able to achieve state of the art results on the semisupervised node classification task.
Graph embedding methods represent nodes in a continuous vector space, preserving different types of relational information from the graph. There are many hyperparameters to these methods (e.g. the length of a random walk) which have to be manually tuned for every graph. Watch Your Step: Learning Node Embeddings via Graph Attention (Google AI: NeurIPS 2018 poster) [code; discussion] replaced previously fixed hyperparameters with trainable ones automatically learned via backpropagation. They proposed a novel attention model on the power series of the transition matrix, which guided the random walk to optimize an upstream objective. Unlike previous approaches to attention models, their method utilized attention parameters exclusively on the data (e.g. on the random walk), and were not used by the model for inference. Experiments on link prediction tasks aimed to produce embeddings that bestpreserved the graph structure, generalizing to unseen information. They improved state of the art results on a comprehensive suite of realworld graph datasets including social, collaboration, and biological networks (a proteinprotein interaction network), where their graph attention reduced the error up to 2040%.
[Image source. Click image to open in new window.]Additional Reading

Improving Robustness of Attention Models on Graphs (Nov 2018)
“Machine learning models that can exploit the inherent structure in data have gained prominence. In particular, there is a surge in deep learning solutions for graphstructured data, due to its widespread applicability in several fields. Graph attention networks (GAT), a recent addition to the broad class of feature learning models in graphs, utilizes the attention mechanism to efficiently learn continuous vector representations for semisupervised learning problems. In this paper, we perform a detailed analysis of GAT models, and present interesting insights into their behavior. In particular, we show that the models are vulnerable to adversaries (rogue nodes) and hence propose novel regularization strategies to improve the robustness of GAT models. Using benchmark datasets, we demonstrate performance improvements on semisupervised learning, using the proposed robust variant of GAT.”
[Image source. Click image to open in new window.]
Question Answering Over Knowledge Graphs
As with textual sources, knowledge graphs (which provide well structured relational information between entities, and allow inference over indirect facts) may be utilized for questioning answering. Recent neural networkbased approaches to knowledge graph based QA include the following works.
Attentionbased, sequencetosequence models; e.g., A Joint Model for Question Answering and Question Generation (Jun 2017), which used a sequencetosequence framework that encoded the document and generated a question given an answer, and an answer given a question;
[Image source. Click image to open in new window.]Simple Question Answering by Attentive Convolutional Neural Network (Oct 2016) [data here and here; discussion, links to nonauthor code] focused on answering singlerelation factoid questions over a KG (Freebase). This task, simple question answering (SimpleQA), was addressed via a twostep pipeline: entity linking and fact selection. In fact selection, we match the subject entity in a fact candidate with the entity mention in the question by a characterlevel convolutional neural network (charCNN), and match the predicate in that fact with the question by a wordlevel CNN (wordCNN).
[Image source. Click image to open in new window.]Neural NetworkBased Question Answering Over Knowledge Graphs on Word and Character Level (2017) [code: not maintained; dataset: missing (users requests ignored)], which trained a neural network for answering simple questions in an endtoend manner, leaving all decisions to the model. It learned to rank subjectpredicate pairs to enable the retrieval of relevant facts, given a question. The network contains a nested word/characterlevel question encoder that allows it to handle outofvocabulary and rare word problems, while still being able to exploit wordlevel semantics;
[Image source. Click image to open in new window.]Variational Reasoning for Question Answering with Knowledge Graph (Nov 2017) [data] provided a unified deep learning architecture and an endtoend variational learning algorithm that could simultaneously handle noise in questions, while learning multihop reasoning.
[Image source. Click image to open in new window.]Similarly, Richard Socher and colleagues at SalesForce employed a multihop reasoning approach to QA over incomplete knowledge graphs [MultiHop Knowledge Graph Reasoning with Reward Shaping (Sep 2018) [code]. A detailed analysis of their reinforcement learning approach (as opposed to knowledge graph embeddingbased models) indicated that access to a more accurate environment representation (reward shaping) and a more thorough exploration of the search space (action dropout) were important to performance.
[Image source. Click image to open in new window.]EARL: Joint Entity and Relation Linking for Question Answering Over Knowledge Graphs (Jun 2018) [code], which performed entity linking and relation linking as a joint task:
[Image source. Click image to open in new window.]“Many question answering systems over knowledge graphs rely on entity and relation linking components in order to connect the natural language input to the underlying knowledge graph. Traditionally, entity linking and relation linking have been performed either as dependent sequential tasks or as independent parallel tasks. In this paper, we propose a framework called EARL, which performs entity linking and relation linking as a joint task. EARL implements two different solution strategies for which we provide a comparative analysis in this paper: The first strategy is a formalisation of the joint entity and relation linking tasks as an instance of the Generalised Travelling Salesman Problem (GTSP). In order to be computationally feasible, we employ approximate GTSP solvers. The second strategy uses machine learning in order to exploit the connection density between nodes in the knowledge graph. It relies on three base features and reranking steps in order to predict entities and relations. We compare the strategies and evaluate them on a dataset with 5000 questions. Both strategies significantly outperform the current stateoftheart approaches for entity and relation linking.”
Andrew McCallum et al. (Go for a Walk and Arrive at the Answer: Reasoning Over Paths in Knowledge Bases using Reinforcement Learning (Nov 2017) [code] proposed the MINERVA algorithm, that addressed the difficult and practical task of answering questions where only one entity of the relation was known (versus finding paths connecting a given pair of entities). Since random walks are impractical in a setting with combinatorially many destinations from a start node, they presented a neural reinforcement learning approach that learned how to navigate the graph conditioned on the input query to find predictive paths. Empirically, this approach significantly outperformed prior methods on several datasets. The model was robust, and could learn long chainsofreasoning (~93% accuracy on 810 path length: note their Fig. 2); moreover it did not require pretraining or initial supervision.
[Image source. Click image to open in new window.]DeepPath : A Reinforcement Learning Method for Knowledge Graph Reasoning (Jul 2018) [code and NELL dataset] studied the problem of learning to reason in large scale knowledge graphs. They described a novel reinforcement learning framework for learning multihop relational paths, using a policybased agent with continuous states based on knowledge graph embeddings, which reasoned in a KG vector space by sampling the most promising relation to extend its path. In contrast to prior work, their approach included a reward function that took the accuracy, diversity, and efficiency into consideration. Their method outperformed a pathranking based algorithm and knowledge graph embedding methods on Freebase and NeverEnding Language Learning datasets.
[Image source. Click image to open in new window.]An implementation of a tumorbased knowledge graph is presented in Semantically Linking In silico Cancer Models (Dec 2014), which introduced a graphbased model for semantically linking computational cancer models via domain graphs to better understand and explore combinations of models spanning multiple biological scales.
“Multiscale models are commonplace in cancer modeling, where individual models acting on different biological scales are combined within a single, cohesive modeling framework. However, model composition gives rise to challenges in understanding interfaces and interactions between them. Based on specific domain expertise, typically these computational models are developed by separate research groups using different methodologies, programming languages, and parameters. This paper introduces a graphbased model for semantically linking computational cancer models via domain graphs that can help us better understand and explore combinations of models spanning multiple biological scales. We take the data model encoded by TumorML, an XMLbased markup language for storing cancer models in online repositories, and transpose its model description elements into a graphbased representation. By taking such an approach, we can link domain models, such as controlled vocabularies, taxonomic schemes, and ontologies, with cancer model descriptions to better understand and explore relationships between models. The union of these graphs creates a connected property graph that links cancer models by categorizations, by computational compatibility, and by semantic interoperability, yielding a framework in which opportunities for exploration and discovery of combinations of models become possible.”

Critique.
While I find their implementation to be very awkwardly conceptualized and described, it nonetheless provides examples of Cypherbased question answer queries that are relevant to this REVIEW (note that I implemented my KG in Neo4j, which uses the Cypher graph query language).
An interesting graphbased approach to question answering is the use of subgraphs. Work from the Allen Institute for AI, Answering Complex Questions Using Open Information Extraction (Apr 2017), used a recently proposed support graph optimization framework for QA to develop a new inference model for Open IE. Their model significantly outperformed a state of the art structured solver on complex questions of varying difficulty, while also removing the reliance on manually curated knowledge. Their system, TupleINF, used integer linear programming to solve QA over complex structures that required multifact reasoning. They viewed the QA task as searching for a graph (support graph search) that best connects the terms in the question (qterms) with an answer choice via the knowledge (see Fig. 1
Source in their paper for a simple illustrative example). The qterms, answer choices, and tuples fields formed the set of possible vertices, $\small \mathcal{V}$, of the support graph. Edges connecting qterms to tuple fields and tuple fields to answer choices formed the set of possible edges, $\small \mathcal{E}$. The support graph, $\small \mathcal{G}(\mathcal{V,E})$, was a subgraph of $\small \mathcal{G}(\mathcal{V,E)}$ where $\small \mathcal{V}$ and $\small \mathcal{E}$ denoted “active” nodes and edges, respectively. Similar to TableILP (below), they scored the support graph based on the weight of the active nodes and edges.
TupleINF was inspired by their earlier TableILP work Question Answering via Integer Programming over SemiStructured Knowledge (Apr 2016) [code].
Note that integer linear programming differs from inductive logic programming (unfortunately, both are often abbreviated “ILP”): a subfield of machine learning that learns models in the form of logic programming rules (Horn Clauses) that are comprehensible to humans. Inductive logic programming is used in Induction of NonMonotonic Logic Programs to Explain Boosted Tree Models Using LIME (Aug 2018).
The approach of using subgraphs and Integer Linear Programming (ILP) in QA was also described on slides 1525 in Reasoningdriven Question Answering by the University of Pennsylvania Ph.D. candidate Daniel Khashabi, at a Stanford NLP seminar (June 2018). The work summarized in that presentation is described in the UPenn/Allen Institute for Artificial Intelligence paper Question Answering as Global Reasoning over Semantic Abstractions (2018) [code], which introduced their SemanticILP system. They claimed SemanticILP to be the first system that reasons over a wide range of semantic abstractions of the text, which are derived using offtheshelf, generalpurpose, pretrained natural language modules such as semantic role labelers, coreference resolvers, and dependency parsers. Representing multiple abstractions as a family of graphs, they translated QA into a search for an optimal subgraph that satisfied certain global and local properties. SemanticILP outperformed baseline models BiDAF, an information retrieval solver, and TupleINF on QA tasks on 4th and 8th grade science exams and a biology exam (Tables 4 and 5 in their paper). Thus on two domains with very different characteristics, the graphbased SemanticILP outperformed traditional and neural network models.
[Image source. Click image to open in new window.] See also their Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences (2018), which described their reading comprehension challenge, MultiSentence Reading Comprehension (MultiRC) [project; Leaderboard/data; GitHub].
Most research in reading comprehension has focused on answering questions based on individual documents or even single paragraphs. Question Answering by Reasoning Across Documents with Graph Convolutional Networks (Aug 2018) by Cao et al. introduced a method which integrates and reasons relying on information spread within documents and across multiple documents. They framed this task as an inference problem on a graph, with mentions of entities as nodes and edges encoding relations between different mentions. Graph convolutional networks (GCN) were applied to these graphs and trained to perform multistep reasoning. Multihop reading comprehension focuses on one type of factoid question, where a system needs to properly integrate multiple pieces of evidence to correctly answer a question. Each step of the algorithm (also referred to as a hop) updated all node representations in parallel. In particular, a node was updated as a function of messages from its direct neighbours. At the end of the first step, every node was aware of every other node it connects directly to. Taking this idea recursively, each further step of the algorithm allowed a node to indirectly interact with nodes already known to their neighbours. After $\small L$ layers of RGCN, information has been propagated through paths connecting up to $\small L$ nodes. In this work, words in supporting documents and in the query using a model of contextualized word representation (ELMo). ELMo encodings were used to produce a query representation, as well as a set of representations. Their EntityGCN method achieved state of the art results on the WikiHop dataset.
[Image source. Click image to open in new window.] twitter (Nicola De Cao: 20180831): “Happy to share our new work: Question Answering by Reasoning Across Documents with Graph Convolutional Networks. Currently the best model on the WikiHop Leaderboard [→ EntityGCN; later surpassed by Socher’s CFC model]! … We used a CodaLab worksheet to run the validation and test which is public and available here. Unfortunately, the code is not commented but it is useful for reproducibility. I aim to create a GitHub repository soon as well.”
As stated in their paper, Question Answering by Reasoning Across Documents with Graph Convolutional Networks,

"The methods reported in Welbl et al. (2017)  'Constructing Datasets for Multihop Reading Comprehension Across Documents'  approach the task by merely concatenating all documents into a single long text and training a standard reading comprehension model, namely, BiDAF and FastQA. Instead, we frame question answering as an inference problem on a graph representing the document collection. Nodes in this graph correspond to named entities in the document whereas edges encode simple relations between them (e.g., crossdocument and withindocument coreferences). We assume that the chains of reasoning can be captured by propagating information, first extracted by a deep contextualized word representation such as ELMo, along edges in this graph using a graph convolutional network (GCN). Our *EntityGCN* model substantially outperforms previous systems. Our ablation studies confirm that we benefit both from using contextualized word embeddings and from modeling relations between entities."
[Image source. Click image to open in new window.]
Related to the work described in Question Answering by Reasoning Across Documents with Graph Convolutional Networks, Exploring Graphstructured Passage Representation for Multihop Reading Comprehension with Graph Neural Networks (Sep 2018) also employed a multihop reading comprehension approach on the WikiHop dataset. Previous work approximated global evidence with local coreference information, encoding coreference chains with directed acyclic graph (DAG)styled GRU layers within a GatedAttention Reader. However, coreference is limited in providing information for rich inference. In this present work, the authors introduced a new method for better connecting global evidence, which formed more complex graphs compared to DAGs. To perform evidence integration on their graphs, they investigated two recent graph neural networks: graph convolutional networks (GCN), and graph recurrent networks (GRN). After obtaining the representation vectors for question and entity mentions in passages, an additive attention model (Bahdanau et al., 2015) was adopted, treating all entity mention representations and the question representation as the memory and the query, respectively. Word embeddings were initialized from the 300dimensional pretrained GloVe word embeddings.
[Image source. Click image to open in new window.]That paper, Question Answering by Reasoning Across Documents with Graph Convolutional Networks, included two footnotes of interest (2; 7), which stated:

“The concurrent unpublished work (Cao et al., 2018) also investigates the usage of graph convolution networks on WikiHop. Our work proposes a different model architecture, and focus more on the exploration and comparison of multiple edge types for building the graphstructured passage representation.

“At submission time we observe a recent short arXiv paper (Cao et al., 2018), available on August 28th, showing an accuracy of 67.6 using ELMo (Peters et al., 2018), which is the only result better than ours. ELMo has achieved dramatic performance gains of 3+ points over a broad range of tasks. Our main contribution is studying an evidence integration approach, which is orthogonal to the contribution of ELMo on largescale training. For more fair comparison with existing work, we did not adopt ELMo, but we will conduct experiments with ELMo as well.”
Exploiting Explicit Paths for Multihop Reading Comprehension (Nov 2018) focused on the task of multihop reading comprehension where a system was required to reason over a chain of multiple facts, distributed across multiple passages, to answer a question. Inspired by graphbased reasoning, they presented a pathbased reasoning approach for textual reading comprehension, which operated by generating potential paths across multiple passages, extracting implicit relations along this path, and composing them to encode each path. The proposed model achieved a 2.3% gain on the WikiHop Dev set as compared to previous state of the art, and was also able to explain its reasoning through explicit paths of sentences.
[Image source. Click image to open in new window.]Lastly, in a variation of graph convolutional networks (GCN), Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text (Sep 2018) [code] by Ruslan Salakhutdinov and colleagues at Carnegie Mellon University looked at the combination of a knowledge base/graph and entitylinked text – appropriate when an incomplete KB is available with a large text corpus. They built on recent advances in graph representation learning, proposing their novel GRAFTNet model for extracting answers from a questionspecific subgraph containing text and KB entities and relations.
To answer a question posed in natural language, GRAFTNet considered a heterogeneous graph constructed from text (sentences retrieved from Wikipedia) and KB facts (constructed from entities around question seeds in FreeBase), leveraging the rich relational structure between the two information sources. Embeddings were propagated in the graph for a fixed number of layers, and the final node representations were used to classify answers. The model included an attention mechanism, computed using the question and relation embeddings. They showed that GRAFTNet was competitive with the state of the art when tested using either KB or text alone, vastly outperforming existing methods in the combined setting.
[image source. click image to open in new window.]Additional Reading

Learning to Rank Query Graphs for Complex Question Answering over Knowledge Graphs (Nov 2018) [code]
“In this paper, we conduct an empirical investigation of neural query graph ranking approaches for the task of complex question answering over knowledge graphs. We experiment with six different ranking models and propose a novel selfattention based slot matching model which exploits the inherent structure of query graphs, our logical form of choice. Our proposed model generally outperforms the other models on two QA datasets over the DBpedia knowledge graph, evaluated in different settings. In addition, we show that transfer learning from the larger of those QA datasets to the smaller dataset yields substantial improvements, effectively offsetting the general lack of training data.”

Graph based Question Answering System (Dec 2018)
“In today’s digital age in the dawning era of big data analytics it is not the information but the linking of information through entities and actions which defines the discourse. Any textual data either available on the Internet off offline (like newspaper data, Wikipedia dump, etc) is basically connect information which cannot be treated isolated for its wholesome semantics. There is a need for an automated retrieval process with proper information extraction to structure the data for relevant and fast text analytics. The first big challenge is the conversion of unstructured textual data to structured data. Unlike other databases, graph databases handle relationships and connections elegantly. Our project aims at developing a graphbased information extraction and retrieval system.”
Graph Representation Learning
Representation learning, variously mentioned above, is a class of machine learning techniques that autonomously learn to derive latent features from raw data. Representation learning of knowledge graphs (graph representation learning) aims to project both entities and relations into a continuous lowdimensional space [Representation Learning of Knowledge Graphs with Entity Descriptions (2016)].
Exploring the Semantic Content of Unsupervised Graph Embeddings: an Empirical Study (Jun 2018) reviews graph embedding techniques, demonstrating that topological features are approximated by the embedding space, allowing key insight into how graph embeddings create good representations.
Machine learning on graphs is an important and ubiquitous task with applications ranging from drug design to friendship recommendation in social networks. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. Traditionally, machine learning approaches relied on userdefined heuristics to extract features encoding structural information about a graph (e.g., degree statistics or kernel functions). However, recent years have seen a surge in approaches that automatically learn to encode graph structure into lowdimensional embeddings, using techniques based on deep learning and nonlinear dimensionality reduction. Representation Learning on Graphs: Methods and Applications (presented by coauthor Jure Leskovec as a keynote speech at the 4th IEEE BigData Conference (Boston, MA: December 2017; slides) provides a superb overview of machine learning on graphs and graph representation learning. The primary challenge in this domain is finding a way to represent, or encode, graph structure so that it can be easily exploited by machine learning models. [GraphSAGE is discussed on pp. 1213].
[Image source. Click image to open in new window.]Inductive Representation Learning on Large Graphs (Jun 2017, updated Sep 2018) (also by Jure Leskovec, Stanford University) [project; code; applied code; slides] introduced GraphSAGE, a general, inductive framework that leveraged node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, they learned a function that generated embeddings by sampling and aggregating features from a node’s local neighborhood. The authors classified the category of unseen nodes in evolving information graphs based on citation and Reddit post data, and – relevant to the biomedical domain – showed that their algorithm generalized to completely unseen graphs using a multigraph dataset of proteinprotein interactions.
[Image source. Click image to open in new window.]
See also:
 LargeScale Learnable Graph Convolutional Networks (Aug 2018).
 Inductive Representation Learning on Large Graphs (Jun 2017, updated Sep 2018)
 Deep Graph Infomax (Sep 2018)
Work by authors (Representation Learning on Graphs with Jumping Knowledge Networks (Jun 2018)) at MIT (with National Institute of Informatics, Tokyo coauthors) explored a “jumping knowledge networks” architecture that for each node flexibly leveraged different neighborhood ranges to enable better structureaware representation. In a number of experiments on social, bioinformatics and citation networks they demonstrated state of the art performance. Furthermore, combining the jumping knowledge framework with models like Graph Convolutional Networks (GCN), GraphSAGE, and Graph Attention Networks (GAT) consistently improved the performance of those models.
[Image source. Click image to open in new window.]Followon work by those MIT authors (How Powerful are Graph Neural Networks? (Oct 2018) – with Stanford University coauthors including Jure Leskovec – characterized the discriminative power of popular GNN variants such as graph convolutional networks (GCN) and GraphSAGE, showing that they cannot learn to distinguish certain simple graph structures. They then developed a simple architecture that was provably the most expressive among the class of GNN and was as powerful as the WeisfeilerLehman graph isomorphism test, demonstrating that their model (GIN: Graph Isomorphism Network) achieved state of the art performance. Their results confirmed that the most powerful GNN (their GIN model) had high representational power, as it almost perfectly fit the training data – whereas the less powerful GNN variants often severely underfit the training data.
[Image source. Click image to open in new window.]Bioinformatics datasets (shown in Table 1 in the figure, above).
 MUTAG is a dataset of 188 mutagenic aromatic and heteroaromatic nitro compounds with 7 discrete labels.
 PROTEINS is a dataset where nodes are secondary structure elements (SSEs) and there is an edge between two nodes if they are neighbors in the aminoacid sequence or in 3D space. It has 3 discrete labels, representing helix, sheet or turn.
 PTC is a dataset of 344 chemical compounds that reports the carcinogenicity for male and female rats and it has 19 discrete labels.
 NCI1 is a dataset made publicly available by the National Cancer Institute (NCI) and is a subset of balanced datasets of chemical compounds screened for ability to suppress or inhibit the growth of a panel of human tumor cell lines, having 37 discrete labels.
Relevant to How Powerful are Graph Neural Networks?, above, A Simple Yet Effective Baseline for NonAttribute Graph Classification (Nov 2018) sought to better understand the increasingly complex methods necessary for the application of graph kernels or graph neural networks to graph classification and representation learning on graphs. As a first step, they developed a simple yet meaningful graph representation which they called LDP, and explored its effectiveness in graph classification. Interestingly, this simple representation achieved similar performance as the state of the art graph kernels and graph neural networks for nonattributed graph classification. Its performance on classifying attributed graphs was slightly weaker as it did not incorporate attributes; however, given its simplicity and efficiency it still serves as an effective baseline for attributed graph classification. They also provided a simple connection with the graph neural networks. Caveats included that these observations were only for the task of graph classification, while existing methods are often designed for a broader scope including node embedding and link prediction. The results are also likely biased due to the limited amount of benchmark datasets available. Nevertheless, the good performance of our simple baseline calls for the development of new, more comprehensive benchmark datasets so as to better evaluate and analyze different graph learning methods. Furthermore, given the computational efficiency of our graph summary, we believe that it is a good candidate as a baseline method for future graph classification (or even other graph learning) studies.
Relevant to How Powerful are Graph Neural Networks? (cited as [XHLJ18] in this paper):
“The theoretical understanding of the properties and limitations of GNNs is somewhat limited, although there exists very recent work starting to address this issue [XHLJ18].”
“Interestingly a recent paper [XHLJ18] explored the expressiveness of different pooling strategies and concludes that sum pooling is more powerful than mean and max pooling.”
“Interestingly, in a recent paper [XHLJ18] that explores the discriminative power of graph neural networks showed that GNNs are at most as powerful as the WL test in distinguishing graph structures. By choosing right aggregation function, they developed the graph isomorphism network (GIN), that can achieve the same discriminative power as the WL test. However, in experiments it is observed that GIN outperforms the WL kernel on social network datasets. One explanation the authors provide is that WLtests are onehot encodings and thus cannot capture the ‘similarity’ between subtrees (while they can capture that whether they are ‘the same’ or not). In contrast, a GNN satisfying certain criteria (see Theorem 3 of [XHLJ18]) generalizes the WL test by learning to embed the subtrees to a feature space continuously. This enables GNNs to not only discriminate different structures, but also learn to map similar graph structures to similar embeddings and capture dependencies between graph structures.”
Otherwise relevant:
“As we can see, LDP is still quite good on MUTAG, PTC, and PROTEIN datasets that do not have rich label information. For graphs with rich labels such as ENZYME, DD, and NCI1 (see details in Appendix), LDP and its variant is much worse than the other kernels that can better utilize label information. This also suggests that MUTAG, PTC, and PROTEIN are not sufficient to evaluate different methods.
“Most graph kernels aim to capture graph topology and graph similarity in the hope of improving classification accuracy. Our experiment suggests that this is not yet wellreflected on current benchmark datasets for nonattribute graphs. This calls for the construction of better and more comprehensive benchmark datasets for understanding and analyzing graph representations. On the other hand, we emphasize that in scenarios where both graph nodes and edges have certain associated features, proper handling labels can significantly improve the classification accuracy, as is shown clearly for the NCI1 dataset. Graph kernels have been rather effective in incorporating node/edge attributes. It will be an interesting question to see how to incorporate attributes in a more effective yet still simple manner in our graph representation. Also, although there are large scale chemical graphs datasets available, a benchmark dataset that contains many large graphs is still missing. We plan to create such benchmark dataset for future use. In general, while not addressed in this paper, we note that understanding the power and limitation of various graph representations, as well as the types of datasets with respect to which they are most effective, are crucial, yet challenging and remain largely open.”
SimGNN: A Neural Network Approach to Fast Graph Similarity Computation (WSDM 2019) [nonauthor code here; author code here; discussion] proposed a novel neural network based approach to graph similarity search called SimGNN, which combined two strategies. First, they designed a learnable embedding function that mapped every graph into an embedding vector, which provided a global summary of a graph. A novel attention mechanism was proposed to emphasize the important nodes with respect to a specific similarity metric. Second, they designed a pairwise node comparison method to supplement the graphlevel embeddings with finegrained nodelevel information. Their model achieved better generalization on unseen graphs, and in the worst case ran in quadratic time with respect to the number of nodes in two graphs.
[Image source. Click image to open in new window.]Graph representation learning methods can be classified into two categories: generative models that learn the underlying connectivity distribution in the graph, and discriminative models that predict the probability of edge existence between a pair of vertices. GraphGAN: Graph Representation Learning with Generative Adversarial Nets (Nov 2017) [code] unified these two classes of methods in an adversarial approach in which the generative model and discriminative model played a gametheoretical minimax game for node classification and link prediction and recommendation. For a given vertex (node), the generative model tried to fit its underlying true connectivity distribution over all other vertices and produced “fake” samples to fool the discriminative model, while the discriminative model tried to detect whether the sampled vertex was from ground truth or generated by the generative model. With the competition between these two models, both of them could alternately and iteratively boost their performance. It has been demonstrated (GraphGAN) that graph representation learning with generative adversarial nets achieves substantial gains over state of the art baselines in a variety of applications including link prediction, node classification, and recommendation.
[Image source. Click image to open in new window.]In a adversarial approach similar to GraphGAN (different authors), KBGAN: Adversarial Learning For Knowledge Graph Embeddings (Nov 2017; updated Apr 2018) [code] introduced KBGAN, an adversarial learning framework to improve the performances of a wide range of existing knowledge graph embedding models. Because KG typically only contain positive facts, sampling useful negative training examples is a nontrivial task. Replacing the head or tail entity of a fact with a uniformly randomly selected entity is a conventional method for generating negative facts, but the majority of the generated negative facts can be easily discriminated from positive facts, and will contribute little towards the training. Inspired by generative adversarial networks (GANs), these authors used one knowledge graph embedding model as a negative sample generator to assist the training of the desired model, which acted as the discriminator in the GANs. This framework was independent of the concrete form of generator and discriminator, and therefore could utilize a wide variety of knowledge graph embedding models as building blocks. They adversarially trained two translationbased models, TransE and TransD, each with assistance from one of the two probabilitybased models, DistMult and ComplEx. The performance of KBGAN was evaluated on the link prediction task using three knowledge base completion datasets – FB15k237, WN18 and WN18RR – showing that adversarial training substantially improved the performances of target embedding models under various settings.
[Image source. Click image to open in new window.]
NSCaching: Simple and Efficient Negative Sampling for Knowledge Graph Embedding (Jan 2019)
“Knowledge Graph (KG) embedding is a fundamental problem in data mining research with many realworld applications. It aims to encode the entities and relations in the graph into low dimensional vector space, which can be used for subsequent algorithms. Negative sampling, which samples negative triplets from nonobserved ones in the training data, is an important step in KG embedding. Recently, generative adversarial network (GAN), has been introduced in negative sampling. By sampling negative triplets with large scores, these methods avoid the problem of vanishing gradient and thus obtain better performance. However, using GAN makes the original model more complex and hard to train, where reinforcement learning must be used. In this paper, motivated by the observation that negative triplets with large scores are important but rare, we propose to directly keep track of them with the cache [NSCaching]. However, how to sample from and update the cache are two important questions. We carefully design the solutions, which are not only efficient but also achieve a good balance between exploration and exploitation. In this way, our method acts as a ‘distilled’ version of previous GAbased methods, which does not waste training time on additional parameters to fit the full distribution of negative triplets. The extensive experiments show that our method can gain significant improvement in various KG embedding models, and outperform the stateoftheart negative sampling methods based on GAN.”
[Image source. Click image to open in new window.]
GRAM: Graphbased Attention Model for Healthcare Representation Learning (Apr 2107) [code] presented a graphbased attention model (GRAM), which addressed two challenges of deep learning applied to healthcarerelated knowledge graphs: data insufficiency (insufficient sample size), and interpretation (representations learned by deep learning methods should align with medical knowledge). GRAM supplemented electronic health records with the hierarchical information inherent in medical ontologies (represented as a directed acyclic graph). The ontological KG provided a multiresolution view of medical concepts upon which GRAM analyzed health records via recurrent neural networks to learn accurate and interpretable representations of medical concepts. GRAM chose a weighted average of ancestors of a medical concept, and trained the entire process with a predictive model in an endtoend fashion.
[Image source. Click image to open in new window.]A new approach to learning graph embeddings that relies on structural measures of node similarities for generation of training data was presented in Learning Graph Embeddings from WordNetbased Similarity Measures (Aug 2018) [code; datasets]. Their path2vec model learned node embeddings that were able to approximate a given measure, such as the shortest path distance (node similarities, based on network structure) or any other. In contrast to existing approaches, path2vec did not rely on random walks or autoencoder architectures (link prediction) and provided a way to supply additional information for the target measure, not only the graph structure itself. Thus, in applications one could replace pathbased measures with these node embeddings, gaining significant speedup in yielding distances and nearest neighbors.
path2vec could be trained on arbitrary graph measures and was not restricted to the shortest path or to only treestructured graphs. Evaluations of the proposed model on semantic similarity and word sense disambiguation tasks (using WordNet as the source of gold similarities) showed that path2vec yielded state of the art results. Notably, the model was computationally efficient, orders of magnitude faster than the direct computation of graph distances: in the WordNet semantic similarity task the learned embeddings trained 3 or 4 orders of magnitude faster than other embedding approaches, depending on the similarity measures used.
[Image source. Click image to open in new window.]
Ruslan Salakhutdinov and colleagues proposed a probabilistic graphical model that leveraged a topic model to design a word sense disambiguation (WSD) system that scaled linearly with the number of words in the context [Knowledgebased Word Sense Disambiguation using Topic Models (Jan 2018)]. Their logistic normal topic model – a variant of Latent Dirichlet Allocation in which the topic proportions for a document were replaced by WordNet synsets
Source (sets of synonyms) – incorporated semantic information about synsets as its priors. They used a nonuniform prior for the synset distribution over words to model the frequency of words within a synset, and they modeled the relationships between synsets by using a logisticnormal prior for drawing the synset proportions of the document. This made their model similar to the correlated topic model, with the difference that their priors were fixed, not learned. In particular, the values of those priors were determined using the knowledge from WordNet. They illustrated the benefit of using the whole document as the context for disambiguation. The proposed model, WSDTM, outperformed state of the art knowledgebased WSD systems.
[Image source. Click image to open in new window.]
Excerpted from the paper:
Consider an excerpt from the SensEval 2 dataset shown in Figure 3
Source:"Medical scientists are starting to uncover a handful of genes which if damaged unleash the chaotic growth of cells that characterizes cancer. Scientists say the discovery of these genes in recent months is painting a new and startling picture of how cancer develops. An emerging understanding of the genes is expected to produce an array of new strategies for future cancer treatment and prevention. That is for the future. Already scientists are developing tests based on the newly identified genes that for the first time can predict whether an otherwise healthy individual is likely to get cancer. "It's a superexciting set of discoveries," says Bert Vogelstein, a Johns Hopkins University researcher."
Highlighted [bolded] words clearly indicate that the domain of the document is biology. While most of these words are monosemous, let’s consider disambiguating the word “cell”, which is highly polysemous, having 7 possible [WordNet] senses as shown in Figure 5
Source.S: (n) cell (any small compartment) “the cells of a honeycomb”
S: (n) cell ((biology) the basic structural and functional unit of all organisms; they may exist as independent units of life (as in monads) or may form colonies or tissues as in higher plants and animals)
S: (n) cell, electric cell (a device that delivers an electric current as the result of a chemical reaction)
S: (n) cell, cadre (a small unit serving as part of or as the nucleus of a larger political movement)
S: (n) cellular telephone, cellular phone, cellphone, cell, mobile phone (a handheld mobile radiotelephone for use in an area divided into small sections, each with its own shortrange transmitter/receiver)
S: (n) cell, cubicle (small room in which a monk or nun lives)
S: (n) cell, jail cell, prison cell (a room where a prisoner is kept)As shown in Table 3
Source, the correct sense of “cell” (“cell#2”) has the highest similarity with senses of three monosemous words “scientist”, “researcher” and “protein”. The word “cell” occurs 21 times in the document, and several times the other words in the sentence are not adequate to disambiguate it. Since our method uses the whole document as the context, words such as “scientists”, “researchers” and “protein” help in disambiguating “cell”, which is not possible otherwise.
 Mapping Text to Knowledge Graph Entities using MultiSense LSTMs [“Python code will be released at” code] also employed WordNet synsets
Source for word sense disambiguation (e.g., their Table 5
Source).
While representation learning on networks generates reliable results in regard to node embeddings, it is limited to homogeneous networks in which all nodes and edges are of the same type. Addressing this challenge, edge2vec: Learning Node Representation using Edge Semantics (Sep 2018) [code] proposed a model that incorporated edge semantics to represent different edgetypes in heterogeneous networks. An edge type transition matrix was optimized from an ExpectationMaximization framework as an extra criterion of a biased node random walk on networks, and a biased skipgram model was then leveraged to learn node embeddings based on the random walks. The novelty of their work was the generation of the edge type transition matrix so that during the process to generate the node random walk corpus, the heterogeneity of the network was considered. edge2vec was validated and evaluated using three medical domain problems on an ensemble of complex medical networks (>10 node and edge types): medical entity classification, compoundgene binding prediction, and medical information search costs. By considering edge semantics, edge2vec significantly outperformed other state of the art models on all three tasks [note that in their tables, edge2vec is listed as heterogeneous node2vec (Hnode2vec)].
[Image source. Click image to open in new window.]Deep Graphs (Jun 2018) , which applied recurrent neural networks (RNN) to graphs and networks, relies on the notion that many graph algorithms can be expressed as iterative vertex updates, where the features of each vertex (or edge) are computed as a function of the neighboring vertex and edge features. Therefore, training and deployment are both efficient, since the cost is $\small \theta(\vert E \vert + \vert V \vert)$, where $\small \vert E \vert$ and $\small \vert V \vert$ are the sets of edges and vertices respectively. The vertex features can be used to perform multiple tasks simultaneously, e.g. in a multitask learning setting, and the proposed algorithm is able to learn the functions that generate these features and the functions that perform these tasks, jointly. Furthermore, it is able to learn representations for graph vertices and edges that can, subsequently, be used by other algorithms to perform new tasks, such as in transfer learning. Finally, apart from the graph structure, this algorithm is also able to use attributes of the vertices and edges of the graph, rendering it even more powerful in cases where such attributes are available.
The authors claimed that Deep Graphs was able to outperform all competing methods for vertex classification tasks by a significant amount (they achieved an AUC of 0.98 while the best competing method achieved 0.67). Furthermore, Deep Graphs could learn to compute PageRank and HITS scores by only applying a vertex update equation 6 times, as opposed to hundreds or thousands of iterations which the original algorithms require. Fig. 1 in that paper shows the node, edge and embedded features that may be be computed/inferred.
[Image source. Click image to open in new window.]The use of constraint graphs for finding optimal Bayesian networks (graphs) is an intuitive method for incorporating rich prior knowledge into the structure learning tasks (Finding the Optimal Bayesian Network Given a Constraint Graph (Jul 2017) [code]. The accompanying Jupyter notebook by these authors provides toy examples, including a threelayer network with:
 health symptoms (low energy; bloating; loss of appetite; vomiting; abdominal cramps) on the bottom layer;
 diseases (ovarian cancer; lactose intolerance; pregnancy) on the middle layer; and,
 genetic tests on the top layer
for three different genetic mutations (BRCA1, BRCA2, and LCT). The edges in this graph were constrained such that symptoms were explained by diseases, and diseases could be partially explained by genetic mutations. There were no edges from diseases to genetic conditions, and no edges from genetic conditions to symptoms. The authors defined a constraint graph which comprised three nodes; “symptoms”, “diseases”, and “genetic mutations”. There was a directed edge from genetic mutations to diseases, and a directed edge from diseases to symptoms. This specified that genetic mutations could be parents to diseases, and diseases to symptoms. Priors were simply programmed Python expressions from the constraint graph, and a resultant learned Bayesian network was returned. Relevant to this REVIEW, this approach could enable very interesting explorations of the KG, for example the discovery of latent/learned embeddings.
[Image source. Click image to open in new window.]While RNN are a cornerstone in learning latent representations from long text sequences, a purely convolutional and deconvolutional autoencoding framework may be employed, as described in Deconvolutional Paragraph Representation Learning (Sep 2017) [code here and here]. That paper addressed the issue that the quality of sentences during RNNbased decoding (reconstruction) decreased with the length of the text. Compared to RNN, their convolutionaldeconvolutional autoencoder framework (CNNDCNN ) was better at reconstructing and correcting long paragraphs (note Table 1 in their paper, showing paragraphs reconstructed by LSTM and CNN, as well as the vastly superior BLEU / ROUGE scores in Table 2). There is also additional NLPrelated LSTM vs. CNN discussion in this Hacker News thread.
[Image source. Click image to open in new window.]Deep Graph Infomax (Sep 2018) [code] by Yoshua Bengio and colleagues presented Deep Graph Infomax (DGI ), a general approach to learning node representations within graphstructured data in an unsupervised manner. DGI relied on maximizing mutual information between patch representations and corresponding highlevel summaries of graphs – both derived using established graph convolutional network (GCN) architectures. Through the learned patch representations, they were able to obtain node embeddings representative of the global structural properties of the graph. Summarized subgraphs centered around nodes of interest could be reused for downstream nodewise learning tasks. In contrast to most prior approaches to graph representation learning, DGI did not rely on random walks, and was readily applicable to both transductive and inductive learning. This enabled competitive performance across a variety of both transductive and inductive node classification tasks, at times even outperforming relevant supervised architectures.
[Image source. Click image to open in new window.]
Followon (Hjelm and colleagues):

SpatioTemporal Deep Graph Infomax (Apr 2019). “Spatiotemporal graphs such as traffic networks or gene regulatory systems present challenges for the existing deep learning methods due to the complexity of structural changes over time. To address these issues, we introduce SpatioTemporal Deep Graph Infomax (STDGI) – a fully unsupervised node representation learning approach based on mutual information maximization that exploits both the temporal and spatial dynamics of the graph. Our model tackles the challenging task of nodelevel regression by training embeddings to maximize the mutual information between patches of the graph, at any given time step, and between features of the central nodes of patches, in the future. We demonstrate through experiments and qualitative studies that the learned representations can successfully encode relevant information about the input graph and improve the predictive performance of spatiotemporal autoregressive forecasting models.”
[Image source. Click image to open in new window.]


See also:
 LargeScale Learnable Graph Convolutional Networks (Aug 2018).
 Inductive Representation Learning on Large Graphs (Jun 2017, updated Sep 2018)
Representing a graph as a vector is a challenging task; ideally, the representation should be easily computable and conducive to efficient comparisons among graphs, tailored to the particular data and analytical task at hand. SGR: SelfSupervised Spectral Graph Representation Learning (Nov 2018) developed SGR, the first method for learning graph representations in a selfsupervised manner. Grounded on spectral graph analysis, SGR facilitated selfsupervised representation learning across a variety of application domains, and performed competitively to state of the art methods without retraining.
[Image source. Click image to open in new window.]In Towards the Latent Transcriptome (Oct 2018) Yoshua Bengio and colleagues employed representation to learn continuous embeddings for kmers from raw RNAseq data, in a referencefree fashion. Their model captured information both of DNA sequence similarity, as well as DNA sequence abundance in the embedding latent space. They confirmed the quality of those vectors by comparing them to known gene substructures, finding that the latent space recovered exon information from the raw RNASeq data from acute myeloid leukemia patients. Furthermore they showed that this latent space allowed the detection of genomic abnormalities such as translocations as well as patientspecific mutations, making this representation space useful for visualization as well as analysis.
[Image source. Click image to open in new window.]Learning Vertex Representations for Bipartite Networks (Jan 2019) [code] addressed the research gap of learning vertex representations for bipartite networks. They presented BiNE (Bipartite Network Embedding), which accounted for two special properties of bipartite networks: longtail distribution of vertex degrees, and implicit connectivity relations between vertices of the same type. They designed a biased random walk generator to generate vertex sequences that preserved the longtail distribution of vertices; proposed a new optimization framework by simultaneously modeling the explicit relations (i.e., observed links) and implicit relations (i.e., unobserved but transitive links); and explored the theoretical foundations of BiNE to shed light on how it works, proving that BiNE could be interpreted as factorizing multiple matrices.
[Image source. Click image to open in new window.]Analyzing and Improving Representations with the Soft Nearest Neighbor Loss (Geoffrey Hinton  Google Brain: Feb 2019) [discussion. “We explore and expand the Soft Nearest Neighbor Loss to measure the entanglement of class manifolds in representation space: i.e., how close pairs of points from the same class are relative to pairs of points from different classes. … Surprisingly, we find that maximizing the entanglement of representations of different classes in the hidden layers is beneficial for discrimination in the final layer, possibly because it encourages representations to identify classindependent similarity structures. …”

“In this paper, we use the labels of the data points to illuminate the class similarity structure of the internal representations learned by discriminative training.”

“Because entangled representations are encouraged to spread out data further inactivation space (see Figure 2), they represent outlier data more consistently apart from real data (see Figure 22). This in turn means outlier data is easily rejected by observing that it is supported by fewer neighbors from the predicted class, as captured by our improved uncertainty estimates.”
A Theoretical Analysis of Contrastive Unsupervised Representation Learning (Mar 2019) [authors’ discussion: Contrastive Unsupervised Learning of Semantic Representations: A Theoretical Framework (local copy)]

“Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the wellknown word2vec embedding algorithm: leveraging availability of pairs of semantically ‘similar’ data points and ‘negative samples,’ the learner forces the inner product of representations of similar pairs with each other to be higher on average than with negative samples. The current paper uses the term contrastive learning for such algorithms and presents a theoretical framework for analyzing them by introducing latent classes and hypothesizing that semantically similar points are sampled from the same latent class. This framework allows us to show provable guarantees on the performance of the learned representations on the average classification task that is comprised of a subset of the same set of latent classes. Our generalization bound also shows that learned representations can reduce (labeled) sample complexity on downstream tasks. We conduct controlled experiments in both the text and image domains to support the theory.”

“Semantic representations (aka semantic embeddings) of complicated data types (e.g. images, text, video) have become central in machine learning, and also crop up in machine translation, language models, GANs, domain transfer, etc. These involve learning a representation function $\small f$ such that for any data point $\small x$ its representation $\small f(x)$ is ‘high level’ (retains semantic information while discarding low level details, such as color of individual pixels in an image) and ‘compact’ (low dimensional). The test of a good representation is that it should greatly simplify solving new classification tasks, by allowing them to be solved via linear classifiers (or other lowcomplexity classifiers) using small amounts of labeled data. …”
[Image source. Click image to open in new window.]
Additional Reading

NeurIPS 2018 Workshop: Relational Representation Learning; accepted papers

TNE: A Latent Model for Representation Learning on Networks (Oct 2018) [project/code]
“Network representation learning (NRL) methods aim to map each vertex into a low dimensional space by preserving the local and global structure of a given network, and in recent years they have received a significant attention thanks to their success in several challenging problems. Although various approaches have been proposed to compute node embeddings, many successful methods benefit from random walks in order to transform a given network into a collection of sequences of nodes and then they target to learn the representation of nodes by predicting the context of each vertex within the sequence. In this paper, we introduce a general framework to enhance the embeddings of nodes acquired by means of the random walkbased approaches. Similar to the notion of topical word embeddings in NLP, the proposed method assigns each vertex to a topic with the favor of various statistical models and community detection methods, and then generates the enhanced community representations. We evaluate our method on two downstream tasks: node classification and link prediction. The experimental results demonstrate that the incorporation of vertex and topic embeddings outperform widelyknown baseline NRL methods.”
[Image source Click image to open in new window.] More on Zachary's Karate Club network.
[Image source. Click image to open in new window.] 
Node Representation Learning for Directed Graphs (Oct 2018) proposed a novel approach for learning node representations in directed graphs, which maintained separate views or embedding spaces for the two distinct node roles induced by the directionality of the edges. In order to achieve this, they proposed a novel alternating random walk strategy to generate training samples from the directed graph while preserving the role information. These samples were then trained using SkipGram with Negative Sampling (SGNS) with nodes retaining their source/target semantics. They conducted extensive experimental evaluation to showcase their effectiveness on several realworld datasets on link prediction, multilabel classification and graph reconstruction tasks, showing that the embeddings were robust, generalizable and well performing across multiple kinds of tasks and networks – consistently outperforming all randomwalk based neural embedding methods for link prediction and graph reconstruction tasks. In addition to providing a theoretical interpretation of their method they also showed that they were considerably more robust than other directed graph approaches.

“We propose a novel scalable approach for learning Node Embeddings Respecting Directionality (*NERD for directed and (un)weighted graphs. NERD aims at learning representations that maximize the likelihood of preserving node neighborhoods. But unlike the previous methods, it identifies the existence of two different types of node neighborhoods; one in its source role and the other in its target role. We propose a novel alternating random walk strategy to sample such node neighborhoods while preserving their respective role information.”

“We presented a novel approach, NERD, for embedding directed graphs while preserving the role semantics of the nodes. We propose an alternating random walk strategy to sample node neighborhoods from a directed graph. The runtime and space complexities of NERD are both linear in the input size, which makes it suitable for large scale directed graphs. In addition to providing advantages of using two embedding representations of nodes in a directed graph, we revisit the evaluation strategies that have been used in the previous works while evaluating directed networks. To this extent, we chart out a clear evaluation strategy for link prediction and graph reconstruction tasks. We show that the embeddings from NERD are indeed robust, generalizable and well performing across multiple types of tasks and networks. …”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.] 

Towards Understanding Learning Representations: To What Extent Do Different Neural Networks Learn the Same Representation (Oct 2018) [code]
It is widely believed that learning good representations is one of the main reasons for the success of deep neural networks. Although highly intuitive, there is a lack of theory and systematic approach quantitatively characterizing what representations do deep neural networks learn. In this paper the authors moved toward a theory and a better understanding of the representations. Specifically, they studied a simpler problem: “How similar are the representations learned by two networks with identical architecture but trained from different initializations?” – developing a rigorous theory based on the neuron activation subspace match model. The theory gave a complete characterization of the structure of neuron activation subspace matches, where the core concepts were “maximum match” and “simple match,” which respectively describe the overall and the finest similarity between sets of neurons in two networks. They also proposed efficient algorithms to find the maximum match and simple matches. Surprisingly, experimental results on their algorithms suggested that representations learned by the same convolutional layers of networks trained from different initializations were not as similar as prevalently expected, at least in terms of subspace match.

Representation Learning by Reconstructing Neighborhoods (Nov 2018) [code]
“Since its introduction, unsupervised representation learning has attracted a lot of attention from the research community, as it is demonstrated to be highly effective and easytoapply in tasks such as dimension reduction, clustering, visualization, information retrieval, and semisupervised learning. In this work, we propose a novel unsupervised representation learning framework called neighborencoder, in which domain knowledge can be easily incorporated into the learning process without modifying the general encoderdecoder architecture of the classic autoencoder. In contrast to autoencoder, which reconstructs the input data itself, neighborencoder reconstructs the input data’s neighbors. As the proposed representation learning problem is essentially a neighbor reconstruction problem, domain knowledge can be easily incorporated in the form of an appropriate definition of similarity between objects. Based on that observation, our framework can leverage any offtheshelf similarity search algorithms or side information to find the neighbor of an input object. Applications of other algorithms (e.g., association rule mining) in our framework are also possible, given that the appropriate definition of neighbor can vary in different contexts. We have demonstrated the effectiveness of our framework in many diverse domains, including images, text, and time series, and for various data mining tasks including classification, clustering, and visualization. Experimental results show that neighborencoder not only outperforms autoencoder in most of the scenarios we consider, but also achieves the stateoftheart performance on text document clustering.”
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.] 
Recent Advances in AutoencoderBased Representation Learning (Dec 2018) [ETH Zurich  Google AI  Google Brain]
“Learning useful representations with little or no supervision is a key challenge in artificial intelligence. We provide an indepth review of recent advances in representation learning with a focus on autoencoderbased models. To organize these results we make use of metapriors believed useful for downstream tasks, such as disentanglement and hierarchical organization of features. In particular, we uncover three main mechanisms to enforce such properties, namely (i) regularizing the (approximate or aggregate) posterior distribution, (ii) factorizing the encoding and decoding distribution, or (iii) introducing a structured prior distribution. While there are some promising results, implicit or explicit supervision remains a key enabler and all current methods use strong inductive biases and modeling assumptions. Finally, we provide an analysis of autoencoderbased representation learning through the lens of ratedistortion theory and identify a clear tradeoff between the amount of prior knowledge available about the downstream tasks, and how useful the representation is for this task.”
“Conclusions. Learning useful representations with little or no supervision is a key challenge towards applying artificial intelligence to the vast amounts of unlabelled data collected in the world. We provide an indepth review of recent advances in representation learning with a focus on autoencoderbased models. …”
[Image source. Click image to open in new window.]
[Image source. MMD: (p. 5) Maximum mean discrepancy. Click image to open in new window.] 
Representation Learning for Spatial Graphs (Dec 2018)
“Recently, the topic of graph representation learning has received plenty of attention. Existing approaches usually focus on structural properties only and thus they are not sufficient for those spatial graphs where the nodes are associated with some spatial information. In this paper, we present the first deep learning approach called s2vec for learning spatial graph representations, which is based on denoising autoencoders framework. We evaluate the learned representations on real datasets and the results verified the effectiveness of s2vec when used for spatial clustering.”
[Image source. Click image to open in new window.] 
Knowledge Representation Learning: A Quantitative Review (Dec 2018).

A Survey on Graph Kernels (Mar 2019)
 “Graph kernels have become an established and widelyused technique for solving classification tasks on graphs. This survey gives a comprehensive overview of techniques for kernelbased graph classification developed in the past 15 years. … In an extensive experimental evaluation, we study the classification accuracy of a large suite of graph kernels on established benchmarks as well as new datasets. We compare the performance of popular kernels with several baseline methods and study the effect of applying a Gaussian RBF kernel to the metric induced by a graph kernel. In doing so, we find that simple baselines become competitive after this transformation on some datasets. Moreover, we study the extent to which existing graph kernels agree in their predictions (and prediction errors) and obtain a datadriven categorization of kernels as result. Finally, based on our experimental results, we derive a practitioner’s guide to kernelbased graph classification.”
Hyperbolic Embeddings
The slides in Hyperbolic Geometry, Möbius Transformations, and Geometric Optimization [local copy] provide a good overview of hyperbolic spaces, etc. Interestingly, compare slides 5 and 6 (an Escher woodcut viewed from the outside and the “inside”, respectively), which is reminiscent of an internal view (structure) of a highly interconnected knowledge graph.
[Image source. Click image to open in new window.]
The introductory and review materials in Gradient Descent in Hyperbolic Space also provide some understanding of the hyperboloid model of hyperbolic space, and how to compute distances between points.

For additional background, see these dated but useful resources:
A hyperbolic space has the property that powerlaw degree distributions, strong clustering and hierarchical community structure emerge naturally when random graphs are embedded in hyperbolic space. It is therefore logical to exploit the structure of the hyperbolic space for useful embeddings of complex network. Neural Embeddings of Graphs in Hyperbolic Space (May 2017) [code] proposed learning neural embeddings of graphs in hyperbolic space, providing experimental evidence that embedding graphs in their natural geometry significantly improved performance on downstream tasks for several realworld public datasets.
[Image source. Click image to open in new window.]
 More on Zachary's Karate Club network.
Embeddings of treelike graphs in hyperbolic space were recently shown to surpass their Euclidean in performance by a large margin. Inspired by those results, Skipgram Word Embeddings in Hyperbolic Space (Aug 2018) [code] presented an algorithm (Minkowski) for learning word embeddings in hyperbolic space from free text. Minkowski was based on code from fastText. Each word in the vocabulary was represented by a point on the hyperboloid model in Minkowski space. The embeddings were then optimized by negative sampling to minimize the hyperbolic distance of cooccurring words. An objective function based on the hyperbolic distance was derived and included in the skipgram architecture from word2vec. The hyperbolic word embeddings were then evaluated on word similarity and analogy benchmarks. The results demonstrated the potential of hyperbolic word embeddings, particularly in low dimensions, although without clear superiority over their Euclidean counterparts.
[Image source. Click image to open in new window.]
In mathematical physics, Minkowski space (or Minkowski spacetime) is a combining of threedimensional Euclidean space and time into a fourdimensional manifold where the spacetime interval between any two events is independent of the inertial frame of reference in which they are recorded.
“Note that the resulting vectors can not be treated as vectors in Euclidean space. In particular, you can not form document vectors by averaging, and the similarity of two vectors needs to be measured using the hyperbolic distance. Furthermore, these vectors can not be used in downstream tasks that are not specifically designed to work with the hyperboloid model of hyperbolic space.” [GitHub]
“To understand the hyperboloid model of hyperbolic space and how to compute distances between points, please see Gradient Descent in Hyperbolic Space.”
Maximilian Nickel and Douwe Kiela at Facebook AI Research recently introduced a new approach to learning hierarchical representations of symbolic data by embedding them into hyperbolic space (more precisely, into an $\small n$dimensional Poincaré ball), allowing them to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. Described in Poincaré Embeddings for Learning Hierarchical Representations (May 2107) [code nonauthor code; project], Poincaré embeddings significantly outperformed Euclidean embeddings on data with latent hierarchies in terms of representation capacity and generalization ability.
[Image source. Click image to open in new window.]Experiments showed that Poincaré embeddings provided important advantages over Euclidean embeddings on hierarchical data:
 Poincaré embeddings enabled very parsimonious representations that enabled the learning of highquality embeddings of largescale taxonomies.
 Excellent link prediction results indicated that hyperbolic geometry could introduce an important structural bias for the embedding of complex symbolic data.
 State of the art results for predicting lexical entailment suggested that the hierarchy in the embedding space corresponded well to the underlying semantics of the data.
 Aside: in Complex and Holographic Embeddings of Knowledge Graphs: A Comparison, Théo Trouillon and Maximilian Nickel provided a comparison of two state of the art knowledge graph embeddings: ComplEx and HolE. HolE was introduced in Holographic Embeddings of Knowledge Graphs.
Learning Continuous Hierarchies in the Lorentz Model of Hyperbolic Geometry (Maximilian Nickel & Douwe Kiela, Jul 2018) [code]. “We are concerned with the discovery of hierarchical relationships from largescale unstructured similarity scores. For this purpose, we study different models of hyperbolic space and find that learning embeddings in the Lorentz model is substantially more efficient than in the Poincaréball model. We show that the proposed approach allows us to learn highquality embeddings of large taxonomies which yield improvements over Poincaré embeddings, especially in low dimensions. Lastly, we apply our model to discover hierarchies in two realworld datasets: we show that an embedding in hyperbolic space can reveal important aspects of a company’s organizational structure as well as reveal historical relationships between language families.”
[Image source. Click image to open in new window.]Hyperbolic embeddings seek to embed structured, discrete objects such as knowledge graphs into a continuous representation that can be used with modern machine learning methods. Hyperbolic embeddings can preserve graph distances and complex relationships in very few dimensions, particularly for hierarchical graphs. Many graphs, such as complex networks including the Internet and social networks, are known to have hyperbolic structure and thus befit hyperbolic embeddings [see Representation Tradeoffs for Hyperbolic Embeddings (Apr 2018) and references therein]. For both NLP and graph based tasks, embeddings have been learned in highdimensional Euclidean spaces. However, recent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but negatively curved, hyperbolic space.
Representation Tradeoffs for Hyperbolic Embeddings (Apr 2018) [project: provides an excellent summary of hyperbolic embeddings as well as entity embeddings with 64bit precision] found that – given a tree – they could give a combinatorial construction that embedded the tree in hyperbolic space with arbitrarily low distortion without using optimization. On WordNet, their combinatorial embedding obtained a meanaverageprecision of 0.989 with only two dimensions
Source, while Nickel et al.’s recent construction (Poincaré Embeddings for Learning Hierarchical Representation) obtained 0.87 using 200 dimensions
Source.
Based on information at Semantic Scholar and the figures in both papers, HyBed : Hyperbolic Neural Graph Embedding (2018) [OpenReview] is a followon article by the authors of Neural Embeddings of Graphs in Hyperbolic Space. The HyBed paper again posits that recent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but a negatively curved hyperbolic space. They proposed learning neural embeddings of graphs in hyperbolic space, providing experimental evidence that hyperbolic embeddings significantly outperformed Euclidean embeddings on vertex classification tasks for several realworld public datasets.
[Image source. Click image to open in new window.]
Caveats:
This paper was (rejected for ICLR 2018) and improperly attributed at Semantic Scholar to Nicholas I. Kolkin (it does not appear among his few publications on Google Scholar, dblp, etc.).
From the Abstract, figures, GitHub repository, etc. this “paper” actually appears to be a minimally updated version of this earlier paper, Neural Embeddings of Graphs in Hyperbolic Space (discussed above), likely by those same authors.
Many wellestablished recommender systems are based on representation learning in Euclidean space. In these models, matching functions such as the Euclidean distance or inner product are typically used for computing similarity scores between user and item embeddings. In an extension of representation learning from the Euclidean to the hyperbolic domain, Hyperbolic Recommender Systems (Sep 2018) investigated the notion of learning user and item representations in hyperbolic space, arguing that hyperbolic space is more suitable for learning useritem embeddings in the recommendation domain. Unlike Euclidean spaces, hyperbolic spaces are intrinsically equipped to handle hierarchical structure, encouraged by its property of exponentially increasing distances away from origin. Their proposed system (HyperBPR) outperformed its Euclidean counterparts, achieving state of the art performance on multiple benchmark datasets.
[Image source. Click image to open in new window.]Biomedical association studies are increasingly done using clinical concepts, and in particular diagnostic codes from clinical data repositories as phenotypes. Clinical concepts can be represented in a meaningful, vector space using word embedding models. These embeddings allow for comparison between clinical concepts or for straightforward input to machine learning models. Using traditional approaches, good representations require high dimensionality, making downstream tasks such as visualization more difficult. Learning Contextual Hierarchical Structure of Medical Concepts with Poincairé Embeddings to Clarify Phenotypes (Nov 2018) [code here and here] applied Poincaré embeddings in a 2dimensional hyperbolic space to a largescale administrative claims database and show performance comparable to 100dimensional embeddings in a Euclidean space. They then examined disease relationships under different disease contexts to better understand potential phenotypes.
[Image source. Click image to open in new window.]Poincaré GloVe: Hyperbolic Word Embeddings (Oct 2018)
 “Words are not created equal. In fact, they form an aristocratic graph with a latent hierarchical structure that the next generation of unsupervised learned word embeddings should reveal. In this paper, driven by the notion of deltahyperbolicity or treelikeliness of a space, we propose to embed words in a Cartesian product of hyperbolic spaces which we theoretically connect with the Gaussian word embeddings and their Fisher distance. We adapt the wellknown GloVe algorithm to learn unsupervised word embeddings in this type of Riemannian manifolds. We explain how concepts from the Euclidean space such as parallel transport (used to solve analogy tasks) generalize to this new type of geometry. Moreover, we show that our embeddings exhibit hierarchical and hypernymy detection capabilities. We back up our findings with extensive experiments in which we outperform strong and popular baselines on the tasks of similarity, analogy and hypernymy detection.”
Additional Reading
See also the latter part of my blog post Holographic Universe, Quantum Knowledge Graph?**

Gradient descent in hyperbolic space (Aug 2018)
“Gradient descent generalises naturally to Riemannian manifolds, and to hyperbolic $\small n$space, in particular. Namely, having calculated the gradient at the point on the manifold representing the model parameters, the updated point is obtained by travelling along the geodesic passing in the direction of the gradient. Some recent works employing optimisation in hyperbolic space have not attempted this procedure, however, employing instead various approximations to avoid a calculation that was considered to be too complicated. In this tutorial, we demonstrate that in the hyperboloid model of hyperbolic space, the necessary calculations to perform gradient descent are in fact straightforward. The advantages of the approach are then both illustrated and quantified for the optimisation problem of computing the Fréchet mean (i.e. barycentre) of points in hyperbolic space.”

Hyperspherical Embedding of Graphs and Networks in Communicability Spaces [Oct 2014) [image 1
Source  image 2]
Source
"Let $\small G$ be a simple connected graph with $\small n$ nodes and let $\small f_{\alpha k} (A)$ be a communicability function of the adjacency matrix $\small A$ , which is expressible by the following Taylor series expansion: $\small \begin{align} \sum_{k=0}^{\infty} \alpha_k A^k\ \end{align}$. We prove here that if $\small f_{\alpha k} (A)$ is positive semidefinite then the function $\small \eta_{p,q} = (f_{\alpha k} (A)_{pp} + f_{\alpha k} (A)_{qq}  f_{\alpha k} (A)_{pq})^{\frac{1}{2}}$ is a Euclidean distance between the corresponding nodes of the graph. Then, we prove that if $\small f_{\alpha k} (A)$ is positive definite, the communicability distance induces an embedding of the graph into a hyperdimensional sphere (hypersphere) such as the distances between the nodes are given by $\small \eta_{p,q}$. ..."
"... We have proved here the existence of hyper spheres of radius $\small R$ in which the nodes of a network can be embedded, such that the distances between pairs of nodes are given by the communicability distance. ..."
 Followon (same author: Prof. Ernesto Estrada): Machine Learning Analysis of Complex Networks in Hyperspherical Space (2018). This is cool! (Radial histogram of the contribution of each of the 22 human diseases considered to the clusters of genes obtained in this work.) [image]
Source
 Followon (same author: Prof. Ernesto Estrada): Machine Learning Analysis of Complex Networks in Hyperspherical Space (2018). This is cool! (Radial histogram of the contribution of each of the 22 human diseases considered to the clusters of genes obtained in this work.) [image]

Hyperspherical Prototype Networks (Jan 2019) [OpenReview (earlier draft)] [Summary]This paper introduces hyperspherical prototype networks, which unify regression and classification by prototypes on hyperspherical output spaces. Rather than defining prototypes as the mean output vector over training examples per class, we propose hyperspheres as output spaces to define class prototypes a priori with large margin separation. By doing so, we do not require any prototype updating, we can handle any training size, and the output dimensionality is no longer constrained to the number of classes. Furthermore, hyperspherical prototype networks generalize to regression, by optimizing outputs as an interpolation between two prototypes on the hypersphere. Since both tasks are now defined by the same loss function, they can be jointly optimized for multitask problems. Experimental evaluation shows the benefits of hyperspherical prototype networks for classification, regression, and their combination.
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.] 
Hyperspherical Variational AutoEncoders (Thomas Kipf and colleagues: 2018) [code here (TensorFlow) and here (PyTorch); authors’ blog post] [image]
[From left to right, plots of learned latent space representations of the AutoEncoder, the NVAE, the NVAE with a scaled down KL divergence, and the SVAE. Image source. Click image to open in new window.]
Source 
Hyperbolic Disk Embeddings for Directed Acyclic Graphs (Feb 2019) [“GitHub URL will be here in camera ready version. For review, see the anonymized version in supplemental material.”]. “Obtaining continuous representations of structural data such as directed acyclic graphs (DAGs) has gained attention in machine learning and artificial intelligence. However, embedding complex DAGs in which both ancestors and descendants of nodes are exponentially increasing is difficult. Tackling in this problem, we develop Disk Embeddings, which is a framework for embedding DAGs into quasimetric spaces. Existing stateoftheart methods, Order Embeddings and Hyperbolic Entailment Cones, are instances of Disk Embedding in Euclidean space and spheres respectively. Furthermore, we propose a novel method Hyperbolic Disk Embeddings to handle exponential growth of relations. The results of our experiments show that our Disk Embedding models outperform existing methods especially in complex DAGs other than trees.”
Hyperbolic Image Embeddings (Apr 2019) [code]
“Computer vision tasks such as image classification, image retrieval and fewshot learning are currently dominated by Euclidean and spherical embeddings, so that the final decisions about class belongings or the degree of similarity are made using linear hyperplanes, Euclidean distances, or spherical geodesic distances (cosine similarity). In this work, we demonstrate that in many practical scenarios hyperbolic embeddings provide a better alternative.”
[Image source. Click image to open in new window.]
Graph Generation Through Generative Adversarial Networks
Related to graph representation learning (above) is the building of graphs through generative adversarial networks (GANs) or sequence models (e.g., RNN) [discussion]. Generative models for realworld graphs have important applications in many domains, including modeling physical and social interactions, discovering new chemical and molecular structures, and constructing knowledge graphs.
DeepMind’s Learning Deep Generative Models of Graphs (Mar 2018) introduced the DeepGMG approach to learning generative models over arbitrary graphs, which could capture both their structure and attributes. Their approach was capable of generating arbitrary graphs by sequentially adding nodes and edges), opening new directions for moving away from restrictions of vector and sequencelike knowledge representations, toward more expressive and flexible relational data structures. There are two major distinctions between the proposed graph model and typical sequence models like LSTMs: model architecture (this model is graphstructured with structured memory, while the LSTM stores information in flat vectors), and graph generating grammar (this model uses the generic graph generating decision sequence for generation, while the LSTM uses a linearization of the graphs, which typically encodes knowledge specific to a domain). For example, for molecules SMILES strings provide a way to linearize a molecules chemical structure (“graph”) into a string, addressing practical issues like the ordering of nodes. [The Simplified MolecularInput LineEntry System (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings.]
NetGAN: Generating Graphs via Random Walks (Jun 2018) [code] posed the problem of graph generation as learning the distribution of biased random walks over the input graph. The proposed model was based on a stochastic neural network that generated discrete output samples and was trained using the Wasserstein GAN objective. NetGAN was able to produce graphs that exhibited wellknown network patterns without explicitly specifying them in the model definition. At the same time, NetGAN exhibited strong generalization properties, highlighted by its competitive link prediction performance despite not being trained specifically for this task.
[Image source. Click image to open in new window.]NetGAN (above) focused on learning an implicit generative model from a single large graph (e.g. a large social network), whereas GraphRNN (below) learned from many small graphs, such as molecules. Since NetGAN had a single training graph, it formulated the problem as learning a distribution of random walks over the graph, which also solved scaling challenges. [Source: NetGAN author Aleksandar Bojchevski on reddit.]
Modeling and generating graphs is fundamental for studying networks in biology and other sciences; however, modeling complex distributions over graphs and then efficiently sampling from these distributions is challenging due to the nonunique, highdimensional nature of graphs and the complex, nonlocal dependencies that exist between edges in a given graph. GraphRNN: Generating Realistic Graphs with Deep Autoregressive Models (Jun 2018) [slides; code] described a deep autoregressive model that addressed those challenges and approximated any distribution of graphs with minimal assumptions about their structure. GraphRNN learned to generate graphs by training on a representative set of graphs and decomposes the graph generation process into an iterative sequence of node and edge formations, conditioned on the preceding graph structure.
[Image source. Click image to open in new window.]Subgraphs: HigherOrder Organization of Complex Networks
See also my Graph Signal Processing  Subgraph and Anomaly (Outlier) Detection subsection.
Analogous to clustering (dimensionality reduction) approaches applied to textual sources, content within knowledge graphs may also be classified and clustered. Many networks exhibit rich, lowerorder connectivity patterns that can be captured at the level of individual nodes and edges. HigherOrder Organization of Complex Networks (Benson, Gleich and Leskovec: 2016) developed a generalized framework for clustering networks on the basis of higherorder connectivity patterns, at the level of small network subgraphs. That project focused on finding higherorder organization of complex networks at the level of small network subgraphs (motifs), by clustering the nodes of a graph based on motifs instead of edges via a motifbased spectral clustering method. The approach is well explained and illustrated on their project page, for example using a bifan motif motif to find clusters in networks such as C. elegans (explained in more detail in their Fig. 2, below), or coherent feedforward loop motifs in the S. cerevisiae transcriptional regulation network.
 See also [same paper but additional graph signal processing related discussion].
 Science: Perspective  Response [notification of uncited, related earlier work]
 Slides here (local copy) and here (local copy)  Media
 See also *MotifNet*: A MotifBased Graph Convolutional Network for Directed Graphs (Feb 2018), a graph CNN for directed graphs that used convolutionlike anisotropic graph filters that were based on local subgraph structures (motifs).
Like the clustering of the content within knowledge graphs (above), portions of graphs may also be classified, as described below in MotifNet, a graph CNNbased approach. Broadly speaking, there are two classes of graph CNN formulations:

spatial approaches, which generalize subgraph regions by constructing a local system of weights on the graph), and

spectral approaches, which use the analogy between the eigenfunctions of the graph Laplacian and the classical Fourier transform, and define a convolutionlike operation in the spectral domain).
However, one of the key drawbacks of spectral CNNs is the explicit assumption of an undirected graph, leading to a symmetric Laplacian matrix with orthogonal eigendecomposition. MotifNet: A MotifBased Graph Convolutional Network for Directed Graphs (Feb 2018) introduced MotifNet, a graph CNN for directed graphs that used convolutionlike anisotropic graph filters that were based on local subgraph structures (motifs). An attention mechanism allowed MotifNet to generalize some standard graph CNN models without significantly increasing the model complexity. The approach was tested on Bojchevski and Günnemann’s (2017) directed CORA citation network, achieving superior classification accuracy compared to ChebNet.
 See also HigherOrder Organization of Complex Networks (Jul 2016), which developed a generalized framework for clustering networks on the basis of higherorder connectivity patterns, at the level of small network subgraphs. That project focused on finding higherorder organization of complex networks at the level of small network subgraphs (motifs), by clustering the nodes of a graph based on motifs instead of edges via a motifbased spectral clustering method.

ChebNet
[Image source. Click image to open in new window.]
Source refers to the Chebyshev polynomials [see also] described by Thomas Kipf and Max Welling in their 2017 paper, SemiSupervised Classification with Graph Convolutional Networks [discussed here].
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
Note that the 2018 MotifNet paper does not cite earlier work (different authors) – Motifbased Convolutional Neural Network on Graphs (Nov 2017; updated Feb 2018) – perhaps due to this statement in that 2017 paper: “We exclude spectral methods as they have been shown inferior to GCN.
Motifbased Convolutional Neural Network on Graphs (Nov 2017; updated 2018) [code] introduced a generalization of CNNs to graphs with irregular linkage structures, especially heterogeneous graphs with typed nodes and schemas. They proposed a novel spatial convolution operation to model the key properties of local connectivity and translation invariance, using highorder connection patterns or motifs (MotifCNN). Like the subsequent MotifNet paper, MotifCNN employed an attention model to combine the features extracted from multiple patterns, thus effectively capturing highorder structural and feature information. Experiments on semisupervised node classification on realworld social networks and multiple representative heterogeneous graph datasets indicated significant gains (621%) over existing graph CNNs and other state of the art techniques.
[Image source. Click image to open in new window.]For more background on graphs and convolutions, spectral approaches and the graph Laplacian, etc. see my Graph signal processing page.
Higherorder connectivity patterns such as small induced subgraphs called graphlets (network motifs) are vital to understand the important components (modules/functional units) governing the configuration and behavior of complex networks. Existing work in higherorder clustering has focused on simple homogeneous graphs with a single node/edge type. However, heterogeneous graphs consisting of nodes and edges of different types are seemingly ubiquitous in the realworld. Higherorder Spectral Clustering for Heterogeneous Graphs (Oct 2018) introduced the notion of typedgraphlet that explicitly captured the rich (typed) connectivity patterns in heterogeneous networks. Using typedgraphlets as a basis, we develop a general principled framework for higherorder clustering in heterogeneous networks. Experiments demonstrated the effectiveness of the framework quantitatively for three important applications including (i) clustering, (ii) link prediction, and (iii) graph compression. In particular, the approach achieved a mean improvement of 43x over all methods and graphs for clustering while achieving a 18.7% and 20.8% improvement for link prediction and graph compression, respectively.
[Image source. Click image to open in new window.] See also gl2vec: Learning Feature Representation Using Graphlets for Directed Networks (Dec 2018)
Additional Reading

Weisfeiler and Leman Go Neural: Higherorder Graph Neural Networks (Oct 2018) [code]
“In recent years, graph neural networks (GNN) have emerged as a powerful neural architecture to learn vector representations of nodes and graphs in a supervised, endtoend fashion. … The following work investigates GNN from a theoretical point of view and relates them to the 1dimensional WeisfeilerLeman graph isomorphism heuristic (1WL). We show that GNN have the same expressiveness as the 1 WL in terms of distinguishing nonisomorphic (sub)graphs. Hence, both algorithms also have the same shortcomings. Based on this, we propose a generalization of GNN, socalled kdimensional GNN (k GNN), which can take higherorder graph structures at multiple scales into account. These higherorder structures play an essential role in the characterization of social networks and molecule graphs. Our experimental evaluation confirms our theoretical findings as well as confirms that higherorder information is useful in the task of graph classification and regression.

Graph refinement, or the task of obtaining subgraphs of interest from overcomplete graphs, can have many varied applications. In Graph Refinement Based Tree Extraction Using MeanField Networks and Graph Neural Networks (Nov 2018) Selvan et al. [Thomas Kipf  Max Welling] extracted tree structures from image data by first deriving a graphbased representation of the volumetric data and then posing tree extraction as a graph refinement task. They presented two methods to perform graph refinement: (1) meanfield approximation (MFA), to approximate the posterior density over the subgraphs from which the optimal subgraph of interest can be estimated …; and (2) they presented a supervised learning approach using graph neural networks (GNNs), which can be seen as generalisations of mean field networks. Subgraphs were obtained by jointly training a GNN based encoderdecoder pair, wherein the encoder learned useful edge embeddings from which the edge probabilities were predicted using a simple decoder. They discussed connections between the two classes of methods, and compared them for the task of extracting airways from 3D, lowdose, chest CT data.
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]

gl2vec: Learning Feature Representation Using Graphlets for Directed Networks (Dec 2018)
“Learning network representations has a variety of applications, such as network classification. Most existing work in this area focuses on static undirected networks and do not account for presence of directed edges or temporarily changes. Furthermore, most work focuses on node representations that do poorly on tasks like network classification. In this paper, we propose a novel, flexible and scalable network embedding methodology, gl2vec, for network classification in both static and temporal directed networks. gl2vec constructs vectors for feature representation using static or temporal network graphlet distributions and a null model for comparing them against random graphs. We argue that gl2vec can be used to classify and compare networks of varying sizes and time period with high accuracy. We demonstrate the efficacy and usability of gl2vec over existing stateoftheart methods on network classification tasks such as network type classification and subgraph identification in several realworld static and temporal directed networks. Experimental results further show that gl2vec, concatenated with a wide range of stateoftheart methods, improves classification accuracy by up to 10% in realworld applications such as detecting departments for subgraphs in an email network or identifying mobile users given their app switching behaviors represented as static or temporal directed networks.”
[Image source. Click image to open in new window.]
See also:
 Higherorder Spectral Clustering for Heterogeneous Graphs (Oct 2018)
 Stochastic Graphlet Embedding (Feb 2017; updated Dec 2018)
 Hierarchical Stochastic Graphlet Embedding for Graphbased Pattern Recognition (Jul 2018)


Stochastic Graphlet Embedding (Feb 2017; updated Dec 2018) [code; published; discussion]
“Graphbased methods are known to be successful in many machine learning and pattern classification tasks. These methods consider semistructured data as graphs where nodes correspond to primitives (parts, interest points, segments, etc.) and edges characterize the relationships between these primitives. However, these nonvectorial graph data cannot be straightforwardly plugged into offtheshelf machine learning algorithms without a preliminary step of – explicit/implicit – graph vectorization and embedding. This embedding process should be resilient to intraclass graph variations while being highly discriminant. In this paper, we propose a novel highorder stochastic graphlet embedding (SGE) that maps graphs into vector spaces. Our main contribution includes a new stochastic search procedure that efficiently parses a given graph and extracts/samples unlimitedly highorder graphlets. We consider these graphlets, with increasing orders, to model local primitives as well as their increasingly complex interactions. In order to build our graph representation, we measure the distribution of these graphlets into a given graph, using particular hash functions that efficiently assign sampled graphlets into isomorphic sets with a very low probability of collision. When combined with maximum margin classifiers, these graphletbased representations have positive impact on the performance of pattern comparison and recognition as corroborated through extensive experiments using standard benchmark databases.”
[Image source. Click image to open in new window.]
 Hierarchical Stochastic Graphlet Embedding for Graphbased Pattern Recognition (Jul 2018) [discussion]
“Despite being very successful within the pattern recognition and machine learning community, graphbased methods are often unusable with many machine learning tools. This is because of the incompatibility of most of the mathematical operations in graph domain. Graph embedding has been proposed as a way to tackle these difficulties, which maps graphs to a vector space and makes the standard machine learning techniques applicable for them. However, it is well known that graph embedding techniques usually suffer from the loss of structural information. In this paper, given a graph, we consider its hierarchical structure for mapping it into a vector space. The hierarchical structure is constructed by topologically clustering the graph nodes, and considering each cluster as a node in the upper hierarchical level. Once this hierarchical structure of graph is constructed, we consider its various configurations of its parts, and use stochastic graphlet embedding (SGE) for mapping them into vector space. Broadly speaking, SGE produces a distribution of uniformly sampled low to high order graphlets as a way to embed graphs into the vector space. In what follows, the coarsetofine structure of a graph hierarchy and the statistics fetched through the distribution of low to high order stochastic graphlets complements each other and include important structural information with varied contexts. Altogether, these two techniques substantially cope with the usual information loss involved in graph embedding techniques, and it is not a surprise that we obtain more robust vector space embedding of graphs. This fact has been corroborated through a detailed experimental evaluation on various benchmark graph datasets, where we outperform the stateoftheart methods.”
 This is a followon paper (same authors) to Stochastic Graphlet Embedding.
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.]
[Image source. Click image to open in new window.] 
MixedOrder Spectral Clustering for Networks (Dec 2018)
“Clustering is fundamental for gaining insights from complex networks, and spectral clustering (SC) is a popular approach. Conventional SC focuses on secondorder structures (e.g., edges connecting two nodes) without direct consideration of higherorder structures (e.g., triangles and cliques). This has motivated SC extensions that directly consider higherorder structures. However, both approaches are limited to considering a single order. This paper proposes a new MixedOrder Spectral Clustering (MOSC) approach to model both secondorder and thirdorder structures simultaneously, with two MOSC methods developed based on Graph Laplacian (GL) and Random Walks (RW). MOSCGL combines edge and triangle adjacency matrices, with theoretical performance guarantee. MOSCRW combines firstorder and secondorder random walks for a probabilistic interpretation. We automatically determine the mixing parameter based on cut criteria or triangle density, and construct new structureaware error metrics for performance evaluation. Experiments on realworld networks show 1) the superior performance of two MOSC methods over existing SC methods, 2) the effectiveness of the mixing parameter determination strategy, and 3) insights offered by the structureaware error metrics.”
Entity Linking and Text Grounding
Mapping Text to Knowledge Graphs, and Vice Versa
[External Knowledge Lookup]
Entity Linking
It is very useful to be able to link entities in unstructured documents to the same entities in a knowledge graph (entity linking) – as well as the complementary task of linking entities in a KG to external textual sources (text grounding). For automated processing of scientific publications to extract useful information about molecular mechanisms it is critical that genes, proteins and other entities be correctly associated with uniform identifiers. Accurate entity linking/grounding is essential for resolving relationships among mined information, curated interaction databases, and biological datasets. Entity linking can help human end users navigate biomedical literature, and improve many other Natural Language Processing (NLP) tasks such as proteinprotein interaction event extraction.
Entity Linking for Biomedical Literature (May 2015) [code] proposed a novel unsupervised collective inference approach to linking entities from unstructured full texts of biomedical literature to 300 ontologies. For example, from the sentence

"In resting cells, p5065 heterodimers (referred herein as NFκB) are sequestered in the cytoplasm by association with members of another family of proteins called IκB."
an entity linking system will identify three prominent mentions

$\small \text{p5065 heterodimers}$,
$\small \text{NFκB}$, and
$\small \text{IκB}$
and in some knowledge base link the first two to

$\small \text{nuclear factor kappalightchainenhancer of activated B cells}$,
and the third to

$\small \text{nuclear factor of kappa light polypeptide gene enhancer in Bcells inhibitor}$.
Without using any manual annotation, their approach significantly outperformed the state of the art supervised entity linking method (9% absolute gain in linking accuracy).
[Image source. Click image to open in new window.]More recently, Unsupervised Medical Entity Recognition and Linking in Chinese Online Medical Text (Apr 2018) (note Figs. 1, 3) addressed issues of context and noise – which affect the accuracy of linking– via an approach that employs multiple entity knowledge (relevant to increased accuracy) and extracting the relevant context (to reduce noise).
[Image source. Click image to open in new window.]Odin, described in Description of the Odin Event Extraction Framework and Rule Language (Sep 2015) [project; code] and used by REACH, is a domainindependent platform for developing rulebased event extraction models for the detection of molecular interactions from research articles . Odin aims to be powerful (the rule language allows the modeling of complex syntactic structures) and robust (to recover from syntactic parsing errors, syntactic patterns can be freely mixed with surface, tokenbased patterns) while remaining simple (some domain grammars can be up and running in minutes), and fast (Odin processes over 100 sentences/second in a realworld domain with over 200 rules). Examples in the biomedical domain are provided; see also their followon paper, Odin’s Runes: A Rule Language for Information Extraction (2016).
REACH, described in Largescale Automated Machine Reading Discovers New Cancer DrivingMechanisms (2017) [code] by the Computational Language Understanding Lab (CLU Lab) at the University of Arizona, is a biomedical information extraction system which aims to read the scientific literature and extract cancer signaling pathways. REACH implements a fairly complete extraction pipeline, including recognition of biochemical entities (proteins, chemicals, etc.), grounding them to known knowledge bases such as UniProt, extraction of BioPAXlike interactions, e.g., phosphorylation, complex assembly, positive/negative regulations, and coreference resolution, for both entities and interactions. REACH was developed using CLU Lab’s opendomain information extraction framework, Odin, which is released within their processors repository on GitHub.
[Image source. Click image to open in new window.]REACH was evaluated in FamPlex: A Resource for Entity Recognition and Relationship Resolution of Human Protein Families and Complexes in Biomedical Text Mining (Jun 2018) [code], which stated:
 “In a task involving automated reading of ∼215,000 articles using the REACH event extraction software we found that grounding was disproportionately inaccurate for multiprotein families (e.g., AKT) and complexes with multiple subunits (e.g.NFκB). To address this problem we constructed FamPlex, a manually curated resource defining protein families and complexes as they are commonly encountered in biomedical text. In FamPlex the genelevel constituents of families and complexes are defined in a flexible format allowing for multilevel, hierarchical membership. To create FamPlex, text strings corresponding to entities were identified empirically from literature and linked manually to uniform identifiers; these identifiers were also mapped to equivalent entries in multiple related databases. FamPlex also includes curated prefix and suffix patterns that improve named entity recognition and event extraction. Evaluation of REACH extractions on a test corpus of ∼54,000 articles showed that FamPlex significantly increased grounding accuracy for families and complexes (from 15 to 71%).”
FamPlex: A Resource for Entity Recognition and Relationship Resolution of Human Protein Families and Complexes in Biomedical Text Mining (Jun 2018) [code] provided a collection of resources for grounding biological entities from text and describing their hierarchical relationships. FamPlex focused on protein families, complexes, and their lexical synonyms.
[Image source. Click image to open in new window.]Figure 3 above (from the FamPlex paper) demonstrates the relative benefits of using FamPlex, particularly for grounding multiprotein families and complexes with multiple subunits. Likewise, their Table 3, below, demonstrates bioentity grounding accuracies in REACH without and with FamPlex.
[Image source. Click image to open in new window.] “Use of FamPlex in text mining improves grounding and relationship resolution for protein families and complexes in two event extraction systems. Following the manual curation of FamPlex identifiers and associated synonyms and the integration of FamPlex into REACH and INDRA, we performed a second evaluation on a random sample of 300 named entities drawn from the results of processing the test corpus (Table 3). The frequency of entity types was comparable between the training and test samples, with proteins/genes and families/complexes accounting for roughly threequarters of all entities. Improvements in grounding were substantial for both classes, with grounding accuracy for families and complexes rising from 15 to 71% (Fig. 3b; Table 3). Grounding accuracy for proteins and genes increased from 79 to 90%, an improvement attributable to the curation of synonyms for frequently occurring proteins. With the incorporation of FamPlex, the overall percentage of unique entity strings grounded to protein family or complex identifiers doubled relative to the training corpus, with REACH grounding 2080 of 57,088 unique entities to a FamPlex, InterPro or Pfam identifier (3.6%).” [Source: p. 9, FamPlex: A Resource for Entity Recognition and Relationship Resolution of Human Protein Families and Complexes in Biomedical Text Mining.]
Strong Baselines for Simple Question Answering over Knowledge Graphs with and without Neural Networks (Jun 2018) [code] examined the problem of question answering over knowledge graphs, focusing on simple questions that can be answered by the lookup of a single fact. Adopting a straightforward decomposition of the problem into entity detection, entity linking, relation prediction, and evidence combination, they explored simple yet strong baselines. On the popular SimpleQuestions dataset, they found that basic LSTMs and GRUs plus a few heuristics yielded accuracies that approached the state of the art, and techniques that did not use neural networks also performed reasonably well. These results suggested that gains from sophisticated deep learning techniques proposed in the literature are quite modest and that some previous models exhibit unnecessary complexity.
[Image source. Click image to open in new window.]See also:

Simple Question Answering by Attentive Convolutional Neural Network (Oct 2016) focused on answering singlerelation factoid questions over a KG (Freebase). This task, simple question answering (SimpleQA), was addressed via a twostep pipeline: entity linking and fact selection.
[Image source. Click image to open in new window.] 
Variational Reasoning for Question Answering with Knowledge Graph provided a unified deep learning architecture and an endtoend variational learning algorithm that could simultaneously handle noise in questions, while learning multihop reasoning.

EARL: Joint Entity and Relation Linking for Question Answering Over Knowledge Graphs (Jun 2018), which performed entity linking and relation linking as a joint task.
Entity linking repositories on GitHub include:

diseasenormalization: biomedical entity linking (normalization) project;

BioXrefEntityLinker: automatic external reference linking of biomedical entities from information extraction events;

wikilinks/nel: fast, accurate and highly modular framework for linking entities in documents (Python) [docs]; and,

berkeleyentity: Berkeley Entity Resolution System; jointly solves the problems of named entity recognition, coreference resolution, and entity linking via a featurerich discriminative model.
Text Grounding
In NLP, the ability to associate text with an entity of a knowledge graph benefits tasks such as question answering and information retrieval, as well as any task that involves some form of mapping a definition to a term. Further, matching text to graphs can be invaluable in providing solutions to domainspecific challenges, for example medical concept normalisation (disambiguation) and identification of adverse drug reactions.
Augmenting KG with external resources such as textual knowledge stores and ontologies is an obvious approach to providing explanatory material to KG. This is particularly important when machine learning approaches are employed in the biomedical and clinical sciences domains, where high precision is imperative for supporting, not distracting practitioners (What Do We Need to Build Explainable AI Systems for the Medical Domain?), and it is crucial to underpin machine output with reasons that are verifiable by humans.
Mapping Text to Knowledge Graph Entities using MultiSense LSTMs (Aug 2018) [GitHub) “placeholder”  link on GitHub to code on Bitbucket  code (Bitbucket)] addressed the problem of mapping natural language text to knowledge base entities. This paper detailed a model for efficiently mapping unrestricted text at the level of phrases and sentences to the entities of a knowledge base (KB) – a task also referred to as text grounding or normalisation. The model aimed at characterising short focused texts, such as definitions or tweets; for example, given a biomedical KB a tweet of the form “Can’t sleep, too tired to think straight” would be mapped to the entity “Insomnia.” The process of texttoentity mapping was treated as a transformation from a textual vector space, to a KB vector space created from a graph and populated by vectors representing entities. The mapping process was approached as a composition of a phrase or a sentence into a point in a multidimensional entity space obtained from a knowledge graph. The compositional model was an LSTM equipped with a dynamic disambiguation mechanism on the input word embeddings (a MultiSense LSTM or MSLSTM), addressing polysemy issues. Further, the knowledge base space was prepared by collecting random walks from a graph enhanced with textual features, which acted as a set of semantic bridges between text and knowledge base entities. The ideas of this work were demonstrated on largescale texttoentity mapping and entity classification tasks, with state of the art results.
[Image source. Click image to open in new window.]
 Ruslan Salakhutdinov and colleagues also used WordNet synsets
Source for word sense disambiguation (Knowledgebased Word Sense Disambiguation using Topic Models).
Augmenting EndtoEnd Dialog Systems with Commonsense Knowledge (Feb 2018) likewise integrated a large commonsense knowledge base (ConceptNet, in one example) into endtoend conversational models. Although very confusingly/poorly described, the basic approach was as follows.

Concepts (keywords) were derived from each message (inputted sentence) using simple ngram matching, where every ngram in the concept was considered as a potential concept. If the ngram was a key in a dictionary of the concepts derived from ConceptNet, the corresponding value or assertion, a (conceptrelationconcept) relation triple, was added to the assertions dataset as external knowledge.

The authors employed a TriLSTM encoder: LSTM were used to encode message, response and commonsense assertions, where LSTM weights for message and response were tied. A component of the TriLSTM encoder (Fig. 3) served as the memory module, encoding all commonsense assertions. Their experiments suggested that their knowledgeaugmented models were superior to their knowledgefree counterparts.
[Image source. Click image to open in new window.]Critique: this was a clever, but complicated, approach. It is mentioned here for its use of knowledge graphs as an external memory source in LSTMbased approaches. The approach was prone to noise due to uncertainties inherent in ConceptNet 5 where many of the assertions were considered “false,” “nonsense” or “vague” by human evaluators:
False: 34/535 (6.4%);Nonsense: 50/535 (9.4%); Vague: 15/535 (2.8%); Don’t Know: 19/535 (3.6%); Sometimes: 117/535 (21.9%); True: 300/535 (56.1%); Total: 535.
Source: Table 6.2 (p. 173) in Speer and Havasi (2012) “ConceptNet 5: A Large Semantic Network for Relational Knowledge” [Chapter 6 in Theory and Applications of Natural Language Processing (local copy)].
Also, this work is conceptually identical to Yoshua Bengio’s earlier and technically superior work (A Neural Knowledge Language Model [OpenReview], which is not cited. In that work, Bengio and colleagues proposed a neural knowledge language model which combined the symbolic knowledge provided by a knowledge graph with the expressive power of RNN language models. Like the “assertions” in Augmenting EndtoEnd Dialog Systems with Commonsense Knowledge, in Bengio’s work facts were represented as (subject, relationship, object) triples, a global vocabulary of frequent terms was maintained, and string matching was employed to align words in the topic description with their corresponding facts in the topic knowledge. The input consisting of a word and a fact went into LSTM; the LSTM output together with the knowledge context generated a fact key. Using the fact key, the fact embedding was retrieved from the topic knowledge memory.
Microsoft Research Asia also recently used ConceptNet as an external knowledge base, as described in Improving Question Answering by CommonsenseBased PreTraining (Oct 2018). They provided a simple and effective method that leveraged an external commonsense knowledge base (ConceptNet) to answer questions that required commonsense knowledge. . They pretrained direct and indirect relational functions between concepts, showing that those pretrained functions could be easily added to existing neural networks. Each concept was represented based on its internal information (i.e. the words it contained) and external context (i.e. neighbors in the knowledge graph), as indicated in their Fig. 6. To verify the effectiveness of their approach, they used two multiplechoice question answering tasks that required commonsense reasoning. Given a question and optionally a supporting passage, both tasks were to predict the correct answer from a set of candidate answers. Results showed that incorporating commonsensebased functions improved the state of the art on two question answering tasks that required commonsense reasoning. Further analysis showed that their system discovered and leveraged useful evidence from an external commonsense knowledge base, which was missing existing neural network models, helping the model derive the correct answer.
[Image source. Click image to open in new window.] [Image source. Click image to open in new window.]Machine reading comprehension (MRC) requires reasoning about both the knowledge involved in a document and knowledge about the world. However, existing datasets are typically dominated by questions that can be well solved by context matching, which fail to test this capability. Contemporaneously published with Improving Question Answering by CommonsenseBased PreTraining (above), Microsoft Research Asia also recently published Knowledge Based Machine Reading Comprehension (Sep 2018), which addressed knowledgebased MRC and built a new dataset consisting of 40,047 questionanswer pairs. The annotation of this dataset (KBMRC) was designed so that successfully answering the questions required understanding and the knowledge included in a document. Their framework consisted of a question answering model and a question generation model, both of which took the knowledge extracted from the document as well as relevant facts from an external knowledge base such as Freebase, ProBase, Reverb, or NELL. Results showed that incorporating side information from the external KB improved the accuracy of the baseline question answer system (BiDAF), while error analysis showed the advantages and limitations of their approach.
[Image source. Click image to open in new window.]In an approach similar to Augmenting EndtoEnd Dialog Systems with Commonsense Knowledge, A Knowledge Enhanced Generative Conversational Service Agent (Sep 2017) described a system that consisted of a knowledge extractor, a knowledge encoder, and a knowledge enhanced seq2seq model including a query encoder and a decoder for response generation. The knowledge extractor was designed to obtain dialogrelated knowledge from the web in a given dialog history including the latest query. In that phase, keywords from the dialog history by a keyword extraction component (RAKE) were used to capture external related knowledge from the web (the topK most relevant results from/ranked by a search engine). A convolutional neural network (CNN) based knowledge encoder then extracted features of the obtained knowledge. In the knowledge enhanced seq2seq model, the encoder adopted a LSTM to project the dialog history and the user input into a realvalued vector, followed by the decoder which generated response based on the dialog history vector and the knowledge features.
[Image source. Click image to open in new window.]Critique: this is a moderately clever idea, albeit prone to noise (acknowledged in their Conclusions
Source), plus other deficiencies.
Most textual entailment models focus on lexical gaps between the premise text and the hypothesis, but rarely on knowledge gaps. Bridging Knowledge Gaps in Neural Entailment via Symbolic Models focused on filling these knowledge gaps in the Science Entailment task, by leveraging an external structured knowledge base (KB) of science facts. Their architecture (NSnet
Source) combined standard neural entailment models with a knowledge lookup module. To facilitate this lookup, they proposed a factlevel decomposition of the hypothesis, and verifying the resulting subfacts against both the textual premise and the structured KB. NSnet learned to aggregate predictions from these heterogeneous data formats. On the SciTail dataset, NSnet outperformed
Source a simpler combination of the two predictions by 3% and the base entailment model by 5%.
In web searches entityseeking queries often trigger a special question answering (QA) system. For example, the QA system may use a parser to reinterpret the question as a structured query, execute that query on a knowledge graph (KG), and return entity responses. However, QA systems based on precise parsing tend to be brittle: minor variations in the question may dramatically change the output. Moreover, KG coverage is patchy, and a large corpus may have supporting information – but in an unstructured, unreliable form.
Addressing those issues, Neural Architecture for Question Answering Using a Knowledge Graph and Web Corpus (Sep 2018) proposed a more robust approach that degraded gracefully on a “structure ramp.” Their system, AQQUCN, accepted a broad spectrum of queries ranging from wellformed questions to short keyword sequences. To address query ambiguity, AQQUCN aggregated signals from the KG and the corpora to directly rank KG entities (rather than committing to one semantic interpretation of the query). AQQUCN modeled the ideal interpretation as an unobservable or latent variable. Interpretations and candidate entity responses were scored as pairs by combining signals from multiple convolutional networks that operate collectively on the query, KG and corpus. On four public query workloads (over 8,000 queries in different query formats) they saw 516% absolute improvement in mean average precision (MAP) compared to the entity ranking performance of recent systems. Their system was also competitive at entity set retrieval, almost doubling $\small F_1$ scores for challenging short queries.
[Image source. Click image to open in new window.]
AQQUCN, by researchers at IIT Bombay, extends previous work, Aqqu [More Accurate Question Answering on Freebase (2015)] by another research group (at the University of Freiburg). Unlike Aqqu and other systems, the end goal of AQQUCN was not to “compile” the input into a structured query to execute on the KG (because broken/missing input syntax can make this attempt fail in brittle ways), but rather to directly create a ranking over entities using a KG and a corpus (using interpretations as latent variables).
 AQQUCN is (apparently) a portmanteau of Aqqu + CN (convolutional networks).
 “Our code with relevant data will be made available (CSAW, 2018).”
Side Information
In addition to relation instances, KBs often contain other relevant side information, such as aliases of relations (e.g., $\small \text{founded}$ and $\small \text{cofounded}$ are aliases for the relation $\small \text{founderOfCompany)}$. Relation extraction models usually ignore such readily available side information. RESIDE: Improving DistantlySupervised Neural Relation Extraction Using Side Information (Vashishth et al.: Dec 2018) [code] proposed RESIDE , a distantlysupervised neural relation extraction method which utilized additional side information from KBs for improved relation extraction. It used entity type and relation alias information for imposing soft constraints while predicting relations [e.g., the relation $\small \text{person/place of birth}$ can only occur between a $\small \text{person}$ and a $\small \text{location}$]. RESIDE employed graph convolution networks (GCNs) to encode syntactic information from text, and improved performance even when only limited amounts of side information were available.
[Image source. Click image to open in new window.] Overview. RESIDE first encodes each sentence in the bag by concatenating embeddings (denoted by $\small \oplus$) from a bidirectional GRU (BiGRU) and Syntactic GCN for each token, followed by word attention. Then, sentence embedding is concatenated with relation alias information, which comes from the Side Information Acquisition Section (Figure 2), before computing attention over sentences. Finally, bag representation with entity type information is fed to a softmax classifier. Please see Section 5 for more details.
The authors of RESIDE: Improving DistantlySupervised Neural Relation Extraction Using Side Information (Vashishth et al.: Dec 2018) also published the following paper (CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information (Feb 2019) – that although appears on arXiv two months later is cited in the former paper.

RESIDE, above, employed a bidirectional GRU and a graph convolution network (GCN) to encode syntactic information from text, followed by word attention. Then, sentence embedding was concatenated with relation alias information, which came from the side information, before computing attention over sentences. Finally, bag representation with entity type information was fed to a softmax classifier.

CESI
, below, first acquired side information for noun and relation phrases of Open KB triples. CESI then learned embeddings for those NP and relation phrases while utilizing the side information obtained in previous step. Lastly, CESI performed clustering over the learned embeddings, to canonicalize the NP and relation phrases. 
Although each approach performed well, unfortunately the two models were not compared.
Open Information Extraction (OpenIE) methods extract (noun phrase, relation phrase, noun phrase) triples from text, resulting in the construction of large Open Knowledge Bases (Open KBs). The noun phrases (NPs) and relation phrases in such Open KBs are not canonicalized, leading to the storage of redundant and ambiguous facts. CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information (Vashishth et al.: Feb 2019) [code] proposed Canonicalization using Embeddings and Side Information (CESI) – a novel approach which performed canonicalization over learned embeddings of Open KBs. *CESI * extended recent advances in KB embedding by incorporating relevant NP and relation phrase side information in a principled manner.
 Overview of CESI. CESI first acquires side information of noun and relation phrases of Open KB triples. In the second step, it learns embeddings of these NPs and relation phrases while utilizing the side information obtained in previous step. In the third step, CESI performs clustering over the learned embeddings to canonicalize NP and relation phrases. See Section 3 for more details.
Additionally:
Global Encoding for Abstractive Summarization (Jun 2018) developed a model with an encoder that employed a BiLSTM decoder that generated summary words. They fed their encoder output at each time step into a convolutional gated unit, that with a selfattention mechanism allowed the encoder output at each time step to become new representation vector, with further connection to the global sourceside information. …
Knowledge Based Machine Reading Comprehension (Sep 2018), which addressed knowledgebased machine reading comprehension, was designed so that successfully answering the questions required understanding and the knowledge included in a document. Their framework consisted of a question answering model and a question generation model, both of which took the knowledge extracted from the document as well as relevant facts from an external knowledge base … Results showed that incorporating side information from the external KB improved the accuracy of the baseline question answer system (BiDAF) …
Representation Learning by Reconstructing Neighborhoods (Nov 2018) proposed an unsupervised representation learning framework (neighborencoder) in which domain knowledge can be easily incorporated into the learning process without modifying the general encoderdecoder architecture of the classic autoencoder. In contrast to autoencoder, which reconstructs the input data itself, neighborencoder reconstructed the input data’s neighbors. As the proposed representation learning problem was essentially a neighbor reconstruction problem, domain knowledge could be easily incorporated in the form of an appropriate definition of similarity between objects. Based on that observation, their framework could leverage off the shelf similarity search algorithms or side information to find the neighbor of an input object. …
In some circumstances it may be necessary to search a KG for subgraphs and topological regions of interest, for which graph representation learning) and other approaches may be applied. For example, Searching for a Single Community in a Graph (May 2018) used “side information” (knowledge about some of the nodes from the target group) to identify nodes in the target more reliably and with lower computation.
Additional Reading

Improving Natural Language Inference Using External Knowledge in the Science Questions Domain

Combining neural and knowledgebased approaches to Named Entity Recognition in Polish (Nov 2018)
“Named entity recognition (NER) is one of the tasks in natural language processing that can greatly benefit from the use of external knowledge sources. We propose a named entity recognition framework composed of knowledgebased feature extractors and a deep learning model including contextual word embeddings, long shortterm memory (LSTM) layers and conditional random fields (CRF) inference layer. We use an entity linking module to integrate our system with Wikipedia. The combination of effective neural architecture and external resources allows us to obtain stateoftheart results on recognition of Polish proper names. We evaluate our model on data from PolEval 2018 NER challenge on which it outperforms other methods, reducing the error rate by 22.4% compared to the winning solution. Our work shows that combining neural NER model and entity linking model with a knowledge base is more effective in recognizing named entities than using NER model alone.”

HCqa: Hybrid and Complex Question Answering on Textual Corpus and Knowledge Graph (Nov 2018) [code]
“Question Answering (QA) systems provide easy access to the vast amount of knowledge without having to know the underlying complex structure of the knowledge. The research community has provided ad hoc solutions to the key QA tasks, including named entity recognition and disambiguation, relation extraction and query building. Furthermore, some have integrated and composed these components to implement many tasks automatically and efficiently. However, in general, the existing solutions are limited to simple and short questions and still do not address complex questions composed of several subquestions. Exploiting the answer to complex questions is further challenged if it requires integrating knowledge from unstructured data sources, i.e., textual corpus, as well as structured data sources, i.e., knowledge graphs. In this paper, an approach (HCqa) is introduced for dealing with complex questions requiring federating knowledge from a hybrid of heterogeneous data sources (structured and unstructured). We contribute in developing (i) a decomposition mechanism which extracts subquestions from potentially long and complex input questions, (ii) a novel comprehensive schema, first of its kind, for extracting and annotating relations, and (iii) an approach for executing and aggregating the answers of subquestions. The evaluation of HCqa showed a superior accuracy in the fundamental tasks, such as relation extraction, as well as the federation task.”
[Image source. Click image to open in new window.]
Text Generation from Knowledge Graphs
Text Generation from Knowledge Graphs with Graph Transformers (Allen Institute for Artificial Intelligence: Apr 2019) [code] addressed the problem of generating coherent multisentence texts from the output of an information extraction system, and in particular a knowledge graph. “We introduce a novel graph transforming encoder [GraphWriter ] which can leverage the relational structure of such knowledge graphs without imposing linearization or hierarchical constraints. Incorporated into an encoderdecoder setup, we provide an endtoend trainable system for graphtotext generation that we apply to the domain of scientific text. Automatic and human evaluations show that our technique produces more informative texts which exhibit better document structure than competitive encoderdecoder methods.”
[Image source. Click image to open in new window.]