Last modified: 2020-06-28 12:28:17 -0700 (PST)

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See Reasoning.

Activation functions

In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. An activation function – for example, ReLU or sigmoid – takes in the weighted sum of all of the inputs from the previous layer, then generates and passes an output value (typically nonlinear) to the next layer; i.e., they take a single number and perform a fixed mathematical operation on it. These functions should be non-linear, to encode complex patterns of the data. The most popular activation functions are sigmoid, tanh and ReLU. ReLU is the most popular activation function in deep neural networks.

activation functions

[Image source. Click image to open in new window.]

Here is a simple artificial neural network:


[ANN showing bias units. Click image to open in new window.]

Each of the circles above represent individual nodes (different layers may contain different numbers of nodes). In the feedforward pass, each node is acted upon (numerically modified) by the activation function  $\small \phi$, shown in the images below [and later adjusted, in response to the error, during the backpropagation step – neither shown nor discussed here.] We calculate each of the layer $\small 2$ node activations based on the sum of the input values ($\small \mathbf{x}_1, \mathbf{x}_2, \dots$) times the weight matrix $\small \mathbf{W}_{ij}$, plus the bias term $\small \mathbf{b}$. For example, the value of node $\small \mathbf{a}{(2) \atop 3}$ in the image, above, representing unit $\small 3$ in layer $\small 2$), is calculated from the values of each of the input nodes $\small \mathbf{x}_i$ in the preceding layer, the weight matrices $\small \mathbf{W}_{ij}$, and the bias unit $\small \mathbf{b}$ ($\small \mathbf{x}_0 = +1$), giving the layer $\small 2$ node activations $\small \mathbf{a}{(2) \atop i}$.

The activation function is shown in greater detail in the following two images; as mentioned, it acts on each node. The activation values on each of the hidden units/nodes in layer $\small 3$ ($\small \mathbf{a}{(3) \atop i}$) are derived from the sigmoid function ($\small \phi(\cdot)$) applied to the linear combination (sum, $\small \sum$) of the inputs to those nodes (the layer $\small 2$ activations multiplied by the weight matrix, plus the bias unit). The activations are derived using exactly the same logic as for the first hidden layer, except that the input values are the activation values $\small \mathbf{a}$ from the preceding layer, rather than the $\small \mathbf{x}$ values. …


[image source; $\small \text{output} = \text{activation}(\sum(\text{weight} * \text{input} + \text{bias})$  (click image to open in new window)]


[Image source. Click image to open in new window.]

This provides a very superficial overview. To fully understand the basics of machine learning, you need to study/know at least a bit of linear algebra (matrix mathematics: transformations, multiplications, … – often used in various parametrizations such as weight matrices, …), partial derivatives (needed for the chain rule, used in backpropagation, …), etc. In my opinion, Andrew Ng’s Machine Learning course is essential preparation, providing a thorough understanding of the basics. Dr. Ng. also teaches the Stanford University cs229 Machine Learning course, which provides online materials.

For example, the summations, $\small \sum(\cdot)$, variously shown above are normally and very neatly accomplished in one step by multiplying the inputs by a transposed weight matrix (that also includes the bias term). Given a neuron that takes as input $\small \mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3$ and a bias term $\small \mathbf{b} = 1$, the output would be $\small \mathbf{h}_{W,b}(\mathbf{x}) = \phi(\sum_{i=1}^{3}\mathbf{W}_i\mathbf{x}_i + b) = \phi(\mathbf{W}^T\mathbf{x})$. Note that last expression, $\small \phi(\mathbf{W}^T\mathbf{x})$: the transposed weight matrix (including the bias) is multiplied by the input giving the output in one operation!

Activation (+ Other ML) Resources

Ad hoc (ad-hoc) retrieval

See Document relevance ranking.

Adjacency matrix

Refer here  [Graph Signal Processing: Background].

See also: Graph Degree Matrix.

Adversarial attacks (adversarial challenges)

For some time now, it has been recognized that some machine learning models appear to be particularly sensitive to adversarial challenges – reviewed in Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning (Dec 2017, updated Jul 2018), and the subject of Google’s Unrestricted Adversarial Examples Challenge [paperleaderboard].


Adversarial examples can be generated through a variety of means, including by making small modifications to the input pixels, but also using spatial transformations, or simple guess-and-check to find misclassified inputs.  [Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Adversarial examples are inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake; they’re like optical illusions for machines. Adversarial examples have the potential to be dangerous. For example, attackers could target autonomous vehicles by using stickers or paint to create an adversarial stop sign that the vehicle would interpret as a ‘yield’ or other sign, as discussed in Practical Black-Box Attacks against Deep Learning Systems using Adversarial Examples (Mar 2017).


[Image source. Click image to open in new window.]

Textual models are also very sensitive to adversarial attacks, which I (and others) noted appear to be particularly sensitive to context; even making subtle changes to the question (changing case, sentence capitalization, character level changes, etc.) can give profoundly different answers. For example, note the following examples from Semantically Equivalent Adversarial Rules for Debugging NLP Models (Ribeiro et al., 2018).


[Image source. Click image to open in new window.]

In the Question Answering and Reading Comprehension subsection of my TECHNICAL REVIEW  I mentioned the Bi-Directional Attention Flow  (BiDAF ) QA platform, which was trained on v.1.1 of the Stanford Question Answering Dataset (SQuAD1.1). BiDAF  is a multi-stage hierarchical process that represents context at different levels of granularity and uses a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization, while SQuAD is a reading comprehension dataset.

To quickly and crudely evaluate the utility of the BiDAF platform, I used their online demo, which was trained on the SQuAD1.1 dataset. I queried one of my PubMed abstracts, Stuart et al. (2009) Transcriptional Response to Mitochondrial NADH Kinase Deficiency in Saccharomyces cerevisiae. That abstract, my questions, and the BiDAF-provided answers are shown here verbatim (unquoted text; punctuation and character case as shown):

While those PRELIMINARY RESULTS were encouraging, I obtained substantially different (often incorrect) answers in response to the formating of otherwise identical questions, as well as queries on entities not present in the text:

At first glance, the issues I identified in my BiDAF /SQuAD tests, above, were suggestive of differences in one or more of the various embedding, attention flow, modeling, and output layers in the BiDAF model (see, e.g., Section 2 in the BiDAF paper) that did not transfer well to the biomedical abstract and questions posed. As well, it is noted (Appendix B in The Natural Language Decathlon: Multitask Learning as Question Answering paper that data in SQuAD is lowercased. Note also that Fig. 3
in that “decaNLP” paper shows which component of their MQAN system chooses to output answers. Appendix A in the decaNLP paper also discusses aspects of SQuAD (paraphrased here) that may be relevant to my observations:

    "Early success on the SQuAD dataset exploited the fact that all answers can be found verbatim in the context. State-of-the-art models point to start and end tokens in the document. This allowed deterministic answer extraction to overtake sequential token generation. This quirk of the dataset does not hold for question answering in general, so recent models for SQuAD are not necessarily general question answering models. ..."

Without looking into this further, I am not sure if shallow-trained models are more sensitive to linguistic adversarial attacks than the more recent, pretrained contextual language models – however, that is my intuition.

As you can imagine, the susceptibility of machine learning models to adversarial challenges / attacks is of intense interest within the ML/NLP research communities – a greater understanding of which will undoubtedly yield more robust, secure, and transparent / interpretable models.

  • For example, in Matrix Capsules with EM Routing (ICLR 2018): Geoff Hinton states  “A capsule is a group of neurons whose outputs represent different properties of the same entity. … Capsules also show far more resistance to white box adversarial attacks than our baseline convolutional neural network. … We found that our model is significantly less vulnerable to both general and targeted adversarial attacks.”  Section 6 (Adversarial Robustness, pp. 7+) in that paper discusses the vulnerability of neural networks to adversarial challenges.


[Image source. Click image to open in new window.]

Adversarial attacks – see also:

Affine transformation

In geometry, an affine transformation, affine map or an affinity (from the Latin, affinis, “connected with”) is a function between affine spaces which preserves points, straight lines and planes. Also, sets of parallel lines remain parallel after an affine transformation. An affine transformation does not necessarily preserve angles between lines or distances between points, though it does preserve ratios of distances between points lying on a straight line.

Examples of affine transformations include translation, scaling, homothety, similarity transformation, reflection, rotation, shear mapping, and compositions of them in any combination and sequence.

If $\small X$ and $\small Y$ are affine spaces, then every affine transformation $\small f \colon X \to Y$ is of the form $\small x \mapsto Mx+b$, where $\small M$ is a linear transformation on the space $\small X$, $\small x$ is a vector in $\small X$, and $\small b$ is a vector in $\small Y$. Unlike a purely linear transformation, an affine map need not preserve the zero point in a linear space. Thus, every linear transformation is affine, but not every affine transformation is linear.

All Euclidean spaces are affine, but there are affine spaces that are non-Euclidean. In affine coordinates, which include Cartesian coordinates in Euclidean spaces, each output coordinate of an affine map is a linear function (in the sense of calculus) of all input coordinates. Another way to deal with affine transformations systematically is to select a point as the origin; then, any affine transformation is equivalent to a linear transformation (of position vectors) followed by a translation.


[Image source. An image of a fern-like fractal that exhibits affine self-similarity. Each of the leaves of the fern is related to each other leaf by an affine transformation. For instance, the red leaf can be transformed into both the dark blue leaf and the light blue leaf by a combination of reflection, rotation, scaling, and translation. Click image to open in new window.]

Matrix affine transformations

To represent affine transformations with matrices, we can use homogeneous coordinates. This means representing a 2-vector $\small (x,y)$ as a 3-vector $\small (x,y,1)$, and similarly for higher dimensions. Using this system, translation can be expressed with matrix multiplication. The functional form

    $\small x' = x + t_x$;
    $\small y' = y + t_y$


    $\small \begin{bmatrix} x' \\\ y' \\\ 1 ènd{bmatrix} = \begin{bmatrix} 1 & 0 & t_{x} \\\ 0 & 1 & t_{y} \\\ 0 & 0 & 1 ènd{bmatrix} \begin{bmatrix} x \\\ y \\\ 1 ènd{bmatrix}$.

All ordinary linear transformations are included in the set of affine transformations, and can be described as a simplified form of affine transformations. Therefore, any linear transformation can also be represented by a general transformation matrix. The latter is obtained by expanding the corresponding linear transformation matrix by one row and column, filling the extra space with zeros except for the lower-right corner, which must be set to $\small 1$. For example, the counter-clockwise rotation matrix from above becomes:

    $\small \begin{bmatrix} \cos \theta & -\sin \theta & 0 \\\ sin \theta & \cos \theta & 0 \\\ 0 & 0 & 1 ènd{bmatrix}$

Using transformation matrices containing homogeneous coordinates, translations become linearly independent, and thus can be seamlessly intermixed with all other types of transformations. The reason is that the real plane is mapped to the $\small w = 1$ plane in real projective space, and so translation in real Euclidean space can be represented as a shear in real projective space. Although a translation is a non-linear transformation in a 2-D or 3-D Euclidean space described by Cartesian coordinates (i.e. it can’t be combined with other transformations while preserving commutativity and other properties), it becomes, in a 3-D or 4-D projective space described by homogeneous coordinates, a simple linear transformation (a shear).

For an example of shear mapping in the eigenspace, refer here.

More affine transformations can be obtained by composition of two or more affine transformations. … [continued here.]

Example: 2-D Transformation

When a transformation takes place on a 2D plane, it is called 2D transformation.

We can represent a point, $\small \mathbf{p} = (x,y)$, in the plane as a row vector $\small \begin{bmatrix} x & y ènd{bmatrix}$ or as a column vector $\small \begin{bmatrix} x \\ y ènd{bmatrix}$.

We can represent a 2-D transformation $\small \mathbf{M}$ by a matrix $\small \mathbf{M} = \begin{bmatrix} a & b \\ c & d ènd{bmatrix}$.

If $\small \mathbf{p}$ is a column vector ($\small [2 \times 1]$), then we need to place $\small \mathbf{M}$ on the left:

… $\small [\color{Blue}2 \times \color{Magenta}2] [\color{Magenta}2 \times \color{Blue}1] = [\color{Blue}2 \times \color{Blue}1]$; we cannot multiply $\small [2 \times 1] [2 \times 1]$, due to the mismatched matrix dimensions.

    $\small \begin{bmatrix} x' \\\ y' ènd{bmatrix} = \begin{bmatrix} a & b \\\ c & d ènd{bmatrix} \begin{bmatrix} x \\\ y ènd{bmatrix} = \begin{bmatrix} ax + by \\\ cx + dy ènd{bmatrix}$

    $\small \begin{bmatrix} x' \\\ y' ènd{bmatrix} = \begin{bmatrix} 1 & 2 \\\ 3 & 4 ènd{bmatrix} \begin{bmatrix} 3 \\\ 2 ènd{bmatrix} = \begin{bmatrix} 1(3) + 2(2) \\\ 3(3) + 4(2) ènd{bmatrix} = \begin{bmatrix} 7 \\\ 17 ènd{bmatrix}$

If $\small \mathbf{p}$ is a row vector ($\small [1 \times 2]$), then we need to place $\small \mathbf{M}^T$ on the right ($\small [1 \times 2][2 \times 2]^\color{Magenta}T$):

$\small \mathbf{M} = \begin{bmatrix} a & b \\ c & d ènd{bmatrix}$,

$\small \mathbf{M}^T = \begin{bmatrix} a & c \\ b & d ènd{bmatrix}$.

    $\small \begin{bmatrix} x' & y' ènd{bmatrix} = \begin{bmatrix} x & y ènd{bmatrix} \begin{bmatrix} a & c \\\ b & d ènd{bmatrix} = \begin{bmatrix} xa + yb & xc + yd ènd{bmatrix}$

    $\small \begin{bmatrix} x' & y' ènd{bmatrix} = \begin{bmatrix} 3 & 2 ènd{bmatrix} \begin{bmatrix} 1 & 3 \\\ 2 & 4 ènd{bmatrix}$ $\small = \begin{bmatrix} 3(1) + 2(2) & 3(3) + 2(4) ènd{bmatrix} = \begin{bmatrix} 7 & 17 ènd{bmatrix}$

which is basically the same result as above (same point, same transformation!).

[If we did not do that matrix transform, $\small \mathbf{M}^T$, we would have obtained $\small \begin{bmatrix} 9 & 14 ènd{bmatrix}$).]

octave:>> p=[3,2]
p =
  3   2

octave:>> q=[3;2]
q =

octave:>> M=[1,2;3,4]
M =
  1   2
  3   4

octave:>> M'
ans =
  1   3
  2   4

octave:>> p*M'
ans =
    7   17

octave:>> M*q
ans =



See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.


In semantics, an antonym is a word having the opposite meaning as another word.

Examples: “happy/unhappy”; “heavy/light”; “long/short”; “up/down”; “dead/alive”; “parent/child”; …

See also Linguistics | Semantics.

Oppositeness is a logical category. There are three types:

  • Complementary pairs are antonyms in which the presence of one quality or state signifies the absence of the other and vice versa: single/ married, not pregnant/pregnant, … There are no intermediate states: you can’t be “a little pregnant”, or “kinda married”.

  • Gradable pairs are antonyms which allow for a natural, gradual transition between two poles: “good/bad”, or “hot/cold”. It is possible to be a “little cold” or “very cold”, etc.

  • Relational opposites are antonyms which share the same semantic features – only the focus (or direction) is reversed: “tie/untie”, “buy/sell”, “give/receive”, “teacher/pupil”, “father/son”.

Some concepts lack logical opposites that can be described in terms of any special word; colors are a good example: the logical opposite of “red” is “not red”. Such concepts may form relational antonyms, however, through symbolic systems of thinking. For instance, in the Cold War paradigm, the relational opposite of “American” is “Russian”; in current US politics, the relational opposite of “Democrat” is “Republican”. These are cultural relational opposites.

Architectures; Algorithms; Frameworks;
Libraries; Models; Platforms

I tend to differentiate architectures (the programmatic implementation of a thematic class of machine learning algorithms and procedures, the “basic code”) from models (that represent various implementations of those architectures). Examples of major architectures include:

Frameworks and libraries. Differentiating the meaning and use of these two terms is a terminological rabbit hole – exemplified in the StackOverflow threads What is the difference between a framework and a library? and Framework vs. Toolkit vs. Library. While the two terms tend to be used interchangeably (for example, TensorFlow, PyTorch, etc. are often included in lists of machine learning libraries), I personally prefer the use of the term framework for programming environments including:

  • Caffe
  • TensorFlow
  • PyTorch
  • Theano
  • Torch
  • etc.

Deep learning frameworks simplify deep learning model development and training by providing high-level primitives for complex and error-prone mathematical transformations, like gradient descent, back-propagation, and inference. Deep learning frameworks vary in their level of functionality. Some of them, such as Theano or TensorFlow, allow you to define a neural network of arbitrary complexity from the most basic building blocks. These types of frameworks might even be called languages. Other “frameworks” [← there’s an example of a library being referred to as a framework!], such as Keras, are drivers or wrappers aimed at boosting developer productivity but are limited in their functionality due to the higher level of abstraction. [Sources: Hands-On AI Part 5: Select a Deep Learning Framework]

Like my use of the term “framework,” I prefer to reserve the use of the term “library” for the libraries provided in programming languages, such as the NumPy package/library in the Python programming language. By this definition, Keras (which uses TensorFlow as a backend) is likewise a library.

The simplicity of Python, a general purpose programming language, has attracted many developers to build libraries for machine learning and data science (and, because of all these libraries, Python is almost popular as R for data science). A popular machine learning library for Python includes NumPy.

Overall, I agree with a comment by @Trunk in SO #148747:

“A software framework is not just a library. It also includes a particular design architecture which users must work within – a sort way in which different items in the library are connected together to provide a generic solution for a particular application type. Frameworks thus have the advantage of needing only configuration by the user and validation of pre-written tests to become operational. With libraries alone users need to design an architecture and create tests before implementation. So more work is needed with just the libraries. But libraries are flexible and potentially more efficient.”


Why Libraries are better than Frameworks  (cited in SO #3057526 and the source for the image, above) also defines the term platform:

“A platform is some lowest level on which to build an application, examples are OS’s or virtual machines, e.g.: Linux, OS X, Java, or .NET. These platforms are great; they allow application developers to focus on the specific application logic, reusing all the ‘plumbing’ like sockets and file systems.”

The ML Cheatsheet also provides some definitions:

  • Algorithm: a method, or series of instructions, devised to generate a machine learning model. Examples include linear regression, decision trees, support vector machines, and neural networks.

  • Model: a structure that stores a generalized, internal representation of a dataset (weights and biases) for description or prediction. Models are created/learned when you train an algorithm on a dataset (i.e., the output is a model).

  • Model vs. Algorithm:

    • Model: the representation learned from a dataset and used to make future predictions.
    • Algorithm: the process for learning it.

      Model = Algorithm(Training Data)


In logic, mathematics, and computer science, the arity of a function or operation is the number of arguments or operands that the function takes. The arity of a relation (or predicate) is the dimension of the domain in the corresponding Cartesian product. (A function of arity $\small n$ thus has arity $\small n+1$ considered as a relation.) The term springs from words like unary, binary, ternary, etc. Unary functions or predicates may be also called “monadic”; similarly, binary functions may be called “dyadic”.

In mathematics, arity may also be named rank, but this word can have many other meanings in mathematics. In logic and philosophy, arity is also called adicity and degree. In linguistics, arity is usually named valency.

In computer programming, there is often a syntactical distinction between operators and functions; syntactical operators usually have arity 0, 1, or 2 (the ternary operator ?: is also common). Functions vary widely in the number of arguments, though large numbers can become unwieldy. Some programming languages also offer support for variadic functions, i.e., functions syntactically accepting a variable number of arguments.

Attention mechanisms

In mid-2017, the dominant sequence transduction models were based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The inherently sequential nature of RNNs precludes parallelization within training examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples.

An attention mechanism  [local copy] takes into account the input from several time steps to make one prediction. It distributes attention over several hidden states, assigning different weights (importance) to those inputs. An attention mechanism allows a method to focus on task-relevant parts of the input {sequence | graph}, helping it to make better decisions.

Attention mechanisms have become an integral part of compelling sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. It has been shown to be very useful in machine reading, abstractive summarization, or image description generation.

In June 2017 Vaswani et al. at Google proposed a new simple network architecture, TransformerTransformer, that was based solely on attention mechanisms – entirely dispensing with recurrence and convolutions, and allowing significantly more parallelization [Attention Is All You Need (Jun 2017; updated Dec 2017);  code]. Transformer has been shown to perform strongly on machine translation, document generation, syntactic parsing and other tasks. Experiments on two machine translation tasks showed these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Transformer also generalized well to other tasks; for example, it was successfully applied to English constituency parsing, both with large and limited training data.

In July 2018, nearly a year after they introduced their original “Attention Is All You NeedTransformer architecture (Jun 2017; updated Dec 2017), Google Brain/DeepMind released an updated Universal Transformer version, discussed in the Google AI blog post Moving Beyond Translation with the Universal Transformer [Aug 2018;  discussion]:


Attention mechanisms, like Google’s Transformer, have become an integral part of sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. Self-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the same sequence. It has been shown to be very useful in machine reading, abstractive summarization, or image description generation.

Excerpted from Tensorized Self-Attention: Efficiently Modeling Pairwise and Global Dependencies Together (May 2018, updated Mar 2019) [appears (2019-03-27
) as “Tao Shen et al. ‘18 300D Reinforced Self-Attention Network” in Stanford SNLI Leaderboard]:

Attention mechanisms can be categorized into two classes according to the type of dependency each aims to model.

  • The first category is $\small \text{token2token}$ self-attention that captures syntactic dependency between every two tokens in a sequence. An efficient dot-product compatibility function is usually deployed to measure this pairwise dependency. In contrast, an additive compatibility function captures the dependency by multilayer perceptron (MLP), and can usually achieve better performance. Its expressive power can be further improved if expanded to multiple dimensions. This multi-dimensional self-attention empirically surpasses the dot-product one, but suffers from expensive computation and memory, which grow linearly with the number of features and quadratically with the sequence length. Hence, it is not scalable to long sequences in practice.

  • The second category is $\small \text{source2token}$ self-attention aiming to capture global dependency, i.e., the importance of each token to the entire sequence for a specific task. Its time and space complexities grow linearly, rather than quadratically, with the sequence length. Hence, it is empirically efficient in terms of memory and computation even if expanded to multiple dimensions, i.e., using a vector of feature-wise scores instead of a scalar for the global dependency. But, it is hard to reach state of the art performance on NLP tasks due to the lack of pairwise and local dependencies.

In this paper, we propose a novel attention mechanism called multi-mask tensorized self-attention (MTSA), for context fusion. In MTSA,

    (1) the pairwise dependency is captured by an efficient dot-product based $\small \text{token2toke}$n self-attention, while the global dependency is modeled by a feature-wise multi-dim $\small \text{source2token}$ self-attention, so they can work jointly to encode rich contextual features;
    (2) self-attention alignment scores are tensorized for more expressive power in that each pair of tokens has one score for each feature, but no tensor computation is required other than simple and efficient matrix multiplications when implemented;
    (3) the tensors above are computed in multiple subspaces (i.e., in a multi-head fashion) rather than in the original input space, so the required memory and computation can be distributed to multiple subspaces; and
    (4) a distinct positional mask is applied to each head in order to encode rich structural information such as the sequential order and relative position of tokens.

Attention – see also:

Graph attention models

While useful insights can be derived from graph-structured data, in the real-world graphs can be both large (with many complex patterns) and noisy, which poses problems for effective graph mining. An effective way to deal with this issue is to incorporate self-attentional mechanisms into graph mining solutions. An attention mechanism allows a method to focus on task-relevant parts of the graph, helping it to make better decisions. Attention Models in Graphs: A Survey (Jul 2018) provides a recent focused survey of the literature on the emerging field of graph attention models.


[Image source. Click image to open in new window.]

Unlike graph convolutional networks (GCNs) which use a fixed or learnable polynomial of a graph Laplacian or an adjacency matrix to aggregate (filter) node information, graph attention networks (GATs) aggregate node information by using an attention mechanism on graph neighborhoods. The essential difference between GAT and GCN is stark: in GCN the weights for aggregating (filtering) neighbor nodes are defined by the graph structure, which is independent on node contents; in contrast, weights in GAT are a function of node contents, thanks to the attention mechanism. Results on graph node classification show that the adaptiveness of GAT makes it more effective to fuse information from node features and graph topological structures.

Attention Models in Graphs: A Survey (Jul 2018) provides a recent focused survey of the literature on the emerging field of graph attention models.

See also: Attention? Attention!  [local copy]


[Image source. Click image to open in new window.]


An autoencoder is a neural network that is trained to copy its input to its output. An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction. Recently, the autoencoder concept has become more widely used for learning generative models of data. Some of the most powerful AI in the 2010s have involved sparse autoencoders stacked inside of deep neural networks.

An autoencoder learns to compress data from the input layer into a short code, and then uncompress that code into something that closely matches the original data. This forces the autoencoder to engage in dimensionality reduction, for example by learning how to ignore noise. Some architectures use stacked sparse autoencoder layers for image recognition. The first autoencoder might learn to encode easy features like corners, the second to analyze the first layer’s output and then encode less local features like the tip of a nose, the third might encode a whole nose, etc., until the final autoencoder encodes the whole image into a code that matches (for example) the concept of “cat.” An alternative use is as a generative model: for example, if a system is manually fed the codes it has learned for “cat” and “flying”, it may attempt to generate an image of a flying cat, even if it has never seen a flying cat before.

Architecturally, the simplest form of an autoencoder is a feedforward, non-recurrent neural network very similar to the many single layer perceptrons which makes a multilayer perceptron (MLP) – having an input layer, an output layer and one or more hidden layers connecting them – but with the output layer having the same number of nodes as the input layer, and with the purpose of reconstructing its own inputs (instead of predicting the target value $\small Y$ given inputs $\small X$. Therefore, autoencoders are unsupervised learning models.


[Image source. Click image to open in new window.]

An autoencoder always consists of two parts, the encoder and the decoder, which can be defined as transitions $\small \phi$ and $\small \psi$, such that:

    $\small \phi : {X} \rightarrow {F}$
    $\small \psi : {F} \rightarrow {X}$
    $\small \phi, \psi = { \underset{\phi, \psi} {\operatorname{arg\,min}} }\ \|X - (\psi \circ \phi )X\|^2$

In the simplest case, where there is one hidden layer, the encoder stage of an autoencoder takes the input $\small \mathbf{x} \in \mathbb{R}^d = {X}$ and maps it to $\small \mathbf{z} \in \mathbb{R}^p = {F}$:

    $\small \displaystyle \mathbf{z} = \sigma (\mathbf{Wx} + \mathbf{b})$

This image $\small \mathbf{z}$ is usually referred to as code, latent variables, or latent representation. Here, $\small \sigma$ is an element-wise activation function such as a sigmoid function or a rectified linear unit. $\small \mathbf{W}$ is a weight matrix and $\small \mathbf{b}$ is a bias vector. After that, the decoder stage of the autoencoder maps $\small \mathbf{z}$ to the reconstruction $\small \mathbf{x’}$ of the same shape as $\small \mathbf{x}$:

    $\small \mathbf{x'} = \sigma '(\mathbf{W'z} +\mathbf{b'})$

where $\small \mathbf{\sigma’}$, $\small \mathbf {W’}$, and $\small \mathbf{b’}$ for the decoder may differ in general from the corresponding $\small \mathbf{\sigma}$, $\small \mathbf{W}$, and $\small \mathbf{b}$ for the encoder, depending on the design of the autoencoder.

Autoencoders are also trained to minimise reconstruction errors (such as squared errors):

    $\small {L} (\mathbf{x}, \mathbf{x'}) = \|\mathbf{x} - \mathbf{x'} \|^2 = \|\mathbf{x} - \sigma '(\mathbf{W'} (\sigma (\mathbf{Wx} + \mathbf{b})) + \mathbf{b'})\|^2$

where $\small \mathbf{x}$ is usually averaged over some input training set.

If the feature space $\small {F}$ has lower dimensionality than the input space $\small {X}$, then the feature vector $\small \phi(x)$ can be regarded as a compressed representation of the input $\small x$. If the hidden layers are larger than the input layer, an autoencoder can potentially learn the identity function and become useless. However, experimental results have shown that autoencoders might still learn useful features in these cases.

Source: Wikipedia: autoencoder.

See also:

Automatic differentiation

In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation, is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program.


[Image source. Click image to open in new window.]

Automatic differentiation is not:

These classical methods run into problems: symbolic differentiation leads to inefficient code (unless done carefully) and faces the difficulty of converting a computer program into a single expression, while numerical differentiation can introduce round-off errors in the discretization process and cancellation. Both classical methods have problems with calculating higher derivatives, where the complexity and errors increase. Finally, both classical methods are slow at computing the partial derivatives of a function with respect to many inputs, as is needed for gradient-based optimization algorithms. Automatic differentiation solves all of these problems, at the expense of introducing more software dependencies.

Chain rule; Forward and reverse accumulation

Fundamental to AD is the decomposition of differentials provided by the chain rule. For the simple composition

    $\small y = f(g(h(x))) = f(g(h(w_0))) = f(g(w_1)) = f(w_2) = w_3$

the chain rule gives

    $\large \frac{dy}{dx} = \frac {dy}{dw_2} \frac{dw_2}{dw_1} \frac{dw_1}{dx}$

Usually, two distinct modes of AD are presented, forward accumulation (or forward mode) and reverse accumulation (or reverse mode). Forward accumulation specifies that one traverses the chain rule from inside to outside (that is, first compute $\small dw_{1}/dx$ and then $\small dw_{2}/dw_{1}$ and finally $\small dy/dw_{2})$, while reverse accumulation has the traversal from outside to inside (first compute $\small dy/dw_{2}$ and then $\small dw_{2}/dw_{1}$ and at last $\small dw_{1}/dx$). More succinctly,

    forward accumulation computes the recursive relation:

    $\large \frac{dw_i}{dx} = \frac{dw_i}{dw_{i-1}} \frac{dw_{i-1}}{dx}$ with $\small w_3 = y$, and,

    reverse accumulation computes the recursive relation:

    $\large \frac{dy}{dw_i} = \frac{dy}{dw_{i+1}} \frac{dw_{i+1}}{dw_i}$ with $\small w_0 = x$.

Generally, both forward and reverse accumulation are specific manifestations of applying the operator of program composition, with the appropriate one of the two mappings $\small (w,y)$ being fixed.

See also:

Autoregression (Autocorrelation)

Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. It is a very simple idea that can result in accurate forecasts on a range of time series problems.

A regression model, such as linear regression, models an output value based on a linear combination of input values.

For example:

    $\small \hat{y} = \mathbf{b}_0 + \mathbf{b}_1 \mathbf{X}_1$

Where $\small \hat{y}$ is the prediction, $\small \mathbf{b}_0 $ and $\small \mathbf{b}_1$ are coefficients found by optimizing the model on training data, and $\small \mathbf{X}_1$ is an input value.

This technique can be used on time series where input variables are taken as observations at previous time steps, called lag variables. For example, we can predict the value for the next time step (t+1) given the observations at the last two time steps (t-1 and t-2). As a regression model, this would look as follows:

    $\small \mathbf{X}_{(t+1)} = \mathbf{b}_0 + \mathbf{b}_1 \mathbf{X}_{(t-1)} + \mathbf{b}_2 \mathbf{X}_{(t-2)}$

Because the regression model uses data from the same input variable at previous time steps, it is referred to as an autoregression (regression of self).

Autoregressive, feed-forward models

Instead of making predictions from a state that depends on the entire history, an autoregressive model directly predicts $\small y_t$ using only the $\small k$ most recent inputs, $\small x_{t-k+1}, \ldots, x_t$. This corresponds to a strong conditional independence assumption. In particular, a feed-forward model assumes the target only depends on the $\small k$ most recent inputs. Google’s WaveNet nicely illustrates this general principle.


[Image source. Click image to open in new window.]

In contrast to an RNN, the limited context of a feed-forward model means that it cannot capture patterns that extend more than $\small k$ steps. However, using techniques like dilated-convolutions, one can make $\small k$ quite large.

Source for the preceding subsection: When Recurrent Models Don’t Need to be Recurrent.


An autoregression model makes an assumption that the observations at previous time steps are useful to predict the value at the next time step. This relationship between variables is called correlation.

If both variables change in the same direction (e.g. go up together or down together), this is called a positive correlation. If the variables move in opposite directions as values change (e.g. one goes up and one goes down), then this is called negative correlation.

We can use statistical measures to calculate the correlation between the output variable and values at previous time steps at various different lags. The stronger the correlation between the output variable and a specific lagged variable, the more weight that autoregression model can put on that variable when modeling.

Again, because the correlation is calculated between the variable and itself at previous time steps, it is called an autocorrelation. It is also called serial correlation because of the sequenced structure of time series data.

The correlation statistics can also help to choose which lag variables will be useful in a model and which will not.

Interestingly, if all lag variables show low or no correlation with the output variable, then it suggests that the time series problem may not be predictable. This can be very useful when getting started on a new dataset.


See Proofs.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


Backpropagation operations in neural networks can be divided into two steps: feedforward, and backpropagation. In the feedforward step, an input pattern is applied to the input layer and its effect propagates, layer by layer, through the network until an output value is produced. The network’s actual output value is then compared to the expected output (i.e., the desired output value), and an error signal is computed for each of the output nodes. Since all the hidden nodes to some degree have contributed to the errors evident in the output layer, the output error signals are transmitted backwards from the output layer to each node in the hidden layer that immediately contributed to the output layer. This process is then repeated, layer by layer, until each node in the network has received an error signal that describes its relative contribution to the overall error.

Once the error signal for each node has been determined, the errors are then used by the nodes to update the values for each connection weights until the network converges to a state that allows all the training patterns to be encoded. The Backpropagation algorithm looks for the minimum value of the error function in weight space using a technique called the delta rule or gradient descent. The weights that minimize the error function is then considered to be a solution to the learning problem. [Source: Backpropagation]

In neural networks, if the estimated output $\small \hat{y}_i$ is far away from the actual output $\small y_i$ (high error), we update the biases and weights based on the error. This weight and bias updating process is known as back propagation. Backpropagation is an algorithm to efficiently calculate the gradients in a neural network (or more generally, a feedforward computational graph).

The actual output  $\small \mathbf{y}_i$ (observed from the network) is compared to the estimated output  $\small \mathbf{\hat{y}}_i$ (the desired/predicted/expected output). The error is the mean squared error,

$\ \ \ \ \small \begin{align} \frac{1}{n} \sum_{i=1}^n(y_i-\hat{y}_i)^2 ènd{align}$

Backpropagation algorithms work by applying the (partial differentiation) chain rule determining the loss (or error) at the output, and then propagating it back into the network. The weights are updated to minimize the error resulting from each neuron. The first step in minimizing the error is to determine the gradient (derivatives) of each node with respect to the final output.

The gradient is the vector of partial derivatives with respect to all of the independent variables. In machine learning, the gradient, a slope (i.e. partial derivative), is the vector of partial derivatives of the model function. The gradient points in the direction of steepest ascent.

Gradient descent is technique to minimize loss by computing the gradients of loss with respect to the model’s parameters, conditioned on the training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss.

$\ \ \ \ \theta_j := \theta_j - α\frac{\partial}{\partial\theta_i}J(\theta_0,\theta_1)$   (for $\small j=0$ and $\small j=1$)  [← gradient descent]

where $\small α$ (alpha, the learning rate) determines how quickly we move toward the minimum.

See also:

In computer science, a beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. Beam search is an optimization of best-first search that reduces its memory requirements. Best-first search is a graph search which orders all partial solutions (states) according to some heuristic. But in beam search, only a predetermined number of best partial solutions are kept as candidates. It is thus a greedy algorithm.

Beam search uses breadth-first search to build its search tree. At each level of the tree, it generates all successors of the states at the current level, sorting them in increasing order of heuristic cost. However, it only stores a predetermined number, $\small \beta$, of best states at each level (called the beam width). Only those states are expanded next. The greater the beam width, the fewer states are pruned. With an infinite beam width, no states are pruned and beam search is identical to breadth-first search. The beam width bounds the memory required to perform the search. Since a goal state could potentially be pruned, beam search sacrifices completeness (the guarantee that an algorithm will terminate with a solution, if one exists). Beam search is not optimal (that is, there is no guarantee that it will find the best solution).

In general, beam search returns the first solution found. Beam search for machine translation is a different case: once reaching the configured maximum search depth (i.e. translation length), the algorithm will evaluate the solutions found during search at various depths and return the best one (the one with the highest probability).

The beam width can either be fixed or variable. One approach that uses a variable beam width starts with the width at a minimum. If no solution is found, the beam is widened and the procedure is repeated.

A beam search is most often used to maintain tractability in large systems with insufficient amount of memory to store the entire search tree. For example, it is used in many machine translation systems. To select the best translation, each part is processed, and many different ways of translating the words appear. The top best translations according to their sentence structures are kept, and the rest are discarded. The translator then evaluates the translations according to a given criterion, choosing the translation which best keeps the goals. The first use of a beam search was in the Harpy Speech Recognition System, CMU 1976.

See also:


Bias terms ($\small \textbf{b}$ or $\small \textbf{w}_0$) are additional constants attached to the neurons and added to the weighted input before the activation function is applied. Bias terms help models represent patterns that do not necessarily pass through the origin. For example, if all your features were 0, would your output also be zero? Is it possible there is some base value upon which your features have an effect? Bias terms typically accompany weights and must also be learned by your model. [Source: ML Cheatsheet]

Bias is the term $\small \textbf{b}$ in the following formula: $\small y^{\ \prime} = \textbf{b} + \mathbf{w}_1\mathbf{x}_1 + \mathbf{w}_2\mathbf{x}_2 + \ldots + \mathbf{w}_n\mathbf{x}_n$.

Bidirectional LSTM (Bi-LSTM)

See Bidirectional LSTM (Bi-LSTM)  [LSTM subsection]

Bifan motif

A bifan is a network motif in which two outputs link to the same two inputs. A bifan is a type of regulatory motif (or “network motif”), and a motif can be thought of as a small recurring subgraph embedded in a network.


[Image source. Click image to open in new window.]

Many networks exhibit rich, lower-order connectivity patterns that can be captured at the level of individual nodes and edges. Higher-Order Organization of Complex Networks (Benson et al., Science: 2016), by Jure Leskovec and colleagues, developed a generalized framework for clustering networks on the basis of higher-order connectivity patterns, at the level of small network subgraphs. That project focused on finding higher-order organization of complex networks at the level of small network subgraphs (motifs), by clustering the nodes of a graph based on motifs instead of edges via a motif-based spectral clustering method. The approach is well explained and illustrated on their project page, for example using a bifan motif to find clusters in networks such as C. elegans (explained in more detail in Fig. 2 in the paper), or coherent feedforward loop motifs in the S. cerevisiae transcriptional regulation network.

The SNAP project page provides an example:

    "In many cases, we know that particular motifs are important for a given network. For example, the bifan motif is known to occur frequently in neuronal networks. We can use this motif to find clusters in networks such as C. elegans, pictured here."

    [Image source. Click image to open in new window.]

Binary operator ($\small \circ$)

In mathematics, a binary operation on a set is a calculation that combines two elements of the set (called operands) to produce another element of the set. More formally, a binary operation is an operation of arity of two whose two domains and one codomain are the same set. Examples include the familiar elementary arithmetic operations of addition, subtraction, multiplication and division. Other examples are readily found in different areas of mathematics, such as vector addition, matrix multiplication and conjugation in groups. More precisely, a binary operation on a set $\small S$ is a map which sends elements of the Cartesian product $\small S \times S$ to $\small S$:

    $\small f \colon S \times S \rightarrow S$.

Because the result of performing the operation on a pair of elements of $\small S$ is again an element of $\small S$, the operation is called a closed binary operation on $\small S$ (or sometimes expressed as having the property of closure). If $\small f$ is not a function, but is instead a partial function, it is called a partial binary operation. For instance, division of real numbers is a partial binary operation, because one can’t divide by zero: $\small a/0$ is not defined for any real $\small a$. Note however that both in algebra and model theory the binary operations considered are defined on all of $\small S \times S$.

Sometimes, especially in computer science, the term is used for any binary function. Binary operations are the keystone of algebraic structures studied in abstract algebra: they are essential in the definitions of groups, monoids, semigroups, rings, and more. Most generally, a magma is a set together with some binary operation defined on it.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


BioNLP is natural language processing, applied in the biomedical domain. See also: Natural language processing.


In technological domains, blockchains can be used to ensure proper attribution of ownership and provenance of { data | information }: assets, ideas, proprietorship, etc.

This Wikipedia article, Blockchain, provides an introduction.

Jesús Cedeño – an interested Reader of this Glossary (and coincidentally a fellow former Biochemist!) – mentioned to me [2020-06-26] a well-written, interesting overview of blockchain technology that he has authored:

Jesús’ article provides a more concise and friendly – yet thorough – introduction to blockchain technology … well worth reading! 

    Jesus Cedeño

    What is Blockchain Technology? author Jesús Cedeño
    [Image source.]

Additional reading

Boltzmann distribution

“The Boltzmann distribution is a foundational model for relating local interactions to global system behaviors, but can be challenging to fit to data. Given an energy function $\small U_\theta [\mathbf{x}]$, the probability of a system configuration $\small \textbf{x}$ scales exponentially with energy as

    $ p_\theta (\mathbf{x}) = \frac{1}{Z} e^{(-U_\theta [\mathbf{x}])}$.

“Importantly, simple energy functions $\small U_\theta [\mathbf{x}]$ consisting of weak interactions can collectively encode complex system behavior, such as the structures of materials, macromolecules, tissues, or when endowed with latent variables, images, sound, and text (Ackley et al., 1985 ; Salakhutdinov & Larochelle, 2010). Unfortunately, generating samples $\small \mathbf{x} \sim p_\theta (\mathbf{x})$ and learning model parameters $\small \hat{\mathbf{\theta}}$ with the Boltzmann distribution can require Monte Carlo simulations that frequently are unable to produce a useful result in available time. These difficulties have driven a shift towards generative models that are easier to learn and sample from, such as directed latent variable models (VAEs) and autoregressive models (Goodfellow et al., 2016).

“The protein folding problem provides a prime example of both the power of energy-based models at describing complex relationships in data as well as the challenge of generating samples from them. Decades of research in biochemistry and biophysics support an energy landscape theory of protein folding (Dill et al., 2017), in which the folds that natural protein sequences adopt are the minima of a free energy surface. Without the availability of external information such as coevolutionary information or homologous structures to constrain the energy function, however, contemporary simulations are challenged to generate globally favorable low-energy structures in available time.”

Source: Learning Protein Structure with a Differentiable Simulator (ICLR 2019) [OpenReviewdiscussionmentioned] and references therein.

See also:

Boolean network

A Boolean network consists of a discrete set of boolean variables each of which has a Boolean function (possibly different for each variable) assigned to it which takes inputs from a subset of those variables and output that determines the state of the variable it is assigned to. This set of functions in effect determines a topology (connectivity) on the set of variables, which then become nodes in a network. Usually, the dynamics of the system is taken as a discrete time series where the state of the entire network at time t+1 is determined by evaluating each variable’s function on the state of the network at time t. This may be done synchronously or asynchronously.

Boolean networks have been used in biology to model regulatory networks. Although Boolean networks are a crude simplification of genetic reality where genes are not simple binary switches, there are several cases where they correctly capture the correct pattern of expressed and suppressed genes. The seemingly mathematical easy (synchronous) model was only fully understood in the mid 2000s.

Boolean networks were used (e.g.) in Ordinary Differential Equations and Boolean Networks in Application to Modelling of 6-Mercaptopurine Metabolism (Apr 2017).

See also:

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Capsule networks

In 2011 Geoff Hinton – building on his contemplation of the subject, spanning decades (Understanding Hinton’s Capsule Networks. Part I: Intuition  [local copy]) – introduced his thoughts on “capsules” in his paper Transforming Auto-Encoders. A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. The use of capsules introduced a simple and elegant new building block that could be used in deep learning to better model hierarchical relationships inside of internal knowledge representation of a neural network.


[Image source. Click image to open in new window.]

Capsule networks were introduced by Sabour, Frosst and Hinton in Dynamic Routing Between Capsules (Nov 2017). In capsule networks, dynamic routing iteratively computes weights over inputs by the inner product between inputs and a weighted sum of inputs. Varying with the inputs, the weighted sum detects informative inputs; therefore it can be interpreted as a dynamic weight vector from the perspective of self-attention. “We expect the dynamic weight vector to give rise to flexibility in self-attention since it can adapt to given sentences even after training.”


[Image source. Click image to open in new window.]

See also Hinton et al., Matrix Capsules with EM Routing (ICLR 2018):

“A capsule is a group of neurons whose outputs represent different properties of the same entity. Each layer in a capsule network contains many capsules. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 matrix which could learn to represent the relationship between that entity and the viewer (the pose). A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by trainable viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated for each image using the Expectation-Maximization [EM] algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The transformation matrices are trained discriminatively by backpropagating through the unrolled iterations of EM between each pair of adjacent capsule layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attacks than our baseline convolutional neural network.”


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source"The capsule network is much better than other models at telling that images in top and bottom rows belong to the same classes, only the view angle is different. The latest papers decreased the error rate by a whopping 45%."  [Quotation source (local copy)].  Click image to open in new window.]

Additional reading

Chain rule

In calculus, the chain rule is a formula for computing the derivative of the composition of two or more functions. That is, if $\small f$ and $\small g$ are functions, then the chain rule expresses the derivative of their composition $\small f \circ g$ (the function which maps $\small x$ to $\small f(g(x))$) in terms of the derivatives of $\small f$ and $\small g$ and the product of functions as follows:

    $\small (f\circ g)'=(f'\circ g) \cdot g'$.

This may equivalently be expressed in terms of the variable. Let $\small F = f \circ g$, or equivalently, $\small F(x) = f(g(x))$ for all $\small x$. Then one can also write

    $\small F'(x)=f'(g(x))g'(x)\ $.

The chain rule may be written in Leibniz’s notation in the following way. If a variable $\small z$ depends on the variable $\small y$, which itself depends on the variable $\small x$, so that $\small y$ and $\small z$ are therefore dependent variables, then $\small z$, via the intermediate variable of $\small y$, depends on $\small x$ as well. The chain rule then states,

    $\large \frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}\ $.

The two versions of the chain rule are related; if $\small z = f(y)$ and $\small y = g(x)$, then

    $\large \frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx} \small = f'(y)g'(x) = f'(g(x))g'(x)\ $.

In integration, the counterpart to the chain rule is the substitution rule.


See also:

Circulating Tumor Cells (CTCs)

Circulating tumor cells (CTCs) are cells that have shed into the vasculature or lymphatics from a primary tumor and are carried around the body in the blood circulation. CTCs constitute seeds for the subsequent growth of additional tumors (metastases) in distant organs, a mechanism that is responsible for the vast majority of cancer-related deaths. The detection and analysis of CTCs can assist early patient prognoses and determine appropriate tailored treatments.

Circulating Tumor Cells (CTC) are rare tumor cells that have been shed from primary or secondary tumors. CTCs have been investigated as diagnostic, prognostic and predictive biomarkers in many types of cancer. Prior studies have demonstrated that enumeration of CTC is a robust independent prognostic factor of progression free and overall survival in patients with early and metastatic breast cancer. CTC studies have accumulated a high level of clinical validity, especially in breast, lung, prostate and colorectal cancers. CTC, as well as other circulating tumor markers, have the appealing advantages over tissue biopsy of

  • ease of collection,
  • serial evaluation, and
  • interrogation of the entire tumor burden instead of just a limited part of the tumor.

Advances have been recently made in phenotyping and genotyping of CTC, which should provide insights into the predictive role of CTC for sensitivity or resistance to therapies. In addition, CTC phenotypic marker changes during the course of treatment may serve as pharmacodynamic monitoring tools. Therefore, CTC may be considered “liquid biopsies,” providing prognostic and predictive clinical information as well as additional understanding of tumor heterogeneity.


Nanoscale Blood Diagnostics

Related to the topic of circulating tumor cells are methods for screening health status on nanoscale sampling of peripheral blood (with applications in diagnosis, treatment, precision medicine, etc). Examples include:

  • Theranostic Nanoparticles for Tracking and Monitoring Disease State (Nov 2017)   [Summary]The development of novel nanoparticles consisting of both diagnostic and therapeutic components has increased over the past decade. These “theranostic” nanoparticles have been tailored toward one or more types of imaging modalities and have been developed for optical imaging, magnetic resonance imaging, ultrasound, computed tomography, and nuclear imaging comprising both single-photon computed tomography and positron emission tomography. In this review, we focus on state-of-the-art theranostic nanoparticles that are capable of both delivering therapy and self-reporting/tracking disease through imaging. We discuss challenges and the opportunity to rapidly adjust treatment for individualized medicine.   [media]

  • Direct electrical quantification of glucose and asparagine from bodily fluids using nanopores (Oct 2018)   [Summary]Crucial steps in the miniaturisation of biosensors are the conversion of a biological signal into an electrical current as well as the direct sampling of bodily fluids. Here we show that protein sensors in combination with a nanopore, acting as an electrical transducer, can accurately quantify metabolites in real time directly from nanoliter amounts of blood and other bodily fluids. Incorporation of the nanopore into portable electronic devices will allow developing sensitive, continuous, and non-invasive sensors for metabolites for point-of-care and home diagnostics.   [media]

  • The Human In Vivo Biomolecule Corona onto PEGylated Liposomes: A Proof‐of‐Concept Clinical Study (Nov 2018)  [Summary]The self‐assembled layered adsorption of proteins onto nanoparticle (NP) surfaces, once in contact with biological fluids, is termed the “protein corona” and it is gradually seen as a determinant factor for the overall biological behavior of NPs. Here, the previously unreported in vivo protein corona formed in human systemic circulation is described. The human‐derived protein corona formed onto PEGylated doxorubicin‐encapsulated liposomes (Caelyx) is thoroughly characterized following the recovery of liposomes from the blood circulation of ovarian carcinoma patients. In agreement with previous investigations in mice, the in vivo corona is found to be molecularly richer in comparison to its counterpart ex vivo corona. The intravenously infused liposomes are able to scavenge the blood pool and surface‐capture low‐molecular‐weight, low‐abundance plasma proteins that cannot be detected by conventional plasma proteomic analysis. This study describes the previously elusive or postulated formation of protein corona around nanoparticles in vivo in humans and illustrates that it can potentially be used as a novel tool to analyze the blood circulation proteome.  [media]

Classification vs. Regression vs. Clustering

The introductory chapter in the excellent textbook Bishop CM (2006) Pattern Recognition and Machine Learning gives a clarification (p. 3):

    "Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning problems. Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the yield in a chemical manufacturing process in which the inputs consist of the concentrations of reactants, the temperature, and the pressure.

    "In other pattern recognition problems, the training data consists of a set of input vectors x without any corresponding target values. The goal in such unsupervised learning problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.

    "Finally, the technique of reinforcement learning (Sutton and Barto, 1998) is concerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward. ..."

Regression and classification are supervised learning approaches that map an input to an output based on example input-output pairs, while clustering is a unsupervised learning approach.

  • Regression predicts continuous valued output. The Regression analysis is the statistical model which is used to predict the numeric data instead of labels. It can also identify the distribution trends based on the available data or historic data. Predicting a person’s income from their age, education is example of regression task.

    If the prediction value tends to be a continuous value, then it is a regression problem. Example: given the location, construction date, number of rooms, acreage, construction type, etc. as features, and predicting the expected cost of a house.

  • Classification predicts discrete number of values. In classification the data is categorized under different labels according to some parameters and then the labels are predicted for the data. Classifying email as either spam or not spam is an example of classification problem.

    If the prediction value tends to be categorical (e.g., binary: yes|no; positive|negative), then it is a classification problem. Example : Given a sentence, predict whether it is negative or a positive review.

  • Clustering is the task of partitioning the dataset into groups, called clusters. The goal is to split up the data in such a way that points within single cluster are very similar and points in different clusters are different. It determines grouping among unlabeled data.

    Grouping a set of points to given number of clusters. Example: given 3, 4, 8, 9 setting the number of clusters as two clusters, then the clustering algorithm might divide the given set into cluster 1 (3, 4) and cluster 2 (8, 9).



See Proofs.


See Classification vs. Regression vs. Clustering.

Computational linguistics

Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate computational approaches to linguistic questions.

See also Linguistics | Semantics.


See Proofs.

Constraint graphs

In constraint satisfaction research of artificial intelligence and operations research, constraint graphs and hypergraphs are used to represent relations among constraints in a constraint satisfaction problem. A constraint graph is a special case of a factor graph, which allows for the existence of free variables.

This YouTube video shows how to draw constraints on a graph (graph constraints), which explains the concept pretty well:

The basic premise is:

  • Let $\small y$ be your variable of interest
  • Plot your (here, linear) equations
  • We are interested in the region bounded by the equations ($\small y \leq x…$ or $\small y \geq x…$):


[Image source. Click image to open in new window.]

The use of constraint graphs for finding optimal Bayesian networks (graphs) is an intuitive method for incorporating rich prior knowledge into the structure learning tasks. Finding the Optimal Bayesian Network Given a Constraint Graph (Jul 2017) [code provides an example of the use of constraint graphs. Their accompanying Jupyter notebook provides toy examples.


[Image source. Click image to open in new window.]

See also:

Content words; Function words

In NLP parts of speech (POS) content words are words that name objects of reality and their qualities. They signify:

  • actual living things: dog, cat, etc.
  • family members: mother, father, sister, etc.
  • natural phenomena: snow, sun, etc.
  • common actions: do, make, come, eat, etc.
  • characteristics: young, cold, dark, etc.
  • etc.

Content words consist mostly of nouns, lexical verbs and adjectives, but certain adverbs can also be content words.

Content words contrast with function words, which are words that have very little substantive meaning and primarily denote grammatical relationships between content words, such as:

  • prepositions: in, out, under, etc.
  • pronouns: I, you, he, who, etc.
  • conjunctions: and, but, till, as, etc.
  • etc.

Content words and function words are used in the following papers.

  • Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation, which proposed a new model with a mechanism called Self-Adaptive Control of Temperature (SACT) to control the softness of attention by means of an “attention temperature.” … the model could learn a soft distribution of attention weights which was more uniform for generating function words, and a hard distribution which is sparser for generating content words. …

  • Google Brain also employed content words and function words in their improved language model, The Importance of Generation Order in Language Modeling. “This paper studies the influence of token generation order on model quality via a novel two-pass language model that produces partially-filled sentence “templates” and then fills in missing tokens. … We find the most effective strategy generates function words in the first pass followed by content words in the second.”

Convolutional neural networks (CNN)

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks, most commonly applied to analyzing visual imagery.

CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. They are also known as shift invariant or space invariant artificial neural networks, based on their shared-weights architecture and translation invariance characteristics.

CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage.

CNNs have applications in image and video recognition, recommender systems and natural language processing.


A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers of a CNN typically consist of convolutional layers, pooling layers, fully connected layers and normalization layers.



[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

The description of the process in neural networks as a convolution is by convention: mathematically, it is a cross-correlation rather than a convolution. This only has significance for the indices in the matrix, and thus which weights are placed at which index.

  • Convolutional layers. Convolutional layers apply a convolution operation to the input, passing the result to the next layer. The convolution emulates the response of an individual neuron to visual stimuli.

    Each convolutional neuron processes data only for its receptive field. Although fully connected feedforward neural networks can be used to learn features as well as classify data, it is not practical to apply this architecture to images. A very high number of neurons would be necessary, even in a shallow (opposite of deep) architecture, due to the very large input sizes associated with images, where each pixel is a relevant variable. For instance, a fully connected layer for a (small) image of size $\small [100 \times 100]$ has 10,000 weights for each neuron in the second layer. The convolution operation brings a solution to this problem as it reduces the number of free parameters, allowing the network to be deeper with fewer parameters. For instance, regardless of image size, tiling regions of size $\small [5 \times 5]$, each with the same shared weights, requires only 25 learnable parameters. In this way, it resolves the vanishing or exploding gradients problem in training traditional multi-layer neural networks with many layers by using backpropagation.


    [Image source. Click image to open in new window.]


    [Image source. Click image to open in new window.]


    [Image sources: here, and here. Click image to open in new window.]


    [Image source. Click image to open in new window.]

  • Pooling layers. Convolutional networks may include local or global pooling layers, which combine the outputs of neuron clusters at one layer into a single neuron in the next layer. For example, max pooling uses the maximum value from each of a cluster of neurons at the prior layer. Another example is average pooling, which uses the average value from each of a cluster of neurons at the prior layer.

  • Fully connected layers. Fully connected layers connect every neuron in one layer to every neuron in another layer. It is in principle the same as the traditional multi-layer perceptron neural network (MLP).

  • Receptive fields. In neural networks, each neuron receives input from some number of locations in the previous layer. In a fully connected layer, each neuron receives input from every element of the previous layer. In a convolutional layer, neurons receive input from only a restricted subarea of the previous layer. Typically the subarea is of a square shape (e.g., size $\small [5 \times 5]$). The input area of a neuron is called its receptive field. So, in a fully connected layer, the receptive field is the entire previous layer. In a convolutional layer, the receptive area is smaller than the entire previous layer.


    [Image source. Click image to open in new window.]

  • Weights. Each neuron in a neural network computes an output value by applying some function to the input values coming from the receptive field in the previous layer. The function that is applied to the input values is specified by a vector of weights and a bias (typically real numbers). Learning in a neural network progresses by making incremental adjustments to the biases and weights. The vector of weights and the bias are called a filter and represents some feature of the input (e.g., a particular shape). A distinguishing feature of CNNs is that many neurons share the same filter. This reduces memory footprint because a single bias and a single vector of weights is used across all receptive fields sharing that filter, rather than each receptive field having its own bias and vector of weights.

CNN - additional reading

There is a plethora of material on the web covering CNNs. Here are some good introductions:

See also:

Coreference resolution

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step for a lot of higher level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction.


[Image source. Click image to open in new window.]


See Proofs.


In linguistics, a corpus (plural corpora), or text corpus, is a large and structured collection of texts – nowadays, usually electronically processed and stored. In corpus linguistics, corpora are used for statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language domain, and natural language processing.

Although last updated Nov 2016, Corposaurus provides a directory of biomedical corpora, with links to papers when possible.

Cost (loss) function

The cost function is used to define and measure the error of the model. The hypothesis is usually presented as

$\ \ \ \ \small h_\theta(x) = \theta_0 + \theta_1x$

The theta values ($\small \theta$) are the parameters.

The cost function is given by:

$\ \ \ \ \small \begin{align} J(\theta_0\theta_1) = \frac{1}{2m} \sum_{i=1}^m(h_0(x^{(i)})-y^{(i)})^2 ènd{align}$  [← squared error cost function]


  • $\small h(x)$ is the prediction
  • $\small y$ is the actual value
  • $\small m$ is the number of rows in the training set

Cost is a synonym for loss: a measure of how far a model’s predictions are from its label (phrased more pessimistically, a measure of how bad the model is). To determine this value, a model must define a loss function. For example, linear regression models typically use mean squared error for a loss function, while logistic regression models use log loss (also known as cross entropy loss or logistic loss). [Cross entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.]

$\ \ \ \ \small \begin{align} MSE = \frac{1}{n} \sum_{i=1}^n(Y_i-\hat{Y}_i)^2 ènd{align}$   [← mean squared error]


$\ \ \ \ \small \begin{align} MSE = \frac{1}{n} \sum_{i=1}^n(y_i-(mx_i + b))^2 ènd{align}$


$\ \ \ \ \small \begin{align} MSE = \frac{1}{2m} \sum_{i=1}^m(h_0(x^{(i)})-y^{(i)})^2 ènd{align}$


  • $\small n$ is the total number of observations (data points)
  • $\small \frac{1}{N} \sum_{i=1}^n$ is the mean
  • $\small y_i$ is the actual value of an observation
  • $\small mx_i+b$ is our prediction

Unfortunately for logistic regression we can’t (or at least shouldn’t) use the MSE cost function, as we did for linear regression (above). For an explanation why see Chapter 3 of Michael Neilson’s deep learning book. In short, it’s because our prediction function is non-linear, due to sigmoid transform. Squaring this prediction, as we do in MSE, results in a non-convex function with many local minima: if our cost function has many local minima, gradient descent may not find the optimal global minimum.

Instead of mean squared error (MSE), we use a cost function called cross entropy, also known as log loss. Cross entropy loss can be divided into two separate cost functions: one for $\small y=1$ and one for $\small y=0$.

$\ \ \ \ \small \begin{align} J(\theta) = \frac{1}{m} \sum_{i=1}^m Cost(h_0(x^{(i)}),y^{(i)}) ènd{align}$

$\ \ \ \ \small \begin{cases} Cost(h_0(x),y) = -log(h_0(x)) & \text{if y=1} \\ Cost(h_0(x),y) = -log(1-h_0(x)) & \text{if y=0} ènd{cases}$

The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for $\small y=1$ and $\small y=0$. These smooth monotonic functions (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost:

[Image from Andrew Ng’s slides on logistic regression]

The key thing to note is the cost function penalizes confident and wrong predictions more than it rewards confident and right predictions! The corollary is increasing prediction accuracy (closer to 0 or 1) has diminishing returns on reducing cost due to the logistic nature of our cost function.

Compressing the two logistic regression equations, above, into a single equation:

$\ \ \ \ \small \begin{align} J(\theta) = -\frac{1}{m} \sum_{i=1}^m [y^{(i)}log(h_0(x^{(i)})) + (1-y^{(i)})log(1-h_\theta(x^{(i)}))] ènd{align}$

Multiplying by $\small y$ and $\small (1-y)$ in the above equation is a sneaky trick that let’s us use the same equation to solve for both $\small y=1$ and $\small y=0$ cases. If $\small y=0$, the first side cancels out. If $\small y=1$, the second side cancels out. In both cases we only perform the operation we need to perform.

Vectorized cost function:

$\ \ \ \ \small h = g(X\theta)$
$\ \ \ \ \small \begin{align} J(\theta) = \frac{1}{m} \cdot(-y^Tlog(h) - (1-y)^Tlog(1-h)) ènd{align}$

Cost function resources:

Cross entropy loss

Cross entropy loss (or log loss) measures the performance of a classification model whose output is a probability value between 0 and 1.

See also:

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See directed acyclic graph.


See Reasoning.

Deep learning

In recent years, a machine learning method called “deep learning” has gained huge attraction, obtaining astonishing results in broad applications such as pattern recognition, speech recognition, computer vision, and natural language processing. Deep learning learns data representations through the use of multi-layered artificial neural networks as opposed to task-specific algorithms. For reviews on deep learning, see:


See Proofs.

Degree matrix

Refer here  [Graph Signal Processing: Background].

See also: Graph Adjacency Matrix.

Dependency parsing

See Syntactic/Dependency parsing.

Directed acyclic graph (DAG)

DAG.png   [Source]

In mathematics and computer science, a directed acyclic graph (DAG), is a finite directed graph with no directed cycles. That is, it consists of finitely many vertices and edges, with each edge directed from one vertex to another, such that there is no way to start at any vertex $\small v$ and follow a consistently-directed sequence of edges that eventually loops back to $\small v$ again. Equivalently, a DAG is a directed graph that has a topological ordering, a sequence of the vertices such that every edge is directed from earlier to later in the sequence.

  • acyclic = “non-circular:” moving from node to node by following the edges, you will never encounter the same node for the second time

DAG can model many different kinds of information. A spreadsheet can be modeled as a DAG, with a vertex for each cell and an edge whenever the formula in one cell uses the value from another; a topological ordering of this DAG can be used to update all cell values when the spreadsheet is changed. Similarly, topological orderings of DAG can be used to order the compilation operations in a makefile. …

In computer science and mathematics, a directed acyclic graph (DAG) is a graph that is directed and without cycles connecting the other edges. This means that it is impossible to traverse the entire graph starting at one edge. The edges of the directed graph only go one way. The graph is a topological sorting, where each node is in a certain order.

In graph theory, a graph is a series of vertexes connected by edges. In a directed graph, the edges are connected so that each edge only goes one way. A directed acyclic graph means that the graph is not cyclic, or that it is impossible to start at one point in the graph and traverse the entire graph. Each edge is directed from an earlier edge to a later edge. This is also known as a topological ordering of a graph.

A spreadsheet may be represented as a directed acyclic graph, with each cell a vertex and an edge connected a cell when a formula references another cell. Other applications include scheduling, circuit design and Bayesian networks.

In mathematics – and more specifically, in graph theory – a tree is an undirected graph in which any two vertices are connected by exactly one path.


Additional Reading

Document relevance ranking

Document relevance ranking, also known as ad hoc retrieval (Harman, 2005), is the task of ranking documents from a large collection using only the query and the text of each document only. Restated, ad hoc retrieval is a standard retrieval task in which the user specifies their information need through a query which initiates a search (executed by the information system) for documents which are likely to be relevant to the user.

This contrasts with standard information retrieval (IR) systems that rely on text based signals in conjunction with network structure and/or user feedback. Text-based ranking is particularly important when:

  • click-logs do not exist or are small, and
  • the network structure of the collection is non-existent or not informative for query-focused relevance.

Examples include various domains in digital libraries, e.g. patents or scientific literature, enterprise search, and personal search.

Source, references: Deep Relevance Ranking Using Enhanced Document-Query Interactions, which explored several new models for document relevance ranking, building upon the Deep Relevance Matching Model (DRMM) of Guo et al. (2016). …

See also: Introduction to Ad-hoc Retrieval.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Eigenvalues, Eigenvectors


Embeddings (word embeddings);
Language models


Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP), where words or phrases from a vocabulary are mapped to vectors of real numbers. Sebastian Ruder provides a good overview; see also this excellent post, Introduction to Word Embeddings.

Word embeddings are widely used in predictive NLP modeling, particularly in deep learning applications (Word Embeddings: A Natural Language Processing Crash Course). Word embeddings enable the identification of similarities between words and phrases, on a large scale, based on their context. These word vectors can capture semantic and lexical properties of words, even allowing some relationships to be captured algebraically; e.g. (analogous to 12 - 9 + 2 = 5),

    vBerlin - vGermany + vFrance ~ vParis
    vking - vman + vwoman ~ vqueen.

Aside: addition and multiplication are commutative; subtraction and division are not commutative.

12 + 9 + 2 = 23
12 + (9 + 2) = (12 + 9) + 2 = 23   (← addition and multiplication are also associative)
12 - 9 + 2 = 5
12 - (9 + 2) = 1
(12 - 9) + 2 = 5
12 - (9 - 2) = 5
12 - 9 - 2 = 1
12 - 2 - 9 = 1

Vectors, like scalars, can be added together. However, adding vectors and scalars is very different. The direction a vector has affects the final magnitude of the addition of the two vectors (see Adding Vectors). A negative sign in front of a vector means that its direction is changed by 180º degrees (see Subtracting Vectors). The direction and magnitude completely change when adding and subtracting vectors. Thus adding and subtracting vectors does not give the same resultant vector. See also Subtraction of Vectors, and vector spaces - linear algebra.

Associative property.  Addition: a + (b + c) = (a + b) + c. Multiplication: a(bc) = (ab)c.
Commutative property.  Addition: a + b = b + a. Multiplication: ab = ba.
Distributive property.  a(b + c) = ab + ac (“multiplication distributes over addition”).
Algebraic order of operations.  Operations are executed in this order: parentheses; exponents; multiplication and division; addition and subtraction. 6 + (4 ÷ 2)² × 8 = 38.


The original work for generating word embeddings was presented by Bengio et al. in 2003 (A Neural Probabilistic Language Model (which builds on his 2001 (NeurIPS 2000) “feature vectors” paper A Neural Probabilistic Language Model), who trained them in a neural language model together with the model’s parameters.

Despite the assertion by Sebastian Ruder in An Overview of Word Embeddings and their Connection to Distributional Semantic Models that Bengio coined the phrase “word embeddings” in his 2003 paper, the term “embedding” does not appear in that paper. The Abstract does state the concept, however: “We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.”. The correct attribution is likely Bengio’s similarly-named 2006 paper Neural Probabilistic Language Models, which states (bottom of p. 162): “Based on our discussion in the introduction, it makes sense to force the word embedding to be shared across all nodes.” The full reference is: Bengio Y. et al. (2006) Neural Probabilistic Language Models. StudFuzz 194:137-186.

Collobert and Weston demonstrated the power of pretrained word embeddings as a highly effective tool when used in downstream tasks in their 2008 paper A Unified Architecture for Natural Language Processing, while also announcing a neural network architecture upon which many current approaches are built. It was Mikolov et al. (2013), however, who popularized word embedding through the introduction of word2vec, a toolkit enabling the training and use of pretrained embeddings (Efficient Estimation of Word Representations in Vector Space).

Likewise – I’m being rather critical here – the 2008 Collobert and Weston paper, above, mentions “embedding” [but not “word embedding”, and cites Bengio’s 2001 (NeurIPS 2000) paper], while Mikolov’s 2013 paper does not mention “embedding” and cites Bengio’s 2003 paper.


Conceptually, word embeddings involve a mathematical embedding from a sparse, highly dimensional space with one dimension per word (a dimensionality proportional to the size of the vocabulary) into a dense, continuous vector space with a much lower dimensionality, perhaps 200 to 500 dimensions [Tomas Mikolov et al., Efficient Estimation of Word Representations in Vector Space (Sep 2013) – the inaugural “word2vec” paper, illustrated below].


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

In word2vec a distributed representation of a word is used. Take a vector with several hundred dimensions (say 1000). Each word is represented by a distribution of weights across those elements. So instead of a one-to-one mapping between an element in the vector and a word, the representation of a word is spread across all of the elements in the vector, and each element in the vector contributes to the definition of many words. If the dimensions in a hypothetical word vector are labeled (there are no such pre-assigned labels in the algorithm of course), it might look a bit like this:


[Image source. Click image to open in new window.]

Such a vector comes to represent in some abstract way the “meaning” of a word. [Excerpted from The amazing power of word vectors  (local copy).]

Mikolov et al. updated their continuous skip-gram (word2vec ) method in Distributed Representations of Words and Phrases and their Compositionality (Oct 2013), which improved both the quality of the vectors and the training speed.


[Image source. Click image to open in new window.]

Skip-Gram with Negative Sampling (SGNS)

The blog post Word2Vec Tutorial Part 2 - Negative Sampling [local copy;  note also commented word2vec C code] – which summarizes the approach very well – mentions a computational challenge associated with the skip-gram model for word2vec:

“When you read Word2Vec Tutorial – The Skip-Gram Model you may have noticed something – it’s a huge neural network! In the example I gave, we had word vectors with 300 components, and a vocabulary of 10,000 words. Recall that the neural network had two weight matrices-a hidden layer and output layer. Both of these layers would have a weight matrix with 300 x 10,000 = 3 million weights each! Running gradient descent on a neural network that large is going to be slow. And to make matters worse, you need a huge amount of training data in order to tune that many weights and avoid over-fitting. millions of weights times billions of training samples means that training this model is going to be a beast.

“The authors of word2vec addressed these issues in their second paper [Distributed Representations of Words and Phrases and their Compositionality (Oct 2013)]. There are three innovations in this second paper:

  • Treating common word pairs or phrases as single “words” in their model.
  • Subsampling frequent words to decrease the number of training examples.
  • Modifying the optimization objective with a technique they called “Negative Sampling”, which causes each training sample to update only a small percentage of the model’s weights.

“It’s worth noting that subsampling frequent words and applying Negative Sampling not only reduced the compute burden of the training process, but also improved the quality of their resulting word vectors as well.”

From Section 2.2 in Distributed Representations of Words and Phrases and their Compositionality (Oct 2013) [heavily paraphrased here]:

“Negative Sampling. An alternative to the hierarchical softmax is Noise Contrastive Estimation (NCE) … NCE posits that a good model should be able to differentiate data from noise by means of logistic regression. … While NCE can be shown to approximately maximize the log probability of the softmax, the Skip-gram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. We define negative sampling (NEG) by the objective … Thus the task is to distinguish the target word from draws from the noise distribution using logistic regression … The main difference between negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while negative sampling uses only samples. …”

For additional background and discussion on word embedding see:

Word embeddings are a particularly striking example of learning a representation, i.e. representation learning (Bengio et al., Representation Learning: A Review and New PerspectivesDeep Learning, NLP, and RepresentationsAn introduction to representation learning). [See also, this Glossary, Representation learning.]

GloVe: Global Vectors for word representation

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.


[Image source. Click image to open in new window.]


  • Wikipedia2Vec : An Optimized Implementation for Learning Embeddings from Wikipedia (Dec 2018) [project/code;  discussion: reddit  |  Hacker Newsembedding projector], by Studio Ousia + Japanese academic collaborators, presented Wikipedia2Vec, an open source tool for learning embeddings of words and entities from Wikipedia. This tool enabled users to easily obtain high-quality embeddings of words and entities from a Wikipedia dump with a single command. The learned embeddings could be used as features in downstream natural language processing (NLP) models. The tool can be installed via PyPI [Python]. The source code, documentation, and pretrained embeddings for 12 major languages can be obtained at GitHub.

    • Author comment (Ikuya Yamada, Hacker News, in response to “I’m a little confused, its just word2vec pretrained on content of wikipedia?” → “ Unlike word2vec, this tool learns embeddings of entities (i.e., entries in Wikipedia) as well as words. And although the model implemented in this tool is based on Word2vec’s skip-gram model, it is extended using two submodels (Wikipedia link graph model and anchor context model). Please refer to the documentation for details.”


  • Is “embedding” an action or a thing? Both. People talk about embedding words in a vector space (action) and about producing word embeddings (things). Common to both is the notion of embedding as a mapping from discrete objects to vectors. Creating or applying that mapping is an action, but the mapping itself is a thing.

  • Are embeddings high-dimensional or low-dimensional? It depends. A 300-dimensional vector space of words and phrases, for instance, is often called low-dimensional (and dense) when compared to the millions of words and phrases it can contain. But mathematically it is high-dimensional, displaying many properties that are dramatically different from what our human intuition has learned about 2- and 3-dimensional spaces.

  • Is an embedding the same as an embedding layer? No. An embedding layer is a part of neural network, but an embedding is a more general concept.

  • Source

Conceptualization of multidimensional word embeddings


[Image source. Click image to open in new window.]

In the figure above we are imagining that each dimension captures a clearly defined meaning. For example, if you imagine that the first dimension represents the meaning or concept of “animal”, then the remaining dimensions represent each word’s values for the other features (domesticated; pet; fluffy; …).  [Adapted from Introduction to Word Vectors.]

Shallow vs. Deep Models (Embeddings)

Multilayer, pretrained, contextual language models – including ELMo, ULMFiT, OpenAI GPT, and BERT – are deeper than shallow-trained methods such as support vector machines (SVM), logistic regression, and shallow word embedding approaches such as word2vec and GloVe.

Succinctly stated by Sebastian Ruder in NLP’s ImageNet Moment Has Arrived  (From Shallow to Deep Pre-Training subsection):

    "Word2vec and related methods are shallow approaches that trade expressivity for efficiency. Using word embeddings is like initializing a computer vision model with pretrained representations that only encode edges: they will be helpful for many tasks, but they fail to capture higher-level information that might be even more useful.

    "A model initialized with word embeddings needs to learn from scratch not only to disambiguate words, but also to derive meaning from a sequence of words. This is the core aspect of language understanding, and it requires modeling complex language phenomena such as compositionality, polysemy, anaphora, long-term dependencies, agreement, negation, and many more. It should thus come as no surprise that NLP models initialized with these shallow representations still require a huge number of examples to achieve good performance.

    "At the core of the recent advances of ULMFiT [Ruder is a coauthor], ELMo, and the OpenAI Transformer is one key paradigm shift: going from just initializing the first layer of our models to pretraining the entire model with hierarchical representations. If learning word vectors is like only learning edges, these approaches are like learning the full hierarchy of features, from edges to shapes to high-level semantic concepts."

Language Models

A recent and particularly exciting advance in NLP is the development of pretrained language models such as

  • ELMo  (released in February 2018, by Allen NLP),
  • ULMFiT  (May 2018, by and Aylien Ltd.),
  • GPT  (June 2018, by OpenAI),
    • GPT-2  (February 2019, by OpenAI),
  • BERT  (October 2018, by Google AI Language),
    • BioBERT  (Jan 2019, by Korea University),
    • SciBERT (Mar 2019, by Allen Institute for Artificial Intelligence),
    • XLNet (Jun 2019, by Carnegie Mellon University | Google Brain), and
  • UniLM  (May 2019, by Microsoft Research) [image]

Prior to 2018 almost all state of the art NLP solutions were highly specialized task-specific architectures. In contrast, in 2018 several task-agnostic architectures were published that achieved state of the art results across a wide range of competitive tasks, for the first time suggesting generalizable language understanding. All of these systems built on the language modeling objective: training a model to predict a word given its surrounding context. Because training examples were built from unlabeled corpora, much training data was available. ELMo trained representations with stacked bidirectional LSTMs, but still employed task-specific architectures on top of them. OpenAI GPT and BERT did away with this and instead trained task-agnostic Transformer (attentional) stacks that were fine-tuned together with a single dense layer for each downstream task. The latter mainly improved upon the former by joint conditioning on both preceding and following contexts. Critically, all systems allowed for contextualized word representations: they mapped each word occurrence to a vector, specifically considering the surrounding context. Much of their success was attributed to the ability to better disambiguate polysemous words in a given sentence. This representation approach is easily applicable for many NLP tasks, where inputs are usually sentences and context information is thus available. [Source: Section 2.2 in Learning Taxonomies of Concepts and not Words using Contextualized Word Representations: A Position Paper.]

Briefly summarizing the architectures associated with those models:

  • ELMo uses a shallow concatenation of independently trained left-to-right and right-to-left multilayer LSTMs – two (stacked) Bi-LSTM layers. [image]

  • OpenAI GPT employed Google’s Transformer architecture (a seq2seq based self-attention mechanism in place of RNNs (e.g., the stacked Bi-LSTMs used in ELMo).

    • The model architectures are different: ELMo uses a shallow concatenation of independently trained left-to-right and right-to-left multi-layer LSTMs (Bi-LSTMs), while GPT is a multi-layer transformer decoder.

    • The use of contextualized embeddings in downstream tasks are different: ELMo feeds embeddings into models customized for specific tasks as additional features, while GPT fine-tunes the same base model for all end tasks.

    • Generative pre-trained LM + task-specific fine-tuning has been proved to work in ULMFiT, where the fine-tuning happens in all layers gradually. ULMFiT focuses on training techniques for stabilizing the fine-tuning process.

  • BERT employed a deeply bidirectional, unsupervised language representation, pretrained using only a plain text corpus: Wikipedia.

    • The architecture employed by BERT is a bidirectional Transformer encoder, which demonstrates training efficiency and superior performance in capturing long-distance dependencies compared to a RNN architecture.

    • The bidirectional encoder is a standout feature that differentiates BERT from:

      • OpenAI GPT, a left-to-right Transformer; and,

      • ELMo, a concatenation of independently trained left-to-right and right-to-left LSTMs.

    • Relation between GPT and BERT.  Both models use virtually the same architecture. In fact, GPT  and BERT-Base  even use the same number of layers and dimensions. The only difference is that BERT  is bidirectional since it tries to fill in individual words given their context, whereas GPT  uses masked self-attention heads.”  [source]

Those papers demonstrated that pretrained language models can achieve state of the art results on a wide range of NLP tasks. [However, note (Kaiming He et al.) Rethinking ImageNet Pre-Training (Nov 2018).]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Character based LM:


[Image source. Click image to open in new window.]

fastText  (FastText ), a library for efficient learning of word representations and sentence classification, is an extension to word2vec. The model is described in Enriching Word Vectors with Subword Information (Jul 2016, updated Jun 2017).  “In this paper, we propose to learn representations for character $\small n$-grams, and to represent words as the sum of then-gram vectors.”  [See Section 3.2 in that paper.]

FastText, including character based representations, is well-described in the following blog posts.

  • FastText : Under the Hood  [local copy]

    • “Each word is represented as a bag of character $\small n$-grams in addition to the word itself, so for example, for the word matter, with $\small n = 3$, the fastText  representations for the character $\small n$-grams is $\small \text{<ma, mat, att, tte, ter, er>}$.  $\small <$ and $\small >$ are added as boundary symbols to distinguish the $\small n$-gram of a word from a word itself, so for example, if the word $\small \text{mat}$ is part of the vocabulary, it is represented as $\small \text{<mat>}$. This helps preserve the meaning of shorter words that may show up as $\small n$-grams of other words. Inherently, this also allows you to capture meaning for suffixes/prefixes.”

  • This two-part post [local copy]:

See also:


See Variational autoencoder.

Energy based models (EBM);
Energy based classification

TransE (NeurIPS 2013) – by Antoine Bordes et al. at the Université de Technologie de Compiègne – CNRS, and Jason Weston and Oksana Yakhnenko at Google – is a well known knowledge graph embedding method which embedded triples (entities and relations) into the same space where the difference between the head and the tail was approximately the relation. TransE adopted the principle of geometric translation, formally as $\small h + r ≈ t$. …

Since recent (late 2018 - early 2019) approaches to triplet classification / confidence scoring (refer here) are based on this TransE  paper, I will excerpt that section of the paper.

“Given a training set $\small {S}$ of triplets $\small (h,L,t)$  composed of two entities $\small h,t \in E$  (the set of entities) and a relationship $\small L \in L$ (the set of relationships), our model learns vector embeddings of the entities and the relationships. The embeddings take values in $\small \mathbb{R}^k$ ($\small k$ is a model hyperparameter) and are denoted with the same letters, in boldface characters. The basic idea behind our model is that the functional relation induced by the $\small L$-labeled edges corresponds to a translation of the embeddings, i.e. we want that $\small \boldsymbol{h + L \approx t}$ when $\small (h,L,t)$ holds ($\small \mathbf{t}$ should be a nearest neighbor of $\small \boldsymbol{h + L}$), while $\small \boldsymbol{h + L}$ should be far away from $\small \boldsymbol{t}$ otherwise. Following an energy-based framework, the energy of a triplet is equal to $\small d(\boldsymbol{h + L,t}$) for some dissimilarity measured, which we take to be either the $\small L_1$ or the $\small L_2$-norm.

“To learn such embeddings, we minimize a margin-based ranking criterion over the training set:

    $\begin{align} {L} = \sum_{(h,L,t) \in S} \ \sum_{(h^{\prime},L,t^{\prime}) \in S^{\prime}_{(h,L,t)}} [\gamma + d(\mathbf{h+L,t}) - d(\boldsymbol{h^{\prime} + L,t^{\prime}})]_{+} ènd{align}$      [1]

where $\small [x]_+$ denotes the positive part of $\small x, \gamma > 0$ is a margin hyperparameter, and

    $S^{\prime}_{(h,L,t)} = \{(h^{\prime},L,t) \vert h^{\prime} \in \textit{E}\} \cup \{(h,L,t^{\prime}) \vert t^{\prime} \in \textit{E}\}$.      [2]

The set of corrupted triplets, constructed according to Equation [2], is composed of training triplets with either the head or tail replaced by a random entity (but not both at the same time). The loss function [1] favors lower values of the energy for training triplets than for corrupted triplets, and is thus a natural implementation of the intended criterion. Note that for a given entity, its embedding vector is the same when the entity appears as the head or as the tail of a triple.”

This approach was also used by Richard Socher and colleagues in Reasoning With Neural Tensor Networks for Knowledge Base Completion (NeurIPS 2013).

The energy based approach to relation classification originates with energy based machine learning models (cited in the TransE  paper)  [I recently posted some of those links on reddit]. For example, Restricted Boltzmann Machines (RBM)  [local copy] – from Yoshua Bengio’s group at MILA – gives a good introduction (as follows).

Energy-based models associate a scalar energy to each configuration of the variables of interest. Learning corresponds to modifying that energy function so that its shape has desirable properties. For example, we would like plausible or desirable configurations to have low energy. Energy-based probabilistic models define a probability distribution through an energy function, as follows:

    $p(x) = \large \frac{e^{-E(x)}}{Z}$.

The normalizing factor $\small Z$ is called the partition function by analogy with physical systems.

    $\small \begin{align} Z = \sum_x e^{-E(x)} ènd{align}$

An energy-based model can be learnt by performing (stochastic) gradient descent on the empirical negative log-likelihood of the training data. As for the logistic regression we will first define the log-likelihood and then the loss function as being the negative log-likelihood.

    $\small \begin{align} {L}(\theta, {D}) = \frac{1}{N} \sum_{x^{(i)} \in {D}} \log\ p(x^{(i)})\\ \\ L (\theta, {D}) = - {L} (\theta, {D}) ènd{align}$

using the stochastic gradient $-\frac{\partial \log p(x^{(i)})}{\partial \theta}$, where $\small \theta$ are the parameters of the model.

Energy based triple classification (triple confidence)

Energy based triple classification is used to evaluate triple confidence / trustworthiness in the following works.

See also:

Euclidean space; Non-Euclidean space

In geometry, Euclidean space encompasses the two-dimensional Euclidean plane, the three-dimensional space of Euclidean geometry, and certain other spaces. It is named after the Ancient Greek mathematician Euclid of Alexandria. The term “Euclidean” distinguishes these spaces from other types of spaces considered in modern geometry. Euclidean spaces also generalize to higher dimensions.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Euclidean space is a mathematical construct that encompasses the line, the plane, and three-dimensional space as special cases. Its elements are called vectors. Vector space (also called linear space) is a collection of objects called vectors, which may be added together and multiplied (“scaled”) by numbers, called scalars.

Vectors can be understood in various ways:

  • as arrows,
  • as quantities with magnitude and direction
  • as displacements, or
  • as points.

Euclidean vectors are an example of a vector space. They represent physical quantities such as forces: any two forces (of the same type) can be added to yield a third, and the multiplication of a force vector by a real multiplier is another force vector. In the same vein, but in a more geometric sense, vectors representing displacements in the plane or in three-dimensional space also form vector spaces.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Vectors are abstract mathematical objects with particular properties, which in some cases can be visualized as arrows. However, vectors in vector spaces do not necessarily have to be arrow-like objects. Vector spaces are the subject of linear algebra and are well characterized by their dimension, which, roughly speaking, specifies the number of independent directions in the space.

Non-Euclidean space

A Euclidean space has some number of real-valued dimensions and “dense” points: there is a notion of an “average” of two points, and Euclidean distance is based on the locations of points in such a space.


[Image source. Click image to open in new window.]

A non-Euclidean space is based on the properties of points, but not their “location” in a space. A way to describe the differences between these geometries is to consider two straight lines indefinitely extended in a two-dimensional plane that are both perpendicular to a third line:

  • In Euclidean geometry, the lines remain at a constant distance from each other (meaning that a line drawn perpendicular to one line at any point will intersect the other line and the length of the line segment joining the points of intersection remains constant) and are known as parallels.

  • In hyperbolic geometry, they “curve away” from each other, increasing in distance as one moves further from the points of intersection with the common perpendicular; these lines are often called ultraparallels.

  • In elliptic geometry, the lines “curve toward” each other and intersect.


[Image source. Click image to open in new window.]


[Image source (slide 14). Click image to open in new window.]

Euler’s formula

See my blog post, Complex Numbers, $\small \pi$, Fourier Transform, Eigenvectors.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Fourier transform

Refer here  [Graph Signal Processing: Background].


See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.

Functional genomics

The standard interpretation for functional genomics is described on Wikipedia:

Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic and transcriptomic projects (such as genome sequencing projects and RNA sequencing) to describe gene (and protein) functions and interactions. Unlike structural genomics, functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein-protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA transcripts, and protein products. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional “gene-by-gene” approach.

However, I broaden my use use of that term, in the sense of

How is the information contained in our genome expressed, and what are the functional consequences of that the expression of that information?

My use of the term “functional genomics” thus spans genomics, molecular genetics, biochemistry, and bioinformatics. I am particularly fascinated by how the information in our genome is encoded and manifested.

Individual variations in our genetic / epigenetic makeup determine who we are, and how we respond to both

  • extrinsic factors:
    • the environment (environmental stress: radiation, heat, famine, anoxia, toxins, pollutants, chemicals, …)
    • pathogens (bacterial, viral)


  • intrinsic factors:
    • metabolism (e.g. different functional isotypes of proteins, that affect how we process chemicals and drugs, relevant e.g. to toxicology and cancer chemotherapy …)
    • mutation (spontaneous: ageing; induced: environmental in nature but affecting the individual)

Function words

See Content words; Function words.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Game theory

Game theory is the study of mathematical models of strategic interaction between rational decision-makers. It has applications in all fields of social science, as well as in logic and computer science. Originally, it addressed zero-sum games, in which one person’s gains result in losses for the other participants. Today, game theory applies to a wide range of behavioral relations, and is now an umbrella term for the science of logical decision making in humans, animals, and computers.

Modern game theory began with the idea regarding the existence of mixed-strategy equilibria in two-person zero-sum games and its proof by John von Neumann. Von Neumann’s original proof used the Brouwer fixed-point theorem on continuous mappings into compact convex sets, which became a standard method in game theory and mathematical economics. His paper was followed by the 1944 book Theory of Games and Economic Behavior, co-written with Oskar Morgenstern, which considered cooperative games of several players. The second edition of this book provided an axiomatic theory of expected utility, which allowed mathematical statisticians and economists to treat decision-making under uncertainty.

Game theory was developed extensively in the 1950s by many scholars. It was later explicitly applied to biology in the 1970s, although similar developments go back at least as far as the 1930s. Game theory has been widely recognized as an important tool in many fields. As of 2014, with the Nobel Memorial Prize in Economic Sciences going to game theorist Jean Tirole, eleven game theorists have won the economics Nobel Prize. John Maynard Smith was awarded the Crafoord Prize for his application of game theory to biology.


Image sources:  leftright.  [Click image to open in new window.]


See also:

Gated units / Gated recurrent units (GRU)

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Sep 2014) by Kyunghyun Cho et al. [Yoshua Bengio] described a RNN encoder-decoder for statistical machine translation, that introduced a new type of hidden unit ($\small {f}$ in the equation, below) that was motivated by the LSTM unit but was much simpler to compute and implement.

    Recurrent Neural Networks. A recurrent neural network (RNN) is a neural network that consists of a hidden state $\small \mathbf{h}$ and an optional output $\small \mathbf{y}$ which operates on a variable-length sequence $\small \mathbf{x} = (x_1, \ldots, x_T)$. At each time step $\small t$, the hidden state $\small \mathbf{h_{\langle t \rangle}}$ of the RNN is updated by

      $\small \mathbf{h_{\langle t \rangle}} = f (\mathbf{h_{\langle t-1 \rangle}}, x_t)$,
    where $\small {f}$ is a non-linear activation function. $\small {f}$ may be as simple as an element-wise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit (Hochreiter and Schmidhuber, 1997).

    Hidden Unit that Adaptively Remembers and Forgets. ... we also propose a new type of hidden unit ($\small f$ in the equation, above) that has been motivated by the LSTM unit but is much simpler to compute and implement. [The LSTM unit has a memory cell and four gating units that adaptively control the information flow inside the unit, compared to only two gating units in the proposed hidden unit.]

    This figure shows the graphical depiction of the proposed hidden unit:

    [Image source. Click image to open in new window.]

    "In this formulation [see Section 2.3 in Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation for details], when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation.

    "On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember long-term information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit (Bengio et al., 2013). As each hidden unit has separate reset and update gates, each hidden unit will learn to capture dependencies over different time scales. Those units that learn to capture short-term dependencies will tend to have reset gates that are frequently active, but those that capture longer-term dependencies will have update gates that are mostly active. ..."

That gated hidden unit is the gated recurrent unit (GRU).

In a highly-cited follow-on paper, Neural Machine Translation by Jointly Learning to Align and Translate (Sep 2014; updated May 2016), Dzmitry Bahdanau, KyungHyun Cho and Yoshua Bengio employed their “gated hidden unit” (i.e. GRU) in a neural machine translation task. Their model consisted of a forward and backward pair of RNN (BiRNN) for the encoder, and a decoder that emulated searching through a source sentence while decoding a translation. From Appendix A in that paper:

“For the activation function $\small f$ of an RNN, we use the gated hidden unit recently proposed by Cho et al. (2014a). The gated hidden unit is an alternative to the conventional simple units such as an element-wise $\small \text{tanh}$. This gated unit is similar to a long short-term memory (LSTM) unit proposed earlier by Hochreiter and Schmidhuber (1997), sharing with it the ability to better model and learn long-term dependencies. This is made possible by having computation paths in the unfolded RNN for which the product of derivatives is close to 1. These paths allow gradients to flow backward easily without suffering too much from the vanishing effect. It is therefore possible to use LSTM units instead of the gated hidden unit described here, as was done in a similar context by Sutskever et al. (2014).”


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

GRUs have fewer parameters than LSTM, as they lack an output gate. A GRU has two gates (an update gate and reset gate), while a RNN has three gates (update, forget and output gates). The GRU update gate decides on how much of information from the past should be let through, while the reset gate decides on how much of information from the past should be discarded. What motivates this? Although RNNs can theoretically capture long-term dependencies, they are actually very hard to train to do this. GRUs are designed to have more persistent memory, thereby making it easier for RNNs to capture long-term dependencies. Even though computationally a GRU is more efficient than an LSTM network, due to the reduction of gates it still comes second to LSTM network in terms of performance. Therefore, GRUs are often used when we need to train faster, and we don’t have much computational power.

  • Recurrent models feature flexibility and expressivity that come at a cost. Empirical experience shows that RNNs are often more delicate to tune and more brittle to train than standard feedforward architectures. Recurrent architectures can also introduce significant computational burden compared with feedforward implementations. In response to these shortcomings, a growing line of empirical research demonstrates that replacing recurrent models by feedforward models is effective in important applications including translation, speech synthesis, and language modeling [John Miller and Moritz Hardt, When Recurrent Models Don’t Need To Be Recurrent (May 2018)].

  • In a superb companion blog post, “When Recurrent Models Don’t Need To Be Recurrent” coauthor John Miller discusses two very interesting papers that discuss the limitations of LSTMs and the effectiveness of GRUs for language modeling:

See also:

Gaussian functions

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the form:

$\large {f(x) = ae^{-{\frac {(x-b)^{2}}{2c^{2}}}} }$

for arbitrary real constants $\small a$, $\small b$ and $\small c$. It is named after the mathematician Carl Friedrich Gauss. The graph of a Gaussian is a characteristic symmetric “bell curve” shape. The parameter $\small a$ is the height of the curve’s peak, $\small b$ is the position of the center of the peak and $\small c$ (the standard deviation, sometimes called the Gaussian RMS width) controls the width of the “bell”.

Gaussian functions are often used to represent the probability density function of a normally distributed random variable with expected value $\small \mu = b$ and variance $\small \sigma^2 = c^2$. In this case, the Gaussian is of the form:

$\large g(x)=\frac{1}{\sigma {\sqrt{2\pi }}} e^{ -{\frac{1}{2}} \left ({\frac{x-\mu }{\sigma }} \right)^{2} }$.

Gaussian functions are widely used in statistics to describe the normal distributions, in signal processing to define Gaussian filters, in image processing where two-dimensional Gaussians are used for Gaussian blurs, and in mathematics to solve heat equations and diffusion equations and to define the Weierstrass transform.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

See also:

Gaussian processes

Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems.

The advantages of Gaussian processes are:

  • The prediction interpolates the observations (at least for regular kernels).
  • The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide based on those if one should refit (online fitting, adaptive fitting) the prediction in some region of interest.
  • Versatile: different kernels can be specified. Common kernelsKernels (also called “covariance functions” in the context of GPs) are a crucial ingredient of GPs which determine the shape of prior and posterior of the GP. They encode the assumptions on the function being learned by defining the “similarity” of two datapoints combined with the assumption that similar datapoints should have similar target values. Two categories of kernels can be distinguished: stationary kernels depend only on the distance of two datapoints and not on their absolute values  k(xi,j) = k(di,j))  and are thus invariant to translations in the input space, while non-stationary kernels depend also on the specific values of the datapoints. Stationary kernels can further be subdivided into isotropic and anisotropic kernels, where isotropic kernels are also invariant to rotations in the input space. are provided, but it is also possible to specify custom kernels.

The disadvantages of Gaussian processes include:

  • They are not sparse, i.e., they use the whole samples/features information to perform the prediction.
  • They lose efficiency in high dimensional spaces - namely when the number of features exceeds a few dozens.

In probability theory and statistics, a Gaussian process is a stochastic process (a collection of random variables indexed by time or space), such that every finite collection of those random variables has a multivariate normal distribution, i.e. every finite linear combination of them is normally distributed. The distribution of a Gaussian process is the joint distribution of all those (infinitely many) random variables, and as such, it is a distribution over functions with a continuous domain, e.g. time or space.

A machine-learning algorithm that involves a Gaussian process uses lazy learning and a measure of the similarity between points (the kernel function) to predict the value for an unseen point from training data. The prediction is not just an estimate for that point, but also has uncertainty information - it is a one-dimensional Gaussian distribution (which is the marginal distribution at that point).

For some kernel functions, matrix algebra can be used to calculate the predictions using the technique of kriging [image]
. When a parameterised kernel is used, optimisation software is typically used to fit a Gaussian process model.

Gaussian processes are useful in statistical modelling, benefiting from properties inherited from the normal. For example, if a random process is modelled as a Gaussian process, the distributions of various derived quantities can be obtained explicitly. Such quantities include the average value of the process over a range of times and the error in estimating the average using sample values at a small set of times.

See also:

Generative adversarial networks

Generative adversarial networks (GANs) are a class of artificial intelligence algorithms used in unsupervised machine learning, implemented by a system of two neural networks contesting with each other in a zero-sum game framework. They were introduced by Goodfellow et al. in 2014 (Generative Adversarial Nets).

One network generates candidates (generative) and the other evaluates them (discriminative). Typically, the generative network learns to map from a latent space to a particular data distribution of interest, while the discriminative network discriminates between instances from the true data distribution and candidates produced by the generator. The generative network’s training objective is to increase the error rate of the discriminative network (i.e., “fool” the discriminator network by producing novel synthesised instances that appear to have come from the true data distribution).

In practice, a known dataset serves as the initial training data for the discriminator. Training the discriminator involves presenting it with samples from the dataset, until it reaches some level of accuracy. Typically the generator is seeded with a randomized input that is sampled from a predefined latent space (e.g. a multivariate normal distribution). Thereafter, samples synthesized by the generator are evaluated by the discriminator. Backpropagation is applied in both networks so that the generator (for example) produces better images, while the discriminator becomes more skilled at flagging synthetic images. The generator is typically a deconvolutional neural network, and the discriminator is a convolutional neural network. [GANs can generate photographs that look authentic to human observers.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

See also

Geometric deep learning

In the last decade, deep learning (DL) approaches (e.g. convolutional neural networks and recurrent neural networks) have achieved unprecedented performance on a broad range of problems coming from a variety of different fields (e.g. computer vision and speech recognition). Despite the results obtained, research so far on DL techniques has focused mainly on data defined on Euclidean domains (i.e. grids). However, in fields such as biology, physics, network science, recommender systems, computer graphics etc. one may have to deal with data defined on non-Euclidean domains (i.e. graphs and manifolds). Until recently, the adoption of DL in these fields has been lagging, primarily as the non-Euclidean nature of data makes the definition of basic operations (such as convolution) rather elusive. Geometric deep learning deals with the extension of DL techniques to graph/manifold structured data.

The Geometric Deep Learning website represents a collection of materials in the field of geometric deep learning. They collect workshops, tutorials, publications and code, that several different researchers have produced over the last several years. Their goal is to provide a general picture of this new and emerging field, which is rapidly developing in the scientific community thanks to its broad applicability.

See also:


[Image source. Click image to open in new window.]

GNU Octave

See this page.

Google AI

Google AI, formerly Google Research. Google collects AI-based services across the company into “ is a collection of products and teams across Alphabet with a focus on AI.”

Gradient; Gradient descent

See Backpropagation.

Graph adjacency matrix

Refer here  [Graph Signal Processing: Background].

See also: Graph Degree Matrix.

Graph databases

Data often exists as relationships between different objects. While relational databases (RDBMS) store highly structured data, they do not store the relationships between the data. Unlike other databases, graph databases store relationships and connections as first-class entities, and excel at managing highly connected data and complex queries.

... continued here ...

Graph degree matrix

Refer here  [Graph Signal Processing: Background].

See also: Graph Adjacency Matrix.

Graph Laplacian (Laplacian matrix)

“The Laplace operator in its various manifestations is the most beautiful and central object in all of mathematics. Probability theory, mathematical physics, Fourier analysis, partial differential equations, the theory of Lie groups, and differential geometry all revolve around this sun, and its light even penetrates such obscure regions as number theory and algebraic geometry.”
[Nelson E (1967) Tensor Analysis, p. 100]

Refer here  [Graph Signal Processing: Background].

Graph signal processing

In applications such as social, energy, transportation, sensor, and neuronal networks, high-dimensional data naturally reside on the vertices of weighted graphs. The emerging field of signal processing on graphs merges algebraic and spectral graph theoretic concepts with computational harmonic analysis to process such signals on graphs. The weight associated with each edge in the graph often represents the similarity between the two vertices it connects. The connectivities and edge weights are either dictated by the physics of the problem at hand or inferred from the data. For instance, the edge weight may be inversely proportional to the physical distance between nodes in the network. The data on these graphs can be visualized as a finite collection of samples, with one sample at each vertex in the graph. Collectively, we refer to these samples as a graph signal.

A signal or function $\small f : {V} \to \mathbf{R}$ defined on the vertices of the graph may be represented as a vector $\small f \in \mathbf{R}^N$, where the $\small i^{th}$ component of the vector $\small f$ represents the function value at the $\small i^{th}$ vertex in $\small {V}$. The following figure is an example of a graph signal.


[Image source. Click image to open in new window.]

... continued here ...

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


A heuristic technique (often simply called a heuristic ) is any approach to problem solving or self-discovery that employs a practical method – not guaranteed to be optimal, perfect, logical, or rational, but instead sufficient for reaching an immediate goal. Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution. Heuristics can be mental shortcuts that ease the cognitive load of making a decision. Examples that employ heuristics include using a rule of thumb, an educated guess, an intuitive judgment, a guesstimate, profiling, or common sense.


In semantics, a holonym is the opposite of a meronym.


  • “finger” is a part of a “hand”, thus “hand” is the holonym of “finger”
  • “family” is a holonym of “child”, “mother” or “father”.
  • “forest” is a holonym of “tree” (forests contain trees), while “tree” is a holonym of “bark”
  • “body” is a holonym of “arm”, which is a holonym of “elbow”

See also Linguistics | Semantics.


In semantics, a homonym is a word having a different meaning, but the same form as another word.


  • “bear” (verb) vs. “bear” (noun, animal)
  • “bank” (institution) vs. “bank” (river bank)

It is possible to distinguish homophones (same sound) and homographs (same spelling)

Homonymy and synonymy have complementary notions:

  • homonyms: the same form, different meanings
  • synonyms: the same meaning, different forms

Homonyms are words that have the same form but different meanings. There are two major types of homonyms, based upon whether the meanings of the word are historically connected or result from coincidence.

  • Coincidental homonyms are the result of such historical accidents as phonetic convergence of two originally different forms or the borrowing of a new word which happens to be identical to an old word. There is usually no natural link between the two meanings: the “bill” of a bird vs the “bill” one has to pay; or the “bark” of a dog vs the “bark” of a tree.

  • The second type of homonym, the polysemous homonym, results when multiple meanings develop historically from the same word. The process by which a word acquires new meanings is called polysemy. Unlike coincidental homonyms, polysemous homonyms usually preserve some perceptible semantic link marking the development of one meaning out of the other, as in the “leg” of a chair and the “leg” of a person; or the face of a person vs. the face of a clock.

Sometimes it is impossible to tell whether two words of identical form are true homonyms (historically unrelated) or polysemous homonyms (historically related), such as ice “skate” vs. “skate”, the fish: “skate” (fish, from Old English “skata”) vs. ice “skate” (from Dutch “schaat”); “dear/deer” are historically related (“darling”, vs. German “Tier”: animal).

Since polysemy is so difficult to separate from true homonymy, dictionaries usually order entries according to (i) the first recorded appearance of word, or (ii) the frequency of meaning use. This is a problem for lexicographers, the people who study words and write dictionaries.

There are universal tendencies in the directionality of polysemy. Studies of polysemy in a wide variety of languages generally find the following directions in meaning shift:

  • body part to part of object (“hands”, “face”, “lip”, “elbow”, “belly”, “vein of gold” or “vein of a leaf”); for example, “appendix.

  • animal to human, for personality traits (“shrew”, “bear”, “wolf”, “fox”, “quiet as a fish”); for example, “My dog is a real Einstein”.

  • space to time (“long”, “short”, “plural”)

  • spatial to sound (“melt”, “rush”)

  • sound to color (“loud”, “clashing”, “mellow”)

  • Physical, visible attribute to emotional or mental, invisible quality (“crushed”, “big head”, “green with envy”, “yellow coward”, “sharp/dull”, “spark”)

Directionality in polysemy seems to be logically motivated: concrete meanings give rise to abstract ones (“sharp knife” → “sharp mind”); mundane gives rise to the technical (“wood chip” → “computer chip”).

See also Linguistics | Semantics.

Horn clauses

See also:

Relevant logic notation:

  • $\small èxists$ : there exists at least one
  • $\small \forall$ : for all
  • $\small \neg$ : not (logical not)
  • $\small \lor$ : or (logical or)
  • $\small \land$ : and (logical and)
  • $\small \sim$ : contraposition (an inference that says that a conditional statement is logically equivalent to its contrapositive. The contrapositive of $\small P \rightarrow Q$ is thus $\small \neg Q \rightarrow \neg P$). In symbolic logic, the tilde (~) is used to indicate the negation of a statement, so $\small \sim p èquiv \lnot p$.
  • $\small \rightarrow$ : implies
  • $\small \leftarrow$ : is implied by

  • In Prolog, the symbol $\small \text{ :- }$, sometimes called a turnstile , is pronounced “ if “.

In mathematical logic and logic programming, a Horn clause is a logical formula of a particular rule-like form which gives it useful properties for use in logic programming, formal specification, and model theory. Horn clauses are named for the logician Alfred Horn, who first pointed out their significance in 1951.

A Horn clause is a clause (a disjunction of literals) with at most one positive, i.e. unnegated, literal. Conversely, a disjunction of literals with at most one negated literal is called a dual-Horn clause. A Horn clause with exactly one positive literal is a definite clause; a definite clause with no negative literals is sometimes called a fact; and a Horn clause without a positive literal is sometimes called a goal clause (note that the empty clause consisting of no literals is a goal clause). …

The following is an example of a (definite) Horn clause:

    $\small \neg p \lor \neg q \lor \ldots \lor \neg t \lor u$

Such a formula can be rewritten in the following form, which is more common in logic programming and similar fields:

    $\small (p \land q \land \ldots \land t) \rightarrow u$

Horn clauses play a basic role in constructive logic and computational logic. They are important in automated theorem proving by first-order resolution, because the resolvent of two Horn clauses is itself a Horn clause, and the resolvent of a goal clause and a definite clause is a goal clause. These properties of Horn clauses can lead to greater efficiencies in proving a theorem (represented as the negation of a goal clause). …

Horn clauses are also the basis of logic programming, where it is common to write definite clauses in the form of an implication:

    $\small (p \land q \land \ldots \land t) \rightarrow u$

In fact, the resolution of a goal clause with a definite clause to produce a new goal clause is the basis of the SLD resolution inference rule, used to implement logic programming in the programming language Prolog.

In logic programming a definite clause behaves as a goal-reduction procedure. For example, the Horn clause written above behaves as the procedure:

    to show $\small u$, show $\small p$ and show $\small q$ and ... and show $\small t$ .

To emphasize this reverse use of the clause, it is often written in the reverse form:

    $\small u \leftarrow (p \land q \land \ldots \land t)$

In Prolog this is written as:

    $\small u \text{ :- } p, q, \dots, t$.

In logic programming and datalog, computation and query evaluation are performed by representing the negation of a problem to be solved as a goal clause. For example, the problem of solving the existentially quantified conjunction of positive literals:

    $\small èxists \mathit{X} (p \land q \land \ldots \land t)$

is represented by negating the problem (denying that it has a solution), and representing it in the logically equivalent form of a goal clause:

    $\small \forall \mathit{X} (false \leftarrow p \land q \land \ldots \land t)$

In Prolog this is written as:

    $\small \text{ :- } p, q, \ldots, t$.

Solving the problem amounts to deriving a contradiction, which is represented by the empty clause (or “false”). The solution of the problem is a substitution of terms for the variables in the goal clause, which can be extracted from the proof of contradiction. Used in this way, goal clauses are similar to conjunctive queries in relational databases, and Horn clause logic is equivalent in computational power to a universal Turing machine.

The Prolog notation is actually ambiguous, and the term “goal clause” is sometimes also used ambiguously. The variables in a goal clause can be read as universally or existentially quantified, and deriving “false” can be interpreted either as deriving a contradiction or as deriving a successful solution of the problem to be solved. …

Notation: The clause $\small \sim p \lor \sim q \lor r$ is written in Prolog in the form $\small r \text{ :- } p, q.$  .

See also:

Hyperbolic embeddings

Hyperbolic embeddings seek to embed structured, discrete objects such as knowledge graphs into a continuous representation that can be used with modern machine learning methods. Hyperbolic embeddings can preserve graph distances and complex relationships in very few dimensions, particularly for hierarchical graphs. Many graphs, such as complex networks including the Internet and social networks, are known to have hyperbolic structure and thus befit hyperbolic embeddings (see Representation Tradeoffs for Hyperbolic Embeddings and references therein). For both natural language processing and graph based tasks, embeddings have been learned in high-dimensional Euclidean spaces. However, recent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but negatively curved, hyperbolic space.

For additional background, see these dated but useful resources:

A hyperbolic space has the property that power-law degree distributions, strong clustering and hierarchical community structure emerge naturally when random graphs are embedded in hyperbolic space. It is therefore logical to exploit the structure of the hyperbolic space for useful embeddings of complex network. Neural Embeddings of Graphs in Hyperbolic Space (May 2017) proposed learning neural embeddings of graphs in hyperbolic space, providing experimental evidence that embedding graphs in their natural geometry significantly improved performance on downstream tasks for several real-world public datasets.


[Image source. Click image to open in new window.]

  • Hybed: Hyperbolic Neural Graph Embedding  [Semantic ScholarOpenReview] again posits that recent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but a negatively curved hyperbolic space. They proposed learning neural embeddings of graphs in hyperbolic space, providing experimental evidence that hyperbolic embeddings significantly outperformed Euclidean embeddings on vertex classification tasks for several real-world public datasets.

Many well-established recommender systems are based on representation learning in Euclidean space. In these models, matching functions such as the Euclidean distance or inner product are typically used for computing similarity scores between user and item embeddings. In an extension of representation learning from the Euclidean to the hyperbolic domain, Hyperbolic Recommender Systems investigated the notion of learning user and item representations in Hyperbolic space, arguing that hyperbolic space is more suitable for learning user-item embeddings in the recommendation domain. Unlike Euclidean spaces, hyperbolic spaces are intrinsically equipped to handle hierarchical structure, encouraged by its property of exponentially increasing distances away from origin. Their proposed system (HyperBPR) outperformed its Euclidean counterparts, achieving state of the art performance on multiple benchmark datasets.

Representation Tradeoffs for Hyperbolic Embeddings provides an excellent summary as well as entity embeddings with 64-bit precision] found that – given a tree – they could give a combinatorial construction that embedded the tree in hyperbolic space with arbitrarily low distortion without using optimization. On WordNet, their combinatorial embedding obtained a mean-average-precision of 0.989 with only two dimensions, while Nickel et al.’s recent construction (Poincaré Embeddings for Learning Hierarchical Representation) obtained 0.87 using 200 dimensions.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Maximilian Nickel and Douwe Kiela at Facebook AI Research recently introduced a new approach to learning hierarchical representations of symbolic data by embedding them into hyperbolic space (more precisely, into an n-dimensional Poincaré ball) (Poincaré Embeddings for Learning Hierarchical Representations;  [project]), allowing them to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. The Poincaré embeddings significantly outperformed Euclidean embeddings on data with latent hierarchies in terms of representation capacity and generalization ability.

Experiments showed that Poincaré embeddings provided important advantages over Euclidean embeddings on hierarchical data:

  1. Poincaré embeddings enabled very parsimonious representations that enabled the learning of high-quality embeddings of large-scale taxonomies.
  2. Excellent link prediction results indicated that hyperbolic geometry could introduce an important structural bias for the embedding of complex symbolic data.
  3. State of the art results for predicting lexical entailment suggested that the hierarchy in the embedding space corresponded well to the underlying semantics of the data.


[Image source. Click image to open in new window.]

Embeddings of tree-like graphs in hyperbolic space were recently shown to surpass their Euclidean counterparts in performance by a large margin. Inspired by those results, Skip-gram Word Embeddings in Hyperbolic Space presented an algorithm for learning word embeddings in hyperbolic space from free text. An objective function based on the hyperbolic distance was derived and included in the skip-gram architecture from word2vec. The results demonstrated the potential of hyperbolic word embeddings, particularly in low dimensions, although without clear superiority over their Euclidean counterparts.

See also Embeddings (word embeddings)


In mathematics, a hypergraph is a generalization of a graph in which an edge can join any number of vertices. Formally, a hypergraph is a pair where is a set of elements called nodes or vertices, and is a set of non-empty subsets of called hyperedges or edges.


[Image source. Click image to open in new window.]


[Click image to open in new window]

Research into hypergraph development is technically challenging due to a lack of programmatic solutions and published work in this area. Nevertheless, for relevant background on hypergraphs consult the following:


Hypernym (superordinate word)

In semantics, superordinate words (hypernyms) are words whose meaning includes meaning of another word. If $\small X$ is superordinate to $\small Y$ then $\small Y$ is subordinate to $\small X$.


  • “color” is a hypernym of “red”, “black”, “green”, …
  • “animal” is a hypernym of “horse”, “tiger”, …
  • “animal” is a hypernym of “mammal” (mammals are animals), which is a hypernym of “mammal” is a hypernym of “dog” (dogs are mammals)
  • “plant” is a hypernym of “flower”, which is a hypernym of “tulip”
  • “red” is a hypernym of “scarlet”, “vermilion”, “carmine” and “crimson”

Hypernyms vs. Hyponyms

Hypernymy is a relation between words (or sentences) where the semantics of one word (the hyponym) are contained within that of another word (the hypernym). A simple form of this relation is the is-a relation; e.g., cat is an animal. Another example is a hyponym in a type-of relationship with its hypernym; for example, “pigeon, crow, eagle and seagull” are all hyponyms of “bird” (their hypernym); “bird” in turn is a hyponym of “animal;” “spoon” is a hyponym of “cutlery” (its hypernym); etc.

See also Linguistics | Semantics.


See Parameters.

Hyponym (subordinate word)

In semantics, subordinate words (hyponyms) are words whose meaning is included in the meaning of another word.


  • “red” is a hyponym of “color” (red is subordinate to color)
  • “pony” is a hyponym of “horse”, which is a hyponym of “animal”
  • “tulip” is a hyponym of “flower”; …
  • “dog” is a hyponym of “mammal” (dogs are among the various animals which are mammals); “mammal” is a hyponym of “animal”
  • “tulip” is a hyponym of “flower”, which is a hyponym of “plant”
  • “scarlet”, “vermilion”, “carmine” and “crimson” are hyponyms of “red”

This is not a whole/part relationship, so “page” is not a subordinate term of “book”.

Hypernyms vs. Hyponyms

Hypernymy is a relation between words (or sentences) where the semantics of one word (the hyponym) are contained within that of another word (the hypernym). A simple form of this relation is the is-a relation; e.g., cat is an animal. Another example is a hyponym in a type-of relationship with its hypernym; for example, “pigeon, crow, eagle and seagull” are all hyponyms of “bird” (their hypernym); “bird” in turn is a hyponym of “animal;” “spoon” is a hyponym of “cutlery” (its hypernym); etc.

See also Linguistics | Semantics.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See Proofs.

Identity Matrix

See here  [Graph Signal Processing: Background]


See Reasoning.

Inductive logic programming

See also:

Inductive logic programming is a subfield of symbolic artificial intelligence which uses logic programming as a uniform representation for examples, background knowledge and hypotheses. Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an inductive logic programming system will derive a hypothesised logic program which entails all the positive and none of the negative examples.

    Schema ($\small \Rightarrow$ : implies):

    positive examples + negative examples + background knowledge $\small \Rightarrow$ hypothesis .

Inductive logic programming is particularly useful in bioinformatics and natural language processing. Gordon Plotkin and Ehud Shapiro laid the initial theoretical foundation for inductive machine learning in a logical setting. Shapiro built their first implementation (Model Inference System) in 1981: a Prolog program that inductively inferred logic programs from positive and negative examples. The term “Inductive Logic Programming” was first introduced in a paper by Stephen Muggleton in 1991. Muggleton also founded the annual international conference on Inductive Logic Programming, introduced the theoretical ideas of Predicate Invention, Inverse resolution, and Inverse entailment. …

In inductive logic programming the learned model is in the form of logic programming rules (Horn clauses) that are comprehensible to humans. It allows the background knowledge to be incrementally extended without requiring the entire model to be re-learned. Meanwhile, the comprehensibility of symbolic rules makes it easier for users to understand and verify induced models and even edit them. The inductive logic programming learning problem can be regarded as a search problem for a set of clauses that deduce the training examples. The search is performed either top down or bottom-up. A bottom-up approach builds most-specific clauses from the training examples and searches the hypothesis space by using generalization. This approach is not applicable to large-scale datasets, nor it can incorporate Negation-As-Failure into the hypotheses. In contrast, top-down approach starts with the most general clauses and then specializes them. A top-down algorithm guided by heuristics is better suited for large-scale and/or noisy datasets. [Source.]

Note that inductive logic programming differs from integer linear programming (unfortunately both are often abbreviated “ILP”).

Inductive logic programming was used in Induction of Non-Monotonic Logic Programs to Explain Boosted Tree Models Using LIME (Aug 2018).

See also:


See Reasoning.

Information extraction

Information extraction (IE) is the process of extracting structured information (e.g. events, binary relations, etc.) from text and data, so that it can be used for another purpose, such as an information retrieval system (e.g. a search engine). IE creates structured information from unstructured text. See also: Relation extraction.

IE vs. IR
      [Image source]

Information overload

The explosion in the amount of news and journalistic content being generated across the globe, coupled with extended and instantaneous access to information through online media, makes it difficult and time-consuming to monitor news developments and opinion formation in real time (Content-Driven, Unsupervised Clustering of News Articles Through Multiscale Graph Partitioning). Even within the more focused health, technical and scientific domains we face a continuous onslaught of new information and knowledge from which we must filter out the non-relevant information, seeking to retain (or hoping to find again) knowledge that is relevant to us.

Information overload is characterized by the difficulty of understanding an issue and effectively making decisions when one has too much information about that issue. In our infocentric world, we have an increasing dependency on relevant, accurate information that is buried in the avalanche of continuously generated information.

Coincident with information overload is the phenomenon of attention overload: we have limited attention and we’re not always sure where to direct it. It can be difficult to limit how much information we consume when there’s always something new waiting for a click; before we know it, an abundance of messy and complex information has infiltrated our minds. If our processing strategies don’t keep pace, our online explorations create strained confusion instead of informed clarity. Hence, More information is not necessarily better.

When Choice is Demotivating: Can One Desire Too Much of a Good Thing? [pdf] discussed findings from 3 experimental studies that starkly challenged the implicit assumption that having more choices is more intrinsically motivating than having fewer choices. Those experiments, which were conducted in both field and laboratory settings, showed that people are more likely to make purchases or undertake optional classroom essay assignments when offered a limited array of 6 choices, rather than a more extensive array of 24 or 30 choices. Moreover, participants reported greater subsequent satisfaction with their selections and wrote better essays when their original set of options had been limited.

Information overload is a long-standing issue: in her 2010 book Too Much To Know: Managing Scholarly Information before the Modern Age, Harvard Department of History Professor Ann Blair argued that the early modern methods of selecting, summarizing, sorting, and storing text (the 4S’s) are at the root of the techniques we use today in information management.

For additional discussion, see the blog post Information Overload, Fake News, and Invisible Gorillas.

Information retrieval

Information retrieval (IR) employs highly scalable statistics-based techniques to index and search large volumes of text efficiently. Information retrieval is based on a query – you specify what information you need, and it is returned in human understandable form.

See also Document relevance ranking.

Integer linear programming

An integer linear program has the same form as a linear program; the only difference is that the feasible solutions are restricted to be integer vectors, i.e. $\small x \in \mathbb{Z}^n$ instead of $\small x \in \mathbb{R}^n$.

An integer programming problem is a mathematical optimization or feasibility program in which some or all of the variables are restricted to be integers. In many settings the term refers to integer linear programming (ILP), in which the objective function and the constraints (other than the integer constraints) are linear.

Integer programming is NP-complete. In particular, the special case of 0-1 integer linear programming, in which unknowns are binary, and only the restrictions must be satisfied, is one of Karp’s 21 NP-complete problems.

Note that integer linear programming differs from inductive logic programming (both, unfortunately, are often abbreviated “ILP”). Inductive logic programming is a subfield of machine learning that learns models in the form of logic programming rules (Horn clauses) that are comprehensible to humans. Inductive logic programming was used in Induction of Non-Monotonic Logic Programs to Explain Boosted Tree Models Using LIME.

Integer linear programming was used in the following papers:

  • An interesting graph-based approach to question answering is the use of subgraphs. Work from the Allen Institute for AI, Answering Complex Questions Using Open Information Extraction (Apr 2017), used a recently proposed support graph optimization framework for QA to develop a new inference model for Open IE. Their model, TupleINF, significantly outperformed a state of the art structured solver on complex questions of varying difficulty, while also removing the reliance on manually curated knowledge. TupleINF used integer linear programming to solve QA over complex structures that require multi-fact reasoning. …

    • TupleINF was inspired by their earlier TableILP work, Question Answering via Integer Programming over Semi-Structured Knowledge (Apr 2016), which proposed a structured inference system for this task formulated as an integer linear program that answers natural language questions using a semi-structured knowledge base derived from text, including questions requiring multi-step inference and a combination of multiple facts.

    • The approach of using subgraphs and integer linear programming in QA was also described on slides 15-25 in Reasoning-driven Question Answering by the University of Pennsylvania Ph.D. candidate Daniel Khashabi, at a Stanford NLP seminar (June 2018). The work summarized in that presentation is described in the UPenn/Allen Institute for Artificial Intelligence paper Question Answering as Global Reasoning over Semantic Abstractions (2018), which introduced their SemanticILP system. They claimed SemanticILP to be the first system that reasons over a wide range of semantic abstractions of the text, which are derived using off-the-shelf, general-purpose, pretrained natural language modules such as semantic role labelers, coreference resolvers, and dependency parsers. Representing multiple abstractions as a family of graphs, they translated QA into a search for an optimal subgraph that satisfied certain global and local properties. SemanticILP outperformed baseline models BiDAF, an information retrieval solver, and TupleINF on QA tasks on 4th and 8th grade science exams and a biology exam (Tables 4 and 5 in their paper). Thus on two domains with very different characteristics, the graph-based SemanticILP outperformed traditional and neural network models.

  • Knowledge Graph Embedding with Iterative Guidance from Soft Rules (Nov 2017) [code] presented a novel approach for knowledge graph embedding combined with guidance from soft logic rules, called Rule-Guided Embedding (RUGE). RUGE built on previous work by these authors that provided additional information:

  • Unsupervised Pseudo-Labeling for Extractive Summarization on Electronic Health Records (Nov 2018)

See also:

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


GitHub: jax:  Composable transformations of Python + NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more.

  • Google GitHub repository, but “not an official Google product.”

JAX is Autograd and XLA, brought together for high-performance machine learning research. With its updated version of Autograd, JAX can automatically differentiate native Python and NumPy functions. It can differentiate through loops, branches, recursion, and closures, and it can take derivatives of derivatives of derivatives. It supports reverse-mode differentiation (a.k.a. backpropagation) via grad as well as forward-mode differentiation, and the two can be composed arbitrarily to any order.

What’s new is that JAX uses XLA to compile and run your NumPy programs on GPUs and TPUs. Compilation happens under the hood by default, with library calls getting just-in-time compiled and executed. But JAX also lets you just-in-time compile your own Python functions into XLA-optimized kernels using a one-function API, jit. Compilation and automatic differentiation can be composed arbitrarily, so you can express sophisticated algorithms and get maximal performance without leaving Python.

Dig a little deeper, and you’ll see that JAX is really an extensible system for composable function transformations. …

import jax.numpy as np
from jax import grad, jit, vmap
from functools import partial

def predict(params, inputs):
  for W, b in params:
    outputs =, W) + b
    inputs = np.tanh(outputs)
  return outputs

def logprob_fun(params, inputs, targets):
  preds = predict(params, inputs)
  return np.sum((preds - targets)**2)

grad_fun = jit(grad(logprob_fun))  # compiled gradient evaluation function

See also:

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Kirchhoff’s Theorem

In the mathematical field of graph theory, Kirchhoff’s theorem or Kirchhoff’s matrix tree theorem (named after Gustav Kirchhoff) is a theorem about the number of spanning trees in a graph, showing that this number can be computed in polynomial time as the determinant of a matrix derived from the graph. It is a generalization of Cayley’s formula which provides the number of spanning trees in a complete graph.


[Image source. Click image to open in new window.]

Kirchhoff’s theorem relies on the notion of the Laplacian matrix of a graph that is equal to the difference between the graph’s degree matrix (a diagonal matrix with vertex degrees on the diagonals) and its adjacency matrix (a $\small (0,1)$-matrix with 1’s at places corresponding to entries where the vertices are adjacent and 0’s otherwise).

For a given connected graph $\small {G}$ with $\small n$ labeled vertices, let $\small \lambda_1, \lambda_2, ldots, \lambda_{n-1}$ be the non-zero eigenvalues of its Laplacian matrix. Then the number of spanning trees of $\small {G}$ is

    $\small \begin{align} t({G) = \frac{1}{n}} \lambda_1, \lambda_2, \cdots, \lambda_{n-1} ènd{align}$.

Equivalently the number of spanning trees is equal to any cofactor of the Laplacian matrix of $\small {G}$.

Kmer (K-mer)

The term $\small k$-mer typically refers to all the possible substrings of length $\small k$ that are contained in a string. In computational genomics, $\small k$-mers refer to all the possible subsequences (of length $\small k$) from a read obtained through DNA sequencing.

The amount of $\small k$-mers possible given a string of length $\small L$ is $\small L-k+1$, while the number of possible $\small k$-mers given $\small n$ possibilities (4 in the case of DNA, e.g. ACTG) is $\small n^{k}$.

$\small K$-mers are typically used during sequence assembly, but can also be used in sequence alignment. In the context of the human genome, $\small k$-mers of various lengths have been used to explain variability in mutation rates.


Here are some examples showing the possible $\small k$-mers (given a specified $\small k$ value) from DNA sequences:



Typically we extract $\small k$-mers from genomic assemblies or read data sets by running a $\small k$-length window across all of the reads and sequences – e.g. given a sequence of length 16, you could extract 11 $\small k$-mers of length six from it like so:


becomes the following set of 6-mers:


$\small k$-mers are most useful when they’re long, because then they’re specific. That is, if you have a 31-mer taken from a human genome, it’s pretty unlikely that another genome has that exact 31-mer in it. (You can calculate the probability if you assume genomes are random: there are 431 possible 31-mers, and 431 = 4,611,686,018,427,387,904.) [Source.]

For a broader coverage of this topic, see $\small n$-gram.

Knowledge discovery in databases

Knowledge discovery in databases (KDD) is the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data. KDD (“data mining”) is the process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results.

Knowledge graphs

Knowledge graphs (relational property graphs) model information in the form of entities (nodes/vertices) and the relationships between them (edges). For a formal definition see

and for a general overview see

A noteworthy feature of knowledge graphs is the excellent performance of traversal-type queries across entities of diverse types within them. Such queries can be challenging to realize in relational databases because of the cost of computing statements (joins) across different tables. See:

Cypher queries in Neo4j are easier to write and understand than complex SQL queries in relational database management systems (RDBMS), especially those involving multiple join statements. For example, see pp. 22-23 in The Definitive Guide to Graph Databases for the RDBMS Developer).

Knowledge graphs (KG) provide semantically structured information that is interpretable by computers – an important property to building more intelligent machines, as well as an important step to transforming text-based knowledge stores and search engines into semantically-aware question answering services (A Review of Relational Machine Learning for Knowledge Graphs). Knowledge graphs are applicable to a broad spectrum of problems ranging from biochemistry to recommender systems; for example, question answering, structured search, exploratory search, and digital assistants.

Knowledge graph structure (embeddings); machine learning

Graphs represent a recent and exciting extension of machine learning; for a good reviews, see

A Review of Relational Machine Learning for Knowledge Graphs also provides an excellent introduction to machine learning and knowledge graphs, with a focus on statistical relational learning.

In a radically different approach from the probabilistic methods employed in statistical relational learning, knowledge graph embedding (KGE) aims to represent entities and relations in a knowledge graph as points or vectors in a continuous vector space (Euclidean space) – simplifying manipulation, while preserving the inherent structure of the KG.

KG are generally represented by symbolic triples – (source, relation, target), also described as (subject, predicate, object) or (head, relation, tail) – whereas KGE methods attempt to represent those symbols (nodes and edges) with their corresponding source, relation, and target vectors – amenable to mathematical processing.

The underlying concept of KGE is that in a knowledge graph each entity can be regarded as a point in a continuous vector space, while relations can be modelled as translation vectors (Expeditious Generation of Knowledge Graph Embeddings). The generated vector representations can be used by machine learning algorithms to accomplish a specific tasks.

Restated in On Embeddings as an Alternative Paradigm for Relational Learning (Jul 2018), KGE aim to represent instances and their relationships as vectors and/or matrices in the Euclidean space. The hope is that the geometry of the embedding space would resemble the structure of the data (for example) by keeping the instances participating in the same relationships close in the Euclidean space. This in turn allows one to apply standard propositional (logic-based) machine learning tools and retain their scalability, while at the same time preserving certain properties of structured relational data.

KGE has proven to be very effective for the tasks of link prediction and knowledge graph completion, where the goal is to identify missing links in the existing knowledge graph (A Review of Relational Machine Learning for Knowledge Graphs  |  An Overview of Embedding Models of Entities and Relationships for Knowledge Base Completion).

KGE approaches are the current (2018) dominating methodology for knowledge graph link prediction (On Link Prediction in Knowledge Bases: Max-K Criterion and Prediction Protocols). This methodology, fundamentally based on distributed representations, has not only proved to be effective for KG link prediction but has also helped to improve our understanding and engineering of knowledge representation. A strong advantage of KGE methods is their scalability, at the expense of their black-box nature and limited reasoning capabilities; KGE has proven to be scalable to very large knowledge graphs.

Knowledge graphs: additional references

For recent reviews on knowledge graph embedding (KGE), see

Examples of the use of KG in the biomedical domain include:

See also Graph databases.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Language models

See Embeddings (word embeddings).


See Graph Laplacian (Laplacian matrix)

Latent Dirichlet allocation (LDA)

In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA is an example of a topic model.

LDA was first presented as a graphical model for topic discovery by Blei, Ng and Jordan in 2003. Essentially the same model was proposed independently by Pritchard, Stephens and Donnelly in the study of population genetics in 2000. The corresponding papers had 1,998 and 2,016 citations, respectively, by August 2017. Topics


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

In LDA, each document may be viewed as a mixture of various topics where each document is considered to have a set of topics that are assigned to it via LDA. This is identical to probabilistic latent semantic analysis (pLSA), except that in LDA the topic distribution is assumed to have a sparse Dirichlet prior. The sparse Dirichlet priors encode the intuition that documents cover only a small set of topics and that topics use only a small set of words frequently. In practice, this results in a better disambiguation of words and a more precise assignment of documents to topics. LDA is a generalization of the pLSA model, which is equivalent to LDA under a uniform Dirichlet prior distribution.

For example, an LDA model might have topics that can be classified as “CAT_related” and “DOG_related.” A topic has probabilities of generating various words, such as milk, meow, and kitten, which can be classified and interpreted by the viewer as “CAT_related”. Naturally, the word cat itself will have high probability given this topic. The “DOG_related” topic likewise has probabilities of generating each word: puppy, bark, and bone might have high probability. Words without special relevance, such as “the” (see function words), will have roughly even probability between classes (or can be placed into a separate category). A topic is neither semantically nor epistemologically strongly defined. It is identified on the basis of automatic detection of the likelihood of term co-occurrence. A lexical word may occur in several topics with a different probability, however, with a different typical set of neighboring words in each topic.

Each document is assumed to be characterized by a particular set of topics. This is similar to the standard bag of words model assumption, and makes the individual words exchangeable.


[Image source. Click image to open in new window.]

  • Each topic is a distribution over words
  • Each document is a mixture of corpus-wide topics
  • Each word is drawn from one of those topics


[Image source. Click image to open in new window.]

LDA – implementations

LDAvis: A Method for Visualizing and Interpreting Topics (2014) presented LDAvis, a web-based interactive visualization of topics estimated using latent Dirichlet allocation (LDA) that was built using a combination of [GNU] R and D3 [d3.js].


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

LDAvis was ported from R to Python as pyLDAvis (Python LDA visualization), a Python library for interactive topic model visualization using latent Dirichlet allocation (LDA). The accompanying Jupyter notebook demonstrates a fabulous interactive, customizable user interface. That interface was also embedded in the webpage for the blog post [Introducing Our Hybrid lda2vec Algorithmcode], which discussed real-world applications of word vectors.

moody2016representing-blog-word2vec.gif moody2016representing-blog-LDA.gif moody2016representing-blog-word2vec+LDA.gif

[Image source. Click image to open in new window.]

  • “At its heart, word2vec predicts locally: given a word it guesses neighboring words. At Stitch Fix, this text is typically a client comment about an item in a fix. In this example, word2vec predicts the other words in a sentence given the central pivot word ‘awesome’ and repeats this process for every pair of words in a moving window. Ultimately this yields wonderful word vectors that have surprisingly powerful representations.
  • LDA on the other hand predicts globally: it learns a document vector that predicts words inside of that document. And so in one of our client’s comments about their fix: it predicts all of the words using a single document-wide vector and doesn’t capture more local word-to-word correlations.
  • lda2vec predicts globally and locally at the same time by predicting the given word using both nearby words and global document themes.The hope is that more data and more features helps us better predict neighboring words. Having a local word feature helps predict words inside of a sentence. Having a document vector captures long-range themes beyond the scale of a few words and instead arcing over thousands of words.
  • [Source: Introducing our Hybrid lda2vec Algorithm.]
  • The accompanying paper is Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec (May 2016) [code] [image]


See Proofs.


See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.


Linguistics is the scientific study of language and its structure, including the study of morphology, syntax, phonetics, and semantics. Specific branches of linguistics include sociolinguistics, dialectology, psycholinguistics, computational linguistics, historical-comparative linguistics, and applied linguistics, etc.

See also:

Logic notation

Logic notation  (see also):

  • $\small èxists$ : there exists at least one
  • $\small \forall$ : for all
  • $\small \neg$ : not (negation: logical not)
  • $\small \lor$ : or (disjunction: logical or)
  • $\small \land$ : and (conjunction: logical and)
  • $\small \sim$ : contraposition (an inference that says that a conditional statement is logically equivalent to its contrapositive. The contrapositive of $\small P \rightarrow Q$ is thus $\small \neg Q \rightarrow \neg P$). In symbolic logic, the tilde (~) is used to indicate the negation of a statement, so $\small \sim p èquiv \lnot p$.
  • $\small \rightarrow$ : implies (implication)
  • $\small \Rightarrow$ : implies
  • $\small \leftarrow$ : is implied by
  • $\small \longleftrightarrow$ : biconditional

  • In Prolog, the symbol $\small \text{ :- }$, sometimes called a turnstile , is pronounced “ if “.


[Image source. Click image to open in new window.]

$\small p \rightarrow q$ ($\small \text{p implies q}$)


  • A necessary condition of a statement must be satisfied for the statement to be true.
    Formally, a statement $\small P\ $ is a necessary condition of a statement $\small Q$ if $\small Q$ implies $\small P$ ($\small Q \Rightarrow P$).

  • A sufficient condition is one that, if satisfied, assures the statement’s truth.
    Formally, a statement $\small P\ $ is a sufficient condition of a statement $\small Q$ if $\small P$ implies $\small Q$ ($\small P \Rightarrow Q$).

This is a simple matter answered by the truth table of $\small \Rightarrow$:

    $\small \begin{array}{ c | c || c | } P & Q & P\Rightarrow Q \\ \hline \text T & \text T & \text T \\ \text T & \text F & \text F \\ \text F & \text T & \text T \\ \text F & \text F & \text T ènd{array}$

This shows that when $\small P$ is false, the implication is true. Note that this is the definition of the table, there is no need to prove it. This is how $\small \Rightarrow$ is defined to work.

For example:

    $\small \text{If it is raining then there are clouds in the sky.}$

In this case $\small P = \text{It is raining}$, and $\small Q = \text{There are clouds in the sky}$.

Note that $\small P$ is sufficient to conclude $\small Q$, and $\small Q$ is necessary for $\small P$.

    $\small \text{There is no rain without clouds, and if there are no clouds then there cannot be any rain.}$

However, note that $\small P$ is not necessary for $\small Q$. There could be clouds without any rain.

In this answer, $\small Q$ is necessary for $\small P$, and $\small P$ is sufficient for $\small Q$.

We can reformulate the example, above, as shown in the following truth tables.

Truth tables

Truth table for the conjunction ($\small \land$;  $\small \text{“and”}$) of two propositions:

    $\small \begin{array}{ c | c || c | } p & q & p \land q \\ \text{(e.g. rain)} & \text{(e.g. clouds)} & \text{(e.g. rain and clouds)} \\ \hline \text T & \text T & \text T \\ \text T & \text F & \text F \\ \text F & \text T & \text F \\ \text F & \text F & \text F ènd{array}$

Truth table for the disjunction ($\small \lor$;  $\small \text{“or”}$) of two propositions:

    $\small \begin{array}{ c | c || c | } p & q & p \lor q \\ \text{(e.g. rain)} & \text{(e.g. clouds)} & \text{(e.g. rain or clouds)} \\ \hline \text T & \text T & \text T \\ \text T & \text F & \text T \\ \text F & \text T & \text T \\ \text F & \text F & \text F ènd{array}$

In the disjunction table, above, if you have no rain (rain = F) and no clouds (clouds = F), then having one or the other of rain or clouds is False (“rain or clouds” = “rain and no clouds,” or “clouds and no rain”).

Truth table for the implication ($\small \rightarrow$) of two propositions:

    $\small \begin{array}{ c | c || c | } p & q & p \rightarrow q \\ \text{(e.g. rain)} & \text{(e.g. clouds)} & \text{(e.g. rain implies clouds)} \\ \hline \text T & \text T & \text T \\ \text T & \text F & \text F \\ \text F & \text T & \text T \\ \text F & \text F & \text T ènd{array}$

The implication table above is trickier. $\small p \rightarrow q$ states that “rain implies clouds,” i.e. that if rain is True then clouds is also True. That implication [row 1].

In the second row, if there is rain but no clouds, then that implication is False (as the implication is that “if you have rain, you have clouds”).

In the third row, the absence of rain (rain = F) means we cannot negate the implication that rain implies clouds, as we don’t have enough information to counter that implication. Thus, the implication $\small p \rightarrow q$ remains valid (True).

In the fourth row, a similar argument holds: the absence of rain (regardless of whether or not clouds are present) means we lack sufficient information to counter the implication $\small p \rightarrow q$, which thus remains valid (True). If we have no rain, then under the stated implication we cannot make any assertions about the presence or absence of clouds.

Truth table for the biconditional ($\small \longleftrightarrow$) relationship of two propositions:

    $\small \begin{array}{ c | c || c | } & & p \longleftrightarrow q & \\ p & q & \text{(rain and clouds), or} \\ \text{(rain)} & \text{(clouds)} & \text{(no rain and no clouds)} & (p \rightarrow q) \land (q \rightarrow p)\\ \hline \text T & \text T & \text T & \text T \\ \text T & \text F & \text F & \text F \\ \text F & \text T & \text F & \text F \\ \text F & \text F & \text T & \text T ènd{array}$

The truth values of $\small p \longleftrightarrow q$ are equivalent to $\small (p \rightarrow q) \land (q \rightarrow p)$.

Truth table for the contraposition ($\small \sim$ or $\small \neg$) of two propositions:

    $\small \begin{array}{ c | c || c | } p & q & \neg p & \neg q & \neg q \rightarrow \neg p & p \rightarrow q \\ \hline \text T & \text T & \text F & \text F & \text T & \text T \\ \text T & \text F & \text F & \text T & \text F & \text F \\ \text F & \text T & \text T & \text F & \text T & \text T \\ \text F & \text F & \text T & \text T & \text T & \text T ènd{array}$

i.e. “not True” is “False;” “not False” is “True.”

In the first row: “not q” (F) implies “not p” (F), which is True.
In the second row: “not q” (T) implies not p” (F) which is False.
In the third row: “not q” (F) implies not p” (T) which is True.
In the fourth row: “not q” (T) implies not p” (T) which is True.

This is the same as the implication truth table, further above (which I added as the last column in the table immediately above).

See also:

Long short-term memory (LSTM RNN)

Although theoretically powerful, vanilla RNNs cannot learn from long-sequences due to a problem known as vanishing or exploding gradients. A powerful solution is Long Short-Term Memory (LSTM), introduced in 1997 by Hochreiter & Schmidhuber). Why “long short-term”? As described in that paper, “[LSTM] can learn to bridge time intervals in excess of 1000 steps even in the case of noisy, incompressible input sequences, without loss of short time lag capabilities.” A RNN relies on the past, without scope; long-term past inputs are forgotten. LSTM provide a short-term memory function: a LSTM block has mechanisms to enable “memorizing” information for an extended number of time steps.

LSTM introduces one more vector called “memory” $\small \mathbf{c}_t$, which, together with the state $\small \mathbf{h}_t$, specify the dynamic as: $\small (\mathbf{h}_t, \mathbf{c}_t) = LSTM(\mathbf{h}_{t-1}, \mathbf{c}_{t-1}, \mathbf{x}_t)$. In most implementations, this is decomposed further as:

$\ \ \ \ \ \ \small \mathbf{c}_t = \mathbf{f}_t ∗ \mathbf{c}_{t-1} + \mathbf{i}_t ∗ \mathbf{\widetilde{c}}_t$

$\ \ \ \ \ \ \small \mathbf{h}_t = \mathbf{o}_t ∗ tanh(\mathbf{c}_t)$


  • $\small \mathbf{\widetilde{c}}_t$ is a candidate memory computed from the input
  • $\small \mathbf{f}_t, \mathbf{i}_t, \mathbf{o}_t \in (\textbf{0},\textbf{1})$ are gates
  • $∗$ denotes point-wise multiplication

  • $\small \mathbf{f}_t$ determines how much the previous memory is maintained
  • $\small \mathbf{i}_t$ controls how much new information is stored into memory
  • $\small \mathbf{o}_t$ controls how much memory is read out

The candidate memory and the gates are typically parameterized (sometimes written as “parametrized”) as:

$\ \ \ \ \ \ \small \mathbf{\widetilde{c}}_t = tanh(W_c\mathbf{h}_{t-1} + V_c\mathbf{x_t} + \mathbf{b_c})$

$\ \ \ \ \ \ \small \begin{bmatrix} f_t \\ i_t \\ o_t ènd{bmatrix} = sigm \left( \begin{bmatrix} W_f \\ W_i \\ W_o ènd{bmatrix} \mathbf{h}_{t-1} + \begin{bmatrix} V_f \\ V_i \\ V_o ènd{bmatrix} \mathbf{x}_t + \begin{bmatrix} \mathbf{b}_f \\ \mathbf{b}_i \\ \mathbf{b}_o ènd{bmatrix} \right)$

where $\small (W_{c,f,i,o}, V_{c,f,i,o}, \mathbf{b}_{c,f,i,o})$ are learnable parameters.


[Image source (click image to open in new window)]


[Image source (click image to open in new window)]

In the transformations above, the memory cell $\small \mathbf{c}_t$ stores the “long-term” memory in the vector form. In other words, the information accumulatively captured and encoded until time step $\small t$ is stored in $\small \mathbf{c}_t$ and is only passed along the same layer over different time steps. Given the inputs $\small \mathbf{c}_t$ and $\small \mathbf{h}_t$, the input gate $\small \mathbf{i}_t$ and forget gate $\small \mathbf{f}_t$ will help the memory cell to decide how to overwrite or keep the memory information. The output gate $\small \mathbf{o}_t$ further lets the LSTM block decide how to retrieve the memory information to generate the current state $\small \mathbf{h}_t$ that is passed to both the next layer of the current time step and the next time step of the current layer. Such decisions are made using the hidden-layer parameters $\small \mathbf{W}$ and $\small \mathbf{b}$ (with various subscripts): these parameters will be inferred during the training phase


  • Sources for the above:

  • Symbols in bold font, above, are vectors (e.g. $\small \mathbf{y}_t$ is an output vector – these are the network’s parameters (i.e. “memory”).

  • That paper glosses over glosses over some of the terms (I chose it for the clarity of the “vanilla” (basic) RNN and LSTM structures and descriptions.

  • Bias terms ($\small \textbf{b}$ or $\small \textbf{w}_0$) are additional constants attached to the neurons and added to the weighted input before the activation function is applied. Bias terms help models represent patterns that do not necessarily pass through the origin. For example, if all your features were 0, would your output also be zero? Is it possible there is some base value upon which your features have an effect? Bias terms typically accompany weights and must also be learned by your model. [Source: ML Cheatsheet]

    Bias is the term $\small \textbf{b}$ in the following formula: $\small y^{\ \prime} = \textbf{b} + \mathbf{w}_1\mathbf{x}_1 + \mathbf{w}_2\mathbf{x}_2 + \ldots + \mathbf{w}_n\mathbf{x}_n$.

  • $\small W$, $\small U$ and $\small V$ are matrices of weights, updated during training. $\small W$ are the (shared – see note above) weights passed from one time step to the next; $\small U$ are the “skip connections” from the inputs to all hidden layers, and $\small V$ are the skip connections from the hidden layers to the outputs. The image below, taken from Alex Graves paper, illustrates those connections.

  • Early versions of simple RNN used the sigmoid activation function (see Recent Advances in Recurrent Neural Networks for a description plus other forms), which is applied component-wise:

    $\ \ \ \ \ \ \ \ \ \ \sigma(x) = \frac{1}{1 + {e}^{(-x)}}$

    The “sigmoid” is a common choice, which takes a real-value and squashes it to the range [0,1]. This activation function is normally used in the output layer, where a cross-entropy loss function is used for training a classification model.

  • In a LSTM, The input gate is a sigmoid function and have a range of [0,1]. Because the equation of the cell state is a summation between the previous cell state, sigmoid function alone will only add memory and not be able to remove/forget memory. If you can only add a float number between [0,1], that number will never be zero/turned off/forget. This is why the input modulation gate has an tanh activation function. Tanh has a range of [-1, 1] and allows the cell state to forget memory. [Source: Long Short-Term Memory (LSTM): Concept]

Bidirectional RNN

… A major issue with all of the above networks [RNN; LSTM; GRU] is that they learn representations from previous time steps. Sometimes, you might have to learn representations from future time steps to better understand the context and eliminate ambiguity. Take the following examples, “He said, Teddy bears are on sale” and “He said, Teddy Roosevelt was a great President”. In those sentences, when we are looking at the word “Teddy” and the previous two words “He said”, we might not be able to understand if the sentence refers to the President or Teddy bears. To resolve this ambiguity, we need to look ahead. This is what Bidirectional RNNs accomplish.

The repeating module in a Bidirectional RNN could be a conventional RNN, LSTM or GRU. The structure and the connections of a bidirectional RNN are represented in the figure below. There are two type of connections, one going forward in time, which helps us learn from previous representations and another going backwards in time, which helps us learn from future representations.


[Image source. Click image to open in new window.]

Forward propagation is done in two steps:

  • We move from left to right, starting with the initial time step we compute the values until we reach the final time step
  • We move from right to left, starting with the final time step we compute the values until we reach the initial time step

Source: Introduction to Sequence Models – RNN, Bidirectional RNN, LSTM, GRU  [local copy]

See also:

Bidirectional LSTM (Bi-LSTM)

LSTM in its core, preserves information from inputs that has already passed through it using the hidden state. Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past. Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backwards you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.

A bidirectional LSTM (BiLSTM; Bi-LSTM) layer learns bidirectional long-term dependencies between time steps of time series or sequence data. These dependencies can be useful for when you want the network to learn from the complete time series at each time step.

Bidirectional LSTMs (Bi-LSTMs) are an extension of traditional LSTMs that can improve model performance on sequence classification problems. In problems where all timesteps of the input sequence are available, Bidirectional LSTMs train two instead of one LSTMs on the input sequence. The first LSTM operates on the input sequence as is, and the second LSTM operates on a reversed copy of the input sequence. This can provide additional context to the network, resulting in faster and even fuller learning on the problem.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

  • Note that $\small \mathbf{x}$ stands for the input sequence, $\small \mathbf{h}$ for the output sequence from forward or backward runs (defined by the arrow pointing right or left, respectively), $\small \mathbf{y}$ for the concatenated output sequence where $\small \sigma$ concatenates the forward and backward output elements. [Source]

See also:


See Long short-term memory (LSTM RNN)

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Machine learning

Machine learning (ML) is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to “learn” (e.g., progressively improve performance on a specific task) from data, without being explicitly programmed.

Machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders, and computer vision.

Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field. Machine learning is sometimes conflated with data mining, where the latter subfield focuses more on exploratory data analysis and is known as unsupervised learning.

Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that lend themselves to prediction; in commercial use, this is known as predictive analytics. These analytical models allow researchers, data scientists, engineers, and analysts to “produce reliable, repeatable decisions and results” and uncover “hidden insights” through learning from historical relationships and trends in the data.

Regarding data analysis, machine learning, and latent knowledge discovery: Kevin Murphy (Google AI) states on p. 1 of his book, Machine Learning: A Probabilistic Perspective (2012), that

    "[The] deluge of data calls for automated methods of data analysis, which is what machine learning provides. In particular, we define machine learning as a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty ..."


[Image source (apparently based on this blog). Click image to open in new window.]

Machine Learning Tasks

Machine learning tasks are typically classified into several broad categories:

  • Supervised learning: The computer is presented with example inputs and their desired outputs, given by a “teacher”, and the goal is to learn a general rule that maps inputs to outputs. As special cases, the input signal can be only partially available, or restricted to special feedback.

  • Semi-supervised learning: The computer is given only an incomplete training signal: a training set with some (often many) of the target outputs missing.

  • Active learning: The computer can only obtain training labels for a limited set of instances (based on a budget), and also has to optimize its choice of objects to acquire labels for. When used interactively, these can be presented to the user for labeling.

  • Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

  • Reinforcement learning: Data (in form of rewards and punishments) are given only as feedback to the program’s actions in a dynamic environment, such as driving a vehicle or playing a game against an opponent.

To fully understand the basics of machine learning, you need to study at least a bit of linear algebra (matrix mathematics: transformations, multiplications, …), partial derivatives (needed for the chain rule, used in backpropagation, …), etc. If you are at all serious about machine learning, Andrew Ng’s Machine Learning course is a must, if only to understand the basics.

Here are my old notes from that course (2013; some of the images are missing, but those omissions are trivial as these notes are very comprehensive):

As well, here is Chris Olah’s Backpropagation blog post, with my handwritten partial differentiation calculations.

For a more recent overview of the recent machine learning domain (focused on mid-2015 through mid-2017) see my comprehensive Machine Learning Notes.

Matrix operations

See this page.

Memory (neural memory)

See my blog post How do Neural Networks "Remember"?.


In semantics, a meronym denotes a part or a member of something.


  • “page” is a meronym of “book”
  • “wheel”, “engine”, “door”, etc. are meronyms of “car”
  • “finger” is a meronym of “hand”
  • “bark” is a meronym of “tree” (bark is part of what makes up a tree), while “tree” is a meronym of “forest”
  • “elbow” is a meronym of “arm”, which is a meronym of “body”

See also Linguistics | Semantics.


See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.


In linguistics, a morpheme is a meaningful morphological unit of a language that cannot be further divided; e.g., “incoming” consists of the morphemes “in”, “come” and “-ing”. A morpheme may or may not stand alone, whereas a word, by definition, is freestanding. Another example: “dogs” consists of two morphemes and one syllable: “dog”, and “-s”, a plural marker on nouns. Note that a morpheme like “-s” can just be a single phoneme and does not have to be a whole syllable.

See also:


Morphology is a subdiscipline of linguistics that studies word structure. During morphological processing we are basically considering words in a text separately and trying to identify morphological classes to which these words belong. A widespread morphological task is lemmatizing (lemmatization) and stemming, which are used in many web search engines. In this case all morphological variations of a given word (known as word forms) are collapsed to one lemma or stem. [Source]

Morphology concerns the structure and meaning of words. Some words, such as “send”, are atomic (monomorphemic), while others – such as “sends”, “sending” and “resend” – appear to be constructed from several atoms or morphemes. These morphemes (bits of words) occur in a lots of other words: “thinks”, “thinking”, “reprogram”, “rethink”, etc. There is a syntax to the way morphemes can combine – the affixes mentioned so far can all combine with verbs to make verbs; others, such as “able”, combine with verbs to make adjectives: e.g., “programmable”. Sometimes the meaning of a word is a regular, productive combination of the meanings of its morphemes, such as “unreprogramability”. Frequently, however, it is not (or isn’t completely) obvious; e.g. “react”, or “establishment”. [Source; Introduction to Linguistics for Natural Language Processing]

See also Linguistics | Semantics.

Multiple instance learning [Multi-instance learning]

In machine learning, multiple instance learning (MIL) is a type of supervised learning. Instead of receiving a set of instances which are individually labeled, the learner receives a set of labeled bags, each containing many instances. In the simple case of multiple-instance binary classification, a bag may be labeled negative if all the instances in it are negative. On the other hand, a bag is labeled positive if there is at least one instance in it which is positive. From a collection of labeled bags, the learner tries to either (i) induce a concept that will label individual instances correctly or (ii) learn how to label bags without inducing the concept.

A simple example for MIL was given in Babenko (2008). Imagine several people, and each of them has a key chain that contains few keys. Some of these people are able to enter a certain room, and some aren’t. The task is then to predict whether a certain key or a certain key chain can get you into that room. To solve this problem we need to find the exact key that is common for all the “positive” key chains. If we can correctly identify this key, we can also correctly classify an entire key chain - positive if it contains the required key, or negative if it doesn’t.


[Image source. Click image to open in new window.]


[Image source.  The source cited [59] is Maron & Lozano-Pérez, 1998.  Click image to open in new window.]

Papers that employ multi-instance learning include:

  • Generating a Tolerogenic Cell Therapy Knowledge Graph from Literature (Nov 2017) [code] employed supervised multi-instance learning for the construction of a biomedical KG. Multi-instance learning organizes instances of objects into bags, which consist simply of sets of instances with a common property. All instances are negative if the bag label is negative, or at least one instance is positive if the bag label is positive. Therefore, there is no need to manually label the relations in the documents.

  • Distant Learning for Entity Linking with Automatic Noise Detection (May 2019).  “… we show how we can learn to link mentions without having any labeled examples, only a knowledge base and a collection of unannotated texts from the corresponding domain. In order to achieve this, we frame the task as a multi-instance learning problem and rely on surface matching to create initial noisy labels … we introduce a noise detection component in our model: it lets the model detect and disregard examples which are likely to be noisy. Our method, jointly learning to detect noise and link entities, greatly outperforms the surface matching baseline and for a subset of entity categories even approaches the performance of supervised learning.”

Machine learning: Multiple instance learning

Depending on the type and variation in training data, machine learning can be roughly categorized into three frameworks: supervised learning, unsupervised learning, and reinforcement learning. Multiple instance learning (MIL) falls under the supervised learning framework, where every training instance has a label, either discrete or real valued. MIL deals with problems with incomplete knowledge of labels in training sets. More precisely, in multiple-instance learning, the training set consists of labeled “bags”, each of which is a collection of unlabeled instances. A bag is positively labeled if at least one instance in it is positive, and is negatively labeled if all instances in it are negative. The goal of the MIL is to predict the labels of new, unseen bags.

Biomedical applications

Generating a Tolerogenic Cell TherapyTolerogenic therapy aims to induce immune tolerance where there is pathological or undesirable activation of the normal immune response. Knowledge Graph from Literature (Nov 2017) [code] employed supervised multi-instance learning to the construction of a biomedical KG. MIL organizes instances of objects into bags, which consist simply of sets of instances with a common property. All instances are negative if the bag label is negative, or at least one instance is positive if the bag label is positive. Therefore, there is no need to manually label the relations in the documents.

Their objective in this paper was to automatically generate a KG for tolerogenic (immune) cell therapy from PubMed abstracts. They developed a ML-based system (see Fig. 2
and Fig. 3
) to annotate each PubMed abstract, generating the possible combinations of cell-cytokine bioentity pairs [e.g. $\small \text{(macrophage, CXCL2)}$] co-occurring in the same sentence, thus identifying meaningful relations between cells and cytokines. The extracted relations were used to generate a knowledge graph, where each edge was supported by one or more documents.


[Image source. Click image to open in new window.]

Multiple instance learning was also used in Biophysicochemical Motifs in T-cell Receptor Sequences Distinguish Repertoires from Tumor-Infiltrating Lymphocyte and Adjacent Healthy Tissue (Apr 2019).

See also:

Mutual information

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the “amount of information” (in units such as shannons, commonly called bits) obtained about one random variable through observing the other random variable. The concept of mutual information is intricately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected “amount of information” held in a random variable.


[Image source. Click image to open in new window.]

Venn diagram showing additive and subtractive relationships various information measures associated with correlated variables $\small X$ and $\small Y$. The area contained by both circles is the joint entropy $\small H(X,Y)$. The circle on the left (red and violet) is the individual entropy $\small H(X)$, with the red being the conditional entropy $\small H(X|Y)$. The circle on the right (blue and violet) is $\small H(Y)$, with the blue being $\small H(Y|X)$. The violet is the mutual information $\small I(X;Y)$.

Not limited to real-valued random variables like the correlation coefficient, MI is more general and determines how similar the joint distribution $\small p(x,y)$ is to the products of factored marginal distribution $\small p(x) \cdot p(y)$. MI is the expected value of the pointwise mutual information (PMI).

Mutual Information: Definition

Formally, the mutual information of two discrete random variables $\small X$ and $\small Y$ can be defined as:

    $\small \begin{align} \operatorname {I} (X;Y)=\sum _{y\in { {Y}}}\sum _{x\in { {X}}}{p(x,y)\log {\left({\frac {p(x,y)}{p(x)\,p(y)}}\right)}} ènd{align}$,

where $\small p(x,y)$ is the joint probability function of $\small X$ and $\small Y$, and $\small p(x)$ and $\small p(y)$ are the marginal probability distribution functions of $\small X$ and $\small Y$, respectively.

In the case of continuous random variables, the summation is replaced by a definite double integral:

    $\small \begin{align} \operatorname {I} (X;Y)=\int _{ {Y}}\int _{ {X}}{p(x,y)\log {\left({\frac {p(x,y)}{p(x)\,p(y)}}\right)}}\;dx\,dy ènd{align}$,

where $\small p(x,y)$ is now the joint probability density function of $\small X$ and $\small Y$, and $\small p(x)$ and $\small p(y)$ are the marginal probability density functions of $\small X$ and $\small Y$, respectively.

If the log base 2 is used, the units of mutual information are bits.

See also:

Pointwise mutual information

Pointwise mutual information (PMI), or point mutual information, is a measure of association used in information theory and statistics. In contrast to mutual information (MI, above, which builds upon PMI), PMI refers to single events, whereas MI refers to the average of all possible events.

The PMI of a pair of outcomes $\small x and $\small y belonging to discrete random variables $\small X and $\small Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically:

    $ \operatorname {pmi} (x;y)èquiv \log {\frac {p(x,y)}{p(x)p(y)}}=\log {\frac {p(x|y)}{p(x)}}=\log {\frac {p(y|x)}{p(y)}}$.

The mutual information (MI) of the random variables $\small X$ and $\small Y$ is the expected value of the PMI over all possible outcomes (with respect to the joint distribution $\small p(x,y))$.

The measure is symmetric: $\small \operatorname {pmi} (x;y) = \operatorname {pmi} (y;x)$. It can take positive or negative values, but is zero if $\small X$ and $\small Y$ are independent. Note that even though PMI may be negative or positive, its expected outcome over all joint events (MI) is positive. PMI maximizes when $\small X$ and $\small Y$ are perfectly associated, i.e. $\small p(x|y)$ or $\small p(y|x)=1$, yielding the following bounds:

    $\small -\infty \leq \operatorname {pmi} (x;y)\leq \min \left[-\log p(x),-\log p(y) \right]$.

Finally, $\small \operatorname {pmi} (x;y)$ will increase if $\small p(x|y)$ is fixed but $\small p(x)$ decreases.

Pointwise mutual information: Example

x y p(xy)
0 0 0.1
0 1 0.7
1 0 0.15
1 1 0.05

Using this table we can marginalize to get the following additional table for the individual distributions:

p(x) p(y)
0 0.8 0.25
1 0.2 0.75

With this example, we can compute four values for $\small pmi(x;y)$. Using base-2 logarithms:

    pmi(x=0; y=0) = -1
    pmi(x=0; y=1) = 0.222392
    pmi(x=1; y=0) = 1.584963
    pmi(x=1; y=1) = -1.584963

For reference, the mutual information $\small \operatorname {I} (X;Y)$ would then be 0.2141709.

PMI is discussed in:

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Named entities;
Named entity recognition (NER)

The difference between an entity and a named entity is essentially the same difference between nouns and proper nouns. An entity can be nominal – a common thing like a city – whereas a named entity is more like a proper noun, such as a name (Paris). In other words, a named entity is something that deserves to have a name. A human is an entity, but if we give human a name, this produces a named entity. Named entities may consist of more than one word.

IE vs. IR

Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names.

NER (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER has been structured as taking an unannotated block of text, such as

    Jim bought 300 shares of Acme Corp. in 2006.

and producing an annotated block of text that highlights the names of entities,

    [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.

In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.

State-of-the-art NER systems for English produce near-human performance.

See also [NLTK] Named Entity Recognition, which includes a short description of the “IOB format” (or sometimes “BIO Format”), developed for NP chunking. In this scheme, each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O.

An example of this scheme is shown here:

    [Tag Representation of Chunk Structures. Image source. Click image to open in new window.]

IOB tags have become the standard way to represent chunk structures in files. Here is how the information in 2.5 would appear in a file:

      We PRP B-NP
      saw VBD O
      the DT B-NP
      yellow JJ I-NP
      dog NN I-NP

In this representation there is one token per line, each with its part-of-speech tag
Source: Image source: Section 8.2 (Tagsets for English) in
Jurafsky D & Martin JH (2000) Speech and Language processing (p. 295)
and chunk tag.

Named entity disambiguation
Named entity normalization

Polysemy, words or phrases with different but related meanings poses a challenge to NLP; for example, “Washington” could refer to the location “Washington, DC” or the person “George Washington”; “ACE” could represent “angiotensin converting enzyme” or “acetylcholinesterase”. In polysemy, the fact that multiple entities might have the same name is common for named entities. The task of addressing the polysemy problem for named entities is called named entity disambiguation. Named entity disambiguation/normalization is the task of mapping of a named entity or type in to a unique identifier or concept (e.g., disease names may be mapped to the National Library of Medicine’s Medical Subject Headings disease terms). For more on polysemy, see Analyzing Polysemous Concepts from a Clinical Perspective: Application to Auditing Concept Categorization in the UMLS.

See also word sense disambiguation.

Natural language inference

Natural language inference (NLI), also known as “recognizing textual entailment” (RTE), is the task of identifying the relationship (entailment, contradiction, and neutral) that holds between a premise $\small p$ (e.g. a piece of text) and a hypothesis $\small h$. The most popular dataset for this task, the Stanford Natural Language Inference (SNLI Corpus*, contains 570k human-written English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral, supporting the task of NLI. A newer, Multi-Genre Natural Language Inference (MultiNLI corpus) is also available: a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The MultiNLI corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation.

Natural language processing

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.


[Image source (slide 8). Click image to open in new window.]

Natural language understanding

Natural language understanding (NLU) is a subtopic of natural language processing that deals with machine reading comprehension (see the image, above). Natural language understanding is considered an AI-hard problem. There is considerable commercial interest in the field because of its application to news-gathering, text categorization, voice-activation, archiving, and large-scale content analysis.


Neo4j is a graph database management system. Neo4j is an ACID-compliant transactional database with native graph storage and processing. Neo4j is accessible using the Cypher Query Language.

... continued here ...

Ngram (n-gram)

In the fields of computational linguistics and probability, an $\small n$-gram is a contiguous sequence of $\small n$ items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The $\small n$-grams typically are collected from a text or speech corpus. When the items are words, $\small n$-grams may also be called shingles.

Using Latin numerical prefixes, an $\small n$-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. English cardinal numbers are sometimes used, e.g., “four-gram”, “five-gram”, and so on.

In computational biology, a polymer or oligomer of a known size is called a $\small k$-mer instead of an $\small n$-gram, with specific names using Greek numerical prefixes such as “monomer”, “dimer”, “trimer”, “tetramer”, “pentamer”, etc., or English cardinal numbers, “one-mer”, “two-mer”, “three-mer”, etc.


Sample sequence
1-gram sequence
2-gram sequence
3-gram sequence
Vernacular name
Order of resulting Markov model
Protein sequencing amino acid ... Cys-Gly-Leu-Ser-Trp ... ..., Cys, Gly, Leu, Ser, Trp, ... ..., Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp, ... ..., Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp, ...
DNA sequencing base pair ...AGCTTCGA... ..., A, G, C, T, T, C, G, A, ... ..., AG, GC, CT, TT, TC, CG, GA, ... ..., AGC, GCT, CTT, TTC, TCG, CGA, ...
Computational linguistics character ...to_be_or_not_to_be... ..., t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e, ... ..., to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be, ... ..., to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be, ...
Computational linguistics word ... to be or not to be ... ..., to, be, or, not, to, be, ... ..., to be, be or, or not, not to, to be, ... ..., to be or, be or not, or not to, not to be, ...

The table above shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences. Here are further examples:

  • 3-grams
    • ceramics collectables collectibles
    • ceramics collectables fine
    • ceramics collected by
  • 4-grams
    • serve as the incoming
    • serve as the incubator
    • serve as the independent

See also kmer.


See also Probabilistic programming:

  • The final portion of An Introduction to Statistical Relational Learning (part 2) lists various probabilistic programming solutions (languages; platforms), and mentions this:

    RockIt  from University of Mannheim has an online and rest based interface and uses only Conjunctive Normal Forms (CNF), i.e. And-Or format in its formulas.” RockIt is a query engine for Markov logic.

The noisy-OR model is a probabilistic model that jointly models parameters (variables). Noisy-OR is a conditional probability distribution that describes the causal mechanisms by which parent nodes (e.g. diseases) affect the states of children nodes (e.g. symptoms). The Noisy-OR function has the useful property that the probability of a relationship is high with at least one reliable assertion yet increases with additional support.


[Image source. Click image to open in new window.]

Noisy-OR is used in the following papers:

  • Learning a Health Knowledge Graph from Electronic Medical Records (Jul 2017), in which maximum likelihood estimation of three probabilistic models was used to automatically construct knowledge graphs: logistic regression, naive Bayes classifier, and a Bayesian network using noisy-OR gates. … Noisy-OR was chosen as a probabilistic model that jointly models diseases and symptoms; similar models have successfully been used in previous medical diagnosis applications. … The noisy-OR model significantly outperformed the other tested models, producing a high quality knowledge graph reaching precision of 0.85 for a recall of 0.6 in the clinical evaluation.

    Diseases (and their effects: symptoms) were separately modeled by logistic regression and naive Bayes; additionally, in naive Bayes, all symptoms were assumed to be conditionally independent from one another. However, the noisy-OR model jointly models all diseases and symptoms [wherein the central assumption of independence of effects (symptoms) is maintained]. Noisy-OR is a conditional probability distribution that describes the causal mechanisms by which parent nodes (e.g. diseases) affect the states of children nodes (e.g. symptoms). For example, given that patients tend to present with few diseases, the presence of one disease typically lowers the probability of others. In the logistic regression and naive Bayes models, however, since each disease is modeled separately (resulting in an assumption of independence between diseases), the presence of one disease does not (for example) rule out or lessen the probability of another disease (i.e. diagnosis).

  • The noisy-OR model was also employed in Biomedical Discovery Acceleration, with Applications to Craniofacial Development (2009; different authors). “… the explosion of new results in the scientific literature, particularly in molecular biomedicine, is both a blessing and a curse to the bench researcher. … In this paper, we describe a novel computational approach to this challenge, a knowledge-based system that combines reading, reasoning, and reporting methods to facilitate analysis of experimental data.” They also stated (paraphrased):

    • “One of the most popular methods to combine individual reliabilities is to assume independence of experts (naive Bayes assumption) and compute the integrated likelihood $\small P$ for each relationship using the Noisy-OR function … The Noisy-OR function has the useful property that the probability of a relationship is high with at least one reliable assertion yet increases with additional support. This property is especially relevant in biology, where it is often difficult to identify false negatives; a given assertion is strengthened by additional information but unlike the case for estimating the reliability of an expert on the whole, an individual assertion is not penalized for lack of additional evidence. Moreover, since the experts are assumed to be independent, experts can be removed from or added to the analysis without excessive re-computation.”


    [Image source. Click image to open in new window.]

  • On a Hypergraph Probabilistic Graphical Model (Nov 2018)

    • “We propose a directed acyclic hypergraph framework for a probabilistic graphical model that we call Bayesian hypergraphs. … Bayesian hypergraphs also allow a modeler to represent causal patterns of interaction such as Noisy-OR graphically (without additional annotations). …”

Non-negative matrix factorization

Non-negative matrix factorization (NMF or NNMF; nonnegative matrix factorization; non-negative matrix approximation) is a group of algorithms in multivariate analysis and linear algebra where a matrix $\small \mathbf{V}$ is factorized into (usually) two matrices $\small \mathbf{W}$ and $\small \mathbf{H}$, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.

Non-negative matrix factorization is a tool for dimensionality reduction of datasets in which the values are constrained to be non-negative.

Let matrix $\small \mathbf{V}$ be the product of the matrices $\small \mathbf{W}$ and $\small \mathbf{H}$,

    $\small \mathbf{V} = \mathbf{W} \mathbf{H} \,$.

Matrix multiplication can be implemented as computing the column vectors of $\small \mathbf{V}$ as linear combinations of the column vectors in $\small \mathbf{W}$ using coefficients supplied by columns of $\small \mathbf{H}$. That is, each column of $\small \mathbf{V}$ can be computed as follows:

    $\small \mathbf{v}_{i} = \mathbf{W} \mathbf{h}_{i}$,

where $\small \mathbf{vi}$ is the $\small i^{th}$ column vector of the product matrix $\small \mathbf{V}$ and $\small \mathbf{h_i}$ is the $\small i^{th}$ column vector of the matrix $\small \mathbf{H}$.

When multiplying matrices, the dimensions of the factor matrices may be significantly lower than those of the product matrix and it is this property that forms the basis of NMF. NMF generates factors with significantly reduced dimensions compared to the original matrix. For example, if $\small \mathbf{V}$ is an $\small m × n$ matrix, $\small \mathbf{W}$ is an $\small m × p$ matrix, and $\small \mathbf{H}$ is a $\small p × n$ matrix then $\small p$ can be significantly less than both $\small m$ and $\small n$.

NMF finds applications in such fields as astronomy, computer vision, document clustering, chemometrics, audio signal processing, recommender systems, and bioinformatics.


[Image sources: top rowbottom leftbottom right  (local copy). Click image to open in new window.]

See also:


In linear algebra, functional analysis, and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector space - except for the zero vector, which is assigned a length of zero. A seminorm, on the other hand, is allowed to assign zero length to some non-zero vectors (in addition to the zero vector).

A norm must also satisfy certain properties pertaining to scalability and additivity which are given in the formal definition below.

A simple example is two dimensional Euclidean space $\small \mathbb{R}^2$ equipped with the “Euclidean norm” (see below). Elements in this vector space (e.g., (3, 7)) are usually drawn as arrows in a 2-dimensional Cartesian coordinate system starting at the origin (0, 0). The Euclidean norm assigns to each vector the length of its arrow. Because of this, the Euclidean norm is often known as the magnitude.

A vector space on which a norm is defined is called a normed vector space. Similarly, a vector space with a seminorm is called a seminormed vector space. It is often possible to supply a norm for a given vector space in more than one way.

Double bars (or sometimes even single bars) tend to denote a norm in Mathematics. Most likely, the double bars here are denoting the Euclidean norm. This is just the length of the vector. So for example, the vector $\small \begin{bmatrix} 1 & 2 & 3 ènd{bmatrix}$ has length

    $\small \Vert \begin{matrix} 1 & 2 & 3 ènd{matrix} \Vert = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{14} = 3.7417$

and the vector

    $\small \Vert \begin{matrix} 3 & -1 & 2 ènd{matrix} \Vert = \sqrt{3^2 + (-1)^2 + 2^2} = \sqrt{14} = 3.7417$

Notice that $\small \mathbf{Ax}$ is just a vector, so $\small \Vert \mathbf{Ax} \Vert$ is just the length of the vector. $\small \Vert \mathbf{x} \Vert$ is just the length of $\small \mathbf{x}$. So here you are looking for scaling of $\small \mathbf{x}$ under transformation by $\small \mathbf{A}$ to be between $\small m$ and $\small M$.

Source for example, above: What does double vertical-line means in linear algebra?

See also:


For good summaries of notations used in machine learning, consult:

Noun phrase

A noun phrase or nominal phrase (abbreviated NP) is a phrase that has a noun (or indefinite pronoun) as its head or performs the same grammatical function as such a phrase. Noun phrases are very common cross-linguistically, and they may be the most frequently occurring phrase type.

Noun phrases often function as verb subjects and objects, as predicative expressions, and as the complements of prepositions. Noun phrases can be embedded inside each other; for instance, the noun phrase some of his constituents contains the shorter noun phrase his constituents.

Some examples of noun phrases are underlined in the sentences below. The head noun appears in bold.

  • The election-year politics are annoying for many people.
  • Almost every sentence contains at least one noun phrase.
  • Current economic weakness may be a result of high energy prices</u>.

Those five beautiful shiny Arkansas Black apples look delicious” is a noun phrase of which apples is the head. To test, a single pronoun can replace the whole noun phrase, as in “They look delicious”.

Noun phrases can be identified by the possibility of pronoun substitution, as is illustrated in the examples below.

    a. This sentence contains two noun phrases.
    b. It contains them.

    a. The subject noun phrase that is present in this sentence is long.
    b. It is long.

    a. Noun phrases can be embedded in other noun phrases.
    b. They can be embedded in them.

A string of words that can be replaced by a single pronoun without rendering the sentence grammatically unacceptable is a noun phrase.

See also NLTK: Noun Phrase Chunking.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Obscuro seu abbreviatio [Obscure abbreviations]

  • a.s.: Almost Sure Convergence


    [Click image to open in new window.]


    [Image source, discussion. Click image to open in new window.]

    [See also.]  Here is an intuitive explanation. In terms of the sample space of events $\small \Omega$, an event $\small E$ happens almost surely if $\small Pr(E) = 1$, whereas an event happens surely if $\small E = \Omega$. An example: suppose we are independently flipping a fair coin infinitely many times. The event

      $\small \text{I will get heads infinitely often}$

    [which you need to read as “I will get an infinite number of heads, ignoring any tails that I get” rather than “I will only get heads”] is an almost sure event (because it is possible get only a finite number of heads … but how likely is that? Rigorous proof uses the Borel Cantelli lemma, if you are interested).

    In contrast,

      $\small \text{I will get heads $\small \mathbf{or}$ tails on my 16th flip}$

    must happen. This is a sure event.

  • i.i.d.: Independent and Identically Distributed random variables

    • In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed (i.i.d. or iid or IID) if each random variable has the same probability distribution as the others and all are mutually independent. The annotation IID is particularly common in statistics, where observations in a sample are often assumed to be effectively IID for the purposes of statistical inference. The assumption (or requirement) that observations be IID tends to simplify the underlying mathematics of many statistical methods …

      Often the IID assumption arises in the context of sequences of random variables. Then “independent and identically distributed” in part implies that an element in the sequence is independent of the random variables that came before it. In this way, an IID sequence is different from a Markov sequence, where the probability distribution for the nth random variable is a function of the previous random variable in the sequence (for a first order Markov sequence). An IID sequence does not imply the probabilities for all elements of the sample space or event space must be the same; for example, repeated throws of loaded dice will produce a sequence that is IID, despite the outcomes being biased.

  • i.f.f.: If and only if

    • In logic and related fields such as mathematics and philosophy, if and only if (shortened iff ) is a biconditional logical connective between statements. In that it is biconditional (a statement of material equivalence), the connective can be likened to the standard material conditional (“only if”, equal to “if … then”) combined with its reverse (“if”); hence the name. The result is that the truth of either one of the connected statements requires the truth of the other (i.e. either both statements are true, or both are false). It is controversial whether the connective thus defined is properly rendered by the English “if and only if”, with its pre-existing meaning.

      In writing, phrases commonly used, with debatable propriety, as alternatives to $\small \textit{P “if and only if” Q}$ include:

      • $\small \textit{Q is necessary and sufficient for P}$,
      • $\small \textit{P is equivalent (or materially equivalent) to Q}$  (compare material implication),
      • $\small \textit{P precisely if Q}$,
      • $\small \textit{P precisely (or exactly) when Q}$,
      • $\small \textit{P exactly in case Q}$, and
      • $\small \textit{P just in case Q}$.

      Some authors regard “iff” as unsuitable in formal writing; others use it freely.

      In logical formulae, logical symbols are used instead of these phrases; see the discussion of notation.

      Notation. The corresponding logical symbols are $\leftrightarrow$, $\Leftrightarrow$, and $èquiv$ and sometimes $\text{iff}$ …

Common Latin and Non-English Abbreviations Used in Research

A.D. (Anno Domini). Used to date years by reckoning the date of Christ’s birth, as opposed to B.C., the years “Before Christ.” Literally, Anno Domini means “In the year of the Lord.” Remember two important notes! Anno Domini does not mean “After Death ” (if it did, there would be a thirty-three year gap between 1 BC and the crucifixion thirty-three years later). Also note the politically correct tendency is to use the abbreviation CE (Common Era) and BCE (Before Common Era). These abbreviations are an attempt to avoid the religious connotations of the Latin abbreviation. In spite of the name change, BCE and CE still divide history according to the life of Christ, but CE and BCE may be less offensive (or at least less ethnocentric) to a non-Christian audience.

c. (circa). Used by historians to show that a date is approximate. Literally, the word means “around,” and it is sometimes abbreviated “ca.” Usage: Shortly after Henry IV seized the throne from Richard II, Geoffrey Chaucer died (c.1400 A.D.), perhaps due to old age.

cf. (confere). A Latin imperative suggesting the reader should compare and contrast one statement or idea with another one. Literally, “compare.” Researchers often follow the abbreviation with a reference to an author or page number, suggesting the reader look for similarities and differences between what a previous citation has said with the subsequent source listed. Usage: Some scholars think Hitler’s Mein Kampf used genocidal ideas found in earlier anti-Semitic literature him (Smith 42), but others argue Hitler himself was the primary originator (cf. Jones 98).

e.g. (exempli gratia). “For example.” Literally, “free as an example.” Usage: “We have numerous problems to deal with before reforming welfare policies, e.g., the trade deficit, Medicare, and social security.”

et al. [et alia (neuter plural)]. It can also be an abbreviation for et alii (masculine plural), or et aliae (feminine plural). This phrase means “and others.” [Source]

etc. (et cetera). “And so on.” This is the one Latin abbreviation most students already know, and the one they tend to overuse. Do note that, since et already means and, it is redundant to write, “and etc.” Literally, the Latin phrase means “and other things.” Usage: The problems of the Balkan Republics are numerous, including insufficient electric power, poor highways, rampant unemployment, hostile neighbors, etc.

et pass. (et passim). And also found throughout the subsequent pages or sections. Literally, “And in the following.” The abbreviation typically appears after a citation of a single page, suggesting the reader look at that page first and then skim the material following for further discussion. Usage: For further discussion of this important issue, see Smith 42 et passim.

ib./ ibid. (ibidem). “In the same passage or page quoted above.” Literally, “In the same place.” Usage: “One physicist compared the behavior of quarks to bowling pins (Jones 35). He also indicated that the ‘Charm’ quark was like a ‘bowling ball’ (ibid.) due to the way it. . . .”

i.e. (id est). “That is more precisely.” Literally, “it is.” Commonly used to refine a general statement or provide additional information. Usage: “Jerry’s girlfriend always managed to turn the conversation toward children, i.e., the possibility of having children together; i.e., the possibility of having legitimate children together; i.e., toward the subject of marriage.”

inter alia. “among other things”; e.g., “… the study includes, inter alia, computers, aircraft, and pharmaceuticals …“  “Your grocery list may include bread, milk, and cereal, inter alia.”  Origin: Latin.

op. cit. Op. cit. is an abbreviation of the Latin phrase opere citato, meaning “in the work cited.” It is used in an endnote or footnote to refer the reader to a previously cited work, standing in for repetition of the full title of the work.

Ph.D. (Philosophiae Doctor). “Doctor (or Doctorate) of Philosophy.” It can refer to the individual as a title, or to the degree itself. Note that it is redundant to write, “Dr. McGillicutty is a Ph. D.” unless the writer seeks to distinguish him from a medical doctor such as an M.D. Usage: “Joe Bob McGillicutty, Ph. D., is on the committee.” Or, “McGillicutty earned his Ph. D. in art history.”

sic. Indicates a misspelling or error in a quoted source, in order to verify to the reader that the researcher did not create a typographical error, but instead exactly reproduces the way the word or statement appeared in the original material. Literally, “yes” or “even thus” in Latin. Usage: There are, according to the writings of seven-year old Andrew, “Manee wayes of riting words” [sic].

vs. (versus. “Turned against.)” Often used in abbreviations for legal trials–though “v.” is more common. Usage: “In the case of Roe v. Wade, the Supreme Court eventually decided that abortion was a medical right.” Don’t confuse the term “vs.” with “v.s.” (see below). And don’t confuse the word versus with verses.

w.r.t. (WRT): “with regard to”  [sic: the correct grammar is “in regard to”]  |  “with reference to”  |  “with respect to”. See this, and this.

Less Common Foreign Abbreviations

a.v. (ad valorem). “In proportion to the value of [something else].” Literally, “To the value.” Usage: “The monetary worth of the dollar is figured a.v. the price of gold.”

i.a. (in absentia). “In absence.” Usage: “With further evidence i.a., it is impossible to provide a definitive answer.” Or more commonly, “The criminal who had fled the country was tried and found guilty of murder, i.a.”

MS. (manuscriptum). A document, particularly an ancient or historical manuscript, that was not printed, but rather drawn or written. Literally, “By hand.” The term is capitalized when attached to a specific document’s title, and the plural form is MSS. In British usage, only the final letter typically has a period. Usage: “MS. Vercilli was found in Northern Italy, and it appears to be written in an Anglo-Saxon dialect.”

P.S. (post scriptum). The abbreviation indicates a last-minute addition to a letter or document. Literally, “After what has been written.” Usage: “That’s all for now. Take care. Love, John. P.S. Don’t forget to write me back!”

R.S.V.P. (Repondez S’il Vous-Plait). “Please send a response confirming whether or not you will accept the invitation.” The abbreviation is French rather than Latin. Literally, “Respond if it pleases you.” Note that it is redundant to write, “Please RSVP,” since the phrase itself implies “please.” Usage: “You are cordially invited to a wine-and-cheese reception at the Bradson’s House. RSVP by Thursday afternoon.”

S.P.Q.R. (Senatus Populusque Romani). The abbreviation was used in Roman times as a part of official government documentation. Today, the phrase is used to refer generally (and sometimes pompously or ironically) to the power, glory, and bureaucracy of a major nation. Literally, “The Senate and the People of Rome.” Usage: “The S.P.Q.R. has spoken, and now American soldiers must obey the call to arms.”

s.p.s. (sine prole supersite). “Without surviving issue.” The phrase is used in inheritance laws to indicate that an individual has no children or legal inheritors. Usage: “Since Mrs. Clayton died s.p.s., her six million dollar estate will revert to the City of Portland.”

t.i.d. (ter in die). “Three times a day.” Used by older pharmacies and doctors to indicate that a medication should be taken three times a day. Usage: “Aspirin, t.i.d.; call if headaches continue.”

viz. (videlicit). “More appropriately or accurately; namely.” The abbreviation is often used interchangeably with i.e. Literally, “As it befits or is pleasing to him.” Usage: “He was a minor Duke in the House of Lords, viz. the Duke of Rochester.”

vide. (“Look” or “see”). This phrase refers the reader back up to a previous statement or definition within the body of the paper. The must common uses are “vide 63” (which means “see page sixty-three”), v.s. vide supra (“see earlier” or “look above on this page”) and v.i. vide infra (“See below” or “Look below”). Don’t confuse v.s. (vide supra) with v. or vs. (versus). Usage: “For the definition of the Latin word videlicit, vide supra.”

N.B.: (Nota Bene). The Latin imperative means “Take notice of this very carefully,” that is, pay special attention to this part because it is unusually important, tricky, or confusing. Usage: All assignments are due at the beginning of class. N. B.: I lock the door to the classroom once lecture begins.


An ontology is a representation / specification of a conceptualization of a domain of knowledge, characterizing the classes and relations that exist in the domain. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. Commonly, ontologies are represented as graph structure that represents a taxonomy.

Ordinary differential equations (ODE)

In mathematics, an ordinary differential equation (ODE) is a differential equation containing one or more functions of one independent variable and the derivatives of those functions. The term ordinary is used in contrast with the term partial differential equation [see my Partial Derivatives entry for a few examples] which may be with respect to more than one independent variable.

See also:


Ordinary Differential Equations and Boolean Networks in Application to Modelling of 6-Mercaptopurine Metabolism (Apr 2017)


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Computational Modeling of the Metabolic States Regulated by the Kinase Akt  (Nov 2012).  “… We compared two metabolic states generated by the specific variation of the fluxes regulated by the activity of the PI3K/Akt/mTOR  pathway. One state represented the metabolism of a growing cancer cell characterized by aerobic glycolysis and cellular biosynthesis (condition H), while the other (condition L) represented the same metabolic network with a reduced glycolytic rate, a reduced lactic acid production, but a higher MPM, as reported in literature in relation to a lower activity of PI3K/Akt/mTOR  (DeBerardinis et al., 2008). Some steps of the metabolic network that link glycolysis and PPP, namely those catalyzed by the G6PDH and TKL enzymes, revealed their importance for the cancer metabolic state. Results from our model may provide insight and assist in the selection of drug targets in anticancer treatments. …”  |  “The model is available in BioModels database (Li et al., 2010a) with identifier MODEL1210150000.”

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See Proofs.


Parameters are variables of a model that the machine learning system trains on its own. For example, weights are parameters whose values the ML system gradually learns through successive training iterations.

Contrast with hyperparameters, the “knobs” that you tweak during successive runs of training a model. For example, learning rate is a hyperparameter.

Partial Derivatives

See my note, here:


[Click image to open in new window.]


$\small \frac{\partial}{\partial x}2x^3 = 6x^2$

$\small \frac{\partial}{\partial x}(2x^3 + 3c) = 6x^2 + 0 = 6x^2$

$\small \frac{\partial}{\partial x}3c = 0$

$\small \frac{\partial}{\partial x}c = 0$

$\small \frac{\partial}{\partial x}(2x^3 + 5z^4) = 6x^2$

$\small \frac{\partial}{\partial z}(2x^3 + 5z^4) = 20z^3$


Parts of speech (POS)

In traditional grammar, a part of speech (abbreviated form: PoS or POS) is a category of words (or, more generally, of lexical items) which have similar grammatical properties. Words that are assigned to the same part of speech generally display similar behavior in terms of syntax – they play similar roles within the grammatical structure of sentences – and sometimes in terms of morphology, in that they undergo inflection for similar properties. Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, and sometimes numeral, article or determiner.

  • Noun (names): a word or lexical item denoting any abstract entity (abstract noun: e.g. home) or concrete entity (concrete noun: e.g. house); a person (police officer, Michael), place (coastline, London), thing (necktie, television), idea (happiness), or quality (bravery). Nouns can also be classified as count nouns or non-count nouns; some can belong to either category. The most common part of speech; nouns are called naming words.

  • Pronoun (replace or again placed): a substitute for a noun or noun phrase (them, he). Pronouns make sentences shorter and clearer since they replace nouns.

  • Adjective (describes, limits): a modifier of a noun or pronoun (big, brave). Adjectives make the meaning of another word (noun) more precise.

  • Verb (states action or being): a word denoting an action (walk), occurrence (happen), or state of being (be). Without a verb a group of words cannot be a clause or sentence.

  • Adverb (describes, limits): a modifier of an adjective, verb, or another adverb (very, quite). Adverbs make language more precise.

  • Preposition (relates): a word that relates words to each other in a phrase or sentence and aids in syntactic context (in, of). Prepositions show the relationship between a noun or a pronoun with another word in the sentence.

  • Conjunction (connects): a syntactic connector; links words, phrases, or clauses (and, but). Conjunctions connect words or group of words

  • Interjection (expresses feelings and emotions): an emotional greeting or exclamation (Woot, Hurray). Interjections express strong feelings and emotions.

  • Article (describes, limits): a grammatical marker of definiteness (the) or indefiniteness (a, an). The article is not always listed among the parts of speech. It is considered by some grammarians to be a type of adjective[13] or sometimes the term ‘determiner’ (a broader class) is used.

In the English language, words can be considered as the smallest elements that have distinctive meanings. Based on their use and functions, words are categorized into several types or parts of speech. This article will offer definitions and examples for the 8 major parts of speech in English grammar: noun, pronoun, verb, adverb, adjective, conjunction, preposition, and interjection.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

POS: functional classification

Linguists recognize that the above list of eight or nine word classes is drastically simplified. For example, “adverb” is to some extent a catch-all class that includes words with many different functions. Some have even argued that the most basic of category distinctions, that of nouns and verbs, is unfounded, or not applicable to certain languages.

Modern linguists have proposed many different schemes whereby the words of English or other languages are placed into more specific categories and subcategories based on a more precise understanding of their grammatical functions.

Common lexical categories defined by function may include the following (not all of them will necessarily be applicable in a given language):

  • Categories that will usually be open classes:
    • adjectives
    • adverbs
    • nouns
    • verbs (except auxiliary verbs)
    • interjections
  • Categories that will usually be closed classes:
    • auxiliary verbs
    • clitics
    • coverbs
    • conjunctions
    • determiners (articles, quantifiers, demonstrative adjectives, and possessive adjectives)
    • particles
    • measure words or classifiers
    • adpositions (prepositions, postpositions, and circumpositions)
    • preverbs
    • pronouns
    • contractions
    • cardinal numbers

Within a given category, subgroups of words may be identified based on more precise grammatical properties. For example, verbs may be specified according to the number and type of objects or other complements which they take. This is called subcategorizaton.

Many modern descriptions of grammar include not only lexical categories or word classes, but also phrasal categories, used to classify phrases, in the sense of groups of words that form units having specific grammatical functions. Phrasal categories may include noun phrases (NP), verb phrases (VP) and so on. Lexical and phrasal categories together are called syntactic categories.


[Image source. Click image to open in new window.]


[Image source: Section 8.2 (Tagsets for English) in Jurafsky D & Martin JH (2000)
  Speech and Language processing (p. 295). Click image to open in new window.]

Performance measures (metrics)

See this link, which discusses:

  • ROC curves (receiver operating characteristic curve)
  • Accuracy, precision, recall, $\small F_1$ score
    • true positives, true negatives
    • false positives, false negatives
    • confusion matrix
    • accuracy
    • misclassification rate
    • precision
    • recall (sensitivity; true positive rate)
    • false positive rate
    • specificity
    • prevalence
    • $\small F_1$ score
  • Performance metrics relevant to ML, NLP:
    • MRR (mean reciprocal rank)
    • filtered MRR
    • perplexity
    • BLEU (BiLingual Evaluation Understudy) score
    • ROUGE score


See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.

Pointer mechanisms; Pointer-generators

For a more comprehensive summary, see my Pointer (Pointer-Generator) Mechanisms TECHNICAL REVIEW, briefly excerpted here. For a recent review, see Neural Abstractive Text Summarization with Sequence-to-Sequence Models (Dec 2018) provides an excellent review, that includes pointer-generator mechanisms.

Pointer networks were introduced by Oriol Vinyals et al. in Pointer Networks (Jun 2015; updated Jan 2017).

  • “We introduce a new neural architecture to learn the conditional probability of an output sequence with elements that are discrete tokens corresponding to positions in an input sequence. Such problems cannot be trivially addressed by existent approaches such as sequence-to-sequence and Neural Turing Machines, because the number of target classes in each step of the output depends on the length of the input, which is variable. Problems such as sorting variable sized sequences, and various combinatorial optimization problems belong to this class. Our model solves the problem of variable size output dictionaries using a recently proposed mechanism of neural attention. It differs from the previous attention attempts in that, instead of using attention to blend hidden units of an encoder to a context vector at each decoder step, it uses attention as a pointer to select a member of the input sequence as the output. We call this architecture a Pointer Net (Ptr-Net). …”


[Image source. Click image to open in new window.]

While pointers (pointer networks; pointer mechanisms) were an important advance – providing, for example, one potential solution for rare and out of vocabulary (OOV) words – they suffered from two limitations. First, they were unable to select words that did not exist in the input sequence. Second, there was no option to choose whether to point or not: it always points.

Pointer-generator architectures can copy words from source texts via a pointer , and generate novel words from a vocabulary via a generator . With the pointing/copying mechanism factual information can be reproduced accurately, and out of vocabulary words can also be taken care of in the summaries.

Pointer / pointer-generator mechanisms have been used in the following works (itemized and paraphrased here; see Pointer (Pointer-Generator) Mechanisms for full descriptions).

Nallapati et al. [Caglar Gulcehre] Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond (Aug 2016) modeled abstractive text summarization using attentional encoder-decoder recurrent neural networks, achieving state of the art performance on two different corpora. They proposed several novel models that address critical problems in summarization that are not adequately modeled by the basic architecture, such as modeling key-words, capturing the hierarchy of sentence-to-word structure, and emitting words that are rare or unseen at training time. …


[Image source. Click image to open in new window.]

Gulcehre et al. [Yoshua Bengio], Pointing the Unknown Words (Aug 2016) proposed a novel way to deal with rare and unseen words for the neural network models using attention. Their model used two softmax layers to predict the next word in conditional language models: one predicted the location of a word in the source sentence, and the other predicted a word in the shortlist vocabulary. At each time-step, the decision of which softmax layer to use choose was made adaptively by an MLP conditioned on the context. …

Merity et al. [Richard Socher | MetaMind/Salesforce] Pointer Sentinel Mixture Models (Sep 2016) introduced the pointer sentinel mixture architecture for neural sequence models, which had the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. …


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. See et al. [Christopher Manning] Get To The Point: Summarization with Pointer-Generator Networks (Apr 2017) proposed a novel architecture that augmented the standard sequence-to-sequence attentional model in two orthogonal ways. First, they used a hybrid pointer-generator network that could copy words from the source text via pointing, which aided accurate reproduction of information while retaining the ability to produce novel words through the generator. Second, they used coverage to keep track of what had been summarized, which discouraged repetition.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Attentional, RNN-based encoder-decoder models for abstractive summarization have achieved good performance on short input and output sequences. For longer documents and summaries however these models often include repetitive and incoherent phrases. Paulus et al. [Richard Socher] A Deep Reinforced Model for Abstractive Summarization (Nov 2017) introduced a neural network model with a novel intra-attention that attended over the input and continuously generated output separately, and a new training method that combined standard supervised word prediction and reinforcement learning. Models trained only with supervised learning often exhibited ‘exposure bias’ – they assumed ground truth was provided at each step during training. However, when standard word prediction was combined with the global sequence prediction training of reinforcement learning, the resulting summaries became more readable.

Follow-on work (2018) by Socher and colleagues, Improving Abstraction in Text Summarization (Aug 2018), proposed two techniques to improve the level of abstraction of generated summaries. First, they decomposed the decoder into a contextual network that retrieved relevant parts of the source document, and a pretrained language model that incorporated prior knowledge about language generation. The decoder generated tokens by interpolating between selecting words from the source document via a pointer network as well as selecting words from a fixed output vocabulary. The contextual network had the sole responsibility of extracting and compacting the source document, whereas the language model was responsible for the generation of concise paraphrases. Second, they proposed a novelty metric that was optimized directly through policy learning (a reinforcement learning reward) to encourage the generation of novel phrases (summary abstraction).

The Natural Language Decathlon: Multitask Learning as Question Answering (Jun 2018), by Richard Socher and colleagues at SalesForce, who presented a NLP challenge spanning 10 tasks as well as a new Multitask Question Answering Network (MQAN) model that jointly learned all tasks in decaNLP without any task-specific modules or parameters in the multitask setting. The multitask question answering network (MQAN) was designed for decaNLP, and made use of two self-attention and multi-head decoder attention layers and a multi-pointer-generator decoder to multitask across all tasks in decaNLP.


[Image source. Click image to open in new window.]

Pointwise mutual information

See Mutual information.


Polysemy describes words or phrases with different but related meanings poses a challenge to NLP; for example, “Washington” could refer to the location “Washington, DC” or the person “George Washington”; “ACE” could represent “angiotensin converting enzyme” or “acetylcholinesterase”. In polysemy, the fact that multiple entities might have the same name is common for named entities. The task of addressing the polysemy problem for named entities is called named entity disambiguation.

See also: Named entity disambiguation;   Word sense disambiguation.

See also Linguistics | Semantics.


A portmanteau or portmanteau word is a linguistic blend of words, in which parts of multiple words or their phones (sounds) are combined into a new word, as in “smog”, coined by blending “smoke” and “fog” – or “motel”, from “motor” and “hotel”. In linguistics, a portmanteau is defined as a single morph that represents two or more morphemes.

The definition of portmanteau overlaps with the grammatical term contraction, but contractions are formed from words that would otherwise appear together in sequence, such as “do” and “not” to make “don’t”, whereas a portmanteau word is formed by combining two or more existing words that all relate to a singular concept. A portmanteau also differs from a compound, which does not involve the truncation of parts of the stems of the blended words. For instance, “starfish” is a compound, not a portmanteau, of “star” and “fish”; whereas a hypothetical portmanteau of “star” and “fish” might be “stish”. A portmanteau of “coyote” and “wolf” is “coywolf”.

Postulate (Axiom)

See Proofs.


Pragmatics is a subfield of linguistics and semiotics that studies the ways in which context contributes to meaning. Pragmatics encompasses speech act theory, conversational implicature, talk in interaction and other approaches to language behavior in philosophy, sociology, linguistics and anthropology. Unlike semantics, which examines meaning that is conventional or “coded” in a given language, pragmatics studies how the transmission of meaning depends not only on structural and linguistic knowledge (e.g., grammar, lexicon, etc.) of the speaker and listener, but also on the context of the utterance, any pre-existing knowledge about those involved, the inferred intent of the speaker, and other factors. In this respect, pragmatics explains how language users are able to overcome apparent ambiguity, since meaning relies on the manner, place, time, etc. of an utterance.

  • The ability to understand another speaker’s intended meaning is called pragmatic competence.

Pragmatics is:

  • the study of the practical aspects of human action and thought.
  • the study of the use of linguistic signs, words and sentences, in actual situations.[1]

Pragmatics outlines the study of meaning in the interactional context. It looks beyond the literal meaning of an utterance and considers how meaning is constructed as well as focusing on implied meanings. It considers language as an instrument of interaction, what people mean when they use language and how we communicate and understand each other.

Pragmatics considers:

  • the negotiation of meaning between speaker and listener.
  • the context of the utterance.
  • the meaning potential of an utterance.

What would happen to language if Pragmatics did not exist? Pragmatics acts as the basis for all language interactions and contact. It is a key feature to the understanding of language and the responses that follow this. Therefore, without the function of Pragmatics, there would be very little understanding of intention and meaning.


See also Linguistics | Semantics.

Precision medicine (deprecates "personalized medicine")

What is the difference between precision medicine and personalized medicine? What about pharmacogenomics?

“There is a lot of overlap between the terms ‘precision medicine’ and ‘personalized medicine.’ According to the National Research Council, ‘personalized medicine’ is an older term with a meaning similar to ‘precision medicine.’ However, there was concern that the word ‘personalized’ could be misinterpreted to imply that treatments and preventions are being developed uniquely for each individual; in precision medicine, the focus is on identifying which approaches will be effective for which patients based on genetic, environmental, and lifestyle factors. The Council therefore preferred the term ‘precision medicine’ to ‘personalized medicine.’ However, some people still use the two terms interchangeably.

Pharmacogenomics is a part of precision medicine. Pharmacogenomics is the study of how genes affect a person’s response to particular drugs. This relatively new field combines pharmacology (the science of drugs) and genomics (the study of genes and their functions) to develop effective, safe medications and doses that are tailored to variations in a person’s genes.”

Pretrained models

Pretrained models are models or model components (such as word or language embeddings) that have been already been trained. Sometimes, pretrained embeddings are fed into a neural network. [Contrast this with the more common situation where the model trains the embeddings de novo, rather than relying on the pretrained embeddings.] The basic steps are:

  1. You have machine learning model $\small m$.

  2. Pretraining: You have a dataset $\small A$ on which you train $\small m$.

  3. You have a dataset $\small B$. Before you start training the model, you initialize some of the parameters of $\small m$ with the model, which was trained on $\small A$.

  4. Fine-tuning: you train $\small m$ on $\small B$.

Pretrained models: word embeddings & language models

Collobert and Weston demonstrated the power of pretrained word embeddings as a highly effective tool when used in downstream tasks in their 2008 paper A Unified Architecture for Natural Language Processing, while also announcing a neural network architecture upon which many current approaches are built. It was Mikolov et al. (2013), however, who popularized word embedding through the introduction of word2vec, a toolkit enabling the training and use of pretrained embeddings (Efficient Estimation of Word Representations in Vector Space).

A particularly exciting recent advance in NLP is the development of pretrained language models such as ELMo (released in February 2018 by Allen NLP), ULMFiT (May 2018 by and Aylien Ltd.), and OpenAI Transformer (June 2018 by OpenAI). Those papers demonstrated that pretrained language models can achieve state of the art results on a wide range of NLP tasks. Language models provide an improved understanding of the semantic content and latent (hidden) relationships in documents.

For an introductory discussion of the potential impact of those models. e.g. on transfer learning (below), see:

See also:

Pretrained models: transfer learning

Note also the links provided immediately above.

Transfer Learning (a subfield of which is domain adaptation) is the reuse of a pretrained model on a new problem. In transfer learning, the knowledge of an already trained machine learning model is applied to a different but related problem. For example, if you trained a simple classifier to predict whether an image contains a backpack, you could use the knowledge that the model gained during its training to recognize other objects like sunglasses. With transfer learning, we basically try to exploit what has been learned in one task to improve generalization in another. We transfer the weights that a network has learned at Task A to a new Task B. The general idea is to use knowledge, that a model has learned from a task where a lot of labeled training data is available, in a new task where we don’t have a lot of data. Instead of starting the learning process from scratch, you start from patterns that have been learned from solving a related task.

Although multitask learning and transfer learning have similarities, they are not the same. Transfer learning only aims at achieving high performance in the target task by transferring knowledge from the source task, while multitask learning tries to learn the target and the source tasks simultaneously.

Whereas in transfer learning we mainly care about doing well on our target task or domain, in multitask learning the objective is to do well on all available tasks (An Overview of Multi-Task Learning in Deep Neural Networks). Alternatively, we can also use the knowledge acquired by learning from related tasks to do well on a target. Crucially, in contrast to transfer learning, some labeled data is usually assumed for each task. In addition, models are trained jointly on source and target task data, which is not the case for all transfer learning scenarios. However, even if target data is not available during training, insights about tasks that are beneficial for multitask learning can still inform transfer learning decisions.</font></a>

Probabilistic graphical models

Complex systems generally involve a significant degree of uncertainty; therefore, to obtain meaningful conclusions we need to reason not only about what is possible, but what is probable. Probabilistic models provide a formal framework for considering multiple possible outcomes and their likelihood (probability). Probabilistic models allow tractable approximations of complex systems.

In a probabilistic graphical model (PGM) the conditional dependencies between the random variables are specified via a graph. Each node represents a random variable (or group of random variables), and the links (edges) express probabilistic relationships between these variables. The graph then captures the way in which the joint distribution over all of the random variables can be decomposed into a product of factors each depending only on a subset of the variables. Graphical models provide a flexible framework for modeling large collections of variables with complex interactions, with wide domain of application including machine learning, computer vision, speech and computational biology.


[Image source. Click image to open in new window.]

The two major classes of PGM are Bayesian networks and Markov random fields. Bayesian networks are directed graphical models in which the links of the graphs have a particular directionality indicated by arrows.  The links in Markov random fields, also known as undirected graphical models, do not carry arrows and have no directional significance. Directed graphs are useful for expressing causal relationships between random variables, whereas undirected graphs are better suited to expressing soft constraints between random variables.


Left: Markov network. Right: Bayesian network.  [Image source. Click image to open in new window.]

Probabilistic graphical models (PGMs) can be used to describe directed and undirected relationships between variables. In the genomics setting, each variable (e.g., genes, proteins) is a node in the network and viewed as a random variable, which is subject to uncertainty. The links in the network convey a relevant measure of association, e.g., correlation (undirected) or causality (directed). The network structure can be decomposed into small regions and translated into a product of conditional probabilities, which represents the joint probability distribution.

Undirected, Markov Networks portray symmetric relationships. A link in this model is present if the linked nodes are associated after controlling for the influence of other nodes in the graph (conditional association).

In a directed graph, an edge $\small A \to B$ implies that independent variable $\small A$ (parent node) is upstream of the dependent variable $\small B$ (child node) in the underlying causal process. Furthermore, the directed edge implies a causal effect of $\small A$ after the influence of the remaining nodes upstream of $\small B$ (ancestors of $\small B$) have been controlled for or removed.

Bayesian Networks (BNs) are directed acyclic graphs (DAGs), which contain no cycles, and thereby prohibit feedback in the model. Chain graphs contain a mixture of directed and undirected edges.

A fundamental challenge is to infer graphical models from data (Mathematical and Statistical Modeling in Cancer Systems Biology, and references therein). There are two distinct and difficult learning tasks: parameter estimation and structural learning. Parameter estimation is for the parameters of the conditional probabilities for a given network structure, and can be carried out using maximum likelihood approaches (Koller and Friedman, 2009). In structural learning, the aim is to identify the most likely network topology that came from the observational data. Structural learning is especially challenging because the number of possible network topologies is super-exponential with the number of nodes (Chickering et al., 1994). As a result, enumeration of all possible network topologies is impossible even for small problems, and machine learning and optimization techniques must be utilized (Koller and Friedman, 2009).

PGMs have been applied to investigate a number of different cancers and data types (Mathematical and Statistical Modeling in Cancer Systems Biology and references therein). Applications involve prediction and classification tasks, which have direct clinical relevance – for example, providing a probabilistic guide for a clinician to diagnose and treat different cancers. Another use of PGMs is to sort out the underlying mutations which put individuals at high risk. In other work, Markov networks were used to predict breast cancer survival after patients received different forms of treatments, e.g., combinations of chemotherapy, radiotherapy, and hormonal therapy. BNs were used to integrate clinical and microarray data for the classification of breast cancer patients into good and poor prognosis groups, and radiological decision support in distinguishing malignant and benign tumors.


Although it does not affect this (purely illustrative) example,
note that in the table next to Gene 3 "Gene 2" should be "Gene 3."
[Image source. Click image to open in new window.]

Nir Friedman – coauthor of Probabilistic Graphical Models (Koller and Friedman, 2009) – provided a concise overview of PGM applied to biological systems in Inferring Cellular Networks Using Probabilistic Graphical Models (2004), while Blair et al. summarized applications of statistical modeling (including PGM) to cancer in Mathematical and Statistical Modeling in Cancer Systems Biology (Jun 2012).

Todd Stephenson An Introduction to Bayesian Network Theory and Usage (2000) provides an excellent introduction and an example (fraudulent transactions) that enables the understanding of the more complicated, real world example (see Appendix C) in Modeling and Predicting the Occurrence of Brain Metastasis from Lung Cancer by Bayesian Network: A Case Study of Taiwan (2014).

Needham et al. (2006) Inference in Bayesian Networks (Nature Biotechnology provide another excellent example of Bayesian network modeling and inference – a simple cell-signaling pathway consisting of a stimulant, an extracellular signal, an inhibitor of the signal, a G protein-coupled receptor, a G protein and the cellular response. The stimulant induces production of the extracellular signal.


[Image source. Click image to open in new window.]

Additional Reading

Probabilistic programming

A probabilistic programming language (PPL) is a programming language designed to describe probabilistic models and then perform inference in those models. PPLs are closely related to graphical models and Bayesian networks, but are more expressive and flexible. Probabilistic programming represents an attempt to unify general purpose programming with probabilistic modeling.

Probabilistic reasoning is a foundational technology of machine learning. Probabilistic reasoning has been used for predicting stock prices, recommending movies, diagnosing computers, detecting cyber intrusions and image detection.

A probabilistic relational programming language (PRPL) is a PPL specially designed to describe and infer with probabilistic relational models (PRMs). A PRM is usually developed with a set of algorithms for reducing, inference about and discovery of concerned distributions, which are embedded into the corresponding PRPL.

An Introduction to Statistical Relational Learning (part 2)  [local copy] discusses probabilistic programming, and mentions the Alchemy  system, from the University of Washington.


[Image source. Click image to open in new window.]

Alchemy is a software package providing a series of algorithms for statistical relational learning and probabilistic logic inference, based on the Markov logic representation. Alchemy allows you to easily develop a wide range of AI applications, including:

  • collective classification;
  • link prediction;
  • entity resolution;
  • social network modeling; and,
  • information extraction.

Alchemy Tutorial  [local copy]

See also:

Probabilistic soft logic

Probabilistic soft logic (PSL) is a statistical relational learning (SRL; a subdiscipline of artificial intelligence and machine learning that is concerned with domain models that exhibit both uncertainty and complex, relational structure) framework for collective, probabilistic reasoning in relational domains. PSL uses first order logic rules as a template language for graphical models over random variables with soft truth values from the interval [0,1] – for reasoning probabilistically over continuously valued random variables.

In recent years there has been a rise in the approaches that combine graphical models and first order logic to allow the development of complex probabilistic models with relational structures. A notable example of such approaches is Markov logic networks (MLNs). Like MLNs, PSL is a modelling language (with an accompanying implementation on GitHub) for learning and predicting in relational domains. Unlike MLNs, PSL uses soft truth values for predicates in an interval between [0,1]. This allows for the underlying inference to be solved quickly as a convex optimization problem. This is useful in problems such as collective classification, link prediction, social network modelling, and object identification/entity resolution/record linkage.

As such, PSL is a machine learning framework for developing probabilistic models. PSL models are easy to use and fast. You can define models using a straightforward logical syntax and solve them with fast convex optimization. PSL has produced state of the art results in many areas spanning natural language processing, social-network analysis, knowledge graphs, recommender system, and computational biology.

The PSL framework is available as an Apache-licensed, open source project on GitHub with an active user group for support. PSL was used by Pujara et al. (2013a,b) for imposing constraints on knowledge graph construction.

Applications of PSL in the literature

A well-known project that employs ontologies and constraints is the Never Ending Language Learning project, NELL  [Never-Ending Learning (May 2018);  project (down mid-Feb 2019);  forum;  slides here and here], by William CohenDr. William A. Cohen, formerly at Carnegie Mellon University, presently Director of Research Engineering at Google AI. and colleagues at Carnegie Mellon University. NELL is a semantic machine learning system developed by a research team at Carnegie Mellon University. NELL processes a large-scale corpus of web sites and exploits a coupled process which learns text patterns corresponding type and relation assertions, as well as applies them to extract new entities and relations.

NELL employs coupling constraints (i.e., interactional limitations or logical constraints, described in Section 5.2 (“NELL’s Coupling Constraints,” pp. 108-109) in the Never-Ending Learning paper), including noun phrase classification, ontology-based coupling rules, and Horn clause couplings. Interestingly, that paper includes this statement (Section 6.2, p. 109):

    "The KI [knowledge integrator] integrates the incoming proposals for KB updates. For efficiency, the KI considers only moderate-confidence candidate beliefs, and re-assesses confidence using a limited radius subgraph of the full graph of consistency constraints and beliefs. As an example, for each new relational triple that the KI asserts, it checks that the entities in the relational triple have a category type consistent with the relation, but does not consider using new triples as a trigger to update beliefs about these argument types during the same iteration. Over multiple iterations, the effects of constraints propagate more widely through this graph of beliefs and constraints.

    "In Pujara et al. [35] a more effective algorithm is proposed [PSL] for the joint inference problem faced by NELL's KI [Knowledge Integrator]; we believe it will be helpful to upgrade NELL's KI in the future to use this approach."

Knowledge graph identification (KGI) is the task of removing noise, inferring missing information, and determining which candidate facts should be included into a KG. The reference in the NELL paper, above, to Pujara et al. refers to this paper: Knowledge Graph Identification [Pujara et al. (William CohenDr. William A. Cohen, formerly at Carnegie Mellon University, presently Director of Research Engineering at Google AI.), 2013a] [code here and here], which formulated a KGI approach that supported reasoning about multiple, uncertain extractor sources in the presence of ontological constraints. They solved KGI efficiently with convex optimization using probabilistic soft logic (PSL), for which their paper provides a good background.


[Image source. Click image to open in new window.]

Pujara et al. (William CohenDr. William A. Cohen, formerly at Carnegie Mellon University, presently Director of Research Engineering at Google AI.; 2013b) also published Ontology-Aware Partitioning for Knowledge Graph Identification [code here and here – same repositories as the preceding paper], describing the use of ontologies in knowledge graph completion (KGC) and fact checking. They again employed probabilistic soft logic, allowing them to apply global constraints instead of relying only on local features. PSL models were expressed using a set of universally-quantified logical rules that, when combined with data such as the noisy extractions and ontological information, defined a probability distribution over possible knowledge graphs. In the case of KGC, they introduced rules to relate uncertain extractions to the true relations and labels in the knowledge graph, pooled those facts across co-referent entities, and constrained relations and labels with rules that used ontological information such as domain and range constraints, and mutual exclusion relationships.


[Image source. Click image to open in new window.]

See also:

Lise Getoor is a coauthor on Kimming et al. (2012) and Bröcheler et al. (2010), as well as Pujara 2013a,b, above.]


See also:

Relevant logic notation:

  • $\small èxists$ : there exists at least one
  • $\small \forall$ : for all
  • $\small \neg$ : not (logical not)
  • $\small \lor$ : or (logical or)
  • $\small \land$ : and (logical and)
  • $\small \sim$ : is similar to
  • $\small \rightarrow$ : implies
  • $\small \leftarrow$ : is implied by

  • In Prolog, the symbol $\small \text{ :- }$, sometimes called a turnstile , is pronounced “ if “.

Prolog Turing-complete logic programming language associated with artificial intelligence and computational linguistics. Prolog has its roots in first-order logic, a formal logic, and unlike many other programming languages, Prolog is intended primarily as a declarative programming language: the program logic is expressed in terms of relations, represented as facts and rules. A computation is initiated by running a query over these relations. Prolog is well-suited for specific tasks that benefit from rule-based logical queries such as searching databases, voice control systems, and filling templates.

The following is excerpted from Quick-‘n’-Dirty Prolog Tutorial  [local copy].

Background: Prolog, whose name is from the phrase “PROgramming in LOGic,” is a special purpose language. At the heart of Prolog is a logical inference engine that, based on the data supplied by the programmer, generates answers to questions posed by the user. Unlike procedural languages (e.g., Java), Prolog is a language in which statements are logical expressions.

gprolog  (GNU Prolog) is an opensource implementation of Prolog.

Prolog Program Components: Technically, you don’t create a Prolog program; you create a Prolog database. In the database are two types of clauses: facts and rules.

A fact is a single piece of information. To represent the fact that the sky is blue, you could use the clause $\small \text{blue(sky).}$ . Similarly, $\small \text{mammal(rabbit).}$ says that rabbits are mammals. Those are both examples of single argument, or unary predicates. Facts can have multiple arguments; for example, $\small \text{plays(john,hockey).}$ indicates that john plays hockey.

  • Note that Prolog constants are in lower case, variables are in upper case, domains are not defined explicitly, and entries always end with a period.

Rules are used to generate new information from facts, other rules, and even themselves. Rules have the form $\small \text{head :- body.}$ , where the head and the body are clauses that typically use variables instead of constants (because rules are used to define general relationships).

For example, consider this rule: $\small \text{grandparent(X,Z) :- parent(X,Y), parent(Y,Z).}$ . The rule says that $\small \text{X}$ is the grandparent of $\small \text{Z}$ when $\small \text{X}$ is a parent of $\small \text{Y}$ and $\small \text{Y}$ is a parent of $\small \text{Z}$.

  • The comma represents a logical $\small \text{AND}$, the parent() clauses on either side are called subgoals, and $\small \text{X}$, $\small \text{Y}$, & $\small \text{Z}$ are variables. Remember, variables are upper case.

Rules may be recursive. Consider: $\small \text{ancestor(X,Y) :- parent(Z,Y) , ancestor(X,Z).}$ . This says that $\small \text{X}$ is an ancestor of $\small \text{Y}$ when $\small \text{Y}$ has a parent $\small \text{Z}$ and $\small \text{X}$ is an ancestor of $\small \text{Z}$. If that’s confusing, plug in your name for $\small \text{Y}$, one of your parent’s names for $\small \text{Z}$ (your mom, for example), and one of your mom’s parent’s names for $\small \text{X}$.

Notice that we don’t explain to Prolog how to work with these rules. That’s its problem, not ours. We just have to supply a logically complete database, and Prolog will take care of deciding which rules to use, and when, to answer a given question.

This brings us to the third type of clause: The query. When you want to ask Prolog a question, you supply a query . You can think of a query as being a rule without a body. Prolog will take that head and will attempt to find out if it is true or false. If the query has variables, Prolog will attempt to find all possible values that can be used in place of the variables to make the query true. Better yet, it will tell you what they are.

[ … SNIP! … ]

Connecting Prolog to Predicate Logic

You no doubt remember that in logic, we like to represent predicates with capital letters (e.g., $\small E(x,y)$ represents $\small \text{x eats y}$ (where $\small E$ represents $\small \text{eats}$).

    Consider this implication: $\small (P(x,y) \land P(y,z)) \rightarrow G(x,z)$.

    This says: If $\small P(x,y)$ is true and $\small P(x,z)$ is also true, then $\small G(x,z)$ is true.

This is exactly what our grandparent rule says in Prolog! The Prolog syntax is just backwards. Think of the $\small \text{ :- }$ symbol in Prolog as representing the word “if”, and the connection becomes clear.

Internally, Prolog represents rules in a form known as a Horn clause. A Horn clause is a disjunction of predicates in which at most one of the predicates is not negated. It sounds odd, but it matches well with Prolog’s needs. Consider the grandparent clause again:

    $\small \text{grandparent(X,Z) :- parent(X,Y), parent(Y,Z)}$.

which can be rewritten in logical notation, including quantification:

    $\small \forall x \forall y \forall z ((P(x,y) \land P(y,z)) \rightarrow G(x,z))$

We know that $\small \textbf{p $\rightarrow$ q}$ is logically equivalent to $\small \neg p \lor q$, and so the above is equivalent to:

    $\small \forall x \forall y \forall z (\neg (P(x,y) \land P(y,z)) \lor G(x,z))$

And De Morgan’s Laws tell us that we can replace the $\small \text{AND}$ with an $\small \text{OR}$ as the negation is pulled through, producing a Horn clause:

    $\small \forall x \forall y \forall z (\neg P(x,y) \lor \neg P(y,z) \lor G(x,z))$

Here’s why having a Horn clause is important: Prolog has facts in its database about parents. By employing a simplified version of the resolution rule of inference, Prolog can use a fact to remove a predicate from the Horn clause and get closer to an answer. For example, for the query

    $\small \text{grandparent(hank,X)}$

the resulting Horn clause would be

    $\small \text{not parent(hank,A) or not parent(A,B) or grandparent(hank,B)}$.

The database tells Prolog that $\small \text{parent(hank,ben)}$ is a fact, and Prolog can resolve that fact with the Horn clause if $\small \text{A = ben}$. That assignment is called unification  [local copycode], and the result is the Horn clause $\small \text{not parent(ben,B) or grandparent(hank,B)}$. Finding that ben $\small \text{carl}$ is the parent of carl $\small \text{ben}$ lets Prolog use resolution and unification once more to conclude that $\small \text{grandparent(hank,carl)}$ is true.

  • Note: there is an error on p. 5 (“Connecting Prolog to Predicate Logic”) in the source document (indicated/corrected above).

The following is excerpted from Horn clauses.

The following is an example of a (definite) Horn clause:

    $\small \neg p \lor \neg q \lor \ldots \lor \neg t \lor u$

Such a formula can be rewritten in the following form, which is more common in logic programming and similar fields:

    $\small (p \land q \land \ldots \land t) \rightarrow u$

In logic programming a definite clause behaves as a goal-reduction procedure. For example, the Horn clause written above behaves as the procedure:

    to show $\small u$, show $\small p$ and show $\small q$ and ... and show $\small t$ .

To emphasize this reverse use of the clause, it is often written in the reverse form:

    $\small u \leftarrow (p \land q \land \ldots \land t)$

In Prolog this is written as:

    $\small u \text{ :- } p, q, \dots, t$.

Prolog – see also:

Prolog – graph related:


See Proofs.


Here I summarize/collate several web sources.

Axiom: see Postulate.


  • An assertion that is then proved. It is often used like an informal lemma. [3]


  • This is an educated prediction that one makes based on their experience. The difference between a conjecture and a lemma/theorem/corollary is that it is usually an open research problem that either has no answer, or some partial answer.

    Conjectures are usually only considered important if they are authored by someone well known in their respective area of mathematics. Once it is proved or disproved, it ceases to be a conjecture and either becomes a fact (backed by a theorem) or there is some interesting counterexample to demonstrate how it is wrong. [1]

    Example: The Poincaré’ conjecture was a famous statement that remained an open research problem in topology for roughly a century. The claim was that every simply connected, compact 3-manifold was homeomorphic to the 3-sphere S3. This statement however is no longer a conjecture since it was famously proven by Grigori Perelman in 2003. [1]

  • A statement that is unproved, but is believed to be true (Collatz conjecture, Goldbach conjecture, twin prime conjecture). [3]


  • This is usually a result that is a direct consequence of a major theorem. Often times a theorem lends itself to other smaller results or special cases which can be shown by simpler methods once a theorem is proven. [1]

    Example: a consequence to the Hopf-Rinow theorem is that compact manifolds are geodesic complete. [1]

  • A result in which the (usually short) proof relies heavily on a given theorem (we often say that “this is a corollary of Theorem A”). [3]


  • This is an assignment of language and syntax to some property of a set, function, or other object. A definition is not something you prove, it is something someone assigns. Often you will want to prove that something satisfies a definition. [1]

    Example: we call a mapping $\small f : X \rightarrow Y$ injective if whenever $\small f(x) = f(y)$ then $\small x = y$. [1]

  • A precise and unambiguous description of the meaning of a mathematical term. It characterizes the meaning of a word by giving all the properties and only those properties that must be true. [3]



  • This is a property that one can derive or prove which is usually technical in nature and is not of primary importance to the overall body of knowledge one is trying to develop. Usually lemmas are there as precursors to larger results that one wants to obtain, or introduce a new technique or tool that one can use over and over again. [1]

    Example: in a Hausdorff space, compact subsets can be separated by disjoint open subsets. [1]

  • A minor result whose sole purpose is to help in proving a theorem. It is a stepping stone on the path to proving a theorem. Very occasionally lemmas can take on a life of their own (Zorn’s lemma, Urysohn’s lemma, Burnside’s lemma, Sperner’s lemma). [3]


  • A statement that can be shown, using a given set of axioms and definitions, to be both true and false. Paradoxes are often used to show the inconsistencies in a flawed theory (Russell’s paradox). The term paradox is often used informally to describe a surprising or counterintuitive result that follows from a given set of rules (Banach-Tarski paradox, Alabama paradox, Gabriel’s horn). [3]

Postulate (Axiom)

  • I would appreciate community input on this, but I haven’t seen this word used in any of the texts/papers I read. I would assume that this is synonymous with proposition. [1]

    I know Postulate is a synonym of axiom. Very used word in italian, but more in physics than mathematics. See Wikipedia:Axiom [1]

  • An axiom or postulate is a statement that is taken to be true, to serve as a premise or starting point for further reasoning and arguments. The word comes from the Greek axíōma (ἀξίωμα), “that which is thought worthy or fit” or “that which commends itself as evident.” [2]

  • A statement that is assumed to be true without proof. These are the basic building blocks from which all theorems are proved (Euclid’s five postulates, Zermelo-Fraenkel axioms, Peano axioms). [3]


  • In mathematics, a proof is an inferential argument for a mathematical statement. In the argument, other previously established statements, such as theorems, can be used. In principle, a proof can be traced back to self-evident or assumed statements, known as axioms, along with accepted rules of inference</a. Axioms may be treated as conditions that must be met before the statement applies. [4]

    Proofs are examples of exhaustive deductive reasoning or inductive reasoning and are distinguished from empirical arguments or non-exhaustive inductive reasoning (or “reasonable expectation”). A proof must demonstrate that a statement is always true (occasionally by listing all possible cases and showing that it holds in each), rather than enumerate many confirmatory cases. An unproved proposition that is believed to be true is known as a conjecture. [4]

    Proofs employ logic but usually include some amount of natural language which usually admits some ambiguity. In fact, the vast majority of proofs in written mathematics can be considered as applications of rigorous informal logic. Purely formal proofs, written in symbolic language instead of natural language, are considered in proof theory. The distinction between formal and informal proofs has led to much examination of current and historical mathematical practice, quasi-empiricism in mathematics, and so-called folk mathematics (in both senses of that term). The philosophy of mathematics is concerned with the role of language and logic in proofs, and mathematics as a language. [4]

  • Concept of scientific proof.

    While the phrase “scientific proof” is often used in the popular media, many scientists have argued that there is really no such thing. For example, Karl Popper once wrote that “In the empirical sciences, which alone can furnish us with information about the world we live in, proofs do not occur, if we mean by ‘proof’ an argument which establishes once and for ever the truth of a theory”. [5] Albert Einstein said:

    “The scientific theorist is not to be envied. For Nature, or more precisely experiment, is an inexorable and not very friendly judge of his work. It never says ‘Yes ’ to a theory. In the most favorable cases it says ‘Maybe ,’ and in the great majority of cases simply ‘No .’ If an experiment agrees with a theory it means for the latter ‘Maybe ,’ and if it does not agree it means ‘No. ’ Probably every theory will someday experience its ‘No’  – most theories, soon after conception.” [5]


  • This is a property that one can derive easily or directly from a given definition of an object. [1]

    Example: the identity element in a group is unique. [1]

  • A proved and often interesting result, but generally less important than a theorem. [3]


  • This is a property of major importance that one can derive which usually has far-sweeping consequences for the area of math one is studying. Theorems don’t necessarily need the support of propositions or lemmas, but they often do require other smaller results to support their evidence. [1]

    Example: every manifold has a simply connected covering space. [1]

  • A mathematical statement that is proved using rigorous mathematical reasoning. In a mathematical paper, the term theorem is often reserved for the most important results. [3]

“From a logical point of view, there is no difference between a lemma, proposition, theorem, or corollary – they are all claims waiting to be proved. However, we use these terms to suggest different levels of importance and difficulty.

  • “A lemma is an easily proved claim which is helpful for proving other propositions and theorems, but is usually not particularly interesting in its own right.

  • “A proposition is a statement which is interesting in its own right, while a theorem is a more important statement than a proposition which says something definitive on the subject, and often takes more effort to prove than a proposition or lemma.

  • “A corollary is a quick consequence of a proposition or theorem that was proven recently.”

Source: Footnote 4, p. 25 in Terence Tao, “Analysis I”, 3rd ed. (2015).

Proofs: Sources

[1] Terminology: Difference between Lemma, Theorem, Definition, Hypothesis, Postulate and a Proposition
[2] Axiom
[3] What is the difference between a theorem, a lemma, and a corollary?  [local copy]
[4] Mathematical proof
[5] Concept of scientific proof

See also: Scientific method.


See Proofs.


Provenance is a reference to literature from which a statement or its supporting evidence were derived.

Provenance (from the French provenir, “to come from/forth”) is the chronology of the ownership, custody or location of a historical object. The term was originally mostly used in relation to works of art but is now used in similar senses in a wide range of fields, including archaeology, paleontology, archives, manuscripts, printed books and science and computing. The primary purpose of tracing the provenance of an object or entity is normally to provide contextual and circumstantial evidence for its original production or discovery, by establishing, as far as practicable, its later history, especially the sequences of its formal ownership, custody and places of storage. The practice has a particular value in helping authenticate objects. Comparative techniques, expert opinions and the results of scientific tests may also be used to these ends, but establishing provenance is essentially a matter of documentation. The term dates to the 1780s in English. Provenance is conceptually comparable to the legal term chain of custody.

In a technological solution, blockchains can be used to ensure data provenance.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Quadratic formula

For $\small ax^2 + bx + c = 0$, the values of $\small x$ which are the solutions of the equation are given by:

$\small x = \dfrac{-b \pm \sqrt{b^2 - 4ac}}{2a}$

For the quadratic formula to work, you must have your equation arranged in the form “$\small \text{(quadratic)} = 0$”. Also, the “$\small 2a$” in the denominator of the formula is underneath everything above, not just the square root.


Solve $\small x^2 + 3x - 4 = 0$

This quadratic happens to factor as

$\small x^2 + 3x - 4 = (x + 4)(x - 1) = 0$

so the solutions are $\small x = -4$ and $\small x = 1$. However, using the quadratic formula with $\small a = 1$, $\small b = 3$, and $\small c = -4$:

$\small x = \dfrac{-(3) \pm \sqrt{(3)^2 - 4(1)(-4)}}{2(1)}$ $\small = \dfrac{-3 \pm \sqrt{9 + 16\,}}{2} = \dfrac{-3 \pm \sqrt{25}}{2}$ $\small = -4,\, 1$

The solution is $\small x = -4$ and $\small x = 1$.

Question answering

Question answering (QA), a computer science discipline within the fields of information retrieval and natural language processing (NLP), is concerned with building systems that automatically answer questions posed by humans in a natural language. A QA implementation, usually a computer program, may construct its answers by querying a structured database of knowledge or information, usually a knowledge base. More commonly, QA systems can pull answers from an unstructured collection of natural language documents.

Some examples of natural language document collections used for QA systems include:

  • a local collection of reference texts
  • internal organization documents and web pages
  • compiled newswire reports
  • a set of Wikipedia pages
  • a subset of World Wide Web pages

Question answering (QA), the identification of short accurate answers to users questions presented in natural language, has numerous applications in the biomedical and clinical sciences including directed search, interactive learning and discovery, clinical decision support, and recommendation. Due to the large size of the biomedical literature and a lack of efficient searching strategies, researchers and medical practitioners often struggle to obtain available information available that is necessary for their needs. Moreover, even the most sophisticated search engines are not intelligent enough to interpret clinicians questions. Thus, there is an urgent need for information retrieval systems that accept queries in natural language and return accurate answers quickly and efficiently.

Question answering (a natural language understanding problem) and reading comprehension (the task of answering a natural language question about a paragraph) are of considerable interest in NLP motivating, for example, the Human-Computer Question Answering Competition (in the NeurIPS 2017 Competition Track), and the BioASQ Challenge in the BioNLP domain. Unlike generic text summarization, reading comprehension systems facilitate the answering of targeted questions about specific documents, efficiently extracting facts and insights (How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks).

The Stanford Question Answering Dataset (SQuAD: developed at Stanford University) is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment or span of text from the corresponding reading passage (or, the question might be unanswerable). There has been a rapid progress on the SQuAD dataset, and early in 2018 engineered systems started achieving and surpassing human level accuracy on the SQuAD1.1 Leaderboard  (discussion: AI Outperforms Humans in Question Answering: Review of three winning SQuAD systems). SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written by crowdworkers to look adversarially similar to answerable ones (ACL 2018 Best Short Paper: Know What You Don’t Know: Unanswerable Questions for SQuAD;  [project/code]). To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Real coordinate space

In mathematics, real coordinate space of $\small n$ dimensions, written $\small \mathbf{R}^n$ (also written $\small \mathbb{R}^n$ with blackboard bold) is a coordinate space that allows several ($\small n$) real variables to be treated as a single variable. With various numbers of dimensions (sometimes unspecified), $\small \mathbb{R}^n$ is used in many areas of pure and applied mathematics, as well as in physics. With component-wise addition and scalar multiplication, it is the prototypical real vector space and is a frequently used representation of Euclidean $\small n$-space. Due to the latter fact, geometric metaphors are widely used for $\small \mathbb{R}^n$, namely a plane for $\small \mathbb{R}^2$ and three-dimensional space for $\small \mathbb{R}^3$.


[Image source. Click image to open in new window.]

Definition and uses

For any natural number $\small n$, the set $\small \mathbb{R}^n$ consists of all $\small n$-tuples of real numbers ($\small \mathbb{R}$). It is called (the) “$\small n$-dimensional real space”. Depending on its construction from $\small n$ instances of the set $\small \mathbb{R}$, it inherits some of the latter’s structure, notably:

  • When defined as the direct sum of vector spaces, addition and scalar multiplication are defined on $\small \mathbb{R}^n$ [refer here];

  • $\small \mathbb{R}^n$ is a topological space [refer here].

An element of $\small \mathbb{R}^n$ is written

    $\small \mathbf {x} = (x_1, x_2, \ldots, x_n)$

where each $\small x_i$ is a real number.

For each $\small n$ there exists only one $\small \mathbb{R}^n$, the real $\small n$-space.

Purely mathematical uses of $\small \mathbb{R}^n$ can be roughly classified as follows, although these uses overlap.

[ … snip … ]

Matrix notation

In standard matrix notation, each element of $\small \mathbb{R}^n$ is typically written as a column vector

    $\small \mathbf{x} = \begin{bmatrix} x_1 \\\ x_2 \\\ \vdots \\\ x_n ènd{bmatrix}$

and sometimes as a row vector:

    $\small \mathbf {x} = \begin{bmatrix} x_1 & x_2 & \dots & x_n ènd{bmatrix}$.

The coordinate space $\small \mathbb{R}^n$ may then be interpreted as the space of all $\small [n × 1]$ column vectors, or all $\small [1 × n]$ row vectors with the ordinary matrix operations of addition and scalar multiplication.

[ … snip … ]

Miscellaneous notes:

  • The superscripts in $\small \mathbb{R}^\color{Brown}{\mathbf{n}}$, $\small \mathbb{R}^\color{Brown}{\mathbf{k}}$ and $\small \mathbb{R}^\color{Brown}{\mathbf{p}}$ refer to the real coordinate space (“dimensions”) of these matrices of matrices of real numbers (respectively) in input, hidden and output layers.

  • Similarly (elsewhere), $\small \mathbb{R}^{m \times n}$ indicates a matrix of real coordinate space (“dimensions”) $\small m\times n$.

  • $\small \in$ denotes “element of” (in set membership: “member of”).

  • In mathematics, the set of real numbers ($\small \mathbb{R}$) are the values of a continuous quantity that can represent a distance along a line; they include rational numbers ($\small \mathbb{Q}$), integers ($\small \mathbb{Z}$), and natural numbers ($\small \mathbb{N}$).

  • Elements of $\small \mathbb{R}^n$ are vectors. In other words, we can consider each element of $\small \mathbb{R}^n$ (the tuple of $n$ real numbers) to be a vector. $\small \mathbb{R}^n$ is more abstract than polynomials; for example,

    $\ \ \ \ \ \ \small a = \begin{bmatrix} 1 \\ 2 \\ 3 ènd{bmatrix} \in \mathbb{R}^3$

is an example of a triplet of numbers in three-dimensional real coordinate space. Adding two vectors $\small a, b ∈ \mathbb{R}^n$ component wise results in another vector: $\small a + b = c \in \mathbb{R}^n$ . Moreover, multiplying $\small a \in \mathbb{R}^n$ by $\small \lambda \in \mathbb{R}$ results in a scaled vector $\small \lambda a \in \mathbb{R}^n$. Linear algebra focuses on the similarities between these vector concepts; we can add them together, and multiply them by scalars. We largely focus on vectors in $\small \mathbb{R}^n$ since most algorithms in linear algebra are formulated in $\small \mathbb{R}^n$. Recall that in machine learning, we often consider data to be represented as vectors in $\small \mathbb{R}^n$. [Source: Linear Algebra]

  • What is “Real coordinate space”?

    • “In my experience the expression ‘real coordinate space’ emphasizes that we are not working over the complex numbers, i.e. the space is $\small \mathbb{R}^n$, not $\small \mathbb{C}^n$. You can use Cartesian coordinates (and a whole bunch of other coordinate systems) on these spaces. The spaces $\small \mathbb{R}^n$ are called Euclidean spaces, so they are the same as real coordinate spaces.”

  • What’s the difference between the Cartesian coordinate system, the Euclidean space, and the real coordinate space?

    • Euclidean space is a way to define a geometric system. It is based on the following assumptions that cannot be proven, or postulated:

    • A straight line can be drawn joining any two points.
    • Any straight line segment can be extended indefinitely in a straight line.
    • Given any straight line segment, a circle can be drawn having the segment as radius and one endpoint as center.
    • All right angles are congruent.

    There are a number of coordinate systems that help “locate” or identify objects or elements in the space. A Cartesian coordinate system, is defined by a 3-tuple {x,y,z} based on 3 axes that are each perpendicular to one another, and any position can be defined by these 3 coordinates.

    Real coordinate space merely extends the Cartesian system to multiple dimensions. For example, there can be $\small n$-tuples represented by matrices, where the properties of Cartesian coordinates hold.


[Polar coordinate system.]  [Image source. Click image to open in new window.]

Reasoning [inference]

Inferences are steps in reasoning, moving from premises to conclusions. That is, inference is the reasoning involved in drawing a conclusion or making a logical judgment on the basis of circumstantial evidence and prior conclusions rather than on the basis of direct observation. Inference may be classified into three kinds:

  • deduction (deductive inference; deductive reasoning),
  • induction (inductive inference; inductive reasoning), and
  • abduction (abductive inference; inductive reasoning).

Deduction is a systematic method of reasoning (inference) for deriving conclusions from facts and direct observation.

Induction is a systematic method of reasoning (inference) in which the premises are viewed as supplying some evidence for the truth of the conclusion.


[Image source. Click image to open in new window.]



[Image source. Click image to open in new window.]


It is sometimes argued that in deduction the particular is inferred from the general:

  • All organisms have RNA
  • (This fruit fly is an organism.)
  • Therefore, this fruit fly has RNA.

And it is sometimes said that in induction the general is inferred from the particular:

  • A red-eyed fruit fly has RNA
  • A white-eyed fruit fly has RNA
  • A Hawaiian fruit fly has RNA
  • Therefore, all fruit flies have RNA.


Because deduction rhymes with reduction, you can easily remember that in deduction, you start with a set of possibilities and reduce it until a smaller subset remains.

  • For example, a murder mystery is an exercise in deduction. Typically, the detective begins with a set of possible suspects – for example, the butler, the maid, the business partner, and the widow. By the end of the story, he or she has reduced this set to only one person – for example, “The victim died in the bathtub but was moved to the bed. But, neither woman could have lifted the body, nor could the butler with his war wound. Therefore, the business partner must have committed the crime.”

Induction begins with the same two letters as the word increase, which can help you remember that in induction, you start with a limited number of observations and increase that number by generalizing.

  • For example, suppose you spend the weekend in a small town and the first five people you meet are friendly, so you inductively conclude the following: “Everybody here is so nice.” In other words, you started with a small set of examples and you increased it to include a larger set.


Examples of Inductive Reasoning

  • Jennifer leaves for school at 7:00 a.m. Jennifer is always on time. Jennifer assumes, then, that she will always be on time if she leaves at 7:00 a.m.
  • The cost of goods was $\small \$$1.00. The cost of labor to manufacture the time was $\small \$$0.50. The sales price of the item was $\small \$$5.00; so, the item always provides a good profit.
  • Every windstorm in this area comes from the north. I can see a big cloud of dust caused by a windstorm in the distance; so, a new windstorm is coming from the north.
  • Bob is showing a big diamond ring to his friend Larry. Bob has told Larry that he is going to marry Joan. Bob has bought the diamond ring to give to Joan.
  • The chair in the living room is red. The chair in the dining room is red. The chair in the bedroom is red. All chairs in the house are red.
  • Every time you eat peanuts, your throat swells up and you can’t breath. So, you are allergic to peanuts.
  • All cats that you have observed purr. Therefore, every cat must purr.
  • Two-thirds of the students at this college receive student aid. Therefore, two-thirds of all college students receive student aid.
  • All of the girls in the class were blond, therefore all girls in this neighborhood are blond.
  • Michael just moved here from Chicago. Michael has red hair, therefore people from Chicago have red hair.
  • The children in that house yell loudly when they play in their bedroom. I can hear children yelling in that house, therefore the children must be playing in their bedroom.
  • All chickens that we have seen have been brown; so, all chickens are brown.
  • All cars in this town drive on the right side of the street. Therefore, all cars in all towns drive on the right side of the street.
  • John is an excellent swimmer. John’s family has a swimming pool. John’s sister Mary must also be an excellent swimmer.
  • All basketball players in your school are tall, so all basketball players must be tall.
  • All brown dogs in the park are small dogs. Therefore, all small dogs are brown.
  • All children in the day care center like to play with Legos. All children, therefore, enjoy playing with Legos.
  • Ray is a football player. All football players weigh more than 170 pounds. Ray weighs more than 170 pounds.
  • All observed houses on the South Street are falling apart. Sherry lives on South Street. Her house is falling apart.


Abductive reasoning (abduction; abductive inference; retroduction) is a form of logical inference which starts with an observation or set of observations then seeks to find the simplest and most likely explanation for the observations. This process, unlike deductive reasoning, yields a plausible conclusion but does not positively verify it. Abductive conclusions are thus qualified as having a remnant of uncertainty or doubt, which is expressed in retreat terms such as “best available” or “most likely”

Inference is used, for example, in Horn clausesinductive logic programminginteger linear programming, and probabilistic programming (including Prolog).

Deductive reasoning (deductive logic; logical deduction) is the process of reasoning from one or more statements (premises) to reach a logically certain conclusion. Deductive reasoning goes in the same direction as that of the conditionals, and links premises with conclusions. If all premises are true, the terms are clear, and the rules of deductive logic are followed, then the conclusion reached is necessarily true.

Deductive reasoning (“top-down logic”) contrasts with inductive reasoning (“bottom-up logic”) in the following way.

  • In deductive reasoning, a conclusion is reached reductively by applying general rules which hold over the entirety of a closed domain of discourse, narrowing the range under consideration until only the conclusion(s) is left.

  • In inductive reasoning, the conclusion is reached by generalizing or extrapolating from specific cases to general rules, i.e., there is epistemic uncertainty. However, the inductive reasoning mentioned here is not the same as induction used in mathematical proofs - mathematical induction is actually a form of deductive reasoning.

Deductive reasoning differs from abductive reasoning by the direction of the reasoning relative to the conditionals. Deductive reasoning goes in the same direction as that of the conditionals, whereas abductive reasoning goes in the opposite direction to that of the conditionals.

Inference means any transparent logical means of yielding conclusions from premises, with transparent caveats associated with the assumptions of the inference method.

Deductive reasoning is logically the strongest inference: conclusions are derived from premises that are assumed to be true via logical entailment, which means that the conclusions are guaranteed to be true if the premises are true. While deduction gives watertight strength (assuming no errors in application), it does not alone allow us to draw general conclusions from a finite set of particular facts.

Inductive reasoning is about obtaining more general conclusions from a set of more specific premises, which means that the conclusion is consistent with the premises but only one of an infinite number of possible conclusions that are also consistent with those premises. Induction works using deduction under the hood, by the creative proposition of the general conclusion and then lots of deduction from the conclusion to yield predictions that are strongly entailed by the general conclusion; when sustained efforts to find a prediction entailed by the general conclusion that doesn’t match observation fail, we have ourselves a reliable (but never true) general theory.

Abduction is a form of probabilistic reasoning (and thus assumes a known sample space of propositions) and effectively yields probability measures for conclusions assuming a major premise is true with probability 1 and minor premises with probabilities less than 1. In abductive reasoning, a major premise is assumed to be true, but minor premises are doubtful, in a probability sense, giving a conclusion that is probabilistic. The caveats are the presence of all the conditions needed to have probability spaces, and, in practice, nicely shaped and stationary distributions as well.


See also:

Recognizing textual entailment

See Natural language Inference.

Recurrent neural networks (RNN)


[Folded, unfolded RNN (click image to open in new window)]

At each time step $\small t$, a RNN reads an input vector $\small \mathbf{x}_t$ into a hidden state vector $\small \mathbf{h}_t$ and predicts an output vector $\small \mathbf{y}_t$ [shown as $\small \mathbf{o}_t$ in the diagram above]. The state dynamic can be abstracted as a recurrent relation: $\small \mathbf{h}_t = RNN(\mathbf{h}_{t-1}, \mathbf{x}_t)$. The vanilla RNN is parameterized as follows:

    $\small \mathbf{h}_t = \sigma (W_h \mathbf{h}_{t-1} + V\mathbf{x}_t + \mathbf{b}_h)$

    $\small \mathbf{y}_t = W_y \mathbf{h}_t + \mathbf{b}_y$

where $\small (W_h, W_y, V, \mathbf{b}_h, \mathbf{b}_y)$ are learnable parameters, and $\small \sigma$ is a point-wise nonlinear function.

Andrej Karpathy provides a great example of a RNN, that should make it easier to visualize the working and use of a RNN:


[image source (click image to open in new window)]

One last thing to note – the weights of the connections between time steps are shared i.e. there isn’t a different set of weights for each time step. This is discussed here, Why are the weights of RNN/LSTM networks shared across time?:

  • “The ‘shared weights’ perspective comes from thinking about RNNs as feedforward networks unrolled across time. If the weights were different at each moment in time, this would just be a feedforward network [ANN, above]. But, I suppose another way to think about it would be as an RNN whose weights are a time-varying function (and that could let you keep the ability to process variable length sequences). If you did this, the number of parameters would grow linearly with the number of time steps. That would be a big explosion of parameters for sequences of any appreciable length. …”


[image sourcediscussed here. (Click image to open in new window.)]

Alex Graves shows a variation of a RNN, above, that better illustrates some additional key concepts (Generating Sequences with Recurrent Neural Networks).

  • “Fig. 1 illustrates the basic recurrent neural network prediction architecture used in this paper. An input vector sequence $\small \mathbf{x} = (x_1, \dots, x_T)$ is passed through weighted connections to a stack of $\small N$ recurrently connected hidden layers to compute first the hidden vector sequences $\small \mathbf{h}^n = (h{n \atop 1}, \dots, h{n \atop T})$ and then the output vector sequence $\small \mathbf{y} = (y_1, \dots, y_T)$. Each output vector $\small y_t$ is used to parameterise a predictive distribution $\small \text{Pr}(x_{t+1} \vert y_t)$ over the possible next inputs $\small x_{t+1}$. The first element $\small x_1$ of every input sequence is always a null vector whose entries are all zero; the network therefore emits a prediction for $\small x_2$, the first real input, with no prior information. The network is “deep” in both space and time, in the sense that every piece of information passing either vertically or horizontally through the computation graph will be acted on by multiple successive weight matrices and nonlinearities.

    “Note the ‘skip connections’ from the inputs to all hidden layers, and from all hidden layers to the outputs. These make it easier to train deep networks, by reducing the number of processing steps between the bottom of the network and the top, and thereby mitigating the “vanishing gradient” problem. In the special case that $\small N = 1$ the architecture reduces to an ordinary, single layer next step prediction RNN.”


[image source (click image to open in new window)]


[image source (click image to open in new window)]

The following material is excerpted from the main body in A Beginner’s Guide to LSTMs.

“In the case of feedforward networks, input examples are fed to the network and transformed into an output; with supervised learning, the output would be a label, a name applied to the input. That is, they map raw data to categories, recognizing patterns that may signal, for example, that an input image should be labeled “cat” or “elephant.” A feedforward network is trained, for example, on labeled images until it minimizes the error it makes when guessing their categories. The trained set of parameters (or weights) are collectively known as a model. A feedforward network has no notion of temporal order (“time”): the only input it considers is the current example upon which it is working.

“RNN recognize patterns in sequences of data such as text, genomes, handwriting, the spoken word, numerical times series data emanating from sensors, stock markets and government agencies. RNNs are even applicable to images, which can be decomposed into a series of patches and treated as a sequence. RNN take as their input not just the current input example they see, but also what they have perceived previously in time. Since these algorithms take time and sequence into account, they have a temporal dimension. The decision a recurrent net reached at time step $\small t-1$ affects the decision it will reach one moment later at time step $\small t$. RNN therefore have two sources of input, the present and the immediate past, which combine to determine how they respond to new data.

“Recurrent networks are distinguished from feedforward networks by a feedback loop connected to their past decisions, ingesting their own outputs moment after moment as input. It is often said that recurrent networks have memory. That sequential information is preserved in the recurrent network’s hidden state, which manages to span many time steps as it cascades forward to affect the processing of each new example. Mathematically, the process of carrying the memory forward is:

    $\small \mathbf{h}_t = \phi(\mathbf{W}\mathbf{x}_t + \mathbf{U}\mathbf{h}_{t-1})$

“The hidden state at time step $\small t$ is $\small h_t$. It is a function of the input at the same time step $\small x_t$, modified by a weight matrix $\small W$ (like the one we used for feedforward nets) added to the hidden state of the previous time step $\small h_{t-1}$ multiplied by its own hidden-state-to-hidden-state matrix $\small U$, otherwise known as a transition matrix and similar to a Markov chain. The weight matrices are filters that determine how much importance to accord to both the present input and the past hidden state. The error they generate will return via backpropagation and be used to adjust their weights until error can’t go any lower.

“The sum of the weight input and hidden state is squashed by the function $\small \phi$ – either a logistic sigmoid function or tanh, depending – which is a standard tool for condensing very large or very small values into a logistic space, as well as making gradients workable for backpropagation. Because this feedback loop occurs at every time step in the series, each hidden state contains traces not only of the previous hidden state, but also of all those that preceded $\small h_{t-1}$ for as long as memory can persist.”

The most basic or “vanilla” RNN consists of the following set of equations (indexed by time-step $t$), that appear various forms in the literature and the web:

    $\small \mathbf{h}_t = \sigma (\mathbf{U}_h \mathbf{x}_t + \mathbf{W}_h \mathbf{h}_{t-1})$

    $\small \mathbf{y}_t = \mathbf{O} \mathbf{h}_t$


  • $\small \mathbf{x}_t \in \mathbb{R}^n$ is the input of the RNN;

  • $\small \mathbf{h}_t \in \mathbb{R}^k$ is called the hidden state of the RNN, and acts as a memory of the current state the network. When starting a sequence, it is set to the all zero vector ($\small \mathbf{h}_{-1} = 0$).

  • $\small \mathbf{y}_t \in \mathbb{R}^p$ is the output of the RNN;

  • The logistic function (here, the sigmoid function; see Recent Advances in Recurrent Neural Networks for other forms), applied component-wise:

$\ \ \ \ \ \ \small \sigma(x) = \frac{1}{1 + {e}^{(-x)}}$

  • $\small \mathbf{U}_{h}, \mathbf{W}_{h}, \mathbf{O}$ are the network’s parameters. [Note that “$\small \mathbf{O}$”, here, is “$\small \mathbf{V}$” elsewhere in this post; i.e. the weight matrices of the “skip connections from the hidden layers to the outputs.]

The output of such a neural network depends on both the input $\small \mathbf{x}_t$ and the hidden state $\small \mathbf{h}_t$.


  • Source for the above: An Intrinsic Difference Between Vanilla RNNs and GRU Models plus my own notes (below).

  • Undefined in that paper, I believe that I correctly interpret/describe these:

    • The superscripts in $\small \mathbb{R}^\color{Brown}{\mathbf{n}}$, $\small \mathbb{R}^\color{Brown}{\mathbf{k}}$ and $\small \mathbb{R}^\color{Brown}{\mathbf{p}}$ refer to the real coordinate space (“dimensions”) of these matrices of matrices of real numbers (respectively) in input, hidden and output layers.

      Similarly (elsewhere), $\small \mathbb{R}^{m \times n}$ indicates a matrix of dimensions $\small m\times n$.

  • $\small \in$ denotes “element of” (in set membership: “member of”).

  • In mathematics, the set of real numbers ($\small \mathbb{R}$) are the values of a continuous quantity that can represent a distance along a line; they include rational numbers ($\small \mathbb{Q}$), integers ($\small \mathbb{Z}$), and natural numbers ($\small \mathbb{N}$).

  • Elements of $\small \mathbb{R}^n$ are vectors. In other words, we can consider each element of $\small \mathbb{R}^n$ (the tuple of $n$ real numbers) to be a vector. $\small \mathbb{R}^n$ is more abstract than polynomials; for example,

    $\ \ \ \ \ \ \small a = \begin{bmatrix} 1 \\ 2 \\ 3 ènd{bmatrix} \in \mathbb{R}^3$

    is an example of a triplet of numbers in three-dimensional real coordinate space. Adding two vectors $\small a, b ∈ \mathbb{R}^n$ component wise results in another vector: $\small a + b = c \in \mathbb{R}^n$ . Moreover, multiplying $\small a \in \mathbb{R}^n$ by $\small \lambda \in \mathbb{R}$ results in a scaled vector $\small \lambda a \in \mathbb{R}^n$. Linear algebra focuses on the similarities between these vector concepts; we can add them together, and multiply them by scalars. We largely focus on vectors in $\small \mathbb{R}^n$ since most algorithms in linear algebra are formulated in $\small \mathbb{R}^n$. Recall that in machine learning, we often consider data to be represented as vectors in $\small \mathbb{R}^n$. [Source: Linear Algebra]

  • In mathematics, the real coordinate space of $n$ dimensions, written $\small \mathbb{R}^n$ is a coordinate space that allows several ($\small n$) real variables to be treated as a single variable. With various numbers of dimensions (sometimes unspecified), $\small \mathbb{R}^n$ is used in many areas of pure and applied mathematics, as well as in physics. With component-wise addition and scalar multiplication, it is the prototypical real vector space and is a frequently used representation of Euclidean $\small n$-space. An element of $\small \mathbb{R}^n$ is written $\small x = (x_1, x_2, \ldots, x_n)$, where each $\small x_i$ is a real number.

Reduced row echelon

  • See discussion, here.


See Classification vs. Regression vs. Clustering.

Reinforcement learning

Reinforcement learning (RL) is a branch of machine learning in which an agent learns from interacting with an environment; see:

Essentially, a RL framework allows an agent to learn from trial and error; the agent receives a reward by acting in the environment, and its goal is learning to select the actions that maximize the expected cumulative reward over time. Deep learning can be combined with RL to learn useful representations for problems with high dimensional raw data input.

Relation extraction

Relation extraction (RE) is a subproblem of IE that addresses the extraction of labeled relations between two named entities. Dependency parsing and phrase structure parsing may be combined for relation extraction. To minimize cascading errors, accurate sentence chunking (splitting) is required, prior to the dependency parsing step. See also: Information extraction;   Dependency parsing.

Representation learning

Representation learning is a set of techniques that learn a feature: a transformation of the raw data input to a representation that can be effectively exploited in machine learning tasks. While traditional unsupervised learning techniques are staples of machine learning, representation learning has emerged as an alternative approach to feature extraction (An Introduction to Representation Learning). In representation learning, features are extracted from unlabeled data by training a neural network on a secondary, supervised learning task. Word2vec is a good example of representation learning, simultaneously learning several language concepts:

  • the meanings of words;
  • how words are combined to form concepts (i.e., syntax); and,
  • how concepts relate to the task at hand.

See also:

Rounding numbers (digits)

See Rounding  [Significant figures (digits)]

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


The term syntax refers to grammatical structure whereas the term semantics refers to the meaning of the vocabulary symbols arranged with that structure. Grammatical (syntactically valid) does not imply sensible (semantically valid), however. For example, the grammatical sentence “cows flow supremely” is grammatically ok (subject verb adverb) in English, but makes no sense. [Source: What do “syntax” and “semantics” mean and how are they different?]

Semantics is used as a technical term for the meaning of words and sentences. Semantics and its understanding as a study of meaning covers most complex tasks like: finding synonyms, word sense disambiguation, constructing question-answering systems, translating from one natural language to another, and populating knowledge bases. Basically, one needs to complete morphological and syntactical analysis before trying to solve any semantic problem. [Source]

Semantics is about the manner in which lexical meaning is combined morphologically and syntactically to form the meaning of a sentence. Mostly, this is regular, productive and rule-governed; e.g. the meaning of “John gave Mary a do “can be represented as (SOME (X) (DOG X) & (PAST-TIME (GIVE (JOHN, MARY, X)))), but sometimes it is idiomatic as in the meaning of “John kicked the bucket”, which can be (PAST-TIME (DIE (JOHN))). (To make this notation useful we also need to know the meaning of these capitalised words and brackets too.) Because the meaning of a sentence is usually a productive combination of the meaning of its words, syntactic information is important for interpretation - it helps us work out what goes with what – but other information, such as punctuation or intonation, pronoun reference, etc, can also play a crucial part. [Source; Introduction to Linguistics for Natural Language Processing]


See also:

Semantic parsing

[See also: Syntactic/Dependency parsing.] Semantic dependencies are understood in terms of predicates and their arguments. The predicate of a sentence mostly corresponds to the main verb and any auxiliaries that accompany the main verb; the arguments of that predicate (e.g. the subject and object noun phrase) are outside the predicate. The arguments of a predicate are semantically dependent on that predicate. Often, semantic dependencies overlap with and point in the same direction as syntactic dependencies. At times, however, semantic dependencies can point in the opposite direction of syntactic dependencies, or they can be entirely independent of syntactic dependencies.

Semantic processing of sentence structure uses statistics or grammar rules to produce an electronic representation that delivers logical components (for example, a ‘noun phrase’), their roles (for example, the ‘subject’) and dependencies. Semantic parsing can thus be understood as extracting the precise meaning of an utterance.

Applications of semantic parsing include question answering. An example of the application of semantic parsing in biomedical question-answering is indicated in Fig. 3 in: Titov & Klementiev (2011) [pdf]:

Semantic parsing: biomedical QA
      (click image for full-size)

Semantic role labeling

Semantic role labeling (sometimes called shallow semantic parsing) is a process in natural language processing that assigns labels to words or phrases in a sentence that indicate their semantic role in the sentence – such as that of an agent, goal, or result. It consists of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles (i.e., automatically finding the semantic roles of each argument of each predicate in a sentence).

For example, given a sentence like “Mary sold the book to John”, the task would be to recognize the verb “to sell” as representing the predicate, “Mary” as representing the seller (agent), “the book” as representing the goods (theme), and “John” as representing the recipient. This is an important step towards making sense of the meaning of a sentence.

For a recent, highly performant semantic role labeling model see Linguistically-Informed Self-Attention for Semantic Role Labeling (2018) [code] by Andrew McCallum and colleagues. Linguistically-Informed Self-Attention  (LISA) is a neural network model that combined multi-head self-attention with multi-task learning across dependency parsing, part of speech tagging, predicate detection and semantic role labeling (SRL).


Google’s seq2seq library [GitHubblogdocs] provides a general-purpose encoder-decoder framework that can be used for machine translation, text summarization, conversational modeling, image captioning, and more.

Sequence-to-sequence (seq2seq) models constitute a common framework for solving sequential problems. In seq2seq models, the input is a sequence of certain data units and the output is also a sequence of data units. Seq2seq models are common in various applications ranging from

  • machine translation, where the input is a sentence (sequence of words) from one language (English) and the output is the translation to another language (French),
  • news headline generation, where the input is a news article (sequence of words) or the first two or three sentences of it and the output is the headline of the article,
  • text summarization, where the input is a complete article (sequence of words) and the output is a short summary of it (sequence of words),
  • speech-to-text applications, where the input is an audio recording of a speech (sequence of audibles pieces) and the output is the speech text (sequence of words), and
  • image captioning, where the input is an image (sequence of different layers of image) and the output is a textual caption explaining the image (sequence of words).

In recent years, the general framework for solving these problems uses deep neural networks that comprise two main components: an encoder which reads the sequence of input data and a decoder which uses the output generated by the encoder to produce the sequence of final outputs. The encoder and decoder are usually implemented by recurrent neural networks.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

See also:

Significant figures (digits)

The significant figures (also known as the significant digits) of a number are digits that carry meaning contributing to its measurement resolution. This includes all digits except:

  • All leading zeros;

  • Trailing zeros when they are merely placeholders to indicate the scale of the number (exact rules are explained at identifying significant figures); and

  • Spurious digits introduced, for example, by calculations carried out to greater precision than that of the original data, or measurements reported to a greater precision than the equipment supports.

Concise rules

  • All non-zero digits are significant: 1, 2, 3, 4, 5, 6, 7, 8, 9.

    • 91 has two significant figures (9 and 1), while 123.45 has five significant figures (1, 2, 3, 4 and 5).

  • Zeros between non-zero digits are significant: 102, 2005, 50009.

    • 101.1203 has seven significant figures: 1, 0, 1, 1, 2, 0 and 3.

  • Leading zeros are never significant: 0.02, 001.887, 0.000515.

    • 0.00052 has two significant figures: 5 and 2.

  • In a number with a decimal point, trailing zeros (those to the right of the last non-zero digit) are significant: 2.02000, 5.400, 57.5400.

    • 12.2300 has six significant figures: 1, 2, 2, 3, 0 and 0. The number 0.000122300 still has only six significant figures (the zeros before the 1 are not significant). In addition, 120.00 has five significant figures since it has three trailing zeros.

  • In a number without a decimal point, trailing zeros may or may not be significant. More information through additional graphical symbols or explicit information on errors is needed to clarify the significance of trailing zeros.

Scientific notation

In most cases, the same rules apply to numbers expressed in scientific notation. However, in the normalized form of that notation, placeholder leading and trailing digits do not occur, so all digits are significant. For example, 0.00012 (two significant figures) becomes 1.2×10-4, and 0.00122300 (six significant figures) becomes 1.22300×10-3. In particular, the potential ambiguity about the significance of trailing zeros is eliminated. For example, 1300 to four significant figures is written as 1.300×103, while 1300 to two significant figures is written as 1.3×103.

Rounding numbers

Numbers are often rounded to avoid reporting insignificant figures. For example, it would create false precision to express a measurement as 12.34500 kg (which has seven significant figures) if the scales only measured to the nearest gram and gave a reading of 12.345 kg (which has five significant figures). Numbers can also be rounded merely for simplicity rather than to indicate a given precision of measurement, for example, to make them faster to pronounce in news broadcasts.

To round to $\small n$ significant figures:

  • Identify the significant figures before rounding. These are the n consecutive digits beginning with the first non-zero digit.

  • If the digit immediately to the right of the last significant figure is greater than 5 or is a 5 followed by other non-zero digits, add 1 to the last significant figure. For example, 1.2459 as the result of a calculation or measurement that only allows for 3 significant figures should be written 1.25.

    • If the number you are rounding is followed by 5, 6, 7, 8, or 9, round the number up. Example: 38 rounded to the nearest ten is 401

    • If the number you are rounding is followed by 0, 1, 2, 3, or 4, round the number down. Example: 33 rounded to the nearest ten is 30

      1.44 rounded to 2 sig. fig. is 1.4
      1.45 rounded to 2 sig. fig. is 1.4
      1.46 rounded to 2 sig. fig. is 1.5

      1.54 rounded to 2 sig. fig. is 1.5
      1.55 rounded to 2 sig. fig. is 1.6
      1.56 rounded to 2 sig. fig. is 1.6

  • If the digit immediately to the right of the last significant figure is a 5 not followed by any other digits or followed only by zeros, rounding requires a tie-breaking rule. For example, to round 1.25 to 2 significant figures:

    • Round half away from zero (also known as “5/4”) rounds up to 1.3. This is the default rounding method implied in many disciplines if not specified.

    • Round half to even, which rounds to the nearest even number, rounds down to 1.2 in this case. The same strategy applied to 1.35 would instead round up to 1.4.

      • Victoria: this is my preference: for numbers ending in 5, if the preceding number is even, round down; if the number preceding a trailing 5 is odd, round up. E.g.

          1.05 rounds to 1.0
          1.15 rounds to 1.2
          1.25 rounds to 1.2
          1.35 rounds to 1.4
          1.45 rounds to 1.4
          1.55 rounds to 1.6
          1.65 rounds to 1.6
          1.75 rounds to 1.8
          1.85 rounds to 1.8
          1.95 rounds to 2.0

  • Replace non-significant figures in front of the decimal point by zeros.

  • Drop all the digits after the decimal point to the right of the significant figures (do not replace them with zeros).

Singular value decomposition

In linear algebra, the singular-value decomposition (SVD) is a factorization of a real or complex matrix. It is the generalization of the eigendecomposition of a positive semidefinite normal matrix (for example, a symmetric matrix with positive eigenvalues) to any $\small m\times n$ matrix via an extension of the polar decomposition. It has many useful applications in signal processing and statistics.

Formally, the singular-value decomposition of an $\small m\times n$ real or complex matrix $\small \mathbf{M}$ is a factorization of the form $\small \mathbf {U \Sigma V^{*}}$, where $\small \mathbf{U}$ is an $\small m\times m$ real or complex unitary matrix, $\small \mathbf{\Sigma}$ is an $\small m\times n$ rectangular diagonal matrix with non-negative real numbers on the diagonal, and $\small \mathbf{V}$ is an $\small n\times n$ real or complex unitary matrix. The diagonal entries $\small \sigma_i$ of $\small \mathbf{\Sigma}$ are known as the singular values of $\small \mathbf{M}$. The columns of $\small \mathbf{U}$ and the columns of $\small \mathbf{V}$ are called the left-singular vectors and right-singular vectors of $\small \mathbf{M}$, respectively.

The singular-value decomposition can be computed using the following observations:

  • The left-singular vectors of M are a set of orthonormal eigenvectors of $\small MM^∗$.
  • The right-singular vectors of M are a set of orthonormal eigenvectors of $\small M^{∗}M$.
  • The non-zero singular values of M (found on the diagonal entries of $\small \Sigma$) are the square roots of the non-zero eigenvalues of both $\small M^{∗}M$ and $\small MM^∗$.

Applications that employ the SVD include computing the pseudoinverse, least squares fitting of data, multivariable control, matrix approximation, and determining the rank, range and null space of a matrix.

    Visualisation of the matrix multiplications in singular value decomposition.
    [Image source. Click image to open in new window.]


    Visualization of the SVD of a 2D, real shearing matrix $\small M$. First, we see the unit disc in blue together with the two canonical unit vectors. We then see the action of $\small M$, which distorts the disk to an ellipse. The SVD decomposes $\small M$ into three simple transformations: an initial rotation $\small V^∗$, a scaling $\small \Sigma$ along the coordinate axes, and a final rotation $\small U$. The lengths $\small \sigma_1$ and $\small \sigma_2$ of the semi-axes of the ellipse are the singular values of $\small M$, namely $\small \Sigma_{1,1}$ and $\small \Sigma_{2,2}$.
    Image source. Click image to open in new window.]
    Upper left: The unit disc with the two canonical unit vectors.
    Upper right: Unit disc transformed with $\small M$ and singular values $\small \sigma_1$ and $\small \sigma_2$ indicated.
    Lower left: The action of $\small V^∗$ on the unit disc. This is just a rotation.
    Lower right: The action of $\small \Sigma V^∗$ on the unit disc. $\small \Sigma $ scales in vertically and horizontally.
    In this special case, the singular values are $\small \phi$ and $\small 1/\phi$ where $\small \phi$ is the golden ratio. $\small V^∗$ is a (counter clockwise) rotation by an angle $\small α$ where $\small α$ satisfies $\small tan(α) = -\phi$.  $\small U$ is a rotation by an angle $\small \beta$ with $\small tan(\beta) = \phi - 1$.
    [Image source. Click image to open in new window.]

The singular value decomposition of a matrix $\small A$ is the factorization of $\small A$ into the product of three matrices $\small A = UDV^T$ where the columns of $\small U$ and $\small V$ are orthonormal and the matrix $\small D$ is diagonal with positive real entries. The SVD is useful in many tasks. Here we mention some examples. First, in many applications, the data matrix $\small A$ is close to a matrix of low rank and it is useful to find a low rank matrix which is a good approximation to the data matrix . We will show that from the singular value decomposition of $\small A$ , we can get the matrix $\small B$ of rank $\small k$ which best approximates $\small A$; in fact we can do this for every $\small k$. Also, singular value decomposition is defined for all matrices (rectangular or square) unlike the more commonly used spectral decomposition in linear algebra. The reader familiar with eigenvectors and eigenvalues (we do not assume familiarity here) will also realize that we need conditions on the matrix to ensure orthogonality of eigenvectors. In contrast, the columns of $\small V$ in the singular value decomposition, called the right singular vectors of $\small A$, always form an orthogonal set with no assumptions on $\small A$. The columns of $\small U$ are called the left singular vectors and they also form an orthogonal set. A simple consequence of the orthogonality is that for a square and invertible matrix $\small A$, the inverse of $\small A$ is $\small VD^{-1}U^T$.  [Source]


[Image source. Click image to open in new window.]

See also:

Skip-gram model

See Word embedding (figures).

Skip-Gram with Negative Sampling

Softmax | softmax layer

A Softmax function is a type of squashing function. Squashing functions limit the output of the function into the range 0 to 1. This allows the output to be interpreted directly as a probability. Similarly, softmax functions are multi-class sigmoids, meaning they are used in determining probability of multiple classes at once. Since the outputs of a softmax function can be interpreted as a probability (i.e., they must sum to 1), a softmax layer is typically the final layer used in neural network functions. It is important to note that a softmax layer must have the same number of nodes as the output later.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

See also:

Statistical relational learning

Statistical relational learning (sometimes called relational machine learning) is a subdiscipline of machine learning that is concerned with domain models that exhibit both uncertainty (which can be dealt with using statistical methods) and complex, relational structure. Typically, the knowledge representation formalisms developed in statistical relational learning use (a subset of) first-order logic to describe relational properties of a domain in a general manner (universal quantification) and draw upon probabilistic graphical models (such as Bayesian networks or Markov networks) to model the uncertainty; some also build upon the methods of inductive logic programming.

The field of statistical relational learning is not strictly limited to learning aspects; it is equally concerned with reasoning (specifically probabilistic inference) and knowledge representation. Therefore, alternative terms that reflect the main foci of the field include statistical relational learning and reasoning (emphasizing the importance of reasoning) and *first-order probabilistic languages (emphasizing the key properties of the languages with which models are represented).

See also:


[Image source. Click image to open in new window.]


A synonym is a word having the same or nearly the same meaning as another word.

Synonyms are words with similar meanings. They are listed in a special type of dictionary called a thesaurus. A regular dictionary lists words according to form, usually in alphabetical order; a thesaurus lists words according to meaning. Synonyms usually differ in at least one semantic feature. Sometimes the feature is objective (denotative), referring to some actual, real world difference in the referents: walk, lumber, stroll, meander, lurch, stagger, stride, mince. Sometimes the feature is subjective (connotative), referring to how the speaker feels about the referent rather than any real difference in the referent itself: die, pass away, give up the ghost, kick the bucket, croak. There tend to be very few absolute synonyms in a language. Example: sofa and couch are nearly complete synonyms, yet they differ in their collocability in at least one way: one may say couch potato, but not *sofa potato.

One special type of partial synonym is called a paronym. Paronyms are words with associated meanings which also have great similarities in form: proscribe/ prescribe, industrial/ industrious, except/accept, affect/effect. Many errors in speech and writing are due to mix-ups involving paronyms.

Examples: “gearbox/transmission”; “choice/selection”; “complex/complicated”; “pretty/attractive”; “sick/ill”; …

Synonymy and homonymy have complementary notions:

  • synonyms: the same meaning, different forms
  • homonyms: the same form, different meanings

See also Linguistics | Semantics.

Syntactic/Dependency parsing

[See also Semantic parsing.] A syntactic parser (i.e., dependency parser) analyzes the grammatical (syntactic) structure of a sentence, establishing binary relationships between “head” words and words which modify those heads; for example, a verb is linked to its dependents (arguments/modifiers). Collectively, these relations form a tree or tree-like graph.

The basic idea is that syntactic structure consists of lexical items, linked by binary asymmetric relations called dependencies. The sentence is an organized whole, the constituent elements of which are words. Every word that belongs to a sentence ceases by itself to be isolated, as in the dictionary. Between the word and its neighbors, the mind perceives connections, the totality of which forms the structure of the sentence.

The structural connections establish dependency relations between the words. The dependencies are all binary relations: a grammatical relation holds between a governor (also known as a regent or a head) and a dependent.

Thus, in the sentence “Winehouse performed …”, “performed” is the governor and “Winehouse” is the dependent (subordinate).

Among other tasks, dependency parse trees may be applied to basic relation extraction. Stanford dependencies provide a representation of grammatical relations between words in a sentence. They have been designed to be easily understood and effectively used by people who want to extract textual relations. Stanford dependencies (SD) are triplets: name of the relation, the governor, and the dependent.

For example, the sentence “Winehouse effortlessly performed her song Rehab.” yields the following dependency paths:

  nsubj(performed-3, Winehouse-1)
  advmod(performed-3, effortlessly-2)
  poss(Rehab-6, her-4)
  nn(Rehab-6, song-5)
  dobj(performed-3, Rehab-6)

In this example, the shortest path between “Winehouse” and “Rehab” is:

  Winehouse nsubj performed dobj Rehab.

and an extracted relation (triple) would be (Winehouse; performed; Rehab)

Winehouse dependency parse
      [graphs above per Stanford CoreNLP online demo]


The term syntax refers to grammatical structure whereas the term semantics refers to the meaning of the vocabulary symbols arranged with that structure. Grammatical (syntactically valid) does not imply sensible (semantically valid), however. For example, the grammatical sentence “cows flow supremely” is grammatically ok (subject verb adverb) in English, but makes no sense. [Source: What do “syntax” and “semantics” mean and how are they different?]

Roughly speaking, the syntax of a language comprises the patterns into which its words can be validly arranged to form sentences. The combination of morphology and syntax is sometimes called the grammar of a language. Syntax as part of grammar is a description of how words grouped and connected to each other in a sentence. There is a good definition of syntax for programming languages: “… syntax usually entails the transformation of a linear sequence of tokens (a token is akin to an individual word or punctuation mark in a natural language) into a hierarchical syntax tree.” The same definition also can be used for natural language. Challenges to syntactic processing are parts of speech (POS) tagging, chunking or detecting syntactic categories (verb, noun phrases), and sentence assembling (constructing syntax trees). [Source]

Syntax concerns the way in which words can be combined to form (grammatical) sentences; e.g. “revolutionary new ideas appear infrequently” is grammatical in English, “colourless green ideas sleep furiously” is grammatical but nonsensical, while “ideas green furiously colourless sleep” is also ungrammatical. Words combine syntactically in certain orders in a way which mirrors the meaning conveyed; e.g. John loves Mary means something different from Mary loves John. The ambiguity of “John gave her dog biscuits” stems from whether we treat “her” as an independent pronoun and “dog biscuits” as a compound noun, or whether we treat “her” as a demonstrative pronoun modifying “dog”. We can illustrate the difference in terms of possible ways of bracketing the sentence: (john (gave (her) (dog biscuits))), vs. (john (gave (her dog) (biscuits))). [Source; Introduction to Linguistics for Natural Language Processing]

See also Linguistics | Semantics.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z



[Image sources: top  |  bottom. For relational learning with tensor decomposition, given $\small I$ subjects, $\small J$ predicates and $\small K$ objects, the target tensor is thus of shape ${Y} \in \mathbb{R}^{I \times J \times K}$.  Click image to open in new window.]

A typical KG is represented as a graph built from symbolic triples – variously stated as $\small (source, relation, target)$,  $\small (subject, predicate, object)$, or $\small (head, relation, tail)$. KGE methods attempt to represent those symbols with their corresponding source, relation, and target vectors, which are amenable to mathematical processing (e.g., the linear algebra approaches common in machine learning). The hope is that the geometry of the embedding space will resemble the structure of the data by keeping the instances that participate in the same relationships close to one another in Euclidean space. This in turn allows one to apply standard propositional (logic-based) machine learning tools and retain their scalability, while at the same time preserving certain properties of structured relational data. The generated vector representations can be used by machine learning algorithms to accomplish specific tasks.


[Image source. Click image to open in new window.]

The approach described in Section III (Statistical Relational Learning for Knowledge Graphs) in A Review of Relational Machine Learning for Knowledge Graphs employs tensors, which are simply a mathematical object analogous to but more general than a vector, represented by an array of components that are functions of the coordinates of a space. Tensors are particularly useful for representing and processing multi-view / multi-layer / temporal graph embeddings.


[Representation of a 3-dimensional tensor.  Image source. Click image to open in new window.]


[Image sourceMachine learning: generalization via tensor decomposition.  $\small \text{(s,p,o): (subject, predicate, object)}$ triple.  That image is in turn taken from Fig. 4
in A Review of Relational Machine Learning for Knowledge Graphs: see Section IV (Latent Feature Models), pp. 6-10 for details.  Essentially, the relationships may be modeled/represented as tensors (vectors), amenable to matrix factorization methods.  For relational learning with tensor decomposition, given $\small I$ subjects, $\small J$ predicates and $\small K$ objects, the target tensor is thus of shape ${Y} \in \mathbb{R}^{I \times J \times K}$.  Click image to open in new window.]

The ability to encode multidimensional feature representations with each node (and edge) is fundamentally important to graph representation learningstatistical relational learninggraph convolutional networksgraph signal processing and other aspects of graph processing,


[Image source. Click image to open in new window.]

… as well as the word embeddings such as word2vec and recent pretrained, multilayer, contextual language models.


[Image source. Click image to open in new window.]
In the figure above we are imagining that each dimension captures a clearly defined meaning. For example, if you imagine that the first dimension represents the meaning or concept of "animal", then the remaining dimensions represent each word's values for the other features (domesticated; pet; fluffy; ...).  (Adapted from Introduction to Word Vectors.)
In word2vec a distributed representation of a word is used. Take a vector with 1000 dimensions, where each word is represented by a distribution of weights across those elements. Instead of a one-to-one mapping between an element in the vector and a word, the representation of a word is spread across all of the elements in the vector, and each element in the vector contributes to the definition of many words. If the dimensions in a hypothetical word vector are labeled (there are no such pre-assigned labels in the algorithm, of course), it might look a bit like this
. Such a vector comes to represent in some abstract way the "meaning" of a word. [Excerpted from The amazing power of word vectors  (local copy).]

The first of the two images above represents a signal on a node, which is the foundational concept in graph signal processing – discussed here and illustrated here. An important early paper that applied a neural network to a graph structure is Kipf and Welling’s Semi-Supervised Classification with Graph Convolutional Networks (Sep 2016; updated Feb 2017;  code variously available on GitHub, elsewhere), which introduced graph convolutional neural networks (GCN). Kipf and Welling’s GCNs built on work by Defferrard et al., Convolutional Neural Networks on Graphs with fast Localized Spectral Filtering (Jun 2016; updated Feb 2017), a formulation of convolutional neural networks in the context of spectral graph theory, which provided the necessary mathematical background and efficient numerical schemes to design fast localized convolutional filters on graphs.

Those papers illustrate the link between tensor representation of graphs, various forms of graph representation learning (e.g. statistical relational learning; graph convolutional networks; …) and graph signal processing. This observation is echoed in Dual Graph Convolutional Networks for Graph-Based Semi-Supervised Classification (2018), which stated the following.

Tensor factorization approaches have proved promising for such link prediction problems. Proposed in 1927, Canonical Polyadic (CP) decomposition was among the first tensor factorization approaches. CP generally performs poorly for link prediction as it learns two independent embedding vectors for each entity
, whereas they are really tied. SimplE  Embedding for Link Prediction in Knowledge Graphs (Oct 2018) [code] presented a simple enhancement of CP (SimplE ) to allow the two embeddings of each entity to be learned dependently. The complexity of SimplE grows linearly with the size of embeddings, which are interpretable, also allowing certain types of background knowledge to be incorporated into these embeddingsIncorporating Background Knowledge into the Embeddings. In SimplE, each element of the embedding vector of the entities can be considered as a feature of the entity and the corresponding element of a relation can be considered as a measure of how important that feature is to the relation. Such interpretability allows the embeddings learned through SimplE for an entity (or relation) to be potentially transferred to other domains. It also allows for incorporating observed features of entities into the embeddings by fixing one of the elements of the embedding vector of the observed value. Nickel et al. [30] show that incorporating such features helps reduce the size of the embeddings. through weight tying.


[Click image to open in new window.]

Canonical Tensor Decomposition for Knowledge Base Completion (Jun 2018) [code] framed knowledge base completion as a 3rd-order binary tensor completion problem. Based on the more challenging datasets (WN18RR: corrects flaws in WN18;  FB15k-237: corrects flaws in FB15k;  YAGO3-10) in the tables above, ProjE,   ComplEx, and CapsE represent leading link prediction models.

  • Dr. Tamara G. Kolda on Tensor Decomposition.  Cited here (reddit), Dr. Tamara G. Kolda, Sandia National Laboratories, gives the SIAM Invited Address on “Tensor Decomposition: A Mathematical Tool for Data Analysis” on January 11 at the 2018 Joint Mathematics Meetings. This lecture provides a good introduction to, and uses of, tensors in data analysis.


TuckER: Tensor Factorization for Knowledge Graph Completion (Jan 2019) [code] proposed TuckER, a relatively simple but powerful linear model based on Tucker decomposition of the binary tensor representation of knowledge graph triples. TuckER outperformed all previous state of the art models across standard link prediction datasets. They proved that TuckER was a fully expressive model, deriving the bound on its entity and relation embedding dimensionality for full expressiveness which was several orders of magnitude smaller than the bound of previous state of the art models ComplEx and SimplE. They further showed that several previously introduced linear models could be viewed as special cases of TuckER.


[Image source. Click image to open in new window.]

Text mining

Text mining (TM) comprises the discovery and extraction of knowledge from free text, and can extend to the generation of new hypotheses by joining the extracted information from several publications. Text mining solutions can achieve different objectives, depending on the tasks they then have to address. Primarily, we can distinguish four different categories of purposes for text-mining solutions: information retrieval, information extraction, building knowledge bases, and knowledge discovery. These categories are illustrated in the following figure [image source]:

text mining
      (click image for full-size)


See Proofs.

Topological data analysis

In applied mathematics, topological data analysis (TDA) is an approach to the analysis of datasets using techniques from topology. Extraction of information from datasets that are high-dimensional, incomplete and noisy is generally challenging. TDA provides a general framework to analyze such data in a manner that is insensitive to the particular metric chosen and provides dimensionality reduction and robustness to noise. Beyond this, it inherits functoriality, a fundamental concept of modern mathematics, from its topological nature, which allows it to adapt to new mathematical tools.

The initial motivation is to study the shape of data. TDA has combined algebraic topology and other tools from pure mathematics to allow mathematically rigorous study of “shape”. The main tool is persistent homology, an adaptation of homology to point cloud data. Persistent homology has been applied to many types of data across many fields. Moreover, its mathematical foundation is also of theoretical importance. The unique features of TDA make it a promising bridge between topology and geometry.


[Image sources:  top  |  bottom.  Click image to open in new window.]

See also:

Transfer learning

See Transfer learning (a subsection in Pretrained models).



A troponym denotes a particular way to do an entry’s referent. Troponymy is a transitive relation.


  • to “trim” and to “slice” are troponyms of to “cut”
  • to “slide” and to “spin” are troponyms of to “move”
  • to “snack” and to “nibble” are troponyms of to “eat”

See also Linguistics | Semantics.

Truth tables

See Logic notation (subsection).

$\small t$-SNE

$\small t$-distributed Symmetric Neighbour Embedding ($\small t$-SNE) is an exceptionally useful method for clustering and visualizing high dimensional datasets and models. For background, see:

$\small t$-SNE-CUDA: GPU-Accelerated $\small t$-SNE and Its Applications to Modern Data (Jul 2018) [code] presented $\small t$-SNE-CUDA, a GPU-accelerated implementation of $\small t$-SNE for visualizing datasets and models that significantly outperformed current implementations with 50-700x speedups on the CIFAR-10 and MNIST datasets.


[Image source. Click image to open in new window.]

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Variational autoencoder

In neural networks, a variational autoencoder (VAE) consists of an encoder, a decoder, and a loss function. In probability models, the VAR refers to approximate inference in a latent Gaussian model where the approximate posterior and model likelihood are parametrized by neural nets (the inference and generative networks).


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

VAE, representation learning and disentanglement

Variational Autoencoders Pursue PCA Directions (by Accident) (Rolinek et al., Dec 2018) [discussion: reddit]

  • “The variational autoencoder (VAE) is a powerful architecture capable of representation learning and generative modeling. When it comes to learning interpretable (disentangled) representations, VAE and its variants show unparalleled performance. However, the reasons for this are unclear, since a very particular alignment of the latent embedding is needed but the design of the VAE does not encourage it in any explicit way. We address this matter and offer the following explanation: the diagonal approximation in the encoder together with the inherent stochasticity force local orthogonality of the decoder. The local behavior of promoting both reconstruction and orthogonality matches closely how the PCA embedding is chosen. Alongside providing an intuitive understanding, we justify the statement with full theoretical analysis as well as with experiments.”

  • “Recently, unsupervised learning of interpretable latent representations has received a lot of attention. Interpretability of the latent code is an intuitively clear concept. For instance, when representing faces one latent variable would solely correspond to the gender of the person, another to skin tone, yet another to hair color and so forth. Once such a representation is found it allows for interpretable latent code manipulation, which is desirable in a variety of applications; recently, for example, in reinforcement learning.

    “The term disentanglement offers a more formal approach. A representation is considered disentangled if each latent component encodes precisely one “aspect” (a generative factor) of the data. …”

  • “The success of VAE-based architectures on disentanglement tasks comes with a certain surprise. One surprising aspect is that VAEs have been challenged on both of its own design functionalities, as generative models and as log-likelihood optimizers. Yet, no such claims are made in terms of disentanglement. Another surprise stems from the fact that disentanglement requires the following feature: the representative low-dimensional manifold must be aligned well with the coordinate axes. However, the design of the VAE does not suggest any such mechanism. On the contrary, the idealized log-likelihood objective is, for example, invariant to rotational changes in the alignment.”

  • 2.2. Disentanglement. In the context of learning interpretable representations it is useful to assume that the data originates from a process with some generating factors. For instance, for images of faces this could be face azimuth, skin brightness, hair length, and so on. Disentangled representations can then be defined as ones in which individual latent variables are sensitive to changes in individual generating factors, while being relatively insensitive to other changes. Although quantifying disentanglement is nontrivial, several metrics have been proposed.

    “Note also, that disentanglement is impossible without first learning a sufficiently expressive latent representation capable of good reconstruction. …”

    “One important point to make is that disentanglement is sensitive to rotations of the latent embedding. …”

  • 2.3. PCA and Latent Representations. Let us examine more closely how PCA chooses the alignment of the latent embedding and why it matters. …”

  • 3.1. The problem with log-likelihood. The message from Example 1 and from the discussion about disentanglement is clear: latent space rotation matters. …”

  • [Discussion]  “… Our insights show that VAEs make use of the differences in variance to form the representation in the latent space. This does not directly encourage factorized latent representations, see also Suppl. 9.2. With this in mind, it makes perfect sense that recent improvements of (β-)VAE incorporate additional terms promoting precisely independence.

    “It is also unsatisfying that VAEs promote orthogonality somewhat indirectly. It would seem that designing architectures allowing explicit control over this feature would be beneficial.”

VAE: see also:

Aside: like the VAE, another common generative framework is the generative adversarial network [GAN: Goodfellow et al. (2014), Generative Adversarial Nets]. In a GAN a generator and an auxiliary adversarial discriminator are trained together. On the other hand, in VAE, an encoder and a decoder (or generator) are both trained according to Bayesian models. Both frameworks contain two components (generator and discriminator or encoder and decoder), where each of them requires training.

Vector space

Refer here (Euclidean space).

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See Word embedding.

Word mover’s distance

The word mover’s distance (WMD) – introduced in From Word Embeddings To Document Distances (2015) [codetutorialtutorial: Python implementation with WMD paper coauthor Matt Kusner] – is a novel distance function between text documents, based on results in word embeddings that learn semantically meaningful representations for words from local co-occurrences in sentences.

The WMD distance measures the dissimilarity between two text documents as the minimum cumulative distance that the embedded words of one document need to “travel” to reach the embedded words of another document. Although two documents may not share any words in common, WMD can still measure the semantical similarity by considering their word embeddings, while other bag-of-words or term frequency-inverse document frequency (TF-IDF) methods only measure the similarity by the appearance of words.

The WMD metric has no hyperparameters and is straightforward to implement. Furthermore, on eight real world document classification data sets, in comparison with seven state-of-the-art baselines, the WMD metric demonstrated unprecedented low $\small k$-nearest neighbor document classification error rates.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Word sense disambiguation (WSD)

In computational linguistics, word-sense disambiguation (WSD) is an open problem of natural language processing and ontology. WSD is the task of identifying which sense/meaning of a word is used in a sentence, when the word has multiple meanings. WSD is basically solution to the ambiguity which arises due to different meaning of words in different contexts. For example, consider the two sentences:

     “The bank will not be accepting cash on Saturdays. “
     “The river overflowed the bank.”

The word “bank” in the first sentence refers to the commercial (finance) banks, while in second sentence, it refers to the river bank.

[For more examples, consult WordNet; e.g., the WordNet search for “cell” gives 7 meanings.]

The solution to this problem impacts other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, inference, et cetera.

See also Named entity disambiguation.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Zachary’s karate club network

Zachary’s karate club is a well-known social network of a university karate club described in the paper An Information Flow Model for Conflict and Fission in Small Groups by Wayne W. Zachary [local copy]. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002.

A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972. The network captures 34 members of a karate club, documenting pairwise links between members who interacted outside the club. During the study a conflict arose between the administrator “John A” and instructor “Mr. Hi” (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other part found a new instructor or gave up karate. Based on collected data Zachary correctly assigned all but one member of the club to the groups they actually joined after the split.


[Image source. Click image to open in new window.]

Zachary’s karate club appears in these papers (of personal interest):

See also:


The following is excerpted from Norms and Rating Scales  (downloaded from SFU Kin304w).

“A standard score is a quantification of a measured score (observation) in comparison to a distribution of the score from some normative (comparison) sample. The data are assumed to be normally distributed. Numerically, a standard score indicates how many standard deviations an observation is above or below the mean of the comparison distribution. Standard scores are also called z-scores. To calculate a z-score, the difference between the score ($\small X$) and mean of the distribution ($\small \bar{X}$) is divided by the standard deviation of the sample ($\small s$).

    $ Z = \frac{(X - \bar{X})}{s}$

“An advantage of the z-score is that a single number quantifies the measured value and its location with respect to the normative distribution. Thus, a z-score of -0.5 tells you that the measured score is half a standard deviation below the mean of the distribution, therefore a little below average. Whereas a z-score of +2.5 tells you that the measured value is quite large and uncommon relative to the norm.

“Z-scores are dimensionless because the standard deviation is divided into the difference between the observation and the mean. …

“A warning here is that the norms used for the two variables must be comparable. Ideally the two distributions would be from the same sample data. If you have distributions from different samples then the comparison becomes invalid if the samples are intrinsically different. Therefore, it is not a recommended tactic to gather arbitrary normative distributions, just to have norms for each test in your battery. It is much better to have a homogeneous set of norms based upon the same sample. What that sample is depends on the use you have for it. …

Composite Scores

“Because the z-score has no units it is possible to combine z-scores for different variables into a single composite score. If you have a test battery you want to administer then you can rate each test against its own norm using z-scores. But, if you want a single composite rating of the performance of a subject in all the tests, this too can be achieved using z-scores. Because the z-scores have no units, you can average the z-scores for all the tests for each subject to come up with an overall rating.

“Table 2-2.1 shows the results of calculating z-scores for a subject in four fitness tests: sum of 5 skinfolds, grip strength, vertical jump, and shuttle run . Each test has different measurement units; thus z-scores help us equate the variables by comparing to normative distributions and giving unitless scores. There are two sets of z-scores in table 2-2.1 based upon the same data of an individual. If we simply calculate z-scores for the measured values we will get the z-scores shown as profile A (also shown in Figure 2-2.1). An overall rating of the subject on the four tests can then be simply achieved by averaging the four z-scores.


[Image source. Click image to open in new window.]

“Thus the overall rating for profile A is a z-score of +0.7. However, if our inference is the bigger the z-score the better, this is incorrect for two of the scores. For grip strength and vertical jump it is true that higher scores are better or represent greater fitness. However, the opposite is true for skinfolds and shuttle runs. A fatter person will have higher skinfolds, and a less fit person will have a slower or larger time for the shuttle run. We therefore need to reverse the rating for these two variables so as to show that a lower score is preferable.

“In our example, this was achieved by simply reversing the sign of the z-score. This holds true if the scores are normally distributed. In calculating z-scores, in order to get a reverse rating you can simply change the z-score equation around to get a reverse z-score as it were. Our example profile B shows the result of correcting the z-scores for skinfolds and shuttle run. The composite overall rating is now easier to interpret. The overall rating in profile B of -0.65 gives a truer indication of the situation. The overall rating is less than average (negative z-score), based upon the person being above average in grip strength, but below average for skinfolds, vertical jump, and shuttle run performance.”

Composite z-scores were used in Finding melanoma drugs through a probabilistic knowledge graph (Feb 2017):

“There are often multiple probabilities per interaction, and more than one interaction per interaction type. This is because the interaction may have been recorded in multiple databases, based on different experimental methods. To provide a single probability score for each interaction of a source and target, the interactions are combined. A single probability is generated per identified interaction by taking the geometric mean of the probabilities for that interaction. However, this method is undesirable when combining multiple interaction records of the same type. We instead combine the interaction records using a form of probabilistic voting using composite Z-scores. This is done to model that multiple experiments that produce the same results reinforce each other, and should therefore give a higher overall probability than would be indicated by taking their mean or even by Bayes theorem. We do this by converting each probability into a Z-score (aka standard score) using the quantile function ($\small Q()$), summing the values, and applying the cumulative distribution function ($\small CDF()$) to compute the corresponding probability:

    $\small P(x_{1...n)} = CDF \left( \begin{align} \sum_{i=1}^{n} Q(P(x_i)) ènd{align} \right) $

“These composite Z-scores, which we transform back into probabilities, are frequently used to combine multiple indicators of the same underlying phenomena, as in Moller et al., 1998). However, it has a drawback. One concern is that the strategy does not account for multiple databases recording the same non-independent experiment. This can possibly inflating the probabilities of interactions described by experiments that are published in more than one database.”

See also:

Zero-shot {training | learning | transfer | …}

Zero-shot {training | learning | transfer | recognition | domain adaptation |…} is learning a classifier (multiclass classification, not binary) for a category using zero training examples. Zero-shot learning is being able to solve a task despite not having received any training examples of that task. Simply stated, zero shot learning is a form of extending supervised learning to a setting of solving for example a classification problem when not enough labeled examples are available for all classes. Zero-shot learning aims to recognize objects whose instances may not have been seen during training.

The alternative and less-explored paradigm for zero-shot learning is to use explicit knowledge bases or knowledge graphs. In this paradigm, one explicitly represents the knowledge as rules or relationships between objects. These relationships can then be used to learn zero-shot classifiers for new categories. The simplest example would be to learn visual classifiers of compositional categories. [Source]

See also: