Last modified: 2018-12-11

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Activation functions

In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. An activation function – for example, ReLU or sigmoid – takes in the weighted sum of all of the inputs from the previous layer, then generates and passes an output value (typically nonlinear) to the next layer; i.e., they take a single number and perform a fixed mathematical operation on it. These functions should be non-linear, to encode complex patterns of the data. The most popular activation functions are sigmoid, tanh and ReLU. ReLU is the most popular activation function in deep neural networks.

activation functions

[Image source. Click image to open in new window.]

Here is a simple artificial neural network:


[ANN showing bias units. Click image to open in new window.]

Each of the circles above represent individual nodes (different layers may contain different numbers of nodes). In the feedforward pass, each node is acted upon (numerically modified) by the activation function  $\small \phi$, shown in the images below [and later adjusted, in response to the error, during the backpropagation step – neither shown nor discussed here.] We calculate each of the layer $\small 2$ node activations based on the sum of the input values ($\small \mathbf{x}_1, \mathbf{x}_2, \dots$) times the weight matrix $\small \mathbf{W}_{ij}$, plus the bias term $\small \mathbf{b}$. For example, the value of node $\small \mathbf{a}{(2) \atop 3}$ in the image, above, representing unit $\small 3$ in layer $\small 2$), is calculated from the values of each of the input nodes $\small \mathbf{x}_i$ in the preceding layer, the weight matrices $\small \mathbf{W}_{ij}$, and the bias unit $\small \mathbf{b}$ ($\small \mathbf{x}_0 = +1$), giving the layer $\small 2$ node activations $\small \mathbf{a}{(2) \atop i}$.

The activation function is shown in greater detail in the following two images; as mentioned, it acts on each node. The activation values on each of the hidden units/nodes in layer $\small 3$ ($\small \mathbf{a}{(3) \atop i}$) are derived from the sigmoid function ($\small \phi(\cdot)$) applied to the linear combination (sum, $\small \sum$) of the inputs to those nodes (the layer $\small 2$ activations multiplied by the weight matrix, plus the bias unit). The activations are derived using exactly the same logic as for the first hidden layer, except that the input values are the activation values $\small \mathbf{a}$ from the preceding layer, rather than the $\small \mathbf{x}$ values. …


[image source; $\small \text{output} = \text{activation}(\sum(\text{weight} * \text{input} + \text{bias})$  (click image to open in new window)]


[Image source. Click image to open in new window.]

This provides a very superficial overview. To fully understand the basics of machine learning, you need to study/know at least a bit of linear algebra (matrix mathematics: transformations, multiplications, … – often used in various parametrizations such as weight matrices, …), partial derivatives (needed for the chain rule, used in backpropagation, …), etc. In my opinion, Andrew Ng’s Machine Learning course is essential preparation, providing a thorough understanding of the basics. Dr. Ng. also teaches the Stanford University cs229 Machine Learning course, which provides online materials.

For example, the summations, $\small \sum(\cdot)$, variously shown above are normally and very neatly accomplished in one step by multiplying the inputs by a transposed weight matrix (that also includes the bias term). Given a neuron that takes as input $\small \mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3$ and a bias term $\small \mathbf{b} = 1$, the output would be $\small \mathbf{h}_{W,b}(\mathbf{x}) = \phi(\sum_{i=1}^{3}\mathbf{W}_i\mathbf{x}_i + b) = \phi(\mathbf{W}^T\mathbf{x})$. Note that last expression, $\small \phi(\mathbf{W}^T\mathbf{x})$: the transposed weight matrix (including the bias) is multiplied by the input giving the output in one operation!

Activation (+ Other ML) Resources

Ad hoc (ad-hoc) retrieval

See Document relevance ranking.

Adjacency matrix

Refer here  [Graph Signal Processing: Background].

See also: Graph Degree Matrix.

Affine transformation

In geometry, an affine transformation, affine map or an affinity (from the Latin, affinis, “connected with”) is a function between affine spaces which preserves points, straight lines and planes. Also, sets of parallel lines remain parallel after an affine transformation. An affine transformation does not necessarily preserve angles between lines or distances between points, though it does preserve ratios of distances between points lying on a straight line.

Examples of affine transformations include translation, scaling, homothety, similarity transformation, reflection, rotation, shear mapping, and compositions of them in any combination and sequence.

If $\small X$ and $\small Y$ are affine spaces, then every affine transformation $\small f \colon X \to Y$ is of the form $\small x \mapsto Mx+b$, where $\small M$ is a linear transformation on the space $\small X$, $\small x$ is a vector in $\small X$, and $\small b$ is a vector in $\small Y$. Unlike a purely linear transformation, an affine map need not preserve the zero point in a linear space. Thus, every linear transformation is affine, but not every affine transformation is linear.

All Euclidean spaces are affine, but there are affine spaces that are non-Euclidean. In affine coordinates, which include Cartesian coordinates in Euclidean spaces, each output coordinate of an affine map is a linear function (in the sense of calculus) of all input coordinates. Another way to deal with affine transformations systematically is to select a point as the origin; then, any affine transformation is equivalent to a linear transformation (of position vectors) followed by a translation.


[Image source. An image of a fern-like fractal that exhibits affine self-similarity. Each of the leaves of the fern is related to each other leaf by an affine transformation. For instance, the red leaf can be transformed into both the dark blue leaf and the light blue leaf by a combination of reflection, rotation, scaling, and translation. Click image to open in new window.]

Matrix affine transformations

To represent affine transformations with matrices, we can use homogeneous coordinates. This means representing a 2-vector $\small (x,y)$ as a 3-vector $\small (x,y,1)$, and similarly for higher dimensions. Using this system, translation can be expressed with matrix multiplication. The functional form

    $\small x' = x + t_x$;
    $\small y' = y + t_y$


    $\small \begin{bmatrix} x' \\\ y' \\\ 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 & t_{x} \\\ 0 & 1 & t_{y} \\\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\\ y \\\ 1 \end{bmatrix}$.

All ordinary linear transformations are included in the set of affine transformations, and can be described as a simplified form of affine transformations. Therefore, any linear transformation can also be represented by a general transformation matrix. The latter is obtained by expanding the corresponding linear transformation matrix by one row and column, filling the extra space with zeros except for the lower-right corner, which must be set to $\small 1$. For example, the counter-clockwise rotation matrix from above becomes:

    $\small \begin{bmatrix} \cos \theta & -\sin \theta & 0 \\\ sin \theta & \cos \theta & 0 \\\ 0 & 0 & 1 \end{bmatrix}$

Using transformation matrices containing homogeneous coordinates, translations become linearly independent, and thus can be seamlessly intermixed with all other types of transformations. The reason is that the real plane is mapped to the $\small w = 1$ plane in real projective space, and so translation in real Euclidean space can be represented as a shear in real projective space. Although a translation is a non-linear transformation in a 2-D or 3-D Euclidean space described by Cartesian coordinates (i.e. it can’t be combined with other transformations while preserving commutativity and other properties), it becomes, in a 3-D or 4-D projective space described by homogeneous coordinates, a simple linear transformation (a shear).

For an example of shear mapping in the eigenspace, refer here.

More affine transformations can be obtained by composition of two or more affine transformations. … [continued here.]

Example: 2-D Transformation

When a transformation takes place on a 2D plane, it is called 2D transformation.

We can represent a point, $\small \mathbf{p} = (x,y)$, in the plane as a row vector $\small \begin{bmatrix} x & y \end{bmatrix}$ or as a column vector $\small \begin{bmatrix} x \\ y \end{bmatrix}$.

We can represent a 2-D transformation $\small \mathbf{M}$ by a matrix $\small \mathbf{M} = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$.

If $\small \mathbf{p}$ is a column vector ($\small [2 \times 1]$), then we need to place $\small \mathbf{M}$ on the left:

… $\small [\color{Blue}2 \times \color{Magenta}2] [\color{Magenta}2 \times \color{Blue}1] = [\color{Blue}2 \times \color{Blue}1]$; we cannot multiply $\small [2 \times 1] [2 \times 1]$, due to the mismatched matrix dimensions.

    $\small \begin{bmatrix} x' \\\ y' \end{bmatrix} = \begin{bmatrix} a & b \\\ c & d \end{bmatrix} \begin{bmatrix} x \\\ y \end{bmatrix} = \begin{bmatrix} ax + by \\\ cx + dy \end{bmatrix}$

    $\small \begin{bmatrix} x' \\\ y' \end{bmatrix} = \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} \begin{bmatrix} 3 \\\ 2 \end{bmatrix} = \begin{bmatrix} 1(3) + 2(2) \\\ 3(3) + 4(2) \end{bmatrix} = \begin{bmatrix} 7 \\\ 17 \end{bmatrix}$

If $\small \mathbf{p}$ is a row vector ($\small [1 \times 2]$), then we need to place $\small \mathbf{M}^T$ on the right ($\small [1 \times 2][2 \times 2]^\color{Magenta}T$):

$\small \mathbf{M} = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$,

$\small \mathbf{M}^T = \begin{bmatrix} a & c \\ b & d \end{bmatrix}$.

    $\small \begin{bmatrix} x' & y' \end{bmatrix} = \begin{bmatrix} x & y \end{bmatrix} \begin{bmatrix} a & c \\\ b & d \end{bmatrix} = \begin{bmatrix} xa + yb & xc + yd \end{bmatrix}$

    $\small \begin{bmatrix} x' & y' \end{bmatrix} = \begin{bmatrix} 3 & 2 \end{bmatrix} \begin{bmatrix} 1 & 3 \\\ 2 & 4 \end{bmatrix}$ $\small = \begin{bmatrix} 3(1) + 2(2) & 3(3) + 2(4) \end{bmatrix} = \begin{bmatrix} 7 & 17 \end{bmatrix}$

which is basically the same result as above (same point, same transformation!).

[If we did not do that matrix transform, $\small \mathbf{M}^T$, we would have obtained $\small \begin{bmatrix} 9 & 14 \end{bmatrix}$).]

octave:>> p=[3,2]
p =
  3   2

octave:>> q=[3;2]
q =

octave:>> M=[1,2;3,4]
M =
  1   2
  3   4

octave:>> M'
ans =
  1   3
  2   4

octave:>> p*M'
ans =
    7   17

octave:>> M*q
ans =



See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.


In semantics, an antonym is a word having the opposite meaning as another word.

Examples: “happy/unhappy”; “heavy/light”; “long/short”; “up/down”; “dead/alive”; “parent/child”; …

See also Linguistics | Semantics.

Oppositeness is a logical category. There are three types:

  • Complementary pairs are antonyms in which the presence of one quality or state signifies the absence of the other and vice versa: single/ married, not pregnant/pregnant, … There are no intermediate states: you can’t be “a little pregnant”, or “kinda married”.

  • Gradable pairs are antonyms which allow for a natural, gradual transition between two poles: “good/bad”, or “hot/cold”. It is possible to be a “little cold” or “very cold”, etc.

  • Relational opposites are antonyms which share the same semantic features – only the focus (or direction) is reversed: “tie/untie”, “buy/sell”, “give/receive”, “teacher/pupil”, “father/son”.

Some concepts lack logical opposites that can be described in terms of any special word; colors are a good example: the logical opposite of “red” is “not red”. Such concepts may form relational antonyms, however, through symbolic systems of thinking. For instance, in the Cold War paradigm, the relational opposite of “American” is “Russian”; in current US politics, the relational opposite of “Democrat” is “Republican”. These are cultural relational opposites.

Architectures; Algorithms; Frameworks;
Libraries; Models; Platforms

I tend to differentiate architectures (the programmatic implementation of a thematic class of machine learning algorithms and procedures, the “basic code”) from models (that represent various implementations of those architectures). Examples of major architectures include:

Frameworks and libraries. Differentiating the meaning and use of these two terms is a terminological rabbit hole – exemplified in the StackOverflow threads What is the difference between a framework and a library? and Framework vs. Toolkit vs. Library. While the two terms tend to be used interchangeably (for example, TensorFlow, PyTorch, etc. are often included in lists of machine learning libraries), I personally prefer the use of the term framework for programming environments including:

  • Caffe
  • TensorFlow
  • PyTorch
  • Theano
  • Torch
  • etc.

Deep learning frameworks simplify deep learning model development and training by providing high-level primitives for complex and error-prone mathematical transformations, like gradient descent, back-propagation, and inference. Deep learning frameworks vary in their level of functionality. Some of them, such as Theano or TensorFlow, allow you to define a neural network of arbitrary complexity from the most basic building blocks. These types of frameworks might even be called languages. Other “frameworks” [← there’s an example of a library being referred to as a framework!], such as Keras, are drivers or wrappers aimed at boosting developer productivity but are limited in their functionality due to the higher level of abstraction. [Sources: Hands-On AI Part 5: Select a Deep Learning Framework]

Like my use of the term “framework,” I prefer to reserve the use of the term “library” for the libraries provided in programming languages, such as the NumPy package/library in the Python programming language. By this definition, Keras (which uses TensorFlow as a backend) is likewise a library.

The simplicity of Python, a general purpose programming language, has attracted many developers to build libraries for machine learning and data science (and, because of all these libraries, Python is almost popular as R for data science). A popular machine learning library for Python includes NumPy.

Overall, I agree with a comment by @Trunk in SO #148747:

“A software framework is not just a library. It also includes a particular design architecture which users must work within – a sort way in which different items in the library are connected together to provide a generic solution for a particular application type. Frameworks thus have the advantage of needing only configuration by the user and validation of pre-written tests to become operational. With libraries alone users need to design an architecture and create tests before implementation. So more work is needed with just the libraries. But libraries are flexible and potentially more efficient.”


Why Libraries are better than Frameworks  (cited in SO #3057526 and the source for the image, above) also defines the term platform:

“A platform is some lowest level on which to build an application, examples are OS’s or virtual machines, e.g.: Linux, OS X, Java, or .NET. These platforms are great; they allow application developers to focus on the specific application logic, reusing all the ‘plumbing’ like sockets and file systems.”

The ML Cheatsheet also provides some definitions:

  • Algorithm: a method, or series of instructions, devised to generate a machine learning model. Examples include linear regression, decision trees, support vector machines, and neural networks.

  • Model: a structure that stores a generalized, internal representation of a dataset (weights and biases) for description or prediction. Models are created/learned when you train an algorithm on a dataset (i.e., the output is a model).

  • Model vs. Algorithm:

    • Model: the representation learned from a dataset and used to make future predictions.
    • Algorithm: the process for learning it.

      Model = Algorithm(Training Data)


In logic, mathematics, and computer science, the arity of a function or operation is the number of arguments or operands that the function takes. The arity of a relation (or predicate) is the dimension of the domain in the corresponding Cartesian product. (A function of arity $\small n$ thus has arity $\small n+1$ considered as a relation.) The term springs from words like unary, binary, ternary, etc. Unary functions or predicates may be also called “monadic”; similarly, binary functions may be called “dyadic”.

In mathematics, arity may also be named rank, but this word can have many other meanings in mathematics. In logic and philosophy, arity is also called adicity and degree. In linguistics, arity is usually named valency.

In computer programming, there is often a syntactical distinction between operators and functions; syntactical operators usually have arity 0, 1, or 2 (the ternary operator ?: is also common). Functions vary widely in the number of arguments, though large numbers can become unwieldy. Some programming languages also offer support for variadic functions, i.e., functions syntactically accepting a variable number of arguments.


An autoencoder is a neural network that is trained to copy its input to its output. An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction. Recently, the autoencoder concept has become more widely used for learning generative models of data. Some of the most powerful AI in the 2010s have involved sparse autoencoders stacked inside of deep neural networks.

An autoencoder learns to compress data from the input layer into a short code, and then uncompress that code into something that closely matches the original data. This forces the autoencoder to engage in dimensionality reduction, for example by learning how to ignore noise. Some architectures use stacked sparse autoencoder layers for image recognition. The first autoencoder might learn to encode easy features like corners, the second to analyze the first layer’s output and then encode less local features like the tip of a nose, the third might encode a whole nose, etc., until the final autoencoder encodes the whole image into a code that matches (for example) the concept of “cat.” An alternative use is as a generative model: for example, if a system is manually fed the codes it has learned for “cat” and “flying”, it may attempt to generate an image of a flying cat, even if it has never seen a flying cat before.

Architecturally, the simplest form of an autoencoder is a feedforward, non-recurrent neural network very similar to the many single layer perceptrons which makes a multilayer perceptron (MLP) – having an input layer, an output layer and one or more hidden layers connecting them – but with the output layer having the same number of nodes as the input layer, and with the purpose of reconstructing its own inputs (instead of predicting the target value $\small Y$ given inputs $\small X$. Therefore, autoencoders are unsupervised learning models.


[Image source. Click image to open in new window.]

An autoencoder always consists of two parts, the encoder and the decoder, which can be defined as transitions $\small \phi$ and $\small \psi$, such that:

    $\small \phi : \mathcal{X} \rightarrow \mathcal{F}$
    $\small \psi : \mathcal{F} \rightarrow \mathcal{X}$
    $\small \phi, \psi = { \underset{\phi, \psi} {\operatorname{arg\,min}} }\ \|X - (\psi \circ \phi )X\|^2$

In the simplest case, where there is one hidden layer, the encoder stage of an autoencoder takes the input $\small \mathbf{x} \in \mathbb{R}^d = \mathcal{X}$ and maps it to $\small \mathbf{z} \in \mathbb{R}^p = \mathcal{F}$:

    $\small \displaystyle \mathbf{z} = \sigma (\mathbf{Wx} + \mathbf{b})$

This image $\small \mathbf{z}$ is usually referred to as code, latent variables, or latent representation. Here, $\small \sigma$ is an element-wise activation function such as a sigmoid function or a rectified linear unit. $\small \mathbf{W}$ is a weight matrix and $\small \mathbf{b}$ is a bias vector. After that, the decoder stage of the autoencoder maps $\small \mathbf{z}$ to the reconstruction $\small \mathbf{x’}$ of the same shape as $\small \mathbf{x}$:

    $\small \mathbf{x'} = \sigma '(\mathbf{W'z} +\mathbf{b'})$

where $\small \mathbf{\sigma’}$, $\small \mathbf {W’}$, and $\small \mathbf{b’}$ for the decoder may differ in general from the corresponding $\small \mathbf{\sigma}$, $\small \mathbf{W}$, and $\small \mathbf{b}$ for the encoder, depending on the design of the autoencoder.

Autoencoders are also trained to minimise reconstruction errors (such as squared errors):

    $\small \mathcal{L} (\mathbf{x}, \mathbf{x'}) = \|\mathbf{x} - \mathbf{x'} \|^2 = \|\mathbf{x} - \sigma '(\mathbf{W'} (\sigma (\mathbf{Wx} + \mathbf{b})) + \mathbf{b'})\|^2$

where $\small \mathbf{x}$ is usually averaged over some input training set.

If the feature space $\small \mathcal{F}$ has lower dimensionality than the input space $\small \mathcal{X}$, then the feature vector $\small \phi(x)$ can be regarded as a compressed representation of the input $\small x$. If the hidden layers are larger than the input layer, an autoencoder can potentially learn the identity function and become useless. However, experimental results have shown that autoencoders might still learn useful features in these cases.

Source: Wikipedia: autoencoder.

See also:

Automatic differentiation

In mathematics and computer algebra, automatic differentiation (AD), also called algorithmic differentiation or computational differentiation, is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. AD exploits the fact that every computer program, no matter how complicated, executes a sequence of elementary arithmetic operations (addition, subtraction, multiplication, division, etc.) and elementary functions (exp, log, sin, cos, etc.). By applying the chain rule repeatedly to these operations, derivatives of arbitrary order can be computed automatically, accurately to working precision, and using at most a small constant factor more arithmetic operations than the original program.


[Image source. Click image to open in new window.]

Automatic differentiation is not:

These classical methods run into problems: symbolic differentiation leads to inefficient code (unless done carefully) and faces the difficulty of converting a computer program into a single expression, while numerical differentiation can introduce round-off errors in the discretization process and cancellation. Both classical methods have problems with calculating higher derivatives, where the complexity and errors increase. Finally, both classical methods are slow at computing the partial derivatives of a function with respect to many inputs, as is needed for gradient-based optimization algorithms. Automatic differentiation solves all of these problems, at the expense of introducing more software dependencies.

Chain rule; Forward and reverse accumulation

Fundamental to AD is the decomposition of differentials provided by the chain rule. For the simple composition

    $\small y = f(g(h(x))) = f(g(h(w_0))) = f(g(w_1)) = f(w_2) = w_3$

the chain rule gives

    $\large \frac{dy}{dx} = \frac {dy}{dw_2} \frac{dw_2}{dw_1} \frac{dw_1}{dx}$

Usually, two distinct modes of AD are presented, forward accumulation (or forward mode) and reverse accumulation (or reverse mode). Forward accumulation specifies that one traverses the chain rule from inside to outside (that is, first compute $\small dw_{1}/dx$ and then $\small dw_{2}/dw_{1}$ and finally $\small dy/dw_{2})$, while reverse accumulation has the traversal from outside to inside (first compute $\small dy/dw_{2}$ and then $\small dw_{2}/dw_{1}$ and at last $\small dw_{1}/dx$). More succinctly,

    forward accumulation computes the recursive relation:

    $\large \frac{dw_i}{dx} = \frac{dw_i}{dw_{i-1}} \frac{dw_{i-1}}{dx}$ with $\small w_3 = y$, and,

    reverse accumulation computes the recursive relation:

    $\large \frac{dy}{dw_i} = \frac{dy}{dw_{i+1}} \frac{dw_{i+1}}{dw_i}$ with $\small w_0 = x$.

Generally, both forward and reverse accumulation are specific manifestations of applying the operator of program composition, with the appropriate one of the two mappings $\small (w,y)$ being fixed.

See also:

Autoregression (Autocorrelation)

Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. It is a very simple idea that can result in accurate forecasts on a range of time series problems.

A regression model, such as linear regression, models an output value based on a linear combination of input values.

For example:

    $\small \hat{y} = \mathbf{b}_0 + \mathbf{b}_1 \mathbf{X}_1$

Where $\small \hat{y}$ is the prediction, $\small \mathbf{b}_0 $ and $\small \mathbf{b}_1$ are coefficients found by optimizing the model on training data, and $\small \mathbf{X}_1$ is an input value.

This technique can be used on time series where input variables are taken as observations at previous time steps, called lag variables. For example, we can predict the value for the next time step (t+1) given the observations at the last two time steps (t-1 and t-2). As a regression model, this would look as follows:

    $\small \mathbf{X}_{(t+1)} = \mathbf{b}_0 + \mathbf{b}_1 \mathbf{X}_{(t-1)} + \mathbf{b}_2 \mathbf{X}_{(t-2)}$

Because the regression model uses data from the same input variable at previous time steps, it is referred to as an autoregression (regression of self).

Autoregressive, feed-forward models

Instead of making predictions from a state that depends on the entire history, an autoregressive model directly predicts $\small y_t$ using only the $\small k$ most recent inputs, $\small x_{t-k+1}, \ldots, x_t$. This corresponds to a strong conditional independence assumption. In particular, a feed-forward model assumes the target only depends on the $\small k$ most recent inputs. Google’s WaveNet nicely illustrates this general principle.


[Image source. Click image to open in new window.]

In contrast to an RNN, the limited context of a feed-forward model means that it cannot capture patterns that extend more than $\small k$ steps. However, using techniques like dilated-convolutions, one can make $\small k$ quite large.

Source for the preceding subsection: When Recurrent Models Don’t Need to be Recurrent.


An autoregression model makes an assumption that the observations at previous time steps are useful to predict the value at the next time step. This relationship between variables is called correlation.

If both variables change in the same direction (e.g. go up together or down together), this is called a positive correlation. If the variables move in opposite directions as values change (e.g. one goes up and one goes down), then this is called negative correlation.

We can use statistical measures to calculate the correlation between the output variable and values at previous time steps at various different lags. The stronger the correlation between the output variable and a specific lagged variable, the more weight that autoregression model can put on that variable when modeling.

Again, because the correlation is calculated between the variable and itself at previous time steps, it is called an autocorrelation. It is also called serial correlation because of the sequenced structure of time series data.

The correlation statistics can also help to choose which lag variables will be useful in a model and which will not.

Interestingly, if all lag variables show low or no correlation with the output variable, then it suggests that the time series problem may not be predictable. This can be very useful when getting started on a new dataset.


See Proofs.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


Backpropagation operations in neural networks can be divided into two steps: feedforward, and backpropagation. In the feedforward step, an input pattern is applied to the input layer and its effect propagates, layer by layer, through the network until an output value is produced. The network’s actual output value is then compared to the expected output (i.e., the desired output value), and an error signal is computed for each of the output nodes. Since all the hidden nodes to some degree have contributed to the errors evident in the output layer, the output error signals are transmitted backwards from the output layer to each node in the hidden layer that immediately contributed to the output layer. This process is then repeated, layer by layer, until each node in the network has received an error signal that describes its relative contribution to the overall error.

Once the error signal for each node has been determined, the errors are then used by the nodes to update the values for each connection weights until the network converges to a state that allows all the training patterns to be encoded. The Backpropagation algorithm looks for the minimum value of the error function in weight space using a technique called the delta rule or gradient descent. The weights that minimize the error function is then considered to be a solution to the learning problem. [Source: Backpropagation]

In neural networks, if the estimated output $\small \hat{y}_i$ is far away from the actual output $\small y_i$ (high error), we update the biases and weights based on the error. This weight and bias updating process is known as back propagation. Backpropagation is an algorithm to efficiently calculate the gradients in a neural network (or more generally, a feedforward computational graph).

The actual output  $\small \mathbf{y}_i$ (observed from the network) is compared to the estimated output  $\small \mathbf{\hat{y}}_i$ (the desired/predicted/expected output). The error is the mean squared error,

$\ \ \ \ \small \begin{align} \frac{1}{n} \sum_{i=1}^n(y_i-\hat{y}_i)^2 \end{align}$

Backpropagation algorithms work by applying the (partial differentiation) chain rule determining the loss (or error) at the output, and then propagating it back into the network. The weights are updated to minimize the error resulting from each neuron. The first step in minimizing the error is to determine the gradient (derivatives) of each node with respect to the final output.

The gradient is the vector of partial derivatives with respect to all of the independent variables. In machine learning, the gradient, a slope (i.e. partial derivative), is the vector of partial derivatives of the model function. The gradient points in the direction of steepest ascent.

Gradient descent is technique to minimize loss by computing the gradients of loss with respect to the model’s parameters, conditioned on the training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss.

$\ \ \ \ \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_i}J(\theta_0,\theta_1)$   (for $\small j=0$ and $\small j=1$)  [← gradient descent]

where $\small \alpha$ (alpha, the learning rate) determines how quickly we move toward the minimum.

See also:

In computer science, a beam search is a heuristic search algorithm that explores a graph by expanding the most promising node in a limited set. Beam search is an optimization of best-first search that reduces its memory requirements. Best-first search is a graph search which orders all partial solutions (states) according to some heuristic. But in beam search, only a predetermined number of best partial solutions are kept as candidates. It is thus a greedy algorithm.

Beam search uses breadth-first search to build its search tree. At each level of the tree, it generates all successors of the states at the current level, sorting them in increasing order of heuristic cost. However, it only stores a predetermined number, $\small \beta$, of best states at each level (called the beam width). Only those states are expanded next. The greater the beam width, the fewer states are pruned. With an infinite beam width, no states are pruned and beam search is identical to breadth-first search. The beam width bounds the memory required to perform the search. Since a goal state could potentially be pruned, beam search sacrifices completeness (the guarantee that an algorithm will terminate with a solution, if one exists). Beam search is not optimal (that is, there is no guarantee that it will find the best solution).

In general, beam search returns the first solution found. Beam search for machine translation is a different case: once reaching the configured maximum search depth (i.e. translation length), the algorithm will evaluate the solutions found during search at various depths and return the best one (the one with the highest probability).

The beam width can either be fixed or variable. One approach that uses a variable beam width starts with the width at a minimum. If no solution is found, the beam is widened and the procedure is repeated.

A beam search is most often used to maintain tractability in large systems with insufficient amount of memory to store the entire search tree. For example, it is used in many machine translation systems. To select the best translation, each part is processed, and many different ways of translating the words appear. The top best translations according to their sentence structures are kept, and the rest are discarded. The translator then evaluates the translations according to a given criterion, choosing the translation which best keeps the goals. The first use of a beam search was in the Harpy Speech Recognition System, CMU 1976.

See also:


Bias terms ($\small \textbf{b}$ or $\small \textbf{w}_0$) are additional constants attached to the neurons and added to the weighted input before the activation function is applied. Bias terms help models represent patterns that do not necessarily pass through the origin. For example, if all your features were 0, would your output also be zero? Is it possible there is some base value upon which your features have an effect? Bias terms typically accompany weights and must also be learned by your model. [Source: ML Cheatsheet]

Bias is the term $\small \textbf{b}$ in the following formula: $\small y^{\ \prime} = \textbf{b} + \mathbf{w}_1\mathbf{x}_1 + \mathbf{w}_2\mathbf{x}_2 + \ldots + \mathbf{w}_n\mathbf{x}_n$.

Bidirectional LSTM (Bi-LSTM)

See Bidirectional LSTM (Bi-LSTM)  [LSTM subsection]


BioNLP is natural language processing, applied in the biomedical domain. See also: Natural language processing.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Capsule networks

In 2011 Geoff Hinton – building on his contemplation of the subject, spanning decades (Understanding Hinton’s Capsule Networks. Part I: Intuition  [local copy]) – introduced his thoughts on “capsules” in his paper Transforming Auto-Encoders. A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. The use of capsules introduced a simple and elegant new building block that could be used in deep learning to better model hierarchical relationships inside of internal knowledge representation of a neural network.


[Image source. Click image to open in new window.]

Capsule networks were introduced by Sabour, Frosst and Hinton in Dynamic Routing Between Capsules (Nov 2017). In capsule networks, dynamic routing iteratively computes weights over inputs by the inner product between inputs and a weighted sum of inputs. Varying with the inputs, the weighted sum detects informative inputs; therefore it can be interpreted as a dynamic weight vector from the perspective of self-attention. “We expect the dynamic weight vector to give rise to flexibility in self-attention since it can adapt to given sentences even after training.”


[Image source. Click image to open in new window.]

See also Hinton et al., Matrix Capsules with EM Routing (ICLR 2018):

“A capsule is a group of neurons whose outputs represent different properties of the same entity. Each layer in a capsule network contains many capsules. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 matrix which could learn to represent the relationship between that entity and the viewer (the pose). A capsule in one layer votes for the pose matrix of many different capsules in the layer above by multiplying its own pose matrix by trainable viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated for each image using the Expectation-Maximization [EM] algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The transformation matrices are trained discriminatively by backpropagating through the unrolled iterations of EM between each pair of adjacent capsule layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45% compared to the state-of-the-art. Capsules also show far more resistance to white box adversarial attacks than our baseline convolutional neural network.”


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source"The capsule network is much better than other models at telling that images in top and bottom rows belong to the same classes, only the view angle is different. The latest papers decreased the error rate by a whopping 45%."  [Quotation source (local copy)].  Click image to open in new window.]

Additional reading

Chain rule

In calculus, the chain rule is a formula for computing the derivative of the composition of two or more functions. That is, if $\small f$ and $\small g$ are functions, then the chain rule expresses the derivative of their composition $\small f \circ g$ (the function which maps $\small x$ to $\small f(g(x))$) in terms of the derivatives of $\small f$ and $\small g$ and the product of functions as follows:

    $\small (f\circ g)'=(f'\circ g) \cdot g'$.

This may equivalently be expressed in terms of the variable. Let $\small F = f \circ g$, or equivalently, $\small F(x) = f(g(x))$ for all $\small x$. Then one can also write

    $\small F'(x)=f'(g(x))g'(x)\ $.

The chain rule may be written in Leibniz’s notation in the following way. If a variable $\small z$ depends on the variable $\small y$, which itself depends on the variable $\small x$, so that $\small y$ and $\small z$ are therefore dependent variables, then $\small z$, via the intermediate variable of $\small y$, depends on $\small x$ as well. The chain rule then states,

    $\large \frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}\ $.

The two versions of the chain rule are related; if $\small z = f(y)$ and $\small y = g(x)$, then

    $\large \frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx} \small = f'(y)g'(x) = f'(g(x))g'(x)\ $.

In integration, the counterpart to the chain rule is the substitution rule.


See also:

Circulating Tumor Cells (CTCs)

Circulating tumor cells (CTCs) are cells that have shed into the vasculature or lymphatics from a primary tumor and are carried around the body in the blood circulation. CTCs constitute seeds for the subsequent growth of additional tumors (metastases) in distant organs, a mechanism that is responsible for the vast majority of cancer-related deaths. The detection and analysis of CTCs can assist early patient prognoses and determine appropriate tailored treatments.

Circulating Tumor Cells (CTC) are rare tumor cells that have been shed from primary or secondary tumors. CTCs have been investigated as diagnostic, prognostic and predictive biomarkers in many types of cancer. Prior studies have demonstrated that enumeration of CTC is a robust independent prognostic factor of progression free and overall survival in patients with early and metastatic breast cancer. CTC studies have accumulated a high level of clinical validity, especially in breast, lung, prostate and colorectal cancers. CTC, as well as other circulating tumor markers, have the appealing advantages over tissue biopsy of

  • ease of collection,
  • serial evaluation, and
  • interrogation of the entire tumor burden instead of just a limited part of the tumor.

Advances have been recently made in phenotyping and genotyping of CTC, which should provide insights into the predictive role of CTC for sensitivity or resistance to therapies. In addition, CTC phenotypic marker changes during the course of treatment may serve as pharmacodynamic monitoring tools. Therefore, CTC may be considered “liquid biopsies,” providing prognostic and predictive clinical information as well as additional understanding of tumor heterogeneity.


Nanoscale Blood Diagnostics

Related to the topic of circulating tumor cells are methods for screening health status on nanoscale sampling of peripheral blood (with applications in diagnosis, treatment, personalized medicine, etc). Examples include:

  • Theranostic Nanoparticles for Tracking and Monitoring Disease State (Nov 2017)   [Summary]The development of novel nanoparticles consisting of both diagnostic and therapeutic components has increased over the past decade. These “theranostic” nanoparticles have been tailored toward one or more types of imaging modalities and have been developed for optical imaging, magnetic resonance imaging, ultrasound, computed tomography, and nuclear imaging comprising both single-photon computed tomography and positron emission tomography. In this review, we focus on state-of-the-art theranostic nanoparticles that are capable of both delivering therapy and self-reporting/tracking disease through imaging. We discuss challenges and the opportunity to rapidly adjust treatment for individualized medicine.   [media]

  • Direct electrical quantification of glucose and asparagine from bodily fluids using nanopores (Oct 2018)   [Summary]Crucial steps in the miniaturisation of biosensors are the conversion of a biological signal into an electrical current as well as the direct sampling of bodily fluids. Here we show that protein sensors in combination with a nanopore, acting as an electrical transducer, can accurately quantify metabolites in real time directly from nanoliter amounts of blood and other bodily fluids. Incorporation of the nanopore into portable electronic devices will allow developing sensitive, continuous, and non-invasive sensors for metabolites for point-of-care and home diagnostics.   [media]

  • The Human In Vivo Biomolecule Corona onto PEGylated Liposomes: A Proof‐of‐Concept Clinical Study (Nov 2018)  [Summary]The self‐assembled layered adsorption of proteins onto nanoparticle (NP) surfaces, once in contact with biological fluids, is termed the “protein corona” and it is gradually seen as a determinant factor for the overall biological behavior of NPs. Here, the previously unreported in vivo protein corona formed in human systemic circulation is described. The human‐derived protein corona formed onto PEGylated doxorubicin‐encapsulated liposomes (Caelyx) is thoroughly characterized following the recovery of liposomes from the blood circulation of ovarian carcinoma patients. In agreement with previous investigations in mice, the in vivo corona is found to be molecularly richer in comparison to its counterpart ex vivo corona. The intravenously infused liposomes are able to scavenge the blood pool and surface‐capture low‐molecular‐weight, low‐abundance plasma proteins that cannot be detected by conventional plasma proteomic analysis. This study describes the previously elusive or postulated formation of protein corona around nanoparticles in vivo in humans and illustrates that it can potentially be used as a novel tool to analyze the blood circulation proteome.  [media]


See Proofs.

Computational linguistics

Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate computational approaches to linguistic questions.

See also Linguistics | Semantics.


See Proofs.

Constraint graphs

In constraint satisfaction research of artificial intelligence and operations research, constraint graphs and hypergraphs are used to represent relations among constraints in a constraint satisfaction problem. A constraint graph is a special case of a factor graph, which allows for the existence of free variables.

This YouTube video shows how to draw constraints on a graph (graph constraints), which explains the concept pretty well:

The basic premise is:

  • Let $\small y$ be your variable of interest
  • Plot your (here, linear) equations
  • We are interested in the region bounded by the equations ($\small y \leq x…$ or $\small y \geq x…$):


[Image source. Click image to open in new window.]

The use of constraint graphs for finding optimal Bayesian networks (graphs) is an intuitive method for incorporating rich prior knowledge into the structure learning tasks. Finding the Optimal Bayesian Network Given a Constraint Graph (Jul 2017) [code provides an example of the use of constraint graphs. Their accompanying Jupyter notebook provides toy examples.


[Image source. Click image to open in new window.]

See also:

Content words; Function words

In NLP parts of speech (POS) content words are words that name objects of reality and their qualities. They signify:

  • actual living things: dog, cat, etc.
  • family members: mother, father, sister, etc.
  • natural phenomena: snow, sun, etc.
  • common actions: do, make, come, eat, etc.
  • characteristics: young, cold, dark, etc.
  • etc.

Content words consist mostly of nouns, lexical verbs and adjectives, but certain adverbs can also be content words.

Content words contrast with function words, which are words that have very little substantive meaning and primarily denote grammatical relationships between content words, such as:

  • prepositions: in, out, under, etc.
  • pronouns: I, you, he, who, etc.
  • conjunctions: and, but, till, as, etc.
  • etc.

Content words and function words are used in the following papers.

  • Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation, which proposed a new model with a mechanism called Self-Adaptive Control of Temperature (SACT) to control the softness of attention by means of an “attention temperature.” … the model could learn a soft distribution of attention weights which was more uniform for generating function words, and a hard distribution which is sparser for generating content words. …

  • Google Brain also employed content words and function words in their improved language model, The Importance of Generation Order in Language Modeling. “This paper studies the influence of token generation order on model quality via a novel two-pass language model that produces partially-filled sentence “templates” and then fills in missing tokens. … We find the most effective strategy generates function words in the first pass followed by content words in the second.”

Convolutional neural networks (CNN)

In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep, feed-forward artificial neural networks, most commonly applied to analyzing visual imagery.

CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. They are also known as shift invariant or space invariant artificial neural networks, based on their shared-weights architecture and translation invariance characteristics.

CNNs use relatively little pre-processing compared to other image classification algorithms. This means that the network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage.

CNNs have applications in image and video recognition, recommender systems and natural language processing.


A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers of a CNN typically consist of convolutional layers, pooling layers, fully connected layers and normalization layers.



[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

The description of the process in neural networks as a convolution is by convention: mathematically, it is a cross-correlation rather than a convolution. This only has significance for the indices in the matrix, and thus which weights are placed at which index.

  • Convolutional layers. Convolutional layers apply a convolution operation to the input, passing the result to the next layer. The convolution emulates the response of an individual neuron to visual stimuli.

    Each convolutional neuron processes data only for its receptive field. Although fully connected feedforward neural networks can be used to learn features as well as classify data, it is not practical to apply this architecture to images. A very high number of neurons would be necessary, even in a shallow (opposite of deep) architecture, due to the very large input sizes associated with images, where each pixel is a relevant variable. For instance, a fully connected layer for a (small) image of size $\small [100 \times 100]$ has 10,000 weights for each neuron in the second layer. The convolution operation brings a solution to this problem as it reduces the number of free parameters, allowing the network to be deeper with fewer parameters. For instance, regardless of image size, tiling regions of size $\small [5 \times 5]$, each with the same shared weights, requires only 25 learnable parameters. In this way, it resolves the vanishing or exploding gradients problem in training traditional multi-layer neural networks with many layers by using backpropagation.


    [Image source. Click image to open in new window.]


    [Image source. Click image to open in new window.]


    [Image sources: here, and here. Click image to open in new window.]


    [Image source. Click image to open in new window.]

  • Pooling layers. Convolutional networks may include local or global pooling layers, which combine the outputs of neuron clusters at one layer into a single neuron in the next layer. For example, max pooling uses the maximum value from each of a cluster of neurons at the prior layer. Another example is average pooling, which uses the average value from each of a cluster of neurons at the prior layer.

  • Fully connected layers. Fully connected layers connect every neuron in one layer to every neuron in another layer. It is in principle the same as the traditional multi-layer perceptron neural network (MLP).

  • Receptive fields. In neural networks, each neuron receives input from some number of locations in the previous layer. In a fully connected layer, each neuron receives input from every element of the previous layer. In a convolutional layer, neurons receive input from only a restricted subarea of the previous layer. Typically the subarea is of a square shape (e.g., size $\small [5 \times 5]$). The input area of a neuron is called its receptive field. So, in a fully connected layer, the receptive field is the entire previous layer. In a convolutional layer, the receptive area is smaller than the entire previous layer.


    [Image source. Click image to open in new window.]

  • Weights. Each neuron in a neural network computes an output value by applying some function to the input values coming from the receptive field in the previous layer. The function that is applied to the input values is specified by a vector of weights and a bias (typically real numbers). Learning in a neural network progresses by making incremental adjustments to the biases and weights. The vector of weights and the bias are called a filter and represents some feature of the input (e.g., a particular shape). A distinguishing feature of CNNs is that many neurons share the same filter. This reduces memory footprint because a single bias and a single vector of weights is used across all receptive fields sharing that filter, rather than each receptive field having its own bias and vector of weights.

CNN - additional reading

There is a plethora of material on the web covering CNNs. Here are some good introductions:

See also:

Coreference resolution

Coreference resolution is the task of finding all expressions that refer to the same entity in a text. It is an important step for a lot of higher level NLP tasks that involve natural language understanding such as document summarization, question answering, and information extraction.


[Image source. Click image to open in new window.]


See Proofs.


In linguistics, a corpus (plural corpora), or text corpus, is a large and structured collection of texts – nowadays, usually electronically processed and stored. In corpus linguistics, corpora are used for statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language domain, and natural language processing.

Cost (loss) function

The cost function is used to define and measure the error of the model. The hypothesis is usually presented as

$\ \ \ \ \small h_\theta(x) = \theta_0 + \theta_1x$

The theta values ($\small \theta$) are the parameters.

The cost function is given by:

$\ \ \ \ \small \begin{align} J(\theta_0\theta_1) = \frac{1}{2m} \sum_{i=1}^m(h_0(x^{(i)})-y^{(i)})^2 \end{align}$  [← squared error cost function]


  • $\small h(x)$ is the prediction
  • $\small y$ is the actual value
  • $\small m$ is the number of rows in the training set

Cost is a synonym for loss: a measure of how far a model’s predictions are from its label (phrased more pessimistically, a measure of how bad the model is). To determine this value, a model must define a loss function. For example, linear regression models typically use mean squared error for a loss function, while logistic regression models use log loss (also known as cross entropy loss or logistic loss). [Cross entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.]

$\ \ \ \ \small \begin{align} MSE = \frac{1}{n} \sum_{i=1}^n(Y_i-\hat{Y}_i)^2 \end{align}$   [← mean squared error]


$\ \ \ \ \small \begin{align} MSE = \frac{1}{n} \sum_{i=1}^n(y_i-(mx_i + b))^2 \end{align}$


$\ \ \ \ \small \begin{align} MSE = \frac{1}{2m} \sum_{i=1}^m(h_0(x^{(i)})-y^{(i)})^2 \end{align}$


  • $\small n$ is the total number of observations (data points)
  • $\small \frac{1}{N} \sum_{i=1}^n$ is the mean
  • $\small y_i$ is the actual value of an observation
  • $\small mx_i+b$ is our prediction

Unfortunately for logistic regression we can’t (or at least shouldn’t) use the MSE cost function, as we did for linear regression (above). For an explanation why see Chapter 3 of Michael Neilson’s deep learning book. In short, it’s because our prediction function is non-linear, due to sigmoid transform. Squaring this prediction, as we do in MSE, results in a non-convex function with many local minima: if our cost function has many local minima, gradient descent may not find the optimal global minimum.

Instead of mean squared error (MSE), we use a cost function called cross entropy, also known as log loss. Cross entropy loss can be divided into two separate cost functions: one for $\small y=1$ and one for $\small y=0$.

$\ \ \ \ \small \begin{align} J(\theta) = \frac{1}{m} \sum_{i=1}^m Cost(h_0(x^{(i)}),y^{(i)}) \end{align}$

$\ \ \ \ \small \begin{cases} Cost(h_0(x),y) = -log(h_0(x)) & \text{if y=1} \\ Cost(h_0(x),y) = -log(1-h_0(x)) & \text{if y=0} \end{cases}$

The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for $\small y=1$ and $\small y=0$. These smooth monotonic functions (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost:

[Image from Andrew Ng’s slides on logistic regression]

The key thing to note is the cost function penalizes confident and wrong predictions more than it rewards confident and right predictions! The corollary is increasing prediction accuracy (closer to 0 or 1) has diminishing returns on reducing cost due to the logistic nature of our cost function.

Compressing the two logistic regression equations, above, into a single equation:

$\ \ \ \ \small \begin{align} J(\theta) = -\frac{1}{m} \sum_{i=1}^m [y^{(i)}log(h_0(x^{(i)})) + (1-y^{(i)})log(1-h_\theta(x^{(i)}))] \end{align}$

Multiplying by $\small y$ and $\small (1-y)$ in the above equation is a sneaky trick that let’s us use the same equation to solve for both $\small y=1$ and $\small y=0$ cases. If $\small y=0$, the first side cancels out. If $\small y=1$, the second side cancels out. In both cases we only perform the operation we need to perform.

Vectorized cost function:

$\ \ \ \ \small h = g(X\theta)$
$\ \ \ \ \small \begin{align} J(\theta) = \frac{1}{m} \cdot(-y^Tlog(h) - (1-y)^Tlog(1-h)) \end{align}$

Cost function resources:

Cross entropy loss

Cross entropy loss (or log loss) measures the performance of a classification model whose output is a probability value between 0 and 1.

See also:

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See directed acyclic graph.

Deep learning

In recent years, a machine learning method called “deep learning” has gained huge attraction, obtaining astonishing results in broad applications such as pattern recognition, speech recognition, computer vision, and natural language processing. Deep learning learns data representations through the use of multi-layered artificial neural networks as opposed to task-specific algorithms. For reviews on deep learning, see:


See Proofs.

Degree matrix

Refer here  [Graph Signal Processing: Background].

See also: Graph Adjacency Matrix.

Dependency parsing

See Syntactic/Dependency parsing.

Directed acyclic graph (DAG)

DAG.png   [Source]

In mathematics and computer science, a directed acyclic graph (DAG), is a finite directed graph with no directed cycles. That is, it consists of finitely many vertices and edges, with each edge directed from one vertex to another, such that there is no way to start at any vertex $\small v$ and follow a consistently-directed sequence of edges that eventually loops back to $\small v$ again. Equivalently, a DAG is a directed graph that has a topological ordering, a sequence of the vertices such that every edge is directed from earlier to later in the sequence.

  • acyclic = “non-circular:” moving from node to node by following the edges, you will never encounter the same node for the second time

DAG can model many different kinds of information. A spreadsheet can be modeled as a DAG, with a vertex for each cell and an edge whenever the formula in one cell uses the value from another; a topological ordering of this DAG can be used to update all cell values when the spreadsheet is changed. Similarly, topological orderings of DAG can be used to order the compilation operations in a makefile. …

In computer science and mathematics, a directed acyclic graph (DAG) is a graph that is directed and without cycles connecting the other edges. This means that it is impossible to traverse the entire graph starting at one edge. The edges of the directed graph only go one way. The graph is a topological sorting, where each node is in a certain order.

In graph theory, a graph is a series of vertexes connected by edges. In a directed graph, the edges are connected so that each edge only goes one way. A directed acyclic graph means that the graph is not cyclic, or that it is impossible to start at one point in the graph and traverse the entire graph. Each edge is directed from an earlier edge to a later edge. This is also known as a topological ordering of a graph.

A spreadsheet may be represented as a directed acyclic graph, with each cell a vertex and an edge connected a cell when a formula references another cell. Other applications include scheduling, circuit design and Bayesian networks.

In mathematics – and more specifically, in graph theory – a tree is an undirected graph in which any two vertices are connected by exactly one path.


Document relevance ranking

Document relevance ranking, also known as ad hoc retrieval (Harman, 2005), is the task of ranking documents from a large collection using only the query and the text of each document only. Restated, ad hoc retrieval is a standard retrieval task in which the user specifies their information need through a query which initiates a search (executed by the information system) for documents which are likely to be relevant to the user.

This contrasts with standard information retrieval (IR) systems that rely on text based signals in conjunction with network structure and/or user feedback. Text-based ranking is particularly important when:

  • click-logs do not exist or are small, and
  • the network structure of the collection is non-existent or not informative for query-focused relevance.

Examples include various domains in digital libraries, e.g. patents or scientific literature, enterprise search, and personal search.

Source, references: Deep Relevance Ranking Using Enhanced Document-Query Interactions, which explored several new models for document relevance ranking, building upon the Deep Relevance Matching Model (DRMM) of Guo et al. (2016). …

See also: Introduction to Ad-hoc Retrieval.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Eigenvalues, Eigenvectors


Embeddings (word embeddings);
Language models

Embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP), where words or phrases from a vocabulary are mapped to vectors of real numbers (see also Word Embeddings). Conceptually it involves a mathematical embedding from a sparse, highly dimensional space with one dimension per word (a dimensionality proportional to the size of the vocabulary) into a dense, continuous vector space with a much lower dimensionality, perhaps 200 to 500 dimensions [Mikolov (2013) Efficient Estimation of Word Representations in Vector Space].

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from a vocabulary are mapped to vectors of real numbers (Sebastian Ruder provides a good overview; see also this excellent post, Introduction to Word Embeddings). Conceptually it involves a mathematical embedding from a sparse, highly dimensional space with one dimension per word (a dimensionality proportional to the size of the vocabulary) into a dense, continuous vector space with a much lower dimensionality, perhaps 200 to 500 dimensions [Mikolov et al., Efficient Estimation of Word Representations in Vector Space (Sep 2013) – the “word2vec” paper].


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Skip-Gram with Negative Sampling (SGNS)

The blog post Word2Vec Tutorial Part 2 - Negative Sampling mentions a computational challenge associated with the skip-gram model for word2vec:

“When you read Word2Vec Tutorial – The Skip-Gram Model you may have noticed something – it’s a huge neural network! In the example I gave, we had word vectors with 300 components, and a vocabulary of 10,000 words. Recall that the neural network had two weight matrices-a hidden layer and output layer. Both of these layers would have a weight matrix with 300 x 10,000 = 3 million weights each! Running gradient descent on a neural network that large is going to be slow. And to make matters worse, you need a huge amount of training data in order to tune that many weights and avoid over-fitting. millions of weights times billions of training samples means that training this model is going to be a beast.

“The authors of word2vec addressed these issues in their second paper [Distributed Representations of Words and Phrases and their Compositionality (Oct 2013)]. There are three innovations in this second paper:

  • Treating common word pairs or phrases as single “words” in their model.
  • Subsampling frequent words to decrease the number of training examples.
  • Modifying the optimization objective with a technique they called “Negative Sampling”, which causes each training sample to update only a small percentage of the model’s weights.

“It’s worth noting that subsampling frequent words and applying Negative Sampling not only reduced the compute burden of the training process, but also improved the quality of their resulting word vectors as well.”

From Section 2.2 in Mikolov et al. Distributed Representations of Words and Phrases and their Compositionality (Oct 2013) [heavily paraphrased here]:

“Negative Sampling. An alternative to the hierarchical softmax is Noise Contrastive Estimation (NCE) … NCE posits that a good model should be able to differentiate data from noise by means of logistic regression. … While NCE can be shown to approximately maximize the log probability of the softmax, the Skip-gram model is only concerned with learning high-quality vector representations, so we are free to simplify NCE as long as the vector representations retain their quality. We define negative sampling (NEG) by the objective … Thus the task is to distinguish the target word from draws from the noise distribution using logistic regression … The main difference between negative sampling and NCE is that NCE needs both samples and the numerical probabilities of the noise distribution, while negative sampling uses only samples. …”

For additional background and discussion on word embedding see:

Word embeddings are widely used in predictive NLP modeling, particularly in deep learning applications (Word Embeddings: A Natural Language Processing Crash Course). Word embeddings enable the identification of similarities between words and phrases, on a large scale, based on their context. These word vectors can capture semantic and lexical properties of words, even allowing some relationships to be captured algebraically; e.g. (analogous to 12 - 9 + 2 = 5),

    vBerlin - vGermany + vFrance ~ vParis
    vking - vman + vwoman ~ vqueen.

Aside: addition and multiplication are commutative; subtraction and division are not commutative.

12 + 9 + 2 = 23
12 + (9 + 2) = (12 + 9) + 2 = 23   (← addition and multiplication are also associative)
12 - 9 + 2 = 5
12 - (9 + 2) = 1
(12 - 9) + 2 = 5
12 - (9 - 2) = 5
12 - 9 - 2 = 1
12 - 2 - 9 = 1

Vectors, like scalars, can be added together. However, adding vectors and scalars is very different. The direction a vector has affects the final magnitude of the addition of the two vectors (see Adding Vectors). A negative sign in front of a vector means that its direction is changed by 180º degrees (see Subtracting Vectors). The direction and magnitude completely change when adding and subtracting vectors. Thus adding and subtracting vectors does not give the same resultant vector. See also Subtraction of Vectors, and vector spaces - linear algebra.

Associative property.  Addition: a + (b + c) = (a + b) + c. Multiplication: a(bc) = (ab)c.
Commutative property.  Addition: a + b = b + a. Multiplication: ab = ba.
Distributive property.  a(b + c) = ab + ac (“multiplication distributes over addition”).
Algebraic order of operations.  Operations are executed in this order: parentheses; exponents; multiplication and division; addition and subtraction. 6 + (4 ÷ 2)² × 8 = 38.

The original work for generating word embeddings was presented by Bengio et al. in 2003 (A Neural Probabilistic Language Model (which builds on his 2001 (NIPS 2000) “feature vectors” paper A Neural Probabilistic Language Model), who trained them in a neural language model together with the model’s parameters.

Despite the assertion by Sebastian Ruder in An Overview of Word Embeddings and their Connection to Distributional Semantic Models that Bengio coined the phrase “word embeddings” in his 2003 paper, the term “embedding” does not appear in that paper. The Abstract does state the concept, however: “We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.”. The correct attribution is likely Bengio’s similarly-named 2006 paper Neural Probabilistic Language Models, which states (bottom of p. 162): “Based on our discussion in the introduction, it makes sense to force the word embedding to be shared across all nodes.” The full reference is: Bengio Y. et al. (2006) Neural Probabilistic Language Models. StudFuzz 194:137-186.

Collobert and Weston demonstrated the power of pretrained word embeddings as a highly effective tool when used in downstream tasks in their 2008 paper A Unified Architecture for Natural Language Processing, while also announcing a neural network architecture upon which many current approaches are built. It was Mikolov et al. (2013), however, who popularized word embedding through the introduction of word2vec, a toolkit enabling the training and use of pretrained embeddings (Efficient Estimation of Word Representations in Vector Space).

Likewise – I’m being rather critical here – the 2008 Collobert and Weston paper, above, mentions “embedding” [but not “word embedding”, and cites Bengio’s 2001 (NIPS 2000) paper], while Mikolov’s 2013 paper does not mention “embedding” and cites Bengio’s 2003 paper.

Word embeddings are a particularly striking example of learning a representation, i.e. representation learning (Bengio et al., Representation Learning: A Review and New PerspectivesDeep Learning, NLP, and RepresentationsAn introduction to representation learning). [See also, this Glossary, Representation learning.]

See also:


See Variational autoencoder.

Euclidean space; Non-Euclidean space

In geometry, Euclidean space encompasses the two-dimensional Euclidean plane, the three-dimensional space of Euclidean geometry, and certain other spaces. It is named after the Ancient Greek mathematician Euclid of Alexandria. The term “Euclidean” distinguishes these spaces from other types of spaces considered in modern geometry. Euclidean spaces also generalize to higher dimensions.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Euclidean space is a mathematical construct that encompasses the line, the plane, and three-dimensional space as special cases. Its elements are called vectors. Vector space (also called linear space) is a collection of objects called vectors, which may be added together and multiplied (“scaled”) by numbers, called scalars.

Vectors can be understood in various ways:

  • as arrows,
  • as quantities with magnitude and direction
  • as displacements, or
  • as points.

Euclidean vectors are an example of a vector space. They represent physical quantities such as forces: any two forces (of the same type) can be added to yield a third, and the multiplication of a force vector by a real multiplier is another force vector. In the same vein, but in a more geometric sense, vectors representing displacements in the plane or in three-dimensional space also form vector spaces.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Vectors are abstract mathematical objects with particular properties, which in some cases can be visualized as arrows. However, vectors in vector spaces do not necessarily have to be arrow-like objects. Vector spaces are the subject of linear algebra and are well characterized by their dimension, which, roughly speaking, specifies the number of independent directions in the space.

A Euclidean space has some number of real-valued dimensions and “dense” points: there is a notion of an “average” of two points, and Euclidean distance is based on the locations of points in such a space.

A non-Euclidean space is based on the properties of points, but not their “location” in a space. A way to describe the differences between these geometries is to consider two straight lines indefinitely extended in a two-dimensional plane that are both perpendicular to a third line:

  • In Euclidean geometry, the lines remain at a constant distance from each other (meaning that a line drawn perpendicular to one line at any point will intersect the other line and the length of the line segment joining the points of intersection remains constant) and are known as parallels.

  • In hyperbolic geometry, they “curve away” from each other, increasing in distance as one moves further from the points of intersection with the common perpendicular; these lines are often called ultraparallels.

  • In elliptic geometry, the lines “curve toward” each other and intersect.


[Image source. Click image to open in new window.]


[Image source (slide 14). Click image to open in new window.]

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Fourier transform

Refer here  [Graph Signal Processing: Background].


See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.

Functional genomics

The standard interpretation for functional genomics is described on Wikipedia:

Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic and transcriptomic projects (such as genome sequencing projects and RNA sequencing) to describe gene (and protein) functions and interactions. Unlike structural genomics, functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein-protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA transcripts, and protein products. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional “gene-by-gene” approach.

However, I broaden my use use of that term, in the sense of

How is the information contained in our genome expressed, and what are the functional consequences of that the expression of that information?

My use of the term “functional genomics” thus spans genomics, molecular genetics, biochemistry, and bioinformatics. I am particularly fascinated by how the information in our genome is encoded and manifested.

Individual variations in our genetic / epigenetic makeup determine who we are, and how we respond to both

  • extrinsic factors:
    • the environment (environmental stress: radiation, heat, famine, anoxia, toxins, pollutants, chemicals, …)
    • pathogens (bacterial, viral)


  • intrinsic factors:
    • metabolism (e.g. different functional isotypes of proteins, that affect how we process chemicals and drugs, relevant e.g. to toxicology and cancer chemotherapy …)
    • mutation (spontaneous: ageing; induced: environmental in nature but affecting the individual)

Function words

See Content words; Function words.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Gated units / Gated recurrent units (GRU)

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (Sep 2014) by Kyunghyun Cho et al. [Yoshua Bengio] described a RNN encoder-decoder for statistical machine translation, that introduced a new type of hidden unit ($\small \mathcal{f}$ in the equation, below) that was motivated by the LSTM unit but was much simpler to compute and implement.

    Recurrent Neural Networks. A recurrent neural network (RNN) is a neural network that consists of a hidden state $\small \mathbf{h}$ and an optional output $\small \mathbf{y}$ which operates on a variable-length sequence $\small \mathbf{x} = (x_1, \ldots, x_T)$. At each time step $\small t$, the hidden state $\small \mathbf{h_{\langle t \rangle}}$ of the RNN is updated by

      $\small \mathbf{h_{\langle t \rangle}} = f (\mathbf{h_{\langle t-1 \rangle}}, x_t)$,
    where $\small \mathcal{f}$ is a non-linear activation function. $\small \mathcal{f}$ may be as simple as an element-wise logistic sigmoid function and as complex as a long short-term memory (LSTM) unit (Hochreiter and Schmidhuber, 1997).

    Hidden Unit that Adaptively Remembers and Forgets. ... we also propose a new type of hidden unit ($\small f$ in the equation, above) that has been motivated by the LSTM unit but is much simpler to compute and implement. [The LSTM unit has a memory cell and four gating units that adaptively control the information flow inside the unit, compared to only two gating units in the proposed hidden unit.]

    This figure shows the graphical depiction of the proposed hidden unit:

    [Image source. Click image to open in new window.]

    "In this formulation [see Section 2.3 in Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation for details], when the reset gate is close to 0, the hidden state is forced to ignore the previous hidden state and reset with the current input only. This effectively allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation.

    "On the other hand, the update gate controls how much information from the previous hidden state will carry over to the current hidden state. This acts similarly to the memory cell in the LSTM network and helps the RNN to remember long-term information. Furthermore, this may be considered an adaptive variant of a leaky-integration unit (Bengio et al., 2013). As each hidden unit has separate reset and update gates, each hidden unit will learn to capture dependencies over different time scales. Those units that learn to capture short-term dependencies will tend to have reset gates that are frequently active, but those that capture longer-term dependencies will have update gates that are mostly active. ..."

That gated hidden unit is the gated recurrent unit (GRU).

In a highly-cited follow-on paper, Neural Machine Translation by Jointly Learning to Align and Translate (Sep 2014; updated May 2016), Dzmitry Bahdanau, KyungHyun Cho and Yoshua Bengio employed their “gated hidden unit” (i.e. GRU) in a neural machine translation task. Their model consisted of a forward and backward pair of RNN (BiRNN) for the encoder, and a decoder that emulated searching through a source sentence while decoding a translation. From Appendix A in that paper:

“For the activation function $\small f$ of an RNN, we use the gated hidden unit recently proposed by Cho et al. (2014a). The gated hidden unit is an alternative to the conventional simple units such as an element-wise $\small \text{tanh}$. This gated unit is similar to a long short-term memory (LSTM) unit proposed earlier by Hochreiter and Schmidhuber (1997), sharing with it the ability to better model and learn long-term dependencies. This is made possible by having computation paths in the unfolded RNN for which the product of derivatives is close to 1. These paths allow gradients to flow backward easily without suffering too much from the vanishing effect. It is therefore possible to use LSTM units instead of the gated hidden unit described here, as was done in a similar context by Sutskever et al. (2014).”


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

GRUs have fewer parameters than LSTM, as they lack an output gate. A GRU has two gates (an update gate and reset gate), while a RNN has three gates (update, forget and output gates). The GRU update gate decides on how much of information from the past should be let through, while the reset gate decides on how much of information from the past should be discarded. What motivates this? Although RNNs can theoretically capture long-term dependencies, they are actually very hard to train to do this. GRUs are designed to have more persistent memory, thereby making it easier for RNNs to capture long-term dependencies. Even though computationally a GRU is more efficient than an LSTM network, due to the reduction of gates it still comes second to LSTM network in terms of performance. Therefore, GRUs are often used when we need to train faster, and we don’t have much computational power.

  • Recurrent models feature flexibility and expressivity that come at a cost. Empirical experience shows that RNNs are often more delicate to tune and more brittle to train than standard feedforward architectures. Recurrent architectures can also introduce significant computational burden compared with feedforward implementations. In response to these shortcomings, a growing line of empirical research demonstrates that replacing recurrent models by feedforward models is effective in important applications including translation, speech synthesis, and language modeling [John Miller and Moritz Hardt, When Recurrent Models Don’t Need To Be Recurrent (May 2018)].

  • In a superb companion blog post, “When Recurrent Models Don’t Need To Be Recurrent” coauthor John Miller discusses two very interesting papers that discuss the limitations of LSTMs and the effectiveness of GRUs for language modeling:

See also:

Gaussian function

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the form:

$\large {f(x) = ae^{-{\frac {(x-b)^{2}}{2c^{2}}}} }$

for arbitrary real constants $\small a$, $\small b$ and $\small c$. It is named after the mathematician Carl Friedrich Gauss. The graph of a Gaussian is a characteristic symmetric “bell curve” shape. The parameter $\small a$ is the height of the curve’s peak, $\small b$ is the position of the center of the peak and $\small c$ (the standard deviation, sometimes called the Gaussian RMS width) controls the width of the “bell”.

Gaussian functions are often used to represent the probability density function of a normally distributed random variable with expected value $\small \mu = b$ and variance $\small \sigma^2 = c^2$. In this case, the Gaussian is of the form:

$\large g(x)=\frac{1}{\sigma {\sqrt{2\pi }}} e^{ -{\frac{1}{2}} \left ({\frac{x-\mu }{\sigma }} \right)^{2} }$.

Gaussian functions are widely used in statistics to describe the normal distributions, in signal processing to define Gaussian filters, in image processing where two-dimensional Gaussians are used for Gaussian blurs, and in mathematics to solve heat equations and diffusion equations and to define the Weierstrass transform.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Generative adversarial networks

Generative adversarial networks (GANs) are a class of artificial intelligence algorithms used in unsupervised machine learning, implemented by a system of two neural networks contesting with each other in a zero-sum game framework. They were introduced by Goodfellow et al. in 2014 (Generative Adversarial Nets).

One network generates candidates (generative) and the other evaluates them (discriminative). Typically, the generative network learns to map from a latent space to a particular data distribution of interest, while the discriminative network discriminates between instances from the true data distribution and candidates produced by the generator. The generative network’s training objective is to increase the error rate of the discriminative network (i.e., “fool” the discriminator network by producing novel synthesised instances that appear to have come from the true data distribution).

In practice, a known dataset serves as the initial training data for the discriminator. Training the discriminator involves presenting it with samples from the dataset, until it reaches some level of accuracy. Typically the generator is seeded with a randomized input that is sampled from a predefined latent space (e.g. a multivariate normal distribution). Thereafter, samples synthesized by the generator are evaluated by the discriminator. Backpropagation is applied in both networks so that the generator (for example) produces better images, while the discriminator becomes more skilled at flagging synthetic images. The generator is typically a deconvolutional neural network, and the discriminator is a convolutional neural network. [GANs can generate photographs that look authentic to human observers.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

See also

Geometric deep learning

In the last decade, deep learning (DL) approaches (e.g. convolutional neural networks and recurrent neural networks) have achieved unprecedented performance on a broad range of problems coming from a variety of different fields (e.g. computer vision and speech recognition). Despite the results obtained, research so far on DL techniques has focused mainly on data defined on Euclidean domains (i.e. grids). However, in fields such as biology, physics, network science, recommender systems, computer graphics etc. one may have to deal with data defined on non-Euclidean domains (i.e. graphs and manifolds). Until recently, the adoption of DL in these fields has been lagging, primarily as the non-Euclidean nature of data makes the definition of basic operations (such as convolution) rather elusive. Geometric deep learning deals with the extension of DL techniques to graph/manifold structured data.

The Geometric Deep Learning website represents a collection of materials in the field of geometric deep learning. They collect workshops, tutorials, publications and code, that several different researchers have produced over the last several years. Their goal is to provide a general picture of this new and emerging field, which is rapidly developing in the scientific community thanks to its broad applicability.

See also:

GNU Octave

See this page.

Google AI

Google AI, formerly Google Research. Google collects AI-based services across the company into “ is a collection of products and teams across Alphabet with a focus on AI.”

Gradient; Gradient descent

See Backpropagation.

Graph adjacency matrix

Refer here  [Graph Signal Processing: Background].

See also: Graph Degree Matrix.

Graph databases

Data often exists as relationships between different objects. While relational databases (RDBMS) store highly structured data, they do not store the relationships between the data. Unlike other databases, graph databases store relationships and connections as first-class entities, and excel at managing highly connected data and complex queries.

... continued here ...

Graph degree matrix

Refer here  [Graph Signal Processing: Background].

See also: Graph Adjacency Matrix.

Graph Laplacian (Laplacian matrix)

“The Laplace operator in its various manifestations is the most beautiful and central object in all of mathematics. Probability theory, mathematical physics, Fourier analysis, partial differential equations, the theory of Lie groups, and differential geometry all revolve around this sun, and its light even penetrates such obscure regions as number theory and algebraic geometry.”
[Nelson E (1967) Tensor Analysis, p. 100]

Refer here  [Graph Signal Processing: Background].

Graph signal processing

In applications such as social, energy, transportation, sensor, and neuronal networks, high-dimensional data naturally reside on the vertices of weighted graphs. The emerging field of signal processing on graphs merges algebraic and spectral graph theoretic concepts with computational harmonic analysis to process such signals on graphs. The weight associated with each edge in the graph often represents the similarity between the two vertices it connects. The connectivities and edge weights are either dictated by the physics of the problem at hand or inferred from the data. For instance, the edge weight may be inversely proportional to the physical distance between nodes in the network. The data on these graphs can be visualized as a finite collection of samples, with one sample at each vertex in the graph. Collectively, we refer to these samples as a graph signal.

A signal or function $\small f : \mathcal{V} \to \mathbf{R}$ defined on the vertices of the graph may be represented as a vector $\small f \in \mathbf{R}^N$, where the $\small i^{th}$ component of the vector $\small f$ represents the function value at the $\small i^{th}$ vertex in $\small \mathcal{V}$. The following figure is an example of a graph signal.


[Image source. Click image to open in new window.]

... continued here ...

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


In semantics, a holonym is the opposite of a meronym.


  • “finger” is a part of a “hand”, thus “hand” is the holonym of “finger”
  • “family” is a holonym of “child”, “mother” or “father”.
  • “forest” is a holonym of “tree” (forests contain trees), while “tree” is a holonym of “bark”
  • “body” is a holonym of “arm”, which is a holonym of “elbow”

See also Linguistics | Semantics.


In semantics, a homonym is a word having a different meaning, but the same form as another word.


  • “bear” (verb) vs. “bear” (noun, animal)
  • “bank” (institution) vs. “bank” (river bank)

It is possible to distinguish homophones (same sound) and homographs (same spelling)

Homonymy and synonymy have complementary notions:

  • homonyms: the same form, different meanings
  • synonyms: the same meaning, different forms

Homonyms are words that have the same form but different meanings. There are two major types of homonyms, based upon whether the meanings of the word are historically connected or result from coincidence.

  • Coincidental homonyms are the result of such historical accidents as phonetic convergence of two originally different forms or the borrowing of a new word which happens to be identical to an old word. There is usually no natural link between the two meanings: the “bill” of a bird vs the “bill” one has to pay; or the “bark” of a dog vs the “bark” of a tree.

  • The second type of homonym, the polysemous homonym, results when multiple meanings develop historically from the same word. The process by which a word acquires new meanings is called polysemy. Unlike coincidental homonyms, polysemous homonyms usually preserve some perceptible semantic link marking the development of one meaning out of the other, as in the “leg” of a chair and the “leg” of a person; or the face of a person vs. the face of a clock.

Sometimes it is impossible to tell whether two words of identical form are true homonyms (historically unrelated) or polysemous homonyms (historically related), such as ice “skate” vs. “skate”, the fish: “skate” (fish, from Old English “skata”) vs. ice “skate” (from Dutch “schaat”); “dear/deer” are historically related (“darling”, vs. German “Tier”: animal).

Since polysemy is so difficult to separate from true homonymy, dictionaries usually order entries according to (i) the first recorded appearance of word, or (ii) the frequency of meaning use. This is a problem for lexicographers, the people who study words and write dictionaries.

There are universal tendencies in the directionality of polysemy. Studies of polysemy in a wide variety of languages generally find the following directions in meaning shift:

  • body part to part of object (“hands”, “face”, “lip”, “elbow”, “belly”, “vein of gold” or “vein of a leaf”); for example, “appendix.

  • animal to human, for personality traits (“shrew”, “bear”, “wolf”, “fox”, “quiet as a fish”); for example, “My dog is a real Einstein”.

  • space to time (“long”, “short”, “plural”)

  • spatial to sound (“melt”, “rush”)

  • sound to color (“loud”, “clashing”, “mellow”)

  • Physical, visible attribute to emotional or mental, invisible quality (“crushed”, “big head”, “green with envy”, “yellow coward”, “sharp/dull”, “spark”)

Directionality in polysemy seems to be logically motivated: concrete meanings give rise to abstract ones (“sharp knife” → “sharp mind”); mundane gives rise to the technical (“wood chip” → “computer chip”).

See also Linguistics | Semantics.

Horn clauses

See also:

Relevant logic notation:

  • $\small \exists$ : there exists at least one
  • $\small \forall$ : for all
  • $\small \neg$ : not (logical not)
  • $\small \lor$ : or (logical or)
  • $\small \land$ : and (logical and)
  • $\small \sim$ : contraposition (an inference that says that a conditional statement is logically equivalent to its contrapositive. The contrapositive of $\small P \rightarrow Q$ is thus $\small \neg Q \rightarrow \neg P$). In symbolic logic, the tilde (~) is used to indicate the negation of a statement, so $\small \sim p \equiv \lnot p$.
  • $\small \rightarrow$ : implies
  • $\small \leftarrow$ : is implied by

  • In Prolog, the symbol $\small \text{ :- }$, sometimes called a turnstile , is pronounced “ if “.

In mathematical logic and logic programming, a Horn clause is a logical formula of a particular rule-like form which gives it useful properties for use in logic programming, formal specification, and model theory. Horn clauses are named for the logician Alfred Horn, who first pointed out their significance in 1951.

A Horn clause is a clause (a disjunction of literals) with at most one positive, i.e. unnegated, literal. Conversely, a disjunction of literals with at most one negated literal is called a dual-Horn clause. A Horn clause with exactly one positive literal is a definite clause; a definite clause with no negative literals is sometimes called a fact; and a Horn clause without a positive literal is sometimes called a goal clause (note that the empty clause consisting of no literals is a goal clause). …

The following is an example of a (definite) Horn clause:

    $\small \neg p \lor \neg q \lor \ldots \lor \neg t \lor u$

Such a formula can be rewritten in the following form, which is more common in logic programming and similar fields:

    $\small (p \land q \land \ldots \land t) \rightarrow u$

Horn clauses play a basic role in constructive logic and computational logic. They are important in automated theorem proving by first-order resolution, because the resolvent of two Horn clauses is itself a Horn clause, and the resolvent of a goal clause and a definite clause is a goal clause. These properties of Horn clauses can lead to greater efficiencies in proving a theorem (represented as the negation of a goal clause). …

Horn clauses are also the basis of logic programming, where it is common to write definite clauses in the form of an implication:

    $\small (p \land q \land \ldots \land t) \rightarrow u$

In fact, the resolution of a goal clause with a definite clause to produce a new goal clause is the basis of the SLD resolution inference rule, used to implement logic programming in the programming language Prolog.

In logic programming a definite clause behaves as a goal-reduction procedure. For example, the Horn clause written above behaves as the procedure:

    to show $\small u$, show $\small p$ and show $\small q$ and ... and show $\small t$ .

To emphasize this reverse use of the clause, it is often written in the reverse form:

    $\small u \leftarrow (p \land q \land \ldots \land t)$

In Prolog this is written as:

    $\small u \text{ :- } p, q, \dots, t$.

In logic programming and datalog, computation and query evaluation are performed by representing the negation of a problem to be solved as a goal clause. For example, the problem of solving the existentially quantified conjunction of positive literals:

    $\small \exists \mathit{X} (p \land q \land \ldots \land t)$

is represented by negating the problem (denying that it has a solution), and representing it in the logically equivalent form of a goal clause:

    $\small \forall \mathit{X} (false \leftarrow p \land q \land \ldots \land t)$

In Prolog this is written as:

    $\small \text{ :- } p, q, \ldots, t$.

Solving the problem amounts to deriving a contradiction, which is represented by the empty clause (or “false”). The solution of the problem is a substitution of terms for the variables in the goal clause, which can be extracted from the proof of contradiction. Used in this way, goal clauses are similar to conjunctive queries in relational databases, and Horn clause logic is equivalent in computational power to a universal Turing machine.

The Prolog notation is actually ambiguous, and the term “goal clause” is sometimes also used ambiguously. The variables in a goal clause can be read as universally or existentially quantified, and deriving “false” can be interpreted either as deriving a contradiction or as deriving a successful solution of the problem to be solved. …

Notation: The clause $\small \sim p \lor \sim q \lor r$ is written in Prolog in the form $\small r \text{ :- } p, q.$  .

See also:

Hyperbolic embeddings

Hyperbolic embeddings seek to embed structured, discrete objects such as knowledge graphs into a continuous representation that can be used with modern machine learning methods. Hyperbolic embeddings can preserve graph distances and complex relationships in very few dimensions, particularly for hierarchical graphs. Many graphs, such as complex networks including the Internet and social networks, are known to have hyperbolic structure and thus befit hyperbolic embeddings (see Representation Tradeoffs for Hyperbolic Embeddings and references therein). For both natural language processing and graph based tasks, embeddings have been learned in high-dimensional Euclidean spaces. However, recent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but negatively curved, hyperbolic space.

For additional background, see these dated but useful resources:

A hyperbolic space has the property that power-law degree distributions, strong clustering and hierarchical community structure emerge naturally when random graphs are embedded in hyperbolic space. It is therefore logical to exploit the structure of the hyperbolic space for useful embeddings of complex network. Neural Embeddings of Graphs in Hyperbolic Space (May 2017) proposed learning neural embeddings of graphs in hyperbolic space, providing experimental evidence that embedding graphs in their natural geometry significantly improved performance on downstream tasks for several real-world public datasets.


[Image source. Click image to open in new window.]

  • Hybed: Hyperbolic Neural Graph Embedding  [Semantic ScholarOpenReview] again posits that recent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but a negatively curved hyperbolic space. They proposed learning neural embeddings of graphs in hyperbolic space, providing experimental evidence that hyperbolic embeddings significantly outperformed Euclidean embeddings on vertex classification tasks for several real-world public datasets.

Many well-established recommender systems are based on representation learning in Euclidean space. In these models, matching functions such as the Euclidean distance or inner product are typically used for computing similarity scores between user and item embeddings. In an extension of representation learning from the Euclidean to the hyperbolic domain, Hyperbolic Recommender Systems investigated the notion of learning user and item representations in Hyperbolic space, arguing that hyperbolic space is more suitable for learning user-item embeddings in the recommendation domain. Unlike Euclidean spaces, hyperbolic spaces are intrinsically equipped to handle hierarchical structure, encouraged by its property of exponentially increasing distances away from origin. Their proposed system (HyperBPR) outperformed its Euclidean counterparts, achieving state of the art performance on multiple benchmark datasets.

Representation Tradeoffs for Hyperbolic Embeddings provides an excellent summary as well as entity embeddings with 64-bit precision] found that – given a tree – they could give a combinatorial construction that embedded the tree in hyperbolic space with arbitrarily low distortion without using optimization. On WordNet, their combinatorial embedding obtained a mean-average-precision of 0.989 with only two dimensions, while Nickel et al.’s recent construction (Poincaré Embeddings for Learning Hierarchical Representation) obtained 0.87 using 200 dimensions.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Maximilian Nickel and Douwe Kiela at Facebook AI Research recently introduced a new approach to learning hierarchical representations of symbolic data by embedding them into hyperbolic space (more precisely, into an n-dimensional Poincaré ball) (Poincaré Embeddings for Learning Hierarchical Representations;  [project]), allowing them to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. The Poincaré embeddings significantly outperformed Euclidean embeddings on data with latent hierarchies in terms of representation capacity and generalization ability.

Experiments showed that Poincaré embeddings provided important advantages over Euclidean embeddings on hierarchical data:

  1. Poincaré embeddings enabled very parsimonious representations that enabled the learning of high-quality embeddings of large-scale taxonomies.
  2. Excellent link prediction results indicated that hyperbolic geometry could introduce an important structural bias for the embedding of complex symbolic data.
  3. State of the art results for predicting lexical entailment suggested that the hierarchy in the embedding space corresponded well to the underlying semantics of the data.


[Image source. Click image to open in new window.]

Embeddings of tree-like graphs in hyperbolic space were recently shown to surpass their Euclidean counterparts in performance by a large margin. Inspired by those results, Skip-gram Word Embeddings in Hyperbolic Space presented an algorithm for learning word embeddings in hyperbolic space from free text. An objective function based on the hyperbolic distance was derived and included in the skip-gram architecture from word2vec. The results demonstrated the potential of hyperbolic word embeddings, particularly in low dimensions, although without clear superiority over their Euclidean counterparts.

See also Embeddings (word embeddings)


In mathematics, a hypergraph is a generalization of a graph in which an edge can join any number of vertices. Formally, a hypergraph is a pair where is a set of elements called nodes or vertices, and is a set of non-empty subsets of called hyperedges or edges.


[Click image to open in new window]

Research into hypergraph development is technically challenging due to a lack of programmatic solutions and published work in this area. Nevertheless, for relevant background on hypergraphs consult the following:

Hypernym (superordinate word)

In semantics, superordinate words (hypernyms) are words whose meaning includes meaning of another word. If $\small X$ is superordinate to $\small Y$ then $\small Y$ is subordinate to $\small X$.


  • “color” is a hypernym of “red”, “black”, “green”, …
  • “animal” is a hypernym of “horse”, “tiger”, …
  • “animal” is a hypernym of “mammal” (mammals are animals), which is a hypernym of “mammal” is a hypernym of “dog” (dogs are mammals)
  • “plant” is a hypernym of “flower”, which is a hypernym of “tulip”
  • “red” is a hypernym of “scarlet”, “vermilion”, “carmine” and “crimson”

Hypernyms vs. Hyponyms

Hypernymy is a relation between words (or sentences) where the semantics of one word (the hyponym) are contained within that of another word (the hypernym). A simple form of this relation is the is-a relation; e.g., cat is an animal. Another example is a hyponym in a type-of relationship with its hypernym; for example, “pigeon, crow, eagle and seagull” are all hyponyms of “bird” (their hypernym); “bird” in turn is a hyponym of “animal;” “spoon” is a hyponym of “cutlery” (its hypernym); etc.

See also Linguistics | Semantics.


See Parameters.

Hyponym (subordinate word)

In semantics, subordinate words (hyponyms) are words whose meaning is included in the meaning of another word.


  • “red” is a hyponym of “color” (red is subordinate to color)
  • “pony” is a hyponym of “horse”, which is a hyponym of “animal”
  • “tulip” is a hyponym of “flower”; …
  • “dog” is a hyponym of “mammal” (dogs are among the various animals which are mammals); “mammal” is a hyponym of “animal”
  • “tulip” is a hyponym of “flower”, which is a hyponym of “plant”
  • “scarlet”, “vermilion”, “carmine” and “crimson” are hyponyms of “red”

This is not a whole/part relationship, so “page” is not a subordinate term of “book”.

Hypernyms vs. Hyponyms

Hypernymy is a relation between words (or sentences) where the semantics of one word (the hyponym) are contained within that of another word (the hypernym). A simple form of this relation is the is-a relation; e.g., cat is an animal. Another example is a hyponym in a type-of relationship with its hypernym; for example, “pigeon, crow, eagle and seagull” are all hyponyms of “bird” (their hypernym); “bird” in turn is a hyponym of “animal;” “spoon” is a hyponym of “cutlery” (its hypernym); etc.

See also Linguistics | Semantics.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See Proofs.

Identity Matrix

See here  [Graph Signal Processing: Background]

Inductive logic programming

See also:

Inductive logic programming is a subfield of symbolic artificial intelligence which uses logic programming as a uniform representation for examples, background knowledge and hypotheses. Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an inductive logic programming system will derive a hypothesised logic program which entails all the positive and none of the negative examples.

    Schema ($\small \Rightarrow$ : implies):

    positive examples + negative examples + background knowledge $\small \Rightarrow$ hypothesis .

Inductive logic programming is particularly useful in bioinformatics and natural language processing. Gordon Plotkin and Ehud Shapiro laid the initial theoretical foundation for inductive machine learning in a logical setting. Shapiro built their first implementation (Model Inference System) in 1981: a Prolog program that inductively inferred logic programs from positive and negative examples. The term “Inductive Logic Programming” was first introduced in a paper by Stephen Muggleton in 1991. Muggleton also founded the annual international conference on Inductive Logic Programming, introduced the theoretical ideas of Predicate Invention, Inverse resolution, and Inverse entailment. …

In inductive logic programming the learned model is in the form of logic programming rules (Horn clauses) that are comprehensible to humans. It allows the background knowledge to be incrementally extended without requiring the entire model to be re-learned. Meanwhile, the comprehensibility of symbolic rules makes it easier for users to understand and verify induced models and even edit them. The inductive logic programming learning problem can be regarded as a search problem for a set of clauses that deduce the training examples. The search is performed either top down or bottom-up. A bottom-up approach builds most-specific clauses from the training examples and searches the hypothesis space by using generalization. This approach is not applicable to large-scale datasets, nor it can incorporate Negation-As-Failure into the hypotheses. In contrast, top-down approach starts with the most general clauses and then specializes them. A top-down algorithm guided by heuristics is better suited for large-scale and/or noisy datasets. [Source.]

Note that inductive logic programming differs from integer linear programming (unfortunately both are often abbreviated “ILP”).

Inductive logic programming was used in Induction of Non-Monotonic Logic Programs to Explain Boosted Tree Models Using LIME (Aug 2018).

See also:

Information extraction

Information extraction (IE) is the process of extracting structured information (e.g. events, binary relations, etc.) from text and data, so that it can be used for another purpose, such as an information retrieval system (e.g. a search engine). IE creates structured information from unstructured text. See also: Relation extraction.

IE vs. IR
      [Image source]

Information overload

The explosion in the amount of news and journalistic content being generated across the globe, coupled with extended and instantaneous access to information through online media, makes it difficult and time-consuming to monitor news developments and opinion formation in real time (Content-Driven, Unsupervised Clustering of News Articles Through Multiscale Graph Partitioning). Even within the more focused health, technical and scientific domains we face a continuous onslaught of new information and knowledge from which we must filter out the non-relevant information, seeking to retain (or hoping to find again) knowledge that is relevant to us.

Information overload is characterized by the difficulty of understanding an issue and effectively making decisions when one has too much information about that issue. In our infocentric world, we have an increasing dependency on relevant, accurate information that is buried in the avalanche of continuously generated information.

Coincident with information overload is the phenomenon of attention overload: we have limited attention and we’re not always sure where to direct it. It can be difficult to limit how much information we consume when there’s always something new waiting for a click; before we know it, an abundance of messy and complex information has infiltrated our minds. If our processing strategies don’t keep pace, our online explorations create strained confusion instead of informed clarity. Hence, More information is not necessarily better.

When Choice is Demotivating: Can One Desire Too Much of a Good Thing? [pdf] discussed findings from 3 experimental studies that starkly challenged the implicit assumption that having more choices is more intrinsically motivating than having fewer choices. Those experiments, which were conducted in both field and laboratory settings, showed that people are more likely to make purchases or undertake optional classroom essay assignments when offered a limited array of 6 choices, rather than a more extensive array of 24 or 30 choices. Moreover, participants reported greater subsequent satisfaction with their selections and wrote better essays when their original set of options had been limited.

Information overload is a long-standing issue: in her 2010 book Too Much To Know: Managing Scholarly Information before the Modern Age, Harvard Department of History Professor Ann Blair argued that the early modern methods of selecting, summarizing, sorting, and storing text (the 4S’s) are at the root of the techniques we use today in information management.

For additional discussion, see the blog post Information Overload, Fake News, and Invisible Gorillas.

Information retrieval

Information retrieval (IR) employs highly scalable statistics-based techniques to index and search large volumes of text efficiently. Information retrieval is based on a query – you specify what information you need, and it is returned in human understandable form.

See also Document relevance ranking.

Integer linear programming

An integer linear program has the same form as a linear program; the only difference is that the feasible solutions are restricted to be integer vectors, i.e. $\small x \in \mathbb{Z}^n$ instead of $\small x \in \mathbb{R}^n$.

An integer programming problem is a mathematical optimization or feasibility program in which some or all of the variables are restricted to be integers. In many settings the term refers to integer linear programming (ILP), in which the objective function and the constraints (other than the integer constraints) are linear.

Integer programming is NP-complete. In particular, the special case of 0-1 integer linear programming, in which unknowns are binary, and only the restrictions must be satisfied, is one of Karp’s 21 NP-complete problems.

Note that integer linear programming differs from inductive logic programming (both, unfortunately, are often abbreviated “ILP”). Inductive logic programming is a subfield of machine learning that learns models in the form of logic programming rules (Horn clauses) that are comprehensible to humans. Inductive logic programming was used in Induction of Non-Monotonic Logic Programs to Explain Boosted Tree Models Using LIME.

Integer linear programming was used in the following papers:

  • An interesting graph-based approach to question answering is the use of subgraphs. Work from the Allen Institute for AI, Answering Complex Questions Using Open Information Extraction (Apr 2017), used a recently proposed support graph optimization framework for QA to develop a new inference model for Open IE. Their model, TupleINF, significantly outperformed a state of the art structured solver on complex questions of varying difficulty, while also removing the reliance on manually curated knowledge. TupleINF used integer linear programming to solve QA over complex structures that require multi-fact reasoning. …

    • TupleINF was inspired by their earlier TableILP work, Question Answering via Integer Programming over Semi-Structured Knowledge (Apr 2016), which proposed a structured inference system for this task formulated as an integer linear program that answers natural language questions using a semi-structured knowledge base derived from text, including questions requiring multi-step inference and a combination of multiple facts.

    • The approach of using subgraphs and integer linear programming in QA was also described on slides 15-25 in Reasoning-driven Question Answering by the University of Pennsylvania Ph.D. candidate Daniel Khashabi, at a Stanford NLP seminar (June 2018). The work summarized in that presentation is described in the UPenn/Allen Institute for Artificial Intelligence paper Question Answering as Global Reasoning over Semantic Abstractions (2018), which introduced their SemanticILP system. They claimed SemanticILP to be the first system that reasons over a wide range of semantic abstractions of the text, which are derived using off-the-shelf, general-purpose, pretrained natural language modules such as semantic role labelers, coreference resolvers, and dependency parsers. Representing multiple abstractions as a family of graphs, they translated QA into a search for an optimal subgraph that satisfied certain global and local properties. SemanticILP outperformed baseline models BiDAF, an information retrieval solver, and TupleINF on QA tasks on 4th and 8th grade science exams and a biology exam (Tables 4 and 5 in their paper). Thus on two domains with very different characteristics, the graph-based SemanticILP outperformed traditional and neural network models.

  • Knowledge Graph Embedding with Iterative Guidance from Soft Rules (Nov 2017) [code] presented a novel approach for knowledge graph embedding combined with guidance from soft logic rules, called Rule-Guided Embedding (RUGE). RUGE built on previous work by these authors that provided additional information:

  • Unsupervised Pseudo-Labeling for Extractive Summarization on Electronic Health Records (Nov 2018)

See also:

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Kirchhoff’s Theorem

In the mathematical field of graph theory, Kirchhoff’s theorem or Kirchhoff’s matrix tree theorem (named after Gustav Kirchhoff) is a theorem about the number of spanning trees in a graph, showing that this number can be computed in polynomial time as the determinant of a matrix derived from the graph. It is a generalization of Cayley’s formula which provides the number of spanning trees in a complete graph.


[Image source. Click image to open in new window.]

Kirchhoff’s theorem relies on the notion of the Laplacian matrix of a graph that is equal to the difference between the graph’s degree matrix (a diagonal matrix with vertex degrees on the diagonals) and its adjacency matrix (a $\small (0,1)$-matrix with 1’s at places corresponding to entries where the vertices are adjacent and 0’s otherwise).

For a given connected graph $\small \mathcal{G}$ with $\small n$ labeled vertices, let $\small \lambda_1, \lambda_2, ldots, \lambda_{n-1}$ be the non-zero eigenvalues of its Laplacian matrix. Then the number of spanning trees of $\small \mathcal{G}$ is

    $\small \begin{align} t(\mathcal{G) = \frac{1}{n}} \lambda_1, \lambda_2, \cdots, \lambda_{n-1} \end{align}$.

Equivalently the number of spanning trees is equal to any cofactor of the Laplacian matrix of $\small \mathcal{G}$.

Kmer (K-mer)

The term $\small k$-mer typically refers to all the possible substrings of length $\small k$ that are contained in a string. In computational genomics, $\small k$-mers refer to all the possible subsequences (of length $\small k$) from a read obtained through DNA sequencing.

The amount of $\small k$-mers possible given a string of length $\small L$ is $\small L-k+1$, while the number of possible $\small k$-mers given $\small n$ possibilities (4 in the case of DNA, e.g. ACTG) is $\small n^{k}$.

$\small K$-mers are typically used during sequence assembly, but can also be used in sequence alignment. In the context of the human genome, $\small k$-mers of various lengths have been used to explain variability in mutation rates.


Here are some examples showing the possible $\small k$-mers (given a specified $\small k$ value) from DNA sequences:



Typically we extract $\small k$-mers from genomic assemblies or read data sets by running a $\small k$-length window across all of the reads and sequences – e.g. given a sequence of length 16, you could extract 11 $\small k$-mers of length six from it like so:


becomes the following set of 6-mers:


$\small k$-mers are most useful when they’re long, because then they’re specific. That is, if you have a 31-mer taken from a human genome, it’s pretty unlikely that another genome has that exact 31-mer in it. (You can calculate the probability if you assume genomes are random: there are 431 possible 31-mers, and 431 = 4,611,686,018,427,387,904.) [Source.]

For a broader coverage of this topic, see $\small n$-gram.

Knowledge discovery in databases

Knowledge discovery in databases (KDD) is the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data. KDD (“data mining”) is the process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results.

Knowledge graphs

Knowledge graphs (relational property graphs) model information in the form of entities (nodes/vertices) and the relationships between them (edges). For a formal definition see

and for a general overview see

A noteworthy feature of knowledge graphs is the excellent performance of traversal-type queries across entities of diverse types within them. Such queries can be challenging to realize in relational databases because of the cost of computing statements (joins) across different tables. See:

Cypher queries in Neo4j are easier to write and understand than complex SQL queries in relational database management systems (RDBMS), especially those involving multiple join statements. For example, see pp. 22-23 in The Definitive Guide to Graph Databases for the RDBMS Developer).

Knowledge graphs (KG) provide semantically structured information that is interpretable by computers – an important property to building more intelligent machines, as well as an important step to transforming text-based knowledge stores and search engines into semantically-aware question answering services (A Review of Relational Machine Learning for Knowledge Graphs). Knowledge graphs are applicable to a broad spectrum of problems ranging from biochemistry to recommender systems; for example, question answering, structured search, exploratory search, and digital assistants.

Knowledge graph structure (embeddings); machine learning

Graphs represent a recent and exciting extension of machine learning; for a good review, see The Knowledge Graph as the Default Data Model for Learning on Heterogeneous Knowledge).

A Review of Relational Machine Learning for Knowledge Graphs also provides an excellent introduction to machine learning and knowledge graphs, with a focus on statistical relational learning.

In a radically different approach from the probabilistic methods employed in statistical relational learning, knowledge graph embedding (KGE) aims to represent entities and relations in a knowledge graph as points or vectors in a continuous vector space (Euclidean space) – simplifying manipulation, while preserving the inherent structure of the KG.

KG are generally represented by symbolic triples – (source, relation, target), also described as (subject, predicate, object) or (head, relation, tail) – whereas KGE methods attempt to represent those symbols (nodes and edges) with their corresponding source, relation, and target vectors – amenable to mathematical processing.

The underlying concept of KGE is that in a knowledge graph each entity can be regarded as a point in a continuous vector space, while relations can be modelled as translation vectors (Expeditious Generation of Knowledge Graph Embeddings). The generated vector representations can be used by machine learning algorithms to accomplish a specific tasks.

Restated in On Embeddings as Alternative Paradigm for Relational Learning, KGE aim to represent instances and their relationships as vectors and/or matrices in the Euclidean space. The hope is that the geometry of the embedding space would resemble the structure of the data (for example) by keeping the instances participating in the same relationships close in the Euclidean space. This in turn allows one to apply standard propositional (logic-based) machine learning tools and retain their scalability, while at the same time preserving certain properties of structured relational data.

KGE has proven to be very effective for the tasks of link prediction and knowledge graph completion, where the goal is to identify missing links in the existing knowledge graph (A Review of Relational Machine Learning for Knowledge Graphs  |  An Overview of Embedding Models of Entities and Relationships for Knowledge Base Completion).

KGE approaches are the current (2018) dominating methodology for knowledge graph link prediction (On Link Prediction in Knowledge Bases: Max-K Criterion and Prediction Protocols). This methodology, fundamentally based on distributed representations, has not only proved to be effective for KG link prediction but has also helped to improve our understanding and engineering of knowledge representation. A strong advantage of KGE methods is their scalability, at the expense of their black-box nature and limited reasoning capabilities; KGE has proven to be scalable to very large knowledge graphs.

Knowledge graphs: additional references

For recent reviews on knowledge graph embedding (KGE), see

Examples of the use of KG in the biomedical domain include:

See also Graph databases.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Language models

See Embeddings (word embeddings).


See Graph Laplacian (Laplacian matrix)

Latent Dirichlet allocation (LDA)

In natural language processing, latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. LDA is an example of a topic model.

LDA was first presented as a graphical model for topic discovery by Blei, Ng and Jordan in 2003. Essentially the same model was proposed independently by Pritchard, Stephens and Donnelly in the study of population genetics in 2000. The corresponding papers had 1,998 and 2,016 citations, respectively, by August 2017. Topics


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

In LDA, each document may be viewed as a mixture of various topics where each document is considered to have a set of topics that are assigned to it via LDA. This is identical to probabilistic latent semantic analysis (pLSA), except that in LDA the topic distribution is assumed to have a sparse Dirichlet prior. The sparse Dirichlet priors encode the intuition that documents cover only a small set of topics and that topics use only a small set of words frequently. In practice, this results in a better disambiguation of words and a more precise assignment of documents to topics. LDA is a generalization of the pLSA model, which is equivalent to LDA under a uniform Dirichlet prior distribution.

For example, an LDA model might have topics that can be classified as “CAT_related” and “DOG_related.” A topic has probabilities of generating various words, such as milk, meow, and kitten, which can be classified and interpreted by the viewer as “CAT_related”. Naturally, the word cat itself will have high probability given this topic. The “DOG_related” topic likewise has probabilities of generating each word: puppy, bark, and bone might have high probability. Words without special relevance, such as “the” (see function words), will have roughly even probability between classes (or can be placed into a separate category). A topic is neither semantically nor epistemologically strongly defined. It is identified on the basis of automatic detection of the likelihood of term co-occurrence. A lexical word may occur in several topics with a different probability, however, with a different typical set of neighboring words in each topic.

Each document is assumed to be characterized by a particular set of topics. This is similar to the standard bag of words model assumption, and makes the individual words exchangeable.


[Image source. Click image to open in new window.]

  • Each topic is a distribution over words
  • Each document is a mixture of corpus-wide topics
  • Each word is drawn from one of those topics


[Image source. Click image to open in new window.]


See Proofs.


See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.


Linguistics is the scientific study of language and its structure, including the study of morphology, syntax, phonetics, and semantics. Specific branches of linguistics include sociolinguistics, dialectology, psycholinguistics, computational linguistics, historical-comparative linguistics, and applied linguistics, etc.

See also:

Logic notation

Logic notation  (see also):

  • $\small \exists$ : there exists at least one
  • $\small \forall$ : for all
  • $\small \neg$ : not (negation: logical not)
  • $\small \lor$ : or (disjunction: logical or)
  • $\small \land$ : and (conjunction: logical and)
  • $\small \sim$ : contraposition (an inference that says that a conditional statement is logically equivalent to its contrapositive. The contrapositive of $\small P \rightarrow Q$ is thus $\small \neg Q \rightarrow \neg P$). In symbolic logic, the tilde (~) is used to indicate the negation of a statement, so $\small \sim p \equiv \lnot p$.
  • $\small \rightarrow$ : implies (implication)
  • $\small \Rightarrow$ : implies
  • $\small \leftarrow$ : is implied by
  • $\small \longleftrightarrow$ : biconditional

  • In Prolog, the symbol $\small \text{ :- }$, sometimes called a turnstile , is pronounced “ if “.


[Image source. Click image to open in new window.]

$\small p \rightarrow q$ ($\small \text{p implies q}$)


  • A necessary condition of a statement must be satisfied for the statement to be true.
    Formally, a statement $\small P\ $ is a necessary condition of a statement $\small Q$ if $\small Q$ implies $\small P$ ($\small Q \Rightarrow P$).

  • A sufficient condition is one that, if satisfied, assures the statement’s truth.
    Formally, a statement $\small P\ $ is a sufficient condition of a statement $\small Q$ if $\small P$ implies $\small Q$ ($\small P \Rightarrow Q$).

This is a simple matter answered by the truth table of $\small \Rightarrow$:

    $\small \begin{array}{ c | c || c | } P & Q & P\Rightarrow Q \\ \hline \text T & \text T & \text T \\ \text T & \text F & \text F \\ \text F & \text T & \text T \\ \text F & \text F & \text T \end{array}$

This shows that when $\small P$ is false, the implication is true. Note that this is the definition of the table, there is no need to prove it. This is how $\small \Rightarrow$ is defined to work.

For example:

    $\small \text{If it is raining then there are clouds in the sky.}$

In this case $\small P = \text{It is raining}$, and $\small Q = \text{There are clouds in the sky}$.

Note that $\small P$ is sufficient to conclude $\small Q$, and $\small Q$ is necessary for $\small P$.

    $\small \text{There is no rain without clouds, and if there are no clouds then there cannot be any rain.}$

However, note that $\small P$ is not necessary for $\small Q$. There could be clouds without any rain.

In this answer, $\small Q$ is necessary for $\small P$, and $\small P$ is sufficient for $\small Q$.

We can reformulate the example, above, as shown in the following truth tables.

Truth tables

Truth table for the conjunction ($\small \land$;  $\small \text{“and”}$) of two propositions:

    $\small \begin{array}{ c | c || c | } p & q & p \land q \\ \text{(e.g. rain)} & \text{(e.g. clouds)} & \text{(e.g. rain and clouds)} \\ \hline \text T & \text T & \text T \\ \text T & \text F & \text F \\ \text F & \text T & \text F \\ \text F & \text F & \text F \end{array}$

Truth table for the disjunction ($\small \lor$;  $\small \text{“or”}$) of two propositions:

    $\small \begin{array}{ c | c || c | } p & q & p \lor q \\ \text{(e.g. rain)} & \text{(e.g. clouds)} & \text{(e.g. rain or clouds)} \\ \hline \text T & \text T & \text T \\ \text T & \text F & \text T \\ \text F & \text T & \text T \\ \text F & \text F & \text F \end{array}$

In the disjunction table, above, if you have no rain (rain = F) and no clouds (clouds = F), then having one or the other of rain or clouds is False (“rain or clouds” = “rain and no clouds,” or “clouds and no rain”).

Truth table for the implication ($\small \rightarrow$) of two propositions:

    $\small \begin{array}{ c | c || c | } p & q & p \rightarrow q \\ \text{(e.g. rain)} & \text{(e.g. clouds)} & \text{(e.g. rain implies clouds)} \\ \hline \text T & \text T & \text T \\ \text T & \text F & \text F \\ \text F & \text T & \text T \\ \text F & \text F & \text T \end{array}$

The implication table above is trickier. $\small p \rightarrow q$ states that “rain implies clouds,” i.e. that if rain is True then clouds is also True. That implication [row 1].

In the second row, if there is rain but no clouds, then that implication is False (as the implication is that “if you have rain, you have clouds”).

In the third row, the absence of rain (rain = F) means we cannot negate the implication that rain implies clouds, as we don’t have enough information to counter that implication. Thus, the implication $\small p \rightarrow q$ remains valid (True).

In the fourth row, a similar argument holds: the absence of rain (regardless of whether or not clouds are present) means we lack sufficient information to counter the implication $\small p \rightarrow q$, which thus remains valid (True). If we have no rain, then under the stated implication we cannot make any assertions about the presence or absence of clouds.

Truth table for the biconditional ($\small \longleftrightarrow$) relationship of two propositions:

    $\small \begin{array}{ c | c || c | } & & p \longleftrightarrow q & \\ p & q & \text{(rain and clouds), or} \\ \text{(rain)} & \text{(clouds)} & \text{(no rain and no clouds)} & (p \rightarrow q) \land (q \rightarrow p)\\ \hline \text T & \text T & \text T & \text T \\ \text T & \text F & \text F & \text F \\ \text F & \text T & \text F & \text F \\ \text F & \text F & \text T & \text T \end{array}$

The truth values of $\small p \longleftrightarrow q$ are equivalent to $\small (p \rightarrow q) \land (q \rightarrow p)$.

Truth table for the contraposition ($\small \sim$ or $\small \neg$) of two propositions:

    $\small \begin{array}{ c | c || c | } p & q & \neg p & \neg q & \neg q \rightarrow \neg p & p \rightarrow q \\ \hline \text T & \text T & \text F & \text F & \text T & \text T \\ \text T & \text F & \text F & \text T & \text F & \text F \\ \text F & \text T & \text T & \text F & \text T & \text T \\ \text F & \text F & \text T & \text T & \text T & \text T \end{array}$

i.e. “not True” is “False;” “not False” is “True.”

In the first row: “not q” (F) implies “not p” (F), which is True.
In the second row: “not q” (T) implies not p” (F) which is False.
In the third row: “not q” (F) implies not p” (T) which is True.
In the fourth row: “not q” (T) implies not p” (T) which is True.

This is the same as the implication truth table, further above (which I added as the last column in the table immediately above).

See also:

Long short-term memory (LSTM RNN)

Although theoretically powerful, vanilla RNNs cannot learn from long-sequences due to a problem known as vanishing or exploding gradients. A powerful solution is Long Short-Term Memory (LSTM), introduced in 1997 by Hochreiter & Schmidhuber). Why “long short-term”? As described in that paper, “[LSTM] can learn to bridge time intervals in excess of 1000 steps even in the case of noisy, incompressible input sequences, without loss of short time lag capabilities.” A RNN relies on the past, without scope; long-term past inputs are forgotten. LSTM provide a short-term memory function: a LSTM block has mechanisms to enable “memorizing” information for an extended number of time steps.

LSTM introduces one more vector called “memory” $\small \mathbf{c}_t$, which, together with the state $\small \mathbf{h}_t$, specify the dynamic as: $\small (\mathbf{h}_t, \mathbf{c}_t) = LSTM(\mathbf{h}_{t-1}, \mathbf{c}_{t-1}, \mathbf{x}_t)$. In most implementations, this is decomposed further as:

$\ \ \ \ \ \ \small \mathbf{c}_t = \mathbf{f}_t ∗ \mathbf{c}_{t-1} + \mathbf{i}_t ∗ \mathbf{\widetilde{c}}_t$

$\ \ \ \ \ \ \small \mathbf{h}_t = \mathbf{o}_t ∗ tanh(\mathbf{c}_t)$


  • $\small \mathbf{\widetilde{c}}_t$ is a candidate memory computed from the input
  • $\small \mathbf{f}_t, \mathbf{i}_t, \mathbf{o}_t \in (\textbf{0},\textbf{1})$ are gates
  • $∗$ denotes point-wise multiplication

  • $\small \mathbf{f}_t$ determines how much the previous memory is maintained
  • $\small \mathbf{i}_t$ controls how much new information is stored into memory
  • $\small \mathbf{o}_t$ controls how much memory is read out

The candidate memory and the gates are typically parameterized (sometimes written as “parametrized”) as:

$\ \ \ \ \ \ \small \mathbf{\widetilde{c}}_t = tanh(W_c\mathbf{h}_{t-1} + V_c\mathbf{x_t} + \mathbf{b_c})$

$\ \ \ \ \ \ \small \begin{bmatrix} f_t \\ i_t \\ o_t \end{bmatrix} = sigm \left( \begin{bmatrix} W_f \\ W_i \\ W_o \end{bmatrix} \mathbf{h}_{t-1} + \begin{bmatrix} V_f \\ V_i \\ V_o \end{bmatrix} \mathbf{x}_t + \begin{bmatrix} \mathbf{b}_f \\ \mathbf{b}_i \\ \mathbf{b}_o \end{bmatrix} \right)$

where $\small (W_{c,f,i,o}, V_{c,f,i,o}, \mathbf{b}_{c,f,i,o})$ are learnable parameters.


[Image source (click image to open in new window)]


[Image source (click image to open in new window)]

In the transformations above, the memory cell $\small \mathbf{c}_t$ stores the “long-term” memory in the vector form. In other words, the information accumulatively captured and encoded until time step $\small t$ is stored in $\small \mathbf{c}_t$ and is only passed along the same layer over different time steps. Given the inputs $\small \mathbf{c}_t$ and $\small \mathbf{h}_t$, the input gate $\small \mathbf{i}_t$ and forget gate $\small \mathbf{f}_t$ will help the memory cell to decide how to overwrite or keep the memory information. The output gate $\small \mathbf{o}_t$ further lets the LSTM block decide how to retrieve the memory information to generate the current state $\small \mathbf{h}_t$ that is passed to both the next layer of the current time step and the next time step of the current layer. Such decisions are made using the hidden-layer parameters $\small \mathbf{W}$ and $\small \mathbf{b}$ (with various subscripts): these parameters will be inferred during the training phase


  • Sources for the above:

  • Symbols in bold font, above, are vectors (e.g. $\small \mathbf{y}_t$ is an output vector – these are the network’s parameters (i.e. “memory”).

  • That paper glosses over glosses over some of the terms (I chose it for the clarity of the “vanilla” (basic) RNN and LSTM structures and descriptions.

  • Bias terms ($\small \textbf{b}$ or $\small \textbf{w}_0$) are additional constants attached to the neurons and added to the weighted input before the activation function is applied. Bias terms help models represent patterns that do not necessarily pass through the origin. For example, if all your features were 0, would your output also be zero? Is it possible there is some base value upon which your features have an effect? Bias terms typically accompany weights and must also be learned by your model. [Source: ML Cheatsheet]

    Bias is the term $\small \textbf{b}$ in the following formula: $\small y^{\ \prime} = \textbf{b} + \mathbf{w}_1\mathbf{x}_1 + \mathbf{w}_2\mathbf{x}_2 + \ldots + \mathbf{w}_n\mathbf{x}_n$.

  • $\small W$, $\small U$ and $\small V$ are matrices of weights, updated during training. $\small W$ are the (shared – see note above) weights passed from one time step to the next; $\small U$ are the “skip connections” from the inputs to all hidden layers, and $\small V$ are the skip connections from the hidden layers to the outputs. The image below, taken from Alex Graves paper, illustrates those connections.

  • Early versions of simple RNN used the sigmoid activation function (see Recent Advances in Recurrent Neural Networks for a description plus other forms), which is applied component-wise:

    $\ \ \ \ \ \ \ \ \ \ \sigma(x) = \frac{1}{1 + \mathcal{e}^{(-x)}}$

    The “sigmoid” is a common choice, which takes a real-value and squashes it to the range [0,1]. This activation function is normally used in the output layer, where a cross-entropy loss function is used for training a classification model.

  • In a LSTM, The input gate is a sigmoid function and have a range of [0,1]. Because the equation of the cell state is a summation between the previous cell state, sigmoid function alone will only add memory and not be able to remove/forget memory. If you can only add a float number between [0,1], that number will never be zero/turned off/forget. This is why the input modulation gate has an tanh activation function. Tanh has a range of [-1, 1] and allows the cell state to forget memory. [Source: Long Short-Term Memory (LSTM): Concept]

Bidirectional RNN

… A major issue with all of the above networks [RNN; LSTM; GRU] is that they learn representations from previous time steps. Sometimes, you might have to learn representations from future time steps to better understand the context and eliminate ambiguity. Take the following examples, “He said, Teddy bears are on sale” and “He said, Teddy Roosevelt was a great President”. In those sentences, when we are looking at the word “Teddy” and the previous two words “He said”, we might not be able to understand if the sentence refers to the President or Teddy bears. To resolve this ambiguity, we need to look ahead. This is what Bidirectional RNNs accomplish.

The repeating module in a Bidirectional RNN could be a conventional RNN, LSTM or GRU. The structure and the connections of a bidirectional RNN are represented in the figure below. There are two type of connections, one going forward in time, which helps us learn from previous representations and another going backwards in time, which helps us learn from future representations.


[Image source. Click image to open in new window.]

Forward propagation is done in two steps:

  • We move from left to right, starting with the initial time step we compute the values until we reach the final time step
  • We move from right to left, starting with the final time step we compute the values until we reach the initial time step

Source: Introduction to Sequence Models – RNN, Bidirectional RNN, LSTM, GRU  [local copy]

See also:

Bidirectional LSTM (Bi-LSTM)

LSTM in its core, preserves information from inputs that has already passed through it using the hidden state. Unidirectional LSTM only preserves information of the past because the only inputs it has seen are from the past. Using bidirectional will run your inputs in two ways, one from past to future and one from future to past and what differs this approach from unidirectional is that in the LSTM that runs backwards you preserve information from the future and using the two hidden states combined you are able in any point in time to preserve information from both past and future.

A bidirectional LSTM (BiLSTM; Bi-LSTM) layer learns bidirectional long-term dependencies between time steps of time series or sequence data. These dependencies can be useful for when you want the network to learn from the complete time series at each time step.

Bidirectional LSTMs (Bi-LSTMs) are an extension of traditional LSTMs that can improve model performance on sequence classification problems. In problems where all timesteps of the input sequence are available, Bidirectional LSTMs train two instead of one LSTMs on the input sequence. The first LSTM operates on the input sequence as is, and the second LSTM operates on a reversed copy of the input sequence. This can provide additional context to the network, resulting in faster and even fuller learning on the problem.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

  • Note that $\small \mathbf{x}$ stands for the input sequence, $\small \mathbf{h}$ for the output sequence from forward or backward runs (defined by the arrow pointing right or left, respectively), $\small \mathbf{y}$ for the concatenated output sequence where $\small \sigma$ concatenates the forward and backward output elements. [Source]

See also:


See Long short-term memory (LSTM RNN)

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Machine learning

Machine learning (ML) is a field of artificial intelligence that uses statistical techniques to give computer systems the ability to “learn” (e.g., progressively improve performance on a specific task) from data, without being explicitly programmed.

Machine learning explores the study and construction of algorithms that can learn from and make predictions on data – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions, through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders, and computer vision.

Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field. Machine learning is sometimes conflated with data mining, where the latter subfield focuses more on exploratory data analysis and is known as unsupervised learning.

Within the field of data analytics, machine learning is a method used to devise complex models and algorithms that lend themselves to prediction; in commercial use, this is known as predictive analytics. These analytical models allow researchers, data scientists, engineers, and analysts to “produce reliable, repeatable decisions and results” and uncover “hidden insights” through learning from historical relationships and trends in the data.

Regarding data analysis, machine learning, and latent knowledge discovery: Kevin Murphy (Google AI) states on p. 1 of his book, Machine Learning: A Probabilistic Perspective (2012), that

    "[The] deluge of data calls for automated methods of data analysis, which is what machine learning provides. In particular, we define machine learning as a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty ..."

Machine Learning Tasks

Machine learning tasks are typically classified into several broad categories:

  • Supervised learning: The computer is presented with example inputs and their desired outputs, given by a “teacher”, and the goal is to learn a general rule that maps inputs to outputs. As special cases, the input signal can be only partially available, or restricted to special feedback.

  • Semi-supervised learning: The computer is given only an incomplete training signal: a training set with some (often many) of the target outputs missing.

  • Active learning: The computer can only obtain training labels for a limited set of instances (based on a budget), and also has to optimize its choice of objects to acquire labels for. When used interactively, these can be presented to the user for labeling.

  • Unsupervised learning: No labels are given to the learning algorithm, leaving it on its own to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

  • Reinforcement learning: Data (in form of rewards and punishments) are given only as feedback to the program’s actions in a dynamic environment, such as driving a vehicle or playing a game against an opponent.

To fully understand the basics of machine learning, you need to study at least a bit of linear algebra (matrix mathematics: transformations, multiplications, …), partial derivatives (needed for the chain rule, used in backpropagation, …), etc. If you are at all serious about machine learning, Andrew Ng’s Machine Learning course is a must, if only to understand the basics.

Here are my old notes from that course (2013; some of the images are missing, but those omissions are trivial as these notes are very comprehensive):

As well, here is Chris Olah’s Backpropagation blog post, with my handwritten partial differentiation calculations.

For a more recent overview of the recent machine learning domain (focused on mid-2015 through mid-2017) see my comprehensive Machine Learning Notes.

Matrix operations

See this page.

Memory (neural memory)

See my blog post How do Neural Networks "Remember"?.


In semantics, a meronym denotes a part or a member of something.


  • “page” is a meronym of “book”
  • “wheel”, “engine”, “door”, etc. are meronyms of “car”
  • “finger” is a meronym of “hand”
  • “bark” is a meronym of “tree” (bark is part of what makes up a tree), while “tree” is a meronym of “forest”
  • “elbow” is a meronym of “arm”, which is a meronym of “body”

See also Linguistics | Semantics.


See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.


In linguistics, a morpheme is a meaningful morphological unit of a language that cannot be further divided; e.g., “incoming” consists of the morphemes “in”, “come” and “-ing”. A morpheme may or may not stand alone, whereas a word, by definition, is freestanding. Another example: “dogs” consists of two morphemes and one syllable: “dog”, and “-s”, a plural marker on nouns. Note that a morpheme like “-s” can just be a single phoneme and does not have to be a whole syllable.

See also:


Morphology is a subdiscipline of linguistics that studies word structure. During morphological processing we are basically considering words in a text separately and trying to identify morphological classes to which these words belong. A widespread morphological task is lemmatizing (lemmatization) and stemming, which are used in many web search engines. In this case all morphological variations of a given word (known as word forms) are collapsed to one lemma or stem. [Source]

Morphology concerns the structure and meaning of words. Some words, such as “send”, are atomic (monomorphemic), while others – such as “sends”, “sending” and “resend” – appear to be constructed from several atoms or morphemes. These morphemes (bits of words) occur in a lots of other words: “thinks”, “thinking”, “reprogram”, “rethink”, etc. There is a syntax to the way morphemes can combine – the affixes mentioned so far can all combine with verbs to make verbs; others, such as “able”, combine with verbs to make adjectives: e.g., “programmable”. Sometimes the meaning of a word is a regular, productive combination of the meanings of its morphemes, such as “unreprogramability”. Frequently, however, it is not (or isn’t completely) obvious; e.g. “react”, or “establishment”. [Source; Introduction to Linguistics for Natural Language Processing]

See also Linguistics | Semantics.

Mutual information

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the “amount of information” (in units such as shannons, commonly called bits) obtained about one random variable through observing the other random variable. The concept of mutual information is intricately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected “amount of information” held in a random variable.


[Image source. Click image to open in new window.]

Venn diagram showing additive and subtractive relationships various information measures associated with correlated variables $\small X$ and $\small Y$. The area contained by both circles is the joint entropy $\small H(X,Y)$. The circle on the left (red and violet) is the individual entropy $\small H(X)$, with the red being the conditional entropy $\small H(X|Y)$. The circle on the right (blue and violet) is $\small H(Y)$, with the blue being $\small H(Y|X)$. The violet is the mutual information $\small I(X;Y)$.

Not limited to real-valued random variables like the correlation coefficient, MI is more general and determines how similar the joint distribution $\small p(x,y)$ is to the products of factored marginal distribution $\small p(x) \cdot p(y)$. MI is the expected value of the pointwise mutual information (PMI).

Mutual Information: Definition

Formally, the mutual information of two discrete random variables $\small X$ and $\small Y$ can be defined as:

    $\small \begin{align} \operatorname {I} (X;Y)=\sum _{y\in {\mathcal {Y}}}\sum _{x\in {\mathcal {X}}}{p(x,y)\log {\left({\frac {p(x,y)}{p(x)\,p(y)}}\right)}} \end{align}$,

where $\small p(x,y)$ is the joint probability function of $\small X$ and $\small Y$, and $\small p(x)$ and $\small p(y)$ are the marginal probability distribution functions of $\small X$ and $\small Y$, respectively.

In the case of continuous random variables, the summation is replaced by a definite double integral:

    $\small \begin{align} \operatorname {I} (X;Y)=\int _{\mathcal {Y}}\int _{\mathcal {X}}{p(x,y)\log {\left({\frac {p(x,y)}{p(x)\,p(y)}}\right)}}\;dx\,dy \end{align}$,

where $\small p(x,y)$ is now the joint probability density function of $\small X$ and $\small Y$, and $\small p(x)$ and $\small p(y)$ are the marginal probability density functions of $\small X$ and $\small Y$, respectively.

If the log base 2 is used, the units of mutual information are bits.

See also:

Pointwise mutual information

Pointwise mutual information (PMI), or point mutual information, is a measure of association used in information theory and statistics. In contrast to mutual information (MI, above, which builds upon PMI), PMI refers to single events, whereas MI refers to the average of all possible events.

The PMI of a pair of outcomes $\small x and $\small y belonging to discrete random variables $\small X and $\small Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically:

    $ \operatorname {pmi} (x;y)\equiv \log {\frac {p(x,y)}{p(x)p(y)}}=\log {\frac {p(x|y)}{p(x)}}=\log {\frac {p(y|x)}{p(y)}}$.

The mutual information (MI) of the random variables $\small X$ and $\small Y$ is the expected value of the PMI over all possible outcomes (with respect to the joint distribution $\small p(x,y))$.

The measure is symmetric: $\small \operatorname {pmi} (x;y) = \operatorname {pmi} (y;x)$. It can take positive or negative values, but is zero if $\small X$ and $\small Y$ are independent. Note that even though PMI may be negative or positive, its expected outcome over all joint events (MI) is positive. PMI maximizes when $\small X$ and $\small Y$ are perfectly associated, i.e. $\small p(x|y)$ or $\small p(y|x)=1$, yielding the following bounds:

    $\small -\infty \leq \operatorname {pmi} (x;y)\leq \min \left[-\log p(x),-\log p(y) \right]$.

Finally, $\small \operatorname {pmi} (x;y)$ will increase if $\small p(x|y)$ is fixed but $\small p(x)$ decreases.

Pointwise mutual information: Example

x y p(xy)
0 0 0.1
0 1 0.7
1 0 0.15
1 1 0.05

Using this table we can marginalize to get the following additional table for the individual distributions:

p(x) p(y)
0 0.8 0.25
1 0.2 0.75

With this example, we can compute four values for $\small pmi(x;y)$. Using base-2 logarithms:

    pmi(x=0; y=0) = -1
    pmi(x=0; y=1) = 0.222392
    pmi(x=1; y=0) = 1.584963
    pmi(x=1; y=1) = -1.584963

For reference, the mutual information $\small \operatorname {I} (X;Y)$ would then be 0.2141709.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Named entities;
Named entity recognition (NER)

The difference between an entity and a named entity is essentially the same difference between nouns and proper nouns. An entity can be nominal – a common thing like a city – whereas a named entity is more like a proper noun, such as a name (Paris). In other words, a named entity is something that deserves to have a name. A human is an entity, but if we give human a name, this produces a named entity. Named entities may consist of more than one word.

IE vs. IR

Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names.

NER (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER has been structured as taking an unannotated block of text, such as

    Jim bought 300 shares of Acme Corp. in 2006.

and producing an annotated block of text that highlights the names of entities,

    [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.

In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.

State-of-the-art NER systems for English produce near-human performance.

See also [NLTK] Named Entity Recognition, which includes a short description of the “IOB format” (or sometimes “BIO Format”), developed for NP chunking. In this scheme, each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O.

An example of this scheme is shown here:

    [Tag Representation of Chunk Structures. Image source. Click image to open in new window.]

IOB tags have become the standard way to represent chunk structures in files. Here is how the information in 2.5 would appear in a file:

      We PRP B-NP
      saw VBD O
      the DT B-NP
      yellow JJ I-NP
      dog NN I-NP

In this representation there is one token per line, each with its part-of-speech tag
Source: Image source: Section 8.2 (Tagsets for English) in
Jurafsky D & Martin JH (2000) Speech and Language processing (p. 295)
and chunk tag.

Named entity disambiguation
Named entity normalization

Polysemy, words or phrases with different but related meanings poses a challenge to NLP; for example, “Washington” could refer to the location “Washington, DC” or the person “George Washington”; “ACE” could represent “angiotensin converting enzyme” or “acetylcholinesterase”. In polysemy, the fact that multiple entities might have the same name is common for named entities. The task of addressing the polysemy problem for named entities is called named entity disambiguation. Named entity disambiguation/normalization is the task of mapping of a named entity or type in to a unique identifier or concept (e.g., disease names may be mapped to the National Library of Medicine’s Medical Subject Headings disease terms). For more on polysemy, see Analyzing Polysemous Concepts from a Clinical Perspective: Application to Auditing Concept Categorization in the UMLS.

See also word sense disambiguation.

Ngram (n-gram)

In the fields of computational linguistics and probability, an $\small n$-gram is a contiguous sequence of $\small n$ items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The $\small n$-grams typically are collected from a text or speech corpus. When the items are words, $\small n$-grams may also be called shingles.

Using Latin numerical prefixes, an $\small n$-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. English cardinal numbers are sometimes used, e.g., “four-gram”, “five-gram”, and so on.

In computational biology, a polymer or oligomer of a known size is called a $\small k$-mer instead of an $\small n$-gram, with specific names using Greek numerical prefixes such as “monomer”, “dimer”, “trimer”, “tetramer”, “pentamer”, etc., or English cardinal numbers, “one-mer”, “two-mer”, “three-mer”, etc.


Sample sequence
1-gram sequence
2-gram sequence
3-gram sequence
Vernacular name
Order of resulting Markov model
Protein sequencing amino acid ... Cys-Gly-Leu-Ser-Trp ... ..., Cys, Gly, Leu, Ser, Trp, ... ..., Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp, ... ..., Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp, ...
DNA sequencing base pair ...AGCTTCGA... ..., A, G, C, T, T, C, G, A, ... ..., AG, GC, CT, TT, TC, CG, GA, ... ..., AGC, GCT, CTT, TTC, TCG, CGA, ...
Computational linguistics character ...to_be_or_not_to_be... ..., t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e, ... ..., to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be, ... ..., to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be, ...
Computational linguistics word ... to be or not to be ... ..., to, be, or, not, to, be, ... ..., to be, be or, or not, not to, to be, ... ..., to be or, be or not, or not to, not to be, ...

The table above shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences. Here are further examples:

  • 3-grams
    • ceramics collectables collectibles
    • ceramics collectables fine
    • ceramics collected by
  • 4-grams
    • serve as the incoming
    • serve as the incubator
    • serve as the independent

See also kmer.

Natural language processing

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.


[Image source (slide 8). Click image to open in new window.]

Natural language understanding

Natural language understanding (NLU) or natural language interpretation (NLI) is a subtopic of natural language processing in artificial intelligence that deals with machine reading comprehension (see the image, above). Natural language understanding is considered an AI-hard problem. There is considerable commercial interest in the field because of its application to news-gathering, text categorization, voice-activation, archiving, and large-scale content analysis.


Neo4j is a graph database management system. Neo4j is an ACID-compliant transactional database with native graph storage and processing. Neo4j is accessible using the Cypher Query Language.

... continued here ...


See also Probabilistic programming:

  • The final portion of An Introduction to Statistical Relational Learning (part 2) lists various probabilistic programming solutions (languages; platforms), and mentions this:

    RockIt  from University of Mannheim has an online and rest based interface and uses only Conjunctive Normal Forms (CNF), i.e. And-Or format in its formulas.” RockIt is a query engine for Markov logic.

The noisy-OR model is an example of a probabilistic model that jointly models parameters (variables). Noisy-OR is a conditional probability distribution that describes the causal mechanisms by which parent nodes (e.g. diseases) affect the states of children nodes (e.g. symptoms). The Noisy-OR function has the useful property that the probability of a relationship is high with at least one reliable assertion yet increases with additional support.


[Image source. Click image to open in new window.]

Noisy-OR is used in the following papers:

  • Learning a Health Knowledge Graph from Electronic Medical Records, in which maximum likelihood estimation of three probabilistic models was used to automatically construct knowledge graphs: logistic regression, naive Bayes classifier, and a Bayesian network using noisy-OR gates. … Noisy-OR was chosen as an example of a probabilistic model that jointly models diseases and symptoms; similar models have successfully been used in previous medical diagnosis applications. … The noisy-OR model significantly outperformed the other tested models, producing a high quality knowledge graph reaching precision of 0.85 for a recall of 0.6 in the clinical evaluation.

    Diseases (and their effects: symptoms) were separately modeled by logistic regression and naive Bayes; additionally, in naive Bayes, all symptoms were assumed to be conditionally independent from one another. However, the noisy-OR model jointly models all diseases and symptoms [wherein the central assumption of independence of effects (symptoms) is maintained]. Noisy-OR is a conditional probability distribution that describes the causal mechanisms by which parent nodes (e.g. diseases) affect the states of children nodes (e.g. symptoms). For example, given that patients tend to present with few diseases, the presence of one disease typically lowers the probability of others. In the logistic regression and naive Bayes models, however, since each disease is modeled separately (resulting in an assumption of independence between diseases), the presence of one disease does not (for example) rule out or lessen the probability of another disease (i.e. diagnosis).

  • The noisy-OR model was also employed in Biomedical Discovery Acceleration, with Applications to Craniofacial Development (2009; different authors). “… the explosion of new results in the scientific literature, particularly in molecular biomedicine, is both a blessing and a curse to the bench researcher. … In this paper, we describe a novel computational approach to this challenge, a knowledge-based system that combines reading, reasoning, and reporting methods to facilitate analysis of experimental data.” They also stated (paraphrased):

    • “One of the most popular methods to combine individual reliabilities is to assume independence of experts (naive Bayes assumption) and compute the integrated likelihood $\small P$ for each relationship using the Noisy-OR function … The Noisy-OR function has the useful property that the probability of a relationship is high with at least one reliable assertion yet increases with additional support. This property is especially relevant in biology, where it is often difficult to identify false negatives; a given assertion is strengthened by additional information but unlike the case for estimating the reliability of an expert on the whole, an individual assertion is not penalized for lack of additional evidence. Moreover, since the experts are assumed to be independent, experts can be removed from or added to the analysis without excessive re-computation.”


    [Image source. Click image to open in new window.]

  • On a Hypergraph Probabilistic Graphical Model (Nov 2018)

    • “We propose a directed acyclic hypergraph framework for a probabilistic graphical model that we call Bayesian hypergraphs. … Bayesian hypergraphs also allow a modeler to represent causal patterns of interaction such as Noisy-OR graphically (without additional annotations). …”


In linear algebra, functional analysis, and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector space - except for the zero vector, which is assigned a length of zero. A seminorm, on the other hand, is allowed to assign zero length to some non-zero vectors (in addition to the zero vector).

A norm must also satisfy certain properties pertaining to scalability and additivity which are given in the formal definition below.

A simple example is two dimensional Euclidean space $\small \mathbb{R}^2$ equipped with the “Euclidean norm” (see below). Elements in this vector space (e.g., (3, 7)) are usually drawn as arrows in a 2-dimensional Cartesian coordinate system starting at the origin (0, 0). The Euclidean norm assigns to each vector the length of its arrow. Because of this, the Euclidean norm is often known as the magnitude.

A vector space on which a norm is defined is called a normed vector space. Similarly, a vector space with a seminorm is called a seminormed vector space. It is often possible to supply a norm for a given vector space in more than one way.

Double bars (or sometimes even single bars) tend to denote a norm in Mathematics. Most likely, the double bars here are denoting the Euclidean norm. This is just the length of the vector. So for example, the vector $\small \begin{bmatrix} 1 & 2 & 3 \end{bmatrix}$ has length

    $\small \Vert \begin{matrix} 1 & 2 & 3 \end{matrix} \Vert = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{14} = 3.7417$

and the vector

    $\small \Vert \begin{matrix} 3 & -1 & 2 \end{matrix} \Vert = \sqrt{3^2 + (-1)^2 + 2^2} = \sqrt{14} = 3.7417$

Notice that $\small \mathbf{Ax}$ is just a vector, so $\small \Vert \mathbf{Ax} \Vert$ is just the length of the vector. $\small \Vert \mathbf{x} \Vert$ is just the length of $\small \mathbf{x}$. So here you are looking for scaling of $\small \mathbf{x}$ under transformation by $\small \mathbf{A}$ to be between $\small m$ and $\small M$.

Source for example, above: What does double vertical-line means in linear algebra?

See also:


For good summaries of notations used in machine learning, consult:

Noun phrase

A noun phrase or nominal phrase (abbreviated NP) is a phrase that has a noun (or indefinite pronoun) as its head or performs the same grammatical function as such a phrase. Noun phrases are very common cross-linguistically, and they may be the most frequently occurring phrase type.

Noun phrases often function as verb subjects and objects, as predicative expressions, and as the complements of prepositions. Noun phrases can be embedded inside each other; for instance, the noun phrase some of his constituents contains the shorter noun phrase his constituents.

Some examples of noun phrases are underlined in the sentences below. The head noun appears in bold.

  • The election-year politics are annoying for many people.
  • Almost every sentence contains at least one noun phrase.
  • Current economic weakness may be a result of high energy prices</u>.

Those five beautiful shiny Arkansas Black apples look delicious” is a noun phrase of which apples is the head. To test, a single pronoun can replace the whole noun phrase, as in “They look delicious”.

Noun phrases can be identified by the possibility of pronoun substitution, as is illustrated in the examples below.

    a. This sentence contains two noun phrases.
    b. It contains them.

    a. The subject noun phrase that is present in this sentence is long.
    b. It is long.

    a. Noun phrases can be embedded in other noun phrases.
    b. They can be embedded in them.

A string of words that can be replaced by a single pronoun without rendering the sentence grammatically unacceptable is a noun phrase.

See also NLTK: Noun Phrase Chunking.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Obscuro seu abbreviatio [Obscure abbreviations]

  • a.s.: Almost Sure Convergence


    [Click image to open in new window.]


    [Image source, discussion. Click image to open in new window.]

    [See also.]  Here is an intuitive explanation. In terms of the sample space of events $\small \Omega$, an event $\small E$ happens almost surely if $\small Pr(E) = 1$, whereas an event happens surely if $\small E = \Omega$. An example: suppose we are independently flipping a fair coin infinitely many times. The event

      $\small \text{I will get heads infinitely often}$

    [which you need to read as “I will get an infinite number of heads, ignoring any tails that I get” rather than “I will only get heads”] is an almost sure event (because it is possible get only a finite number of heads … but how likely is that? Rigorous proof uses the Borel Cantelli lemma, if you are interested).

    In contrast,

      $\small \text{I will get heads $\small \mathbf{or}$ tails on my 16th flip}$

    must happen. This is a sure event.

  • i.i.d.: Independent and Identically Distributed random variables

    • In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed (i.i.d. or iid or IID) if each random variable has the same probability distribution as the others and all are mutually independent. The annotation IID is particularly common in statistics, where observations in a sample are often assumed to be effectively IID for the purposes of statistical inference. The assumption (or requirement) that observations be IID tends to simplify the underlying mathematics of many statistical methods …

      Often the IID assumption arises in the context of sequences of random variables. Then “independent and identically distributed” in part implies that an element in the sequence is independent of the random variables that came before it. In this way, an IID sequence is different from a Markov sequence, where the probability distribution for the nth random variable is a function of the previous random variable in the sequence (for a first order Markov sequence). An IID sequence does not imply the probabilities for all elements of the sample space or event space must be the same; for example, repeated throws of loaded dice will produce a sequence that is IID, despite the outcomes being biased.

  • i.f.f.: If and only if

    • In logic and related fields such as mathematics and philosophy, if and only if (shortened iff ) is a biconditional logical connective between statements. In that it is biconditional (a statement of material equivalence), the connective can be likened to the standard material conditional (“only if”, equal to “if … then”) combined with its reverse (“if”); hence the name. The result is that the truth of either one of the connected statements requires the truth of the other (i.e. either both statements are true, or both are false). It is controversial whether the connective thus defined is properly rendered by the English “if and only if”, with its pre-existing meaning.

      In writing, phrases commonly used, with debatable propriety, as alternatives to $\small \textit{P “if and only if” Q}$ include:

      • $\small \textit{Q is necessary and sufficient for P}$,
      • $\small \textit{P is equivalent (or materially equivalent) to Q}$  (compare material implication),
      • $\small \textit{P precisely if Q}$,
      • $\small \textit{P precisely (or exactly) when Q}$,
      • $\small \textit{P exactly in case Q}$, and
      • $\small \textit{P just in case Q}$.

      Some authors regard “iff” as unsuitable in formal writing; others use it freely.

      In logical formulae, logical symbols are used instead of these phrases; see the discussion of notation.

      Notation. The corresponding logical symbols are $\leftrightarrow$, $\Leftrightarrow$, and $\equiv$ and sometimes $\text{iff}$ …

Common Latin and Non-English Abbreviations Used in Research

A.D. (Anno Domini). Used to date years by reckoning the date of Christ’s birth, as opposed to B.C., the years “Before Christ.” Literally, Anno Domini means “In the year of the Lord.” Remember two important notes! Anno Domini does not mean “After Death ” (if it did, there would be a thirty-three year gap between 1 BC and the crucifixion thirty-three years later). Also note the politically correct tendency is to use the abbreviation CE (Common Era) and BCE (Before Common Era). These abbreviations are an attempt to avoid the religious connotations of the Latin abbreviation. In spite of the name change, BCE and CE still divide history according to the life of Christ, but CE and BCE may be less offensive (or at least less ethnocentric) to a non-Christian audience.

c. (circa). Used by historians to show that a date is approximate. Literally, the word means “around,” and it is sometimes abbreviated “ca.” Usage: Shortly after Henry IV seized the throne from Richard II, Geoffrey Chaucer died (c.1400 A.D.), perhaps due to old age.

cf. (confere). A Latin imperative suggesting the reader should compare and contrast one statement or idea with another one. Literally, “compare.” Researchers often follow the abbreviation with a reference to an author or page number, suggesting the reader look for similarities and differences between what a previous citation has said with the subsequent source listed. Usage: Some scholars think Hitler’s Mein Kampf used genocidal ideas found in earlier anti-Semitic literature him (Smith 42), but others argue Hitler himself was the primary originator (cf. Jones 98).

e.g. (exempli gratia). “For example.” Literally, “free as an example.” Usage: “We have numerous problems to deal with before reforming welfare policies, e.g., the trade deficit, Medicare, and social security.”

et al. [et alia (neuter plural)]. It can also be an abbreviation for et alii (masculine plural), or et aliae (feminine plural). This phrase means “and others.” [Source]

etc. (et cetera). “And so on.” This is the one Latin abbreviation most students already know, and the one they tend to overuse. Do note that, since et already means and, it is redundant to write, “and etc.” Literally, the Latin phrase means “and other things.” Usage: The problems of the Balkan Republics are numerous, including insufficient electric power, poor highways, rampant unemployment, hostile neighbors, etc.

et pass. (et passim). And also found throughout the subsequent pages or sections. Literally, “And in the following.” The abbreviation typically appears after a citation of a single page, suggesting the reader look at that page first and then skim the material following for further discussion. Usage: For further discussion of this important issue, see Smith 42 et passim.

ib./ ibid. (ibidem). “In the same passage or page quoted above.” Literally, “In the same place.” Usage: “One physicist compared the behavior of quarks to bowling pins (Jones 35). He also indicated that the ‘Charm’ quark was like a ‘bowling ball’ (ibid.) due to the way it. . . .”

i.e. (id est). “That is more precisely.” Literally, “it is.” Commonly used to refine a general statement or provide additional information. Usage: “Jerry’s girlfriend always managed to turn the conversation toward children, i.e., the possibility of having children together; i.e., the possibility of having legitimate children together; i.e., toward the subject of marriage.”

sic. Indicates a misspelling or error in a quoted source, in order to verify to the reader that the researcher did not create a typographical error, but instead exactly reproduces the way the word or statement appeared in the original material. Literally, “yes” or “even thus” in Latin. Usage: There are, according to the writings of seven-year old Andrew, “Manee wayes of riting words” [sic].

Ph.D. (Philosophiae Doctor). “Doctor (or Doctorate) of Philosophy.” It can refer to the individual as a title, or to the degree itself. Note that it is redundant to write, “Dr. McGillicutty is a Ph. D.” unless the writer seeks to distinguish him from a medical doctor such as an M.D. Usage: “Joe Bob McGillicutty, Ph. D., is on the committee.” Or, “McGillicutty earned his Ph. D. in art history.”

vs. (versus. “Turned against.)” Often used in abbreviations for legal trials–though “v.” is more common. Usage: “In the case of Roe v. Wade, the Supreme Court eventually decided that abortion was a medical right.” Don’t confuse the term “vs.” with “v.s.” (see below). And don’t confuse the word versus with verses.

w.r.t. (WRT): “with regard to”  |  “with reference to”  |  “with respect to”. See this, and this.

Less Common Foreign Abbreviations

a.v. (ad valorem). “In proportion to the value of [something else].” Literally, “To the value.” Usage: “The monetary worth of the dollar is figured a.v. the price of gold.”

i.a. (in absentia). “In absence.” Usage: “With further evidence i.a., it is impossible to provide a definitive answer.” Or more commonly, “The criminal who had fled the country was tried and found guilty of murder, i.a.”

MS. (manuscriptum). A document, particularly an ancient or historical manuscript, that was not printed, but rather drawn or written. Literally, “By hand.” The term is capitalized when attached to a specific document’s title, and the plural form is MSS. In British usage, only the final letter typically has a period. Usage: “MS. Vercilli was found in Northern Italy, and it appears to be written in an Anglo-Saxon dialect.”

P.S. (post scriptum). The abbreviation indicates a last-minute addition to a letter or document. Literally, “After what has been written.” Usage: “That’s all for now. Take care. Love, John. P.S. Don’t forget to write me back!”

R.S.V.P. (Repondez S’il Vous-Plait). “Please send a response confirming whether or not you will accept the invitation.” The abbreviation is French rather than Latin. Literally, “Respond if it pleases you.” Note that it is redundant to write, “Please RSVP,” since the phrase itself implies “please.” Usage: “You are cordially invited to a wine-and-cheese reception at the Bradson’s House. RSVP by Thursday afternoon.”

S.P.Q.R. (Senatus Populusque Romani). The abbreviation was used in Roman times as a part of official government documentation. Today, the phrase is used to refer generally (and sometimes pompously or ironically) to the power, glory, and bureaucracy of a major nation. Literally, “The Senate and the People of Rome.” Usage: “The S.P.Q.R. has spoken, and now American soldiers must obey the call to arms.”

s.p.s. (sine prole supersite). “Without surviving issue.” The phrase is used in inheritance laws to indicate that an individual has no children or legal inheritors. Usage: “Since Mrs. Clayton died s.p.s., her six million dollar estate will revert to the City of Portland.”

t.i.d. (ter in die). “Three times a day.” Used by older pharmacies and doctors to indicate that a medication should be taken three times a day. Usage: “Aspirin, t.i.d.; call if headaches continue.”

viz. (videlicit). “More appropriately or accurately; namely.” The abbreviation is often used interchangeably with i.e. Literally, “As it befits or is pleasing to him.” Usage: “He was a minor Duke in the House of Lords, viz. the Duke of Rochester.”

vide. (“Look” or “see”). This phrase refers the reader back up to a previous statement or definition within the body of the paper. The must common uses are “vide 63” (which means “see page sixty-three”), v.s. vide supra (“see earlier” or “look above on this page”) and v.i. vide infra (“See below” or “Look below”). Don’t confuse v.s. (vide supra) with v. or vs. (versus). Usage: “For the definition of the Latin word videlicit, vide supra.”

N.B.: (Nota Bene). The Latin imperative means “Take notice of this very carefully,” that is, pay special attention to this part because it is unusually important, tricky, or confusing. Usage: All assignments are due at the beginning of class. N. B.: I lock the door to the classroom once lecture begins.


An ontology is a representation / specification of a conceptualization of a domain of knowledge, characterizing the classes and relations that exist in the domain. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. Commonly, ontologies are represented as graph structure that represents a taxonomy.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See Proofs.


Parameters are variables of a model that the machine learning system trains on its own. For example, weights are parameters whose values the ML system gradually learns through successive training iterations.

Contrast with hyperparameters, the “knobs” that you tweak during successive runs of training a model. For example, learning rate is a hyperparameter.

Partial Derivatives

See my note, here:


[Click image to open in new window.]


$\small \frac{\partial}{\partial x}2x^3 = 6x^2$

$\small \frac{\partial}{\partial x}(2x^3 + 3c) = 6x^2 + 0 = 6x^2$

$\small \frac{\partial}{\partial x}3c = 0$

$\small \frac{\partial}{\partial x}c = 0$

$\small \frac{\partial}{\partial x}(2x^3 + 5z^4) = 6x^2$

$\small \frac{\partial}{\partial z}(2x^3 + 5z^4) = 20z^3$


Parts of speech (POS)

In traditional grammar, a part of speech (abbreviated form: PoS or POS) is a category of words (or, more generally, of lexical items) which have similar grammatical properties. Words that are assigned to the same part of speech generally display similar behavior in terms of syntax – they play similar roles within the grammatical structure of sentences – and sometimes in terms of morphology, in that they undergo inflection for similar properties. Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, and sometimes numeral, article or determiner.

  • Noun (names): a word or lexical item denoting any abstract entity (abstract noun: e.g. home) or concrete entity (concrete noun: e.g. house); a person (police officer, Michael), place (coastline, London), thing (necktie, television), idea (happiness), or quality (bravery). Nouns can also be classified as count nouns or non-count nouns; some can belong to either category. The most common part of speech; nouns are called naming words.

  • Pronoun (replace or again placed): a substitute for a noun or noun phrase (them, he). Pronouns make sentences shorter and clearer since they replace nouns.

  • Adjective (describes, limits): a modifier of a noun or pronoun (big, brave). Adjectives make the meaning of another word (noun) more precise.

  • Verb (states action or being): a word denoting an action (walk), occurrence (happen), or state of being (be). Without a verb a group of words cannot be a clause or sentence.

  • Adverb (describes, limits): a modifier of an adjective, verb, or another adverb (very, quite). Adverbs make language more precise.

  • Preposition (relates): a word that relates words to each other in a phrase or sentence and aids in syntactic context (in, of). Prepositions show the relationship between a noun or a pronoun with another word in the sentence.

  • Conjunction (connects): a syntactic connector; links words, phrases, or clauses (and, but). Conjunctions connect words or group of words

  • Interjection (expresses feelings and emotions): an emotional greeting or exclamation (Woot, Hurray). Interjections express strong feelings and emotions.

  • Article (describes, limits): a grammatical marker of definiteness (the) or indefiniteness (a, an). The article is not always listed among the parts of speech. It is considered by some grammarians to be a type of adjective[13] or sometimes the term ‘determiner’ (a broader class) is used.

In the English language, words can be considered as the smallest elements that have distinctive meanings. Based on their use and functions, words are categorized into several types or parts of speech. This article will offer definitions and examples for the 8 major parts of speech in English grammar: noun, pronoun, verb, adverb, adjective, conjunction, preposition, and interjection.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

POS: functional classification

Linguists recognize that the above list of eight or nine word classes is drastically simplified. For example, “adverb” is to some extent a catch-all class that includes words with many different functions. Some have even argued that the most basic of category distinctions, that of nouns and verbs, is unfounded, or not applicable to certain languages.

Modern linguists have proposed many different schemes whereby the words of English or other languages are placed into more specific categories and subcategories based on a more precise understanding of their grammatical functions.

Common lexical categories defined by function may include the following (not all of them will necessarily be applicable in a given language):

  • Categories that will usually be open classes:
    • adjectives
    • adverbs
    • nouns
    • verbs (except auxiliary verbs)
    • interjections
  • Categories that will usually be closed classes:
    • auxiliary verbs
    • clitics
    • coverbs
    • conjunctions
    • determiners (articles, quantifiers, demonstrative adjectives, and possessive adjectives)
    • particles
    • measure words or classifiers
    • adpositions (prepositions, postpositions, and circumpositions)
    • preverbs
    • pronouns
    • contractions
    • cardinal numbers

Within a given category, subgroups of words may be identified based on more precise grammatical properties. For example, verbs may be specified according to the number and type of objects or other complements which they take. This is called subcategorizaton.

Many modern descriptions of grammar include not only lexical categories or word classes, but also phrasal categories, used to classify phrases, in the sense of groups of words that form units having specific grammatical functions. Phrasal categories may include noun phrases (NP), verb phrases (VP) and so on. Lexical and phrasal categories together are called syntactic categories.


[Image source. Click image to open in new window.]


[Image source: Section 8.2 (Tagsets for English) in Jurafsky D & Martin JH (2000)
  Speech and Language processing (p. 295). Click image to open in new window.]

Performance measures (metrics)

See this link, which discusses:

  • ROC curves (receiver operating characteristic curve)
  • Accuracy, precision, recall, $\small F_1$ score
    • true positives, true negatives
    • false positives, false negatives
    • confusion matrix
    • accuracy
    • misclassification rate
    • precision
    • recall (sensitivity; true positive rate)
    • false positive rate
    • specificity
    • prevalence
    • $\small F_1$ score
  • Performance metrics relevant to ML, NLP:
    • MRR (mean reciprocal rank)
    • filtered MRR
    • perplexity
    • BLEU (BiLingual Evaluation Understudy) score
    • ROUGE score


See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.

Pointer mechanisms; Pointer-generators

Pointer-generator mechanisms were introduced by See et al. [Christopher Manning] in Get To The Point: Summarization with Pointer-Generator Networks (Apr 2017). Pointer-generator architectures can copy words from source texts via a pointer , and generate novel words from a vocabulary via a generator . With the pointing/copying mechanism factual information can be reproduced accurately, and out of vocabulary words can also be taken care of in the summaries [Neural Abstractive Text Summarization with Sequence-to-Sequence Models (Dec 2018)].


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Pointwise mutual information

See Mutual information.


Polysemy describes words or phrases with different but related meanings poses a challenge to NLP; for example, “Washington” could refer to the location “Washington, DC” or the person “George Washington”; “ACE” could represent “angiotensin converting enzyme” or “acetylcholinesterase”. In polysemy, the fact that multiple entities might have the same name is common for named entities. The task of addressing the polysemy problem for named entities is called named entity disambiguation.

See also: Named entity disambiguation;   Word sense disambiguation.

See also Linguistics | Semantics.


A portmanteau or portmanteau word is a linguistic blend of words, in which parts of multiple words or their phones (sounds) are combined into a new word, as in “smog”, coined by blending “smoke” and “fog” – or “motel”, from “motor” and “hotel”. In linguistics, a portmanteau is defined as a single morph that represents two or more morphemes.

The definition of portmanteau overlaps with the grammatical term contraction, but contractions are formed from words that would otherwise appear together in sequence, such as “do” and “not” to make “don’t”, whereas a portmanteau word is formed by combining two or more existing words that all relate to a singular concept. A portmanteau also differs from a compound, which does not involve the truncation of parts of the stems of the blended words. For instance, “starfish” is a compound, not a portmanteau, of “star” and “fish”; whereas a hypothetical portmanteau of “star” and “fish” might be “stish”. A portmanteau of “coyote” and “wolf” is “coywolf”.

Postulate (Axiom)

See Proofs.


Pragmatics is a subfield of linguistics and semiotics that studies the ways in which context contributes to meaning. Pragmatics encompasses speech act theory, conversational implicature, talk in interaction and other approaches to language behavior in philosophy, sociology, linguistics and anthropology. Unlike semantics, which examines meaning that is conventional or “coded” in a given language, pragmatics studies how the transmission of meaning depends not only on structural and linguistic knowledge (e.g., grammar, lexicon, etc.) of the speaker and listener, but also on the context of the utterance, any pre-existing knowledge about those involved, the inferred intent of the speaker, and other factors. In this respect, pragmatics explains how language users are able to overcome apparent ambiguity, since meaning relies on the manner, place, time, etc. of an utterance.

  • The ability to understand another speaker’s intended meaning is called pragmatic competence.

Pragmatics is:

  • the study of the practical aspects of human action and thought.
  • the study of the use of linguistic signs, words and sentences, in actual situations.[1]

Pragmatics outlines the study of meaning in the interactional context. It looks beyond the literal meaning of an utterance and considers how meaning is constructed as well as focusing on implied meanings. It considers language as an instrument of interaction, what people mean when they use language and how we communicate and understand each other.

Pragmatics considers:

  • the negotiation of meaning between speaker and listener.
  • the context of the utterance.
  • the meaning potential of an utterance.

What would happen to language if Pragmatics did not exist? Pragmatics acts as the basis for all language interactions and contact. It is a key feature to the understanding of language and the responses that follow this. Therefore, without the function of Pragmatics, there would be very little understanding of intention and meaning.


See also Linguistics | Semantics.

Pretrained models

Pretrained models are models or model components (such as word or language embeddings) that have been already been trained. Sometimes, pretrained embeddings are fed into a neural network. [Contrast this with the more common situation where the model trains the embeddings de novo, rather than relying on the pretrained embeddings.] The basic steps are:

  1. You have machine learning model $\small m$.

  2. Pretraining: You have a dataset $\small A$ on which you train $\small m$.

  3. You have a dataset $\small B$. Before you start training the model, you initialize some of the parameters of $\small m$ with the model, which was trained on $\small A$.

  4. Fine-tuning: you train $\small m$ on $\small B$.

Pretrained models: word embeddings & language models

Collobert and Weston demonstrated the power of pretrained word embeddings as a highly effective tool when used in downstream tasks in their 2008 paper A Unified Architecture for Natural Language Processing, while also announcing a neural network architecture upon which many current approaches are built. It was Mikolov et al. (2013), however, who popularized word embedding through the introduction of word2vec, a toolkit enabling the training and use of pretrained embeddings (Efficient Estimation of Word Representations in Vector Space).

A particularly exciting recent advance in NLP is the development of pretrained language models such as ELMo (released in February 2018 by Allen NLP), ULMFiT (May 2018 by and Aylien Ltd.), and OpenAI Transformer (June 2018 by OpenAI). Those papers demonstrated that pretrained language models can achieve state of the art results on a wide range of NLP tasks. Language models provide an improved understanding of the semantic content and latent (hidden) relationships in documents.

For an introductory discussion of the potential impact of those models. e.g. on transfer learning (below), see:

See also:

Pretrained models: transfer learning

Note also the links provided immediately above.

Transfer Learning (a subfield of which is domain adaptation) is the reuse of a pretrained model on a new problem. In transfer learning, the knowledge of an already trained machine learning model is applied to a different but related problem. For example, if you trained a simple classifier to predict whether an image contains a backpack, you could use the knowledge that the model gained during its training to recognize other objects like sunglasses. With transfer learning, we basically try to exploit what has been learned in one task to improve generalization in another. We transfer the weights that a network has learned at Task A to a new Task B. The general idea is to use knowledge, that a model has learned from a task where a lot of labeled training data is available, in a new task where we don’t have a lot of data. Instead of starting the learning process from scratch, you start from patterns that have been learned from solving a related task.

Although multitask learning and transfer learning have similarities, they are not the same. Transfer learning only aims at achieving high performance in the target task by transferring knowledge from the source task, while multitask learning tries to learn the target and the source tasks simultaneously.

Whereas in transfer learning we mainly care about doing well on our target task or domain, in multitask learning the objective is to do well on all available tasks (An Overview of Multi-Task Learning in Deep Neural Networks). Alternatively, we can also use the knowledge acquired by learning from related tasks to do well on a target. Crucially, in contrast to transfer learning, some labeled data is usually assumed for each task. In addition, models are trained jointly on source and target task data, which is not the case for all transfer learning scenarios. However, even if target data is not available during training, insights about tasks that are beneficial for multitask learning can still inform transfer learning decisions.

Probabilistic programming

A probabilistic programming language (PPL) is a programming language designed to describe probabilistic models and then perform inference in those models. PPLs are closely related to graphical models and Bayesian networks, but are more expressive and flexible. Probabilistic programming represents an attempt to unify general purpose programming with probabilistic modeling.

Probabilistic reasoning is a foundational technology of machine learning. Probabilistic reasoning has been used for predicting stock prices, recommending movies, diagnosing computers, detecting cyber intrusions and image detection.

A probabilistic relational programming language (PRPL) is a PPL specially designed to describe and infer with probabilistic relational models (PRMs). A PRM is usually developed with a set of algorithms for reducing, inference about and discovery of concerned distributions, which are embedded into the corresponding PRPL.

An Introduction to Statistical Relational Learning (part 2)  [local copy] discusses probabilistic programming, and mentions the Alchemy  system, from the University of Washington.


[Image source. Click image to open in new window.]

Alchemy is a software package providing a series of algorithms for statistical relational learning and probabilistic logic inference, based on the Markov logic representation. Alchemy allows you to easily develop a wide range of AI applications, including:

  • collective classification;
  • link prediction;
  • entity resolution;
  • social network modeling; and,
  • information extraction.

Alchemy Tutorial  [local copy]

See also:


See also:

Relevant logic notation:

  • $\small \exists$ : there exists at least one
  • $\small \forall$ : for all
  • $\small \neg$ : not (logical not)
  • $\small \lor$ : or (logical or)
  • $\small \land$ : and (logical and)
  • $\small \sim$ : is similar to
  • $\small \rightarrow$ : implies
  • $\small \leftarrow$ : is implied by

  • In Prolog, the symbol $\small \text{ :- }$, sometimes called a turnstile , is pronounced “ if “.

Prolog Turing-complete logic programming language associated with artificial intelligence and computational linguistics. Prolog has its roots in first-order logic, a formal logic, and unlike many other programming languages, Prolog is intended primarily as a declarative programming language: the program logic is expressed in terms of relations, represented as facts and rules. A computation is initiated by running a query over these relations. Prolog is well-suited for specific tasks that benefit from rule-based logical queries such as searching databases, voice control systems, and filling templates.

The following is excerpted from Quick-‘n’-Dirty Prolog Tutorial  [local copy].

Background: Prolog, whose name is from the phrase “PROgramming in LOGic,” is a special purpose language. At the heart of Prolog is a logical inference engine that, based on the data supplied by the programmer, generates answers to questions posed by the user. Unlike procedural languages (e.g., Java), Prolog is a language in which statements are logical expressions.

gprolog  (GNU Prolog) is an opensource implementation of Prolog.

Prolog Program Components: Technically, you don’t create a Prolog program; you create a Prolog database. In the database are two types of clauses: facts and rules.

A fact is a single piece of information. To represent the fact that the sky is blue, you could use the clause $\small \text{blue(sky).}$ . Similarly, $\small \text{mammal(rabbit).}$ says that rabbits are mammals. Those are both examples of single argument, or unary predicates. Facts can have multiple arguments; for example, $\small \text{plays(john,hockey).}$ indicates that john plays hockey.

  • Note that Prolog constants are in lower case, variables are in upper case, domains are not defined explicitly, and entries always end with a period.

Rules are used to generate new information from facts, other rules, and even themselves. Rules have the form $\small \text{head :- body.}$ , where the head and the body are clauses that typically use variables instead of constants (because rules are used to define general relationships).

For example, consider this rule: $\small \text{grandparent(X,Z) :- parent(X,Y), parent(Y,Z).}$ . The rule says that $\small \text{X}$ is the grandparent of $\small \text{Z}$ when $\small \text{X}$ is a parent of $\small \text{Y}$ and $\small \text{Y}$ is a parent of $\small \text{Z}$.

  • The comma represents a logical $\small \text{AND}$, the parent() clauses on either side are called subgoals, and $\small \text{X}$, $\small \text{Y}$, & $\small \text{Z}$ are variables. Remember, variables are upper case.

Rules may be recursive. Consider: $\small \text{ancestor(X,Y) :- parent(Z,Y) , ancestor(X,Z).}$ . This says that $\small \text{X}$ is an ancestor of $\small \text{Y}$ when $\small \text{Y}$ has a parent $\small \text{Z}$ and $\small \text{X}$ is an ancestor of $\small \text{Z}$. If that’s confusing, plug in your name for $\small \text{Y}$, one of your parent’s names for $\small \text{Z}$ (your mom, for example), and one of your mom’s parent’s names for $\small \text{X}$.

Notice that we don’t explain to Prolog how to work with these rules. That’s its problem, not ours. We just have to supply a logically complete database, and Prolog will take care of deciding which rules to use, and when, to answer a given question.

This brings us to the third type of clause: The query. When you want to ask Prolog a question, you supply a query . You can think of a query as being a rule without a body. Prolog will take that head and will attempt to find out if it is true or false. If the query has variables, Prolog will attempt to find all possible values that can be used in place of the variables to make the query true. Better yet, it will tell you what they are.

[ … SNIP! … ]

Connecting Prolog to Predicate Logic

You no doubt remember that in logic, we like to represent predicates with capital letters (e.g., $\small E(x,y)$ represents $\small \text{x eats y}$ (where $\small E$ represents $\small \text{eats}$).

    Consider this implication: $\small (P(x,y) \land P(y,z)) \rightarrow G(x,z)$.

    This says: If $\small P(x,y)$ is true and $\small P(x,z)$ is also true, then $\small G(x,z)$ is true.

This is exactly what our grandparent rule says in Prolog! The Prolog syntax is just backwards. Think of the $\small \text{ :- }$ symbol in Prolog as representing the word “if”, and the connection becomes clear.

Internally, Prolog represents rules in a form known as a Horn clause. A Horn clause is a disjunction of predicates in which at most one of the predicates is not negated. It sounds odd, but it matches well with Prolog’s needs. Consider the grandparent clause again:

    $\small \text{grandparent(X,Z) :- parent(X,Y), parent(Y,Z)}$.

which can be rewritten in logical notation, including quantification:

    $\small \forall x \forall y \forall z ((P(x,y) \land P(y,z)) \rightarrow G(x,z))$

We know that $\small \textbf{p $\rightarrow$ q}$ is logically equivalent to $\small \neg p \lor q$, and so the above is equivalent to:

    $\small \forall x \forall y \forall z (\neg (P(x,y) \land P(y,z)) \lor G(x,z))$

And De Morgan’s Laws tell us that we can replace the $\small \text{AND}$ with an $\small \text{OR}$ as the negation is pulled through, producing a Horn clause:

    $\small \forall x \forall y \forall z (\neg P(x,y) \lor \neg P(y,z) \lor G(x,z))$

Here’s why having a Horn clause is important: Prolog has facts in its database about parents. By employing a simplified version of the resolution rule of inference, Prolog can use a fact to remove a predicate from the Horn clause and get closer to an answer. For example, for the query

    $\small \text{grandparent(hank,X)}$

the resulting Horn clause would be

    $\small \text{not parent(hank,A) or not parent(A,B) or grandparent(hank,B)}$.

The database tells Prolog that $\small \text{parent(hank,ben)}$ is a fact, and Prolog can resolve that fact with the Horn clause if $\small \text{A = ben}$. That assignment is called unification  [local copycode], and the result is the Horn clause $\small \text{not parent(ben,B) or grandparent(hank,B)}$. Finding that ben $\small \text{carl}$ is the parent of carl $\small \text{ben}$ lets Prolog use resolution and unification once more to conclude that $\small \text{grandparent(hank,carl)}$ is true.

  • Note: there is an error on p. 5 (“Connecting Prolog to Predicate Logic”) in the source document (indicated/corrected above).

The following is excerpted from Horn clauses.

The following is an example of a (definite) Horn clause:

    $\small \neg p \lor \neg q \lor \ldots \lor \neg t \lor u$

Such a formula can be rewritten in the following form, which is more common in logic programming and similar fields:

    $\small (p \land q \land \ldots \land t) \rightarrow u$

In logic programming a definite clause behaves as a goal-reduction procedure. For example, the Horn clause written above behaves as the procedure:

    to show $\small u$, show $\small p$ and show $\small q$ and ... and show $\small t$ .

To emphasize this reverse use of the clause, it is often written in the reverse form:

    $\small u \leftarrow (p \land q \land \ldots \land t)$

In Prolog this is written as:

    $\small u \text{ :- } p, q, \dots, t$.

Prolog – see also:

Prolog – graph related:


See Proofs.


Here I summarize/collate several web sources.

Axiom: see Postulate.


  • An assertion that is then proved. It is often used like an informal lemma. [3]


  • This is an educated prediction that one makes based on their experience. The difference between a conjecture and a lemma/theorem/corollary is that it is usually an open research problem that either has no answer, or some partial answer.

    Conjectures are usually only considered important if they are authored by someone well known in their respective area of mathematics. Once it is proved or disproved, it ceases to be a conjecture and either becomes a fact (backed by a theorem) or there is some interesting counterexample to demonstrate how it is wrong. [1]

    Example: The Poincaré’ conjecture was a famous statement that remained an open research problem in topology for roughly a century. The claim was that every simply connected, compact 3-manifold was homeomorphic to the 3-sphere S3. This statement however is no longer a conjecture since it was famously proven by Grigori Perelman in 2003. [1]

  • A statement that is unproved, but is believed to be true (Collatz conjecture, Goldbach conjecture, twin prime conjecture). [3]


  • This is usually a result that is a direct consequence of a major theorem. Often times a theorem lends itself to other smaller results or special cases which can be shown by simpler methods once a theorem is proven. [1]

    Example: a consequence to the Hopf-Rinow theorem is that compact manifolds are geodesic complete. [1]

  • A result in which the (usually short) proof relies heavily on a given theorem (we often say that “this is a corollary of Theorem A”). [3]


  • This is an assignment of language and syntax to some property of a set, function, or other object. A definition is not something you prove, it is something someone assigns. Often you will want to prove that something satisfies a definition. [1]

    Example: we call a mapping $\small f : X \rightarrow Y$ injective if whenever $\small f(x) = f(y)$ then $\small x = y$. [1]

  • A precise and unambiguous description of the meaning of a mathematical term. It characterizes the meaning of a word by giving all the properties and only those properties that must be true. [3]



  • This is a property that one can derive or prove which is usually technical in nature and is not of primary importance to the overall body of knowledge one is trying to develop. Usually lemmas are there as precursors to larger results that one wants to obtain, or introduce a new technique or tool that one can use over and over again. [1]

    Example: in a Hausdorff space, compact subsets can be separated by disjoint open subsets. [1]

  • A minor result whose sole purpose is to help in proving a theorem. It is a stepping stone on the path to proving a theorem. Very occasionally lemmas can take on a life of their own (Zorn’s lemma, Urysohn’s lemma, Burnside’s lemma, Sperner’s lemma). [3]


  • A statement that can be shown, using a given set of axioms and definitions, to be both true and false. Paradoxes are often used to show the inconsistencies in a flawed theory (Russell’s paradox). The term paradox is often used informally to describe a surprising or counterintuitive result that follows from a given set of rules (Banach-Tarski paradox, Alabama paradox, Gabriel’s horn). [3]

Postulate (Axiom)

  • I would appreciate community input on this, but I haven’t seen this word used in any of the texts/papers I read. I would assume that this is synonymous with proposition. [1]

    I know Postulate is a synonym of axiom. Very used word in italian, but more in physics than mathematics. See Wikipedia:Axiom [1]

  • An axiom or postulate is a statement that is taken to be true, to serve as a premise or starting point for further reasoning and arguments. The word comes from the Greek axíōma (ἀξίωμα), “that which is thought worthy or fit” or “that which commends itself as evident.” [2]

  • A statement that is assumed to be true without proof. These are the basic building blocks from which all theorems are proved (Euclid’s five postulates, Zermelo-Fraenkel axioms, Peano axioms). [3]


  • In mathematics, a proof is an inferential argument for a mathematical statement. In the argument, other previously established statements, such as theorems, can be used. In principle, a proof can be traced back to self-evident or assumed statements, known as axioms, along with accepted rules of inference. Axioms may be treated as conditions that must be met before the statement applies. [4]

    Proofs are examples of exhaustive deductive reasoning or inductive reasoning and are distinguished from empirical arguments or non-exhaustive inductive reasoning (or “reasonable expectation”). A proof must demonstrate that a statement is always true (occasionally by listing all possible cases and showing that it holds in each), rather than enumerate many confirmatory cases. An unproved proposition that is believed to be true is known as a conjecture. [4]

    Proofs employ logic but usually include some amount of natural language which usually admits some ambiguity. In fact, the vast majority of proofs in written mathematics can be considered as applications of rigorous informal logic. Purely formal proofs, written in symbolic language instead of natural language, are considered in proof theory. The distinction between formal and informal proofs has led to much examination of current and historical mathematical practice, quasi-empiricism in mathematics, and so-called folk mathematics (in both senses of that term). The philosophy of mathematics is concerned with the role of language and logic in proofs, and mathematics as a language. [4]

  • Concept of scientific proof.

    While the phrase “scientific proof” is often used in the popular media, many scientists have argued that there is really no such thing. For example, Karl Popper once wrote that “In the empirical sciences, which alone can furnish us with information about the world we live in, proofs do not occur, if we mean by ‘proof’ an argument which establishes once and for ever the truth of a theory”. [5] Albert Einstein said:

    “The scientific theorist is not to be envied. For Nature, or more precisely experiment, is an inexorable and not very friendly judge of his work. It never says ‘Yes ’ to a theory. In the most favorable cases it says ‘Maybe ,’ and in the great majority of cases simply ‘No .’ If an experiment agrees with a theory it means for the latter ‘Maybe ,’ and if it does not agree it means ‘No. ’ Probably every theory will someday experience its ‘No’  – most theories, soon after conception.” [5]


  • This is a property that one can derive easily or directly from a given definition of an object. [1]

    Example: the identity element in a group is unique. [1]

  • A proved and often interesting result, but generally less important than a theorem. [3]


  • This is a property of major importance that one can derive which usually has far-sweeping consequences for the area of math one is studying. Theorems don’t necessarily need the support of propositions or lemmas, but they often do require other smaller results to support their evidence. [1]

    Example: every manifold has a simply connected covering space. [1]

  • A mathematical statement that is proved using rigorous mathematical reasoning. In a mathematical paper, the term theorem is often reserved for the most important results. [3]

“From a logical point of view, there is no difference between a lemma, proposition, theorem, or corollary – they are all claims waiting to be proved. However, we use these terms to suggest different levels of importance and difficulty.

  • “A lemma is an easily proved claim which is helpful for proving other propositions and theorems, but is usually not particularly interesting in its own right.

  • “A proposition is a statement which is interesting in its own right, while a theorem is a more important statement than a proposition which says something definitive on the subject, and often takes more effort to prove than a proposition or lemma.

  • “A corollary is a quick consequence of a proposition or theorem that was proven recently.”

Source: Footnote 4, p. 25 in Terence Tao, “Analysis I”, 3rd ed. (2015).

Proofs: Sources

[1] Terminology: Difference between Lemma, Theorem, Definition, Hypothesis, Postulate and a Proposition
[2] Axiom
[3] What is the difference between a theorem, a lemma, and a corollary?  [local copy]
[4] Mathematical proof
[5] Concept of scientific proof

See also: Scientific method.


See Proofs.


Provenance is a reference to literature from which a statement or its supporting evidence were derived.

Provenance (from the French provenir, “to come from/forth”) is the chronology of the ownership, custody or location of a historical object. The term was originally mostly used in relation to works of art but is now used in similar senses in a wide range of fields, including archaeology, paleontology, archives, manuscripts, printed books and science and computing. The primary purpose of tracing the provenance of an object or entity is normally to provide contextual and circumstantial evidence for its original production or discovery, by establishing, as far as practicable, its later history, especially the sequences of its formal ownership, custody and places of storage. The practice has a particular value in helping authenticate objects. Comparative techniques, expert opinions and the results of scientific tests may also be used to these ends, but establishing provenance is essentially a matter of documentation. The term dates to the 1780s in English. Provenance is conceptually comparable to the legal term chain of custody.

In a technological solution, blockchains can be used to ensure data provenance:

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Quadratic formula

For $\small ax^2 + bx + c = 0$, the values of $\small x$ which are the solutions of the equation are given by:

$\small x = \dfrac{-b \pm \sqrt{b^2 - 4ac}}{2a}$

For the quadratic formula to work, you must have your equation arranged in the form “$\small \text{(quadratic)} = 0$”. Also, the “$\small 2a$” in the denominator of the formula is underneath everything above, not just the square root.


Solve $\small x^2 + 3x - 4 = 0$

This quadratic happens to factor as

$\small x^2 + 3x - 4 = (x + 4)(x - 1) = 0$

so the solutions are $\small x = -4$ and $\small x = 1$. However, using the quadratic formula with $\small a = 1$, $\small b = 3$, and $\small c = -4$:

$\small x = \dfrac{-(3) \pm \sqrt{(3)^2 - 4(1)(-4)}}{2(1)}$ $\small = \dfrac{-3 \pm \sqrt{9 + 16\,}}{2} = \dfrac{-3 \pm \sqrt{25}}{2}$ $\small = -4,\, 1$

The solution is $\small x = -4$ and $\small x = 1$.

Question answering

Question answering (QA), a computer science discipline within the fields of information retrieval and natural language processing (NLP), is concerned with building systems that automatically answer questions posed by humans in a natural language. A QA implementation, usually a computer program, may construct its answers by querying a structured database of knowledge or information, usually a knowledge base. More commonly, QA systems can pull answers from an unstructured collection of natural language documents.

Some examples of natural language document collections used for QA systems include:

  • a local collection of reference texts
  • internal organization documents and web pages
  • compiled newswire reports
  • a set of Wikipedia pages
  • a subset of World Wide Web pages

Question answering (QA), the identification of short accurate answers to users questions presented in natural language, has numerous applications in the biomedical and clinical sciences including directed search, interactive learning and discovery, clinical decision support, and recommendation. Due to the large size of the biomedical literature and a lack of efficient searching strategies, researchers and medical practitioners often struggle to obtain available information available that is necessary for their needs. Moreover, even the most sophisticated search engines are not intelligent enough to interpret clinicians questions. Thus, there is an urgent need for information retrieval systems that accept queries in natural language and return accurate answers quickly and efficiently.

Question answering (a natural language understanding problem) and reading comprehension (the task of answering a natural language question about a paragraph) are of considerable interest in NLP motivating, for example, the Human-Computer Question Answering Competition (in the NIPS 2017 Competition Track), and the BioASQ Challenge in the BioNLP domain. Unlike generic text summarization, reading comprehension systems facilitate the answering of targeted questions about specific documents, efficiently extracting facts and insights (How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks).

The Stanford Question Answering Dataset (SQuAD: developed at Stanford University) is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment or span of text from the corresponding reading passage (or, the question might be unanswerable). There has been a rapid progress on the SQuAD dataset, and early in 2018 engineered systems started achieving and surpassing human level accuracy on the SQuAD1.1 Leaderboard  (discussion: AI Outperforms Humans in Question Answering: Review of three winning SQuAD systems). SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written by crowdworkers to look adversarially similar to answerable ones (ACL 2018 Best Short Paper: Know What You Don’t Know: Unanswerable Questions for SQuAD;  [project/code]). To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Real coordinate space

In mathematics, real coordinate space of $\small n$ dimensions, written $\small \mathbf{R}^n$ (also written $\small \mathbb{R}^n$ with blackboard bold) is a coordinate space that allows several ($\small n$) real variables to be treated as a single variable. With various numbers of dimensions (sometimes unspecified), $\small \mathbb{R}^n$ is used in many areas of pure and applied mathematics, as well as in physics. With component-wise addition and scalar multiplication, it is the prototypical real vector space and is a frequently used representation of Euclidean $\small n$-space. Due to the latter fact, geometric metaphors are widely used for $\small \mathbb{R}^n$, namely a plane for $\small \mathbb{R}^2$ and three-dimensional space for $\small \mathbb{R}^3$.


[Image source. Click image to open in new window.]

Definition and uses

For any natural number $\small n$, the set $\small \mathbb{R}^n$ consists of all $\small n$-tuples of real numbers ($\small \mathbb{R}$). It is called (the) “$\small n$-dimensional real space”. Depending on its construction from $\small n$ instances of the set $\small \mathbb{R}$, it inherits some of the latter’s structure, notably:

  • When defined as the direct sum of vector spaces, addition and scalar multiplication are defined on $\small \mathbb{R}^n$ [refer here];

  • $\small \mathbb{R}^n$ is a topological space [refer here].

An element of $\small \mathbb{R}^n$ is written

    $\small \mathbf {x} = (x_1, x_2, \ldots, x_n)$

where each $\small x_i$ is a real number.

For each $\small n$ there exists only one $\small \mathbb{R}^n$, the real $\small n$-space.

Purely mathematical uses of $\small \mathbb{R}^n$ can be roughly classified as follows, although these uses overlap.

[ … snip … ]

Matrix notation

In standard matrix notation, each element of $\small \mathbb{R}^n$ is typically written as a column vector

    $\small \mathbf{x} = \begin{bmatrix} x_1 \\\ x_2 \\\ \vdots \\\ x_n \end{bmatrix}$

and sometimes as a row vector:

    $\small \mathbf {x} = \begin{bmatrix} x_1 & x_2 & \dots & x_n \end{bmatrix}$.

The coordinate space $\small \mathbb{R}^n$ may then be interpreted as the space of all $\small [n × 1]$ column vectors, or all $\small [1 × n]$ row vectors with the ordinary matrix operations of addition and scalar multiplication.

[ … snip … ]

Miscellaneous notes:

  • The superscripts in $\small \mathbb{R}^\color{Brown}{\mathbf{n}}$, $\small \mathbb{R}^\color{Brown}{\mathbf{k}}$ and $\small \mathbb{R}^\color{Brown}{\mathbf{p}}$ refer to the real coordinate space (“dimensions”) of these matrices of matrices of real numbers (respectively) in input, hidden and output layers.

  • Similarly (elsewhere), $\small \mathbb{R}^{m \times n}$ indicates a matrix of real coordinate space (“dimensions”) $\small m\times n$.

  • $\small \in$ denotes “element of” (in set membership: “member of”).

  • In mathematics, the set of real numbers ($\small \mathbb{R}$) are the values of a continuous quantity that can represent a distance along a line; they include rational numbers ($\small \mathbb{Q}$), integers ($\small \mathbb{Z}$), and natural numbers ($\small \mathbb{N}$).

  • Elements of $\small \mathbb{R}^n$ are vectors. In other words, we can consider each element of $\small \mathbb{R}^n$ (the tuple of $n$ real numbers) to be a vector. $\small \mathbb{R}^n$ is more abstract than polynomials; for example,

    $\ \ \ \ \ \ \small a = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \in \mathbb{R}^3$

is an example of a triplet of numbers in three-dimensional real coordinate space. Adding two vectors $\small a, b ∈ \mathbb{R}^n$ component wise results in another vector: $\small a + b = c \in \mathbb{R}^n$ . Moreover, multiplying $\small a \in \mathbb{R}^n$ by $\small \lambda \in \mathbb{R}$ results in a scaled vector $\small \lambda a \in \mathbb{R}^n$. Linear algebra focuses on the similarities between these vector concepts; we can add them together, and multiply them by scalars. We largely focus on vectors in $\small \mathbb{R}^n$ since most algorithms in linear algebra are formulated in $\small \mathbb{R}^n$. Recall that in machine learning, we often consider data to be represented as vectors in $\small \mathbb{R}^n$. [Source: Linear Algebra]

  • What is “Real coordinate space”?

    • “In my experience the expression ‘real coordinate space’ emphasizes that we are not working over the complex numbers, i.e. the space is $\small \mathbb{R}^n$, not $\small \mathbb{C}^n$. You can use Cartesian coordinates (and a whole bunch of other coordinate systems) on these spaces. The spaces $\small \mathbb{R}^n$ are called Euclidean spaces, so they are the same as real coordinate spaces.”

  • What’s the difference between the Cartesian coordinate system, the Euclidean space, and the real coordinate space?

    • Euclidean space is a way to define a geometric system. It is based on the following assumptions that cannot be proven, or postulated:

    • A straight line can be drawn joining any two points.
    • Any straight line segment can be extended indefinitely in a straight line.
    • Given any straight line segment, a circle can be drawn having the segment as radius and one endpoint as center.
    • All right angles are congruent.

    There are a number of coordinate systems that help “locate” or identify objects or elements in the space. A Cartesian coordinate system, is defined by a 3-tuple {x,y,z} based on 3 axes that are each perpendicular to one another, and any position can be defined by these 3 coordinates.

    Real coordinate space merely extends the Cartesian system to multiple dimensions. For example, there can be $\small n$-tuples represented by matrices, where the properties of Cartesian coordinates hold.

Recurrent neural networks (RNN)


[Folded, unfolded RNN (click image to open in new window)]

At each time step $\small t$, a RNN reads an input vector $\small \mathbf{x}_t$ into a hidden state vector $\small \mathbf{h}_t$ and predicts an output vector $\small \mathbf{y}_t$ [shown as $\small \mathbf{o}_t$ in the diagram above]. The state dynamic can be abstracted as a recurrent relation: $\small \mathbf{h}_t = RNN(\mathbf{h}_{t-1}, \mathbf{x}_t)$. The vanilla RNN is parameterized as follows:

    $\small \mathbf{h}_t = \sigma (W_h \mathbf{h}_{t-1} + V\mathbf{x}_t + \mathbf{b}_h)$

    $\small \mathbf{y}_t = W_y \mathbf{h}_t + \mathbf{b}_y$

where $\small (W_h, W_y, V, \mathbf{b}_h, \mathbf{b}_y)$ are learnable parameters, and $\small \sigma$ is a point-wise nonlinear function.

Andrej Karpathy provides a great example of a RNN, that should make it easier to visualize the working and use of a RNN:


[image source (click image to open in new window)]

One last thing to note – the weights of the connections between time steps are shared i.e. there isn’t a different set of weights for each time step. This is discussed here, Why are the weights of RNN/LSTM networks shared across time?:

  • “The ‘shared weights’ perspective comes from thinking about RNNs as feedforward networks unrolled across time. If the weights were different at each moment in time, this would just be a feedforward network [ANN, above]. But, I suppose another way to think about it would be as an RNN whose weights are a time-varying function (and that could let you keep the ability to process variable length sequences). If you did this, the number of parameters would grow linearly with the number of time steps. That would be a big explosion of parameters for sequences of any appreciable length. …”


[image sourcediscussed here. (Click image to open in new window.)]

Alex Graves shows a variation of a RNN, above, that better illustrates some additional key concepts (Generating Sequences with Recurrent Neural Networks).

  • “Fig. 1 illustrates the basic recurrent neural network prediction architecture used in this paper. An input vector sequence $\small \mathbf{x} = (x_1, \dots, x_T)$ is passed through weighted connections to a stack of $\small N$ recurrently connected hidden layers to compute first the hidden vector sequences $\small \mathbf{h}^n = (h{n \atop 1}, \dots, h{n \atop T})$ and then the output vector sequence $\small \mathbf{y} = (y_1, \dots, y_T)$. Each output vector $\small y_t$ is used to parameterise a predictive distribution $\small \text{Pr}(x_{t+1} \vert y_t)$ over the possible next inputs $\small x_{t+1}$. The first element $\small x_1$ of every input sequence is always a null vector whose entries are all zero; the network therefore emits a prediction for $\small x_2$, the first real input, with no prior information. The network is “deep” in both space and time, in the sense that every piece of information passing either vertically or horizontally through the computation graph will be acted on by multiple successive weight matrices and nonlinearities.

    “Note the ‘skip connections’ from the inputs to all hidden layers, and from all hidden layers to the outputs. These make it easier to train deep networks, by reducing the number of processing steps between the bottom of the network and the top, and thereby mitigating the “vanishing gradient” problem. In the special case that $\small N = 1$ the architecture reduces to an ordinary, single layer next step prediction RNN.”


[image source (click image to open in new window)]


[image source (click image to open in new window)]

The following material is excerpted from the main body in A Beginner’s Guide to LSTMs.

“In the case of feedforward networks, input examples are fed to the network and transformed into an output; with supervised learning, the output would be a label, a name applied to the input. That is, they map raw data to categories, recognizing patterns that may signal, for example, that an input image should be labeled “cat” or “elephant.” A feedforward network is trained, for example, on labeled images until it minimizes the error it makes when guessing their categories. The trained set of parameters (or weights) are collectively known as a model. A feedforward network has no notion of temporal order (“time”): the only input it considers is the current example upon which it is working.

“RNN recognize patterns in sequences of data such as text, genomes, handwriting, the spoken word, numerical times series data emanating from sensors, stock markets and government agencies. RNNs are even applicable to images, which can be decomposed into a series of patches and treated as a sequence. RNN take as their input not just the current input example they see, but also what they have perceived previously in time. Since these algorithms take time and sequence into account, they have a temporal dimension. The decision a recurrent net reached at time step $\small t-1$ affects the decision it will reach one moment later at time step $\small t$. RNN therefore have two sources of input, the present and the immediate past, which combine to determine how they respond to new data.

“Recurrent networks are distinguished from feedforward networks by a feedback loop connected to their past decisions, ingesting their own outputs moment after moment as input. It is often said that recurrent networks have memory. That sequential information is preserved in the recurrent network’s hidden state, which manages to span many time steps as it cascades forward to affect the processing of each new example. Mathematically, the process of carrying the memory forward is:

    $\small \mathbf{h}_t = \phi(\mathbf{W}\mathbf{x}_t + \mathbf{U}\mathbf{h}_{t-1})$

“The hidden state at time step $\small t$ is $\small h_t$. It is a function of the input at the same time step $\small x_t$, modified by a weight matrix $\small W$ (like the one we used for feedforward nets) added to the hidden state of the previous time step $\small h_{t-1}$ multiplied by its own hidden-state-to-hidden-state matrix $\small U$, otherwise known as a transition matrix and similar to a Markov chain. The weight matrices are filters that determine how much importance to accord to both the present input and the past hidden state. The error they generate will return via backpropagation and be used to adjust their weights until error can’t go any lower.

“The sum of the weight input and hidden state is squashed by the function $\small \phi$ – either a logistic sigmoid function or tanh, depending – which is a standard tool for condensing very large or very small values into a logistic space, as well as making gradients workable for backpropagation. Because this feedback loop occurs at every time step in the series, each hidden state contains traces not only of the previous hidden state, but also of all those that preceded $\small h_{t-1}$ for as long as memory can persist.”

The most basic or “vanilla” RNN consists of the following set of equations (indexed by time-step $t$), that appear various forms in the literature and the web:

    $\small \mathbf{h}_t = \sigma (\mathbf{U}_h \mathbf{x}_t + \mathbf{W}_h \mathbf{h}_{t-1})$

    $\small \mathbf{y}_t = \mathbf{O} \mathbf{h}_t$


  • $\small \mathbf{x}_t \in \mathbb{R}^n$ is the input of the RNN;

  • $\small \mathbf{h}_t \in \mathbb{R}^k$ is called the hidden state of the RNN, and acts as a memory of the current state the network. When starting a sequence, it is set to the all zero vector ($\small \mathbf{h}_{-1} = 0$).

  • $\small \mathbf{y}_t \in \mathbb{R}^p$ is the output of the RNN;

  • The logistic function (here, the sigmoid function; see Recent Advances in Recurrent Neural Networks for other forms), applied component-wise:

$\ \ \ \ \ \ \small \sigma(x) = \frac{1}{1 + \mathcal{e}^{(-x)}}$

  • $\small \mathbf{U}_{h}, \mathbf{W}_{h}, \mathbf{O}$ are the network’s parameters. [Note that “$\small \mathbf{O}$”, here, is “$\small \mathbf{V}$” elsewhere in this post; i.e. the weight matrices of the “skip connections from the hidden layers to the outputs.]

The output of such a neural network depends on both the input $\small \mathbf{x}_t$ and the hidden state $\small \mathbf{h}_t$.


  • Source for the above: An Intrinsic Difference Between Vanilla RNNs and GRU Models plus my own notes (below).

  • Undefined in that paper, I believe that I correctly interpret/describe these:

    • The superscripts in $\small \mathbb{R}^\color{Brown}{\mathbf{n}}$, $\small \mathbb{R}^\color{Brown}{\mathbf{k}}$ and $\small \mathbb{R}^\color{Brown}{\mathbf{p}}$ refer to the real coordinate space (“dimensions”) of these matrices of matrices of real numbers (respectively) in input, hidden and output layers.

      Similarly (elsewhere), $\small \mathbb{R}^{m \times n}$ indicates a matrix of dimensions $\small m\times n$.

  • $\small \in$ denotes “element of” (in set membership: “member of”).

  • In mathematics, the set of real numbers ($\small \mathbb{R}$) are the values of a continuous quantity that can represent a distance along a line; they include rational numbers ($\small \mathbb{Q}$), integers ($\small \mathbb{Z}$), and natural numbers ($\small \mathbb{N}$).

  • Elements of $\small \mathbb{R}^n$ are vectors. In other words, we can consider each element of $\small \mathbb{R}^n$ (the tuple of $n$ real numbers) to be a vector. $\small \mathbb{R}^n$ is more abstract than polynomials; for example,

    $\ \ \ \ \ \ \small a = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \in \mathbb{R}^3$

    is an example of a triplet of numbers in three-dimensional real coordinate space. Adding two vectors $\small a, b ∈ \mathbb{R}^n$ component wise results in another vector: $\small a + b = c \in \mathbb{R}^n$ . Moreover, multiplying $\small a \in \mathbb{R}^n$ by $\small \lambda \in \mathbb{R}$ results in a scaled vector $\small \lambda a \in \mathbb{R}^n$. Linear algebra focuses on the similarities between these vector concepts; we can add them together, and multiply them by scalars. We largely focus on vectors in $\small \mathbb{R}^n$ since most algorithms in linear algebra are formulated in $\small \mathbb{R}^n$. Recall that in machine learning, we often consider data to be represented as vectors in $\small \mathbb{R}^n$. [Source: Linear Algebra]

  • In mathematics, the real coordinate space of $n$ dimensions, written $\small \mathbb{R}^n$ is a coordinate space that allows several ($\small n$) real variables to be treated as a single variable. With various numbers of dimensions (sometimes unspecified), $\small \mathbb{R}^n$ is used in many areas of pure and applied mathematics, as well as in physics. With component-wise addition and scalar multiplication, it is the prototypical real vector space and is a frequently used representation of Euclidean $\small n$-space. An element of $\small \mathbb{R}^n$ is written $\small x = (x_1, x_2, \ldots, x_n)$, where each $\small x_i$ is a real number.

Reduced row echelon

  • See discussion, here.

Reinforcement learning

Reinforcement learning (RL) is a branch of machine learning in which an agent learns from interacting with an environment; see:

Essentially, a RL framework allows an agent to learn from trial and error; the agent receives a reward by acting in the environment, and its goal is learning to select the actions that maximize the expected cumulative reward over time. Deep learning can be combined with RL to learn useful representations for problems with high dimensional raw data input.

Relation extraction

Relation extraction (RE) is a subproblem of IE that addresses the extraction of labeled relations between two named entities. Dependency parsing and phrase structure parsing may be combined for relation extraction. To minimize cascading errors, accurate sentence chunking (splitting) is required, prior to the dependency parsing step. See also: Information extraction;   Dependency parsing.

Representation learning

Representation learning is a set of techniques that learn a feature: a transformation of the raw data input to a representation that can be effectively exploited in machine learning tasks. While traditional unsupervised learning techniques are staples of machine learning, representation learning has emerged as an alternative approach to feature extraction (An Introduction to Representation Learning). In representation learning, features are extracted from unlabeled data by training a neural network on a secondary, supervised learning task. Word2vec is a good example of representation learning, simultaneously learning several language concepts:

  • the meanings of words;
  • how words are combined to form concepts (i.e., syntax); and,
  • how concepts relate to the task at hand.

See also:

Rounding numbers (digits)

See Rounding  [Significant figures (digits)]

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


The term syntax refers to grammatical structure whereas the term semantics refers to the meaning of the vocabulary symbols arranged with that structure. Grammatical (syntactically valid) does not imply sensible (semantically valid), however. For example, the grammatical sentence “cows flow supremely” is grammatically ok (subject verb adverb) in English, but makes no sense. [Source: What do “syntax” and “semantics” mean and how are they different?]

Semantics is used as a technical term for the meaning of words and sentences. Semantics and its understanding as a study of meaning covers most complex tasks like: finding synonyms, word sense disambiguation, constructing question-answering systems, translating from one natural language to another, and populating knowledge bases. Basically, one needs to complete morphological and syntactical analysis before trying to solve any semantic problem. [Source]

Semantics is about the manner in which lexical meaning is combined morphologically and syntactically to form the meaning of a sentence. Mostly, this is regular, productive and rule-governed; e.g. the meaning of “John gave Mary a do “can be represented as (SOME (X) (DOG X) & (PAST-TIME (GIVE (JOHN, MARY, X)))), but sometimes it is idiomatic as in the meaning of “John kicked the bucket”, which can be (PAST-TIME (DIE (JOHN))). (To make this notation useful we also need to know the meaning of these capitalised words and brackets too.) Because the meaning of a sentence is usually a productive combination of the meaning of its words, syntactic information is important for interpretation - it helps us work out what goes with what – but other information, such as punctuation or intonation, pronoun reference, etc, can also play a crucial part. [Source; Introduction to Linguistics for Natural Language Processing]


See also:

Semantic parsing

[See also: Syntactic/Dependency parsing.] Semantic dependencies are understood in terms of predicates and their arguments. The predicate of a sentence mostly corresponds to the main verb and any auxiliaries that accompany the main verb; the arguments of that predicate (e.g. the subject and object noun phrase) are outside the predicate. The arguments of a predicate are semantically dependent on that predicate. Often, semantic dependencies overlap with and point in the same direction as syntactic dependencies. At times, however, semantic dependencies can point in the opposite direction of syntactic dependencies, or they can be entirely independent of syntactic dependencies.

Semantic processing of sentence structure uses statistics or grammar rules to produce an electronic representation that delivers logical components (for example, a ‘noun phrase’), their roles (for example, the ‘subject’) and dependencies. Semantic parsing can thus be understood as extracting the precise meaning of an utterance.

Applications of semantic parsing include question answering. An example of the application of semantic parsing in biomedical question-answering is indicated in Fig. 3 in: Titov & Klementiev (2011) [pdf]:

Semantic parsing: biomedical QA
      (click image for full-size)

Semantic role labeling

Semantic role labeling (sometimes called shallow semantic parsing) is a process in natural language processing that assigns labels to words or phrases in a sentence that indicate their semantic role in the sentence. It consists of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles (i.e., automatically finding the semantic roles of each argument of each predicate in a sentence).

Significant figures (digits)

The significant figures (also known as the significant digits) of a number are digits that carry meaning contributing to its measurement resolution. This includes all digits except:

  • All leading zeros;

  • Trailing zeros when they are merely placeholders to indicate the scale of the number (exact rules are explained at identifying significant figures); and

  • Spurious digits introduced, for example, by calculations carried out to greater precision than that of the original data, or measurements reported to a greater precision than the equipment supports.

Concise rules

  • All non-zero digits are significant: 1, 2, 3, 4, 5, 6, 7, 8, 9.

    • 91 has two significant figures (9 and 1), while 123.45 has five significant figures (1, 2, 3, 4 and 5).

  • Zeros between non-zero digits are significant: 102, 2005, 50009.

    • 101.1203 has seven significant figures: 1, 0, 1, 1, 2, 0 and 3.

  • Leading zeros are never significant: 0.02, 001.887, 0.000515.

    • 0.00052 has two significant figures: 5 and 2.

  • In a number with a decimal point, trailing zeros (those to the right of the last non-zero digit) are significant: 2.02000, 5.400, 57.5400.

    • 12.2300 has six significant figures: 1, 2, 2, 3, 0 and 0. The number 0.000122300 still has only six significant figures (the zeros before the 1 are not significant). In addition, 120.00 has five significant figures since it has three trailing zeros.

  • In a number without a decimal point, trailing zeros may or may not be significant. More information through additional graphical symbols or explicit information on errors is needed to clarify the significance of trailing zeros.

Scientific notation

In most cases, the same rules apply to numbers expressed in scientific notation. However, in the normalized form of that notation, placeholder leading and trailing digits do not occur, so all digits are significant. For example, 0.00012 (two significant figures) becomes 1.2×10-4, and 0.00122300 (six significant figures) becomes 1.22300×10-3. In particular, the potential ambiguity about the significance of trailing zeros is eliminated. For example, 1300 to four significant figures is written as 1.300×103, while 1300 to two significant figures is written as 1.3×103.

Rounding numbers

Numbers are often rounded to avoid reporting insignificant figures. For example, it would create false precision to express a measurement as 12.34500 kg (which has seven significant figures) if the scales only measured to the nearest gram and gave a reading of 12.345 kg (which has five significant figures). Numbers can also be rounded merely for simplicity rather than to indicate a given precision of measurement, for example, to make them faster to pronounce in news broadcasts.

To round to $\small n$ significant figures:

  • Identify the significant figures before rounding. These are the n consecutive digits beginning with the first non-zero digit.

  • If the digit immediately to the right of the last significant figure is greater than 5 or is a 5 followed by other non-zero digits, add 1 to the last significant figure. For example, 1.2459 as the result of a calculation or measurement that only allows for 3 significant figures should be written 1.25.

    • If the number you are rounding is followed by 5, 6, 7, 8, or 9, round the number up. Example: 38 rounded to the nearest ten is 401

    • If the number you are rounding is followed by 0, 1, 2, 3, or 4, round the number down. Example: 33 rounded to the nearest ten is 30

      1.44 rounded to 2 sig. fig. is 1.4
      1.45 rounded to 2 sig. fig. is 1.4
      1.46 rounded to 2 sig. fig. is 1.5

      1.54 rounded to 2 sig. fig. is 1.5
      1.55 rounded to 2 sig. fig. is 1.6
      1.56 rounded to 2 sig. fig. is 1.6

  • If the digit immediately to the right of the last significant figure is a 5 not followed by any other digits or followed only by zeros, rounding requires a tie-breaking rule. For example, to round 1.25 to 2 significant figures:

    • Round half away from zero (also known as “5/4”) rounds up to 1.3. This is the default rounding method implied in many disciplines if not specified.

    • Round half to even, which rounds to the nearest even number, rounds down to 1.2 in this case. The same strategy applied to 1.35 would instead round up to 1.4.

      • Victoria: this is my preference: for numbers ending in 5, if the preceding number is even, round down; if the number preceding a trailing 5 is odd, round up. E.g.

          1.05 rounds to 1.0
          1.15 rounds to 1.2
          1.25 rounds to 1.2
          1.35 rounds to 1.4
          1.45 rounds to 1.4
          1.55 rounds to 1.6
          1.65 rounds to 1.6
          1.75 rounds to 1.8
          1.85 rounds to 1.8
          1.95 rounds to 2.0

  • Replace non-significant figures in front of the decimal point by zeros.

  • Drop all the digits after the decimal point to the right of the significant figures (do not replace them with zeros).

Skip-gram model

See Word embedding (figures).

Skip-Gram with Negative Sampling

Statistical relational learning

Statistical relational learning (sometimes called relational machine learning) is a subdiscipline of machine learning that is concerned with domain models that exhibit both uncertainty (which can be dealt with using statistical methods) and complex, relational structure. Typically, the knowledge representation formalisms developed in statistical relational learning use (a subset of) first-order logic to describe relational properties of a domain in a general manner (universal quantification) and draw upon probabilistic graphical models (such as Bayesian networks or Markov networks) to model the uncertainty; some also build upon the methods of inductive logic programming.

The field of statistical relational learning is not strictly limited to learning aspects; it is equally concerned with reasoning (specifically probabilistic inference) and knowledge representation. Therefore, alternative terms that reflect the main foci of the field include statistical relational learning and reasoning (emphasizing the importance of reasoning) and *first-order probabilistic languages (emphasizing the key properties of the languages with which models are represented).

See also:


[Image source. Click image to open in new window.]


A synonym is a word having the same or nearly the same meaning as another word.

Synonyms are words with similar meanings. They are listed in a special type of dictionary called a thesaurus. A regular dictionary lists words according to form, usually in alphabetical order; a thesaurus lists words according to meaning. Synonyms usually differ in at least one semantic feature. Sometimes the feature is objective (denotative), referring to some actual, real world difference in the referents: walk, lumber, stroll, meander, lurch, stagger, stride, mince. Sometimes the feature is subjective (connotative), referring to how the speaker feels about the referent rather than any real difference in the referent itself: die, pass away, give up the ghost, kick the bucket, croak. There tend to be very few absolute synonyms in a language. Example: sofa and couch are nearly complete synonyms, yet they differ in their collocability in at least one way: one may say couch potato, but not *sofa potato.

One special type of partial synonym is called a paronym. Paronyms are words with associated meanings which also have great similarities in form: proscribe/ prescribe, industrial/ industrious, except/accept, affect/effect. Many errors in speech and writing are due to mix-ups involving paronyms.

Examples: “gearbox/transmission”; “choice/selection”; “complex/complicated”; “pretty/attractive”; “sick/ill”; …

Synonymy and homonymy have complementary notions:

  • synonyms: the same meaning, different forms
  • homonyms: the same form, different meanings

See also Linguistics | Semantics.

Syntactic/Dependency parsing

[See also Semantic parsing.] A syntactic parser (i.e., dependency parser) analyzes the grammatical (syntactic) structure of a sentence, establishing binary relationships between “head” words and words which modify those heads; for example, a verb is linked to its dependents (arguments/modifiers). Collectively, these relations form a tree or tree-like graph.

The basic idea is that syntactic structure consists of lexical items, linked by binary asymmetric relations called dependencies. The sentence is an organized whole, the constituent elements of which are words. Every word that belongs to a sentence ceases by itself to be isolated, as in the dictionary. Between the word and its neighbors, the mind perceives connections, the totality of which forms the structure of the sentence.

The structural connections establish dependency relations between the words. The dependencies are all binary relations: a grammatical relation holds between a governor (also known as a regent or a head) and a dependent.

Thus, in the sentence “Winehouse performed …”, “performed” is the governor and “Winehouse” is the dependent (subordinate).

Among other tasks, dependency parse trees may be applied to basic relation extraction. Stanford dependencies provide a representation of grammatical relations between words in a sentence. They have been designed to be easily understood and effectively used by people who want to extract textual relations. Stanford dependencies (SD) are triplets: name of the relation, the governor, and the dependent.

For example, the sentence “Winehouse effortlessly performed her song Rehab.” yields the following dependency paths:

  nsubj(performed-3, Winehouse-1)
  advmod(performed-3, effortlessly-2)
  poss(Rehab-6, her-4)
  nn(Rehab-6, song-5)
  dobj(performed-3, Rehab-6)

In this example, the shortest path between “Winehouse” and “Rehab” is:

  Winehouse nsubj performed dobj Rehab.

and an extracted relation (triple) would be (Winehouse; performed; Rehab)

Winehouse dependency parse
      [graphs above per Stanford CoreNLP online demo]


The term syntax refers to grammatical structure whereas the term semantics refers to the meaning of the vocabulary symbols arranged with that structure. Grammatical (syntactically valid) does not imply sensible (semantically valid), however. For example, the grammatical sentence “cows flow supremely” is grammatically ok (subject verb adverb) in English, but makes no sense. [Source: What do “syntax” and “semantics” mean and how are they different?]

Roughly speaking, the syntax of a language comprises the patterns into which its words can be validly arranged to form sentences. The combination of morphology and syntax is sometimes called the grammar of a language. Syntax as part of grammar is a description of how words grouped and connected to each other in a sentence. There is a good definition of syntax for programming languages: “… syntax usually entails the transformation of a linear sequence of tokens (a token is akin to an individual word or punctuation mark in a natural language) into a hierarchical syntax tree.” The same definition also can be used for natural language. Challenges to syntactic processing are parts of speech (POS) tagging, chunking or detecting syntactic categories (verb, noun phrases), and sentence assembling (constructing syntax trees). [Source]

Syntax concerns the way in which words can be combined to form (grammatical) sentences; e.g. “revolutionary new ideas appear infrequently” is grammatical in English, “colourless green ideas sleep furiously” is grammatical but nonsensical, while “ideas green furiously colourless sleep” is also ungrammatical. Words combine syntactically in certain orders in a way which mirrors the meaning conveyed; e.g. John loves Mary means something different from Mary loves John. The ambiguity of “John gave her dog biscuits” stems from whether we treat “her” as an independent pronoun and “dog biscuits” as a compound noun, or whether we treat “her” as a demonstrative pronoun modifying “dog”. We can illustrate the difference in terms of possible ways of bracketing the sentence: (john (gave (her) (dog biscuits))), vs. (john (gave (her dog) (biscuits))). [Source; Introduction to Linguistics for Natural Language Processing]

See also Linguistics | Semantics.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Text mining

Text mining (TM) comprises the discovery and extraction of knowledge from free text, and can extend to the generation of new hypotheses by joining the extracted information from several publications. Text mining solutions can achieve different objectives, depending on the tasks they then have to address. Primarily, we can distinguish four different categories of purposes for text-mining solutions: information retrieval, information extraction, building knowledge bases, and knowledge discovery. These categories are illustrated in the following figure [image source]:

text mining
      (click image for full-size)


See Proofs.

Transfer learning

See Transfer learning (a subsection in Pretrained models).



A troponym denotes a particular way to do an entry’s referent. Troponymy is a transitive relation.


  • to “trim” and to “slice” are troponyms of to “cut”
  • to “slide” and to “spin” are troponyms of to “move”
  • to “snack” and to “nibble” are troponyms of to “eat”

See also Linguistics | Semantics.

Truth tables

See Logic notation (subsection).

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Variational autoencoder

In neural networks, a variational autoencoder (VAE) consists of an encoder, a decoder, and a loss function. In probability models, the VAR refers to approximate inference in a latent Gaussian model where the approximate posterior and model likelihood are parametrized by neural nets (the inference and generative networks).


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

See also:

Aside: like the VAE, another common generative framework is the generative adversarial network [GAN: Goodfellow et al. (2014), Generative Adversarial Nets]. In a GAN a generator and an auxiliary adversarial discriminator are trained together. On the other hand, in VAE, an encoder and a decoder (or generator) are both trained according to Bayesian models. Both frameworks contain two components (generator and discriminator or encoder and decoder), where each of them requires training.

Vector space

Refer here (Euclidean space).

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See Word embedding.

Word mover’s distance

The word mover’s distance (WMD) – introduced in From Word Embeddings To Document Distances (2015) [codetutorialtutorial: Python implementation with WMD paper coauthor Matt Kusner)] – is a novel distance function between text documents, based on results in word embeddings that learn semantically meaningful representations for words from local co-occurrences in sentences.

The WMD distance measures the dissimilarity between two text documents as the minimum cumulative distance that the embedded words of one document need to “travel” to reach the embedded words of another document. Although two documents may not share any words in common, WMD can still measure the semantical similarity by considering their word embeddings, while other bag-of-words or term frequency-inverse document frequency (TF-IDF) methods only measure the similarity by the appearance of words.

The WMD metric has no hyperparameters and is straightforward to implement. Furthermore, on eight real world document classification data sets, in comparison with seven state-of-the-art baselines, the WMD metric demonstrated unprecedented low $\small k$-nearest neighbor document classification error rates.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Word sense disambiguation (WSD)

In computational linguistics, word-sense disambiguation (WSD) is an open problem of natural language processing and ontology. WSD is the task of identifying which sense/meaning of a word is used in a sentence, when the word has multiple meanings. WSD is basically solution to the ambiguity which arises due to different meaning of words in different contexts. For example, consider the two sentences:

     “The bank will not be accepting cash on Saturdays. “
     “The river overflowed the bank.”

The word “bank” in the first sentence refers to the commercial (finance) banks, while in second sentence, it refers to the river bank.

[For more examples, consult WordNet; e.g., the WordNet search for “cell” gives 7 meanings.]

The solution to this problem impacts other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, inference, et cetera.

See also Named entity disambiguation.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Zachary’s karate club network

Zachary’s karate club is a well-known social network of a university karate club described in the paper An Information Flow Model for Conflict and Fission in Small Groups by Wayne W. Zachary [local copy]. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002.

A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972. The network captures 34 members of a karate club, documenting pairwise links between members who interacted outside the club. During the study a conflict arose between the administrator “John A” and instructor “Mr. Hi” (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other part found a new instructor or gave up karate. Based on collected data Zachary correctly assigned all but one member of the club to the groups they actually joined after the split.


[Image source. Click image to open in new window.]

Zachary’s karate club appears in these papers (of personal interest):

See also: