A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Activation functions

In artificial neural networks, the activation function of a node defines the output of that node given an input or set of inputs. An activation function – for example, ReLU or sigmoid – takes in the weighted sum of all of the inputs from the previous layer, then generates and passes an output value (typically nonlinear) to the next layer; i.e., they take a single number and perform a fixed mathematical operation on it. These functions should be non-linear, to encode complex patterns of the data. The most popular activation functions are sigmoid, tanh and ReLU. ReLU is the most popular activation function in deep neural networks.

activation functions

[Image source. Click image to open in new window.]

Here is a simple artificial neural network:


[ANN showing bias units. Click image to open in new window.]

Each of the circles above represent individual nodes (different layers may contain different numbers of nodes). In the feedforward pass, each node is acted upon (numerically modified) by the activation function  $\small \phi$, shown in the images below [and later adjusted, in response to the error, during the backpropagation step – neither shown nor discussed here.] We calculate each of the layer $\small 2$ node activations based on the sum of the input values ($\small \mathbf{x}_1, \mathbf{x}_2, \dots$) times the weight matrix $\small \mathbf{W}_{ij}$, plus the bias term $\small \mathbf{b}$. For example, the value of node $\small \mathbf{a}{(2) \atop 3}$ in the image, above, representing unit $\small 3$ in layer $\small 2$), is calculated from the values of each of the input nodes $\small \mathbf{x}_i$ in the preceding layer, the weight matrices $\small \mathbf{W}_{ij}$, and the bias unit $\small \mathbf{b}$ ($\small \mathbf{x}_0 = +1$), giving the layer $\small 2$ node activations $\small \mathbf{a}{(2) \atop i}$.

The activation function is shown in greater detail in the following two images; as mentioned, it acts on each node. The activation values on each of the hidden units/nodes in layer $\small 3$ ($\small \mathbf{a}{(3) \atop i}$) are derived from the sigmoid function ($\small \phi(\cdot)$) applied to the linear combination (sum, $\small \sum$) of the inputs to those nodes (the layer $\small 2$ activations multiplied by the weight matrix, plus the bias unit). The activations are derived using exactly the same logic as for the first hidden layer, except that the input values are the activation values $\small \mathbf{a}$ from the preceding layer, rather than the $\small \mathbf{x}$ values. …


[image source; $\small \text{output} = \text{activation}(\sum(\text{weight} * \text{input} + \text{bias})$  (click image to open in new window)]


[Image source. Click image to open in new window.]

This provides a very superficial overview. To fully understand the basics of machine learning, you need to study/know at least a bit of linear algebra (matrix mathematics: transformations, multiplications, … – often used in various parametrizations such as weight matrices, …), partial derivatives (needed for the chain rule, used in backpropagation, …), etc. In my opinion, Andrew Ng’s Machine Learning course is essential preparation, providing a thorough understanding of the basics. Dr. Ng. also teaches the Stanford University cs229 Machine Learning course, which provides online materials.

For example, the summations, $\small \sum(\cdot)$, variously shown above are normally and very neatly accomplished in one step by multiplying the inputs by a transposed weight matrix (that also includes the bias term). Given a neuron that takes as input $\small \mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_3$ and a bias term $\small \mathbf{b} = 1$, the output would be $\small \mathbf{h}_{W,b}(\mathbf{x}) = \phi(\sum_{i=1}^{3}\mathbf{W}_i\mathbf{x}_i + b) = \phi(\mathbf{W}^T\mathbf{x})$. Note that last expression, $\small \phi(\mathbf{W}^T\mathbf{x})$: the transposed weight matrix (including the bias) is multiplied by the input giving the output in one operation!

Activation (+ Other ML) Resources

Ad hoc (ad-hoc) retrieval

See Document relevance ranking.

Adjacency matrix

Refer here  [Graph Signal Processing: Background].

See also: Graph Degree Matrix.

Affine transformation

In geometry, an affine transformation, affine map or an affinity (from the Latin, affinis, “connected with”) is a function between affine spaces which preserves points, straight lines and planes. Also, sets of parallel lines remain parallel after an affine transformation. An affine transformation does not necessarily preserve angles between lines or distances between points, though it does preserve ratios of distances between points lying on a straight line.

Examples of affine transformations include translation, scaling, homothety, similarity transformation, reflection, rotation, shear mapping, and compositions of them in any combination and sequence.

If $\small X$ and $\small Y$ are affine spaces, then every affine transformation $\small f \colon X \to Y$ is of the form $\small x \mapsto Mx+b$, where $\small M$ is a linear transformation on the space $\small X$, $\small x$ is a vector in $\small X$, and $\small b$ is a vector in $\small Y$. Unlike a purely linear transformation, an affine map need not preserve the zero point in a linear space. Thus, every linear transformation is affine, but not every affine transformation is linear.

All Euclidean spaces are affine, but there are affine spaces that are non-Euclidean. In affine coordinates, which include Cartesian coordinates in Euclidean spaces, each output coordinate of an affine map is a linear function (in the sense of calculus) of all input coordinates. Another way to deal with affine transformations systematically is to select a point as the origin; then, any affine transformation is equivalent to a linear transformation (of position vectors) followed by a translation.


[Image source. An image of a fern-like fractal that exhibits affine self-similarity. Each of the leaves of the fern is related to each other leaf by an affine transformation. For instance, the red leaf can be transformed into both the dark blue leaf and the light blue leaf by a combination of reflection, rotation, scaling, and translation. Click image to open in new window.]

Matrix affine transformations

To represent affine transformations with matrices, we can use homogeneous coordinates. This means representing a 2-vector $\small (x,y)$ as a 3-vector $\small (x,y,1)$, and similarly for higher dimensions. Using this system, translation can be expressed with matrix multiplication. The functional form

    $\small x' = x + t_x$;
    $\small y' = y + t_y$


    $\small \begin{bmatrix} x' \\\ y' \\\ 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 & t_{x} \\\ 0 & 1 & t_{y} \\\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} x \\\ y \\\ 1 \end{bmatrix}$.

All ordinary linear transformations are included in the set of affine transformations, and can be described as a simplified form of affine transformations. Therefore, any linear transformation can also be represented by a general transformation matrix. The latter is obtained by expanding the corresponding linear transformation matrix by one row and column, filling the extra space with zeros except for the lower-right corner, which must be set to $\small 1$. For example, the counter-clockwise rotation matrix from above becomes:

    $\small \begin{bmatrix} \cos \theta & -\sin \theta & 0 \\\ sin \theta & \cos \theta & 0 \\\ 0 & 0 & 1 \end{bmatrix}$

Using transformation matrices containing homogeneous coordinates, translations become linearly independent, and thus can be seamlessly intermixed with all other types of transformations. The reason is that the real plane is mapped to the $\small w = 1$ plane in real projective space, and so translation in real Euclidean space can be represented as a shear in real projective space. Although a translation is a non-linear transformation in a 2-D or 3-D Euclidean space described by Cartesian coordinates (i.e. it can’t be combined with other transformations while preserving commutativity and other properties), it becomes, in a 3-D or 4-D projective space described by homogeneous coordinates, a simple linear transformation (a shear).

For an example of shear mapping in the eigenspace, refer here.

More affine transformations can be obtained by composition of two or more affine transformations. … [continued here.]

Example: 2-D Transformation

When a transformation takes place on a 2D plane, it is called 2D transformation.

We can represent a point, $\small \mathbf{p} = (x,y)$, in the plane as a row vector $\small \begin{bmatrix} x & y \end{bmatrix}$ or as a column vector $\small \begin{bmatrix} x \\ y \end{bmatrix}$.

We can represent a 2-D transformation $\small \mathbf{M}$ by a matrix $\small \mathbf{M} = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$.

If $\small \mathbf{p}$ is a column vector ($\small [2 \times 1]$), then we need to place $\small \mathbf{M}$ on the left:

… $\small [\color{Blue}2 \times \color{Magenta}2] [\color{Magenta}2 \times \color{Blue}1] = [\color{Blue}2 \times \color{Blue}1]$; we cannot multiply $\small [2 \times 1] [2 \times 1]$, due to the mismatched matrix dimensions.

    $\small \begin{bmatrix} x' \\\ y' \end{bmatrix} = \begin{bmatrix} a & b \\\ c & d \end{bmatrix} \begin{bmatrix} x \\\ y \end{bmatrix} = \begin{bmatrix} ax + by \\\ cx + dy \end{bmatrix}$

    $\small \begin{bmatrix} x' \\\ y' \end{bmatrix} = \begin{bmatrix} 1 & 2 \\\ 3 & 4 \end{bmatrix} \begin{bmatrix} 3 \\\ 2 \end{bmatrix} = \begin{bmatrix} 1(3) + 2(2) \\\ 3(3) + 4(2) \end{bmatrix} = \begin{bmatrix} 7 \\\ 17 \end{bmatrix}$

If $\small \mathbf{p}$ is a row vector ($\small [1 \times 2]$), then we need to place $\small \mathbf{M}^T$ on the right ($\small [1 \times 2][2 \times 2]^\color{Magenta}T$):

$\small \mathbf{M} = \begin{bmatrix} a & b \\ c & d \end{bmatrix}$,

$\small \mathbf{M}^T = \begin{bmatrix} a & c \\ b & d \end{bmatrix}$.

    $\small \begin{bmatrix} x' & y' \end{bmatrix} = \begin{bmatrix} x & y \end{bmatrix} \begin{bmatrix} a & c \\\ b & d \end{bmatrix} = \begin{bmatrix} xa + yb & xc + yd \end{bmatrix}$

    $\small \begin{bmatrix} x' & y' \end{bmatrix} = \begin{bmatrix} 3 & 2 \end{bmatrix} \begin{bmatrix} 1 & 3 \\\ 2 & 4 \end{bmatrix}$ $\small = \begin{bmatrix} 3(1) + 2(2) & 3(3) + 2(4) \end{bmatrix} = \begin{bmatrix} 7 & 17 \end{bmatrix}$

which is basically the same result as above (same point, same transformation!).

[If we did not do that matrix transform, $\small \mathbf{M}^T$, we would have obtained $\small \begin{bmatrix} 9 & 14 \end{bmatrix}$).]

octave:>> p=[3,2]
p =
  3   2

octave:>> q=[3;2]
q =

octave:>> M=[1,2;3,4]
M =
  1   2
  3   4

octave:>> M'
ans =
  1   3
  2   4

octave:>> p*M'
ans =
    7   17

octave:>> M*q
ans =



See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.


In semantics, an antonym is a word having the opposite meaning as another word.

Examples: “happy/unhappy”; “heavy/light”; “long/short”; “up/down”; “dead/alive”; “parent/child”; …

See also Linguistics | Semantics.

Oppositeness is a logical category. There are three types:

  • Complementary pairs are antonyms in which the presence of one quality or state signifies the absence of the other and vice versa: single/ married, not pregnant/pregnant, … There are no intermediate states: you can’t be “a little pregnant”, or “kinda married”.

  • Gradable pairs are antonyms which allow for a natural, gradual transition between two poles: “good/bad”, or “hot/cold”. It is possible to be a “little cold” or “very cold”, etc.

  • Relational opposites are antonyms which share the same semantic features – only the focus (or direction) is reversed: “tie/untie”, “buy/sell”, “give/receive”, “teacher/pupil”, “father/son”.

Some concepts lack logical opposites that can be described in terms of any special word; colors are a good example: the logical opposite of “red” is “not red”. Such concepts may form relational antonyms, however, through symbolic systems of thinking. For instance, in the Cold War paradigm, the relational opposite of “American” is “Russian”; in current US politics, the relational opposite of “Democrat” is “Republican”. These are cultural relational opposites.

Architectures; Algorithms; Frameworks;
Libraries; Models; Platforms

I tend to differentiate architectures (the programmatic implementation of a thematic class of machine learning algorithms and procedures, the “basic code”) from models (that represent various implementations of those architectures). Examples of major architectures include:

Frameworks and libraries. Differentiating the meaning and use of these two terms is a terminological rabbit hole – exemplified in the StackOverflow threads What is the difference between a framework and a library? and Framework vs. Toolkit vs. Library. While the two terms tend to be used interchangeably (for example, TensorFlow, PyTorch, etc. are often included in lists of machine learning libraries), I personally prefer the use of the term framework for programming environments including:

  • Caffe
  • TensorFlow
  • PyTorch
  • Theano
  • Torch
  • etc.

Deep learning frameworks simplify deep learning model development and training by providing high-level primitives for complex and error-prone mathematical transformations, like gradient descent, back-propagation, and inference. Deep learning frameworks vary in their level of functionality. Some of them, such as Theano or TensorFlow, allow you to define a neural network of arbitrary complexity from the most basic building blocks. These types of frameworks might even be called languages. Other “frameworks” [← there’s an example of a library being referred to as a framework!], such as Keras, are drivers or wrappers aimed at boosting developer productivity but are limited in their functionality due to the higher level of abstraction. [Sources: Hands-On AI Part 5: Select a Deep Learning Framework]

Like my use of the term “framework,” I prefer to reserve the use of the term “library” for the libraries provided in programming languages, such as the NumPy package/library in the Python programming language. By this definition, Keras (which uses TensorFlow as a backend) is likewise a library.

The simplicity of Python, a general purpose programming language, has attracted many developers to build libraries for machine learning and data science (and, because of all these libraries, Python is almost popular as R for data science). A popular machine learning library for Python includes NumPy.

Overall, I agree with a comment by @Trunk in SO #148747:

“A software framework is not just a library. It also includes a particular design architecture which users must work within – a sort way in which different items in the library are connected together to provide a generic solution for a particular application type. Frameworks thus have the advantage of needing only configuration by the user and validation of pre-written tests to become operational. With libraries alone users need to design an architecture and create tests before implementation. So more work is needed with just the libraries. But libraries are flexible and potentially more efficient.”


Why Libraries are better than Frameworks  (cited in SO #3057526 and the source for the image, above) also defines the term platform:

“A platform is some lowest level on which to build an application, examples are OS’s or virtual machines, e.g.: Linux, OS X, Java, or .NET. These platforms are great; they allow application developers to focus on the specific application logic, reusing all the ‘plumbing’ like sockets and file systems.”

The ML Cheatsheet also provides some definitions:

  • Algorithm: a method, or series of instructions, devised to generate a machine learning model. Examples include linear regression, decision trees, support vector machines, and neural networks.

  • Model: a structure that stores a generalized, internal representation of a dataset (weights and biases) for description or prediction. Models are created/learned when you train an algorithm on a dataset (i.e., the output is a model).

  • Model vs. Algorithm:

    • Model: the representation learned from a dataset and used to make future predictions.
    • Algorithm: the process for learning it.

      Model = Algorithm(Training Data)

Autoregression (Autocorrelation)

Autoregression is a time series model that uses observations from previous time steps as input to a regression equation to predict the value at the next time step. It is a very simple idea that can result in accurate forecasts on a range of time series problems.

A regression model, such as linear regression, models an output value based on a linear combination of input values.

For example:

    $\small \hat{y} = \mathbf{b}_0 + \mathbf{b}_1 \mathbf{X}_1$

Where $\small \hat{y}$ is the prediction, $\small \mathbf{b}_0 $ and $\small \mathbf{b}_1$ are coefficients found by optimizing the model on training data, and $\small \mathbf{X}_1$ is an input value.

This technique can be used on time series where input variables are taken as observations at previous time steps, called lag variables. For example, we can predict the value for the next time step (t+1) given the observations at the last two time steps (t-1 and t-2). As a regression model, this would look as follows:

    $\small \mathbf{X}_{(t+1)} = \mathbf{b}_0 + \mathbf{b}_1 \mathbf{X}_{(t-1)} + \mathbf{b}_2 \mathbf{X}_{(t-2)}$

Because the regression model uses data from the same input variable at previous time steps, it is referred to as an autoregression (regression of self).

Autoregressive, feed-forward models

Instead of making predictions from a state that depends on the entire history, an autoregressive model directly predicts $\small y_t$ using only the $\small k$ most recent inputs, $\small x_{t−k+1}, \ldots, x_t$. This corresponds to a strong conditional independence assumption. In particular, a feed-forward model assumes the target only depends on the $\small k$ most recent inputs. Google’s WaveNet nicely illustrates this general principle.


[Image source. Click image to open in new window.]

In contrast to an RNN, the limited context of a feed-forward model means that it cannot capture patterns that extend more than $\small k$ steps. However, using techniques like dilated-convolutions, one can make $\small k$ quite large.

Source for the preceding subsection: When Recurrent Models Don’t Need to be Recurrent.


An autoregression model makes an assumption that the observations at previous time steps are useful to predict the value at the next time step. This relationship between variables is called correlation.

If both variables change in the same direction (e.g. go up together or down together), this is called a positive correlation. If the variables move in opposite directions as values change (e.g. one goes up and one goes down), then this is called negative correlation.

We can use statistical measures to calculate the correlation between the output variable and values at previous time steps at various different lags. The stronger the correlation between the output variable and a specific lagged variable, the more weight that autoregression model can put on that variable when modeling.

Again, because the correlation is calculated between the variable and itself at previous time steps, it is called an autocorrelation. It is also called serial correlation because of the sequenced structure of time series data.

The correlation statistics can also help to choose which lag variables will be useful in a model and which will not.

Interestingly, if all lag variables show low or no correlation with the output variable, then it suggests that the time series problem may not be predictable. This can be very useful when getting started on a new dataset.


See Proofs.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


Backpropagation operations in neural networks can be divided into two steps: feedforward, and backpropagation. In the feedforward step, an input pattern is applied to the input layer and its effect propagates, layer by layer, through the network until an output value is produced. The network’s actual output value is then compared to the expected output (i.e., the desired output value), and an error signal is computed for each of the output nodes. Since all the hidden nodes to some degree have contributed to the errors evident in the output layer, the output error signals are transmitted backwards from the output layer to each node in the hidden layer that immediately contributed to the output layer. This process is then repeated, layer by layer, until each node in the network has received an error signal that describes its relative contribution to the overall error.

Once the error signal for each node has been determined, the errors are then used by the nodes to update the values for each connection weights until the network converges to a state that allows all the training patterns to be encoded. The Backpropagation algorithm looks for the minimum value of the error function in weight space using a technique called the delta rule or gradient descent. The weights that minimize the error function is then considered to be a solution to the learning problem. [Source: Backpropagation]

In neural networks, if the estimated output $\small \hat{y}_i$ is far away from the actual output $\small y_i$ (high error), we update the biases and weights based on the error. This weight and bias updating process is known as back propagation. Backpropagation is an algorithm to efficiently calculate the gradients in a neural network (or more generally, a feedforward computational graph).

The actual output  $\small \mathbf{y}_i$ (observed from the network) is compared to the estimated output  $\small \mathbf{\hat{y}}_i$ (the desired/predicted/expected output). The error is the mean squared error,

$\ \ \ \ \small \begin{align} \frac{1}{n} \sum_{i=1}^n(y_i-\hat{y}_i)^2 \end{align}$

Backpropagation algorithms work by applying the (partial differentiation) chain rule determining the loss (or error) at the output, and then propagating it back into the network. The weights are updated to minimize the error resulting from each neuron. The first step in minimizing the error is to determine the gradient (derivatives) of each node with respect to the final output.

The gradient is the vector of partial derivatives with respect to all of the independent variables. In machine learning, the gradient, a slope (i.e. partial derivative), is the vector of partial derivatives of the model function. The gradient points in the direction of steepest ascent.

Gradient descent is technique to minimize loss by computing the gradients of loss with respect to the model’s parameters, conditioned on the training data. Informally, gradient descent iteratively adjusts parameters, gradually finding the best combination of weights and bias to minimize loss.

$\ \ \ \ \theta_j := \theta_j - \alpha\frac{\partial}{\partial\theta_i}J(\theta_0,\theta_1)$   (for $\small j=0$ and $\small j=1$)  [← gradient descent]

where $\small \alpha$ (alpha, the learning rate) determines how quickly we move toward the minimum.

See also Cost Function, Gradient Descent and Univariate Linear Regression


Bias terms ($\small \textbf{b}$ or $\small \textbf{w}_0$) are additional constants attached to the neurons and added to the weighted input before the activation function is applied. Bias terms help models represent patterns that do not necessarily pass through the origin. For example, if all your features were 0, would your output also be zero? Is it possible there is some base value upon which your features have an effect? Bias terms typically accompany weights and must also be learned by your model. [Source: ML Cheatsheet]

Bias is the term $\small \textbf{b}$ in the following formula: $\small y^{\ \prime} = \textbf{b} + \mathbf{w}_1\mathbf{x}_1 + \mathbf{w}_2\mathbf{x}_2 + \ldots + \mathbf{w}_n\mathbf{x}_n$.


BioNLP is natural language processing, applied in the biomedical domain. See also: Natural language processing.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See Proofs.

Computational linguistics

Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate computational approaches to linguistic questions.

See also Linguistics | Semantics.


See Proofs.

Content words; Function words

In NLP parts of speech (POS) content words are words that name objects of reality and their qualities. They signify:

  • actual living things: dog, cat, etc.
  • family members: mother, father, sister, etc.
  • natural phenomena: snow, sun, etc.
  • common actions: do, make, come, eat, etc.
  • characteristics: young, cold, dark, etc.
  • etc.

Content words consist mostly of nouns, lexical verbs and adjectives, but certain adverbs can also be content words.

Content words contrast with function words, which are words that have very little substantive meaning and primarily denote grammatical relationships between content words, such as:

  • prepositions: in, out, under, etc.
  • pronouns: I, you, he, who, etc.
  • conjunctions: and, but, till, as, etc.
  • etc.

Content words and function words are used in the following papers.

  • Learning When to Concentrate or Divert Attention: Self-Adaptive Attention Temperature for Neural Machine Translation, which proposed a new model with a mechanism called Self-Adaptive Control of Temperature (SACT) to control the softness of attention by means of an “attention temperature.” … the model could learn a soft distribution of attention weights which was more uniform for generating function words, and a hard distribution which is sparser for generating content words. …

  • Google Brain also employed content words and function words in their improved language model, The Importance of Generation Order in Language Modeling. “This paper studies the influence of token generation order on model quality via a novel two-pass language model that produces partially-filled sentence “templates” and then fills in missing tokens. … We find the most effective strategy generates function words in the first pass followed by content words in the second.”


See Proofs.


In linguistics, a corpus (plural corpora), or text corpus, is a large and structured collection of texts – nowadays, usually electronically processed and stored. In corpus linguistics, corpora are used to for statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language domain, and natural language processing.

Cost (loss) function

The cost function is used to define and measure the error of the model. The hypothesis is usually presented as

$\ \ \ \ \small h_\theta(x) = \theta_0 + \theta_1x$

The theta values ($\small \theta$) are the parameters.

The cost function is given by:

$\ \ \ \ \small \begin{align} J(\theta_0\theta_1) = \frac{1}{2m} \sum_{i=1}^m(h_0(x^{(i)})-y^{(i)})^2 \end{align}$  [← squared error cost function]


  • $\small h(x)$ is the prediction
  • $\small y$ is the actual value
  • $\small m$ is the number of rows in the training set

Cost is a synonym for loss: a measure of how far a model’s predictions are from its label (phrased more pessimistically, a measure of how bad the model is). To determine this value, a model must define a loss function. For example, linear regression models typically use mean squared error for a loss function, while logistic regression models use log loss (also known as cross entropy loss or logistic loss). [Cross entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.]

$\ \ \ \ \small \begin{align} MSE = \frac{1}{n} \sum_{i=1}^n(Y_i-\hat{Y}_i)^2 \end{align}$   [← mean squared error]


$\ \ \ \ \small \begin{align} MSE = \frac{1}{n} \sum_{i=1}^n(y_i-(mx_i + b))^2 \end{align}$


$\ \ \ \ \small \begin{align} MSE = \frac{1}{2m} \sum_{i=1}^m(h_0(x^{(i)})-y^{(i)})^2 \end{align}$


  • $\small n$ is the total number of observations (data points)
  • $\small \frac{1}{N} \sum_{i=1}^n$ is the mean
  • $\small y_i$ is the actual value of an observation
  • $\small mx_i+b$ is our prediction

Unfortunately for logistic regression we can’t (or at least shouldn’t) use the MSE cost function, as we did for linear regression (above). For an explanation why see Chapter 3 of Michael Neilson’s deep learning book. In short, it’s because our prediction function is non-linear, due to sigmoid transform. Squaring this prediction, as we do in MSE, results in a non-convex function with many local minima: if our cost function has many local minima, gradient descent may not find the optimal global minimum.

Instead of mean squared error (MSE), we use a cost function called cross entropy, also known as log loss. Cross entropy loss can be divided into two separate cost functions: one for $\small y=1$ and one for $\small y=0$.

$\ \ \ \ \small \begin{align} J(\theta) = \frac{1}{m} \sum_{i=1}^m Cost(h_0(x^{(i)}),y^{(i)}) \end{align}$

$\ \ \ \ \small \begin{cases} Cost(h_0(x),y) = -log(h_0(x)) & \text{if y=1} \\ Cost(h_0(x),y) = -log(1-h_0(x)) & \text{if y=0} \end{cases}$

The benefits of taking the logarithm reveal themselves when you look at the cost function graphs for $\small y=1$ and $\small y=0$. These smooth monotonic functions (always increasing or always decreasing) make it easy to calculate the gradient and minimize cost:

[Image from Andrew Ng’s slides on logistic regression]

The key thing to note is the cost function penalizes confident and wrong predictions more than it rewards confident and right predictions! The corollary is increasing prediction accuracy (closer to 0 or 1) has diminishing returns on reducing cost due to the logistic nature of our cost function.

Compressing the two logistic regression equations, above, into a single equation:

$\ \ \ \ \small \begin{align} J(\theta) = -\frac{1}{m} \sum_{i=1}^m [y^{(i)}log(h_0(x^{(i)})) + (1-y^{(i)})log(1-h_\theta(x^{(i)}))] \end{align}$

Multiplying by $\small y$ and $\small (1-y)$ in the above equation is a sneaky trick that let’s us use the same equation to solve for both $\small y=1$ and $\small y=0$ cases. If $\small y=0$, the first side cancels out. If $\small y=1$, the second side cancels out. In both cases we only perform the operation we need to perform.

Vectorized cost function:

$\ \ \ \ \small h = g(X\theta)$
$\ \ \ \ \small \begin{align} J(\theta) = \frac{1}{m} \cdot(-y^Tlog(h) - (1-y)^Tlog(1-h)) \end{align}$

Cost function resources:

Cross entropy loss

Cross entropy loss (or log loss) measures the performance of a classification model whose output is a probability value between 0 and 1.

See also:

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See directed acyclic graph.

Deep learning

In recent years, a machine learning method called “deep learning” has gained huge attraction, obtaining astonishing results in broad applications such as pattern recognition, speech recognition, computer vision, and natural language processing. Deep learning learns data representations through the use of multi-layered artificial neural networks as opposed to task-specific algorithms. For reviews on deep learning, see:


See Proofs.

Degree matrix

Refer here  [Graph Signal Processing: Background].

See also: Graph Adjacency Matrix.

Directed acyclic graph (DAG)

DAG.png   [Source]

In mathematics and computer science, a directed acyclic graph (DAG), is a finite directed graph with no directed cycles. That is, it consists of finitely many vertices and edges, with each edge directed from one vertex to another, such that there is no way to start at any vertex $\small v$ and follow a consistently-directed sequence of edges that eventually loops back to $\small v$ again. Equivalently, a DAG is a directed graph that has a topological ordering, a sequence of the vertices such that every edge is directed from earlier to later in the sequence.

  • acyclic = “non-circular:” moving from node to node by following the edges, you will never encounter the same node for the second time

DAG can model many different kinds of information. A spreadsheet can be modeled as a DAG, with a vertex for each cell and an edge whenever the formula in one cell uses the value from another; a topological ordering of this DAG can be used to update all cell values when the spreadsheet is changed. Similarly, topological orderings of DAG can be used to order the compilation operations in a makefile. …

In computer science and mathematics, a directed acyclic graph (DAG) is a graph that is directed and without cycles connecting the other edges. This means that it is impossible to traverse the entire graph starting at one edge. The edges of the directed graph only go one way. The graph is a topological sorting, where each node is in a certain order.

In graph theory, a graph is a series of vertexes connected by edges. In a directed graph, the edges are connected so that each edge only goes one way. A directed acyclic graph means that the graph is not cyclic, or that it is impossible to start at one point in the graph and traverse the entire graph. Each edge is directed from an earlier edge to a later edge. This is also known as a topological ordering of a graph.

A spreadsheet may be represented as a directed acyclic graph, with each cell a vertex and an edge connected a cell when a formula references another cell. Other applications include scheduling, circuit design and Bayesian networks.

In mathematics – and more specifically, in graph theory – a tree is an undirected graph in which any two vertices are connected by exactly one path.


Document relevance ranking

Document relevance ranking, also known as ad hoc retrieval (Harman, 2005), is the task of ranking documents from a large collection using only the query and the text of each document only. Restated, ad hoc retrieval is a standard retrieval task in which the user specifies their information need through a query which initiates a search (executed by the information system) for documents which are likely to be relevant to the user.

This contrasts with standard information retrieval (IR) systems that rely on text based signals in conjunction with network structure and/or user feedback. Text-based ranking is particularly important when:

  • click-logs do not exist or are small, and
  • the network structure of the collection is non-existent or not informative for query-focused relevance.

Examples include various domains in digital libraries, e.g. patents or scientific literature, enterprise search, and personal search.

Source, references: Deep Relevance Ranking Using Enhanced Document-Query Interactions, which explored several new models for document relevance ranking, building upon the Deep Relevance Matching Model (DRMM) of Guo et al. (2016). …

See also: Introduction to Ad-hoc Retrieval.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Eigenvalues, Eigenvectors


Embeddings (word embeddings);
Language models

Embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP), where words or phrases from a vocabulary are mapped to vectors of real numbers (see also Word Embeddings). Conceptually it involves a mathematical embedding from a sparse, highly dimensional space with one dimension per word (a dimensionality proportional to the size of the vocabulary) into a dense, continuous vector space with a much lower dimensionality, perhaps 200 to 500 dimensions [Mikolov (2013) Efficient Estimation of Word Representations in Vector Space].

Word embeddings are widely used in predictive NLP modeling, particularly in deep learning applications (Word Embeddings: A Natural Language Processing Crash Course). Word embeddings enable the identification of similarities between words and phrases, on a large scale, based on their context. These word vectors can capture semantic and lexical properties of words, even allowing some relationships to be captured algebraically; e.g. (analogous to 12 - 9 + 2 = 5),

    vBerlin - vGermany + vFrance ~ vParis
    vking - vman + vwoman ~ vqueen.

Aside: addition and multiplication are commutative; subtraction and division are not commutative.

12 + 9 + 2 = 23
12 + (9 + 2) = (12 + 9) + 2 = 23   (← addition and multiplication are also associative)
12 - 9 + 2 = 5
12 - (9 + 2) = 1
(12 - 9) + 2 = 5
12 - (9 - 2) = 5
12 - 9 - 2 = 1
12 - 2 - 9 = 1

Vectors, like scalars, can be added together. However, adding vectors and scalars is very different. The direction a vector has affects the final magnitude of the addition of the two vectors (see Adding Vectors). A negative sign in front of a vector means that its direction is changed by 180º degrees (see Subtracting Vectors). The direction and magnitude completely change when adding and subtracting vectors. Thus adding and subtracting vectors does not give the same resultant vector. See also Subtraction of Vectors, and vector spaces - linear algebra.

Associative property.  Addition: a + (b + c) = (a + b) + c. Multiplication: a(bc) = (ab)c.
Commutative property.  Addition: a + b = b + a. Multiplication: ab = ba.
Distributive property.  a(b + c) = ab + ac (“multiplication distributes over addition”).
Algebraic order of operations.  Operations are executed in this order: parentheses; exponents; multiplication and division; addition and subtraction. 6 + (4 ÷ 2)² × 8 = 38.

The original work for generating word embeddings was presented by Bengio et al. in 2003 (A Neural Probabilistic Language Model (which builds on his 2001 (NIPS 2000) “feature vectors” paper A Neural Probabilistic Language Model), who trained them in a neural language model together with the model’s parameters.

Despite the assertion by Sebastian Ruder in An Overview of Word Embeddings and their Connection to Distributional Semantic Models that Bengio coined the phrase “word embeddings” in his 2003 paper, the term “embedding” does not appear in that paper. The Abstract does state the concept, however: “We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences.”. The correct attribution is likely Bengio’s similarly-named 2006 paper Neural Probabilistic Language Models, which states (bottom of p. 162): “Based on our discussion in the introduction, it makes sense to force the word embedding to be shared across all nodes.” The full reference is: Bengio Y. et al. (2006) Neural Probabilistic Language Models. StudFuzz 194:137-186.

Collobert and Weston demonstrated the power of pretrained word embeddings as a highly effective tool when used in downstream tasks in their 2008 paper A Unified Architecture for Natural Language Processing, while also announcing a neural network architecture upon which many current approaches are built. It was Mikolov et al. (2013), however, who popularized word embedding through the introduction of word2vec, a toolkit enabling the training and use of pretrained embeddings (Efficient Estimation of Word Representations in Vector Space).

Likewise – I’m being rather critical here – the 2008 Collobert and Weston paper, above, mentions “embedding” [but not “word embedding”, and cites Bengio’s 2001 (NIPS 2000) paper], while Mikolov’s 2013 paper does not mention “embedding” and cites Bengio’s 2003 paper.

Word embeddings are a particularly striking example of learning a representation, i.e. representation learning (Bengio et al., Representation Learning: A Review and New PerspectivesDeep Learning, NLP, and RepresentationsAn introduction to representation learning). [See also, this Glossary, Representation learning.]

See also:

Euclidean space; Non-Euclidean space

In geometry, Euclidean space encompasses the two-dimensional Euclidean plane, the three-dimensional space of Euclidean geometry, and certain other spaces. It is named after the Ancient Greek mathematician Euclid of Alexandria. The term “Euclidean” distinguishes these spaces from other types of spaces considered in modern geometry. Euclidean spaces also generalize to higher dimensions.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Euclidean space is a mathematical construct that encompasses the line, the plane, and three-dimensional space as special cases. Its elements are called vectors. Vector space (also called linear space) is a collection of objects called vectors, which may be added together and multiplied (“scaled”) by numbers, called scalars.

Vectors can be understood in various ways: as arrows, as quantities with magnitude and direction, as displacements, or as points. Euclidean vectors are an example of a vector space. They represent physical quantities such as forces: any two forces (of the same type) can be added to yield a third, and the multiplication of a force vector by a real multiplier is another force vector. In the same vein, but in a more geometric sense, vectors representing displacements in the plane or in three-dimensional space also form vector spaces.

Vectors are abstract mathematical objects with particular properties, which in some cases can be visualized as arrows. However, vectors in vector spaces do not necessarily have to be arrow-like objects. Vector spaces are the subject of linear algebra and are well characterized by their dimension, which, roughly speaking, specifies the number of independent directions in the space.

A Euclidean space has some number of real-valued dimensions and “dense” points: there is a notion of an “average” of two points, and Euclidean distance is based on the locations of points in such a space.

A non-Euclidean space is based on the properties of points, but not their “location” in a space. A way to describe the differences between these geometries is to consider two straight lines indefinitely extended in a two-dimensional plane that are both perpendicular to a third line:

  • In Euclidean geometry, the lines remain at a constant distance from each other (meaning that a line drawn perpendicular to one line at any point will intersect the other line and the length of the line segment joining the points of intersection remains constant) and are known as parallels.

  • In hyperbolic geometry, they “curve away” from each other, increasing in distance as one moves further from the points of intersection with the common perpendicular; these lines are often called ultraparallels.

  • In elliptic geometry, the lines “curve toward” each other and intersect.


[Image source. Click image to open in new window.]


[Image source (slide 14). Click image to open in new window.]

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Fourier transform

Refer here  [Graph Signal Processing: Background].


See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.

Functional genomics

The standard interpretation for functional genomics is described on Wikipedia:

Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic and transcriptomic projects (such as genome sequencing projects and RNA sequencing) to describe gene (and protein) functions and interactions. Unlike structural genomics, functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein-protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. Functional genomics attempts to answer questions about the function of DNA at the levels of genes, RNA transcripts, and protein products. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional “gene-by-gene” approach.

However, I broaden my use use of that term, in the sense of

How is the information contained in our genome expressed, and what are the functional consequences of that the expression of that information?

My use of the term “functional genomics” thus spans genomics, molecular genetics, biochemistry, and bioinformatics. I am particularly fascinated by how the information in our genome is encoded and manifested.

Individual variations in our genetic / epigenetic makeup determine who we are, and how we respond to both

  • extrinsic factors:
    • the environment (environmental stress: radiation, heat, famine, anoxia, toxins, pollutants, chemicals, …)
    • pathogens (bacterial, viral)


  • intrinsic factors:
    • metabolism (e.g. different functional isotypes of proteins, that affect how we process chemicals and drugs, relevant e.g. to toxicology and cancer chemotherapy …)
    • mutation (spontaneous: ageing; induced: environmental in nature but affecting the individual)

Function words

See Content words; Function words.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Gaussian function

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the form:

$\large {f(x) = ae^{-{\frac {(x-b)^{2}}{2c^{2}}}} }$

for arbitrary real constants $\small a$, $\small b$ and $\small c$. It is named after the mathematician Carl Friedrich Gauss. The graph of a Gaussian is a characteristic symmetric “bell curve” shape. The parameter $\small a$ is the height of the curve’s peak, $\small b$ is the position of the center of the peak and $\small c$ (the standard deviation, sometimes called the Gaussian RMS width) controls the width of the “bell”.

Gaussian functions are often used to represent the probability density function of a normally distributed random variable with expected value $\small \mu = b$ and variance $\small \sigma^2 = c^2$. In this case, the Gaussian is of the form:

$\large g(x)=\frac{1}{\sigma {\sqrt{2\pi }}} e^{ -{\frac{1}{2}} \left ({\frac{x-\mu }{\sigma }} \right)^{2} }$.

Gaussian functions are widely used in statistics to describe the normal distributions, in signal processing to define Gaussian filters, in image processing where two-dimensional Gaussians are used for Gaussian blurs, and in mathematics to solve heat equations and diffusion equations and to define the Weierstrass transform.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

Geometric deep learning

In the last decade, deep learning (DL) approaches (e.g. convolutional neural networks and recurrent neural networks) have achieved unprecedented performance on a broad range of problems coming from a variety of different fields (e.g. computer vision and speech recognition). Despite the results obtained, research so far on DL techniques has focused mainly on data defined on Euclidean domains (i.e. grids). However, in fields such as biology, physics, network science, recommender systems, computer graphics etc. one may have to deal with data defined on non-Euclidean domains (i.e. graphs and manifolds). Until recently, the adoption of DL in these fields has been lagging, primarily as the non-Euclidean nature of data makes the definition of basic operations (such as convolution) rather elusive. Geometric deep learning deals with the extension of DL techniques to graph/manifold structured data.

The Geometric Deep Learning website represents a collection of materials in the field of geometric deep learning. They collect workshops, tutorials, publications and code, that several different researchers have produced over the last several years. Their goal is to provide a general picture of this new and emerging field, which is rapidly developing in the scientific community thanks to its broad applicability.

See also:

GNU Octave

See this page.

Google AI

Google AI, formerly Google Research. Google collects AI-based services across the company into “ is a collection of products and teams across Alphabet with a focus on AI.”

Gradient; Gradient descent

See Backpropagation.

Graph adjacency matrix

Refer here  [Graph Signal Processing: Background].

See also: Graph Degree Matrix.

Graph databases

Data often exists as relationships between different objects. While relational databases (RDBMS) store highly structured data, they do not store the relationships between the data. Unlike other databases, graph databases store relationships and connections as first-class entities, and excel at managing highly connected data and complex queries.

... continued here ...

Graph degree matrix

Refer here  [Graph Signal Processing: Background].

See also: Graph Adjacency Matrix.

Graph Laplacian (Laplacian matrix)

“The Laplace operator in its various manifestations is the most beautiful and central object in all of mathematics. Probability theory, mathematical physics, Fourier analysis, partial differential equations, the theory of Lie groups, and differential geometry all revolve around this sun, and its light even penetrates such obscure regions as number theory and algebraic geometry.”
[Nelson E (1967) Tensor Analysis, p. 100]

Refer here  [Graph Signal Processing: Background].

Graph signal processing

In applications such as social, energy, transportation, sensor, and neuronal networks, high-dimensional data naturally reside on the vertices of weighted graphs. The emerging field of signal processing on graphs merges algebraic and spectral graph theoretic concepts with computational harmonic analysis to process such signals on graphs. The weight associated with each edge in the graph often represents the similarity between the two vertices it connects. The connectivities and edge weights are either dictated by the physics of the problem at hand or inferred from the data. For instance, the edge weight may be inversely proportional to the physical distance between nodes in the network. The data on these graphs can be visualized as a finite collection of samples, with one sample at each vertex in the graph. Collectively, we refer to these samples as a graph signal.

A signal or function $\small f : \mathcal{V} \to \mathbf{R}$ defined on the vertices of the graph may be represented as a vector $\small f \in \mathbf{R}^N$, where the $\small i^{th}$ component of the vector $\small f$ represents the function value at the $\small i^{th}$ vertex in $\small \mathcal{V}$. The following figure is an example of a graph signal.


[Image source. Click image to open in new window.]

... continued here ...

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


In semantics, a holonym is the opposite of a meronym.


  • “finger” is a part of a “hand”, thus “hand” is the holonym of “finger”
  • “family” is a holonym of “child”, “mother” or “father”.
  • “forest” is a holonym of “tree” (forests contain trees), while “tree” is a holonym of “bark”
  • “body” is a holonym of “arm”, which is a holonym of “elbow”

See also Linguistics | Semantics.


In semantics, a homonym is a word having a different meaning, but the same form as another word.


  • “bear” (verb) vs. “bear” (noun, animal)
  • “bank” (institution) vs. “bank” (river bank)

It is possible to distinguish homophones (same sound) and homographs (same spelling)

Homonymy and synonymy have complementary notions:

  • homonyms: the same form, different meanings
  • synonyms: the same meaning, different forms

Homonyms are words that have the same form but different meanings. There are two major types of homonyms, based upon whether the meanings of the word are historically connected or result from coincidence.

  • Coincidental homonyms are the result of such historical accidents as phonetic convergence of two originally different forms or the borrowing of a new word which happens to be identical to an old word. There is usually no natural link between the two meanings: the “bill” of a bird vs the “bill” one has to pay; or the “bark” of a dog vs the “bark” of a tree.

  • The second type of homonym, the polysemous homonym, results when multiple meanings develop historically from the same word. The process by which a word acquires new meanings is called polysemy. Unlike coincidental homonyms, polysemous homonyms usually preserve some perceptible semantic link marking the development of one meaning out of the other, as in the “leg” of a chair and the “leg” of a person; or the face of a person vs. the face of a clock.

Sometimes it is impossible to tell whether two words of identical form are true homonyms (historically unrelated) or polysemous homonyms (historically related), such as ice “skate” vs. “skate”, the fish: “skate” (fish, from Old English “skata”) vs. ice “skate” (from Dutch “schaat”); “dear/deer” are historically related (“darling”, vs. German “Tier”: animal).

Since polysemy is so difficult to separate from true homonymy, dictionaries usually order entries according to (i) the first recorded appearance of word, or (ii) the frequency of meaning use. This is a problem for lexicographers, the people who study words and write dictionaries.

There are universal tendencies in the directionality of polysemy. Studies of polysemy in a wide variety of languages generally find the following directions in meaning shift:

  • body part to part of object (“hands”, “face”, “lip”, “elbow”, “belly”, “vein of gold” or “vein of a leaf”); for example, “appendix.

  • animal to human, for personality traits (“shrew”, “bear”, “wolf”, “fox”, “quiet as a fish”); for example, “My dog is a real Einstein”.

  • space to time (“long”, “short”, “plural”)

  • spatial to sound (“melt”, “rush”)

  • sound to color (“loud”, “clashing”, “mellow”)

  • Physical, visible attribute to emotional or mental, invisible quality (“crushed”, “big head”, “green with envy”, “yellow coward”, “sharp/dull”, “spark”)

Directionality in polysemy seems to be logically motivated: concrete meanings give rise to abstract ones (“sharp knife” → “sharp mind”); mundane gives rise to the technical (“wood chip” → “computer chip”).

See also Linguistics | Semantics.

Hyperbolic embeddings

Hyperbolic embeddings seek to embed structured, discrete objects such as knowledge graphs into a continuous representation that can be used with modern machine learning methods. Hyperbolic embeddings can preserve graph distances and complex relationships in very few dimensions, particularly for hierarchical graphs. Many graphs, such as complex networks including the Internet and social networks, are known to have hyperbolic structure and thus befit hyperbolic embeddings (see Representation Tradeoffs for Hyperbolic Embeddings and references therein). For both natural language processing and graph based tasks, embeddings have been learned in high-dimensional Euclidean spaces. However, recent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but negatively curved, hyperbolic space.

For additional background, see these dated but useful resources:

A hyperbolic space has the property that power-law degree distributions, strong clustering and hierarchical community structure emerge naturally when random graphs are embedded in hyperbolic space. It is therefore logical to exploit the structure of the hyperbolic space for useful embeddings of complex network. Neural Embeddings of Graphs in Hyperbolic Space (May 2017) proposed learning neural embeddings of graphs in hyperbolic space, providing experimental evidence that embedding graphs in their natural geometry significantly improved performance on downstream tasks for several real-world public datasets.


[Image source. Click image to open in new window.]

  • Hybed: Hyperbolic Neural Graph Embedding  [Semantic ScholarOpenReview] again posits that recent work has shown that the appropriate isometric space for embedding complex networks is not the flat Euclidean space, but a negatively curved hyperbolic space. They proposed learning neural embeddings of graphs in hyperbolic space, providing experimental evidence that hyperbolic embeddings significantly outperformed Euclidean embeddings on vertex classification tasks for several real-world public datasets.

Many well-established recommender systems are based on representation learning in Euclidean space. In these models, matching functions such as the Euclidean distance or inner product are typically used for computing similarity scores between user and item embeddings. In an extension of representation learning from the Euclidean to the hyperbolic domain, Hyperbolic Recommender Systems investigated the notion of learning user and item representations in Hyperbolic space, arguing that hyperbolic space is more suitable for learning user-item embeddings in the recommendation domain. Unlike Euclidean spaces, hyperbolic spaces are intrinsically equipped to handle hierarchical structure, encouraged by its property of exponentially increasing distances away from origin. Their proposed system (HyperBPR) outperformed its Euclidean counterparts, achieving state of the art performance on multiple benchmark datasets.

Representation Tradeoffs for Hyperbolic Embeddings  [project: provides an excellent summary as well as entity embeddings with 64-bit precision] found that – given a tree – they could give a combinatorial construction that embedded the tree in hyperbolic space with arbitrarily low distortion without using optimization. On WordNet, their combinatorial embedding obtained a mean-average-precision of 0.989 with only two dimensions, while Nickel et al.’s recent construction (Poincaré Embeddings for Learning Hierarchical Representation) obtained 0.87 using 200 dimensions.


[Image source. Click image to open in new window.]

Maximilian Nickel and Douwe Kiela at Facebook AI Research recently introduced a new approach to learning hierarchical representations of symbolic data by embedding them into hyperbolic space (more precisely, into an n-dimensional Poincaré ball) (Poincaré Embeddings for Learning Hierarchical Representations;  [project]), allowing them to learn parsimonious representations of symbolic data by simultaneously capturing hierarchy and similarity. The Poincaré embeddings significantly outperformed Euclidean embeddings on data with latent hierarchies in terms of representation capacity and generalization ability.

Experiments showed that Poincaré embeddings provided important advantages over Euclidean embeddings on hierarchical data:

  1. Poincaré embeddings enabled very parsimonious representations that enabled the learning of high-quality embeddings of large-scale taxonomies.
  2. Excellent link prediction results indicated that hyperbolic geometry could introduce an important structural bias for the embedding of complex symbolic data.
  3. State of the art results for predicting lexical entailment suggested that the hierarchy in the embedding space corresponded well to the underlying semantics of the data.


[Image source. Click image to open in new window.]

Embeddings of tree-like graphs in hyperbolic space were recently shown to surpass their Euclidean counterparts in performance by a large margin. Inspired by those results, Skip-gram Word Embeddings in Hyperbolic Space presented an algorithm for learning word embeddings in hyperbolic space from free text. An objective function based on the hyperbolic distance was derived and included in the skip-gram architecture from word2vec. The results demonstrated the potential of hyperbolic word embeddings, particularly in low dimensions, although without clear superiority over their Euclidean counterparts.

See also Embeddings (word embeddings)


In mathematics, a hypergraph is a generalization of a graph in which an edge can join any number of vertices. Formally, a hypergraph is a pair where is a set of elements called nodes or vertices, and is a set of non-empty subsets of called hyperedges or edges.


[Click image to open in new window]

Research into hypergraph development is technically challenging due to a lack of programmatic solutions and published work in this area. Nevertheless, for relevant background on hypergraphs consult the following:

Hypernym (superordinate word)

In semantics, superordinate words (hypernyms) are words whose meaning includes meaning of another word. If $\small X$ is superordinate to $\small Y$ then $\small Y$ is subordinate to $\small X$.


  • “color” is a hypernym of “red”, “black”, “green”, …
  • “animal” is a hypernym of “horse”, “tiger”, …
  • “animal” is a hypernym of “mammal” (mammals are animals), which is a hypernym of “mammal” is a hypernym of “dog” (dogs are mammals)
  • “plant” is a hypernym of “flower”, which is a hypernym of “tulip”
  • “red” is a hypernym of “scarlet”, “vermilion”, “carmine” and “crimson”

Hypernyms vs. Hyponyms

Hypernymy is a relation between words (or sentences) where the semantics of one word (the hyponym) are contained within that of another word (the hypernym). A simple form of this relation is the is-a relation; e.g., cat is an animal. Another example is a hyponym in a type-of relationship with its hypernym; for example, “pigeon, crow, eagle and seagull” are all hyponyms of “bird” (their hypernym); “bird” in turn is a hyponym of “animal;” “spoon” is a hyponym of “cutlery” (its hypernym); etc.

See also Linguistics | Semantics.


See Parameters.

Hyponym (subordinate word)

In semantics, subordinate words (hyponyms) are words whose meaning is included in the meaning of another word.


  • “red” is a hyponym of “color” (red is subordinate to color)
  • “pony” is a hyponym of “horse”, which is a hyponym of “animal”
  • “tulip” is a hyponym of “flower”; …
  • “dog” is a hyponym of “mammal” (dogs are among the various animals which are mammals); “mammal” is a hyponym of “animal”
  • “tulip” is a hyponym of “flower”, which is a hyponym of “plant”
  • “scarlet”, “vermilion”, “carmine” and “crimson” are hyponyms of “red”

This is not a whole/part relationship, so “page” is not a subordinate term of “book”.

Hypernyms vs. Hyponyms

Hypernymy is a relation between words (or sentences) where the semantics of one word (the hyponym) are contained within that of another word (the hypernym). A simple form of this relation is the is-a relation; e.g., cat is an animal. Another example is a hyponym in a type-of relationship with its hypernym; for example, “pigeon, crow, eagle and seagull” are all hyponyms of “bird” (their hypernym); “bird” in turn is a hyponym of “animal;” “spoon” is a hyponym of “cutlery” (its hypernym); etc.

See also Linguistics | Semantics.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See Proofs.

Identity Matrix

See here  [Graph Signal Processing: Background]

Information extraction

Information extraction (IE) is the process of extracting structured information (e.g. events, binary relations, etc.) from text and data, so that it can be used for another purpose, such as an information retrieval system (e.g. a search engine). IE creates structured information from unstructured text. See also: Relation extraction. [image source]

IE vs. IR

Information overload

The explosion in the amount of news and journalistic content being generated across the globe, coupled with extended and instantaneous access to information through online media, makes it difficult and time-consuming to monitor news developments and opinion formation in real time (Content-Driven, Unsupervised Clustering of News Articles Through Multiscale Graph Partitioning). Even within the more focused health, technical and scientific domains we face a continuous onslaught of new information and knowledge from which we must filter out the non-relevant information, seeking to retain (or hoping to find again) knowledge that is relevant to us.

Information overload is characterized by the difficulty of understanding an issue and effectively making decisions when one has too much information about that issue. In our infocentric world, we have an increasing dependency on relevant, accurate information that is buried in the avalanche of continuously generated information.

Coincident with information overload is the phenomenon of attention overload: we have limited attention and we’re not always sure where to direct it. It can be difficult to limit how much information we consume when there’s always something new waiting for a click; before we know it, an abundance of messy and complex information has infiltrated our minds. If our processing strategies don’t keep pace, our online explorations create strained confusion instead of informed clarity. Hence, More information is not necessarily better.

When Choice is Demotivating: Can One Desire Too Much of a Good Thing? [pdf] discussed findings from 3 experimental studies that starkly challenged the implicit assumption that having more choices is more intrinsically motivating than having fewer choices. Those experiments, which were conducted in both field and laboratory settings, showed that people are more likely to make purchases or undertake optional classroom essay assignments when offered a limited array of 6 choices, rather than a more extensive array of 24 or 30 choices. Moreover, participants reported greater subsequent satisfaction with their selections and wrote better essays when their original set of options had been limited.

Information overload is a long-standing issue: in her 2010 book Too Much To Know: Managing Scholarly Information before the Modern Age, Harvard Department of History Professor Ann Blair argued that the early modern methods of selecting, summarizing, sorting, and storing text (the 4S’s) are at the root of the techniques we use today in information management.

For additional discussion, see the blog post Information Overload, Fake News, and Invisible Gorillas.

Information retrieval

Information retrieval (IR) employs highly scalable statistics-based techniques to index and search large volumes of text efficiently. Information retrieval is based on a query – you specify what information you need, and it is returned in human understandable form.

See also Document relevance ranking.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Kirchhoff’s Theorem

In the mathematical field of graph theory, Kirchhoff’s theorem or Kirchhoff’s matrix tree theorem (named after Gustav Kirchhoff) is a theorem about the number of spanning trees in a graph, showing that this number can be computed in polynomial time as the determinant of a matrix derived from the graph. It is a generalization of Cayley’s formula which provides the number of spanning trees in a complete graph.


[Image source. Click image to open in new window.]

Kirchhoff’s theorem relies on the notion of the Laplacian matrix of a graph that is equal to the difference between the graph’s degree matrix (a diagonal matrix with vertex degrees on the diagonals) and its adjacency matrix (a $\small (0,1)$-matrix with 1’s at places corresponding to entries where the vertices are adjacent and 0’s otherwise).

For a given connected graph $\small \mathcal{G}$ with $\small n$ labeled vertices, let $\small \lambda_1, \lambda_2, ldots, \lambda_{n-1}$ be the non-zero eigenvalues of its Laplacian matrix. Then the number of spanning trees of $\small \mathcal{G}$ is

    $\small \begin{align} t(\mathcal{G) = \frac{1}{n}} \lambda_1, \lambda_2, \cdots, \lambda_{n-1} \end{align}$.

Equivalently the number of spanning trees is equal to any cofactor of the Laplacian matrix of $\small \mathcal{G}$.

Kmer (K-mer)

The term $\small k$-mer typically refers to all the possible substrings of length $\small k$ that are contained in a string. In computational genomics, $\small k$-mers refer to all the possible subsequences (of length $\small k$) from a read obtained through DNA sequencing.

The amount of $\small k$-mers possible given a string of length $\small L$ is $\small L-k+1$, while the number of possible $\small k$-mers given $\small n$ possibilities (4 in the case of DNA, e.g. ACTG) is $\small n^{k}$.

$\small K$-mers are typically used during sequence assembly, but can also be used in sequence alignment. In the context of the human genome, $\small k$-mers of various lengths have been used to explain variability in mutation rates.


Here are some examples showing the possible $\small k$-mers (given a specified $\small k$ value) from DNA sequences:



Typically we extract $\small k$-mers from genomic assemblies or read data sets by running a $\small k$-length window across all of the reads and sequences – e.g. given a sequence of length 16, you could extract 11 $\small k$-mers of length six from it like so:


becomes the following set of 6-mers:


$\small k$-mers are most useful when they’re long, because then they’re specific. That is, if you have a 31-mer taken from a human genome, it’s pretty unlikely that another genome has that exact 31-mer in it. (You can calculate the probability if you assume genomes are random: there are 431 possible 31-mers, and 431 = 4,611,686,018,427,387,904.) [Source.]

For a broader coverage of this topic, see $\small n$-gram.

Knowledge discovery in databases

Knowledge discovery in databases (KDD) is the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data. KDD (“data mining”) is the process of discovering useful knowledge from a collection of data. This widely used data mining technique is a process that includes data preparation and selection, data cleansing, incorporating prior knowledge on data sets and interpreting accurate solutions from the observed results.

Knowledge graphs

Knowledge graphs (relational property graphs) model information in the form of entities (nodes/vertices) and the relationships between them (edges). For a formal definition see

and for a general overview see

A noteworthy feature of knowledge graphs is the excellent performance of traversal-type queries across entities of diverse types within them. Such queries can be challenging to realize in relational databases because of the cost of computing statements (joins) across different tables. See:

Cypher queries in Neo4j are easier to write and understand than complex SQL queries in relational database management systems (RDBMS), especially those involving multiple join statements. For example, see pp. 22-23 in The Definitive Guide to Graph Databases for the RDBMS Developer).

Knowledge graphs (KG) provide semantically structured information that is interpretable by computers – an important property to building more intelligent machines, as well as an important step to transforming text-based knowledge stores and search engines into semantically-aware question answering services (A Review of Relational Machine Learning for Knowledge Graphs). Knowledge graphs are applicable to a broad spectrum of problems ranging from biochemistry to recommender systems; for example, question answering, structured search, exploratory search, and digital assistants.

Knowledge graph structure (embeddings); machine learning

Graphs represent a recent and exciting extension of machine learning; for a good review, see The Knowledge Graph as the Default Data Model for Learning on Heterogeneous Knowledge).

A Review of Relational Machine Learning for Knowledge Graphs also provides an excellent introduction to machine learning and knowledge graphs, with a focus on statistical relational learning.

In a radically different approach from the probabilistic methods employed in statistical relational learning, knowledge graph embedding (KGE) aims to represent entities and relations in a knowledge graph as points or vectors in a continuous vector space (Euclidean space) – simplifying manipulation, while preserving the inherent structure of the KG.

KG are generally represented by symbolic triples – (source, relation, target), also described as (subject, predicate, object) or (head, relation, tail) – whereas KGE methods attempt to represent those symbols (nodes and edges) with their corresponding source, relation, and target vectors – amenable to mathematical processing.

The underlying concept of KGE is that in a knowledge graph each entity can be regarded as a point in a continuous vector space, while relations can be modelled as translation vectors (Expeditious Generation of Knowledge Graph Embeddings). The generated vector representations can be used by machine learning algorithms to accomplish a specific tasks.

Restated in On Embeddings as Alternative Paradigm for Relational Learning, KGE aim to represent instances and their relationships as vectors and/or matrices in the Euclidean space. The hope is that the geometry of the embedding space would resemble the structure of the data (for example) by keeping the instances participating in the same relationships close in the Euclidean space. This in turn allows one to apply standard propositional (logic-based) machine learning tools and retain their scalability, while at the same time preserving certain properties of structured relational data.

KGE has proven to be very effective for the tasks of link prediction and knowledge graph completion, where the goal is to identify missing links in the existing knowledge graph (A Review of Relational Machine Learning for Knowledge Graphs  |  An Overview of Embedding Models of Entities and Relationships for Knowledge Base Completion).

KGE approaches are the current (2018) dominating methodology for knowledge graph link prediction (On Link Prediction in Knowledge Bases: Max-K Criterion and Prediction Protocols). This methodology, fundamentally based on distributed representations, has not only proved to be effective for KG link prediction but has also helped to improve our understanding and engineering of knowledge representation. A strong advantage of KGE methods is their scalability, at the expense of their black-box nature and limited reasoning capabilities; KGE has proven to be scalable to very large knowledge graphs.

Knowledge graphs: additional references

For recent reviews on knowledge graph embedding (KGE), see

Examples of the use of KG in the biomedical domain include:

See also Graph databases.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Language models

See Embeddings (word embeddings).


See Graph Laplacian (Laplacian matrix)


See Proofs.


See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.


Linguistics is the scientific study of language and its structure, including the study of morphology, syntax, phonetics, and semantics. Specific branches of linguistics include sociolinguistics, dialectology, psycholinguistics, computational linguistics, historical-comparative linguistics, and applied linguistics, etc.

See also:

Long short-term memory (LSTM RNN)

Although theoretically powerful, vanilla RNNs cannot learn from long-sequences due to a problem known as vanishing or exploding gradients. A powerful solution is Long Short-Term Memory (LSTM), introduced in 1997 by Hochreiter & Schmidhuber). Why “long short-term”? As described in that paper, “[LSTM] can learn to bridge time intervals in excess of 1000 steps even in the case of noisy, incompressible input sequences, without loss of short time lag capabilities.” A RNN relies on the past, without scope; long-term past inputs are forgotten. LSTM provide a short-term memory function: a LSTM block has mechanisms to enable “memorizing” information for an extended number of time steps.

LSTM introduces one more vector called “memory” $\small \mathbf{c}_t$, which, together with the state $\small \mathbf{h}_t$, specify the dynamic as: $\small (\mathbf{h}_t, \mathbf{c}_t) = LSTM(\mathbf{h}_{t-1}, \mathbf{c}_{t-1}, \mathbf{x}_t)$. In most implementations, this is decomposed further as:

$\ \ \ \ \ \ \small \mathbf{c}_t = \mathbf{f}_t ∗ \mathbf{c}_{t-1} + \mathbf{i}_t ∗ \mathbf{\widetilde{c}}_t$

$\ \ \ \ \ \ \small \mathbf{h}_t = \mathbf{o}_t ∗ tanh(\mathbf{c}_t)$


  • $\small \mathbf{\widetilde{c}}_t$ is a candidate memory computed from the input
  • $\small \mathbf{f}_t, \mathbf{i}_t, \mathbf{o}_t \in (\textbf{0},\textbf{1})$ are gates
  • $∗$ denotes point-wise multiplication

  • $\small \mathbf{f}_t$ determines how much the previous memory is maintained
  • $\small \mathbf{i}_t$ controls how much new information is stored into memory
  • $\small \mathbf{o}_t$ controls how much memory is read out

The candidate memory and the gates are typically parameterized (sometimes written as “parametrized”) as:

$\ \ \ \ \ \ \small \mathbf{\widetilde{c}}_t = tanh(W_c\mathbf{h}_{t-1} + V_c\mathbf{x_t} + \mathbf{b_c})$

$\ \ \ \ \ \ \small \begin{bmatrix} f_t \\ i_t \\ o_t \end{bmatrix} = sigm \left( \begin{bmatrix} W_f \\ W_i \\ W_o \end{bmatrix} \mathbf{h}_{t-1} + \begin{bmatrix} V_f \\ V_i \\ V_o \end{bmatrix} \mathbf{x}_t + \begin{bmatrix} \mathbf{b}_f \\ \mathbf{b}_i \\ \mathbf{b}_o \end{bmatrix} \right)$

where $\small (W_{c,f,i,o}, V_{c,f,i,o}, \mathbf{b}_{c,f,i,o})$ are learnable parameters.


[Image source (click image to open in new window)]


[Image source (click image to open in new window)]

In the transformations above, the memory cell $\small \mathbf{c}_t$ stores the “long-term” memory in the vector form. In other words, the information accumulatively captured and encoded until time step $\small t$ is stored in $\small \mathbf{c}_t$ and is only passed along the same layer over different time steps. Given the inputs $\small \mathbf{c}_t$ and $\small \mathbf{h}_t$, the input gate $\small \mathbf{i}_t$ and forget gate $\small \mathbf{f}_t$ will help the memory cell to decide how to overwrite or keep the memory information. The output gate $\small \mathbf{o}_t$ further lets the LSTM block decide how to retrieve the memory information to generate the current state $\small \mathbf{h}_t$ that is passed to both the next layer of the current time step and the next time step of the current layer. Such decisions are made using the hidden-layer parameters $\small \mathbf{W}$ and $\small \mathbf{b}$ (with various subscripts): these parameters will be inferred during the training phase


  • Sources for the above:

  • Symbols in bold font, above, are vectors (e.g. $\small \mathbf{y}_t$ is an output vector – these are the network’s parameters (i.e. “memory”).

  • That paper glosses over glosses over some of the terms (I chose it for the clarity of the “vanilla” (basic) RNN and LSTM structures and descriptions.

  • Bias terms ($\small \textbf{b}$ or $\small \textbf{w}_0$) are additional constants attached to the neurons and added to the weighted input before the activation function is applied. Bias terms help models represent patterns that do not necessarily pass through the origin. For example, if all your features were 0, would your output also be zero? Is it possible there is some base value upon which your features have an effect? Bias terms typically accompany weights and must also be learned by your model. [Source: ML Cheatsheet]

    Bias is the term $\small \textbf{b}$ in the following formula: $\small y^{\ \prime} = \textbf{b} + \mathbf{w}_1\mathbf{x}_1 + \mathbf{w}_2\mathbf{x}_2 + \ldots + \mathbf{w}_n\mathbf{x}_n$.

  • $\small W$, $\small U$ and $\small V$ are matrices of weights, updated during training. $\small W$ are the (shared – see note above) weights passed from one time step to the next; $\small U$ are the “skip connections” from the inputs to all hidden layers, and $\small V$ are the skip connections from the hidden layers to the outputs. The image below, taken from Alex Graves paper, illustrates those connections.

  • Early versions of simple RNN used the sigmoid activation function (see Recent Advances in Recurrent Neural Networks for a description plus other forms), which is applied component-wise:

    $\ \ \ \ \ \ \ \ \ \ \sigma(x) = \frac{1}{1 + \mathcal{e}^{(-x)}}$

    The “sigmoid” is a common choice, which takes a real-value and squashes it to the range [0,1]. This activation function is normally used in the output layer, where a cross-entropy loss function is used for training a classification model.

  • In a LSTM, The input gate is a sigmoid function and have a range of [0,1]. Because the equation of the cell state is a summation between the previous cell state, sigmoid function alone will only add memory and not be able to remove/forget memory. If you can only add a float number between [0,1], that number will never be zero/turned off/forget. This is why the input modulation gate has an tanh activation function. Tanh has a range of [-1, 1] and allows the cell state to forget memory. [Source: Long Short-Term Memory (LSTM): Concept]

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Matrix operations

See this page.

Memory (neural memory)

See my blog post How do Neural Networks "Remember"?.


In semantics, a meronym denotes a part or a member of something.


  • “page” is a meronym of “book”
  • “wheel”, “engine”, “door”, etc. are meronyms of “car”
  • “finger” is a meronym of “hand”
  • “bark” is a meronym of “tree” (bark is part of what makes up a tree), while “tree” is a meronym of “forest”
  • “elbow” is a meronym of “arm”, which is a meronym of “body”

See also Linguistics | Semantics.


See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.


In linguistics, a morpheme is a meaningful morphological unit of a language that cannot be further divided; e.g., “incoming” consists of the morphemes “in”, “come” and “-ing”. A morpheme may or may not stand alone, whereas a word, by definition, is freestanding. Another example: “dogs” consists of two morphemes and one syllable: “dog”, and “-s”, a plural marker on nouns. Note that a morpheme like “-s” can just be a single phoneme and does not have to be a whole syllable.

See also:


Morphology is a subdiscipline of linguistics that studies word structure. During morphological processing we are basically considering words in a text separately and trying to identify morphological classes to which these words belong. A widespread morphological task is lemmatizing (lemmatization) and stemming, which are used in many web search engines. In this case all morphological variations of a given word (known as word forms) are collapsed to one lemma or stem. [Source]

Morphology concerns the structure and meaning of words. Some words, such as “send”, are atomic (monomorphemic), while others – such as “sends”, “sending” and “resend” – appear to be constructed from several atoms or morphemes. These morphemes (bits of words) occur in a lots of other words: “thinks”, “thinking”, “reprogram”, “rethink”, etc. There is a syntax to the way morphemes can combine – the affixes mentioned so far can all combine with verbs to make verbs; others, such as “able”, combine with verbs to make adjectives: e.g., “programmable”. Sometimes the meaning of a word is a regular, productive combination of the meanings of its morphemes, such as “unreprogramability”. Frequently, however, it is not (or isn’t completely) obvious; e.g. “react”, or “establishment”. [Source; Introduction to Linguistics for Natural Language Processing]

See also Linguistics | Semantics.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Named entities;
Named entity recognition (NER)

The difference between an entity and a named entity is essentially the same difference between nouns and proper nouns. An entity can be nominal – a common thing like a city – whereas a named entity is more like a proper noun, such as a name (Paris). In other words, a named entity is something that deserves to have a name. A human is an entity, but if we give human a name, this produces a named entity. Named entities may consist of more than one word.

IE vs. IR

Named Entity Recognition (NER) labels sequences of words in a text which are the names of things, such as person and company names, or gene and protein names.

NER (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Most research on NER has been structured as taking an unannotated block of text, such as

    Jim bought 300 shares of Acme Corp. in 2006.

and producing an annotated block of text that highlights the names of entities,

    [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time.

In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.

State-of-the-art NER systems for English produce near-human performance.

See also [NLTK] Named Entity Recognition, which includes a short description of the “IOB format” (or sometimes “BIO Format”), developed for NP chunking. In this scheme, each token is tagged with one of three special chunk tags, I (inside), O (outside), or B (begin). A token is tagged as B if it marks the beginning of a chunk. Subsequent tokens within the chunk are tagged I. All other tokens are tagged O. The B and I tags are suffixed with the chunk type, e.g. B-NP, I-NP. Of course, it is not necessary to specify a chunk type for tokens that appear outside a chunk, so these are just labeled O.

An example of this scheme is shown here:

    [Tag Representation of Chunk Structures. Image source. Click image to open in new window.]

IOB tags have become the standard way to represent chunk structures in files. Here is how the information in 2.5 would appear in a file:

      We PRP B-NP
      saw VBD O
      the DT B-NP
      yellow JJ I-NP
      dog NN I-NP

In this representation there is one token per line, each with its part-of-speech tag
Source: Image source: Section 8.2 (Tagsets for English) in
Jurafsky D & Martin JH (2000) Speech and Language processing (p. 295)
and chunk tag.

Named entity disambiguation
Named entity normalization

Polysemy, words or phrases with different but related meanings poses a challenge to NLP; for example, “Washington” could refer to the location “Washington, DC” or the person “George Washington”; “ACE” could represent “angiotensin converting enzyme” or “acetylcholinesterase”. In polysemy, the fact that multiple entities might have the same name is common for named entities. The task of addressing the polysemy problem for named entities is called named entity disambiguation. Named entity disambiguation/normalization is the task of mapping of a named entity or type in to a unique identifier or concept (e.g., disease names may be mapped to the National Library of Medicine’s Medical Subject Headings disease terms). For more on polysemy, see Analyzing Polysemous Concepts from a Clinical Perspective: Application to Auditing Concept Categorization in the UMLS.

See also word sense disambiguation.

Ngram (n-gram)

In the fields of computational linguistics and probability, an $\small n$-gram is a contiguous sequence of $\small n$ items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The $\small n$-grams typically are collected from a text or speech corpus. When the items are words, $\small n$-grams may also be called shingles.

Using Latin numerical prefixes, an $\small n$-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” (or, less commonly, a “digram”); size 3 is a “trigram”. English cardinal numbers are sometimes used, e.g., “four-gram”, “five-gram”, and so on.

In computational biology, a polymer or oligomer of a known size is called a $\small k$-mer instead of an $\small n$-gram, with specific names using Greek numerical prefixes such as “monomer”, “dimer”, “trimer”, “tetramer”, “pentamer”, etc., or English cardinal numbers, “one-mer”, “two-mer”, “three-mer”, etc.


Sample sequence
1-gram sequence
2-gram sequence
3-gram sequence
Vernacular name
Order of resulting Markov model
Protein sequencing amino acid ... Cys-Gly-Leu-Ser-Trp ... ..., Cys, Gly, Leu, Ser, Trp, ... ..., Cys-Gly, Gly-Leu, Leu-Ser, Ser-Trp, ... ..., Cys-Gly-Leu, Gly-Leu-Ser, Leu-Ser-Trp, ...
DNA sequencing base pair ...AGCTTCGA... ..., A, G, C, T, T, C, G, A, ... ..., AG, GC, CT, TT, TC, CG, GA, ... ..., AGC, GCT, CTT, TTC, TCG, CGA, ...
Computational linguistics character ...to_be_or_not_to_be... ..., t, o, _, b, e, _, o, r, _, n, o, t, _, t, o, _, b, e, ... ..., to, o_, _b, be, e_, _o, or, r_, _n, no, ot, t_, _t, to, o_, _b, be, ... ..., to_, o_b, _be, be_, e_o, _or, or_, r_n, _no, not, ot_, t_t, _to, to_, o_b, _be, ...
Computational linguistics word ... to be or not to be ... ..., to, be, or, not, to, be, ... ..., to be, be or, or not, not to, to be, ... ..., to be or, be or not, or not to, not to be, ...

The table above shows several example sequences and the corresponding 1-gram, 2-gram and 3-gram sequences. Here are further examples:

  • 3-grams
    • ceramics collectables collectibles
    • ceramics collectables fine
    • ceramics collected by
  • 4-grams
    • serve as the incoming
    • serve as the incubator
    • serve as the independent

See also kmer.

Natural language processing

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.


[Image source (slide 8). Click image to open in new window.]

Natural language understanding

Natural language understanding (NLU) or natural language interpretation (NLI) is a subtopic of natural language processing in artificial intelligence that deals with machine reading comprehension (see the image, above). Natural language understanding is considered an AI-hard problem. There is considerable commercial interest in the field because of its application to news-gathering, text categorization, voice-activation, archiving, and large-scale content analysis.


Neo4j is a graph database management system. Neo4j is an ACID-compliant transactional database with native graph storage and processing. Neo4j is accessible using the Cypher Query Language.

... continued here ...


In linear algebra, functional analysis, and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to each vector in a vector space - except for the zero vector, which is assigned a length of zero. A seminorm, on the other hand, is allowed to assign zero length to some non-zero vectors (in addition to the zero vector).

A norm must also satisfy certain properties pertaining to scalability and additivity which are given in the formal definition below.

A simple example is two dimensional Euclidean space $\small \mathbb{R}^2$ equipped with the “Euclidean norm” (see below). Elements in this vector space (e.g., (3, 7)) are usually drawn as arrows in a 2-dimensional Cartesian coordinate system starting at the origin (0, 0). The Euclidean norm assigns to each vector the length of its arrow. Because of this, the Euclidean norm is often known as the magnitude.

A vector space on which a norm is defined is called a normed vector space. Similarly, a vector space with a seminorm is called a seminormed vector space. It is often possible to supply a norm for a given vector space in more than one way.

Double bars (or sometimes even single bars) tend to denote a norm in Mathematics. Most likely, the double bars here are denoting the Euclidean norm. This is just the length of the vector. So for example, the vector $\small \begin{bmatrix} 1 & 2 & 3 \end{bmatrix}$ has length

    $\small \Vert \begin{matrix} 1 & 2 & 3 \end{matrix} \Vert = \sqrt{1^2 + 2^2 + 3^2} = \sqrt{14} = 3.7417$

and the vector

    $\small \Vert \begin{matrix} 3 & -1 & 2 \end{matrix} \Vert = \sqrt{3^2 + (-1)^2 + 2^2} = \sqrt{14} = 3.7417$

Notice that $\small \mathbf{Ax}$ is just a vector, so $\small \Vert \mathbf{Ax} \Vert$ is just the length of the vector. $\small \Vert \mathbf{x} \Vert$ is just the length of $\small \mathbf{x}$. So here you are looking for scaling of $\small \mathbf{x}$ under transformation by $\small \mathbf{A}$ to be between $\small m$ and $\small M$.

Source for example, above: What does double vertical-line means in linear algebra?

See also:


For good summaries of notations used in machine learning, consult:

Miscellaneous notes:

  • The superscripts in $\small \mathbb{R}^\color{Brown}{\mathbf{n}}$, $\small \mathbb{R}^\color{Brown}{\mathbf{k}}$ and $\small \mathbb{R}^\color{Brown}{\mathbf{p}}$ refer to the dimensions of matrices of real numbers (respectively) in input, hidden and output layers.

  • Similarly (elsewhere), $\small \mathbb{R}^{m \times n}$ indicates a matrix of dimensions $\small m\times n$.

  • $\small \in$ denotes “element of” (in set membership: “member of”).

  • In mathematics, the set of real numbers ($\small \mathbb{R}$) are the values of a continuous quantity that can represent a distance along a line; they include rational numbers ($\small \mathbb{Q}$), integers ($\small \mathbb{Z}$), and natural numbers ($\small \mathbb{N}$).

  • Elements of $\small \mathbb{R}^n$ are vectors. In other words, we can consider each element of $\small \mathbb{R}^n$ (the tuple of $n$ real numbers) to be a vector. $\small \mathbb{R}^n$ is more abstract than polynomials; for example,

    $\ \ \ \ \ \ \small a = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \in \mathbb{R}^3$

is an example of a triplet of numbers. Adding two vectors $\small a, b ∈ \mathbb{R}^n$ component wise results in another vector: $\small a + b = c \in \mathbb{R}^n$ . Moreover, multiplying $\small a \in \mathbb{R}^n$ by $\small \lambda \in \mathbb{R}$ results in a scaled vector $\small \lambda a \in \mathbb{R}^n$. Linear algebra focuses on the similarities between these vector concepts; we can add them together, and multiply them by scalars. We largely focus on vectors in $\small \mathbb{R}^n$ since most algorithms in linear algebra are formulated in $\small \mathbb{R}^n$. Recall that in machine learning, we often consider data to be represented as vectors in $\small \mathbb{R}^n$. [Source: Linear Algebra]

  • In mathematics, the real coordinate space of $n$ dimensions, written $\small \mathbb{R}^n$ is a coordinate space that allows several ($\small n$) real variables to be treated as a single variable. With various numbers of dimensions (sometimes unspecified), $\small \mathbb{R}^n$ is used in many areas of pure and applied mathematics, as well as in physics. With component-wise addition and scalar multiplication, it is the prototypical real vector space and is a frequently used representation of Euclidean $\small n$-space. An element of $\small \mathbb{R}^n$ is written $\small x = (x_1, x_2, \ldots, x_n)$, where each $\small x_i$ is a real number.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Obscuro seu abbreviatio [Obscure abbreviations]

  • a.s.: Almost Sure Convergence


    [Click image to open in new window.]


    [Image source, discussion. Click image to open in new window.]

    [See also.]  Here is an intuitive explanation. In terms of the sample space of events $\small \Omega$, an event $\small E$ happens almost surely if $\small Pr(E) = 1$, whereas an event happens surely if $\small E = \Omega$. An example: suppose we are independently flipping a fair coin infinitely many times. The event

      $\small \text{I will get heads infinitely often}$

    [which you need to read as “I will get an infinite number of heads, ignoring any tails that I get” rather than “I will only get heads”] is an almost sure event (because it is possible get only a finite number of heads … but how likely is that? Rigorous proof uses the Borel Cantelli lemma, if you are interested).

    In contrast,

      $\small \text{I will get heads $\small \mathbf{or}$ tails on my 16th flip}$

    must happen. This is a sure event.

  • i.i.d.: Independent and Identically Distributed random variables

    • In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed (i.i.d. or iid or IID) if each random variable has the same probability distribution as the others and all are mutually independent. The annotation IID is particularly common in statistics, where observations in a sample are often assumed to be effectively IID for the purposes of statistical inference. The assumption (or requirement) that observations be IID tends to simplify the underlying mathematics of many statistical methods …

      Often the IID assumption arises in the context of sequences of random variables. Then “independent and identically distributed” in part implies that an element in the sequence is independent of the random variables that came before it. In this way, an IID sequence is different from a Markov sequence, where the probability distribution for the nth random variable is a function of the previous random variable in the sequence (for a first order Markov sequence). An IID sequence does not imply the probabilities for all elements of the sample space or event space must be the same; for example, repeated throws of loaded dice will produce a sequence that is IID, despite the outcomes being biased.

  • i.f.f.: If and only if

    • In logic and related fields such as mathematics and philosophy, if and only if (shortened iff ) is a biconditional logical connective between statements. In that it is biconditional (a statement of material equivalence), the connective can be likened to the standard material conditional (“only if”, equal to “if … then”) combined with its reverse (“if”); hence the name. The result is that the truth of either one of the connected statements requires the truth of the other (i.e. either both statements are true, or both are false). It is controversial whether the connective thus defined is properly rendered by the English “if and only if”, with its pre-existing meaning.

      In writing, phrases commonly used, with debatable propriety, as alternatives to $\small \textit{P “if and only if” Q}$ include:

      • $\small \textit{Q is necessary and sufficient for P}$,
      • $\small \textit{P is equivalent (or materially equivalent) to Q}$  (compare material implication),
      • $\small \textit{P precisely if Q}$,
      • $\small \textit{P precisely (or exactly) when Q}$,
      • $\small \textit{P exactly in case Q}$, and
      • $\small \textit{P just in case Q}$.

      Some authors regard “iff” as unsuitable in formal writing; others use it freely.

      In logical formulae, logical symbols are used instead of these phrases; see the discussion of notation.

      Notation. The corresponding logical symbols are $\leftrightarrow$, $\Leftrightarrow$, and $\equiv$ and sometimes $\text{iff}$ …

Common Latin and Non-English Abbreviations Used in Research

A.D. (Anno Domini). Used to date years by reckoning the date of Christ’s birth, as opposed to B.C., the years “Before Christ.” Literally, Anno Domini means “In the year of the Lord.” Remember two important notes! Anno Domini does not mean “After Death ” (if it did, there would be a thirty-three year gap between 1 BC and the crucifixion thirty-three years later). Also note the politically correct tendency is to use the abbreviation CE (Common Era) and BCE (Before Common Era). These abbreviations are an attempt to avoid the religious connotations of the Latin abbreviation. In spite of the name change, BCE and CE still divide history according to the life of Christ, but CE and BCE may be less offensive (or at least less ethnocentric) to a non-Christian audience.

c. (circa). Used by historians to show that a date is approximate. Literally, the word means “around,” and it is sometimes abbreviated “ca.” Usage: Shortly after Henry IV seized the throne from Richard II, Geoffrey Chaucer died (c.1400 A.D.), perhaps due to old age.

cf. (confere). A Latin imperative suggesting the reader should compare and contrast one statement or idea with another one. Literally, “compare.” Researchers often follow the abbreviation with a reference to an author or page number, suggesting the reader look for similarities and differences between what a previous citation has said with the subsequent source listed. Usage: Some scholars think Hitler’s Mein Kampf used genocidal ideas found in earlier anti-Semitic literature him (Smith 42), but others argue Hitler himself was the primary originator (cf. Jones 98).

e.g. (exempli gratia). “For example.” Literally, “free as an example.” Usage: “We have numerous problems to deal with before reforming welfare policies, e.g., the trade deficit, Medicare, and social security.”

et al. [et alia (neuter plural)]. It can also be an abbreviation for et alii (masculine plural), or et aliae (feminine plural). This phrase means “and others.” [Source]

etc. (et cetera). “And so on.” This is the one Latin abbreviation most students already know, and the one they tend to overuse. Do note that, since et already means and, it is redundant to write, “and etc.” Literally, the Latin phrase means “and other things.” Usage: The problems of the Balkan Republics are numerous, including insufficient electric power, poor highways, rampant unemployment, hostile neighbors, etc.

et pass. (et passim). And also found throughout the subsequent pages or sections. Literally, “And in the following.” The abbreviation typically appears after a citation of a single page, suggesting the reader look at that page first and then skim the material following for further discussion. Usage: For further discussion of this important issue, see Smith 42 et passim.

ib./ ibid. (ibidem). “In the same passage or page quoted above.” Literally, “In the same place.” Usage: “One physicist compared the behavior of quarks to bowling pins (Jones 35). He also indicated that the ‘Charm’ quark was like a ‘bowling ball’ (ibid.) due to the way it. . . .”

i.e. (id est). “That is more precisely.” Literally, “it is.” Commonly used to refine a general statement or provide additional information. Usage: “Jerry’s girlfriend always managed to turn the conversation toward children, i.e., the possibility of having children together; i.e., the possibility of having legitimate children together; i.e., toward the subject of marriage.”

sic. Indicates a misspelling or error in a quoted source, in order to verify to the reader that the researcher did not create a typographical error, but instead exactly reproduces the way the word or statement appeared in the original material. Literally, “yes” or “even thus” in Latin. Usage: There are, according to the writings of seven-year old Andrew, “Manee wayes of riting words” [sic].

Ph.D. (Philosophiae Doctor). “Doctor (or Doctorate) of Philosophy.” It can refer to the individual as a title, or to the degree itself. Note that it is redundant to write, “Dr. McGillicutty is a Ph. D.” unless the writer seeks to distinguish him from a medical doctor such as an M.D. Usage: “Joe Bob McGillicutty, Ph. D., is on the committee.” Or, “McGillicutty earned his Ph. D. in art history.”

vs. (versus. “Turned against.)” Often used in abbreviations for legal trials–though “v.” is more common. Usage: “In the case of Roe v. Wade, the Supreme Court eventually decided that abortion was a medical right.” Don’t confuse the term “vs.” with “v.s.” (see below). And don’t confuse the word versus with verses.

Less Common Foreign Abbreviations

a.v. (ad valorem). “In proportion to the value of [something else].” Literally, “To the value.” Usage: “The monetary worth of the dollar is figured a.v. the price of gold.”

i.a. (in absentia). “In absence.” Usage: “With further evidence i.a., it is impossible to provide a definitive answer.” Or more commonly, “The criminal who had fled the country was tried and found guilty of murder, i.a.”

MS. (manuscriptum). A document, particularly an ancient or historical manuscript, that was not printed, but rather drawn or written. Literally, “By hand.” The term is capitalized when attached to a specific document’s title, and the plural form is MSS. In British usage, only the final letter typically has a period. Usage: “MS. Vercilli was found in Northern Italy, and it appears to be written in an Anglo-Saxon dialect.”

P.S. (post scriptum). The abbreviation indicates a last-minute addition to a letter or document. Literally, “After what has been written.” Usage: “That’s all for now. Take care. Love, John. P.S. Don’t forget to write me back!”

R.S.V.P. (Repondez S’il Vous-Plait). “Please send a response confirming whether or not you will accept the invitation.” The abbreviation is French rather than Latin. Literally, “Respond if it pleases you.” Note that it is redundant to write, “Please RSVP,” since the phrase itself implies “please.” Usage: “You are cordially invited to a wine-and-cheese reception at the Bradson’s House. RSVP by Thursday afternoon.”

S.P.Q.R. (Senatus Populusque Romani). The abbreviation was used in Roman times as a part of official government documentation. Today, the phrase is used to refer generally (and sometimes pompously or ironically) to the power, glory, and bureaucracy of a major nation. Literally, “The Senate and the People of Rome.” Usage: “The S.P.Q.R. has spoken, and now American soldiers must obey the call to arms.”

s.p.s. (sine prole supersite). “Without surviving issue.” The phrase is used in inheritance laws to indicate that an individual has no children or legal inheritors. Usage: “Since Mrs. Clayton died s.p.s., her six million dollar estate will revert to the City of Portland.”

t.i.d. (ter in die). “Three times a day.” Used by older pharmacies and doctors to indicate that a medication should be taken three times a day. Usage: “Aspirin, t.i.d.; call if headaches continue.”

viz. (videlicit). “More appropriately or accurately; namely.” The abbreviation is often used interchangeably with i.e. Literally, “As it befits or is pleasing to him.” Usage: “He was a minor Duke in the House of Lords, viz. the Duke of Rochester.”

vide. (“Look” or “see”). This phrase refers the reader back up to a previous statement or definition within the body of the paper. The must common uses are “vide 63” (which means “see page sixty-three”), v.s. vide supra (“see earlier” or “look above on this page”) and v.i. vide infra (“See below” or “Look below”). Don’t confuse v.s. (vide supra) with v. or vs. (versus). Usage: “For the definition of the Latin word videlicit, vide supra.”

N.B.: (Nota Bene). The Latin imperative means “Take notice of this very carefully,” that is, pay special attention to this part because it is unusually important, tricky, or confusing. Usage: All assignments are due at the beginning of class. N. B.: I lock the door to the classroom once lecture begins.


An ontology is a representation / specification of a conceptualization of a domain of knowledge, characterizing the classes and relations that exist in the domain. That is, an ontology is a description (like a formal specification of a program) of the concepts and relationships that can exist for an agent or a community of agents. Commonly, ontologies are represented as graph structure that represents a taxonomy.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


See Proofs.


Parameters are variables of a model that the machine learning system trains on its own. For example, weights are parameters whose values the ML system gradually learns through successive training iterations.

Contrast with hyperparameters, the “knobs” that you tweak during successive runs of training a model. For example, learning rate is a hyperparameter.

Partial Derivatives

See my note, here:


[Click image to open in new window.]


$\small \frac{\partial}{\partial x}2x^3 = 6x^2$

$\small \frac{\partial}{\partial x}(2x^3 + 3c) = 6x^2 + 0 = 6x^2$

$\small \frac{\partial}{\partial x}3c = 0$

$\small \frac{\partial}{\partial x}c = 0$

$\small \frac{\partial}{\partial x}(2x^3 + 5z^4) = 6x^2$

$\small \frac{\partial}{\partial z}(2x^3 + 5z^4) = 20z^3$


Parts of speech (POS)

In traditional grammar, a part of speech (abbreviated form: PoS or POS) is a category of words (or, more generally, of lexical items) which have similar grammatical properties. Words that are assigned to the same part of speech generally display similar behavior in terms of syntax – they play similar roles within the grammatical structure of sentences – and sometimes in terms of morphology, in that they undergo inflection for similar properties. Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, and sometimes numeral, article or determiner.

  • Noun (names): a word or lexical item denoting any abstract entity (abstract noun: e.g. home) or concrete entity (concrete noun: e.g. house); a person (police officer, Michael), place (coastline, London), thing (necktie, television), idea (happiness), or quality (bravery). Nouns can also be classified as count nouns or non-count nouns; some can belong to either category. The most common part of speech; nouns are called naming words.

  • Pronoun (replace or again placed): a substitute for a noun or noun phrase (them, he). Pronouns make sentences shorter and clearer since they replace nouns.

  • Adjective (describes, limits): a modifier of a noun or pronoun (big, brave). Adjectives make the meaning of another word (noun) more precise.

  • Verb (states action or being): a word denoting an action (walk), occurrence (happen), or state of being (be). Without a verb a group of words cannot be a clause or sentence.

  • Adverb (describes, limits): a modifier of an adjective, verb, or another adverb (very, quite). Adverbs make language more precise.

  • Preposition (relates): a word that relates words to each other in a phrase or sentence and aids in syntactic context (in, of). Prepositions show the relationship between a noun or a pronoun with another word in the sentence.

  • Conjunction (connects): a syntactic connector; links words, phrases, or clauses (and, but). Conjunctions connect words or group of words

  • Interjection (expresses feelings and emotions): an emotional greeting or exclamation (Woot, Hurray). Interjections express strong feelings and emotions.

  • Article (describes, limits): a grammatical marker of definiteness (the) or indefiniteness (a, an). The article is not always listed among the parts of speech. It is considered by some grammarians to be a type of adjective[13] or sometimes the term ‘determiner’ (a broader class) is used.

In the English language, words can be considered as the smallest elements that have distinctive meanings. Based on their use and functions, words are categorized into several types or parts of speech. This article will offer definitions and examples for the 8 major parts of speech in English grammar: noun, pronoun, verb, adverb, adjective, conjunction, preposition, and interjection.


[Image source. Click image to open in new window.]


[Image source. Click image to open in new window.]

POS: functional classification

Linguists recognize that the above list of eight or nine word classes is drastically simplified. For example, “adverb” is to some extent a catch-all class that includes words with many different functions. Some have even argued that the most basic of category distinctions, that of nouns and verbs, is unfounded, or not applicable to certain languages.

Modern linguists have proposed many different schemes whereby the words of English or other languages are placed into more specific categories and subcategories based on a more precise understanding of their grammatical functions.

Common lexical categories defined by function may include the following (not all of them will necessarily be applicable in a given language):

  • Categories that will usually be open classes:
    • adjectives
    • adverbs
    • nouns
    • verbs (except auxiliary verbs)
    • interjections
  • Categories that will usually be closed classes:
    • auxiliary verbs
    • clitics
    • coverbs
    • conjunctions
    • determiners (articles, quantifiers, demonstrative adjectives, and possessive adjectives)
    • particles
    • measure words or classifiers
    • adpositions (prepositions, postpositions, and circumpositions)
    • preverbs
    • pronouns
    • contractions
    • cardinal numbers

Within a given category, subgroups of words may be identified based on more precise grammatical properties. For example, verbs may be specified according to the number and type of objects or other complements which they take. This is called subcategorizaton.

Many modern descriptions of grammar include not only lexical categories or word classes, but also phrasal categories, used to classify phrases, in the sense of groups of words that form units having specific grammatical functions. Phrasal categories may include noun phrases (NP), verb phrases (VP) and so on. Lexical and phrasal categories together are called syntactic categories.


[Image source. Click image to open in new window.]


[Image source: Section 8.2 (Tagsets for English) in Jurafsky D & Martin JH (2000)
  Speech and Language processing (p. 295). Click image to open in new window.]

Performance measures (metrics)

See this link, which discusses:

  • ROC curves (receiver operating characteristic curve)
  • Accuracy, precision, recall, $\small F_1$ score
    • true positives, true negatives
    • false positives, false negatives
    • confusion matrix
    • accuracy
    • misclassification rate
    • precision
    • recall (sensitivity; true positive rate)
    • false positive rate
    • specificity
    • prevalence
    • $\small F_1$ score
  • Performance metrics relevant to ML, NLP:
    • MRR (mean reciprocal rank)
    • filtered MRR
    • perplexity
    • BLEU (BiLingual Evaluation Understudy) score
    • ROUGE score


See Architectures; Algorithms; Frameworks; Libraries; Models; Platforms.


Polysemy describes words or phrases with different but related meanings poses a challenge to NLP; for example, “Washington” could refer to the location “Washington, DC” or the person “George Washington”; “ACE” could represent “angiotensin converting enzyme” or “acetylcholinesterase”. In polysemy, the fact that multiple entities might have the same name is common for named entities. The task of addressing the polysemy problem for named entities is called named entity disambiguation.

See also: Named entity disambiguation;   Word sense disambiguation.

See also Linguistics | Semantics.


A portmanteau or portmanteau word is a linguistic blend of words, in which parts of multiple words or their phones (sounds) are combined into a new word, as in “smog”, coined by blending “smoke” and “fog” – or “motel”, from “motor” and “hotel”. In linguistics, a portmanteau is defined as a single morph that represents two or more morphemes.

The definition of portmanteau overlaps with the grammatical term contraction, but contractions are formed from words that would otherwise appear together in sequence, such as “do” and “not” to make “don’t”, whereas a portmanteau word is formed by combining two or more existing words that all relate to a singular concept. A portmanteau also differs from a compound, which does not involve the truncation of parts of the stems of the blended words. For instance, “starfish” is a compound, not a portmanteau, of “star” and “fish”; whereas a hypothetical portmanteau of “star” and “fish” might be “stish”. A portmanteau of “coyote” and “wolf” is “coywolf”.

Postulate (Axiom)

See Proofs.


Pragmatics is a subfield of linguistics and semiotics that studies the ways in which context contributes to meaning. Pragmatics encompasses speech act theory, conversational implicature, talk in interaction and other approaches to language behavior in philosophy, sociology, linguistics and anthropology. Unlike semantics, which examines meaning that is conventional or “coded” in a given language, pragmatics studies how the transmission of meaning depends not only on structural and linguistic knowledge (e.g., grammar, lexicon, etc.) of the speaker and listener, but also on the context of the utterance, any pre-existing knowledge about those involved, the inferred intent of the speaker, and other factors. In this respect, pragmatics explains how language users are able to overcome apparent ambiguity, since meaning relies on the manner, place, time, etc. of an utterance.

  • The ability to understand another speaker’s intended meaning is called pragmatic competence.

Pragmatics is:

  • the study of the practical aspects of human action and thought.
  • the study of the use of linguistic signs, words and sentences, in actual situations.[1]

Pragmatics outlines the study of meaning in the interactional context. It looks beyond the literal meaning of an utterance and considers how meaning is constructed as well as focusing on implied meanings. It considers language as an instrument of interaction, what people mean when they use language and how we communicate and understand each other.

Pragmatics considers:

  • the negotiation of meaning between speaker and listener.
  • the context of the utterance.
  • the meaning potential of an utterance.

What would happen to language if Pragmatics did not exist? Pragmatics acts as the basis for all language interactions and contact. It is a key feature to the understanding of language and the responses that follow this. Therefore, without the function of Pragmatics, there would be very little understanding of intention and meaning.


See also Linguistics | Semantics.

Pretrained models

Pretrained models are models or model components (such as word or language embeddings) that have been already been trained. Sometimes, pretrained embeddings are fed into a neural network. [Contrast this with the more common situation where the model trains the embeddings de novo, rather than relying on the pretrained embeddings.] The basic steps are:

  1. You have machine learning model $\small m$.

  2. Pretraining: You have a dataset $\small A$ on which you train $\small m$.

  3. You have a dataset $\small B$. Before you start training the model, you initialize some of the parameters of $\small m$ with the model, which was trained on $\small A$.

  4. Fine-tuning: you train $\small m$ on $\small B$.

Pretrained models: word embeddings & language models

Collobert and Weston demonstrated the power of pretrained word embeddings as a highly effective tool when used in downstream tasks in their 2008 paper A Unified Architecture for Natural Language Processing, while also announcing a neural network architecture upon which many current approaches are built. It was Mikolov et al. (2013), however, who popularized word embedding through the introduction of word2vec, a toolkit enabling the training and use of pretrained embeddings (Efficient Estimation of Word Representations in Vector Space).

A particularly exciting recent advance in NLP is the development of pretrained language models such as ELMo (released in February 2018 by Allen NLP), ULMFiT (May 2018 by and Aylien Ltd.), and OpenAI Transformer (June 2018 by OpenAI). Those papers demonstrated that pretrained language models can achieve state of the art results on a wide range of NLP tasks. Language models provide an improved understanding of the semantic content and latent (hidden) relationships in documents.

For an introductory discussion of the potential impact of those models. e.g. on transfer learning (below), see:

Pretrained models: transfer learning

Note also the links provided immediately above.

Transfer Learning (a subfield of which is domain adaptation) is the reuse of a pretrained model on a new problem. In transfer learning, the knowledge of an already trained machine learning model is applied to a different but related problem. For example, if you trained a simple classifier to predict whether an image contains a backpack, you could use the knowledge that the model gained during its training to recognize other objects like sunglasses. With transfer learning, we basically try to exploit what has been learned in one task to improve generalization in another. We transfer the weights that a network has learned at Task A to a new Task B. The general idea is to use knowledge, that a model has learned from a task where a lot of labeled training data is available, in a new task where we don’t have a lot of data. Instead of starting the learning process from scratch, you start from patterns that have been learned from solving a related task.

Although multitask learning and transfer learning have similarities, they are not the same. Transfer learning only aims at achieving high performance in the target task by transferring knowledge from the source task, while multitask learning tries to learn the target and the source tasks simultaneously.

Whereas in transfer learning we mainly care about doing well on our target task or domain, in multitask learning the objective is to do well on all available tasks (An Overview of Multi-Task Learning in Deep Neural Networks). Alternatively, we can also use the knowledge acquired by learning from related tasks to do well on a target. Crucially, in contrast to transfer learning, some labeled data is usually assumed for each task. In addition, models are trained jointly on source and target task data, which is not the case for all transfer learning scenarios. However, even if target data is not available during training, insights about tasks that are beneficial for multitask learning can still inform transfer learning decisions.


See Proofs.


Here I summarize/collate several web sources.

Axiom: see Postulate.


  • An assertion that is then proved. It is often used like an informal lemma. [3]


  • This is an educated prediction that one makes based on their experience. The difference between a conjecture and a lemma/theorem/corollary is that it is usually an open research problem that either has no answer, or some partial answer.

    Conjectures are usually only considered important if they are authored by someone well known in their respective area of mathematics. Once it is proved or disproved, it ceases to be a conjecture and either becomes a fact (backed by a theorem) or there is some interesting counterexample to demonstrate how it is wrong. [1]

    Example: The Poincaré’ conjecture was a famous statement that remained an open research problem in topology for roughly a century. The claim was that every simply connected, compact 3-manifold was homeomorphic to the 3-sphere S3. This statement however is no longer a conjecture since it was famously proven by Grigori Perelman in 2003. [1]

  • A statement that is unproved, but is believed to be true (Collatz conjecture, Goldbach conjecture, twin prime conjecture). [3]


  • This is usually a result that is a direct consequence of a major theorem. Often times a theorem lends itself to other smaller results or special cases which can be shown by simpler methods once a theorem is proven. [1]

    Example: a consequence to the Hopf-Rinow theorem is that compact manifolds are geodesic complete. [1]

  • A result in which the (usually short) proof relies heavily on a given theorem (we often say that “this is a corollary of Theorem A”). [3]


  • This is an assignment of language and syntax to some property of a set, function, or other object. A definition is not something you prove, it is something someone assigns. Often you will want to prove that something satisfies a definition. [1]

    Example: we call a mapping $\small f : X \rightarrow Y$ injective if whenever $\small f(x) = f(y)$ then $\small x = y$. [1]

  • A precise and unambiguous description of the meaning of a mathematical term. It characterizes the meaning of a word by giving all the properties and only those properties that must be true. [3]



  • This is a property that one can derive or prove which is usually technical in nature and is not of primary importance to the overall body of knowledge one is trying to develop. Usually lemmas are there as precursors to larger results that one wants to obtain, or introduce a new technique or tool that one can use over and over again. [1]

    Example: in a Hausdorff space, compact subsets can be separated by disjoint open subsets. [1]

  • A minor result whose sole purpose is to help in proving a theorem. It is a stepping stone on the path to proving a theorem. Very occasionally lemmas can take on a life of their own (Zorn’s lemma, Urysohn’s lemma, Burnside’s lemma, Sperner’s lemma). [3]


  • A statement that can be shown, using a given set of axioms and definitions, to be both true and false. Paradoxes are often used to show the inconsistencies in a flawed theory (Russell’s paradox). The term paradox is often used informally to describe a surprising or counterintuitive result that follows from a given set of rules (Banach-Tarski paradox, Alabama paradox, Gabriel’s horn). [3]

Postulate (Axiom)

  • I would appreciate community input on this, but I haven’t seen this word used in any of the texts/papers I read. I would assume that this is synonymous with proposition. [1]

    I know Postulate is a synonym of axiom. Very used word in italian, but more in physics than mathematics. See Wikipedia:Axiom [1]

  • An axiom or postulate is a statement that is taken to be true, to serve as a premise or starting point for further reasoning and arguments. The word comes from the Greek axíōma (ἀξίωμα), “that which is thought worthy or fit” or “that which commends itself as evident.” [2]

  • A statement that is assumed to be true without proof. These are the basic building blocks from which all theorems are proved (Euclid’s five postulates, Zermelo-Fraenkel axioms, Peano axioms). [3]


  • In mathematics, a proof is an inferential argument for a mathematical statement. In the argument, other previously established statements, such as theorems, can be used. In principle, a proof can be traced back to self-evident or assumed statements, known as axioms, along with accepted rules of inference. Axioms may be treated as conditions that must be met before the statement applies. [4]

    Proofs are examples of exhaustive deductive reasoning or inductive reasoning and are distinguished from empirical arguments or non-exhaustive inductive reasoning (or “reasonable expectation”). A proof must demonstrate that a statement is always true (occasionally by listing all possible cases and showing that it holds in each), rather than enumerate many confirmatory cases. An unproved proposition that is believed to be true is known as a conjecture. [4]

    Proofs employ logic but usually include some amount of natural language which usually admits some ambiguity. In fact, the vast majority of proofs in written mathematics can be considered as applications of rigorous informal logic. Purely formal proofs, written in symbolic language instead of natural language, are considered in proof theory. The distinction between formal and informal proofs has led to much examination of current and historical mathematical practice, quasi-empiricism in mathematics, and so-called folk mathematics (in both senses of that term). The philosophy of mathematics is concerned with the role of language and logic in proofs, and mathematics as a language. [4]

  • Concept of scientific proof.

    While the phrase “scientific proof” is often used in the popular media, many scientists have argued that there is really no such thing. For example, Karl Popper once wrote that “In the empirical sciences, which alone can furnish us with information about the world we live in, proofs do not occur, if we mean by ‘proof’ an argument which establishes once and for ever the truth of a theory”. [5] Albert Einstein said:

    “The scientific theorist is not to be envied. For Nature, or more precisely experiment, is an inexorable and not very friendly judge of his work. It never says ‘Yes ’ to a theory. In the most favorable cases it says ‘Maybe ,’ and in the great majority of cases simply ‘No .’ If an experiment agrees with a theory it means for the latter ‘Maybe ,’ and if it does not agree it means ‘No. ’ Probably every theory will someday experience its ‘No’  – most theories, soon after conception.” [5]


  • This is a property that one can derive easily or directly from a given definition of an object. [1]

    Example: the identity element in a group is unique. [1]

  • A proved and often interesting result, but generally less important than a theorem. [3]


  • This is a property of major importance that one can derive which usually has far-sweeping consequences for the area of math one is studying. Theorems don’t necessarily need the support of propositions or lemmas, but they often do require other smaller results to support their evidence. [1]

    Example: every manifold has a simply connected covering space. [1]

  • A mathematical statement that is proved using rigorous mathematical reasoning. In a mathematical paper, the term theorem is often reserved for the most important results. [3]

“From a logical point of view, there is no difference between a lemma, proposition, theorem, or corollary – they are all claims waiting to be proved. However, we use these terms to suggest different levels of importance and difficulty.

  • “A lemma is an easily proved claim which is helpful for proving other propositions and theorems, but is usually not particularly interesting in its own right.

  • “A proposition is a statement which is interesting in its own right, while a theorem is a more important statement than a proposition which says something definitive on the subject, and often takes more effort to prove than a proposition or lemma.

  • “A corollary is a quick consequence of a proposition or theorem that was proven recently.”

Source: Footnote 4, p. 25 in Terence Tao, “Analysis I”, 3rd ed. (2015).

Proofs: Sources

[1] Terminology: Difference between Lemma, Theorem, Definition, Hypothesis, Postulate and a Proposition
[2] Axiom
[3] What is the difference between a theorem, a lemma, and a corollary?  [local copy]
[4] Mathematical proof
[5] Concept of scientific proof

See also: Scientific method.


See Proofs.


Provenance is a reference to literature from which a statement or its supporting evidence were derived.

Provenance (from the French provenir, “to come from/forth”) is the chronology of the ownership, custody or location of a historical object. The term was originally mostly used in relation to works of art but is now used in similar senses in a wide range of fields, including archaeology, paleontology, archives, manuscripts, printed books and science and computing. The primary purpose of tracing the provenance of an object or entity is normally to provide contextual and circumstantial evidence for its original production or discovery, by establishing, as far as practicable, its later history, especially the sequences of its formal ownership, custody and places of storage. The practice has a particular value in helping authenticate objects. Comparative techniques, expert opinions and the results of scientific tests may also be used to these ends, but establishing provenance is essentially a matter of documentation. The term dates to the 1780s in English. Provenance is conceptually comparable to the legal term chain of custody.

In a technological solution, blockchains can be used to ensure data provenance:

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Quadratic formula

For $\small ax^2 + bx + c = 0$, the values of $\small x$ which are the solutions of the equation are given by:

$\small x = \dfrac{-b \pm \sqrt{b^2 - 4ac}}{2a}$

For the quadratic formula to work, you must have your equation arranged in the form “$\small \text{(quadratic)} = 0$”. Also, the “$\small 2a$” in the denominator of the formula is underneath everything above, not just the square root.


Solve $\small x^2 + 3x - 4 = 0$

This quadratic happens to factor as

$\small x^2 + 3x - 4 = (x + 4)(x - 1) = 0$

so the solutions are $\small x = -4$ and $\small x = 1$. However, using the quadratic formula with $\small a = 1$, $\small b = 3$, and $\small c = -4$:

$\small x = \dfrac{-(3) \pm \sqrt{(3)^2 - 4(1)(-4)}}{2(1)}$ $\small = \dfrac{-3 \pm \sqrt{9 + 16\,}}{2} = \dfrac{-3 \pm \sqrt{25}}{2}$ $\small = -4,\, 1$

The solution is $\small x = -4$ and $\small x = 1$.

Question answering

Question answering (QA), a computer science discipline within the fields of information retrieval and natural language processing (NLP), is concerned with building systems that automatically answer questions posed by humans in a natural language. A QA implementation, usually a computer program, may construct its answers by querying a structured database of knowledge or information, usually a knowledge base. More commonly, QA systems can pull answers from an unstructured collection of natural language documents.

Some examples of natural language document collections used for QA systems include:

  • a local collection of reference texts
  • internal organization documents and web pages
  • compiled newswire reports
  • a set of Wikipedia pages
  • a subset of World Wide Web pages

Question answering (QA), the identification of short accurate answers to users questions presented in natural language, has numerous applications in the biomedical and clinical sciences including directed search, interactive learning and discovery, clinical decision support, and recommendation. Due to the large size of the biomedical literature and a lack of efficient searching strategies, researchers and medical practitioners often struggle to obtain available information available that is necessary for their needs. Moreover, even the most sophisticated search engines are not intelligent enough to interpret clinicians questions. Thus, there is an urgent need for information retrieval systems that accept queries in natural language and return accurate answers quickly and efficiently.

Question answering (a natural language understanding problem) and reading comprehension (the task of answering a natural language question about a paragraph) are of considerable interest in NLP motivating, for example, the Human-Computer Question Answering Competition (in the NIPS 2017 Competition Track), and the BioASQ Challenge in the BioNLP domain. Unlike generic text summarization, reading comprehension systems facilitate the answering of targeted questions about specific documents, efficiently extracting facts and insights (How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks).

The Stanford Question Answering Dataset (SQuAD: developed at Stanford University) is a reading comprehension dataset consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment or span of text from the corresponding reading passage (or, the question might be unanswerable). There has been a rapid progress on the SQuAD dataset, and early in 2018 engineered systems started achieving and surpassing human level accuracy on the SQuAD1.1 Leaderboard  (discussion: AI Outperforms Humans in Question Answering: Review of three winning SQuAD systems). SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 new, unanswerable questions written by crowdworkers to look adversarially similar to answerable ones (ACL 2018 Best Short Paper: Know What You Don’t Know: Unanswerable Questions for SQuAD;  [project/code]). To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Recurrent neural networks


[Folded, unfolded RNN (click image to open in new window)]

At each time step $\small t$, a RNN reads an input vector $\small \mathbf{x}_t$ into a hidden state vector $\small \mathbf{h}_t$ and predicts an output vector $\small \mathbf{y}_t$ [shown as $\small \mathbf{o}_t$ in the diagram above]. The state dynamic can be abstracted as a recurrent relation: $\small \mathbf{h}_t = RNN(\mathbf{h}_{t-1}, \mathbf{x}_t)$. The vanilla RNN is parameterized as follows:

    $\small \mathbf{h}_t = \sigma (W_h \mathbf{h}_{t-1} + V\mathbf{x}_t + \mathbf{b}_h)$

    $\small \mathbf{y}_t = W_y \mathbf{h}_t + \mathbf{b}_y$

where $\small (W_h, W_y, V, \mathbf{b}_h, \mathbf{b}_y)$ are learnable parameters, and $\small \sigma$ is a point-wise nonlinear function.

Andrej Karpathy provides a great example of a RNN, that should make it easier to visualize the working and use of a RNN:


[image source (click image to open in new window)]

One last thing to note – the weights of the connections between time steps are shared i.e. there isn’t a different set of weights for each time step. This is discussed here, Why are the weights of RNN/LSTM networks shared across time?:

  • “The ‘shared weights’ perspective comes from thinking about RNNs as feedforward networks unrolled across time. If the weights were different at each moment in time, this would just be a feedforward network [ANN, above]. But, I suppose another way to think about it would be as an RNN whose weights are a time-varying function (and that could let you keep the ability to process variable length sequences). If you did this, the number of parameters would grow linearly with the number of time steps. That would be a big explosion of parameters for sequences of any appreciable length. …”


[image sourcediscussed here. (Click image to open in new window.)]

Alex Graves shows a variation of a RNN, above, that better illustrates some additional key concepts (Generating Sequences with Recurrent Neural Networks).

  • “Fig. 1 illustrates the basic recurrent neural network prediction architecture used in this paper. An input vector sequence $\small \mathbf{x} = (x_1, \dots, x_T)$ is passed through weighted connections to a stack of $\small N$ recurrently connected hidden layers to compute first the hidden vector sequences $\small \mathbf{h}^n = (h{n \atop 1}, \dots, h{n \atop T})$ and then the output vector sequence $\small \mathbf{y} = (y_1, \dots, y_T)$. Each output vector $\small y_t$ is used to parameterise a predictive distribution $\small \text{Pr}(x_{t+1} \vert y_t)$ over the possible next inputs $\small x_{t+1}$. The first element $\small x_1$ of every input sequence is always a null vector whose entries are all zero; the network therefore emits a prediction for $\small x_2$, the first real input, with no prior information. The network is “deep” in both space and time, in the sense that every piece of information passing either vertically or horizontally through the computation graph will be acted on by multiple successive weight matrices and nonlinearities.

    “Note the ‘skip connections’ from the inputs to all hidden layers, and from all hidden layers to the outputs. These make it easier to train deep networks, by reducing the number of processing steps between the bottom of the network and the top, and thereby mitigating the “vanishing gradient” problem. In the special case that $\small N = 1$ the architecture reduces to an ordinary, single layer next step prediction RNN.”


[image source (click image to open in new window)]


[image source (click image to open in new window)]

The following material is excerpted from the main body in A Beginner’s Guide to LSTMs.

“In the case of feedforward networks, input examples are fed to the network and transformed into an output; with supervised learning, the output would be a label, a name applied to the input. That is, they map raw data to categories, recognizing patterns that may signal, for example, that an input image should be labeled “cat” or “elephant.” A feedforward network is trained, for example, on labeled images until it minimizes the error it makes when guessing their categories. The trained set of parameters (or weights) are collectively known as a model. A feedforward network has no notion of temporal order (“time”): the only input it considers is the current example upon which it is working.

“RNN recognize patterns in sequences of data such as text, genomes, handwriting, the spoken word, numerical times series data emanating from sensors, stock markets and government agencies. RNNs are even applicable to images, which can be decomposed into a series of patches and treated as a sequence. RNN take as their input not just the current input example they see, but also what they have perceived previously in time. Since these algorithms take time and sequence into account, they have a temporal dimension. The decision a recurrent net reached at time step $\small t-1$ affects the decision it will reach one moment later at time step $\small t$. RNN therefore have two sources of input, the present and the immediate past, which combine to determine how they respond to new data.

“Recurrent networks are distinguished from feedforward networks by a feedback loop connected to their past decisions, ingesting their own outputs moment after moment as input. It is often said that recurrent networks have memory. That sequential information is preserved in the recurrent network’s hidden state, which manages to span many time steps as it cascades forward to affect the processing of each new example. Mathematically, the process of carrying the memory forward is:

    $\small \mathbf{h}_t = \phi(\mathbf{W}\mathbf{x}_t + \mathbf{U}\mathbf{h}_{t-1})$

“The hidden state at time step $\small t$ is $\small h_t$. It is a function of the input at the same time step $\small x_t$, modified by a weight matrix $\small W$ (like the one we used for feedforward nets) added to the hidden state of the previous time step $\small h_{t-1}$ multiplied by its own hidden-state-to-hidden-state matrix $\small U$, otherwise known as a transition matrix and similar to a Markov chain. The weight matrices are filters that determine how much importance to accord to both the present input and the past hidden state. The error they generate will return via backpropagation and be used to adjust their weights until error can’t go any lower.

“The sum of the weight input and hidden state is squashed by the function $\small \phi$ – either a logistic sigmoid function or tanh, depending – which is a standard tool for condensing very large or very small values into a logistic space, as well as making gradients workable for backpropagation. Because this feedback loop occurs at every time step in the series, each hidden state contains traces not only of the previous hidden state, but also of all those that preceded $\small h_{t-1}$ for as long as memory can persist.”

The most basic or “vanilla” RNN consists of the following set of equations (indexed by time-step $t$), that appear various forms in the literature and the web:

    $\small \mathbf{h}_t = \sigma (\mathbf{U}_h \mathbf{x}_t + \mathbf{W}_h \mathbf{h}_{t-1})$

    $\small \mathbf{y}_t = \mathbf{O} \mathbf{h}_t$


  • $\small \mathbf{x}_t \in \mathbb{R}^n$ is the input of the RNN;

  • $\small \mathbf{h}_t \in \mathbb{R}^k$ is called the hidden state of the RNN, and acts as a memory of the current state the network. When starting a sequence, it is set to the all zero vector ($\small \mathbf{h}_{-1} = 0$).

  • $\small \mathbf{y}_t \in \mathbb{R}^p$ is the output of the RNN;

  • The logistic function (here, the sigmoid function; see Recent Advances in Recurrent Neural Networks for other forms), applied component-wise:

$\ \ \ \ \ \ \small \sigma(x) = \frac{1}{1 + \mathcal{e}^{(-x)}}$

  • $\small \mathbf{U}_{h}, \mathbf{W}_{h}, \mathbf{O}$ are the network’s parameters. [Note that “$\small \mathbf{O}$”, here, is “$\small \mathbf{V}$” elsewhere in this post; i.e. the weight matrices of the “skip connections from the hidden layers to the outputs.]

The output of such a neural network depends on both the input $\small \mathbf{x}_t$ and the hidden state $\small \mathbf{h}_t$.


  • Source for the above: An Intrinsic Difference Between Vanilla RNNs and GRU Models plus my own notes (below).

  • Undefined in that paper, I believe that I correctly interpret/describe these:

    • The superscripts in $\small \mathbb{R}^\color{Brown}{\mathbf{n}}$, $\small \mathbb{R}^\color{Brown}{\mathbf{k}}$ and $\small \mathbb{R}^\color{Brown}{\mathbf{p}}$ refer to the dimensions of matrices of real numbers (respectively) in input, hidden and output layers.

      Similarly (elsewhere), $\small \mathbb{R}^{m \times n}$ indicates a matrix of dimensions $\small m\times n$.

  • $\small \in$ denotes “element of” (in set membership: “member of”).

  • In mathematics, the set of real numbers ($\small \mathbb{R}$) are the values of a continuous quantity that can represent a distance along a line; they include rational numbers ($\small \mathbb{Q}$), integers ($\small \mathbb{Z}$), and natural numbers ($\small \mathbb{N}$).

  • Elements of $\small \mathbb{R}^n$ are vectors. In other words, we can consider each element of $\small \mathbb{R}^n$ (the tuple of $n$ real numbers) to be a vector. $\small \mathbb{R}^n$ is more abstract than polynomials; for example,

    $\ \ \ \ \ \ \small a = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \in \mathbb{R}^3$

    is an example of a triplet of numbers. Adding two vectors $\small a, b ∈ \mathbb{R}^n$ component wise results in another vector: $\small a + b = c \in \mathbb{R}^n$ . Moreover, multiplying $\small a \in \mathbb{R}^n$ by $\small \lambda \in \mathbb{R}$ results in a scaled vector $\small \lambda a \in \mathbb{R}^n$. Linear algebra focuses on the similarities between these vector concepts; we can add them together, and multiply them by scalars. We largely focus on vectors in $\small \mathbb{R}^n$ since most algorithms in linear algebra are formulated in $\small \mathbb{R}^n$. Recall that in machine learning, we often consider data to be represented as vectors in $\small \mathbb{R}^n$. [Source: Linear Algebra]

  • In mathematics, the real coordinate space of $n$ dimensions, written $\small \mathbb{R}^n$ is a coordinate space that allows several ($\small n$) real variables to be treated as a single variable. With various numbers of dimensions (sometimes unspecified), $\small \mathbb{R}^n$ is used in many areas of pure and applied mathematics, as well as in physics. With component-wise addition and scalar multiplication, it is the prototypical real vector space and is a frequently used representation of Euclidean $\small n$-space. An element of $\small \mathbb{R}^n$ is written $\small x = (x_1, x_2, \ldots, x_n)$, where each $\small x_i$ is a real number.

Reduced row echelon

  • See discussion, here.

Reinforcement learning

Reinforcement learning (RL) is a branch of machine learning in which an agent learns from interacting with an environment; see:

Essentially, a RL framework allows an agent to learn from trial and error; the agent receives a reward by acting in the environment, and its goal is learning to select the actions that maximize the expected cumulative reward over time. Deep learning can be combined with RL to learn useful representations for problems with high dimensional raw data input.

Relation extraction

Relation extraction (RE) is a subproblem of IE that addresses the extraction of labeled relations between two named entities. Dependency parsing and phrase structure parsing may be combined for relation extraction. To minimize cascading errors, accurate sentence chunking (splitting) is required, prior to the dependency parsing step. See also: Information extraction;   Dependency parsing.

Representation learning

Representation learning is a set of techniques that learn a feature: a transformation of the raw data input to a representation that can be effectively exploited in machine learning tasks. While traditional unsupervised learning techniques are staples of machine learning, representation learning has emerged as an alternative approach to feature extraction (An Introduction to Representation Learning). In representation learning, features are extracted from unlabeled data by training a neural network on a secondary, supervised learning task. Word2vec is a good example of representation learning, simultaneously learning several language concepts:

  • the meanings of words;
  • how words are combined to form concepts (i.e., syntax); and,
  • how concepts relate to the task at hand.

See also:

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z


The term syntax refers to grammatical structure whereas the term semantics refers to the meaning of the vocabulary symbols arranged with that structure. Grammatical (syntactically valid) does not imply sensible (semantically valid), however. For example, the grammatical sentence “cows flow supremely” is grammatically ok (subject verb adverb) in English, but makes no sense. [Source: What do “syntax” and “semantics” mean and how are they different?]

Semantics is used as a technical term for the meaning of words and sentences. Semantics and its understanding as a study of meaning covers most complex tasks like: finding synonyms, word sense disambiguation, constructing question-answering systems, translating from one natural language to another, and populating knowledge bases. Basically, one needs to complete morphological and syntactical analysis before trying to solve any semantic problem. [Source]

Semantics is about the manner in which lexical meaning is combined morphologically and syntactically to form the meaning of a sentence. Mostly, this is regular, productive and rule-governed; e.g. the meaning of “John gave Mary a do “can be represented as (SOME (X) (DOG X) & (PAST-TIME (GIVE (JOHN, MARY, X)))), but sometimes it is idiomatic as in the meaning of “John kicked the bucket”, which can be (PAST-TIME (DIE (JOHN))). (To make this notation useful we also need to know the meaning of these capitalised words and brackets too.) Because the meaning of a sentence is usually a productive combination of the meaning of its words, syntactic information is important for interpretation - it helps us work out what goes with what – but other information, such as punctuation or intonation, pronoun reference, etc, can also play a crucial part. [Source; Introduction to Linguistics for Natural Language Processing]


See also:

Semantic parsing

[See also: Syntactic/Dependency parsing.] Semantic dependencies are understood in terms of predicates and their arguments. The predicate of a sentence mostly corresponds to the main verb and any auxiliaries that accompany the main verb; the arguments of that predicate (e.g. the subject and object noun phrases) are outside the predicate. The arguments of a predicate are semantically dependent on that predicate. Often, semantic dependencies overlap with and point in the same direction as syntactic dependencies. At times, however, semantic dependencies can point in the opposite direction of syntactic dependencies, or they can be entirely independent of syntactic dependencies.

Semantic processing of sentence structure uses statistics or grammar rules to produce an electronic representation that delivers logical components (for example, a ‘noun phrase’), their roles (for example, the ‘subject’) and dependencies. Semantic parsing can thus be understood as extracting the precise meaning of an utterance.

Applications of semantic parsing include question answering. An example of the application of semantic parsing in biomedical question-answering is indicated in Fig. 3 in: Titov & Klementiev (2011) [pdf]:

Semantic parsing: biomedical QA
      (click image for full-size)

Semantic role labeling

Semantic role labeling (sometimes called shallow semantic parsing) is a process in natural language processing that assigns labels to words or phrases in a sentence that indicate their semantic role in the sentence. It consists of the detection of the semantic arguments associated with the predicate or verb of a sentence and their classification into their specific roles (i.e., automatically finding the semantic roles of each argument of each predicate in a sentence).

Statistical relational learning

Statistical relational learning (sometimes called relational machine learning) is a subdiscipline of machine learning that is concerned with domain models that exhibit both uncertainty (which can be dealt with using statistical methods) and complex, relational structure. Typically, the knowledge representation formalisms developed in statistical relational learning use (a subset of) first-order logic to describe relational properties of a domain in a general manner (universal quantification) and draw upon probabilistic graphical models (such as Bayesian networks or Markov networks) to model the uncertainty; some also build upon the methods of inductive logic programming.

The field of statistical relational learning is not strictly limited to learning aspects; it is equally concerned with reasoning (specifically probabilistic inference) and knowledge representation. Therefore, alternative terms that reflect the main foci of the field include statistical relational learning and reasoning (emphasizing the importance of reasoning) and *first-order probabilistic languages (emphasizing the key properties of the languages with which models are represented).


A synonym is a word having the same or nearly the same meaning as another word.

Synonyms are words with similar meanings. They are listed in a special type of dictionary called a thesaurus. A regular dictionary lists words according to form, usually in alphabetical order; a thesaurus lists words according to meaning. Synonyms usually differ in at least one semantic feature. Sometimes the feature is objective (denotative), referring to some actual, real world difference in the referents: walk, lumber, stroll, meander, lurch, stagger, stride, mince. Sometimes the feature is subjective (connotative), referring to how the speaker feels about the referent rather than any real difference in the referent itself: die, pass away, give up the ghost, kick the bucket, croak. There tend to be very few absolute synonyms in a language. Example: sofa and couch are nearly complete synonyms, yet they differ in their collocability in at least one way: one may say couch potato, but not *sofa potato.

One special type of partial synonym is called a paronym. Paronyms are words with associated meanings which also have great similarities in form: proscribe/ prescribe, industrial/ industrious, except/accept, affect/effect. Many errors in speech and writing are due to mix-ups involving paronyms.

Examples: “gearbox/transmission”; “choice/selection”; “complex/complicated”; “pretty/attractive”; “sick/ill”; …

Synonymy and homonymy have complementary notions:

  • synonyms: the same meaning, different forms
  • homonyms: the same form, different meanings

See also Linguistics | Semantics.

Syntactic/Dependency parsing

[See also Semantic parsing.] A syntactic parser (i.e., dependency parser) analyzes the grammatical (syntactic) structure of a sentence, establishing binary relationships between “head” words and words which modify those heads; for example, a verb is linked to its dependents (arguments/modifiers). Collectively, these relations form a tree or tree-like graph.

The basic idea is that syntactic structure consists of lexical items, linked by binary asymmetric relations called dependencies. The sentence is an organized whole, the constituent elements of which are words. Every word that belongs to a sentence ceases by itself to be isolated, as in the dictionary. Between the word and its neighbors, the mind perceives connections, the totality of which forms the structure of the sentence.

The structural connections establish dependency relations between the words. The dependencies are all binary relations: a grammatical relation holds between a governor (also known as a regent or a head) and a dependent.

Thus, in the sentence “Winehouse performed …”, “performed” is the governor and “Winehouse” is the dependent (subordinate).

Among other tasks, dependency parse trees may be applied to basic relation extraction. Stanford dependencies provide a representation of grammatical relations between words in a sentence. They have been designed to be easily understood and effectively used by people who want to extract textual relations. Stanford dependencies (SD) are triplets: name of the relation, the governor, and the dependent.

For example, the sentence “Winehouse effortlessly performed her song Rehab.” yields the following dependency paths:

  nsubj(performed-3, Winehouse-1)
  advmod(performed-3, effortlessly-2)
  poss(Rehab-6, her-4)
  nn(Rehab-6, song-5)
  dobj(performed-3, Rehab-6)

In this example, the shortest path between “Winehouse” and “Rehab” is:

  Winehouse nsubj performed dobj Rehab.

and an extracted relation (triple) would be (Winehouse; performed; Rehab)

Winehouse dependency parse
      [graphs above per Stanford CoreNLP online demo]


The term syntax refers to grammatical structure whereas the term semantics refers to the meaning of the vocabulary symbols arranged with that structure. Grammatical (syntactically valid) does not imply sensible (semantically valid), however. For example, the grammatical sentence “cows flow supremely” is grammatically ok (subject verb adverb) in English, but makes no sense. [Source: What do “syntax” and “semantics” mean and how are they different?]

Roughly speaking, the syntax of a language comprises the patterns into which its words can be validly arranged to form sentences. The combination of morphology and syntax is sometimes called the grammar of a language. Syntax as part of grammar is a description of how words grouped and connected to each other in a sentence. There is a good definition of syntax for programming languages: “… syntax usually entails the transformation of a linear sequence of tokens (a token is akin to an individual word or punctuation mark in a natural language) into a hierarchical syntax tree.” The same definition also can be used for natural language. Challenges to syntactic processing are parts of speech (POS) tagging, chunking or detecting syntactic categories (verb, noun phrases), and sentence assembling (constructing syntax trees). [Source]

Syntax concerns the way in which words can be combined to form (grammatical) sentences; e.g. “revolutionary new ideas appear infrequently” is grammatical in English, “colourless green ideas sleep furiously” is grammatical but nonsensical, while “ideas green furiously colourless sleep” is also ungrammatical. Words combine syntactically in certain orders in a way which mirrors the meaning conveyed; e.g. John loves Mary means something different from Mary loves John. The ambiguity of “John gave her dog biscuits” stems from whether we treat “her” as an independent pronoun and “dog biscuits” as a compound noun, or whether we treat “her” as a demonstrative pronoun modifying “dog”. We can illustrate the difference in terms of possible ways of bracketing the sentence: (john (gave (her) (dog biscuits))), vs. (john (gave (her dog) (biscuits))). [Source; Introduction to Linguistics for Natural Language Processing]

See also Linguistics | Semantics.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Text mining

Text mining (TM) comprises the discovery and extraction of knowledge from free text, and can extend to the generation of new hypotheses by joining the extracted information from several publications. Text mining solutions can achieve different objectives, depending on the tasks they then have to address. Primarily, we can distinguish four different categories of purposes for text-mining solutions: information retrieval, information extraction, building knowledge bases, and knowledge discovery. These categories are illustrated in the following figure [image source]:

text mining
      (click image for full-size)


See Proofs.

Transfer learning

See Transfer learning (a subsection in Pretrained models).



A troponym denotes a particular way to do an entry’s referent. Troponymy is a transitive relation.


  • to “trim” and to “slice” are troponyms of to “cut”
  • to “slide” and to “spin” are troponyms of to “move”
  • to “snack” and to “nibble” are troponyms of to “eat”

See also Linguistics | Semantics.

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Variational autoencoders

In neural networks, a variational autoencoder (VAE) consists of an encoder, a decoder, and a loss function. In probability models, the VAR refers to approximate inference in a latent Gaussian model where the approximate posterior and model likelihood are parametrized by neural nets (the inference and generative networks).

See also:

Aside: like the VAE, another common generative framework (model) is the generative adversarial network (GAN) [Goodfellow IJ et al. (2014) Generative Adversarial Nets]. In a GAN, a generator and an auxiliary adversarial discriminator are trained together. On the other hand, in VAE, an encoder and a decoder (or generator) are both trained according to Bayesian models. Both frameworks contain two components (generator and discriminator or encoder and decoder), where each of them requires training.

Vector space

Refer here (Euclidean space).

A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R   S   T   U   V   W   X   Y   Z

Word sense disambiguation

In computational linguistics, word-sense disambiguation (WSD) is an open problem of natural language processing and ontology. WSD is the task of identifying which sense/meaning of a word is used in a sentence, when the word has multiple meanings. WSD is basically solution to the ambiguity which arises due to different meaning of words in different contexts. For example, consider the two sentences:

     “The bank will not be accepting cash on Saturdays. “
     “The river overflowed the bank.”

The word “bank” in the first sentence refers to the commercial (finance) banks, while in second sentence, it refers to the river bank.

[For more examples, consult WordNet; e.g., the WordNet search for “cell” gives 7 meanings.]

The solution to this problem impacts other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, inference, et cetera.

See also Named entity disambiguation.