# Machine Learning Glossary

## Persagen.com

This file:  Persagen.com/files/ml_glossary.html

• ACTIVATION FUNCTIONS: [ Source; image ]  |  Victoria: edit these: GitHub author is an amateur so these definitions/examples may be sl. 'suspect'!

 Function Description Definition Identity Don't transform the incoming data.That's what you would expect at input layers. $x$ Sigmoid The de facto standard activation before Relu; smoothly mapsthe incoming activation into a range from zero to one.Logistic function at right: $s(x) = \large \frac{1}{1 + \mathcal{e}^{-x}}$ Relu Fast non-linear function that has proven to be effective in deep networks. $max(0, x)$ Softmax Smooth activation function where the outgoing activations sum up to one.It's commonly used for output layers in classification because theoutgoing activations can be interpreted as probabilities. $\large \frac{\mathcal{e}^x}{\sum{\mathcal{e}^x}}$

• AFFINE: e.g., affine activation functions. Wikipedia: In geometry, an affine transformation, affine map or an affinity (from the Latin, affinis, "connected with") is a function between affine spaces which preserves points, straight lines and planes. Also, sets of parallel lines remain parallel after an affine transformation. An affine transformation does not necessarily preserve angles between lines or distances between points, though it does preserve ratios of distances between points lying on a straight line.

Examples of affine transformations include translation, scaling, homothety, similarity transformation, reflection, rotation, shear mapping, and compositions of them in any combination and sequence.

If $X$ and $Y$ are affine spaces, then every affine transformation $f : X \rightarrow Y$ is of the form $x \mapsto Mx + b$, where $M$ is a linear transformation on $X$ and $b$ is a vector in $Y$. Unlike a purely linear transformation, an affine map need not preserve the zero point in a linear space. Thus, every linear transformation is affine, but not every affine transformation is linear.

All Euclidean spaces are affine, but there are affine spaces that are non-Euclidean. An image of a fern-like fractal that exhibits affine self-similarity. Each of the leaves of the fern is related to each other leaf by an affine transformation.
For instance, the red leaf can be transformed into both the small dark blue leaf and the large light blue leaf by a combination of reflection, rotation, scaling, and translation.

• What is the difference between linear and affine function?
• $f(x) = 2x$ is linear and affine. $f(x) = 2x + 3$ is affine but not linear.
• A quick definition for linearity would be "$f(x)$ is linear if $f(\alpha x_1 + \beta x_2) = \alpha f(x_1) + \beta f(x_2)$".
• A linear function fixes the origin, whereas an affine function need not do so. An affine function is the composition of a linear function with a translation, so while the linear part fixes the origin, the translation can map it somewhere else. As an example, linear functions $\mathbb{R^2} \rightarrow \mathbb{R^2}$ preserve the vector space structure (so in particular they must fix the origin). While affine functions don't preserve the origin, they do preserve some of the other geometry of the space, such as the collection of straight lines. If you choose a basis for vector spaces $V$ and $W$, and consider functions $f:V \rightarrow W$, then $f$ is linear if $f(v) = Av$ for some matrix $A$ (of the appropriate size), and $f$ is affine if $f(v) = Av + b$ for some matrix $A$ and vector $b \in W$.
• Q. Affine functions preserve the distance between two points?
A. An affine transformation does not necessarily preserve angles between lines or distances between points, though it does preserve ratios of distances between points lying on a straight line.
• An affine function is the composition of a linear function followed by a translation. $ax$ is linear ; $(x + b) \times (ax)$ is affine. See Modern basic Pure mathematics : C.Sidney
• Not sure why this got down-voted, made the most sense to me.
• Probably because this is just a particular case. In general an affine space needs to be introduced.

• Why must a nonlinear activation function be used in a backpropagation neural network? [StackOverflow]
• Q. " I've been reading some things on neural networks and I understand the general principle of a single layer neural network. I understand the need for additional layers, but why are nonlinear activation functions used?"
• A. The purpose of the activation function is to introduce non-linearity into the network. Non-linear means that the output cannot be reproduced from a linear combination of the inputs (which is not the same as output that renders to a straight line -- the word for this is affine). Another way to think of it: without a non-linear activation function in the network, a NN, no matter how many layers it had would behave just like a single perceptron (because summing these layers just give you another linear function-see definition just above). ...

• ARGMAX: arguments of the maxima (abbreviated $arg max$ or $argmax$) are the points of the domain of some function at which the function values are maximized. In contrast to global maxima, referring to the largest OUTPUTS of a function, $argmax$ refers to the INPUTS, or ARGUMENTS, at which the function outputs are as large as possible.

• ARGMIN: ($argmin$ | $arg min$) -- the argument of the minimum -- are the points '$x$' for which $f(x)$ attains its smallest value.

• BIAS, VARIANCE:

• Error due to Bias: The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict. Of course you only have one model so talking about expected or average prediction values might seem a little strange. However, imagine you could repeat the whole model building process more than once: each time you gather new data and run a new analysis creating a new model. Due to randomness in the underlying data sets, the resulting models will have a range of predictions. Bias measures how far off in general these models' predictions are from the correct value.

• Error due to Variance: The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model.

• Essentially, bias is how removed a model's predictions are from correctness, while variance is the degree to which these predictions vary between model iterations: Graphical illustration of bias and variance.
From Understanding the Bias-Variance Tradeoff, by Scott Fortmann-Roe. Bias and variance contributing to total error.
From Understanding the Bias-Variance Tradeoff, by Scott Fortmann-Roe.

• CLASSIFICATION: When the target variable ($y$) that we're trying to predict is continuous, such as housing prices, we call the learning problem a regression problem. When $y$ can take on only a small number of discrete values (such as: if given the living area, we want to predict if a dwelling is a house or an apartment), we call it a classification problem. [Source (bottom p. 2)]

• Victoria: note however that regression can have categorical outcome variables; e.g.; logistic regression (binary: {0, 1}), or multinomial logistic regression ...

• [comprehensive review:] Sokolova M & Lapalme G. (2009) A systematic analysis of performance measures for classification tasks. pdf
• This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class, multi-labelled, and hierarchical. For each classification task, the study relates a set of changes in a confusion matrix to specific characteristics of data. Then the analysis concentrates on the type of changes to a confusion matrix that do not change a measure, therefore, preserve a classifier's evaluation (measure invariance). The result is the measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem. This formal analysis is supported by examples of applications where invariance properties of measures lead to a more reliable evaluation of classifiers. Text classification supplements the discussion with several case studies.

• COST FUNCTIONS: [ Source; image ]  |  Victoria: edit these: GitHub author is an amateur so these definitions/examples may be sl. 'suspect'!

 Function Description Definition Squared error The most common cost function. The difference is squaredto always be positive and penalize large errors stronger. $\large \frac{(pred - target)^2}{2}$ Cross entropy Logistic cost function useful for classification tasks.Commonly used in conjunction with Softmax output layers. $\small -[target * log(pred) + (1 - target) * log(1 - pred)]$

• CROSS ENTROPY [CROSS-ENTROPY]: One very common, very nice cost function is "cross-entropy." Surprisingly, cross-entropy arises from thinking about information compressing codes in information theory but it winds up being an important idea in lots of areas, from gambling to machine learning. It's defined:

H_{y^\prime}(y) = \begin{align} -\sum_i \end{align} y_i^{\prime}\log(y_i)

where $y$ is our predicted probability distribution, and $y^{\prime}$ is the true distribution (the one-hot vector we'll input). In some rough sense, the cross-entropy is measuring how inefficient our predictions are for describing the truth. [Source: Training (TensorFlow tutorial)]

• Cross entropy [Wikipedia]
• Cross-Entropy [Chris Olah]
• Softmax classifier [Andrej Karpathy, Stanford cs231n]:

... It turns out that the SVM is one of two commonly seen classifiers. The other popular choice is the Softmax classifier, which has a different loss function. If you've heard of the binary Logistic Regression classifier before, the Softmax classifier is its generalization to multiple classes. Unlike the SVM which treats the outputs $f(x_i,W)$ as (uncalibrated and possibly difficult to interpret) scores for each class, the Softmax classifier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly. In the Softmax classifier, the function mapping $f(x_i; W) = W x_i$ stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form:

\begin{align} L_i = -\log\left(\frac{e^{f_{y_i}}}{\begin{align} \sum_j e^{f_j} \end{align}}\right) \hspace{0.15in} \text{or equivalently} \hspace{0.15in} L_i = -f_{y_i} + \log\sum_j e^{f_j} \end{align}

where we are using the notation $f_j$ to mean the $j^{th}$ element of the vector of class scores $f$. As before, the full loss for the dataset is the mean of $L_i$ over all training examples together with a regularization term $R(W)$. The function $f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}}$ is called the softmax function: It takes a vector of arbitrary real-valued scores (in $z$) and squashes it to a vector of values between zero and one that sum to one. The full cross-entropy loss that involves the softmax function might look scary if you're seeing it for the first time but it is relatively easy to motivate.

Information theory view. The cross-entropy between a "true" distribution $p$ and an estimated distribution $q$ is defined as:

\begin{align} H(p,q) = - \sum_x p(x) \log q(x) \end{align}

The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities ( $q = e^{f_{y_i}} / \sum_j e^{f_j}$ as seen above) and the "true" distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e. $p = [0, \ldots 1, \ldots, 0]$ contains a single 1 at the $y_{i}^{th}$ position). Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as $H(p,q) = H(p) + D_{KL}(p||q)$, and the entropy of the delta function $p$ is zero, this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance). In other words, the cross-entropy objective wants the predicted distribution to have all of its mass on the correct answer.

Probabilistic interpretation. Looking at the expression, we see that

P(y_i \mid x_i; W) = \large \frac{e^{f_{y_i}}}{\begin{align} \large \sum_j e^{f_j}\end{align}}

can be interpreted as the (normalized) probability assigned to the correct label $y_i$ given the image $x_i$ and parameterized by $W$. To see this, remember that the Softmax classifier interprets the scores inside the output vector $f$ as the unnormalized log probabilities. Exponentiating these quantities therefore gives the (unnormalized) probabilities, and the division performs the normalization so that the probabilities sum to one. In the probabilistic interpretation, we are therefore minimizing the negative log likelihood of the correct class, which can be interpreted as performing Maximum Likelihood Estimation (MLE). A nice feature of this view is that we can now also interpret the regularization term $R(W)$ in the full loss function as coming from a Gaussian prior over the weight matrix $W$, where instead of MLE we are performing the Maximum a posteriori (MAP) estimation. We mention these interpretations to help your intuitions, but the full details of this derivation are beyond the scope of this class.

Practical issues: Numeric stability. When you're writing code for computing the Softmax function in practice, the intermediate terms $e^{f_{y_i}}$ and $\sum_j e^{f_j}$ may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick. Notice that if we multiply the top and bottom of the fraction by a constant $C$ and push it into the sum, we get the following (mathematically equivalent) expression:

$\frac{e^{f_{y_i}}}{\sum_j e^{f_j}} \ = \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}} \ = \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}}$

We are free to choose the value of $C$. This will not change any of the results, but we can use this value to improve the numerical stability of the computation. A common choice for $C$ is to set $\log C = -\max_j f_j$. This simply states that we should shift the values inside the vector $f$ so that the highest value is zero. In code:

    f = np.array([123, 456, 789]) # example with 3 classes and each having large scores
p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup

# instead: first shift the values of f so that the highest number is 0:
f -= np.max(f) # f becomes [-666, -333, 0]
p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer


Possibly confusing naming conventions. To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss. The Softmax classifier uses the cross-entropy loss. The Softmax classifier gets its name from the softmax function, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied. In particular, note that technically it doesn't make sense to talk about the "softmax loss", since softmax is just the squashing function, but it is a relatively commonly used shorthand.

• ERGODIC: relating to or denoting systems or processes with the property that, given sufficient time, they include or impinge on all points in a given space and can be represented statistically by a reasonably large selection of points. More here [Google]

• GAUSSIAN DISTRIBUTION: also called a Normal distribution ( $\mathcal{N}$ )

• KULLBACK-LEIBLER DIVERGENCE: In probability theory and information theory, the Kullback–Leibler divergence, also called discrimination information (the name preferred by Kullback), information divergence, information gain, relative entropy, KLIC, KL divergence, is a measure of the difference between two probability distributions $P$ and $Q$. It is not symmetric in $P$ and $Q$. In applications, $P$ typically represents the "true" distribution of data, observations, or a precisely calculated theoretical distribution, while $Q$ typically represents a theory, model, description, or approximation of $P$. Specifically, the Kullback–Leibler divergence of $Q$ from $P$, denoted $D_{KL}(P\|Q)$, is a measure of the information gained when one revises ones beliefs from the prior probability distribution $Q$ to the posterior probability distribution $P$. In other words, it is the amount of information lost when $Q$ is used to approximate $P$.

• L1, L2 NORM

• See my notes, here.

• LOGISTIC REGRESSION (binary labels): $y^{(i)} \in \{0,1\}$. In logistic regression, we had a training set $\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}$ of $m$ labeled examples, where the input features are $x^{(i)} \in \Re^{n}$. With logistic regression, we were in the binary classification setting, so the labels were $y(i) ∈ {0,1}$.

Our hypothesis took the form:

$h_\theta(x) = \Large\frac{1}{1 + \mathcal{e}^{(-\theta^\top x)}}$

where $g(z) = \Large\frac{1}{1 + \mathcal{e}^{-z}}$

is called the logistic (or sigmoid) function, and the model parameters $\theta$ were trained to minimize the cost function

\begin{align} J(\theta) = -\left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right] \end{align}

• Logistic regression is a probabilistic, linear classifier. It is parametrized by a weight matrix $W$ and a bias vector $b$. Classification is done by projecting an input vector onto a set of hyperplanes, each of which corresponds to a class. The distance from the input to a hyperplane reflects the probability that the input is a member of the corresponding class.
• The downside of logistic regression is that it can only classify classes that are separable by linear plane.
• Parenthetically, the derivative ( $g'(z)$ ) of the logistic function is specified by the logistic function, itself: $g'(z) = g'(z)(\ 1 - g'(z)\ )$ !

• LOSS FUNCTIONS (advice): If you're trying to predict some real numbers, then MSE (mean squared error) is a loss metric that is commonly used. Cross-entropy is usually used when you are calculating class probabilities (you need a different metric than when doing regression) ... [Source: what loss to use for variational auto encoder with real numbers? (reddit)]

• MEMOIZATION: In computing, memoization is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again. ... Memoization effectively refers to remembering ("memoization" → "memorandum" → to be remembered) results of method calls based on the method inputs and then returning the remembered result rather than computing the result again. You can think of it as a cache for method results.

• More here:
• Memoization [Wikipedia]
• What is memoization and how can I use it in Python? [StackOverflow]
• man page:

SYNOPSIS

    # This is the documentation for Memoize 1.01
use Memoize;
memoize('slow_function');
slow_function(arguments);    # Is faster than it was before

This is normally all you need to know. However, many options are available:

    memoize(function, options...); ...

• ONE-HOT: ... The corresponding labels in MNIST are numbers between 0 and 9, describing which digit a given image is of. For the purposes of this tutorial, we're going to want our labels as "one-hot vectors". A one-hot vector is a vector which is 0 in most dimensions, and 1 in a single dimension. In this case, the $n^{th}$ digit will be represented as a vector which is 1 in the $n^{th}$ dimensions. For example, 3 would be [0,0,0,1,0,0,0,0,0,0]. ... [Source: [TensorFlow] MNIST For ML Beginners] • PARAMETERS: Let's say we decide to approximate $y$ as a linear function of $x$:

$h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2}$.

Here, the $\theta_i$'s are the parameters (also called weights) parameterizing the space of linear functions mapping from $\mathcal{X}$ to $\mathcal{Y}$.

• PARAMETRIC | NON-PARAMETRIC: [re: feature selection; underfitting; overfitting; ...] Locally weighted linear regression is the first example we're seeing of a non-parametric algorithm. The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm, because it has a fixed, finite number of parameters (the $\theta_i$'s), which are fit to the data. Once we've fit the $\theta_i$'s and stored them away, we no longer need to keep the training data around to make future predictions. In contrast, to make predictions using locally weighted linear regression, we need to keep the entire training set around. The term "non-parametric" (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis $h$ grows linearly with the size of the training set. [Source (bottom p. 15)]

• PERPLEXITY: a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample. Perplexity is the de facto standard for evaluating language models; measures (e.g. RNN) how surprised the network is to see the next character in a sequence. Perplexity just measures the cross entropy between the empirical distribution (the distribution of things that actually appear) and the predicted distribution (what your model likes) and then divides by the number of words and exponentiates [raises to a power] after throwing out unseen words.

• REGRESSION: When the target variable ($y$) that we're trying to predict is continuous, such as housing prices, we call the learning problem a regression problem. When $y$ can take on only a small number of discrete values (such as: if given the living area, we want to predict if a dwelling is a house or an apartment), we call it a classification problem. [Source (bottom p. 2)]

• Softmax regression vs multinomial logistic regression: Is there a difference?  >>  "If you have regularization ($L1$ or $L2$ penalties) then you can prove that it doesn't matter that softmax is technically overparametrized.  ...  ...  Even if softmax is overparametrized, they have the same number of degrees of freedom, which I think is what counts for the statistical properties of the estimator."

• SOFTMAX REGRESSION allows us to handle $y^{(i)} \in \{1,\ldots,K\}$ where $K$ is the number of classes. The $softmax$ function (aka normalized exponential), is a generalization of the logistic function that "squashes" a $K$-dimensional vector

• basically, softmax regression is the same as multinomial logistic regression

• In the softmax regression setting, we are interested in multi-class classification (as opposed to only binary classification), and so the label $y$ can take on $K$ different values, rather than only two. Thus, in our training set $\{ (x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)}) \}$, we now have that $y^{(i)} \in \{1, 2, \ldots, K\}$. (Note that our convention will be to index the classes starting from 1, rather than from 0.) For example, in the MNIST digit recognition task, we would have $K = 10$ different classes.

Given a test input $x$, we want our hypothesis to estimate the probability that $P(y=k | x)$ for each value of $k = 1, \ldots, K$. I.e., we want to estimate the probability of the class label taking on each of the $K$ different possible values. Thus, our hypothesis will output a $K$-dimensional vector (whose elements sum to 1) giving us our K estimated probabilities. Concretely, our hypothesis $h_{\theta}(x)$ takes the form: ...

• Continued here: Softmax Regression ...