Function | Description | Definition |
Identity | Don't transform the incoming data. That's what you would expect at input layers. |
$x$ |
Sigmoid | The de facto standard activation before Relu; smoothly maps the incoming activation into a range from zero to one. Logistic function at right: |
$ s(x) = \large \frac{1}{1 + \mathcal{e}^{-x}}$ |
Relu | Fast non-linear function that has proven to be effective in deep networks. | $max(0, x)$ |
Softmax | Smooth activation function where the outgoing activations sum up to one. It's commonly used for output layers in classification because the outgoing activations can be interpreted as probabilities. |
$\large \frac{\mathcal{e}^x}{\sum{\mathcal{e}^x}}$ |
Examples of affine transformations include translation, scaling, homothety, similarity transformation, reflection, rotation, shear mapping, and compositions of them in any combination and sequence.
If $X$ and $Y$ are affine spaces, then every affine transformation $f : X \rightarrow Y$ is of the form $x \mapsto Mx + b$, where $M$ is a linear transformation on $X$ and $b$ is a vector in $Y$. Unlike a purely linear transformation, an affine map need not preserve the zero point in a linear space. Thus, every linear transformation is affine, but not every affine transformation is linear.
All Euclidean spaces are affine, but there are affine spaces that are non-Euclidean.
Graphical illustration of bias and variance.
From Understanding the Bias-Variance Tradeoff, by Scott Fortmann-Roe.
Bias and variance contributing to total error.
From Understanding the Bias-Variance Tradeoff, by Scott Fortmann-Roe.
Function | Description | Definition |
Squared error | The most common cost function. The difference is squared to always be positive and penalize large errors stronger. |
$\large \frac{(pred - target)^2}{2}$ |
Cross entropy | Logistic cost function useful for classification tasks. Commonly used in conjunction with Softmax output layers. |
$\small -[target * log(pred) + (1 - target) * log(1 - pred)]$ |
... It turns out that the SVM is one of two commonly seen classifiers. The other popular choice is the Softmax classifier, which has a different loss function. If you've heard of the binary Logistic Regression classifier before, the Softmax classifier is its generalization to multiple classes. Unlike the SVM which treats the outputs $f(x_i,W)$ as (uncalibrated and possibly difficult to interpret) scores for each class, the Softmax classifier gives a slightly more intuitive output (normalized class probabilities) and also has a probabilistic interpretation that we will describe shortly. In the Softmax classifier, the function mapping $f(x_i; W) = W x_i$ stays unchanged, but we now interpret these scores as the unnormalized log probabilities for each class and replace the hinge loss with a cross-entropy loss that has the form:
$ \begin{align} L_i = -\log\left(\frac{e^{f_{y_i}}}{\begin{align} \sum_j e^{f_j} \end{align}}\right) \hspace{0.15in} \text{or equivalently} \hspace{0.15in} L_i = -f_{y_i} + \log\sum_j e^{f_j} \end{align} $where we are using the notation $f_j$ to mean the $j^{th}$ element of the vector of class scores $f$. As before, the full loss for the dataset is the mean of $L_i$ over all training examples together with a regularization term $R(W)$. The function $f_j(z) = \frac{e^{z_j}}{\sum_k e^{z_k}}$ is called the softmax function: It takes a vector of arbitrary real-valued scores (in $z$) and squashes it to a vector of values between zero and one that sum to one. The full cross-entropy loss that involves the softmax function might look scary if you're seeing it for the first time but it is relatively easy to motivate.
Information theory view. The cross-entropy between a "true" distribution $p$ and an estimated distribution $q$ is defined as:
$\begin{align} H(p,q) = - \sum_x p(x) \log q(x) \end{align}$The Softmax classifier is hence minimizing the cross-entropy between the estimated class probabilities ( $q = e^{f_{y_i}} / \sum_j e^{f_j}$ as seen above) and the "true" distribution, which in this interpretation is the distribution where all probability mass is on the correct class (i.e. $p = [0, \ldots 1, \ldots, 0]$ contains a single 1 at the $y_{i}^{th}$ position). Moreover, since the cross-entropy can be written in terms of entropy and the Kullback-Leibler divergence as $H(p,q) = H(p) + D_{KL}(p||q)$, and the entropy of the delta function $p$ is zero, this is also equivalent to minimizing the KL divergence between the two distributions (a measure of distance). In other words, the cross-entropy objective wants the predicted distribution to have all of its mass on the correct answer.
Probabilistic interpretation. Looking at the expression, we see that
$ P(y_i \mid x_i; W) = \large \frac{e^{f_{y_i}}}{\begin{align} \large \sum_j e^{f_j}\end{align}} $can be interpreted as the (normalized) probability assigned to the correct label $y_i$ given the image $x_i$ and parameterized by $W$. To see this, remember that the Softmax classifier interprets the scores inside the output vector $f$ as the unnormalized log probabilities. Exponentiating these quantities therefore gives the (unnormalized) probabilities, and the division performs the normalization so that the probabilities sum to one. In the probabilistic interpretation, we are therefore minimizing the negative log likelihood of the correct class, which can be interpreted as performing Maximum Likelihood Estimation (MLE). A nice feature of this view is that we can now also interpret the regularization term $R(W)$ in the full loss function as coming from a Gaussian prior over the weight matrix $W$, where instead of MLE we are performing the Maximum a posteriori (MAP) estimation. We mention these interpretations to help your intuitions, but the full details of this derivation are beyond the scope of this class.
Practical issues: Numeric stability. When you're writing code for computing the Softmax function in practice, the intermediate terms $e^{f_{y_i}}$ and $\sum_j e^{f_j}$ may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick. Notice that if we multiply the top and bottom of the fraction by a constant $C$ and push it into the sum, we get the following (mathematically equivalent) expression:
$\frac{e^{f_{y_i}}}{\sum_j e^{f_j}} \ = \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}} \ = \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}}$We are free to choose the value of $C$. This will not change any of the results, but we can use this value to improve the numerical stability of the computation. A common choice for $C$ is to set $\log C = -\max_j f_j$. This simply states that we should shift the values inside the vector $f$ so that the highest value is zero. In code:
f = np.array([123, 456, 789]) # example with 3 classes and each having large scores p = np.exp(f) / np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup # instead: first shift the values of f so that the highest number is 0: f -= np.max(f) # f becomes [-666, -333, 0] p = np.exp(f) / np.sum(np.exp(f)) # safe to do, gives the correct answer
Possibly confusing naming conventions. To be precise, the SVM classifier uses the hinge loss, or also sometimes called the max-margin loss. The Softmax classifier uses the cross-entropy loss. The Softmax classifier gets its name from the softmax function, which is used to squash the raw class scores into normalized positive values that sum to one, so that the cross-entropy loss can be applied. In particular, note that technically it doesn't make sense to talk about the "softmax loss", since softmax is just the squashing function, but it is a relatively commonly used shorthand.
Our hypothesis took the form:
$h_\theta(x) = \Large\frac{1}{1 + \mathcal{e}^{(-\theta^\top x)}}$
where $g(z) = \Large\frac{1}{1 + \mathcal{e}^{-z}}$
is called the logistic (or sigmoid) function, and the model parameters $\theta$ were trained to minimize the cost function
$ \begin{align} J(\theta) = -\left[ \sum_{i=1}^m y^{(i)} \log h_\theta(x^{(i)}) + (1-y^{(i)}) \log (1-h_\theta(x^{(i)})) \right] \end{align} $
SYNOPSIS
# This is the documentation for Memoize 1.01 use Memoize; memoize('slow_function'); slow_function(arguments); # Is faster than it was before
memoize(function, options...); ...
$h_{\theta}(x) = \theta_{0} + \theta_{1}x_{1} + \theta_{2}x_{2}$.
Here, the $\theta_i$'s are the parameters (also called weights) parameterizing the space of linear functions mapping from $\mathcal{X}$ to $\mathcal{Y}$.
Given a test input $x$, we want our hypothesis to estimate the probability that $P(y=k | x)$ for each value of $k = 1, \ldots, K$. I.e., we want to estimate the probability of the class label taking on each of the $K$ different possible values. Thus, our hypothesis will output a $K$-dimensional vector (whose elements sum to 1) giving us our K estimated probabilities. Concretely, our hypothesis $h_{\theta}(x)$ takes the form: ...
This distinction is elaborated in much more detail by Baroni et al. [pdf], but in a nutshell: Count-based methods compute the statistics of how often some word co-occurs with its neighbor words in a large text corpus, and then map these count-statistics down to a small, dense vector for each word. Predictive models directly try to predict a word from its neighbors in terms of learned small, dense embedding vectors (considered parameters of the model).
Word2vec is a particularly computationally-efficient predictive model for learning word embeddings from raw text. It comes in two flavors, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model (Chapter 3.1 and 3.2 in Mikolov et al.: arXiv:1301.3781). Algorithmically, these models are similar, except that CBOW predicts target words (e.g. 'mat') from source context words ('the cat sits on the'), while the skip-gram does the inverse and predicts source context-words from the target words. This inversion might seem like an arbitrary choice, but statistically it has the effect that CBOW smoothes over a lot of the distributional information (by treating an entire context as one observation). For the most part, this turns out to be a useful thing for smaller datasets. However, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets. We will focus on the skip-gram model in the rest of this tutorial.