# L1$L1$$L1$ and L2$L2$$L2$ Normalization / Regularization [L1$L1$$L1$ and L2$L2$$L2$ Norm]

• related:  Andrej Karpathy, @karpathy (twitter, Mar 11, 2016):
"not-widely-enough-known-protip: Do not use $L2$$L2$ loss (regression) in neural
nets unless you absolutely have to. Softmax is likely to work better."

## Improving the Way Neural Networks Learn

[source]

The techniques we'll develop in this chapter include: a better choice of cost function, known as the cross-entropy cost function; four so-called "regularization" methods ($L1$$L1$ and $L2$$L2$ regularization, dropout, and artificial expansion of the training data), which make our networks better at generalizing beyond the training data; a better method for initializing the weights in the network; and a set of heuristics to help choose good hyper-parameters for the network.
...

### Regularization

Increasing the amount of training data is one way of reducing overfitting. Are there other ways we can reduce the extent to which overfitting occurs? One possible approach is to reduce the size of our network. However, large networks have the potential to be more powerful than small networks, and so this is an option we'd only adopt reluctantly.

Fortunately, there are other techniques which can reduce overfitting, even when we have a fixed network and fixed training data. These are known as regularization techniques. In this section I describe one of the most commonly used regularization techniques, a technique sometimes known as weight decay or $L2$$L2$ regularization. The idea of $L2$$L2$ regularization is to add an extra term to the cost function, a term called the regularization term. Here's the regularized cross-entropy:

$C=−\frac{1}{n} \sum\limits_{xj} {\left[y_j\ ln a^{L}_{j} + (1−y_j)\ ln(1−a^{L}_{j})\right]} + \frac{\lambda}{2n} \sum\limits_w{w^2}$

The first term is just the usual expression for the cross-entropy. But we've added a second term, namely the sum of the squares of all the weights in the network. This is scaled by a factor $\lambda /2n$$\lambda/2n$, where $\lambda >0$$\lambda>0$ is known as the regularization parameter, and $n$$n$ is, as usual, the size of our training set. I'll discuss later how $\lambda$$\lambda$ is chosen. It's also worth noting that the regularization term doesn't include the biases. I'll also come back to that below.

Of course, it's possible to regularize other cost functions, such as the quadratic cost. This can be done in a similar way:

$C=\frac{1}{2n}\sum _{x}||y-{a}^{L}|{|}^{2}+\frac{\lambda }{2n}\sum _{w}{w}^{2}$$C = \frac{1}{2n} \sum\limits_x{||y−a^L||^2} + \frac{\lambda}{2n} \sum\limits_w{w^2}$

In both cases we can write the regularized cost function as

$C={C}_{0}+\frac{\lambda }{2n}\sum _{w}{w}^{2}$$C = C_0 + \frac{\lambda}{2n} \sum\limits_w{w^2}$

where ${C}_{0}$$C_0$ is the original, unregularized cost function.

Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other things being equal. Large weights will only be allowed if they considerably improve the first part of the cost function. Put another way, regularization can be viewed as a way of compromising between finding small weights and minimizing the original cost function. The relative importance of the two elements of the compromise depends on the value of $\lambda$$\lambda$: when $\lambda$$\lambda$ is small we prefer to minimize the original cost function, but when $\lambda$$\lambda$ is large we prefer small weights.

Now, it's really not at all obvious why making this kind of compromise should help reduce overfitting! But it turns out that it does. We'll address the question of why it helps in the next section. But first, let's work through an example showing that regularization really does reduce overfitting. To construct such an example, we first need to figure out how to apply our stochastic gradient descent learning algorithm in a regularized neural network.
...

## Regularization

[source]

...
There are several ways of controlling the capacity of Neural Networks to prevent overfitting:

$L2$$L2$ regularization is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight w in the network, we add the term $\frac{1}{2}\lambda {w}^{2}$$\frac{1}{2}\lambda w^2$ to the objective, where $\lambda$$\lambda$ is the regularization strength. It is common to see the factor of $\frac{1}{2}$$\frac{1}{2}$ in front because then the gradient of this term with respect to the parameter $w$$w$ is simply $\lambda w$$\lambda w$ instead of $2\lambda w$$2\lambda w$.

The $L2$$L2$ regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. As we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot. Lastly, notice that during gradient descent parameter update, using the $L2$$L2$ regularization ultimately means that every weight is decayed linearly: $W+=-\lambda \ast W$$W \mathrel{+}= -\lambda * W$ towards zero.

$L1$$L1$ regularization is another relatively common form of regularization, where for each weight w we add the term \lambda∣w∣ to the objective. It is possible to combine the $L1$$L1$ regularization with the $L2$$L2$ regularization: ${\lambda }_{1}\mid w\mid +{\lambda }_{2}{w}^{2}$$\lambda_1{∣w∣} + \lambda_2w^2$ (this is called Elastic net regularization).

The $L1$$L1$ regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with $L1$$L1$ regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the "noisy" inputs. In comparison, final weight vectors from $L2$$L2$ regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, $L2$$L2$ regularization can be expected to give superior performance over $L1$$L1$.

Max norm constraints. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector \overrightarrow{w} of every neuron to satisfy $||\stackrel{\to }{w}|{|}_{2}$||\overrightarrow{w}||_2 < c$.  Typical values of $c$$c$ are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot "explode" even when the learning rates are set too high because the updates are always bounded.

Dropout is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in Dropout: A Simple Way to Prevent Neural Networks from Overfitting (pdf) that complements the other methods ($L1,L2,maxnorm$$L1, L2, maxnorm$). While training, dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise.
...

## Blogs

• Mathematical foundation for Noise, Bias and Variance in Neural Networks

• "... In the previous post titled "Algorithms to Improve Neural Network Accuracy", we learnt about the overfitting problem in Neural Nets and how to break it using L1/L2 regularizers, Weight penalties decay and constraints. Also in the post title "Committee of Intelligent Machines" , we learnt how to stack different models to improve accuracy and generalize the prediction better.

"In this post, I would like to introduce the fundamental math on understanding Noise, Bias and Variance during Neural Net training and also use Noise as a regularizer for generalizing the Neural Nets. ...

"... In conclusion, by adding Gaussian noise to the input signal, we can regularize the Models more efficiently and prevent them from overfitting."