Fortunately, there are other techniques which can reduce overfitting, even when we have a fixed network and fixed training data. These are known as regularization techniques. In this section I describe one of the most commonly used regularization techniques, a technique sometimes known as weight decay or $L2$ regularization. The idea of $L2$ regularization is to add an extra term to the cost function, a term called the regularization term. Here's the regularized cross-entropy:
Of course, it's possible to regularize other cost functions, such as the quadratic cost. This can be done in a similar way:
Intuitively, the effect of regularization is to make it so the network prefers to learn small weights, all other things being equal. Large weights will only be allowed if they considerably improve the first part of the cost function. Put another way, regularization can be viewed as a way of compromising between finding small weights and minimizing the original cost function. The relative importance of the two elements of the compromise depends on the value of $\lambda $: when $\lambda $ is small we prefer to minimize the original cost function, but when $\lambda $ is large we prefer small weights.
Now, it's really not at all obvious why making this kind of compromise
should help reduce overfitting! But it turns out that it does. We'll
address the question of why it helps in the next section. But first,
let's work through an example showing that regularization really does
reduce overfitting.
To construct such an example, we first need to figure out how to apply
our stochastic gradient descent learning algorithm in a regularized
neural network.
...
...
There are several ways of controlling the capacity of Neural Networks to prevent overfitting:
$L2$ regularization is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight w in the network, we add the term $\frac{1}{2}\lambda {w}^{2}$ to the objective, where $\lambda $ is the regularization strength. It is common to see the factor of $\frac{1}{2}$ in front because then the gradient of this term with respect to the parameter $w$ is simply $\lambda w$ instead of $2\lambda w$.
The $L2$ regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors. As we discussed in the Linear Classification section, due to multiplicative interactions between weights and inputs this has the appealing property of encouraging the network to use all of its inputs a little rather that some of its inputs a lot. Lastly, notice that during gradient descent parameter update, using the $L2$ regularization ultimately means that every weight is decayed linearly: $W+=-\lambda \ast W$ towards zero.
$L1$ regularization is another relatively common form of regularization, where for each weight w we add the term \lambda∣w∣ to the objective. It is possible to combine the $L1$ regularization with the $L2$ regularization: ${\lambda}_{1}\mid w\mid +{\lambda}_{2}{w}^{2}$ (this is called Elastic net regularization).
The $L1$ regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with $L1$ regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the "noisy" inputs. In comparison, final weight vectors from $L2$ regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, $L2$ regularization can be expected to give superior performance over $L1$.
Max norm constraints. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector \overrightarrow{w} of every neuron to satisfy $||\overrightarrow{w}|{|}_{2}<c$. Typical values of $c$ are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot "explode" even when the learning rates are set too high because the updates are always bounded.
Dropout is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in Dropout: A Simple Way to Prevent Neural Networks from Overfitting (pdf) that complements the other methods ($L1,L2,maxnorm$).
While training, dropout is implemented by only keeping a neuron active
with some probability p (a hyperparameter), or setting it to zero
otherwise.
...
"In this post, I would like to introduce the fundamental math on understanding Noise, Bias and Variance during Neural Net training and also use Noise as a regularizer for generalizing the Neural Nets. ...
"... In conclusion, by adding Gaussian noise to the input signal, we can regularize the Models more efficiently and prevent them from overfitting."