# MACHINE LEARNING - IMPLEMENTATION NOTES

{ Approach | Algorithms | Advice ... }

## Persagen.com

This file:  Persagen.com/files/ml-implementation_notes.html

### Preface

3. These notes are less-formatted than my "parent" ML Notes   [<< large file; opens in new tab]. Basically I placed content here
(reddit; ...) that I thought might be useful one day, without cluttering up that file.
4. Enjoy!

### General

• Google's "Wide & Deep Learning" is useful for generic large-scale regression and classification problems with sparse inputs (categorical features with a large number of possible feature values), such as recommender systems, search, and ranking problems.

## ACTIVATION FUNCTIONS

### Activation Functions - Blogs

• Activation functions [reddit]
What is the consensus on the performance of activation functions? Has anyone actually tried to use different nonlinearities in a major ImageNet network and show results?
RELU vs. LRELU vs. PRELU vs. ELU?  [... significant discussion ...]

• An Intuitive Explanation of Convolutional Neural Networks  |  a simplified/well-written, informative intro to CNN  |  reddit

• I'm writing a tutorial/article series for implementing Neural nets and would love feedback!

• The one thing that nearly every intro to NN article suffers from: the "sigmoid" function. << use ReLU.

• What's wrong with using sigmoid? :-)

• Kills gradients, not zero-centered, (relatively) slow to calculate: Commonly used activation functions [ Stanford cs231n  |  Andrej Karpathy ]

• Is there any research on using distinct activation functions in the same network? [reddit]: DCGANs use ReLU and tanh. ReLU for nonlinearity in the convolutional layers, and tanh to normalise the representation in the generator network.

• 'Related':

• IRNN by Le et al.(https://arXiv:1504.00941): Authors proposed using ReLU as an activation function instead of Tanh and initializing recurrent weights matrix to an identity matrix. This means that at the beginning, state is propagated to the next time step without any loss of information.

• NPRNN by Talathi et al.(http://arXiv:1511.03771): Similar approach as IRNN, but they propose using positive definite matrix with largest eigenvalue 1 as an initialization for recurrent weights. This should increase the stability of model compared to IRNN.

• Neural Network Evolution Playground with Backprop NEAT [reddit]; links to related blog article:
Neural Network Evolution Playground with Backprop NEAT: This demo will attempt to use a genetic algorithm to produce efficient, but atypical neural network structures to classify datasets borrowed from TensorFlow Playground.  |  demo  |  GitHub

...
In typical neural network-based classification problems, the data scientist would design and put together some pre-defined neural network, based on human heuristic, and the actual machine learning bit of the task would be to solve for the set of weights in the network, using some variants of stochastic gradient descent and the back propagation algorithm to calculate the weight gradients, in order to get the network to fit some training data under some regularisation constraints. The TensorFlow Playground demo captured the essence of this sort of task, but I've been thinking if machine learning can also be used effectively to design the actual neural network used for a given task as well. What if we can automate the process of discovering neural network architectures?

I decided to experiment with this idea by creating this demo. Rather than go with the conventional approach of organising many layers of neurons with uniform activation functions, we will try to abandon the idea of layers altogether, so each neuron can potentially connect to any other neuron in our network. Also, rather than sticking with neurons that use a uniform activation function, such as sigmoids or Relu's, we will allow many types of neurons with many types of activation functions, such as sigmoid, tanh, Relu, sine, Gaussian, abs, square, and even addition and multiplicative gates.

The genetic algorithm called $\small NEAT$ will be used to evolve our neural nets from a very simple one at the beginning to more complex ones over many generations. The weights of the neural nets will be solved via back propagation. The awesome recurrent.js library made by Karpathy, makes it possible to build computational graph representation of arbitrary neural networks with arbitrary activation functions. I implemented the $\small NEAT$ algorithm to generate representations of neural nets that $\small recurrent.js$ can process, so that the library can be used to forward pass through the neural nets that $\small NEAT$ has discovered, and also to backprop the neural nets to optimise for their weights.
...

• More on the $\small NEAT$ algorithm:

• Explanation of NEAT and HyperNEAT? [reddit]

• Mentioned here: Neural Network Evolution Playground with Backprop $\small NEAT$:

• ... The genetic algorithm called $\small NEAT$ will be used to evolve our neural nets from a very simple one at the beginning to more complex ones over many generations. The weights of the neural nets will be solved via back propagation. The awesome recurrent.js library made by Karpathy, makes it possible to build computational graph representation of arbitrary neural networks with arbitrary activation functions. I implemented the $\small NEAT$ algorithm to generate representations of neural nets that recurrent.js can process, so that the library can be used to forward pass through the neural nets that $\small NEAT$ has discovered, and also to backprop the neural nets to optimise for their weights. ...

• More on $\small NEAT$ here [reddit].

• ReLU (activation functions; pros / cons; ...) are discussed in this "tips" file:

• ReLU vs. Leaky ReLU are discussed in this [reddit] post: What are the advantages of ReLU over the Leaky ReLU (in FFNN)?

• Sigmoid function question
I am using the sigmoid function for forward propagation:

def nonlin(x,deriv=False):
if(deriv==True):
return x*(1-x) return 1/(1+np.exp(-x))

but my input values can be very large as well as zero, so the output values are either 1 or 0, respectively. Should I map my input values to a smaller domain? use a different function?

• Try scaling your inputs so that the mean is 0 and variance is 1 or something like that. Also is it supposed to be returning the derivative if deriv is true?

• Your code doesn't seem right. Maybe you meant

def nonlin(x,deriv=False):
s = 1/(1+np.exp(-x))
return s*(1-s) if deriv else s

• You have discovered the twin curses of Numerical Instability and Vanishing Gradient! Yeah, if your inputs are very large or very small, you probably don't want a sigmoid activation function (or anything that looks remotely sigmoid if you squint). The typical suggestions are:

Rectified Linear Unit:

def ReLU(x):
max(0, x)

Leaky ReLU:
def LReLU(x):
return max(0.01 * x, x)

Exponential Linear Unit:
def ELU(x):
return (e ** x) - 1 if x < 0 else x

• Using tanh activation functions for the first hidden layer and ReLUs for subsequent hidden layers
I have been using tanh activation functions for my first hidden layer and ReLUs for the subsequent hidden layers for a while now, based on my observations that it seems to work slightly better than other variations. Presented a paper last week with this setup: Beat Tracking with a Cepstroid Invariant Neural Network (though the machine learning is rather simple by the standards of this forum). Anyway, a researcher from Apple came by my poster and it became apparent that they are doing exactly this too.

Is this a common setup? I mean it could be motivated in various ways, e.g.:

• Less risk of vanishing gradients, as opposed to when tanh units are used in subsequent layers

• The sparsity induced by ReLUs could arguably be more relevant at later layers of processing (i.e. nice to use smooth activations for edges but more relevant to disentangle factors of variation for high-level representations).

• Interesting. Could you give some numbers on the difference in training and val score? Also am I reading your paper correctly that you have ≤ 3 layers and ≤ 25 neurons? I don't think vanishing gradient should be a problem at all with such a network size.

• Yes they are small, hence my "rather simple by the standards of this forum"-remark. Biggest was the input layer of the CINN at around 1300, but that network is not relevant for the discussion.

The difference was a few percent of the error when I tested for another study of perceptual features (not yet published). I got the impression from a few runs of pure tanh or ReLUs during development in this study that the same applied. It wasn't really a focus of mine until I heard that they were doing the same at a tech company without me ever reading about it in any literature. Some comparisons could be interesting to do in the future yes! (For a paper more devoted to the actual machine learning than for beat tracking).

• I have not heard of it, and have little intuition into why it might be effective, save for that it could help ward against neurons dying. A good experiment could be to look at what proportions tend to die, saturation of tanhs, magnitudes of activations for various architectures around this theme.

• Yes, we know that there are some benefits to tanh units (one being to avoid "dead" neurons as you say), and we know that there are some benefits to ReLUs. It feels intuitive to me that the benefits of ReLUs are most important for later layers. For example, sparsity could be more important when the representations you are processing are more complex (e.g. your processing may get more accurate if you allow for more shades of "edge-strength” than that of "two-eye-shaped-thingies-close-together-strength”).

And thanks for the tips. I have been running some simple experiments but nothing published yet.

• What are the advantages of ReLU over sigmoid function in deep neural network?
The state of the art of non-linearity is to use ReLU instead of sigmoid function in deep neural network, what are the advantages? I know that training a network when ReLU is used would be faster, and it is more biological inspired, what are the other advantages? (That is, any disadvantages of using sigmoid)?

• Two additional major benefits of ReLUs are sparsity and a reduced likelihood of vanishing gradient. But first recall the definition of a ReLU is $\small h = max(0,a)$, where $\small a = Wx + b$. One major benefit is the reduced likelihood of the gradient to vanish. This arises when a > 0. In this regime the gradient has a constant value. In contrast, the gradient of sigmoids becomes increasingly small as the absolute value of x increases. The constant gradient of ReLUs results in faster learning.

The other benefit of ReLUs is sparsity. Sparsity arises when a ≤ 0.

The more such units that exist in a layer the more sparse the resulting representation. Sigmoids on the other hand are always likely to generate some non-zero value resulting in dense representations. Sparse representations seem to be more beneficial than dense representations.

• When you say the gradient, you mean with respect to weights or the input x?

• With respect to the weights. Gradient-based learning algorithms always taking the gradient with respect to the parameters of the learner, i.e. the weights and biases in a NN.

• What Activation Function to use for Convolutional Layers?
• Relus are the best default choice, they should outperform sigmoid if you are just doing standard CNN stuff. If you use relus in your case and it fails to converge, you should check your implementation.

• Note this (excellent answers): What is the "dying ReLU" problem in neural networks? [DataScience:StackExchange]

### Activation Functions - Instruction

• Commonly used activation functions [ Stanford cs231n  |  Andrej Karpathy ]

Commonly used activation functions

Every activation function (or non-linearity) takes a single number and performs a certain fixed mathematical operation on it. There are several activation functions you may encounter in practice:

Sigmoid. The sigmoid non-linearity has the mathematical form $\sigma(x) = 1 / (1 + e^{-x})$ and is shown in the image above on the left. As alluded to in the previous section, it takes a real-valued number and "squashes" it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1). In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks:

• Sigmoids saturate and kill gradients. A very undesirable property of the sigmoid neuron is that when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. Recall that during backpropagation, this (local) gradient will be multiplied to the gradient of this gate's output for the whole objective. Therefore, if the local gradient is very small, it will effectively "kill" the gradient and almost no signal will flow through the neuron to its weights and recursively to its data. Additionally, one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn.

• Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive [ e.g., $x > 0$ elementwise in $$f = w^Tx + b$$) ], then the gradient on the weights $w$ will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression $f$). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.

Tanh. The tanh non-linearity is shown on the image above on the right. It squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity. Also note that the tanh neuron is simply a scaled sigmoid neuron, in particular the following holds: $\tanh(x) = 2 \sigma(2x) -1$.

ReLU. The Rectified Linear Unit has become very popular in the last few years. It computes the function $f(x)=max(0,x)$. In other words, the activation is simply thresholded at zero (see image above on the left). There are several pros and cons to using the ReLUs:

(+) It was found to greatly accelerate (e.g. a factor of 6 in Krizhevsky et al.) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form.

(+) Compared to tanh/sigmoid neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero.

(-) Unfortunately, ReLU units can be fragile during training and can "die". For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be "dead" (i.e. neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

Leaky ReLU. Leaky ReLUs are one attempt to fix the "dying ReLU" problem. Instead of the function being zero when $x < 0$, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes $(f(x) = \mathbb{1}(x < 0) (\alpha x) + \mathbb{1}(x>=0) (x)$ where $\alpha$ is a small constant. Some people report success with this form of activation function, but the results are not always consistent. The slope in the negative region can also be made into a parameter of each neuron, as seen in PReLU neurons, introduced in "Delving Deep into Rectifiers", by He et al. (2015). However, the consistency of the benefit across tasks is presently unclear.

Maxout. Other types of units have been proposed that do not have the functional form $f(w^Tx + b)$ where a non-linearity is applied on the dot product between the weights and the data. One relatively popular choice is the Maxout neuron (introduced recently by Goodfellow et al.) that generalizes the ReLU and its leaky version. The Maxout neuron computes the function $\max(w_1^Tx+b_1, w_2^Tx + b_2)$. Notice that both ReLU and Leaky ReLU are a special case of this form (for example, for ReLU we have $w1,b1=0$). The Maxout neuron therefore enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU). However, unlike the ReLU neurons it doubles the number of parameters for every single neuron, leading to a high total number of parameters.

This concludes our discussion of the most common types of neurons and their activation functions. As a last comment, it is very rare to mix and match different types of neurons in the same network, even though there is no fundamental problem with doing so.

TLDR: "What neuron type should I use?" Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of "dead" units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout.

### Activation Functions - Papers

• Cai B (2016) DehazeNet: An End-to-End System for Single Image Haze Removal. arXiv:1601.07661
• Single image haze removal is a challenging ill-posed problem. Existing methods use various constraints/priors to get plausible dehazing solutions. The key to achieve haze removal is to estimate a medium transmission map for an input hazy image. In this paper, we propose a trainable end-to-end system called DehazeNet, for medium transmission estimation. DehazeNet takes a hazy image as input, and outputs its medium transmission map that is subsequently used to recover a haze-free image via atmospheric scattering model. DehazeNet adopts Convolutional Neural Networks (CNN) based deep architecture, whose layers are specially designed to embody the established assumptions/priors in image dehazing. Specifically, layers of Maxout units are used for feature extraction, which can generate almost all haze-relevant features. We also propose a novel nonlinear activation function in DehazeNet, called Bilateral Rectified Linear Unit (BReLU), which is able to improve the quality of recovered haze-free image. We establish connections between components of the proposed DehazeNet and those used in existing methods. Experiments on benchmark images show that DehazeNet achieves superior performance over existing methods, yet keeps efficient and easy to use.

• Courbariaux M [Bengio Y] (2016) Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv:1602.02830 |  |  fast MNIST MLP  |  GitHub  |  reddit: Best work in Deep Learning. Deserves a perfect 1/1.  |  GitXiv  |  binary_net by chainer [GitHub]

• We introduce a method to train Binarized Neural Networks (BNN) - neural networks with binary weights and activations at run-time and when computing the parameters' gradient at train-time. We conduct two sets of experiments, each based on a different framework, namely Torch7 and Theano, where we train BNNs on MNIST, CIFAR-10 and SVHN, and achieve nearly state-of-the-art results. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which might lead to a great increase in power-efficiency. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available.

• Godfrey LB (2016) A continuum among logarithmic, linear, and exponential functions, and its potential to improve generalization in neural networks. arXiv:1602.01321  |  reddit
• We present the soft exponential activation function for artificial neural networks that continuously interpolates between logarithmic, linear, and exponential functions. This activation function is simple, differentiable, and parameterized so that it can be trained as the rest of the network is trained. We hypothesize that soft exponential has the potential to improve neural network learning, as it can exactly calculate many natural operations that typical neural networks can only approximate, including addition, multiplication, inner product, distance, polynomials, and sinusoids.

• Goodfellow IJ [Courville A; Bengio Y] (2013) An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv:1312.6211  |  Can deep learning models "forget" their training? << "Yes. It's called catastrophic forgetting."

• 'Maxout' activation function:  Goodfellow IJ [Courville A; Bengio Y] (2013) Maxout networks. arXiv:1302.4389  |  webpage  |  cited here [Andrej Karpathy's cs231n CNN course]  |  reddit
• We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN.

• Gulcehre C [Bengio Y] (2016) Noisy Activation Functions. arXiv:1603.00391
• Common activation functions used in NN can yield to training difficulties due to the saturation behavior of the activation function, which may hide dependencies which are not visible to first order (using only gradients). Gating mechanisms ... are good examples of this. We propose to exploit the injection of appropriate noise so that some gradients may sometimes flow, even if the noiseless application of the activation function would yield zero gradient. Large noise will dominate the noise-free gradient and allow stochastic gradient descent to be more exploratory. By adding noise only to the problematic parts of the activation function we allow the optimization procedure to explore the boundary between the degenerate (saturating) and the well-behaved parts of the activation function. We also establish connections to simulated annealing, when the amount of noise is annealed down, making it easier to optimize hard objective functions. We find experimentally that replacing such saturating activation functions by by noisy variants helps training in many contexts, yielding state-of-the-art results on several datasets, especially when training seems to be the most difficult, e.g., when curriculum learning is necessary to obtain good results.

• He K (2015) Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. arXiv:1502.01852  |  reddit

• Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.

• Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU improves model fitting with nearly zero extra computational cost and little overfitting risk. Second, we derive a robust initialization method that particularly considers the rectifier nonlinearities. This method enables us to train extremely deep rectified models directly from scratch and to investigate deeper or wider network architectures. Based on our PReLU networks (PReLU-nets), we achieve 4.94% top-5 test error on the ImageNet 2012 classification dataset. This is a 26% relative improvement over the ILSVRC 2014 winner (GoogLeNet, 6.66%). To our knowledge, our result is the first to surpass human-level performance (5.1%, Russakovsky et al.) on this visual recognition challenge.

• Hendrycks D (2016) Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. arXiv:1606.08415  |  GitHub  |  reddit
• We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU nonlinearity is the expected transformation of a stochastic regularizer which randomly applies the identity or zero map, combining the intuitions of dropout and zoneout while respecting neuron values. This connection suggests a new probabilistic understanding of nonlinearities. We perform an empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations and find performance improvements across all tasks.

• Layer-sequential unit-variance (LSUV) initialization for CNN: Implementation of the Layer-sequential unit-variance neural network initialization described in the paper "All you need is a good init" [GitXiv: links to arXiv:1511.06422]:

• Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one. Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)). Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.

• Mhaskar H (2016) Deep vs. shallow networks: An approximation theory perspective. arXiv:1608.03287  |  reddit

• The paper briefly reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activation function - the ReLU function - used in present-day neural networks, as well as for the Gaussian networks. We propose a new definition of relative dimension to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.

• reddit: The hierarchical softmax is probably the most popular application of binary trees in neural networks, but they are widely used in recursive neural networks (an old concept, but recently popularised by Richard Socher).

• Rister R [Stanford | Student Member, IEEE] (2016) Piecewise convexity of artificial neural networks. arXiv:1607.04917
• Although artificial neural networks have shown great promise in applications ranging from computer vision to speech recognition, there remains considerable practical and theoretical difficulty in optimizing their parameters. The seemingly unreasonable success of gradient descent methods in minimizing these non-convex functions remains poorly understood. In this work we offer some theoretical guarantees concerning networks with continuous piecewise affine activation functions, which have in recent years become the norm. We prove three main results. Firstly, that the network is piecewise convex as a function of the input data. Secondly, that the network, considered as a function of the parameters in a single layer, all others held constant, is again piecewise convex. Finally, that the network as a function of all its parameters is piecewise multi-convex, a generalization of biconvexity. Accordingly, we show that any point to which gradient descent converges is a local minimum of some piece. Thus gradient descent converges to non-minima only at the boundaries of pieces. These results might offer some insights into the effectiveness of gradient descent methods in optimizing this class of networks.

• Scardapane S (2016) Learning activation functions from data using cubic spline interpolation. arXiv:1605.05509  |  GitXiv
• Neural networks require a careful design in order to perform properly on a given task. In particular, selecting a good activation function (possibly in a data-dependent fashion) is a crucial step, which remains an open problem in the research community. Despite a large amount of investigations, most current implementations simply select one fixed function from a small set of candidates, which is not adapted during training, and is shared among all neurons throughout the different layers. However, neither two of these assumptions can be supposed optimal in practice. In this paper, we present a principled way to have data-dependent adaptation of the activation functions, which is performed independently for each neuron. This is achieved by leveraging over past and present advances on cubic spline interpolation, allowing for local adaptation of the functions around their regions of use. The resulting algorithm is relatively cheap to implement, and overfitting is counterbalanced by the inclusion of a novel damping criterion, which penalizes unwanted oscillations from a predefined shape. Experimental results validate the proposal over two well-known benchmarks.

• Spring R [Shrivastava A] (2016) Scalable and Sustainable Deep Learning via Randomized Hashing. arXiv:1602.08194  |  very DNN; hashing  |  reddit
• Current deep learning architectures are growing larger in order to learn from enormous datasets. These architectures require giant matrix multiplication operations to train millions or billions of parameters during forward and back propagation steps. These operations are very expensive from a computational and energy standpoint. We present a novel technique to reduce the amount of computation needed to train and test deep networks drastically. Our approach combines recent ideas from adaptive dropouts and randomized hashing for maximum inner product search to select only the nodes with the highest activation efficiently. Our new algorithm for training deep networks reduces the overall computational cost, of both feed-forward pass and backpropagation, by operating on significantly fewer nodes. As a consequence, our algorithm only requires 5% of computations (multiplications) compared to traditional algorithms, without any loss in the accuracy. Furthermore, due to very sparse gradient updates, our algorithm is ideally suited for asynchronous training leading to near linear speedup with increasing parallelism. We demonstrate the scalability and sustainability (energy efficiency) of our proposed algorithm via rigorous experimental evaluations.

• Trottier L (2016) Parametric Exponential Linear Unit for Deep Convolutional Neural Networks. arXiv:1605.0933  |  reddit

• The activation function of Deep Neural Networks (DNNs) has undergone many changes during the last decades. Since the advent of the well-known non-saturated Rectified Linear Unit (ReLU), many have tried to further improve the performance of the networks with more elaborate functions. Examples are the Leaky ReLU (LReLU) to remove zero gradients and Exponential Linear Unit (ELU) to reduce bias shift. In this paper, we introduce the Parametric ELU (PELU), an adaptive activation function that allows the DNNs to adopt different non-linear behaviors throughout the training phase. We contribute in three ways: (1) we show that PELU increases the network flexibility to counter vanishing gradient, (2) we provide a gradient-based optimization framework to learn the parameters of the function, and (3) we conduct several experiments on MNIST, CIFAR-10/100 and ImageNet with different network architectures, such as NiN, Overfeat, All-CNN, ResNet and Vgg, to demonstrate the general applicability of the approach. Our proposed PELU has shown relative error improvements of 4.45% and 5.68% on CIFAR-10 and 100, and as much as 7.28% with only 0.0003% parameter increase on ImageNet, along with faster convergence rate in almost all test scenarios. We also observed that Vgg using PELU tended to prefer activations saturating close to zero, as in ReLU, except at last layer, which saturated near -2. These results suggest that varying the shape of the activations during training along with the other parameters helps to control vanishing gradients and bias shift, thus facilitating learning.

• Srivastava RK [Schmidhuber J] (2015). Highway networks. arXiv:1505.00387 GitXiv

• There is plenty of theoretical and empirical evidence that depth of neural networks is a crucial ingredient for their success. However, network training becomes more difficult with increasing depth and training of very deep networks remains an open problem. In this extended abstract, we introduce a new architecture designed to ease gradient-based training of very deep networks. We refer to networks with this architecture as highway networks, since they allow unimpeded information flow across several layers on "information highways". The architecture is characterized by the use of gating units which learn to regulate the flow of information through a network. Highway networks with hundreds of layers can be trained directly using stochastic gradient descent and with a variety of activation functions, opening up the possibility of studying extremely deep and efficient architectures.

• implemented here: Recurrent Highway & Multiplicative Integration -- Implementation in TensorFlow  |  GitHub

• Wan L [Zeiler M; LeCun Y; Fergus R] (ICML 2013) Regularization of neural networks using DropConnect. pdf |  GitHub  |  GitXiv

• We introduce DropConnect, a generalization of Dropout (Hinton et al., 2012), for regularizing large fully-connected layers within neural networks. When training with Dropout, a randomly selected subset of activations are set to zero within each layer. DropConnect instead sets a randomly selected subset of weights within the network to zero. Each unit thus receives input from a random subset of units in the previous layer. We derive a bound on the generalization performance of both Dropout and DropConnect. We then evaluate DropConnect on a range of datasets, comparing to Dropout, and show state-of-the-art results on several image recognition benchmarks by aggregating multiple DropConnect-trained models.

• Implementations:

• Wu X (2015) A Lightened CNN for Deep Face Representation. arXiv:1511.02683  |  GitXiv
• Convolution neural network (CNN) has significantly pushed forward the development of face recognition techniques. To achieve ultimate accuracy, CNN models tend to be deeper or multiple local facial patch ensemble, which result in a waste of time and space. To alleviate this issue, this paper studies a lightened CNN framework to learn a compact embedding for face representation. First, we introduce the concept of maxout in the fully connected layer to the convolution layer, which leads to a new activation function, named Max-Feature-Map (MFM). Compared with widely used ReLU, MFM can simultaneously capture compact representation and competitive information. Then, one shallow CNN model is constructed by 4 convolution layers and totally contains about 4M parameters; and the other is constructed by reducing the kernel size of convolution layers and adding Network in Network (NIN) layers between convolution layers based on the previous one. These models are trained on the CASIA-WebFace dataset and evaluated on the LFW and YTF datasets. Experimental results show that the proposed models achieve state-of-the-art results. At the same time, a reduction of computational cost is reached by over 9 times in comparison with the released VGG model.

• Xu B (2015) Empirical evaluation of rectified activations in convolutional network. arXiv:1505.00853.  |  reddit

• Xu H (2015) Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering. arXiv:1511.05234.
• We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from different spatial regions of the image in its memory, and uses the question to choose relevant regions for computing the answer, a process of which constitutes a single "hop" in the network. We propose a novel spatial attention architecture that aligns words with image patches in the first hop, and obtain improved results by adding a second attention hop which considers the whole question to choose visual evidence based on the results of the first hop. To better understand the inference process learned by the network, we design synthetic questions that specifically require spatial inference and visualize the attention weights. We evaluate our model on two published visual question answering datasets, DAQUAR [1] and VQA [2], and obtain improved results compared to a strong deep baseline model (iBOWIMG) which concatenates image and question features to predict the answer [3].

## BATCH NORMALIZATION

• Comment on batch normalization in Hacker News comments on TensorFlow Playground:

There's one recent advance in particular that isn't in this demo, and that is Batch Normalization. If you've played around with it a bit, I'm sure you have seen that deeper layers are hard to train. You see the dashed lines representing signal in the network become weaker and weaker as the network gets deeper. BatchNorm works wonders with this. It takes statistics from the minibatch of training examples, and tries to normalize it so that the next layer gets input more similar to what it expects, even if the previous layer has changed. In practice you get a much better signal, so the network can learn a lot more efficiently.

Without BatchNorm, more than two hidden layers is tedious and error-prone to train. With it, you can train 10-12 layers easily. (With another recent advance, residual nets, you can train hundreds!)

• u/enematurret: Depth & Batch Normalization (and also Dropout)
I've been doing some kind of extensive testing on how deep nets can get with/without BN and also indirectly with/without dropout, and got some interesting results so far.

BTW, I've been mainly testing fully-connected nets, from 50 to 100 neurons per hidden layer only. With pre-activation BN (dot -> BN -> ReLU), I can train networks with up to ~21 hidden layers effectively (~98% val acc). When I get to around 25 layers it breaks completely (stuck at 10% error forever). However, with post-activation BN networks with up to around 35 layers converge normally, and start breaking when they have ~50 hidden layers. Interestingly, dropout also has a really weird factor in all of this. For 10 hidden layers, a dropout rate of 0.5 in the last layer only makes the validation accuracy go 2% up. However, with 20 hidden layers the same rate makes the network break - while no dropout at all makes it converge normally.

Are there any studies and experiments on such things that you all are aware of? I'd like to compare and check what's been done and what hasn't before spending more time on this.

On a side-note, I'm testing this on Keras, using net = BatchNormalization(mode=1)(net) to apply BN. I'm not 100% that's the right way to do it, but that seems to be the correct way from what I read in the documentation (at least for fully-connected nets).

• u/Nimitz14: I don't have anything to add but want just mention that that's quite a low number of hidden neurons, and that the behaviour you will get during training depends very strongly on exactly the sort of data you have.

• u/enematurret: I do agree, but that's how many neurons were used in the Highway Neural Networks paper. I'll try playing around with more. All my experiments are on the MNIST dataset, BTW.  >>  Just tested with 25 hidden layers, 250 neurons per layer. Breaks with pre-activation BN, converges with post.

• u/PM_ME_YOUR_GRADIENTS: I might be wrong but aren't you using the wrong Keras mode? $mode=1$ implies that you are doing sample-wise normalization. That means, you look at all the features and take the mean across them and subtract that from the incoming input sample. That is not what "true" BN is if I am not mistaken. What you want is mode=0 so you take the mean over all inputs for each feature and try to ensure that everything going into the ReLU is mean=0, var=1. Again, not an expert, so take it with a grain of salt.

EDIT: Here's an old Stackoverflow question that reinforces my point: Where do I call the BatchNormalization function in Keras? I am not sure if the API has changed since them but given how well thought out the project is, I doubt it is not back-compatible.

• u/enematurret: I'm pretty sure you're right. Just re-read the Keras docs and also did a few tests. $mode=0$ truly makes the column-wise means be 0 (in Theano format, where columns represent features and rows represent samples), whereas mode=1 doesn't. That's not true for the variance, though. Some columns indeed get their variance normalized to 1, but some several others have it as low as 0.07 for some reason. All betas and gammas are 1 and 0, so they're not affecting the final means and variances. I'll test some more when I have the time, but it seems that mode=0 changes a lot of the previous results I mentioned, including 25-layered nets breaking with pre-activation BN and not with post, which agrees a bit more with the research I guess. Thanks for the input.

• u/PM_ME_YOUR_GRADIENTS: "That's not true for the variance, though. Some columns indeed get their variance normalized to 1, but some several others have it as low as 0.07 for some reason. All betas and gammas are 1 and 0, so they're not affecting the final means and variances."  <<  Aah. There is another gotcha in Keras' BN. With $mode=0$, you get a running variance and mean. So, it might not be exactly 1. If you want the true variance of the batch you're testing, try $mode=2$. That should fix it. I'd love to hear if it did, could you report back your findings if you can?

EDIT: I did some back-of-the-envelope math. It seems that if the inputs are normalized. The normalization of each feature is at the mercy of the weights. Say, $X$ is the output of the previous layer (of size num-of-dps x nodes). $mode=1$ will make sure that the rows are unit-normalized. However, for any given node in the next layer, the input to the $i^{th}$ activation is $X(w_i)$. If $w_i$ are poorly scaled, this quantity will not be normal at all. The fact that you were increasing layers and the performance got worse then makes sense since the gradients might be exploding giving rise to weird asymmetries in the weights and in turn, destroying the normality of the input to the activation.

• Noob question: why should we normalize test data with mean and std from training data?

• To normalize the predictive role of different features, even if they have vastly different variances and means.

• Nah. It's only really required for things like Neural Networks where it keeps the gradient descent of features in the space where gradient descent does best, and for Linear/Logistic Regression where it also isn't really required, but makes the weights interpretable as feature importance/contribution to the prediction.
For things like Random Forest, which are based on decision trees, they'll find a split anywhere, it doesn't matter how the features are scaled.
For stuff like Nearest Neighbours, it can be important, or it can hurt. This is because normalisation is like saying all features are equally important, which isn't necessarily true. It could be the case that you've got spatial information in a rectangular space, and so normalising is favouring the small axis of that rectangle over the other axis. I'd compare normalisation in this case to a uniform prior. It's often okay, but only if you don't know better.

• [OP] Not sure if I fully understand what you mean. Why can't we use mean and std from test data?
• Because the classifier doesn't "know" what those numbers should be ahead of time. That way the computations on the data are the same as what was done to the training data.

• Well, since both sets are samples from the same distribution, they should ideally have similar means and variances. They obviously won't be identical though, and in this case it makes sense to use the means and variances from the training data, since it's what the model was trained on. The model approximates a mapping from data standardized by the training data's mean and variance, so using the test data's mean and variance would give you inaccurate results.

• And honestly, if your train and test data have significantly different mean and variance, then the basic assumption of learning (that the training data represents the desired underlying distribution) is probably invalid.

I'm training a deep network and am experimenting with using batch normalization. From reading the Ioffe and Szegedy paper, it seems that it's a solution for the saturation problem experienced by activation functions like the sigmoid.
My question is does using batch normalization make sense for Rectified Linear Units (ReLUs), which already are one solution to the saturation problem? I seem to be getting slightly better results using batch normalization with ReLUs, but wanted to understand if combining the two makes sense conceptually and whether other researchers have used them together.

• Correct, it's a problem for the saturation problem to do batch normalization: you wouldn't want all your activations in the long tail of a sigmoid. Over there the gradient gets killed in backprop. This reasoning carries over to ReLU: you wouldn't want your activations below zero, as your gradients is zero over there. Moreover, you wouldn;t want your activations in the linear part, because with all activations in the linear part, a neural network is just a linear approximator.

• In both the cases (of using a saturating nonlinearity or ReLU), I think Batch Normalization does help speedup the training because it helps scale back all the activations in to the same range of values, thus helping reduce the range of values the gradient of the cost function will take across different dimensions, thus avoiding the need for smaller, carefully selected learning rates.

• Batch norm layer as the name goes performs normalization on the incoming training batch irrespective of activation functions. It's main goal was to deal with non-zero centeredness attribute of the data, which is an unwanted characteristic as it slows down the convergence to the minima. ( As it makes most of your gradients either positive or negative, leading to take more gradient steps than required).
• The original batch norm paper (that you seem to have read) applied it to a network with ReLUs, so I don't really get your question. Yes it makes sense (because solving the saturation problem is not its main goal), yes people do it, and yes it helps.

• Thanks. I see that they used both sigmoids and ReLUs's in the paper. They do motivate their paper at least in part by discussing the saturation problem and saying that the optimizer would be less likely to get stuck in the saturated regime if the input distribution remains stable.

## CLASSIFICATION - DATA

• Choosing a modeling strategy
I am having trouble setting up a problem and was looking for help. I will present a simplified version: The dataset has 10,000 customers. The target variable, Y, has two states: 1 if the product is owned, 0 if it is not. There are 1,000 customers who own (Y=1) and 9,000 who do not (Y=0). There are also a set of predictors (X). The questions is, what is the best way to predict which of the 9,000 who do not own the product are most like the 1,000 who do and therefore may be most likely to purchase it? I would like to do some form of logisitic regression because it is simple and interpretable, however I can't figure out how to set it up. Say I use the whole data set and regress Y against X. Then who out of the Y=0 people should be targeted? Those with the highest probability of being Y=1 but are actually not? My best guess now is a nearest neighbor search, where I take an average of the Y=1 customers and rank the Y=0 customers based on distance to that vector. But I feel as if there is a better way. I'm sure I am missing something obvious here. Any ideas? Thanks!

• KNN could be good for clustering to see if anything stands out. I'd be tempted to throw a decision tree at the problem too to see if anything shakes out there (and still crazy easy to interpret) depending on how many variables etc.

• Whatever model you choose, make sure to leave some of the training data (of both kinds) out to verify it predicts those well without being trained on them.

• Fraud detection:
Does anyone have any good tutorials, videos, books, talks, etc. on fraud detection (i.e. "Someone is buying something at your online store, how do you detect they are a fraudster.")? What algorithms are used, what are the best practices etc? A binary classifier built from historical data?
• Fraud detection is hard because:
• Data is imbalanced; on insurance claims data for example, you would have < 0.5% of claims being fraudulent
• Historical data is also somewhat censored (I know I'm using the wrong word but can't think of the correct terminology atm); because your historical data which you see is generally only the "obvious" ones and not the ones you've missed, in other words the "investigated" sample that you have is the ones you think are highly suspicious but you wouldn't know the ones which you haven't manually sampled.
In insurance claims, we would only look at 1-2% being assessed manually (i.e. have a definitive knowledge that is it fraudulent/non-fraudulent) whilst the rest of the 98-99% of claims are rather unknown. I think in general this situation would be the same for most fraud databases, though perhaps not as extreme.

• Recommendations:

• Feature engineering is really important. Encoding heuristics as features will send you really far.
• Ensemble modelling will help you a lot. GBMs are commonly employed in my experience.
• Oversampling can be used with the investigated vs non-investigated datasets. e.g. we would sample data sets so its something like, 10% fraudulent, 20% non-fraudulent+investigated, 70% not investigated.

• Good luck! This is a difficult problem where domain knowledge is actually key to success and failure and there is no free lunch here. The type of model you use is actually highly irrelevant. You can use naive Bayes, or logistic regression (see this paper [pdf]) that will do well. Depending again on how good your feature engineering is, kmeans, dbscan can also be extremely useful for generating features and even models to detect fraud.
• No one has answered yet, so I'll chime in and i'm sure someone else can give a better answer. I'm in eCom and thought about this very question. From studying ML just in the past 6 months my best answer would be a Naive Bayes Classifier. You are actually looking for the probability of the order or person being fraud. Much like the probability of a email being Spam, but with less data points. The success of the classifier will depend on what data you have and if you have enough fraud to classify. This technique is shown in so many ML books and videos so shouldn't be too hard to find info. I've not coded a classifier to look at this, but in theory should be pretty good as fraudsters use common tricks and target things they can easily resell. Good luck!

• If a binary classifier (neural network model) achieves 99% training accuracy with 65% validation accuracy, what to do next?
Generally I'm considering:
1. Add regularizer (dropout, L2, etc), which helps to bring accuracy to 75%.
2. Shrink model size. Which does not help much.
Any other general advice? BTW, if the model could achieve 99% training accuracy, does it mean that with proper configuration the validation accuracy could also be very high?
• You're massively overfitting. The very first rule to avoid overfitting is to find a larger training data set (not fancy tricks). With a small data set there is no way to avoid overfitting, no matter what tricks you use. Note that depending on your classification problem, the data set you need could be much larger than you think. Even if you manage to find some combination of tricks that brings up validation accuracy, you're really just fooling yourself, as you'd then be in effect just overfitting on your validation data set too (by selecting among models that give better results on it).
• Even more regularization.
• Accuracy is a deceiving metric when we do not know the distribution of positive and negative samples. Validation vs. local accuracy is also deceiving when we do not know if the train data is different from the test data (perhaps collected on a different date, different distribution, whatever). We also do not know if there is a temporal effect in the data, data size, what your validation strategy is, and a ton of other (possibly) relevant stuff. But... since you are using neural networks, the answer is always: add more hidden layers.
• Let's say you have a train set of 100 samples. 99 are negative (labeled 0) and 1 is positive (labeled 1). Your classifier always predicts 0. You now have 99% accuracy with a dumb or overfitted classifier. That's the problem with an uneven distribution (imbalanced classes) together with a poorly chosen metric.
• Looks like a 1000 samples total (so 600 for training), which is pretty low for a complex model like a NN (overfit-danger galore). Stratified sampling may be better than random sampling (stratified keeps distributions between train and test the same). Edit: Can you try the same data set/validation strategy with a simple Logistic Regression and report results?
• You probably have too many features. Look into either feature selection or a preprocessing step to get a lower dimensional representation of the feature space.

• Keras model stuck at 94.2% accuracy :-( Help?
Basically I am doing the Amazon Employee Access Challenge on Kaggle and I trained a neural net in Keras using the following:
cnn = keras.models.Sequential()

earlyStop = keras.callbacks.EarlyStopping(monitor='loss', patience=0, verbose=0, mode='auto')

sgd = keras.optimizers.SGD(lr=0.003)
cnn.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=["accuracy"])
cnn.fit(X_train, Y_train, batch_size=32, nb_epoch=10, show_accuracy=True, verbose=0, callbacks=[earlyStop])

No matter how many layers I add or dropout/learning rate/etc. I ALWAYS get 94.21% (same when I rebuilt the model in skflow using much less layers). Any idea whats going wrong here and why it plateaus after 3 epochs?
Images: learning chart  |  easier to read layer diagram

• So it looks like you have exactly 2 classes. I have a feeling the classes are not balanced, and the test set is. If your training data is in fact unbalanced, that would explain why you always get 94.21% - it is simply always classifying it as the majority class.
• Yep. To expand on this, OP has said below their test accuracy is 50%. Classic behaviour for a model which is simply saying "every example is class 1". It means the model isn't learning anything. Considering there are people on Kaggle leaderboard with good results, it is probably a data setup problem at OP's end.
• Add in the test set to fit(), or create your own val set to feed it in. It's useful to see train_loss/acc along with val_loss/acc, so you avoid overfitting.
• A plateau after 3 epochs means you probably want to adjust the learning rate.
• Take a look at the predictions you got wrong, does that data look ok?
• Start tweaking with the hyper-parameters I guess? If I were you, I would:
• replace "$optimizer=sgd$" by "$optimizer=\textsf{'} adam \textsf{'}$" just to see what happens
• Increase the batch-size a bit
• Increase Dropout fraction
• ...
• Unfortunately same results. I just don't understand why it is so crappy with this model. I run it locally on the training data and get 94%, and then when I run against their test data and submit, I get 50% ... it must be overfitting, right?

• Stacked LSTM for binary classification - Keras
I am trying to implement a stacked LSTM for a time series binary classification problem in Keras, but am getting stuck. Can anyone help me debug my problem. Stacked LSTM for sequence classification - Keras - Error [on hold]

• The error is saying your input to the model is the wrong shape. It looks like you're attempting a multiclass logloss instead of just logloss, maybe? In other words, your model is setup to use simple logloss, but your y_train_smote_reshaped shape suggests a multiclass logloss scheme. Since you're output is binary, either is appropriate, but I often find better results with multiclass logloss even in binary situations. Essentially, your problem is that the output layer has 1 output, while your Y data has 2 expected outputs, this is the erroneous mismatch.

Try changing these lines:

to:

Something to note, you'll need to ensure that each sample output Y, in y_train_smote_reshaped, has the property sum(Y) == 1. (since you're using softmax and multiclass logloss) This is typically a onehot encoded vector (truthy category). Try rmsprop, adam, and nadam for the optimizer.

• Optimize features in order to reclassify item.
I have a (large) dataset of items in a store, each with a range of features (price, size, weight, color, ratings, customer demand, etc). I also know how well each of these items are selling.
There are a lot of information out there on how to create a classifier which can predict how much a new item will sell, based on it's features [example: an item with a low price and high demand will probably sell a lot, while an item with no demand will never sell.]. However, assuming my goal is not to introduce new (successful) items in my store, but optimize the current items.
If I have an item with some fixed properties which sells poorly, how can I find which features I need to change (and how much) for the item to sell better. The typical classifier will be able to predict that an item wont sell, but how can you make it tell you "why" it won't sell, and what you need to change?
• A random forest could do this, or even plain old regression
• Multi-variate regression will work well for your task. You can look at the coefficients to see which features most influence whether a product sells. If your 'sell' variable is binary (i.e. 1=it sold, 0=did not sell) then you'll have to use logistic regression. If 'sell' is a number (i.e. how many you sold) then standard linear regression will work. I don't know if you are familiar with statistical packages? But the links above are examples in R (which is good and free).

• Question about what supervised learning approach to use?
I have a project in which I have 4 known input variables providing 1 known output variable and I wanted to create an equation that would take these 4 known input variables and be able to predict an unknown output. I have a strong programming background and I was planning on using Python, though I pick up on any language quickly. Would anyone be able to point me in the right direction?
• What are the variable types? Continuous or binary (0's and 1's)? Based off the limited information sounds like your best bet is linear or logistic regression. Its fairly simple in R which is the a free commonly used language for statistics.
• First, you have to be clear about whether you're facing a regression or classification problem. Then look into scikit-learn, which is an easy to use ML library for Python that should provide several alternatives for your problem.

• Recalibrating Neural Networks?
I am creating a prediction model for daily water consumption for households using Neural networks in R. The input parameters would be - Past 2 days consumption, past 2 days temperature (min and max) and the forecast day temperature. Suppose I want to create a model that will train on the past 3 years water consumption data (once) and then be used for daily water predictions for the next 100 years. I was thinking of setting a max error threshold at 10% for 5 consecutive days, and when this is voided the model will retrain itself using the past 3 months data and then use this for further predictions which should hopefully have better accuracy than before. The reason being that peoples water consumption patterns are likely to change over the next few years, due to water conservation methods and awareness plans or just people caring more about water, a precious resource. Does this idea of retraining when the error crosses a threshold make sense? Or does it make sense when I have very little past data(1-2 months) and after a couple of years it will never need to retrain itself?
• First of all, why use neural network? This is pretty straightforward regression problem, so a NN seems like an overkill. At first you should check that there is some correlation in your data. Does the chosen parameter (data) depend on each other and how much? Then you could try to fit some function to your data, even if it will be simple Least Squares regression. The least squares regression will perform poorly for longer prediction times. It could be accurate for few days, but I doubt it. This problem is periodical in nature and the swings in water consumption during the year will likely break the LS regression model. There are much more powerful models for time series prediction (ARIMA -- autoregressive integrated moving average -- springs to mind). Good starting point could be this wiki page [Wikipedia: Time Series - Prediction & Forecasting].

• Some Classes Sometimes Having Zero Probability Across A Testing Set
Every once in a while I come across this problem. I train a model that maps a probability of input x belonging to let's say 30 potential classes (y-labels). Sometimes, one or two classes come out with essentially zero average probability in the testing set. And it's not consistent. If I retrain the model, that problem typically goes away (or other classes may become "omitted"). I feel like this is a common machine learning problem. How/where can I read more about this problem?
• if that is a problem, you are probably using the wrong metric. Try crossentropy.
• In a situation like this you need to punish wrong classification more, do it with logloss aka crossentropy. It sounds like you go by accuracy, that is not a good idea.

• What loss to use for variational autoencoder with real numbers?
• If you're trying to predict some real numbers, then MSE (mean squared error) is a loss metric that is commonly used. Cross-entropy is usually used when you are calculating class probabilities, you will need a different metric when doing regression, which is what it sounds like you are doing.

• Note that MSE is a royal pain to optimize well in neural architectures, especially in VAEs. At very least you will want some kind of batch norm.
If you have bounds on the range of values you expect, you can also normalize between 0-1, sigmoid the outputs and use binary crossentropy (you will have "targets" of 0.2, 0.8, 0.1, etc). This is related to the "Dark Knowledge" approximation work from Hinton et. al., but I think people used this trick before that for Bernoulli-Bernoulli RBMs on color image patches as well.
Even just bucketing and using a softmax per output (as in pixelRNN) can work extremely well.
Both of these work much better in my experience than MSE, which is basically a Gaussian distribution on the output with fixed variance. In practice Gaussian or GMM outputs seem hard to optimize in all but the smallest dimensional cases. If you do go for MSE, be sure to use PCA on your data first to "gaussianize" it (learn the basis on the train set, though) - this usually improves results a lot. The semi-supervised learning paper from Kingma et. al. [arXiv:1406.5298] use this "trick" on SVHN and I found that it helped a lot in both optimization and overall performance. Also consider a Huber loss instead of straight MSE - it seems a little easier to train in my limited tests.
Real valued MNIST is just horrible to train on with a VAE (the distribution of pixel values really is bimodal, and the binarized version basically looks the same) - so if that is the goal/test case definitely consider a different dataset to test on! I did some stuff with VAE on real valued MNIST during IFT6266H15, see this post for some samples. I think something was wrong with BN during sampling in the feedforward case though - the speckle noise is really weird.

• +1 for bucketing and softmax. One possible problem with this is that it ignores the fact that your outputs are ordinal altogether and adds a lot of params. This usually isn't a problem if you have enough data, but otherwise you might look into the "discretized logistic likelihood" used in arXiv:1606.04934 (section 5.3.3).

## CLASSIFICATION | CLUSTERING - IMAGES

• How to localize objects given only class labels
Hi, I have a dataset of images with two classes (defective and non-defective). I have trained a CNN based classifier which gives good accuracy. Now I would like to localize the defect in a image (if defective). For example, I have a metal case which has defects like scratches, dents etc. Now I collected samples in which I have defective images and non-defective images. I trained a classifier which tells me defective images. Can I localize the defects for further inspection. Can this be done without annotating the defective samples with bounding boxes of defects?
• You may want to look at concepts like Deconvolution, saliency maps and papers like Zeiler & Fergus [arXiv:311.2901 (2014)] and Yosinksi [arXiv:1506.06579 (2015)]. The overlapping concept goes as follows: You can visualize the parts of images that CNN's use most, or parts of images that would influence the decision (defective, non-defective) the most. In your case, I'd expect that the defect themselves will lit up on these visualization approaches.
• You could train an auto-encoder network on your non-defective samples. When you run a defective sample though the trained auto encoder it should not be able to reproduce the defects very well and therefor have a large difference between input and output at the place of the defect.
• You could feed several cropped image patches to the classifier and select the one (or multiple up to a threshold) that maximally activates the defective decision.

## CLASSIFICATION | CLUSTERING - TEXT

• Decision Trees for Text
I considering implementing a decision tree for classifying text examples into binary categories (is or is not an example of). I have implemented a small example (weather dataset) and think that I have a good grasp of the concepts of entropy and information gain , however I am a little hazy as to how one would preprocess text so that the attributes are suitable for decision trees. Obviously some counting of the text will be required, but how would one represent the documents ? Any suggestions or links would be appreciated.
• sklearn's CountVectorizer is a good place to start. It takes a list of sentences, tokenises them (splits on e.g. whitespace), then counts the frequency of each term, then turns this into a matrix (sentences by terms).
I'd recommend using binary=True at first to get absence/presence of terms unless you really believe that word frequency is useful additional information. Start with the fewest assumptions as possible.
You might also do things like add Part of Speech tags (see NLTK or Spacy) as a later step. Definitely build short trees when you're starting (e.g. max_depth=3) and visualise the trees - you'll see which features the DecisionTreeClassifier think are most useful at the root of the tree and if they don't make sense, it'll quickly help you debug the features that you're using. Remember - Garbage In, Garbage Out.
If you need to clean your text beforehand I've got lots of notes on Python cleaning processes
• Going along with the short trees part - the default settings in sklearn decision trees will overfit the training data pretty heavily. Especially if you have a small data set you'll get high training accuracy but low testing accuracy. So you'll want to limit the depth either explicitly or via min_samples_split, min_samples_leaf, etc.
• A bag of words model would work well in this case -- here, your documents become sparse feature vectors that can be either binary ("does a word occur in a document or not"), absolute counts / term frequencies ("how often does a word occur in a document") or term-frequency-inverse document frequencies ("how often does a word occur in a particular document inversely normalized by how often a word occurs throughout all documents -- the more often the lower the tf-idf"). Which vectorizing approach works best is a bit empirical, e.g., the binary counts may work better for smaller datasets with shorter documents etc. (but that's something you have to figure out via cross-validation).
I've written an intro to text classification here: http://arxiv.org/abs/1410.5329 (ignore the naive bayes parts; the feature representation parts should also work for decision trees).
Another tip: Instead of using a single decision tree, try a random forest if computationally feasible (i.e., if your dataset is not too large and it can be trained in reasonable time).
• I have some example Python code for the random forest training Here, if useful, and here [Jupyter notebooks]. I've explained a bit more about the tf-idf computation. Lastly, a little write-up on growing a decision tree.
• Latent Dirichlet Allocation (LDA) is another bag-of-words based technique worth knowing about. With this you're trying to figure out, for a set of documents, what are the major topics, and what are the topic compositions of each document. You don't really need to understand the math that well to have fun with it.
There's a really good description . There's a good Python implementation of LDA, here. What these other guys said is also good.
• Yeah I have looked into LDA and implemented the example from the Gensim which is a good tutorial , but like you say the maths of it defeated me somewhat so I figured it might be a good idea to start with a simpler approach that I can code myself and work from there.

• Does ML apply for this type of problem?
Say I want to create structured data using text descriptions of apartments. For example, after running through a bunch of these listings, I'd expect to find categories like pets allowed, outdoor space and security deposit, because those phrases appeared in so many of the listings. (And I'd expect these phrases to be found even if some of the listings have typos like "secury deposit" or "pets alow". I assume this is the kind of problem I'd solve with ML, but I'm just getting started (experienced programmer, though). Any pointers? For specific tools, JavaScript / Node would be my preference, though I'm also asking for the general terms I need to research.
• This sounds like a classic text document clustering problem. Search around for "unsupervised learning" or "clustering". For the text analysis part, look up how to do "n-gram" analysis in your language of choice. I don't know if JS/Node has any clustering packages floating around, but R and Python both have robust libraries for clustering and text analytics. Your basic approach will be to find similarity clusters in the listings, then analyze each cluster to determine what relates the members of each cluster. You'll likely find clusters based on particular groups of terms that tend to appear together. Your concern about typos can be solved by using a good stemmer or lemmatizer.

## CLUSTERING

• [Q] Clustering on a self-similarity matrix
I have a sparse item-item matrix. Some 1's sprinkled about showing that two items have a high degree of similarity. The rest are NaN.
1 N N
N 1 N
N N N

We also have rules about the data that help us impute many of these Nan values with a 0 (there is no similarity between those items):

1 0 0
0 1 N
N N 0

For completion, let's impute the remaining NaN values with row/column means:

1       0       0
0       1       0.25
0.33    0.33    0

... something like that. My question is this: Does imputing the 0's in step 2 poison the ability to cluster? My concern is this: If I apply a rule, thus filling in many 0's, I worry that they may begin to form their own cluster. Is this a valid concern? Similar discussions are taking place on forums: here and here where people are clustering on similarity (or distance) matrices. But nothing really addressing the idea of imputing chunks of known dissimilarities by known rules. Edit: I see that Affinity Propagation in sklearn, here has an option for "precomputed" affinity. Could this be useful?

• BTW, another way to go when similarity matrix is given is spectral clustering, nice tutorial here. When it comes to your side problem of similarities /dissimilarities, you could consider random walk based clustering + setting probabilities of transitions according to your problem (making sure probability of transition between two dissimilar entities is 0 while it is non-zero for NaNs and similar entities)
• ... you could consider random walk based clustering ...
• [OP] Yes! I was just downloading the Markov Graph Clustering package. I think what I have is more of a graph-theory problem. Thank you

• Regarding your concern, points with more 0s are likely to be considered as less similar from each other, thus they will be put in different clusters. Moreover, I would avoid to give different values to the remaining NaN entries since you don't have reasons to believe that their similarities are different. Just pick a constant between 0 and 1 which would encode your a priori knowledge of how two points are similar in average.
• [OP] I solved it! And you were totally right. Imputing threw the final algorithm off. I used Markov Graph Clustering and it worked like a charm! Now I know how to go from a similarity matrix to clusters :D :D

• [fintech] Look alike machine learning algorithm?
Hi, all! I am seeking a advice on how to tackle a business project we are currently having in our company. We are a growing fintech company currently having our business in 8 countries. What we would like to do is to base our decision in which country to go next, completely on data. Meaning that we would like to build such an algorithm that outputs us the country to which we should be expanding next. Problem is that it is a small sample problem and applying something like regression would not really be sufficient solution in my opinion. Plus we have 8 countries which we consider to be good and 0 countries which we can tell by our experience to be bad. We have many good ideas on feature engineering side but we struggle choosing the right algorithm. The only thing that came into my mind was to try to make clusters and hope that some new country appears to be in a same cluster that good countries are in. Do you have any suggestions or ideas, dear redditers? :)

• I would advise you to try with hierarchical clustering. You should experiment a bit with the distance measures and normalizing and/or weighing the features. You should run the algorithm and print it out as a dendrogram or something similar which is useful for analysis. Focus on getting the "good" countries grouped (as in, consider the cluster "good" if majority of the countries in it are "good"). Then, start from a cluster that consists of all of your good countries (and some, not all unknown) and try moving towards single countries in the dendrogram to find a cluster which consists of all of your good countries and a minimum of unknown countries. Then focus your more detailed analysis on these countries. Hopefully this may help, at least with the general direction.

• Yeah, thanks!

• No problem! As I said I think the majority of the effort will be in experimenting how each feature affects distance/clustering, although I leave that to your team :) Don't forget to try out combinations of your features as additional features

## CNN - LAYERS, FILTERS

Say you have an [80x80] image, how does one determine the filter size and the number of filters in the first conv layer?
• Rules of thumb + trial and error (cross-validation):
• Convolution neural networks - Understanding layer input sizes/shapes
I am unsure about how input shape of each hidden convolutional layers should be.
Here is an example:
My input is an image [8x8] and my first convolution layer has a window kernel (filter) size of [3x3] should the next layers input size look like [6x6]? assuming no zero padding. Then if I do maxpooling of size [2x2] right after the first convolution the next layer should look like [3x3]?
I was told that at the end of the network I should have a 1x1 pixel (hyper pixel) but I am unsure why we should have this.
• My input is an image [8x8] and my first convolution layer has a window kernel size of [3x3] should the next layers input size look like [6x6]? assuming no zero padding.
Yup, $H \times W$ conv on $N \times M$ input ⇒ $(N-H+1) \times (M-W+1)$ output

Then if I do maxpooling of size [2x2] right after the first convolution the next layer should look like $3 \times 3$?
Only if its non-overlapping, i.e., with stride 2 (aka $[2 \times 2]/s2$)

I was told that at the end of the network I should have a [1x1] pixel (hyper pixel) but I am unsure why we should have this.
I don't know how to put it into better words, but just do what makes the most sense. There isn't really an 'one-size-fits-all' approach.

• Victoria: illustrated in  this pdf
• Intuition of 1X1 convolutions as final layer of a convnet instead of FC (and possible pseudo/theano code) please?
Hi, I am not able to understand how a [1x1] convolution can replace a FC Layer in the top layer of a ConvNet. Can anyone please help?
• I'm assuming you have read network in network [arXiv:1312.4400] and perhaps this explanation?
• Yep they are good resources for understanding the 1x1 convs, as far as I'm aware the Network in Network paper was the first to really leverage them in a way that drew attention(maybe?).
• Is there any conclusive work on whether or not fully convolutional networks are superior?
I'm referring to networks where the final feature map is [1x1] and directly connected to the softmax. Also is there any comparison in such networks regarding pooling versus 2 stride conv?
• I think that [1x1] convs on [1x1] feature maps is exactly like a FC. A CNN with pooling and one with stride = 2 are not exactly the same, as the pooling choose the best (likely MAX) feature. With stride of 2, it would just ignore 3/4 of the features.
Edit: note that convolutionals layers can be aplied to different size of input. It's not possible with FC.
• I think that [1x1] convs on [1x1] feature maps is exactly like a FC. A convolution layer with an [NxN] filter on an [NxN] input (with "valid" border mode) producing a [1x1] output is exactly the same as a fully-connected layer.
• I'm also curious what is the difference on ConvLayer + MP vs strided Conv in general.
• This paper arXiv:1412.6806 is the most thorough examination of the topic I know.
• Partially connected per block instead of fully connected layers?
When speaking of neural networks, I mostly read about conv layers and fully connected layers. Conv layers are just FC layers, with local connections and weight sharing. For unstructured data, weight sharing doesn't make sense, but can we still use local connections, to reduce the memory and computational cost? FC layers are prohibitively expensive as the number of operations grows as the square of the number of neurons. I seems natural for me to try to do it by block, to force many 0 in the weight matrix and reduce the computational cost. Why don't I see papers about that topic? Am I just ignorant? If yes, can you provide me some links or keywords?
• I've seen local connectivity in neural networks applied to graphs, so it seems to be used when the data is structured in a way that can be taken advantage of. Local connectivity is less general than full connectivity, so you better have a good reason for using it. There are several ways to induce sparsity, most commonly L1 weight regularisation. Once you have a sparse connectivity matrix (or even otherwise) you can use various heuristics to trim connections and hence reduce the computation.
• Yes, but let's say I have a $1024 \times 1024$ layer and I replace it by four $1024 \times 1024$ layers with 1/8th of connections, with randomised order after the end of each layer. It will still be able to link all inputs together as you have several layers.
• Intuitively, I'd rather let the network learn which connections to lose, rather than impose anything from the start. But I also haven't come across papers doing what you seem to be suggesting, so maybe it's worth trying out and seeing if it works?
• Have a look at "DeepFace" [pdf] by Taigman et.al where they propose a "locally connected" layer for face recognition: "like a convolutional layer they apply a filter bank, but every location in the feature map learns a different set of filters." Assuming faces are aligned in the image, it seems like a waste to use the same set of high-level filters for describing eyes/noses/mouths/etc. So for this application it makes sense to use such a layer.
• Quick question: how are convnet-filters applied on previous layers?
How are the filters of a convnet applied on previous layers, i.e. how many feature maps do I have after stacking convnets of n1, n2, n3, ... number of filters? Will n2 filters of layer 2 be applied to each feature map resulting from layer 1? Example: Input has 1 channel (grey scale), followed by 3 convnets with 8, 16, 32 filters respectively. So the first convnet layer yields 1x8 feature maps, the 2nd convnet layer 1x8x16 feature maps and the 3rd convnet layer 1x8x16x32 features maps?
• 3 convnets with 8, 16, 32 filters:
layer: 8 feature maps
layer: 16 feature maps
layer: 32 feature maps
They aren't multiplied.
• How exactly does this result come up? If you have 8 feature maps from layer 1 and 16 kernels, which kernels get applied to which feature map?
• Let's say the kernels in layer 2 are 3x3. Each unit/kernel in layer 2 receives input from every unit in its receptive field in layer 1, i.e., 3 x 3 x 8 = 72 inputs. The kernels are connected to every feature map in the previous layer.
• The key point missing is that the kernels are not actually [3x3], they are [N x 3x3], where N is the number of feature maps in the previous layer.

## CNN - MISCELLANEOUS

• Best practices when doing deep learning for computer vision projects?
I'm trying to do train an image classifier on a ~2 millions labeled images. The images are scenic, with 20 distinct labels. I'm quite familiar with the concept of the state-of-the-art techniques in image classification (CNN and alike) but lacks the practical experience. I have some questions:

1. 1. How do we handle different image sizes? As I know, the first layer of the network will receive the input. Do we have to rescale all the train/test image to the same size first?

2. I'm not sure about the correctness of the labelling of all the 2 million images. What is a good way to check the label with acceptable accuracy? Do people usually just use a pre-trained models and compare the labels? Or manual check/Mechanical Turk is the best way? Is there anything else I should check before using the data to train a model?

3. I've heard good things about Caffe for vision research projects. But how easy is it to be productionized? How does it compare to Tensorflow?

• I'm not an expert, but here's what I've done or seen done in these situations:

1. How do we handle different image sizes? As I know, the first layer of the network will receive the input. Do we have to rescale all the train/test image to the same size first?

• The most common approach is to rescale then crop. If you have a 640×480 image and your network accepts $120×120$ input images, you would rescale to $160×120$, then choose a $120×120$ crop from that. The crop is usually taken from the center of the image, since images tend to be centered on the subject.
Another approach is to rescale, compute your classification for every $120×120$ window of the image (this is referred to as a "fully convolutional network" in the literature), then take the average. For example, if you have a $640×480$ input image, first rescale it to $160×120$, then use a $120×120$ sliding window to get a $40×1×CLASSES$ output. Take the mean over the first two dimensions to get your final classification.

3. I've heard good things about Caffe for vision research projects. But how easy is it to be productionized? How does it compare to Tensorflow?

• My experience is limited to Theano, TensorFlow, and a tiny bit of Torch. TensorFlow has TensorFlow Serving, which is meant for model evaluation (as opposed to training) in production.

• Is there a better option than the sliding window method for object detection on images? [reddit]:

• Proposals + Fast R-CNN is kind of the standard now: Ren S (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv:1506.01497  |  GitHub

• You Only Look Once [YOLO] for multiple, at Darknet site [Redmon J (2015) You only look once: Unified, real-time object detection. arXiv:1506.02640  |  YOLO website]. CNN more generally.puff

How important is the LRN layer with GoogLeNet? The original network description (v1 paper) does not include this, but the Caffe version seem to include this (link). However, the v3 (link) and v4 (link:pdf) network descriptions also seem not have this?

• Unimportant, it's fallen out of favor.

• Per image whitening in RGB image. Is useful during CNN training, or is useless?
I'm training a CNN and I'm doing some data argumentation and pre-processing before feeding the network with a batch of images. I've seen that image whitening is sometimes used (e.g., in the TensorFlow tutorial of the CIFAR10, and sometimes not used (e.g. in inception training). Karpathy's CNN course contains some info on image whitening: he says that image whitening and PCA are not used in CNNs. So, why it's used in the cifar10_inputs.py file?
The UFDL wiki says that image whitening is useful only when working on gray scale images and not on colour images. ... So, is it better to not do image whitening when working with color images? An additional question: some models need input rescaled between [0,1], others between [-1,1]. Can someone explain this to me (or provide a link)? Thank you.
• Relevant section from UFDL: "For large images, PCA/ZCA based whitening methods are impractical as the covariance matrix is too large. For these cases, we defer to 1/f-whitening methods. (more details to come)." AFAIK, this is why people pretty much stopped doing it when things moved on from CIFAR-10/MNIST. And with the recent stuff like batch normalization and better random initializations, I don't think it has much benefit any more even when computationally feasible.
• You can do local image whitening. In my experience, it helps a lot. Simple algorithm: For each pixel, take pixels in a radius and average them. Then take the average covariance between the center pixel and the other pixels. Then the whitened value of the pixel in the center is related to this covariance (I use an inverted exponential formula). It's not "exact" whitening, but achieves very similar results. OP: I have some Python example code if you need it.
• Viewing attention outputs over an image?
My attention model computes probabilities over the filter outputs of the last convolutional layer. After multiple convolution and max pooling layers, the original image starts out as 224x224 and ends up being 14x14. As such, the attention model outputs 196 probabilities. I'm wondering if there's an easy way to upscale the probabilities to the original image so I can view the attention outputs or would you simply have to undo all the layers. What do most researchers do?
• "I'm wondering if there's an easy way to upscale the probabilities to the original image so I can view the attention outputs" << Yes, it's called "resizing an image"... really, just get the 14x14 output of the attention layer and upscale to the original input size.
• People do this because it's easy and the result looks good, but the receptive field of each position in the 14x14 layer is almost certainly quite a bit bigger than the upscaled image implies unless your network has some pretty aggressive pooling, which means the result you get is misleadingly precise.
Better options include:
1. Compute the true receptive field for the attended location.
2. Generate a saliency map for the attended location as in arXiv:1312.6034
3. Use one of the interpretable forms of visualization like guided backprop arXiv:1412.6806
• What are the preprocessing steps required before feeding images into a CNN?
I need to identify very specific features in an image, like the neckline of a dress, as well as the color and other specific features. The images are also of various sizes. I was thinking of resizing all images to same size, perhaps highlighting the edges. What other things do I need to do?
• Generally the only things you need to do are resize all the input images to have the same dimensions and perform mean subtraction. In the past people have also used whitening methods such as PCA/ZCA, but this does not seem to be necessary.
Before you go using a convnet to solve this problem you will need to make sure you have a lot of labelled training data. If you don't have a lot of training then you can try out transfer learning with a pretrained network. In this case you should start with whatever preprocessing was performed when training the original network, and then see what improvements you can make from there.
• And approximately, what should be the minimum size of my training data?
• I guess there is no straight answer. This will depend on the number of free parameters you need to train in your network and on your task.
• If you have much less than the one million images in ImageNet then you might consider reusing an existing pretrained ImageNet model, and just training the last FC [fully-connected] layer or two on your own data.
• [u/alexmlamb] Divide by 3 because you have 3 channels.

## CNN - POOLING

• Convolutional Neural Network (CNN) - is the pooling layer essential?:
u/hansolav91: I am building a Convolutional Neural network to classify human activities. The input is acceleration data from two sensors and I'm feeding the network with raw acceleration data. The input format is 6x100, whereas the rows are the different axis on the two sensors and 100 is data points (100 Hz, so 1. second). Since each convolutional layer will produce feature maps of size 6x100 or 1x94 (using kernel size of 6x6 with shape setting "same" and "valid"), the output of the max-pooling layer will be very small. I am therefore wondering if the pooling layer is essential or if it doesn't mater. One of the benefits with pooling layers is of course complexity, but I can't grasp the other reasons. Do you have any thoughts?
• u/benanne: It sounds like you actually want a 1D convolution with 6 input channels. Convolving over the channel axis seems like a strange thing to do. Then you can also use 1D pooling regions. But to answer your question more generally, no, pooling layers are not crucial. You can also use convolutions with strides larger than 1, as demonstrated in this paper: arXiv:1412.6806.
• u/hansolav91_ Yeah. I don't have much experience with CNN other than image recognition, that's why I used 6x6. But yeah, it does sound better using a 1D conv! I will look into the paper. Thanks!_
• u/Dr_Vlad: Regarding the 1D/2D convolution over the channel axis, 2D is not so strange I think. A 2D kernel will capture information across different channels, i.e. how they simultaneously relate to each other. For example, let's say the net learns a 2D filter that is activated when the first 3 channels all contain peaks and the other 3 channels have 0 amplitude - such filter represents how all the channels relate to each other. If you use 1D filters you have to hope that this relationship between channels will be captured by the net at later layers, possibly requiring more parameters/layers than a net using 2D convs.
I hope what I wrote is clear :-S
A long time ago I have applied a very shallow CNN (conv → dense → dense) with 2D convolution to EEG data (size of 1st layer conv filters was nr_of_channels by X) and this net consistently outperformed the same architecture but with 1D convs. But again, this is most likely a matter of not having enough learning capacity (i.e. layers¶meters) in the 1D CNN to capture relationships between channels efficiently.
• u/benanne:In the case of EEG, there is some kind of topological structure among the channels, i.e. you can meaningfully order them and adjacent channels are approximately "equidistant", according to some broad notion of distance. So then the prior imposed by a convolution makes sense: you may want to detect the same pattern in different groups of adjacent channels.
But in this case, the 6 channels are just x y and z for two different sensors. Their ordering is fairly arbitrary, and not meaningful. That implies that said prior doesn't make sense here. What could make sense is to have locally connected layers, but without shared parameters. But since the number of channels is so small (only 6), I can't think of a good reason not to use a fully connected structure in the channel dimension.
I realise now that "sounds like a strange thing to do" is not a very helpful way of phrasing things :) I hope this clarifies my line of thinking!
• One reason is that max pooling operations give you some translation invariance. BTW, why aren't you using 1D kernels? These are time series... 2D does not make much sense here
• u/hansolav91 _1D kernels does actually make more sense now! Thanks!_
• In Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction they mention that pooling was essential to build robust features. Usually pooling helps make features more invariant and reduces dimensionality, if this is not important to you, try without and see how it goes. From what I've understood pooling can also be replaced with strided convolutions.
• u/hansolav91: _Yeah. That's what I though it was. I am not really sure if the robustness will play an important part yet, I will have to try out both. I will also look into strided convolutions. Thanks!

• Is gradual pooling no longer the preferred architecture? [Oct 2016]
It used to be that ConvNets would use 2x2 pooling every few layers: the number of channels would grow, while the dimensions of the image would decrease all the way to 4x4 or less. However, now, with the Xception architecture, I'm seeing lots of pooling in the beginning (299x299 -> 18x18), no pooling in most of the network, and lots of pooling at the end (18x18 -> 1x1). Is there a justification for this?

• Everyone trying to find the "best" convnet are wasting their time in my opinion. Tune your hyperparameters, do a random search over a reasonably large architecture family, and call it a day. Many of these things don't generalize across datasets the way you would hope. Sure, most of them will do something reasonable, but go solve problems! Go try new and difficult things! Find problems existing techniques don't work on and figure out why. This herd mentality is dangerous. In many cases you can get many very different looking neural nets to work well.

• Yeah, that's basically the preferred architecture right now, pooling isn't really as needed anymore to lower dimensions.

• Riedmiller's team scoffed at max pooling years ago: Springenberg JT [Riedmiller M] (2014) Striving for simplicity: The all convolutional net. arXiv:1412.6806

• Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -- and building on other recent work for finding simple network structures -- we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the "deconvolution approach" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.

• Xception uses pooling extensively: only 1 out of ~70 of its convolutions has stride = 2.

## COST FUNCTIONS

• What cost function when multi good answers?
When we have one good answer, we use a 1 hot vector, with one 1. We define the cost function as the cross entropy of the softmax vector with the good answer. What do we do when we have 2 or more good answers ? Do we define the last layer as sigmoid and we use the cross entropy on the sigmoid vector and the one hot good answer vector with several 1s ? If there are several things at the same time in my input, what is the best:
• Like in imagenet, we say what is the most important thing and we use softmax during training ? And we consider the other stuff as "irrelevant noise"
• Use it as several training examples, with a single different good answer for each of the object in the data?
• We list all the stuff inside an use a sigmoid during training?
• Yes, you can just use a many-hot vector rather than one-hot, and take the "cross-entropy," -dot product(many-hot, log(output with probabilistic interpretation)). This will push your output vector to assign equal probability to each of the correct answers.
There's an important decision here - if you use a target vector with several 1s, you are learning as if each good answer is a separate thing to update on - if an example has more good answers, it matters more, because each good answer is the thing that matters. If you use a target vector that is normalized so that its contents sum to 1, then you're learning as if the examples are what matter, no matter the number of answers.
If you don't care about assigning high value to all the good answers, and instead just want any good answer, then cross entropy is not what you want - you need something linear for that. The obvious thing to do is to try to minimize -log(sum of 'probabilities' for all the good answers).
• We list all the stuff inside an use a sigmoid during training?
• This is the most sensible thing, unless there really is truly one right answer, in which case you should just use it as a target.

## CROSS-VALIDATION

• How to calculate accuracy in cross-validation?
I have a classification problem consisting of two classes. I have around 10000 data pionts and 20 features. I'm doing nested 10-fold cross-validation. I am unsure about calculating the accuracy. I see two possibilities to calculate the balanced accuracy:

1. Calculating the balanced accuracy for each test fold. This will give 10 balanced accuracy values. Then I can take the mean and standard deviation.

2. Collecting the predicted labels from the test folds. In the end I have a vector of true labels and a vector of predicted labels. From this I can calculate the confusion matrix and the balanced accuracy.
Currently I'm doing the second one. Is this ok? If yes, how can I calculate the standard deviation? Based on the confusion matrix I'm also calculating other measures such as precision, recall etc.

• Your #1 leads to a more biased estimate, so you want to do #2. See this paper [pdf] for details.

• Anyone can correct me if I'm wrong, but I believe the two things you are describing are identical. This is an analogy, but #2 would be like taking the mean of 100 numbers, whereas #1 would be like grouping the 100 numbers into blocks of 10, taking the average of each, and then averaging those averages.
Also, I don't know what standard deviation you are referring to. I don't believe it's common (or helpful) to report the standard deviation of accuracy scores on each fold of an n-fold cross-val. And with method 2 there isn't really any logical standard deviation to report.

• RNN language models: No cross validation?
I always see language model datasets having a train set and a dev set (validation). What are the reasons why this is done, rather than doing n-cross validation on the data?

• n-cross validation is a more efficient way to use your data. Since it is usually easy to obtain lots of data for language modeling there is not much need to be efficient with the validation data. This could be the reason why you don't see that.

• There is also the issue of time; to perform n-fold cross validation one has to train the model n times. When data is abundant, it is more time-efficient to train and test only once (while sacrificing a small part of the dataset) then running the same model n times over.

PAPERS:

• Arlot S & Celisse A (2010) A survey of cross-validation procedures for model selection [pdf; 40 pp]  |  reddit

## DEBUGGING

• [Question] Computing output error in neural network

• Debugging machine learning

...
For instance, my general debugging strategy involves steps like the following:

• First, ensure that your optimizer isn't the problem. You can do this by adding "cheating" features -- a feature that correlates perfectly with the label. Make sure you can successfully overfit the training data. If not, this is probably either an optimizer problem or a too-small-sample problem.

• Remove all the features except the cheating feature and make sure you can overfit then. Assuming that works, add feature back in incrementally (usually at an exponential rate). If at some point, things stop working, then probably you have too many features or too little data.

• Remove the cheating features and make your hypothesis class much bigger; e.g., by adding lots of quadratic features. Make sure you can overfit. If you can't overfit, maybe you need a better hypothesis class.

• Cut the amount of training data in half. We usually see test accuracy asymptote as the training data size increases, so if cutting the training data in half has a huge effect, you're not yet asymptoted and you might do better to get some more data.

The problem is that this normal breakdown of error terms comes from theory land, and, well, sometimes theory misses out on some stuff because of a particular abstraction that has been taken. Typically this abstraction has to do with the fact that the overall goal has already been broken down into an iid/PAC style learning problem, and so you end up unable to see some types of error because the abstraction hides them.

In an effort to try to understand this better, I tried to make a flow chart of sorts that encompasses all the various types of error I could think of that can sneak into a machine learning system. This is shown below:

I've tried to give some reasonable names to the steps (the left part of the box) and then give a grounded example in the context of ad placement (because it's easy to think about). I'll walk through the steps (1-11) and try to say something about what sort of error can arise at that step.

1. In the first step, we take our real world goal of increasing revenue for our company and decide to solve it by improving our ad displays. This immediately upper bounds how much increased revenue we can hope for because, well, maybe ads are the wrong thing to target. Maybe I would do better by building a better product. This is sort of a "business" decision, but it's perhaps the most important question you can ask: am I even going after the right things?

2. Once you have a real world mechanism (better ad placement) you need to turn it into a learning problem (or not). In this case, we've decided that the way we're going to do this is by trying to predict clickthrough, and then use those predictions to place better ads. Is clickthrough a good thing to use to predict increased revenue? This itself is an active research area. But once you decide that you're going to predict clickthrough, you suffer some loss because of a mismatch between that prediction task and the goal of better ad placement.

3. Now you have to collect some data. You might do this by logging interactions with a currently deployed system. This introduces all sorts of biases because the data you're collecting is not from the final system you want to deploy (the one you're building now), and you will pay for this in terms of distribution drift.

4. You cannot possibly log everything that the current system is doing, so you have to only log a subset of things. Perhaps you log queries, ads, and clicks. This now hides any information that you didn't log, for instance time of day or day of week might be relevant, user information might be relevant, etc. Again, this upper bounds your best possible revenue.

5. You then usually pick a data representation, for instance quadratic terms between a bag of words on the query side and a bag of words on the ad side, paired with a +/- on whether the user clicked or not. We're now getting into the position where we can start using theory words, but this is basically limited the best possible Bayes error. If you included more information, or represented it better, you might be able to get a lower Bayes error.

6. You also have to choose a hypothesis class. I might choose decision trees. This is where my approximation error comes from.

7. We have to pick some training data. The real world is basically never i.i.d., so any data we select is going to have some bias. It might not be identically distributed with the test data (because things change month to month, for instance). It might not be independent (because things don't change much second to second). You will pay for this.

8. You now train your model on this data, probably tuning hyperparameters too. This is your usual estimation error.

9. We now pick some test data on which to measure performance. Of course, this test data is only going to be representative of how well your system will do in the future if this data is so representative. In practice, it won't be, typically at least because of concept drift over time.

10. After we make predictions on this test data, we have to choose some method for evaluating success. We might use accuracy, f measure, area under the ROC curve, etc. The degree to which these measures correlate with what we really care about (ad revenue) is going to affect how well we're able to capture the overall task. If the measure anti-correlates, for instance, we'll head downhill rather than uphill.

(Minor note: although I put these in a specific order, that's not a prescriptive order, and many can be swapped. Also, of course there are lots of cycles and dependencies here as one continues to improve systems.)

Some of these things are active research areas. Things like sample selection bias/domain adaptation/covariate shift have to do with mismatch of train/test data. For instance, if I can overfit train but generalization is horrible, I'll often randomly shuffle train/test into a new split and see if generalization is better. If it is, there's probably an adaptation problem.

When people develop new evaluation metrics (like Bleu for machine translation), they try to look at things like #10 (correlation with some goal, perhaps not exactly the end goal). And standard theory and debugging (per above) covers some of this too.

## ERROR

• [Question] Computing output error in neural network

• Softmax outputs a normalized probability distributions, i.e., the outputs sum to 1, and training with cross-entropy loss forces the network to match the given target distribution (usually one-hot encoded target vector in the simplest case).
When your network is supposed to solve a regression problem, the usual thing is using a linear last layer and train minimizing a squared error loss function (target - output)2 = (output-target)2 .

• The reason you see some sources using (target - output value) and some using (output value - target) is because it ultimately doesn't matter. We only care about knowing how much different our network's output is compared to the target that we would like the output to be. Notice that the network's loss function, such as the standard mean-squared error (mse) is a symmetrical distance measure, meaning basically that it's as far travelling from London to Paris as it is going from Paris to London. The loss function is what you minimize to get your network's outputs as close to the targets as possible so a short distance is good. Your (softmax and cross-entropy) "layer" follow the same principle, although the math might look a little more complicated.

## DEPTH

• 7-Layer Deep Fully Connected Network not giving sufficient accuracy (regression)
I am trying to fit some numerical data with a 7-layer deep (each 128 wide) neural network but am achieving an accuracy of only 0.15 MSE with 10M data points. The activation function is Relu and I'm using weight decay. Any suggestions on how to improve the performance ? I've already tried widening and deepening the network and seen no improvement. [u/metakone]

• What kind of accuracy do you get with other techniques? Have you tried normal Regression, k-NN or SVMs? (If speed is an issue, try subsampling.)

• [u/metakone]: I thought that a deep-nn would give me the best performance; especially because I have a lot of data. Is that incorrect?

• Sure, but you don't know how "solvable" is a problem - how much signal is there is the input. If regression gets 0.45 MSE loss, 0.15 for the neural nets is actually very good. If regression is 0.15 too, you know there really is a problem somewhere.

• This is where my mind goes. Do you know that the variables you're looking at explain the data? It's possible there are complicating factors that keep you from seeing the trends, or that the trends you're looking for are simply not strongly predicted by the data. In my experience it's best to start with a small subset of data and do EDA to make sure that there is some basis for the expectation that your problem is solvable given the data.

• There is much more to ML than just Neural Networks. In fact, they're one of the more complex and computationally hard approaches, and there are tons of hyperparameters you need to choose. Which is why it's a good idea to try something simpler and easier to use first: SVMs and Random Forests should be your go-to tools before diving into NNs just because they're hyped. Yes, NNs will sometimes give you better results. But more often than not, the improvements (if there are any) are negligible, especially when compared to the time it takes you to fiddle with hyperparameters. If nothing else, they will give you a good baseline so you know if your results are actually good or not.

With that said, on 10M datapoints SVMs are kind of tricky because the most common implementations have a hard time scaling to that amount of data. Which is why I suggested subsampling. RFs on the other hand might still work out fine.

• Try to make it shallower. Seriously. There may be some optimization difficulties with really deep FC networks. Also, use Batch Normalization / dropout if aren't already.

• [u/metakone] I've tried 2-layer also but didn't get better performance. In fact, the convergence was slower with 2-layer vs 7-layer. Also, I'm not sure dropout applies for me because I won't have missing features (its not image data). I'll try batch normalization.

• How dimensional is the data? How does linear regression perform? Have you tried a triangular shape of the net - i.e. not (128, 128, 128, 128, 1), but (128, 64, 32, 16, 4, 1)? Is the input normalized - if some variables have e.g. lognormal distribution that might fuck things up.

• [u/metakone] I thought the triangular shape might be derived automatically because the unused weights will decay. I tried it anyways just now with no significant change in accuracy. The data is normalized to a mean of 0, std of 1. Input dimensions = 7; output dimensions = 5

• From your replies you don't know much about NN and you're assuming a lot of stuff. No one is going to be able to fully educate you in a reddit post, you need to go back to basics and learn your shit to ever have a hope of making this work.

• Try not using weight decay? Regularization should only be added after you're already getting good results on the training data.

• What's your train error like? You have 5 outputs, do you use different weights on all layers for each output? You might consider sharing some if it makes sense. If computational resources are an issue, do some cross validation on like 1M points, using a few layers Try other activation functions (tanh, softsign), regularization strength (dropout, L2), and hidden layer size (it's ok to keep the same sizer everywhere I think). Then add layers if it improves performance.

• Start with a simple model with fewer and smaller layers, then increase its size to see how performance varies with complexity. Also consider treating the number and size of layers like hyperparameters to tune with some automatic method.

• Use fewer layers. DON'T use ReLUs. That makes an assumption that your function is non-smooth and can severely degrade regression performance

• What is the importance of depth in recurrent neural networks?

• This paper has a good investigation on adding depth to recurrent functions.

• Have a look at these:

• There was a detailed discussion here (reddit) about how depth (and skip connections) function in RNNs.

## FEATURES-RELATED

• Learning using one very strong feature and many weaker ones?
Given a learning problem (mine is regression, but classification is equally applicable) where one feature is a very strong predictor, while the others are weak, how does one make use of the weaker features? I've noticed that every time I attempt to add other, weaker features to the strong one, regardless of regressor type (gradient boosting, random forests, regression tree, linear regression, SVR), the results just get worse on the test set. Does anyone know of any literature about combining heterogeneous strength features for regression/classification?
• Train a model on only the weaker features. Store out-of-fold predictions for train set. Now you have another very strong feature. Randomly dropout the stronger feature for certain samples, forcing the model to look at the weaker features. Use a classifier that does not always exploit the strongest feature, like Extremely Randomized Trees.

• Quick question: What does label and feature of sample mean?
Label of sample | Features of sample | Can you ELI5 and give an example?
• Basically what you said. Though it is less common to preprocess data to retrieve features. Artificial neural networks are all the hype because they can learn these features themselves from raw data. They also do a better job at finding good features in most domains than humans ever could.
• The label is the name of some category. If you're building a machine learning system to distinguish fruits coming down a conveyor belt, labels for training samples might be "apple", " orange", "banana". The features are any kind of information you can extract about each sample. In our example, you might have one feature for colour, another for weight, another for length, and another for width. Maybe you would have some measure of concavity or linearity or ball-ness.
Machine Learning tries to learn how to guess a label when all we have are some features. Usually it does this by looking at a bunch of training samples where we know the labels ahead of time, so we can learn what the features for each category look like and how the categories' features differ from one another.
If all you know about a fruit is it's colour, then a red fruit is likely an apple, and a yellow one is probably a banana.

• Li J (2016) Feature Selection: A Data Perspective. arXiv:1601.07996
• Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing high-dimensional data for data mining and machine learning problems. The objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and preparing clean, understandable data. ... In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. ... we also present an open-source feature selection repository that consists of most of the popular feature selection algorithms.
• scikit-feature feature selection package:
The feature selection repository is designed to collect some widely used feature selection algorithms that have been developed in the feature selection research to serve as a platform for facilitating their application, comparison and joint study. The feature selection repository also effectively assists researchers to achieve more reliable evaluation in the process of developing new feature selection algorithms. We develop the open source feature selection repository scikit-feature by one of the most popular programming language - python. It contains more than 40 popular feature selection algorithms, including most traditional feature selection algorithms and some structural and streaming feature selection algorithms. It is built upon one widely used machine learning package Scikit-learn and two scientific computing packages Numpy and Scipy.

• Non-Mathematical Feature Engineering techniques for Data Science  |  reddit
This is the first sentence in a Google-internal document I read about how to apply ML. And rightly so. In my limited experience working as a server/analytics guy, data (and how to store/process it) has always been the source of most consideration and impact on the overall pipeline. Ask any Kaggle winner, and they will always say that the biggest gains usually come from being smart about representing data, rather than using some sort of complex algorithm. Even the CRISP data mining process has not one, but two stages dedicated solely to data understanding and preparation. So what is Feature Engineering?

Simply put, it is the art/science of representing data is the best way possible. ...

• Python Scikit Learn: For each classification that I do, how do I get to know which features in my text helped it to pick that class?
After I have trained my classifier, I can get the most important features for each class using featureimportances function of RandomForest classifier. But I want that for each prediction which I perform, I want to know the most important features which helped it pick that class? How do I do that?
• By definition, the feature importances from the Random Forest tell you the features that contributed most to classifying each record. I suppose we should ask the question: Why do you want the feature importances for individual records instead of the entire data set? That may help us answer your question.
• Basically, I am hoping to get a multi-class + multi-label model. The class the model predicts will form my multi-class, and the most important features for each individual record, will form the multi-label part. Each of my individual record just consists of unlabeled text. I want to identify it's class, and then based on it's most informative features, I am hoping to get some labels for the text of my record. I hope I could explain my requirement clearly.
• It sounds like you're using a one-vs-rest classification scheme for multiclass classification, which I believe sklearn does by default. If you want to do that by hand, then you need to make several copies of your data set: one copy for each class. Process each copy such that it's a binary classification task: 1 if it's the particular class the data set is for, 0 if it's any of the other classes. Then fit a model on each of those data sets, and the feature importances will tell you the features that played the most important role in differentiating that class from the rest of the classes.
• Yes, sorry, I am trying 2-3 approaches, one of which is using one-vs-rest classifier. However, if I do feature importances, that only gives me the most important feature which it found while training the data set. I want to know the most important features in my test data, at test time, what it found while predicting.
• I haven't tried this yet, but it seems to answer your question: Interpreting random forests. Summary: "There is a very straightforward way to make random forest predictions more interpretable, leading to a similar level of interpretability as linear models - not in the static but dynamic sense. Every prediction can be trivially presented as a sum of feature contributions, showing how the features lead to a particular prediction."
• First remark: to classify text documents, you probably use Bag of Word features, possibly with TF-IDF re-weighting. In that case the features are very high dimensional and sparse. In my experience linear models such as logistic regression are always at least as accurate and much much faster to train than random forests on this kind of data. I would therefore recommend.
To interpret the decision of a linear model, you just have to consider the product of the feature values (e.g. the non-zero TF-IDF weights of the words that make your documents) by the matching weights of the linear model. Those weights are stored in the coef_ attributed of a fitted linear model in scikit-learn.
At this time, scikit-learn random forest do not expose a way to introspect what are the most relevant features for the classification of an individual sample. The current feature importances are only there to summarize the relative importance of features for the aggregate classification of all the samples in the training set.

• Removing features greatly increases performance of NN model?
• I'm doing a multi-regression problem with real world messy data (i.e. we can't 100% trust all the features to be accurate). I'm using a NN that minimizes cross entropy error, and I've found that removing some features greatly improves performance. E.g., with feature A in the dataset, the model seemingly gets stuck at a local minima and doesn't reach above 7-8% accuracy (consistently over many runs), whereas removing the feature allows the network to achieve 65-70% accuracy.
• This is phenomenon I've noticed across a variety of models (for example linear regression, even with $L1$ regularization) and problems. Random Forest is the only algorithm that comes to mind as an algorithm that is truly robust to this problem in practice.
• I understand how simple linear regression could get confused by useless features (by either not being predictive of the target variable or by being highly correlated with other features), but I was under the impression that a more complex model like a NN, or even something simple like $L1$ regularization, would be more robust to useless features, and learn to ignore said feature, at least theoretically. I also understand the benefit of removing these types of features, even in complex models, the idea being that you shouldn't make the problem harder than it needs to be, and shouldn't make the model learn that it's redundant or useless if you know it is. What I don't get is why this happens to begin with.
• What are the properties of a feature that would consistently degrade performance when present? For example, would a feature that is conditionally dependent on another variable that has bad or erratic data cause this?
• This sounds like it could possibly be a problem with the scale of feature A. If the max/min of your features varies considerably, this can cause the type of issue you are describing. The fact that RF still performs well supports this too as RFs are invariant to scale. Try subtracting the mean and dividing by the standard deviation (i.e. z-scoring/scaling) for each variable and see if it helps.
• D'oh - this helped. Guess I was overthinking it, thanks! Are neural networks not invariant to scale? I would think having a squashing function as an activation function would make scale of variables more or less unimportant after the input layer. Is this wrong?
• I don't know any learning algorithm except the tree-based ones that would be scale invariant. However, some optimization algorithms are more robust than others. In any case, gradient descent, for example, is pretty sensitive to having features somewhat normal distributed centered around 0 (i.e., standardized). The reason is that your weights won't update "equally" otherwise. I think it becomes clear if you think about the derivative of your cost function with respect to a weight:

$gradient = \sum_i [ (target_i - activation_i) * featurevalues_i ]$

• Is there any theoretical justification why zero mean / unit variance is superior to [0,1] scaling? Or is it just dependent on the individual case?
• The problem with 0-1 scaling (in case of gradient descent) is that you don't center around zero but around 0.5. You don't have any negative values then and the problem of unequal weight updated that I mentioned above. I mean, it still works well in practice, but may take a tad longer to train compared to standardized features.

• houses, metrics
I have two buckets of houses - ones that sold in <= X days, and one's that sold in > X days. Given that I know the various parameters on each (price, location, # bedrooms, date built etc), what techniques are there to predict which category a new house would fall into?
As the title really - the houses have a number of metrics associated with them, but they are related so I think a naive bayes would fail (e.g. price/location/num bedrooms/construction date/amount of land are all correlated to some degree). My aim is to represent those houses as objects like:
{
'bedrooms': 4,
'location': AreaofCity,
'construction_date': '1950-01-01',
'days_to_sale': 44
'day_listed_for_sale': ' 2016-02-15',
'price': 600000
...
}

Then to pass in a new house, and determine whether it looks most like houses in the 'sold in <= X' category or the 'sold in > X' category, ideally with a % prob of being in the first and a % prob of being in the second.
• This looks like a classic classification problem. Your biggest challenge is going to be encoding your features into a form that most classification algorithms can use. After doing that I imagine something like XGBoost or random forests or similar would do well on your dataset.

• Iandola FN (2016) SqueezeNet: AlexNet level accuracy with 50x fewer parameters and <1MB model size. arXiv:1602.07360  |  ultra-compact DNN  |  GitHub  |  GitXiv
• We propose a small DNN architecture called SqueezeNet, that achieves [AlexNet](https://github.com/BVLC/caffe/tree/master/models/bvlc_alexnet)-level accuracy on [ImageNet](http://image-net.org/) with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to <1MB (461x smaller than AlexNet).

• Determining size and number of hidden layers
What is a good method for figuring out how large to make each hidden layer within a neural network, and how many hidden layers to include overall in a model?
• Same as any other knob on any other ML thing. Try a bunch of options. Use a validation set. If someone has done something similar to the task you're trying to train on, start with their architecture and tweak from there.
• You can just learn it! "Learning the Architecture of Deep Neural Networks" arXiv:1511.05497

• Srinivas S (2015) Learning the Architecture of Deep Neural Networks. arXiv:1511.05497
• Deep neural networks with millions of parameters are at the heart of many state of the art machine learning models today. However, recent works have shown that models with much smaller number of parameters can also perform just as well. In this work, we introduce the problem of architecture-learning, i.e; learning the architecture of a neural network along with weights. We introduce a new trainable parameter called tri-state ReLU, which helps in eliminating unnecessary neurons. We also propose a smooth regularizer which encourages the total number of neurons after elimination to be small. The resulting objective is differentiable and simple to optimize. We experimentally validate our method on both small and large networks, and show that it can learn models with a considerably small number of parameters without affecting prediction accuracy.

• Which model to use for multiclass classification with vectors?
Hi, I am working on a classification problem, but I have trouble with finding the right algorithm. My data consists of integer vectors, which contain 60 values. Each vector belongs to a class, there are 12 classes altogether. I want an algorithm where I can dump my training vectors and class labels, and later input new vectors, and the algorithm would tell which class they are.
• Have you tried multinomial logistic regression? mathematically, it's equivalent to using one neural network dense layer that outputs to 12 output units with a soft max activation. you'd just have to encode you class labels as 1-hots (aka, column vector of size 12,1 that's all 0s except at true label where it's a 1).br I've recently fallen in love with Keras, so, to give you a quick example from it (probably not working code.. I've just woken up and am sleepy)

vec_in = Input((60,))
vec_out = Dense(12,  activation='softmax')(vec_in)

... then, train with categorical cross entropy loss. in this context, it is equivalent to the log probability. Check out keras.io for more. Note: you can probably do this with any algorithm from scikit-learn as well. what have you tried?
• All the models I'm aware of can do this. Have you tried anything yourself? We can't do your work for you. A tip I can give you though is to try random forests first, because they take little amount of parameters and perform quite OK.
• Yes I have tried several models :D, I just wanted to ask you guys whether there is something special for this. I tried random forest, however it really under performs (at least with optimizing on the number of parameters).
• How many trees did you grow? (Random Forests get better as the number of trees grown converge towards infinity.) Are these features maybe hand engineered? Are these text-based features? I've never ever had the case that random forest underperform, really. But well, it is not unlikely.
And no, this should be straight forward. You have a small number of features, which is good. SVM is always a reference (though if you use scikit-learn, linearSVC first, it is essentially the same). Naïve Bayes always worked for me pretty well, nothing too good, but OK (be sure to use the proper probability distribution, Gaußian [Gaussian], Bernoulli and multinomial, for you I'd guess Gaußian, but that's only a guess!). If these didn't work well you should see if you can improve the features, if the classes are correct, if your training set is valid (consistent with classes? each class has enough training examples and so on?). Then you should try ann's, even fancy things like convnets, they performed well for me, at least for word2vec based features with a small amount of training examples (2k). You can fiddle much more with ann's than with any other model, so be sure to give it a shot. For the simple cases these things turn out to be kind of equivalent anyways (see generalized linear model).

• [very good!Approaching (Almost) Any Machine Learning Problem

• An average data scientist deals with loads of data daily. Some say over 60-70% time is spent in data cleaning, munging and bringing data to a suitable format such that machine learning models can be applied on that data. This post focuses on the second part, i.e., applying machine learning models, including the preprocessing steps. The pipelines discussed in this post come as a result of over a hundred machine learning competitions that I've taken part in. It must be noted that the discussion here is very general but very useful and there can also be very complicated methods which exist and are practised by professionals. ...

• Victoria: great article  [local copy (pdf)]! Touches on data preparation but mainly/most significantly a rational approach to designing, applying and optimizing ML algorithms.

• Cites:  Thakur A (2015) AutoCompete: A Framework for Machine Learning Competition. arXiv:1507.02188  |  pdf

• In this paper, we propose AutoCompete, a highly automated machine learning framework for tackling machine learning competitions. This framework has been learned by us, validated and improved over a period of more than two years by participating in online machine learning competitions. It aims at minimizing human interference required to build a first useful predictive model and to assess the practical difficulty of a given machine learning challenge. The proposed system helps in identifying data types, choosing a machine learning model, tuning hyper-parameters, avoiding over-fitting and optimization for a provided evaluation metric. We also observe that the proposed system produces better (or comparable) results with less runtime as compared to other approaches.

• Basic Linear Regression problem
Hello, I am a total beginner in Machine Learning ... I have the following question. As we know, our basic linear regression algorithm is $\small \hat{\beta} = (X^T X)^{-1} X^T Y$. I have tried this algorithm for $\small y = 2x + 5$. For example, (the first column is ones vector and imagine that we give $\small 1, 2, 3, 4$ numbers each for $x$ and get $\small 7, 9, 11, 13$ for $y$ values). [x-posted here (StackExchange)]

$x = [[1, 1]; [1, 2]; [1, 3]; [1, 4]] \\ y = [[7]; [9]; [11]; [13]]$

When I solve the algorithm, it gives $\small 2$ and $\small 5$ which are the parameters of $\small y = 2x + 5$. There is no problem until here. My question starts when I use this algorithm for, lets say $\small y = X_1 + X_2 + 5$. When I use this algorithm for this equation, I cannot get $1, 1, 5$ parameters as a solution. The inputs I use as following: (I give $\small 1, 2, 3, 4$ for $\small X_1$ and $\small 2, 3, 4, 5$ for $\small X_2$ and get $\small 8, 10, 12, 14$ for $\small Y$) (Again, the first column is ones vector):

$X = [[1, 1, 2]; [1, 2 , 3]; [1, 3 , 4]; [1, 4 , 5]] \\ y = [[8]; [10]; [12]; [14]]$

• Ok, have you figured it out? If not, spoiler warning. Your method for the second example is good, the problem lies with your dataset. If you look at X, you can see that column 3 can be expressed as a sum of columns 1 and 2. They are not linearly independent. This breaks the linear regression.

Why is this bad? Because in terms of the finding the coefficients of the function, there are now infinite solutions. For example your dataset suggests that the function : y=2*X1+6 is an equally good solution. But so is y=X1+X2+5, but so is y=2*X2+4, and so is y=-X1+3*X2+3.

Now if you add an example, such as X1=10, X2=10 and Y=25, then the method should find the coefficients that you are looking for. How to avoid this in the future? Look at the rank of the matrix X, and if it drops below the number of coefficients that you are looking for (3 in this case), then you know you need another example. I hope this was helpful.

• How to know which algorithm/approach to use and when?
Hello, I had a question regarding picking the right algorithm or approach for the job (specifically supervised learning, although I'd assume the question could abstractly apply to others too). Given the fact that there are so many types of algorithms to solve problems with datasets, how do you typically know which approach is best? I understand that they all have their pros and cons and it's just a matter of understanding strengths and weaknesses of them all, but sometimes I do not even recognize what is viable and what isn't until I've already started coding an implementation. For example, I was recently trying to use a simple linear regression algorithm on a problem I was working on, but halfway through I realized that neural networks would work better. I feel that had I recognized and made the connection that a neural network would be better suited, I would have saved myself a lot of time. Part of my issue is that I do not consciously taken into consideration each approach that I have learned. I was thinking maybe I could make a master list of approaches I've learned and use it as a cheat sheet, but that doesn't really solve my issue; it acts too much of a crutch for me.
Also, leading off of this is my other part to the question: if I pick an algorithm less suited for the task, it may not be as efficient or optimized, but given enough tuning and tweaking would it be able to work as well as a more suited algorithm? (Obviously two drastically different approaches would not work very similar to each other regardless of tweaking and tuning, but if I've already started one approach to the problem and I realize partially through that there's a better way to do this, what should I do...? (As I'm typing this I realize that this is very situational, but any tips would be helpful nonetheless.))

TIPS, TRICKS:

General Advice - CNN - Images:

• [Help] How to determine how many images needed for my CNN dataset?
Hi, for what it's worth, I'm pretty new in CNN subject. As the title suggests, I'm not sure how many images needed for my CNN dataset? I plan to use pretrained GoogleNet on ImageNet and fine tune it for head pose classification(8 classes each 45 degree and 1 background class). I read "If your dataset is big", the thing is I don't know how to determine whether my dataset is big or small. is 50.000 images is big? Any input, critic, advice are most appreciated. Thx in advance!

• The general idea is to divide your dataset into training and test subsets and plot accuracy for each as a function of increasing training set size. If you are overfitting you'll see accuracy leveling off with increasing training set size, with training accuracy notably above test accuracy. If you are underfitting you wont see such a difference between training and test accuracy, and accuracy overall will be worse.

Overfitting means either your dataset is too small (too easy to fit) or your model is too large (it's memorizing the dataset rather than abstracting it). Regularization may help.

Underfitting means your model is too small/simple for the function you are trying to learn.

So ... try it and see. Keep increasing dataset size until you are no longer overfitting. If you are still overfitting and can't add more data, then try adding regularization (although that may hurt performance on other categories).

• Thx, I have better picture now about what to do. I was hoping there was a specific magic number where a dataset can be considered big.

• Calculate how many parameters are in your network and then how noisy your data is. You need your data (w/ augmentations) to be greater than your # of parameters first of all. Then, the number of examples per class depends on how precise you want to be, if you want to differentiate breeds of similar dogs, you want more data than say a dog vs car classifier. For head poses it needs to be robust against facial appearance - there was a Kaggle competition a while back for regressing facial key points using a CNN, see how big their dataset was to give you an idea.

Transfer learning is also great way of training a network on a small dataset. Take a popular ImageNet trained network (e.g. from Caffe's Model Zoo), replace the fully connected layers and make sure your dataset is > than the # of new parameters in the FCs.

• My general rule of thumb for CNNs is to have at least 1000 training examples per class to have any hope of creating a generalizable model. That's a bare minimum though; more is always better.

I have a question about a few models I've seen but haven't understood completely. If you're classifying sequences of varying lengths then it seems reasonable to (1) embed sequence elements in a vector space (2) perform convolutions to capture local structure (3) feed the sequence of convolutional activations to an RNN to capture longer-range interactions. Are there any good papers or tutorials which discuss these kinds of models in detail? Specifically I'm hoping to answer some of the following:

1. Are the convolutions typically 1D (applied along only one embedding dimension) or 2D (mixing all the components of embedding vectors)?

2. Is there any reason to preserve convolutional activations which overlap only part of the sequence (i.e. border_mode="same")?

3. Is it useful to add explicit start and stop symbols to each sequence (which would then have their own embedded representations)?

4. If using convolutions of multiple widths with a "valid" border mode then the number of timesteps from each convolution size will differ. How can these activation sequences of differing lengths be used as inputs to the same RNN layer? Should each conv size have its own downstream RNN, with a merge of outputs afterward?

5. Purely a detail of implementation, but how can convolutional RNNs be implemented in Keras when the convolution classes don't support masking (and thus seem to require that every input be of the same maximal length).

• Sorry to not comment anything productive but can someone explain me the difference between convolutional networks and recurrent convolutional neural networks?

• Regular convolutional networks take a fixed input size, like other normal machine learning approaches and output a fixed output size. For example, it solves the problem of: Is this picture a cat?

Recurrent neural networks can take a fixed input size or a sequence of inputs and output a fixed output size or a sequence. This is the same for recurrent convolutional neural networks, they either take in a sequence of images and output a single (or a sequence) of outputs, for example to classify a variable sized video. It can also take a single image and output a variable sized sequence of text, to annotate images for example.

• Thanks for the clear explanation! I'm still trying to completely understand it...So hypothetically, if I want to train for example a deep learning system which can recognize gestures in a video, recurrent convolutional neural networks are the way to go, right? Basically the convolutional part will be able to find the pattern within each frame and the recurrent part will be able to figure out the temporal pattern in between the frames of the video?

• Correct :) if you want to recognize one gesture you would have a single classification layer at the end (softmax probably), if you would want to consistently keep spewing out recognized gestures you would have sequence in sequence out.

• Not a tutorial, but this paper feeds 1D convolutions (with multiple filter widths too) into an RNN and gets great results, which sounds similar to what you want to do: Kim et al. (2016). I believe their code is online so you can see how they do it (but written in Torch). Another paper with just a CNN is: Zhang et al. (2016) [arXiv:1509.01626], which might help with figuring that part out -- it should be pretty clear how to combine CNNs and RNNs (in the way you are describing) once you understand both. For CNNs generally, I would check out cs231n, which should answer your question about the overlap-type of convolutions.

• ResNet vs. Highway Networks, when to use which?
For what data and problems is it better to use highway networks instead of a non-trainable skip connection like those used by ResNet architectures?

• Try a middle ground. Scalar gated (not tensor gated) skip connections. For me, these have worked very well. I think there was even a paper about them recently? (might be mistaken on that though).

• ResNets are better for images and Highway Networks for other modalities, I think. Greff et al.'s recent Unrolled Iterative Estimation paper [arXiv:1612.07771 should provide you the right pointers for your problem   (<< large file; opens in new tab)].

• Quick explanation of Gradient Boosting?
I understand the general concept of basic ensemble methods like random forests but don't get how Gradient Boosting provides a different approach. I know GBM is a whole family of different algorithms but can anyone give me a quick rundown on the idea of "Gradient Boosting". I particularly don't understand the sequential addition of models to the ensemble.

• It's pretty easy:

• You start with some predictions - could be all zero, could be some other function you provide (like a best initial guess).

• Then, you specify a loss function - could be squared error, cross-entropy, whatever you like - and take its derivative analytically.

• You then take the current predictions - then fit the data to the derivative of the cost function with respect to the current value of f(x) (NOT with respect to x) You can use a linear function, a decision tree, whatever you want.

• Now, you take a step kinda like you're doing gradient descent - but instead of changing the PARAMS like new_params = old_params - learning_rate * derivative, it's new_function = old_function - learning_rate * derivative.

• This is the hardest part to understand; the derivative in this context means "for a given x, we have f(x) currently; but it's be better if f(x) were slightly different: f(x) + some other adjustment called g(x). We'll try g(x) = - learning_rate * d_cost(f(x)) / d_f(x)"

• It is about making cost smaller by making small adjustments to the prediction function values directly, as opposed to making small adjustments to some coefficients which indirectly produce the prediction function values. That's it. So, interestingly, when you predict later, you are running each sample through it's a history of function gradient steps, then summing them all up to arrive at the final prediction. That's it in a nutshell.

• You fit a model.
You compute the residuals.
You fit a model to the residuals.
Repeat.

• You don't fit the residuals; you fit the gradient of the cost function with respect to the previous model's value at each sample. This is a key component of gradient boosting; and it is not the same as fitting the residuals, except in the case of squared error.

• XGBoost: A Scalable Tree Boosting System: Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Flink and DataFlow.

• Wang Y (2016) Random Bits Forest: a Strong Classifier/Regressor for Big Data. pdf  |  reddit  |  reddit

• Efficiency, memory consumption, and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width), and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS).

• Batch Size:

• Variable batch-size in mini-batch gradient descent
It's now recommended to use some fixed mini-batch size for mini-batch gradient descent, say from 32 to 256. Did anyone try to do dynamic mini-batch size - in other words to select some random mini-batch size for each iteration?
• Not random maybe, but steadily increasing it over time works. It has a similar effect to lowering the learning rate over time, ie. it's a stabilizing factor later on in learning when you're close to convergence. Larger batch sizes means less variance per batch, which in turn means less "noisy" gradients.
• I think increasing the batch size from small to large might make sense.
• Yes, this is a well-known idea from way back for reducing the variance in your gradients once you are close to a minimum.
• There's some theory on this: arXiv:1104.2373

• Learning Rate:

• What is your heuristic to decrease the learning rate of a CNN?
When you train a convolutional neural network and you set the stochastic gradient descent algorithm as its optimizer, you usually have three hyperparameters to tune, which are : the learning rate [ α ], the momentum and the weight decay. I know few heuristics to decrease the learning rate. The first one is to tune a fourth hyperparameter (a learning rate decay) which decreases the learning rate at each time step such as a minibatch or an epoch. The second one is to set different values for the learning rate (also called regimes) that you instantiate manually. I would like to know if you use different heuristics based, for instance, on the deceleration of the loss or the train/test accuracy.
• Not sure how other people do it, but here's my strategy:
1. Start with 1.0, run it and see if it diverges. (Probably will.) If so divide by 10 and start over.
2. Once you have the largest N such that a 1eN learning rate converges, run that for a while. When it stops getting better, decrease learning rate by 10.
• I think those optimization methods (RMSprop and so on) compute a kind of learning rate for each weights and biais based on some heuristic. The thing is, their memory footprint is higher than a simple momentum SGD. Also, I've seen different graph where those methods were converging faster, but the best accuracy was obtained using SGD with momentum.And finally, you have to understand them totally if you want to fine tune a CNN. I'm not sure if dividing the initial learning rate by a factor of 10 for the convolutional layers only would be sufficient using Adam for instance.
• Learning rate schedule and regularization for embedding layer?
With SGD training, every mini-batch the embedding matrix will only get partial gradient (for those used words), hence most word embeddings got update every several iterations. This is very different from other parameters like conv filters that got updated much more frequent, e.g. every iteration. There are two questions:
(1) is it better to use different learning rate schedule for embedding layer that is different from other frequently-updated layers?
(2) is it better to regularize over only used words every mini-batch, or is it better to regularize over the whole embedding matrix every mini-batch?

## IMAGE CLASSIFICATION

• Any off the shelf framework for multi label image classification?
Has anyone solved this problem? Using Caffe/ Torch or any other framework?
• And would I have to give my input layer format differently? And I may be asking too much, but could you guide me towards one such code?
• Check out examples in Scikit Flow (https://github.com/tensorflow/skflow/tree/master/examples). There are examples for this. Hope this helps.
• https://github.com/jetpacapp/DeepBeliefSDK and https://github.com/dmlc/mxnet/tree/master/example/cpp/image-classification [←←  Victoria: dead link, 2017-Sep-12; see perhaps this?] provide pretty easy ways to do this.
• related: How do I do multilabel image classification in TensorFlow?
I saw something like multi output. Is it possible to do multilabel classification in TensorFlow? Using skflow?
• You can have multiple outputs with sigmoid activation (one per label) and compute binary cross entropy or any other suitable cost function for each output, then sum/average over all outputs.

• Best practices when doing deep learning for computer vision projects?
I'm trying to do train an image classifier on a ~2 millions labeled images. The images are scenic, with 20 distinct labels. I'm quite familiar with the concept of the state-of-the-art techniques in image classification (CNN and alike) but lacks the practical experience. I have some questions:

1. 1. How do we handle different image sizes? As I know, the first layer of the network will receive the input. Do we have to rescale all the train/test image to the same size first?

2. I'm not sure about the correctness of the labelling of all the 2 million images. What is a good way to check the label with acceptable accuracy? Do people usually just use a pre-trained models and compare the labels? Or manual check/Mechanical Turk is the best way? Is there anything else I should check before using the data to train a model?

3. I've heard good things about Caffe for vision research projects. But how easy is it to be productionized? How does it compare to Tensorflow?

• I'm not an expert, but here's what I've done or seen done in these situations:

1. How do we handle different image sizes? As I know, the first layer of the network will receive the input. Do we have to rescale all the train/test image to the same size first?

• The most common approach is to rescale then crop. If you have a 640×480 image and your network accepts $120×120$ input images, you would rescale to $160×120$, then choose a $120×120$ crop from that. The crop is usually taken from the center of the image, since images tend to be centered on the subject.
Another approach is to rescale, compute your classification for every $120×120$ window of the image (this is referred to as a "fully convolutional network" in the literature), then take the average. For example, if you have a $640×480$ input image, first rescale it to $160×120$, then use a $120×120$ sliding window to get a $40×1×CLASSES$ output. Take the mean over the first two dimensions to get your final classification.

3. I've heard good things about Caffe for vision research projects. But how easy is it to be productionized? How does it compare to Tensorflow?

• My experience is limited to Theano, TensorFlow, and a tiny bit of Torch. TensorFlow has TensorFlow Serving, which is meant for model evaluation (as opposed to training) in production.

• Defining image similarity using pre-trained convnet activations
I lately implemented a code to find similarity between images using pretrained convnet (inception v3). I noticed that the $L2$ distance between activation values (the next to last layer) of the images are tough to interpret. E.g. the difference between American Staffordshire Terrier and running Maltese was 18.68863, and the difference between the same terrier image and a flower was 19.86305. Shouldn't these difference values be more apart? I know I'm giving equal weight to each neuron when calculating $L2$ and a trained softmax layer above would look at these neurons in a weighted way - but is there a way to better use these pre-trained nets for similarity finding?
• We usually remove the mean and divide by the $L2$ norm for each neuron to put each point on an unit hypersphere. We then use cosine distance, but since the norm are always 1, you just have to do a dot product between the vectors you want to compare. This way, you can find similars very fast! That's what they do for OpenFace, for example.
• If you want to go further, you can take a look at the hubness/orphan problem, for example
• related: Andrej Karpathy, @karpathy (twitter, Mar 11, 2016): "not-widely-enough-known-protip: Do not use $L2$ loss (regression) in neural nets unless you absolutely have to. Softmax likely to work better."

• How many times should I iterate over my data when training CNNs?
I have half a million labeled images and am trying to build a deep CNN architecture for classification. How many times should I feed each image to the network? For example, with a batch size of 128 (is that too big?) and 8000 steps I will iterate approximately twice. Is that approach correct?
• You should have a training and a validation set. You train on training set and then validate on the validation set at the end of each epoch. (0.90/0.10) good train/validation ratio. Then you observe the loss from train and loss. If you train at some point continues to go down but your validation goes up it means you have overfitted, and probably have reached the end of the training. At least with that learning rate. Sometimes you can divide lr by 10 and the validation will start to go down again. So just keep training until you overfit.
• Don't forget to also have a small 5% test set that you set aside to double check that your validation score is indeed realistic.
• You will probably need a lot more updates than that, but it really depends on the problem and the dataset. Note that it doesn't really make sense to measure this using the size of the dataset: if your dataset was 10 times larger, you would't need the same number of passes to get to convergence, as that would take 10 times as long, and there would be 10 times as many weight updates. But you probably wouldn't be able to reduce the number of passes by a factor of 10 either! There is more data to learn from, so the learning problem has become a bit harder (but not 10x harder). My intuition is that there is no linear relationship between these things.
At any rate, 8000 steps is probably way too little for any modern neural net to converge. The number of update steps is a better measure for this, but it's not going to be consistent between different datasets (and different update rules) either. At least I can say that 8000 is probably the wrong order of magnitude :-)
The only correct answer is to try things out. Start training, and see where the curve starts to flatline. At that point, it can be worth it to reduce the learning rate by a factor of 10 (or some other factor) and continue training, if you really want to get the best performance you can get. You can do that a few more times. When the curve finally stops going down, you'll know how many steps you need.
Anecdotally, I've worked with some large datasets where one pass through the dataset was enough to get decent convergence. I think I ended up doing 3-4 passes to get the most out of it.
• Thanks for such a well structured answer, it really provides more insight as I am still an amateur. One more thing I would like to ask. Currently I am using GradientDescent optimizer (Tensorflow) with a decaying learning rate. Is there a rule of thumb as to what range of values the initial learning rate should take? Is 0.5 too much? Is 0.01 too little? Right now I am decaying every 1000 iterations, multiplying by 0.95. Is that a good starting point?
• Usually you start with 0.01, and then either divide by 10 once the loss stays straight for a while or you could also do what you are doing by exponentially decreasing it by a factor of "d" or 0.95 in your case.
• The type of range for an initial learning rate depends on the type of network. Something of the order of 0.1 might be about right, but you need to experiment and see what works (if rate is too high, then loss will soon start increasing rather than decreasing). Try increasing or decreasing the initial learning rate by factors of 10 to rapidly explore what works.
Another alternative is to use an adaptive weight update method such as AdaDelta (or Adam, AdaGrad, RMSProp) rather than simple gradient descent, which avoids you having to manually select the learning rate. These methods have their own hyperparameters, but often a single value will work for a large variety of nets. e.g. for AdaDelta the default 0.99 EMA decay rate works very well.
• What do you mean by '8000 steps'? batch size is the number of training samples you're passing over on each iteration. Different batch sizes can yield different speeds of convergence (depending on the optimization algorithm). An epoch is 1 full run of your training sample, i.e. 500000/128 batches (at a batch size of 128). So you usually talk about number of epochs needed to make your CNN converge. But nobody can tell you how long this takes, it can be 10 epochs, 100, 1000.
• By 8000 steps I mean:
Feed the [0:127] images (step1)
Feed the [127:255] images (step2)
... (step 8000), etc.
• In this case, as I said, you rather talk about the number of epochs
• As many as it takes.

## IMAGE SEARCH

• Computer Vision + TensorFlow: How do I train my model to find visually similar images?
I need to give it an image; it should return visually similar images ...
• In NN, similar images have similar activations in the upper hidden units. To include large images, and less coding as you asked, you can download NN pretrained on the ImageNet. After that you take the hidden layer activation vector of your input image and do a k-nearest neighbor search on all the other activation vectors from your image dataset, to get the most similar images. You have to choose which hidden layer you will use based on your goal. The lower layers will give you images that look similar pixel wise, and the higher layers will give you images that look semantically similar.

## MEMORY-RELATED: GARBAGE COLLECTION ...

• Jupyter notebook: try the following (here 'tr' and 'df' are variables assigned to input CSV - see /mnt/Vancouver/Programming/Jupyter/notebooks/CSV data files.ipynb):
# FREE RAM MEMORY:
#
# https://stackoverflow.com/questions/15455048/releasing-memory-in-python
#https://docs.python.org/3/library/concurrent.futures.html
import concurrent.futures
with concurrent.futures.ProcessPoolExecutor(max_workers=1) as executor:
# https://docs.python.org/3/library/concurrent.futures.html
future = executor.submit(pow, 323, 1235)
#print(future.result())
pass

# https://stackoverflow.com/questions/15455048/releasing-memory-in-python
import gc
gc.collect()

%xdel tr
%xdel df

#%reset -f
# https://ipython.org/ipython-doc/3/interactive/magics.html
# file:///mnt/Vancouver/shortcuts/files/ml-implementation_notes.html#Memory-GC
%reset -f array

• This might be a bug in Jupyter, but in Firefox, when I restart the (Python) kernel in jupyter, it frees RAM memory. The code, above, had no effect!

• Pandas does not free memory? [StackOverflow]
• Releasing memory of huge numpy array in IPython [StackOverflow]
• How to clear variables in ipython? [StackOverflow]
• Jupyter Line magics:

%reset: resets the namespace by removing all names defined by the user, if called without arguments, or by removing some types of objects, such as everything currently in IPython's In[] and Out[] containers (see the parameters for details).

Parameters:

-f : force reset without asking for confirmation.

-s : 'Soft' reset: Only clears your namespace, leaving history intact. References to objects may be kept. By default (without this option), we do a 'hard' reset, giving you a new session and removing all references to objects from the current session.

in : reset input history

out : reset output history

dhist : reset directory history

array : reset only variables that are NumPy arrays

Examples:

In [6]: a = 1

In [7]: a
Out[7]: 1

In [8]: 'a' in _ip.user_ns
Out[8]: True

In [9]: %reset -f

In [1]: 'a' in _ip.user_ns
Out[1]: False

In [2]: %reset -f in
Flushing input history

In [3]: %reset -f dhist in
Flushing directory history
Flushing input history

Note:  Calling this magic from clients that do not implement standard input, such as the ipython notebook interface, will reset the namespace without confirmation.

%reset_selective:  resets the namespace by removing names defined by the user; input/Output history are left around in case you need them.
%reset_selective [-f] regex
No action is taken if regex is not included.

Options:

-f : force reset without asking for confirmation.

We first fully reset the namespace so your output looks identical to this example for pedagogical reasons; in practice you do not need a full reset:

In [1]: %reset -f

Now, with a clean namespace we can make a few variables and use %reset_selective to only delete names that match our regexp:

In [2]: a=1; b=2; c=3; b1m=4; b2m=5; b3m=6; b4m=7; b2s=8

In [3]: who_ls
Out[3]: ['a', 'b', 'b1m', 'b2m', 'b2s', 'b3m', 'b4m', 'c']

In [4]: %reset_selective -f b[2-3]m

In [5]: who_ls
Out[5]: ['a', 'b', 'b1m', 'b2s', 'b4m', 'c']

In [6]: %reset_selective -f d

In [7]: who_ls
Out[7]: ['a', 'b', 'b1m', 'b2s', 'b4m', 'c']

In [8]: %reset_selective -f c

In [9]: who_ls
Out[9]: ['a', 'b', 'b1m', 'b2s', 'b4m']

In [10]: %reset_selective -f b

In [11]: who_ls
Out[11]: ['a']

Note:  Calling this magic from clients that do not implement standard input, such as the iPython notebook interface, will reset the namespace without confirmation.

## INITIALIZATION; PARAMETERS; ...

Blogs:

• How to Best tweak hyperparameters for CNN accuracy?
I am currently building a CNN some text data. My dataset seems to work out fine since I get the same/slightly better accuracy compared to the well known IMDB dataset. The goal is basically sentiment classification of short paragraphs (binary classes that is). However, when I train 5 epochs of the network I get like 51% accuracy. I tried tweaking some stuff (embedding dimensions, learning rate, Also, amount of layers, training set size, vocab size etc.) but the only thing that ramps up my validation accuracy is basically number of epochs. For 15 epochs for example I get approximately 90% accuracy, but I have the fear that this might just be overfitting the network (although I do use dropout). Basically I am just jiggling stuff around not really knowing what I am doing thus far. How would a sane person go about tweaking the network in a scientific manner? How do you guys do it? Any tips regarding:
• how many epochs is enough
• which algorithm to use
• tips for regularization
• dimensions for embedding/layers/etc
• general procedure to ramp up the classification accuracy without over fitting?
• It turns out that picking reasonable ranges and scalings (linear vs. log), sampling parameter combinations randomly, and then choosing the model with the best validation error works really well and is super easy to implement. This is super true if your models train quickly and you don't mind training 100 different models. The News on Auto-tuning
• This is another interesting option. [Optunity is a library containing various optimizers for hyperparameter tuning. ...]
• Victoria: Theano logistic regression example!

• What are the current options for automatic hyperparameter fitting?
• You should check out Hyperband [writeup | demo].
• There's a library called TPOT that tries to fit hyper parameters with a genetic algorithm. More theoretically: I've heard Hinton talk about Monte Carlo search for hyperparams and recently I looked into Markov chain Monte Carlo (MCMC).
• The library is called TPOT and outputs the pipeline in Python (using sklearn). From personal experience it seems to be pretty good at helping (~5% boost of accuracy) find good pipelines for classification problems, but I've had it actually make things worse with regression problems and to be honest I've never had a good result. Could just be the data really needs domain knowledge to create a good pipeline but after doing 100 generations over a week the results were rather poor.
• TPOT is more focused on full-on automated machine learning than just optimizing hyperparameters of a model. It conducts both random search and genetic algorithm search, however in the author's supporting publication they found "that guided search did not outperform random search in this case" [arXiv:1601.07925]
• TPOT - Automatically creates and optimizes machine learning pipelines using genetic programming [automating biomedical data science through tree-based pipeline optimization]: Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets.
• Olson RS (2016) Automating biomedical data science through tree-based pipeline optimization. arXiv:1601.07925  |  GitXiv
• Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators---such as synthetic feature constructors---that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.
1. For Keras, there is https://github.com/maxpumperla/hyperas which is built on top of HyperOpt, widely used in general ML (esp: Kaggle)
2. For Torch there is https://github.com/Element-Research/hypero, however this had some have some issues when I used last so do check.
There are other libraries in Github which does Bayesian Optimization for parameter search. HyperOpt, Spearmint. I don't know if there is code for Method of Auxiliary Coordinates (MAC), but according to its author this is much superior technique than Bayesian Optimization arXiv:1212.5921
• Very recently there is Optomatic [GitHub], of which I am the author. I originally wrote it for my own projects, and after a few months decided to release it to github for others too. The basic idea is similar to hyperopt, i.e. use a central MongoDB server as a job scheduler and to store the results of your hyperparameter searches (this way you can run in parallel across a lot of different computers and not need to manually collect and collate each set of results). Optomatic is extensible through parameter-filters, that choose to ignore a particular set of parameters e.g. if it's too close to one tried before, and parameter generators that pump out new sets of params to to test. There's also some plotting facilities.
There is Optunity, but it doesn't currently support discrete/categorical parameters. There is SMAC, which I haven't tried, Spearmint (that later turned in to Whetlab and was bought out by Twitter) and also metaopt.
• The easiest way to get started is random search, followed by genetic algorithm. Avoid grid search, which is inefficient and relatively ineffective. Bayesian optimization is popular. There was also a paper, with Bengio as a coauthor if I remember correctly, that recommended using Gaussian processes or simplexes to guide hyperparameter tuning.

• Scanning hyperspace: how to tune machine learning models  |  reddit  |  hyperparameter optimization ...
• This actually applied the "three-way split rule"; the initial split leads to the training and test datasets using the train_test_split() function. Then k-fold cross-validation is performed on the training set, which creates its own train and validation set(s), otherwise known as pseudo-test set(s). The latter is done internally with the GridSearchCV() function. In this way the test set is kept aside throughout the training and optimisation process and only used to report the final accuracy.
• Couldn't you use an evolutionary algorithm instead of Grid search? If you really wanted to find good hyperparameters you would have to iteratively do Grid search to refine your choices, assuming you have few enough parameters that grid search is practical. Whereas an EA, while more complicated, could probably find you near-optimal solutions in a few dozen generations.
• [cmachinarium << blog author] This is the first out of a series of blog posts that aims to make the value in tuning machine learning algorithms. We are definitely eager to cover more advanced optimisation topics in the future, including heuristics to accomplish hyperparameter tuning that you mention. We think that grid search still remains the most intuitive technique particularly used by a novice audience.
• I certainly agree. When I have to do hyperparameter selection I go for grid search first since its easiest. But I am sometimes confronted with many parameters in the more complex ML models. Additionally, there is a lot of computation time that has to be dedicated to training the model itself, so I have to decide how much time I dedicate to running it over and over in order to find good hyperparameters. Do you know of any techniques that are a bit "smarter" than grid search, but not as complicated to build as EA?
• Bayesian optimization, look at hyperopt and speatmint. They should also be more clever than genetic algorithms

• What does "debugging" a deep net look like?
I've heard people say that researchers spend more time debugging deep neural nets than training them. If you're a practitioner using a toolkit like TensorFlow or Lasagne, you can probably assume the code for the gradients, optimizers, etc is mostly correct. So then what does it mean to debug a neural network when you're using a toolkit like this? What are common bugs and debugging techniques? Presumably it's more than just tuning hyperparameters?
• A lot of neural net "bugs" are related to initialisation: if you don't initialise the net properly, the training will not converge.
Another bug I've run into is nets behaving very differently with/without dropout. This was because I accidentally applied dropout in the wrong position (never use dropout directly before a pooling layer for example).
It's always good to monitor the activations, weights and gradients of the different layers, to ensure that their magnitudes are in a healthy range (you don't want gradients that are a billion times smaller than the weights, for example). You don't have to do this all the time, but it can be a helpful diagnostic tool.
Manually inspecting validation set examples for which the net performs very poorly can also be extremely enlightening, and issues that you hadn't even noticed or thought of.
• Could you please describe some of the problems you observed when applying dropout directly before a pooling layer? The reason I'm curious is that this is actually a key idea in Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference arXiv:1506.02158, where the authors implement a Bayesian CNN by applying dropout after convolutional layers, but before pooling (effectively dropping out entire kernels, see the last two paragraphs of Section 5). An important difference, however, is that they use Monte Carlo dropout (sampling multiple forward passes with different dropout masks), and confirm empirically that this works much better than using standard dropout between convolution and pooling (which agrees with your observation that standard dropout before pooling causes problems).
• If you use MC dropout it's not a problem. The issue with applying 'regular' dropout before pooling is that at test time, there is no good single-pass approximation (i.e. you can't just halve the weights).
• Training some networks may require a fair amount of hand-holding. From my personal experience, here are some common problems:
• as /u/benanne mentioned, bad initialisation. Fix: Xavier/Glorot initialization.
• people like me often forget to switch off dropout during inference/testing.
• badly conditioned activations (mean >> 0). Fix: batch normalization or exponential linear units.
• as /u/siblbombs mentioned, NaNs. Fix: evaluate the network layer by layer (layer0, layer0 → layer1, layer0 → layer1 → layer2, etc.) to localize where NaNs appear. Have NaN-guards in place, to kill gradients if they're NaN (before throwing an error), or you lose some of those precious training iterations depending on how often you back up your parameters.
• unbalanced datasets. Real world datasets may contain 100 samples of class 1 for every sample of class 2. When that happens, you may want to balance your loss function (or your dataset).
• For me it becomes a lot of playing "Why do I have a NaN somewhere, after a bunch of epochs?" Debugging the code can be things like did I potentially divide by or take the square root of 0, are my weights getting big enough that I somehow overflow a float32, did I do something else that is numerically unstable? Depending on your model you can do more advanced 'debugging', when I was implementing neural queues/stacks I modified my model to take explicit inputs for push/pop actions so that I could verify if the rest of the model learned assuming push/pop was functioning correctly.
• I think getting NaNs after several epochs (30+) is probably one of the most annoying things. Lures you into a false sense of security!
• Is there a general way to avoid the NaN issue? When I get NaNs, I usually restart training with a heuristic that checks the gradients for NaNs or Infs at each update and shrink the weights when they appear, but this is slow and perhaps not entirely principled. Is there any way to do better?
• I'm not sure myself, I come across them usually with RNNs, once it happens I generally try more aggressive gradient clipping or clipping of values in general. This also assumes you haven't done something dumb in your model (div/0, etc).
Blocks has an interesting (potential) solution in RemoveNotFinite which tries to recover when it happens in training, but if you have moved into some weird parameter space that consistently produces nans/infs this might not solve the issue.
The traditional theano approach is to use nanguardmode which spits out the apply node which has a nan, but I'm not super comfortable with working with the optimized graph and usually have a hard time determining what part of my model the specific apply node is in.
• Debugging visually means you can see the error go down. You might also want to watch the weights to make sure they have a roughly normal distribution. Karpathy has some nice visualizations, as does this page.

• Simple regression example in Keras (2 inputs, 1 output) cannot be trained?
[surreal_tournament]I am using the latest Keras (pip installed from git) to do regression on Bukin function No. 6. I sample 1000 random (x, y) input pairs, where x in [-15, -5] and y in [-3, 3] to generate the training set. I have played around with the activation functions, the number of hidden layers, the number of neurons in each hidden layer, rescaling the inputs to [0, 1], and still can't get a good model fit after many attempts. For example, sometimes the y predictions are very small (0 point something), and sometimes they are in the hundreds range.
I defined the model in the following way (following tutorials and forum posts):

from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
model = Sequential()
# 2 inputs, 10 neurons in 1 hidden layer, with tanh activation and dropout
# 1 output, linear activation
model.compile(loss='mse', optimizer='rmsprop')

I am then training it with early stopping, like so:

from keras.callbacks import EarlyStopping
early_stopping = EarlyStopping(monitor='val_loss', patience=2)
m.fit(X_train, y_train,
nb_epoch=100, batch_size=50,
validation_data=(X_test, y_test),
callbacks=[early_stopping])

Any ideas about what I'm overlooking? Does my model setup even make sense? Thanks.

• [adagradlace] I did similar experiments to approximate simple functions with keras and it worked quite well. Try the following:
• more training data (10k, 100k or more)
• more hidden units, more layers
• ReLU as the non-linear function
• Xavier or He initialisation
• e.g., something like this:

model = Sequential()

• [surreal_tournament] Thank you for the reply. With your proposed network setup I get an MSE of approx. 50 on the validation after 20 epochs of training. Should I normalize the targets to [0, 1]? Some years ago I was training libFANN networks with tanh activation and only needed a single hidden layer with 20 neurons to get a decent accuracy on similarly smooth functions. I wonder how.
• [adagradlace] I tried it now and got down to MSE 0.45 after around 100 epochs with 7500 training examples, without any normalisation. Try more epochs and when the loss reaches a plateau, reduce the learning rate. You could also try a different optimizer.

• Hyperparameter tuning for deep nets?
What are some strategies for tuning a convnet that might take 2 hours for one epoch on the training data? I want to see if I can push the accuracy score any higher, but it feels like the task it too large for something like grid search or random search.
• It's kind of complicated to implement but freeze-thaw Bayesian optimization is sort of an ideal approach for neural networks.
• Agreed! To add to this, if you aren't doing something like this "manually" you might want to consider it. It's a pretty great way to boost efficiency when you have these really long running experiments and limited resources. Don't be afraid to kill partially trained models if they aren't looking promising.
• Bayesian optimization to rescue
• If you don't have resources to run many jobs in parallel (cluster/AWS) then Bayesion hyperparameter optimization (see e.g. Spearmint) is probably the only way. But most people I know just do random search in parallel, and take the best valid performer.
• Thanks - do you know of any good resources on reading up on Bayesian hyperparameter optimization? I see that there's a way to use Spearmint with Torch, but I also want to understand the methods.
• Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Hugo Larochelle and Ryan Prescott Adams. Advances in Neural Information Processing Systems, 2012
• Multi-Task Bayesian Optimization. Kevin Swersky, Jasper Snoek and Ryan Prescott Adams. Advances in Neural Information Processing Systems, 2013
• Input Warping for Bayesian Optimization of Non-stationary Functions. Jasper Snoek, Kevin Swersky, Richard Zemel and Ryan Prescott Adams. International Conference on Machine Learning, 2014
• Bayesian Optimization and Semiparametric Models with Applications to Assistive Technology. Jasper Snoek, PhD Thesis, University of Toronto, 2013
• Bayesian Optimization with Unknown Constraints Michael Gelbart, Jasper Snoek and Ryan Prescott Adams Uncertainty in Artificial Intelligence, 2014
• http://fastml.com/blog/categories/hyperparams/
• You can try using Optunity
• Related reddit post: Hyperparameter Selection

Papers:

• Ithapu VK (2015) On the interplay of network structure and gradient convergence in deep learning. arXiv:1511.05297  |  local copy
• The regularization and output consistency behavior of dropout and layer-wise pretraining for learning deep networks have been fairly well studied. However, our understanding of how the asymptotic convergence of backpropagation in deep architectures is related to the structural properties of the network and other design choices (like denoising and dropout rate) is less clear at this time. An interesting question one may ask is whether the network architecture and input data statistics may guide the choices of learning parameters and vice versa. In this work, we explore the association between such structural, distributional and learnability aspects vis-a-vis their interaction with parameter convergence rates. We present a framework to address these questions based on the backpropagation convergence for general nonconvex objectives using first-order information. This analysis suggests an interesting relationship between feature denoising and dropout. Building upon the results, we obtain a setup that provides systematic guidance regarding the choice of learning parameters and network sizes that achieve a certain level of convergence (in the optimization sense) often mediated by statistical attributes of the inputs. Our results are supported by a set of experiments we conducted as well as independent empirical observations reported by other groups in recent papers.

• Discussion - Conclusions

## MEMORY-RELATED: LARGE DATA FILES

### Machine Learning - Large Datasets, RAM Memory [Out-of-Memory] Issues ... - SOLUTIONS

• See also my XML/xsltproc-related comments (Apr 2015), further below; includes Pythonic solution and my bash script! :-

Mini-batches

• How to PCA large data sets? I'm running out of memory?

Taking PCA over a subset of your data can work, though as you said choosing a "representative subsample" is hard.

There is an $IncrementalPCA$ in sklearn master that will do a minibatch computation. If you keep all the components the results are exact - smaller results ($n_{components} < n_{features}$) will have differences due to the computation of SVD then slicing in sklearn's batch PCA vs slicing each minibatch in the other version. I suppose you could keep all the components til the end, then slice, but then this is wasted computation. Not sure what the best option is here.

It would be great to add the radomized SVD solver for each minibatch to make it even faster for small numbers of $n_{components}$. I was hoping to tackle it during the PyCon sprints in a few weeks but PRs are always welcome :)

Full disclosure, I wrote the IncrementalPCA that is in sklearn right now.

• Minibatch learning for large-scale data, using scikit-learn

... What we see from the above is that our situation points us towards Stochastic Gradient Descent (SGD) regression or classification. Why SGD? The problem with standard (usually gradient-descent-based) regression/classification implementations, support vector machines (SVMs), random forests etc is that they do not effectively scale to the data size we are talking, because of the need to load all the data into memory at once and/or nonlinear computation time.

SGD, however, can deal with large data sets effectively by breaking up the data into chunks and processing them sequentially, as we will see shortly; this is often called minibatch learning. The fact that we only need to load one chunk into memory at a time makes it useful for large-scale data, and the fact that it can work iteratively allows it to be used for online learning as well. SGD can be used for regression or classification with any regularization scheme (ridge, lasso, etc) and any loss function (squared loss, logistic loss, etc).

... The key feature of sklearn's $SGDRegressor$ and $SGDClassifier$ classes that we're interested in is the partial_fit() method; this is what supports minibatch learning. Whereas other estimators need to receive the entire training data in one go, there is no such necessity with the SGD estimators. One can, for instance, break up a data set of a million rows into a thousand chunks, then successively execute partial_fit() on each chunk. Each time one chunk is complete, it can be thrown out of memory and the next one loaded in, so memory needs are limited to the size of one chunk, not the entire data set.

(It's worth mentioning that the SGD estimators are not the only ones in sklearn that support minibatch learning; a variety of others are listed here (http://scikit-learn.org/stable/modules/scaling_strategies.html#incremental-learning). One can use this approach with any of them.)

Finally, the use of a generator in Python makes this easy to implement.

Below is a piece of simplified Python code for instructional purposes showing how to do this. It uses a generator called 'batcherator' to yield chunks one at a time, to be iteratively trained on using partial_fit() as described above.

from sklearn.linear_model import SGDRegressor

def iter_minibatches(chunksize):
# Provide chunks one by one
chunkstartmarker = 0
while chunkstartmarker &amp;lt; numtrainingpoints:
chunkrows = range(chunkstartmarker,chunkstartmarker+chunksize)
X_chunk, y_chunk = getrows(chunkrows)
yield X_chunk, y_chunk
chunkstartmarker += chunksize

def main():
batcherator = iter_minibatches(chunksize=1000)
model = SGDRegressor()

# Train model
for X_chunk, y_chunk in batcherator:
model.partial_fit(X_chunk, y_chunk)

# Now make predictions with trained model
y_predicted = model.predict(X_test)

We haven't said anything about the getrows() function in the code above, since it pretty much depends on the specifics of where the data resides. Common situations might involve the data being stored on disk, stored in distributed fashion, obtained from an interface etc.

Also, while this simplistic code calls $SGDRegressor$ with default arguments, this may not be the best thing to do. It is best to carry out careful cross-validation to determine the best hyperparameters to use, especially for regularization. There is a bunch more practical info on using sklearn's SGD estimators here.

Hopefully this post, and the links within, give you enough info to get started. Happy large-scale learning!

• How do Torch 7 load very large datasets that do not fit in memory?

A1. [up vote 5] Have a look at $imagenet-multiGPU.torch$ full-stack sample code: it contains a data loader ($dataset.lua$) able to sample a batch of images at a time which prevents from pre-loading everything in memory:

(see train.lua for more details)

• Learning on huge datasets

A2. [up vote 6] Instead of using just one subset, you could use multiple subsets as in mini-batch learning (e.g. stochastic gradient descent). This way you would still make use of all your data.

• A Full Hardware Guide to Deep Learning

... In deep learning, the same memory is read repeatedly for every mini-batch before it is sent to the GPU (the memory is just overwritten), but it depends on the mini-batch size if its memory can be stored in the cache. For a mini-batch size of 128, we have 0.4MB and 1.5 MB for MNIST and CIFAR, respectively, which will fit into most CPU caches; for ImageNet, we have more than 85 MB (${4\times 128\times 244^2\times 3\times 1024^{-2}}$) for a mini-batch, which is much too large even for the largest cache ($L3$ caches are limited to a few MB).

Because data sets in general are too large to fit into the cache, new data need to be read from the RAM for each new mini-batch - so there will be a constant need to access the RAM either way.

RAM memory addresses stay in the cache (the CPU can perform fast lookups in the cache which point to the exact location of the data in RAM), but this is only true if your whole data set fits into your RAM, otherwise the memory addresses will change and there will be no speed up from caching (one might be able to prevent that when one uses pinned memory, but as you shall see later, it does not matter anyway).

Other pieces of deep learning code - like variables and function calls - will benefit from the cache, but these are generally few in number and fit easily into the small and fast $L1$ cache of almost any CPU.

From this reasoning it is sensible to conclude, that CPU cache size should not really matter, and further analysis in the next sections is coherent with this conclusion.

• [Keras] Working with large datasets like ImageNet

Keras models absolutely do support batch training. The CIFAR10 example offers an example of this. What's more, you can use the image preprocessing module (data augmentation and normalization) on batches as well. Here's a quick example:

datagen = ImageDataGenerator(
featurewise_center=True, # set input mean to 0 over the dataset
samplewise_center=False, # set each sample mean to 0
featurewise_std_normalization=True, # divide inputs by std of the dataset
samplewise_std_normalization=False, # divide each input by its std
zca_whitening=False, # apply ZCA whitening
rotation_range=20, # randomly rotate images in the range (degrees, 0 to 180)
width_shift_range=0.2, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.2, # randomly shift images vertically (fraction of total height)
horizontal_flip=True, # randomly flip images
vertical_flip=False) # randomly flip images

datagen.fit(X_sample) # let's say X_sample is a small-ish but statistically representative sample of your data

# let's say you have an ImageNet generator that yields ~10k samples at a time.
for e in range(nb_epoch):
print("epoch %d" % e)
for X_train, Y_train in ImageNet(): # these are chunks of ~10k pictures
for X_batch, Y_batch in datagen.flow(X_train, Y_train, batch_size=32): # these are chunks of 32 samples
loss = model.train(X_batch, Y_batch)

# Alternatively, without data augmentation / normalization:
for e in range(nb_epoch):
print("epoch %d" % e)
for X_train, Y_train in ImageNet(): # these are chunks of ~10k pictures
model.fit(X_batch, Y_batch, batch_size=32, nb_epoch=1)

• If you have a huge dataset as an HDF5 file, you can use $keras.utils.io_utils.HDF5Matrix$ to load the dataset. It will only read one batch at a time from memory, but there's some limitations (e.g., you cannot read shuffled data from the file, only sequentially). A workaround would be to shuffle the data before you store it to disk (but you would still read the same batches after a full epoch).

Here is a short example of how to do this. This considers you have all of your samples in the same HDF5 file, and features and targets are in HDF5 datasets named 'features' and 'targets':

def load_data(datapath, train_start, train_end, n_training_examples, n_test_examples)
X_train = HDF5Matrix(datapath, 'features', train_start, train_start+n_training_examples, normalizer=normalize_data)
y_train = HDF5Matrix(datapath, 'targets', train_start, train_start+n_training_examples)
X_test = HDF5Matrix(datapath, 'features', test_start, test_start+n_test_examples, normalizer=normalize_data)
y_test = HDF5Matrix(datapath, 'targets', test_start, test_start+n_test_examples)
return X_train, y_train, X_test, y_test

The returned variables here are not real Numpy arrays, but they implement the same interface so everything works transparently in Keras (as long as you don't try to read shuffled indices).

• In general smaller batches will give better results, however larger batches will make training faster. There is a compromise to strike, somewhere in between. I generally use 16 or 32, unless there is very little data in which case I go full stochastic (batch_size = 1).

• Introduction to Deep Learning Q&A

Q: Can you please talk about the advantage by using batch processing? I am using Caffe, and it seems the batch processing has no benefits with CPUs.

A:Batch processing is very popular on GPUs because you can tune the batch size to correspond with the amount of memory (RAM) on the GPU. When the batch size fits into GPU memory the training computations are quite efficient on GPUs.

• Mini-batches too big for memory (Theano)

I'm a beginner with Theano and my problem is that my mini-batches are too big. I'm using a convolutional neural network similar to the deep learning tutorial. The images I'm using to train are so big that I can only use a mini-batch size = 2. I really want to experiment with bigger sizes, so I'm trying to come up with code that only updates the weights after, say, 100 samples (so 50 mini-batch iterations). The problem is that if I update after the 50th mini-batch, only the last mini-batch will affect the updates. ...

A1. [up vote 1] I don't think there is a good reason to do this... The primary motivation to use mini-batches is so that the computation can make efficient use of the GPU (or multicore) architecture. If you are computing the gradients in batches of two because of memory constraints, you might as well update the model parameters every two examples as well. That being said, if you really want to only update the model parameters every 100 examples, you could keep a running average (or sum) of the gradients computed until you reach 100 examples and then apply that averaged gradient as an update. I think that maintaining the average is just as much computational overhead (and even more memory overhead) as updating the model parameters every batch though. Again though, I think this is a bad idea compared to just updating the model after every batch.

> Thank you for your answer. This has solved my problem indeed. I do think there is a good reason to do this however. The mini-batch size can sometimes greatly influence the training speed and even the final accuracy.

>> Glad that worked... You might have been having hyperparameter issues with the small mini-batch size though... if you are using a batch size of 2 instead of 100 you might need to divide the learning rate by a similar factor (or maybe a bit less...); so, your learning rate for the 2 example batches might have been much too high...

>>> Batch learning is not just for performance reasons. The bigger the batch is, the better the generalization. Without a batch (or batch size = 1), you will update the weights after each step and for the next learning step, you use the new weights. That can perform much different than if you use the same weights for a whole batch.

• How does Theano handle the batch size when it is not a divisor of the train sample?

If the batch size is not a divisor of the train sample, how does Theano handle the last batch of train during a epoch? Is it skipping the last batch, or going thorough until it reaches the end?

A.Depends on what you mean by "Theano"! Theano itself shouldn't know anything about the batch size. Conventional use of Theano would not use the batch size within the computation at all, it just determines how much data to pass into the computation for a batch.

At best the batch size should play no explicit part in the computation -- it's just the size of one of the dimensions on your data tensor(s). At worst the batch size might determine the number of iterations of a scan but this would be strange, normally one wouldn't scan over the batch and even if you did you still shouldn't be specifying the batch size within the computation itself, scan would just dynamically iterate over however many entries there are in the batch dimension.

It really depends on your code -- how are you using the batch size with respect to Theano code? If you have some particular code you're concerned about, post it here and we can comment more specifically. If you're using a library that builds on top of Theano, e.g. Lasagne, Blocks, or Keras, then the answer may differ; again post some code and we can comment.

train_set_x, train_set_y = datasets[0]

n_train_batches = train_set_x.get_value(borrow=True).shape[0] / batch_size
train_model = theano.function(
inputs=[index],
outputs=cost,
givens={
x: train_set_x[index * batch_size: (index + 1) * batch_size],
y: train_set_y[index * batch_size: (index + 1) * batch_size]
}
)

[Q.] Take a look at this code (above) from the Theano tutorial, as an example. Let the train data set is 11900 rows and the batch size 2000 rows. 2000 is not a divisor of 11900 and n_train_batches=5.95. Python round it to 5. What happens to the others 0.95*2000 examples at the end of the train set?

[A.] The problem there is that it's doing integer division so any remainder will not be used. I guess that in the tutorial code they are either using data sizes that are always an exact multiple of the batch size or they've kept the code simple because it's just for tutorial purposes.

If the data size is not an exact multiple of the batch size then one should use:

batch_count = (data_count - 1) / batch_size + 1

Theano behaves just like numpy when it comes to slicing:

a = numpy.arange(10)
print a[8:12]

will print

[8 9]

In this example, the batch size is 4 but the final batch actually has size 2. Theano does the same with slicing and there's no problem with having different batch sizes as long as the computation doesn't build the nominal batch size in anywhere. So as long as the number of batches is corrected, the code you pasted will include all the data but the final batch will be smaller than the others.

• Vowpal Wabbit tutorial for the Uninitiated
• cited here: What is the most efficient way of training data using least memory?

A1. [up vote 4; accepted] I believe the term for this type of learning is out-of-core learning. One suggestion is vowpal wabbit, which has a convenient R library, as well as libraries for many other languages.

> I'm having dependencies issues with boost while installing it. do you have any idea on why I get this? bit.ly/L939DO - madCode

>> @madCode I've never actually used vowpal wabbit, so I can't help you install it. I've heard their mailing list is excellent, and I'm sure you can find help there for setting it up. - Zach Jul 17 '12

>>> Hey Zach: it worked fine. I got it installed and even give me predictions. thanks :-) - madCode Jul 17 '12 at 20:38

A2. []up vote 1] I heartily second Zach's suggestion. vowpal wabbit is an excellent option, and you'd be suprised by its speed. A 200k by 10k data-set is not considered large by vowpal wabbit's norms. vowpal_wabbit (available in source form via https://github.com/JohnLangford/vowpal_wabbit, an older version is available as a standard package in Ubuntu universe) is a fast online linear + bilinear learner, with very flexible input. You may mix binary and numeric-valued features. There's no need to number the features as variable names will work "as is". It has a ton of options, algorithms, reductions, loss-functions, and all-in-all great flexibility. You may join the mailing list (find it via github) and ask any question. The community is very knowledgeable and supportive.

• Vowpal Wabbit tutorial for the Uninitiated

Whenever I have a classification task with lots of data and lots of features, I love throwing Vowpal Wabbit (VW) at the problem. Unfortunately, I find the array of command-line options in VW very intimidating. The github wiki is really good, but the information you need to be productive is scattered all over the place. This is my attempt to put everything you need in one place.

Note most of this is directly cribbed from the wiki (https://github.com/JohnLangford/vowpal_wabbit/wiki). So always check there for the latest instructions.

Also I have only covered the arguments and options I use the most often. Check the complete reference (https://github.com/JohnLangford/vowpal_wabbit/wiki/Command-line-arguments) on the wiki for more information. Installation: Vowpal Wabbit is being developed rapidly enough that it is worth it to just install directly from the github repo.

git clone https://github.com/JohnLangford/vowpal_wabbit.git
cd vowpal_wabbit
make

The only dependency short of a working c++ compiler is the boost program options library. Which is fairly straightforward to get

sudo apt-get install libboost-program-options-dev

[ ... SNIP! ... ]

Miscellaneous:

Papers:

PANDAS

Ruby

• Processing large CSV files with Ruby
Processing large files is a memory intensive operation and could cause servers to run out of RAM memory and swap to disk. Let's look at few ways to process CSV files with Ruby and measure the memory consumption and speed performance.

### MEMORY-RELATED -- OLDER [Apr 2015]

XML - XSLT PROCESSING OF LARGE XML FILES (xsltproc ...)
/mnt/Vancouver/Programming/ML/Machine Learning - Large Datasets, RAM Memory.txt

[vstuart@vstuart01-centos PMC]$date Fri Apr 10 09:16:25 PDT 2015 [vstuart@vstuart01-centos PMC]$ pwd
/projects/btl/vstuart/Annotation/PMC

[vstuart@vstuart01-centos PMC]$l total 26G -rw-r--r-- 1 vstuart users 0 Apr 7 14:46 out-pattern_matches -rw-r--r-- 1 vstuart users 4.6K Feb 26 15:39 patterns-1.6 -rw-r--r-- 1 vstuart users 4.6K Mar 30 09:52 patterns-1.7 -rw-r----- 1 vstuart users 38M Mar 20 15:11 pmc_result2.xml -rw-r--r-- 1 vstuart users 26G Mar 25 12:03 pmc_result.xml -rw-r--r-- 1 vstuart users 7.8K Apr 1 11:53 pm.pmc_stylesheet-1.1.xsl rwxrwxrwx 1 vstuart users 43 Apr 9 11:42 read_lines.sh -rw-r--r-- 1 vstuart users 13M Apr 9 14:42 set2 -rw-r--r-- 1 vstuart users 0 Apr 9 13:18 temp -rw-r--r-- 1 vstuart users 1.9K Feb 13 16:14 victoria-1.2.xsl -rwxr-xr-x 1 vstuart users 545 Mar 25 09:08 xsltproc2.sh -rwxr-xr-x 1 vstuart users 174 Mar 25 09:01 xsltproc.sh I am able to process the smaller pmc_result XML file (38 MB) OK, but the large one (26 GB) fails: xsltproc -o set2 pm.pmc_stylesheet-1.1.xsl pmc_result2.xml && echo "OK" || echo "NOK" ## OK (passed:ran) xsltproc -o set1 pm.pmc_stylesheet-1.1.xsl pmc_result.xml && echo "OK" || echo "NOK" ## NOK (failed) My "read_lines.sh" script (a good general solution for programmatically working with large files) reads the XML file one line at a time and gives an identical output (set2), BUT it fails on the larger file: xsltproc -o set2 pm.pmc_stylesheet-1.1.xsl < ./read_lines.sh pmc_result2.xml && echo "OK" || echo "NOK" ## OK xsltproc -o set1 pm.pmc_stylesheet-1.1.xsl < ./read_lines.sh pmc_result.xml && echo "OK" || echo "NOK" ## NOK The problem with the last approach (read_lines.sh) is that xsltproc reads the entire XML file into (computer) RAM memory as a DOM tree structure that consumes MULTIPLES (perhaps up to 10-fold or so of the XML file size) of memory - a well-known and documented limitation of XML files and XSLT 1.0 (xsltproc -- included in most Linux distributions -- is only XSLT 1.0-compliant; and, XSLT 2.0-compliant solutions, such as SAX (Saxon). • READING LARGE FILES LINE-By-LINE: HOW TO READ LARGE FILE, LINE BY LINE IN PYTHON • How to read large file, line by line in python [StackOverflow] • supporting Python docs • additional info: 10 tips to Improve Performance of Shell Scripts >> I used this (specifically, item #5) as the basis of my read_lines.sh script • see also How to pass each line of a text file as an argument to a command? >> again, use a "while read" loop • This is echoed in Solution 1, here: Let's start with the straightforward pythonic way to read a sequence of records from a file: def get_data(input_filename, delimiter = ','): with open(input_filename, 'r+b') as f: for record in f: # traverse sequentially through the file x = record.split(delimiter) # parsing logic goes here (binary, text, JSON, markup, etc) yield x # emit a stream of things # (e.g., words in the line of a text file, # or fields in the row of a CSV file) Here we exploit Python's lazy evaluation and iterable comprehension, slurping a sequence of records sequentially (i.e., line after line) from the file on disk. By reading binary data, we can handle any arbitrary data type. However, you'll need some knowledge about how to split the stream into records; since we assume text data above, the easy thing is to split on whitespace (i.e., a record is a word) or commas (i.e., a record is a field in a CSV file). You might also want to parse the records further, into fields, words, etc. This solution avoids reading the whole input into memory, is readable, and simple enough there's really no point to wrapping it for reuse. Depending on your system/context, you can probably emit a stream of data at 100+ MB/s or better throughput. This is the straightforward pythonic way to do it. Yay, Python! The solution from Part #1 traverses the data source sequentially, yielding a predictable/ordered/biased stream of records. When you are feeding a machine learning model, this is not good. If the data source is small, you can use UNIX tools to pre-sort the data ... but what if you can't afford to pre-sort everything? • Also echoed here: Reading a file bigger than current RAM memory? [reddit] If it is a text file with line breaks, you can read it line by line like this: with open("file.txt") as f: for line in f: process_line(line) It won't slurp the whole file into memory like read() or readlines() does. If for some reason you want to read the file in fixed sized chunks, then try this: with open("file.txt") as f: for chunk in iter(lambda: f.read(128), ""): process_chunk(chunk) You can make a generator out of it: def read_in_chunks(f, size=128): while True: chunk = f.read(size) if not chunk: break yield chunk And you can use it like so: with open("file.txt") as f: for chunk in read_in_chunks(f): process_chunk(chunk) This should cover most scenarios. BTW I wrote a blog post about this. ## MISSING DATA, VALUES (NaN) ## MODEL SIZES • How to choose the number of hidden layers and nodes in a feedforward neural network? Is there a standard and accepted method for selecting the number of layers, and the number of nodes in each layer, in a FF NN? I'm interested in automated ways of building neural networks. • VICTORIA: three excellent, highly-upvoted answers! :-D • Model sizes for sequence to sequence learning? I have trained a lot of sequence-to-sequence models in the last 6 months, with all kinds of recurrent encoder-decoder set-ups. I have rarely experienced the clean results reported in the most famous papers. I am maxing out a 4GB Nvidia GPU with between 20-80 million parameters, using either very large, single layer LSTMs or deeper layers with a smaller hidden dimension. And yet, getting a model to fit to even 20-30 thousand sequences is challenging. When sequences are < 30 time-steps, it is not difficult, but training longer sequences (50+ time-steps) with batch processing seems to plateau too easily. What size of model, in others' experience, is required for truly hardcore sequence-to-sequence learning? Have the big-guns in the field been paralleling billions of parameters across multiple GPUs? Or am I just so far lacking that mysterious winning touch? • Gradients are the dominant memory cost in sequence to sequence models, try lowering your batch size and also truncating gradients. If you have a "long tail" on the length of sequences, throwing out the longest ones can also help. This means you will have to wait much longer to see results - and these models are normally trained for 3-4 days on 12GB to 24GB GPUs. Depending on the length of your sequences, attention might be very important (>50ish timesteps you start to have long range dependecy problems with the regular seq2seq/enc-dec). All of this assumes you have other things right. Orthogonal init of hidden-to-hidden, initialize forget gate bias to 1 in LSTM, make initial hidden in the decoder a function of the context (such as the mean) as well, and aren't having bugs/optimization problems with masking and so on. Summed loss over timesteps also optimizes very differently from the mean over timesteps, and I have cases (even today!) where one works and the other does not. Beam search is crucial. A bug in beam search can cost ~2 BLEU points, which is quite a lot these days. • "Summed loss over timesteps also optimizes very differently from the mean over timesteps, and I have cases (even today!) where one works and the other does not." I don't understand - shouldn't these result in exactly the same weight updates, given the right learning rates? (not to say that choosing those learning rates is always easy, of course). • If you are using a pseudo-second order optimizer like Adadelta, Adam, or RMSProp, it gets very weird since the true learning rate is kind of independent of the "learning rate" you set, since updates are per parameter based on some rules/statistics. For SGD it is true, but as an example my problem has variably ~700-1200 timesteps which makes finding the right learning rate hard - without varying the learning rate per minibatch you are taking different size steps depending on length of the batch, effectively giving different importance to different training samples. • [OriolVinyals] You shouldn't change your learning rate per minibatch. In the extreme, if minibatches are of size 1 (i.e. 1 sequence), unrolled 100 steps, and one minibatch has a sequence of length 1 (filling 1% of the batch), vs. another which has a sequence of length 50 (filling 50%), the gradient for the 50 one has more importance and that is the correct thing to do. If you divide the gradient by 50 for the 50% minibatch, and by 1 for the 1% minibatch, you are really not optimizing the same objective than in our seq2seq paper. • May I ask, why does the gradient of a longer sequence have more importance than the gradient of a shorter one? • [OriolVinyals] As an extreme example, imagine you are training a unigram language model. All you have to do is count the number of times a word (e.g. "California") occurs in the training data, divided by the total number of words in your corpus. Do you count differently if the word is contained in a long vs short sentence? More counts come from longer sentences, and if you want to maximize the probability of words given context, then that's the correct thing to do. Perplexity is a per-word metric, not per-sentence. Of course, it may (or may not) correlate with other metrics, but that's a whole different debate. • If you're training with mini-batches, the summed losses over a longer sequence might look more impressive than the mean loss, considered against losses for a shorter sequence. The mean loss would have less regard for sequence length. That's how I interpret it anyway. • Impressive is a good way to put it - my losses went from -3000-ish NLL (oh wow, so fancy) to about -3 (small numbers are not hype enough). Good news is my model is still broken, so the numbers don't really matter :) • I don't know what you're working with, but I had a problem of meaningless losses (randomly moving between 3000-5000) in a Torch model I built, and it turned out to be a problem with nn.MaskZeroCriterion (a loss function decorator that masks zero padding) which was fixed about two weeks ago. • It is my own code on top of Theano - and more than likely the bug I am facing is of my own making. But I will find it, eventually... ## OPTIMIZATION • See also the Gradient Descent | Optimization subsection of my "ML Notes" file [<< large file; opens in new tab]. Blogs - Optimization: • CNN - Stanford cs231n [Andrej Karpathy]: Backpropagation, Intuitions | subreddit • Auto-sklearn / automated data science/ML: Contest Winner: Winning the AutoML Challenge with Auto-sklearn [KDNuggets.com]: This post is the first place prize recipient in the recent KDnuggets blog contest. Auto-sklearn is an open-source Python tool that automatically determines effective machine learning pipelines for classification and regression datasets. It is built around the successful scikit-learn library and won the recent AutoML challenge. ... | pdf [local copy] • How do you know when you've reached a critical point when optimizing a DNN? So I've been trying out some different optimizers (SGD, NAG, Adadelta, etc.) and there's one thing I can't quite figure out: How do you know when you've reached a critical point (be it a local min, local max or saddle point, idc)? Just any critical point in general. Is it just when the gradient becomes zero or approaches zero? • In practice, "the critical point" is when your test set error becomes lower then the baseline methods error. However, you need to make sure you limit yourself to just checking whether you hit the critical point by manual inspection. In this situation it is considered normal ML science. Once you start to automate this procedure such that you have your program checking for "critical points" you're then testing on the test set. • Surprisingly, this is kinda the right answer. Mathematically speaking, you are correct in saying the critical point of a smooth function is the point where the gradient of the function is equal to zero. Numerically checking the norm of the gradient, however, is unreliable: • The norm of the gradient requires you to compute the full gradient over the whole dataset. This takes forever. • Even if you could compute the norm of gradient, it can be small when moving through a region where the graph of the function is flat. This makes the norm small but nonzero, and it is hard to distinguish between rounding errors and pathological curvature. Its often the case the norm of the gradient is small, you think you've converged, and there's a sudden steep dropoff as you exit this regime. So basically the norm of the gradient doesn't help much in practice • Yeah I was worried about that. How "off" would the gradient over a minibatch be (compared to computing it over the whole dataset)? Assume for some reason I can deal with pathological curvature being mistaken for a critical point. • I would consider the second problem to be way more serious than the first - you can characterize how "off" you are using something like the central limit theorem quite precisely (remember: the full gradient is just the average of the gradient of the samples) and usually if your minibatch isn't too big it isn't that big a deal. • Sweet, it seems like that might work for what I'm doing. Thanks! • You can split x% ( usually 10-30) of your training set, call it a validation set. This set you don't use for training, but observe the validation error over training. You stop once you reach the minimum of the validation error, e.g by saving a network snapshot when a new minimum is reached or by rerunning the training procedure with the same random seed up to the best epoch according to the validation error. By that you don't touch the test set. The method is called early stopping. • Your procedure is only part of the process and surely suboptimal alone for the typical machine learning objective: that is to report results higher than the baseline on the test set. Normally you would observe the test set after early stopping on the validation and modify your hyperparameters accordingly rerunning the entire process, until your test set error is below the baseline. Then you will release the software, network design, etc with said hyperparameters included as obvious settings. In many cases you will not even obfuscate this and include 10 versions of your method in result Tables and blog posts, showing that one of them miraculously (and not by chance) does better than the others, and the baseline, by some small margin. You can even write a whole section about "exploring your hyperparameters" to highlight your proper scientific method. This is standard ML science practice even at the top research laboratories and companies. Be warned however, if at any point you start automating this process you are now doing poor ML science. In the extreme case if you brute force all the hyperparameters (ala SARM) you are going to be accused of fraud. • This is suboptimal. After you early stop on validation, you should begin training from start on the training with validation included for the amount of epochs your early stopping took (or something fancier, but that is enough) and only then observe your testing set accuracy. • Advice for applying ML - Andrew Ng [pdf]: • How well do hyperparameters generalize to more complex problems? Michael Nielson's ebook on deep learning [Neural Networks and Deep Learning] suggests that to find good hyperparameters it is better to start with a simple problem and find good hyperparameters to solve that before attempting to optimize the hyperparameters for the full problem. For example if I was attempting MNIST then I might begin by using all the data for the images with numbers 0-4 on them, I would find a good learning rate, batch size, and the number of training epochs. Once I've found good hyperparameters then I can use these as a basis for the full MNIST problem. Does this approach work well for most parameters? I imagine that the optimal learning rate and batch size for the MNIST 0-4 problem would perform well for the MNIST 0-9 problem; however, since the size of the output layer has doubled the size of the other layers (or the number of other layers) probably need to be larger. So which hyperparameters do generalize better to more complex problems? I'm considering the number of training epochs, batch size, learning rate, number of hidden layers, layer sizes, and regularization parameters. Also are there any rules of thumb such as when doubling the size of the output layer should the hidden layers double in size (or increase by a factor of sqrt(2))? • I'll preface this by first saying that this is not my area of expertise. But, it is my understanding that the questions you ask are where the field is at right now. there are no real "rules of thumb" like the way I think you might be looking for them , or that you might expect. there are trade-off either way: over parameterization and complexity increases accuracy locally, but tends to decrease accuracy non-locally. There is a good discussion of a response to your question (in the context of the exact domain you mention -- detecting numbers) available in the course materials located here: CME 250: Introduction to Machine Learning [ICME], and expanded upon in a very readable book recently published by several Stanford professors: An Introduction to Statistical Learning [ with Applications in R]. • Just from experience, your intuition is pretty good. Learning rate for example should be within an order of magnitude on the large problem, so you have a smaller search space. I've found training time is probably proportional to data size and complexity. Batch size probably doesn't matter much. I think the remark about needing a bigger network is probably a bit off though. It makes sense intuitively, but in practice most CNNs at least have a high capacity and are strongly regularised to achieve accuracy. This is well demonstrated by the fact that trained networks can be compressed massively, to fit onto smartphones for example. This is actually a broader concept in science I have found - there are many processes where the optimal point for accuracy is "above x" rather than "exactly x". The classic example is medical imaging, where radiation dose is proportional to scan quality, until it crosses a threshold where the human eye can't tell the difference. Ideally we want that exact dose, so we aren't over-irradiating people. We typically image somewhere between ten and fifty percent above that dose, because we can't tell the difference and hitting an exact target is hard. Same principle I think. Unless you are actively and concertedly optimising to minimise capacity and reach maximum accuracy, you will overshoot capacity. Which is why we have dropout :) • Choosing appropriate hyperparameters is a completely open problem, and most people will start from hyperparameters (including the model itself) that are known to work on a similar problem, then modify. Modifications are mainly based on intuition. I would echo Andrej Karpathy's advice in this general situation - it is good to see if you can overfit a model on a small subset of your training data. If it can't produce some kind of results then something definitely needs changing. • In general it doesn't generalize. It is still good to start simple and small however. • Hyperparameters are problem-specific, not model specific. They almost never generalize across problems, unless the problems are very similar (same amount of data of a similar type). In particular having more or less training data (even if it's the same type of data) will completely invalidate your regularization hyperparameters. Having noisy labels will do that too. etc. The only real way to pick them is by doing grid search. You don't know what works until you try. • Thanks that's very helpful. I'm just getting started on NN's so I'm using a simple grid search but what's your opinion on methods such as Spearmint, genetic algorithms, or "Bayesian freeze-thaw"? I've seen these three mentioned on this subreddit as more efficient ways of tuning hyperparameters. • In my experience grid search is enough to tune your parameters. The simplest stuff is often the best. You can try smarter things but it will be harder to set up and may not be much better. But if you want to try it, I've had a good experience with hyperopt and with hyperas (a hyperopt-based lib for tuning Keras nets). • I was training a CNN and this huge accuracy drop happened -- any idea why? • You started getting undefined gradients and Nan outputs for some classes (maybe). In general, it is useless to guess what is going on (as I did above), to want people to help you, you need to track the gradients, loss function and the weights. Its only then what is going on in a network can be visualized. • Microsoft's New Neural Net Shows Artificial Intelligence Is About to Get Way Smarter [Wired] | >150 layers, hyper parameter optimization (Microsoft); FPGAs (Microsoft; Intel) • Model evaluation, model selection, and algorithm selection in machine learning: • Excellent: Peeking inside Convnets | reddit | [June 2016]: "There has been similar work on visualizing convolutional networks by e.g. Zeiler and Fergus [arXiv:1311.2901] and lately by Yosinski, Nugyen et al. In a recent work by Nguyen [arXiv:1602.03616], they manage to visualize features very well, based on a technique they called "mean-image initialization". Since I started writing this blog post, they've also published a new paper [arXiv:1605.09304, above; reddit; GitHub] using Generative Adversarial Networks as priors for the visualizations, which lead to far far better visualizations than the ones I've showed above. If you are interested, do take a look at their paper or the code they've released!" • Related: Yosinski J (2015) Understanding neural networks through deep visualization. arXiv:1506.06579 | GitHub • Recent years have produced great advances in training large, deep neural networks (DNNs), including notable successes in training convolutional neural networks (convnets) to recognize natural images. However, our understanding of how these models work, especially what computations they perform at intermediate layers, has lagged behind. Progress in the field will be further accelerated by the development of better tools for visualizing and interpreting neural nets. We introduce two such tools here. The first is a tool that visualizes the activations produced on each layer of a trained convnet as it processes an image or video (e.g. a live webcam stream). We have found that looking at live activations that change in response to user input helps build valuable intuitions about how convnets work. The second tool enables visualizing features at each layer of a DNN via regularized optimization in image space. Because previous versions of this idea produced less recognizable images, here we introduce several new regularization methods that combine to produce qualitatively clearer, more interpretable visualizations. Both tools are open source and work on a pre-trained convnet with minimal setup. • Related: Understanding Neural Networks Through Deep Visualization [blog post by arXiv:1506.06579 authors Jason Yosinski, Jeff Clune, Anh Nguyen et al.] | reddit • Code for synthesizing images via deep generator networks [reddit] • Shouldn't test accuracy be lower than train accuracy typically? Recently a 3d convolution-based models I've been training have had the test (or val, seems like most people use test when it's actually validating the model) accuracy materially above the train accuracy. I would have guessed that test accuracy should trail train accuracy basically every time. I have checked that my test data is not in my training data, which, if it was, should have caused them to be more similar. Have I just gotten lucky? • There's no law that says that test accuracy has to be lower than training accuracy. It usually happens that way, but it depends on the sets. Which brings me to the point: try doing the test/train split again (using a different random number seed), and train a new model. • It can happen if your validation set couple of order of magnitude less in size than training set. In that case validation set may not cover all manifold of training data, but only some localized part of it. So by accident on this part your classifier work better then average over all dataset. On training set it's averaged out to lesser accuracy, but on validation set it's not. Solution: before start of training make validation set comparable in size to training set, sample randomly for it and during training use small random subset from big validation set for validation. • Are you leaving dropout on when computing the training error? • Both train and test accuracy had keep_probability: 0.8. Don't I want it on for training but not for testing? • If you train with dropout, then your train error will be inflated, so there could be the reason. • Yes, you want to turn dropout off for testing. • StitchFix.com blog blog entries are excellent: ; e.g.: • GA - Genetic Algorithms: Data-Driven Fashion Design: ... we draw inspiration from the field of genetic algorithms, which has been shown to achieve efficient search across a variety of similar problems. The proposed design system can be shaped to mimic the processes that underlie these stochastic optimization methods. In the remainder of this blog post we describe a design system being explored at Stitch Fix that blends elements of genetic algorithm with the judgments from our expert human fashion designers. • TensorFlow ConvNets on a Budget with Bayesian Optimization [reddit] | SigOpt for ML: TensorFlow ConvNets on a Budget with Bayesian Optimization | commercial (SigOpt), but good description of SGD, tuning CNN hyperparameters in their blog post, + good discussion with SigOpt staff in reddit thread (plus links to code on GitHub, free (temp; then$99/mo for individuals) SigOpt account ...

• When to stop iteration of neural network training?
On validation set, Loss starts to drop or no lower loss for several iterations, or accuracy starts to drop or no higher accuracy for several iterations? Which one is better and why?

• It really depends on the settings of the problem. If the classes are balanced, then accuracy is a good measure. If the classes are unbalanced, the ROC curves and Precision-Recall curves are another good measure. Looking at all three is good for analysis. I usually like to look at the confusion matrix (especially if it's a simple single label problem), it can give you an understanding of where your classifier is falling short. Loss going down is usually a good indication that your network is learning the problem but it is highly dependent on your loss function and what it means in the context of your problem. So basically, it all depends on your problem and what metrics are important for that particular problem.

• These are all good points but I would also note that, for most ML models, the only thing explicitly being optimized for is the loss. While using accuracy for early stopping may work, I'd argue that it's less theoretically founded than using loss.

• [u/trevinstein] I always stop at a point where validation set and training set losses start to diverge.

• This may be a bad idea as the problem of "dying ReLU" may lead to "not so accurate" prediction. What is the "dying ReLU" problem in neural networks?

• [u/trevinstein] Do people still use plain ReLUs? I have always found leaky or parametric variations of it to be superior. Lately I have been using ELUs.

Papers - Optimization:

• Arjovsky M [Bengio Y] (2015) Unitary Evolution Recurrent Neural Networks. arXiv:1511.06464  |  GitHub  |  reddit ("uRNN outperforms LSTMs" - extensive discussion!)
• Recurrent neural networks (RNNs) are notoriously difficult to train. When the eigenvalues of the hidden to hidden weight matrix deviate from absolute value 1, optimization becomes difficult due to the well studied issue of vanishing and exploding gradients, especially when trying to learn long-term dependencies. To circumvent this problem, we propose a new architecture that learns a unitary weight matrix, with eigenvalues of absolute value exactly 1. The challenge we address is that of parametrizing unitary matrices in a way that does not require expensive computations (such as eigendecomposition) after each weight update. We construct an expressive unitary weight matrix by composing several structured matrices that act as building blocks with parameters to be learned. Optimization with this parameterization becomes feasible only when considering hidden states in the complex domain. We demonstrate the potential of this architecture by achieving state of the art results in several hard tasks involving very long-term dependencies.

• Barone AVM (ACL 2016) Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. arXiv:1608.02996  |  GitHub  |  GitXiv

• Current approaches to learning vector representations of text that are compatible between different languages usually require some amount of parallel text, aligned at word, sentence or at least document level. We hypothesize however, that different natural languages share enough semantic structure that it should be possible, in principle, to learn compatible vector representations just by analyzing the monolingual distribution of words. In order to evaluate this hypothesis, we propose a scheme to map word vectors trained on a source language to vectors semantically compatible with word vectors trained on a target language using an adversarial autoencoder. We present preliminary qualitative results and discuss possible future developments of this technique, such as applications to cross-lingual sentence representations.

• Author (AVM Barone) on reddit (u/AnvaMiba): [1608.02996] Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders:

• In this preliminary work I try to learn a transformation word embeddings from one language (e.g. English) to another language (e.g. Italian) without using any parallel dataset.

My hypothesis is that this should be possible because languages are assumed to have a hidden vector-like "concept" space (of which word embeddings are a crude approximation, although it may make more sense to consider sentence or document embeddings) and if different languages are used to talk about similar themes, the stochastic processes that generate these latent representations should be near isomorphic.

So my general idea is to use generative adversarial networks (GANs) to learn to match word embedding distributions: instead of transforming from Gaussian noise to images, as it is usually done in GAN papers, I transform from English embeddings to Italian embeddings.

Unfortunately this basic setup doesn't work since training ends up in the pathological state where the generator collapses everything into a single output vector, a known problem of GANs which I think becomes even worse in my case since I use point-mass probability distributions instead of truly continuous ones.

Hence I use adversarial autoencoders (AAEs): I add a decoder that tries to reconstruct English embeddings from the artificial Italian embeddings produced by the generator, using cosine dissimilarity as a reconstruction loss.

Using a few tricks to aid optimization (a ResNet leaky relu discriminator with batch normalization to increase the magnitude of the gradient being backpropagated to the generator) I managed to make the model learn.

Qualitatively, it approximately learns some frequent mappings, but overall it is not competitive with cross-lingual embedding approaches that make use of parallel resources. I don't know if it is just a matter of architecture/hyperparameters or if I have already hit a fundamental limit of how much semantic transfer can be done by using only monolingual data.

Comments, suggestions, criticism are welcome. Also, if you are at ACL 2016 in Berlin, I will present this work as a poster today (Aug 11) in the REPL4NLP workshop.

• Baydin AG (2015) Automatic differentiation in machine learning: a survey. arXiv:1502.05767

• Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD) is a technique for calculating derivatives of numeric functions expressed as computer programs efficiently and accurately, used in fields such as computational fluid dynamics, nuclear engineering, and atmospheric sciences. Despite its advantages and use in other fields, machine learning practitioners have been little influenced by AD and make scant use of available tools. We survey the intersection of AD and machine learning, cover applications where AD has the potential to make a big impact, and report on some recent developments in the adoption of this technique. We aim to dispel some misconceptions that we contend have impeded the use of AD within the machine learning community.

• Cited by Andrej Karpathy in his Stanford cs231n [Spring 2016] 'CNN For Visual Recognition' course in his excellent backpropagation/optimization lecture notes.

• Belanger D [McCallum A] (2015) Structured Prediction Energy Networks. arXiv:1511.06350

• We introduce structured prediction energy networks (SPENs), a flexible framework for structured prediction. A deep architecture is used to define an energy function of candidate labels, and then predictions are produced by using back-propagation to iteratively optimize the energy with respect to the labels. This deep architecture captures dependencies between labels that would lead to intractable graphical models, and performs structure learning by automatically learning discriminative features of the structured output. One natural application of our technique is multi-label classification, which traditionally has required strict prior assumptions about the interactions between labels to ensure tractable learning and prediction. We are able to apply SPENs to multi-label problems with substantially larger label sets than previous applications of structured prediction, while modeling high-order interactions using minimal structural assumptions. Overall, deep learning provides remarkable tools for learning features of the inputs to a prediction problem, and this work extends these techniques to learning features of structured outputs. Our experiments provide impressive performance on a variety of benchmark multi-label classification tasks, demonstrate that our technique can be used to provide interpretable structure learning, and illuminate fundamental trade-offs between feed-forward and iterative structured prediction.

• Cited in this blog post: A quick comment on structured input vs structured output learning:

... The observation is that these two problems are essentially the same thing. That is, if you know how to do the structured input problem, then the structured output problem is essentially the same thing, as far as the learning problem goes. That is, if you can put structure in f(x) for structured input, you can just as well put structure in s(x,y) for structured output. Or, by example, if you can predict the fluency of an English sentence x as a structured input problem, you can predict the translation quality of a French/English sentence pair x,y in a structured output problem. This doesn't solve the argmax problem -- you have to do that separately -- but the underlying learning problem is essentially identical.

You see similar ideas being reborn these days with papers like David Belanger's ICML paper this year on energy networks. With this framework of think-of-structured-input-and-structured-output-as-the-same, basically what they're doing is building a structured score function that uses both the input and output simultaneously, and throwing these through a deep network. ...

• Bengio Y (2007) Greedy layer-wise training of deep networks.  pdf
• Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

• highly-regarded, ML community: Boyd S & Vandenberghe L (2004) Convex Optimization.  [pdf; Cambridge University Press; 730 pp]
• Convex optimization problems arise frequently in many different fields. This book provides a comprehensive introduction to the subject, and shows in detail how such problems can be solved numerically with great efficiency. The focus of the book is on recognizing convex optimization problems and then finding the most appropriate technique for solving them. It contains many worked examples and homework exercises and will appeal to students, researchers and practitioners in fields such as engineering, computer science, mathematics, statistics, finance, and economics.

• Brochu E [De Freitas N] (2010) A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv:1012.2599  |  GitHub  |  GitXiv

• We present a tutorial on Bayesian optimization, a method of finding the maximum of expensive cost functions. Bayesian optimization employs the Bayesian technique of setting a prior over the objective function and combining it with evidence to get a posterior function. This permits a utility-based selection of the next observation to make on the objective function, which must take into account both exploration (sampling from areas of high uncertainty) and exploitation (sampling areas likely to offer improvement over the current best observation). We also present two detailed extensions of Bayesian optimization, with experiments -- active user modelling with preferences, and hierarchical reinforcement learning -- and a discussion of the pros and cons of Bayesian optimization based on our experiences.

• Related? Jamshidi P (2016) An Uncertainty-Aware Approach to Optimal Configuration of Stream Processing Systems. arXiv:1606.06543  |  GitHub  |  GitXiv

• Finding optimal configurations for Stream Processing Systems (SPS) is a challenging problem due to the large number of parameters that can influence their performance and the lack of analytical models to anticipate the effect of a change. To tackle this issue, we consider tuning methods where an experimenter is given a limited budget of experiments and needs to carefully allocate this budget to find optimal configurations. We propose in this setting Bayesian Optimization for Configuration Optimization (BO4CO), an auto-tuning algorithm that leverages Gaussian Processes (GPs) to iteratively capture posterior distributions of the configuration spaces and sequentially drive the experimentation. Validation based on Apache Storm demonstrates that our approach locates optimal configurations within a limited experimental budget, with an improvement of SPS performance typically of at least an order of magnitude compared to existing configuration algorithms.

• Chen T (2016) Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174  |  reddit  |  GitHub
• We propose a systematic approach to reduce the memory consumption of deep neural network training. Specifically, we design an algorithm that costs O(sqrt(n)) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch. As many of the state-of-the-art models hit the upper bound of the GPU memory, our algorithm allows deeper and more complex models to be explored, and helps advance the innovations in deep learning research. We focus on reducing the memory cost to store the intermediate feature maps and gradients during training. Computation graph analysis is used for automatic in-place operation and memory sharing optimizations. We show that it is possible to trade computation for memory - giving a more memory efficient training algorithm with a little extra computation cost. In the extreme case, our analysis also shows that the memory consumption can be reduced to O(log n) with as little as O(n log n) extra cost for forward computation. Our experiments show that we can reduce the memory cost of a 1,000-layer deep residual network from 48G to 7G with only 30 percent additional running time cost on ImageNet problems. Similarly, significant memory cost reduction is observed in training complex recurrent neural networks on very long sequences.

• Collobert R [Weston J; Bottou L] (2011) Natural language processing (almost) from scratch. [pdf]  |  the "SENNA" (Semantic/syntactic Extraction using a Neural Network Architecture) system  |  reddit

• We propose a unified neural network architecture and learning algorithm that can be applied to various natural language processing tasks including part-of-speech tagging, chunking, named entity recognition, and semantic role labeling. This versatility is achieved by trying to avoid task-specific engineering and therefore disregarding a lot of prior knowledge. Instead of exploiting man-made input features carefully optimized for each task, our system learns internal representations on the basis of vast amounts of mostly unlabeled training data. This work is then used as a basis for building a freely available tagging system with good performance and minimal computational requirements.

• Discussed here: Having spent much of last week looking at non-goal driven dialogue systems trained end-to-end, today it's time to turn our attention to some of the building blocks of natural language processing that a chatbot can take advantage of if you're capturing intent (for example, to initiate actions) or generating your own responses.

Collobert et al. describe four standard NLP tasks, each of which has well established benchmarks in the community. As of 2011, the state-of-the-art in these tasks used researcher discovered task-specific intermediate representations (features) based on a large body of linguistic knowledge. The authors set out to build a system that could excel across multiple benchmarks, without needing task-specific representations or engineering. ... And yes, the neural networks once again turn out to learn as good or better representations all by themselves than human experts can design by hand. Though a little bit of human guidance adds the icing on the cake.

(1) Part-of-speech tagging (POS): labels each word with a tag that indicates its syntactic role in a sentence [noun, verb, adverb, ...];

(2) Chunking: labels phrases or segments within a sentence with tags that indicate their syntactic role; e.g. noun phrase (NP), verb phrase (VP), ...;

(3) Named-entity recognition (NER): labels recognised entities within the sentence, e.g. as a person, location, date, time, company, ...;

(4) Semantic-role labeling (SRL): "gives a semantic role to a syntactic constituent of a sentence."]

... A typical SRL system may involve several stages: producing a parse tree, identifying which parse tree nodes represent the arguments of a given verb, and then classifying them to compute the final labels. Koomen et al. (2005) [pdf] achieved a 77.92% F1 score. SENNA achieves 75.49%, but 10x faster and using an order-of-magnitude less RAM. POS is the simplest of these tasks, and SRL the most complex. The more complex the task, the more feature engineering has traditionally been required to perform well in it. ...

Doing away with hand-engineered features

"All the NLP tasks above can be seen as tasks assigning labels to words. The traditional NLP approach is: extract from the sentence a rich set of hand-designed features which are then fed to a standard classification algorithm, e.g. a Support Vector Machine (SVM), often with a linear kernel. The choice of features is a completely empirical process, mainly based first on linguistic intuition, and then trial and error, and the feature selection is task dependent, implying additional research for each new NLP task. Complex tasks like SRL then require a large number of possibly complex features (e.g., extracted from a parse tree) which can impact the computational cost which might be important for large-scale applications or applications requiring real-time response. Instead, we advocate a radically different approach: as input we will try to pre-process our features as little as possible and then use a multilayer neural network (NN) architecture, trained in an end-to-end fashion."

Network Architecture

Collobert et al. experimented with two different network architectures: one using a sliding window approach to combining words, and one using a convolutional network [CNN] layer. In both cases, these combining layers are fed by word vectors. This paper pre-dates the work by Mikolov et al. on word vectors, and instead looks up a one-hot word vector representation in a series of lookup tables (one for each word feature) and concatenates the results to give the final vector representation. The lookup table feature vectors are in turn trained by backpropagation. I found the paper a little light on the details of these features and their training (there are various snippets of information scattered across the 47 pages). However, the authors say " Ideally, we would like semantically similar words to be close in the embedding space represented by the word lookup table: by continuity of the neural network function, tags produced on semantically similar sentences would be similar." This is precisely the property of the word vectors introduced by Mikolov et al., and in GloVe, ...

...
This scoring system is called 'sentence-level log-likelihood.'
...
The final optimised version of the system is called SENNA ...

• Follow-on blog post, re: Zhou J & Xu W (2015) End-To-End Learning of Semantic Role Labeling Using Recurrent Neural Networks [LSTM; pdf]

Collobert's 2011 paper that we looked at yesterday [above] represented a turning point in NLP in which they achieved state of the art performance on part-of-speech tagging (POS), chunking, and named entity recognition (NER) using a neural network in place of expert crafted systems and algorithms. For the semantic role labeling (SRL) task though, Collobert et al. had to resort to including parsing features. With today's paper, that final hold-out task also falls to the power of neural networks, and the authors (from Baidu research) achieve state-of-the-art performance taking only original text as input features. They out-perform previous state-of-the-art systems, that were based on parsing results and feature engineering, and which relied heavily on linguistic knowledge from experts.

Zhou & Xu's solution uses an 8 layer bi-directional RNN (an LSTM to be precise). Using an LSTM architecture enables cells to store and access information over long periods of time. We saw last week the technique of processing sequences both forward and backward, and combining the results in some way (e.g. concatenation):

"In this work, we utilize the bi-directional information in another way. First a standard LSTM processes the sequence in (a) forward direction. The output of this LSTM layer is taken by the next LSTM layer as input, processed in (the) reverse direction. These two standard LSTM layers compose a pair of LSTMs. Then we stack LSTM layers pair after pair to obtain the deep LSTM model. We call this topology a deep bi-directional LSTM (DB-LSTM) network. Our experiments show that this architecture is critical to achieve good performance." ...

• Dean J [Ng AY | Google] (2012) Large scale distributed deep networks.  |  reddit  |  notes
• Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.

• Donahue J (2015) Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389  |  reddit  |  Problems reproducing LSTM classification results? [reddit]
• Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are "doubly deep"' in that they can be compositional in spatial and temporal "layers". Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

• Dong C (2014) Image super-resolution using deep convolutional networks. arXiv:1501.00092  |  GitXiv
• We propose a deep learning method for single image super-resolution (SR). Our method directly learns an end-to-end mapping between the low/high-resolution images. The mapping is represented as a deep convolutional neural network (CNN) that takes the low-resolution image as the input and outputs the high-resolution one. We further show that traditional sparse-coding-based SR methods can also be viewed as a deep convolutional network. But unlike traditional methods that handle each component separately, our method jointly optimizes all layers. Our deep CNN has a lightweight structure, yet demonstrates state-of-the-art restoration quality, and achieves fast speed for practical on-line usage. We explore different network structures and parameter settings to achieve trade-offs between performance and speed. Moreover, we extend our network to cope with three color channels simultaneously, and show better overall reconstruction quality.

• Feurer M (NIPS 2015) Efficient and robust automated machine learning. pdf

• The success of machine learning in a broad range of applications has led to an ever-growing demand for machine learning systems that can be used off the shelf by non-experts. To be effective in practice, such systems need to automatically choose a good algorithm and feature preprocessing steps for a new dataset at hand, and also set their respective hyperparameters. Recent work has started to tackle this automated machine learning (AutoML) problem with the help of efficient Bayesian optimization methods. In this work we introduce a robust new AutoML system based on scikit-learn (using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods, giving rise to a structured hypothesis space with 110 hyperparameters). This system, which we dub $\small AUTO-SKLEARN$, improves on existing AutoML methods by automatically taking into account past performance on similar datasets, and by constructing ensembles from the models evaluated during the optimization. Our system won the first phase of the ongoing ChaLearn AutoML challenge, and our comprehensive analysis on over 100 diverse datasets shows that it substantially outperforms the previous state of the art in AutoML. We also demonstrate the performance gains due to each of our contributions and derive insights into the effectiveness of the individual components of $\small AUTO-SKLEARN$.

• Contest Winner: Winning the AutoML Challenge with Auto-sklearn [KDNuggets.com]

• Fu J (2016) DrMAD: Distilling Reverse-Mode Automatic Differentiation for Optimizing Hyperparameters of Deep Neural Networks. arXiv:1601.00917  |  GitHub  |  GitXiv  |  discussion, critique: reddit
• The performance of deep neural networks is well-known to be sensitive to the setting of their hyperparameters. Recent advances in reverse-mode automatic differentiation allow for optimizing hyperparameters with gradients. The standard way of computing these gradients involves a forward and backward pass of computations. However, the backward pass usually needs to consume unaffordable memory to store all the intermediate variables to exactly reverse the forward training procedure. In this work we propose a simple but effective method, DrMAD, to distill the knowledge of the forward pass into a shortcut path, through which we approximately reverse the training trajectory. Experiments on several image benchmark datasets show that DrMAD is at least 45 times faster and consumes 100 times less memory compared to state-of-the-art methods for optimizing hyperparameters with minimal compromise to its effectiveness. To the best of our knowledge, DrMAD is the first research attempt to make it practical to automatically tune thousands of hyperparameters of deep neural networks. The code can be downloaded from GitHub.

• Gennaro C (2016) Large Scale Deep Convolutional Neural Network Features Search with Lucene. arXiv:1603.09687
• In this work, we propose an approach to index Deep Convolutional Neural Network Features to support efficient content-based retrieval on large image databases. To this aim, we have converted the these features into a textual form, to index them into an inverted index by means of Lucene. In this way, we were able to set up a robust retrieval system that combines full-text search with content-based image retrieval capabilities. We evaluated different strategies of textual representation in order to optimize the index occupation and the query response time. In order to show that our approach is able to handle large datasets, we have developed a web-based prototype that provides an interface for combined textual and visual searching into a dataset of about 100 million of images.

• Goodfellow IJ [Vinyals O; Saxe AM | Google] (2014) Qualitatively characterizing neural network optimization problems. arXiv:1412.6544
• Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.

• 'Maxout' activation function:  Goodfellow IJ [Courville A; Bengio Y] (2013) Maxout networks. arXiv:1302.4389  |  webpage  |  cited here [Andrej Karpathy's cs231n CNN course]  |  reddit
• We consider the problem of designing models to leverage a recently introduced approximate model averaging technique called dropout. We define a simple new model called maxout (so named because its output is the max of a set of inputs, and because it is a natural companion to dropout) designed to both facilitate optimization by dropout and improve the accuracy of dropout's fast approximate model averaging technique. We empirically verify that the model successfully accomplishes both of these tasks. We use maxout and dropout to demonstrate state of the art classification performance on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN.

• Güçlütürk Y (2016) Convolutional Sketch Inversion. arXiv:1606.03073  |  MIT Technology Review  |  reddit
• In this paper, we use deep neural networks for inverting face sketches to synthesize photorealistic face images. We first construct a semi-simulated dataset containing a very large number of computer-generated face sketches with different styles and corresponding face images by expanding existing unconstrained face data sets. We then train models achieving state-of-the-art results on both computer-generated sketches and hand-drawn sketches by leveraging recent advances in deep learning such as batch normalization, deep residual learning, perceptual losses and stochastic optimization in combination with our new dataset. We finally demonstrate potential applications of our models in fine arts and forensic arts. In contrast to existing patch-based approaches, our deep-neural-network-based approach can be used for synthesizing photorealistic face images by inverting face sketches in the wild.

• Gulcehre C [Yoshua Bengio] (2016) Mollifying Networks. arXiv:1608.04980  |  reddit: "This paper [v.1] seems pretty unpolished."  |  reddit
• The optimization of deep neural networks can be more challenging than traditional convex optimization problems due to the highly non-convex nature of the loss function, e.g. it can involve pathological landscapes such as saddle-surfaces that can be difficult to escape for algorithms based on simple gradient descent. In this paper, we attack the problem of optimization of highly non-convex neural networks by starting with a smoothed -- or mollified -- objective function that gradually has a more non-convex energy landscape during the training. Our proposition is inspired by the recent studies in continuation methods: similar to curriculum methods, we begin learning an easier (possibly convex) objective function and let it evolve during the training, until it eventually goes back to being the original, difficult to optimize, objective function. The complexity of the mollified networks is controlled by a single hyperparameter which is annealed during the training. We show improvements on various difficult optimization tasks and establish a relationship with recent works on continuation methods for neural networks and mollifiers.

• Gulcehre C [Bengio Y] (2016) Noisy Activation Functions. arXiv:1603.00391
• Common activation functions used in NN can yield to training difficulties due to the saturation behavior of the activation function, which may hide dependencies which are not visible to first order (using only gradients). Gating mechanisms ... are good examples of this. We propose to exploit the injection of appropriate noise so that some gradients may sometimes flow, even if the noiseless application of the activation function would yield zero gradient. Large noise will dominate the noise-free gradient and allow stochastic gradient descent to be more exploratory. By adding noise only to the problematic parts of the activation function we allow the optimization procedure to explore the boundary between the degenerate (saturating) and the well-behaved parts of the activation function. We also establish connections to simulated annealing, when the amount of noise is annealed down, making it easier to optimize hard objective functions. We find experimentally that replacing such saturating activation functions by by noisy variants helps training in many contexts, yielding state-of-the-art results on several datasets, especially when training seems to be the most difficult, e.g., when curriculum learning is necessary to obtain good results.

• Henaff M [Yann LeCun] (2016) [RNNs store different types of information in their hidden states] Orthogonal RNNs and Long-Memory Tasks. arXiv:1602.06662

• In this work, we carefully analyze two synthetic datasets originally outlined in (Hochreiter & Schmidhuber, 1997) which are used to evaluate the ability of RNNs to store information over many time steps. We explicitly construct RNN solutions to these problems, and using these constructions, illuminate both the problems themselves and the way in which RNNs store different types of information in their hidden states. These constructions furthermore explain the success of recent methods that specify unitary initializations or constraints on the transition matrices.

• Conclusion: In this work, we analyzed two standard synthetic long-term memory problems and provided explicit RNN solutions for them. We found that the (fixed length T) copy problem can be solved with an RNN with a transition matrix that is a T+S root of the identity matrix I , and whose eigenvalues are well distributed on the unit circle, and we remarked that random orthogonal matrices almost satisfy this description. We also saw that the addition problem can be solved with I as a transition matrix. We showed that correspondingly, initializing with I allows a linear-transition RNN to easily be optimized for solving the addition task, and initializing with random orthogonal matrix allows easy optimization for the copy task; but that flipping these leads to poor results. Finally, we showed how one can use l2 pooling to allow the model to make the decision between the two regimes.

• discussion here: reddit:

• I really didn't get this paper. Like, what's happening? Maybe someone with a lot more experience can share their understanding.

• In recent papers, many have been using copy tasks and addition tasks to show that their models have better memory and reasoning. In this paper, the authors write an explicit closed-form solution to these tasks, and clearly explain why the recent models worked. It also implies that these tasks aren't very useful to do general claims, so people should probably stop using them.

• They develop 2 highly engineered RNNs which are really good at solving Copy or Addition task very well respectively. But when they swap the tasks, they both struggle to work well.

How does this expose the limitations of the tasks? The tasks were primarily designed to see if LSTM could overcome vanishing gradients and learn over long time lags between events. That is, a randomly initialized RNN has to learn this from scratch. So isn't it fair to say that these tasks are like sanity checks (or a fundamental requirement) for any new RNN architecture?

• Ida Y (2016) Controlling Exploration Improves Training For Deep Neural Networks.  |  arXiv:1605.09593
• Stochastic optimization methods are widely used for training of deep neural networks. However, it is still a challenging research problem to achieve effective training by using stochastic optimization methods. This is due to the difficulties in finding good parameters on a loss function that have many saddle points. In this paper, we propose a stochastic optimization method called STDProp for effective training of deep neural networks. Its key idea is to effectively explore parameters on a complex surface of a loss function. We additionally develop momentum version of STDProp. While our approaches are easy to implement with high memory efficiency, it is more effective than other practical stochastic optimization methods for deep neural networks.

• Ilievski I (2016) Hyperparameter Optimization of Deep Neural Networks Using Non-Probabilistic RBF Surrogate Model. arXiv:1607.08316
• Recently, Bayesian optimization has been successfully applied for optimizing hyperparameters of deep neural networks, significantly outperforming the expert-set hyperparameter values. The methods approximate and minimize the validation error as a function of hyperparameter values through probabilistic models like Gaussian processes. However, probabilistic models that require a prior distribution of the errors may be not adequate for approximating very complex error functions of deep neural networks. In this work, we propose to employ radial basis function as the surrogate of the error functions for optimizing both continuous and integer hyperparameters. The proposed non-probabilistic algorithm, called Hyperparameter Optimization using RBF and DYCORS (HORD), searches the surrogate for the most promising hyperparameter values while providing a good balance between exploration and exploitation. Extensive evaluations demonstrate HORD significantly outperforms the well-established Bayesian optimization methods such as Spearmint and TPE, both in terms of finding a near optimal solution with fewer expensive function evaluations, and in terms of a final validation error. Further, HORD performs equally well in low- and high-dimensional hyperparameter spaces, and by avoiding expensive covariance computation can also scale to a high number of observations.

• Im DJ (2016) Generating images with recurrent adversarial networks. arXiv:1602.05110  |  reddit  |  GitXiv
• Generating images with recurrent adversarial networks Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, Roland Memisevic Gatys et al. (2015) showed that optimizing pixels to match features in a convolutional network with respect reference image features is a way to render images of high visual quality. We show that unrolling this gradient-based optimization yields a recurrent computation that creates images by incrementally adding onto a visual "canvas". We propose a recurrent generative model inspired by this view, and show that it can be trained using adversarial training to generate very good image samples. We also propose a way to quantitatively compare adversarial networks by having the generators and discriminators of these networks compete against each other.

• Ithapu VK (2015) On the interplay of network structure and gradient convergence in deep learning. arXiv:1511.05297  |  local copy

• The regularization and output consistency behavior of dropout and layer-wise pretraining for learning deep networks have been fairly well studied. However, our understanding of how the asymptotic convergence of backpropagation in deep architectures is related to the structural properties of the network and other design choices (like denoising and dropout rate) is less clear at this time. An interesting question one may ask is whether the network architecture and input data statistics may guide the choices of learning parameters and vice versa. In this work, we explore the association between such structural, distributional and learnability aspects vis-a-vis their interaction with parameter convergence rates. We present a framework to address these questions based on the backpropagation convergence for general nonconvex objectives using first-order information. This analysis suggests an interesting relationship between feature denoising and dropout. Building upon the results, we obtain a setup that provides systematic guidance regarding the choice of learning parameters and network sizes that achieve a certain level of convergence (in the optimization sense) often mediated by statistical attributes of the inputs. Our results are supported by a set of experiments we conducted as well as independent empirical observations reported by other groups in recent papers.

• Discussion - Conclusions

• Johnson J [Karpathy A; Fei-Fei L] (2015) DenseCap: Fully Convolutional Localization Networks for Dense Captioning. arXiv:1511.07571. arXiv:1511.07571  |  reddit

• We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language. The dense captioning task generalizes object detection when the descriptions consist of a single word, and Image Captioning when one predicted region covers the full image. To address the localization and description task jointly we propose a Fully Convolutional Localization Network (FCLN) architecture that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with a single round of optimization. The architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences. We evaluate our network on the Visual Genome dataset, which comprises 94,000 images and 4,100,000 region-grounded captions. We observe both speed and accuracy improvements over baselines based on current state of the art approaches in both generation and retrieval settings.

• very cool: DenseCap: Fully Convolutional Localization Networks for Dense Captioning  |  Stanford: Andrej Karpathy; Fei-Fei Li  |  reddit

• We introduce the dense captioning task, which requires a computer vision system to both localize and describe salient regions in images in natural language. The dense captioning task generalizes object detection when the descriptions consist of a single word, and Image Captioning when one predicted region covers the full image. To address the localization and description task jointly we propose a Fully Convolutional Localization Network (FCLN) architecture that processes an image with a single, efficient forward pass, requires no external regions proposals, and can be trained end-to-end with a single round of optimization. The architecture is composed of a Convolutional Network, a novel dense localization layer, and Recurrent Neural Network language model that generates the label sequences. We evaluate our network on the Visual Genome dataset, which comprises 94,000 images and 4,100,000 region-grounded captions. We observe both speed and accuracy improvements over baselines based on current state of the art approaches in both generation and retrieval settings.

• Johnson J [Fei-Fei L; Stanford] (2016) Perceptual Losses for Real-Time Style Transfer and Super-Resolution arXiv:1603.08155  |  reddit

• We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a per-pixel loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing perceptual loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results.

• mentioned here [reddit]

• Kang Z (2016) Top-n recommender system via matrix completion. arXiv:1601.04800  |  GitHub  |  GitXiv
• Top-N recommender systems have been investigated widely both in industry and academia. However, the recommendation quality is far from satisfactory. In this paper, we propose a simple yet promising algorithm. We fill the user-item matrix based on a low-rank assumption and simultaneously keep the original information. To do that, a nonconvex rank relaxation rather than the nuclear norm is adopted to provide a better rank approximation and an efficient optimization strategy is designed. A comprehensive set of experiments on real datasets demonstrates that our method pushes the accuracy of Top-N recommendation to a new level.

• Kingma D (2016) Adam: A Method for Stochastic Optimization. arXiv:1412.6980

• We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.

• Mentioned in this What is a good reference for RMSprop method? reddit thread

• Klein A (2016) Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. arXiv:1605.07079
• Bayesian optimization has become a successful tool for hyperparameter optimization of machine learning algorithms, such as support vector machines or deep neural networks. But it is still costly if each evaluation of the objective requires training and validating the algorithm being optimized, which, for large datasets, often takes hours, days, or even weeks. To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size, which is learned during the optimization process and allows exploration of preliminary configurations on small subsets, by extrapolating to the full dataset. We construct a Bayesian optimization procedure, dubbed FABOLAS, which models loss and training time as a function of dataset size and automatically trades off high information gain about the global optimum against computational cost. Experiments optimizing support vector machines and deep neural networks show that FABOLAS often finds high-quality solutions 10 to 100 times faster than other state-of-the-art Bayesian optimization methods.

• Laina I (2016) Deeper Depth Prediction with Fully Convolutional Residual Networks. arXiv:1606.00373  |  GitXiv
• This paper addresses the problem of estimating the depth map of a scene given a single RGB image. To model the ambiguous mapping between monocular images and depth maps, we leverage on deep learning capabilities and present a fully convolutional architecture encompassing residual learning. The proposed model is deeper than the current state of the art, but contains fewer parameters and requires less training data, while still outperforming all current CNN approaches aimed at the same task. We further present a novel way to efficiently learn feature map up-sampling within the network. For optimization we introduce the reverse Huber loss, particularly suited for the task at hand and driven by the value distributions commonly present in depth maps. The predictions are given by a single architecture, trained end-to-end, that does not rely on post-processing techniques, such as CRFs or other additional refinement steps.

• LeCun YA (2012) Efficient backprop [ pdf: 44 pp ]

• The convergence of back-propagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work. Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most "classical" second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.

• Keras (GitHub) [re: this paper] "LeCun has long argued that the result obtained with stochastic learning is almost always better, thanks to the random noise it introduces."

• Levy O [Goldberg Y] (2015). Improving distributional similarity with lessons learned from word embeddings. pdf  |  critical review of VSM, word2vec models: parameterization, implementation ...
• Recent trends suggest that neural network-inspired word embedding models outperform traditional count-based distributional models on word similarity and analogy detection tasks. We reveal that much of the performance gains of word embeddings are due to certain system design choices and hyperparameter optimizations, rather than the embedding algorithms themselves. Furthermore, we show that these modifications can be transferred to traditional distributional models, yielding similar gains. In contrast to prior reports, we observe mostly local or insignificant performance differences between the methods, with no global advantage to any single approach over the others.

• Li L (2016) Efficient Hyperparameter Optimization and Infinitely Many Armed Bandits. arXiv:1603.06560
• Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While current methods offer efficiencies by adaptively choosing new configurations to train, an alternative strategy is to adaptively allocate resources across the selected configurations. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinitely many armed bandit problem where allocation of additional resources to an arm corresponds to training a configuration on larger subsets of the data. We introduce Hyperband for this framework and analyze its theoretical properties, providing several desirable guarantees. We compare Hyperband with state-of-the-art Bayesian optimization methods and a random search baseline on a comprehensive benchmark including 117 datasets. Our results on this benchmark demonstrate that while Bayesian optimization methods do not outperform random search trained for twice as long, Hyperband in favorable settings offers valuable speedups.

• Kevin Jamieson (2015) Non-stochastic Best Arm Identification and Hyperparameter Optimization. arXiv:1502.07943
• Motivated by the task of hyperparameter optimization, we introduce the non-stochastic best-arm identification problem. Within the multi-armed bandit literature, the cumulative regret objective enjoys algorithms and analyses for both the non-stochastic and stochastic settings while to the best of our knowledge, the best-arm identification framework has only been considered in the stochastic setting. We introduce the non-stochastic setting under this framework, identify a known algorithm that is well-suited for this setting, and analyze its behavior. Next, by leveraging the iterative nature of standard machine learning algorithms, we cast hyperparameter optimization as an instance of non-stochastic best-arm identification, and empirically evaluate our proposed algorithm on this task. Our empirical results show that, by allocating more resources to promising hyperparameter settings, we typically achieve comparable test accuracies an order of magnitude faster than baseline methods.

• Related blog post: Embracing the Random: "... The comparison to Hyperband, on the other hand, is striking. On average, Hyperband finds a decent solution in a fraction of the time of all of the other methods. It also finds the best solution over all in all three cases. On SVHN, it finds the best solution in a fifth of the time of the other methods. And, again, the protocol is just 7 lines of python code. Hyperband is just a first step, and other might not be the ideal solution for your particular workload. But I think these plots nicely illustrate how simple-but enhancements of random search can go a very long way."  |  image

• Blog post (Hyperband implementation): Hyperparameter optimization by efficient configuration evaluation

• Li C (2016) Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks. arXiv:1604.04382  |  reddit
• This paper proposes Markovian Generative Adversarial Networks (MGANs), a method for training generative neural networks for efficient texture synthesis. While deep neural network approaches have recently demonstrated remarkable results in terms of synthesis quality, they still come at considerable computational costs (minutes of run-time for low-res images). Our paper addresses this efficiency issue. Instead of a numerical deconvolution in previous work, we precompute a feed-forward, strided convolutional network that captures the feature statistics of Markovian patches and is able to directly generate outputs of arbitrary dimensions. Such network can directly decode brown noise to realistic texture, or photos to artistic paintings. With adversarial training, we obtain quality comparable to recent neural texture synthesis methods. As no optimization is required any longer at generation time, our run-time performance (0.25M pixel images at 25Hz) surpasses previous neural texture synthesizers by a significant margin (at least 500 times faster). We apply this idea to texture synthesis, style transfer, and video stylization.

• Liu Q (2016) Part-of-Speech Relevance Weights for Learning Word Embeddings. arXiv:1603.07695
• This paper proposes a model to learn word embeddings with weighted contexts based on part-of-speech (POS) relevance weights. POS is a fundamental element in natural language. However, state-of-the-art word embedding models fail to consider it. This paper proposes to use position-dependent POS relevance weighting matrices to model the inherent syntactic relationship among words within a context window. We utilize the POS relevance weights to model each word-context pairs during the word embedding training process. The model proposed in this paper paper jointly optimizes word vectors and the POS relevance matrices. Experiments conducted on popular word analogy and word similarity tasks all demonstrated the effectiveness of the proposed method.

• Maclaurin D (2015) Gradient-based Hyperparameter Optimization through Reversible Learning. arXiv:1502.03492 GitXiv

• Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.

• Mentioned here: Why is the learning process only changing the weights but not the structure?: "... Gradient descent (ie deep learning) has been very successful over the past few years and is the current vanguard of a lot of ML approaches, this is one of the few papers that even attempts to tackle optimizing hyperparameters as a learning objective but has some significant downsides. ..."

• Marblestone AH (2016) Towards an integration of deep learning and neuroscience. pdf  |  reddit
• Neuroscience has focused on the detailed implementation of computation, studying neural codes, dynamics and circuits. In machine learning, however, artificial neural networks tend to eschew precisely designed codes, dynamics or circuits in favor of brute force optimization of a cost function, often using simple and relatively uniform initial architectures. Two recent developments have emerged within machine learning that create an opportunity to connect these seemingly divergent perspectives. First, structured architectures are used, including dedicated systems for attention, recursion and various forms of short- and long-term memory storage. Second, cost functions and training procedures have become more complex and are varied across layers and over time. Here we think about the brain in terms of these ideas. We hypothesize that (1) the brain optimizes cost functions, (2) these cost functions are diverse and differ across brain locations and over development, and (3) optimization operates within a pre-structured architecture matched to the computational problems posed by behavior. Such a heterogeneously optimized system, enabled by a series of interacting cost functions, serves to make learning data-efficient and precisely targeted to the needs of the organism. We suggest directions by which neuroscience could seek to refine and test these hypotheses.

• Contents:
1. Introduction
2. The brain can optimize cost functions
3. The cost functions are diverse across brain areas and time
4. Optimization occurs in the context of specialized structures
5. Machine learning inspired neuroscience
6. Neuroscience inspired machine learning
7. Conclusions

• ['older;' RNN] Mao J (2014) Explain images with multimodal recurrent neural networks. arXiv:1410.1090.  |  GitHub  |  reddit
• In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12, Flickr 8K, and Flickr 30K). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

• McMahan HB (2016) Federated Learning of Deep Networks using Model Averaging. arXiv:1602.05629  |  reddit
• Modern mobile devices have access to a wealth of data suitable for learning models, which in turn can greatly improve the user experience on the device. For example, language models can improve speech recognition and text entry, and image models can automatically select good photos. However, this rich data is often privacy sensitive, large in quantity, or both, which may preclude logging to the data-center and training there using conventional approaches. We advocate an alternative that leaves the training data distributed on the mobile devices, and learns a shared model by aggregating locally-computed updates. We term this decentralized approach Federated Learning.
We present a practical method for the federated learning of deep networks that proves robust to the unbalanced and non-IID data distributions that naturally arise. This method allows high-quality models to be trained in relatively few rounds of communication, the principal constraint for federated learning. The key insight is that despite the non-convex loss functions we optimize, parameter averaging over updates from multiple clients produces surprisingly good results, for example decreasing the communication needed to train an LSTM language model by two orders of magnitude.

• Mnih V [Alex Graves; Timothy P. Lillicrap; David Silver; Koray Kavukcuoglu | Google DeepMind] (2016) Asynchronous Methods for Deep Reinforcement Learning. arXiv:1602.01783

• We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. The best performing method, an asynchronous variant of actor-critic, surpasses the current state-of-the-art on the Atari domain while training for half the time on a single multi-core CPU instead of a GPU. Furthermore, we show that asynchronous actor-critic succeeds on a wide variety of continuous motor control problems as well as on a new task involving finding rewards in random 3D mazes using a visual input.

• Stabilising deep reinforcement learning often requires lots of memory and computation because it uses experience replay: the idea of storing and sampling from an agent's historical experience conducting a task. Here, the DeepMind team show that asynchronously running multiple agents in parallel on multiple instances of the environment is instead more effective and less computationally intensive. Check out the video for a demo application of an agent navigating a 3D video game environment.
• Related:

• "Related:" Mnih A [Rezende DJ | Google] (2016) Variational inference for Monte Carlo objectives. arXiv:1602.06725  |  reddit  |  Expectation of an unbiased estimator (under variational inference setting) [StackExchange:CrossValidated]

• Recent progress in deep latent variable models has largely been driven by the development of flexible and scalable variational inference methods. Variational training of this type involves maximizing a lower bound on the log-likelihood, using samples from the variational posterior to compute the required gradients. Recently, Burda et al. (2016) have derived a tighter lower bound using a multi-sample importance sampling estimate of the likelihood and showed that optimizing it yields models that use more of their capacity and achieve higher likelihoods. This development showed the importance of such multi-sample objectives and explained the success of several related approaches.

We extend the multi-sample approach to discrete latent variables and analyze the difficulty encountered when estimating the gradients involved. We then develop the first unbiased gradient estimator designed for importance-sampled objectives and evaluate it at training generative and structured output prediction models. The resulting estimator, which is based on low-variance per-sample learning signals, is both simpler and more effective than the NVIL estimator proposed for the single-sample variational objective, and is competitive with the currently used biased estimators.

• Neelakantan A [Le QV; Sutskever I | Google Brain] (2015) Adding Gradient Noise Improves Learning for Very Deep Networks. arXiv:1511.06807

• Deep feedforward and recurrent networks have achieved impressive results in many perception and language processing applications. This success is partially attributed to architectural innovations such as convolutional and long short-term memory networks. The main motivation for these architectural innovations is that they capture better domain knowledge, and importantly are easier to optimize than more basic architectures. Recently, more complex architectures such as Neural Turing Machines and Memory Networks have been proposed for tasks including question answering and general computation, creating a new set of optimization challenges. In this paper, we discuss a low-overhead and easy-to-implement technique of adding gradient noise which we find to be surprisingly effective when training these very deep architectures. The technique not only helps to avoid overfitting, but also can result in lower training loss. This method alone allows a fully-connected 20-layer deep network to be trained with standard gradient descent, even starting from a poor initialization. We see consistent improvements for many complex models, including a 72% relative reduction in error rate over a carefully-tuned baseline on a challenging question-answering task, and a doubling of the number of accurate binary multiplication models learned across 7,000 random restarts. We encourage further application of this technique to additional complex modern architectures.

• Make your Stochastic Gradient Descent more Stochastic: [Jun 2016] Results in Deep Learning never cease to surprise me. One recent ICLR 2016 paper from Google Brain team suggests a simple 1-line code change to improve your parameter estimation across the board - by adding a Gaussian noise to the computed gradients. Typical SGD updates parameters by taking a step in the direction of the gradient (simplified):

$\Theta​_{t+1} ​← \Theta_​t ​+ \alpha​_​t​​\nabla\Theta$

Instead of doing that the suggestion is add a small random noise to the update:

$\Theta​_{t+1​​} ← \Theta​_t​​ + \alpha​_t​​\nabla\Theta + N(0,\sigma​_{t}^{​2}​​)$

Further, $\sigma_t$​​ is prescribed to be:

$\large \sigma_{t}^{2} = \frac{\eta}{(1+t)^{0.55}}$​​

and $\eta$ is one of $\{{0.01, 0.3, 1.0}\}$!  ...

• related: Gradient Noise Injection Is Not So Strange After All

Yesterday, I wrote about a gradient noise injection result at ICLR 2016, and noted the authors of the paper, despite detailed experimentation, were very wishy washy in their explanation of why it works. Fortunately, my Twitter friends, particularly Tim Vieira and Shubhendu Trivedi, grounded this much better than the authors themselves! Shubhendu pointed out Rong Ge (of MSR) and friends tried this in the context of Tensor Decomposition in 2015 (at some point I should write about connection between backprop and matrix factorization). Algorithm 1. in that paper is pretty much the update equation of the recent ICLR paper (modulo actual values of the constants).
Ge, R., Huang, F., Jin, C. and Yuan, Y., 2015. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition. arXiv:1503.02101.

• Olson RS (2016) Automating biomedical data science through tree-based pipeline optimization. arXiv:1601.07925  |  GitHub  |  GitXiv
• Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in academia, business, and government. In this paper, we introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning---pipeline design. We implement a Tree-based Pipeline Optimization Tool (TPOT) and demonstrate its effectiveness on a series of simulated and real-world genetic data sets. In particular, we show that TPOT can build machine learning pipelines that achieve competitive classification accuracy and discover novel pipeline operators---such as synthetic feature constructors---that significantly improve classification accuracy on these data sets. We also highlight the current challenges to pipeline optimization, such as the tendency to produce pipelines that overfit the data, and suggest future research paths to overcome these challenges. As such, this work represents an early step toward fully automating machine learning pipeline design.

• Pedregosa F (2016) Hyperparameter optimization with approximate gradient. arXiv:1602.02355 GitHub  |  GitXiv
• Most models in machine learning contain at least one hyperparameter to control for model complexity. Choosing an appropriate set of hyperparameters is both crucial in terms of model accuracy and computationally challenging. In this work we propose an algorithm for the optimization of continuous hyperparameters using inexact gradient information. An advantage of this method is that hyperparameters can be updated before model parameters have fully converged. We also give sufficient conditions for the global convergence of this method, based on regularity conditions of the involved functions and summability of errors. Finally, we validate the empirical performance of this method on the estimation of regularization constants of $\small \mathcal{L}_2$-regularized logistic regression and kernel Ridge regression. Empirical benchmarks indicate that our approach is highly competitive with respect to state of the art methods.

• Prakash A (2016) Highway Networks for Visual Question Answering. pdf
• We propose a version of highway network designed for the task of Visual Question Answering. We take inspiration from recent success of Residual Layer Network and Highway Network in learning deep representation of images and fine grained localization of objects. We propose variation in gating mechanism to allow incorporation of word embedding in the information highway. The gate parameters are influenced by the words in the question, which steers the network towards localized feature learning. This achieves the same effect as soft attention via recurrence but allows for faster training using optimized feed-forward techniques. We are able to obtain state-of-the-art results on VQA dataset for Open Ended and Multiple Choice tasks with current model ["among published results, User jw2yang has higher accuracy than us on VQA Challenge leaderboard"].

• Raghu M [Google Brain] (2016) On the expressive power of deep neural networks. arXiv:1606.05336

• We study the expressivity of deep neural networks with random weights. We provide several results, both theoretical and experimental, precisely characterizing their functional properties in terms of the depth and width of the network. In doing so, we illustrate inherent connections between the length of a latent trajectory, local neuron transitions, and network activation patterns. The latter, a notion defined in this paper, is further studied using properties of hyperplane arrangements, which also help precisely characterize the effect of the neural network on the input space. We further show dualities between changes to the latent state and changes to the network weights, and between the number of achievable activation patterns and the number of achievable labellings over input data. We see that the depth of the network affects all of these quantities exponentially, while the width appears at most as a base. These results also suggest that the remaining depth of a neural network is an important determinant of expressivity, supported by experiments on MNIST and CIFAR-10.

• Conclusion. ... That exponentially large changes in output can be induced by small changes in the input may help explain the prevalence of adversarial examples Szegedy et al. [2013] [arXiv:1312.6199] in deep networks. Additionally, understanding this structure may inspire new optimization schemes that account for the differing leverage of weights at different layers. The greater leverage stemming from training earlier weights may also motivate novel ways to adapt pre-trained networks to new tasks, beyond simply retraining the top layer (least expressive) weights.

• Ribeiro MT (2016) "Why Should I Trust You?": Explaining the Predictions of Any Classifier. arXiv:1602.04938  |  GitHub GitXiv  |  reddit

• Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted.

• Discussed here (by arXiv:1602.04938 authors Ribeiro et al.): Introduction to Local Interpretable Model-Agnostic Explanations (LIME): A technique to explain the predictions of any machine learning classifier.

• Salimans T (2016) Weight Normalization: Simple Reparameterization to Accelerate Training of DNN. arXiv:1602.07868  |  RNN, LSTM; Atari, DQN, reinforcement learning (not DeepMind)  |  reddit includes discussion of DQN & batch normalization (problems, latter re: former)  |  GitXiv
• We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of stochastic gradient descent. Our reparameterization is inspired by batch normalization but does not introduce any dependencies between the examples in a minibatch. This means that our method can also be applied successfully to recurrent models such as LSTMs and to noise-sensitive applications such as deep reinforcement learning or generative models, for which batch normalization is less well suited. Although our method is much simpler, it still provides much of the speed-up of full batch normalization. In addition, the computational overhead of our method is lower, permitting more optimization steps to be taken in the same amount of time. We demonstrate the usefulness of our method on applications in supervised image recognition, generative modelling, and deep reinforcement learning.

• Saxena A [Ng AY] (2008) 3-D depth reconstruction from a single still image. pdf

• Given a single photo of a room and a large database of furniture CAD models, our goal is to reconstruct a scene that is as similar as possible to the scene depicted in the photograph, and composed of objects drawn from the database. We present a completely automatic system to address this IM2CAD problem that produces high quality results on challenging imagery from real estate web sites. Our approach iteratively optimizes the placement and scale of objects in the room to best match scene renderings to the input photo, used image comparison metrics trained using deep convolutional neural nets. By operating jointly on the full scene at once, we account for inter-object occlusions.

• Scardapane S (2016) Group Sparse Regularization for Deep Neural Networks. arXiv:1607.00485  |  reddit
• In this paper, we consider the joint task of simultaneously optimizing (i) the weights of a deep neural network, (ii) the number of neurons for each hidden layer, and (iii) the subset of active input features (i.e., feature selection). While these problems are generally dealt with separately, we present a simple regularized formulation allowing to solve all three of them in parallel, using standard optimization routines. Specifically, we extend the group Lasso penalty (originated in the linear regression literature) in order to impose group-level sparsity on the network's connections, where each group is defined as the set of outgoing weights from a unit. Depending on the specific case, the weights can be related to an input variable, to a hidden neuron, or to a bias unit, thus performing simultaneously all the aforementioned tasks in order to obtain a compact network. We perform an extensive experimental evaluation, by comparing with classical weight decay and Lasso penalties. We show that a sparse version of the group Lasso penalty is able to achieve competitive performances, while at the same time resulting in extremely compact networks with a smaller number of input features. We evaluate both on a toy dataset for handwritten digit recognition, and on multiple realistic large-scale classification problems.
• Serban IV [Yoshua Bengio; Aaron Courville] (2016) Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation. arXiv:1606.00776
• We introduce the multiresolution recurrent neural network, which extends the sequence-to-sequence framework to model natural language generation as two parallel discrete stochastic processes: a sequence of high-level coarse tokens, and a sequence of natural language tokens. There are many ways to estimate or learn the high-level coarse tokens, but we argue that a simple extraction procedure is sufficient to capture a wealth of high-level discourse semantics. Such procedure allows training the multiresolution recurrent neural network by maximizing the exact joint log-likelihood over both sequences. In contrast to the standard log-likelihood objective w.r.t. natural language tokens (word perplexity), optimizing the joint log-likelihood biases the model towards modeling high-level abstractions. We apply the proposed model to the task of dialogue response generation in two challenging domains: the Ubuntu technical support domain, and Twitter conversations. On Ubuntu, the model outperforms competing approaches by a substantial margin, achieving state-of-the-art results according to both automatic evaluation metrics and a human evaluation study. On Twitter, the model appears to generate more relevant and on-topic responses according to automatic evaluation metrics. Finally, our experiments demonstrate that the proposed model is more adept at overcoming the sparsity of natural language and is better able to capture long-term structure.

• Shahriari B [de Freitas N] (2015). Unbounded Bayesian Optimization via Regularization. arXiv:1508.03666  |  pdf [local copy] |  [pdf:JMLR.org]
• Bayesian optimization has recently emerged as a powerful and flexible tool in machine learning for hyperparameter tuning and more generally for the efficient global optimization of expensive black box functions. The established practice requires a user-defined bounded domain, which is assumed to contain the global optimizer. However, when little is known about the probed objective function, it can be difficult to prescribe such a domain. In this work, we modify the standard Bayesian optimization framework in a principled way to allow for unconstrained exploration of the search space. We introduce a new alternative method and compare it to a volume doubling baseline on two common synthetic benchmarking test functions. Finally, we apply our proposed methods on the task of tuning the stochastic gradient descent optimizer for both a multi-layered perceptron and a convolutional neural network on the MNIST dataset.

• Shen T [MIT CSAIL] (2016) Making Dependency Labeling Simple, Fast and Accurate. pdf  |  GitHub
• This work addresses the task of dependency labeling - assigning labels to an (unlabeled) dependency tree. We employ and extend a feature representation learning approach, optimizing it for both high speed and accuracy. We apply our labeling model on top of state-of-the-art parsers and evaluate its performance on standard benchmarks including the CoNLL-2009 and the English PTB datasets. Our model processes over 1,700 English sentences per second, which is 30 times faster than the sparse-feature method. It improves labeling accuracy over the outputs of top parsers, achieving the best LAS on 5 out of 7 datasets.

• Shi Z (2016) Empirical study of PROXTONE and PROXTONE$+$ for Fast Learning of Large Scale Sparse Models. arXiv:1604.05024
• PROXTONE is a novel and fast method for optimization of large scale non-smooth convex problem (Shi 2015). In this work, we try to use PROXTONE method in solving large scale non-smooth non-convex problems, for example training of sparse deep neural network (sparse DNN) or sparse convolutional neural network (sparse CNN) for embedded or mobile device. PROXTONE converges much faster than first order methods, while first order method is easy in deriving and controlling the sparseness of the solutions. Thus in some applications, in order to train sparse models fast, we propose to combine the merits of both methods, that is we use PROXTONE in the first several epochs to reach the neighborhood of an optimal solution, and then use the first order method to explore the possibility of sparsity in the following training. We call such method PROXTONE plus (PROXTONE$+$). Both PROXTONE and PROXTONE$+$ are tested in our experiments, and which demonstrate both methods improved convergence speed twice as fast at least on diverse sparse model learning problems, and at the same time reduce the size to 0.5% for DNN models. The source of all the algorithms is available upon request.

• Sidiropoulos ND (2016) Tensor Decomposition for Signal Processing and Machine Learning. arXiv:1607.01668  |  reddit
• Tensors or multi-way arrays are functions of three or more indices $(i,j,k, ...)$ -- similar to matrices (two-way arrays), which are functions of two indices $(r,c)$ for $(row,column)$. Tensors have a rich history, stretching over almost a century, and touching upon numerous disciplines; but they have only recently become ubiquitous in signal and data analytics at the confluence of signal processing, statistics, data mining and machine learning. This overview article aims to provide a good starting point for researchers and practitioners interested in learning about and working with tensors. As such, it focuses on fundamentals and motivation (using various application examples), aiming to strike an appropriate balance of breadth and depth that will enable someone having taken first graduate courses in matrix algebra and probability to get started doing research and/or developing tensor algorithms and software. Some background in applied optimization is useful but not strictly required. The material covered includes tensor rank and rank decomposition; basic tensor factorization models and their relationships and properties (including fairly good coverage of identifiability); broad coverage of algorithms ranging from alternating optimization to stochastic gradient; statistical performance analysis; and applications ranging from source separation to collaborative filtering, mixture and topic modeling, classification, and multilinear subspace learning.

• Srinivas S (2015) Learning the Architecture of Deep Neural Networks. arXiv:1511.05497

• Deep neural networks with millions of parameters are at the heart of many state of the art machine learning models today. However, recent works have shown that models with much smaller number of parameters can also perform just as well. In this work, we introduce the problem of architecture-learning, i.e; learning the architecture of a neural network along with weights. We introduce a new trainable parameter called tri-state ReLU, which helps in eliminating unnecessary neurons. We also propose a smooth regularizer which encourages the total number of neurons after elimination to be small. The resulting objective is differentiable and simple to optimize. We experimentally validate our method on both small and large networks, and show that it can learn models with a considerably small number of parameters without affecting prediction accuracy.

• [multilayered LSTM:] Sutskever I [Vinyals O; Le QV | Google] (2014) [multilayered LSTM; NLP; VSM; translation] Sequence to Sequence Learning With Neural Networks  |  blog: post
• Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences. In this paper, we present a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure. Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM's BLEU score was penalized on out-of-vocabulary words. Additionally, the LSTM did not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous state of the art. The LSTM also learned sensible phrase and sentence representations that are sensitive to word order and are relatively invariant to the active and the passive voice. Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

• Tabacof P (2015) Exploring the space of adversarial images. arXiv:1510.05328  |  GitXiv
• Adversarial examples have raised questions regarding the robustness and security of deep neural networks. In this work we formalize the problem of adversarial images given a pretrained classifier, showing that even in the linear case the resulting optimization problem is nonconvex. We generate adversarial images using shallow and deep classifiers on the MNIST and ImageNet datasets. We probe the pixel space of adversarial images using noise of varying intensity and distribution. We bring novel visualizations that showcase the phenomenon and its high variability. We show that adversarial images appear in large regions in the pixel space, but that, for the same task, a shallow classifier seems more robust to adversarial images than a deep convolutional network.

• Trottier L (2016) Parametric Exponential Linear Unit for Deep Convolutional Neural Networks. arXiv:1605.0933  |  reddit

• The activation function of Deep Neural Networks (DNNs) has undergone many changes during the last decades. Since the advent of the well-known non-saturated Rectified Linear Unit (ReLU), many have tried to further improve the performance of the networks with more elaborate functions. Examples are the Leaky ReLU (LReLU) to remove zero gradients and Exponential Linear Unit (ELU) to reduce bias shift. In this paper, we introduce the Parametric ELU (PELU), an adaptive activation function that allows the DNNs to adopt different non-linear behaviors throughout the training phase. We contribute in three ways: (1) we show that PELU increases the network flexibility to counter vanishing gradient, (2) we provide a gradient-based optimization framework to learn the parameters of the function, and (3) we conduct several experiments on MNIST, CIFAR-10/100 and ImageNet with different network architectures, such as NiN, Overfeat, All-CNN, ResNet and Vgg, to demonstrate the general applicability of the approach. Our proposed PELU has shown relative error improvements of 4.45% and 5.68% on CIFAR-10 and 100, and as much as 7.28% with only 0.0003% parameter increase on ImageNet, along with faster convergence rate in almost all test scenarios. We also observed that Vgg using PELU tended to prefer activations saturating close to zero, as in ReLU, except at last layer, which saturated near -2. These results suggest that varying the shape of the activations during training along with the other parameters helps to control vanishing gradients and bias shift, thus facilitating learning.

• Ulyanov D (2016) Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. arXiv:1603.03417  |  GitHub  |  mentioned here: reddit

• Gatys et al. recently demonstrated that deep networks can generate beautiful textures and stylized images from a single texture example. However, their methods requires a slow and memory-consuming optimization process. We propose here an alternative approach that moves the computational burden to a learning stage. Given a single example of a texture, our approach trains compact feed-forward convolutional networks to generate multiple samples of the same texture of arbitrary size and to transfer artistic style from a given image to any other image. The resulting networks are remarkably light-weight and can generate textures of quality comparable to Gatys et al., but hundreds of times faster. More generally, our approach highlights the power and flexibility of generative feed-forward models trained with complex and expressive loss functions.

• 4.1. Speed and memory. We compare quantitatively the speed of our method and of the iterative optimization of Gatys et al., 2015a by measuring how much time it takes for the latter and for our generator network to reach a given value of the loss LT (x; x0). * Figure 6 shows that iterative optimization requires about 10 seconds to generate a sample x that has a loss comparable to the output x = g(z) of our generator network. Since an evaluation of the latter requires ~20ms, we achieve a 500x speed-up, which is sufficient for real-time applications such as video processing. There are two reasons for this significant difference: the generator network is much smaller than the VGG-19 model evaluated at each iteration of (Gatys et al., 2015a), and our method requires a single network evaluation. By avoiding backpropagation, our method also uses significantly less memory (170 MB to generate a 256 x 256 sample, vs 1100 MB of (Gatys et al., 2015a).

• In this reddit thread, Texture Networks: Feed-forward Synthesis of Textures and Stylized Images, the Gatys et al. and the Ulyanov et al. papers are discussed in relation to one another:

• speed is awesome, but the quality is quite inferior to Gatys et al.

• Generally agreed, although there are a couple of samples where theirs does appear better to me -- trees in Fig. 1, roofing shingles in Fig. 11. It looks like the textures generated in this paper are much more homogenous than their sources, particularly conspicuous on the rock textures.

• Oh yeah the textures are nice, I only meant the style transfer experiments. Compared to the stuff they're putting out now at deepart.io, these are pretty bad.

• The textures are ok (compared to Gatys; they are much better than the things that came before Gatys, clearly)-- like I said, some are better. Some aren't. The homogeneity is bad on most of the textures they show.
But yeah, I have to agree that the results of the style transfer experiments are worse. It's quite an achievement to be running 500x faster when deployed, though, which gives a lot of room to improve the method's results while remaining very fast.

• [follow-on to preceding paper (Ulyanov D 2016 arXiv:1603.03417)]:
Ulyanov D (2016) Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv:1607.08022  |  reddit

• It this paper we revisit the fast stylization method introduced in Ulyanov et. al. (2016) [arXiv:1603.03417]. We show how a small change in the stylization architecture results in a significant qualitative improvement in the generated images. The change is limited to swapping batch normalization with instance normalization, and to apply the latter both at training and testing times. The resulting method can be used to train high-performance architectures for real-time image generation. The code will be made available here [GitHub]

• The recent work of Gatys et al. (2016) introduced a method for transferring a style from an image onto another one, as demonstrated in fig. 1. The stylized image matches simultaneously selected statistics of the style image and of the content image. Both style and content statistics are obtained from a deep convolutional network pre-trained for image classification. The style statistics are extracted from shallower layers and averaged across spatial locations whereas the content statistics are extracted form deeper layers and preserve spatial information. In this manner, the style statistics capture the "texture" of the style image whereas the content statistics capture the "structure" of the content image.

Although the method of Gatys et. al produces remarkably good results, it is computationally inefficient. The stylized image is, in fact, obtained by iterative optimization until it matches the desired statistics. In practice, it takes several minutes to stylize an image of size $\small 512 \times 512$. Two recent works, Ulyanov et al. (2016) Johnson et al. (2016), sought to address this problem by learning equivalent feed-forward generator networks that can generate the stylized image in a single pass. These two methods differ mainly by the details of the generator architecture and produce results of a comparable quality; however, neither achieved as good results as the slower optimization-based method of Gatys et. al

In this paper we revisit the method for feed-forward stylization of Ulyanov et al. (2016) and show that a small change in a generator architecture leads to much improved results. The results are in fact of comparable quality as the slow optimization method of Gatys et al. but can be obtained in real time on standard GPU hardware. The key idea (section 2) is to replace batch normalization layers in the generator architecture with instance normalization layers, and to keep them at test time (as opposed to freeze and simplify them out as done for batch normalization). Intuitively, the normalization process allows to remove instance-specific contrast information from the content image, which simplifies generation. In practice, this results in vastly improved images (section 3).

• Vinyals O [Le Q | Google] (2015) A neural conversational model. arXiv:1506.05869  |  blog:commentary
• Conversational modeling is an important task in natural language understanding and machine intelligence. Although previous approaches exist, they are often restricted to specific domains (e.g., booking an airline ticket) and require hand-crafted rules. In this paper, we present a simple approach for this task which uses the recently proposed sequence to sequence framework. Our model converses by predicting the next sentence given the previous sentence or sentences in a conversation. The strength of our model is that it can be trained end-to-end and thus requires much fewer hand-crafted rules. We find that this straightforward model can generate simple conversations given a large conversational training dataset. Our preliminary results suggest that, despite optimizing the wrong objective function, the model is able to converse well. It is able extract knowledge from both a domain specific dataset, and from a large, noisy, and general domain dataset of movie subtitles. On a domain-specific IT helpdesk dataset, the model can find a solution to a technical problem via conversations. On a noisy open-domain movie transcript dataset, the model can perform simple forms of common sense reasoning. As expected, we also find that the lack of consistency is a common failure mode of our model.

• Wang B (2016) Improving and Scaling Trans-dimensional Random Field Language Models. arXiv:1603.09170  |  [ Language modeling, random fields, stochastic approximation ]
• The dominant language models (LMs) such as n-gram and neural network (NN) models represent sentence probabilities in terms of conditionals. In contrast, a new trans-dimensional random field (TRF) LM has been recently introduced to show superior performances, where the whole sentence is modeled as a random field. In this paper, we further develop the TDF LMs with two technical improvements, which are a new method of exploiting Hessian information in parameter optimization to further enhance the convergence of the training algorithm and an enabling method for training the TRF LMs on large corpus which may contain rare very long sentences. Experiments show that the TRF LMs can scale to using training data of up to 32 million words, consistently achieve 10% relative perplexity reductions over 5-gram LMs, and perform as good as NN LMs but with much faster speed in calculating sentence probabilities. Moreover, we examine how the TRF models can be interpolated with the NN models, and obtain 12.1% and 17.9% relative error rate reductions over 6-gram LMs for English and Chinese speech recognition respectively through log-linear combination.

• Wen Y (ECCV 2016) A Discriminative Feature Learning Approach for Deep Face Recognition.  |  pdf  |  reddit
• Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the state-of-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new supervision signal, called center loss, for face recognition task. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. More importantly, we prove that the proposed center loss function is trainable and easy to optimize in the CNNs. With the joint supervision of softmax loss and center loss, we can train a robust CNNs to obtain the deep features with the two key learning objectives, inter-class dispension [ << sic -- Victoria: dispersion? differences??] and intra-class compactness as much as possible, which are very essential to face recognition. It is encouraging to see that our CNNs (with such joint supervision) achieve the state-of-the-art accuracy on several important face recognition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge. Especially, our new approach achieves the best results on MegaFace (the largest public domain face benchmark) under the protocol of small training set (contains under 500000 images and under 20000 persons), significantly improving the previous results and setting new state-of-the-art for both face recognition and face verification tasks.

• Wiseman S (2016) Sequence-to-Sequence Learning as Beam-Search Optimization. arXiv:1606.02960
• Sequence-to-Sequence (seq2seq) modeling has rapidly become an important general-purpose NLP tool that has proven effective for many text-generation and sequence-labeling tasks. Seq2seq builds on deep neural language modeling and inherits its remarkable accuracy in estimating local, next-word distributions. In this work, we introduce a model and beam-search training scheme, based on the work of Daume III and Marcu (2005), that extends seq2seq to learn global sequence scores. This structured approach avoids classical biases associated with local training and unifies the training loss with the test-time usage, while preserving the proven model architecture of seq2seq and its efficient training approach. We show that our system outperforms a highly-optimized attention-based seq2seq system and other baselines on three different sequence to sequence tasks: word ordering, parsing, and machine translation.

• Zhang K (2016) Residual Networks of Residual Networks: Multilevel Residual Networks. arXiv:1608.02908 reddit
• Residual networks family with hundreds or even thousands of layers dominate major image recognition tasks, but building a network by simply stacking residual blocks inevitably limits its optimization ability. This paper proposes a novel residual-network architecture, Residual networks of Residual networks (RoR), to dig the optimization ability of residual networks. RoR substitutes optimizing residual mapping of residual mapping for optimizing original residual mapping, in particular, adding level-wise shortcut connections upon original residual networks, to promote the learning capability of residual networks. More importantly, RoR can be applied to various kinds of residual networks (Pre-ResNets and WRN) and significantly boost their performance. Our experiments demonstrate the effectiveness and versatility of RoR, where it achieves the best performance in all residual-network-like structures. Our RoR-3-WRN58-4 models achieve new state-of-the-art results on CIFAR-10, CIFAR-100 and SVHN, with test errors 3.77%, 19.73% and 1.59% respectively. These results outperform 1001-layer Pre-ResNets by 18.4% on CIFAR-10 and 13.1% on CIFAR-100.

• Zhang Y (2015) Sensitivity Analysis/Practitioners' Guide: CNN, Sentence Classification. arXiv 1510.03820  |  NLP,VSM, sentence classification; parameter optimization

• Convolutional Neural Networks (CNNs) have recently achieved remarkably strong performance on the practically important task of sentence classification (Kim 2014, Kalchbrenner 2014, Johnson 2014). However, these models require practitioners to specify an exact model architecture and set accompanying hyperparameters, including the filter region size, regularization parameters, and so on. It is currently unknown how sensitive model performance is to changes in these configurations for the task of sentence classification. We thus conduct a sensitivity analysis of one-layer CNNs to explore the effect of architecture components on model performance; our aim is to distinguish between important and comparatively inconsequential design decisions for sentence classification. We focus on one-layer CNNs (to the exclusion of more complex models) due to their comparative simplicity and strong empirical performance, which makes it a modern standard baseline method akin to Support Vector Machine (SVMs) and logistic regression. We derive practical advice from our extensive empirical results for those interested in getting the most out of CNNs for sentence classification in real world settings.

• Zilly JG [Jürgen Schmidhuber] (2016) Recurrent Highway Networks. arXiv:1607.03474  |  reddit

• Many sequential processing tasks require complex nonlinear transition functions from one step to the next. However, recurrent neural networks with such 'deep' transition functions remain difficult to train, even when using Long Short-Term Memory networks. We introduce a novel theoretical analysis of recurrent networks based on Geršgorin's circle theorem that illuminates several modeling and optimization issues and improves our understanding of the LSTM cell. Based on this analysis we propose Recurrent Highway Networks (RHN), which are long not only in time but also in space, generalizing LSTMs to larger step-to-step depths. Experiments indicate that the proposed architecture results in complex but efficient models, beating previous models for character prediction on the Hutter Prize dataset with less than half of the parameters.

• implemented here: Recurrent Highway & Multiplicative Integration -- Implementation in TensorFlow  |  GitHub

## OVERFITTING

• Can more data lower SVM performance?
I'm using LIBSVM [a library for SVM]on a dataset for binary classification, with around 150 features. When I train with ~150 examples, I get ~76% accuracy, when I up the number of examples to 500+, I get an overall accuracy of around 72%. Originally I used 10-fold cross validation, I did suspect overfitting given the low number of examples compared to features, but I tried again with a completely separate (~100 example) test set and it still performed the same - 76% and 72%. Basically, what I'm asking is: Should adding more training data to a dataset always mach or increase the accuracy of the model, or can adding more data actually lower precision? (Assuming everything else is accounted for - optimal parameters for each set, etc.).
• You have to think about what overfitting actually means. If you perfectly fit (overfit?) a small sample, but the sample has the same distribution of features and labels as your test set, you still get perfect accuracy on test. So that isn't overfitting, even though your model fits the training data exactly.
To be able to overfit, which it sounds like you are, your training data must have a different distribution than your test set.
So, with that mental model in mind, for sure adding more data can change your results. The distribution of data changes.
Conceptually, this means that the bias in your smaller training set just happened to be beneficial for your test results (which should happen 50% of the time!) Remember that this is a falsely inflated result, and probably doesn't represent better generalisation.
As a general rule, having as many variables as data points (or within an order of magnitude) is a very small data setting, and is more susceptible to data heterogeneity. You should see less of this if you increase the number of observations by a few orders of magnitude! Easier said than done.
The other thing you can do is some form of variable selection to reduce thy chance of overfitting. Lasso regularisation and PCA are popular. Stepwise selection methods and p-value ranking with univariate linear models are used in some fields, but controversial in others, and can be a bit more tricky to get right.
As an example, it you used PCA to capture 95%of the variation in the data, you might end up with under ten variables (principal components) and you should be less likely to overfit.
With a dataset that size, those feature selection methods should be pretty quick. Lasso and PCA shouldn't take more than a few seconds.

• Should adding more training data to a dataset always mach or increase the accuracy of the model Yes if the data is IID [independent and identically distributed]. Use more folds and compute standard deviation
• Feature engineering and over-fitting
I was wondering when does feature engineering become over-fitting. Traditionally we always refer to over-fitting in the context of dimensionality, but there should also be a concept for the feature definition themselves. All that work that goes into hand crafting a feature could lead to over-fitting. Secondly, how to we use cross-validation in this scenario since, we can't really look at only part of the data to design the features and then test on the rest, any thoughts?

• Over-fitting refers to almost "remembering" the exact data points, rather than learning an intelligent representation of the data. With a neural network of 10,000 hidden units, I can definitely overfit a training set of 10,000 samples. Simply every hidden neuron can correspond to one input sample.
Feature engineering concerns the expansion of your input space. Say you have input vectors of 20 features. That is a point in $R^{20}$. Now you engineer another 10 features . The input vectors are now in $R^{30}$.
Though you didn't necessarily add input samples, it is easier for a smaller network to overfit your data. Concretely, a neural network with 1000 hidden neurons might not overfit the original 10,000-sample dataset, but it could possibly overfit the new dataset with exact the same amount of samples.

How come that the second dataset is easier to overfit?

Your data space has become less dense. The original $20^D$ data vectors have a certain density. Density being the number of datapoint per given volume of the data space. In the new $30^D$ data space, the data is less dense.

A usual metaphor: 100 people randomly spread on an area of $30 \times 30$ meters might look like a full area to you. But 100 people randomly spread in a skyscraper of the same surface area, might look like an empty skyscraper to you. Conclusion: adding the extra dimension (height of the building in this case) makes the data points less dense in the data space.

Now why does data density matter?

For densely packed data vectors, you force the neural network to learn an intelligent representation of the data. The data points are close together, so it needs to learn highly non-linear methods to separate the data points. For loosely packed data vectors, the neural network can learn a few non-linear methods that suffice for a good performance.

The same metaphor: Let's say we like to classify men from women. The same situation occurs: first, we spread them on an area of 30by30 meter. Second, we spread them in a skyscraper of surface area $30 \times 30$ meters. Say 100 people are on a area of $30 \times 30$ meters. You have to distinguish them in an intelligent manner. You'll have a hard time to come up with an algorithm. The second case is easier. We spread the 100 people randomly in the skyscraper. Now, you might as well remember that floor 14 has more men, floor 81 has more women and so on. This is an easy task. You remember the genders on the different floor. And ready you are.
With this explanation and metaphor, you see both the use and the caveat of feature engineering. Feature engineering makes the data easier to represent for any given algorithm. In that perspective, you "help" the algorithm with its task. Though if you help it too much, it will simply overfit the data and not learn at all.

• How do we deal with overfitting?
1. Occam's razor. The solution with the least number of assumptions is generally correct. Prefer simpler models over more complicated models. Generally, the less number of parameters you have in the model, the better it is.
2. Out of time validation sample. If you have data that deals with events that happened over a certain period of time (e.g. fraud events in bank transactions), use an out of time sample to test your model. Out of time sample means data sample beyond the period of the development dataset (sample used to train your model). This will ensure your model works in situations it has not seen before and not overfitted to types of events it was trained.
3. Cross validation. This gives you a good idea of goodness of fit to expect from the model when it sees data that it has not seen before. Cross-validation involves partitioning a dataset into complementary subsets, performing the development or training of model on one subset (training set), and validating the results on the other subset (testing set). This process is repeated multiple times and cross-validation are performed using different partitions. In the end the validations results are averaged over all the rounds of cross-validation.
4. Regularization. Regularization tries to enforce Occam's Razor on the model by penalizing some sources of overfitting. One of the common techniques is LASSO, where your penalize your model if the sum of the absolute value of the regression coefficients gets too high.
5. What story are the predictors telling. One of the key steps is to look at the model and try to find justification for the variables that is being picked up and the signs of the coefficients.
6. If a model is too good to be true then probably it has issues - be skeptical

• re: LASSO (item 4, above): Is stepwise regression still controversial?
... I'm doing some work with genomics style data, and my (very senior, world-respected) professor has suggested I use stepwise regression to select my variables of interest. It is an exploratory paper, so there are only few observations and very many variables (2-3 orders of magnitude difference). I am not a statistician, but I have seen in many places that stepwise regression has been bagged as data dredging. Yet it seems to be in widespread use in genome wide association studies ...
• As you mentioned, stepwise regression is very unpopular in the stats community. The LASSO has mostly replaced variable selection methods for linear models. The LASSO is widely used in genomics, so I would start there instead of using a stepwise selection technique.
• Also look at elastic net. It's a generalization of LASSO that uses the $L1$ and $L2$ norms (i.e. the LASSO and ridge penalties). You could also look at combining variables using multivariate techniques like PCA. As for huge amounts of hypothesis tests, look at the FDR criteria. It's a correction that guarantees that alpha% of the tests you run are false positives (rather than a much larger proportion).
• related: Andrej Karpathy, @karpathy (twitter, Mar 11, 2016): "not-widely-enough-known-protip: Do not use $L2$ loss (regression) in neural nets unless you absolutely have to. Softmax likely to work better."
• There's a huge literature on sparse regression with genetic applications/motivation. That's probably the most covered topic in stats in the past decade or so. Nobody recommends doing stepwise selection.

• Less training data giving better results? (NLP/Text Generation)
I am doing text generation with LSTM networks using Keras, and I have noticed that when I train on only the first 200k characters of my source novel, I get better results with much less training time (more epochs go by in this lesser time though). When training with very large amounts of data, I almost never get spelling errors, but the generated text tends to repeat the same statements over and over. Could my net be too small, or is it just that I just need to wait for more epochs to go by when using the larger data set?
My network is 3 hidden LSTM layers, fully connected with 512 nodes. There are about 70 inputs and 70 outputs, so 5 layers total including the in and out layers. I am using 1 hot encoding to pass in characters, and the output is the probability of a given character coming next. The way I generate text is by passing in a seed character and let the net output the probabilities of the next character. I then sample from this with a softmax function, and pass in this character.

• If it learns faster on only the beginning of your novels, it either means it's overfitting on the smaller dataset or it means the beginning of the novels is easier to learn that later parts. Try to report train and validation loss, to find out if it's overfitting or not. Also I would try and run it until convergence.
• How do you evaluate it ? If you do this kind of generation you're probably happy with overfitting which is easier when there is less data If your evaluation require generalization then more data will be better

• What happened to DropOut?
When Hinton plugged DropOut bigtime a few years ago, it seemed like a good solution to the overfitting problem, and like a new standard that everyone would use from then on. Some variants on the theme came about after that. But these days you hardly see it mentioned in any of the papers with impressive results. What happened? These days it seems overfitting is hardly a problem, although nets are huge and often larger than the dataset. Do you think the DropOut patent has something to do with it? Also I guess the VAE is a more elegant solution to the same problem?

• [u/ogrisel] I think dropout is still very useful when fitting complex supervised models on tasks with few labeled samples. However training deep supervised vision models is nowadays done with residual convolutional architectures that make intensive use of batch normalization. The slight stochastic rescaling and shifting of BN also serve as a stochastic regularizer and can make dropout regularization redundant. Furthermore dropout never really helped when inserted between convolution layers and was most useful between fully connected layers. Today's most performant vision models don't use fully connected layers anymore (they use convolutional blocks till the end and then some parameterless global averaging layer).

For RNNs, using dropout between the recurrent units never really worked (apparently the noise introduced in the recurrent layer destroys the gradients too much and prevents learning to occur). Dropout still useful between RNN layers as far as I know. It probably also depends on the amount of labeled samples. Dropout can also be useful on the input embedding layer of RNNs trained on word level or character level data or any model using categorical inputs via an embedding.

VAE is an unsupervised model (it can also be used in semi-supervised mode): generally the main problem of unsupervised models is not overfitting (caused by the lack of labeled data) because they work on unlabeled data which is often quite abundant. Their problem is more about underfitting due to computational constraints or optimization issues (or both).

Dropout does actually work quite well between recurrent units if you tie the dropout masks across time (see also this paper where they provide a Bayesian background story for this simple trick, which I don't really buy to be honest: arXiv:1512.05287  [<< large file; opens in new tab].

The same actually holds for dropout between convolutional layers: if you tie the masks across spatial dimensions it regularises pretty well (too strongly in fact, in my experience). But since conv. layers have relatively few parameters anyway there's not so much of a need, and of course the regularising side-effects of batchnorm also help.

The reason dropout doesn't work between conv. layers if you don't do this, is because the weight gradients of the conv. layers are averaged across the spatial dimensions, which tend to contain many very highly correlated activations. This ends up canceling out the effect of dropout if the mask is different at every spatial location. You're basically Monte-Carlo averaging over all possible masks, which is precisely what you don't want to do with dropout during training :-)

I imagine the story is much the same for RNNs, instead of the "gradient destruction" explanation which I don't really buy either. Although admittedly I haven't thought about this for RNNs nearly as much as I have for convnets.

• [u/ogrisel] Thanks. When you say "tie the mask between convolution layers", you mean use a fixed Bernoulli maps for consecutive layers that have the same dimensions? When the dimensions change from one layer to the next (e.g. via convolution strides or max pooling), I assume you need to apply the consistent strides / pooling patterns to the Bernoulli mask as well. Is that right? Do you need to also use a fixed mask across channels of a specific conv layer or can you use different Bernoulli masks for different channels of the same layer? Are their implementations of those aspects in common DL frameworks such as theano, tensorflow or torch (or higher level libs like Keras and lasagne)? Do you have a good reference for the use of dropout in convnets?

• [u/benanne] "Thanks. When you say "tie the mask between convolution layers", you mean use a fixed Bernoulli maps for consecutive layers that have the same dimensions?" No, only within layers across the spatial dimensions. Not across channels, not across the batch, not across different layers :-) In Lasagne you can do this easily with the "shared_axes" parameter of lasagne.layers.DropoutLayer. As for references, I don't think I've ever seen dropout used in convolutional layers in papers. I'm mainly talking about my own experience trying to train convnets on relatively small datasets. Here's another relevant thread from this subreddit: reddit.

• [u/ogrisel] Thanks. For convnets trained on relatively small datasets, what do you think of Fractional Max-pooling? arXiv:1412.6071

• [u/benanne] I think it's a very interesting approach but pretty tough to implement efficiently, which explains its limited adoption so far. I haven't tried it myself.

• [u/cbeak] "No, only within layers across the spatial dimensions." << Isn't that the same as DropConnect, i.e. where you randomly set weights to 0 (in this case in the kernel)?

• [u/benanne] No, you're still zeroing out activations rather than weights, but you're zeroing out the same subset of activations for every spatial position. If your activation tensors are of shape (batch_size, num_channels, height, width), then regular dropout would involve sampling dropout masks of that same shape. Dropout with spatially tied masks can instead be implemented by sampling a mask of (batch_size, num_channels, 1, 1) and then relying on broadcasting for the spatial dimensions when multiplying it with the activations. Don't really see the link with dropconnect to be honest. On a related note I don't think dropconnect is usually worth it, it's pretty hard to implement efficiently (you need a different mask on the weight matrix for each example in a batch, which destroys batch parallelism) and in the paper they were only able to demonstrate improvements over dropout when ensembling several models together. You might want to search this subreddit for "dropconnect", this has been discussed several times in the past :)

## PARAMETERIZATION [PARAMETRIZATION | HYPERPARAMETERS]

• See my Machine Learning - Parameterization notes.  [<< large file; opens in new tab]

## PRETRAINING

• Is layer-wise pre-training obsolete?
I see some papers use backpropagation directly for deep network. So my question is what is the main stream style? is layer-wise pre-training obsolete?

• I think you would do it if you had a ton of unlabeled data and only a few labeled training samples, all though ladder networks would be a better choice in that scenario.

• Mostly yes: ReLUs and similar activation functions help avoiding vanishing gradients during learning

• It seems like it. Personally, I haven't had much success with it either. ReLUs, as others have mentioned, and Batch Normalization take care of problems like vanishing gradients. In Goodfellow's/Bengio's/Courville's new book they mention that on average, layer-wise pretaining hurts the training, but in some cases, it improves the results a lot. Might be worth it to give it a shot. A related approach that is used in the VGG-net is training a NN with fewer layers first and then initializing the weights of a deeper net with the learned weights of the smaller net.

• You don't generally see layer-by-layer pre-training anymore. As said in other comments, ReLUs have outmoded it by and large. However when using RNNs, you often see CNNs being pre-trained on other datasets. The CNN is then frozen during the training of the recurrent portion of the model. e.g. Show and Tell and Sequence to Sequence.

• Echoed here:  What is the current state-of-the-art in pre-training neural networks?

• The state of the art is not doing pre-training at all, at least not with the data on which you want to apply your algorithm. ReLU's made it possible [pdf]. Residual learning [arXiv:1512.03385] and highway networks [arXiv:1505.00387]have also enabled deeper architectures without the need of pre-training.

However, what people sometimes do is to pre-train on other (larger) datasets to learn features and then re-train (performing the needed architectural modifications) on your dataset (normally with lower learning rate). But I think this is usually referred to as transfer learning. For an example on this, check the Keras blog: (jump to title "Using the bottleneck features of a pre-trained network: 90% accuracy in a minute")

• ... don't think about 'attention' as a specific NN architecture. Try to dissect what is intrinsically beneficial about it. Although specific attention mechanisms may go out of style, the underlying benefits of the idea will not. For example, take layer-wise pre-training. The original method of stacking RBMs and Autoencoders is no longer used, but if you think about the intuitions behind the idea and what it would look like to train the whole stack simultaneously, you get something like a Ladder Network.

• "Original method of stacking RBMs or Autoencoders is no longer used". Do you mean there's another better method for "later wise pretraining" currently or you're trying to say that layerwise pre training in general is not used due to some innovations like ReLU etc? Could you elaborate? p>
• Layer-wise training isn't used in general. Discussed here and recently here.

• Semi-Supervised Learning with Ladder Networks  |  GitHub  |  GitXiv

• echoed here: Need some help understanding Neural Net Activation and Loss functions [reddit]: "It's also worth mentioning that layerwise pretraining has been pretty much phased out, and is not useful anymore in most cases."

## PROBABILITY (REGRESSION)

• [u/gabegabe6]Iris classification with Keras
Here, you can see, how I did it [link no longer exists (2017-Sep-12)]: https://gist.github.com/gaborvecsei/49029cc8133805e38f826028b9ba715b#file-keras-md . My problem is... that the accuracy is around 73% and that's pretty bad. How can I make it better? (I just started learning Deep Learning) How could I make it much more better?

• There are three classes and you are using sigmoid output with binary cross-entropy loss.

• Use softmax activation for the final layer. Use categorical_crossentropy as objective. Play with parameters.

• Binary cross-entropy is when you have two classes. This is dataset has 3. So you should use softmax with categorical_crossentropy loss function as mentioned by other redditors.

• u/gabegabe6: Thanks. When I use categorical_crossentropy, I get this:

Exception: You are passing a target array of shape (120, 1) while using as loss categorical_crossentropy. categorical_crossentropy expects targets to be binary matrices (1s and 0s) of shape (samples, classes).

• Yes, the last layer needs to be (120, 3). But you also need to encode the labels into One Hot Encode. This tutorial might help you.

• u/gabegabe6 I know, I tried that but than this is what I get:

Exception: A target array with shape (2402, 2) was passed for an output of shape (None, 1) while using as loss categorical_crossentropy. This loss expects targets to have the same shape as the output.

train_y looks like this:

array([[ 1.,  0.],
[ 1.,  0.],
[ 1.,  0.],
...,
[ 0.,  1.],
[ 0.,  1.],
[ 0.,  1.]])

... With this model I just get a warning:

model = Sequential()

... Okay, so with this and 3000 epochs I get 94% accuracy. :)

• Small Project on Evolutionary Algorithms
I have a small university project concerning evolutionary algorithms in which I will work on the NCAA dataset, that's part of the current Kaggle competition. The idea is:
1. Train some (simple?) learning algorithms on the dataset / subsets of the data;
2. Create an ensemble predictor by weighing the different trained models, so we get a prediction based on all the base models;
3. Repeat a lot: Use evolutionary methods to find good weights for step 2. ...
EDIT: The task is predicting the probability of which basketball team wins a given matchup, so I'm looking for regression models.
EDIT2: After looking into this some more, I realised that what I want to do is called bagging. I am probably going to first try a big bag of regression trees (which I guess would resemble random forest). Secondly, I will try to combine different base models, so I will try to find good weights for a combination of a regression tree (or multiple ones?), an SVM, maybe MARS, maybe ANN. If you have any more suggestions on this, I'd be very happy to receive more suggestions :)
• If I understand correctly you have to predict some kind of probability, right? That means you need to do regression (as opposed to classification). Regression trees should work fairly well. They should be able to handle both qualitative and quantitative data. Maybe MARS ...
If you want to use something like a neural network, it deals fairly naturally with numerical data (although you may want to normalize). You can use one-hot encoding for nominal data (if you have one variable that can take on N classes, create N corresponding input nodes and turn all off except for the one that corresponds to the variable's current value). For (bounded) ordinal data you can use a progressive encoding, so if there are N possible values, create N-1 nodes, and if the current value is the third, turn on the first two nodes. (So if you have Temperature which can be Cold, Warm, or Hot, then make 2 nodes and turn them all off for Cold, turn the first on for Warm and turn both on for Hot.) These encodings might also work for other algorithms.

## REGULARIZATION

Blogs:

• How common is L1/L2 norm regularization in deep learning today?
How do they compare to dropout?
• They are pretty common. $L2/L1$ penalizes large weights and pulls the learning into smaller weights. Dropout drops nodes randomly from your network and forces it not to rely on only a small set of weights. This is very helpful for large networks. So, dropout is like turning your network into an ensemble of smaller neural networks and average the results. Here is a great source about the regularization techniques in deep learning: Overfitting and regularization
• Sometimes when I want to train a large model on not so large dataset, I see that using $l1, l2$ regularization is more helpful than dropout to achieve smaller loss. Sometimes using both of them together helps.
• You can use both $L2$ regularization sometimes has Bayesian/generalization properties Dropout prevent co-adaptations of features and has ensemble properties $L2$ is more common and often more efficient if you don't care about sparsity, and $L1/L2$ is a bit tedious to tune
• Weight decay which is ubiquitous in conv nets is $L2$ regularization. However in my experience its impact is negligible. Some papers suggest to use $L1$ for weight compression, but $L1$ in my experience considerably slow down convergence.
• From two projects with relatively shallow nets (1-3 hidden layers), dropout was much more impactful than $L2$. $L2$ helped mostly in one situation where I was getting $NaN$, and a tiny amount of $L2$ helped to reduce that. Haven't tried $L1$ in NNs.
• That's actually a great point. Anecdotally, a little bit of $L2$ regularisation can stabilise training with a cross entropy objective, where you can get sudden divergence quite late into the training process because the model becomes too confident on certain examples (just one is enough to ruin it). Dropout by itself has no such effect in my experience.
Another interesting case is AlexNet, where supposedly the weight decay actually improves convergence (i.e. lower training error) and not just generalisation. I took this to mean that there are some interesting interactions with the optimisation algorithm as well.
Papers:

• Han S (2016) [Stanford University | NVIDIA | Baidu Research] DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow. arXiv:1607.04381  |  Compressing and regularizing deep neural networks [OReilly.com: Nov 2016]
• Modern deep neural networks have a large number of parameters, making them very powerful machine learning systems. A critical issue for training such large networks on large-scale data-sets is to prevent overfitting while at the same time providing enough model capacity. We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks. In the first D step, we train a dense network to learn which connections are important. In the S step, we regularize the network by pruning the unimportant connections and retrain the network given the sparsity constraint. In the final D step, we increase the model capacity by freeing the sparsity constraint, re-initializing the pruned parameters, and retraining the whole dense network. Experiments show that DSD training can improve the performance of a wide range of CNN, RNN and LSTMs on the tasks of image classification, caption generation and speech recognition. On the ImageNet dataset, DSD improved the absolute accuracy of AlexNet, GoogleNet, VGG-16, ResNet-50, ResNet-152 and SqueezeNet by a geo-mean of 2.1 points(Top-1) and 1.4 points(Top-5). On the WSJ'92 and WSJ'93 dataset, DSD improved DeepSpeech-2 WER by 0.53 and 1.08 points. On the Flickr-8K dataset, DSD improved the NeuralTalk BLEU score by 2.0 points. DSD training flow produces the same model architecture and doesn't incur any inference overhead.

## REINFORCEMENT LEARNING { Q LEARNING | DQN }

• A Potentially Ignorant Question
I have several thousand data points, each made up of about 15-30 individual floating point numbers. ... I would like to be able to predict the top 5 most likely to be correct data points given the current data point. In other words, I will have a current data point and need 5 new data points returned from my list of data points, each of which should be a good match to be next in the sequence given the current data point, preferably with the amount of confidence for each.
The idea is that I will have several thousand songs and I will extract some features out of each (bpm, genre, etc). These features will be stored in memory. Given the current song, I want to predict the best song to play next. However, I will not start with any probabilities of which will be next. Instead, I want to start by playing random songs and having a user rate the transition to the next song each time. Over time, it would learn user preferences. It's important to note that the reinforcement is from users rating the transition, but the decision is made just off of extracted features.
Ideally, I would not store a pair for each combination of songs. I would like to have some black box that, given the current song and a list of n-1 other songs, scans through and calculates some score for each, selecting the 5 best next songs. In other words, I don't want to store state explicitly between any 2 pairs of songs, but rather have the state in the black box itself.
• What you're describing is well-studied in reinforcement learning. You're ultimately trying to solve a deterministic action Markov decision process. The environment is not fully observed, since the user might be in a (unobservable) particular mood which changes how they rate a given song trajectory. I'd suggest you read up on Q learning, to start with.

## ReLU [Rectified Linear Unit]

• Is truncated normal with low std (0.01-0.1) a better initialization than random normal for CNN filters?
How about biases? I've seen people initialize them as constants and other as random variables
• reddit: The current basic recommendation [http://arxXiv:1502.01852] with ReLU units is to initialize using

np.random.randn(n) * sqrt(2.0/n)

• It's somewhat surprising to me that orthogonal intialisation hasn't become standard by now. For very deep nets (and especially for RNNs, but I can't really speak from experience there) it really makes a difference.
• Is there a simple way to do this for convolutional filters? (There probably is, and it's probably in Lasagne, right? :P)
• Yep, right here. Basically you need an orthogonal matrix of size (input_channels * height * width, output_channels) and then you simply reshape it. Multiplying by sqrt(2) to compensate for ReLUs is sometimes done, but it's usually unnecessary unless the net is very deep.
• n is the number of filters?
• It's the fan-in. If your weight shape is [N x C x H x W] (N = #feature maps in this layer, C = #feature maps in previous layer), then it's

np.random.randn(N,C,H,W) * sqrt(2.0/(C*H*W))

H & W are the Height and Width of the convolution kernel.

• So you initialize your bias as +0.25?
• If you are using ReLU, this is a sensible initialization strategy.

## RNN [RECURRENT NEURAL NETS] | LSTM [LONG SHORT-TERM MEMORY]

• In LSTM's does each input connect to all of the hidden nodes, or just for one hidden node? [reddit]
I've seen two scenarios : LSTM with each input connected separately to a hidden node [image]; LSTM with each input connected to all of the hidden layers [image]! So, which is the correct way of connecting an LSTM?
• LSTMs with connections between its inputs and all layers are usually said to have "skip connections." There's no correct way. They're just different.
Adding skip connections will increase training time (especially if your inputs are large). Maybe this is worth it if performance goes up, but this will depend on the application. For example if you have a small dataset maybe this will actually hurt your performance by leading to worse overfitting / increased generalization error.
• Each circle in the first image is just a "zoomed out" view of the second. If not mentioned explicitly, assume that everything is fully-connected.
• The original LSTM paper [pdf] suggests that the network is fully connected, so I guess it is just for simplifying the drawing they removed (most of) the connections.

• How can I Improve my NLP Task?
I know this is very unconventional (nonsense): I have a Feed Forward Neural Network that Predicts the next Character based on the last $n$ Characters. I tried values of 4, 5, 7 and 12 for $n$. A character is represented as a 28-dimensional vector, where 27 are one-hot, the 27th is space and the 28th is shift. shift + space are newlines.
I took Pride and Prejudice as training data. KL divergence is my error measure and RMSProp with first and second order momentum and standard parameters is my weight update algorithm.
I tried batch sizes of 40,100, 400 and 1000. After every step, I've put out 100 Characters of generated text.
The only network architecture that didn't end up to always putting out the same character or putting out Spaces was sine activations in all hidden layers and sigmoid in the output layer. The first layer after input was a wide layer with more nodes than input nodes. Then 4-5 layers, decreasing in size.
With a test training set only consisting of "LOLOLOLOLOLO...", it converged after 3 training steps with n = 4 and batch size 40, but diverges after that and never converges again. With the Pride and Prejudice set, it puts out total gibberish, no words whatsoever. The classification error is decreasing, but plateaus after ~100 training steps. It does model some probabilities, for example, a vocal is always followed by a consonant, there are never more than 3 consonants in a row, word length is realistic, shift is low most the time, but it hasn't learned that it should only be high after a space character. Also, I think the words sound somehow English.
My intention behind the sine is that the Bias is far more easy to train because of the periodicity of the sine Function. I already trained some sine networks on MNIST and image regression, and got very good results with both of them.
Is there any work on Sliding-Window Feed-Forward Neural Networks used for character-level text synthesis?

• it converged after 3 training steps with n = 4 and batch size 40, but diverges after that and never converges again << It sounds like either the weights are initialised poorly or your learning rate is too high... Which is something I'm guessing you've checked already, because it sounds like you know what you're doing. At any rate, I don't know what a good method for initialising weights in a network with sinusoidal activation functions is, but in ReLU and tanh networks, you should be using He and Xavier initialisation, respectively. But if all the weirdness you describe is a result of you writing the training algorithm yourself, then good luck getting any useful advice from /r/machinelearning.

• Thanks for the tips :) I actually didn't try to adapt the learning rate, because it's already at 0.001, which is standard for RMSProp. I'll try that. I think I have to explore some more options regarding weight initalization. At the moment all I do is initalize them with a quadratic distribution and normalize them so that every weight+bias vector's L2 norm is exactly 2 Pi. Yes i wrote it myself.
EDIT: as I think of it, it may be better to normalize them so that the activation of sin units ranges from -Pi to Pi. to do that, i have to normalize them using L1 norm, not L2 norm.

• I actually didn't try to adapt the learning rate, because it's already at 0.001, which is standard for RMSProp << I missed that bit. Obviously not a learning rate issue then.

all I do is initalize them with a quadratic distribution and normalize them so that every weight+bias vector's L2 norm is exactly 2 Pi << Assuming your network has many layers and there isn't a bug elsewhere, this is probably what's causing the issue. Standard initialisation techniques start the bias terms of at zero, and ensure that the average value of the weights is zero and their variance is set depending on your activation function and the width of the layer they appear in.

• Instead of manually sliding a window around you want to use an RNN, check out CharRNN and one of the many blog posts about it for more info.

## SEMI-SUPERVISED LEARNING (SSL)

• Semi-supervised learning: drowning in training data
I am currently developing an activity recognition system (detect walking, standing, sitting and lying) using two accelerometer sensors. Raw acceleration data as input to a CNN are now producing decent results. What I am doing now, is to personalize the models for each subject using SSL [ semi-supervised learning]. First, a static model using the labelled data, then using SSL to retrain the model with parts of the subjects unlabelled data. The problem is that the newly labelled data does not influence the result (it seems that the amount of new data is so small that it does not impact the overall amount of data). Do you guys have any idea how I could weight the newly labelled data? So that the new data will influence the model? If I reduce the amount of training data, the start accuracy will improve, but not the overall accuracy. TLDR; How to make newly labelled data influence the model when the amount of training data is large.
• Not sure I understand, but here is an attempt:
• Undersampling. Pick 100% of your new data (which makes n rows) and n others rows from "old data". You now have a more balanced set.
• Stacking it. Of course, you lose a lot, too. If you feel like it, you can repeat the process several time, with 90% of the "new data" and the same amont of the "old data". This will produce several distinct models worth stacking.
• Boosting. You had a previous model, right ? Why not add the "old" prediction in your variables, so that the new model can learn to adapt it. Because it should be easier than a complete retraining, the undersampling technique could work better here.
Does that make sense ? I'm asking because I'm really unsure about my understanding of your problem :-)
• Thanks! I think you understood the problem. Undersampling is something I have tried, and the increased accuracy is noticeable, but the final accuracy is lower than the accuracy achieved when using the entire data set. Have not tried stacking it. Do you then mean, say, training n-classifiers with different subset of the labelled data and feed their predictions into a new classifier? Boosting is also something I could look into! I have also though about oversampling the predicted labels (e.g. when it predicts "walking" it adds n-copies of that instance) so those instances have more weight, but I want to do this in a more "algorithmic fashion".
• You can achieve the same result as adding n "copies", by just weighting the samples differently during training. Remember that you have full control the loss function at any given point, so you're free to give samples uneven weight there. SGD is of course all about minimizing the loss function, so even small changes to how you define it can sometimes greatly impact your results. Also, are you training a "shared" model first and then trying to retrain that on new data to create a specialized model? Or are you starting from scratch each time with the combined data? The former may have some challenges if the pre-trained network has "settled" to far into a good solution, it may make it harder to train it away from that solution.
• Oversampling may work, but I have never seen it work in real life cases, which means it is very dependent on the project. I would not invest too much effort in it to start with. There are less brutal techniques, like SMOTE, but again it may or may not work.
I think you understood what I meant by stacking. Which, now that I think about it, is what we usually call bagging (rows, not columns). You pick rows randomly, train a model, and see what it does. The key here is that stacking bagged models reduces variability. Each model may overfit the training set because you don't have enough data, but one you start stacking them, you may reduce this problem. I would solve everything though, so make sure your models don''t overfit like hell. :)
Boosting is the most promising idea, I guess. But that is pure intuition, so lets try and see what actually works :)
Oh, and by the way, you can combine everything. Using random forest at some point should be considered.
• Do you have a baseline? I don't think KNN would be affected by this, and I really like this example [Jupyter notebook]. One way to improve this is to treat your unlabled data as unlabeled (rather than relabeling with a "weak model"), but penalize the entropy of the prediction in the unlabeled case as in this paper: arXiv:1406.5298. In this way, your classification power is only coming from the labeled data, but you are telling the model that if it is going to be wrong (unlabeled case), it should be wrong confidently in one class. This kind of sets up the other labeled cost piece to shift the confident, but wrong distribution to be over the correct class. With this cost I have not seen the issue you mention, but it is more complicated to think about. Alternatively you could massively upweight correct data (maybe * n_samples_your_model_labeled?) at the end of training, so you end up with "strong classes" (ground truth) and "weak classes" (estimates from the model).

## SUPERVISED LEARNING

• labeled temporal spatial data for classification: best neural network architecture for temporal-spatial data
I want to implement a supervised approach using neural networks. I have my data in the following 3D numpy array shape: (samples, temporal data dimension, spatial data dimension). What is the best suited neural network for this problem?
• Recurrent Convolutional Neural Network.
• Seems the vanilla single-directional or bi-directional RNN architecture can only reliably handle a relatively small number of time-steps though. Proper scaling helps - e.g. not being too far zoomed in, but feels hacky. And if information from multiple very different timescales is highly relevant at the same time, seems vanilla RNN will have a bad time.
• Depends on Spatial data dimension. Anything in double digits would require (stacked) LSTMs. Reducing that dimension by local features using CNNs is a good way

## TENSORFLOW

• Simple neural net implemented "from scratch" and equivalent TensorFlow version (Exercise)

def softmax(x):
return np.exp(x) / np.sum(np.exp(x))
• $x$ is a numpy array, so $np.exp(x)$ is element-wise exponentiation, divided by $np.sum(...)$ scalar. Should be ok, no?
• Only if $x$ is a vector. if $x$ is an array, you have to sum for each sample individually. Also, you're evaluating the exponential twice.
• He's doing backprop sample by sample which means it's okay.

• Training ensembles in Tensorflow I was wondering if anyone had experience training ensembles in Tensorflow? Essentially I have an architecture consisting of a number of networks of heterogeneous architecture combined as a product-of-experts and a cost which minimises the ensemble as a whole, not individual members. I've done their classification tutorials & understand how to train a single net, but I can't find how to reuse components and train an ensemble. If anyone had experience with how to do this, or suggestions of what I might do it'd be very helpful.
• If you know how to train a single net, you know how to train multiple nets. Or just use dropout.
• Just define all your models and then combine their outputs using tf.mul

## THEANO

• Theano resources [Theano; AWWS, GPU tricks; LSTM code]
I've recently switched (from my own custom NN code) to theano and its been a marvel to have the benefits of a flexible optimized graph but I often find its a bit of a black box. It can do a lot of very nice things but I always worry that they might cost me in performance. E.g. is it more efficient to vectors in order to do fewer larger matrix multiplications (e.g. in the units of an LSTM or a GRU) does is the loss in memory locality or shuffling stuff around in memory outweigh the gains of larger matrix multiplications? I still have figured out how to efficiently implement attention with minibatches.
I realize I can try to profile and test this stuff (and I do) but I am wondering where people normally go to get this kind of info on theano? Does anyone have any nice resources on theano's performance? Is the mailing list the best place to ask such questions?
• It can do a lot of very nice things but I always worry that they might cost me in performance. E.g. is it more efficient to vectors in order to do fewer larger matrix multiplications (e.g. in the units of an LSTM or a GRU) does is the loss in memory locality or shuffling stuff around in memory outweigh the gains of larger matrix multiplications? You really have to profile, esp. with RNN/scan, but it's generally faster to dot&slice than to do many smaller dots. After some time you gain a kind of intuition for this stuff and know what to try when.

For reference, this is the LSTM step function I'm currently using:

def lstm_step(x, h_tm1, c_tm1):
x = T.dot(x, W) + T.dot(h_tm1, U) + b
I = inner_activation(x[:,:output_dim])
f = inner_activation(x[:,output_dim:2*output_dim])
c = f * c_tm1 + I * activation(x[:,2*output_dim:3*output_dim])
o = inner_activation(x[:,3*output_dim:])
h = o * activation(c)
return h, c

T.dot(x, W) + b  can be pulled outside. Doing so may lead to three different results:
• needs more memory
• faster + needs more memory
• significantly slower + needs more memory
• Thanks, that's already helpful. Taking the non-recurrent multiplication out of the scan is a really obvious trick I missed. I understand the memory usage but why would it ever be significantly slower?? So you don't have any particular place you went to learn this stuff? Do you just profile using profile=True in the function call (and then f.profile.print_summary)?
• Wow, cudnn 4 + cnmem seems to have given be a free x2 speedup!
• Don't forget about the cuDNN flags, see my post here
• Taking the non-recurrent multiplication out of the scan is a really obvious trick I missed. I understand the memory usage but why would it ever be significantly slower??
GPUs are weird... I can only speculate that it might be able to perform the two dots in parallel (= higher utilization), whereas in the out of the scan case there's a dependency on the result which means scan (resp. CUDA) has to wait for it before it can even begin with the loop.
So you don't have any particular place you went to learn this stuff?
Well, you can look at Theano's source code (for low-level stuff) and other people's implementations.
Do you just profile using profile=True in the function call (and then f.profile.print_summary)?
No, I'm only interested in total troughput, so I just time the function call, try different things and look at total GPU usage. Detailed profiling doesn't really help much more (imo), except when you have some obviously stupid ops. debugprint()'ing the graph comes in handy sometimes.

## TIME SERIES [e.g. AUDIO | VIDEO]

• Machine learning in time series?
I'm rather new to the field so, can anyone suggest any basic guides on applications of machine learning to time-series/signals like in acoustics? I am interested in data mining, clustering and detection of signal elements defined by morphological features.
• I think the best technique that works with time series such as audio or video is the [Hidden Markov Model](https://en.wikipedia.org/wiki/Hidden_Markov_model). You can train separate models to recognize different actions/events (in the video case) or sound patterns (in the audio case). There are a couple of sophisticated implementations for this in different languages. The one I used recently was an [implementation for MATLAB](http://www.cs.ubc.ca/%7Emurphyk/Software/HMM/hmm.html), developed by Kevin Murphy himself. MATLAB's own implementation is also neat but I don't think it works with continuous data, which is usually the case for audio/video.
• In the literature "time-series" and "acoustics" are usually very different things (of course, you can convert "acoustics" into low D time series using MFCC, as in fig 7 of [a]) For time series classification and clustering, the state of the art is still using the raw data, and either the Euclidean or DTW distance [b, c]. This is true, in spite of many claims to the contrary (source: The 36 million experiments of [c], and my own few tens of millions of experiments). I know nothing about acoustics.
• [alexmlamb] Can you be more specific? RNNs, Convnets, and HMMs are all models that have been with time series, but for different sorts of tasks.
• I'm not the OP, but in which situations would an HMM be advantageous over an RNN? I have a (very) weak understanding of these models but from what I've read it seems like RNN's almost always perform better on time series data.
• [u/alexmlamb] So here's a hypothetical scenario: Let's say that you train an RNN over variables running from a $t = 1$ to $t = N$. Now let's say that you want to compute $p(t[0:10] | t[10:50])$. How could you do this efficiently? To my knowledge you can't, without training a separate network that runs in the opposite direction or drawing lots of samples - either solution has issues. However, this type of inference can be done with HMMs. Also, HMMs just have a probabilistic setup that's distinct from RNNs. So RNNs can be interpreted as modeling the product of conditionals $p(y1) * p(y2 | y1) * p(y3 | y1, y2)$ whereas HMMs model $p(y1 | h1), p(y2 | h2), p(y3 | h3)$ and $p(h2 | h1)$, and $p(h3 | h2)$. That's not really an advantage, but you could consider a situation where that probabilistic model is a better fit for what you're trying to do. But in general RNNs perform better because they're much stronger models in terms of representational power.
• Somewhat related to : https://www.reddit.com/r/MLQuestions/comments/484k0a/machine_learning_deep_and_wide_data_how_to_start

## TRAINING | VALIDATION | TEST SETS

• Help: How do I know when to stop training?
Hi. I've been experimenting with NNs lately, and I come up with dozens of questions every day. I have this product-title -----> product-category dataset I'm playing with. I feed the NN vectorized titles, and their target categories. The orange line is the smoothed training loss, and the blue line is the eval loss. IMHO, according to that graph, overfitting is still not an issue. This other graph represents the evolution of the model's accuracy over time. Accuracy seems to still be improving, slower than before though. Should I let the model train some more? And how should I got about making that decision myself?

• You can store the weights every time you reach a minimum error in the validation set.
• So, just save a checkpoint a keep going?
• That would be the idea, and you save only the one with minimum validation error ever, replacing the previous sub-optimal
• Thanks. I appreciate the insight. What about the number of iterations? I understand that's very related to the architecture of the network. But most examples I see run for 500 -1000 steps, while I'm sure I'll hit 1 million iterations in no time, looking for that error minimum. Is that reasonable and a regular occurrence?

• Is the difference between train loss and test loss a valid regularization term?
... or am I implicitly training in the test data?

• You are implicitly training on the test data. In general you should not be doing anything with test data except for testing on it, once, to get the final results that you report. If a different test set would cause you to select a different model -- which it certainly would in this case -- then you do not have an independent test set and your results are not valid. Of course you can always split your training data into training and validation sets, and incorporate validation set error into your objective. It's not obvious that the objective you proposed would have better properties than just training on the combined (larger) dataset, though.

• You would be implicitly training on the test if the test data went through the model (it would have to) to calculate the loss and update the weights. You should also note that the loss over any test example has _nothing_ to do with the input features from the training example, which would be bad news generally - you would only really want to measure global statistics about the loss over the whole test set, not any individuals.

You could compare the train loss with a _prior_ over the test loss (maybe coming from a known model, or a known imbalance/performance?) though I don't really know what good it would do. You can do this for almost any interpretation of a layer generally - this is basically what the variational autoencoder does (constrain a layer to fit a diagonal Gaussian prior). This prior is quite different from something that changes as the model updates, though.

• So your objective function would be something like a sum of squared errors over your training set plus some multiplicative factor of the difference between training loss and regularization set loss, right? How would you choose the multiplicative factor?

• Like any other hyperparameter of the model.

• To add to what people are saying: if your main objective loss is Lo, then your new loss is
L = Lo(train) + β \* (Lo(train) - Lo(test))2
= Lo(train) + β \* Lo(train)2 + β \* Lo(test)2 - β2Lo(train)Lo(test)
= Lo(train)(1 - β2 \* Lo(test)) + β \* Lo(train)2 + β * Lo(test)2.
For reasonably small beta, this is effectively just jointly minimising training and testing error, with a bias towards minimising training error. You are effectively optimising against the test set.

• Question regarding understanding of Neural Networks
Let us assume we have have a MLP with dimensions (5, 4, 3). We have 100,000 inputs, where the input data is the same all the time. The output is (1, 0, 0) in 70,000 of these. (0, 1, 0) in 20,000 of these and (0, 0, 1) in the rest. After I learned the input in random order what output will I get by the neural net, if I would input the same dataset as the one learned.
(a) Something close to (1, 0, 0).
(b) Something close to (0.7, 0.2, 0.1)
(c) Something else
If the answer is not b) how would I try to achieve the result b), which resembles the probabilities, which you see in the test data.

• Assuming you use the categorical entropy loss on a softmax output layer you will get close to (b). I was not a 100% sure about this, so I decided to quickly test it using Python and Keras. If you want to check it out yourself here is the code to do so:

def generate_data():
x = np.zeros((100000, 10))
y = np.concatenate((np.array([0]*70000), np.array([1]*20000), np.array([2]*10000)))
np.random.shuffle(y)
y = np_utils.to_categorical(y, 3)
return x, y, test_x

def generate_model():
model = Sequential()
rms = RMSprop()
model.compile(loss='categorical_crossentropy', optimizer=rms)
return model

def train_model(model, x, y, nb_epochs=10, batch=512):
model.fit(x, y, nb_epoch=nb_epochs, batch_size=batch, verbose=2)
return model

x, y = generate_data()
model = generate_model()
model = train_model(model, x, y)
model.predict(x[:1])

The output as expected:

array([[ 0.68932748,  0.20502673,  0.1056458 ]], dtype=float32)

• Thank you very much. What is the cost function in this example? I can not see it from the code, since I am not used to Keras. I thought that in general you are not getting something so close to probabilities.

• It's the crossentropy loss function for more than 2 categories, it's in the compile line:

loss='categorical_crossentropy'

• It depends on your choice of loss function. If you use cross-entropy loss (which is minimized when the predicted probability distribution is close to the underlying probability distribution), it will get you close to (b).

• All neural networks must be evaluated prior to propagation. This evaluation is often called the "cost" function. Without knowing how the cost is evaluated, it would be impossible to know the outcome. For example in TensorFlow, the cost function for a regression problem is often defined like this:

with tf.name_scope("cost") as scope:
cost = tf.reduce_mean((y_-Hypothesis)**2)

Here the cost is the mean of the square of the difference. And the "reduce_mean" tells the program to find the smallest cost possible. There are many different possible cost functions. Each might lead to different results.

• Partitioning data for dimension reduction and classification pipeline
Let's say I want to test the performance of my dimension reduction + classification pipeline. To do this, I will use k-fold cross validation (divide data in training and testing sets). I know that performing dimension reduction on the complete dataset before creating the k folds is bad due to overfitting. To avoid this, the k folds for training and testing are created first. My question is the following: how should my dimension reduction + classification pipeline learn? I see two options:

1. Take my training data, divide it in two (how many samples go to each is to be determined). Use one subset to learn the dimension reduction mapping. Then, pass the other set through the learned mapping and use the reduced features to learn the classifier. Now, none of the steps have overfitted.

2. Take my training data, apply my dimension reduction to it i.e use the same data for learning and reducing. Use the reduced data to learn the classifier.
I tend to prefer approach (1) given that no overfitting occurs. With method (2), I would run into issues when I want to use my dimension reduction + classifier pipeline on new data. Is approach (1) the correct one? Is there another way to do this? I'm not making assumptions on whether the dimension reduction is supervised or not.

• Generally you would want to split the data into train/test for each fold, then run the entire process on the train data (dimension reduction, train clf,...). You then apply the transforms from the train data on the test data to evaluate them.

• [OP] Right, that would be approach (2) above, but if my dimension reduction, for example, extracts the features that are most correlated with the labels, then the classifier would be overfit, right? I'd be using the same data to train the dimension reduction and reduce for the classifier. Wouldn't it make sense to split this training data as in approach (1)?

• The purpose of k-fold is to evaluate the approach on unseen data, mimicking what it would look like in deployment. If you use the test data at any point in the training pipeline it doesn't apply to the real word, since you don't have new test data when you are training the actual system. Usually you would do dimension reduction if it isn't feasible to pass all the features to the classifier, so it isn't necessarily over fitting its learning with the most relevant information.

• What to do with small set of labeled data and large set of unlabeled data?
We have a set of, say, 10K labeled images (two classes), and an unlabeled set that is maybe 10X larger (or even 100X, doesn't really matter for this discussion). What I'm wondering is can I train a NN on the initial labeled set of 10K images and then use that model to label a larger set of unlabeled images, and then use that larger set of labeled images to train the model again? Will this result in a better model? If so, does anyone have links to literature on this approach and what is called? Is this an example of semi-supervised learning?

• Hmm, what would you expect the model to learn from being re-trained on its previous outputs? Seems like you're just training to reinforce earlier decisions without adding anything new for it to learn from.

• Semi-supervised learning incorporates both unsupervised and supervised objectives to the training. In the example you've provided you only ever deal with supervised training so I would say that this is not an example of semi-supervised learning.
Two ways to incorporate unsupervised learning is through adding an autoencoder or adversarial functionality into your architecture. Basically you can train the model on both the labeled and unlabeled examples and then use it to predict the labels for the unlabeled examples.

• Can you recommend a good article or paper on this technique?

• Released today: arXiv:1606.03498 ["Improved Techniques for Training GANs"]

• This one is easy to implement and effective. Another LeCun et al. gem IMO [Victoria: GEM: generalized eigenvectors for multiclass?]

• I don't see any improvement off of the additional training. Quite contrary, I can see the second training imposing, possibly reinforcing, some sort of structure onto the model. You also end up with the second issue because you're saying that the data you labeled is correct because you labeled it correct in the second training.

• Your algorithm doesn't work because you can't feed back in weak guesses as ground truth and bootstrap your way from nothingness. All that does is turn initial uncertain guesses into final certain mistakes.
Is this an example of semi-supervised learning?
What you want is semi-supervised learning, yes. If you are able to get labels on that unlabeled set on demand (perhaps you don't have time to label 1 million images yourself but labeling a few would take you just a few seconds), then you could also do 'active learning' and use the model trained on the 10k images to select which of the unlabeled images would be most helpful labeled (for example, if your binary class is evenly split 50:50 base rate, then one active learning algorithm would be to go through the unlabeled images and get the classification probability; the images which get 50%/50% are maximally difficult for your NN and would be most informative when labeled).

## WEIGHT INITIALIZATION

• [Convnets, Keras] Training loss is stuck at the initial value when I start with a higher learning rate and slowly decrease it down to a learning rate for which the training loss decreases - Why? [reddit]:

• Layer-sequential unit-variance (LSUV) initialization for CNN: Implementation of the Layer-sequential unit-variance neural network initialization described in the paper "All you need is a good init" [GitXiv: links to arXiv:1511.06422]:

• Layer-sequential unit-variance (LSUV) initialization - a simple method for weight initialization for deep net learning - is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one. Experiment with different activation functions (maxout, ReLU-family, tanh) show that the proposed initialization leads to learning of very deep nets that (i) produces networks with test accuracy better or equal to standard methods and (ii) is at least as fast as the complex schemes proposed specifically for very deep nets such as FitNets (Romero et al. (2015)) and Highway (Srivastava et al. (2015)). Performance is evaluated on GoogLeNet, CaffeNet, FitNets and Residual nets and the state-of-the-art, or very close to it, is achieved on the MNIST, CIFAR-10/100 and ImageNet datasets.

• Maclaurin D (2015) Gradient-based Hyperparameter Optimization through Reversible Learning. arXiv:1502.03492
• Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.

• Mentioned here: Why is the learning process only changing the weights but not the structure?: "... Gradient descent (ie deep learning) has been very successful over the past few years and is the current vanguard of a lot of ML approaches, this is one of the few papers that even attempts to tackle optimizing hyperparameters as a learning objective but has some significant downsides. ..."