# VICTORIA's MACHINE LEARNING NOTES

## Persagen.com

• This file is:  Persagen.com/files/ml.html

• TIP!  Most internal links (Persagen.com) to this rather large page should open it in a single, separate, reused tab.
If you are browsing this site (much), leaving that tab open will speed up your browsing experience, re: speed!

### Preface

1. Disclaimer: any thoughts / comments (added) are my own personal opinion.

2. Content is notably focused on mid-2015 through mid-2017, when I was most assiduously following the machine learning and related literature.
That has abated somewhat as I refocus on my first passions: functional genomics, information processing, and knowledge discovery.

However, I will continue to add (frequently) new content, that is of particular interest to myself (i.e. less broad / more focused,
my research interests & Vision.

4. Double-click the videos to watch them full-screen.

5. The images "mouse-over" to a larger size. If they exceed your screen size, use your arrow keys to scroll L-R, up-down.

6. This is a large file: ~31,000 HTML lines [2017-Sep-26] -- all hand-coded (HTML / JS / CSS) ...
Once loaded, you should be fine.

If it helps, compared to Firefox (my preferred browser) and Chrome, Opera appears to be blazingly fast!

7. Enjoy!

## ACTIVATION FUNCTIONS

• Generative Adversarial Networks are NN where generative models are estimated via an adversarial process. These can be used to produce high quality samples of natural images. Wikipedia

• GAN were introduced by Ian Goodfellow (2014) [arXiv:1406.2661]

Source: arXiv:1610.09585

• Yann LeCun: What are some recent and potentially upcoming breakthroughs in deep learning? [Quora.com; July 2016]

• The most important one, in my opinion, is adversarial training (also called GAN for Generative Adversarial Networks). This is an idea that was originally proposed by Ian Goodfellow when he was a student with Yoshua Bengio at the University of Montreal (he since moved to Google Brain and recently to OpenAI). This, and the variations that are now being proposed is the most interesting idea in the last 11 years in ML, in my opinion.

The idea is to simultaneously train two neural nets. The first one, called the Discriminator - let's denote it $\small D(Y)$ - takes an input (e.g. an image) and outputs a scalar that indicates whether the image $\small Y$ looks "natural" or not. In one instance of adversarial training, $\small D(Y)$ can be seem as some sort of energy function that takes a low value (e.g. close to 1) when $\small Y$ is a real sample (e.g. an image from a database) and a positive value when it is not (e.g. if it's a noisy or strange looking image). The second network is called the generator, denoted $\small G(Z)$, where $\small Z$ is generally a vector randomly sampled in a simple distribution (e.g. Gaussian).

The role of the generator is to produce images so as to train the $\small D(Y)$ function to take the right shape (low values for real images, higher values for everything else). During training $\small D$ is shown a real image, and adjusts its parameter to make its output lower. Then $\small D$ is shown an image produced from $\small G$ and adjusts its parameters to make its output $\small D(G(Z))$ larger (following the gradient of some objective predefined function). But $\small G(Z)$ will train itself to produce images so as to fool $\small D$ into thinking they are real. It does this by getting the gradient of $\small D$ with respect to $\small Y$ for each sample it produces. In other words, it's trying to minimize the output of $\small D$ while $\small D$ is trying to maximize it. Hence the name adversarial training.

The original formulation uses a considerably more complicated probabilistic framework, but that's the gist of it.

Why is that so interesting? It allows us to train a discriminator as an unsupervised "density estimator", i.e. a contrast function that gives us a low value for data and higher output for everything else. This discriminator has to develop a good internal representation of the data to solve this problem properly. It can then be used as a feature extractor for a classifier, for example.

But perhaps more interestingly, the generator can be seen as parameterizing the complicated surface of real data: give it a vector $\small Z$, and it maps it to a point on the data manifold. There are papers where people do amazing things with this, like generating pictures of bedrooms, doing arithmetic on faces in the $\small Z$ vector space: [man with glasses] - [man without glasses] + [woman without glasses] = [woman with glasses].

There has been a series of interesting papers from FAIR on the topic:

• Denton E (2016) "Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks." pdf
• In this paper we introduce a generative model capable of producing high quality samples of natural images. Our approach uses a cascade of convolutional networks (convnets) within a Laplacian pyramid framework to generate images in a coarse-to-fine fashion. At each level of the pyramid a separate generative convnet model is trained using the Generative Adversarial Nets (GAN) approach. Samples drawn from our model are of significantly higher quality than existing models. In a quantitative assessment by human evaluators our CIFAR10 samples were mistaken for real images around 40% of the time, compared to 10% for GAN samples drawn from a GAN baseline model. We also show samples from models trained on the higher resolution images of the LSUN scene dataset.

• Radford A (2015 | ICLR 2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434

• In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.

• Mathieu M [LeCun Y] (2015) Deep multi-scale video prediction beyond mean square error. arXiv:1511.05440  |  GitHub  |  GitXiv

• Learning to predict future images from a video sequence involves the construction of an internal representation that models the image evolution accurately, and therefore, to some degree, its content and dynamics. This is why pixel-space video prediction may be viewed as a promising avenue for unsupervised feature learning. In addition, while optical flow has been a very studied problem in computer vision for a long time, future frame prediction is rarely approached. Still, many vision applications could benefit from the knowledge of the next frames of videos, that does not require the complexity of tracking every pixel trajectories. In this work, we train a convolutional network to generate future frames given an input sequence. To deal with the inherently blurry predictions obtained from the standard Mean Squared Error (MSE) loss function, we propose three different and complementary feature learning strategies: a multi-scale architecture, an adversarial training method, and an image gradient difference loss function. We compare our predictions to different published results based on recurrent neural networks on the UCF101 dataset

This last one is on video prediction with adversarial training. It solves a really important issue, which is that when you train a neural net (or any other model) to predict the future, and when there are several possible futures, a network trained the traditional way (e.g. with least square) will predict the average of all the possible futures. In the case of video, it will produce a blurry mess. Adversarial training lets the system produce whatever it wants, as long as it's within the set that the discriminator likes. This solves the "blurriness" problem when predicting under. uncertainty.

It seems like a rather technical issue, but I really think it opens the door to an entire world of possibilities.

• Counterpoint: some critique of GANs in this recent (Sept 16, 2016) reddit thread [Notes on "Energy Based Generative Adversarial Networks" review of recent arXiv paper (Zhao, Mathieu and LeCun, 2016)]

• AI-ML News Aug-Sep 2016

• The majority of machine learning models we talk about in the real world are discriminative insofar as they model the dependence of an unobserved variable y on an observed variable x to predict y from x. As such, they are used for supervised classification or regression tasks. Generative models, on the other hand, are fully probabilistic models of all variables from which randomly generated observable data points can be obtained. They're all the rage at the moment because they have been shown to synthesise artificial content (text, images, video, sound) that look and sound real from a small, unlabeled datasets. Here's a range of research and resources that help us understand how they work, why they're fascinating and use cases:

• A UCLA undergrad walks us through how generative adversarial networks work.

• The Twitter Cortex VX team (Magic Pony Technology) has had a busy summer publishing three papers: super-resolution of images and video on a single K2 GPU using a convolutional neural network architecture (arXiv:1609.05158), super-resolution of a 4x downsampled image using a generative adversarial network (arXiv:1609.04802) and an extended discussion of their work (arXiv.org:1609.07009).

• Google DeepMind publish WaveNet, a generative model for raw audio (paper arXiv:1609.03499). This work uses a generative convolutional neural network architecture that operates directly on the raw audio waveform to model the conditional probability distribution of future predictions on the basis of the sample immediately prior. The network samples audio at 16,000 times a second and each predicted sample is fed back through the network to predict the next sample. The results on text-to-speech tasks are impressive!

• Shakir Mohamed, Research Scientist at Google DeepMind, presents his work on building machines that imagine and reason at this summer's Deep Learning Summer School. He is also co-author on a paper, "Unsupervised Learning of 3D Structure from Images" arXiv:1607.00662, which uses generative models to infer 3D representations given a 2D image.

• Researchers in Edinburgh publish the Neural Photo Editor [arXiv:1609.07093], a novel interface for exploring the learned latent space of generative models and for making specific semantic changes to natural images. The method allows a user to produce said changes in the output image by use of a "contextual paintbrush" that indirectly modifies the latent vector.

• Adversarial Autoencoders [GitXiv]  |  Our method, named "adversarial autoencoder", uses the recently proposed generative adversarial networks (GAN) in order to match the aggregated posterior of the hidden code vector, i.e. this paper of the autoencoder with an arbitrary prior. Matching the aggregated posterior to the prior ensures that there are no "holes" in the prior, and generating from any part of prior space results in meaningful samples.

• A path to unsupervised learning through adversarial networks  |  reddit

• Building a 1D Generative Adversarial Network in TensorFlow   [reddit: good discussion; refer below (arXiv:1406.2661) for links to related blog posts, implementations in TF in Jupyter notebooks]

• Deep Learning Research Review Week 1: Generative Adversarial Nets

• Fantastic GANs and where to find them [Mar 2017]   [good summary]  |  reddit

• Generative Adversarial Networks [Andrej Karpathy (Stanford)]

• Generative Adversarial Networks (GANs) in 50 lines of code (PyTorch)  |  code [GitHub]

• Generative Adversarial Networks Explained  |  GitHub

• There's been a lot of advances in image classification, mostly thanks to the convolutional neural network. It turns out, these same networks can be turned around and applied to image generation as well. If we've got a bunch of images, how can we generate more like them? A recent method, Generative Adversarial Networks [Ian J. Goodfellow; Aaron Courville; Yoshua Bengio; arXiv:1406.2661], attempts to train an image generator by simultaneously training a discriminator to challenge it to improve. To gain some intuition, think of a back-and-forth situation between a bank and a money counterfeiter. At the beginning, the fakes are easy to spot. However, as the counterfeiter keeps trying different kinds of techniques, some may get past the check. The counterfeiter then can improve his fakes towards the areas that got past the bank's security checks. But the bank doesn't give up. It also keeps learning how to tell the fakes apart from real money. After a long period of back-and-forth, the competition has led the money counterfeiter to create perfect replicas.

Now, take that same situation, but let the money forger have a spy in the bank that reports back how the bank is telling fakes apart from real money.Every time the bank comes up with a new strategy to tell apart fakes, such as using ultraviolet light, the counterfeiter knows exactly what to do to bypass it, such as replacing the material with ultraviolet marked cloth. The second situation is essentially what a generative adversarial network does. The bank is known as a discriminator network, and in the case of images, is a convolutional neural network that assigns a probability that an image is real and not fake. The counterfeiter is known as the generative network, and is a special kind of convolutional network that uses transpose convolutions, sometimes known as a deconvolutional network. This generative network takes in some 101 parameters of noise (sometimes known as the code), and outputs an image accordingly. [ ... snip ... ]

• Follow-on article:  Variational Autoencoders Explained

• In my previous post about generative adversarial networks, I went over a simple method to training a network that could generate realistic-looking images. However, there were a couple of downsides to using a plain GAN. First, the images are generated off some arbitrary noise. If you wanted to generate a picture with specific features, there's no way of determining which initial noise values would produce that picture, other than searching over the entire distribution. Second, a generative adversarial model only discriminates between "real" and "fake" images. There's no constraints that an image of a cat has to look like a cat. This leads to results where there's no actual object in a generated image, but the style just looks like picture. In this post, I'll go over the variational autoencoder, a type of network that solves these two problems. [ ... snip ... ]

• Follow-on article:  What is DRAW (Deep Recurrent Attentive Writer)?  |  GitHub  |  reddit

• A few weeks ago I made a post on variational autoencoders, and how they can be applied to image generation. In this post, we'll be taking a look at DRAW: a model based off of the VAE that generates images using a sequence of modifications rather than all at once. [ ... snip ... ]

• Gregor K [Google DeepMind] (2015) DRAW: A recurrent neural network for image generation. arXiv:1502.04623

• This paper introduces the Deep Recurrent Attentive Writer (DRAW) neural network architecture for image generation. DRAW networks combine a novel spatial attention mechanism that mimics the foveation of the human eye, with a sequential variational auto-encoding framework that allows for the iterative construction of complex images. The system substantially improves on the state of the art for generative models on MNIST, and, when trained on the Street View House Numbers dataset, it generates images that cannot be distinguished from real data with the naked eye.

• Generative Adversarial Networks vs Variational Autoencoders, who will win? [reddit]
It seems these days that for every GAN paper there's a complementary VAE version of that paper. Here's a few examples:

The two approaches seem to be fundamentally completely different ways of attacking the same problems. Is there something to takeaway from all this, or will we just keep seeing papers going back and forth between the two?

• Generating images with recurrent adversarial networks (GRAN): Adversarial generative RNN incrementally adding onto a visual "canvas"

• Generative models with constraints [reddit]

• How to train generative adversarial networks  [reddit]

• We have had a few posts here before, and the notes of the talk at NIPS are useful, and there are some other great "how to" resources. There is also the Torch blogpost on the topic. I have generated images that look good. I have toyed with tons of GAN variants. But I still feel like I have no idea how to train a GAN in any sort of rigorous manner. To me it seems like the most important factor in generating visually understandable images is some strange and exploratory architecture balancing act, which appears completely arbitrary. I train one GAN to equilibrium and end up with trash images, but change a tiny thing in the architecture and reach a very similar equilibrium (ie the G and D training curves look the same) but have nice images. This suggests to me that what we are actually doing is (over)fitting models based on human visual interpretation, not model metrics. This is very "human in the loop" and not theoretically satisfying. And more frustratingly, it is not discussed even in the few articles and blogposts on GAN hacks and methods.

I wasn't at NIPS so I don't know if the workshop covered more than the git. Mode collapse and "stability" don't seem to explain my experiences. It kind of feels more like model hacking. I know my colleagues feel similar frustrations, like we are all just randomly wandering around image space until we stumble upon a training method that lands us close to the image manifold and then it magically starts working. I really want to be told I am just missing something that should be obvious. So, is there any rigorous way to train a GAN? Where I can look at my curves and my gradients and my activations and just know I will end up with quality images?

• Image Super-Resolution Background Help reddit

• Is Maximum Likelihood Useful for Representation Learning?   [inFERENCe blog: May 2017]  |  reddit

• A few weeks ago at the DALI Theory of GANs workshop we had a great discussion about what GANs are even useful for. Pretty much everybody agreed that generating random images from a model is not really our goal. We either want to use GANs to train conditional probabilistic models (like we do for image super-resolution or speech synthesis, or something along those lines), or as a means of unsupervised representation learning. Indeed, many papers examine the latent space representations that GANs learn.

But the elephant in the room is that nobody really agrees on what unsupervised representation learning really means, and why any GAN variant should be any better or worse at it than others, whether GANs or VAEs are better for that. So I thought I'd write a post to address this, focussing now on maximum likelihood learning and variational autoencoders, but many of these things holds true for variants of GANs as well.

[ ... SNIP! ... ]

• [adversarial images]  Magic AI: these are the optical illusions that trick, fool, and flummox computers

• MNIST Generative Adversarial Model in Keras   [reblogged at KDNuggets.com]  |  GitHub

• Reddit discussion (GANs): Recent Progress in Generative Modeling - Ilya Sutskever @ OpenAI; .. mentions (e.g.):

• Stability of Generative Adversarial Networks  |  reddit

• Generative Adversarial Networks (arXiv:1406.2661, and the related DCGAN: arXiv:1511.06434) are a relatively new type of neural network architecture which pits two sub-networks against each-other in order to learn very realistic generative models of high-dimensional data (mostly used for image synthesis, though extensions to sound, text, and other media have been constructed). When one generates something like an image, the pixels themselves are not individually so responsible for the realism of the image as the correlations and relatively arrangements of pixels, and so hand-crafted loss functions to measure reconstruction accuracy have a tendency towards blurriness or other perceptual artefacts. But when you allow two networks to compete with each-other: one to make a plausible fake, the other to distinguish forgeries from reality, then the networks (in principle) eliminate the noticeable differences from reality one-by-one. A blurry image can be detected, so it eliminates blurriness, and so on. The results of this approach are visually quite impressive, but because the two sub-networks are trained against different, opposed target functions, there is a wealth of new instabilities and problems that can crop up compared to training traditional neural networks which have a single unified loss function.

In this post, I'm going to look at extremely simple adversarial networks - ones with only one parameter for each sub-network - so that we can exhaustively explore the parameter space, and hopefully get some intuition for how these adversarial networks behave when trained.

[ ... snip ... ]

• The Tiny Changes That Can Cauae AI To Fail  [BBC: Apr 2017]

• Variational Autoencoders (VAE) vs Generative Adversarial Networks (GAN)? [reddit]: VAEs can be used with discrete inputs, while GANs can be used with discrete latent variables. However, assuming both are continuous, is there any reason to prefer one over the other?

• They're similar in some respects, different in others -- it all depends on what you're trying to do. VAEs, being autoencoders, learn to map from an input to a low dimensional space, while only recently have people started figuring out how to do that with GANs (Vanilla GANs don't map from the input to the latent space directly.) GANs are promising and have recently shown some awesome empirical results, but are generally known to be trickier to train (though that too looks like it's being improved upon) and are a relatively "new," albeit extremely "hot" area of research. The adversarial objective is apparently a pretty powerful one, and is suitable for sticking onto the end of the VAE.

• There are also Adversarial AEs: arXiv.org:1511.05644

• The biggest advantage of VAEs is the nice probabilistic formulation they come with as a result of maximizing a lower bound on the log-likelihood. The advantage of GANs at the moment is they are better at generating visual features (which really boils down to adversarial loss is better than mean-squared loss)

• [u/alexmlamb]

• Usually easier to train and get working. Relatively easy to implement and robust to hyperparameter choices.

• Tractable likelihood.

• Has an explicit inference network. So it lets us do reconstruction.

• If the distribution $\small p(x|z)$ makes conditional independence assumptions, then it might have the "blurring" effect on images and "low-pass" effect on audio.

• Much higher visual fidelity in generated samples.

• [good author!]  Variational Inference with Implicit Probabilistic Models: Part 1  |  iPython notebook  |  reddit

• Update (Feb 2017):  Variational Inference using Implicit Models

In January 2017 I wrote a series of blog posts on adversarial algorithms for variational inference. I eventually turned this into a paper on arXiv [arXiv:1702.08235]. Also make sure you read Adversarial Variational Bayes by Mescheder et al, (2017) [arXiv:1701.04722] who propose the algorithm I described in Part II but have done a lot better job with experiments and the paper in general.

The blog posts are available here:

Part I (you are here): Inference of single, global variable (Bayesian logistic regression)
Part II: Amortised Inference via the Prior-Contrastive Method (Explaining Away Demo)
Part III: Amortised Inference via a Joint-Contrastive Method (ALI, BiGAN)
Part IV: Using Denoisers instead of Discriminators

• What's in your bag of tricks for training GANs? [reddit]:  Inspired by this discussion, I imagine it'd be helpful to gather some collective wisdom. If this goes somewhere, we could summarize to a Github FAQ for future reference.

• Goodfellow I (2017) NIPS 2016 Tutorial: Generative Adversarial Networks. arXiv:1701.00160  |  reddit  |  KDNuggets.com

• [v4] This report summarizes the tutorial presented by the author at NIPS 2016 on generative adversarial networks (GANs). The tutorial describes: (1) Why generative modeling is a topic worth studying, (2) how generative models work, and how GANs compare to other generative models, (3) the details of how GANs work, (4) research frontiers in GANs, and (5) state-of-the-art image models that combine GANs with other methods. Finally, the tutorial contains three exercises for readers to complete, and the solutions to these exercises.

• Abadi M [David G. Andersen | Google Brain] (2016) Learning to Protect Communications with Adversarial Neural Cryptography. arXiv:1610.06918

• Arjovsky M [Bottou L | Facebook AI Research] (2017) Wasserstein GAN. arXiv:1701.07875  |  GitHub  |  reddit  |  Read-through: Wasserstein GAN   [reddit]  |  reddit: search (Wasserstein GAN)  |  Wasserstein GAN and the Kantorovich-Rubinstein Duality  |  GitHub  |  reddit  |

• We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to other distances between distributions.

• Related:  Gulrajani I [Aaron Courville | MILA] (2017) Improved Training of Wasserstein GANs. arXiv:1704.00028  |  GitHub  |  GitXiv  |  GitHub  |  reddit Why Does BEGAN Produce Far Better Images than WGAN?  [reddit: May 2017]

• Generative Adversarial Networks (GANs) are powerful generative models, but suffer from training instability. The recently proposed Wasserstein GAN (WGAN) makes significant progress toward stable training of GANs, but can still generate low-quality samples or fail to converge in some settings. We find that these training failures are often due to the use of weight clipping in WGAN to enforce a Lipschitz constraint on the critic, which can lead to pathological behavior. We propose an alternative method for enforcing the Lipschitz constraint: instead of clipping weights, penalize the norm of the gradient of the critic with respect to its input. Our proposed method converges faster and generates higher-quality samples than WGAN with weight clipping. Finally, our method enables very stable GAN training: for the first time, we can train a wide variety of GAN architectures with almost no hyperparameter tuning, including 101-layer ResNets and language models over discrete data.

• Related:  Berthelot D [Google] (2017) BEGAN: Boundary Equilibrium Generative Adversarial Networks.  |  arXiv:1703.10717  |  GitHub  |  GitHub  |  GitXiv  |  reddit  |  Why Does BEGAN Produce Far Better Images than WGAN?  [reddit: May 2017]

• We propose a new equilibrium enforcing method paired with a loss derived from the Wasserstein distance for training auto-encoder based Generative Adversarial Networks. This method balances the generator and discriminator during training. Additionally, it provides a new approximate convergence measure, fast and stable training and high visual quality. We also derive a way of controlling the trade-off between image diversity and visual quality. We focus on the image generation task, setting a new milestone in visual quality, even at higher resolutions. This is achieved while using a relatively simple model architecture and a standard training procedure.

• Che T [Bengio Y | MILA | UdeM] (2016) Mode Regularized Generative Adversarial Networks. arXiv:1612.02136
• Although Generative Adversarial Networks achieve state-of-the-art results on a variety of generative tasks, they are regarded as highly unstable and prone to miss modes. We argue that these bad behaviors of GANs are due to the very particular functional shape of the trained discriminators in high dimensional spaces, which can easily make training stuck or push probability mass in the wrong direction, towards that of higher concentration than that of the data generating distribution. We introduce several ways of regularizing the objective, which can dramatically stabilize the training of GAN models. We also show that our regularizers can help the fair distribution of probability mass across the modes of the data generating distribution, during the early phases of training and thus providing a unified solution to the missing modes problem.

• Chen X (2016) InfoGAN: Interpretable Representation Learning By Information Maximizing Generative Adversarial Nets. arXiv:1606.03657  |  NIPS Proceedings  |  reddit  |  OpenAI  |  GitHub  |  critiqued herereddit

• This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner. InfoGAN is a generative adversarial network that also maximizes the mutual information between a small subset of the latent variables and the observation. We derive a lower bound to the mutual information objective that can be optimized efficiently, and show that our training procedure can be interpreted as a variation of the Wake-Sleep algorithm. Specifically, InfoGAN successfully disentangles writing styles from digit shapes on the MNIST dataset, pose from lighting of 3D rendered images, and background digits from the central digit on the SVHN dataset. It also discovers visual concepts that include hair styles, presence/absence of eyeglasses, and emotions on the CelebA face dataset. Experiments show that InfoGAN learns interpretable representations that are competitive with representations learned by existing fully supervised methods.

• [non-author] Implementation:  Adversarial Neural Cryptography in Theano: Last week I read Abadi and Andersen's recent paper [2], Learning to Protect Communications with Adversarial Neural Cryptography. I thought the idea seemed pretty cool and that it wouldn't be too tricky to implement, and would also serve as an ideal project to learn a bit more Theano. This post describes the paper, my implementation [GitHub], and the results. [ ... snip ... ]  |  reddit

• What is this type of GAN called? What is it used for? I saw it at NIPS but can't find it.  [reddit]

• Generative Adversarial Networks are NN where generative models are estimated via an adversarial process. These can be used to produce high quality samples of natural images. Wikipedia

Source: arXiv:1610.09585

• [review]  Creswell A (2017) Generative Adversarial Networks: An Overview. arXiv:1710.07035

• [v1] Generative adversarial networks (GANs) provide a way to learn deep representations without extensively annotated training data. They achieve this through deriving backpropagation signals through a competitive process involving a pair of networks. The representations that can be learned by GANs may be used in a variety of applications, including image synthesis, semantic image editing, style transfer, image super-resolution and classification. The aim of this review paper is to provide an overview of GANs for the signal processing community, drawing on familiar analogies and concepts where possible. In addition to identifying different methods for training and constructing GANs, we also point to remaining challenges in their theory and application.

• Donahue J [UC Berkeley] (2016) Adversarial Feature Learning. arXiv:1605.09782 reddit  |  KDNuggets.com
• The ability of the Generative Adversarial Networks (GANs) framework to learn generative models mapping from simple latent distributions to arbitrarily complex data distributions has been demonstrated empirically, with compelling results showing generators learn to "linearize semantics" in the latent space of such models. Intuitively, such latent spaces may serve as useful feature representations for auxiliary problems where semantics are relevant. However, in their existing form, GANs have no means of learning the inverse mapping -- projecting data back into the latent space. We propose Bidirectional Generative Adversarial Networks (BiGANs) as a means of learning this inverse mapping, and demonstrate that the resulting learned feature representation is useful for auxiliary supervised discrimination tasks, competitive with contemporary approaches to unsupervised and self-supervised feature learning.

• Durugkar I [UMass - Amherst] (2017) Generative Multi-Adversarial Networks. pdf  |  reddit

• Generative adversarial networks (GANs) are a framework for producing a generative model by way of a two-player minimax game. In this paper, we propose the Generative Multi-Adversarial Network (GMAN), a framework that extends GANs to multiple discriminators. In previous work, the successful training of GANs requires modifying the minimax objective to accelerate training early on. In contrast, GMAN can be reliably trained with the original, untampered objective. We explore a number of design perspectives with the discriminator role ranging from formidable adversary to forgiving teacher. Image generation tasks comparing the proposed framework to standard GANs demonstrate GMAN produces higher quality samples in a fraction of the iterations when measured by a pairwise GAM-type metric.

• TL;DR: GANs with multiple discriminators accelerate training to more robust performance.

• Ghosh A (2016) Contextual RNN-GANs for Abstract Reasoning Diagram Generation. arXiv:1609.09444  |  GitHub  |  reddit  |  Hacker News

• Understanding, predicting, and generating object motions and transformations is a core problem in artificial intelligence. Modeling sequences of evolving images may provide better representations and models of motion and may ultimately be used for forecasting, simulation, or video generation. Diagrammatic Abstract Reasoning is an avenue in which diagrams evolve in complex patterns and one needs to infer the underlying pattern sequence and generate the next image in the sequence. For this, we develop a novel Contextual Generative Adversarial Network based on Recurrent Neural Networks (Context-RNN-GANs), where both the generator and the discriminator modules are based on contextual history (modeled as RNNs) and the adversarial discriminator guides the generator to produce realistic images for the particular time step in the image sequence. We evaluate the Context-RNN-GAN model (and its variants) on a novel dataset of Diagrammatic Abstract Reasoning, where it performs competitively with 10th-grade human performance but there is still scope for interesting improvements as compared to college-grade human performance. We also evaluate our model on a standard video next-frame prediction task, achieving improved performance over comparable state-of-the-art.

• Project page:  Contextual RNN-GANs for Abstract Reasoning Diagram Generation

Motivation:

• Understanding, predicting, and generating object motions and transformations is a core problem in artificial intelligence.
• Modeling sequences of evolving images may provide better representations and models of motion and may ultimately be used for forecasting, simulation, or video generation.
• Diagrammatic Abstract Reasoning is an avenue in which diagrams evolve in complex patterns and one needs to infer the underlying pattern sequence and generate the next image in the sequence.

An Example with an Explanation:

An explanation of the ground truth is that the dashed line first goes to the left, then to the right,and then on both sides, and also changes from single to double, hence the ground truth should have double dashed lines on both the sides. On the corners, the number of slanted lines increase by oneafter every two images, hence the ground truth should have four slant lines on both the corners.

• Goodfellow IJ [Courville A; Bengio Y] (2014) Generative Adversarial Networks. arXiv:1406.2661  |  Generative Adversarial Networks ["demo"]  |  blog: implementation

• We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 2/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

• Related:

• Implementations:

• Graese A (2016) Assessing Threat of Adversarial Examples on Deep Neural Networks. arXiv:1610.04256
• Deep neural networks are facing a potential security threat from adversarial examples, inputs that look normal but cause an incorrect classification by the deep neural network. For example, the proposed threat could result in hand-written digits on a scanned check being incorrectly classified but looking normal when humans see them. This research assesses the extent to which adversarial examples pose a security threat, when one considers the normal image acquisition process. This process is mimicked by simulating the transformations that normally occur in acquiring the image in a real world application, such as using a scanner to acquire digits for a check amount or using a camera in an autonomous car. These small transformations negate the effect of the carefully crafted perturbations of adversarial examples, resulting in a correct classification by the deep neural network. Thus just acquiring the image decreases the potential impact of the proposed security threat. We also show that the already widely used process of averaging over multiple crops neutralizes most adversarial examples. Normal preprocessing, such as text binarization, almost completely neutralizes adversarial examples. This is the first paper to show that for text driven classification, adversarial examples are an academic curiosity, not a security threat.

• Heusel M [Sepp Hochreiter] (2017) GANs Trained by a Two Time-Scale Update Rule Converge to a Nash Equilibrium. arXiv:1706.08500  |  GitHub  |  reddit  [excellent discussion

• Generative Adversarial Networks (GANs) excel at creating realistic images with complex models for which maximum likelihood is infeasible. However, the convergence of GAN training has still not been proved. We propose a two time-scale update rule (TTUR) for training GANs with stochastic gradient descent that has an individual learning rate for both the discriminator and the generator. We prove that the TTUR converges under mild assumptions to a stationary Nash equilibrium. The convergence carries over to the popular Adam optimization, for which we prove that it follows the dynamics of a heavy ball with friction and thus prefers flat minima in the objective landscape. For the evaluation of the performance of GANs at image generation, we introduce the "Fréchet Inception Distance" (FID) which captures the similarity of generated images to real ones better than the Inception Score. In experiments, TTUR improves learning for DCGANs, improved Wasserstein GANs, and BEGANs, outperforming conventional GAN training on CelebA, Billion Word Benchmark, and LSUN bedrooms.

• Huang X [Cornell University] (2016) Stacked Generative Adversarial Networks. arXiv:1612.04357  |  GitHub  |  GitXiv

• In this paper, we propose a novel generative model named Stacked Generative Adversarial Networks (SGAN), which is trained to invert the hierarchical representations of a bottom-up discriminative network. Our model consists of a top-down stack of GANs, each learned to generate lower-level representations conditioned on higher-level representations. A representation discriminator is introduced at each feature hierarchy to encourage the representation manifold of the generator to align with that of the bottom-up discriminative network, leveraging the powerful discriminative representations to guide the generative model. In addition, we introduce a conditional loss that encourages the use of conditional information from the layer above, and a novel entropy loss that maximizes a variational lower bound on the conditional entropy of generator outputs. We first train each stack independently, and then train the whole model end-to-end. Unlike the original GAN that uses a single noise vector to represent all the variations, our SGAN decomposes variations into multiple levels and gradually resolves uncertainties in the top-down generative process. Based on visual inspection, Inception scores and visual Turing test, we demonstrate that SGAN is able to generate images of much higher quality than GANs without stacking.

• Im DJ (2016) Generating images with recurrent adversarial networks. arXiv:1602.05110  |  reddit  |  GitXiv
• Generating images with recurrent adversarial networks Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, Roland Memisevic Gatys et al. (2015) showed that optimizing pixels to match features in a convolutional network with respect reference image features is a way to render images of high visual quality. We show that unrolling this gradient-based optimization yields a recurrent computation that creates images by incrementally adding onto a visual "canvas". We propose a recurrent generative model inspired by this view, and show that it can be trained using adversarial training to generate very good image samples. We also propose a way to quantitatively compare adversarial networks by having the generators and discriminators of these networks compete against each other.

• Isola P [Berkeley AI Research (BAIR) Laboratory - UC Berkeley] (2016) Image-to-Image Translation with Conditional Adversarial Networks. arXiv:1611.07004  |  GitHub  |  GitXiv  |  project page  |  reddit  |  reddit  |  reddit

• We investigate conditional adversarial networks as a general-purpose solution to image-to-image translation problems. These networks not only learn the mapping from input image to output image, but also learn a loss function to train this mapping. This makes it possible to apply the same generic approach to problems that traditionally would require very different loss formulations. We demonstrate that this approach is effective at synthesizing photos from label maps, reconstructing objects from edge maps, and colorizing images, among other tasks. As a community, we no longer hand-engineer our mapping functions, and this work suggests we can achieve reasonable results without hand-engineering our loss functions either.

• demo  |  reddit

• Jia R (2017) Adversarial Examples for Evaluating Reading Comprehension Systems. arXiv:1707.07328

• Discussed here:  The Morning Paper   [local copy]

• Jonathan Ho [Stanford] (2016) Generative Adversarial Imitation Learning. arXiv:1606.03476  |  reddit  |  OpenAI  |  GitHub
• Consider learning a policy from example expert behavior, without interaction with the expert or access to reinforcement signal. One approach is to recover the expert's cost function with inverse reinforcement learning, then extract a policy from that cost function with reinforcement learning. This approach is indirect and can be slow. We propose a new general framework for directly extracting a policy from data, as if it were obtained by reinforcement learning following inverse reinforcement learning. We show that a certain instantiation of our framework draws an analogy between imitation learning and generative adversarial networks, from which we derive a model-free imitation learning algorithm that obtains significant performance gains over existing model-free methods in imitating complex behaviors in large, high-dimensional environments.

• Kim T (2017) Learning to Discover Cross-Domain Relations with Generative Adversarial Networks. arXiv:1703.05192  |  GitHub  [PyTorch]  |  GitHub  [TensorFlow  |  reddit]  |  GitXiv  |  reddit  |  DiscoGAN in PyTorch: implementation of "Learning to Discover Cross-Domain Relations with Generative Adversarial Networks" [reddit]

• While humans easily recognize relations between data from different domains without any supervision, learning to automatically discover them is in general very challenging and needs many ground-truth pairs that illustrate the relations. To avoid costly pairing, we address the task of discovering cross-domain relations given unpaired data. We propose a method based on generative adversarial networks that learns to discover relations between different domains (DiscoGAN). Using the discovered relations, our proposed network successfully transfers style from one domain to another while preserving key attributes such as orientation and face identity.

• Amazing results!   Zhu J-Y (2017) Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. arXiv:1703.10593  |  GitHub  [reddit]  |  Understanding and Implementing CycleGAN in TensorFlow  [GitHub: blog] GitHub  |  GitHub  [PyTorch]  |  GitXiv  |  project page  |  reddit  |  YouTube

• Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Our goal is to learn a mapping G:X→Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping F:Y→X and introduce a cycle consistency loss to push F(G(X))≈X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach.

• reddit:  "This work seems very similar to DiscoGAN (arXiv:1703.05192) but it is not mentioned in your paper. How is this different?"

• Kim T [Yoshua Bengio] (2016) Deep Directed Generative Models with Energy-Based Probability Estimation. arXiv:1606.03439

• Training energy-based probabilistic models is confronted with apparently intractable sums, whose Monte Carlo estimation requires sampling from the estimated probability distribution in the inner loop of training. This can be approximately achieved by Markov chain Monte Carlo methods, but may still face a formidable obstacle that is the difficulty of mixing between modes with sharp concentrations of probability. Whereas an MCMC process is usually derived from a given energy function based on mathematical considerations and requires an arbitrarily long time to obtain good and varied samples, we propose to train a deep directed generative model (not a Markov chain) so that its sampling distribution approximately matches the energy function that is being trained. Inspired by generative adversarial networks, the proposed framework involves training of two models that represent dual views of the estimated probability distribution: the energy function (mapping an input configuration to a scalar energy value) and the generator (mapping a noise vector to a generated configuration), both represented by deep neural networks.

• Similar approaches:

• Xie J (2016) Cooperative Training of Descriptor and Generator Networks. arXiv:1609.09408

• This paper studies the cooperative training of two probabilistic models of signals such as images. Both models are parametrized by convolutional neural networks (ConvNets). The first network is a descriptor network, which is an exponential family model or an energy-based model, whose feature statistics or energy function are defined by a bottom-up ConvNet, which maps the observed signal to the feature statistics. The second network is a generator network, which is a non-linear version of factor analysis. It is defined by a top-down ConvNet, which maps the latent factors to the observed signal. The maximum likelihood training algorithms of both the descriptor net and the generator net are in the form of alternating back-propagation, and both algorithms involve Langevin sampling. %In the training of the descriptor net, the Langevin sampling is used to sample synthesized examples from the model. In the training of the generator net, the Langevin sampling is used to sample the latent factors from the posterior distribution. The Langevin sampling in both algorithms can be time consuming. We observe that the two training algorithms can cooperate with each other by jumpstarting each other's Langevin sampling, and they can be naturally and seamlessly interwoven into a CoopNets algorithm that can train both nets simultaneously.

• Sukhbaatar S [Fergus R; Facebook AI Research] (2016) Learning Multiagent Communication with Backpropagation. arXiv:1605.07736  |  GitHub  |  GitXiv  |  project page

• Many tasks in AI require the collaboration of multiple agents. Typically, the communication protocol between agents is manually specified and not altered during training. In this paper we explore a simple neural model, called CommNN, that uses continuous communication for fully cooperative tasks. The model consists of multiple agents and the communication between them is learned alongside their policy. We apply this model to a diverse set of tasks, demonstrating the ability of the agents to learn to communicate amongst themselves, yielding improved performance over non-communicative agents and baselines. In some cases, it is possible to interpret the language devised by the agents, revealing simple but effective strategies for solving the task at hand.

• Krotov D (2017) Dense Associative Memory is Robust to Adversarial Inputs. arXiv:1701.00939  |  reddit

• Deep neural networks (DNN) trained in a supervised way suffer from two known problems. First, the minima of the objective function used in learning correspond to data points (also known as rubbish examples or fooling images) that lack semantic similarity with the training data. Second, a clean input can be changed by a small, and often imperceptible for human vision, perturbation, so that the resulting deformed input is misclassified by the network. These findings emphasize the differences between the ways DNN and humans classify patterns, and raise a question of designing learning algorithms that more accurately mimic human perception compared to the existing methods.

Our paper examines these questions within the framework of Dense Associative Memory (DAM) models. These models are defined by the energy function, with higher order (higher than quadratic) interactions between the neurons. We show that in the limit when the power of the interaction vertex in the energy function is sufficiently large, these models have the following three properties. First, the minima of the objective function are free from rubbish images, so that each minimum is a semantically meaningful pattern. Second, artificial patterns poised precisely at the decision boundary look ambiguous to human subjects and share aspects of both classes that are separated by that decision boundary. Third, adversarial images constructed by models with small power of the interaction vertex, which are equivalent to DNN with rectified linear units (ReLU), fail to transfer to and fool the models with higher order interactions. This opens up a possibility to use higher order models for detecting and stopping malicious adversarial attacks. The presented results suggest that DAM with higher order energy functions are closer to human visual perception than DNN with ReLUs.

• Kurakin A [Ian Goodfellow; Samy Bengio | OpenAI | Google Brain] (2016) Adversarial examples in the physical world. arXiv:1607.02533 | reddit

• Most existing machine learning classifiers are highly vulnerable to adversarial examples. An adversarial example is a sample of input data which has been modified very slightly in a way that is intended to cause a machine learning classifier to misclassify it. In many cases, these modifications can be so subtle that a human observer does not even notice the modification at all, yet the classifier still makes a mistake. Adversarial examples pose security concerns because they could be used to perform an attack on machine learning systems, even if the adversary has no access to the underlying model. Up to now, all previous work have assumed a threat model in which the adversary can feed data directly into the machine learning classifier. This is not always the case for systems operating in the physical world, for example those which are using signals from cameras and other sensors as an input. This paper shows that even in such physical world scenarios, machine learning systems are vulnerable to adversarial examples. We demonstrate this by feeding adversarial images obtained from cell-phone camera to an ImageNet Inception classifier and measuring the classification accuracy of the system. We find that a large fraction of adversarial examples are classified incorrectly even when perceived through the camera.

• Related: Machine Vision's Achilles Heel Revealed By Google Brain Researchers [MIT Technology Review] By some measures machine vision is better than human vision. But now researchers have found a class of "adversarial images" that easily fool them (July 2016)

"These modified pictures are called adversarial images and they are a significant threat. "An adversarial example for the face recognition domain might consist of very subtle markings applied to a person's face, so that a human observer would recognize their identity correctly, but a machine learning system would recognize them as being a different person," say Alexey Kurakin and Samy Bengio at Google Brain and Ian Goodfellow from Open AI, a non-profit AI research company."

• "Follow-on:"  Kurakin A [Ian Goodfellow, Samy Bengio] (2016) Adversarial Machine Learning at Scale. arXiv:1611.01236
• Adversarial examples are malicious inputs designed to fool machine learning models. They often transfer from one model to another, allowing attackers to mount black box attacks without knowledge of the target model's parameters. Adversarial training is the process of explicitly training a model on adversarial examples, in order to make it more robust to attack or to reduce its test error on clean inputs. So far, adversarial training has primarily been applied to small problems. In this research, we apply adversarial training to ImageNet. Our contributions include: (1) recommendations for how to successfully scale adversarial training to large models and datasets, (2) the observation that adversarial training confers robustness to single-step attack methods, (3) the finding that multi-step attack methods are somewhat less transferable than single-step attack methods, so single-step attacks are the best for mounting black-box attacks, and (4) resolution of a "label leaking" effect that causes adversarially trained models to perform better on adversarial examples than on clean examples, because the adversarial example construction process uses the true label and the model can learn to exploit regularities in the construction process.

• Kwak H (2016) Generating Images Part by Part with Composite Generative Adversarial Networks. arXiv:1607.05387
• Image generation remains a fundamental problem in artificial intelligence in general and deep learning in specific. The generative adversarial network (GAN) was successful in generating high quality samples of natural images. We propose a model called composite generative adversarial network, that reveals the complex structure of images with multiple generators in which each generator generates some part of the image. Those parts are combined by alpha blending process to create a new single image. It can generate, for example, background and face sequentially with two generators, after training on face dataset. Training was done in an unsupervised way without any labels about what each generator should generate. We found possibilities of learning the structure by using this generative model empirically.

• Ledig C [Twitter | Magic Pony] (2016) Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. arXiv:1609.04802  |  deep residual network ...  |  GitHub [code]  |  GitHub [curated list: super-resolution resources, benchmarks]  |  GitXiv  |  reddit

• Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.

• Twitter pays up to $150M for Magic Pony Technology, which uses neural networks to improve images [June 2016] | Twitter beefs up machine learning chops, buys Magic Pony [June 2016] • Excellent machine learning blogger [inFERENCe.vc] : Ferenc Huszár [@fhuszar]: I am a machine learning researcher, I did my PhD in Cambridge with Carl Rasmussen, Máté Lengyel and Zoubin Ghahramani. I'm interested in probabilistic inference, generative models, unsupervised learning and applying deep learning to these problems. ... I work at Twitter Cortex on unsupervised learning for visual data. ... I ended up at Twitter when they acquired our startup, Magic Pony Technology, and as it turns out, it is a brilliant place to work. • Active reddit discussion: [1609.04802] Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network (Twitter Cortex / Magic Pony) • SUPER-RESOLUTION -- Related work: • Goodfellow IJ [Courville A; Bengio Y] (2014) Generative Adversarial Networks. arXiv:1406.2661 • Johnson J [Fei-Fei L; Stanford] (2016) Perceptual Losses for Real-Time Style Transfer and Super-Resolution arXiv:1603.08155 • Kim J (2015) Deeply-Recursive Convolutional Network for Image Super-Resolution. arXiv:1511.04491 • Ulyanov D (2016) Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. arXiv:1603.03417 • Romano Y [Google Research] (2016) RAISR: Rapid and Accurate Image Super Resolution. arXiv:1606.01299 • Sajjadi MS (2016) EnhanceNet: Single Image Super-Resolution through Automated Texture Synthesis. arXiv:1612.07919 | reddit • Single image super-resolution is the task of inferring a high-resolution image from a single low-resolution input. Traditionally, the performance of algorithms for this task is measured using pixel-wise reconstruction measures such as peak signal-to-noise ratio (PSNR) which have been shown to correlate poorly with the human perception of image quality. As a result, algorithms minimizing these metrics tend to produce oversmoothed images that lack high-frequency textures and do not look natural despite yielding high PSNR values. We propose a novel combination of automated texture synthesis with a perceptual loss focusing on creating realistic textures rather than optimizing for a pixel-accurate reproduction of ground truth images during training. By using feed-forward fully convolutional neural networks in an adversarial training setting, we achieve a significant boost in image quality at high magnification ratios. Extensive experiments on a number of datasets show the effectiveness of our approach, yielding state-of-the-art results in both quantitative and qualitative benchmarks. • Dahl R (2017) Pixel Recursive Super Resolution. arXiv:1702.00783 | reddit | reddit | Google Brain super-resolution image tech makes "zoom, enhance!" real | "Enhancing" Images Using Autoencoders and Adversarial Networks | Faces from Noise: Super Enhancing 8x8 Images with EnhanceGAN • We present a pixel recursive super resolution model that synthesizes realistic details into images while enhancing their resolution. A low resolution image may correspond to multiple plausible high resolution images, thus modeling the super resolution process with a pixel independent conditional model often results in averaging different details--hence blurry edges. By contrast, our model is able to represent a multimodal conditional distribution by properly modeling the statistical dependencies among the high resolution image pixels, conditioned on a low resolution input. We employ a PixelCNN architecture to define a strong prior over natural images and jointly optimize this prior with a deep conditioning convolutional network. Human evaluations indicate that samples from our proposed model look more photo realistic than a strong L2 regression baseline. • Lai W-S (2017) [University of California, Merced | Virginia Tech | University of Illinois, Urbana-Champaign] Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution. pdf | GitHub | GitXiv | project page | reddit • Convolutional neural networks have recently demonstrated high-quality reconstruction for single-image super-resolution. In this paper, we propose the Laplacian Pyramid Super-Resolution Network (LapSRN) to progressively reconstruct the sub-band residuals of high-resolution images. At each pyramid level, our model takes coarse-resolution feature maps as input, predicts the high-frequency residuals, and uses transposed convolutions for upsampling to the finer level. Our method does not require the bicubic interpolation as the pre-processing step and thus dramatically reduces the computational complexity. We train the proposed LapSRN with deep supervision using a robust Charbonnier loss function and achieve high-quality reconstruction. Furthermore, our network generates multi-scale predictions in one feed-forward pass through the progressive reconstruction, thereby facilitates resource-aware applications. Extensive quantitative and qualitative evaluations on benchmark datasets show that the proposed algorithm performs favorably against the state-of-the-art methods in terms of speed and accuracy. • Lim B [Seoul National University] (2017) Enhanced Deep Residual Networks for Single Image Super-Resolution. arXiv:1707.02921 | GitHub | reddit • Recent research on super-resolution has progressed with the development of deep convolutional neural networks (DCNN). In particular, residual learning techniques exhibit improved performance. In this paper, we develop an enhanced deep super-resolution network (EDSR) with performance exceeding those of current state-of-the-art SR methods. The significant performance improvement of our model is due to optimization by removing unnecessary modules in conventional residual networks. The performance is further improved by expanding the model size while we stabilize the training procedure. We also propose a new multi-scale deep super-resolution system (MDSR) and training method, which can reconstruct high-resolution images of different upscaling factors in a single model. The proposed methods show superior performance over the state-of-the-art methods on benchmark datasets and prove its excellence by winning the NTIRE2017 Super-Resolution Challenge. • Li C (2016) Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks. arXiv:1604.04382 | reddit • This paper proposes Markovian Generative Adversarial Networks (MGANs), a method for training generative neural networks for efficient texture synthesis. While deep neural network approaches have recently demonstrated remarkable results in terms of synthesis quality, they still come at considerable computational costs (minutes of run-time for low-res images). Our paper addresses this efficiency issue. Instead of a numerical deconvolution in previous work, we precompute a feed-forward, strided convolutional network that captures the feature statistics of Markovian patches and is able to directly generate outputs of arbitrary dimensions. Such network can directly decode brown noise to realistic texture, or photos to artistic paintings. With adversarial training, we obtain quality comparable to recent neural texture synthesis methods. As no optimization is required any longer at generation time, our run-time performance (0.25M pixel images at 25Hz) surpasses previous neural texture synthesizers by a significant margin (at least 500 times faster). We apply this idea to texture synthesis, style transfer, and video stylization. • Lin Y-C (2017) Tactics of Adversarial Attack on Deep Reinforcement Learning Agents. pdf | project page | reddit • We introduce two tactics to attack agents trained by deep reinforcement learning algorithms using adversarial examples, namely the strategically-timed attack and the enchanting attack. In the strategically-timed attack, the adversary aims at minimizing the agent's reward by only attacking the agent at a small subset of time steps in an episode. Limiting the attack activity to this subset helps prevent detection of the attack by the agent. We propose a novel method to determine when an adversarial example should be crafted and applied. In the enchanting attack, the adversary aims at luring the agent to a designated target state. This is achieved by combining a generative model and a planning algorithm: while the generative model predicts the future states, the planning algorithm generates a preferred sequence of actions for luring the agent. A sequence of adversarial examples is then crafted to lure the agent to take the preferred sequence of actions. We apply the two tactics to the agents trained by the state-of-the-art deep reinforcement learning algorithm including DQN and A3C. In 5 Atari games, our strategically-timed attack reduces as much reward as the uniform attack (i.e., attacking at every time step) does by attacking the agent 4 times less often. Our enchanting attack lures the agent toward designated target states with a more than 70% success rate. Videos are available at http://yclin.me/adversarial_attack_RL/.   • Liu MY (2016) Coupled Generative Adversarial Networks. arXiv:1606.07536 reddit • We propose the coupled generative adversarial network (CoGAN) framework for generating pairs of corresponding images in two different domains. It consists of a pair of generative adversarial networks, each responsible for generating images in one domain. We show that by enforcing a simple weight-sharing constraint, the CoGAN learns to generate pairs of corresponding images without existence of any pairs of corresponding images in the two domains in the training set. In other words, the CoGAN learns a joint distribution of images in the two domains from images drawn separately from the marginal distributions of the individual domains. This is in contrast to the existing multi-modal generative models, which require corresponding images for training. We apply the CoGAN to several pair image generation tasks. For each task, the CoGAN learns to generate convincing pairs of corresponding images. We further demonstrate the applications of the CoGAN framework for the domain adaptation and cross-domain image generation tasks. • Mao X (2016) Least Squares Generative Adversarial Networks. arXiv:1611.04076 | reddit • Unsupervised learning with generative adversarial networks (GANs) has proven hugely successful. Regular GANs hypothesize the discriminator as a classifier with the sigmoid cross entropy loss function. This loss function, however, may lead to the vanishing gradient problem during the learning process. To overcome such problem, here we propose the Least Squares Generative Adversarial Networks (LSGANs) that adopt the least squares loss function for the discriminator. We show that minimizing the objective function of LSGAN yields minimizing the Pearson$\small χ2$divergence. There are two benefits of LSGANs over regular GANs. First, LSGANs are able to generate higher quality images than regular GANs. Second, LSGANs performs more stable during the learning process. We evaluate the LSGANs on five scene datasets and the experimental results demonstrate that the generated images by LSGANs look more realistic than the ones generated by regular GANs. We also conduct two comparison experiments between LSGANs and regular GANs to illustrate the stability of LSGANs. • Mescheder L [MPI Tubingen | Microsoft Research Cambridge] (2017) Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks. arXiv:1701.04722 | good discussion: reddit | Twitter • Variational Autoencoders (VAEs) are expressive latent variable models that can be used to learn complex probability distributions from training data. However, the quality of the resulting model crucially relies on the expressiveness of the inference model used during training. We introduce Adversarial Variational Bayes (AVB), a technique for training Variational Autoencoders with arbitrarily expressive inference models. We achieve this by introducing an auxiliary discriminative network that allows to rephrase the maximum-likelihood-problem as a two-player game, hence establishing a principled connection between VAEs and Generative Adversarial Networks (GANs). We show that in the nonparametric limit our method yields an exact maximum-likelihood assignment for the parameters of the generative model, as well as the exact posterior distribution over the latent variables given an observation. Contrary to competing approaches which combine VAEs with GANs, our approach has a clear theoretical justification, retains most advantages of standard Variational Autoencoders and is easy to implement. • Mescheder L [MPI Tubingen | Microsoft Research Cambridge] (2017) The Numerics of GANs. arXiv:1705.10461 | reddit | inFERENCe [Ferenc Huszár: Oct 2017; local copy (pdf)] • [v1] In this paper, we analyze the numerics of common algorithms for training Generative Adversarial Networks (GANs). Using the formalism of smooth two-player games we analyze the associated gradient vector field of GAN training objectives. Our findings suggest that the convergence of current algorithms suffers due to two factors: i) presence of eigenvalues of the Jacobian of the gradient vector field with zero real-part, and ii) eigenvalues with big imaginary part. Using these findings, we design a new algorithm that overcomes some of these limitations and has better convergence properties. Experimentally, we demonstrate its superiority on training common GAN architectures and show convergence on GAN architectures that are known to be notoriously hard to train. • Metzen JH (2017) Universal Adversarial Perturbations Against Semantic Image Segmentation. arXiv:1704.05712 • While deep learning is remarkably successful on perceptual tasks, it was also shown to be vulnerable to adversarial perturbations of the input. These perturbations denote noise added to the input that was generated specifically to fool the system while being quasi-imperceptible for humans. More severely, there even exist universal perturbations that are input-agnostic but fool the network on the majority of inputs. While recent work has focused on image classification, this work proposes attacks against semantic image segmentation: we present an approach for generating (universal) adversarial perturbations that make the network yield a desired target segmentation as output. We show empirically that there exist barely perceptible universal noise patterns which result in nearly the same predicted segmentation for arbitrary inputs. Furthermore, we also show the existence of universal noise which removes a target class (e.g., all pedestrians) from the segmentation while leaving the segmentation mostly unchanged otherwise. • Mohamed S [Google DeepMind] (2016) Learning in Implicit Generative Models. arXiv:1610.03483 | reddit • Generative adversarial networks (GANs) provide an algorithmic framework for constructing generative models with several appealing properties: they do not require a likelihood function to be specified, only a generating procedure; they provide samples that are sharp and compelling; and they allow us to harness our knowledge of building highly accurate neural network classifiers. Here, we develop our understanding of GANs with the aim of forming a rich view of this growing area of machine learning -- to build connections to the diverse set of statistical thinking on this topic, of which much can be gained by a mutual exchange of ideas. We frame GANs within the wider landscape of algorithms for learning in implicit generative models--models that only specify a stochastic procedure with which to generate data--and relate these ideas to modelling problems in related fields, such as econometrics and approximate Bayesian computation. We develop likelihood-free inference methods and highlight hypothesis testing as a principle for learning in implicit generative models, using which we are able to derive the objective function used by GANs, and many other related objectives. The testing viewpoint directs our focus to the general problem of density ratio estimation. There are four approaches for density ratio estimation, one of which is a solution using classifiers to distinguish real from generated data. Other approaches such as divergence minimisation and moment matching have also been explored in the GAN literature, and we synthesise these views to form an understanding in terms of the relationships between them and the wider literature, highlighting avenues for future exploration and cross-pollination. • Molano JM (2016) Adversarial Ladder Networks. arXiv:1611.02320 • The use of unsupervised data in addition to supervised data in training discriminative neural networks has improved the performance of this classification scheme. However, the best results were achieved with a training process that is divided in two parts: first an unsupervised pre-training step is done for initializing the weights of the network and after these weights are refined with the use of supervised data. On the other hand adversarial noise has improved the results of classical supervised learning. Recently, a new neural network topology called Ladder Network, where the key idea is based in some properties of hierarchical latent variable models, has been proposed as a technique to train a neural network using supervised and unsupervised data at the same time with what is called semi-supervised learning. This technique has reached state of the art classification. In this work we add adversarial noise to the ladder network and get state of the art classification, with several important conclusions on how adversarial noise can help in addition with new possible lines of investigation. We also propose an alternative to add adversarial noise to unsupervised data. • Moosavi-Dezfooli S-M [Ecole Polytechnique Federale de Lausanne, Switzerland | Universite de Lyon, France] (2016) Universal adversarial perturbations. arXiv:1610.08401 | GitHub | GitXiv | | Hacker News | reddit • Given a state-of-the-art deep neural network classifier, we show the existence of a universal (image-agnostic) and very small perturbation vector that causes natural images to be misclassified with high probability. We propose a systematic algorithm for computing universal perturbations, and show that state-of-the-art deep neural networks are highly vulnerable to such perturbations, albeit being quasi-imperceptible to the human eye. We further empirically analyze these universal perturbations and show, in particular, that they generalize very well across neural networks. The surprising existence of universal perturbations reveals important geometric correlations among the high-dimensional decision boundary of classifiers. It further outlines potential security breaches with the existence of single directions in the input space that adversaries can possibly exploit to break a classifier on most natural images. • Conclusions. We showed the existence of small universal perturbations that can fool state-of-the-art classifiers on natural images. We proposed an iterative algorithm to generate universal perturbations, and highlighted several properties of such perturbations. In particular, we showed that universal perturbations generalize well across different classification models, resulting in doubly-universal perturbations (image-agnostic, network-agnostic). We further explained the existence of such perturbations with the correlation between different regions of the decision boundary. This provides insights on the geometry of the decision boundaries of deep neural networks, and contributes to a better understanding of such systems. A theoretical analysis of the geometric correlations between different parts of the decision boundary will be the subject of future research. • Nguyen A [Bengio Y | University of Wyoming | Montreal Institute for Learning Algorithms | ...] (2016) Plug & Play Generative Networks: Conditional Iterative Generation of Images in Latent Space. pdf | GitHub | GitXiv | project page | reddit • Generating high-resolution, photo-realistic images has been a long-standing goal in machine learning. Recently, Nguyen et al. (2016) showed one interesting way to synthesize novel images by performing gradient ascent in the latent space of a generator network to maximize the activations of one or multiple neurons in a separate classifier network. In this paper we extend this method by introducing an additional prior on the latent code, improving both sample quality and sample diversity, leading to a state-of-the-art generative model that produces high quality images at higher resolutions [227x227] than previous generative models, and does so for all 1000 ImageNet categories. In addition, we provide a unified probabilistic interpretation of related activation maximization methods and call the general class of models "Plug and Play Generative Networks". PPGNs are composed of (1) a generator network G that is capable of drawing a wide range of image types and (2) a replaceable "condition" network C that tells the generator what to draw. We demonstrate the generation of images conditioned on a class (when C is an ImageNet or MIT Places classification network) and also conditioned on a caption (when C is an image captioning network). Our method also improves the state of the art of Multifaceted Feature Visualization, which generates the set of synthetic inputs that activate a neuron in order to better understand how deep neural networks operate. Finally, we show that our model performs reasonably well at the task of image inpainting. While image models are used in this paper, the approach is modality-agnostic and can be applied to many types of data.   • Nguyen A (2015) Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. arXiv:1605.09304 | GitHub | reddit • Deep neural networks (DNNs) have demonstrated state-of-the-art results on many pattern recognition tasks, especially vision classification problems. Understanding the inner workings of such computational brains is both fascinating basic science that is interesting in its own right - similar to why we study the human brain - and will enable researchers to further improve DNNs. One path to understanding how a neural network functions internally is to study what each of its neurons has learned to detect. One such method is called activation maximization (AM), which synthesizes an input (e.g. an image) that highly activates a neuron. Here we dramatically improve the qualitative state of the art of activation maximization by harnessing a powerful, learned prior: a deep generator network (DGN). The algorithm (1) generates qualitatively state-of-the-art synthetic images that look almost real, (2) reveals the features learned by each neuron in an interpretable way, (3) generalizes well to new datasets and somewhat well to different network architectures without requiring the prior to be relearned, and (4) can be considered as a high-quality generative method (in this case, by generating novel, creative, interesting, recognizable images). • Mentioned here: Peeking inside Convnets [June 2016]: "There has been similar work on visualizing convolutional networks by e.g. Zeiler and Fergus and lately by Yosinski, Nugyen et al. In a recent work by Nguyen, they manage to visualize features very well, based on a technique they called "mean-image initialization". Since I started writing this blog post, they've also published a new paper [arXiv:1605.09304, above] using Generative Adversarial Networks as priors for the visualizations, which lead to far far better visualizations than the ones I've showed above. If you are interested, do take a look at their paper or the code they've released!" • Odena A [Chris Olah; Shlens, J] (2016) Conditional Image Synthesis With Auxiliary Classifier GANs. arXiv:1610.09585 | GitHub | GitXiv • Synthesizing high resolution photorealistic images has been a long-standing challenge in machine learning. In this paper we introduce new methods for the improved training of generative adversarial networks (GANs) for image synthesis. We construct a variant of GANs employing label conditioning that results in 128x128 resolution image samples exhibiting global coherence. We expand on previous work for image quality assessment to provide two new analyses for assessing the discriminability and diversity of samples from class-conditional image synthesis models. These analyses demonstrate that high resolution samples provide class information not present in low resolution samples. Across 1000 ImageNet classes, 128x128 samples are more than twice as discriminable as artificially resized 32x32 samples. In addition, 84.7% of the classes have samples exhibiting diversity comparable to real ImageNet data. Source: arXiv:1610.09585 Source: twitter: Chris Olah • Implementation [non-author]: GitHub | reddit | reddit • Papernot N [Goodfellow I | OpenAI | U.S. ARL | Pennsylvania State University] (2017) Practical Black-Box Attacks against Machine Learning. pdf • Machine learning (ML) models, e.g., deep neural networks (DNNs), are vulnerable to adversarial examples: malicious inputs modified to yield erroneous model outputs, while appearing unmodified to human observers. Potential attacks include having malicious content like malware identified as legitimate or controlling vehicle behavior. Yet, all existing adversarial example attacks require knowledge of either the model internals or its training data. We introduce the first practical demonstration of an attacker controlling a remotely hosted DNN with no such knowledge. Indeed, the only capability of our black-box adversary is to observe labels given by the DNN to chosen inputs. Our attack strategy consists in training a local model to substitute for the target DNN, using inputs synthetically generated by an adversary and labeled by the target DNN. We use the local substitute to craft adversarial examples, and find that they are misclassified by the targeted DNN. To perform a real-world and properly-blinded evaluation, we attack a DNN hosted by MetaMind, an online deep learning API. We find that their DNN misclassifies 84.24% of the adversarial examples crafted with our substitute. We demonstrate the general applicability of our strategy to many ML techniques by conducting the same attack against models hosted by Amazon and Google, using logistic regression substitutes. They yield adversarial examples misclassified by Amazon and Google at rates of 96.19% and 88.94%. We also find that this black-box attack strategy is capable of evading defense strategies previously found to make adversarial example crafting harder. • Pan (2017) SalGAN: Visual Saliency Prediction with Generative Adversarial Networks. arXiv:1701.01081 | GitHub | GitXiv | project page | reddit • We introduce SalGAN, a deep convolutional neural network for visual saliency prediction trained with adversarial examples. The first stage of the network consists of a generator model whose weights are learned by back-propagation computed from a binary cross entropy (BCE) loss over downsampled versions of the saliency maps. The resulting prediction is processed by a discriminator network trained to solve a binary classification task between the saliency maps generated by the generative stage and the ground truth ones. Our experiments show how adversarial training allows reaching state-of-the-art performance across different metrics when combined with a widely-used loss function like BCE. • Pfau D [Oriol Vinyals | Google DeepMind] (2016) Connecting Generative Adversarial Networks and Actor-Critic Methods. arXiv:1610.01945 | reddit • Both generative adversarial networks (GAN) in unsupervised learning and actor-critic methods in reinforcement learning (RL) have gained a reputation for being difficult to optimize. Practitioners in both fields have amassed a large number of strategies to mitigate these instabilities and improve training. Here we show that GANs can be viewed as actor-critic methods in an environment where the actor cannot affect the reward. We review the strategies for stabilizing training for each class of models, both those that generalize between the two and those that are particular to that model. We also review a number of extensions to GANs and RL algorithms with even more complicated information flow. We hope that by highlighting this formal connection we will encourage both GAN and RL communities to develop general, scalable, and stable algorithms for multilevel optimization with deep networks, and to draw inspiration across communities. • [DCGANs] Radford A (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434 | GitHub | GitXiv | reddit • In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations. • Related: DCGAN in Tensorflow: Tensorflow implementation of Deep Convolutional Generative Adversarial Networks [arXiv:1511.06434], which is a stabilize Generative Adversarial Networks. The referenced torch code can be found here. | demo • Related - implementation (faces): srez [GitHub] | reddit | Hacker News [srez author comments here] • Image super-resolution through deep learning. This project uses deep learning to upscale 16x16 images by a 4x factor. The resulting 64x64 images display sharp features that are plausible based on the dataset that was used to train the neural net. Here's an random, non cherry-picked, example of what this network can do. From left to right, the first column is the 16x16 input image, the second one is what you would get from a standard bicubic interpolation, the third is the output generated by the neural net, and on the right is the ground truth. As you can see, the network is able to produce a very plausible reconstruction of the original face. As the dataset is mainly composed of well-illuminated faces looking straight ahead, the reconstruction is poorer when the face is at an angle, poorly illuminated, or partially occluded by eyeglasses or hands. This particular example was produced after training the network for 4 hours on a GTX 1080 GPU, equivalent to 130,000 batches or about 10 epochs. ... • Related: I wrote an article about learning image generation with DRAW & DCGAN, feedback welcome [ ← reddit: links to: Generating Martian Terrain with Neural Networks] • Raghu M [Google Brain] (2016) On the expressive power of deep neural networks. arXiv:1606.05336 • We propose a new approach to the problem of neural network expressivity, which seeks to characterize how structural properties of a neural network family affect the functions it is able to compute. Our approach is based on an interrelated set of measures of expressivity, unified by the novel notion of trajectory length, which measures how the output of a network changes as the input sweeps along a one-dimensional path. Our findings can be summarized as follows: (1) The complexity of the computed function grows exponentially with depth. (2) All weights are not equal: trained networks are more sensitive to their lower (initial) layer weights. (3) Regularizing on trajectory length (trajectory regularization) is a simpler alternative to batch normalization, with the same performance. • Rajeswar S [Courville A | MILA, Universite de Montreal (UdeM)] (2017) Adversarial Generation of Natural Language. arXiv:1705.10929 | reddit • Schawinski K (2017) Generative Adversarial Networks recover features in astrophysical images of galaxies beyond the deconvolution limit. arXiv:1702.00403 | journal | pdf | Neural networks promise sharpest ever images [ScienceDaily.com: Feb 2017] • Observations of astrophysical objects such as galaxies are limited by various sources of random and systematic noise from the sky background, the optical system of the telescope and the detector used to record the data. Conventional deconvolution techniques are limited in their ability to recover features in imaging data by the Shannon-Nyquist sampling theorem. Here, we train a generative adversarial network (GAN) on a sample of 4,550 images of nearby galaxies at 0.01 < z < 0.02 from the Sloan Digital Sky Survey and conduct 10× cross-validation to evaluate the results. We present a method using a GAN trained on galaxy images that can recover features from artificially degraded images with worse seeing and higher noise than the original with a performance that far exceeds simple deconvolution. The ability to better recover detailed features such as galaxy morphology from low signal to noise and low angular resolution imaging data significantly increases our ability to study existing data sets of astrophysical objects as well as future observations with observatories such as the Large Synoptic Sky Telescope (LSST) and the Hubble and James Webb space telescopes. The frames here show an example of an original galaxy image (left), the same image deliberately degraded (second from left), the image after recovery with the neural net (second from right), and the image processed with deconvolution, the best existing technique (right). • Reed S [Scott Reed now (2016) at Google DeepMind] (2016) Generative Adversarial Text to Image Synthesis. arXiv:1605.05396 | GitHub | GitXiv | reddit • Automatic synthesis of realistic images from text would be interesting and useful, but current AI systems are still far from this goal. However, in recent years generic and powerful recurrent neural network architectures have been developed to learn discriminative text feature representations. Meanwhile, deep convolutional generative adversarial networks (GANs) have begun to generate highly compelling images of specific categories such as faces, album covers, room interiors etc. In this work, we develop a novel deep architecture and GAN formulation to effectively bridge these advances in text and image modeling, translating visual concepts from characters to pixels. We demonstrate the capability of our model to generate plausible images of birds and flowers from detailed text descriptions. • Adapted, implemented here [GitHub]: Text To Image Synthesis Using Thought Vectors: This is an experimental tensorflow implementation of synthesizing images from captions using Skip Thought Vectors [arXiv:1506.06726]. The images are synthesized using the GAN-CLS Algorithm from the paper Generative Adversarial Text-to-Image Synthesis [arXiv:1605.05396]. This implementation is built on top of the excellent DCGAN in Tensorflow. The following is the model architecture. The blue bars represent the Skip Thought Vectors for the captions. | reddit | Hacker News • Reed S [U Michigan - Ann Arbor | Max Planck Institute for Informatics | Scott Reed now (2016) at Google DeepMind] (NIPS 2016) Learning What and Where to Draw. pdf | GitHub | GitXiv • Generative Adversarial Networks (GANs) have recently demonstrated the capability to synthesize compelling real-world images, such as room interiors, album covers, manga, faces, birds, and flowers. While existing models can synthesize images based on global constraints such as a class label or caption, they do not provide control over pose or object location. We propose a new model, the Generative Adversarial What-Where Network (GAWWN), that synthesizes images given instructions describing what content to draw in which location. We show high-quality 128 x 128 image synthesis on the Caltech-UCSD Birds dataset, conditioned on both informal text descriptions and also object location. Our system exposes control over both the bounding box around the bird and its constituent parts. By modeling the conditional distributions over part locations, our system also enables conditioning on arbitrary subsets of parts (e.g. only the beak and tail), yielding an efficient interface for picking part locations. • Romano Y [Google Research] (2016) RAISR: Rapid and Accurate Image Super Resolution. arXiv:1606.01299 | Google Research Blog [Nov 2016] | reddit | reddit | Hacker News | Is Google's AI-Driven Image-Resizing Algorithm Dishonest? [Slashdot.org, via reddit] • Given an image, we wish to produce an image of larger size with significantly more pixels and higher image quality. This is generally known as the Single Image Super-Resolution (SISR) problem. The idea is that with sufficient training data (corresponding pairs of low and high resolution images) we can learn set of filters (i.e. a mapping) that when applied to given image that is not in the training set, will produce a higher resolution version of it, where the learning is preferably low complexity. In our proposed approach, the run-time is more than one to two orders of magnitude faster than the best competing methods currently available, while producing results comparable or better than state-of-the-art.] A closely related topic is image sharpening and contrast enhancement, i.e., improving the visual quality of a blurry image by amplifying the underlying details (a wide range of frequencies). Our approach additionally includes an extremely efficient way to produce an image that is significantly sharper than the input blurry one, without introducing artifacts such as halos and noise amplification. We illustrate how this effective sharpening algorithm, in addition to being of independent interest, can be used as a pre-processing step to induce the learning of more effective upscaling filters with built-in sharpening and contrast enhancement effect. • Salimans T [Ian Goodfellow | Google DeepMind] (NIPS 2016) Improved Techniques for Training GANs. arXiv:1606.03498 | NIPS Proceedings • We present a variety of new architectural features and training procedures that we apply to the generative adversarial networks (GANs) framework. We focus on two applications of GANs: semi-supervised learning, and the generation of images that humans find visually realistic. Unlike most work on generative models, our primary goal is not to train a model that assigns high likelihood to test data, nor do we require the model to be able to learn well without using any labels. Using our new techniques, we achieve state-of-the-art results in semi-supervised classification on MNIST, CIFAR-10 and SVHN. The generated images are of high quality as confirmed by a visual Turing test: our model generates MNIST samples that humans cannot distinguish from real data, and CIFAR-10 samples that yield a human error rate of 21.3%. We also present ImageNet samples with unprecedented resolution and show that our methods enable the model to learn recognizable features of ImageNet classes. • Blog post: Understanding Minibatch Discrimination in GANs • Code: • Commentary - reddit: • See also this thread • Sønderby CK [Twitter Cortex, London, UK] (2016) Amortised MAP Inference for Image Super-resolution. arXiv:1610.04490 • Image Super-resolution (SR) is an underdetermined inverse problem, where a large number of plausible high-resolution images can explain the same downsampled image. Most current single image SR methods use empirical risk minimisation, often with a pixel-wise mean squared error (MSE) loss. However, the outputs from such methods tend to be blurry, over-smoothed and generally appear implausible. A more desirable approach would employ Maximum a Posteriori (MAP) inference, preferring solutions that always have a high probability under the image prior, and thus appear more plausible. Direct MAP estimation for SR is non-trivial, as it requires us to build a model for the image prior from samples. Furthermore, MAP inference is often performed via optimisation-based iterative algorithms which don't compare well with the efficiency of neural-network-based alternatives. Here we introduce new methods for amortised MAP inference whereby we calculate the MAP estimate directly using a convolutional neural network. We first introduce a novel neural network architecture that performs a projection to the affine subspace of valid SR solutions ensuring that the high resolution output of the network is always consistent with the low resolution input. We show that, using this architecture, the amortised MAP inference problem reduces to minimising the cross-entropy between two distributions, similar to training generative models. We propose three methods to solve this optimisation problem: (1) Generative Adversarial Networks (GAN) (2) denoiser-guided SR which backpropagates gradient-estimates from denoising to train the network, and (3) a baseline method using a maximum-likelihood-trained image prior. Our experiments show that the GAN based approach performs best on real image data, achieving particularly good results in photo-realistic texture SR. • Discussions: • Tabacof P (2015) Exploring the space of adversarial images. arXiv:1510.05328 | GitHub | GitXiv | GitXiv | project page • Adversarial examples have raised questions regarding the robustness and security of deep neural networks. In this work we formalize the problem of adversarial images given a pretrained classifier, showing that even in the linear case the resulting optimization problem is nonconvex. We generate adversarial images using shallow and deep classifiers on the MNIST and ImageNet datasets. We probe the pixel space of adversarial images using noise of varying intensity and distribution. We bring novel visualizations that showcase the phenomenon and its high variability. We show that adversarial images appear in large regions in the pixel space, but that, for the same task, a shallow classifier seems more robust to adversarial images than a deep convolutional network. • summarized here [reddit] • Taigman, Y [Facebook AI Research] (2016) Unsupervised Cross-Domain Image Generation. arXiv:1611.02200 | GitHub | reddit • We study the problem of transferring a sample in one domain to an analog sample in another domain. Given two related domains, S and T, we would like to learn a generative function G that maps an input sample from S to the domain T, such that the output of a given function f, which accepts inputs in either domains, would remain unchanged. Other than the function f, the training data is unsupervised and consist of a set of samples from each domain. The Domain Transfer Network (DTN) we present employs a compound loss function that includes a multiclass GAN loss, an f-constancy component, and a regularizing component that encourages G to map samples from T to themselves. We apply our method to visual domains including digits and face images and demonstrate its ability to generate convincing novel images of previously unseen entities, while preserving their identity. • Tanay T (2016) A Boundary Tilting Perspective on the Phenomenon of Adversarial Examples. arXiv:1608.07690 • Deep neural networks have been shown to suffer from a surprising weakness: their classification outputs can be changed by small, non-random perturbations of their inputs. This adversarial example phenomenon has been explained as originating from deep networks being "too linear" (Goodfellow et al., 2014). We show here that the linear explanation of adversarial examples presents a number of limitations: the formal argument is not convincing, linear classifiers do not always suffer from the phenomenon, and when they do their adversarial examples are different from the ones affecting deep networks. We propose a new perspective on the phenomenon. We argue that adversarial examples exist when the classification boundary lies close to the submanifold of sampled data, and present a mathematical analysis of this new perspective in the linear case. We define the notion of adversarial strength and show that it can be reduced to the deviation angle between the classifier considered and the nearest centroid classifier. Then, we show that the adversarial strength can be made arbitrarily high independently of the classification performance due to a mechanism that we call boundary tilting. This result leads us to defining a new taxonomy of adversarial examples. Finally, we show that the adversarial strength observed in practice is directly dependent on the level of regularisation used and the strongest adversarial examples, symptomatic of overfitting, can be avoided by using a proper level of regularisation. • Tolstikhin T [Max Planck Institute for Intelligent Systems | Google Brain] (2017) AdaGAN: Boosting Generative Models. arXiv:1701.02386 | reddit • Generative Adversarial Networks (GAN) (Goodfellow et al., 2014) are an effective method for training generative models of complex data such as natural images. However, they are notoriously hard to train and can suffer from the problem of missing modes where the model is not able to produce examples in certain regions of the space. We propose an iterative procedure, called AdaGAN, where at every step we add a new component into a mixture model by running a GAN algorithm on a reweighted sample. This is inspired by boosting algorithms, where many potentially weak individual predictors are greedily aggregated to form a strong composite predictor. We prove that such an incremental procedure leads to convergence to the true distribution in a finite number of steps if each step is optimal, and convergence at an exponential rate otherwise. We also illustrate experimentally that this procedure addresses the problem of missing modes. • Tramèr F [Nicolas Papernot; Ian Goodfellow | Stanford University | Google Brain] (2017) The Space of Transferable Adversarial Examples. arXiv:1704.03453 | reddit • Adversarial examples are maliciously perturbed inputs designed to mislead machine learning (ML) models at test-time. Adversarial examples are known to transfer across models: a same perturbed input is often misclassified by different models despite being generated to mislead a specific architecture. This phenomenon enables simple yet powerful black-box attacks against deployed ML systems. In this work, we propose novel methods for estimating the previously unknown dimensionality of the space of adversarial inputs. We find that adversarial examples span a contiguous subspace of large dimensionality and that a significant fraction of this space is shared between different models, thus enabling transferability. The dimensionality of the transferred adversarial subspace implies that the decision boundaries learned by different models are eerily close in the input domain, when moving away from data points in adversarial directions. A first quantitative analysis of the similarity of different models' decision boundaries reveals that these boundaries are actually close in arbitrary directions, whether adversarial or benign. We conclude with a formal study of the limits of transferability. We show (1) sufficient conditions on the data distribution that imply transferability for simple model classes and (2) examples of tasks for which transferability fails to hold. This suggests the existence of defenses making models robust to transferability attacks---even when the model is not robust to its own adversarial examples. • Uehara M [University of Tokyo] (2016) Generative Adversarial Nets from a Density Ratio Estimation Perspective. arXiv:1610.02920 | reddit | reddit • Generative adversarial networks (GANs) are successful deep generative models. GANs are based on a two-player minimax game. However, the objective function derived in the original motivation is changed to obtain stronger gradients when learning the generator. We propose a novel algorithm that repeats the density ratio estimation and f-divergence minimization. Our algorithm offers a new perspective toward the understanding of GANs and is able to make use of multiple viewpoints obtained in the research of density ratio estimation, e.g. what divergence is stable and relative density ratio is useful. • Vondrick C (NIPS 2016 | MIT CSAIL) Generating Videos with Scene Dynamics. pdf | project website | GIFs and source code released for GAN paper 'Generating Videos with Scene Dynamics' [reddit] | reddit • We capitalize on large amounts of unlabeled video in order to learn a model of scene dynamics for both video recognition tasks (e.g. action classification) and video generation tasks (e.g. future prediction). We propose a generative adversarial network for video with a spatio-temporal convolutional architecture that untangles the scene's foreground from the background. Experiments suggest this model can generate tiny videos up to a second at full frame rate better than simple baselines, and we show its utility at predicting plausible futures of static images. Moreover, experiments and visualizations show the model internally learns useful features for recognizing actions with minimal supervision, suggesting scene dynamics are a promising signal for representation learning. We believe generative video models can impact many applications in video understanding and simulation. • See also [this author]: Vondrick C [MIT CSAIL] (2016) Anticipating Visual Representations from Unlabeled Video. more here. • White T (2016) Sampling Generative Networks: Notes on a Few Effective Techniques. arXiv:1609.04468 | GitHub | GitXiv | @dribnet [ << Tom White: Twitter] • We introduce several techniques for sampling and visualizing the latent spaces of generative models. Replacing linear interpolation with spherical linear interpolation prevents diverging from a models prior distribution and produces sharper samples. J Diagrams and MINE grids are introduced as visualizations of manifolds created by analogies and nearest neighbors. We demonstrate two new techniques for deriving attribute vectors - bias-correct vectors with data replication and synthetic vectors with data augmentation. Most techniques are intended to be independent of model type and examples are shown on both Variational Autoencoders and Generative Adversarial Networks. • Wu L [Microsoft Research Asia] (2017) Adversarial Neural Machine Translation. arXiv:1704.06933 • In this paper, we study a new learning paradigm for Neural Machine Translation (NMT). Instead of maximizing the likelihood of the human translation as in previous works, we minimize the distinction between human translation and the translation given by a NMT model. To achieve this goal, inspired by the recent success of generative adversarial networks (GANs), we employ an adversarial training architecture and name it as Adversarial-NMT. In Adversarial-NMT, the training of the NMT model is assisted by an adversary, which is an elaborately designed Convolutional Neural Network (CNN). The goal of the adversary is to differentiate the translation result generated by the NMT model from that by human. The goal of the NMT model is to produce high quality translations so as to cheat the adversary. A policy gradient method is leveraged to co-train the NMT model and the adversary. Experimental results on English → French and German → English translation tasks show that Adversarial-NMT can achieve significantly better translation quality than several strong baselines. • Yu L (2016) SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. arXiv:1609.05473 | GitHub | GitXiv • As a new way of training generative models, Generative Adversarial Nets (GAN) that uses a discriminative model to guide the training of the generative model has enjoyed considerable success in generating real-valued data. However, it has limitations when the goal is for generating sequences of discrete tokens. A major reason lies in that the discrete outputs from the generative model make it difficult to pass the gradient update from the discriminative model to the generative model. Also, the discriminative model can only assess a complete sequence, while for a partially generated sequence, it is non-trivial to balance its current score and the future one once the entire sequence has been generated. In this paper, we propose a sequence generation framework, called SeqGAN, to solve the problems. Modeling the data generator as a stochastic policy in reinforcement learning (RL), SeqGAN bypasses the generator differentiation problem by directly performing gradient policy update. The RL reward signal comes from the GAN discriminator judged on a complete sequence, and is passed back to the intermediate state-action steps using Monte Carlo search. Extensive experiments on synthetic data and real-world tasks demonstrate significant improvements over strong baselines. • reddit: "Very excited about this paper! Finally a promising way to apply GANs to sequences. The code was really easy to get to work, too, and looks eminently readable." • Zhao J [Yann LeCun | NYU | Facebook Artificial Intelligence Research] (2016) Energy-based Generative Adversarial Network). arXiv:1609.03126 | GitHub | GitXiv | reddit • We introduce the "Energy-based Generative Adversarial Network" (EBGAN) model which views the discriminator in GAN framework as an energy function that associates low energies with the regions near the data manifold and higher energies everywhere else. Similar to the probabilistic GANs, a generator is trained to produce contrastive samples with minimal energies, while the energy function is trained to assign high energies to those generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary discriminant network. Among them, an instantiation of EBGANs is to use an auto-encoder architecture alongside the energy being the reconstruction error. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images. • blog: Are Energy-Based GANs any more energy-based than normal GANs? | reddit • As I was reading the paper, my excitement has disappeared somewhat. Instead of a nice theoretical framework I was hoping to see, the authors' choices looked a bit arbitrary to me, loosely motivated by intuition. In this post I'm trying to explain how I think about energy-based GANs (EBGANs). I'm only really going to touch on very big-picture details instead of covering all details of the paper. • Summary of this note: • I introduce a unifying framework to think about GAN-type methods. This includes the original GAN and energy-based EBGANs as special case • I show different special cases of this general algorithm Using an example I show that intuition is not enough and details matter: a special case of this algorithm-family that has pathological behaviour • I finally cast the EBGAN objective in this framework • I think that the choices the authors made are rather arbitrary, I don't see why this particular algorithm should do well, and I find it hard to make predictions about how it's going to work compared to other variants. • Zhu J-J (2017) Generative Adversarial Active Learning. | arXiv:1702.07956 • We propose a new active learning approach using Generative Adversarial Networks (GAN). Different from regular active learning, we adaptively synthesize training instances for querying to increase learning speed. Our approach outperforms random generation using GAN alone in active learning experiments. We demonstrate the effectiveness of the proposed algorithm in various datasets when compared to other algorithms. To the best our knowledge, this is the first active learning work using GAN. • [cool!] Zhu JY (ECCV 2016) Generative Visual Manipulation on the Natural Image Manifold. pdf | "iGAN" • Realistic image manipulation is challenging because it requires modifying the image appearance in a user-controlled way, while preserving the realism of the result. Unless the user has considerable artistic skill, it is easy to "fall off" the manifold of natural images while editing. In this paper, we propose to learn the natural image manifold directly from data using a generative adversarial neural network. We then define a class of image editing operations, and constrain their output to lie on that learned manifold at all times. The model automatically adjusts the output keeping all edits as realistic as possible. All our manipulations are expressed in terms of constrained optimization and are applied in near-real time. We evaluate our algorithm on the task of realistic photo manipulation of shape and color. The presented method can further be used for changing one image to look like the other, as well as generating novel imagery from scratch based on user's scribbles. • Project page | code | YouTube • iGAN: Interactive Image Generation powered by GAN [GitHub] | GitXiv • reddit: ## AI: ARTIFICIAL INTELLIGENCE - CONSCIOUSNESS [AI-Consciousness] Papers: • Bengio Y [Université de Montréal, MILA] (2017) The Consciousness Prior. arXiv:1709.08568 | reddit | What is Yoshua Bengio's new "Consciousness Prior" paper about? [Quora: Sep 2017] • [v1] A new prior is proposed for representation learning, which can be combined with other priors in order to help disentangling abstract factors from each other. It is inspired by the phenomenon of consciousness seen as the formation of a low-dimensional combination of a few concepts constituting a conscious thought, i.e., consciousness as awareness at a particular time instant. This provides a powerful constraint on the representation in that such low-dimensional thought vectors can correspond to statements about reality which are true, highly probable, or very useful for taking decisions. The fact that a few elements of the current state can be combined into such a predictive or useful statement is a strong constraint and deviates considerably from the maximum likelihood approaches to modelling data and how states unfold in the future based on an agent's actions. Instead of making predictions in the sensory (e.g. pixel) space, the consciousness prior allows the agent to make predictions in the abstract space, with only a few dimensions of that space being involved in each of these predictions. The consciousness prior also makes it natural to map conscious states to natural language utterances or to express classical AI knowledge in the form of facts and rules, although the conscious states may be richer than what can be expressed easily in the form of a sentence, a fact or a rule. • Could We Build a Machine with Consciousness? [MIT Technology Review: Oct 2017 • Dehaene S (2017) What is consciousness, and could machines have it? Science: Oct 2017 [pdf] • The controversial question of whether machines may ever be conscious must be based on a careful consideration of how consciousness arises in the only physical system that undoubtedly possesses it: the human brain. We suggest that the word "consciousness" conflates two different types of information-processing computations in the brain: the selection of information for global broadcasting, thus making it flexibly available for computation and report (C1, consciousness in the first sense), and the self-monitoring of those computations, leading to a subjective sense of certainty or error (C2, consciousness in the second sense). We argue that despite their recent successes, current machines are still mostly implementing computations that reflect unconscious processing (C0) in the human brain. We review the psychological and neural science of unconscious (C0) and conscious computations (C1 and C2) and outline how they may inspire novel machine architectures. ## AI: ARTIFICIAL INTELLIGENCE - GENERAL [AI-General] Blogs: [AI-General] Competitions: • Alexa Prize [Amazon]:$2.5 Million to Advance Conversational Artificial Intelligence  |  MIT Technology ReviewAmazon is offering $2.5 million to fund research aimed at enabling its personal assistant software to hold a 20-minute conversation | reddit | Hacker News • The way humans interact with machines is at an inflection point and conversational artificial intelligence (AI) is at the center of the transformation. Alexa, the voice service that powers Amazon Echo, enables customers to interact with the world around them in a more intuitive way using only their voice. The Alexa Prize is an annual competition for university students dedicated to accelerating the field of conversational AI. The inaugural competition is focused on creating a socialbot, a new Alexa skill that converses coherently and engagingly with humans on popular topics and news events. Participating teams will advance several areas of conversational AI including knowledge acquisition, natural language understanding, natural language generation, context modeling, commonsense reasoning and dialog planning. Through the innovative work of students, Alexa customers will have novel, engaging conversations. And, the immediate feedback from Alexa customers will help students improve their algorithms much faster than previously possible. [AI-General] Education: • [Stanford] cs221: Artificial Intelligence: Principles and Techniques: What do web search, speech recognition, face recognition, machine translation, autonomous driving, and automatic scheduling have in common? These are all complex real-world problems, and the goal of artificial intelligence (AI) is to tackle these with rigorous mathematical tools. In this course, you will learn the foundational principles that drive these applications and practice implementing some of these systems. Specific topics include machine learning, search, game playing, Markov decision processes, constraint satisfaction, graphical models, and logic. The main goal of the course is to equip you with the tools to tackle new AI problems you might encounter in life. [AI-General] Papers: • Baroni M [Mikolov T | Facebook Artificial Intelligence Research] (2017) CommAI: Evaluating the first steps towards a useful general AI. arXiv:1701.08954 • With machine learning successfully applied to new daunting problems almost every day, general AI starts looking like an attainable goal. However, most current research focuses instead on important but narrow applications, such as image classification or machine translation. We believe this to be largely due to the lack of objective ways to measure progress towards broad machine intelligence. In order to fill this gap, we propose here a set of concrete desiderata for general AI, together with a platform to test machines on how well they satisfy such desiderata, while keeping all further complexities to a minimum. • Palm RB [Technical University of Denmark | DeepMind] (2017) Recurrent Relational Networks for Complex Relational Reasoning. arXiv:1711.08028 • [v1] Humans possess an ability to abstractly reason about objects and their interactions, an ability not shared with state-of-the-art deep learning models. Relational networks, introduced by Santoro et al. (2017), add the capacity for relational reasoning to deep neural networks, but are limited in the complexity of the reasoning tasks they can address. We introduce recurrent relational networks which increase the suite of solvable tasks to those that require an order of magnitude more steps of relational reasoning. We use recurrent relational networks to solve Sudoku puzzles and achieve state-of-the-art results by solving 96.6% of the hardest Sudoku puzzles, where relational networks fail to solve any. We also apply our model to the BaBi textual QA dataset solving 19/20 tasks which is competitive with state-of-the-art sparse differentiable neural computers. The recurrent relational network is a general purpose module that can augment any neural network model with the capacity to do many-step relational reasoning. • Schmidhuber J (2015) On learning to think: Algorithmic information theory for novel combinations of reinforcement learning controllers and recurrent neural world models. arXiv:1511.09249 | [arXiv admin note: substantial text overlap with arXiv:1404.7828] | reddit • This paper addresses the general problem of reinforcement learning (RL) in partially observable environments. In 2013, our large RL recurrent neural networks (RNNs) learned from scratch to drive simulated cars from high-dimensional video input. However, real brains are more powerful in many ways. In particular, they learn a predictive model of their initially unknown environment, and somehow use it for abstract (e.g., hierarchical) planning and reasoning. Guided by algorithmic information theory, we describe RNN-based AIs (RNNAIs) designed to do the same. Such an RNNAI can be trained on never-ending sequences of tasks, some of them provided by the user, others invented by the RNNAI itself in a curious, playful fashion, to improve its RNN-based world model. Unlike our previous model-building RNN-based RL machines dating back to 1990, the RNNAI learns to actively query its model for abstract reasoning and planning and decision making, essentially "learning to think." The basic ideas of this report can be applied to many other cases where one RNN-like system exploits the algorithmic information content of another. They are taken from a grant proposal submitted in Fall 2014, and also explain concepts such as "mirror neurons." Experimental results will be described in separate papers. • More here. [AI-General] Reports: [AI-General] Research: • Google DeepMind: • DeepMind and Blizzard to release StarCraft II as an AI research environment [Google DeepMind blog: Nov 2016 • Today at BlizzCon 2016 in Anaheim, California, we announced our collaboration with Blizzard Entertainment to open up StarCraft II to AI and Machine Learning researchers around the world. For almost 20 years, the StarCraft game series has been widely recognised as the pinnacle of 1v1 competitive video games, and among the best PC games of all time. The original StarCraft was an early pioneer in eSports, played at the highest level by elite professional players since the late 90s, and remains incredibly competitive to this day. The StarCraft series' longevity in competitive gaming is a testament to Blizzard's design, and their continual effort to balance and refine their games over the years. StarCraft II continues the series' renowned eSports tradition, and has been the focus of our work with Blizzard. DeepMind is on a scientific mission to push the boundaries of AI, developing programs that can learn to solve any complex problem without needing to be told how. Games are the perfect environment in which to do this, allowing us to develop and test smarter, more flexible AI algorithms quickly and efficiently, and also providing instant feedback on how we're doing through scores. [ ... SNIP! ... ] • StarCraft Will Become the Next Big Playground for AI [MIT Technology Review: Nov 2016] Artificial intelligence will require key advances in order to play a video game filled with planning, guesswork, and bluffing.] • DeepMind and Blizzard to release StarCraft II as an AI research environment reddit • DeepMind announces collaboration with Blizzard [reddit] • Hacker News ## ANIMATION [Animation] Papers: • Holden D (2017) Phase-Functioned Neural Networks for Character Control. pdf | project page | blog: TechCrunch • We present a real-time character control mechanism using a novel neural network architecture called a Phase-Functioned Neural Network. In this network structure, the weights are computed via a cyclic function which uses the phase as an input. Along with the phase, our system takes as input user controls, the previous state of the character, the geometry of the scene, and automatically produces high quality motions that achieve the desired user control. The entire network is trained in an end-to-end fashion on a large dataset composed of locomotion such as walking, running, jumping, and climbing movements fitted into virtual environments. Our system can therefore automatically produce motions where the character adapts to different geometric environments such as walking and running over rough terrain, climbing over large rocks, jumping over obstacles, and crouching under low ceilings. Our network architecture produces higher quality results than time-series autoregressive models such as LSTMs as it deals explicitly with the latent variable of motion relating to the phase. Once trained, our system is also extremely fast and compact, requiring only milliseconds of execution time and a few megabytes of memory, even when trained on gigabytes of motion data. Our work is most appropriate for controlling characters in interactive scenes such as computer games and virtual reality systems. • Keywords: locomotion; human motion; character animation; character control   ## APPLIED ML [Applied ML] Biology: [Applied ML - Biology] Blogs: • A.I. Versus M.D.: What happens when diagnosis is automated? | local copy [pdf] | reddit • Deep Learning in drug discovery [reddit]; 2 papers: • Aliper A (2016) Deep learning applications for predicting pharmacological properties of drugs and drug repurposing using transcriptomic data. pdf • Deep learning is rapidly advancing many areas of science and technology with multiple success stories in image, text, voice and video recognition, robotics, and autonomous driving. In this paper we demonstrate how deep neural networks (DNN) trained on large transcriptional response data sets can classify various drugs to therapeutic categories solely based on their transcriptional profiles. We used the perturbation samples of 678 drugs across A549, MCF-7, and PC-3 cell lines from the LINCS Project and linked those to 12 therapeutic use categories derived from MeSH. To train the DNN, we utilized both gene level transcriptomic data and transcriptomic data processed using a pathway activation scoring algorithm, for a pooled data set of samples perturbed with different concentrations of the drug for 6 and 24 hours. In both pathway and gene level classification, DNN achieved high classification accuracy and convincingly outperformed the support vector machine (SVM) model on every multiclass classification problem, however, models based on pathway level data performed significantly better. For the first time we demonstrate a deep learning neural net trained on transcriptomic data to recognize pharmacological properties of multiple drugs across different biological systems and conditions. We also propose using deep neural net confusion matrices for drug repositioning. This work is a proof of principle for applying deep learning to drug discovery and development. • Artificial intelligence achieves near-human performance in diagnosing breast cancer: [ScienceDaily.com] Pathologists have been largely diagnosing disease the same way for the past 100 years, by manually reviewing images under a microscope. But new work suggests that computers can help doctors improve accuracy and significantly change the way cancer and other diseases are diagnosed. • Putin E (2017) Deep biomarkers of human aging: Application of deep neural networks to biomarker development. pdf • One of the major impediments in human aging research is the absence of a comprehensive and actionable set of biomarkers that may be targeted and measured to track the effectiveness of therapeutic interventions. In this study, we designed a modular ensemble of 21 deep neural networks (DNNs) of varying depth, structure and optimization to predict human chronological age using a basic blood test. To train the DNNs, we used over 60,000 samples from common blood biochemistry and cell count tests from routine health exams performed by a single laboratory and linked to chronological age and sex. The best performing DNN in the ensemble demonstrated 81.5% epsilon-accuracy r = 0.90 with R2 = 0.80 and MAE = 6.07 years in predicting chronological age within a 10 year frame, while the entire ensemble achieved 83.5% epsilon-accuracy r = 0.91 with R2 = 0.82 and MAE = 5.55 years. The ensemble also identified the 5 most important markers for predicting human chronological age: albumin, glucose, alkaline phosphatase, urea and erythrocytes. To allow for public testing and evaluate real-life performance of the predictor, we developed an online system available at http://aging.ai. The ensemble approach may facilitate integration of multi-modal data linked to chronological age and sex that may lead to simple, minimally invasive, and affordable methods of tracking integrated biomarkers of aging in humans and performing cross-species feature importance analysis. • Diagnosing Heart Diseases with Deep Neural Networks [reddit] | blog post | GitHub | winning team's summary | Detecting heart arrhythmias using machine learning and Apple Watch data [blog post] • Is there a comprehensive list of all of the famous problems solved successfully using Deep Learning? [reddit] • Machine learning researchers team up with Chinese botanists on flower-recognition project: ... At least 250,000 species of flowers exist and even experienced botanists have trouble identifying them all. Now there's a way thanks to the rising power and sophistication of image recognition and the ease of taking pictures with your smartphone. It's called the Smart Flower Recognition System but it might never have happened were it not for a chance encounter last year between Microsoft researchers and botanists at the Institute of Botany, Chinese Academy of Sciences (IBCAS). ... [Phys.org] • Neural Networks for Genomics: When we first set out to apply deep learning to genomics, we asked ourselves what the current state of the art is. What problems are researchers working on and what approaches are they using? This post contains a summary of what we found - an overview of popular network architectures in genomics, the types of data used to train deep models, and the outcomes predicted or inferred. | reddit • Recognizing & Localizing Endangered Right Whales c. Extremely Deep NN | summary of reddit threads | discussed here: How to create the bounding boxes in an image? [reddit] • Intro to Neptune - Machine Learning Platform [Sep 2016] | Neptune | local copy [pdf] | reddit • In January 2016, deepsense.io won the Right Whale Recognition contest on kaggle. The competition's goal was to automate the right whale recognition process using a dataset of aerial photographs of individual whales. The terms and conditions for the competition stated that to collect the prize, the winning team had to provide source code and a description of how to recreate the winning solution. A fair request, but as it turned out, the winning solution's authors spent about three weeks recreating all of the steps that led them to the winning machine learning model. • ... The deepsense.io research team performed around 1000 experiments to find the competition-winning solution. Knowing all that, it becomes clear why recreating the solution was such a difficult and time consuming task. ... deepsense.io decided to build Neptune - a brand new machine learning platform that organizes data science processes. This platform relieves data scientists of the manual tasks related to managing their experiments. It helps with monitoring long-running experiments and supports team collaboration. All these features are accessible through the powerful Neptune Web UI and useful for scripting CLI. • Neptune is already used in all machine learning projects at deepsense.io. Every week, our data scientists execute around 1001 experiments using this machine learning platform. Thanks to that, the machine learning team can focus on data science and stop worrying about process management. [Applied ML - Biology] Papers: • Angermueller C (2016) Deep learning for computational biology. pdf [review] • Technological advances in genomics and imaging have led to an explosion of molecular and cellular profiling data from large numbers of samples. This rapid increase in biological data dimension and acquisition rate is challenging conventional analysis strategies. Modern machine learning methods, such as deep learning, promise to leverage very large data sets for finding hidden structure within them, and for making accurate predictions. In this review, we discuss applications of this new breed of analysis approaches in regulatory genomics and cellular imaging. We provide background of what deep learning is, and the settings in which it can be successfully applied to derive biological insights. In addition to presenting specific applications and providing tips for practical use, we also highlight possible pitfalls and limitations to guide computational biologists when and how to make the most use of this new technology. • Asgari E (2015) ProtVec: A Continuous Distributed Representation of Biological Sequences. arXiv:1503.05140 • We introduce a new representation and feature extraction method for biological sequences. Named bio-vectors (BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins (amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representation can be widely used in applications of deep learning in proteomics and genomics. In the present paper, we focus on protein-vectors that can be utilized in a wide array of bioinformatics investigations such as family classification, protein visualization, structure prediction, disordered protein identification, and protein-protein interaction prediction. In this method, we adopt artificial neural network approaches and represent a protein sequence with a single dense n-dimensional vector. To evaluate this method, we apply it in classification of 324,018 protein sequences obtained from Swiss-Prot belonging to 7,027 protein families, where an average family classification accuracy of 93%+-0.06% is obtained, outperforming existing family classification methods. In addition, we use ProtVec representation to predict disordered proteins from structured proteins. Two databases of disordered sequences are used: the DisProt database as well as a database featuring the disordered regions of nucleoporins rich with phenylalanine-glycine repeats (FG-Nups). Using support vector machine classifiers, FG-Nup sequences are distinguished from structured protein sequences found in Protein Data Bank (PDB) with a 99.8% accuracy, and unstructured DisProt sequences are differentiated from structured DisProt sequences with 100.0% accuracy. These results indicate that by only providing sequence data for various proteins into this model, accurate information about protein structure can be determined. • Buggenthin F (2017) Prospective identification of hematopoietic lineage choice by deep learning | Deep Learning predicts hematopoietic stem cell development [ScienceDaily.com: Feb 2017] • Differentiation alters molecular properties of stem and progenitor cells, leading to changes in their shape and movement characteristics. We present a deep neural network that prospectively predicts lineage choice in differentiating primary hematopoietic progenitors using image patches from brightfield microscopy and cellular movement. Surprisingly, lineage choice can be detected up to three generations before conventional molecular markers are observable. Our approach allows identification of cells with differentially expressed lineage-specifying genes without molecular labeling. • Collij LE (2017) Application of Machine Learning to Arterial Spin Labeling in Mild Cognitive Impairment and Alzheimer Disease. pdf | Artificial intelligence may aid in Alzheimer's diagnosis [ScienceDaily.com] Purpose: To investigate whether multivariate pattern recognition analysis of arterial spin labeling (ASL) perfusion maps can be used for classification and single-subject prediction of patients with Alzheimer disease (AD) and mild cognitive impairment (MCI) and subjects with subjective cognitive decline (SCD) after using the W score method to remove confounding effects of sex and age. Materials and Methods: Pseudocontinuous 3.0-T ASL images were acquired in 100 patients with probable AD; 60 patients with MCI, of whom 12 remained stable, 12 were converted to a diagnosis of AD, and 36 had no follow-up; 100 subjects with SCD; and 26 healthy control subjects. The AD, MCI, and SCD groups were divided into a sex- and age-matched training set (n = 130) and an independent prediction set (n = 130). Standardized perfusion scores adjusted for age and sex (W scores) were computed per voxel for each participant. Training of a support vector machine classifier was performed with diagnostic status and perfusion maps. Discrimination maps were extracted and used for single-subject classification in the prediction set. Prediction performance was assessed with receiver operating characteristic (ROC) analysis to generate an area under the ROC curve (AUC) and sensitivity and specificity distribution. Results: Single-subject diagnosis in the prediction set by using the discrimination maps yielded excellent performance for AD versus SCD (AUC, 0.96; P < 0.01), good performance for AD versus MCI (AUC, 0.89; P < 0.01), and poor performance for MCI versus SCD (AUC, 0.63; P = 0.06). Application of the AD versus SCD discrimination map for prediction of MCI subgroups resulted in good performance for patients with MCI diagnosis converted to AD versus subjects with SCD (AUC, 0.84; P < 0.01) and fair performance for patients with MCI diagnosis converted to AD versus those with stable MCI (AUC, 0.71; P > 0.05). Conclusion: With automated methods, age- and sex-adjusted ASL perfusion maps can be used to classify and predict diagnosis of AD, conversion of MCI to AD, stable MCI, and SCD with good to excellent accuracy and AUC values. • Correia FB (2016) Prediction of Cancer using Network Topological Features. pdf • Several data mining methods have been applied to explore biological data and understand the mechanisms that regulate genetic and metabolic diseases. The underlying hypothesis is that the identification of signatures can help the clinical identification of diseased tissues. Under this principle many different methodologies have been tested mostly using unsupervised methods. A common trend consists in combining the information obtained from gene expression and protein-protein interaction networks analyses or, more recently, building series of complex networks to model system dynamics. Despite the positive results that these works present, they typically fail to generalize out of sample datasets. In this paper we describe a supervised classification approach, with a new methodology for extracting the network topology dynamics embedded in a disease system, to improve the capacity of cancer prediction, using exclusively the topological properties of biological networks as features. Four microarrays datasets were used, for testing and validation, three from breast cancer experiments and one from a liver cancer experiment. The obtained results corroborate the potential of the proposed methodology to predict a certain type of cancer and the necessity of applying different classification models to different types of cancer. • Deming L (2016) Genetic Architect: Discovering Genomic Structure with Learned Neural Architectures. arXiv:1605.07156 • Each human genome is a 3 billion base pair set of encoding instructions. Decoding the genome using deep learning fundamentally differs from most tasks, as we do not know the full structure of the data and therefore cannot design architectures to suit it. As such, architectures that fit the structure of genomics should be learned not prescribed. Here, we develop a novel search algorithm, applicable across domains, that discovers an optimal architecture which simultaneously learns general genomic patterns and identifies the most important sequence motifs in predicting functional genomic outcomes. The architectures we find using this algorithm succeed at using only RNA expression data to predict gene regulatory structure, learn human-interpretable visualizations of key sequence motifs, and surpass state-of-the-art results on benchmark genomics challenges. • Ghosh S (2016) Designing Domain Specific Word Embeddings: Applications to Disease Surveillance. arXiv:1603.00106 | word2vec; Dis2Vec • We motivate a disease vocabulary driven word2vec model (Dis2Vec) which we use to model diseases and constituent attributes as word embeddings from the HealthMap news corpus. We use these word embeddings to create disease taxonomies and evaluate our model accuracy against human annotated taxonomies. We compare our accuracies against several state-of-the art word2vec methods. Our results demonstrate that Dis2Vec outperforms traditional distributed vector representations in its ability to faithfully capture disease attributes and accurately forecast outbreaks. • Guan MY [Geoffrey E. Hinton | Google Brain] (2017) Who Said What: Modeling Individual Labelers Improves Classification. arXiv:1703.08774... Here we show that our approach leads to improvements in computer-aided diagnosis of diabetic retinopathy. • Guo W (2017) DeepMetabolism: A Deep Learning System To Predict Phenotype From Genome Sequencing. arXiv:1705.03094 • Life science is entering a new era of petabyte-level sequencing data. Converting such big data to biological insights represents a huge challenge for computational analysis. To this end, we developed DeepMetabolism, a biology-guided deep learning system to predict cell phenotypes from transcriptomics data. By integrating unsupervised pre-training with supervised training, DeepMetabolism is able to predict phenotypes with high accuracy (PCC > 0.92), high speed (<30 min for >100 GB data using a single GPU), and high robustness (tolerate up to 75% noise). We envision DeepMetabolism to bridge the gap between genotype and phenotype and to serve as a springboard for applications in synthetic biology and precision medicine. • Havaei M [Courville A; Bengio Y] (2015) Brain Tumor Segmentation with Deep Neural Networks. arXiv:1505.03540 • In this paper, we present a fully automatic brain tumor segmentation method based on Deep Neural Networks (DNNs). The proposed networks are tailored to glioblastomas (both low and high grade) pictured in MR images. By their very nature, these tumors can appear anywhere in the brain and have almost any kind of shape, size, and contrast. These reasons motivate our exploration of a machine learning solution that exploits a flexible, high capacity DNN while being extremely efficient. Here, we give a description of different model choices that we've found to be necessary for obtaining competitive performance. We explore in particular different architectures based on Convolutional Neural Networks (CNN), i.e. DNNs specifically adapted to image data. We present a novel CNN architecture which differs from those traditionally used in computer vision. Our CNN exploits both local features as well as more global contextual features simultaneously. Also, different from most traditional uses of CNNs, our networks use a final layer that is a convolutional implementation of a fully connected layer which allows a 40 fold speed up. We also describe a 2-phase training procedure that allows us to tackle difficulties related to the imbalance of tumor labels. Finally, we explore a cascade architecture in which the output of a basic CNN is treated as an additional source of information for a subsequent CNN. Results reported on the 2013 BRATS test dataset reveal that our architecture improves over the currently published state-of-the-art while being over 30 times faster. • Hosseini-Asl E (2016) Alzheimer's Disease Diagnostics by Adaptation of 3D Convolutional Network. arXiv:1607.00455 • Early diagnosis, playing an important role in preventing progress and treating the Alzheimer's disease (AD), is based on classification of features extracted from brain images. The features have to accurately capture main AD-related variations of anatomical brain structures, such as, e.g., ventricles size, hippocampus shape, cortical thickness, and brain volume. This paper proposed to predict the AD with a deep 3D convolutional neural network (3D-CNN), which can learn generic features capturing AD biomarkers and adapt to different domain datasets. The 3D-CNN is built upon a 3D convolutional autoencoder, which is pre-trained to capture anatomical shape variations in structural brain MRI scans. Fully connected upper layers of the 3D-CNN are then fine-tuned for each task-specific AD classification. Experiments on the CADDementia MRI dataset with no skull-stripping preprocessing have shown our 3D-CNN outperforms several conventional classifiers by accuracy. Abilities of the 3D-CNN to generalize the features learnt and adapt to other domains have been validated on the ADNI dataset. • Hua L (2016) A Shortest Dependency Path Based Convolutional Neural Network for Protein-Protein Relation Extraction. journal | pdf • The state-of-the-art methods for protein-protein interaction (PPI) extraction are primarily based on kernel methods, and their performances strongly depend on the handcraft features. In this paper, we tackle PPI extraction by using convolutional neural networks (CNN) and propose a shortest dependency path based CNN (sdpCNN) model. The proposed method only takes the sdp and word embedding as input and could avoid bias from feature selection by using CNN. We performed experiments on standard Aimed and BioInfer datasets, and the experimental results demonstrated that our approach outperformed state-of-the-art kernel based methods. In particular, by tracking the sdpCNN model, we find that sdpCNN could extract key features automatically and it is verified that pretrained word embedding is crucial in PPI task. • Kadurin A (2017) The cornucopia of meaningful leads: Applying deep adversarial autoencoders for new molecule development in oncology. pdf [journal] | reddit • Recent advances in deep learning and specifically in generative adversarial networks have demonstrated surprising results in generating new images and videos upon request even using natural language as input. In this paper we present the first application of generative adversarial autoencoders (AAE) for generating novel molecular fingerprints with a defined set of parameters. We developed a 7-layer AAE architecture with the latent middle layer serving as a discriminator. As an input and output the AAE uses a vector of binary fingerprints and concentration of the molecule. In the latent layer we also introduced a neuron responsible for growth inhibition percentage, which when negative indicates the reduction in the number of tumor cells after the treatment. To train the AAE we used the NCI-60 cell line assay data for 6252 compounds profiled on MCF-7 cell line. The output of the AAE was used to screen 72 million compounds in PubChem and select candidate molecules with potential anti-cancer properties. This approach is a proof of concept of an artificially-intelligent drug discovery engine, where AAEs are used to generate new molecular fingerprints with the desired molecular properties. • Kang J (2017) Improving drug discovery with high-content phenotypic screens by systematic selection of reporter cell lines. pdf | How artificial intelligence techniques are aiding the hunt for new drugs [Phys.org: Jan 2017; local copy] • High-content, image-based screens enable the identification of compounds that induce cellular responses similar to those of known drugs but through different chemical structures or targets. A central challenge in designing phenotypic screens is choosing suitable imaging biomarkers. Here we present a method for systematically identifying optimal reporter cell lines for annotating compound libraries (ORACLs), whose phenotypic profiles most accurately classify a training set of known drugs. We generate a library of fluorescently tagged reporter cell lines, and let analytical criteria determine which among them - the ORACL-best classifies compounds into multiple, diverse drug classes. We demonstrate that an ORACL can functionally annotate large compound libraries across diverse drug classes in a single-pass screen and confirm high prediction accuracy by means of orthogonal, secondary validation assays. Our approach will increase the efficiency, scale and accuracy of phenotypic screens by maximizing their discriminatory power. • Leale G (2016) Inferring unknown biological function by integration of GO annotations and gene expression data. arXiv:1608.03672 • Characterizing genes with semantic information is an important process regarding the description of gene products. In spite that complete genomes of many organisms have been already sequenced, the biological functions of all of their genes are still unknown. Since experimentally studying the functions of those genes, one by one, would be unfeasible, new computational methods for gene functions inference are needed. We present here a novel computational approach for inferring biological function for a set of genes with previously unknown function, given a set of genes with well-known information. This approach is based on the premise that genes with similar behaviour should be grouped together. This is known as the guilt-by-association principle. Thus, it is possible to take advantage of clustering techniques to obtain groups of unknown genes that are co-clustered with genes that have well-known semantic information (GO annotations). Meaningful knowledge to infer unknown semantic information can therefore be provided by these well-known genes. We provide a method to explore the potential function of new genes according to those currently annotated. The results obtained indicate that the proposed approach could be a useful and effective tool when used by biologists to guide the inference of biological functions for recently discovered genes. Our work sets an important landmark in the field of identifying unknown gene functions through clustering, using an external source of biological input. A simple web interface to this proposal can be found at gamma-AM. • Lee B (2016) deepTarget: End-to-end Learning Framework for microRNA Target Prediction using Deep Recurrent Neural Networks. arXiv:1603.09123 • MicroRNAs (miRNAs) are short sequences of ribonucleic acids that control the expression of target messenger RNAs (mRNAs) by binding them. Robust prediction of miRNA-mRNA pairs is of utmost importance in deciphering gene regulations but has been challenging because of high false positive rates, despite a deluge of computational tools that normally require laborious manual feature extraction. This paper presents an end-to-end machine learning framework for miRNA target prediction. Leveraged by deep recurrent neural networks-based auto-encoding and sequence-sequence interaction learning, our approach not only delivers an unprecedented level of accuracy but also eliminates the need for manual feature extraction. The performance gap between the proposed method and existing alternatives is substantial (over 25% increase in F-measure), and deepTarget delivers a quantum leap in the long-standing challenge of robust miRNA target prediction. • Leimar O (2017) Genes as cues of relatedness and social evolution in heterogeneous environments. pdf | Scientists uncover route for finding out what makes individuals nice or nasty [ScienceDaily.com] Abstract. There are many situations where relatives interact while at the same time there is genetic polymorphism in traits influencing survival and reproduction. Examples include cheater-cooperator polymorphism and polymorphic microbial pathogens. Environmental heterogeneity, favoring different traits in nearby habitats, with dispersal between them, is one general reason to expect polymorphism. Currently, there is no formal framework of social evolution that encompasses genetic polymorphism. We develop such a framework, thus integrating theories of social evolution into the evolutionary ecology of heterogeneous environments. We allow for adaptively maintained genetic polymorphism by applying the concept of genetic cues. We analyze a model of social evolution in a two-habitat situation with limited dispersal between habitats, in which the average relatedness at the time of helping and other benefits of helping can differ between habitats. An important result from the analysis is that alleles at a polymorphic locus play the role of genetic cues, in the sense that the presence of a cue allele contains statistical information for an organism about its current environment, including information about relatedness. We show that epistatic modifiers of the cue polymorphism can evolve to make optimal use of the information in the genetic cue, in analogy with a Bayesian decision maker. Another important result is that the genetic linkage between a cue locus and modifier loci influences the evolutionary interest of modifiers, with tighter linkage leading to greater divergence between social traits induced by different cue alleles, and this can be understood in terms of genetic conflict. Author Summary. The theory of kin selection explains the evolution of helping when relatives interact. It can be used when individuals in a social group have different sexes, ages or phenotypic qualities, but the theory has not been worked out for situations where there is genetic polymorphism in helping. That kind of polymorphism, for instance cheater-cooperator polymorphism in microbes, has attracted much interest. We include these phenomena into a general framework of social evolution. Our framework is built on the idea of genetic cues, which means that an individual uses its genotype at a polymorphic locus as a statistical predictor of the current social conditions, including the expected relatedness in a social group. We allow for multilocus determination of the phenotype, in the form of modifiers of the effects of the alleles at a cue locus, and we find that there can be genetic conflicts between modifier loci that are tightly linked versus unlinked to the cue locus. • See also (below): Rutledge RB (2017) The social contingency of momentary subjective well-being. pdf | New equation reveals how other people's fortunes affect our happiness • Liu S (2016) Makeup like a superstar: Deep Localized Makeup Transfer Network. arXiv:1604.07102 • In this paper, we propose a novel Deep Localized Makeup Transfer Network to automatically recommend the most suitable makeup for a female and synthesis the makeup on her face. Given a before-makeup face, her most suitable makeup is determined automatically. Then, both the before makeup and the reference faces are fed into the proposed Deep Transfer Network to generate the after-makeup face. Our end-to-end makeup transfer network have several nice properties including: (1) with complete functions: including foundation, lip gloss, and eye shadow transfer; (2) cosmetic specific: different cosmetics are transferred in different manners; (3) localized: different cosmetics are applied on different facial regions; (4) producing naturally looking results without obvious artifacts; (5) controllable makeup lightness: various results from light makeup to heavy makeup can be generated. Qualitative and quantitative experiments show that our network performs much better than the methods of [Guo and Sim, 2009] and two variants of NerualStyle [Gatys et al., 2015a]. • Lobo D (2015) Inferring regulatory networks from experimental morphological phenotypes: A computational method reverse-engineers planarian regeneration. | pdf | journal:PLoS Comp Biol | news release: Planarian Regeneration Model Discovered by Artificial Intelligence • Transformative applications in biomedicine require the discovery of complex regulatory networks that explain the development and regeneration of anatomical structures, and reveal what external signals will trigger desired changes of large-scale pattern. Despite recent advances in bioinformatics, extracting mechanistic pathway models from experimental morphological data is a key open challenge that has resisted automation. The fundamental difficulty of manually predicting emergent behavior of even simple networks has limited the models invented by human scientists to pathway diagrams that show necessary subunit interactions but do not reveal the dynamics that are sufficient for complex, self-regulating pattern to emerge. To finally bridge the gap between high-resolution genetic data and the ability to understand and control patterning, it is critical to develop computational tools to efficiently extract regulatory pathways from the resultant experimental shape phenotypes. For example, planarian regeneration has been studied for over a century, but despite increasing insight into the pathways that control its stem cells, no constructive, mechanistic model has yet been found by human scientists that explains more than one or two key features of its remarkable ability to regenerate its correct anatomical pattern after drastic perturbations. We present a method to infer the molecular products, topology, and spatial and temporal non-linear dynamics of regulatory networks recapitulating in silico the rich dataset of morphological phenotypes resulting from genetic, surgical, and pharmacological experiments. We demonstrated our approach by inferring complete regulatory networks explaining the outcomes of the main functional regeneration experiments in the planarian literature; By analyzing all the datasets together, our system inferred the first systems-biology comprehensive dynamical model explaining patterning in planarian regeneration. This method provides an automated, highly generalizable framework for identifying the underlying control mechanisms responsible for the dynamic regulation of growth and form. • Lobo D (2017) Discovering novel phenotypes with automatically inferred dynamic models: a partial melanocyte conversion in Xenopus. journal: paper | pdf | blog: PhysOrg [Mar 2017] • Progress in regenerative medicine requires reverse-engineering cellular control networks to infer perturbations with desired systems-level outcomes. Such dynamic models allow phenotypic predictions for novel perturbations to be rapidly assessed in silico. Here, we analyzed a Xenopus model of conversion of melanocytes to a metastatic-like phenotype only previously observed in an all-or-none manner. Prior in vivo genetic and pharmacological experiments showed that individual animals either fully convert or remain normal, at some characteristic frequency after a given perturbation. We developed a Machine Learning method which inferred a model explaining this complex, stochastic all-or-none dataset. We then used this model to ask how a new phenotype could be generated: animals in which only some of the melanocytes converted. Systematically performing in silico perturbations, the model predicted that a combination of altanserin (5HTR2 inhibitor), reserpine (VMAT inhibitor), and VP16-XlCreb1 (constitutively active CREB) would break the all-or-none concordance. Remarkably, applying the predicted combination of three reagents in vivo revealed precisely the expected novel outcome, resulting in partial conversion of melanocytes within individuals. This work demonstrates the capability of automated analysis of dynamic models of signaling networks to discover novel phenotypes and predictively identify specific manipulations that can reach them. • Lupolova N (2016) Support vector machine applied to predict the zoonotic potential of E. coli O157 cattle isolates. pdf | journal:PNAS | Computers learn to spot deadly bacteria [Phys.org] • Sequence analyses of pathogen genomes facilitate the tracking of disease outbreaks and allow relationships between strains to be reconstructed and virulence factors to be identified. However, these methods are generally used after an outbreak has happened. Here, we show that support vector machine analysis of bovine E. coli O157 isolate sequences can be applied to predict their zoonotic potential, identifying cattle strains more likely to be a serious threat to human health. Notably, only a minor subset (less than 10%) of bovine E. coli O157 isolates analyzed in our datasets were predicted to have the potential to cause human disease; this is despite the fact that the majority are within previously defined pathogenic lineages I or I/II and encode key virulence factors. The predictive capacity was retained when tested across datasets. The major differences between human and bovine E. coli O157 isolates were due to the relative abundances of hundreds of predicted prophage proteins. This finding has profound implications for public health management of disease because interventions in cattle, such a vaccination, can be targeted at herds carrying strains of high zoonotic potential. Machine-learning approaches should be applied broadly to further our understanding of pathogen biology. • Milletari F (2016) V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. arXiv:1606.04797 | reddit • Convolutional Neural Networks (CNNs) have been recently employed to solve problems from both the computer vision and medical image analysis fields. Despite their popularity, most approaches are only able to process 2D images while most medical data used in clinical practice consists of 3D volumes. In this work we propose an approach to 3D image segmentation based on a volumetric, fully convolutional, neural network. Our CNN is trained end-to-end on MRI volumes depicting prostate, and learns to predict segmentation for the whole volume at once. We introduce a novel objective function, that we optimise during training, based on Dice coefficient. In this way we can deal with situations where there is a strong imbalance between the number of foreground and background voxels. To cope with the limited number of annotated volumes available for training, we augment the data applying random non-linear transformations and histogram matching. We show in our experimental evaluation that our approach achieves good performances on challenging test data while requiring only a fraction of the processing time needed by other previous methods. • Putin E (2017) Deep biomarkers of human aging: Application of deep neural networks to biomarker development. [pdf] • One of the major impediments in human aging research is the absence of a comprehensive and actionable set of biomarkers that may be targeted and measured to track the effectiveness of therapeutic interventions. In this study, we designed a modular ensemble of 21 deep neural networks (DNNs) of varying depth, structure and optimization to predict human chronological age using a basic blood test. To train the DNNs, we used over 60,000 samples from common blood biochemistry and cell count tests from routine health exams performed by a single laboratory and linked to chronological age and sex. The best performing DNN in the ensemble demonstrated 81.5 % epsilon-accuracy r = 0.90 with R2 = 0.80 and MAE = 6.07 years in predicting chronological age within a 10 year frame, while the entire ensemble achieved 83.5% epsilon-accuracy r = 0.91 with R2 = 0.82 and MAE = 5.55 years. The ensemble also identified the 5 most important markers for predicting human chronological age: albumin, glucose, alkaline phosphatase, urea and erythrocytes. To allow for public testing and evaluate real-life performance of the predictor, we developed an online system available at http://aging.ai. The ensemble approach may facilitate integration of multi-modal data linked to chronological age and sex that may lead to simple, minimally invasive, and affordable methods of tracking integrated biomarkers of aging in humans and performing cross-species feature importance analysis. • Romero A (2016) [Yoshua Bengio] Diet Networks: Thin Parameters for Fat Genomics. pdf | arXiv:1611.09340 | reddit • Learning tasks such as those involving genomic data often poses a serious challenge: the number of input features can be orders of magnitude larger than the number of training examples, making it difficult to avoid overfitting, even when using the known regularization techniques. We focus here on tasks in which the input is a description of the genetic variation specific to a patient, the single nucleotide polymorphisms (SNPs), yielding millions of ternary inputs. Improving the ability of deep learning to handle such datasets could have an important impact in medical research, more specifically in precision medicine, where high-dimensional data regarding a particular patient is used to make predictions of interest. Even though the amount of data for such tasks is increasing, this mismatch between the number of examples and the number of inputs remains a concern. Naive implementations of classifier neural networks involve a huge number of free parameters in their first layer (number of input features times number of hidden units): each input feature is associated with as many parameters as there are hidden units. We propose a novel neural network parametrization which considerably reduces the number of free parameters. It is based on the idea that we can first learn or provide a distributed representation for each input feature (e.g. for each position in the genome where variations are observed in data), and then learn (with another neural network called the parameter prediction network) how to map a feature's distributed representation (based on the feature's identity not its value) to the vector of parameters specific to that feature in the classifier neural network (the weights which link the value of the feature to each of the hidden units). This approach views the problem of producing the parameters associated with each feature as a multi-task learning problem. We show experimentally on a population stratification task of interest to medical studies that the proposed approach can significantly reduce both the number of parameters and the error rate of the classifier. TL;DR: Drastically reducing the number of parameters, when the number of input features is orders of magnitude larger than the number of training examples, such as in genomics. • Conclusion. In this paper, we proposed Diet Networks, a novel network parametrization which considerably reduces the number of free parameters when the input is very high dimensional. We showed how using the parameter prediction networks, yielded better generalization in terms of misclassification error. Notably, when using pre-computed feature embeddings that maximally the number of free parameters, we were able to obtain our best results. We validated our approach on the publicly available 1000 genomes dataset, addressing the relevant task of ancestry prediction based on SNP data. This work demonstrated the potential of deep learning models to tackle domain-specific tasks, where there is a mismatch between the number of samples and their high dimensionality. Given the high accuracy achieved in the ancestry prediction task, we believe that deep learning techniques can improve standard practices in the analysis of human polymorphism data. We expect that these techniques will allow us to tackle the more challenging problem of conducting genetic association studies. Hence, we expect to further develop our method to conduct population-aware analyses of SNP data in disease cohorts. The increased power of deep learning methods to identify the genetic basis of common diseases could lead to better patient risk prediction and will improve our overall understanding of disease etiology. • Ronneberger O (2015) U-net: Convolutional networks for biomedical image segmentation. arXiv:1505.04597 | GitHub | GitXiv • There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available here. • Rutledge RB (2016) The social contingency of momentary subjective well-being. pdf • Although social comparison is a known determinant of overall life satisfaction, it is not clear how it affects moment-to-moment variation in subjective emotional state. Using a novel social decision task combined with computational modelling, we show that a participant's subjective emotional state reflects not only the impact of rewards they themselves receive, but also the rewards received by a social partner. Unequal outcomes, whether advantageous or disadvantageous, reduce average momentary happiness. Furthermore, the relative impacts of advantageous and disadvantageous inequality on momentary happiness at the individual level predict a subject's generosity in a separate dictator game. These findings demonstrate a powerful social influence upon subjective emotional state, where emotional reactivity to inequality is strongly predictive of altruism in an independent task domain. • Related blog post: New equation reveals how other people's fortunes affect our happiness [ScienceDaily.com]: A new equation, showing how our happiness depends not only on what happens to us but also how this compares to other people, has been developed by researchers. The team developed an equation to predict happiness in 2014, highlighting the importance of expectations, and the new updated equation also takes into account other people's fortunes. • See also (above): Leimar O (2016) Genes as cues of relatedness and social evolution in heterogeneous environments. pdf | Scientists uncover route for finding out what makes individuals nice or nasty [ScienceDaily.com] • Santerre JW (2016) Machine Learning for Antimicrobial Resistance. arXiv:1607.01224 • Biological datasets amenable to applied machine learning are more available today than ever before, yet they lack adequate representation in the Data-for-Good community. Here we present a work in progress case study performing analysis on antimicrobial resistance (AMR) using standard ensemble machine learning techniques and note the successes and pitfalls such work entails. Broadly, applied machine learning (AML) techniques are well suited to AMR, with classification accuracies ranging from mid-90% to low- 80% depending on sample size. Additionally, these techniques prove successful at identifying gene regions known to be associated with the AMR phenotype. We believe that the extensive amount of biological data available, the plethora of problems presented, and the global impact of such work merits the consideration of the Data-for-Good community. • Sarraf S (2016) Classification of Alzheimer's Disease Structural MRI Data by Deep Learning Convolutional Neural Networks. arXiv:1607.06583 • Recently, machine learning techniques especially predictive modeling and pattern recognition in biomedical sciences from drug delivery system to medical imaging has become one of the important methods which are assisting researchers to have deeper understanding of entire issue and to solve complex medical problems. Deep learning is a powerful machine learning algorithm in classification while extracting low to high-level features. In this paper, we used convolutional neural network to classify Alzheimer's brain from normal healthy brain. The importance of classifying this kind of medical data is to potentially develop a predict model or system in order to recognize the type disease from normal subjects or to estimate the stage of the disease. Classification of clinical data such as Alzheimer's disease has been always challenging and most problematic part has been always selecting the most discriminative features. Using Convolutional Neural Network (CNN) and the famous architecture LeNet-4, we successfully classified structural MRI data of Alzheimer's subjects from normal controls where the accuracy of test data on trained data reached 98.84%. This experiment suggests us the shift and scale invariant features extracted by CNN followed by deep learning classification is most powerful method to distinguish clinical data from healthy data in fMRI. This approach also enables us to expand our methodology to predict more complicated systems. • Spranger M (ACL-IJCNLP 2016) Extracting Biological Pathway Models From NLP Event Representations. arXiv:1608.03764 • This paper describes an an open-source software system for the automatic conversion of NLP event representations to system biology structured data interchange formats such as SBML and BioPAX. It is part of a larger effort to make results of the NLP community available for system biology pathway modelers. • Spranger M (ACL 2016) Measuring the State of the Art of Automated Pathway Curation Using Graph Algorithms-A Case Study of the mTOR Pathway. arXiv:1608.03767 • This paper evaluates the difference between human pathway curation and current NLP systems. We propose graph analysis methods for quantifying the gap between human curated pathway maps and the output of state-of-the-art automatic NLP systems. Evaluation is performed on the popular mTOR pathway. Based on analyzing where current systems perform well and where they fail, we identify possible avenues for progress. • Wan C (2016) A New Hierarchical Redundancy Eliminated Tree Augmented Naive Bayes Classifier for Coping with Gene Ontology-based Features. arXiv:1607.01690 • The Tree Augmented Naive Bayes classifier is a type of probabilistic graphical model that can represent some feature dependencies. In this work, we propose a Hierarchical Redundancy Eliminated Tree Augmented Naive Bayes (HRE-TAN) algorithm, which considers removing the hierarchical redundancy during the classifier learning process, when coping with data containing hierarchically structured features. The experiments showed that HRE-TAN obtains significantly better predictive performance than the conventional Tree Augmented Naive Bayes classifier, and enhanced the robustness against imbalanced class distributions, in aging-related gene datasets with Gene Ontology terms used as features. • Wang S (2016) Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model. arXiv:1609.00680 • Protein contact prediction from sequence is an important problem. Recently exciting progress has been made, but the predicted contacts for proteins without many sequence homologs is still of low quality and not extremely useful for de novo structure prediction. This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual neural networks. This deep neural network allows us to model very complex relationship between sequence and contact map as well as long-range interdependency between contacts. Our method greatly outperforms existing contact prediction methods and leads to much more accurate contact-assisted protein folding. Tested on three datasets of 579 proteins, the average top L long-range prediction accuracy obtained our method, the representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints can yield correct folds (i.e., TMscore>0.6) for 203 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 proteins, respectively. Further, our contact-assisted models have much better quality than template-based models. Using our predicted contacts as restraints, we can (ab initio) fold 208 of the 398 membrane proteins with TMscore>0.5. By contrast, when the training proteins of our method are used as templates, homology modeling can only do so for 10 of them. One interesting finding is that even if we do not train our prediction models with any membrane proteins, our method still works well on membrane protein prediction. • Availability: ContactMap • Werner E (2016) Stem Cell Networks arXiv:1607.04502 [relevant, e.g., to cancer modeling ...] • We present a general computational theory of stem cell networks and their developmental dynamics. Stem cell networks are special cases of developmental control networks. Our theory generates a natural classification of all possible stem cell networks based on their network architecture. Each stem cell network has a unique topology and semantics and developmental dynamics that result in distinct phenotypes. We show that the ideal growth dynamics of multicellular systems generated by stem cell networks have mathematical properties related to the coefficients of Pascal's Triangle. The relationship to cancer stem cells and their control networks is indicated. The theory lays the foundation for a new research paradigm for understanding and investigating stem cells. The theory of stem cell networks implies that new methods for generating and controlling stem cells will become possible. [Applied ML] Business: • 6 Ways Companies Can Leverage Machine Learning Algorithms: • Price Optimization • Improving Customer Engagement and Maximizing Profits • Launching Targeted Promotions • Predicting Equipment Failure • Detecting and Preventing Fraud • Streamlining Talent Acquisition • Rönnqvist S (2016) Bank distress in the news: Describing events through deep learning. arXiv:1603.05670 • While many models are purposed for detecting the occurrence of significant events in financial systems, the task of providing qualitative detail on the developments is not usually as well automated. We present a deep learning approach for detecting relevant discussion in text and extracting natural language descriptions of events. Supervised by only a small set of event information, comprising entity names and dates, the model is leveraged by unsupervised learning of semantic vector representations on extensive text data. We demonstrate applicability to the study of financial risk based on news (6.6M articles), particularly bank distress and government interventions (243 events), where indices can signal the level of bank-stress-related reporting at the entity level, or aggregated at national or European level, while being coupled with explanations. Thus, we exemplify how text, as timely, widely available and descriptive data, can serve as a useful complementary source of information for financial and systemic risk analytics. • How a Japanese cucumber farmer is using deep learning and TensorFlow | reddit • Insurance: Saving 80% in 90 Seconds? When Tech Makes Insurance 5x Cheaper. • A report by IBNR Weekly, one of the most respected publications in insurance, alarmed many industry-insiders: "In an attempt to better understand Lemonade's "killer" pricing, we "applied" for renters insurance through the Lemonade and Bungalow websites... the pricing was dramatically different as Bungalow's annual price was ~5.6x Lemonade." (IBNR Weekly, September 29, 2016) Incumbents find the idea of a 560% price gap unsettling. Understandably. Beyond self-preservation jitters, some raised concerns about Lemonade's 'killer prices' looking a lot like 'suicidal prices.' After all, they reasoned, insurance companies pay out in claims over 40% of the fees they collect. So if Lemonade charges 80% less (same as saying others charge 5x more) Lemonade will be paying out in claims more than it receives in premiums! Lemonade must be recklessly naïve or worse, they surmised, and insolvency just a matter of time. I get it. That's why I'm writing this post. [ ... snip ... ] • ... So how can Lemonade be 81% cheaper? You guessed it: by building an insurance company, from the ground up, powered by A.I. and behavioral economics. Not brokers and bureaucracy. It's not just a marketing slogan: our acquisition costs are already 10x lower than legacy carriers. This allows us to do away with punitive minimum premiums, and allows renters to save a fortune on insurance. Simple. 80% cheaper is achievable without being reckless or naïve. Indeed it has so been achieved. It is available 24/7, courtesy of a delightful bot who will fashion you a$5 policy in 90 seconds and with zero paperwork. Mystery solved.

[Applied ML] Clinical:

[Applied ML - Clinical] BLOGS:

• Can ML say whether or not someone is ill?

• Computers trounce pathologists in predicting lung cancer type, severity: [Aug 2016 | Michael Snyder] Automating the analysis of slides of lung cancer tissue samples increases the accuracy of tumor classification and patient prognoses, according to a new study. ...

• Weight loss: GitHub  |  Discussion: reddit  |  Hacker News

• Patel TA (2016) Correlating mammographic and pathologic findings in clinical decision support using natural language processing and data mining methods.  pdf.

"Researchers have developed an artificial intelligence (AI) software that reliably interprets mammograms, assisting doctors with a quick and accurate prediction of breast cancer risk. The AI computer software intuitively translates patient charts into diagnostic information at 30 times human speed and with 99 percent accuracy."

• BACKGROUND: A key challenge to mining electronic health records for mammography research is the preponderance of unstructured narrative text, which strikingly limits usable output. The imaging characteristics of breast cancer subtypes have been described previously, but without standardization of parameters for data mining.

METHODS: The authors searched the enterprise-wide data warehouse at the Houston Methodist Hospital, the Methodist Environment for Translational Enhancement and Outcomes Research (METEOR), for patients with Breast Imaging Reporting and Data System (BI-RADS) category 5 mammogram readings performed be tween January 2006 and May 2015 and an available pathology report. The authors developed natural language processing (NLP) software algorithms to automatically extract mammographic and pathologic findings from free text mammogram and pathology reports. The correlation between mammographic imaging features and breast cancer subtype was analyzed using one-way analysis of variance and the Fisher exact test.

RESULTS: The NLP algorithm was able to obtain key characteristics for 543 patients who met the inclusion criteria. Patients with estrogen receptor-positive tumors were more likely to have spiculated margins (P = 0.0008), and those with tumors that overexpressed human epidermal growth factor receptor 2 (HER2) were more likely to have heterogeneous and pleomorphic calcifications (P = 0.0078 and P = 0.0002, respectively).

CONCLUSIONS: Mammographic imaging characteristics, obtained from an automated text search and the extraction of mammogram reports using NLP techniques, correlated with pathologic breast cancer subtype. The results of the current study validate previously reported trends assessed by manual data collection. Furthermore, NLP provides an automated means with which to scale up data extraction and analysis for clinical decision support.

[Applied ML - Clinical] COMPETITIONS]:

• 2nd place solution for the 2017 national datascience bowl

• This document describes my part of the 2nd prize solution to the Data Science Bowl 2017 hosted by Kaggle.com. I teamed up with Daniel Hammack. His part of the solution is described here The goal of the challenge was to predict the development of lung cancer in a patient given a set of CT images. Detailed descriptions of the challenge can be found on the Kaggle competition page and this blog post by Elias Vansteenkiste. My solution (and that of Daniel) was mainly based on nodule detectors with a 3D convolutional neural network architecture. I worked on a windows 64 system using the Keras library in combination with the just released windows version of TensorFlow.

[ ... SNIP! ... ]

• 3D convnets for lung cancer estimation (2nd place national datascience bowl)   [reddit: May 2017]:

• [u/dhammack:]  If you are interested in my half of the solution, a technical writeup is available here: here  [pdf]   [local copy  (pdf)].  Also, the code I wrote is in that repository:  |  GitHub  |  GitHub.io

[Applied ML - Clinical] PAPERS:

• Aczon M (2017) Dynamic Mortality Risk Predictions in Pediatric Critical Care Using Recurrent Neural Networks. arXiv:1701.06675  |  In Hospital ICUs, AI Could Predict Which Patients Are Likely to Die [IEEE Spectrum: Mar 2017]

• Viewing the trajectory of a patient as a dynamical system, a recurrent neural network was developed to learn the course of patient encounters in the Pediatric Intensive Care Unit (PICU) of a major tertiary care center. Data extracted from Electronic Medical Records (EMR) of about 12000 patients who were admitted to the PICU over a period of more than 10 years were leveraged. The RNN model ingests a sequence of measurements which include physiologic observations, laboratory results, administered drugs and interventions, and generates temporally dynamic predictions for in-ICU mortality at user-specified times. The RNN's ICU mortality predictions offer significant improvements over those from two clinically-used scores and static machine learning algorithms.

• Chen R (2016) Identifying Metastases in Sentinel Lymph Nodes with Deep Convolutional Neural Networks. arXiv:1608.01658 reddit

• Metastatic presence in lymph nodes is one of the most important prognostic variables of breast cancer. The current diagnostic procedure for manually reviewing sentinel lymph nodes, however, is very time-consuming and subjective. Pathologists have to manually scan an entire digital whole-slide image (WSI) for regions of metastasis that are sometimes only detectable under high resolution or entirely hidden from the human visual cortex. From October 2015 to April 2016, the International Symposium on Biomedical Imaging (ISBI) held the Camelyon Grand Challenge 2016 to crowd-source ideas and algorithms for automatic detection of lymph node metastasis. Using a generalizable stain normalization technique and the Proscia Pathology Cloud computing platform, we trained a deep convolutional neural network on millions of tissue and tumor image tiles to perform slide-based evaluation on our testing set of whole-slide images images, with a sensitivity of 0.96, specificity of 0.89, and AUC score of 0.90. Our results indicate that our platform can automatically scan any WSI for metastatic regions without institutional calibration to respective stain profiles.

• Choi E (2015) Doctor AI: Predicting Clinical Events via RNN. arXiv:1511.05942

• Leveraging large historical data in electronic health record (EHR), we developed Doctor AI, a generic predictive model that covers observed medical conditions and medication uses. Doctor AI is a temporal model using recurrent neural networks (RNN) and was developed and applied to longitudinal time stamped EHR data from 260K patients over 8 years. Encounter records (e.g. diagnosis codes, medication codes or procedure codes) were input to RNN to predict (all) the diagnosis and medication categories for a subsequent visit. Doctor AI assesses the history of patients to make multilabel predictions (one label for each diagnosis or medication category). Based on separate blind test set evaluation, Doctor AI can perform differential diagnosis with up to 79% recall@30, significantly higher than several baselines. Moreover, we demonstrate great generalizability of Doctor AI by adapting the resulting models from one institution to another without losing substantial accuracy.

• Dooling D (2016) Personalized Prognostic Models for Oncology: A Machine Learning Approach. arXiv:1606.07369

• We have applied a little-known data transformation to subsets of the Surveillance, Epidemiology, and End Results (SEER) publically available data of the National Cancer Institute (NCI) to make it suitable input to standard machine learning classifiers. This transformation properly treats the right-censored data in the SEER data and the resulting Random Forest and Multi-Layer Perceptron models predict full survival curves. Treating the 6, 12, and 60 months points of the resulting survival curves as 3 binary classifiers, the 18 resulting classifiers have AUC values ranging from 0.765 to 0.885. Further evidence that the models have generalized well from the training data is provided by the extremely high levels of agreement between the random forest and neural network models predictions on the 6, 12, and 60 month binary classifiers.

• Halpern Y [David Sontag] (2016) Clinical Tagging with Joint Probabilistic Models. arXiv:1608.00686

• We describe a method for parameter estimation in bipartite probabilistic graphical models for joint prediction of clinical conditions from the electronic medical record. The method does not rely on the availability of gold-standard labels, but rather uses noisy labels, called anchors, for learning. We provide a likelihood-based objective and a moments-based initialization that are effective at learning the model parameters. The learned model is evaluated in a task of assigning a heldout clinical condition to patients based on retrospective analysis of the records, and outperforms baselines which do not account for the noisiness in the labels or do not model the conditions jointly.

• Hristov BG, Singh M [Princeton University] (2017) Network-based coverage of mutational profiles reveals cancer genes. arXiv:1704.08544

• A central goal in cancer genomics is to identify the somatic alterations that underpin tumor initiation and progression. This task is challenging as the mutational profiles of cancer genomes exhibit vast heterogeneity, with many alterations observed within each individual, few shared somatically mutated genes across individuals, and important roles in cancer for both frequently and infrequently mutated genes. While commonly mutated cancer genes are readily identifiable, those that are rarely mutated across samples are difficult to distinguish from the large numbers of other infrequently mutated genes. Here, we introduce a method that considers per-individual mutational profiles within the context of protein-protein interaction networks in order to identify small connected subnetworks of genes that, while not individually frequently mutated, comprise pathways that are perturbed across (i.e., "cover") a large fraction of the individuals. We devise a simple yet intuitive objective function that balances identifying a small subset of genes with covering a large fraction of individuals. We show how to solve this problem optimally using integer linear programming and also give a fast heuristic algorithm that works well in practice. We perform a large-scale evaluation of our resulting method, nCOP, on 6,038 TCGA tumor samples across 24 different cancer types. We demonstrate that our approach nCOP is more effective in identifying cancer genes than both methods that do not utilize any network information as well as state-of-the-art network-based methods that aggregate mutational information across individuals. Overall, our work demonstrates the power of combining per-individual mutational information with interaction networks in order to uncover genes functionally relevant in cancers, and in particular those genes that are less frequently mutated.

• "... We also compare nCOP to HOTNET2 and Muffinn, two recent network-based approaches that aggregate mutational information. ..."

• Min W (2016) Network-regularized Sparse Logistic Regression Models for Clinical Risk Prediction and Biomarker Discovery. arXiv:1609.06480

• Molecular profiling data (e.g., gene expression) has been used for clinical risk prediction and biomarker discovery. However, it is necessary to integrate other prior knowledge like biological pathways or gene interaction networks to improve the predictive ability and biological interpretability of biomarkers. Here, we first introduce a general regularized Logistic Regression (LR) framework with regularized term $\lambda \|{w}\|_2 + \eta{w}^T{M}{w}$, which can reduce to different penalties, including Lasso, elastic net, and network-regularized terms with different ${M}$. This framework can be easily solved in a unified manner by a cyclic coordinate descent algorithm which can avoid inverse matrix operation and accelerate the computing speed. However, if those estimated ${w}_i$ and ${w}_j$ have opposite signs, then the traditional network-regularized penalty may not perform well. To address it, we introduce a novel network-regularized sparse LR model with a new penalty $\lambda \|{w}\|_1 + \eta|{w}|^T{M}|{w}|$ to consider the difference between the absolute values of the coefficients. And we develop two efficient algorithms to solve it. Finally, we test our methods and compare them with the related ones using simulated and real data to show their efficiency.

• Natarajan S (2017) Markov logic networks for adverse drug event extraction from text. pdf

• Adverse drug events (ADEs) are a major concern and point of emphasis for the medical profession, government, and society. A diverse set of techniques from epidemiology, statistics, and computer science are being proposed and studied for ADE discovery from observational health data (e.g., EHR and claims data), social network data (e.g., Google and Twitter posts), and other information sources. Methodologies are needed for evaluating, quantitatively measuring, and comparing the ability of these various approaches to accurately discover ADEs. This work is motivated by the observation that text sources such as the Medline/Medinfo library provide a wealth of information on human health. Unfortunately, ADEs often result from unexpected interactions, and the connection between conditions and drugs is not explicit in these sources. Thus, in this work we address the question of whether we can quantitatively estimate relationships between drugs and conditions from the medical literature. This paper proposes and studies a state-of-the-art NLP-based extraction of ADEs from text.

Keywords: Natural Language Processing; Adverse Drug Event Extraction; Markov Logic Networks; Statistical Relational Learning

• Nguyen P (2016) Deepr: A Convolutional Net for Medical Records. arXiv:1607.07519

• Feature engineering remains a major bottleneck when creating predictive systems from electronic medical records. At present, an important missing element is detecting predictive regular clinical motifs from irregular episodic records. We present Deepr (short for Deep record), a new end-to-end deep learning system that learns to extract features from medical records and predicts future risk automatically. Deepr transforms a record into a sequence of discrete elements separated by coded time gaps and hospital transfers. On top of the sequence is a convolutional neural net that detects and combines predictive local clinical motifs to stratify the risk. Deepr permits transparent inspection and visualization of its inner working. We validate Deepr on hospital data to predict unplanned readmission after discharge. Deepr achieves superior accuracy compared to traditional techniques, detects meaningful clinical motifs, and uncovers the underlying structure of the disease and intervention space.

• Park HS (2017) Automated Detection of P. falciparum Using Machine Learning Algorithms with Quantitative Phase Images of Unstained Cells. PLoS ONE [pdf]  |  Phys.org

• Malaria detection through microscopic examination of stained blood smears is a diagnostic challenge that heavily relies on the expertise of trained microscopists. This paper presents an automated analysis method for detection and staging of red blood cells infected by the malaria parasite Plasmodium falciparum at trophozoite or schizont stage. Unlike previous efforts in this area, this study uses quantitative phase images of unstained cells. Erythrocytes are automatically segmented using thresholds of optical phase and refocused to enable quantitative comparison of phase images. Refocused images are analyzed to extract 23 morphological descriptors based on the phase information. While all individual descriptors are highly statistically different between infected and uninfected cells, each descriptor does not enable separation of populations at a level satisfactory for clinical utility. To improve the diagnostic capacity, we applied various machine learning techniques, including linear discriminant classification (LDC), logistic regression (LR), and k-nearest neighbor classification (NNC), to formulate algorithms that combine all of the calculated physical parameters to distinguish cells more effectively. Results show that LDC provides the highest accuracy of up to 99.7% in detecting schizont stage infected cells compared to uninfected RBCs. NNC showed slightly better accuracy (99.5%) than either LDC (99.0%) or LR (99.1%) for discriminating late trophozoites from uninfected RBCs. However, for early trophozoites, LDC produced the best accuracy of 98%. Discrimination of infection stage was less accurate, producing high specificity (99.8%) but only 45.0%-66.8% sensitivity with early trophozoites most often mistaken for late trophozoite or schizont stage and late trophozoite and schizont stage most often confused for each other. Overall, this methodology points to a significant clinical potential of using quantitative phase imaging to detect and stage malaria infection without staining or expert analysis.

• Pham T (2016) [LSTM] DeepCare: Deep Dynamic Memory Model for Predictive Medicine. arXiv:1602.00357

• Razavian N [Sontag, D] (2015) Temporal CNN for Diagnosis from Lab Tests. arXiv:1511.07938

• Richfield O (2016) Learning Schizophrenia Imaging Genetics Data Via Multiple Kernel Canonical Correlation Analysis. arXiv:1609.04699

• Kernel and Multiple Kernel Canonical Correlation Analysis (CCA) are employed to classify schizophrenic and healthy patients based on their SNPs, DNA Methylation and fMRI data. Kernel and Multiple Kernel CCA are popular methods for finding nonlinear correlations between high-dimensional datasets. Data was gathered from 183 patients, 79 with schizophrenia and 104 healthy controls. Kernel and Multiple Kernel CCA represent new avenues for studying schizophrenia, because, to our knowledge, these methods have not been used on these data before. Classification is performed via k-means clustering on the kernel matrix outputs of the Kernel and Multiple Kernel CCA algorithm. Accuracies of the Kernel and Multiple Kernel CCA classification are compared to that of the regularized linear CCA algorithm classification, and are found to be significantly more accurate. Both algorithms demonstrate maximal accuracies when the combination of DNA methylation and fMRI data are used, and experience lower accuracies when the SNP data are incorporated.

• Shin HC (2015) Interleaved Text/Image Deep Mining on a Large-Scale Radiology Database for Automated Image Interpretation. arXiv:1505.00670. pdf:final published version [JMLR]

• Despite tremendous progress in computer vision, there has not been an attempt to apply machine learning on very large-scale medical image databases. We present an interleaved text/image deep learning system to extract and mine the semantic interactions of radiology images and reports from a national research hospital's Picture Archiving and Communication System. With natural language processing, we mine a collection of ~216K representative two-dimensional images selected by clinicians for diagnostic reference and match the images with their descriptions in an automated manner. We then employ a weakly supervised approach using all of our available data to build models for generating approximate interpretations of patient images. Finally, we demonstrate a more strictly supervised approach to detect the presence and absence of a number of frequent disease types, providing more specific interpretations of patient scans. A relatively small amount of data is used for this part, due to the challenge in gathering quality labels from large raw text data. Our work shows the feasibility of large-scale learning and prediction in electronic patient records available in most modern clinical institutions. It also demonstrates the trade-offs to consider in designing machine learning systems for analyzing large medical data.

• Xie H (2016) Comparison among dimensionality reduction techniques based on Random Projection for cancer classification. http://arxiv.org/abs/1608.07019

• Random Projection (RP) technique has been widely applied in many scenarios because it can reduce high-dimensional features into low-dimensional space within short time and meet the need of real-time analysis of massive data. There is an urgent need of dimensionality reduction with fast increase of big genomics data. However, the performance of RP is usually lower. We attempt to improve classification accuracy of RP through combining other reduction dimension methods such as Principle Component Analysis (PCA), Linear Discriminant Analysis (LDA), and Feature Selection (FS). We compared classification accuracy and running time of different combination methods on three microarray datasets and a simulation dataset. Experimental results show a remarkable improvement of 14.77% in classification accuracy of FS followed by RP compared to RP on BC-TCGA dataset. LDA followed by RP also helps RP to yield a more discriminative subspace with an increase of 13.65% on classification accuracy on the same dataset. FS followed by RP outperforms other combination methods in classification accuracy on most of the datasets.

• Yu KH [Michael Snyder | Stanford] (2016) Predicting non-small cell lung cancer prognosis by fully automated microscopic pathology image features. pdf   [Nature]

• Lung cancer is the most prevalent cancer worldwide, and histopathological assessment is indispensable for its diagnosis. However, human evaluation of pathology slides cannot accurately predict patients' prognoses. In this study, we obtain 2,186 haematoxylin and eosin stained histopathology whole-slide images of lung adenocarcinoma and squamous cell carcinoma patients from The Cancer Genome Atlas (TCGA), and 294 additional images from Stanford Tissue Microarray (TMA) Database. We extract 9,879 quantitative image features and use regularized machine-learning methods to select the top features and to distinguish shorter-term survivors from longer-term survivors with stage I adenocarcinoma (P<0.003) or squamous cell carcinoma (P=0.023) in the TCGA data set. We validate the survival prediction framework with the TMA cohort (P<0.036 for both tumour types). Our results suggest that automatically derived image features can predict the prognosis of lung cancer patients and thereby contribute to precision oncology. Our methods are extensible to histopathology images of other organs.

• Computers trounce pathologists in predicting lung cancer type, severity [ScienceDaily.com]

[Applied ML] Commercial

[Applied ML] Computer Science:

• AI learns to write its own code by stealing from other programs

• Balog M [University of Cambridge | Microsoft Research] (2017 ICLR 2016) DeepCoder: Learning to Write Programs. arXiv:1611.01989  |  published paper  |  The Language Driven Approach for Deep Learning Training  |  Stop saying DeepCoder steals code from StackOverflow

• We develop a first line of attack for solving programming competition-style problems from input-output examples using deep learning. The approach is to train a neural network to predict properties of the program that generated the outputs from the inputs. We use the neural network's predictions to augment search techniques from the programming languages community, including enumerative search and an SMT-based solver. Empirically, we show that our approach leads to an order of magnitude speedup over the strong non-augmented baselines and a Recurrent Neural Network approach, and that we are able to solve problems of difficulty comparable to the simplest problems on programming competition websites.

[Applied ML] Miscellaneous:

[Applied ML - Physics] PAPERS:

• Tompson J [Google Brain | New York University | Google DeepMind] (2016) Accelerating Eulerian Fluid Simulation With Convolutional Networks. arXiv:1607.03597  |  GitHub  |  project page  |  reddit  |  reddit  |  reddit  |  YouTube

• Efficient simulation of the Navier-Stokes equations for fluid flow is a long standing problem in applied mathematics, for which state-of-the-art methods require large compute resources. In this work, we propose a data-driven approach that leverages the approximation power of deep-learning with the precision of standard solvers to obtain fast and highly realistic simulations. Our method solves the incompressible Euler equations using the standard operator splitting method, in which a large sparse linear system with many free parameters must be solved. We use a Convolutional Network with a highly tailored architecture, trained using a novel unsupervised learning framework to solve the linear system. We present real-time 2D and 3D simulations that outperform recently proposed data-driven methods; the obtained results are realistic and show good generalization properties.

## APPROACH | ALGORITHMS - ADVICE

• Activation functions: see my notes!

• Baldassi C (2016) Unreasonable Effectiveness of Learning Neural Nets: Accessible States and Robust Ensembles. arXiv:1605.06444

• In artificial neural networks, learning from data is a computationally demanding task in which a large number of connection weights are iteratively tuned through stochastic-gradient-based heuristic processes over a cost-function. It is not well understood how learning occurs in these systems, in particular how they avoid getting trapped in configurations with poor computational performance. Here we study the difficult case of networks with discrete weights, where the optimization landscape is very rough even for simple architectures, and provide theoretical and numerical evidence of the existence of rare - but extremely dense and accessible - regions of configurations in the network weight space. We define a novel measure, which we call the "robust ensemble" (RE), which suppresses trapping by isolated configurations and amplifies the role of these dense regions. We analytically compute the RE in some exactly solvable models, and also provide a general algorithmic scheme which is straightforward to implement: define a cost-function given by a sum of a finite number of replicas of the original cost-function, with a constraint centering the replicas around a driving assignment. To illustrate this, we derive several powerful new algorithms, ranging from Markov Chains to message passing to gradient descent processes, where the algorithms target the robust dense states, resulting in substantial improvements in performance. The weak dependence on the number of precision bits of the weights leads us to conjecture that very similar reasoning applies to more conventional neural networks. Analogous algorithmic schemes can also be applied to other optimization problems.

• Klein A (2016) Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets. arXiv:1605.07079

• Bayesian optimization has become a successful tool for hyperparameter optimization of machine learning algorithms, such as support vector machines or deep neural networks. But it is still costly if each evaluation of the objective requires training and validating the algorithm being optimized, which, for large datasets, often takes hours, days, or even weeks. To accelerate hyperparameter optimization, we propose a generative model for the validation error as a function of training set size, which is learned during the optimization process and allows exploration of preliminary configurations on small subsets, by extrapolating to the full dataset. We construct a Bayesian optimization procedure, dubbed FABOLAS, which models loss and training time as a function of dataset size and automatically trades off high information gain about the global optimum against computational cost. Experiments optimizing support vector machines and deep neural networks show that FABOLAS often finds high-quality solutions 10 to 100 times faster than other state-of-the-art Bayesian optimization methods.

• Li J (2016) Feature Selection: A Data Perspective. arXiv:1601.07996

• Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing high-dimensional data for data mining and machine learning problems. The objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and preparing clean, understandable data. ... In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. ... we also present an open-source feature selection repository that consists of most of the popular feature selection algorithms.

• Scardapane S (2016) Learning activation functions from data using cubic spline interpolation. arXiv:1605.05509  |  GitXiv

• Neural networks require a careful design in order to perform properly on a given task. In particular, selecting a good activation function (possibly in a data-dependent fashion) is a crucial step, which remains an open problem in the research community. Despite a large amount of investigations, most current implementations simply select one fixed function from a small set of candidates, which is not adapted during training, and is shared among all neurons throughout the different layers. However, neither two of these assumptions can be supposed optimal in practice. In this paper, we present a principled way to have data-dependent adaptation of the activation functions, which is performed independently for each neuron. This is achieved by leveraging over past and present advances on cubic spline interpolation, allowing for local adaptation of the functions around their regions of use. The resulting algorithm is relatively cheap to implement, and overfitting is counterbalanced by the inclusion of a novel damping criterion, which penalizes unwanted oscillations from a predefined shape. Experimental results validate the proposal over two well-known benchmarks.

• scikit-feature feature selection package: The feature selection repository is designed to collect some widely used feature selection algorithms that have been developed in the feature selection research to serve as a platform for facilitating their application, comparison and joint study. The feature selection repository also effectively assists researchers to achieve more reliable evaluation in the process of developing new feature selection algorithms. We develop the open source feature selection repository scikit-feature by one of the most popular programming language - python. It contains more than 40 popular feature selection algorithms, including most traditional feature selection algorithms and some structural and streaming feature selection algorithms. It is built upon one widely used machine learning package Scikit-learn and two scientific computing packages Numpy and Scipy.

## ARCHITECTURES - BIOMIMETIC

[Architectures - Biomimetic] Papers:

• Costa RP [Nando de Freitas | Google DeepMind] (2017) Cortical microcircuits as gated-recurrent neural networks. arXiv:1711.02448

• [v1] Cortical circuits exhibit intricate recurrent architectures that are remarkably similar across different brain areas. Such stereotyped structure suggests the existence of common computational principles. However, such principles have remained largely elusive. Inspired by gated-memory networks, namely long short-term memory networks (LSTMs), we introduce a recurrent neural network in which information is gated through inhibitory cells that are subtractive (subLSTM). We propose a natural mapping of subLSTMs onto known canonical excitatory-inhibitory cortical microcircuits. Our empirical evaluation across sequential image classification and language modelling tasks shows that subLSTM units can achieve similar performance to LSTM units. These results suggest that cortical circuits can be optimised to solve complex contextual problems and proposes a novel view on their computational function. Overall our work provides a step towards unifying recurrent networks as used in machine learning with their biological counterparts.

## ARCHITECTURES - CAPSULE NETWORKS

[Architectures - Capsule Networks] Blogs:

• Short, laypersons description:  Google A.I. researchers develop alternative architecture for neural networks

• Capsule Networks Explained  |  reddit

• ELI5: Capsule networks. How are they unique and how are they better than CNN?  [reddit: Nov 2017]

• Lot of misconceptions around capsules network?  [reddit: Nov 2017]

• Understanding Hinton's Capsule Networks. Part I: Intuition

[Architectures - Capsule Networks] Papers:

• Sabour S [Hinton GE | Google Brain, Toronto] (2017) Dynamic Routing Between Capsules. arXiv:1710.09829  |  reddit  |  reddit

• [v1] A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.

• Google Researchers Have a New Alternative to Traditional Neural Networks  [MIT Technology Review: Nov 2017]:

• Dynamic routing between capsules  [The Morning Paper: Nov 13, 2017]
• Matrix capsules with EM routing  [The Morning Paper: Nov 14, 2017]

• Google's AI Wizard Unveils a New Twist on Neural Networks  [Wired: Nov 01, 2017]:

• Related [same group]: (ICLR 2018) Matrix Capsules With EM Routing  [pdf]  |  OpenReview

• A capsule is a group of neurons whose outputs represent different properties of the same entity. We describe a version of capsules in which each capsule has a logistic unit to represent the presence of an entity and a 4x4 pose matrix which could learn to represent the relationship between that entity and the viewer. A capsule in one layer votes for the pose matrices of many different capsules in the layer above by multiplying its own pose matrix by viewpoint-invariant transformation matrices that could learn to represent part-whole relationships. Each of these votes is weighted by an assignment coefficient. These coefficients are iteratively updated using the EM algorithm such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes. The whole system is trained discriminatively by unrolling 3 iterations of EM between each pair of adjacent layers. On the smallNORB benchmark, capsules reduce the number of test errors by 45\% compared to the state-of-the-art. Capsules also show far more resistant to white box adversarial attack than our baseline convolutional neural network.

TL;DR: Capsule networks with learned pose matrices and EM routing improves state of the art classification on smallNORB, improves generalizability to new view points, and white box adversarial robustness.

## ARCHITECTURES - GENERAL

[Architectures - General] Blogs:

• Complex neural networks made easy by Chainer: A define-by-run approach allows for flexibility and simplicity when building deep learning networks.  |  local copy [pdf]  |  reddit

• Chainer is an open source framework designed for efficient research into and development of deep learning algorithms. In this post, we briefly introduce Chainer with a few examples and compare with other frameworks such as Caffe, Theano, Torch, and Tensorflow. Most existing frameworks construct a computational graph in advance of training. This approach is fairly straightforward, especially for implementing fixed and layer-wise neural networks like convolutional neural networks. However, state-of-the-art performance and new applications are now coming from more complex networks, such as recurrent or stochastic neural networks. Though existing frameworks can be used for these kinds of complex networks, it sometimes requires (dirty) hacks that can reduce development efficiency and maintainability of the code.

Chainer's approach is unique: building the computational graph "on-the-fly" during training. This allows users to change the graph at each iteration or for each sample, depending on conditions. It is also easy to debug and refactor Chainer-based code with a standard debugger and profiler, since Chainer provides an imperative API in plain Python and NumPy. This gives much greater flexibility in the implementation of complex neural networks, which leads in turn to faster iteration, and greater ability to quickly realize cutting-edge deep learning algorithms. Below, I describe how Chainer actually works and what kind of benefits users can get from it.

...
Chainer's design: Define-by-Run.

To train a neural network, three steps are needed: (1) build a computational graph from network definition, (2) input training data and compute the loss function, and (3) update the parameters using an optimizer and repeat until convergence. Usually, DL frameworks complete step one in advance of step two. We call this approach define-and-run.

This is straightforward but not optimal for complex neural networks since the graph must be fixed before training. Therefore, when implementing recurrent neural networks, for examples, users are forced to exploit special tricks (such as the scan() function in Theano) which make it harder to debug and maintain the code. Instead, Chainer uses a unique approach called define-by-run, which combines steps one and two into a single step.

The computational graph is not given before training but obtained in the course of training. Since forward computation directly corresponds to the computational graph and backpropagation through it, any modifications to the computational graph can be done in the forward computation at each iteration and even for each sample.

As a simple example, let's see what happens using two-layer perceptron for MNIST digit classification.
[ ... SNIP! ... ]

• [CNN architectures] Is there any logic behind the design of architectures? [reddit]

• Neural Network Architectures:

• (June 2016)  Brief history of image processing NN [comments at:  reddit  |  Hacker News]:

• ... plus additional content (updated references -- Victoria )

LeNet5
Dan Ciresan Net
AlexNet
Overfeat
VGG
Network-in-network
ResNet
Inception V4
SqueezeNet
ENet

• Lenet5 (lenet-5):

• It is the year 1994, and this is one of the very first convolutional neural networks, and what propelled the field of Deep Learning. This pioneering work by Yann LeCun was named LeNet5 after many previous successful iterations since the year 1988!

• The LeNet5 architecture was fundamental, in particular the insight that image features are distributed across the entire image, and convolutions with learnable parameters are an effective way to extract similar features at multiple location with few parameters. At the time there was no GPU to help training, and even CPUs were slow. Therefore being able to save parameters and computation was a key advantage. This is in contrast to using each pixel as a separate input of a large multi-layer neural network. LeNet5 explained that those should not be used in the first layer, because images are highly spatially correlated, and using individual pixel of the image as separate input features would not take advantage of these correlations.

• LeNet5 features can be summarized as:

• convolutional neural network use sequence of 4 layers: convolution, pooling, non-linearity → This may be the key feature of Deep Learning for images since this paper!
• use convolution to extract spatial features
• subsample using spatial average of maps
• non-linearity in the form of tanh or sigmoids
• multi-layer neural network (MLP) as final classifier
• sparse connection matrix between layers to avoid large computational cost

• LeCun Y (1998) Gradient-based learning applied to document recognition. pdf

• Multilayer neural networks trained with the back-propagation algorithm constitute the best example of a successful gradient-based learning technique. Given an appropriate network architecture, gradient-based learning algorithms can be used to synthesize a complex decision surface that can classify high-dimensional patterns, such as handwritten characters, with minimal preprocessing. This paper reviews various methods applied to handwritten character recognition and compares them on a standard handwritten digit recognition task. Convolutional neural networks, which are specifically designed to deal with the variability of two dimensional (2-D) shapes, are shown to outperform all other techniques. Real-life document recognition systems are composed of multiple modules including field extraction, segmentation, recognition, and language modeling. A new learning paradigm, called graph transformer networks (GTN's), allows such multimodule systems to be trained globally using gradient-based methods so as to minimize an overall performance measure. Two systems for online handwriting recognition are described. Experiments demonstrate the advantage of global training, and the flexibility of graph transformer networks. A graph transformer network for reading a bank check is also described. It uses convolutional neural network character recognizers combined with global training techniques to provide record accuracy on business and personal checks. It is deployed commercially and reads several million checks per day.

• Dan Ciresan Net:

• In 2010 Dan Claudiu Ciresan and Jürgen Schmidhuber published one of the very fist implementations of GPU Neural nets. This implementation had both forward and backward implemented on a NVIDIA GTX 280 graphic processor of an up to 9 layers neural network.

• Good old on-line back-propagation for plain multi-layer perceptrons yields a very low 1.35% error rate on the famous MNIST handwritten digits benchmark. All we need to achieve this best result so far are many hidden layers, many neurons per layer, numerous deformed training images, and graphics cards to greatly speed up learning.

• Ciresan DC [Schmidhuber J] (2010) Deep, big, simple neural nets for handwritten digit recognition. arXiv:1003.0358

• AlexNet:

[Click for larger image]

• In 2012, Alex Krizhevsky released AlexNet [pdf] which was a deeper and much wider version of the LeNet and won by a large margin the difficult ImageNet competition (ILSVRC 2012). AlexNet scaled the insights of LeNet into a much larger neural network that could be used to learn much more complex objects and object hierarchies. The contribution of this work were:

• use of rectified linear units (ReLU) as non-linearities
• use of dropout technique to selectively ignore single neurons during training, a way to avoid overfitting of the model
• overlapping max pooling, avoiding the averaging effects of average pooling
• use of GPUs NVIDIA GTX 580 to reduce training time

• At the time GPUs offered a much larger number of cores than CPUs, and allowed 11x faster training time, which in turn allowed to use larger datasets and also bigger images.

• Krizhevsky A [Sutskever I; Hinton GE] (2012) ImageNet classification with deep convolutional neural networks. more here.

• Overfeat:

• In December 2013 the NYU lab from Yann LeCun came up with Overfeat, which is a derivative of AlexNet. The article also proposed learning bounding boxes, which later gave rise to many other papers on the same topic. I believe it is better to learn to segment objects rather than learn artificial bounding boxes.

• Sermanet P [LeCun Y] (2013) Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv:1312.6229

• We present an integrated framework for using Convolutional Networks for classification, localization and detection. We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries. Bounding boxes are then accumulated rather than suppressed in order to increase detection confidence. We show that different tasks can be learned simultaneously using a single shared network. This integrated framework is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 (ILSVRC2013) and obtained very competitive results for the detection and classifications tasks. In post-competition work, we establish a new state of the art for the detection task. Finally, we release a feature extractor from our best model called OverFeat.

• VGG networks:

• The VGG networks from Oxford were the first to use much smaller 3×3 filters in each convolutional layers and also combined them as a sequence of convolutions. ... the great advantage of VGG was the insight that multiple 3×3 convolution in sequence can emulate the effect of larger receptive fields, for examples 5×5 and 7×7. ... VGG networks use multiple 3x3 convolutional layers to represent complex features. ... This obviously amounts to a massive number of parameters, and also learning power. But training of these networks were difficult (had to be split into smaller networks with layers added one by one; lack of strong ways to regularize the model, or to somehow restrict the massive search space promoted by the large amount of parameters). VGG used large feature sizes in many layers and thus inference was quite costly at run-time. Reducing the number of features, as done in Inception bottlenecks, saves some of the computational cost.

• Simonyan K (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

• In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks, respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

• Very good discussion!  Is VGG common in newer research, or is ResNet the new standard for pretrained networks?  [reddit: May 2017]

• I've looked at some benchmarks, and it seems like Resnet-50 is both faster and more accurate than both VGGs. Is there any reason to use VGG, other than the simple architecture?

• u/SalimSerhan: My experience has been that, when it comes to finetuning, VGG16 and VGG19 converges quicker, and are in general easier to train than Resnet-50.

• u/RadonGaming: Personally in the research we're undertaking we're choosing VGG as it is is very simple, and straightforward to implement. Many pre-trained examples exist, and it just works.

• u/Britefury: I have seen pre-trained ResNets used for object detection and localisation, or fine tuned for classification. AFAICT, people stick to VGG for semantic segmentation. I have tried both VGG-16 and ResNet-50 for segmentation and have obtained better results from VGG-16.

• ResNets performed best for this simple VQA baseline
• AlexNet / VGG perform well for word similarity tasks
• GoogleLeNetv3 performed best for machine translation with images

• ResNets provide better accuracy in nearly all domains, e.g. in visual and speech recognition

• Network-in-network:

• Network-in-network (NiN) had the great and simple insight of using 1x1 convolutions to provide more combinational power to the features of a convolutional layers. The NiN architecture used spatial MLP layers after each convolution, in order to better combine features before another layer. Again one can think the 1x1 convolutions are against the original principles of LeNet, but really they instead help to combine convolutional features in a better way, which is not possible by simply stacking more convolutional layers. This is different from using raw pixels as input to the next layer. Here 1×1 convolution are used to spatially combine features across features maps after convolution, so they effectively use very few parameters, shared across all pixels of these features! The power of MLP can greatly increase the effectiveness of individual convolutional features by combining them into more complex groups. This idea was used later in most recent architectures as ResNet and Inception and derivatives. NiN also used an average pooling layer as part of the last classifier, another practice that will become common. This was done to average the response of the network to multiple are of the input image before classification.

• Lin M (2013) Network in network. arXiv:1312.4400

• We propose a novel deep network structure called "Network In Network" (NIN) to enhance model discriminability for local patches within the receptive field. The conventional convolutional layer uses linear filters followed by a nonlinear activation function to scan the input. Instead, we build micro neural networks with more complex structures to abstract the data within the receptive field. We instantiate the micro neural network with a multilayer perceptron, which is a potent function approximator. The feature maps are obtained by sliding the micro networks over the input in a similar manner as CNN; they are then fed into the next layer. Deep NIN can be implemented by stacking multiple of the above described structure. With enhanced local modeling via the micro network, we are able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers. We demonstrated the state-of-the-art classification performances with NIN on CIFAR-10 and CIFAR-100, and reasonable performances on SVHN and MNIST datasets.

• Network in Network (NiN), is it still useful? [reddit: Jan 2016]  "I think the main value of this paper was that it caused the collective realisation that convolutions with 1x1 receptive field are useful and incredibly versatile. ..."

• Aside -- [1x1] Convolutions:

• Can someone explain the 1x1 convolutions in inception module?  >>  A bank of N 1x1 filters in an intermediate layer is actually a bank of N Cx1x1 filters, where C is the number of filters in the previous layer. It's basically taking N weighted sums of the previous set of featuremaps (i.e. the output of the previous layer). If N < C, then you're reducing dimensionality.

• ... Strategy 2. Make the network smaller by replacing 3x3 filters with 1x1 filters

This strategy reduces the number of parameters 10x by replacing a bunch of 3x3 filters with 1x1 filters. At first this seemed really confusing to me. By moving 1x1 filters across an image I would think that each filter has less information to look at and would thus perform more poorly, however that doesn't seem to be the case! Typically a larger 3x3 convolution filter captures spatial information of pixels close to each other. On the other hand, 1x1 convolutional filters zero in on a single pixel and capture relationships amongst its channels as opposed to neighboring pixels. If you are looking to learn more about the use of 1x1 filters check out this blog post.

• Well-explained here:  The One by One [1x1] Convolution: Counter-intuitively Useful

• 1D Convolution in Neural Networks [StackExchange]

• 1x1 Convolutions - Why use them? [reddit]

• Christian Szegedy (Google) began a quest aimed at reducing the computational burden of deep neural networks, and devised the GoogLeNet the first Inception architecture. By now (Fall 2014) deep learning models were becoming extremely useful in categorizing the content of images and video frames. ... He and his team came up with the Inception module, which at a first glance is basically the parallel combination of 1×1, 3×3, and 5×5 convolutional filters. The great insight of the inception module was the use of 1×1 convolutional blocks (NiN) to reduce the number of features before the expensive parallel blocks (commonly referred to as "bottlenecks"). ... GoogLeNet used a stem without inception modules as initial layers, and an average pooling plus softmax classifier similar to NiN. This classifier is also extremely low number of operations, compared to AlexNet and VGG. This also contributed to a very efficient network design:  [Canziani A (2016) An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678].

This classifier is also extremely low number of operations, compared to AlexNet and VGG. This also contributed to a very efficient network design:   [Canziani A (2016) An Analysis of Deep Neural Network Models for Practical Applications.

• Szegedy C [Google] (2014) Going deeper with convolutions. arXiv:1409.4842   [pdf]  |  GitXiv

• We propose a deep convolutional neural network architecture codenamed "Inception", which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC 2014 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

• cool!  CaffeJS Webcam Implementation!

• GoogLeNet. The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al. from Google. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much. There are also several followup versions to the GoogLeNet, most recently Inception-v4.

• The network under examination is the GoogLeNet architecture, trained to classify images into one of 1000 categories of the ImageNet dataset. It consists of a set of layers that apply a sequence of transformations to the input image. The parameters of these transformations were determined during the training process by a variant of gradient descent algorithm. The internal image representations may seem obscure, but it is possible to visualize and interpret them. In this notebook we are going to present a few tricks that allow to make these visualizations both efficient to generate and even beautiful. Impatient readers can start with exploring the full galleries of images generated by the method described here for GoogLeNet and VGG16 architectures.

• Bottleneck layer:

• Inspired by NiN, the [key to overcoming the] bottleneck layer of Inception was reducing the number of features (and thus operations) at each layer, so the inference time could be kept low. Before passing data to the expensive convolution modules, the number of features was reduced by (e.g.) 4 times. This led to large savings in computational cost, and the success of this architecture. Let's examine this in detail. Let's say you have 256 features coming in, and 256 coming out, and let's say the Inception layer only performs 3x3 convolutions. That is [256x256] x [3x3] convolutions that have to be performed (589,000s multiply-accumulate, or MAC operations). That may be more than the computational budget we have, say, to run this layer in 0.5 milli-seconds on a Google Server. Instead of doing this, we decide to reduce the number of features that will have to be convolved, say to 64 or 256/4. In this case, we first perform 256 -> 64 1×1 convolutions, then 64 convolution on all Inception branches, and then we use again a 1x1 convolution from 64 → 256 features back again. The operations are now:

256 ×   64 × 1×1 = 16,000
64 ×    64 × 3×3 = 36,000
64 ×  256 × 1×1 = 16,000

... for a total of about 70,000, vs. the ~600,000 we had before: almost 10x less operations! And although we are doing less operations, we are not losing generality in this layer. In fact bottleneck layers have been proven to perform at state-of-art on the ImageNet dataset, for example, and were later used in architectures such as ResNet. The reason for the success is that the input features are correlated, and thus redundancy can be removed by combining them appropriately with the 1x1 convolutions. Then, after convolution with a smaller number of features, they can be expanded again into meaningful combination for the next layer.

• Inception-v2  |  Inception-v3:

[Inception-v4 further below.]

[Click for larger image]

• [Christian Szegedy | Google]  In February 2015 batch-normalized Inception was introduced as Inception-v2.  Batch-normalization computes the mean and standard-deviation of all feature maps at the output of a layer, and normalizes their responses with these values. This corresponds to "whitening" the data, and thus making all the neural maps have responses in the same range, and with zero mean. This helps training as the next layer does not have to learn offsets in the input data, and can focus on how to best combine features. In December 2015 they released a new version of the Inception modules and the corresponding architecture This article better explains the original GoogLeNet architecture, giving a lot more detail on the design choices. A list of the original ideas are:

• maximize information flow into the network, by carefully constructing networks that balance depth and width. Before each pooling, increase the feature maps.
• when depth is increased, the number of features, or width of the layer is also increased systematically
• use width increase at each layer to increase the combination of features before next layer
• use only 3x3 convolution, when possible, given that filter of 5x5 and 7x7 can be decomposed with multiple 3x3.
• filters can also be decomposed by flattened convolutions into more complex modules:
Jin J (2014) Flattened convolutional neural networks for feedforward acceleration. arXiv:1412.5474
• inception modules can also decrease the size of the data by provide pooling while performing the inception computation; this is basically identical to performing a convolution with strides in parallel, with a simple pooling layer
• Inception still uses a pooling layer plus softmax as final classifier

• Training an Inception-V3-based image classifier with your own dataset [GitHub]  |  reddit

• Improving Inception and Image Classification in TensorFlow  [Google research Blog: Aug 2016] "... In order to allow people to immediately begin experimenting, we are also releasing a pre-trained instance of the new Inception-ResNet-v2, as part of the TF-Slim Image Model Library. ..."  |  reddit

• Ioffe S [Szegedy C | Google] (2015) [Inception network] Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167  |  reddit  |  reddit

• [v3] Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch.

BATCH NORMALIZATION allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.

• Andrej Karpathy (Stanford) cs231n  [ local copy  (pdf)]: "A recently developed technique by Ioffe and Szegedy called Batch Normalization [arXiv:1502.03167] alleviates a lot of headaches with properly initializing neural networks by explicitly forcing the activations throughout a network to take on a unit Gaussian distribution at the beginning of the training. The core observation is that this is possible because normalization is a simple differentiable operation. In the implementation, applying this technique usually amounts to insert the BatchNorm layer immediately after fully connected layers (or convolutional layers, as we'll soon see), and before non-linearities. We do not expand on this technique here because it is well described in the linked paper, but note that it has become a very common practice to use Batch Normalization in neural networks. In practice networks that use Batch Normalization are significantly more robust to bad initialization. Additionally, batch normalization can be interpreted as doing preprocessing at every layer of the network, but integrated into the network itself in a differentiable manner. Neat!"

• Batch Normalization: What the Hey? [Lab41]: this blog post is devoted to explaining the more confusing portions of batch normalization. What follows are a few concepts that you may find interesting or may not have fully understood when reading over Ioffe and Szegedy's paper

• Batch Normalization before or after ReLU?   [reddit: Apr 2017]conflicting advice ...

• Szegedy C (2015) Rethinking the inception architecture for computer vision. arXiv:1512.00567

• Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error on the validation set (3.6% error on the test set) and 17.3% top-1 error on the validation set.

• Train your own image classifier with Inception in TensorFlow [Google Research Blog  |  schematic]  |  arXiv:1512.00567 [Google]  |  Inception in TensorFlow [GitHub]

• Notes on the TensorFlow Implementation of Inception v3  |  reddit

• The official TensorFlow repository has a working implementation of the Inception v3 architecture. Inception v3 is the 2015 iteration of Google's Inception architecture for image recognition. If you are familiar with deep learning then you most definitely know all about it. If you aren't, but keep up with tech news, then you probably best know it as 'that learning algorithm that trained itself to recognize pictures of cats.' And if you still have no idea what I'm talking about (or you think I'm talking about that Leonardo DiCaprio movie), then congrats on climbing out of your cave, and welcome to the world of machine learning!

Inception is a really great architecture and it's the result of multiple cycles of trial and error. I frequently find that it achieves the best performance for image recognition among other models.

The implementation of it was written by the same people who wrote TensorFlow, and so it seems to be well-written and makes use of a lot of TensorFlow tricks and techniques. I thought I'd study the code to see how they do things, and learn how to utilize TensorFlow better. In this blog post I'm sharing some of my notes.

First of all, the Inception code uses TF-Slim, which seems to be a kind of abstraction library over TensorFlow that makes writing convolutional nets easier and more compact. As far as I can tell, TF-Slim hasn't been used for any major projects aside from Inception. But it's very ideal for inception, because the inception architecture is 'deep' and has many layers. Looking at the Readme file on that page is recommended.

Let's dive into the code. slim\inception_model.py contains the code for the actual inception model itself. ...

[ ... SNIP! ... ]

• ResNet:

• The revolution then came in December 2015, at about the same time as Inception-v3. ResNet has simple ideas: feed the output of two successive convolutional layer AND also bypass the input to the next layers! This is similar to older ideas like this one [pdf]. But here they bypass TWO layers and are applied to large scales. Bypassing after 2 layers is a key intuition, as bypassing a single layer did not give much improvements. By 2 layers can be thought as a small classifier, or a Network-In-Network! This is also the very first time that a network of > hundred, even 1000 layers was trained. ResNet with a large number of layers started to use a bottleneck layer similar to the Inception bottleneck:

This layer reduces the number of features at each layer by first using a 1x1 convolutions with a smaller output (usually 1/4 of the input), and then a 3x3 layer, and then again a 1x1 convolution to a larger number of features. Like in the case of Inception modules, this allows to keep the computation low, while providing rich combination of features. See "bottleneck layer" section after "GoogLeNet and Inception". ResNet uses a fairly simple initial layers at the input (stem): a 7x7 conv layer followed with a pool of 2. Contrast this to more complex and less intuitive stems as in Inception-v3, -v4. ResNet also uses a pooling layer plus softmax as final classifier. Additional insights about the ResNet architecture are appearing every day:

• ResNet can be seen as both parallel and serial modules, by just thinking of the input as going to many modules in parallel, while the output of each modules connect in series
• ResNet can also be thought as multiple ensembles of parallel or serial modules:
Veit A (2016) Residual Networks are Exponential Ensembles of Relatively Shallow Networks. arXiv:1605.06431
• It has been found that ResNet usually operates on blocks of relatively low depth ~20-30 layers, which act in parallel, rather than serially flow the entire length of the network.
• ResNet, when the output is fed back to the input, as in RNN, the network can be seen as a better bio-plausible model of the cortex:
Liao Q (2016) Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex. arXiv:1604.03640

• He K (2015) Deep residual learning for image recognition. arXiv:1512.03385  |  Winner: ILSVRC 2015

• Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers -- 8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

• Inception-v4:

• [Christian Szegedy | Google]. This Inception module after the stem (image) is rather similar to Inception-v3; they also combined the Inception module with the ResNet module:

This time though the solution is, in my opinion, less elegant and more complex, but also full of less transparent heuristics. It is hard to understand the choices and it is also hard for the authors to justify them. In this regard the prize for a clean and simple network that can be easily understood and modified now goes to ResNet.

• GitHub  |  GitHub [Keras]  |  GitXiv  |  Google Research Blog

• GoogLeNet. The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al. from Google. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much. There are also several followup versions to the GoogLeNet, most recently Inception-v4.

• Szegedy C (2016) Inception-v4, Inception-ResNet and the impact of residual connections on learning. arXiv:1602.07261   [pdf]

• Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections. Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin. We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly. We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08 percent top-5 error on the test set of the ImageNet classification (CLS) challenge.

• [1602.07261] Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning  [reddit]

• Very deep CNN have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. ...

• Pretrained model: Google released pre-trained Inception-v4 model  [reddit: Nov 2016]  |  GitHub

• GitHub  |  Inception V4 Implementation in Keras Including Pre-Trained Weights! [reddit]

• SqueezeNet:

• SqueezeNet has been recently released. It is a re-hash of many concepts from ResNet and Inception, and show that after all, a better design of architecture will deliver small network sizes and parameters without needing complex compression algorithms.

• Iandola FN [Song Han; ... | UC Berkeley | Stanford] (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size. arXiv:1602.07360  |  GitHub [SqueezeNet-Deep-Compression]  |  GitHub [SqueezeNet]  |  GitXiv  |  keras-squeezenet [GitHub]  |  reddit

• Recent research on deep neural networks has focused primarily on improving accuracy. For a given accuracy level, it is typically possible to identify multiple DNN architectures that achieve that accuracy level. With equivalent accuracy, smaller DNN architectures offer at least three advantages: (1) Smaller DNNs require less communication across servers during distributed training. (2) Smaller DNNs require less bandwidth to export a new model from the cloud to an autonomous car. (3) Smaller DNNs are more feasible to deploy on FPGAs and other hardware with limited memory. To provide all of these advantages, we propose a small DNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to less than 0.5MB (510x smaller than AlexNet). The SqueezeNet architecture is available for download here [GitHub].

• Discussed here [Lab41.org]:  Lab41 Reading Group: SqueezeNet  [local copy (pdf)]:

• "The next paper [arXiv:1602.07360] from our reading group is by Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally and Kurt Keutzer. This paper introduces a small CNN architecture called "SqueezeNet" that achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. As you may have noticed with one of our recent posts [arXiv:1510.00149] we're really interested in learning more about the compression of neural network architectures and this paper really stood out.

"It's no secret that much of deep learning is tied up in the hell that is parameter tuning. This paper makes a case for increased study into the area of convolutional neural network design in order to drastically reduce the number of parameters you have to deal with. Unlike our previous post on "deep compression", this paper proposes making a network smaller by starting with a smarter design versus using a clever compression scheme. The authors outline 4 main strategies for reducing parameter size while maximizing accuracy. I'll walk you through them now."

• "Update" (Aug 2016; different research group): [1608.04493] Dynamic Network Surgery for Efficient DNNs (Reduces AlexNet by 17.7x):

• Guo Y [Intel Labs China] (2016) Dynamic Network Surgery for Efficient DNNs. arXiv:1608.04493

• Deep learning has become a ubiquitous technology to improve machine intelligence. However, most of the existing deep models are structurally very complex, making them difficult to be deployed on the mobile platforms with limited computational power. In this paper, we propose a novel network compression method called dynamic network surgery, which can remarkably reduce the network complexity by making on-the-fly connection pruning. Unlike the previous methods which accomplish this task in a greedy way, we properly incorporate connection splicing into the whole process to avoid incorrect pruning and make it as a continual network maintenance. The effectiveness of our method is proved with experiments. Without any accuracy loss, our method can efficiently compress the number of parameters in LeNet-4 and AlexNet by a factor of 108x and 17.7x respectively, proving that it outperforms the recent pruning method by considerable margins. Code will be made publicly available.

• Mentioned here: reddit

• Interesting. Though this paper shows you can get AlexNet-like accuracy with 51x less parameters with a better designed network: arXiv:1602.07360

• The two approaches are roughly orthogonal to each other. The paper you linked goes on to compress their small network using Deep Compression [arXiv:1510.00149], getting a further reduction in model size. Deep Compression also relies on removing unnecessary weights, and the technique used is the main comparison point in OP's paper [arXiv:1506.02626], with this paper claiming better results. Presumably this weight-removal process can be swapped out to give combined improvements.

• Related [same research group as i.e. Song Han]:

• Han S[Stanford] (2015) Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv:1510.00149  |  Pruning [summary: this file]

• ENet:

• Our team set up to combine all the features of the recent architectures into a very efficient and light-weight network that uses very few parameters and computation to achieve state-of-the-art results. This network architecture is dubbed ENet [arXiv:1606.02147], and was designed by Adam Paszke [GitHub]. We have used it to perform pixel-wise labeling and scene-parsing. Here are some videos of ENet in action. These videos are not part of the training dataset. ENet is a encoder plus decoder network. The encoder is a regular CNN design for categorization, while the decoder is a upsampling network designed to propagate the categories back into the original image size for segmentation. This worked used only neural networks, and no other algorithm to perform image segmentation. ENet was designed to use the minimum number of resources possible from the start. As such it achieves such a small footprint that both encoder and decoder network together only occupies 0.7 MB with fp16 precision. Even at this small size, ENet is similar or above other pure neural network solutions in accuracy of segmentation.

• Paszke A (2016) ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv:1606.02147  |  GitHub  |  GitXiv  |  project page  |  reddit  |  reddit

• The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18× faster, requires 75× less FLOPs, has 79× less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.

• Discussion here [reddit]  Importance of first layer In ConvNets

• Xception:

• Xception improves on the inception module and architecture with a simple and more elegant architecture that is as effective as ResNet and Inception V4. The Xception module is presented here:

• This network can be anyone's favorite given the simplicity and elegance of the architecture, presented here:

• The architecture has 36 convolutional stages, making it close in similarity to a ResNet-34. But the model and code is as simple as ResNet and much more comprehensible than Inception V4. A Torch7 implementation of this network is available here.  An implementation in Keras/TF is available here.

• Chollet F [François Chollet | Keras developer | Google] (2016) Xception: Deep Learning with Depthwise Separable Convolutions. arXiv:1610.02357  |  GitHub  |  GitXiv  |  Keras: docs  |  tutorial: TensorFlow

• We present an interpretation of Inception modules in convolutional neural networks as being an intermediate step in-between regular convolution and the recently introduced "separable convolution" operation. In this light, a separable convolution can be understood as an Inception module with a maximally large number of towers. This observation leads us to propose a novel deep convolutional neural network architecture inspired by Inception, where Inception modules have been replaced with separable convolutions. We show that this architecture, dubbed Xception, slightly outperforms Inception V3 on the ImageNet dataset (which Inception V3 was designed for), and significantly outperforms Inception V3 on a larger image classification dataset comprising 350 million images and 17,000 classes. Since the Xception architecture has the same number of parameter as Inception V3, the performance gains are not due to increased capacity but rather to a more efficient use of model parameters.

• Mentioned here:  Is gradual pooling no longer the preferred architecture?

• ... It used to be that ConvNets would use 2x2 pooling every few layers: the number of channels would grow, while the dimensions of the image would decrease all the way to 4x4 or less. However, now, with the Xception architecture, I'm seeing lots of pooling in the beginning (299x299 -> 18x18), no pooling in most of the network, and lots of pooling at the end (18x18 -> 1x1).

• ... Xception uses pooling extensively: only 1 out of ~70 of its convolutions has stride = 2.

• ILSVRC 2017:

• Mishkin D (2016) Systematic evaluation of CNN advances on the ImageNet. arXiv:1606.02228

• The paper systematically studies the impact of a range of recent advances in CNN architectures and learning methods on the object categorization (ILSVRC) problem. The evaluation tests the influence of the following choices of the architecture: non-linearity (ReLU, ELU, maxout, compatibility with batch normalization), pooling variants (stochastic, max, average, mixed), network width, classifier design (convolutional, fully-connected, SPP), image pre-processing, and of learning parameters: learning rate, batch size, cleanliness of the data, etc. The performance gains of the proposed modifications are first tested individually and then in combination. The sum of individual gains is bigger than the observed improvement when all modifications are introduced, but the "deficit" is small suggesting independence of their benefits. We show that the use of 128x128 pixel images is sufficient to make qualitative conclusions about optimal network structure that hold for the full size Caffe and VGG nets. The results are obtained an order of magnitude faster than with the standard 224 pixel images.

They found that it is advantageous to:

• use ELU non-linearity without batchnorm or ReLU with it.
• apply a learned colorspace transformation of RGB.
• use the linear learning rate decay policy.
• use a sum of the average and max pooling layers.
• use mini-batch size around 128 or 256. If this is too big for your GPU, decrease the learning rate proportionally to the batch size.
• use fully-connected layers as convolutional and average the predictions for the final decision.
• when investing in increasing training set size, check if a plateau has not been reach.
• cleanliness of the data is more important then the size.
• if you cannot increase the input image size, reduce the stride in the consequent layers, it has roughly the same effect.
• if your network has a complex and highly optimized architecture, like e.g. GoogLeNet, be careful with modifications.

• Neural Network Evolution Playground with Backprop NEAT [reddit]  |  links to: Neural Network Evolution Playground with Backprop NEAT: This demo will attempt to use a genetic algorithm to produce efficient, but atypical neural network structures to classify datasets borrowed from TensorFlow Playground.  |  demo  |  GitHub

...
In typical neural network-based classification problems, the data scientist would design and put together some pre-defined neural network, based on human heuristic, and the actual machine learning bit of the task would be to solve for the set of weights in the network, using some variants of stochastic gradient descent and the back propagation algorithm to calculate the weight gradients, in order to get the network to fit some training data under some regularisation constraints. The TensorFlow Playground demo captured the essence of this sort of task, but I've been thinking if machine learning can also be used effectively to design the actual neural network used for a given task as well. What if we can automate the process of discovering neural network architectures?

I decided to experiment with this idea by creating this demo. Rather than go with the conventional approach of organising many layers of neurons with uniform activation functions, we will try to abandon the idea of layers altogether, so each neuron can potentially connect to any other neuron in our network. Also, rather than sticking with neurons that use a uniform activation function, such as sigmoids or Relu's, we will allow many types of neurons with many types of activation functions, such as sigmoid, tanh, Relu, sine, Gaussian, abs, square, and even addition and multiplicative gates.

The genetic algorithm called $\small NEAT$ will be used to evolve our neural nets from a very simple one at the beginning to more complex ones over many generations. The weights of the neural nets will be solved via back propagation. The awesome recurrent.js library made by Karpathy, makes it possible to build computational graph representation of arbitrary neural networks with arbitrary activation functions. I implemented the $\small NEAT$ algorithm to generate representations of neural nets that $\small recurrent.js$ can process, so that the library can be used to forward pass through the neural nets that $\small NEAT$ has discovered, and also to backprop the neural nets to optimise for their weights.
...

• Explanation of NEAT and HyperNEAT? [[reddit]]

• NNPACK - acceleration package for neural networks on multi-core CPUs   (reddit)  |  GitHub

• NNPACK is an acceleration package for neural network computations. NNPACK aims to provide high-performance implementations of convnet layers for multi-core CPUs. NNPACK is not intended to be directly used by machine learning researchers; instead it provides low-level performance primitives to be leveraged by higher-level frameworks, such as Torch, Caffe, Tensorflow, Theano, and Mocha.jl.

[Architectures - General] Frameworks:

• Deep Learning Frameworks: A Survey of TensorFlow, Torch, Theano, Caffe, Neon, and the IBM Machine Learning Stack  |  local copy:  pdf

• Neubig G (2017) DyNet: The Dynamic Neural Network Toolkit. arXiv:1701.03980  |  GitHub  |  GitXiv  |  benchmarks  |  reddit
• We describe DyNet, a toolkit for implementing neural network models based on dynamic declaration of network structure. In the static declaration strategy that is used in toolkits like Theano, CNTK, and TensorFlow, the user first defines a computation graph (a symbolic representation of the computation), and then examples are fed into an engine that executes this computation and computes its derivatives. In DyNet's dynamic declaration strategy, computation graph construction is mostly transparent, being implicitly constructed by executing procedural code that computes the network outputs, and the user is free to use different network structures for each input. Dynamic declaration thus facilitates the implementation of more complicated network architectures, and DyNet is specifically designed to allow users to implement their models in a way that is idiomatic in their preferred programming language (C++ or Python). One challenge with dynamic declaration is that because the symbolic computation graph is defined anew for every training example, its construction must have low overhead. To achieve this, DyNet has an optimized C++ backend and lightweight graph representation. Experiments show that DyNet's speeds are faster than or comparable with static declaration toolkits, and significantly faster than Chainer, another dynamic declaration toolkit. DyNet is released open-source under the Apache 3.0 license and available at this http URL.

[Architectures - General] Papers:

• Adolf R (2016) Fathom: Reference Workloads for Modern Deep Learning Methods. arXiv:1608.06581  |  reddit

• Deep learning has been popularized by its recent successes on challenging artificial intelligence problems. One of the reasons for its dominance is also an ongoing challenge: the need for immense amounts of computational power. Hardware architects have responded by proposing a wide array of promising ideas, but to date, the majority of the work has focused on specific algorithms in somewhat narrow application domains. While their specificity does not diminish these approaches, there is a clear need for more flexible solutions. We believe the first step is to examine the characteristics of cutting edge models from across the deep learning community.

Consequently, we have assembled Fathom: a collection of eight archetypal deep learning workloads for study. Each of these models comes from a seminal work in the deep learning community, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook's AI research group. Fathom has been released online, and this paper focuses on understanding the fundamental performance characteristics of each model. We use a set of application-level modeling tools built around the TensorFlow deep learning framework in order to analyze the behavior of the Fathom workloads. We present a breakdown of where time is spent, the similarities between the performance profiles of our models, an analysis of behavior in inference and training, and the effects of parallelism on scaling.

• Amos B [Carnegie Mellon University] (2016) Input Convex Neural Networks. arXiv:1609.07152  |  GitHub  |  GitXiv  |  reddit

• This paper presents the input convex neural network architecture. These are scalar-valued (potentially deep) neural networks with constraints on the network parameters such that the output of the network is a convex function of (some of) the inputs. The networks allow for efficient inference via optimization over some inputs to the network given others, and can be applied to settings including structured prediction, data imputation, reinforcement learning, and others. In this paper we lay the basic groundwork for these models, proposing methods for inference, optimization and learning, and analyze their representational power. We show that many existing neural network architectures can be made input-convex with only minor modification, and develop specialized optimization algorithms tailored to this setting. Finally, we highlight the performance of the methods on multi-label prediction, image completion, and reinforcement learning problems, where we show improvement over the existing state of the art in many cases.

• Ba J & Rich Caruana [UofT | Microsoft Research] (2013) Do deep nets really need to be deep? arXiv:1312.6184  |  pdf

• Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this extended abstract, we show that shallow feed-forward networks can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. Moreover, in some cases the shallow neural nets can learn these deep functions using a total number of parameters similar to the original deep model. We evaluate our method on the TIMIT phoneme recognition task and are able to train shallow fully-connected nets that perform similarly to complex, well-engineered, deep convolutional architectures. Our success in training shallow neural nets to mimic deeper models suggests that there probably exist better algorithms for training shallow feed-forward nets than those currently available.]

• Link to paper on using a larger neural net to train a smaller one?  [reddit]: "Yeah, it's the work by Rich Caruana: arXiv:1312.6184. If I understand correctly, they train a CNN on some labeled dataset. Then, they pass a larger set of data (larger, but similar to the original one) through the CNN, and train a smaller network to regress the log probabilities using $\small l2$ loss."

• Barone AVM (2016) Low-rank passthrough neural networks. arXiv:1603.03116  |  [1603.03116] Low-rank passthrough neural networks [reddit]  |  GitXiv

• Deep learning consists in training neural networks to perform computations that sequentially unfold in many steps over a time dimension or an intrinsic depth dimension. Effective learning in this setting is usually accomplished by specialized network architectures that are designed to mitigate the vanishing gradient problem of naive deep networks. Many of these architectures, such as LSTMs, GRUs, Highway Networks and Deep Residual Network, are based on a single structural principle: the state passthrough. We observe that these architectures, hereby characterized as Passthrough Networks, in addition to the mitigation of the vanishing gradient problem, enable the decoupling of the network state size from the number of parameters of the network, a possibility that is exploited in some recent works but not thoroughly explored. In this work we propose simple, yet effective, low-rank and low-rank plus diagonal matrix parametrizations [Victoria: sic: "Parametrization (or parameterization; also parameterisation, parametrisation) is the process of deciding and defining the parameters necessary for a complete or relevant specification of a model or geometric object."] for Passthrough Networks which exploit this decoupling property, reducing the data complexity and memory requirements of the network while preserving its memory capacity. We present competitive experimental results on synthetic tasks and a state of the art result on sequential randomly-permuted MNIST classification, a hard task on natural data.

• Canny J (NIPS 2013) BIDMach: Large-scale learning with zero memory allocation. pdf  |  GitHub  |  GitXiv

• This paper describes recent work on the BIDMach toolkit for large-scale machine learning. BIDMach has demonstrated single-node performance that exceeds that of published cluster systems for many common machine-learning task. BIDMach makes full use of both CPU and GPU acceleration (through a sister library BIDMat), and requires only modest hardware (commodity GPUs). One of the challenges of reaching this level of performance is the allocation barrier. While it is simple and expedient to allocate and recycle matrix (or graph) objects in expressions, this approach is too slow to match the arithmetic throughput possible on either GPUs or CPUs. In this paper we describe a caching approach that allows code with complex matrix (graph) expressions to run at massive scale, i.e. multi-terabyte data, with zero memory allocation after initial start-up. We present a number of new benchmarks that leverage this approach.

• Canziani A (2016) An Analysis of Deep Neural Network Models for Practical Applications. arXiv:1605.07678  |  reddit

• Since the emergence of Deep Neural Networks (DNNs) as a prominent technique in the field of computer vision, the ImageNet classification challenge has played a major role in advancing the state-of-the-art. While accuracy figures have steadily increased, the resource utilization of winning models has not been properly taken into account. In this work, we present a comprehensive analysis of important metrics in practical applications: accuracy, memory footprint, parameters, operations count, inference time and power consumption. Key findings are: (1) fully connected layers are largely inefficient for smaller batches of images; (2) accuracy and inference time are in a hyperbolic relationship; (3) energy constraint are an upper bound on the maximum achievable accuracy and model complexity; (4) the number of operations is a reliable estimate of the inference time. We believe our analysis provides a compelling set of information that help design and engineer efficient DNNs.

• Duvenaud DK [Cambridge | MIT | Harvard] (2014) Avoiding pathologies in very deep networks. arXiv:1402.5836  |  GitHub

• Choosing appropriate architectures and regularization strategies for deep networks is crucial to good predictive performance. To shed light on this problem, we analyze the analogous problem of constructing useful priors on compositions of functions. Specifically, we study the deep Gaussian process, a type of infinitely-wide, deep neural network. We show that in standard architectures, the representational capacity of the network tends to capture fewer degrees of freedom as the number of layers increases, retaining only a single degree of freedom in the limit. We propose an alternate network architecture which does not suffer from this pathology. We also examine deep covariance functions, obtained by composing infinitely many feature transforms. Lastly, we characterize the class of models obtained by performing dropout on Gaussian processes.

• Feng M (2015) Distributed Deep Learning for Question Answering. arXiv:1511.01158

• Fortunato M (2017) [Alex Graves | Vlad Mnih | Demis Hassabis | Google DeepMind] Noisy Networks for Exploration. arXiv:1706.10295  |  reddit

• We introduce NoisyNet, a deep reinforcement learning agent with parametric noise added to its weights, and show that the induced stochasticity of the agent's policy can be used to aid efficient exploration. The parameters of the noise are learned with gradient descent along with the remaining network weights. NoisyNet is straightforward to implement and adds little computational overhead. We find that replacing the conventional exploration heuristics for A3C, DQN and dueling agents (entropy reward and ϵ-greedy respectively) with NoisyNet yields substantially higher scores for a wide range of Atari games, in some cases advancing the agent from sub to super-human performance.

• Related:  reddit:  "OpenAI submitted a similar paper a few weeks ago: "Parameter Space Noise for Exploration" arXiv:1706.01905. Difference is that OpenAI version scales the noise based on variance it causes in action space (good) and the parameter for amount_of_noise_used is not learned (bad)."

• Plappert M (2017) [OpenAI] Parameter Space Noise for Exploration. arXiv:1706.01905

• Deep reinforcement learning (RL) methods generally engage in exploratory behavior through noise injection in the action space. An alternative is to add noise directly to the agent's parameters, which can lead to more consistent exploration and a richer set of behaviors. Methods such as evolutionary strategies use parameter perturbations, but discard all temporal structure in the process and require significantly more samples. Combining parameter noise with traditional RL methods allows to combine the best of both worlds. We demonstrate that both off- and on-policy methods benefit from this approach through experimental comparison of DQN, DDPG, and TRPO on high-dimensional discrete action environments as well as continuous control tasks. Our results show that RL with parameter noise learns more efficiently than traditional RL with action space noise and evolutionary strategies individually.

• Ha D [Andrew Dai; Quoc V. Le | Google Brain] (2016) HyperNetworks, arXiv:1609.09106  |  GitHub  |  GitXiv  |  reddit  |  reddit  |  blog: post

• This work explores hypernetworks: an approach of using a small network, also known as a hypernetwork, to generate the weights for a larger network. Hypernetworks provide an abstraction that is similar to what is found in nature: the relationship between a genotype - the hypernetwork - and a phenotype - the main network. Though they are also reminiscent of HyperNEAT in evolution, our hypernetworks are trained end-to-end with backpropagation and thus are usually faster. The focus of this work is to make hypernetworks useful for deep convolutional networks and long recurrent networks, where hypernetworks can be viewed as relaxed form of weight-sharing across layers. Our main result is that hypernetworks can generate non-shared weights for LSTM and achieve state-of-art results on a variety of language modeling tasks with Character-Level Penn Treebank and Hutter Prize Wikipedia datasets, challenging the weight-sharing paradigm for recurrent networks. Our results also show that hypernetworks applied to convolutional networks still achieve respectable results for image recognition tasks compared to state-of-the-art baseline models while requiring fewer learnable parameters.

• Hyper Networks: In this post, I will talk about our recent paper [arXiv:1609.09106] called "HyperNetworks." I worked on this paper as a Google Brain Resident ...  |  Includes code! :-D

• Han S [Stanford] (2015) Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv:1510.00149  |  GitHub  |  GitXiv  |  reddit  |  Pruning [summary: this file]

• Neural networks are both computationally intensive and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources. To address this limitation, we introduce "deep compression", a three stage pipeline: pruning, trained quantization and Huffman coding, that work together to reduce the storage requirement of neural networks by 35x to 49x without affecting their accuracy. Our method first prunes the network by learning only the important connections. Next, we quantize the weights to enforce weight sharing, finally, we apply Huffman coding. After the first two steps we retrain the network to fine tune the remaining connections and the quantized centroids. Pruning, reduces the number of connections by 9x to 13x; Quantization then reduces the number of bits that represent each connection from 32 to 5. On the ImageNet dataset, our method reduced the storage required by AlexNet by 35x, from 240MB to 6.9MB, without loss of accuracy. Our method reduced the size of VGG-16 by 49x from 552MB to 11.3MB, again with no loss of accuracy. This allows fitting the model into on-chip SRAM cache rather than off-chip DRAM memory. Our compression method also facilitates the use of complex neural networks in mobile applications where application size and download bandwidth are constrained. Benchmarked on CPU, GPU and mobile GPU, compressed network has 3x to 4x layerwise speedup and 3x to 7x better energy efficiency.

• Lab41 Reading Group: Deep Compression: The next paper from our reading group is by Song Han, Huizi Mao, and William J. Dally. It won the best paper award at ICLR 2016. It details three methods of compressing a neural network in order to reduce the size of the network on disk, improve performance, and decrease run time. ...  |  pdf  [local copy]

• [2017-Sep-15]  Song Han graduated from Stanford University advised by Prof. Bill Dally. His research focuses on energy-efficient deep learning, at the intersection between machine learning and computer architecture. He proposed Deep Compression that can compress deep neural networks by an order of magnitude without losing the prediction accuracy. He designed EIE: Efficient Inference Engine, a hardware architecture that can perform inference directly on the compressed sparse model, which saves memory bandwidth and results in significant speedup and energy saving. His work has been featured by TheNextPlatform, TechEmergence, Embedded Vision and O'Reilly. He led research efforts in model compression and hardware acceleration for deep learning that won the Best Paper Award at ICLR'16 and the Best Paper Award at FPGA'17. Before joining Stanford, Song graduated from Tsinghua University.

I will join MIT EECS as an assistant professor starting summer 2018. I'm actively looking for students interested in deep learning and computer architecture. Welcome to apply for MIT this fall. If you are interested in summer research in 2018, drop me an email with your CV, publication and research proposal, preferably before April 2018.

[Nov 2016]  Song Han  [Stanford department page]  is a fifth year PhD student with Prof. Bill Dally at Stanford University. His research focuses on energy-efficient deep learning, at the intersection between machine learning and computer architecture. Song proposed Deep Compression that can compress state-of-the art CNNs by 9x-49x and compressed SqueezeNet to only 470KB, which fits fully in on-chip SRAM. He proposed a DSD training flow that improved that accuracy of a wide range of neural networks. He designed EIE: Efficient Inference Engine, a hardware architecture that does inference directly on the compressed sparse neural network model, which is 13x faster and 3000x energy efficient than GPU. His work has been covered by TheNextPlatform, TechEmergence, Embedded Vision and O'Reilly. His work received the Best Paper Award in ICLR'16.  |  Deep Compression, DSD Training and EIE: Deep Neural Network Model Compression, Regularization and Hardware Acceleration  [Microsoft Research Talks: Jun 2016].

        Neural networks are both computationally intensive and memory intensive, making them difficult    to deploy on mobile phones and embedded systems with limited hardware resources. To address    this limitation, this talk first introduces "Deep Compression" that can compress the deep neural    networks by 9x-49x without loss of prediction accuracy[1][2][5]. Then this talk will describe    DSD, the "Dense-Sparse-Dense" training method that regularizes CNN/RNN/LSTMs to improve the    prediction accuracy of a wide range of neural networks given the same model size[3]. Finally    this talk will discuss EIE, the "Efficient Inference Engine" that works directly on the    deep-compressed DNN model and accelerates the inference, taking advantage of weight sparsity,    activation sparsity and weight sharing, which is 13x faster and 3000x more energy efficient than a    TitanX GPU[4].     References:        [1] Han et al. Learning both Weights and Connections for Efficient Neural Networks (NIPS'15)        [2] Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained                       Quantization and Huffman Coding (ICLR'16, best paper award)        [3] Han et al. DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training (submitted to NIPS'16)        [4] Han et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network (ISCA'16)        [5] Iandola, Han et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size                       (submitted to EVVC'16)

• Han S [Song Han et al. | Stanford University] (2016) DSD: Regularizing deep neural networks with dense-sparse-dense training flow. arXiv:1607.04381  |  Compressing and regularizing deep neural networks  [<< must-read!  |  OReilly.com: Nov 2016]

• Modern deep neural networks have a large number of parameters, making them very hard to train. We propose DSD, a dense-sparse-dense training flow, for regularizing deep neural networks and achieving better optimization performance. In the first D (Dense) step, we train a dense network to learn connection weights and importance. In the S (Sparse) step, we regularize the network by pruning the unimportant connections with small weights and retraining the network given the sparsity constraint. In the final D (re-Dense) step, we increase the model capacity by removing the sparsity constraint, re-initialize the pruned parameters from zero and retrain the whole dense network. Experiments show that DSD training can improve the performance for a wide range of CNNs, RNNs and LSTMs on the tasks of image classification, caption generation and speech recognition. On ImageNet, DSD improved the Top1 accuracy of GoogLeNet by 1.1%, VGG-16 by 4.3%, ResNet-18 by 1.2% and ResNet-50 by 1.1%, respectively. On the WSJ'93 dataset, DSD improved DeepSpeech and DeepSpeech2 WER by 2.0% and 1.1%. On the Flickr-8K dataset, DSD improved the NeuralTalk BLEU score by over 1.7. DSD is easy to use in practice: at training time, DSD incurs only one extra hyper-parameter: the sparsity ratio in the S step. At testing time, DSD doesn't change the network architecture or incur any inference overhead. The consistent and significant performance gain of DSD experiments shows the inadequacy of the current training methods for finding the best local optimum, while DSD effectively achieves superior optimization performance for finding a better solution. DSD models are available to download at GitHub.

• Han S (2016) EIE: efficient inference engine on compressed deep neural network. arXiv:1602.01528  |  reddit

• State-of-the-art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware helps the computation, fetching weights from DRAM is two orders of magnitude more expensive than ALU operations, and dominates the required power.

Previously proposed 'Deep Compression' makes it possible to fit large DNNs (AlexNet and VGGNet) fully in on-chip SRAM. This compression is achieved by pruning the redundant connections and having multiple connections share the same weight. We propose an energy efficient inference engine (EIE) that performs inference on this compressed network model and accelerates the resulting sparse matrix-vector multiplication with weight sharing. Going from DRAM to SRAM gives EIE 120x energy saving; Exploiting sparsity saves 10x; Weight sharing gives 8x; Skipping zero activations from ReLU saves another 3x. Evaluated on nine DNN benchmarks, EIE is 189x and 13x faster when compared to CPU and GPU implementations of the same DNN without compression. EIE has a processing power of 102GOPS/s working directly on a compressed network, corresponding to 3TOPS/s on an uncompressed network, and processes FC layers of AlexNet at 1.88x10^4 frames/sec with a power dissipation of only 600mW. It is 24,000x and 3,400x more energy efficient than a CPU and GPU respectively. Compared with DaDianNao, EIE has 2.9x, 19x and 3x better throughput, energy efficiency and area efficiency.

• Talk [slides: pdf; Jun 2017]:  EIE: Efficient Inference Engine on Compressed Deep Neural Network

• Hinton G (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. pdf

• Gao H [Cornell University] (2016) Snapshot Ensembles: Train 1, Get M For Free. pdf  |  reddit

• Ensembles of neural networks are known to give far more robust and accurate predictions compared to any of its individual networks. However, training multiple deep networks for model averaging is computationally expensive. In this paper, we propose a method to obtain the seemingly contradictory goal to obtain ensembles of multiple neural network at no additional training cost . We achieve this goal by letting a single neural network converge into several local minima along its optimization path and save the model parameters. To obtain repeated rapid convergence we leverage recent work on cyclic learning rate schedules. The resulting technique, which we refer to as Snapshot Ensembling, is surprisingly simple, yet effective. We show in a series of experiments that our approach is compatible with diverse network architectures and learning tasks. It consistently yields significantly lower error rates than state-of-the-art single models at no additional training cost, and almost matches the results of (far more expensive) independently trained network ensembles. On CIFAR-10 and CIFAR-100 our DenseNet Snapshot Ensembles obtain error rates of 3.4% and 17.4% respectively.

• Hubara I [Courbariaux M; Bengio Y] (2016) Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. arXiv:1609.07061  |  reddit  |  GitHub [Theano]  |  GitHub [Torch]

• We introduce a method to train Quantized Neural Networks (QNNs) -- neural networks with extremely low precision (e.g., 1-bit) weights and activations, at run-time. At train-time the quantized weights and activations are used for computing the parameter gradients. During the forward pass, QNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations. As a result, power consumption is expected to be drastically reduced. We trained QNNs over the MNIST, CIFAR-10, SVHN and ImageNet datasets. The resulting QNNs achieve prediction accuracy comparable to their 32-bit counterparts. For example, our quantized version of AlexNet with 1-bit weights and 2-bit activations achieves 51% top-1 accuracy. Moreover, we quantize the parameter gradients to 6-bits as well which enables gradients computation using only bit-wise operation. Quantized recurrent neural networks were tested over the Penn Treebank dataset, and achieved comparable accuracy as their 32-bit counterparts using only 4-bits. Last but not least, we programmed a binary matrix multiplication GPU kernel with which it is possible to run our MNIST QNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The QNN code is available online.

• Iandola FN (2016) SqueezeNet: AlexNet level accuracy with 50x fewer parameters and <1MB model size. arXiv:1602.07360  |  ultra-compact DNN  |  GitHub  |  GitXiv  |  mentioned here: reddit

• We propose a small DNN architecture called SqueezeNet, that achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters. Additionally, with model compression techniques we are able to compress SqueezeNet to <0.5MB (510x smaller than AlexNet).

• Discussion here [this file].

• Related:  Wu C (2017) A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation. arXiv:1703.04071

• Recently, DNN model compression based on network architecture design, e.g., SqueezeNet, attracted a lot attention. No accuracy drop on image classification is observed on these extremely compact networks, compared to well-known models. An emerging question, however, is whether these model compression techniques hurt DNN's learning ability other than classifying images on a single dataset. Our preliminary experiment shows that these compression methods could degrade domain adaptation (DA) ability, though the classification performance is preserved. Therefore, we propose a new compact network architecture and unsupervised DA method in this paper. The DNN is built on a new basic module Conv-M which provides more diverse feature extractors without significantly increasing parameters. The unified framework of our DA method will simultaneously learn invariance across domains, reduce divergence of feature representations, and adapt label prediction. Our DNN has 4.1M parameters, which is only 6.7% of AlexNet or 59% of GoogLeNet. Experiments show that our DNN obtains GoogLeNet-level accuracy both on classification and DA, and our DA method slightly outperforms previous competitive ones. Put all together, our DA strategy based on our DNN achieves state-of-the-art on sixteen of total eighteen DA tasks on popular Office-31 and Office-Caltech datasets.

• Ithapu VK (2015) On the interplay of network structure and gradient convergence in deep learning. arXiv:1511.05297

• The regularization and output consistency behavior of dropout and layer-wise pretraining for learning deep networks have been fairly well studied. However, our understanding of how the asymptotic convergence of backpropagation in deep architectures is related to the structural properties of the network and other design choices (like denoising and dropout rate) is less clear at this time. An interesting question one may ask is whether the network architecture and input data statistics may guide the choices of learning parameters and vice versa. In this work, we explore the association between such structural, distributional and learnability aspects vis-a-vis their interaction with parameter convergence rates. We present a framework to address these questions based on the backpropagation convergence for general nonconvex objectives using first-order information. This analysis suggests an interesting relationship between feature denoising and dropout. Building upon the results, we obtain a setup that provides systematic guidance regarding the choice of learning parameters and network sizes that achieve a certain level of convergence (in the optimization sense) often mediated by statistical attributes of the inputs. Our results are supported by a set of experiments we conducted as well as independent empirical observations reported by other groups in recent papers.

• Discussion - Conclusions

• Jégou S [Yoshua Bengio] (2016) The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation. arXiv:1611.09326
• State-of-the-art approaches for semantic image segmentation are built on Convolutional Neural Networks (CNNs). The typical segmentation architecture is composed of (a) a downsampling path responsible for extracting coarse semantic features, followed by (b) an upsampling path trained to recover the input image resolution at the output of the model and, optionally, (c) a post-processing module (e.g. Conditional Random Fields) to refine the model predictions.

Recently, a new CNN architecture, Densely Connected Convolutional Networks (DenseNets), has shown excellent results on image classification tasks. The idea of DenseNets is based on the observation that if each layer is directly connected to every other layer in a feed-forward fashion then the network will be more accurate and easier to train.

In this paper, we extend DenseNets to deal with the problem of semantic segmentation. We achieve state-of-the-art results on urban scene benchmark datasets such as CamVid and Gatech, without any further post-processing module nor pretraining. Moreover, due to smart construction of the model, our approach has much less parameters than currently published best entries for these datasets.

• Kanhabua N (2016) Learning Dynamic Classes of Events using Stacked Multilayer Perceptron Networks. arXiv:1606.07219

• People often use a web search engine to find information about events of interest, for example, sport competitions, political elections, festivals and entertainment news. In this paper, we study a problem of detecting event-related queries, which is the first step before selecting a suitable time-aware retrieval model. In general, event-related information needs can be observed in query streams through various temporal patterns of user search behavior, e.g., spiky peaks for popular events, and periodicities for repetitive events. However, it is also common that users search for non-popular events, which may not exhibit temporal variations in query streams, e.g., past events recently occurred, historical events triggered by anniversaries or similar events, and future events anticipated to happen. To address the challenge of detecting dynamic classes of events, we propose a novel deep learning model to classify a given query into a predetermined set of multiple event types. Our proposed model, a Stacked Multilayer Perceptron (S-MLP) network, consists of multilayer perceptron used as a basic learning unit. We assemble stacked units to further learn complex relationships between neutrons in successive layers. To evaluate our proposed model, we conduct experiments using real-world queries and a set of manually created ground truth. Preliminary results have shown that our proposed deep learning model outperforms the state-of-the-art classification models significantly.

• Related:

• Bosc T (2016) Learning to Learn Neural Networks. arXiv:1610.06072

• Meta-learning consists in learning learning algorithms. We use a Long Short Term Memory (LSTM) based network to learn to compute on-line updates of the parameters of another neural network. These parameters are stored in the cell state of the LSTM. Our framework allows to compare learned algorithms to hand-made algorithms within the traditional train and test methodology. In an experiment, we learn a learning algorithm for a one-hidden layer Multi-Layer Perceptron (MLP) on non-linearly separable datasets. The learned algorithm is able to update parameters of both layers and generalise well on similar datasets.

• Ludwig (2016) Deep Embedding for Spatial Role Labeling. arXiv:1603.08474

• Kardan N [University of Central Florida] (2016) Fitted Learning: Models with Awareness of their Limits. arXiv:1609.02226  |  reddit

• Though deep learning has pushed the boundaries of classification forward, in recent years hints of the limits of standard classification have begun to emerge. Problems such as fooling, adding new classes over time, and the need to retrain learning models only for small changes to the original problem all point to a potential shortcoming in the classic classification regime, where a comprehensive a priori knowledge of the possible classes or concepts is critical. Without such knowledge, classifiers misjudge the limits of their knowledge and overgeneralization therefore becomes a serious obstacle to consistent performance. In response to these challenges, this paper extends the classic regime by reframing classification instead with the assumption that concepts present in the training set are only a sample of the hypothetical final set of concepts. To bring learning models into this new paradigm, a novel elaboration of standard architectures called the competitive overcomplete output layer (COOL) neural network is introduced. Experiments demonstrate the effectiveness of COOL by applying it to fooling, separable concept learning, one-class neural networks, and standard classification benchmarks. The results suggest that, unlike conventional classifiers, the amount of generalization in COOL networks can be tuned to match the problem.

• Kuen J (2016) DelugeNets: Deep Networks with Massive and Flexible Cross-layer Information Inflows. arXiv:1611.05552  |  GitHub  |  GitXiv  |  reddit

• Deluge Networks (DelugeNets) are deep neural networks which efficiently facilitate massive cross-layer information inflows from preceding layers to succeeding layers. The connections between layers in DelugeNets are established through cross-layer depthwise convolutional layers with learnable filters, acting as a flexible yet efficient selection mechanism. DelugeNets can propagate information across many layers with greater flexibility and utilize network parameters more effectively compared to ResNets, whilst being more efficient than DenseNets. Remarkably, a DelugeNet model with just model complexity of 4.31 GigaFLOPs and 20.2M network parameters, achieve classification errors of 3.76% and 19.02% on CIFAR-10 and CIFAR-100 dataset respectively. Moreover, DelugeNet-122 performs competitively to ResNet-200 on ImageNet dataset, despite costing merely half of the computations needed by the latter.

• Kurach K [Sutskever I] (2015) Neural random-access machines. [memory tape; curriculum learning; entropy; ...] arXiv:1511.06392  |  reddit

• In this paper, we propose and investigate a new neural network architecture called Neural Random Access Machine. It can manipulate and dereference pointers to an external variable-size random-access memory. The model is trained from pure input-output examples using backpropagation.

We evaluate the new model on a number of simple algorithmic tasks whose solutions require pointer manipulation and dereferencing. Our results show that the proposed model can learn to solve algorithmic tasks of such type and is capable of operating on simple data structures like linked-lists and binary trees. For easier tasks, the learned solutions generalize to sequences of arbitrary length. Moreover, memory access during inference can be done in a constant time under some assumptions.

• See also:  Karol Kurach, Marcin Andrychowicz & Ilya Sutskever. Neural Random-Access Machines  [ local copy  (html) ]

• Laha A [IBM Research India] (2016) An Empirical Evaluation of various Deep Learning Architectures for Bi-Sequence Classification Tasks. arXiv:1607.04853
• Several tasks in argumentation mining and debating, question-answering, and natural language inference involve classifying a sequence in the context of another sequence (referred as bi-sequence classification). For several single sequence classification tasks, the current state-of-the-art approaches are based on recurrent and convolutional neural networks. On the other hand, for bi-sequence classification problems, there is not much understanding as to the best deep learning architecture. In this paper, we attempt to get an understanding of this category of problems by extensive empirical evaluation of 20 different deep learning architectures (specifically on different ways of handling context) for various problems originating in natural language processing like debating, textual entailment and question-answering. Following the empirical evaluation, we offer our insights and conclusions regarding the architectures we have considered. We also establish the first deep learning baselines for three argumentation mining tasks.

• LeCun Y (2006) A tutorial on energy-based learning. pdf  |  reddit
• Energy-Based Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy. Learning consists in finding an energy function in which observed configurations of the variables are given lower energies than unobserved ones. The EBM approach provides a common theoretical framework for many learning models, including traditional discriminative and generative approaches, as well as graph-transformer networks, conditional random fields, maximum margin Markov networks, and several manifold learning methods. Probabilistic models must be properly normalized, which sometimes requires evaluating intractable integrals over the space of all possible variable configurations. Since EBMs have no requirement for proper normalization, this problem is naturally circumvented. EBMs can be viewed as a form of non-probabilistic factor graphs, and they provide considerably more flexibility in the design of architectures and training criteria than probabilistic approaches.

• Li Y (2015) Convergent Learning: Do different neural networks learn the same representations? arXiv:1511.07543  |  GitXiv
• Recent success in training deep neural networks have prompted active investigation into the features learned on their intermediate layers. Such research is difficult because it requires making sense of non-linear computations performed by millions of parameters, but valuable because it increases our ability to understand current models and create improved versions of them. In this paper we investigate the extent to which neural networks exhibit what we call convergent learning, which is when the representations learned by multiple nets converge to a set of features which are either individually similar between networks or where subsets of features span similar low-dimensional spaces. We propose a specific method of probing representations: training multiple networks and then comparing and contrasting their individual, learned representations at the level of neurons or groups of neurons. We begin research into this question using three techniques to approximately align different neural networks on a feature level: a bipartite matching approach that makes one-to-one assignments between neurons, a sparse prediction approach that finds one-to-many mappings, and a spectral clustering approach that finds many-to-many mappings. This initial investigation reveals a few previously unknown properties of neural networks, and we argue that future research into the question of convergent learning will yield many more. The insights described here include (1) that some features are learned reliably in multiple networks, yet other features are not consistently learned; (2) that units learn to span low-dimensional subspaces and, while these subspaces are common to multiple networks, the specific basis vectors learned are not; (3) that the representation codes show evidence of being a mix between a local code and slightly, but not fully, distributed codes across multiple units.

• Liang S [University of Illinois Urbana-Champaign] (2016) Why Deep Neural Networks? arXiv:1610.04161
• Recently there has been much interest in understanding why deep neural networks are preferred to shallow networks. In this paper, we show that, for a large class of piecewise smooth functions, the number of neurons needed by a shallow network to approximate a function is exponentially larger than the corresponding number of neurons needed by a deep network for a given degree of function approximation. First, we consider univariate functions on a bounded interval and require a neural network to achieve an approximation error of $\small ε$ uniformly over the interval. We show that shallow networks (i.e., networks whose depth does not depend on $\small ε$) require $\small Ω(poly(1/ε))$ neurons while deep networks (i.e., networks whose depth grows with $\small 1/ε$) require $\small O(polylog(1/ε))$ neurons. We then extend these results to certain classes of important multivariate functions. Our results are derived for neural networks which use a combination of rectifier linear units (ReLUs) and binary step units, two of the most popular type of activation functions. Our analysis builds on a simple observation: the multiplication of two bits can be represented by a ReLU.

• Mhaskar H (2016) Deep vs. shallow networks: An approximation theory perspective. arXiv:1608.03287  |  reddit

• The paper briefly reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activation function - the ReLU function - used in present-day neural networks, as well as for the Gaussian networks. We propose a new definition of relative dimension to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.

• reddit: The hierarchical softmax is probably the most popular application of binary trees in neural networks, but they are widely used in recursive neural networks (an old concept, but recently popularised by Richard Socher).

• Raghu M [Google Brain] (2016) On the expressive power of deep neural networks. arXiv:1606.05336

• Saxena S (2016) Convolutional Neural Fabrics. arXiv:1606.02492  |  reddit

• Despite the success of CNNs, selecting the optimal architecture for a given task remains an open problem. Instead of aiming to select a single optimal architecture, we propose a "fabric" that embeds an exponentially large number of architectures. The fabric consists of a 3D trellis that connects response maps at different layers, scales, and channels with a sparse homogeneous local connectivity pattern. The only hyper-parameters of a fabric are the number of channels and layers. While individual architectures can be recovered as paths, the fabric can in addition ensemble all embedded architectures together, sharing their weights where their paths overlap. Parameters can be learned using standard methods based on back-propagation, at a cost that scales linearly in the fabric size. We present benchmark results competitive with the state of the art for image classification on MNIST and CIFAR10, and for semantic segmentation on the Part Labels dataset.

• Shafiee MJ [U Waterloo] (2016) Deep Learning with Darwin: Evolutionary Synthesis of Deep Neural Networks. arXiv:1606.04393
• Taking inspiration from biological evolution, we explore the idea of "Can deep neural networks evolve naturally over successive generations into highly efficient deep neural networks?" by introducing the notion of synthesizing new highly efficient, yet powerful deep neural networks over successive generations via an evolutionary process from ancestor deep neural networks. The architectural traits of ancestor deep neural networks are encoded using synaptic probability models, which can be viewed as the 'DNA' of these networks. New descendant networks with differing network architectures are synthesized based on these synaptic probability models from the ancestor networks and computational environmental factor models, in a random manner to mimic heredity, natural selection, and random mutation. These offspring networks are then trained into fully functional networks, like one would train a newborn, and have more efficient, more diverse network architectures than their ancestor networks, while achieving powerful modeling capabilities. Experimental results for the task of visual saliency demonstrated that the synthesized 'evolved' offspring networks can achieve state-of-the-art performance while having network architectures that are significantly more efficient (with a staggering ∼48-fold decrease in synapses by the fourth generation) compared to the original ancestor network.

• Srinivas S (2015) Learning the Architecture of Deep Neural Networks. arXiv:1511.05497  |  DNN, features, pruning neurons (nodes)

• Deep neural networks with millions of parameters are at the heart of many state of the art machine learning models today. However, recent works have shown that models with much smaller number of parameters can also perform just as well. In this work, we introduce the problem of architecture-learning, i.e; learning the architecture of a neural network along with weights. We introduce a new trainable parameter called tri-state ReLU, which helps in eliminating unnecessary neurons. We also propose a smooth regularizer which encourages the total number of neurons after elimination to be small. The resulting objective is differentiable and simple to optimize. We experimentally validate our method on both small and large networks, and show that it can learn models with a considerably small number of parameters without affecting prediction accuracy.

• More on pruning here:

• current state of the art in neural network pruning? (reddit)

• It's not done, generally. If your goal is to regularize with pruning then there are more principled methods. If your goal is to reduce computation then you're out of luck, as you'd be turning regular cache-friendly operations into irregular ones; better bets are to binarize/evaluate in lower precision or to use model compression to obtain a more compact network that mimics the more powerful network.

• So things like "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding" are not used in practice?

• This is a research paper that is less than 7 months old. It usually takes a lot longer for a technique to become commonplace, and the benefit really has to be quite strong relative to the engineering investment required in getting it to work.

• Szegedy C [Sutskever I; Goodfellow I] (2013) Intriguing properties of neural networks. arXiv:1312.6199

• DNN are highly expressive models that have recently achieved state of the art performance on speech and visual recognition tasks. While their expressiveness is the reason they succeed, it also causes them to learn uninterpretable solutions that could have counter-intuitive properties. In this paper we report two such properties.
First, we find that there is no distinction between individual high level units and random linear combinations of high level units, according to various methods of unit analysis. It suggests that it is the space, rather than the individual units, that contains of the semantic information in the high layers of neural networks.
Second, we find that deep neural networks learn input-output mappings that are fairly discontinuous to a significant extend. We can cause the network to misclassify an image by applying a certain imperceptible perturbation, which is found by maximizing the network's prediction error. In addition, the specific nature of these perturbations is not a random artifact of learning: the same perturbation can cause a different network, that was trained on a different subset of the dataset, to misclassify the same input.

• How to interpret - natural basis vector associated with hidden unit? [reddit]

$x' = arg max(\phi(x), e_i)$ means "find input (image) $x$ that results in maximal output of unit $i$"

• Zagoruyko S (2017) DiracNets: Training Very Deep Neural Networks Without Skip-Connections. arXiv:1706.00388  |  GitHub  |  reddit

• Deep neural networks with skip-connections, such as ResNet, show excellent performance in various image classification benchmarks. It is though observed that the initial motivation behind them - training deeper networks - does not actually hold true, and the benefits come from increased capacity, rather than from depth. Motivated by this, and inspired from ResNet, we propose a simple Dirac weight parameterization, which allows us to train very deep plain networks without skip-connections, and achieve nearly the same performance. This parameterization has a minor computational cost at training time and no cost at all at inference. We're able to achieve 95.5% accuracy on CIFAR-10 with 34-layer deep plain network, surpassing 1001-layer deep ResNet, and approaching Wide ResNet. Our parameterization also mostly eliminates the need of careful initialization in residual and non-residual networks. The code and models for our experiments are available here.

• Zamir AR [Stanford University; UC Berkeley] (2016) Feedback Networks. arXiv:1612.09508  |  LSTMs → Highway (Residual) Networks → LSTMs  |  reddit

• Currently, the most successful learning models in computer vision are based on learning successive representations followed by a decision layer. This is usually actualized through feedforward multilayer neural networks, e.g. ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based approach in which the representation is formed in an iterative manner based on a feedback received from previous iteration's output.

We establish that a feedback based approach has several fundamental advantages over feedforward: it enables making early predictions at the query time, its output naturally conforms to a hierarchical structure in the label space (e.g. a taxonomy), and it provides a new basis for Curriculum Learning. We observe that feedback networks develop a considerably different representation compared to feedforward counterparts, in line with the aforementioned advantages. We put forth a general feedback based learning architecture with the endpoint results on par or better than existing feedforward networks with the addition of the above advantages. We also investigate several mechanisms in feedback architectures (e.g. skip connections in time) and design choices (e.g. feedback length). We hope this study offers new perspectives in quest for more natural and practical learning models.

• Zhang X [LeCun Y] (2015) Universum Prescription: Regularization using Unlabeled Data. arXiv:1511.03719

• This paper shows that simply prescribing "none of the above" labels to unlabeled data has a beneficial regularization effect to supervised learning. We call it universum prescription by the fact that the prescribed labels cannot be one of the supervised labels. In spite of its simplicity, universum prescription obtained competitive results in training deep convolutional networks for CIFAR-10, CIFAR-100, STL-10 and ImageNet datasets. A qualitative justification of these approaches using Rademacher complexity is presented. The effect of a regularization parameter -- probability of sampling from unlabeled data -- is also studied empirically.

• Mentioned here?  Is Classless Data Useful For Training Neural Nets [reddit]: Yes; arXiv:1511.03719. The idea is basically that you add them to your train set under a "none-of-the-above" class and this acts as a regularizer.

• Zhou Z-H [Nanjing University, China] (2017) Deep Forest: Towards An Alternative to Deep Neural Networks. arXiv:1702.08835  |  reddit

• In this paper, we propose gcForest, a decision tree ensemble approach with performance highly competitive to deep neural networks. In contrast to deep neural networks which require great effort in hyper-parameter tuning, gcForest is much easier to train. Actually, even when gcForest is applied to different data from different domains, excellent performance can be achieved by almost same settings of hyper-parameters. The training process of gcForest is efficient and scalable. In our experiments its training time running on a PC is comparable to that of deep neural networks running with GPU facilities, and the efficiency advantage may be more apparent because gcForest is naturally apt to parallel implementation. Furthermore, in contrast to deep neural networks which require large-scale training data, gcForest can work well even when there are only small-scale training data. Moreover, as a tree-based approach, gcForest should be easier for theoretical analysis than deep neural networks.

## ARCHITECTURES - RBM | DBN | MLP | ENERGY-BASED MODELS ...

• Victoria: See early papers (late 1990s) by Yoshua Bengio and Geoffrey Hinton ... Bengio (LISA Lab: UdeM) developed the widely-regarded Theano ML platform, that employs a powerful symbolic programming approach. This tutorial introduces Energy-Based Models (EBM).

[Architectures:RBM|DBN|MLP}EBM] Blogs:

• Dreaming of names with RBMs  |  GitHub  |  reddit:blog author  |  reddit

• A classic problem in natural language processing is named entity recognition. Given a text, we have to identify the proper nouns. But what about the generative mirror image of this problem - i.e. named entity generation? What if we ask a model to dream up new names of people, places and things?

I wrote some code to do this using restricted Boltzmann machines, a nifty (if passé) variety of generative neural network. It turns out they come up with some funny stuff! For example, if we train an RBM on GitHub repository names, it can come up with new ones like ... If you want to learn about how I got there, read on. In this post, I'll give a brief overview of restricted Boltzmann machines and how I applied them to this problem, and try to give some intuition about what's going on in the brain of one of these models.

My code is available here on GitHub. Feel free to play with it (with the caveat that it's more of a research notebook than a polished library).

• Foundations: Mean Field Boltzmann Machines 1987

• A friend from grad school pointed out a great foundational paper on Boltzmann Machines. It is a 1987 paper from complex systems theory,

Peterson C (1987) A Mean Field Theory Learning Algorithm for Neural Networks  [pdf],

just a couple years after Hinton's seminal 1985 paper,

Ackley DH [Hinton GE] (1985) A learning algorithm for Boltzmann machines  [pdf].

What I really like is how we see the foundations of deep learning arose from statistical physics and theoretical chemistry. My top 11 favorite take-a-ways are: ...

• Improving RBMs with physical chemistry

• "... In this post, I am going to discuss a recent advanced in RBM theory based on ideas from theoretical condensed matter physics and physical chemistry ..."  |  includes discussion of energy-based models ...

• On Cheap Learning: Partition Functions and RBMs: "Why does deep and cheap learning work so well?" This is the question posed by a recent article. Deep Learning seems to require knowing the Partition Function-at least in old fashioned Restricted Boltzmann Machines (RBMs). Here, I will discuss some aspects of this paper, in the context of RBMs. ...

• Why does Deep Learning work?  |  "DNN, convex landscapes; SGD, spin funnels, saddle points"  |  reddit

• Why Deep Learning Works II: the Renormalization Group

[Architectures:RBM|DBN|MLP}EBM] Instruction:

[Architectures:RBM|DBN|MLP}EBM] Papers:

• Belanger D [McCallum A] (2015) Structured Prediction Energy Networks. arXiv:1511.06350

• We introduce structured prediction energy networks (SPENs), a flexible framework for structured prediction. A deep architecture is used to define an energy function of candidate labels, and then predictions are produced by using back-propagation to iteratively optimize the energy with respect to the labels. This deep architecture captures dependencies between labels that would lead to intractable graphical models, and performs structure learning by automatically learning discriminative features of the structured output. One natural application of our technique is multi-label classification, which traditionally has required strict prior assumptions about the interactions between labels to ensure tractable learning and prediction. We are able to apply SPENs to multi-label problems with substantially larger label sets than previous applications of structured prediction, while modeling high-order interactions using minimal structural assumptions. Overall, deep learning provides remarkable tools for learning features of the inputs to a prediction problem, and this work extends these techniques to learning features of structured outputs. Our experiments provide impressive performance on a variety of benchmark multi-label classification tasks, demonstrate that our technique can be used to provide interpretable structure learning, and illuminate fundamental trade-offs between feed-forward and iterative structured prediction.

• Cited in this blog post: A quick comment on structured input vs structured output learning:

... The observation is that these two problems are essentially the same thing. That is, if you know how to do the structured input problem, then the structured output problem is essentially the same thing, as far as the learning problem goes. That is, if you can put structure in f(x) for structured input, you can just as well put structure in s(x,y) for structured output. Or, by example, if you can predict the fluency of an English sentence x as a structured input problem, you can predict the translation quality of a French/English sentence pair x,y in a structured output problem. This doesn't solve the argmax problem -- you have to do that separately -- but the underlying learning problem is essentially identical.

You see similar ideas being reborn these days with papers like David Belanger's ICML paper this year on energy networks. With this framework of think-of-structured-input-and-structured-output-as-the-same, basically what they're doing is building a structured score function that uses both the input and output simultaneously, and throwing these through a deep network. ...

• Follow-on paper describes limitations/solutions to their 2015 paper, above:  Belanger D [McCallum A] (2017) arXiv:1703.05667

• Structured Prediction Energy Networks (Belanger and McCallum, 2016) (SPENs) are a simple, yet expressive family of structured prediction models. An energy function over candidate structured outputs is given by a deep network, and predictions are formed by gradient-based optimization. An energy function over candidate structured outputs is given by a deep network, and predictions are formed by gradient-based optimization. This paper presents end-to-end learning for SPENs, where the energy function is discriminatively trained by back-propagating through gradient-based prediction. In our experience, the approach is substantially more accurate than the structured SVM method of Belanger and McCallum (2016), as it allows us to use more sophisticated non-convex energies. We provide a collection of techniques for improving the speed, accuracy, and memory requirements of end-to-end SPENs, and demonstrate the power of our method on 7-Scenes image denoising and CoNLL-2005 semantic role labeling tasks. In both, inexact minimization of non-convex SPEN energies is superior to baseline methods that use simplistic energy functions that can be minimized exactly.

• Bengio Y (2007) Greedy layer-wise training of deep networks. pdf

• Complexity theory of circuits strongly suggests that deep architectures can be much more efficient (sometimes exponentially) than shallow architectures, in terms of computational elements required to represent some functions. Deep multi-layer neural networks have many levels of non-linearities allowing them to compactly represent highly non-linear and highly-varying functions. However, until recently it was not clear how to train such deep networks, since gradient-based optimization starting from random initialization appears to often get stuck in poor solutions. Hinton et al. recently introduced a greedy layer-wise unsupervised learning algorithm for Deep Belief Networks (DBN), a generative model with many layers of hidden causal variables. In the context of the above optimization problem, we study this algorithm empirically and explore variants to better understand its success and extend it to cases where the inputs are continuous or where the structure of the input distribution is not revealing enough about the variable to be predicted in a supervised task. Our experiments also confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

• Choromanska A [LeCun Y] (2014) The loss surfaces of multilayer networks. arXiv:1412.0233

• We study the connection between the highly non-convex loss function of a simple model of the fully-connected feed-forward neural network and the Hamiltonian of the spherical spin-glass model under the assumptions of: i) variable independence, ii) redundancy in network parametrization [Victoria: sic: "Parametrization (or parameterization; also parameterisation, parametrisation) is the process of deciding and defining the parameters necessary for a complete or relevant specification of a model or geometric object."], and iii) uniformity. These assumptions enable us to explain the complexity of the fully decoupled neural network through the prism of the results from random matrix theory. We show that for large-size decoupled networks the lowest critical values of the random loss function form a layered structure and they are located in a well-defined band lower-bounded by the global minimum. The number of local minima outside that band diminishes exponentially with the size of the network. We empirically verify that the mathematical model exhibits similar behavior as the computer simulations, despite the presence of high dependencies in real networks. We conjecture that both simulated annealing and SGD converge to the band of low critical points, and that all critical points found there are local minima of high quality measured by the test error. This emphasizes a major difference between large- and small-size networks where for the latter poor quality local minima have non-zero probability of being recovered. Finally, we prove that recovering the global minimum becomes harder as the network size increases and that it is in practice irrelevant as global minimum often leads to overfitting.

• This blog post [reblogged here], that mentions spin-glass models [ Why, theoretically, do deep neural nets scale so well with training data?: Quora.com], cites the LeCun [arXiv:1412.0233] research paper, above.

• That Quora.com post also mentions:
• Goodfellow IJ [Vinyals O; Saxe AM | Google] (2015) Qualitatively characterizing neural network optimization problems. arXiv:1412.6544

• Related?  Sohl-Dickstein J [Stanford University | UC-Berkeley] (2015) Deep unsupervised learning using nonequilibrium thermodynamics. arXiv:1503.03585  |  Some Thoughts about "Deep Unsupervised Learning using Nonequilibrium Thermodynamics"

• A central problem in machine learning involves modeling complex data-sets using highly flexible families of probability distributions in which learning, sampling, inference, and evaluation are still analytically or computationally tractable. Here, we develop an approach that simultaneously achieves both flexibility and tractability. The essential idea, inspired by non-equilibrium statistical physics, is to systematically and slowly destroy structure in a data distribution through an iterative forward diffusion process. We then learn a reverse diffusion process that restores structure in data, yielding a highly flexible and tractable generative model of the data. This approach allows us to rapidly learn, sample from, and evaluate probabilities in deep generative models with thousands of layers or time steps, as well as to compute conditional and posterior probabilities under the learned model. We additionally release an open source reference implementation of the algorithm.

• Courbariaux M [Bengio Y] (2016) Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv:1602.02830  [fast MNIST MLP]  |  NIPS Proceedings  |  GitHub  |  GitXiv  |  reddit: Best work in Deep Learning. Deserves a perfect 1/1.  |  GitXiv  |  binary_net by chainer [GitHub]

• We introduce a method to train Binarized Neural Networks (BNN) - neural networks with binary weights and activations at run-time and when computing the parameters' gradient at train-time. We conduct two sets of experiments, each based on a different framework, namely Torch7 and Theano, where we train BNNs on MNIST, CIFAR-10 and SVHN, and achieve nearly state-of-the-art results. During the forward pass, BNNs drastically reduce memory size and accesses, and replace most arithmetic operations with bit-wise operations, which might lead to a great increase in power-efficiency. Last but not least, we wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy. The code for training and running our BNNs is available.

• Related:  Hubara I [Courbariaux M; Bengio Y] (2016) Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations. arXiv:1609.07061

• Related:  Accelerating Neural Networks with Binary Arithmetic  |  local copy [pdf]

• At Nervana we are deeply interested in algorithmic and hardware improvements for speeding up neural networks. One particularly exciting area of research is in low precision arithmetic. In this blog post, we highlight one particular class of low precision networks named binarized neural networks (BNNs), the fundamental concepts underlying this class, and introduce a Neon CPU and GPU implementation. BNNs achieve accuracy comparable to that of standard neural networks on a variety of datasets.

• BNNs use binary weights and activations for all computations. Floating point arithmetic underlies all computations in deep learning, including computing gradients, applying parameter updates, and calculating activations. These 32 bit floating point multiplications, however, are very expensive. In BNNs, floating point multiplications are supplanted with bitwise XNORs and left and right bit shifts. This is extremely attractive from a hardware perspective: binary operations can be implemented computationally efficiently at a low power cost.

• Do K (2016) Outlier Detection on Mixed-Type Data: An Energy-based Approach. arXiv:1608.04830
• Outlier detection amounts to finding data points that differ significantly from the norm. Classic outlier detection methods are largely designed for single data type such as continuous or discrete. However, real world data is increasingly heterogeneous, where a data point can have both discrete and continuous attributes. Handling mixed-type data in a disciplined way remains a great challenge. In this paper, we propose a new unsupervised outlier detection method for mixed-type data based on Mixed-variate Restricted Boltzmann Machine (Mv.RBM). The Mv.RBM is a principled probabilistic method that models data density. We propose to use free-energy derived from Mv.RBM as outlier score to detect outliers as those data points lying in low density regions. The method is fast to learn and compute, is scalable to massive datasets. At the same time, the outlier score is identical to data negative log-density up-to an additive constant. We evaluate the proposed method on synthetic and real-world datasets and demonstrate that (a) a proper handling mixed-types is necessary in outlier detection, and (b) free-energy of Mv.RBM is a powerful and efficient outlier scoring method, which is highly competitive against state-of-the-arts.

• Eryilmaz SB (2016) Training a Probabilistic Graphical Model with Resistive Switching Electronic Synapses. arXiv:1609.08686
• Current large scale implementations of deep learning and data mining require thousands of processors, massive amounts of off-chip memory, and consume gigajoules of energy. Emerging memory technologies such as nanoscale two-terminal resistive switching memory devices offer a compact, scalable and low power alternative that permits on-chip co-located processing and memory in fine-grain distributed parallel architecture. Here we report first use of resistive switching memory devices for implementing and training a Restricted Boltzmann Machine (RBM), a generative probabilistic graphical model as a key component for unsupervised learning in deep networks. We experimentally demonstrate a 45-synapse RBM realized with 90 resistive switching phase change memory (PCM) elements trained with a bio-inspired variant of the Contrastive Divergence (CD) algorithm, implementing Hebbian and anti-Hebbian weight updates. The resistive PCM devices show a two-fold to ten-fold reduction in error rate in a missing pixel pattern completion task trained over 30 epochs, compared to untrained case. Measured programming energy consumption is 6.1 nJ per epoch with the resistive switching PCM devices, a factor of ~150 times lower than conventional processor-memory systems. We analyze and discuss the dependence of learning performance on cycle-to-cycle variations as well as number of gradual levels in the PCM analog memory devices.

• Finn C [UC - Berkeley] (2016) A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models. https://arxiv.org/abs/1611.03852
• Generative adversarial networks (GANs) are a recently proposed class of generative models in which a generator is trained to optimize a cost function that is being simultaneously learned by a discriminator. While the idea of learning cost functions is relatively new to the field of generative modeling, learning costs has long been studied in control and reinforcement learning (RL) domains, typically for imitation learning from demonstrations. In these fields, learning cost function underlying observed behavior is known as inverse reinforcement learning (IRL) or inverse optimal control. While at first the connection between cost learning in RL and cost learning in generative modeling may appear to be a superficial one, we show in this paper that certain IRL methods are in fact mathematically equivalent to GANs. In particular, we demonstrate an equivalence between a sample-based algorithm for maximum entropy IRL and a GAN in which the generator's density can be evaluated and is provided as an additional input to the discriminator. Interestingly, maximum entropy IRL is a special case of an energy-based model. We discuss the interpretation of GANs as an algorithm for training energy-based models, and relate this interpretation to other recent work that seeks to connect GANs and EBMs. By formally highlighting the connection between GANs, IRL, and EBMs, we hope that researchers in all three communities can better identify and apply transferable ideas from one domain to another, particularly for developing more stable and scalable algorithms: a major challenge in all three domains.

• Hinton GE (2010) A practical guide to training restricted Boltzmann machines. pdf
• Restricted Boltzmann machines (RBMs) have been used as generative models of many different types of data. RBMs are usually trained using the contrastive divergence learning procedure. This requires a certain amount of practical experience to decide how to set the values of numerical meta-parameters. Over the last few years, the machine learning group at the University of Toronto has acquired considerable expertise at training RBMs and this guide is an attempt to share this expertise with other machine learning researchers.

• Hinton G & Salakhutdinov R (2010) Discovering binary codes for documents by learning deep generative models  |  [ pdf]  |  NLM, VSM, tf-idf; 'semantic hashing;' autoencoders, RBM; LSA

• We describe a deep generative model in which the lowest layer represents the word-count vector of a document and the top layer represents a learned binary code for that document. The top two layers of the generative model form an undirected associative memory and the remaining layers form a belief net with directed, top-down connections. We present efficient learning and inference procedures for this type of generative model and show that it allows more accurate and much faster retrieval than latent semantic analysis. By using our method as a filter for a much slower method called TF-IDF we achieve higher accuracy than TF-IDF alone and save several orders of magnitude in retrieval time. By using short binary codes as addresses, we can perform retrieval on very large document sets in a time that is independent of the size of the document set using only one word of memory to describe each document.

• Hinton GE (2006) Reducing the dimensionality of data with neural networks.  pdf
• High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

• Hinton GE (2002) Training products of experts by minimizing contrastive divergence. pdf
• It is possible to combine multiple latent-variable models of the same data by multiplying their probability distributions together and then renormalizing. This way of combining individual "expert" models makes it hard to generate samples from the combined model but easy to infer the values of the latent variables of each expert, because the combination rule ensures that the latent variables of different experts are conditionally independent when given the data. A product of experts (PoE) is therefore an interesting candidate for a perceptual system in which rapid inference is vital and generation is unnecessary. Training a PoE by maximizing the likelihood of the data is difficult because it is hard even to approximate the derivatives of the renormalization term in the combination rule. Fortunately, a PoE can be trained using a different objective function called "contrastive divergence" whose derivatives with regard to the parameters can be approximated accurately and efficiently. Examples are presented of contrastive divergence learning using several types of expert on several types of data.

• Hopfield JJ (1982) Neural networks and physical systems with emergent collective computational abilities. pdf
• Computational properties of use to biological organisms or to the construction of computers can emerge as collective properties of systems having a large number of simple equivalent components (or neurons). The physical meaning of content-addressable memory is described by an appropriate phase space flow of the state of a system. A model of such a system is given, based on aspects of neurobiology but readily adapted to integrated circuits. The collective properties of this model produce a content-addressable memory which correctly yields an entire memory from any subpart of sufficient size. The algorithm for the time evolution of the state of the system is based on asynchronous parallel processing. Additional emergent collective properties include some capacity for generalization, familiarity recognition, categorization, error correction, and time sequence retention. The collective properties are only weakly sensitive to details of the modeling or the failure of individual devices.

• Kim T [Yoshua Bengio] (2016) Deep Directed Generative Models with Energy-Based Probability Estimation. arXiv:1606.03439

• Training energy-based probabilistic models is confronted with apparently intractable sums, whose Monte Carlo estimation requires sampling from the estimated probability distribution in the inner loop of training. This can be approximately achieved by Markov chain Monte Carlo methods, but may still face a formidable obstacle that is the difficulty of mixing between modes with sharp concentrations of probability. Whereas an MCMC process is usually derived from a given energy function based on mathematical considerations and requires an arbitrarily long time to obtain good and varied samples, we propose to train a deep directed generative model (not a Markov chain) so that its sampling distribution approximately matches the energy function that is being trained. Inspired by generative adversarial networks, the proposed framework involves training of two models that represent dual views of the estimated probability distribution: the energy function (mapping an input configuration to a scalar energy value) and the generator (mapping a noise vector to a generated configuration), both represented by deep neural networks.

• More here.

• LeCun Y (2005) [Energy-Based Models (EBM)] A tutorial on energy-based learning. pdf [59 pp]
• Energy-Based Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy. Learning consists in finding an energy function in which observed configurations of the variables are given lower energies than unobserved ones. The EBM approach provides a common theoretical framework for many learning models, including traditional discriminative and generative approaches, as well as graph-transformer networks, conditional random fields, maximum margin Markov networks, and several manifold learning methods. Probabilistic models must be properly normalized, which sometimes requires evaluating intractable integrals over the space of all possible variable configurations. Since EBMs have no requirement for proper normalization, this problem is naturally circumvented. EBMs can be viewed as a form of non-probabilistic factor graphs, and they provide considerably more flexibility in the design of architectures and training criteria than probabilistic approaches.

• LeCun Y (2006) Loss Functions for Discriminative Training of Energy-Based Models. pdf
• Probabilistic graphical models associate a probability to each configuration of the relevant variables. Energy-based models (EBM) associate an energy to those configurations, eliminating the need for proper normalization of probability distributions. Making a decision (an inference) with an EBM consists in comparing the energies associated with various configurations of the variable to be predicted, and choosing the one with the smallest energy. Such systems must be trained discriminatively to associate low energies to the desired configurations and higher energies to undesired configurations. A wide variety of loss function can be used for this purpose. We give sufficient conditions that a loss function should satisfy so that its minimization will cause the system to approach to desired behavior. We give many specific examples of suitable loss functions, and show an application to object recognition in images.

• Mocanu DC (2016) A topological insight into restricted Boltzmann machines. arXiv:1604.05978
• Restricted Boltzmann Machines (RBMs) and models derived from them have been successfully used as basic building blocks in deep artificial neural networks for automatic features extraction, unsupervised weights initialization, but also as density estimators. Thus, their generative and discriminative capabilities, but also their computational time are instrumental to a wide range of applications. Our main contribution is to look at RBMs from a topological perspective, bringing insights from network science. Firstly, here we show that RBMs and Gaussian RBMs (GRBMs) are bipartite graphs which naturally have a small-world topology. Secondly, we demonstrate both on synthetic and real-world datasets that by constraining RBMs and GRBMs to a scale-free topology (while still considering local neighborhoods and data distribution), we reduce the number of weights that need to be computed by a few orders of magnitude, at virtually no loss in generative performance. Thirdly, we show that, for a fixed number of weights, our proposed sparse models (which by design have a higher number of hidden neurons) achieve better generative capabilities than standard fully connected RBMs and GRBMs (which by design have a smaller number of hidden neurons), at no additional computational costs.

• Ngiam J [Ng AY] (2011) Learning deep energy models.  pdf
• Deep generative models with multiple hidden layers have been shown to be able to learn meaningful and compact representations of data. In this work we propose deep energy models, which use deep feedforward neural networks to model the energy landscapes that define probabilistic models. We are able to efficiently train all layers of our model simultaneously, allowing the lower layers of the model to adapt to the training of the higher layers, and thereby producing better generative models. We evaluate the generative performance of our models on natural images and demonstrate that this joint training of multiple layers yields qualitative and quantitative improvements over greedy layerwise training. We further generalize our models beyond the commonly used sigmoidal neural networks and show how a deep extension of the product of Student-t distributions model achieves good generative performance. Finally, we introduce a discriminative extension of our model and demonstrate that it outperforms other fully-connected models on object recognition on the NORB dataset.
• Their model is the basis of the work described in: Zhai S (2016) Deep Structured Energy Based Models for Anomaly Detection. arXiv:1605.07717  |  [see also: reddit]

• Odense S [Roderick Edwards | UVic?!] (2016) Universal Approximation Results for the Temporal Restricted Boltzmann Machine and the Recurrent Temporal Restricted Boltzmann Machine. JMLR [pdf]
• The Restricted Boltzmann Machine (RBM) has proved to be a powerful tool in machine learning, both on its own and as the building block for Deep Belief Networks (multi-layer generative graphical models). The RBM and Deep Belief Network have been shown to be universal approximators for probability distributions on binary vectors. In this paper we prove several similar universal approximation results for two variations of the Restricted Boltzmann Machine with time dependence, the Temporal Restricted Boltzmann Machine (TRBM) and the Recurrent Temporal Restricted Boltzmann Machine (RTRBM). We show that the TRBM is a universal approximator for Markov chains and generalize the theorem to sequences with longer time dependence. We then prove that the RTRBM is a universal approximator for stochastic processes with finite time dependence. We conclude with a discussion on efficiency and how the constructions developed could explain some previous experimental results.

• Scellier B [Bengio Y] (2016) Towards a Biologically Plausible Backprop. arXiv:1602.05179  |  reddit  |  reddit

• We introduce Equilibrium Propagation, a learning framework for energy-based models. It involves only one kind of neural computation, performed in both the first phase (when the prediction is made) and the second phase of training (after the target or prediction error is revealed). Although this algorithm computes the gradient of an objective function just like Backpropagation, it does not need a special computation or circuit for the second phase, where errors are implicitly propagated. Equilibrium Propagation shares similarities with Contrastive Hebbian Learning and Contrastive Divergence while solving the theoretical issues of both algorithms: our algorithm computes the gradient of a well defined objective function. Because the objective function is defined in terms of local perturbations, the second phase of Equilibrium Propagation corresponds to only nudging the prediction (fixed point, or stationary distribution) towards a configuration that reduces prediction error. In the case of a recurrent multi-layer supervised network, the output units are slightly nudged towards their target in the second phase, and the perturbation introduced at the output layer propagates backward in the hidden layers. We show that the signal 'back-propagated' during this second phase corresponds to the propagation of error derivatives and encodes the gradient of the objective function, when the synaptic update corresponds to a standard form of spike-timing dependent plasticity. This work makes it more plausible that a mechanism similar to Backpropagation could be implemented by brains, since leaky integrator neural computation performs both inference and error back-propagation in our model. The only local difference between the two phases is whether synaptic changes are allowed or not.

• Vincent P (2010) A connection between score matching and denoising autoencoders.   pdf

• Denoising autoencoders have been previously shown to be competitive alternatives to Restricted Boltzmann Machines for unsupervised pre-training of each layer of a deep architecture. We show that a simple denoising autoencoder training criterion is equivalent to matching the score (with respect to the data) of a specific energy based model to that of a non-parametric Parzen density estimator of the data. This yields several useful insights. It defines a proper probabilistic model for the denoising autoencoder technique which makes it in principle possible to sample from them or to rank examples by their energy. It suggests a different way to apply score matching that is related to learning to denoise and does not require computing second derivatives. It justifies the use of tied weights between the encoder and decoder, and suggests ways to extend the success of denoising autoencoders to a larger family of energy-based models.

• Multiply-cited in: Zhai S (2016) Deep Structured Energy Based Models for Anomaly Detection. arXiv:1605.07717:

• "For example, it is shown that properly regularized autoencoders (Vincent et al., 2010; Rifai et al., 2011) are able to effectively characterize the data distribution and learn useful representations, which are not achieved by shallow methods such as PCA or K-Means."

• "Similarly to (Vincent, 2011), we are able to train a DSEBM in the same way as that of a deep denoising autoencoder (DAE) Vincent et al. (2010), which only requires standard stochastic gradient descent (SGD)."

• "One particularly interesting variant of autoencoders is DAEs (Vincent et al., 2010), which learn to construct the inputs given their randomly corrupted versions:

$\sum_{i=1}^N E_\epsilon \left\lVert x_i - f(x_i + \epsilon; \theta) \right\rVert_{2}^2 ,$

where $\epsilon \sim \mathcal{N}(0, \sigma^2 I)$ is an isotropic Gaussian noise [Victoria: $I$: identity matrix]. DAEs are easy to train with standard stochastic gradient descent (SGD) and perform significantly better than unregularized autoencoders. While RBM and DAEs are typically considered as two alternative unsupervised deep models, it is recently shown they are closely related to each other. In particular, (Vincent, 2011) shows that training an RBM with score matching (SM) (Hyvärinen, 2005) is equivalent to a one-layer DAE. SM is an alternative method to MLE, which is especially suitable for estimating non-normalized density functions such as EBM. Instead of trying to directly maximize the probability of training instances, SM minimizes the following objective function: ..."

• Vincent P [Bengio Y] (2008) Extracting and composing robust features with denoising autoencoders.  pdf
• Previous work has shown that the difficulties in learning deep generative or discriminative models can be overcome by an initial unsupervised learning step that maps inputs to useful intermediate representations. We introduce and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern. This approach can be used to train autoencoders, and these denoising autoencoders can be stacked to initialize deep architectures. The algorithm can be motivated from a manifold learning and information theoretic perspective or from a generative model perspective. Comparative experiments clearly show the surprising advantage of corrupting the input of autoencoders on a pattern classification benchmark suite.

• Zhai S (2016) Deep Structured Energy Based Models for Anomaly Detection. arXiv:1605.07717  |  reddit

• In this paper, we attack the anomaly detection problem by directly modeling the data distribution with deep architectures. We propose deep structured energy based models (DSEBMs), where the energy function is the output of a deterministic deep neural network with structure. We develop novel model architectures to integrate EBMs with different types of data such as static data, sequential data, and spatial data, and apply appropriate model architectures to adapt to the data structure. Our training algorithm is built upon the recent development of score matching (Hyvarinen, 2005), which connects an EBM with a regularized autoencoder, eliminating the need for complicated sampling method. Statistically sound decision criterion can be derived for anomaly detection purpose from the perspective of the energy landscape of the data distribution. We investigate two decision criteria for performing anomaly detection: the energy score and the reconstruction error. Extensive empirical studies on benchmark tasks demonstrate that our proposed model consistently matches or outperforms all the competing methods.

• Zhao J [Yann LeCun] (2016) Energy-based Generative Adversarial Network). arXiv:1609.03126  |  reddit

• We introduce the "Energy-based Generative Adversarial Network" (EBGAN) model which views the discriminator in GAN framework as an energy function that associates low energies with the regions near the data manifold and higher energies everywhere else. Similar to the probabilistic GANs, a generator is trained to produce contrastive samples with minimal energies, while the energy function is trained to assign high energies to those generated samples. Viewing the discriminator as an energy function allows to use a wide variety of architectures and loss functionals in addition to the usual binary discriminant network. Among them, an instantiation of EBGANs is to use an auto-encoder architecture alongside the energy being the reconstruction error. We show that this form of EBGAN exhibits more stable behavior than regular GANs during training. We also show that a single-scale architecture can be trained to generate high-resolution images.

• More here.

## ART - CNN

[Art:CNN] Blogs:

• A Computer Algorithm Does the Work of 85 Artists: Watch Starry Starwars: A Clip of 'Star Wars: Episode V' in the Art Style of Vincent Van Gogh YouTube
• While it has taken over £50,000 pounds and a large collection of artists to hand-paint each frame of a the Van Gogh movie: Loving Vincent, recent advances in neural algorithms artistic style (Gatys et al.) allow one to capture his art style on a computer.

Although the work of Gatys et al. worked well on images, the naive method for extending it to movies does not produce great results. The video below shows a new method of rendering movies in a given art style using optical flow to move the textures with the objects in the scene. A technical report will appear soon on arXiv. Now that the code is written, the movie is generated with little human input.

For a demonstration of our method, watch a clip of Star Wars: Episode V in the Art Style of Vincent Van Gogh.

Of course, it should be noted that the main value of art is in it's ability to communicate insight into the nature of the human experience. The neural algorithms of artistic style that we use here are nowhere close to capturing this essential, human component of art. We merely capture the style of Van Gogh's brush strokes, much like the artists who take the frames of a movie and paint them in the style of Van Gogh.

This work is powered by DeepMovie, which uses a combination of advances in neural networks and optical flow algorithms to render videos in complex artistic styles.
The method was developed by Alexander G. Anderson and collaborators in the Redwood Center for Theoretical Neuroscience at UC Berkeley. Gatys et al.: arXiv:1508.06576.

• The art of neural networks (Mike Tyka: TEDxTUM) [reddit]  |  YouTube [16:07]

• Understanding Deep Dreams

[Art:CNN] GitXiv:

• A Neural Algorithm of Artistic Style: Here we introduce an artificial system based on a Deep Neural Network that creates artistic images of high perceptual quality. The system uses neural representations to separate and recombine content and style of arbitrary images, providing a neural algorithm for the creation of artistic images. Moreover, in light of the striking similarities between performance-optimised artificial neural networks and biological vision.

• Neural Doodle: This paper introduces a novel concept to augment such generative architectures with semantic annotations, either by manually authoring pixel labels or using existing solutions for semantic segmentation. The result is a content-aware generative algorithm that offers meaningful control over the outcome.  |  reddit  |  article (Neuro Doodle author)

• Faster neural doodle [reddit]: This is my approach on neural doodle. It does not use patch-based idea and more like original Artistic Style algorithm by L. Gatys. It takes several minutes to draw a picture.

• Painting-to-3D Model Alignment Via Discriminative Visual Elements: Painting-to-3D Model Alignment Via Discriminative Visual Elements Mathieu Aubry, Josef Sivic This paper describes a technique that can reliably align arbitrary 2D depictions of an architectural site, including drawings, paintings and historical photographs, with a 3D model of the site. ...

• Robot Art Raises Questions about Human Creativity [MIT Technology Review]

[Art:CNN] Papers:

• Champandard AJ (2016) Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artwork.  |  arXiv:1603.01768  |  GitXiv
• CNN have proven highly effective at image synthesis and style transfer. ... This paper introduces a novel concept to augment such generative architectures with semantic annotations, either by manually authoring pixel labels or using existing solutions for semantic segmentation. The result is a content-aware generative algorithm that offers meaningful control over the outcome. Thus, we increase the quality of images generated by avoiding common glitches, make the results look significantly more plausible, and extend the functional range of these algorithms - whether for portraits or landscapes, etc. Applications include semantic style transfer and turning doodles with few colors into masterful paintings!

• Dumoulin V [Google Brain] (2016) A Learned Representation For Artistic Style. arXiv:1610.07629  |  reddit  |  reddit

• The diversity of painting styles represents a rich visual vocabulary for the construction of an image. The degree to which one may learn and parsimoniously capture this visual vocabulary measures our understanding of the higher level features of paintings, if not images in general. In this work we investigate the construction of a single, scalable deep network that can parsimoniously capture the artistic style of a diversity of paintings. We demonstrate that such a network generalizes across a diversity of artistic styles by reducing a painting to a point in an embedding space. Importantly, this model permits a user to explore new painting styles by arbitrarily combining the styles learned from individual paintings. We hope that this work provides a useful step towards building rich models of paintings and offers a window on to the structure of the learned representation of artistic style.

• Related blog post [Google DeepMind: Oct 2016]:  Supercharging Style Transfer

• See also:  Gatys LA (2015b) A neural algorithm of artistic style. arXiv:1508.06576

• Gatys L (2015a) Texture synthesis using convolutional neural networks. arXiv:1505.07376

• Here we introduce a new model of natural textures based on the feature spaces of convolutional neural networks optimised for object recognition. Samples from the model are of high perceptual quality demonstrating the generative power of neural networks trained in a purely discriminative fashion. Within the model, textures are represented by the correlations between feature maps in several layers of the network. We show that across layers the texture representations increasingly capture the statistical properties of natural images while making object information more and more explicit. The model provides a new tool to generate stimuli for neuroscience and might offer insights into the deep representations learned by convolutional neural networks.

• In this reddit thread, Texture Networks: Feed-forward Synthesis of Textures and Stylized Images, the Gatys et al. and the Ulyanov et al. papers are discussed in relation to one another:

• speed is awesome, but the quality is quite inferior to Gatys et al.

• Generally agreed, although there are a couple of samples where theirs does appear better to me -- trees in Fig. 1, roofing shingles in Fig. 11. It looks like the textures generated in this paper are much more homogenous than their sources, particularly conspicuous on the rock textures.

• Oh yeah the textures are nice, I only meant the style transfer experiments. Compared to the stuff they're putting out now at deepart.io, these are pretty bad.

• The textures are ok (compared to Gatys; they are much better than the things that came before Gatys, clearly)-- like I said, some are better. Some aren't. The homogeneity is bad on most of the textures they show.

But yeah, I have to agree that the results of the style transfer experiments are worse. It's quite an achievement to be running 500x faster when deployed, though, which gives a lot of room to improve the method's results while remaining very fast.

• Implementation (non-author?):  GitHub  |  reddit

• Gatys LA (2015b). A neural algorithm of artistic style. arXiv:1508.06576  |  Replicating Neural Style [reddit]
• In fine art, especially painting, humans have mastered the skill to create unique visual experiences through composing a complex interplay between the content and style of an image. Thus far the algorithmic basis of this process is unknown and there exists no artificial system with similar capabilities. However, in other key areas of visual perception such as object and face recognition near-human performance was recently demonstrated by a class of biologically inspired vision models called Deep Neural Networks. Here we introduce an artificial system based on a Deep Neural Network that creates artistic images of high perceptual quality. The system uses neural representations to separate and recombine content and style of arbitrary images, providing a neural algorithm for the creation of artistic images. Moreover, in light of the striking similarities between performance-optimised artificial neural networks and biological vision, our work offers a path forward to an algorithmic understanding of how humans create and perceive artistic imagery.

• Related: Novak R (2016) Improving the Neural Algorithm of Artistic Style. arXiv:1605.04603  |  reddit

• In this work we investigate different avenues of improving the Neural Algorithm of Artistic Style (Gatys, Ecker & Bethge, arXiv:1508.06576). While showing great results when transferring homogeneous and repetitive patterns, the original style representation often fails to capture more complex properties, like having separate styles of foreground and background. This leads to visual artifacts and undesirable textures appearing in unexpected regions when performing style transfer. We tackle this issue with a variety of approaches, mostly by modifying the style representation in order for it to capture more information and impose a tighter constraint on the style transfer result. In our experiments, we subjectively evaluate our best method as producing from barely noticeable to significant improvements in the quality of style transfer.

• Mentioned here [Google Research Blog: Oct 2016]:  Supercharging Style Transfer  |  related paper:  Dumoulin V [Google Brain] (2016) A Learned Representation For Artistic Style. arXiv:1610.07629

• Implementation: Neural-Style-Transfer: Implementation of Neural Style Transfer from the paper "A Neural Algorithm of Artistic Style" (arXiv:1508.06576) in Keras 2.0+.

• Gatys LA [University of Tubingen, Germany] (CVPR 2016) Image style transfer using convolutional neural networks. pdf

• Rendering the semantic content of an image in different styles is a difficult image processing task. Arguably, a major limiting factor for previous approaches has been the lack of image representations that explicitly represent semantic information and, thus, allow to separate image content from style. Here we use image representations derived from Convolutional Neural Networks optimised for object recognition, which make high level image information explicit. We introduce A Neural Algorithm of Artistic Style that can separate and recombine the image content and style of natural images. The algorithm allows us to produce new images of high perceptual quality that combine the content of an arbitrary photograph with the appearance of numerous well-known artworks. Our results provide new insights into the deep image representations learned by Convolutional Neural Networks and demonstrate their potential for high level image synthesis and manipulation.

• Related:  Li Y (2017) Demystifying Neural Style Transfer. arXiv:1701.01036  |  reddit

• Neural Style Transfer has recently demonstrated very exciting results which catches eyes in both academia and industry. Despite the amazing results, the principle of neural style transfer, especially why the Gram matrices could represent style remains unclear. In this paper, we propose a novel interpretation of neural style transfer by treating it as a domain adaptation problem. Specifically, we theoretically show that matching the Gram matrices of feature maps is equivalent to minimize the Maximum Mean Discrepancy (MMD) with the second order polynomial kernel. Thus, we argue that the essence of neural style transfer is to match the feature distributions between the style images and the generated images. To further support our standpoint, we experiment with several other distribution alignment methods, and achieve appealing results. We believe this novel interpretation connects these two important research fields, and could enlighten future researches.

• Implementation [commercial]:  Combinart.io/  |  GitHub

• Gatys LA (2016) Preserving Color in Neural Artistic Style Transfer. arXiv:1606.05897
• This note presents an extension to the neural artistic style transfer algorithm (Gatys et al.). The original algorithm transforms an image to have the style of another given image. For example, a photograph can be transformed to have the style of a famous painting. Here we address a potential shortcoming of the original method: the algorithm transfers the colors of the original painting, which can alter the appearance of the scene in undesirable ways. We describe simple linear methods for transferring style while preserving colors.

• Amazing!  Luan F [Cornell University | Adobe] (2017) Deep Photo Style Transfer. arXiv:1703.07511  |  GitHub reddit
• This paper introduces a deep-learning approach to photographic style transfer that handles a large variety of image content while faithfully transferring the reference style. Our approach builds upon recent work on painterly transfer that separates style from the content of an image by considering different layers of a neural network. However, as is, this approach is not suitable for photorealistic style transfer. Even when both the input and reference images are photographs, the output still exhibits distortions reminiscent of a painting. Our contribution is to constrain the transformation from the input to the output to be locally affine in colorspace, and to express this constraint as a custom CNN layer through which we can backpropagate. We show that this approach successfully suppresses distortion and yields satisfying photorealistic style transfers in a broad variety of scenarios, including transfer of the time of day, weather, season, and artistic edits.

• Ruder M (2016) Artistic style transfer for videos. arXiv:1604.08610  |  reddit  |  YouTube  |  GitHub

• In the past, manually re-drawing an image in a certain artistic style required a professional artist and a long time. Doing this for a video sequence single-handed was beyond imagination. Nowadays computers provide new possibilities. We present an approach that transfers the style from one image (for example, a painting) to a whole video sequence. We make use of recent advances in style transfer in still images and propose new initializations and loss functions applicable to videos. This allows us to generate consistent and stable stylized video sequences, even in cases with large motion and strong occlusion. We show that the proposed method clearly outperforms simpler baselines both qualitatively and quantitatively.

• Ulyanov D (2016) Texture Networks: Feed-forward Synthesis of Textures and Stylized Images. arXiv:1603.03417  |  GitHub  |  mentioned here: reddit

• Gatys et al. recently demonstrated that deep networks can generate beautiful textures and stylized images from a single texture example. However, their methods requires a slow and memory-consuming optimization process. We propose here an alternative approach that moves the computational burden to a learning stage. Given a single example of a texture, our approach trains compact feed-forward convolutional networks to generate multiple samples of the same texture of arbitrary size and to transfer artistic style from a given image to any other image. The resulting networks are remarkably light-weight and can generate textures of quality comparable to Gatys et al., but hundreds of times faster. More generally, our approach highlights the power and flexibility of generative feed-forward models trained with complex and expressive loss functions.

• 4.1. Speed and memory. We compare quantitatively the speed of our method and of the iterative optimization of Gatys et al., 2015a by measuring how much time it takes for the latter and for our generator network to reach a given value of the loss LT (x; x0). Figure 6 shows that iterative optimization requires about 10 seconds to generate a sample x that has a loss comparable to the output x = g(z) of our generator network. Since an evaluation of the latter requires ~20ms, we achieve a 500x speed-up, which is sufficient for real-time applications such as video processing. There are two reasons for this significant difference: the generator network is much smaller than the VGG-19 model evaluated at each iteration of (Gatys et al., 2015a), and our method requires a single network evaluation. By avoiding backpropagation, our method also uses significantly less memory (170 MB to generate a 256 x 256 sample, vs 1100 MB of (Gatys et al., 2015a).

• In this reddit thread, Texture Networks: Feed-forward Synthesis of Textures and Stylized Images, the Gatys et al. and the Ulyanov et al. papers are discussed in relation to one another:

• speed is awesome, but the quality is quite inferior to Gatys et al.

• Generally agreed, although there are a couple of samples where theirs does appear better to me -- trees in Fig. 1, roofing shingles in Fig. 11. It looks like the textures generated in this paper are much more homogenous than their sources, particularly conspicuous on the rock textures.

• Oh yeah the textures are nice, I only meant the style transfer experiments. Compared to the stuff they're putting out now at deepart.io, these are pretty bad.

• The textures are ok (compared to Gatys; they are much better than the things that came before Gatys, clearly)-- like I said, some are better. Some aren't. The homogeneity is bad on most of the textures they show.
But yeah, I have to agree that the results of the style transfer experiments are worse. It's quite an achievement to be running 500x faster when deployed, though, which gives a lot of room to improve the method's results while remaining very fast.

• Texture networks: feed-forward synthesis of textures and stylized images [The Morning Paper]

• [follow-on to preceding paper (Ulyanov D 2016 arXiv:1603.03417)]:

Ulyanov D (2016) Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv:1607.08022  |  reddit

• In this paper we revisit the fast stylization method introduced in Ulyanov et. al. (2016) [arXiv:1603.03417]. We show how a small change in the stylization architecture results in a significant qualitative improvement in the generated images. The change is limited to swapping batch normalization with instance normalization, and to apply the latter both at training and testing times. The resulting method can be used to train high-performance architectures for real-time image generation. The code will be made available here [GitHub]

• The recent work of Gatys et al. (2016) introduced a method for transferring a style from an image onto another one, as demonstrated in fig. 1. The stylized image matches simultaneously selected statistics of the style image and of the content image. Both style and content statistics are obtained from a deep convolutional network pre-trained for image classification. The style statistics are extracted from shallower layers and averaged across spatial locations whereas the content statistics are extracted form deeper layers and preserve spatial information. In this manner, the style statistics capture the "texture" of the style image whereas the content statistics capture the "structure" of the content image.

Although the method of Gatys et. al produces remarkably good results, it is computationally inefficient. The stylized image is, in fact, obtained by iterative optimization until it matches the desired statistics. In practice, it takes several minutes to stylize an image of size $\small 512 \times 512$. Two recent works, Ulyanov et al. (2016) Johnson et al. (2016), sought to address this problem by learning equivalent feed-forward generator networks that can generate the stylized image in a single pass. These two methods differ mainly by the details of the generator architecture and produce results of a comparable quality; however, neither achieved as good results as the slower optimization-based method of Gatys et. al

In this paper we revisit the method for feed-forward stylization of Ulyanov et al. (2016) and show that a small change in a generator architecture leads to much improved results. The results are in fact of comparable quality as the slow optimization method of Gatys et al. but can be obtained in real time on standard GPU hardware. The key idea (section 2) is to replace batch normalization layers in the generator architecture with instance normalization layers, and to keep them at test time (as opposed to freeze and simplify them out as done for batch normalization). Intuitively, the normalization process allows to remove instance-specific contrast information from the content image, which simplifies generation. In practice, this results in vastly improved images (section 3).

## ATTENTION - MEMORY; READING COMPREHENSION

Blogs:

• Attention and Memory in Deep Learning and NLP [Denny Britz]

• Interpretability via attentional and memory-based interfaces, using TensorFlow  |  GitHub

• This post will serve as a gentle introduction to attentional and memory-based interfaces in deep neural architectures using TensorFlow. Incorporation of attention mechanisms is very simple and can improve transparency interpretability in our complex models. We will conclude with extensions and caveats of the interfaces. The intended audience for this notebook are developers and researchers who have some basic understanding of Tensorflow and fundamental deep learning concepts.

• Suggestions: beginner tutorials for NN with Attention? [reddit]

• Teaching Machines to Read and Comprehend: Teaching machines to read natural language documents remains an elusive challenge. In this work we define a new methodology that resolves this bottleneck and provides large scale supervised reading comprehension data. This allows us to develop a class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure.

Papers:

• Ba J [Geoffrey Hinton; Volodymyr Mnih; Catalin Ionescu | University of Toronto | Google Brain | Google DeepMind] (2016) Using Fast Weights to Attend to the Recent Past. arXiv:1610.06258  |  NIPS Proceedings  |  GitHub  |  GitXiv  |  reddit  |  reddit  |  discussion [ShortScience.org]  |  blog: post

• Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different time-scales and this suggests that artificial neural networks might benefit from variables that change slower than activities but much faster than the standard weights. These "fast weights" can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequence-to-sequence models. By using fast weights we can avoid the need to store copies of neural activity patterns.

• TensorFlow implementation [non-author]:  GitHub  |  reddit

• Bojarski M [NVIDIA Corp | Google Brain Robotics] (2016) VisualBackProp: Efficient visualization of CNNs. arXiv:1611.05418  |  GitHub  |  GitXiv  |  reddit

• This paper proposes a new method, that we call VisualBackProp, for visualizing which sets of pixels of the input image contribute most to the predictions made by the convolutional neural network (CNN). The method heavily hinges on exploring the intuition that the feature maps contain less and less irrelevant information to the prediction decision when moving deeper into the network. The technique we propose was developed as a debugging tool for CNN-based systems for steering self-driving cars and is therefore required to run in real-time, i.e. it was designed to require less computations than a forward propagation. This makes the presented visualization method a valuable debugging tool which can be easily used during both training and inference. We furthermore justify our approach with theoretical arguments and theoretically confirm that the proposed method identifies sets of input pixels, rather than individual pixels, that collaboratively contribute to the prediction. Our theoretical findings stand in agreement with the experimental results. The empirical evaluation shows the plausibility of the proposed approach on the road video data as well as in other applications and reveals that it compares favorably to the layer-wise relevance propagation approach, i.e. it obtains similar visualization results and simultaneously achieves order of magnitude speed-ups.

• de Brébisson A [Pascal Vincent | MILA (UdeM) | CIFAR] (2016) A Cheap Linear Attention Mechanism with Fast Lookups and Fixed-Size Representations. arXiv:1609.05866  |  reddit

• The softmax content-based attention mechanism has proven to be very beneficial in many applications of recurrent neural networks. Nevertheless it suffers from two major computational limitations. First, its computations for an attention lookup scale linearly in the size of the attended sequence. Second, it does not encode the sequence into a fixed-size representation but instead requires to memorize all the hidden states. These two limitations restrict the use of the softmax attention mechanism to relatively small-scale applications with short sequences and few lookups per sequence. In this work we introduce a family of linear attention mechanisms designed to overcome the two limitations listed above. We show that removing the softmax non-linearity from the traditional attention formulation yields constant-time attention lookups and fixed-size representations of the attended sequences. These properties make these linear attention mechanisms particularly suitable for large-scale applications with extreme query loads, real-time requirements and memory constraints. Early experiments on a question answering task show that these linear mechanisms yield significantly better accuracy results than no attention, but obviously worse than their softmax alternative.

• Cui Y (2016) Attention-over-Attention Neural Networks for Reading Comprehension. arXiv:1607.04423  |  reddit

• Cloze-style queries are representative problems in reading comprehension. Over the past few months, we have seen much progress that utilizing neural network approach to solve Cloze-style questions. In this paper, we present a novel model called attention-over-attention reader for the Cloze-style reading comprehension task. Our model aims to place another attention mechanism over the document-level attention, and induces "attended attention" for final predictions. Unlike the previous works, our neural network model requires less pre-defined hyper-parameters and uses an elegant architecture for modeling. Experimental results show that the proposed attention-over-attention model significantly outperforms various state-of-the-art systems by a large margin in public datasets, such as CNN and Children's Book Test datasets.

• dos Santos C (2016) Attentive Pooling Networks. arXiv:1602.03609.  |  IBM Watson: 2-way attention mechanism; CNN, RNN  |  reddit

• García-Gavilanes R (2016) Memory Remains: Understanding Collective Memory in the Digital Age. arXiv:1609.02621

• Recently developed information communication technologies, particularly the Internet, have affected how we, both as individuals and as a society, create, store, and recall information. Internet also provides us with a great opportunity to study memory using transactional large scale data, in a quantitative framework similar to the practice in statistical physics. In this project, we make use of online data by analysing viewership statistics of Wikipedia articles on aircraft crashes. We study the relation between recent events and past events and particularly focus on understanding memory triggering patterns. We devise a quantitative model that explains the flow of viewership from a current event to past events based on similarity in time, geography, topic, and the hyperlink structure of Wikipedia articles. We show that on average the secondary flow of attention to past events generated by such remembering processes is larger than the primary attention flow to the current event. We are the first to report these cascading effects.

• Hermann KM [Grefenstette E | Google DeepMind] (2016) Teaching machines to read and comprehend [pdf]

• Teaching machines to read natural language documents remains an elusive challenge. Machine reading systems can be tested on their ability to answer questions posed on the contents of documents that they have seen, but until now large scale training and test datasets have been missing for this type of evaluation. In this work we define a new methodology that resolves this bottleneck and provides large scale supervised reading comprehension data. This allows us to develop a class of attention based deep neural networks that learn to read real documents and answer complex questions with minimal prior knowledge of language structure.

• Iyyer M (2016) The Amazing Mysteries of the Gutter: Drawing Inferences Between Panels in Comic Book Narratives. arXiv:1611.05118  |  GitHub  |  GitXiv  |  AI Machine Attempts to Understand Comic Books ... and Fails  [MIT Technology Review]  |  reddit

• Visual narrative is often a combination of explicit information and judicious omissions, relying on the viewer to supply missing details. In comics, most movements in time and space are hidden in the "gutters" between panels. To follow the story, readers logically connect panels together by inferring unseen actions through a process called "closure". While computers can now describe the content of natural images, in this paper we examine whether they can understand the closure-driven narratives conveyed by stylized artwork and dialogue in comic book panels. We collect a dataset, COMICS, that consists of over 1.2 million panels (120 GB) paired with automatic textbox transcriptions. An in-depth analysis of COMICS demonstrates that neither text nor image alone can tell a comic book story, so a computer must understand both modalities to keep up with the plot. We introduce three cloze-style tasks that ask models to predict narrative and character-centric aspects of a panel given n preceding panels as context. Various deep neural architectures underperform human baselines on these tasks, suggesting that COMICS contains fundamental challenges for both vision and language.

• Jaitly N [Quoc V. Le; Oriol Vinyals; Ilya Sutskever; Samy Bengio | Google Brain | Google DeepMind | Open AI] (2015) A Neural Transducer. arXiv:1511.04868 [[attention models]]  |  reddit

• Sequence-to-sequence models have achieved impressive results on various tasks. However, they are unsuitable for tasks that require incremental predictions to be made as more data arrives or tasks that have long input sequences and output sequences. This is because they generate an output sequence conditioned on an entire input sequence. In this paper, we present a Neural Transducer that can make incremental predictions as more input arrives, without redoing the entire computation. Unlike sequence-to-sequence models, the Neural Transducer computes the next-step distribution conditioned on the partially observed input sequence and the partially generated sequence. At each time step, the transducer can decide to emit zero to many output symbols. The data can be processed using an encoder and presented as input to the transducer. The discrete decision to emit a symbol at every time step makes it difficult to learn with conventional backpropagation. It is however possible to train the transducer by using a dynamic programming algorithm to generate target discrete decisions. Our experiments show that the Neural Transducer works well in settings where it is required to produce output predictions as data come in. We also find that the Neural Transducer performs well for long sequences even when attention mechanisms are not used.

• Conclusion. We have introduced a new model that uses partial conditioning on inputs to generate output sequences. This allows the model to produce output as input arrives. This is useful for speech recognition systems and will also be crucial for future generations of online speech translation systems. Further it can be useful for performing transduction over long sequences - something that is possibly difficult for sequence-to-sequence models. We applied the model to a toy task of addition, and to a phone recognition task and showed that is can produce results comparable to the state of the art from sequence-to-sequence models.

• Related: Luo Y [Navdeep Jaitly; Ilya Sutskever | Google Brain | Open AI] (2016) Learning Online Alignments with Continuous Rewards Policy Gradient. arXiv:1608.01281  |  reddit

• Sequence-to-sequence models with soft attention had significant success in machine translation, speech recognition, and question answering. Though capable and easy to use, they require that the entirety of the input sequence is available at the beginning of inference, an assumption that is not valid for instantaneous translation and speech recognition. To address this problem, we present a new method for solving sequence-to-sequence problems using hard online alignments instead of soft offline alignments. The online alignments model is able to start producing outputs without the need to first process the entire input sequence. A highly accurate online sequence-to-sequence model is useful because it can be used to build an accurate voice-based instantaneous translator. Our model uses hard binary stochastic decisions to select the timesteps at which outputs will be produced. The model is trained to produce these stochastic decisions using a standard policy gradient method. In our experiments, we show that this model achieves encouraging performance on TIMIT and Wall Street Journal (WSJ) speech recognition datasets.

• Related: Binary Stochastic Neurons in Tensorflow

• In this post, I introduce and discuss binary stochastic neurons, implement trainable binary stochastic neurons in Tensorflow, and conduct several simple experiments on the MNIST dataset to get a feel for their behavior. Binary stochastic neurons offer two advantages over real-valued neurons: they can act as a regularizer and they enable conditional computation by enabling a network to make yes/no decisions. Conditional computation opens the door to new and exciting neural network architectures, such as the choice of experts architecture and heirarchical multiscale neural networks, which I plan to discuss in future posts.

The binary stochastic neuron

A binary stochastic neuron is a neuron with a noisy output: some proportion $p$ of the time it outputs 2, otherwise 0. An easy way to turn a real-valued input, $a$, into this proportion, $p$, is to set $p = sigm(a)$, where $sigm$ is the logistic sigmoid, $\small sigm(x) = \large \frac{1}{1 + \exp^{-x}}$. Thus, we define the binary stochastic neuron, $\text{BSN}$, as:

$\large BSN(a) = \textbf{1}_{\large z\ \lt\ sigm(a)}$

where $\textbf{1}_{x}$ is the indicator function on the truth value of $x$ and $z \sim U[0,1]$.

Advantages of the binary stochastic neuron

1. A binary stochastic neuron is a noisy modification of the logistic sigmoid: instead of outputting $p$, it outputs 1 with probability $p$ and 0 otherwise. Noise generally serves as a regularizer (see, e.g., Srivastava et al. (2014) and Neelakantan et al. (2015)), and so we might expect the same from binary stochastic neurons as compared to the logistic neurons. Indeed, this is the claimed "unpublished result" from the end of Hinton et al.'s Coursera Lecture 9c, which I test empirically in this post. Unfortunately, the results below show that binary stochastic neurons do not work so well as regularizers on the MNIST dataset, though they may serve as viable regularizers in other cases.

2. More importantly, by enabling networks to make binary decisions, the binary stochastic neuron allows for conditional computation. This opens the door to some interesting new architectures. For example, instead of a mixture of experts architecture, which weights the outputs of several "expert" sub-networks and requires that all subnetworks be computed, we could use a choice of experts architecture, which conditionally uses expert sub-networks as needed. This architecture is implicitly proposed in Bengio et al. (2013), wherein the experiments use a choice of expert units architecture (i.e., a gated architecture where gates must be 1 or 0). Another example, proposed in Bengio et al. (2013) and implemented by Chung et al. (2016), is the Hierarchical Multiscale Recurrent Neural Network (HM-RNN) architecture, which achieves great results on language modelling tasks. Both of these architectures will be explored in future posts.

...
[ ... snip ... ]

• Kadlec R [IBM Watson] (2016) Text Understanding with the Attention Sum Reader Network. arXiv:1603.01547v1

• Several large cloze-style context-QA datasets have been introduced recently: the CNN and Daily Mail news data, & the Children's Book Test. Thanks to the size of these datasets, the associated text comprehension task is well suited for deep-learning techniques that currently seem to outperform all alternative approaches. We present a new, simple model that uses attention to directly pick the answer from the context as opposed to computing the answer using a blended representation of words in the document as is usual in similar models. This makes the model particularly suitable for question-answering problems where the answer is a single word from the document. Our model outperforms models previously proposed for these tasks by a large margin.

• Ribeiro MT (2016) "Why Should I Trust You?": Explaining the Predictions of Any Classifier. arXiv:1602.04938  |

• Sordoni A [Yoshua Bengio] (2016) Iterative Alternating Neural Attention for Machine Reading. arXiv:1606.02245  |  RNN c. bidirectional gated recurrent units (GRU); VSM, GloVe, ...  |  reddit

• We propose a novel neural attention architecture to tackle machine comprehension tasks, such as answering Cloze-style queries with respect to a document. Unlike previous models, we do not collapse the query into a single vector, instead we deploy an iterative alternating attention mechanism that allows a fine-grained exploration of both the query and the document. Our model outperforms state-of-the-art baselines in standard machine comprehension benchmarks such as CNN news articles and the Children's Book Test (CBT) dataset.

• SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage. With 107,785 question-answer pairs on 536 articles, SQuAD is significantly larger than previous reading comprehension datasets.

• Rajpurkar P [Stanford] (2016) SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250  |  reddit

• We present a new reading comprehension dataset, SQuAD, consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset in both manual and automatic ways to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We built a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at GitHub.

• Seo M [Allen Institute for Artificial Intelligence] (2016) Bidirectional Attention Flow for Machine Comprehension. arXiv:1611.01603

• Machine Comprehension (MC), answering a query about a given context, requires modeling complex interactions between the context and the query. Recently, attention mechanisms have been successfully extended to MC. Typically these methods use attention to summarize the context (or query) into a single vector, couple attentions temporally, and often form a uni-directional attention. In this paper we introduce the Bi-directional Attention Flow (BiDAF) network, a multi-stage hierarchical process that represents the context at different levels of granularity and uses a bi-directional attention flow mechanism to achieve a query-aware context representation without early summarization. Our experimental evaluations show that our model achieves the state-of-the-art results in Stanford Question Answering Dataset (SQuAD) and CNN/DailyMail Cloze Test.

• Related:

• Wiese G [Mariana Neves] (2017) Neural Domain Adaptation for Biomedical Question Answering. arXiv:1706.03610

• Factoid question answering (QA) has recently benefited from the development of deep learning (DL) systems. Neural network models outperform traditional approaches in domains where large datasets exist, such as SQuAD (ca. 100,000 questions) for Wikipedia articles. However, these systems have not yet been applied to QA in more specific domains, such as biomedicine, because datasets are generally too small to train a DL system from scratch. For example, the BioASQ dataset for biomedical QA comprises less then 900 factoid (single answer) and list (multiple answers) QA instances. In this work, we adapt a neural QA system trained on a large open-domain dataset (SQuAD, source) to a biomedical dataset (BioASQ, target) by employing various transfer learning techniques. Our network architecture is based on a state-of-the-art QA system, extended with biomedical word embeddings and a novel mechanism to answer list questions. In contrast to existing biomedical QA systems, our system does not rely on domain-specific ontologies, parsers or entity taggers, which are expensive to create. Despite this fact, our systems achieve state-of-the-art results on factoid questions and competitive results on list questions.

• Tan C [Beihang University, Beijing | Microsoft Research, Beijing, China] (2017) S-Net: From Answer Extraction to Answer Generation for Machine Reading Comprehension. arXiv:1706.04815

Approach. Following the overview in Figure 1, our approach consists of two parts as evidence extraction 1 and answer synthesis.
The two parts are trained in two stages. The evidence extraction part aims to extract evidence snippets related to the question and passage.
The answer synthesis part aims to generate the answer based on the extracted evidence snippets. We propose a multi-task learning framework for
the evidence extraction shown in Figure 2, and use the sequence-to-sequence model with additional features of the start and end positions of
the evidence snippet for the answer synthesis shown in Figure 3. We use Gated Recurrent Unit (GRU) (Cho et al., 2014) instead of basic RNN. ...

• Trischler A (2016) Natural Language Comprehension with the EpiReader. arXiv:1606.02270

• We present the EpiReader, a novel model for machine comprehension of text. Machine comprehension of unstructured, real-world text is a major research goal for natural language processing. Current tests of machine comprehension pose questions whose answers can be inferred from some supporting text, and evaluate a model's response to the questions. The EpiReader is an end-to-end neural model comprising two components: the first component proposes a small set of candidate answers after comparing a question to its supporting text, and the second component formulates hypotheses using the proposed candidates and the question, then reranks the hypotheses based on their estimated concordance with the supporting text. We present experiments demonstrating that the EpiReader sets a new state-of-the-art on the CNN and Children's Book Test machine comprehension benchmarks, outperforming previous neural models by a significant margin.

• Vaswani A [Google Brain | Google Research | UofT] (2017) Attention Is All You Need. arXiv:1706.03762  |  GitHub  [non-author implementation]  |  GitHub  |  GitHub  [Part of Google's Tensor2Tensor library]  |  reddit

• The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.0 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

• Weston J [Mikolov T] (2015) Towards AI-complete question answering: A set of prerequisite toy tasks. arXiv:1502.05698

• One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human. We believe many existing learning systems can currently not solve them, and hence our aim is to classify these tasks into skill sets, so that researchers can identify (and then rectify) the failings of their systems. We also extend and improve the recently introduced Memory Networks model, and show it is able to solve some, but not all, of the tasks.

• related:

• GitHub: Trains 2 RNN based upon a story and a question. The resulting merged vector is then queried to answer a range of bAbI tasks. The results are comparable to those for an LSTM model provided in Weston et al.

• GitHub: Trains a memory network on the bAbI dataset. ... Reaches 98.6% accuracy on task 'single_supporting_fact_10k' after 120 epochs. Time per epoch: 3s on CPU (core i7).

• Yang Z [Smola AJ | Carnegie Mellon University | Microsoft Research] (2016) Hierarchical Attention Networks for Document Classification. pdf  |  reddit

• We propose a hierarchical attention network for document classification. Our model has two distinctive characteristics: (i) it has a hierarchical structure that mirrors the hierarchical structure of documents; (ii) it has two levels of attention mechanisms applied at the word- and sentence-level, enabling it to attend differentially to more and less important content when constructing the document representation. Experiments conducted on six large scale text classification tasks demonstrate that the proposed architecture outperform previous methods by a substantial margin. Visualization of the attention layers illustrates that the model selects qualitatively informative words and sentences.

• Used here: Pappas N (2017) Multilingual Hierarchical Attention Networks for Document Classification. arXiv:1707.00896  |  GitHub  |  GitXiv  |  project page

• [v4] Hierarchical attention networks have recently achieved remarkable performance for document classification in a given language. However, when multilingual document collections are considered, training such models separately for each language entails linear parameter growth and lack of cross-language transfer. Learning a single multilingual model with fewer parameters is therefore a challenging but potentially beneficial objective. To this end, we propose multilingual hierarchical attention networks for learning document structures, with shared encoders and/or shared attention mechanisms across languages, using multi-task learning and an aligned semantic space as input. We evaluate the proposed models on multilingual document classification with disjoint label sets, on a large dataset which we provide, with 600k news documents in 8 languages, and 5k labels. The multilingual models outperform monolingual ones in low-resource as well as full-resource settings, and use fewer parameters, thus confirming their computational efficiency and the utility of cross-language transfer.

• Zagoruyko S (2016) Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer. arXiv:1612.03928  |  GitHub  |  reddit

• Attention plays a critical role in human visual experience. Furthermore, it has recently been demonstrated that attention can also play an important role in the context of applying artificial neural networks to a variety of tasks from fields such as computer vision and NLP. In this work we show that, by properly defining attention for convolutional neural networks, we can actually use this type of information in order to significantly improve the performance of a student CNN network by forcing it to mimic the attention maps of a powerful teacher network. To that end, we propose several novel methods of transferring attention, showing consistent improvement across a variety of datasets and convolutional neural network architectures.

## Audio | Conversation | Speech | Sound

[Audio|Conversation|Speech|Sound] Blogs:

• Artificial Intelligence cleans podcast episodes from 'ahem' sounds  |  GitHub  |  slides [local copy]
• Do you know why you can't hear the ugly ahem sounds on the podcast Data Science at Home? Because we remove them. Actually not us. A neural network does. The ahem detector is a deep convolutional neural network trained on transformed audio signals to recognize ahem sounds. The network has been trained to detect such signals on the episodes of Data Science at Home ...

• Combining CNN and RNN for spoken language identification  |  GitHub
• [By Hrayr Harutyunyan and Hrant Khachatrian (June 2016)] Last year Hrayr used convolutional networks to identify spoken language from short audio recordings for a TopCoder contest and got 95% accuracy. After the end of the contest we decided to try recurrent neural networks and their combinations with CNNs on the same task. The best combination allowed to reach 99.24% and an ensemble of 33 models reached 99.67%. This work became Hrayr's bachelor's thesis. ...
• Final remarks. The number of hyperparameters in these CNN+RNN mixtures is huge. Because of the limited hardware we covered only a very small fraction of possible configurations. The organizers of the original contest did not publicly release the dataset. Nevertheless we release the full source code on GitHub. We couldn't find many Theano/Lasagne implementations of CNN+RNN networks on GitHub, and we hope these scripts will partially fill that gap. This work was part of Hrayr's bachelor's thesis, which is available on academia.edu (the text is in Armenian).

[Audio|Conversation|Speech|Sound] GitHub

[Audio|Conversation|Speech|Sound] Papers:

• Amodei D [Baidu Research - Silicon Valley AI Lab] (2015) Deep speech 2: End-to-end speech recognition in English and Mandarin. arXiv:1512.02595

• We show that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech--two vastly different languages. Because it replaces entire pipelines of hand-engineered components with neural networks, end-to-end learning allows us to handle a diverse variety of speech including noisy environments, accents and different languages. Key to our approach is our application of HPC techniques, resulting in a 7x speedup over our previous system. Because of this efficiency, experiments that previously took weeks now run in days. This enables us to iterate more quickly to identify superior architectures and algorithms. As a result, in several cases, our system is competitive with the transcription of human workers when benchmarked on standard datasets. Finally, using a technique called Batch Dispatch with GPUs in the data center, we show that our system can be inexpensively deployed in an online setting, delivering low latency when serving users at scale.

• [Non-author?] Implementation:  GitHub  |  reddit

• Arik SO [Baidu Silicon Valley Artificial Intelligence Lab] (2017) Deep Voice: Real-time Neural Text-to-Speech. arXiv:1702.07825  |  reddit  |  Baidu's Artificial Intelligence Lab Unveils Synthetic Speech System: The Chinese search giant's Deep Voice system learns to talk in just a few hours with little or no human interference
• We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. The system comprises five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For the segmentation model, we propose a novel way of performing phoneme boundary detection with deep neural networks using connectionist temporal classification (CTC) loss. For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original. By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive domain expertise. Finally, we show that inference with our system can be performed faster than real time and describe optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.

• Assael YM (2016) [Nando de Freitas | Google DeepMind] LipNet: Sentence-level Lipreading. pdf  |  arXiv:1611.01599  |  reddit  |  reddit [some critique]  |  OpenReview

• Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). All existing works, however, perform only word classification, not sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, an LSTM recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first lipreading model to operate at sentence-level, using a single end-to-end speaker-independent deep model to simultaneously learn spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 95.2% accuracy, outperforming experienced human lipreaders and the previous 86.4% state-of-the-art accuracy.

TL;DR: LipNet is the first lipreading model to operate at sentence-level, using a single end-to-end speaker-independent deep model to simultaneously learn spatiotemporal visual features and a sequence model.



• Related:

• Chung, JS [Oriol Vinyals | University of Oxford | Google DeepMind] (2016) Lip Reading Sentences in the Wild. arXiv:1611.05358  |  reddit  |  reddit  |  Google DeepMind AI destroys human expert in lip reading competition [TechRepublic.com]

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focused on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem - unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) a 'Watch, Listen, Attend and Spell' (WLAS) network that learns to transcribe videos of mouth motion to characters; (2) a curriculum learning strategy to accelerate training and to reduce overfitting; (3) a 'Lip Reading Sentences' (LRS) dataset for visual speech recognition, consisting of over 100,000 natural sentences from British television. The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin. This lip reading performance beats a professional lip reader on videos from BBC television, and we also demonstrate that visual information helps to improve speech recognition performance even when the audio is available.



• Aytar Y [MIT] (2016) SoundNet: Learning Sound Representations from Unlabeled Video. pdf  |  Phys.org
• We learn rich natural sound representations by capitalizing on large amounts of unlabeled sound data collected in the wild. We leverage the natural synchronization between vision and sound to learn an acoustic representation using two-million unlabeled videos. Unlabeled video has the advantage that it can be economically acquired at massive scales, yet contains useful signals about natural sound. We propose a student-teacher training procedure which transfers discriminative visual knowledge from well established visual recognition models into the sound modality using unlabeled video as a bridge. Our sound representation yields significant performance improvements over the state-of-the-art results on standard benchmarks for acoustic scene/object classification. Visualizations suggest some high-level semantics automatically emerge in the sound network, even though it is trained without ground truth labels.

• Blaauw M (2017) A Neural Parametric Singing Synthesizer. arXiv:1704.03809  |  audio samples  |  reddit  |  twitter

• We present a new model for singing synthesis based on a modified version of the WaveNet architecture. Instead of modeling raw waveform, we model features produced by a parametric vocoder that separates the influence of pitch and timbre. This allows conveniently modifying pitch to match any target melody, facilitates training on more modest dataset sizes, and significantly reduces training and generation times. Our model makes frame-wise predictions using mixture density outputs rather than categorical outputs in order to reduce the required parameter count. As we found overfitting to be an issue with the relatively small datasets used in our experiments, we propose a method to regularize the model and make the autoregressive generation process more robust to prediction errors. Using a simple multi-stream architecture, harmonic, aperiodic and voiced/unvoiced components can all be predicted in a coherent manner. We compare our method to existing parametric statistical and state-of-the-art concatenative methods using quantitative metrics and a listening test. While naive implementations of the autoregressive generation algorithm tend to be inefficient, using a smart algorithm we can greatly speed up the process and obtain a system that's competitive in both speed and quality.

• Collobert R [Facebook AI Research] (2016) Wav2Letter: an End-to-End ConvNet-based Speech Recognition System. arXiv:1609.03193 reddit
• This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criterion for training from sequence annotation without alignment that is on par with CTC while being simpler. We show competitive results in word error rate on the Librispeech corpus with MFCC features, and promising results from raw waveform.

• Dean J [Ng AY | Google] (2012) Large scale distributed deep networks.  (pdf)  |  reddit  |  notes

• Recent work in unsupervised feature learning and deep learning has shown that being able to train large models can dramatically improve performance. In this paper, we consider the problem of training a deep network with billions of parameters using tens of thousands of CPU cores. We have developed a software framework called DistBelief that can utilize computing clusters with thousands of machines to train large models. Within this framework, we have developed two algorithms for large-scale distributed training: (i) Downpour SGD, an asynchronous stochastic gradient descent procedure supporting a large number of model replicas, and (ii) Sandblaster, a framework that supports a variety of distributed batch optimization procedures, including a distributed implementation of L-BFGS. Downpour SGD and Sandblaster L-BFGS both increase the scale and speed of deep network training. We have successfully used our system to train a deep network 30x larger than previously reported in the literature, and achieves state-of-the-art performance on ImageNet, a visual object recognition task with 16 million images and 21k categories. We show that these same techniques dramatically accelerate the training of a more modestly- sized deep network for a commercial speech recognition service. Although we focus on and report performance of these methods as applied to training large neural networks, the underlying algorithms are applicable to any gradient-based machine learning algorithm.

• "Distributed training" - related:

• Scardapane S (2016) A Framework for Parallel and Distributed Training of Neural Networks. arXiv:1610.07448  |  reddit

• The aim of this paper is to develop a general framework for training neural networks (NNs) in a distributed environment, where training data is partitioned over a set of agents that communicate with each other through a sparse, possibly time-varying, connectivity pattern. In such distributed scenario, the training problem can be formulated as the (regularized) optimization of a non-convex social cost function, given by the sum of local (non-convex) costs, where each agent contributes with a single error term defined with respect to its local dataset. To devise a flexible and efficient solution, we customize a recently proposed framework for non-convex optimization over networks, which hinges on a (primal) convexification-decomposition technique to handle non-convexity, and a dynamic consensus procedure to diffuse information among the agents. Several typical choices for the training criterion (e.g., squared loss, cross entropy, etc.) and regularization (e.g., $\small ℓ2$ norm, sparsity inducing penalties, etc.) are included in the framework and explored along the paper. Convergence to a stationary solution of the social non-convex problem is guaranteed under mild assumptions. Additionally, we show a principled way allowing each agent to exploit a multi-core architecture (e.g., a local cloud) in order to parallelize its local optimization step, resulting in strategies that are both distributed (across the agents) and parallel (inside each agent) in nature. A comprehensive set of experimental results validate the proposed approach.

• Feng M (2015) Distributed Deep Learning for Question Answering. arXiv:1511.01158

• Diamos G [Baidu] (2016) Persistent RNNs: Stashing Recurrent Weights On-Chip. pdf  |  reddit
• This paper introduces a new technique for mapping Deep Recurrent Neural Networks (RNN) efficiently onto GPUs. We show how it is possible to achieve substantially higher computational throughput at low mini-batch sizes than direct implementations of RNNs based on matrix multiplications. The key to our approach is the use of persistent computational kernels that exploit the GPU's inverted memory hierarchy to reuse network weights over multiple timesteps. Our initial implementation sustains 2.8 TFLOP/s at a minibatch size of 4 on an NVIDIA TitanX GPU. This provides a 16x reduction in activation memory footprint, enables model training with 12x more parameters on the same hardware, allows us to strongly scale RNN training to 128 GPUs, and allows us to efficiently explore end-to-end speech recognition models with over 100 layers.

• Ephrat A (2017) Vid2speech: Speech Reconstruction from Silent Video.arXiv:1701.00495  |  GitHub  |  GitXiv  |  published paper   [pdf]  |  project page (demos)  |  reddit
• Speechreading is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible acoustic speech signal from silent video frames of a speaking person. The proposed CNN generates sound features for each frame based on its neighboring frames. Waveforms are then synthesized from the learned speech features to produce intelligible speech. We show that by leveraging the automatic feature learning capabilities of a CNN, we can obtain state-of-the-art word intelligibility on the GRID dataset, and show promising results for learning out-of-vocabulary (OOV) words.

• Graves A [Google DeepMind] (2014) Towards End-To-End Speech Recognition with Recurrent Neural Networks. pdf

• This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the deep bidirectional LSTM recurrent neural network architecture and the Connectionist Temporal Classification objective function. A modification to the objective function is introduced that trains the network to minimise the expectation of an arbitrary transcription loss function. This allows a direct optimisation of the word error rate, even in the absence of a lexicon or language model. The system achieves a word error rate of 27.3% on the Wall Street Journal corpus with no prior linguistic information, 21.9% with only a lexicon of allowed words, and 8.2% with a trigram language model. Combining the network with a baseline system further reduces the error rate to 6.7%.

• Implementation (non-author):  ctc-beam-search [GitHub]: : This is a python implementation of Alex grave's paper "Towards End-to-End Speech Recognition with Recurrent Neural Networks." Note: The code is slightly inefficient, but it's a simple straightforward implementation, and a good place to start.  |  reddit

• Hershey S [Google] (2016) CNN Architectures for Large-Scale Audio Classification. arXiv:1609.09430
• Convolutional Neural Networks (CNNs) have proven very effective in image classification and have shown promise for audio classification. We apply various CNN architectures to audio and investigate their ability to classify videos with a very large data set of 70M training videos (5.24 million hours) with 30,871 labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet, VGG, Inception, and ResNet. We explore the effects of training with different sized subsets of the training videos. Additionally we report the effect of training using different subsets of the labels. While our dataset contains video-level labels, we are also interested in Acoustic Event Detection (AED) and train a classifier on embeddings learned from the video-level task on Audio Set [5]. We find that derivatives of image classification networks do well on our audio classification task, that increasing the number of labels we train on provides some improved performance over subsets of labels, that performance of models improves as we increase training set size, and that a model using embeddings learned from the video-level task does much better than a baseline on the Audio Set classification task.

• Dataset.  The YouTube-100M data set consists of 100 million YouTube videos: 70M training videos, 10M evaluation videos, and a pool of 20M videos that we use for validation. Videos average 4.6 minutes each for a total of 5.4M training hours. Each of these videos are labeled with 1 or more topic identifiers (from Knowledge Graph [22]) from a set of 30,871 labels. There are an average of around 5 labels per video. The labels are assigned automatically based on a combination of metadata (title, description, comments, etc.), context, and image content for each video. The labels apply to the entire video and range from very generic (e.g. "Song") to very specific (e.g. "Cormorant"). Table 1 shows a few examples.

Being machine generated, the labels are not 100% accurate and of the 30K labels, some are clearly acoustically relevant ("Trumpet") and others are less so ("Web Page"). Videos are often annotated with similar labels with varying degrees of specificity. For example, videos labeled with "Trumpet" will tend to be labeled "Entertainment" as well, but no hierarchy is enforced.

• Conclusions.  The results in Section 4.1 show that state-of-the-art image networks are capable of excellent results on audio classification when compared to a simple fully connected network or earlier image classification architectures. In Section 4.2 we saw results showing that training on larger label set vocabularies can improve performance, albeit modestly, when evaluating on smaller label sets. In Section 4.3 we saw that increasing the number of videos up to 7M improves performance for the best-performing ResNet-50 architecture. We note that regularization could have reduced the gap between the models trained on smaller datasets and the 7M and 70M datasets. In Section 4.4 we see a significant increase over our baseline when training a model for AED with ResNet embeddings on the Audio Set dataset.

In addition to these quantified results, we can subjectively examine the performance of the model on segments of video. Fig. 2 shows the results of running our best classifier over a video and overlaying the frame-by-frame results of the 16 classifier outputs with the greatest peak values across the entire video. The different sound sources present at different points in the video are clearly distinguished.

• Hinton G (2012) Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. pdf

• Jaques N (2016) [Google Brain | MIT | University of Cambridge | Max Planck Institute for Intelligent Systems] Tuning Recurrent Neural Networks with Reinforcement Learning. arXiv:1611.02796

• Mikolov T (2011) RNNLM - Recurrent neural network language modeling toolkit [pdf]:
• We present a freely available open-source toolkit for training recurrent neural network based language models. It can be easily used to improve existing speech recognition and machine translation systems. Also, it can be used as a baseline for future research of advanced language modeling techniques. In the paper, we discuss optimal parameter selection and different modes of functionality. The toolkit, example scripts and basic setups are freely available at GitHub.

• Shao L [Denny Britz; Ray Kurzweil | Google Research | Google Brain] (2017) Generating Long and Diverse Responses with Neural Conversation Models. arXiv:1701.03185
• Building general-purpose conversation agents is a very challenging task, but necessary on the road toward intelligent agents that can interact with humans in natural language. Neural conversation models -- purely data-driven systems trained end-to-end on dialogue corpora -- have shown great promise recently, yet they often produce short and generic responses. This work presents new training and decoding methods that improve the quality, coherence, and diversity of long responses generated using sequence-to-sequence models. Our approach adds self-attention to the decoder to maintain coherence in longer responses, and we propose a practical approach, called the glimpse-model, for scaling to large datasets. We introduce a stochastic beam-search algorithm with segment-by-segment reranking which lets us inject diversity earlier in the generation process. We trained on a combined data set of over 2.3B conversation messages mined from the web. In human evaluation studies, our method produces longer responses overall, with a higher proportion rated as acceptable and excellent as length increases, compared to baseline sequence-to-sequence models with explicit length-promotion. A back-off strategy produces better responses overall, in the full spectrum of lengths.

• Sotelo J (ICLR 2017) [Aaron Courville | Yoshua Bengio | MILA (UdeM)] Char2Wav: End-to-End Speech Synthesis. pdf  |  GitHub  |  GitXiv  |  project page  |  project page  |  demo

• We present Char2Wav, an end-to-end model for speech synthesis. Char2Wav has two components: a reader and a neural vocoder. The reader is an encoder-decoder model with attention. The encoder is a bidirectional recurrent neural network that accepts text or phonemes as inputs, while the decoder is a recurrent neural network (RNN) with attention that produces vocoder acoustic features. Neural vocoder refers to a conditional extension of SampleRNN which generates raw waveform samples from intermediate representations. Unlike traditional models for speech synthesis, Char2Wav learns to produce audio directly from text.

• van den Oord A [Oriol Vinyals; Alex Graves; Nal Kalchbrenner; Koray Kavukcuoglu | Google DeepMind] (2016) WaveNet: A Generative Model For Raw Audio. arXiv:1609.03499  |  pdf

• This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Chinese. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

• Google DeepMind blog post: WaveNet: A Generative Model for Raw Audio: This post presents WaveNet, a deep generative model of raw audio waveforms. We show that WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%. We also demonstrate that the same network can be used to synthesize other audio signals such as music, and present some striking samples of automatically generated piano pieces. ...

• Loved on reddit!

• Implementations:

• Very impressive!   Wang Y [Samy Bengio | Google] (2017) Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. arXiv:1703.10135  |  reddit audio samples  [GitHub: 'demo' folder]  |  audio samples

• A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module. Building these components often requires extensive domain expertise and may contain brittle design choices. In this paper, we present Tacotron, an end-to-end generative text-to-speech model that synthesizes speech directly from characters. Given pairs, the model can be trained completely from scratch with random initialization. We present several key techniques to make the sequence-to-sequence framework perform well for this challenging task. Tacotron achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. In addition, since Tacotron generates speech at the frame level, it's substantially faster than sample-level autoregressive methods.

• Wen TH [Cambridge University] (2015) Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. arXiv:1508.01745  |  reddit  |  reddit
• Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact both on usability and perceived quality. Most NLG systems in common use employ rules and heuristics and tend to generate rigid and stylised responses without the natural variation of human language. They are also not easily scaled to systems covering multiple domains and languages. This paper presents a statistical language generator based on a semantically controlled Long Short-term Memory (LSTM) structure. The LSTM generator can learn from unaligned data by jointly optimising sentence planning and surface realisation using a simple cross entropy training criterion, and language variation can be easily achieved by sampling from output candidates. With fewer heuristics, an objective evaluation in two differing test domains showed the proposed method improved performance compared to previous methods. Human judges scored the LSTM system higher on informativeness and naturalness and overall preferred it to the other systems.

• Xiong W [Microsoft Research] (2016) Achieving Human Parity in Conversational Speech Recognition. arXiv:1610.05256 [Technical Report MSR-TR-2016-71]  |  reddit

• Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity. The error rate of professional transcribers is 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic, and 11.3% for the CallHome portion where friends and family members have open-ended conversations. In both cases, our automated system establishes a new state of the art, and edges past the human benchmark, achieving error rates of 5.8% and 11.0%, respectively. The key to our system's performance is the use of various convolutional and LSTM acoustic model architectures, combined with a novel spatial smoothing method and lattice-free MMI acoustic training, multiple recurrent neural network language modeling approaches, and a systematic use of system combination.

• Discussed here:  The Morning Paper

• Xiong W [Microsoft Research] (2016) The Microsoft 2016 Conversational Speech Recognition System. arXiv:1609.03528  |  reddit  |  blog: review

• We describe Microsoft's conversational speech recognition system, in which we combine recent developments in neural-network-based acoustic and language modeling to advance the state of the art on the Switchboard recognition task. Inspired by machine learning ensemble techniques, the system uses a range of convolutional and recurrent neural networks. I-vector modeling and lattice-free MMI training provide significant gains for all acoustic model architectures. Language model rescoring with multiple forward and backward running RNNLMs, and word posterior-based system combination provide a 20% boost. The best single system uses a ResNet architecture acoustic model with RNNLM rescoring, and achieves a word error rate of 6.9% on the NIST 2000 Switchboard task. The combined system has an error rate of 6.2%, representing an improvement over previously reported results on this benchmark task.

• MS blog:  Microsoft researchers achieve speech recognition milestone

• reddit:  Question: When, and at what word error rate, will speech recognition make human transcribers mostly redundant  |  [1609.03528] The Microsoft 2016 Conversational Speech Recognition System (word error rate of 6.9% on the NIST 2000)

• ZDnet.com [Sep 2016]:  Microsoft's newest milestone? World's lowest error rate in speech recognition

• Microsoft leapfrogs IBM to claim a significant test result in the quest for machines to understand speech better than humans. ...

The previous lowest error rate was 6.9 percent, achieved by IBM's Watson team, which beat their own record of eight percent set last year. ... However, these days with more research funds being funneled into deep neural networks, tech giants are boasting error rates of well below 10 percent, but not quite at a level that exceeds human-level accuracy, which IBM estimates to be at about four percent.

Google CEO Sundar Pichari last year boasted its deep neural networks helped it achieve an error rate of eight percent in speech recognition systems that power voice Search and Android.

More recently, Apple's senior director of Siri, Alex Acero, a former Microsoft Research member, said error rates for speech recognition have been "cut by a factor of two in all languages", with greater gains in some languages, again thanks to its work on deep neural networks.

• First Computer to Match Humans in Ordinary Speech Recognition [MIT Technology Review] A team at Microsoft Research has trained a deep-learning machine to recognize ordinary speech as well as humans can.

• Saon G [IBM Watson] (2015) The IBM 2015 English conversational telephone speech recognition system. arXiv:1505.05899

• We describe the latest improvements to the IBM English conversational telephone speech recognition system. Some of the techniques that were found beneficial are: maxout networks with annealed dropout rates; networks with a very large number of outputs trained on 2000 hours of data; joint modeling of partially unfolded recurrent neural networks and convolutional nets by combining the bottleneck and output layers and retraining the resulting model; and lastly, sophisticated language model rescoring with exponential and neural network LMs. These techniques result in an 8.0% word error rate on the Switchboard part of the Hub5-2000 evaluation test set which is 23% relative better than our previous best published result.

• 2016 Update:   Saon G (2016). The IBM 2016 English conversational telephone speech recognition system. arXiv:1604.08242

• We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3x3 kernels, and bidirectional long short-term memory nets which operate on FMLLR and i-vector features. On the language modeling side, we use an updated model "M" and hierarchical neural network LMs.

• Recent Advances in Conversational Speech Recognition [IBM Watson Blog: April 28, 2016]

• 2017 Update:  Saon G (2017) English Conversational Telephone Speech Recognition by Humans and Machines. arXiv:1703.02136

• One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates? A recent paper by Microsoft suggests that we have already achieved human performance. In trying to verify this statement, we performed an independent set of human performance measurements on two conversational tasks and found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve. We also report on our own efforts in this area, presenting a set of acoustic and language modeling techniques that lowered the word error rate of our own English conversational telephone LVCSR system to the level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000 evaluation, which - at least at the writing of this paper - is a new performance milestone (albeit not at what we measure to be human performance!). On the acoustic side, we use a score fusion of three models: one LSTM with multiple feature inputs, a second LSTM trained with speaker-adversarial multi-task learning and a third residual net (ResNet) with 25 convolutional layers and time-dilated convolutions. On the language modeling side, we use word and character LSTMs and convolutional WaveNet-style language models.

• Reaching new records in speech recognition [IBM Watson Blog: Mar 07, 2017]

• Zweig G [Microsoft Research] (2016) Advances in All-Neural Speech Recognition. arXiv:1609.05935
• This paper advances the design of CTC-based all-neural (or end-to-end) speech recognizers. We propose a novel symbol inventory, and a novel iterated-CTC method in which a second system is used to transform a noisy initial output into a cleaner version. We present a number of stabilization and initialization methods we have found useful in training these networks. We evaluate our system on the commonly used NIST 2000 conversational telephony test set, and significantly exceed the previously published performance of similar systems, both with and without the use of an external language model and decoding technology.

## AUTOENCODERS - GENERAL

[Autoencoders:General] Blogs:

• Adversarial Autoencoders (vs Moment Matching Autoencoders?)
• relevant paper: Makhzani A [Goodfellow I] (2015) Adversarial Autoencoders. arXiv:1511.05644

• Adversarial Autoencoders [GitXiv]  |  Our method, named "adversarial autoencoder", uses the recently proposed generative adversarial networks (GAN) in order to match the aggregated posterior of the hidden code vector , i.e. this paper of the autoencoder with an arbitrary prior. Matching the aggregated posterior to the prior ensures that there are no "holes" in the prior, and generating from any part of prior space results in meaningful samples.

• Autoencoding Blade Runner: Reconstructing films with artificial neural networks  |  reddit

• In this blog I detail the work I have been doing over the past year in getting artificial neural networks to reconstruct films  - by training them to reconstruct individual frames from films, and then getting them to reconstruct every frame in a given film and resequencing it. The type of neural network used is an autoencoder. An autoencoder is a type of neural net with a very small bottleneck, it encodes a data sample into a much smaller representation (in this case a 200 digit number), then reconstructs the data sample to the best of its ability. The reconstructions are in no way perfect, but the project was more of a creative exploration of both the capacity and limitations of this approach. This work was done as the dissertation project for my research masters (MSc) in Creative Computing at Goldsmiths.

• Mentioned here: "A researcher at Goldsmiths in London trains a variational autoencoder deep learning model on all frames from the Blade Runner movie and then asks the network to reconstruct the video in its original sequence as well as other videos the network wasn't trained on."

• Autoencoders [tutorial]

• [autoencoder] Convolutional Shape Encoder

• Deep Autoencoders [reddit]

• GitHub: shape-encoder:
• relevant paper: Kaiser Ł. & Sutskever I. (2015) Neural GPUs learn algorithms. arXiv:1511.08228

• Denoising Convolutional Autoencoder [GitHub]: implementation of a denoising convolutional autoencoder built on Torch, trained on still images from Stanley Kubrick's 2001: A Space Odyssey. This code is a modified version of the denoising autoencoder by Kaixhin [GitHub]. It does not have any pooling layers, works on RBG images of 3x96x96 dimensions, and is trained with small, stochastic minibatches, randomly resampling from the full dataset to create a unique training dataset each epoch (as a way of getting around my 4GB GPU limitations). Trained using an NVIDIA 970 GTX.  |  reddit

• Hybrid Collaborative Filtering with Neural Networks (Collaborative Filtering with Sparse Denoising Autoencoders): Collaborative Filtering aims at exploiting the feedback of users to provide personalised recommendations. Such algorithms look for latent variables in a large sparse matrix of ratings. They can be enhanced by adding side information to tackle the well-known cold start problem. While Neural Networks have tremendous success in image and speech recognition, they have received less attention in Collaborative Filtering.

• Introducing Variational Autoencoders (in Prose and Code)  |  reddit

• Rank-ordered Autoencoders: A new method for unsupervised learning of sparse representations with autoencoders is proposed and implemented by ordering the output of the hidden units by their activation value and progressively reconstructing the input in this order. This can be done efficiently in parallel with the use of cumulative sums and sorting only slightly increasing the computational costs.  |  reddit  |  arXiv:1605.01749

• Stacked Denoising Autoencoders  |  Yoshua Bengio's Theano group at UdeM  |  mentioned here: Deep Autoencoders [reddit]

• Variational Autoencoders (VAE) vs Generative Adversarial Networks (GAN)? [reddit]: VAEs can be used with discrete inputs, while GANs can be used with discrete latent variables. However, assuming both are continuous, is there any reason to prefer one over the other?

• They're similar in some respects, different in others -- it all depends on what you're trying to do. VAEs, being autoencoders, learn to map from an input to a low dimensional space, while only recently have people started figuring out how to do that with GANs (Vanilla GANs don't map from the input to the latent space directly.) GANs are promising and have recently shown some awesome empirical results, but are generally known to be trickier to train (though that too looks like it's being improved upon) and are a relatively "new," albeit extremely "hot" area of research. The adversarial objective is apparently a pretty powerful one, and is suitable for sticking onto the end of the VAE.

• There are also Adversarial AEs: arXiv.org:1511.05644
• The biggest advantage of VAEs is the nice probabilistic formulation they come with as a result of maximizing a lower bound on the log-likelihood. The advantage of GANs at the moment is they are better at generating visual features (which really boils down to adversarial loss is better than mean-squared loss)
• [u/alexmlamb]

• Usually easier to train and get working. Relatively easy to implement and robust to hyperparameter choices.

• Tractable likelihood.

• Has an explicit inference network. So it lets us do reconstruction.

• If the distribution $\small p(x|z)$ makes conditional independence assumptions, then it might have the "blurring" effect on images and "low-pass" effect on audio.

• Much higher visual fidelity in generated samples.

• What's wrong with autoencoders?: "... So, an optimal single layer autoencoder will discard an image's high-frequency information. This explains why autoencoders invariably produce blurry outputs. Whenever you optimize for reducing the sum-squared error in pixel intensities, you will find the same effect. The high-frequency information in images is just not interesting from a sum-squared error point of view. The upshot of this is, if you want to produce sharper results from an autoencoder, you'll need to find a different objective function to minimize."  |  reddit

• What can we not do with ML these days?  |  Unsupervised learning is an area that we are not very good at and that has enormous potential if we do get good at it.  |  Autoencoders are a great example of the power of unsupervised learning. ML practitioners are severely limited by our ingenuity on how to represent features.

• Which unsupervised learning method produces the best features for semi-supervised object recognition? Are variational autoencoders currently the best? [reddit]
There is a lot of talk about variational autoencoders applied to images lately. However, I wonder if they actually learn the best features for semi-supervised object recognition? Have there been any comparisons to simple (convolutional) autoencoders, denoising autoencoders and other unsupervised learning methods?
EDIT: To clarify, I'm asking about fully unsupervised feature learning methods that are then used to extract features to be used in supervised learning.

• It's the semi-supervised Ladder AFAIK: arXiv:1507.02672

• Using our new techniques, we achieve state-of-the-art results in semi-supervised classification on MNIST, CIFAR-10 and SVHN. ← Surprisingly, it's GANs: arXiv:1606.03498

• [u/alexmlamb] It's also interesting that on SVHN, Dumoulin and I beat DCGAN using ALI [Adversarially Learned Inference] (like 22% error to 19%). The "Improved GAN techniques" paper gets a much better SVHN result by doing things like co-training the discriminator to do classification. It would be interesting to see if "Improved GAN techniques" + ALI got even better semi-supervised results. arXiv:1606.00704

• [Dumoulin V Alex Lamb; Aaron Courville] (2016) Adversarially Learned Inference arXiv:1606.00704  |  Adversarially Learned Inference  |  reddit  |  reddit:mention

We introduce the adversarially learned inference (ALI) model, which jointly learns a generation network and an inference network using an adversarial process. The generation network maps samples from stochastic latent variables to the data space while the inference network maps training examples in data space to the space of latent variables. An adversarial game is cast between these two networks and a discriminative network that is trained to distinguish between joint latent/data-space samples from the generative network and joint samples from the inference network. We illustrate the ability of the model to learn mutually coherent inference and generation networks through the inspections of model samples and reconstructions and confirm the usefulness of the learned representations by obtaining a performance competitive with other recent approaches on the semi-supervised SVHN task.

• Project web page  |  reddit  |  reddit

[Autoencoders:General] Papers:

• Barone AVM (ACL 2016) Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. arXiv:1608.02996  |  GitHub  |  GitXiv  |

• Current approaches to learning vector representations of text that are compatible between different languages usually require some amount of parallel text, aligned at word, sentence or at least document level. We hypothesize however, that different natural languages share enough semantic structure that it should be possible, in principle, to learn compatible vector representations just by analyzing the monolingual distribution of words. In order to evaluate this hypothesis, we propose a scheme to map word vectors trained on a source language to vectors semantically compatible with word vectors trained on a target language using an adversarial autoencoder. We present preliminary qualitative results and discuss possible future developments of this technique, such as applications to cross-lingual sentence representations.

• Author AVM Barone (u/AnvaMiba) on reddit: [1608.02996] Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders:

• In this preliminary work I try to learn a transformation word embeddings from one language (e.g. English) to another language (e.g. Italian) without using any parallel dataset.

My hypothesis is that this should be possible because languages are assumed to have a hidden vector-like "concept" space (of which word embeddings are a crude approximation, although it may make more sense to consider sentence or document embeddings) and if different languages are used to talk about similar themes, the stochastic processes that generate these latent representations should be near isomorphic.

So my general idea is to use generative adversarial networks (GANs) to learn to match word embedding distributions: instead of transforming from Gaussian noise to images, as it is usually done in GAN papers, I transform from English embeddings to Italian embeddings.

Unfortunately this basic setup doesn't work since training ends up in the pathological state where the generator collapses everything into a single output vector, a known problem of GANs which I think becomes even worse in my case since I use point-mass probability distributions instead of truly continuous ones.

Hence I use adversarial autoencoders (AAEs): I add a decoder that tries to reconstruct English embeddings from the artificial Italian embeddings produced by the generator, using cosine dissimilarity as a reconstruction loss.

Using a few tricks to aid optimization (a ResNet leaky relu discriminator with batch normalization to increase the magnitude of the gradient being backpropagated to the generator) I managed to make the model learn.

Qualitatively, it approximately learns some frequent mappings, but overall it is not competitive with cross-lingual embedding approaches that make use of parallel resources. I don't know if it is just a matter of architecture/hyperparameters or if I have already hit a fundamental limit of how much semantic transfer can be done by using only monolingual data.

Comments, suggestions, criticism are welcome. Also, if you are at ACL 2016 in Berlin, I will present this work as a poster today (Aug 11) in the REPL4NLP workshop.

• Author (AVM Barone) on reddit (u/AnvaMiba): Are DenseNets and ResNets specific to vision problems? Do these ideas work for fully connected networks?: "I have used a ResNet for a NLP task (arXiv:1603.04259)."

• Bornschein J [Bengio Y] (2015) Bidirectional Helmholtz Machines. arXiv:1506.03877  |  GitXiv

• Efficient unsupervised training and inference in deep generative models remains a challenging problem. One basic approach, called Helmholtz machine, involves training a top-down directed generative model together with a bottom-up auxiliary model used for approximate inference. Recent results indicate that better generative models can be obtained with better approximate inference procedures. Instead of improving the inference procedure, we here propose a new model which guarantees that the top-down and bottom-up distributions can efficiently invert each other. We achieve this by interpreting both the top-down and the bottom-up directed models as approximate inference distributions and by defining the model distribution to be the geometric mean of these two. We present a lower-bound for the likelihood of this model and we show that optimizing this bound regularizes the model so that the Bhattacharyya distance between the bottom-up and top-down approximate distributions is minimized. This approach results in state of the art generative models which prefer significantly deeper architectures while it allows for orders of magnitude more efficient approximate inference.

• Mentioned here: Minimal Gate Unit for Recurrent Neural Networks [reddit]

• Bowman SR [Vilnis L; Vinyals O; Dai AM; Jozefowicz R; Bengio S | Stanford | UMass | Google Brain] (2015) Generating sentences from a continuous space. arXiv:1511.06349  |  [pdf]

• The standard recurrent neural network language model (RNNLM) generates sentences one word at a time and does not work from an explicit global sentence representation. In this work, we introduce and study an RNN-based variational autoencoder generative model that incorporates distributed latent representations of entire sentences. This factorization allows it to explicitly model holistic properties of sentences such as style, topic, and high-level syntactic features. Samples from the prior over these sentence representations remarkably produce diverse and well-formed sentences through simple deterministic decoding. By examining paths through this latent space, we are able to generate coherent novel sentences that interpolate between known sentences. We present techniques for solving the difficult learning problem presented by this model, demonstrate its effectiveness in imputing missing words, explore many interesting properties of the model's latent sentence space, and present negative results on the use of the model in language modeling.

• Related:  Iyyer M (2014) Generating Sentences from Semantic Vector Space Representations   [pdf]

• Chen M [Bengio Y] (ICML 2014) Marginalized Denoising Auto-encoders for Nonlinear Representations. pdf  |  GitHub  |  GitXiv
• Denoising auto-encoders (DAEs) have been successfully used to learn new representations for a wide range of machine learning tasks. During training, DAEs make many passes over the training dataset and reconstruct it from partial corruption generated from a pre-specified corrupting distribution. This process learns robust representation, though at the expense of requiring many training epochs, in which the data is explicitly corrupted. In this paper we present the marginalized Denoising Auto-encoder (mDAE), which (approximately) marginalizes out the corruption during training. Effectively, the mDAE takes into account infinitely many corrupted copies of the training data in every epoch, and therefore is able to match or outperform the DAE with much fewer training epochs. We analyze our proposed algorithm and show that it can be understood as a classic auto-encoder with a special form of regularization. In empirical evaluations we show that it attains 1-2 order-of-magnitude speedup in training time over other competing approaches.

• Chen X [Ilya Sutskever | UC Berkeley | OpenAI | MIT] (2016) Variational Lossy Autoencoder. arXiv:1611.02731
• Representation learning seeks to expose certain aspects of observed data in a learned representation that's amenable to downstream tasks like classification. For instance, a good representation for 2D images might be one that describes only global structure and discards information about detailed texture. In this paper, we present a simple but principled method to learn such global representations by combining Variational Autoencoder (VAE) with neural autoregressive models such as RNN, MADE and PixelRNN/CNN. Our proposed VAE model allows us to have control over what the global latent code can learn and , by designing the architecture accordingly, we can force the global latent code to discard irrelevant information such as texture in 2D images, and hence the code only "autoencodes" data in a lossy fashion. In addition, by leveraging autoregressive models as both prior distribution p(z) and decoding distribution p(x|z), we can greatly improve generative modeling performance of VAEs, achieving new state-of-the-art results on MNIST, OMNIGLOT and Caltech-101 Silhouettes density estimation tasks.

• Cheung B [UC Berkeley | Nervana] (2014) Discovering hidden factors of variation in deep networks. arXiv:1412.6583  |  GitHub  |  GitXiv
• Deep learning has enjoyed a great deal of success because of its ability to learn useful features for tasks such as classification. But there has been less exploration in learning the factors of variation apart from the classification signal. By augmenting autoencoders with simple regularization terms during training, we demonstrate that standard deep architectures can discover and explicitly represent factors of variation beyond those relevant for categorization. We introduce a cross-covariance penalty (XCov) as a method to disentangle factors like handwriting style for digits and subject identity in faces. We demonstrate this on the MNIST handwritten digit database, the Toronto Faces Database (TFD) and the Multi-PIE dataset by generating manipulated instances of the data. Furthermore, we demonstrate these deep networks can extrapolate hidden' variation in the supervised signal.

• Dai AM & Le QV (2015) Semi-supervised sequence learning  |  Google: RNN, LSTM, autoencoder, seq2seq, pretraining, CIFAR-10, IMDB  |  pdf  |  reddit

• Doersch D (2016) Tutorial on Variational Autoencoders. arXiv:1606.05908  |  GitHub  |  GitXiv  |  reddit

• In just three years, Variational Autoencoders (VAEs) have emerged as one of the most popular approaches to unsupervised learning of complicated distributions. VAEs are appealing because they are built on top of standard function approximators (neural networks), and can be trained with stochastic gradient descent. VAEs have already shown promise in generating many kinds of complicated data, including handwritten digits, faces, house numbers, CIFAR images, physical models of scenes, segmentation, and predicting the future from static images. This tutorial introduces the intuitions behind VAEs, explains the mathematics behind them, and describes some empirical behavior. No prior knowledge of variational Bayesian methods is assumed.

• Variational auto-encoders do not train complex generative models [reddit]  |  blog: post

• Gregor K [Google DeepMind] (2016) Towards Conceptual Compression. arXiv:1604.08772  |  []1604.08772] Towards Conceptual Compression
• We introduce a simple recurrent variational auto-encoder architecture that significantly improves image modeling. The system represents the state-of-the-art in latent variable models for both the ImageNet and Omniglot datasets. We show that it naturally separates global conceptual information from lower level details, thus addressing one of the fundamentally desired properties of unsupervised learning. Furthermore, the possibility of restricting ourselves to storing only global information about an image allows us to achieve high quality 'conceptual compression'.

• Goodfellow IJ [Courville A; Bengio Y] (2014) Generative Adversarial Networks. arXiv:1406.2661

• Graves A [Google DeepMind] (2016) Stochastic Backpropagation through Mixture Density Distributions. arXiv:1607.05690  |  reddit  |  reddit  |  reddit

• The ability to backpropagate stochastic gradients through continuous latent distributions has been crucial to the emergence of variational autoencoders and stochastic gradient variational Bayes. The key ingredient is an unbiased and low-variance way of estimating gradients with respect to distribution parameters from gradients evaluated at distribution samples. The "reparameterization trick" provides a class of transforms yielding such estimators for many continuous distributions, including the Gaussian and other members of the location-scale family. However the trick does not readily extend to mixture density models, due to the difficulty of reparameterizing the discrete distribution over mixture weights. This report describes an alternative transform, applicable to any continuous multivariate distribution with a differentiable density function from which samples can be drawn, and uses it to derive an unbiased estimator for mixture density weight derivatives. Combined with the reparameterization trick applied to the individual mixture components, this estimator makes it straightforward to train variational autoencoders with mixture-distributed latent variables, or to perform stochastic variational inference with a mixture density variational posterior.

• "Reparameterization trick" also mentioned here:

• Johnson MJ [Harvard | Twitter] (2016) Composing graphical models with neural networks for structured representations and fast inference. arXiv:1603.06277

• We propose a general modeling and inference framework that composes probabilistic graphical models with deep learning methods and combines their respective strengths. ... All components of these models are learned simultaneously with a single objective, giving a scalable algorithm that leverages stochastic variational inference, natural gradients, graphical model message passing, and the reparameterization trick. ...

• Naesseth CA [David M. Blei] (2016) Rejection Sampling Variational Inference. arXiv:1610.05683  |  reddit

• Variational inference using the reparameterization trick has enabled large-scale approximate Bayesian inference in complex probabilistic models, leveraging stochastic optimization to sidestep intractable expectations. The reparameterization trick is applicable when we can simulate a random variable by applying a (differentiable) deterministic function on an auxiliary random variable whose distribution is fixed. For many distributions of interest (such as the gamma or Dirichlet), simulation of random variables relies on rejection sampling. The discontinuity introduced by the accept--reject step means that standard reparameterization tricks are not applicable. We propose a new method that lets us leverage reparameterization gradients even when variables are outputs of a rejection sampling algorithm. Our approach enables reparameterization on a larger class of variational distributions. In several studies of real and synthetic data, we show that the variance of the estimator of the gradient is significantly lower than other state-of-the-art methods. This leads to faster convergence of stochastic optimization variational inference.

• Tokui S [University of Tokyo] (2016) Reparameterization trick for discrete variables. arXiv:1611.01239

• Low-variance gradient estimation is crucial for learning directed graphical models parameterized by neural networks, where the reparameterization trick is widely used for those with continuous variables. While this technique gives low-variance gradient estimates, it has not been directly applicable to discrete variables, the sampling of which inherently requires discontinuous operations. We argue that the discontinuity can be bypassed by marginalizing out the variable of interest, which results in a new reparameterization trick for discrete variables. This reparameterization greatly reduces the variance, which is understood by regarding the method as an application of common random numbers to the estimation. The resulting estimator is theoretically guaranteed to have a variance not larger than that of the likelihood-ratio method with the optimal input-dependent baseline. We give empirical results for variational learning of sigmoid belief networks.

• Hinton G (2010) Discovering binary codes for documents by learning deep generative models. pdf

• Hinton GE (2006) Reducing the dimensionality of data with neural networks.  pdf
• High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such "autoencoder" networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

• Kingma DP (2016) Improving Variational Inference with Inverse Autoregressive Flow. arXiv:1606.04934  |  reddit  |  reddit

• The framework of normalizing flows provides a general strategy for flexible variational inference of posteriors over latent variables. We propose a new type of normalizing flow, inverse autoregressive flow (IAF), that, in contrast to earlier published flows, scales well to high-dimensional latent spaces. The proposed flow consists of a chain of invertible transformations, where each transformation is based on an autoregressive neural network. In experiments, we show that IAF significantly improves upon diagonal Gaussian approximate posteriors. In addition, we demonstrate that a novel type of variational autoencoder, coupled with IAF, is competitive with neural autoregressive models in terms of attained log-likelihood on natural images, while allowing significantly faster synthesis.

• Konda K (2014) Zero-bias autoencoders and the benefits of co-adapting features. arXiv:1402.3337   GitHub  |  sp GitXiv
• Regularized training of an autoencoder typically results in hidden unit biases that take on large negative values. We show that negative biases are a natural result of using a hidden layer whose responsibility is to both represent the input data and act as a selection mechanism that ensures sparsity of the representation. We then show that negative biases impede the learning of data distributions whose intrinsic dimensionality is high. We also propose a new activation function that decouples the two roles of the hidden layer and that allows us to learn representations on data with very high intrinsic dimensionality, where standard autoencoders typically fail. Since the decoupled activation function acts like an implicit regularizer, the model can be trained by minimizing the reconstruction error of training data, without requiring any additional regularization.

• Lamb A [Courville A] (2016) Discriminative Regularization for Generative Models. arXiv:1602.03220  |  GitXiv
• We explore the question of whether the representations learned by classifiers can be used to enhance the quality of generative models. Our conjecture is that labels correspond to characteristics of natural data which are most salient to humans: identity in faces, objects in images, and utterances in speech. We propose to take advantage of this by using the representations from discriminative classifiers to augment the objective function corresponding to a generative model. In particular we enhance the objective function of the variational autoencoder, a popular generative model, with a discriminative regularization term. We show that enhancing the objective function in this way leads to samples that are clearer and have higher visual quality than the samples from the standard variational autoencoders.

• Pu Y [Duke University | Nokia Bell Labs] (2016) Variational Autoencoder for Deep Learning of Images, Labels and Captions. arXiv:1609.08976
• A novel variational autoencoder is developed to model images, as well as associated labels or captions. The Deep Generative Deconvolutional Network (DGDN) is used as a decoder of the latent image features, and a deep Convolutional Neural Network (CNN) is used as an image encoder; the CNN is used to approximate a distribution for the latent DGDN features/code. The latent code is also linked to generative models for labels (Bayesian support vector machine) or captions (recurrent neural network). When predicting a label/caption for a new image at test, averaging is performed across the distribution of latent codes; this is computationally efficient as a consequence of the learned CNN-based encoder. Since the framework is capable of modeling the image in the presence/absence of associated labels/captions, a new semi-supervised setting is manifested for CNN learning with images; the framework even allows unsupervised CNN learning, based on images alone.

• Rolfe JT [D-Wave Systems] (2016) Discrete Variational Autoencoders arXiv:1609.02200 reddit
• Probabilistic models with discrete latent variables naturally capture datasets composed of discrete classes. However, they are difficult to train efficiently, since backpropagation through discrete variables is generally not possible. We introduce a novel class of probabilistic models, comprising an undirected discrete component and a directed hierarchical continuous component, that can be trained efficiently using the variational autoencoder framework. The discrete component captures the distribution over the disconnected smooth manifolds induced by the continuous component. As a result, this class of models efficiently learns both the class of objects in an image, and their specific realization in pixels, from unsupervised data; and outperforms state-of-the-art methods on the permutation-invariant MNIST, OMNIGLOT, and Caltech-101 Silhouettes datasets.

• Suzuki M [University of Tokyo] (2016) Joint Multimodal Learning with Deep Generative Models. arXiv:1611.01891

• We investigate deep generative models that can exchange multiple modalities bi-directionally, e.g., generating images from corresponding texts and vice versa. Recently, some studies handle multiple modalities on deep generative models, such as variational autoencoders (VAEs). However, these models typically assume that modalities are forced to have a conditioned relation, i.e., we can only generate modalities in one direction. To achieve our objective, we should extract a joint representation that captures high-level concepts among all modalities and through which we can exchange them bi-directionally. As described herein, we propose a joint multimodal variational autoencoder (JMVAE), in which all modalities are independently conditioned on joint representation. In other words, it models a joint distribution of modalities. Furthermore, to be able to generate missing modalities from the remaining modalities properly, we develop an additional method, JMVAE-kl, that is trained by reducing the divergence between JMVAE's encoder and prepared networks of respective modalities. Our experiments show that our proposed method can obtain appropriate joint representation from multiple modalities and that it can generate and reconstruct them more properly than conventional VAEs. We further demonstrate that JMVAE can generate multiple modalities bi-directionally.

• Discussion (Nathan Benaich, Nov 2016):

• Joint multimodal learning with deep generative models, University of Tokyo. Machine learning models in production today are typically trained and operate on data of a single modality, i.e. only text or only images. However, information in the real world is represented through various modalities. Here, the authors present a joint multimodal variational autoencoder  -  a generative model that can extract a joint representation that captures high-level concepts among all modalities it is trained on (e.g. text and images). With this model, the authors show that we can exchange this representation bi-directionally, that is to say the model can generate and reconstruct images from corresponding text and vice versa.

• 'Related:'  Upchurch P (2016) Deep Feature Interpolation for Image Content Changes. arXiv:1611.05507

• Vincent P (2011) A connection between score matching and denoising autoencoders. more here.

• Vincent P [Bengio Y] (2008) Extracting and composing robust features with denoising autoencoders. pdf
• Previous work has shown that the difficulties in learning deep generative or discriminative models can be overcome by an initial unsupervised learning step that maps inputs to useful intermediate representations. We introduce and motivate a new training principle for unsupervised learning of a representation based on the idea of making the learned representations robust to partial corruption of the input pattern. This approach can be used to train autoencoders, and these denoising autoencoders can be stacked to initialize deep architectures. The algorithm can be motivated from a manifold learning and information theoretic perspective or from a generative model perspective. Comparative experiments clearly show the surprising advantage of corrupting the input of autoencoders on a pattern classification benchmark suite.

## AUTOENCODERS - LADDER NETWORKS { UNSUPERVISED LEARNING IN DNN }

• reddit:

• Introduction to Semi-Supervised Learning with Ladder Networks

• [identically-named but different post:] Introduction to Semi-Supervised Learning with Ladder Networks

• Ladder networks combine supervised learning with unsupervised learning in deep neural networks. Often, unsupervised learning was used only for pre-training the network, followed by normal supervised learning. In case of ladder networks, it is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Ladder network is able to achieve state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.

• Key Aspects:

• Compatibility with supervised methods. It can be added to existing feedforward neural networks. The unsupervised part focuses on relevant details found by supervised learning. It can also be extended to be added to recurrent neural networks.

• Scalability resulting from local learning. In addition to a supervised learning target on the top layer, the model has local unsupervised learning targets on every layer, making it suitable for very deep neural networks.

• Computational efficiency. Adding a decoder (part of the ladder network) approximately triples the computation during training but not necessarily the training time since the same result can be achieved faster through the better utilization of the available information.

• Rasmus A (2015) Semi-Supervised Learning with Ladder Networks. arXiv:1507.02672  |  GitHub  |  reddit  |  reddit:mention

• We combine supervised learning with unsupervised learning in deep neural networks. The proposed model is trained to simultaneously minimize the sum of supervised and unsupervised cost functions by backpropagation, avoiding the need for layer-wise pre-training. Our work builds on the Ladder network proposed by Valpola (2014), which we extend by combining the model with supervision. We show that the resulting model reaches state-of-the-art performance in semi-supervised MNIST and CIFAR-10 classification, in addition to permutation-invariant MNIST classification with all labels.

• Our approach follows Valpola (2015), who proposed a Ladder network where the auxiliary task is to denoise representations at every level of the model. The model structure is an autoencoder with skip connections from the encoder to decoder and the learning task is similar to that in denoising autoencoders but applied to every layer, not just the inputs. The skip connections relieve the pressure to represent details in the higher layers of the model because, through the skip connections, the decoder can recover any details discarded by the encoder. Previously, the Ladder network has only been demonstrated in unsupervised learning (Valpola, 2015; Rasmus et al., 2015a) but we now combine it with supervised learning.

• Victoria: Those publication dates are actually: Valpola (2014); Rasmus et al., 2014.

• mentioned here ([reddit]): Missed connections: anyone remember a paper that uses semi-supervised learning to solve MNIST with only one labeled example in each class?: You might mean semi-supervised learning with ladder Networks? i.e.: this paper | "Wait, it was ladder networks that did that? Cool!"

• reviewed here  |  local copy ]

• Pezeshki M [Courville A; Bengio Y] (2015) Deconstructing the Ladder Network Architecture. arXiv:1511.06430.  |  analyzes the contribution of all individual components in depth and proposes a potentially better variant  |  suppl. mat.
• The Ladder Network is a recent new approach to semi-supervised learning that turned out to be very successful. While showing impressive performance, the Ladder Network has many components intertwined, whose contributions are not obvious in such a complex architecture. This paper presents an extensive experimental investigation of variants of the Ladder Network in which we replaced or removed individual components to learn about their relative importance. For semi-supervised tasks, we conclude that the most important contribution is made by the lateral connections, followed by the application of noise, and the choice of what we refer to as the 'combinator function'. As the number of labeled training examples increases, the lateral connections and the reconstruction criterion become less important, with most of the generalization improvement coming from the injection of noise in each layer. Finally, we introduce a combinator function that reduces test error rates on Permutation-Invariant MNIST to 0.57% for the supervised setting, and to 0.97% and 1.0% for semi-supervised settings with 1000 and 100 labeled examples, respectively.

• Sønderby CK (2016) Ladder Variational Autoencoders. arXiv:1602.02282  |  GitHub  |  reddit
• Variational Autoencoders are powerful models for unsupervised learning. However deep models with several layers of dependent stochastic variables are difficult to train which limits the improvements obtained using these highly expressive models. We propose a new inference model, the Ladder Variational Autoencoder, that recursively corrects the generative distribution by a data dependent approximate likelihood in a process resembling the recently proposed Ladder Network. We show that this model provides state of the art predictive log-likelihood and tighter log-likelihood lower bound compared to the purely bottom-up inference in layered Variational Autoencoders and other generative models. We provide a detailed analysis of the learned hierarchical latent representation and show that our new inference model is qualitatively different and utilizes a deeper more distributed hierarchy of latent variables. Finally, we observe that batch normalization and deterministic warm-up (gradually turning on the KL-term) are crucial for training variational models with many stochastic layers.

• Valpola H (2014) [ladder networks] From neural PCA to deep unsupervised learning. arXiv:1411.7783  |  introduces ladder networks

• A network supporting deep unsupervised learning is presented. The network is an autoencoder with lateral shortcut connections from the encoder to decoder at each level of the hierarchy. The lateral shortcut connections allow the higher levels of the hierarchy to focus on abstract invariant features. While standard autoencoders are analogous to latent variable models with a single layer of stochastic variables, the proposed network is analogous to hierarchical latent variables models. Learning combines denoising autoencoder and denoising sources separation frameworks. Each layer of the network contributes to the cost function a term which measures the distance of the representations produced by the encoder and the decoder. Since training signals originate from all levels of the network, all layers can learn efficiently even in deep networks. The speedup offered by cost terms from higher levels of the hierarchy and the ability to learn invariant features are demonstrated in experiments.

• The model structure is called a ladder network because two vertical paths are connected by horizontal lateral connection s at regular intervals. ... The experiments presented in Section 4 demonstrate that the higher levels of a ladder network can discard information and focus on invariant representation s and that the training targets on higher layers speed up learning. ... In order to support efficient unsupervised learning in deep ladder networks, a new type of cost function was proposed. The key aspect is that each layer of the network contributes its own terms to the cost function.

• Vincent P (2010) A connection between score matching and denoising autoencoders. more here

## BACKPROPAGATION

• 'error backpropagation' is now referred to as "gradient flow"

• Seminal paper (introduced backprop):  Rumelhart DE [Hinton GE; Williams RJ] (1986) Learning representations by back-propagating errors.  pdf  |  reddit

[Backpropagation] Blogs:

[Backpropagation] Instruction:

[Backpropagation] Papers:

• Baldi P (2015) A theory of local learning, the learning channel, and the optimality of backpropagation. Neural Networks. pdf  |  reddit

• In a physical neural system, where storage and processing are intimately intertwined, the rules for adjusting the synaptic weights can only depend on variables that are available locally, such as the activity of the pre- and post-synaptic neurons, resulting in local learning rules. A systematic framework for studying the space of local learning rules is obtained by first specifying the nature of the local variables, and then the functional form that ties them together into each learning rule. Such a framework enables also the systematic discovery of new learning rules and exploration of relationships between learning rules and group symmetries. We study polynomial local learning rules stratified by their degree and analyze their behavior and capabilities in both linear and non-linear units and networks. Stacking local learning rules in deep feedforward networks leads to deep local learning. While deep local learning can learn interesting representations, it cannot learn complex input-output functions, even when targets are available for the top layer. Learning complex input-output functions requires local deep learning where target information is communicated to the deep layers through a backward learning channel. The nature of the communicated information about the targets and the structure of the learning channel partition the space of learning algorithms. For any learning algorithm, the capacity of the learning channel can be defined as the number of bits provided about the error gradient per weight, divided by the number of required operations per weight. We estimate the capacity associated with several learning algorithms and show that backpropagation outperforms them by simultaneously maximizing the information rate and minimizing the computational cost. This result is also shown to be true for recurrent networks, by unfolding them in time. The theory clarifies the concept of Hebbian learning, establishes the power and limitations of local learning rules, introduces the learning channel which enables a formal analysis of the optimality of backpropagation, and explains the sparsity of the space of learning rules discovered so far.

• Baydin AG (2015) Automatic differentiation in machine learning: a survey. arXiv:1502.05767

• Derivatives, mostly in the form of gradients and Hessians, are ubiquitous in machine learning. Automatic differentiation (AD) is a technique for calculating derivatives of numeric functions expressed as computer programs efficiently and accurately, used in fields such as computational fluid dynamics, nuclear engineering, and atmospheric sciences. Despite its advantages and use in other fields, machine learning practitioners have been little influenced by AD and make scant use of available tools. We survey the intersection of AD and machine learning, cover applications where AD has the potential to make a big impact, and report on some recent developments in the adoption of this technique. We aim to dispel some misconceptions that we contend have impeded the use of AD within the machine learning community.

• Cited by Andrej Karpathy in his Stanford cs231n [Spring 2016] 'CNN For Visual Recognition' course in his excellent backpropagation/optimization lecture notes.  |  local copy  |

• Bojarski M [NVIDIA Corp | Google Brain Robotics] (2016) VisualBackProp: Efficient visualization of CNNs. arXiv:1611.05418

• Demyanov S (2015) Invariant backpropagation: How to train a transformation-invariant neural network. arXiv:1502.04434   GitXiv

• In many classification problems a classifier should be robust to small variations in the input vector. This is a desired property not only for particular transformations, such as translation and rotation in image classification problems, but also for all others for which the change is small enough to retain the object perceptually indistinguishable. We propose two extensions of the backpropagation algorithm that train a neural network to be robust to variations in the feature vector. While the first of them enforces robustness of the loss function to all variations, the second method trains the predictions to be robust to a particular variation which changes the loss function the most. The second methods demonstrates better results, but is slightly slower. We analytically compare the proposed algorithm with two the most similar approaches (Tangent BP and Adversarial Training), and propose their fast versions. In the experimental part we perform comparison of all algorithms in terms of classification accuracy and robustness to noise on MNIST and CIFAR-10 datasets. Additionally we analyze how the performance of the proposed algorithm depends on the dataset size and data augmentation.

• Gruslys A [Alex Graves | Google DeepMind] (2016) Memory-Efficient Backpropagation Through Time. arXiv:1606.03401  |  reddit  |  reddit-mentioned GitHub

• We propose a novel approach to reduce memory consumption of the backpropagation through time (BPTT) algorithm when training recurrent neural networks (RNNs). Our approach uses dynamic programming to balance a trade-off between caching of intermediate results and recomputation. The algorithm is capable of tightly fitting within almost any user-set memory budget while finding an optimal execution policy minimizing the computational cost. Computational devices have limited memory capacity and maximizing a computational performance given a fixed memory budget is a practical use-case. We provide asymptotic computational upper bounds for various regimes. The algorithm is particularly effective for long sequences. For sequences of length 1000, our algorithm saves 95% of memory usage while using only one third more time per iteration than the standard BPTT.

• Guo J. (2013) [BPTT] BackPropagation Through Time. [pdf]  |  reddit

• Han T [UCLA] (2016) Alternating Back-Propagation for Generator Network. pdf  |  project page

• This paper proposes an alternating back-propagation algorithm for learning the generator network model. The model is a non-linear generalization of factor analysis. In this model, the mapping from the latent factors to the observed vector is parametrized by a convolutional neural network. The alternating back-propagation algorithm iterates between the following two steps: (1) Inferential back-propagation, which infers the latent factors by Langevin dynamics or gradient descent. (2) Learning back-propagation, which updates the parameters given the inferred latent factors by gradient descent. The gradient computations in both steps are powered by back-propagation, and they share most of their code in common. We show that the alternating back-propagation algorithm can learn realistic generator models of natural images, video sequences, and sounds. Moreover, it can also be used to learn from incomplete or indirect training data.

• Han T (2016) Learning Generative ConvNet with Continuous Latent Factors by Alternating Back-Propagation. arXiv:1606.08571  |  reddit

• This paper proposes an alternating back-propagation algorithm for learning the generator network model. The model is a non-linear generalization of factor analysis. In this model, the mapping from the latent factors to the observed vector is parametrized by a convolutional neural network. The alternating back-propagation algorithm iterates between the following two steps: (1) Inferential back-propagation, which infers the latent factors by Langevin dynamics or gradient descent. (2) Learning back-propagation, which updates the parameters given the inferred latent factors by gradient descent. The gradient computations in both steps are powered by back-propagation, and they share most of their code in common. We show that the alternating back-propagation algorithm can learn realistic generator models of natural images, video sequences, and sounds. Moreover, it can also be used to learn from incomplete or indirect training data.

• The generator network is a fundamental representation of knowledge, and it has the following properties:

1. Analysis : The model disentangles the variations in the observed data vectors into independent variations of latent factors.
2. Synthesis : The model can easily synthesize new signals by sampling the factors from the known prior distribution and transforming the factors into the signal.
3. Embedding: The model embeds the high-dimensional non-Euclidean manifold formed by the observed data vectors into the low-dimensional Euclidean space of the latent factors, so that linear interpolation in the low-dimensional factor space results in non-linear interpolation in the data space.

• LeCun YA (2012) Efficient backprop [ pdf: 48 pp ]

• The convergence of back-propagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work. Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most "classical" second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.

• Keras (GitHub) [re: this paper] "LeCun has long argued that the result obtained with stochastic learning is almost always better, thanks to the random noise it introduces."

• Lillicrap TP (2014) Random feedback weights support learning in deep neural networks. arXiv:1411.0247

• The brain processes information through many layers of neurons. This deep architecture is representationally powerful, but it complicates learning by making it hard to identify the responsible neurons when a mistake is made. In machine learning, the backpropagation algorithm assigns blame to a neuron by computing exactly how it contributed to an error. To do this, it multiplies error signals by matrices consisting of all the synaptic weights on the neuron's axon and farther downstream. This operation requires a precisely choreographed transport of synaptic weight information, which is thought to be impossible in the brain. Here we present a surprisingly simple algorithm for deep learning, which assigns blame by multiplying error signals by random synaptic weights. We show that a network can learn to extract useful information from signals sent through these random feedback connections. In essence, the network learns to learn. We demonstrate that this new mechanism performs as quickly and accurately as backpropagation on a variety of problems and describe the principles which underlie its function. Our demonstration provides a plausible basis for how a neuron can be adapted using error signals generated at distal locations in the brain, and thus dispels long-held assumptions about the algorithmic constraints on learning in neural circuits.

• See also:  Liao Q [Tomaso Poggio] (2015) How Important is Weight Symmetry in Backpropagation? arXiv:1510.05067

• Gradient backpropagation (BP) requires symmetric feedforward and feedback connections -- the same weights must be used for forward and backward passes. This "weight transport problem" (Grossberg 1987) is thought to be one of the main reasons to doubt BP's biologically plausibility. Using 15 different classification datasets, we systematically investigate to what extent BP really depends on weight symmetry. In a study that turned out to be surprisingly similar in spirit to Lillicrap et al.'s demonstration (Lillicrap et al. 2014) but orthogonal in its results, our experiments indicate that: (1) the magnitudes of feedback weights do not matter to performance (2) the signs of feedback weights do matter -- the more concordant signs between feedforward and their corresponding feedback connections, the better (3) with feedback weights having random magnitudes and 100% concordant signs, we were able to achieve the same or even better performance than SGD. (4) some normalizations/stabilizations are indispensable for such asymmetric BP to work, namely Batch Normalization (BN) (Ioffe and Szegedy 2015) and/or a "Batch Manhattan" (BM) update rule.

• Related follow-on:  Nøkland A (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. arXiv:1609.01596  |  GitHub  |  GitXiv  |  reddit  |  reddit  |  reddit  |  reddit

• Artificial neural networks are most commonly trained with the back-propagation algorithm, where the gradient for learning is provided by back-propagating the error, layer by layer, from the output layer to the hidden layers. A recently discovered method called feedback-alignment shows that the weights used for propagating the error backward don't have to be symmetric with the weights used for propagation the activation forward. In fact, random feedback weights work evenly well, because the network learns how to make the feedback useful. In this work, the feedback alignment principle is used for training hidden layers more independently from the rest of the network, and from a zero initial condition. The error is propagated through fixed random feedback connections directly from the output layer to each hidden layer. This simple method is able to achieve zero training error even in convolutional networks and very deep networks, completely without error back-propagation. The method is a step towards biologically plausible machine learning because the error signal is almost local, and no symmetric or reciprocal weights are required. Experiments show that the test performance on MNIST and CIFAR is almost as good as those obtained with back-propagation for fully connected networks. If combined with dropout, the method achieves 1.45% error on the permutation invariant MNIST task.

• Conclusion. "... The method was able to fit the training set on all experiments performed on MNIST, CIFAR-10 and CIFAR-100. The performance on the test sets lags a little behind back-propagation. Most importantly, this work suggests that the restriction enforced by back-propagation and feedback-alignment, that the backward pass have to visit every neuron from the forward pass, can be discarded. Learning is possible even when the feedback path is disconnected from the forward path. ..."

• Mentioned here:   |  Random synaptic feedback weights support error backpropagation for deep learning [reddit]

• Nøkland A (2016) Direct Feedback Alignment Provides Learning in Deep Neural Networks. arXiv:1609.01596

• Scellier B [Bengio Y] (2016) Equilibrium Propagation: Bridging the Gap Between Energy-Based Models and Backpropagation. arXiv:1602.05179

• We introduce Equilibrium Propagation, a learning framework for energy-based models. It involves only one kind of neural computation, performed in both the first phase (when the prediction is made) and the second phase of training (after the target or prediction error is revealed). Although this algorithm computes the gradient of an objective function just like Backpropagation, it does not need a special computation or circuit for the second phase, where errors are implicitly propagated. Equilibrium Propagation shares similarities with Contrastive Hebbian Learning and Contrastive Divergence while solving the theoretical issues of both algorithms: our algorithm computes the gradient of a well defined objective function. Because the objective function is defined in terms of local perturbations, the second phase of Equilibrium Propagation corresponds to only nudging the prediction (fixed point, or stationary distribution) towards a configuration that reduces prediction error. In the case of a recurrent multi-layer supervised network, the output units are slightly nudged towards their target in the second phase, and the perturbation introduced at the output layer propagates backward in the hidden layers. We show that the signal 'back-propagated' during this second phase corresponds to the propagation of error derivatives and encodes the gradient of the objective function, when the synaptic update corresponds to a standard form of spike-timing dependent plasticity. This work makes it more plausible that a mechanism similar to Backpropagation could be implemented by brains, since leaky integrator neural computation performs both inference and error back-propagation in our model. The only local difference between the two phases is whether synaptic changes are allowed or not.

• From an earlier version:  "This work follows Bengio and Fischer (2015) in which theoretical foundations were laid to show how iterative inference can backpropagate error signals. ..."

• Sun X (2017) meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting. arXiv:1706.06197  |  "critiqued (kinda):"  reddit

• We propose a simple yet effective technique for neural network learning. The forward propagation is computed as usual. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-k elements (in terms of magnitude) are kept. As a result, only k rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction (k divided by the vector dimension) in the computational cost. Surprisingly, experimental results demonstrate that we can update only 1-4% of the weights at each back propagation pass. This does not result in a larger number of training iterations. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given.

## BATCH NORMALIZATION

Blogs:

Papers:

• Seminal paper:  Sergey Ioffe & Christian Szegedy [Google] (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167: ... Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. ...

• Ba JL [Geoffrey E. Hinton] (2016) Layer Normalization. arXiv:1607.06450... A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of training cases to compute a mean and variance which are then used to normalize the summed input to that neuron on each training case. This significantly reduces the training time in feed-forward neural networks. However, the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks. In this paper, we transpose batch normalization into layer normalization ... Empirically, we show that layer normalization can substantially reduce the training time compared with previously published techniques.

• Barone AVM (ACL 2016) Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. arXiv:1608.02996

• Author (AVM Barone) on reddit: ... Using a few tricks to aid optimization (a ResNet leaky relu discriminator with batch normalization to increase the magnitude of the gradient being backpropagated to the generator) I managed to make the model learn. ...

• Cooijmans T [Aaron Courville] (2016) Recurrent Batch Normalization. arXiv:1603.09025  |  GitXiv reddit:   OriolVinyals: "Good to see finally someone figured out how to make these two work."  |

• We propose a reparameterization of LSTM that brings the benefits of batch normalization to recurrent neural networks. Whereas previous works only apply batch normalization to the input-to-hidden transformation of RNNs, we demonstrate that it is both possible and beneficial to batch-normalize the hidden-to-hidden transition, thereby reducing internal covariate shift between time steps. We evaluate our proposal on various sequential problems such as sequence classification, language modeling and question answering. Our empirical results show that our batch-normalized LSTM consistently leads to faster convergence and improved generalization.

• Güçlütürk Y (2016) Convolutional Sketch Inversion. arXiv:1606.03073In this paper, we use deep neural networks for inverting face sketches to synthesize photorealistic face images. ... We then train models achieving state-of-the-art results on both computer-generated sketches and hand-drawn sketches by leveraging recent advances in deep learning such as batch normalization, deep residual learning, perceptual losses and stochastic optimization in combination with our new dataset.

• He K [Microsoft Research] (2015) Deep Residual Learning for Image Recognition. arXiv:1512.03385... The Residual Network developed by Kaiming He et al. was the winner of ILSVRC 2015. It features special skip connections and a heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network. ...

• Laurent C [Bengio Y] (2015) Batch Normalized Recurrent Neural Networks. arXiv:1510.01378.RNN ... are computationally expensive to train and difficult to parallelize. Recent work has shown that normalizing intermediate representations of neural networks can significantly improve convergence rates in feedforward neural networks . In particular, batch normalization, which uses mini-batch statistics to standardize features, was shown to significantly reduce training time. ... we show that applying batch normalization to the hidden-to-hidden transitions of our RNNs doesn't help the training procedure ... batch normalization can lead to a faster convergence of the training criterion but doesn't seem to improve the generalization performance on both our language modelling and speech recognition tasks. ... applying batch normalization to RNNs turns out to be more challenging than applying it to feedforward networks, but certain variants of it can still be beneficial.

• Liao Q (2015) How Important is Weight Symmetry in Backpropagation? arXiv:1510.05067Gradient backpropagation (BP) requires symmetric feedforward and feedback connections -- the same weights must be used for forward and backward passes. ... we systematically investigate to what extent BP really depends on weight symmetry. ... our experiments indicate that: ... (4) some normalizations/stabilizations are indispensable for such asymmetric BP to work, namely Batch Normalization (BN) (Ioffe and Szegedy 2015) and/or a "Batch Manhattan" (BM) update rule.

• Mishkin D (2016) Systematic evaluation of CNN advances on the ImageNet. arXiv:1606.02228The paper systematically studies the impact of a range of recent advances in CNN architectures and learning methods on the object categorization (ILSVRC) problem. The evaluation tests the influence of the following choices of the architecture: non-linearity (ReLU, ELU, maxout, compatibility with batch normalization), pooling variants (stochastic, max, average, mixed), network width, classifier design (convolutional, fully-connected, SPP), image pre-processing, and of learning parameters: learning rate, batch size, cleanliness of the data, etc. ...

• ResNets & Batch Normalization  [detailed summary (this file): blogs + papers]

• Salimans T (2016) Weight Normalization: Simple Reparameterization to Accelerate Training of DNN. arXiv:1602.07868

• Shah A (2016) Deep Residual Networks with Exponential Linear Unit. arXiv:1604.04112... we propose the use of exponential linear unit instead of the combination of ReLU and Batch Normalization in Residual Networks. We show that this not only speeds up learning in Residual Networks but also improves the accuracy as the depth increases. It improves the test error on almost all data sets, like CIFAR-10 and CIFAR-100.

• Ulyanov D (2016) Instance Normalization: The Missing Ingredient for Fast Stylization. arXiv:1607.08022 In this paper we revisit the fast stylization method introduced in Ulyanov et. al. (2016) [arXiv:1603.03417]. We show how a small change in the stylization architecture results in a significant qualitative improvement in the generated images. The change is limited to swapping batch normalization with instance normalization, and to apply the latter both at training and testing times. ... The key idea (section 2) is to replace batch normalization layers in the generator architecture with instance normalization layers, and to keep them at test time (as opposed to freeze and simplify them out as done for batch normalization). Intuitively, the normalization process allows to remove instance-specific contrast information from the content image, which simplifies generation. In practice, this results in vastly improved images.

## CLASSIFICATION | CLUSTERING

[Classification | Clustering] Blogs:

[Classification | Clustering] Papers:

• Amid E [Adobe Research | UC - Santa Cruz] (2016) t-Exponential Triplet Embedding. arXiv:1611.09957

• Given a set of relative similarities between objects in the form of triplets "object I is more similar to object j than to object k", we consider the problem of finding an embedding of these objects in a metric space. This problem is generally referred to as triplet embedding. Our main focus in this paper is the case where a subset of triplets are corrupted by noise, such that the order of objects in a triple is reversed. In a crowdsourcing application, for instance, this noise may arise due to varying skill levels or different opinions of the human evaluators. As we show, all existing triplet embedding methods fail to handle even low levels of noise. Inspired by recent advances in robust binary classification and ranking, we introduce a new technique, called t-Exponential Triplet Embedding (t-ETE), that produces high-quality embeddings even in the presence of significant amount of noise in the triplets. By an extensive set of experiments on both synthetic and real-world datasets, we show that our method outperforms all the other methods, giving rise to new insights on real-world data, which have been impossible to observe using the previous techniques.

• Benson AR (2016) Higher-order Organization of Complex Networks. pdf  |  Supplemental Materials  |  blog:Phys.org  |  GitHub  |  project page ["SNAP"]  |  code

• Networks are a fundamental tool for understanding and modeling complex systems in physics, biology, neuroscience, engineering, and social science. Many networks are known to exhibit rich, lower-order connectivity patterns that can be captured at the level of individual nodes and edges. However, higher-order organization of complex networks - at the level of small network subgraphs - remains largely unknown. Here, we develop a generalized framework for clustering networks on the basis of higher-order connectivity patterns. This framework provides mathematical guarantees on the optimality of obtained clusters and scales to networks with billions of edges. The framework reveals higher-order organization in a number of networks, including information propagation units in neuronal networks and hub structure in transportation networks. Results show that networks exhibit rich higher-order organizational structures that are exposed by clustering based on higher-order connectivity patterns.

• Victoria: excellent, interesting paper!

• Dogan U (JMLR 2016) A unified view on multi-class support vector classification. pdf  |  reddit

• A unified view on multi-class support vector machines (SVMs) is presented, covering most prominent variants including the one- vs-all approach and the algorithms proposed by Weston & Watkins, Crammer & Singer, Lee, Lin, & Wahba, and Liu & Yuan. The unification leads to a template for the quadratic training problems and new multi-class SVM formulations. Within our framework, we provide a comparative analysis of the various notions of multi-class margin and margin-based loss. In particular, we demonstrate limitations of the loss function considered, for instance, in the Crammer & Singer machine.

We analyze Fisher consistency of multi- class loss functions and universal consistency of the various machines. On the one hand, we give examples of SVMs that are, in a particular hyperparameter regime, universally consistent without being based on a Fisher consistent loss. These include the canonical extension of SVMs to multiple classes as proposed by Weston & Watkins and Vapnik as well as the one-vs-all approach. On the other hand, it is demonstrated that machines based on Fisher consistent loss functions can fail to identify proper decision boundaries in low-dimensional feature spaces.

We compared the performance of nine different multi-class SVMs in a thorough empirical study. Our results suggest to use the Weston & Watkins SVM, which can be trained comparatively fast and gives good accuracies on benchmark functions. If training time is a major concern, the one-vs-all approach is the method of choice.

• Er MJ (2016) An Online Universal Classifier for Binary, Multi-class and Multi-label Classification. arXiv:1609.00843  |  GitHub  |  GitXiv  |  reddit

• Classification involves the learning of the mapping function that associates input samples to corresponding target label. There are two major categories of classification problems: Single-label classification and Multi-label classification. Traditional binary and multi-class classifications are sub-categories of single-label classification. Several classifiers are developed for binary, multi-class and multi-label classification problems, but there are no classifiers available in the literature capable of performing all three types of classification. In this paper, a novel online universal classifier capable of performing all the three types of classification is proposed. Being a high speed online classifier, the proposed technique can be applied to streaming data applications. The performance of the developed classifier is evaluated using datasets from binary, multi-class and multi-label problems. The results obtained are compared with state-of-the-art techniques from each of the classification types.

• [zero-shot learning]  Fu Y (2016) Semi-supervised Vocabulary-informed Learning. arXiv:1604.07093

• Guan MY [Geoffrey E. Hinton | Google Brain] (2017) Who Said What: Modeling Individual Labelers Improves Classification. arXiv:1703.08774  |  reddit

• Data are often labeled by many different experts with each expert only labeling a small fraction of the data and each data point being labeled by several experts. This reduces the workload on individual experts and also gives a better estimate of the unobserved ground truth. When experts disagree, the standard approaches are to treat the majority opinion as the correct label or to model the correct label as a distribution. These approaches, however, do not make any use of potentially valuable information about which expert produced which label. To make use of this extra information, we propose modeling the experts individually and then learning averaging weights for combining them, possibly in sample-specific ways. This allows us to give more weight to more reliable experts and take advantage of the unique strengths of individual experts at classifying certain types of data. Here we show that our approach leads to improvements in computer-aided diagnosis of diabetic retinopathy. We also show that our method performs better than competing algorithms by Welinder and Perona, and by Mnih and Hinton. Our work offers an innovative approach for dealing with the myriad real-world settings that use expert opinions to define labels for training.

• Liu W (2016) Large-Margin Softmax Loss for Convolutional Neural Networks. arXiv:1612.02295

• Cross-entropy loss together with softmax is arguably one of the most common used supervision components in convolutional neural networks (CNNs). Despite its simplicity, popularity and excellent performance, the component does not explicitly encourage discriminative learning of features. In this paper, we propose a generalized large-margin softmax (L-Softmax) loss which explicitly encourages intra-class compactness and inter-class separability between learned features. Moreover, L-Softmax not only can adjust the desired margin but also can avoid overfitting. We also show that the L-Softmax loss can be optimized by typical stochastic gradient descent. Extensive experiments on four benchmark datasets demonstrate that the deeply-learned features with L-softmax loss become more discriminative, hence significantly boosting the performance on a variety of visual classification and verification tasks.

• Concluding Remarks. We proposed the Large-Margin Softmax loss for the convolutional neural networks. The large-margin softmax loss defines a flexible learning task with adjustable margin. We can set the parameter m to control the margin. With larger m, the decision margin between classes also becomes larger. More appealingly, the Large-Margin Softmax loss has very clear in- tuition and geometric interpretation. The extensive experimental results on several benchmark datasets show clear advantages over current state-of-CNNs and all the compared baselines.

• Mitra B (2016) Dual Embedding Space Model for Document Ranking. arXiv:1602.01137  |  word2vec; search result document clustering

• Ribeiro MT (2016) "Why Should I Trust You?": Explaining the Predictions of Any Classifier. arXiv:1602.04938  |  GitHub  |  GitXiv  |  reddit  |  reddit

• [v3] Despite widespread adoption, machine learning models remain mostly black boxes. Understanding the reasons behind predictions is, however, quite important in assessing trust, which is fundamental if one plans to take action based on a prediction, or when choosing whether to deploy a new model. Such understanding also provides insights into the model, which can be used to transform an untrustworthy model or prediction into a trustworthy one. In this work, we propose LIME, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning an interpretable model locally around the prediction. We also propose a method to explain models by presenting representative individual predictions and their explanations in a non-redundant way, framing the task as a submodular optimization problem. We demonstrate the flexibility of these methods by explaining different models for text (e.g. random forests) and image classification (e.g. neural networks). We show the utility of explanations via novel experiments, both simulated and with human subjects, on various scenarios that require trust: deciding if one should trust a prediction, choosing between models, improving an untrustworthy classifier, and identifying why a classifier should not be trusted.

• Discussed here by Ribeiro et al.:  Introduction to Local Interpretable Model-Agnostic Explanations (LIME): A technique to explain the predictions of any machine learning classifier.

• Also discussed here:  Why should I trust you? Explaining the predictions of any classifier [The Morning Paper]

• Related:  Lundberg S [University of Washington] (2016) An unexpected unity among methods for interpreting model predictions. arXiv:1611.07478

• Understanding why a model made a certain prediction is crucial in many data science fields. Interpretable predictions engender appropriate trust and provide insight into how the model may be improved. However, with large modern datasets the best accuracy is often achieved by complex models even experts struggle to interpret, which creates a tension between accuracy and interpretability. Recently, several methods have been proposed for interpreting predictions from complex models by estimating the importance of input features. Here, we present how a model-agnostic additive representation of the importance of input features unifies current methods. This representation is optimal, in the sense that it is the only set of additive values that satisfies important properties. We show how we can leverage these properties to create novel visual explanations of model predictions. The thread of unity that this representation weaves through the literature indicates that there are common principles to be learned about the interpretation of model predictions that apply in many scenarios.

• [very good!Rippel O [MIT | Facebook AI Research | UC Berkeley] (2015) Metric Learning with Adaptive Density Discrimination. arXiv:1511.05939 ["We introduce  Magnet Loss ..."] |  GitHub  |  GitXiv

• Distance metric learning (DML) approaches learn a transformation to a representation space where distance is in correspondence with a predefined notion of similarity. While such models offer a number of compelling benefits, it has been difficult for these to compete with modern classification algorithms in performance and even in feature extraction. In this work, we propose a novel approach explicitly designed to address a number of subtle yet important issues which have stymied earlier DML algorithms. It maintains an explicit model of the distributions of the different classes in representation space. It then employs this knowledge to adaptively assess similarity, and achieve local discrimination by penalizing class distribution overlap.

We demonstrate the effectiveness of this idea on several tasks. Our approach achieves state-of-the-art classification results on a number of fine-grained visual recognition datasets, surpassing the standard softmax classifier and outperforming triplet loss by a relative margin of 30-40%. In terms of computational performance, it alleviates training inefficiencies in the traditional triplet loss, reaching the same error in 5-30 times fewer iterations. Beyond classification, we further validate the saliency of the learnt representations via their attribute concentration and hierarchy recovery properties, achieving 10-25% relative gains on the softmax classifier and 25-50% on triplet loss in these tasks.

• Rodriguez A (2015) Clustering by fast search and find of density peaks. pdf  |  GitXiv

• Cluster analysis is aimed at classifying elements into categories on the basis of their similarity. Its applications range from astronomy to bioinformatics, bibliometrics, and pattern recognition. We propose an approach based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities. This idea forms the basis of a clustering procedure in which the number of clusters arises intuitively, outliers are automatically spotted and excluded from the analysis, and clusters are recognized regardless of their shape and of the dimensionality of the space in which they are embedded. We demonstrate the power of the algorithm on several test cases.

• Victoria: I remember reading this (2015): very neat clustering method; glad to see the code [GitHub]

• Romano S [Verspoor K] (2015) Adjusting for Chance Clustering Comparison Measures. arXiv:1512.01286.  |  journal  |  pdf
• Adjusted for chance measures are widely used to compare partitions/clusterings of the same data set. In particular, the Adjusted Rand Index (ARI) based on pair-counting, and the Adjusted Mutual Information (AMI) based on Shannon information theory are very popular in the clustering community. Nonetheless it is an open problem as to what are the best application scenarios for each measure and guidelines in the literature for their usage are sparse, with the result that users often resort to using both. Generalized Information Theoretic (IT) measures based on the Tsallis entropy have been shown to link pair- counting and Shannon IT measures. In this paper, we aim to bridge the gap between adjustment of measures based on pair- counting and measures based on information theory. We solve the key technical challenge of analytically computing the expected value and variance of generalized IT measures. This allows us to propose adjustments of generalized IT measures, which reduce to well known adjusted clustering comparison measures as special cases. Using the theory of generalized IT measures, we are able to propose the following guidelines for using ARI and AMI as external validation indices: ARI should be used when the reference clustering has large equal sized clusters; AMI should be used when the reference clustering is unbalanced and there exist small clusters.

• Santra T (2016) A Bayesian non-parametric method for clustering high-dimensional binary data. arXiv:1603.02494
• In many real life problems, objects are described by large number of binary features. For instance, documents are characterized by presence or absence of certain keywords; cancer patients are characterized by presence or absence of certain mutations etc. In such cases, grouping together similar objects/profiles based on such high dimensional binary features is desirable, but challenging. Here, I present a Bayesian non parametric algorithm for clustering high dimensional binary data. It uses a Dirichlet Process (DP) mixture model and simulated annealing to not only cluster binary data, but also find optimal number of clusters in the data. The performance of the algorithm was evaluated and compared with other algorithms using simulated datasets. It outperformed all other clustering methods that were tested in the simulation studies. It was also used to cluster real datasets arising from document analysis, handwritten image analysis and cancer research. It successfully divided a set of documents based on their topics, hand written images based on different styles of writing digits and identified tissue and mutation specificity of chemotherapy treatments.

• [comprehensive review:] Sokolova M & Lapalme G. (2009) A systematic analysis of performance measures for classification tasks. pdf  |  reddit
• This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class, multi-labelled, and hierarchical. For each classification task, the study relates a set of changes in a confusion matrix to specific characteristics of data. Then the analysis concentrates on the type of changes to a confusion matrix that do not change a measure, therefore, preserve a classifier's evaluation (measure invariance). The result is the measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem. This formal analysis is supported by examples of applications where invariance properties of measures lead to a more reliable evaluation of classifiers. Text classification supplements the discussion with several case studies.

• Venkatesan R (2016) A novel progressive learning technique for multi-class classification. arXiv:1609.00085 reddit  |  reddit

• In this paper, a progressive learning technique for multi-class classification is proposed. This newly developed learning technique is independent of the number of class constraints and it can learn new classes while still retaining the knowledge of previous classes. Whenever a new class (non-native to the knowledge learnt thus far) is encountered, the neural network structure gets remodeled automatically by facilitating new neurons and interconnections, and the parameters are calculated in such a way that it retains the knowledge learnt thus far. This technique is suitable for real-world applications where the number of classes is often unknown and online learning from real-time data is required. The consistency and the complexity of the progressive learning technique are analyzed. Several standard datasets are used to evaluate the performance of the developed technique. A comparative study shows that the developed technique is superior.

• Follow-on paper:   Venkatesan R (2016) A novel online multi-label classifier for high-speed streaming data applications. arXiv:1609.00086  |  reddit
• In this paper, a high-speed online neural network classifier based on extreme learning machines for multi-label classification is proposed. In multi-label classification, each of the input data sample belongs to one or more than one of the target labels. The traditional binary and multi-class classification where each sample belongs to only one target class forms the subset of multi-label classification. Multi-label classification problems are far more complex than binary and multi-class classification problems, as both the number of target labels and each of the target labels corresponding to each of the input samples are to be identified. The proposed work exploits the high-speed nature of the extreme learning machines to achieve real-time multi-label classification of streaming data. A new threshold-based online sequential learning algorithm is proposed for high speed and streaming data classification of multi-label problems. The proposed method is experimented with six different datasets from different application domains such as multimedia, text, and biology. The hamming loss, accuracy, training time and testing time of the proposed technique is compared with nine different state-of-the-art methods. Experimental studies shows that the proposed technique outperforms the existing multi-label classifiers in terms of performance and speed.

• Ver Steeg G [University of Southern California] (2014) Maximally informative hierarchical representations of high-dimensional data. arXiv:1410.7404  |  reddit
• We consider a set of probabilistic functions of some input variables as a representation of the inputs. We present bounds on how informative a representation is about input data. We extend these bounds to hierarchical representations so that we can quantify the contribution of each layer towards capturing the information in the original data. The special form of these bounds leads to a simple, bottom-up optimization procedure to construct hierarchical representations that are also maximally informative about the data. This optimization has linear computational complexity and constant sample complexity in the number of variables. These results establish a new approach to unsupervised learning of deep representations that is both principled and practical. We demonstrate the usefulness of the approach on both synthetic and real-world data.

• Wainberg W [Brendan J. Frey] (2016) Are Random Forests Truly the Best Classifiers? pdf:publisher [JLMR]  |  pdf  [local copy]

• The JMLR study Do we need hundreds of classifiers to solve real world classification problems? benchmarks 179 classifiers in 17 families on 121 data sets from the UCI repository and claims that "the random forest is clearly the best family of classifier". In this response, we show that the study's results are biased by the lack of a held-out test set and the exclusion of trials with errors. Further, the study's own statistical tests indicate that random forests do not have significantly higher percent accuracy than support vector machines and neural networks, calling into question the conclusion that random forests are the best classifiers.

• Response to: Fernández-Delgado M (2015) Do we need hundreds of classifiers to solve real world classification problems?  pdf

• Wang Y (2016) Random Bits Forest: a Strong Classifier/Regressor for Big Data. pdf  |  reddit  |  reddit

• Efficiency, memory consumption, and robustness are common problems with many popular methods for data analysis. As a solution, we present Random Bits Forest (RBF), a classification and regression algorithm that integrates neural networks (for depth), boosting (for width), and random forests (for prediction accuracy). Through a gradient boosting scheme, it first generates and selects ~10,000 small, 3-layer random neural networks. These networks are then fed into a modified random forest algorithm to obtain predictions. Testing with datasets from the UCI (University of California, Irvine) Machine Learning Repository shows that RBF outperforms other popular methods in both accuracy and robustness, especially with large datasets (N  > 1000). The algorithm also performed highly in testing with an independent data set, a real psoriasis genome-wide association study (GWAS).

• Zoran D [Google DeepMind] (2017) Learning Deep Nearest Neighbor Representations Using Differentiable Boundary Trees. arXiv:1702.08833  |  reddit

• Nearest neighbor (kNN) methods have been gaining popularity in recent years in light of advances in hardware and efficiency of algorithms. There is a plethora of methods to choose from today, each with their own advantages and disadvantages. One requirement shared between all kNN based methods is the need for a good representation and distance measure between samples. We introduce a new method called differentiable boundary tree which allows for learning deep kNN representations. We build on the recently proposed boundary tree algorithm which allows for efficient nearest neighbor classification, regression and retrieval. By modelling traversals in the tree as stochastic events, we are able to form a differentiable cost function which is associated with the tree's predictions. Using a deep neural network to transform the data and back-propagating through the tree allows us to learn good representations for kNN methods. We demonstrate that our method is able to learn suitable representations allowing for very efficient trees with a clearly interpretable structure.

## CNN - GENERAL

• CNN architectures [Victoria; pdf]

• Convolutions:

[CNN:General] Blogs:

[CNN:General] Instruction:

[CNN:General:Instruction] PAPERS:

• Dumoulin V (2016) A guide to convolution arithmetic for deep learning. arXiv:1603.07285  |  reddit  |  GitXiv

• We introduce a guide to help deep learning practitioners understand and manipulate convolutional neural network architectures. The guide clarifies the relationship between various properties (input shape, kernel shape, zero padding, strides and output shape) of convolutional, pooling and transposed convolutional layers, as well as the relationship between convolutional and transposed convolutional layers. Relationships are derived for various cases, and are illustrated in order to make them intuitive.

• Victoria: Meh! Not very useful -- ignore!

• Li Z (2016) Learning without Forgetting. arXiv:1606.09282  |  reddit

• When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning as standard practice for improved new task performance.

• reddit: Transfer learning across CNNs. This takes the simultaneous-training approach of having two output layers, 1 for each separate task. But one only has the dataset for the second task. How does training on the second task still preserve performance on the first task? Before training on the second task, one runs the original CNN through the entire second dataset and record all the outputs of the original CNN on the new data. Now when training on the second dataset, one trains against the second dataset as usual but also against the hardwired probabilities/classifications. This forces the CNN to preserve the first task's performance (at least, to the extent that the second task's images are comparable to the first task's images) but without having to keep around the original dataset.

• Mairal J (2016) End-to-End Kernel Learning with Supervised Convolutional Kernel Networks. arXiv:1605.06265  |  reddit

• In this paper, we propose a new image representation based on a multilayer kernel machine that performs end-to-end learning. Unlike traditional kernel methods, where the kernel is handcrafted or adapted to data in an unsupervised manner, we learn how to shape the kernel for a supervised prediction problem. We proceed by generalizing convolutional kernel networks, which originally provide unsupervised image representations, and we derive backpropagation rules to optimize model parameters. As a result, we obtain a new type of convolutional neural network with the following properties: (i) at each layer, learning filters is equivalent to optimizing a linear subspace in a reproducing kernel Hilbert space (RKHS), where we project data, (ii) the network may be learned with supervision or without, (iii) the model comes with a natural regularization function (the norm in the RKHS). We show that our method achieves reasonably competitive performance on some standard "deep learning" image classification datasets such as CIFAR-10 and SVHN, and also state-of-the-art results for image super-resolution, demonstrating the applicability of our approach to a large variety of image-related tasks.

• Trottier L (2016) Parametric Exponential Linear Unit for Deep Convolutional Neural Networks. arXiv:1605.0933  |  reddit

• The activation function is an important component in Convolutional Neural Networks (CNNs). For instance, recent breakthroughs in Deep Learning can be attributed to the Rectified Linear Unit (ReLU). Another recently proposed activation function, the Exponential Linear Unit (ELU), has the supplementary property of reducing bias shift without explicitly centering the values at zero. In this paper, we show that learning a parameterization of ELU improves its performance. We analyzed our proposed Parametric ELU (PELU) in the context of vanishing gradients and provide a gradient-based optimization framework. We conducted several experiments on CIFAR-10/100 and ImageNet with different network architectures, such as NiN, Overfeat, All-CNN and ResNet. Our results show that our PELU has relative error improvements over ELU of 4.45% and 5.68% on CIFAR-10 and 100, and as much as 7.28% with only 0.0003% parameter increase on ImageNet. We also observed that Vgg using PELU tended to prefer activations saturating closer to zero, as in ReLU, except at the last layer, which saturated near -2. Finally, other presented results suggest that varying the shape of the activations during training along with the other parameters helps controlling vanishing gradients and bias shift, thus facilitating learning.

• Zeiler MD (2013) Visualizing and understanding convolutional networks. pdf  |  arXiv:1311.2901  |  GitHub [Keras; non-author]

• Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we address both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. We also perform an ablation study to discover the performance contribution from different model layers. This enables us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

• CNN [Karpathy, Stanford Spring 2016 cs231n Lecture 9 slides  |  local copy] >> "Case Studies" >> ZF Net. The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the ZFNet (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers and making the stride and filter size on the first layer smaller.

• Zhang R (2016) Colorful Image Colorization. arXiv:1603.08511  |  GitXiv  |  reddit  |  Photoshop of the Future May Be Able to Auto-Colorize a B&W Photo

• Given a grayscale photograph as input, this paper attacks the problem of hallucinating a plausible color version of the photograph. This problem is clearly underconstrained, so previous approaches have either relied on significant user interaction or resulted in desaturated colorizations. We propose a fully automatic approach that produces vibrant and realistic colorizations. We embrace the underlying uncertainty of the problem by posing it as a classification task and explore using class - rebalancing at training time to increase the diversity of colors in the result. The system is implemented as a feed-forward operation in a CNN at test time and is trained on over a million color images. We evaluate our algorithm using a "colorization Turing test", asking human subjects to choose between a generated and ground truth color image. Our method successfully fools humans 20\% of the time, significantly higher than previous methods.

[CNN:General] Papers:

• Aghajanyan A (2017) Convolution Aware Initialization. arXiv:1702.06295  |  reddit

• Initialization of parameters in deep neural networks has been shown to have a big impact on the performance of the networks (Mishkin & Matas, 2015). The initialization scheme devised by He et al, allowed convolution activations to carry a constrained mean which allowed deep networks to be trained effectively (He et al., 2015a). Orthogonal initializations and more generally orthogonal matrices in standard recurrent networks have been proved to eradicate the vanishing and exploding gradient problem (Pascanu et al., 2012). Majority of current initialization schemes do not take fully into account the intrinsic structure of the convolution operator. Using the duality of the Fourier transform and the convolution operator, Convolution Aware Initialization builds orthogonal filters in the Fourier space, and using the inverse Fourier transform represents them in the standard space. With Convolution Aware Initialization we noticed not only higher accuracy and lower loss, but faster convergence. We achieve new state of the art on the CIFAR10 dataset, and achieve close to state of the art on various other tasks.

• Almahairi A [Tim Cooijmans; Aaron Courville | UMontreal] (2016) Dynamic Capacity Networks. pdf

• We introduce the Dynamic Capacity Network (DCN), a neural network that can adaptively assign its capacity across different portions of the input data. This is achieved by combining modules of two types: low-capacity subnetworks and high-capacity sub-networks. The low-capacity sub-networks are applied across most of the input, but also provide a guide to select a few portions of the input on which to apply the high-capacity sub-networks. The selection is made using a novel gradient-based attention mechanism, that efficiently identifies input regions for which the DCN's output is most sensitive and to which we should devote more capacity. We focus our empirical evaluation on the Cluttered MNIST and SVHN image datasets. Our findings indicate that DCNs are able to drastically reduce the number of computations, compared to traditional convolutional neural networks, while maintaining similar or even better performance.

• Cai B (2016) DehazeNet: An End-to-End System for Single Image Haze Removal. arXiv:1601.07661

• Single image haze removal is a challenging ill-posed problem. Existing methods use various constraints/priors to get plausible dehazing solutions. The key to achieve haze removal is to estimate a medium transmission map for an input hazy image. In this paper, we propose a trainable end-to-end system called DehazeNet, for medium transmission estimation. DehazeNet takes a hazy image as input, and outputs its medium transmission map that is subsequently used to recover a haze-free image via atmospheric scattering model. DehazeNet adopts Convolutional Neural Networks (CNN) based deep architecture, whose layers are specially designed to embody the established assumptions/priors in image dehazing. Specifically, layers of Maxout units are used for feature extraction, which can generate almost all haze-relevant features. We also propose a novel nonlinear activation function in DehazeNet, called Bilateral Rectified Linear Unit (BReLU), which is able to improve the quality of recovered haze-free image. We establish connections between components of the proposed DehazeNet and those used in existing methods. Experiments on benchmark images show that DehazeNet achieves superior performance over existing methods, yet keeps efficient and easy to use.

• Caterini AL (2016) A Geometric Framework for Convolutional Neural Networks. arXiv:1608.04374

• In this paper, a geometric framework for neural networks is proposed. This framework uses the inner product space structure underlying the parameter set to perform gradient descent not in a component-based form, but in a coordinate-free manner. Convolutional neural networks are described in this framework in a compact form, with the gradients of standard -- and higher-order -- loss functions calculated for each layer of the network. This approach can be applied to other network structures and provides a basis on which to create new networks.

• Choi K (2016) Automatic tagging using deep convolutional neural networks. arXiv:1606.00298  |  GitHub  |  GitXiv: Music auto-tagging models and trained weights in Keras/Theano

• We present a content-based automatic music tagging algorithm using fully convolutional neural networks (FCNs). We evaluate different architectures consisting of 2D convolutional layers and subsampling layers only. In the experiments, we measure the AUC-ROC scores of the architectures with different complexities and input types using the MagnaTagATune dataset, where a 4-layer architecture shows state-of-the-art performance with mel-spectrogram input. Furthermore, we evaluated the performances of the architectures with varying the number of layers on a larger dataset (Million Song Dataset), and found that deeper models outperformed the 4-layer architecture. The experiments show that mel-spectrogram is an effective time-frequency representation for automatic tagging and that more complex models benefit from more training data.

• Dai J [MSRA: Microsoft Research Asia] (2017) Deformable Convolutional Networks. arXiv:1703.06211  |  GitHub  |  GitHub  [official implementation  |  reddit ]  |  GitXiv  |  blog post (non-author)  |  reddit

• Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules. In this work, we introduce two new modules to enhance the transformation modeling capacity of CNNs, namely, deformable convolution and deformable RoI pooling. Both are based on the idea of augmenting the spatial sampling locations in the modules with additional offsets and learning the offsets from target tasks, without additional supervision. The new modules can readily replace their plain counterparts in existing CNNs and can be easily trained end-to-end by standard back-propagation, giving rise to deformable convolutional networks. Extensive experiments validate the effectiveness of our approach on sophisticated vision tasks of object detection and semantic segmentation. The code would be released.

• Defferrard M (2016) Convolutional neural networks on graphs with fast localized spectral filtering. arXiv:1606.09375  |  GitHub  |  GitXiv  |  reddit

• In this work, we are interested in generalizing convolutional neural networks (CNNs) from low-dimensional regular grids, where image, video and speech are represented, to high-dimensional irregular domains, such as social networks, brain connectomes or words' embedding, represented by graphs. We present a formulation of CNNs in the context of spectral graph theory, which provides the necessary mathematical background and efficient numerical schemes to design fast localized convolutional filters on graphs. Importantly, the proposed technique offers the same linear computational complexity and constant learning complexity as classical CNNs, while being universal to any graph structure. Experiments on MNIST and 20NEWS demonstrate the ability of this novel deep learning system to learn local, stationary, and compositional features on graphs.

• Ge Z (2015) Fine-Grained Classification via Mixture of Deep Convolutional Neural Networks. arXiv:1511.09209  |  reddit  |  GitXiv  |  GitXiv

• We present a novel deep convolutional neural network (DCNN) system for fine-grained image classification, called a mixture of DCNNs (MixDCNN). The fine-grained image classification problem is characterised by large intra-class variations and small inter-class variations. To overcome these problems our proposed MixDCNN system partitions images into K subsets of similar images and learns an expert DCNN for each subset. The output from each of the K DCNNs is combined to form a single classification decision. In contrast to previous techniques, we provide a formulation to perform joint end-to-end training of the K DCNNs simultaneously. Extensive experiments, on three datasets using two network structures (AlexNet and GoogLeNet), show that the proposed MixDCNN system consistently outperforms other methods. It provides a relative improvement of 12.7% and achieves state-of-the-art results on two datasets.

• Gu J (2015) Recent advances in convolutional neural networks. arXiv:1512.07108  |  KDNuggets.com

• In the last few years, deep learning has led to very good performance on a variety of problems, such as visual recognition, speech recognition and natural language processing. Among different types of deep neural networks, convolutional neural networks have been most extensively studied. Due to the lack of training data and computing power in early days, it is hard to train a large high-capacity convolutional neural network without overfitting. After the rapid growth in the amount of the annotated data and the recent improvements in the strengths of graphics processor units (GPUs), the research on convolutional neural networks has been emerged swiftly and achieved state-of-the-art results on various tasks. In this paper, we provide a broad survey of the recent advances in convolutional neural networks. Besides, we also introduce some applications of convolutional neural networks in computer vision.

• "DenseNet "Huang G (2016) Densely Connected Convolutional Networks. arXiv:1608.06993  |  GitHub [PyTorch  |  reddit]  |  \\ GitHub  |  [reddit]  |  GitHub  |  GitXiv  |  GitHub [Keras; non-author]

• Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at GitHub.

• Discussions - reddit:

• Notes on the Implementation of DenseNet in TensorFlow

• Jégou S [Yoshua Bengio] (2016) The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation. arXiv:1611.09326

• Kuen J (2016) DelugeNets: Deep Networks with Massive and Flexible Cross-layer Information Inflows. arXiv:1611.05552

• Kim Y (2015) Character-aware neural language models. arXiv:1508.06615  |  GitHub

• We describe a simple neural language model that relies only on character-level inputs. Predictions are still made at the word-level. Our model employs a convolutional neural network (CNN) and a highway network over characters, whose output is given to a long short-term memory (LSTM) recurrent neural network language model (RNN-LM). On the English Penn Treebank the model is on par with the existing state-of-the-art despite having 60% fewer parameters. On languages with rich morphology (Arabic, Czech, French, German, Spanish, Russian), the model outperforms word-level/morpheme-level LSTM baselines, again with fewer parameters. The results suggest that on many languages, character inputs are sufficient for language modeling. Analysis of word representations obtained from the character composition part of the model reveals that the model is able to encode, from characters only, both semantic and orthographic information.

• reddit:

• Miyamoto Y & Cho K [Kyunghyun Cho | New York University] (2016) Gated Word-Character Recurrent Language Model. arXiv:1606.01700

• We introduce a recurrent neural network language model (RNN-LM) with long short-term memory (LSTM) units that utilizes both character-level and word-level inputs. Our model has a gate that adaptively finds the optimal mixture of the character-level and word-level inputs. The gate creates the final vector representation of a word by combining two distinct representations of the word. The character-level inputs are converted into vector representations of words using a bidirectional LSTM. The word-level inputs are projected into another high-dimensional space by a word lookup table. The final vector representations of words are used in the LSTM language model which predicts the next word given all the preceding words. Our model with the gating mechanism effectively utilizes the character-level inputs for rare and out-of-vocabulary words and outperforms word-level language models on several English corpora.

• Krähenbühl P (2015) Data-dependent Initializations of Convolutional Neural Networks. arXiv:1511.06856  |  reddit  |  GitXiv

• Convolutional Neural Networks spread through computer vision like a wildfire, impacting almost all visual tasks imaginable. Despite this, few researchers dare to train their models from scratch. Most work builds on one of a handful of ImageNet pre-trained models, and fine-tunes or adapts these for specific tasks. This is in large part due to the difficulty of properly initializing these networks from scratch. A small miscalibration of the initial weights leads to vanishing or exploding gradients, as well as poor convergence properties. In this work we present a fast and simple data-dependent initialization procedure, that sets the weights of a network such that all units in the network train at roughly the same rate, avoiding vanishing or exploding gradients. Our initialization matches the current state-of-the-art unsupervised or self-supervised pre-training methods on standard computer vision tasks, such as image classification and object detection, while being roughly three orders of magnitude faster. When combined with pre-training methods, our initialization significantly outperforms prior work, narrowing the gap between supervised and unsupervised pre-training.

• Krizhevsky A (2014) One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997

• I present a new way to parallelize the training of convolutional neural networks across multiple GPUs. The method scales significantly better than all alternatives when applied to modern convolutional neural networks.

• seminal paper / WINNER: ILSVRC 2012:  Krizhevsky A [Ilya Sutskever; Geoffrey E. Hinton] (2012) ImageNet Classification with Deep Convolutional Neural Networks. pdf

• We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.

• blog: commentary

• Cited often in Andrej Karpathy (Stanford)'s cs231n: Convolutional Neural Networks for Visual Recognition (Spring 2016) class, esp.   this lecture   [local copy  (pdf)]

• Li H (2016) Pruning Filters for Efficient ConvNets. arXiv:1608.08710 reddit

• The success of CNNs in various applications is accompanied by a significant increase in the computation and parameter storage costs. Recent efforts toward reducing these overheads involve pruning and compressing the weights of various layers without hurting original accuracy. However, magnitude-based pruning of weights reduces a significant number of parameters from the fully connected layers and may not adequately reduce the computation costs in the convolutional layers due to irregular sparsity in the pruned networks. We present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly. In contrast to pruning weights, this approach does not result in sparse connectivity patterns. Hence, it does not need the support of sparse convolution libraries and can work with existing efficient BLAS libraries for dense matrix multiplications. We show that even simple filter pruning techniques can reduce inference costs for VGG-16 by up to 34% and ResNet-110 by up to 38% on CIFAR10 while regaining close to the original accuracy by retraining the networks.

• Niepert M (2016) Learning Convolutional Neural Networks for Graphs. arXiv:1605.05273

• Numerous important problems can be framed as learning from graph data. We propose a framework for learning convolutional neural networks for arbitrary graphs. These graphs may be undirected, directed, and with both discrete and continuous node and edge attributes. Analogous to image-based convolutional networks that operate on locally connected regions of the input, we present a general approach to extracting locally-connected regions from graphs. Using established benchmark data sets, we demonstrate that the learned feature representations are competitive with state of the art graph kernels and that their computation is highly efficient.

• Shao J (CVPR 2016) Slicing convolutional neural network for crowd video understanding. pdf  |  GitHub  |  GitXiv

• Learning and capturing both appearance and dynamic representations are pivotal for crowd video understanding. Convolutional Neural Networks (CNNs) have shown its remarkable potential in learning appearance representations from images. However, the learning of dynamic representation, and how it can be effectively combined with appearance features for video analysis, remains an open problem. In this study, we propose a novel spatio-temporal CNN, named Slicing CNN (S-CNN), based on the decomposition of 3D feature maps into 2D spatio- and 2D temporal-slices representations. The decomposition brings unique advantages:

(1) the model is capable of capturing dynamics of different semantic units such as groups and objects,
(2) it learns separated appearance and dynamic representations while keeping proper interactions between them, and
(3) it exploits the selectiveness of spatial filters to discard irrelevant background clutter for crowd understanding. We demonstrate the effectiveness of the proposed S-CNN model on the WWW crowd video dataset for attribute recognition and observe significant performance improvements to the state-of-the-art methods (62.55% from 51.84%).

• Springenberg JT [Martin Riedmiller | University of Freiburg] (2014) Striving for simplicity: The all convolutional net. arXiv:1412.6806

• Most modern convolutional neural networks (CNNs) used for object recognition are built using the same principles: Alternating convolution and max-pooling layers followed by a small number of fully connected layers. We re-evaluate the state of the art for object recognition from small images with convolutional networks, questioning the necessity of different components in the pipeline. We find that max-pooling can simply be replaced by a convolutional layer with increased stride without loss in accuracy on several image recognition benchmarks. Following this finding -- and building on other recent work for finding simple network structures -- we propose a new architecture that consists solely of convolutional layers and yields competitive or state of the art performance on several object recognition datasets (CIFAR-10, CIFAR-100, ImageNet). To analyze the network we introduce a new variant of the "deconvolution approach" for visualizing features learned by CNNs, which can be applied to a broader range of network structures than existing approaches.

• Mentions [reddit]:

• Srinivas S (2016) Taxonomy of Deep CNN for Computer Vision  [pdf]  |  excellent!

• Stabinger S (2016) 25 years of CNNs: Can we compare to human abstraction capabilities?. arXiv:1607.08366

• We try to determine the progress made by convolutional neural networks over the past 25 years in classifying images into abstract lasses. For this purpose we compare the performance of LeNet to that of GoogLeNet at classifying randomly generated images which are differentiated by an abstract property (e.g., one class contains two objects of the same size, the other class two objects of different sizes). Our results show that there is still work to do in order to solve vision problems humans are able to solve without much difficulty.

• Victoria: short (8 pp.) paper; rudimentary figures; ... Students? [ i.e., student review paper? - looks like an undergrad project; could be useful, however ... :-/ ]

• Urban G (2016) Do Deep Convolutional Nets Really Need to be Deep (Or Even Convolutional)? arXiv:1603.05691  |  reddit

• Yes, they do. This paper provides the first empirical demonstration that deep convolutional models really need to be both deep and convolutional, even when trained with methods such as distillation that allow small or shallow models of high accuracy to be trained. Although previous research showed that shallow feed-forward nets sometimes can learn the complex functions previously learned by deep nets while using the same number of parameters as the deep models they mimic, in this paper we demonstrate that the same methods cannot be used to train accurate models on CIFAR-10 unless the student models contain multiple layers of convolution. Although the student models do not have to be as deep as the teacher model they mimic, the students need multiple convolutional layers to learn functions of comparable accuracy as the deep convolutional teacher.

• Veličković P [University of Cambridge] (2016) X-CNN: Cross-modal Convolutional Neural Networks for Sparse Datasets. arXiv:1610.00163

• In this paper we propose cross-modal convolutional neural networks (X-CNNs), a novel biologically inspired type of CNN architectures, treating gradient descent-specialised CNNs as individual units of processing in a larger-scale network topology, while allowing for unconstrained information flow and/or weight sharing between analogous hidden layers of the network -- thus generalising the already well-established concept of neural network ensembles (where information typically may flow only between the output layers of the individual networks). The constituent networks are individually designed to learn the output function on their own subset of the input data, after which cross-connections between them are introduced after each pooling operation to periodically allow for information exchange between them. This injection of knowledge into a model (by prior partition of the input data through domain knowledge or unsupervised methods) is expected to yield greatest returns in sparse data environments, which are typically less suitable for training CNNs. For evaluation purposes, we have compared a standard four-layer CNN as well as a sophisticated FitNet4 architecture against their cross-modal variants on the CIFAR-10 and CIFAR-100 datasets with differing percentages of the training data being removed, and find that at lower levels of data availability, the X-CNNs significantly outperform their baselines (typically providing a 2-6% benefit, depending on the dataset size and whether data augmentation is used), while still maintaining an edge on all of the full dataset tests.

• reddit: This paper makes a small modification to a standard CNN architecture to improve results with small datasets. This shows a good improvement on CIFAR10/100 when trained with <40% of the data (usually 2-6%). I wish the authors shared their code though. They specifically mention using Keras, so it's can't be more than a few hundred lines of python.

• Vondrick C [MIT CSAIL] (2017) Anticipating Visual Representations from Unlabeled Video. pdf

• Anticipating actions and objects before they start or appear is a difficult problem in computer vision with several real-world applications. This task is challenging partly because it requires leveraging extensive knowledge of the world that is difficult to write down. We believe that a promising resource for efficiently learning this knowledge is through readily available unlabeled video. We present a framework that capitalizes on temporal structure in unlabeled video to learn to anticipate human actions and objects. The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. Visual representations are a promising prediction target because they encode images at a higher semantic level than pixels yet are automatic to compute. We then apply recognition algorithms on our predicted representation to anticipate objects and actions. We experimentally validate this idea on two datasets, anticipating actions one second in the future and objects five seconds in the future.

• Media:

• Similar:  Lotter W [Harvard] (2016) Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. arXiv:1605.08104  |  project page  |  GitHub  |  GitXiv
• While great strides have been made in using deep learning algorithms to solve supervised learning tasks, the problem of unsupervised learning - leveraging unlabeled examples to learn about the structure of a domain - remains a difficult unsolved challenge. Here, we explore prediction of future frames in a video sequence as an unsupervised learning rule for learning about the structure of the visual world. We describe a predictive neural network ("PredNet") architecture that is inspired by the concept of "predictive coding" from the neuroscience literature. These networks learn to predict future frames in a video sequence, with each layer in the network making local predictions and only forwarding deviations from those predictions to subsequent network layers. We show that these networks are able to robustly learn to predict the movement of synthetic (rendered) objects, and that in doing so, the networks learn internal representations that are useful for decoding latent object parameters (e.g. pose) that support object recognition with fewer training views. We also show that these networks can scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene, and generalizing across video datasets. These results suggest that prediction represents a powerful framework for unsupervised learning, allowing for implicit learning of object and scene structure.

• Wang M (2016) Factorized Convolutional Neural Networks. arXiv:1608.04337  |  reddit

• Deep convolutional neural networks achieve remarkable visual recognition performance, at the cost of high computational complexity. In this paper, we have a new design of efficient convolutional layers based on three schemes. The 3D convolution operation in a convolutional layer can be considered as performing spatial convolution in each channel and linear projection across channels simultaneously. By unravelling them and arranging the spatial convolution sequentially, the proposed layer is composed of a single intra-channel convolution, of which the computation is negligible, and a linear channel projection. A topological subdivisioning is adopted to reduce the connection between the input channels and output channels. Additionally, we also introduce a spatial "bottleneck" structure that utilizes a convolution-projection-deconvolution pipeline to take advantage of the correlation between adjacent pixels in the input. Our experiments demonstrate that the proposed layers remarkably outperform the standard convolutional layers with regard to accuracy/complexity ratio. Our models achieve similar accuracy to VGG, ResNet-50, ResNet-101 while requiring 42, 4.5, 6.5 times less computation respectively.

• reddit: This is not a particularly novel idea, so I get the feeling that some references are missing. It's been available in TensorFlow as tf.nn.separable_conv2d() for a while, and this presentation also discusses it (slide 26 and onwards): pdf  |  YouTube. The results seem to be pretty solid though, and the interaction with residual connections probably makes it more practical. It's a nice way to further increase depth and nonlinearity while keeping the computational cost and risk of overfitting at reasonable levels.

• In traditional convolution layers, the convolution is tied up with cross-channel pooling: for each output channel, a convolution is applied to each input channel and the results are summed together. This leads to the unfortunate situation where the network may often be repeatedly applying similar filters to each input channel. Storing these filters wastes memory, and applying them repeatedly wastes computation. It's possible to instead split the computation into a convolution stage, where multiple filters are applied to each input channel, and a cross-channel pooling stage where the output channels each use whichever intermediate results are of use to them. This allows the filters to be shared across multiple output channels. This reduces their number substantially, allowing for more efficient networks, or larger networks with a similar computational cost.

• See also blog post: Factorized convolutional neural networks, AKA separable convolutions

• Xu B (2015) Empirical evaluation of rectified activations in convolutional network. arXiv:1505.00853.  |  reddit

• Yu J (2016) UnitBox: An Advanced Object Detection Network. arXiv:1608.01471  |  reddit

• In present object detection systems, the deep convolutional neural networks (CNNs) are utilized to predict bounding boxes of object candidates, and have gained performance advantages over the traditional region proposal methods. However, existing deep CNN methods assume the object bounds to be four independent variables, which could be regressed by the $\small \ell2$ loss separately. Such an oversimplified assumption is contrary to the well-received observation, that those variables are correlated, resulting to less accurate localization. To address the issue, we firstly introduce a novel Intersection over Union ($\small IoU$) loss function for bounding box prediction, which regresses the four bounds of a predicted box as a whole unit. By taking the advantages of $\small IoU$ loss and deep fully convolutional networks, the UnitBox is introduced, which performs accurate and efficient localization, shows robust to objects of varied shapes and scales, and converges fast. We apply UnitBox on face detection task and achieve the best performance among all published methods on the FDDB benchmark.

• Similar (and also published Aug 2016):  Wan S (2016) Bootstrapping Face Detection with Hard Negative Examples. arXiv:1608.02236  |  reddit

• Recently significant performance improvement in face detection was made possible by deeply trained convolutional networks. In this report, a novel approach for training state-of-the-art face detector is described. The key is to exploit the idea of hard negative mining and iteratively update the Faster R-CNN based face detector with the hard negatives harvested from a large set of background examples. We demonstrate that our face detector outperforms state-of-the-art detectors on the FDDB dataset, which is the de facto standard for evaluating face detection algorithms.

• Zhang Y [Stanford] (2016) Convexified Convolutional Neural Networks. arXiv:1609.01000  |  reddit  |  reddit

• We describe the class of convexified convolutional neural networks (CCNNs), which capture the parameter sharing of convolutional neural networks in a convex manner. By representing the nonlinear convolutional filters as vectors in a reproducing kernel Hilbert space, the CNN parameters can be represented as a low-rank matrix, which can be relaxed to obtain a convex optimization problem. For learning two-layer convolutional neural networks, we prove that the generalization error obtained by a convexified CNN converges to that of the best possible CNN. For learning deeper networks, we train CCNNs in a layer-wise manner. Empirically, CCNNs achieve performance competitive with CNNs trained by backpropagation, SVMs, fully-connected neural networks, stacked denoising auto-encoders, and other baseline methods.

• Zhang Y (2015) Sensitivity Analysis/Practitioners' Guide: CNN, Sentence Classification. arXiv 1510.03820  |  NLP, VSM, sentence classification; parameter optimization

• Convolutional Neural Networks (CNNs) have recently achieved remarkably strong performance on the practically important task of sentence classification (Kim 2014, Kalchbrenner 2014, Johnson 2014). However, these models require practitioners to specify an exact model architecture and set accompanying hyperparameters, including the filter region size, regularization parameters, and so on. It is currently unknown how sensitive model performance is to changes in these configurations for the task of sentence classification. We thus conduct a sensitivity analysis of one-layer CNNs to explore the effect of architecture components on model performance; our aim is to distinguish between important and comparatively inconsequential design decisions for sentence classification. We focus on one-layer CNNs (to the exclusion of more complex models) due to their comparative simplicity and strong empirical performance, which makes it a modern standard baseline method akin to Support Vector Machine (SVMs) and logistic regression. We derive practical advice from our extensive empirical results for those interested in getting the most out of CNNs for sentence classification in real world settings.

• Zhou B (2015) Learning Deep Features for Discriminative Localization. arXiv:1512.04150  |  GitXiv: Generic localizable deep representation using a global average pooling layer

• In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network to have remarkable localization ability despite being trained on image-level labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that can be applied to a variety of tasks. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014, which is remarkably close to the 34.2% top-5 error achieved by a fully supervised CNN approach. We demonstrate that our network is able to localize the discriminative image regions on a variety of tasks despite not being trained for them.

Visualizations:

## CNN - IMAGE CLASSIFICATION

[CNN-Image Classification] Blogs:

• Building powerful image classification models using very little data  [Keras | by Keras author François Chollet]  |  reddit

• In this tutorial, we will present a few simple yet effective methods that you can use to build a powerful image classifier, using only very few training examples -- just a few hundred or thousand pictures from each class you want to be able to recognize.

• "Training a small convnet from scratch: 80% accuracy in 40 lines of code"
• "Using the bottleneck features of a pre-trained network: 90% accuracy in a minute"

• What a Deep Neural Network thinks about your #selfie [Andrej Karpathy]

• Convolutional Neural Networks are great: they recognize things, places and people in your personal photos, signs, people and lights in self-driving cars, crops, forests and traffic in aerial imagery, various anomalies in medical images and all kinds of other useful things. But once in a while these powerful visual recognition models can also be warped for distraction, fun and amusement. In this fun experiment we're going to do just that: We'll take a powerful, 140-million-parameter state-of-the-art Convolutional Neural Network, feed it 2 million selfies from the internet, and train it to classify good selfies from bad ones. Just because it's easy and because we can. And in the process we might learn how to take better selfies :) ...

[CNN-Image Classification] ILSVRC [ImageNet Large Scale Visual Recognition Competition]:

• Kloss A (2015) Object Detection Using Deep Learning-Learning where to search using visual attention. Doctoral dissertation, Universität Tübingen Tübingen, Germany. [p. 12, Sect. 1.3.1]

• A good means of judging how close the computer vision community has come to solving the problems of object detection is to look at the results of challenges like the PASCAL Visual Object Challenge (VOC) and later the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)[17].

Pascal VOC started out in 2005 and was held annually until 2012. Among other tasks, it always featured classification as well as localisation tasks. In 2009, the organizers published a paper that presented the challenge and the contributions of this year in more details [18]. Back then, most of the submissions employed the bag-of-visual-words technique based on hand-crafted features like SIFT [19] and HOG [20], where feature vectors are computed at keypoint locations in the image and then a kind of histogram over the feature vectors is used to classify the image content.

From 2010 on, the ILSVRC was also held annually and became the main bench-mark for object detection when the PASCAL VOC ended in 2012. It uses images from the huge ImageNet dataset (more than 14 million images) that include up to 1000 different classes for the classification task and 200 classes for detection.

In 2012, the challenge saw a big improvement in performance, when Krizhevsky et al. entered a deep convolutional neural network (CNN) for the first time. They were able to bring down the top-5 classification error from 25.2% (for the second best entry) to 15.3% and the localisation error from 50% to 34.3%. Since then, the ILSVRC has been dominated by convolutional neural networks of increasing depth. In 2014, it seems that the classification task has become relatively easy with the winning entry [22] achieving a top-5 error of 6.7%. The localisation error has dropped to 25% [23] and the best detection mean average precision (mAP) has risen from 22.58% in 2013 to 43.93% in 2014.

• ILSVRC2017 [ImageNet Large Scale Visual Recognition Challenge 2017]

• ILSVRC 2016 Large Scale Visual Recognition Challenge 2016 - Results finally available

• TLDR:

• No big new technologies or revolutionary architectures
• Everyone uses deep learning
• None of the big companies care anymore (no Google, MSRA, Facebook, Baidu, ...)
• Almost all competitors are from Asian organizations

Seems to me like ImageNet is mostly dead.

• From the ILSVRC 2017 team: This year, 85 teams had 344 entries -- more than an 56% increase from last year! Object detection mean average precision (mAP) is up to 66.3% from 62.1% last year. Localization error is down to 7.7% compared to 9.0% last year. Classification error is down to 3.0% from 3.6% last year. Object detection from video (VID) mAP is up to 80.8% from 67.8% last year, and is 55.9% with the new metric with tracking information. The two scene challenges, Scene classification and Scene parsing, were very successful too. Scene classification top-5 error is 9.0% among 92 submissions from 28 teams. Scene parsing average of mIoU and pixel accuracy is 57.2% among 80 submissions from 23 teams.

• ILSVRC 2014:

Convolutional Neural Networks (CNNs / ConvNets)
[Andrej Karpathy, Stanford Spring 2016 cs231n, Lecture 9]  |  local copy

...
CASE STUDIES

There are several architectures in the field of Convolutional Networks that have a name. The most common are:

• LeNet The first successful applications of Convolutional Networks were developed by Yann LeCun in 1990's. Of these, the best known is the LeNet architecture that was used to read zip codes, digits, etc.

• AlexNet The first work that popularized Convolutional Networks in Computer Vision was the AlexNet, developed by Alex Krizhevsky, Ilya Sutskever and Geoff Hinton. The AlexNet was submitted to the ImageNet ILSVRC challenge in 2012 and significantly outperformed the second runner-up (top 5 error of 16% compared to runner-up with 26% error). The Network had a very similar architecture to LeNet, but was deeper, bigger, and featured Convolutional Layers stacked on top of each other (previously it was common to only have a single CONV layer always immediately followed by a POOL layer).

• ZF Net. The ILSVRC 2013 winner was a Convolutional Network from Matthew Zeiler and Rob Fergus. It became known as the ZFNet (short for Zeiler & Fergus Net). It was an improvement on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers and making the stride and filter size on the first layer smaller.

• GoogLeNet. The ILSVRC 2014 winner was a Convolutional Network from Szegedy et al. from Google. Its main contribution was the development of an Inception Module that dramatically reduced the number of parameters in the network (4M, compared to AlexNet with 60M). Additionally, this paper uses Average Pooling instead of Fully Connected layers at the top of the ConvNet, eliminating a large amount of parameters that do not seem to matter much. There are also several followup versions to the GoogLeNet, most recently Inception-v4.

• VGGNet. The runner-up in ILSVRC 2014 was the network from Karen Simonyan and Andrew Zisserman that became known as the VGGNet. Its main contribution was in showing that the depth of the network is a critical component for good performance. Their final best network contains 16 CONV/FC layers and, appealingly, features an extremely homogeneous architecture that only performs 3x3 convolutions and 2x2 pooling from the beginning to the end. Their pretrained model is available for plug and play use in Caffe. A downside of the VGGNet is that it is more expensive to evaluate and uses a lot more memory and parameters (140M). Most of these parameters are in the first fully connected layer, and it was since found that these FC layers can be removed with no performance downgrade, significantly reducing the number of necessary parameters.

• ResNet. Residual Network developed by Kaiming He et al. was the winner of ILSVRC 2015. It features special skip connections and a heavy use of   batch normalization. The architecture is also missing fully connected layers at the end of the network. The reader is also referred to Kaiming's presentation ( video, and some   recent experiments that reproduce these networks in Torch. ResNets are currently by far state of the art Convolutional Neural Network models and are the default choice for using ConvNets in practice (as of May 2016). In particular, also see more recent developments that tweak the original architecture from Kaiming He et al.,   Identity Mappings in Deep Residual Networks (published March 2016).

• DenseNet (August 2016) - Recently published by Gao Huang (and others), the Densely Connected Convolutional Network [arXiv:1608.06993] has each layer directly connected to every other layer in a feed-forward fashion. The DenseNet has been shown to obtain significant improvements over previous state-of-the-art architectures on five highly competitive object recognition benchmark tasks. Check out the Torch implementation here [GitHub].
VGGNET IN DETAIL.

Lets break down the VGGNet in more detail as a case study.  |  local copy

[ ... SNIP! ... ]

• Mishkin D (2016) Systematic evaluation of CNN advances on the ImageNet. arXiv:1606.02228  |  GitHub  |  reddit

• The paper systematically studies the impact of a range of recent advances in CNN architectures and learning methods on the object categorization (ILSVRC) problem. The evaluation tests the influence of the following choices of the architecture: non-linearity (ReLU, ELU, maxout, compatibility with batch normalization), pooling variants (stochastic, max, average, mixed), network width, classifier design (convolutional, fully-connected, SPP), image pre-processing, and of learning parameters: learning rate, batch size, cleanliness of the data, etc.

The performance gains of the proposed modifications are first tested individually and then in combination. The sum of individual gains is bigger than the observed improvement when all modifications are introduced, but the "deficit" is small suggesting independence of their benefits. We show that the use of 128x128 pixel images is sufficient to make qualitative conclusions about optimal network structure that hold for the full size Caffe and VGG nets. The results are obtained an order of magnitude faster than with the standard 224 pixel images.

[CNN-Image Classification] Papers:

• Krizhevsky A [Hinton G] (2009) [CNN technical report] Learning multiple layers of features from tiny images. refer here.

• Lin K (2016) Learning Compact Binary Descriptors with Unsupervised Deep Neural Networks. pdf  |  GitHub

• In this paper, we propose a new unsupervised deep learning approach called DeepBit to learn compact binary descriptor for efficient visual object matching. Unlike most existing binary descriptors which were designed with random projections or linear hash functions, we develop a deep neural network to learn binary descriptors in an unsupervised manner. We enforce three criteria on binary codes which are learned at the top layer of our network: (1) minimal loss quantization, (2) evenly distributed codes and (3) uncorrelated bits. Then, we learn the parameters of the networks with a back-propagation technique. Experimental results on three different visual analysis tasks including image matching, image retrieval, and object recognition clearly demonstrate the effectiveness of the proposed approach.

• Razavian S (2014) CNN features off-the-shelf: an astounding baseline for recognition. arXiv:1403.6382  |  reddit

• Recent results indicate that the generic descriptors extracted from the convolutional neural networks are very powerful. This paper adds to the mounting evidence that this is indeed the case. We report on a series of experiments conducted for different recognition tasks using the publicly available code and model of the \overfeat network which was trained to perform object classification on ILSVRC13. We use features extracted from the overfeat network as a generic image representation to tackle the diverse range of recognition tasks of object image classification, scene recognition, fine grained recognition, attribute detection and image retrieval applied to a diverse set of datasets. We selected these tasks and datasets as they gradually move further away from the original task and data the overfeat network was trained to solve. Astonishingly, we report consistent superior results compared to the highly tuned state-of-the-art systems in all the visual classification tasks on various datasets. For instance retrieval it consistently outperforms low memory footprint methods except for sculptures dataset. The results are achieved using a linear SVM classifier (or $\small L2$ distance in case of retrieval) applied to a feature representation of size 4096 extracted from a layer in the net. The representations are further modified using simple augmentation techniques e.g. jittering. The results strongly suggest that features obtained from deep learning with convolutional nets should be the primary candidate in most visual recognition tasks.

• Ren S (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv:1506.01497  |  GitHub

• State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features -- using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.

• Is there a better option than the sliding window method for object detection on images? [reddit]:

• Proposals + Fast R-CNN is kind of the standard now: Ren S (2015) Faster R-CNN: Towards real-time object detection with region proposal networks. arXiv:1506.01497  |  GitHub

• You Only Look Once [YOLO] for multiple, at Darknet site [Redmon J (2015) You only look once: Unified, real-time object detection. arXiv:1506.02640  |  YOLO website]

• Girshick R (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524

• Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also compare R-CNN to OverFeat, a recently proposed sliding-window detector based on a similar CNN architecture. We find that R-CNN outperforms OverFeat by a large margin on the 200-class ILSVRC2013 detection dataset.

• Girshick R (2015) Fast R-CNN. arXiv:1504.08083

• This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training and testing speed while also increasing detection accuracy. Fast R-CNN trains the very deep VGG16 network 9x faster than R-CNN, is 213x faster at test-time, and achieves a higher mAP on PASCAL VOC 2012. Compared to SPPnet, Fast R-CNN trains VGG16 3x faster, tests 10x faster, and is more accurate. Fast R-CNN is implemented in Python and C++ (using Caffe) and is available under the open-source MIT License here .

• Song HO [Stanford; MIT] (2015) Deep Metric Learning via Lifted Structured Feature Embedding. arXiv:1511.06452.  |  GitXiv

• Learning the distance metric between pairs of examples is of great importance for learning and visual recognition. With the remarkable success from the state of the art convolutional neural networks, recent works have shown promising results on discriminatively training the networks to learn semantic feature embeddings where similar examples are mapped close to each other and dissimilar examples are mapped farther apart. In this paper, we describe an algorithm for taking full advantage of the training batches in the neural network training by lifting the vector of pairwise distances within the batch to the matrix of pairwise distances. This step enables the algorithm to learn the state of the art feature embedding by optimizing a novel structured prediction objective on the lifted problem. Additionally, we collected Online Products dataset: 120k images of 23k classes of online products for metric learning. Our experiments on the CUB-200-2011, CARS196, and Online Products datasets demonstrate significant improvement over existing deep feature embedding methods on all experimented embedding sizes with the GoogLeNet network.

## CNN - OBJECT DETECTION: IMAGE ANALYSES; SEGMENTATION; DEBLURRING/SHARPENING; GENERATION ...

• CNN and image segmentation: see 'Image Segmentation', below; includes stacked models:

• stacked RF & deep CNN: Richmond DL (2015)  |  more detail, below

• stacked CRF + CNN: Zheng (2015).  |  more detail, below

• somewhat similar: Ioannou Y (2016), below: hybrid model between two state of the art classifiers: decision forests (DF) and CNN

[CNN:Image Processing] Blogs:

• Are Elephants Bigger than Butterflies? Reasoning about Sizes of Objects [GitXiv]

• Can NeuralTalk2 classify images into multiple captions, with their probabilities?  |  GitHub [Andrej Karpathy]  |  demo page [Stanford]  |  Hacker News

• GitHub:  "... RNN captions your images. Much faster/ better than the original NeuralTalk: this implementation is batched, uses Torch, runs on a GPU, and supports CNN finetuning. All of these together result in quite a large increase in training speed for the Language Model (~100x), but overall not as much because we also have to forward a VGGNet. However, overall very good models can be trained in 2-3 days, and they show a much better performance. ..."

• CRF as RNN: Semantic Image Segmentation Live Demo [University of Oxford Torr Vision Group]

• Excellent:  Peeking inside Convnets  |  reddit  |  [June 2016]: "There has been similar work on visualizing convolutional networks by e.g. Zeiler and Fergus [arXiv:1311.2901] and lately by Yosinski, Nugyen et al. In a recent work by Nguyen [arXiv:1602.03616], they manage to visualize features very well, based on a technique they called "mean-image initialization". Since I started writing this blog post, they've also published a new paper [arXiv:1605.09304, above; reddit; GitHub] using Generative Adversarial Networks as priors for the visualizations, which lead to far far better visualizations than the ones I've showed above. If you are interested, do take a look at their paper or the code they've released!"

• Related: Yosinski J (2015) Understanding neural networks through deep visualization. arXiv:1506.06579  |  GitHub

• Recent years have produced great advances in training large, deep neural networks (DNNs), including notable successes in training convolutional neural networks (convnets) to recognize natural images. However, our understanding of how these models work, especially what computations they perform at intermediate layers, has lagged behind. Progress in the field will be further accelerated by the development of better tools for visualizing and interpreting neural nets. We introduce two such tools here. The first is a tool that visualizes the activations produced on each layer of a trained convnet as it processes an image or video (e.g. a live webcam stream). We have found that looking at live activations that change in response to user input helps build valuable intuitions about how convnets work. The second tool enables visualizing features at each layer of a DNN via regularized optimization in image space. Because previous versions of this idea produced less recognizable images, here we introduce several new regularization methods that combine to produce qualitatively clearer, more interpretable visualizations. Both tools are open source and work on a pre-trained convnet with minimal setup.

• Related: Understanding Neural Networks Through Deep Visualization [blog post by arXiv:1506.06579 authors Jason Yosinski, Jeff Clune, Anh Nguyen et al.]  |  reddit

• Code for synthesizing images via deep generator networks [reddit]

• DeepDreaming with TensorFlow: This notebook demonstrates a number of Convolutional Neural Network image generation techniques implemented with TensorFlow for fun and science:

• visualize individual feature channels and their combinations to explore the space of patterns learned by the neural network (see GoogLeNet and VGG16 galleries)

• embed TensorBoard graph visualizations into Jupyter notebooks

• produce high-resolution images with tiled computation (example)

• use Laplacian Pyramid Gradient Normalization to produce smooth and colorful visuals at low cost

• generate DeepDream-like images with TensorFlow (DogSlugs included)

• The network under examination is the GoogLeNet architecture, trained to classify images into one of 1000 categories of the ImageNet dataset. It consists of a set of layers that apply a sequence of transformations to the input image. The parameters of these transformations were determined during the training process by a variant of gradient descent algorithm. The internal image representations may seem obscure, but it is possible to visualize and interpret them. In this notebook we are going to present a few tricks that allow to make these visualizations both efficient to generate and even beautiful. Impatient readers can start with exploring the full galleries of images generated by the method described here for GoogLeNet and VGG16 architectures.

• To Get Truly Smart, AI Might Need to Play More Video Games

• The latest computer games can be fantastically realistic ... these lifelike virtual worlds might have [some educational value, too - especially [value] for fledgling AI algorithms. ... cutting-edge AI algorithms need to feed on huge quantities of data in order to learn to perform a task. ... Gaidon and colleagues used a popular game development engine, called Unity, to generate virtual scenes for training deep-learning algorithms - a very large type of simulated neural network - to recognize objects and situations in real images. Unity is widely used to make 4-D video games, and many common objects are available to developers to use in their creations.

• What I learned from competing against a ConvNet on ImageNet [Andrej Karpathy]  |  see also 'Section 6.4: Human accuracy on large-scale image classification' in arXiv:1409.0575

[CNN:Image Processing] Papers:

• Ali Eslami SM [GE Hinton | Google DeepMind] (2016) Attend, Infer, Repeat: Fast Scene Understanding with Generative Models. arXiv:1603.08575  |  NIPS Proceedings  |  reddit  |  reddit

• We present a framework for efficient inference in structured image models that explicitly reason about objects. We achieve this by performing probabilistic inference using a recurrent neural network that attends to scene elements and processes them one at a time. Crucially, the model itself learns to choose the appropriate number of inference steps. We use this scheme to learn to perform inference in partially specified 2D models (variable-sized variational auto-encoders) and fully specified 3D models (probabilistic renderers). We show that such models learn to identify multiple objects - counting, locating and classifying the elements of a scene - without any supervision, e.g., decomposing 3D images with various numbers of objects in a single forward pass of a neural network. We further show that the networks produce accurate inferences when compared to supervised counterparts, and that their structure leads to improved generalization.

• Related: Graves A [Google DeepMind] (2016) Adaptive Computation Time for Recurrent Neural Networks. arXiv:1603.08983  |  reddit

• This paper introduces Adaptive Computation Time (ACT), an algorithm that allows recurrent neural networks to learn how many computational steps to take between receiving an input and emitting an output. ACT requires minimal changes to the network architecture, is deterministic and differentiable, and does not add any noise to the parameter gradients. Experimental results are provided for four synthetic problems: determining the parity of binary vectors, applying sequences of binary logic operations, adding sequences of integers, and sorting sequences of real numbers. Overall performance is dramatically improved by the use of ACT, which successfully adapts the number of computational steps to the requirements of the problem. When applied to character-level language modelling on the Hutter prize Wikipedia dataset, ACT yields intriguing insight into the structure of the data, with more computation allocated to harder-to-predict transitions, such as spaces between words and ends of sentences. This suggests that ACT or other adaptive computation methods could be used to infer segment boundaries in sequence data.

• reddit: Funny how both this paper and this one posted the day before both attack the problem of varying number of computation steps.

• Badrinarayanan V (2015) SegNet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561  |  GitHub  |  GitXiv  |  project page  |  reddit

• We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN and also with the well known DeepLab-LargeFOV, DeconvNet architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. We show that SegNet provides good performance with competitive inference time and more efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo here .

• Courbariaux M [Bengio Y] (2016) Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or -1. arXiv:1602.02830

• Dai J (2016 | NIPS 2016) R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv:1605.06409  |  GitHub  |  GitHub  [py-R-FCN]  |  GitXiv  |  GitHub  [R-FCN]  |  reddit

• We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation-variance in object detection. Our method can thus naturally adopt fully convolutional image classifier backbones, such as the latest Residual Networks (ResNets), for object detection. We show competitive results on the PASCAL VOC datasets (e.g., 83.6% mAP on the 2007 set) with the 101-layer ResNet. Meanwhile, our result is achieved at a test-time speed of 170ms per image, 2.5-20x faster than the Faster R-CNN counterpart. Code will be made publicly available.

• Das A [Facebook AI Research] (2016) Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? arXiv:1606.03556

• We conduct large-scale studies on 'human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). We find that depending on the implementation used, machine-generated attention maps are either negatively correlated with human attention or have positive correlation worse than task-independent saliency. Overall, our experiments paint a bleak picture for the current generation of attention models in VQA.

• Dosovitskiy A (2015) Inverting Visual Representations with Convolutional Networks. arXiv:1506.02753  |  reddit: " There is some code available with pretrained models"

• Feature representations, both hand-designed and learned ones, are often hard to analyze and interpret, even when they are extracted from visual data. We propose a new approach to study image representations by inverting them with an up-convolutional neural network. We apply the method to shallow representations (HOG, SIFT, LBP), as well as to deep networks. For shallow representations our approach provides significantly better reconstructions than existing methods, revealing that there is surprisingly rich information contained in these features when combined with a strong prior. Inverting a deep network trained on ImageNet provides several insights into the properties of the feature representation learned by the network. Most strikingly, the colors and the rough contours of an image can be reconstructed from activations in higher network layers and even from the predicted class probabilities.

• Fawzi A (2015) Analysis of classifiers' robustness to adversarial perturbations. arXiv:1502.02590

• The goal of this paper is to analyze an intriguing phenomenon recently discovered in deep networks, namely their instability to adversarial perturbations (Szegedy et. al., 2014). We provide a theoretical framework for analyzing the robustness of classifiers to adversarial perturbations, and show fundamental upper bounds on the robustness of classifiers. Specifically, we establish a general upper bound on the robustness of classifiers to adversarial perturbations, and then illustrate the obtained upper bound on the families of linear and quadratic classifiers. In both cases, our upper bound depends on a distinguishability measure that captures the notion of difficulty of the classification task. Our results for both classes imply that in tasks involving small distinguishability, no classifier in the considered set will be robust to adversarial perturbations, even if a good accuracy is achieved. Our theoretical framework moreover suggests that the phenomenon of adversarial instability is due to the low flexibility of classifiers, compared to the difficulty of the classification task (captured by the distinguishability).

Moreover, we show the existence of a clear distinction between the robustness of a classifier to random noise and its robustness to adversarial perturbations. Specifically, the former is shown to be larger than the latter by a factor that is proportional to $\sqrt{d}$ (with $d$ being the signal dimension) for linear classifiers. This result gives a theoretical explanation for the discrepancy between the two robustness properties in high dimensional problems, which was empirically observed in the context of neural networks. To the best of our knowledge, our results provide the first theoretical work that addresses the phenomenon of adversarial instability recently observed for deep networks. Our analysis is complemented by experimental results on controlled and real-world data.

• See also: Fawzi A (2016) Robustness of classifiers: from adversarial to random noise. arXiv:1608.08967  [immediately following]

• Fawzi A (2016) Robustness of classifiers: from adversarial to random noise. arXiv:1608.08967

• Several recent works have shown that state-of-the-art classifiers are vulnerable to worst-case (i.e., adversarial) perturbations of the datapoints. On the other hand, it has been empirically observed that these same classifiers are relatively robust to random noise. In this paper, we propose to study a semi-random noise regime that generalizes both the random and worst-case noise regimes. We propose the first quantitative analysis of the robustness of nonlinear classifiers in this general noise regime. We establish precise theoretical bounds on the robustness of classifiers in this general regime, which depend on the curvature of the classifier's decision boundary. Our bounds confirm and quantify the empirical observations that classifiers satisfying curvature constraints are robust to random noise. Moreover, we quantify the robustness of classifiers in terms of the subspace dimension in the semi-random noise regime, and show that our bounds remarkably interpolate between the worst-case and random noise regimes. We perform experiments and show that the derived bounds provide very accurate estimates when applied to various state-of-the-art deep neural networks and datasets. This result suggests bounds on the curvature of the classifiers' decision boundaries that we support experimentally, and more generally offers important insights onto the geometry of high dimensional classification problems.

• Gidaris S [France: Paris] (2016) Attend refine repeat: Active box proposal generation via in-out localization. arXiv:1606.04446  |  GitHub  |  GitXiv

• The problem of computing category agnostic bounding box proposals is utilized as a core component in many computer vision tasks and thus has lately attracted a lot of attention. In this work we propose a new approach to tackle this problem that is based on an active strategy for generating box proposals that starts from a set of seed boxes, which are uniformly distributed on the image, and then progressively moves its attention on the promising image areas where it is more likely to discover well localized bounding box proposals. We call our approach AttractioNet and a core component of it is a CNN-based category agnostic object location refinement module that is capable of yielding accurate and robust bounding box predictions regardless of the object category.

We extensively evaluate our AttractioNet approach on several image datasets (i.e. COCO, PASCAL, ImageNet detection and NYU-Depth V2 datasets) reporting on all of them state-of-the-art results that surpass the previous work in the field by a significant margin and also providing strong empirical evidence that our approach is capable to generalize to unseen categories. Furthermore, we evaluate our AttractioNet proposals in the context of the object detection task using a VGG16-Net based detector and the achieved detection performance on COCO manages to significantly surpass all other VGG16-Net based detectors while even being competitive with a heavily tuned ResNet-101 based detector. Code as well as box proposals computed for several datasets are available at GitHub.

• Goodfellow IJ [Christian Szegedy] (2014) Explaining and harnessing adversarial examples. arXiv:1412.6572 [Google]  |  adversarial examples  |  Adversary examples - do we actually care?   [reddit]  |  [reddit discussion, including author Ian Goodfellow]   Goodfellow says Neural Nets are Linear?!!

• Several machine learning models, including neural networks, consistently misclassify adversarial examples -- inputs formed by applying small but intentionally worst-case perturbations to examples from the dataset, such that the perturbed input results in the model outputting an incorrect answer with high confidence. Early attempts at explaining this phenomenon focused on nonlinearity and overfitting. We argue instead that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature. This explanation is supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets. Moreover, this view yields a simple and fast method of generating adversarial examples. Using this approach to provide examples for adversarial training, we reduce the test set error of a maxout network on the MNIST dataset.

• Held D (2016) Learning to Track at 100 FPS with Deep Regression Networks. arXiv:1604.01802  |  reddit

• Machine learning techniques are often used in computer vision due to their ability to leverage large amounts of training data to improve performance. Unfortunately, most generic object trackers are still trained from scratch online and do not benefit from the large number of videos that are readily available for offline training. We propose a method for offline training of neural networks that can track novel objects at test-time at 100 fps. Our tracker is significantly faster than previous methods that use neural networks for tracking, which are typically very slow to run and not practical for real-time applications. Our tracker uses a simple feed-forward network with no online training required. The tracker learns a generic relationship between object motion and appearance and can be used to track novel objects that do not appear in the training set. We test our network on a standard tracking benchmark to demonstrate our tracker's state-of-the-art performance. Further, our performance improves as we add more videos to our offline training set. To the best of our knowledge, our tracker is the first neural-network tracker that learns to track generic objects at 100 fps.

• Huang J [Kevin Murphy | Google Research] (2016) Speed/accuracy trade-offs for modern convolutional object detectors. arXiv:1611.10012  |  Google Research's COCO detection winning solution  [reddit]

• In this paper, we study the trade-off between accuracy and speed when building an object detection system based on convolutional neural networks. We consider three main families of detectors -- Faster R-CNN, R-FCN and SSD -- which we view as "meta-architectures". Each of these can be combined with different kinds of feature extractors, such as VGG, Inception or ResNet. In addition, we can vary other parameters, such as the image resolution, and the number of box proposals. We develop a unified framework (in Tensorflow) that enables us to perform a fair comparison between all of these variants. We analyze the performance of many different previously published model combinations, as well as some novel ones, and thus identify a set of models which achieve different points on the speed-accuracy tradeoff curve, ranging from fast models, suitable for use on a mobile phone, to a much slower model that achieves a new state of the art on the COCO detection challenge.

• Update (Jun 2017): pretrained models released!  Supercharge your Computer Vision models with the TensorFlow Object Detection API  |  GitHub  |  reddit

• Huh M (2016) What makes ImageNet good for transfer learning? arXiv:1608.08614

• The tremendous success of ImageNet-trained deep features on a wide range of transfer tasks begs the question: what are the properties of the ImageNet dataset that are critical for learning good, general-purpose features? This work provides an empirical investigation of various facets of this question: Is more pre-training data always better? How does feature quality depend on the number of training examples per class? Does adding more object classes improve performance? For the same data budget, how should the data be split into classes? Is fine-grained recognition necessary for learning good features? Given the same number of training classes, is it better to have coarse classes or fine-grained classes? Which is better: more classes or more examples per class? To answer these and related questions, we pre-trained CNN features on various subsets of the ImageNet dataset and evaluated transfer performance on PASCAL detection, PASCAL action classification, and SUN scene classification tasks. Our overall findings suggest that most changes in the choice of pre-training data long thought to be critical do not significantly affect transfer performance.? Given the same number of training classes, is it better to have coarse classes or fine-grained classes? Which is better: more classes or more examples per class?

• Im DJ (2016) Generating images with recurrent adversarial networks. arXiv:1602.05110  |  reddit  |  GitXiv  |  Generative Adversarial Networks Battling During Training reddit

• Generating images with recurrent adversarial networks Daniel Jiwoong Im, Chris Dongjoo Kim, Hui Jiang, Roland Memisevic Gatys et al. (2015) showed that optimizing pixels to match features in a convolutional network with respect reference image features is a way to render images of high visual quality. We show that unrolling this gradient-based optimization yields a recurrent computation that creates images by incrementally adding onto a visual "canvas". We propose a recurrent generative model inspired by this view, and show that it can be trained using adversarial training to generate very good image samples. We also propose a way to quantitatively compare adversarial networks by having the generators and discriminators of these networks compete against each other.

• Ioannou Y (2016) Decision Forests, Convolutional Networks and the Models in-Between. Microsoft Research Technical Report 2015-58. arXiv:1603.01250  |  decision forests, CNN classifiers, conditional networks

• Johnson J [Fei-Fei L; Stanford] (2016) Perceptual Losses for Real-Time Style Transfer and Super-Resolution arXiv:1603.08155  |  GitHub  |  GitXiv  |  reddit  |  reddit  |  reddit

• We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a per-pixel loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing perceptual loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results.

• mentioned here [reddit]

• Kim J (2015) Deeply-Recursive Convolutional Network for Image Super-Resolution. arXiv:1511.04491  |  CNN, amazing! super-high image resolution deblurring, sharpening  |  BB-8 Image Super-Resolved [reddit; image]  |  related: reddit  |  Super-Resolution Experimental Results 1 (CVPR 16) [YouTube: 01:52]   [reddit]  |  Super-Resolution Experimental Results 2 (CVPR 16) - [YouTube: 01:31]   [reddit]

• We propose an image super-resolution method (SR) using a deeply-recursive convolutional network (DRCN). Our network has a very deep recursive layer (up to 16 recursions). Increasing recursion depth can improve performance without introducing new parameters for additional convolutions. Albeit advantages, learning a DRCN is very hard with a standard gradient descent method due to exploding/vanishing gradients. To ease the difficulty of training, we propose two extensions: recursive-supervision and skip-connection. Our method outperforms previous methods by a large margin.

• Project page  |  CVPR 2016 paper [pdf]  |  YouTube 1  |  YouTube 2

[Click for full-size image]

• Laina I (2016) Deeper Depth Prediction with Fully Convolutional Residual Networks. arXiv:1606.00373  |  GitHub  |  GitXiv  |  reddit

• This paper addresses the problem of estimating the depth map of a scene given a single RGB image. To model the ambiguous mapping between monocular images and depth maps, we leverage on deep learning capabilities and present a fully convolutional architecture encompassing residual learning. The proposed model is deeper than the current state of the art, but contains fewer parameters and requires less training data, while still outperforming all current CNN approaches aimed at the same task. We further present a novel way to efficiently learn feature map up-sampling within the network. For optimization we introduce the reverse Huber loss, particularly suited for the task at hand and driven by the value distributions commonly present in depth maps. The predictions are given by a single architecture, trained end-to-end, that does not rely on post-processing techniques, such as CRFs or other additional refinement steps.

• Lake BM (2015) Deep Neural Nets Predict Category Typicality Ratings for Images [pdf]  |  images, 'typicality' (psychology: predicting human typicality ratings from raw naturalistic images; typicality ratings reflect the graded structure of concepts: people rate a Golden Retriever as a more typical "dog" than a hairless Chihuahua and a goldfish as a more typical "fish" than a shark.)

• Lee PS (2016) Viziometrics: Analyzing visual information in the scientific literature. arXiv:1605.04951  |  The First Visual Search Engine for Scientific Diagrams MIT Technology Review

• Scientific results are communicated visually in the literature through diagrams, visualizations, and photographs. These information-dense objects have been largely ignored in bibliometrics and scientometrics studies when compared to citations and text. In this paper, we use techniques from computer vision and machine learning to classify more than 8 million figures from PubMed into 5 figure types and study the resulting patterns of visual information as they relate to impact. We find that the distribution of figures and figure types in the literature has remained relatively constant over time, but can vary widely across field and topic. Remarkably, we find a significant correlation between scientific impact and the use of visual information, where higher impact papers tend to include more diagrams, and to a lesser extent more plots and photographs. To explore these results and other ways of extracting this visual information, we have built a visual browser to illustrate the concept and explore design alternatives for supporting viziometric analysis and organizing visual information. We use these results to articulate a new research agenda -- viziometrics -- to study the organization and presentation of visual information in the scientific literature.

• Luo Y (2015) Foveation-based Mechanisms Alleviate Adversarial Examples. arXiv:1511.06292  |  reddit

• We show that adversarial examples, i.e., the visually imperceptible perturbations that result in Convolutional Neural Networks (CNNs) fail, can be alleviated with a mechanism based on foveations - applying the CNN in different image regions. To see this, first, we report results in ImageNet that lead to a revision of the hypothesis that adversarial perturbations are a consequence of CNNs acting as a linear classifier: CNNs act locally linearly to changes in the image regions with objects recognized by the CNN, and in other regions the CNN may act non-linearly. Then, we corroborate that when the neural responses are linear, applying the foveation mechanism to the adversarial example tends to significantly reduce the effect of the perturbation. This is because, hypothetically, the CNNs for ImageNet are robust to changes of scale and translation of the object produced by the foveation, but this property does not generalize to transformations of the perturbation. As a result, the accuracy after a foveation is almost the same as the accuracy of the CNN without the adversarial perturbation, even if the adversarial perturbation is calculated taking into account a foveation.

• Mansimov E (2015) Generating Images from Captions with Attention. arXiv:1511.02793   ["text2image:" very impressive]  |  GitHub  |  GitXiv  |  reddit

• Motivated by the recent progress in generative models, we introduce a model that generates images from natural language descriptions. The proposed model iteratively draws patches on a canvas, while attending to the relevant words in the description. After training on Microsoft COCO, we compare our model with several baseline generative models on image generation and retrieval tasks. We demonstrate that our model produces higher quality samples than other approaches and generates images with novel scene compositions corresponding to previously unseen captions in the dataset.

• Noh H (2015) Image QA using CNN with Dynamic Parameter Prediction. arXiv:1511.05756  |  website  |  GitHub

• We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are determined adaptively based on questions. For the adaptive parameter prediction, we employ a separate parameter prediction network, which consists of gated recurrent unit (GRU) taking a question as its input and a fully-connected layer generating a set of candidate weights as its output. However, it is challenging to construct a parameter prediction network for a large number of parameters in the fully-connected dynamic parameter layer of the CNN. We reduce the complexity of this problem by incorporating a hashing technique, where the candidate weights given by the parameter prediction network are selected using a predefined hash function to determine individual weights in the dynamic parameter layer. The proposed network -- joint network with the CNN for ImageQA and the parameter prediction network -- is trained end-to-end through back-propagation, where its weights are initialized using a pre-trained CNN and GRU. The proposed algorithm illustrates the state-of-the-art performance on all available public ImageQA benchmarks.

• Paszke A (2016) ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv:1606.02147

• Radford A (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434  |  GitHub  |  reddit  |  reddit  |  reddit

• Ramanathan V [Fei-Fei L] (2015) Detecting events and key actors in multi-person videos. arXiv:1511.02917

• Multi-person event recognition is a challenging task, often with many people active in the scene but only a small subset contributing to an actual event. In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event. Our model does not use explicit annotations regarding who or where those people are during training and testing. In particular, we track people in videos and use a recurrent neural network (RNN) to represent the track features. We learn time-varying attention weights to combine these features at each time-instant. The attended features are then processed using another RNN for event detection/classification. Since most video datasets with multiple people are restricted to a small number of videos, we also collected a new basketball dataset comprising 257 basketball games with 14K event annotations corresponding to 11 event classes. Our model outperforms state-of-the-art methods for both event classification and detection on this new dataset. Additionally, we show that the attention mechanism is able to consistently localize the relevant players.

• YOLO:   Redmon J (2015) You only look once: Unified, real-time object detection. arXiv:1506.02640  |  high performance! image recognition/processing  |  YOLO website  |  mentioned here (reddit)  |  GitHub  |  GitXiv  |  Why is the last layer in YOLO a fully connected layer? [reddit]

• We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is far less likely to predict false detections where nothing exists. Finally, YOLO learns very general representations of objects. It outperforms all other detection methods, including DPM and R-CNN, by a wide margin when generalizing from natural images to artwork on both the Picasso Dataset and the People-Art Dataset.

• Implementation:  He Y (2016) Object Detection with YOLO on Artwork Dataset. GitHub: unpublished paper [pdf; local copy]  |  Objects-Detection-with-YOLO-on-Artwork-Dataset [GitHub]

• I design a small object detection network, which is simplified from the YOLO (You Only Look Once) network. YOLO is a fast and elegant network that can extract meta features, predict bounding boxes and assign scores to bounding boxes. Compared with RCNN, it doesn't have complex pipeline, which is easier for me to implement. Starting from an ImageNet pretrained model, I train my YOLO on the PASCAL VOC2007 training dataset, and validate my YOLO on PASCAL VOC2007 validation dataset. Finally, I evaluate my YOLO on an artwork dataset (Picasso dataset). With the best parameters, I got 40% precision and 35% recall.

• Shafiee MJ [U Waterloo] (2017) Fast YOLO: A Fast You Only Look Once System for Real-time Embedded Object Detection in Video. arXiv:1709.05943

• [v1] Object detection is considered one of the most challenging problems in this field of computer vision, as it involves the combination of object classification and object localization within a scene. Recently, deep neural networks (DNNs) have been demonstrated to achieve superior object detection performance compared to other approaches, with YOLOv2 (an improved You Only Look Once model) being one of the state-of-the-art in DNN-based object detection methods in terms of both speed and accuracy. Although YOLOv2 can achieve real-time performance on a powerful GPU, it still remains very challenging for leveraging this approach for real-time object detection in video on embedded computing devices with limited computational power and limited memory. In this paper, we propose a new framework called Fast YOLO, a fast You Only Look Once framework which accelerates YOLOv2 to be able to perform object detection in video on embedded devices in a real-time manner. First, we leverage the evolutionary deep intelligence framework to evolve the YOLOv2 network architecture and produce an optimized architecture (referred to as O-YOLOv2 here) that has 2.8X fewer parameters with just a ~2% IOU drop. To further reduce power consumption on embedded devices while maintaining performance, a motion-adaptive inference method is introduced into the proposed Fast YOLO framework to reduce the frequency of deep inference with O-YOLOv2 based on temporal motion characteristics. Experimental results show that the proposed Fast YOLO framework can reduce the number of deep inferences by an average of 38.13%, and an average speedup of ~3.3X for objection detection in video compared to the original YOLOv2, leading Fast YOLO to run an average of ~18FPS on a Nvidia Jetson TX1 embedded system.

• Similar:  Liu W [Szegedy C | Google] (2015) SSD: Single Shot MultiBox Detector. arXiv:1512.02325  |  GitHub  |  GitXiv  |  slides  [local copy]

• We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stage and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, MS COCO, and ILSVRC datasets confirm that SSD has comparable accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. Compared to other single stage methods, SSD has much better accuracy, even with a smaller input image size. For $\small 300 × 300$ input, SSD achieves 72.1% mAP on VOC2007 test at 58 FPS on a Nvidia Titan X and for $\small 500 × 500$ input, SSD achieves 75.1% mAP, outperforming a comparable state of the art Faster R-CNN model. Code is available here [GitHub].

• Similar:  Teradeep general object classifier [reddit]  |  YouTube [Apr 2015]: This is what a state-of-art neural network for robotic vision can do: 1000 categories, 10 M images with full caption  |  GitHub

• Problems with You Only Look Once: Real-Time Object Detection: "... my question is this: what am I doing wrong that's causing my YOLO training to output the same exact thing regardless of my test input? I'm so confused..."

• there is a mailing list here, might have better luck there: Darknet [Google Groups]  →  here is the thread

• Related:  You Only Look Twice -  Multi-Scale Object Detection in Satellite Imagery With Convolutional Neural Networks (Part I)  |  reddit

• Gupta A [University of Oxford] (CVPR 2016) Synthetic Data for Text Localisation in Natural Images. arXiv:1604.06646  |  GitHub  |  project page: VGG SynthText in the Wild

• In this paper we introduce a new method for text detection in natural images. The method comprises two contributions: First, a fast and scalable engine to generate synthetic images of text in clutter. This engine overlays synthetic text to existing background images in a natural way, accounting for the local 3D scene geometry. Second, we use the synthetic images to train a Fully-Convolutional Regression Network (FCRN) which efficiently performs text detection and bounding-box regression at all locations and multiple scales in an image. We discuss the relation of FCRN to the recently-introduced YOLO detector, as well as other end-to-end object detection systems based on deep learning. The resulting detection network significantly out performs current methods for text detection in natural images, achieving an F-measure of 84.2% on the standard ICDAR 2013 benchmark. Furthermore, it can process 15 images per second on a GPU.

• Conclusion. We have developed a new CNN architecture for generating text proposals in images. It would not have been possible to train this architecture on the available annotated datasets, as they contain far too few samples, but we have shown that training images of sufficient verisimilitude can be generated synthetically, and that the CNN trained only on these images exceeds the state-of-the-art performance for both detection and end-to-end text spotting on real images.

• Dataset: SynthText in the Wild Dataset  [zip file, 41 GB].  This is a synthetically generated dataset, in which word instances are placed in natural scene images, while taking into account the scene layout. The dataset consists of 800 thousand images with approximately 8 million synthetic word instances. Each text instance is annotated with its text-string, word-level and character-level bounding-boxes.

• This reddit thread, Why is the last layer in YOLO a fully connected layer?; discusses bounding boxes in YOLO and this [arXiv:1604.06646] paper ...

• Related:  Redmon J (2016) YOLO9000: Better, Faster, Stronger. arXiv:1612.08242  |  pjreddie.com/yolo9000/  |  GitHub  |  reddit  |  reddit

• We introduce YOLO9000, a state-of-the-art, real-time object detection system that can detect over 9000 object categories. First we propose various improvements to the YOLO detection method, both novel and drawn from prior work. The improved model, YOLOv2, is state-of-the-art on standard detection tasks like PASCAL VOC and COCO. At 67 FPS, YOLOv2 gets 76.8 mAP on VOC 2007. At 40 FPS, YOLOv2 gets 78.6 mAP, outperforming state-of-the-art methods like Faster RCNN with ResNet and SSD while still running significantly faster. Finally we propose a method to jointly train on object detection and classification. Using this method we train YOLO9000 simultaneously on the COCO detection dataset and the ImageNet classification dataset. Our joint training allows YOLO9000 to predict detections for object classes that don't have labelled detection data. We validate our approach on the ImageNet detection task. YOLO9000 gets 19.7 mAP on the ImageNet detection validation set despite only having detection data for 44 of the 200 classes. On the 156 classes not in COCO, YOLO9000 gets 16.0 mAP. But YOLO can detect more than just 200 classes; it predicts detections for more than 9000 different object categories. And it still runs in real-time.

    `

• See also (image processing architectures, speed tests ...):  Huang J [Kevin Murphy | Google Research] (2016) Speed/accuracy trade-offs for modern convolutional object detectors. arXiv:1611.10012

• Related:  Ghosh T (2017) QuickNet: Maximizing Efficiency and Efficacy in Deep Architectures. arXiv:1701.02291  |  reddit

• We present QuickNet, a fast and accurate network architecture that is both faster and significantly more accurate than other fast deep architectures like SqueezeNet. Furthermore, it uses less parameters than previous networks, making it more memory efficient. We do this by making two major modifications to the reference Darknet model (Redmon et al, 2015): 1) The use of depthwise separable convolutions and 2) The use of parametric rectified linear units. We make the observation that parametric rectified linear units are computationally equivalent to leaky rectified linear units at test time and the observation that separable convolutions can be interpreted as a compressed Inception network (Chollet, 2016). Using these observations, we derive a network architecture, which we call QuickNet, that is both faster and more accurate than previous models. Our architecture provides at least four major advantages: (1) A smaller model size, which is more tenable on memory constrained systems; (2) A significantly faster network which is more tenable on computationally constrained systems; (3) A high accuracy of 95.7 percent on the CIFAR-10 Dataset which outperforms all but one result published so far, although we note that our works are orthogonal approaches and can be combined (4) Orthogonality to previous model compression approaches allowing for further speed gains to be realized.

• Tang P (2016) Deep FisherNet for Object Classification. arXiv:1608.00182  |  reddit
• Despite the great success of convolutional neural networks (CNN) for the image classification task on datasets like Cifar and ImageNet, CNN's representation power is still somewhat limited in dealing with object images that have large variation in size and clutter, where Fisher Vector (FV) has shown to be an effective encoding strategy. FV encodes an image by aggregating local descriptors with a universal generative Gaussian Mixture Model (GMM). FV however has limited learning capability and its parameters are mostly fixed after constructing the codebook. To combine together the best of the two worlds, we propose in this paper a neural network structure with FV layer being part of an end-to-end trainable system that is differentiable; we name our network FisherNet that i