22. December 2018

Recurrent models of visual attention

Volodymir Mnih Nicolas Hees Alex Graves Koray Kavukcuoglu | Jun 2014 | Paper

tl;dr: Training a network to classify images (with a single label) is modeled as a sequential decision problem where actions are salient locations in the image and tentative labels. The state (full image) is partially observed through a fixed size subimage around each location. The policy takes the full history into account compressed into a hidden vector via an RNN. REINFORCE is used to compute the policy gradient.

Although the paper targets several applications, to fix ideas, say we want to classify images with one label. These can vary in size but the number of parameters of the model will not change. Taking inspiration from how humans process images, the proposed model iteratively selects points in the image and focuses on local patches around them at different resolutions. The problem of choosing the locations and related classification labels is cast as a reinforcement learning problem.

Attention as a sequential decision problem

One begins by fixing one image and selecting a number of timesteps. At each timestep :

We are in some (observed) state : it consists of a location in the image (a pair of coordinates), and a corresponding glimpse of . This glimpse is a concatenation of multiple subimages of taken at different resolutions, centered at location , then resampled to the same size. How many (), at what resolutions () and to what fixed size () they are resampled are all hyperparameters.¹ The set of all past states is the history: .
We take an action , with the new location in the (same) image to look at and the current guess as to what the label for is. In the typical way for image classification with neural networks, is a vector of “probabilities” coming from a softmax layer. Analogously, the location is sampled from a distribution parametrized by the last layer of a network. The actions are taken according to a policy , with , and the set of all probabilty measures over . The policy is implemented as a neural network, where represents all internal parameters. The crucial point in the paper is that the network takes the whole history as input, compressed into a hidden state vector, i.e. the policy will be implemented with a recurrent network. Because parameters are shared across all timesteps, we drop the superindex and denote its output at timestep by .
We obtain a scalar reward . Actually, the reward will be 0 at all timesteps but the last (), where it can be either 0 if predicts the wrong class or 1 if it is the right one.²

Note that the policy has two “heads”, a labeling network , outputting a probability of the current glimpse belonging to each class and a location network . Only the output of the latter directly influences the next state. This is important when computing the distribution over trajectories induced by the policy:

The goal is to maximise the total expected reward

The algorithm used is basically the policy gradient method with the REINFORCE rule:³

Algorithm 1:

Initialise with some random set of parameters.
For , pick some input image with label .
Sample some random initial location .
Run the policy (the recurrent network) for timesteps, creating new locations and labels . At the end collect the reward.
Compute the gradient of the reward .
Update .

The difficulty lies in step because the reward is an expectation over trajectories whose gradient cannot be analitically computed. One solution is to rewrite the gradient of the expectation as another expectation using a simple but clever substitution:

and this is

In order to compute this integral we can now use Monte-Carlo sampling:

and after rewriting as a sum of logarithms and discarding the terms which do not depend on we obtain:

where is the final reward (recall that in this application for all ). In order to reduce the variance of this estimator it is standard to subtract a baseline estimate $b =\mathbb{E}_{\pi{\theta}} [r{T}]$ of the expected reward, thus arriving at the expression

There is a vast literature on the Monte-Carlo approximation for policy gradients, as well as techniques for variance reduction.⁴

Hybrid learning

Because in classification problems the labels are known at training time, one can provide the network with a better signal than just the reward at the end of all the process. In this case the authors

optimize the cross entropy loss to train the [labeling] network and backpropagate the gradients through the core and glimpse networks. The location network is always trained with REINFORCE.

Results for image classification

An image is worth a thousand words:

It would be nice to try letting a CNN capture the relevant features for us, instead of fixing the resolutions. I’m sure this has been tried since 2014. ^⇧
I wonder: wouldn’t it make more sense to let instead using the cross entropy to the 1-hot vector encoding the correct class? ^⇧
See e.g. Reinforcement Learning: An Introduction, Sutton, R. , Barto, A. (2018) , §13.3. ^⇧
Again, see Reinforcement Learning: An Introduction, Sutton, R. , Barto, A. (2018) . ^⇧

← Older

This is PaperWhy. Our sisyphean endeavour not to drown in the immense Machine Learning literature.

Paperwhy

22. December 2018

Recurrent models of visual attention

Attention as a sequential decision problem

The model at timestep .

Hybrid learning

Results for image classification

This is PaperWhy. Our sisyphean endeavour not to drown in the immense Machine Learning literature.

22. December 2018

Recurrent models of visual attention

Attention as a sequential decision problem

The model at timestep t.

Hybrid learning

Results for image classification

The model at timestep .