Visualizing and Understanding Convolutional Networks 821
feature reconstructions at each layer (which also should be positive), we pass the
reconstructed signal through a relu non-linearity
1
.
Filtering: The convnet uses learned filters to convolve the feature maps from
the previous layer. To approximately invert this, the deconvnet uses transposed
versions of the same filters (as other autoencoder models, such as RBMs), but
applied to the rectified maps, not the output of the layer beneath. In practice
this means flipping each filter vertically and horizontally.
Note that we do not use any contrast normalization operations when in this
reconstruction path. Projecting down from higher layers uses the switch settings
generated by the max pooling in the convnet on the way up. As these switch
settings are peculiar to a given input image, the reconstruction obtained from a
single activation thus resembles a small piece of the original input image, with
structures weighted according to their contribution toward to the feature acti-
vation. Since the model is trained discriminatively, they implicitly show which
parts of the input image are discriminative. Note that these projections are not
samples from the model, since there is no generative process involved. The whole
procedure is similar to backpropping a single strong activation (rather than the
usual gradients), i.e. computing
∂h
∂X
n
,whereh is the element of the feature map
with the strong activation and X
n
is the input image. However, it differs in
that (i) the the relu is imposed independently and (ii) contrast normalization
operations are not used. A general shortcoming of our approach is that it only
visualizes a single activation, not the joint activity present in a layer. Neverthe-
less, as we show in Fig. 6, these visualizations are accurate representations of
the input pattern that stimulates the given feature map in the model: when the
parts of the original input image corresponding to the pattern are occluded, we
see a distinct drop in activity within the feature map.
3 Training Details
We now describe the large convnet model that will be visualized in Section 4.
The architecture, shown in Fig. 3, is similar to that used by Krizhevsky et al. [18]
for ImageNet classification. One difference is that the sparse connections used
in Krizhevsky’s layers 3,4,5 (due to the model being split across 2 GPUs) are
replaced with dense connections in our model. Other important differences re-
lating to layers 1 and 2 were made following inspection of the visualizations in
Fig. 5, as described in Section 4.1.
The model was trained on the ImageNet 2012 training set (1.3 million images,
spread over 1000 different classes) [6]. Each RGB image was preprocessed by resiz-
ing the smallest dimension to 256, cropping the center 256x256 region, subtract-
ing the per-pixel mean (across all images) and then using 10 different sub-crops
of size 224x224 (corners + center with(out) horizontal flips). Stochastic gradient
descent with a mini-batch size of 128 was used to update the parameters, starting
with a learning rate of 10
−2
, in conjunction with a momentum term of 0.9. We
1
We also tried rectifying using the binary mask imposed by the feed-forward relu
operation, but the resulting visualizations were significantly less clear.