Visualizing and Understanding2014-中文译文

人工智能

3.78MB

41 需要积分: 1

立即下载

资源介绍:

dfNet深度卷积网络论文

Visualizing and Understanding

Convolutional Networks

Matthew D. Zeiler and Rob Fergus

Dept. of Computer Science,

New York University, USA

{zeiler,fergus}@cs.nyu.edu

Abstract. Large Convolutional Network models have recently demon-

strated impressive classiﬁcation performance on the ImageNet bench-

mark Krizhevsky et al. [18]. However there is no clear understanding of

why they perform so well, or how they might be improved. In this paper

we explore both issues. We introduce a novel visualization technique that

gives insight into the function of intermediate feature layers and the oper-

ation of the classiﬁer. Used in a diagnostic role, these visualizations allow

us to ﬁnd model architectures that outperform Krizhevsky et al. on the

ImageNet classiﬁcation benchmark. We also perform an ablation study

to discover the performance contribution from diﬀerent model layers. We

show our ImageNet model generalizes well to other datasets: when the

softmax classiﬁer is retrained, it convincingly beats the current state-of-

the-art results on Caltech-101 and Caltech-256 datasets.

1 Introduction

Since their introduction by LeCun et al. [20] in the early 1990’s, Convolutional

Networks (convnets) have demonstrated excellent performance at tasks such as

hand-written digit classiﬁcation and face detection. In the last 18 months, sev-

eral papers have shown that they can also deliver outstanding performance on

more challenging visual classiﬁcation tasks. Ciresan et al. [4] demonstrate state-of-

the-art performance on NORB and CIFAR-10 datasets. Most notably, Krizhevsky

et al. [18] show record beating performance on the ImageNet 2012 classiﬁcation

benchmark, with their convnet model achieving an error rate of 16.4%, compared

to the 2nd place result of 26.1%. Following on from this work, Girshick et al. [10]

have shown leading detection performance on the PASCAL VOC dataset. Sev-

eral factors are responsible for this dramatic improvement in performance: (i) the

availability of much larger training sets, with millions of labeled examples; (ii)

powerful GPU implementations, making the training of very large models practi-

cal and (iii) better model regularization strategies, such as Dropout [14].

Despite this encouraging progress, there is still little insight into the internal

operation and behavior of these complex models, or how they achieve such good

performance. From a scientiﬁc standpoint, this is deeply unsatisfactory. With-

out clear understanding of how and why they work, the development of better

models is reduced to trial-and-error. In this paper we introduce a visualization

D. Fleet et al. (Eds.): ECCV 2014, Part I, LNCS 8689, pp. 818–833, 2014.

 Springer International Publishing Switzerland 2014

Visualizing and Understanding Convolutional Networks 819

technique that reveals the input stimuli that excite individual feature maps at

any layer in the model. It also allows us to observe the evolution of features

during training and to diagnose potential problems with the model. The visu-

alization technique we propose uses a multi-layered Deconvolutional Network

(deconvnet), as proposed by Zeiler et al. [29], to project the feature activations

back to the input pixel space. We also perform a sensitivity analysis of the clas-

siﬁer output by occluding portions of the input image, revealing which parts of

the scene are important for classiﬁcation.

Using these tools, we start with the architecture of Krizhevsky et al. [18] and

explore diﬀerent architectures, discovering ones that outperform their results

on ImageNet. We then explore the generalization ability of the model to other

datasets, just retraining the softmax classiﬁer on top. As such, this is a form

of supervised pre-training, which contrasts with the unsupervised pre-training

methods popularized by Hinton et al. [13] and others [1,26].

1.1 Related Work

Visualization: Visualizing features to gain intuition about the network is com-

mon practice, but mostly limited to the 1st layer where projections to pixel

space are possible. In higher layers alternate methods must be used. [8] ﬁnd the

optimal stimulus for each unit by performing gradient descent in image space

to maximize the unit’s activation. This requires a careful initialization and does

not give any information about the unit’s invariances. Motivated by the latter’s

short-coming, [19] (extending an idea by [2]) show how the Hessian of a given

unit may be computed numerically around the optimal response, giving some

insight into invariances. The problem is that for higher layers, the invariances are

extremely complex so are poorly captured by a simple quadratic approximation.

Our approach, by contrast, provides a non-parametric view of invariance, show-

ing which patterns from the training set activate the feature map. Our approach

is similar to contemporary work by Simonyan et al. [23] who demonstrate how

saliency maps can be obtained from a convnet by projecting back from the fully

connected layers of the network, instead of the convolutional features that we

use. Girshick et al. [10] show visualizations that identify patches within a dataset

that are responsible for strong activations at higher layers in the model. Our vi-

sualizations diﬀer in that they are not just crops of input images, but rather

top-down projections that reveal structures within each patch that stimulate a

particular feature map.

Feature Generalization: Our demonstration of the generalization ability of

convnet features is also explored in concurrent work by Donahue et al. [7] and

Girshick et al. [10]. They use the convnet features to obtain state-of-the-art

performance on Caltech-101 and the Sun scenes dataset in the former case, and

for object detection on the PASCAL VOC dataset, in the latter.

2 Approach

We use standard fully supervised convnet models throughout the paper, as de-

ﬁned by LeCun et al. [20] and Krizhevsky et al. [18]. These models map a color

820 M.D. Zeiler and R. Fergus

2D input image x

, via a series of layers, to a probability vector ˆy

over the C dif-

ferent classes. Each layer consists of (i) convolution of the previous layer output

(or, in the case of the 1st layer, the input image) with a set of learned ﬁlters; (ii)

passing the responses through a rectiﬁed linear function (relu(x)=max(x, 0));

(iii) [optionally] max pooling over local neighborhoods and (iv) [optionally] a

local contrast operation that normalizes the responses across feature maps. For

more details of these operations, see [18] and [16]. The top few layers of the net-

work are conventional fully-connected networks and the ﬁnal layer is a softmax

classiﬁer. Fig. 3 shows the model used in many of our experiments.

We train these models using a large set of N labeled images {x, y}, where label

is a discrete variable indicating the true class. A cross-entropy loss function,

suitable for image classiﬁcation, is used to compare ˆy

and y

. The parameters

of the network (ﬁlters in the convolutional layers, weight matrices in the fully-

connected layers and biases) are trained by back-propagating the derivative of

the loss with respect to the parameters throughout the network, and updating

the parameters via stochastic gradient descent. Details of training are given in

Section 3.

2.1 Visualization with a Deconvnet

Understanding the operation of a convnet requires interpreting the feature activ-

ity in intermediate layers. We present a novel way to map these activities back to the

input pixel space, showing what input pattern originally caused a given activation

in the feature maps. We perform this mapping with a Deconvolutional Network

(deconvnet) Zeiler et al. [29]. A deconvnet can be thought of as a convnet model

that uses the same components (ﬁltering, pooling) but in reverse, so instead of

mapping pixels to features does the opposite. In Zeiler et al. [29], deconvnets were

proposed as a way of performing unsupervised learning. Here, they are not used

in any learning capacity, just as a probe of an already trained convnet.

To examine a convnet, a deconvnet is attached to each of its layers, as illus-

trated in Fig. 1(top), providing a continuous path back to image pixels. To start,

an input image is presented to the convnet and features computed throughout

the layers. To examine a given convnet activation, we set all other activations in

the layer to zero and pass the feature maps as input to the attached deconvnet

layer. Then we successively (i) unpool, (ii) rectify and (iii) ﬁlter to reconstruct

the activity in the layer beneath that gave rise to the chosen activation. This is

then repeated until input pixel space is reached.

Unpooling: In the convnet, the max pooling operation is non-invertible, how-

ever we can obtain an approximate inverse by recording the locations of the

maxima within each pooling region in a set of switch variables. In the decon-

vnet, the unpooling operation uses these switches to place the reconstructions

from the layer above into appropriate locations, preserving the structure of the

stimulus. See Fig. 1(bottom) for an illustration of the procedure.

Rectiﬁcation: The convnet uses relu non-linearities, which rectify the fea-

ture maps thus ensuring the feature maps are always positive. To obtain valid

Visualizing and Understanding Convolutional Networks 821

feature reconstructions at each layer (which also should be positive), we pass the

reconstructed signal through a relu non-linearity

Filtering: The convnet uses learned ﬁlters to convolve the feature maps from

the previous layer. To approximately invert this, the deconvnet uses transposed

versions of the same ﬁlters (as other autoencoder models, such as RBMs), but

applied to the rectiﬁed maps, not the output of the layer beneath. In practice

this means ﬂipping each ﬁlter vertically and horizontally.

Note that we do not use any contrast normalization operations when in this

reconstruction path. Projecting down from higher layers uses the switch settings

generated by the max pooling in the convnet on the way up. As these switch

settings are peculiar to a given input image, the reconstruction obtained from a

single activation thus resembles a small piece of the original input image, with

structures weighted according to their contribution toward to the feature acti-

vation. Since the model is trained discriminatively, they implicitly show which

parts of the input image are discriminative. Note that these projections are not

samples from the model, since there is no generative process involved. The whole

procedure is similar to backpropping a single strong activation (rather than the

usual gradients), i.e. computing

∂h

∂X

,whereh is the element of the feature map

with the strong activation and X

is the input image. However, it diﬀers in

that (i) the the relu is imposed independently and (ii) contrast normalization

operations are not used. A general shortcoming of our approach is that it only

visualizes a single activation, not the joint activity present in a layer. Neverthe-

less, as we show in Fig. 6, these visualizations are accurate representations of

the input pattern that stimulates the given feature map in the model: when the

parts of the original input image corresponding to the pattern are occluded, we

see a distinct drop in activity within the feature map.

3 Training Details

We now describe the large convnet model that will be visualized in Section 4.

The architecture, shown in Fig. 3, is similar to that used by Krizhevsky et al. [18]

for ImageNet classiﬁcation. One diﬀerence is that the sparse connections used

in Krizhevsky’s layers 3,4,5 (due to the model being split across 2 GPUs) are

replaced with dense connections in our model. Other important diﬀerences re-

lating to layers 1 and 2 were made following inspection of the visualizations in

Fig. 5, as described in Section 4.1.

The model was trained on the ImageNet 2012 training set (1.3 million images,

spread over 1000 diﬀerent classes) [6]. Each RGB image was preprocessed by resiz-

ing the smallest dimension to 256, cropping the center 256x256 region, subtract-

ing the per-pixel mean (across all images) and then using 10 diﬀerent sub-crops

of size 224x224 (corners + center with(out) horizontal ﬂips). Stochastic gradient

descent with a mini-batch size of 128 was used to update the parameters, starting

with a learning rate of 10

−2

, in conjunction with a momentum term of 0.9. We

We also tried rectifying using the binary mask imposed by the feed-forward relu

operation, but the resulting visualizations were signiﬁcantly less clear.

822 M.D. Zeiler and R. Fergus

Layer Below Pooled Maps

Feature Maps

Rectied Feature Maps



!"





Pooled Maps



Reconstruction

Rectied Unpooled Maps

Unpooled Maps



!



"





Layer Above

Reconstruction





Unpooling

Max Locations

“Switches”

Pooling

Pooled Maps

Feature Map

Layer Above

Reconstruction

Unpooled

Maps

Rectiﬁed

Feature Maps

Fig. 1. Top: A deconvnet layer (left) attached to a convnet layer (right). The deconvnet

will reconstruct an approximate version of the convnet features from the layer beneath.

Bottom: An illustration of the unpooling operation in the deconvnet, using switches

which record the location of the local max in each pooling region (colored zones) during

pooling in the convnet. The black/white bars are negative/positive activations within

the feature map.

anneal the learning rate throughout training manually when the validation error

plateaus. Dropout [14] is used in the fully connected layers (6 and 7) with a rate

of 0.5. All weights are initialized to 10

−2

and biases are set to 0.

Visualization of the ﬁrst layer ﬁlters during training reveals that a few of

them dominate. To combat this, we renormalize each ﬁlter in the convolutional

layers whose RMS value exceeds a ﬁxed radius of 10

−1

to this ﬁxed radius. This

is crucial, especially in the ﬁrst layer of the model, where the input images are

roughly in the [-128,128] range. As in Krizhevsky et al. [18], we produce multiple

diﬀerent crops and ﬂips of each training example to boost training set size. We

stopped training after 70 epochs, which took around 12 days on a single GTX580

GPU, using an implementation based on [18].

4 Convnet Visualization

Using the model described in Section 3, we now use the deconvnet to visualize

the feature activations on the ImageNet validation set.

Feature Visualization: Fig. 2 shows feature visualizations from our model

once training is complete. For a given feature map, we show the top 9 acti-

vations, each projected separately down to pixel space, revealing the diﬀerent

资源文件列表:

Visualizing and Understanding2014_中文译文.zip 大约有2个文件

Visualizing and Understanding.pdf 2.25MB
Visualizing and Understanding2014_中文译文.docx 1.81MB