首页 星云 工具 资源 星选 资讯 热门工具
:

PDF转图片 完全免费 小红书视频下载 无水印 抖音视频下载 无水印 数字星空

Visualizing and Understanding2014-中文译文

人工智能 3.78MB 14 需要积分: 1
立即下载

资源介绍:

dfNet深度卷积网络论文
Visualizing and Understanding
Convolutional Networks
Matthew D. Zeiler and Rob Fergus
Dept. of Computer Science,
New York University, USA
{zeiler,fergus}@cs.nyu.edu
Abstract. Large Convolutional Network models have recently demon-
strated impressive classification performance on the ImageNet bench-
mark Krizhevsky et al. [18]. However there is no clear understanding of
why they perform so well, or how they might be improved. In this paper
we explore both issues. We introduce a novel visualization technique that
gives insight into the function of intermediate feature layers and the oper-
ation of the classifier. Used in a diagnostic role, these visualizations allow
us to find model architectures that outperform Krizhevsky et al. on the
ImageNet classification benchmark. We also perform an ablation study
to discover the performance contribution from different model layers. We
show our ImageNet model generalizes well to other datasets: when the
softmax classifier is retrained, it convincingly beats the current state-of-
the-art results on Caltech-101 and Caltech-256 datasets.
1 Introduction
Since their introduction by LeCun et al. [20] in the early 1990’s, Convolutional
Networks (convnets) have demonstrated excellent performance at tasks such as
hand-written digit classification and face detection. In the last 18 months, sev-
eral papers have shown that they can also deliver outstanding performance on
more challenging visual classification tasks. Ciresan et al. [4] demonstrate state-of-
the-art performance on NORB and CIFAR-10 datasets. Most notably, Krizhevsky
et al. [18] show record beating performance on the ImageNet 2012 classification
benchmark, with their convnet model achieving an error rate of 16.4%, compared
to the 2nd place result of 26.1%. Following on from this work, Girshick et al. [10]
have shown leading detection performance on the PASCAL VOC dataset. Sev-
eral factors are responsible for this dramatic improvement in performance: (i) the
availability of much larger training sets, with millions of labeled examples; (ii)
powerful GPU implementations, making the training of very large models practi-
cal and (iii) better model regularization strategies, such as Dropout [14].
Despite this encouraging progress, there is still little insight into the internal
operation and behavior of these complex models, or how they achieve such good
performance. From a scientific standpoint, this is deeply unsatisfactory. With-
out clear understanding of how and why they work, the development of better
models is reduced to trial-and-error. In this paper we introduce a visualization
D. Fleet et al. (Eds.): ECCV 2014, Part I, LNCS 8689, pp. 818–833, 2014.
c
Springer International Publishing Switzerland 2014
Visualizing and Understanding Convolutional Networks 819
technique that reveals the input stimuli that excite individual feature maps at
any layer in the model. It also allows us to observe the evolution of features
during training and to diagnose potential problems with the model. The visu-
alization technique we propose uses a multi-layered Deconvolutional Network
(deconvnet), as proposed by Zeiler et al. [29], to project the feature activations
back to the input pixel space. We also perform a sensitivity analysis of the clas-
sifier output by occluding portions of the input image, revealing which parts of
the scene are important for classification.
Using these tools, we start with the architecture of Krizhevsky et al. [18] and
explore different architectures, discovering ones that outperform their results
on ImageNet. We then explore the generalization ability of the model to other
datasets, just retraining the softmax classifier on top. As such, this is a form
of supervised pre-training, which contrasts with the unsupervised pre-training
methods popularized by Hinton et al. [13] and others [1,26].
1.1 Related Work
Visualization: Visualizing features to gain intuition about the network is com-
mon practice, but mostly limited to the 1st layer where projections to pixel
space are possible. In higher layers alternate methods must be used. [8] find the
optimal stimulus for each unit by performing gradient descent in image space
to maximize the unit’s activation. This requires a careful initialization and does
not give any information about the unit’s invariances. Motivated by the latter’s
short-coming, [19] (extending an idea by [2]) show how the Hessian of a given
unit may be computed numerically around the optimal response, giving some
insight into invariances. The problem is that for higher layers, the invariances are
extremely complex so are poorly captured by a simple quadratic approximation.
Our approach, by contrast, provides a non-parametric view of invariance, show-
ing which patterns from the training set activate the feature map. Our approach
is similar to contemporary work by Simonyan et al. [23] who demonstrate how
saliency maps can be obtained from a convnet by projecting back from the fully
connected layers of the network, instead of the convolutional features that we
use. Girshick et al. [10] show visualizations that identify patches within a dataset
that are responsible for strong activations at higher layers in the model. Our vi-
sualizations differ in that they are not just crops of input images, but rather
top-down projections that reveal structures within each patch that stimulate a
particular feature map.
Feature Generalization: Our demonstration of the generalization ability of
convnet features is also explored in concurrent work by Donahue et al. [7] and
Girshick et al. [10]. They use the convnet features to obtain state-of-the-art
performance on Caltech-101 and the Sun scenes dataset in the former case, and
for object detection on the PASCAL VOC dataset, in the latter.
2 Approach
We use standard fully supervised convnet models throughout the paper, as de-
fined by LeCun et al. [20] and Krizhevsky et al. [18]. These models map a color
820 M.D. Zeiler and R. Fergus
2D input image x
i
, via a series of layers, to a probability vector ˆy
i
over the C dif-
ferent classes. Each layer consists of (i) convolution of the previous layer output
(or, in the case of the 1st layer, the input image) with a set of learned filters; (ii)
passing the responses through a rectified linear function (relu(x)=max(x, 0));
(iii) [optionally] max pooling over local neighborhoods and (iv) [optionally] a
local contrast operation that normalizes the responses across feature maps. For
more details of these operations, see [18] and [16]. The top few layers of the net-
work are conventional fully-connected networks and the final layer is a softmax
classifier. Fig. 3 shows the model used in many of our experiments.
We train these models using a large set of N labeled images {x, y}, where label
y
i
is a discrete variable indicating the true class. A cross-entropy loss function,
suitable for image classification, is used to compare ˆy
i
and y
i
. The parameters
of the network (filters in the convolutional layers, weight matrices in the fully-
connected layers and biases) are trained by back-propagating the derivative of
the loss with respect to the parameters throughout the network, and updating
the parameters via stochastic gradient descent. Details of training are given in
Section 3.
2.1 Visualization with a Deconvnet
Understanding the operation of a convnet requires interpreting the feature activ-
ity in intermediate layers. We present a novel way to map these activities back to the
input pixel space, showing what input pattern originally caused a given activation
in the feature maps. We perform this mapping with a Deconvolutional Network
(deconvnet) Zeiler et al. [29]. A deconvnet can be thought of as a convnet model
that uses the same components (filtering, pooling) but in reverse, so instead of
mapping pixels to features does the opposite. In Zeiler et al. [29], deconvnets were
proposed as a way of performing unsupervised learning. Here, they are not used
in any learning capacity, just as a probe of an already trained convnet.
To examine a convnet, a deconvnet is attached to each of its layers, as illus-
trated in Fig. 1(top), providing a continuous path back to image pixels. To start,
an input image is presented to the convnet and features computed throughout
the layers. To examine a given convnet activation, we set all other activations in
the layer to zero and pass the feature maps as input to the attached deconvnet
layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct
the activity in the layer beneath that gave rise to the chosen activation. This is
then repeated until input pixel space is reached.
Unpooling: In the convnet, the max pooling operation is non-invertible, how-
ever we can obtain an approximate inverse by recording the locations of the
maxima within each pooling region in a set of switch variables. In the decon-
vnet, the unpooling operation uses these switches to place the reconstructions
from the layer above into appropriate locations, preserving the structure of the
stimulus. See Fig. 1(bottom) for an illustration of the procedure.
Rectification: The convnet uses relu non-linearities, which rectify the fea-
ture maps thus ensuring the feature maps are always positive. To obtain valid
Visualizing and Understanding Convolutional Networks 821
feature reconstructions at each layer (which also should be positive), we pass the
reconstructed signal through a relu non-linearity
1
.
Filtering: The convnet uses learned filters to convolve the feature maps from
the previous layer. To approximately invert this, the deconvnet uses transposed
versions of the same filters (as other autoencoder models, such as RBMs), but
applied to the rectified maps, not the output of the layer beneath. In practice
this means flipping each filter vertically and horizontally.
Note that we do not use any contrast normalization operations when in this
reconstruction path. Projecting down from higher layers uses the switch settings
generated by the max pooling in the convnet on the way up. As these switch
settings are peculiar to a given input image, the reconstruction obtained from a
single activation thus resembles a small piece of the original input image, with
structures weighted according to their contribution toward to the feature acti-
vation. Since the model is trained discriminatively, they implicitly show which
parts of the input image are discriminative. Note that these projections are not
samples from the model, since there is no generative process involved. The whole
procedure is similar to backpropping a single strong activation (rather than the
usual gradients), i.e. computing
∂h
∂X
n
,whereh is the element of the feature map
with the strong activation and X
n
is the input image. However, it differs in
that (i) the the relu is imposed independently and (ii) contrast normalization
operations are not used. A general shortcoming of our approach is that it only
visualizes a single activation, not the joint activity present in a layer. Neverthe-
less, as we show in Fig. 6, these visualizations are accurate representations of
the input pattern that stimulates the given feature map in the model: when the
parts of the original input image corresponding to the pattern are occluded, we
see a distinct drop in activity within the feature map.
3 Training Details
We now describe the large convnet model that will be visualized in Section 4.
The architecture, shown in Fig. 3, is similar to that used by Krizhevsky et al. [18]
for ImageNet classification. One difference is that the sparse connections used
in Krizhevsky’s layers 3,4,5 (due to the model being split across 2 GPUs) are
replaced with dense connections in our model. Other important differences re-
lating to layers 1 and 2 were made following inspection of the visualizations in
Fig. 5, as described in Section 4.1.
The model was trained on the ImageNet 2012 training set (1.3 million images,
spread over 1000 different classes) [6]. Each RGB image was preprocessed by resiz-
ing the smallest dimension to 256, cropping the center 256x256 region, subtract-
ing the per-pixel mean (across all images) and then using 10 different sub-crops
of size 224x224 (corners + center with(out) horizontal flips). Stochastic gradient
descent with a mini-batch size of 128 was used to update the parameters, starting
with a learning rate of 10
2
, in conjunction with a momentum term of 0.9. We
1
We also tried rectifying using the binary mask imposed by the feed-forward relu
operation, but the resulting visualizations were significantly less clear.
822 M.D. Zeiler and R. Fergus
Layer Below Pooled Maps
Feature Maps
Rectied Feature Maps

!"


Pooled Maps

Reconstruction
Rectied Unpooled Maps
Unpooled Maps

!
"


Layer Above
Reconstruction


Unpooling
Max Locations
“Switches”
Pooling
Pooled Maps
Feature Map
Layer Above
Reconstruction
Unpooled
Maps
Rectified
Feature Maps
Fig. 1. Top: A deconvnet layer (left) attached to a convnet layer (right). The deconvnet
will reconstruct an approximate version of the convnet features from the layer beneath.
Bottom: An illustration of the unpooling operation in the deconvnet, using switches
which record the location of the local max in each pooling region (colored zones) during
pooling in the convnet. The black/white bars are negative/positive activations within
the feature map.
anneal the learning rate throughout training manually when the validation error
plateaus. Dropout [14] is used in the fully connected layers (6 and 7) with a rate
of 0.5. All weights are initialized to 10
2
and biases are set to 0.
Visualization of the first layer filters during training reveals that a few of
them dominate. To combat this, we renormalize each filter in the convolutional
layers whose RMS value exceeds a fixed radius of 10
1
to this fixed radius. This
is crucial, especially in the first layer of the model, where the input images are
roughly in the [-128,128] range. As in Krizhevsky et al. [18], we produce multiple
different crops and flips of each training example to boost training set size. We
stopped training after 70 epochs, which took around 12 days on a single GTX580
GPU, using an implementation based on [18].
4 Convnet Visualization
Using the model described in Section 3, we now use the deconvnet to visualize
the feature activations on the ImageNet validation set.
Feature Visualization: Fig. 2 shows feature visualizations from our model
once training is complete. For a given feature map, we show the top 9 acti-
vations, each projected separately down to pixel space, revealing the different

资源文件列表:

Visualizing and Understanding2014_中文译文.zip 大约有2个文件
  1. Visualizing and Understanding.pdf 2.25MB
  2. Visualizing and Understanding2014_中文译文.docx 1.81MB
0评论
提交 加载更多评论
其他资源 中文版VC6.0(32&64bit)rjazz.zip
中文版VC6.0(32&64bit)rjazz.zip
BackToTop 置顶组件(VUE2 后台)
BackToTop 置顶组件
LCD12864.zip
LCD12864.zip
LCD12864.zip LCD12864.zip LCD12864.zip
VMware 全家桶算号器keygen 5-8 (包括 Tanzu、NSX)
VMware 全家桶算号器keygen 5-8 (包括 Tanzu、NSX) 包括以下产品 ESXi 5-8 vCenter Server 6-8 Site Recovery Manager 8 Fusion 16, 10 Player 10-16 Workstation 10-16 Tanzu 7-8 vSAN for Tanzu 7-8 vSphere add-on for Kubernetes 7-8 vSAN 6-8 Virtual SAN Witness for Embedded OEMs 6-8 Chargeback powered by vRealize Operations vRealize (Aria) 6-7, Suite 2017-2019 SD-WAN vCloud NSX Cloud HCX 1 Telco Cloud NSX-T Data Center 2-4 NSX 4 AVI ALB 1 Horzion 7-8 WorkSpace ONE 2 EMC Storage Analytics 1 EVO SD
包含xss攻击的pdf文件
验证xss攻击的pdf文件
包含xss攻击的pdf文件
LockCop工具(排查死锁问题)
LockCop工具(排查死锁问题)
iOS MFI认证代码及文档
最新iOS MFI认证流程文档说明及代码,代码是在iOS端实现的,但流程完整,按照流程可以轻松移入firmware端,希望可以帮到你。 zip文件中包含: - MFI授权认证流程.pptx - USB ATS Cable Connection.png (苹果ATS及USB分析仪连线图) - MFIAuthentication 工程,详细认证流程代码见 MFIFlow文件夹 - 苹果 iAP2 Sample Source R1.zip
家政保洁上门预约小程序
家政行业作为服务行业的一种,如今已然成为很多家庭的刚需,随着市场规模的扩大,目前家政行业越来越需要新的方式适应新一代消费者的需要,近些年来,家政行业在手机上下单、上门服务已经是常态,保洁家政预约服务小程序正是基于这种背景下而开发的,服务家政企业开启网上接单模式,包含用户端,家政员端,管理端三方,能够清晰展现家政服务项目内容和价格,将服务步骤条理化、透明度,让潜在的用户能够足不出门地就掌握到服务信息内容,有益于降低供求间的信息差,本项目前后端完整,包括公告,家政员预约,后台管理,用户管理,预约名单管理,预约记录管理与导出,我的预约,历史浏览,我的收藏等模块,采用腾讯提供的小程序云开发解决方案,无须服务器和域名。