精华内容
下载资源
问答
  • 基于最新的iso 27001标准中文版与中英文对照版,两篇文章均pdf版本
  • C++ C++ C++ C++ C++ C++ Primer, Fourth Edition ...C++国际经典书籍,本版为中英文对照版,一段英文紧接一段译文,绝对实用!chm格式,关键术语直接跳链到书中相关页码。 个人认为是C++浅入深出最适合的书!
  • 这是一套个人简历中英文对照excel模版下载,喜欢的人都来下载吧。该文档个人简历中英文对照excel模版下载,是一份很不错的参考资料,具有较高参考价值,感兴趣的可以下载看看
  • ISO27001中英文对照版2013 ISO27001-2013-中文版本,属于标准类,有助于相关人员工作查阅 前言 ISO(国际标准化组织)和EC(国际电工委员会)是国际标准化制定专门体制的国际组织。国 家机构是O或mC的成员,他们通过...
  • 该资源ISO14229-1汽车协议的中英文对照版本,为了方便我自己以后看与大家,就准备分享出来,顺便赚点积分,以备以后想要下载时候积分不足问题,为了描述字数,就准备多写几个,今天冬至,之后的北方天,就开始白天...
  • ISO14229-1的中英文对照版本PDF,方便自己阅读,又方便对照,更是为了以后自己下载时候可以看,就上传到了这里,物超所值,方便快捷,冬至了,以后北方就开始白天长,夜晚短了。
  • 这是一套提单栏目中英文对照详解excel模版下载,喜欢的人都来下载吧。该文档提单栏目中英文对照详解excel模版下载,是一份很不错的参考资料,具有较高参考价值,感兴趣的可以下载看看
  • 1972避碰规则中英文对照版. 本规则条款适用于在公海和连接 于公海而可供海船航行的一切水域中的 一切船舶。  2.本规则条款不妨碍有关主管机关连 接于公海而可供海船航行的任何港外锚地、 港口、江河、湖泊...
  • java_ee_api_中英文对照版.zip - 解包大小 6.6 MB -------------------------------------- JavaEE javax - 的api
  • Serial Programming HOWTO 中英文对照版 网上有一个翻译,不过感觉比较生涩,自己翻译了一下,对linux 串口编程有兴趣的可以看下 原来要5分,限时福利,1个月内只要1分,独家自己翻译的,所以大家懂的就赶紧下载,...
  • j2ee中英文对照版api

    2012-03-28 15:14:59
    注:调用者 DataHandler 传递 null 值是可以接受的。 verb 此对象引用的 Command Verb。 dh DataHandler。 英文文档: setCommandContext void setCommandContext(String verb, DataHandler dh) throws ...
  • 《项目管理知识体系指南》(PMBOK® 指南)是美国项目管理协会PMI其制定的项目管理知识体系PMBOK( Project Management Body of Knowledge)出版的指导性文件。 30多年来,《项目管理知识体系指南》(PMBOK® 指南...
  • WTL9.1-ReadMe-中英文对照版 Windows Template Library - WTL Version 9.1 (build 5270) 2015-09-27 Windows模板库 - WTL Version 9.1 (build 5270) 2015-09-27 (水平有限,不足之处,欢迎指正交流:ybmj@vip.163....
  • C++6.0经典教材 C++Primer_4中文(中英文对照_序言英文)
  • J2EE-API-7和J2EE-API-6中英文对照版,专java企业级开发提供参考文档,这两种文档收集不易,希望大家多多支持。
  • AfterEffect中英文对照大全一、菜单命令1、File(文本)New新建NewProject新建项目NewFolder新建文件夹OpenProject打开项目OpenRecentProjects打开最近项目Close关闭Save保存SaveAs另存SaveaCopy保存副本Revert恢复...

    After Effect

    中英文对照大全

    一、菜单命令

    1

    File

    (文本)

    New

    新建

    New Project

    新建项目

    New Folder

    新建文件夹

    Open Project

    打开项目

    Open Recent Projects

    打开最近项目

    Close

    关闭

    Save

    保存

    Save As

    另存为

    Save a Copy

    保存副本

    Revert

    恢复

    Import

    导入

    File

    文件

    Multiple Files

    多个文件

    Placeholder

    输入占位符

    Solid

    实色

    Import Recent Footage

    导入最近镜头

    Export

    输出

    Find

    查找

    Find Next

    再次查找

    Add Footage to Comp

    添加镜头到合成

    New Comp From Selection

    选定脚本建立合成

    Consolidate All Footage

    整理镜头

    Remove Unused Footage

    删除未用镜头

    Reduce Project

    简化项目

    Collect Files

    文件打包

    Watch Folder

    浏览文件夹

    Run Script

    运行脚本

    Create Proxy

    建立代理

    Still

    静态图片

    Movie

    影片

    Set Proxy

    设置代理

    File

    文件

    None

    Interpret Footage

    解释镜头

    Main

    常规

    Proxy

    代理

    Remember Interpretation

    保存解释

    Apply Interpretation

    应用解释

    Replace Footage

    替换镜头

    File

    文件

    Placeholder

    占位符

    Solid

    实色

    Reload Footage

    重载镜头

    Reveal in Explorer

    显示所在文件夹

    Project Settings

    项目设置

    Print

    打印

    Exit

    退出

    2

    Edit

    (编辑)

    Undo Copy

    撤消

    Redo Copy

    重复

    History

    历史记录

    Cut

    剪切

    Copy

    复制

    Paste

    粘贴

    Clear

    清楚

    Duplicate

    副本

    Split Layer

    分层图层

    Lift Work Area

    抽出工作区域

    Extract Work Area

    挤压工作区域

    Select All

    选择全部

    Deselect All

    全部取消

    Label

    标签

    Purge

    清空

    All

    全部

    展开全文
  • WTL9.1-ReadMe-中英文对照版 Windows Template Library - WTL Version 9.1 (build 5270) 2015-09-27 Windows模板库 - WTL Version 9.1 (build 5270) 2015-09-27 (水平有限,不足之处,欢迎指正交流:ybmj@vip.163....
  • 一本久负盛名的C++经典教程,它是业界内久负盛名、无可替代的C++经典著作,它的原版销量超过450000册。此版本PDF,而非影印,清晰度没的说
  • JSTL入门中英文对照版

    2007-06-23 16:17:24
    IBM文档,Christen制作CHM格式,分享是一种美德:)
  • jQuery API (中英文对照版) ---------------------------------- jQuery由美国人John Resig创建,至今已吸引了来自世界各地的众多javascript高手加入其team,包括来自德国的Jörn Zaefferer,罗马尼亚的Stefan ...
  • 此版为中英文对照版,纯中文版请稳步:[SENet纯中文版] Squeeze-and-Excitation Networks 挤压和激励网络 Jie Hu* Momenta hujie@momenta.ai Li Shen* University of Oxford...

    图像分类经典论文翻译汇总:[翻译汇总]

    翻译pdf文件下载:[下载地址]

    此版为中英文对照版,纯中文版请稳步:[SENet纯中文版]

    Squeeze-and-Excitation Networks

    挤压和激励网络

    Jie Hu*
    Momenta
    hujie@momenta.ai

    Li Shen*
    University of Oxford
    lishen@robots.ox.ac.uk

    Gang Sun*
    Momenta
    sungang@momenta.ai

    Abstract

    Convolutional neural networks are built upon the convolution operation, which extracts informative features by fusing spatial and channel-wise information together within local receptive fields. In order to boost the representational power of a network, much existing work has shown the benefits of enhancing spatial encoding. In this work, we focus on channels and propose a novel architectural unit, which we term the “Squeeze-and-Excitation”(SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. We demonstrate that by stacking these blocks together, we can construct SENet architectures that generalise extremely well across challenging datasets. Crucially, we find that SE blocks produce significant performance improvements for existing state-of-the-art deep architectures at slight computational cost. SENets formed the foundation of our ILSVRC 2017 classification submission which won first place and significantly reduced the top-5 error to 2.251%, achieving a 25% relative improvement over the winning entry of 2016.

    摘要

    卷积神经网络建立在卷积运算的基础上,通过融合局部感受野内的空间信息和通道信息来提取信息特征。为了提高网络的表示能力,许多现有的工作已经表明增强空间编码的好处。在这项工作中,我们专注于通道,并提出了一种新颖的架构单元,我们称之为“Squeeze-and-Excitation”(SE模块,通过显式地建模通道之间的相互依赖关系,自适应地重新校准通道式的特征响应。通过将这些块堆叠在一起,我们证明了我们可以构建SENet架构,在具有挑战性的数据集中可以进行泛化地非常好。关键的是,我们发现SE模块以微小的计算成本为现有最先进的深层架构产生了显著的性能改进。SENets是我们ILSVRC 2017分类提交的基础,它赢得了第一名,并将top-5错误率显著减少到2.251%,相对于2016年的获胜团队取得了约25%的相对改进。

    1. Introduction

    Convolutional neural networks (CNNs) have proven to be effective models for tackling a variety of visual tasks [19, 23, 29, 41]. For each convolutional layer, a set of filters are learned to express local spatial connectivity patterns along input channels. In other words, convolutional filters are expected to be informative combinations by fusing spatial and channel-wise information together, while restricted in local receptive fields. By stacking a series of convolutional layers interleaved with non-linearities and downsampling, CNNs are capable of capturing hierarchical patterns with global receptive fields as powerful image descriptions. Recent work has demonstrated the performance of networks can be improved by explicitly embedding learning mechanisms that help capture spatial correlations without requiring additional supervision. One such approach was popularised by the Inception architectures [14, 39], which showed that the network can achieve competitive accuracy by embedding multi-scale processes in its modules. More recent work has sought to better model spatial dependence [1, 27] and incorporate spatial attention [17].

    1. 引言

    卷积神经网络(CNNs)已被证明是解决各种视觉任务的有效模型[19,23,29,41]。对于每个卷积层,沿着输入通道学习一组滤波器来表达局部空间连接模式。换句话说,期望卷积滤波器通过融合空间信息和信道信息进行信息组合,而受限于局部感受野。通过叠加一系列非线性和下采样交织的卷积层,CNN能够捕获具有全局感受野的分层模式作为强大的图像描述。最近的工作已经证明,网络的性能可以通过显式地嵌入学习机制来改善,这种学习机制有助于捕捉空间相关性而不需要额外的监督。Inception架构推广了一种这样的方法[14,39],这表明网络可以通过在其模块中嵌入多尺度处理来取得有竞争力的准确度。最近的工作在寻找更好地模型空间依赖[1,27],结合空间注意力[17]

    In contrast to these methods, we investigate a different aspect of architectural design —— the channel relationship, by introducing a new architectural unit, which we term the Squeeze-and-Excitation (SE) block. Our goal is to improve the representational power of a network by explicitly modelling the interdependencies between the channels of its convolutional features. To achieve this, we propose a mechanism that allows the network to perform feature recalibration, through which it can learn to use global information to selectively emphasise informative features and suppress less useful ones.

    与这些方法相反,通过引入新的架构单元,我们称之为“Squeeze-and-Excitation (SE)块,我们研究了架构设计的一个不同方向——通道关系。我们的目标是通过显式地建模卷积特征通道之间的相互依赖性来提高网络的表示能力。为了达到这个目的,我们提出了一种机制,使网络能够执行特征重新校准,通过这种机制可以学习使用全局信息来选择性地强调信息特征并抑制不太有用的特征。

    The basic structure of the SE building block is illustrated in Fig.1. For any given transformation Ftr:X→UFtr:XUXRW×H×C,URW×H×CXRW′×H′×C,URW×H×C, (e.g. a convolution or a set of convolutions), we can construct a corresponding SE block to perform feature recalibration as follows. The features UU are first passed through a squeeze operation, which aggregates the feature maps across spatial dimensions W×HW×H to produce a channel descriptor. This descriptor embeds the global distribution of channel-wise feature responses, enabling information from the global receptive field of the network to be leveraged by its lower layers. This is followed by an excitation operation, in which sample-specific activations, learned for each channel by a self-gating mechanism based on channel dependence, govern the excitation of each channel. The feature maps UU are then reweighted to generate the output of the SE block which can then be fed directly into subsequent layers.

    Figure 1. A Squeeze-and-Excitation block.

    SE构建块的基本结构如图1所示。对于任何给定的变换Ftr:X→UFtr:XUXRW×H×C,URW×H×CXRW′×H′×C,URW×H×C(例如卷积或一组卷积),我们可以构造一个相应的SE块来执行特征重新校准,如下所示。特征UU首先通过squeeze操作,该操作跨越空间维度W×HW×H聚合特征映射来产生通道描述符。这个描述符嵌入了通道特征响应的全局分布,使来自网络全局感受野的信息能够被其较低层利用。这之后是一个excitation操作,其中通过基于通道依赖性的自门机制为每个通道学习特定采样的激活,控制每个通道的激励。然后特征映射UU被重新加权以生成SE块的输出,然后可以将其直接输入到随后的层中。

    1. Squeeze-and-Excitation

    An SE network can be generated by simply stacking a collection of SE building blocks. SE blocks can also be used as a drop-in replacement for the original block at any depth in the architecture. However, while the template for the building block is generic, as we show in Sec. 6.3, the role it performs at different depths adapts to the needs of the network. In the early layers, it learns to excite informative features in a class agnostic manner, bolstering the quality of the shared lower level representations. In later layers, the SE block becomes increasingly specialised, and responds to different inputs in a highly class-specific manner. Consequently, the benefits of feature recalibration conducted by SE blocks can be accumulated through the entire network.

    SE网络可以通过简单地堆叠SE构建块的集合来生成。SE块也可以用作架构中任意深度的原始块的直接替换。然而,虽然构建块的模板是通用的,正如我们6.3节中展示的那样,但它在不同深度的作用适应于网络的需求。在前面的层中,它学习以类不可知的方式激发信息特征,增强共享的较低层表示的质量。在后面的层中,SE块越来越专业化,并以高度类特定的方式响应不同的输入。因此,SE块进行特征重新校准的好处可以通过整个网络进行累积。

    The development of new CNN architectures is a challenging engineering task, typically involving the selection of many new hyperparameters and layer configurations. By contrast, the design of the SE block outlined above is simple, and can be used directly with existing state-of-the-art architectures whose convolutional layers can be strengthened by direct replacement with their SE counterparts. Moreover, as shown in Sec. 4, SE blocks are computationally lightweight and impose only a slight increase in model complexity and computational burden. To support these claims, we develop several SENets, namely SE-ResNet, SE-Inception, SE-ResNeXt and SE-Inception-ResNet and provide an extensive evaluation of SENets on the ImageNet 2012 dataset [30]. Further, to demonstrate the general applicability of SE blocks, we also present results beyond ImageNet, indicating that the proposed approach is not restricted to a specific dataset or a task.

    CNN架构的开发是一项具有挑战性的工程任务,通常涉及许多新的超参数和层配置的选择。相比之下,上面概述的SE块的设计是简单的,并且可以直接与现有的最新架构一起使用,其卷积层可以通过直接用对应的SE层来替换从而进行加强。另外,如第四节所示,SE块在计算上是轻量级的,并且在模型复杂性和计算负担方面仅稍微增加。为了支持这些声明,我们开发了一些SENets,即SE-ResNetSE-InceptionSE-ResNeXtSE-Inception-ResNet,并在ImageNet 2012数据集[30]上对SENets进行了广泛的评估。此外,为了证明SE块的一般适用性,我们还呈现了ImageNet之外的结果,表明所提出的方法不受限于特定的数据集或任务。

    Using SENets, we won the first place in the ILSVRC 2017 classification competition. Our top performing model ensemble achieves a 2.251%2.251% top-5 error on the test set. This represents a 25%25% relative improvement in comparison to the winner entry of the previous year (with a top-55 error of 2.991%2.991%). Our models and related materials have been made available to the research community.

    使用SENets,我们赢得了ILSVRC 2017分类竞赛的第一名。我们的表现最好的模型集合在测试集上达到了2.251%2.251%top-5错误率。与前一年的获奖者(2.991%2.991%top-5错误率)相比,这表示25%25%的相对改进。我们的模型和相关材料已经提供给研究界。

    2. Related Work

    Deep architectures. A wide range of work has shown that restructuring the architecture of a convolutional neural network in a manner that eases the learning of deep features can yield substantial improvements in performance. VGGNets [35] and Inception models [39] demonstrated the benefits that could be attained with an increased depth, significantly outperforming previous approaches on ILSVRC 2014. Batch normalization (BN) [14] improved gradient propagation through deep networks by inserting units to regulate layer inputs stabilising the learning process, which enables further experimentation with a greater depth. He et al. [9, 10] showed that it was effective to train deeper networks by restructuring the architecture to learn residual functions through the use of identity-based skip connections which ease the flow of information across units. More recently, reformulations of the connections between network layers [5, 12] have been shown to further improve the learning and representational properties of deep networks.

    2. 近期工作

    深层架构。大量的工作已经表明,以易于学习深度特征的方式重构卷积神经网络的架构可以大大提高性能。VGGNets[35]Inception模型[39]证明了深度增加可以获得的好处,明显超过了ILSVRC 2014之前的方法。批标准化(BN[14]通过插入单元来调节层输入稳定学习过程,改善了通过深度网络的梯度传播,这使得可以用更深的深度进行进一步的实验。He等人[9,10]表明,通过重构架构来训练更深层次的网络是有效的,通过使用基于恒等映射的跳跃连接来学习残差函数,从而减少跨单元的信息流动。最近,网络层间连接的重新表示[5,12]已被证明可以进一步改善深度网络的学习和表征属性。

    An alternative line of research has explored ways to tune the functional form of the modular components of a network. Grouped convolutions can be used to increase cardinality (the size of the set of transformations) [13, 43] to learn richer representations. Multi-branch convolutions can be interpreted as a generalisation of this concept, enabling more flexible compositions of convolutional operators [14, 38, 39, 40]. Cross-channel correlations are typically mapped as new combinations of features, either independently of spatial structure [6, 18] or jointly by using standard convolutional filters [22] with 1×11×1 convolutions, while much of this work has concentrated on the objective of reducing model and computational complexity. This approach reflects an assumption that channel relationships can be formulated as a composition of instance-agnostic functions with local receptive fields. In contrast, we claim that providing the network with a mechanism to explicitly model dynamic, non-linear dependencies between channels using global information can ease the learning process, and significantly enhance the representational power of the network.

    另一种研究方法探索了调整网络模块化组件功能形式的方法。可以用分组卷积来增加基数(一组变换的大小)[13,43]以学习更丰富的表示。多分支卷积可以解释为这个概念的概括,使得卷积算子可以更灵活的组合[14,38,39,40]。跨通道相关性通常被映射为新的特征组合,或者独立的空间结构[6,18],或者联合使用标准卷积滤波器[22]1×11×1卷积,然而大部分工作的目标是集中在减少模型和计算复杂度上面。这种方法反映了一个假设,即通道关系可以被表述为具有局部感受野的实例不可知的函数的组合。相比之下,我们声称为网络提供一种机制来显式建模通道之间的动态、非线性依赖关系,使用全局信息可以减轻学习过程,并且显著增强网络的表示能力。

    Attention and gating mechanisms. Attention can be viewed, broadly, as a tool to bias the allocation of available processing resources towards the most informative components of an input signal. The development and understanding of such mechanisms has been a longstanding area of research in the neuroscience community [15, 16, 28] and has seen significant interest in recent years as a powerful addition to deep neural networks [20, 25]. Attention has been shown to improve performance across a range of tasks, from localisation and understanding in images [3, 17] to sequence-based models [2, 24]. It is typically implemented in combination with a gating function (e.g. a softmax or sigmoid) and sequential techniques [11, 37]. Recent work has shown its applicability to tasks such as image captioning [4, 44] and lip reading [7], in which it is exploited to efficiently aggregate multi-modal data. In these applications, it is typically used on top of one or more layers representing higher-level abstractions for adaptation between modalities. Highway networks [36] employ a gating mechanism to regulate the shortcut connection, enabling the learning of very deep architectures. Wang et al. [42] introduce a powerful trunk-and-mask attention mechanism using an hourglass module [27], inspired by its success in semantic segmentation. This high capacity unit is inserted into deep residual networks between intermediate stages. In contrast, our proposed SE-block is a lightweight gating mechanism, specialised to model channel-wise relationships in a computationally efficient manner and designed to enhance the representational power of modules throughout the network.

    注意力和门机制。从广义上讲,可以将注意力视为一种工具,将可用处理资源的分配偏向于输入信号的信息最丰富的组成部分。这种机制的发展和理解一直是神经科学社区的一个长期研究领域[15,16,28],并且近年来作为一个强大补充,已经引起了深度神经网络的极大兴趣[20,25]。注意力已经被证明可以改善一系列任务的性能,从图像的定位和理解[3,17]到基于序列的模型[2,24]。它通常结合门功能(例如softmaxsigmoid)和序列技术来实现[11,37]。最近的研究表明,它适用于像图像标题[4,44]和口头阅读[7]等任务,其中利用它来有效地汇集多模态数据。在这些应用中,它通常用在表示较高级别抽象的一个或多个层的顶部,以用于模态之间的适应。高速网络[36]采用门机制来调节快捷连接,使得可以学习非常深的架构。王等人[42]受到语义分割成功的启发,引入了一个使用沙漏模块[27]的强大的trunk-and-mask注意力机制。这个高容量的单元被插入到中间阶段之间的深度残差网络中。相比之下,我们提出的SE块是一个轻量级的门机制,专门用于以计算有效的方式对通道关系进行建模,并设计用于增强整个网络中模块的表示能力。

    3. Squeeze-and-Excitation Blocks

    The Squeeze-and-Excitation block is a computational unit which can be constructed for any given transformation Ftr:X→U,XRW′×H′×C′,URW×H×CFtr:XU,XRW′×H′×C,URW×H×C. For simplicity of exposition, in the notation that follows we take FtrFtr to be a standard convolutional operator. Let V=[v1,v2,…,vC]V=[v1,v2,,vC] denote the learned set of filter kernels, where vcvc refers to the parameters of the cc-th filter. We can then write the outputs of FtrFtr as U=[u1,u2,…,uC]U=[u1,u2,,uC] where

    uc=vcX=∑s=1C′vscxs.uc=vcX=s=1Cvcsxs.

    Here  denotes convolution, vc=[v1c,v2c,…,vC′c]vc=[vc1,vc2,…,vcC′] and X=[x1,x2,…,xC′]X=[x1,x2,…,xC′] (to simplify the notation, bias terms are omitted). Here vscvcs is a 22D spatial kernel, and therefore represents a single channel of vcvc which acts on the corresponding channel of XX. Since the output is produced by a summation through all channels, the channel dependencies are implicitly embedded in vcvc, but these dependencies are entangled with the spatial correlation captured by the filters. Our goal is to ensure that the network is able to increase its sensitivity to informative features so that they can be exploited by subsequent transformations, and to suppress less useful ones. We propose to achieve this by explicitly modelling channel interdependencies to recalibrate filter responses in two steps, squeeze and excitation, before they are fed into next transformation. A diagram of an SE building block is shown in Fig.1.

    3. Squeeze-and-Excitation块

    Squeeze-and-Excitation块是一个计算单元,可以为任何给定的变换构建:Ftr:X→U,XRW′×H′×C′,URW×H×CFtr:XU,XRW′×H′×C,URW×H×C。为了简化说明,在接下来的表示中,我们将FtrFtr看作一个标准的卷积算子。V=[v1,v2,…,vC]V=[v1,v2,,vC]表示学习到的一组滤波器核,vcvc指的是第cc个滤波器的参数。然后我们可以将FtrFtr的输出写作U=[u1,u2,…,uC]U=[u1,u2,,uC],其中

    uc=vcX=∑s=1C′vscxs.uc=vcX=s=1Cvcsxs.

    这里表示卷积,vc=[v1c,v2c,…,vC′c]vc=[vc1,vc2,…,vcC′]X=[x1,x2,…,xC′]X=[x1,x2,…,xC′](为了简洁表示,忽略偏置项)。这里vscvcs22D空间核,因此表示vcvc的一个单通道,作用于对应的通道XX。由于输出是通过所有通道的和来产生的,所以通道依赖性被隐式地嵌入到vcvc中,但是这些依赖性与滤波器捕获的空间相关性纠缠在一起。我们的目标是确保能够提高网络对信息特征的敏感度,以便后续转换可以利用这些功能,并抑制不太有用的功能。我们建议通过显式建模通道依赖性来实现这一点,以便在进入下一个转换之前通过两步重新校准滤波器响应,两步为:squeezeexcitationSE构建块的图如图1所示。

    3.1. Squeeze: Global Information Embedding

    In order to tackle the issue of exploiting channel dependencies, we first consider the signal to each channel in the output features. Each of the learned filters operate with a local receptive field and consequently each unit of the transformation output UU is unable to exploit contextual information outside of this region. This is an issue that becomes more severe in the lower layers of the network whose receptive field sizes are small.

    3.1. Squeeze:全局信息嵌入

    为了解决利用通道依赖性的问题,我们首先考虑输出特征中每个通道的信号。每个学习到的滤波器都对局部感受野进行操作,因此变换输出UU的每个单元都无法利用该区域之外的上下文信息。在网络较低的层次上其感受野尺寸很小,这个问题变得更严重。

    To mitigate this problem, we propose to squeeze global spatial information into a channel descriptor. This is achieved by using global average pooling to generate channel-wise statistics. Formally, a statistic zRCzRC is generated by shrinking UU through spatial dimensions W×HW×H, where the cc-th element of zz is calculated by:

    zc=Fsq(uc)=1W×H∑i=1W∑j=1Huc(i,j).zc=Fsq(uc)=1W×H∑i=1W∑j=1Huc(i,j).

    为了减轻这个问题,我们提出将全局空间信息压缩成一个通道描述符。这是通过使用全局平均池化生成通道统计实现的。形式上,统计zRCzRC是通过在空间维度W×HW×H上收缩UU生成的,其中zz的第cc个元素通过下式计算:

    zc=Fsq(uc)=1W×H∑i=1W∑j=1Huc(i,j).zc=Fsq(uc)=1W×H∑i=1W∑j=1Huc(i,j).

    Discussion. The transformation output UU can be interpreted as a collection of the local descriptors whose statistics are expressive for the whole image. Exploiting such information is prevalent in feature engineering work [31, 34, 45]. We opt for the simplest, global average pooling, while more sophisticated aggregation strategies could be employed here as well.

    讨论。转换输出UU可以被解释为局部描述子的集合,这些描述子的统计信息对于整个图像来说是有表现力的。特征工程工作中[31,34,45]普遍使用这些信息。我们选择最简单的全局平均池化,同时也可以采用更复杂的汇聚策略。

    3.2. Excitation: Adaptive Recalibration

    To make use of the information aggregated in the squeeze operation, we follow it with a second operation which aims to fully capture channel-wise dependencies. To fulfil this objective, the function must meet two criteria: first, it must be flexible (in particular, it must be capable of learning a nonlinear interaction between channels) and second, it must learn a non-mutually-exclusive relationship as multiple channels are allowed to be emphasised opposed to one-hot activation. To meet these criteria, we opt to employ a simple gating mechanism with a sigmoid activation:

    s=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z))s=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z))

    where δδ refers to the ReLU[26] function, W1RCr×CW1RCr×C and W2RC×CrW2RC×Cr. To limit model complexity and aid generalisation, we parameterise the gating mechanism by forming a bottleneck with two fully-connected (FC) layers around the non-linearity, i.e. a dimensionality-reduction layer with parameters W1W1 with reduction ratio rr (we set it to be 16, and this parameter choice is discussed in Sec.6.3), a ReLU and then a dimensionality-increasing layer with parameters W2W2. The final output of the block is obtained by rescaling the transformation output UU with the activations:

    x˜c=Fscale(uc,sc)=scucx~c=Fscale(uc,sc)=scuc

    where X˜=[x˜1,x˜2,…,x˜C]X~=[x~1,x~2,…,x~C] and Fscale(uc,sc)Fscale(uc,sc) refers to channel-wise multiplication between the feature map ucRW×HucRW×Hand the scalar scsc.

    3.2. Excitation:自适应重新校正

    为了利用压缩操作中汇聚的信息,我们接下来通过第二个操作来全面捕获通道依赖性。为了实现这个目标,这个功能必须符合两个标准:第一,它必须是灵活的(特别是它必须能够学习通道之间的非线性交互);第二,它必须学习一个非互斥的关系,因为独热激活相反,这里允许强调多个通道。为了满足这些标准,我们选择采用一个简单的门机制,并使用sigmoid激活:

    s=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z))s=Fex(z,W)=σ(g(z,W))=σ(W2δ(W1z))

    ,其中δδ是指ReLU[26]函数,W1RCr×CW1RCr×CW2RC×CrW2RC×Cr。为了限制模型复杂度和辅助泛化,我们通过在非线性周围形成两个全连接(FC)层的瓶颈来参数化门机制,即降维层参数为W1W1,降维比例为rr(我们把它设置为16,这个参数选择在6.3节中讨论),一个ReLU,然后是一个参数为W2W2的升维层。块的最终输出通过重新调节带有激活的变换输出UU得到:

    x˜c=Fscale(uc,sc)=scucx~c=Fscale(uc,sc)=scuc

    其中X˜=[x˜1,x˜2,…,x˜C]X~=[x~1,x~2,…,x~C]Fscale(uc,sc)Fscale(uc,sc)指的是特征映射ucRW×HucRW×H和标量scsc之间的对应通道乘积。

    Discussion. The activations act as channel weights adapted to the input-specific descriptor zz. In this regard, SE blocks intrinsically introduce dynamics conditioned on the input, helping to boost feature discriminability.

    讨论。激活作为适应特定输入描述符zz的通道权重。在这方面,SE块本质上引入了以输入为条件的动态特性,有助于提高特征辨别力。

    3.3. Exemplars: SE-Inception and SE-ResNet

    The flexibility of the SE block means that it can be directly applied to transformations beyond standard convolutions. To illustrate this point, we develop SENets by integrating SE blocks into two popular network families of architectures, Inception and ResNet. SE blocks are constructed for the Inception network by taking the transformation FtrFtr to be an entire Inception module (see Fig.2). By making this change for each such module in the architecture, we construct an SE-Inception network.

    Figure 2. The schema of the original Inception module (left) and the SE-Inception module (right).

    3.3. 模型:SE-Inception和SE-ResNet

    SE块的灵活性意味着它可以直接应用于标准卷积之外的变换。为了说明这一点,我们通过将SE块集成到两个流行的网络架构系列InceptionResNet中来开发SENets。通过将变换FtrFtr看作一个整体的Inception模块(参见图2),为Inception网络构建SE块。通过对架构中的每个模块进行更改,我们构建了一个SE-Inception网络。

    2。最初的Inception模块架构()SE-Inception模块架构()

    Residual networks and their variants have shown to be highly effective at learning deep representations. We develop a series of SE blocks that integrate with ResNet [9], ResNeXt [43] and Inception-ResNet [38] respectively. Fig.3 depicts the schema of an SE-ResNet module. Here, the SE block transformation FtrFtr is taken to be the non-identity branch of a residual module. Squeeze and excitation both act before summation with the identity branch.

    Figure 3. The schema of the original Residual module (left) and the SE-ResNet module (right).

    残留网络及其变种已经证明在学习深度表示方面非常有效。我们开发了一系列的SE块,分别与ResNet[9]ResNeXt[43]Inception-ResNet[38]集成。图3描述了SE-ResNet模块的架构。在这里,SE块变换FtrFtr被认为是残差模块的非恒等分支。压缩和激励都在恒等分支相加之前起作用。

    3 最初的Residual模块架构()SE-ResNet模块架构()

    4. Model and Computational Complexity

    An SENet is constructed by stacking a set of SE blocks. In practice, it is generated by replacing each original block (i.e. residual block) with its corresponding SE counterpart (i.e. SE-residual block). We describe the architecture of SE-ResNet-50 and SE-ResNeXt-50 in Table 1.

    Table 1. (Left) ResNet-50. (Middle) SE-ResNet-50. (Right) SE-ResNeXt-50 with a 32×4d32×4d template. The shapes and operations with specific parameter settings of a residual building block are listed inside the brackets and the number of stacked blocks in a stage is presented outside. The inner brackets following by fc indicates the output dimension of the two fully connected layers in a SE-module.

    4. 模型和计算复杂度

    SENet通过堆叠一组SE块来构建。实际上,它是通过用原始块的SE对应部分(即SE残差块)替换每个原始块(即残差块)而产生的。我们在表1中描述了SE-ResNet-50SE-ResNeXt-50的架构。

    1()ResNet-50()SE-ResNet-50()具有32×4d32×4d模板的SE-ResNeXt-50。在括号内列出了残差构建块特定参数设置的形状和操作,并且在外部呈现了一个阶段中堆叠块的数量。fc后面的内括号表示SE模块中两个全连接层的输出维度。

    For the proposed SE block to be viable in practice, it must provide an acceptable model complexity and computational overhead which is important for scalability. To illustrate the cost of the module, we take the comparison between ResNet-50 and SE-ResNet-50 as an example, where the accuracy of SE-ResNet-50 is obviously superior to ResNet-50 and approaching a deeper ResNet-101 network (shown in Table 2). ResNet-50 requires ∼∼3.86 GFLOPs in a single forward pass for a 224×224224×224 pixel input image. Each SE block makes use of a global average pooling operation in the squeeze phase and two small fully connected layers in the excitation phase, followed by an inexpensive channel-wise scaling operation. In aggregate, SE-ResNet-50 requires ∼∼3.87 GFLOPs, corresponding to only a 0.26%0.26% relative increase over the original ResNet-50.

    Table 2. Single-crop error rates (%) on the ImageNet validation set and complexity comparisons. The original column refers to the results reported in the original papers. To enable a fair comparison, we re-train the baseline models and report the scores in the re-implementation column. The SENet column refers the corresponding architectures in which SE blocks have been added. The numbers in brackets denote the performance improvement over the re-implemented baselines. indicates that the model has been evaluated on the non-blacklisted subset of the validation set (this is discussed in more detail in [38]), which may slightly improve results.

    在实践中提出的SE块是可行的,它必须提供可接受的模型复杂度和计算开销,这对于可伸缩性是重要的。为了说明模块的成本,作为例子我们比较了ResNet-50SE-ResNet-50,其中SE-ResNet-50的精确度明显优于ResNet-50,接近更深的ResNet-101网络(如表2所示)。对于224×224224×224像素的输入图像,ResNet-50单次前向传播需要∼∼ 3.86 GFLOP。每个SE块利用压缩阶段的全局平均池化操作和激励阶段中的两个小的全连接层,接下来是廉价的通道缩放操作。总的来说,SE-ResNet-50需要∼∼ 3.87 GFLOP,相对于原始的ResNet-50只相对增加了0.26%0.26%

    2ImageNet验证集上的单裁剪图像错误率(%)和复杂度比较。original列是指原始论文中报告的结果。为了进行公平比较,我们重新训练了基准模型,并在re-implementation列中报告分数。SENet列是指已添加SE块后对应的架构。括号内的数字表示与重新实现的基准数据相比的性能改善。†表示该模型已经在验证集的非黑名单子集上进行了评估(在[38]中有更详细的讨论),这可能稍微改善结果。

    In practice, with a training mini-batch of 256256 images, a single pass forwards and backwards through ResNet-50 takes 190190ms, compared to 209209ms for SE-ResNet-50 (both timings are performed on a server with 88 NVIDIA Titan X GPUs). We argue that it is a reasonable overhead as global pooling and small inner-product operations are less optimised in existing GPU libraries. Moreover, due to its importance for embedded device applications, we also benchmark CPU inference time for each model: for a 224×224224×224pixel input image, ResNet-50 takes 164164ms, compared to for SE-ResNet-5050. The small additional computational overhead required by the SE block is justified by its contribution to model performance (discussed in detail in Sec. 6).

    在实践中,训练的批数据大小为256张图像,ResNet-50的一次前向传播和反向传播花费190190 ms,而SE-ResNet-50则花费209209ms(两个时间都在具有88NVIDIA Titan X GPU的服务器上执行)。我们认为这是一个合理的开销,因为在现有的GPU库中,全局池化和小型内积操作的优化程度较低。此外,由于其对嵌入式设备应用的重要性,我们还对每个模型的CPU推断时间进行了基准测试:对于224×224224×224像素的输入图像,ResNet-50花费了164164ms,相比之下,SE-ResNet-5050花费了167167msSE块所需的小的额外计算开销对于其对模型性能的贡献来说是合理的(在第6节中详细讨论)。

    Next, we consider the additional parameters introduced by the proposed block. All additional parameters are contained in the two fully connected layers of the gating mechanism, which constitute a small fraction of the total network capacity. More precisely, the number of additional parameters introduced is given by:

    2r∑s=1SNsCs22r∑s=1SNsCs2

    where rr denotes the reduction ratio (we set rr to 1616 in all our experiments), SS refers to the number of stages (where each stage refers to the collection of blocks operating on feature maps of a common spatial dimension), CsCs denotes the dimension of the output channels for stage ss and NsNs refers to the repeated block number. In total, SE-ResNet-50 introduces ∼∼2.5 million additional parameters beyond the ∼∼25 million parameters required by ResNet-50, corresponding to a 10%10% increase in the total number of parameters. The majority of these additional parameters come from the last stage of the network, where excitation is performed across the greatest channel dimensions. However, we found that the comparatively expensive final stage of SE blocks could be removed at a marginal cost in performance (<0.1%<0.1% top-1 error on ImageNet dataset) to reduce the relative parameter increase to 4%4%, which may prove useful in cases where parameter usage is a key consideration.

    接下来,我们考虑所提出的块引入的附加参数。所有附加参数都包含在门机制的两个全连接层中,构成网络总容量的一小部分。更确切地说,引入的附加参数的数量由下式给出:

    2r∑s=1SNsCs22r∑s=1SNsCs2

    其中rr表示减少比率(我们在所有的实验中将rr设置为1616),SS指的是阶段数量(每个阶段是指在共同的空间维度的特征映射上运行的块的集合),CsCs表示阶段ss的输出通道的维度,NsNs表示重复的块编号。总的来说,SE-ResNet-50ResNet-50所要求的∼∼2500万参数之外引入了∼∼250万附加参数,相对增加了10%10%的参数总数量。这些附加参数中的大部分来自于网络的最后阶段,其中激励在最大的通道维度上执行。然而,我们发现SE块相对昂贵的最终阶段可以在性能的边际成本(ImageNet数据集上<0.1%<0.1%top-1错误率)上被移除,将相对参数增加减少到4%4%,这在参数使用是关键考虑的情况下可能证明是有用的。

    5. Implementation

    During training, we follow standard practice and perform data augmentation with random-size cropping [39] to 224×224224×224 pixels (299×299299×299 for Inception-ResNet-v2 [38] and SE-Inception-ResNet-v2) and random horizontal flipping. Input images are normalised through mean channel subtraction. In addition, we adopt the data balancing strategy described in [32] for mini-batch sampling to compensate for the uneven distribution of classes. The networks are trained on our distributed learning system ROCS which is capable of handing efficient parallel training of large networks. Optimisation is performed using synchronous SGD with momentum 0.9 and a mini-batch size of 1024 (split into sub-batches of 32 images per GPU across 4 servers, each containing 8 GPUs). The initial learning rate is set to 0.6 and decreased by a factor of 10 every 30 epochs. All models are trained for 100 epochs from scratch, using the weight initialisation strategy described in [8].

    5. 实现

    在训练过程中,我们遵循标准的做法,使用随机大小裁剪[39]224×224224×224像素(299×299299×299用于Inception-ResNet-v2[38]SE-Inception-ResNet-v2)和随机的水平翻转进行数据增强。输入图像通过通道减去均值进行归一化。另外,我们采用[32]中描述的数据均衡策略进行小批量采样,以补偿类别的不均匀分布。网络在我们的分布式学习系统“ROCS”上进行训练,能够处理大型网络的高效并行训练。使用同步SGD进行优化,动量为0.9,小批量数据的大小为1024(在4个服务器的每个GPU上分成32张图像的子批次,每个服务器包含8GPU)。初始学习率设为0.6,每30个迭代周期减少10倍。使用[8]中描述的权重初始化策略,所有模型都从零开始训练100个迭代周期。

    6. Experiments

    In this section we conduct extensive experiments on the ImageNet 2012 dataset [30] for the purposes: first, to explore the impact of the proposed SE block for the basic networks with different depths and second, to investigate its capacity of integrating with current state-of-the-art network architectures, which aim to a fair comparison between SENets and non-SENets rather than pushing the performance. Next, we present the results and details of the models for ILSVRC 2017 classification task. Furthermore, we perform experiments on the Places365-Challenge scene classification dataset [48] to investigate how well SENets are able to generalise to other datasets. Finally, we investigate the role of excitation and give some analysis based on experimental phenomena.

    6. 实验

    在这一部分,我们在ImageNet 2012数据集上进行了大量的实验[30],其目的是:首先探索提出的SE块对不同深度基础网络的影响;其次,调查它与最先进的网络架构集成后的能力,旨在公平比较SENets和非SENets,而不是推动性能。接下来,我们将介绍ILSVRC 2017分类任务模型的结果和详细信息。此外,我们在Places365-Challenge场景分类数据集[48]上进行了实验,以研究SENets是否能够很好地泛化到其它数据集。最后,我们研究激励的作用,并根据实验现象给出了一些分析。

    6.1. ImageNet Classification

    The ImageNet 2012 dataset is comprised of 1.28 million training images and 50K validation images from 1000 classes. We train networks on the training set and report the top-1 and the top-5 errors using centre crop evaluations on the validation set, where 224×224224×224 pixels are cropped from each image whose shorter edge is first resized to 256 (299×299299×299 from each image whose shorter edge is first resized to 352 for Inception-ResNet-v2 and SE-Inception-ResNet-v2).

    6.1. ImageNet分类

    ImageNet 2012数据集包含来自1000个类别的128万张训练图像和5万张验证图像。我们在训练集上训练网络,并在验证集上使用中心裁剪图像评估来报告top-1top-5错误率,其中每张图像短边首先归一化为256,然后从每张图像中裁剪出224×224224×224个像素,(对于Inception-ResNet-v2SE-Inception-ResNet-v2,每幅图像的短边首先归一化到352,然后裁剪出299×299299×299个像素)。

    Network depth. We first compare the SE-ResNet against a collection of standard ResNet architectures. Each ResNet and its corresponding SE-ResNet are trained with identical optimisation schemes. The performance of the different networks on the validation set is shown in Table 2, which shows that SE blocks consistently improve performance across different depths with an extremely small increase in computational complexity.

    网络深度。我们首先将SE-ResNet与一系列标准ResNet架构进行比较。每个ResNet及其相应的SE-ResNet都使用相同的优化方案进行训练。验证集上不同网络的性能如表2所示,表明SE块在不同深度上的网络上计算复杂度极小增加,始终提高性能。

    Remarkably, SE-ResNet-50 achieves a single-crop top-5 validation error of 6.62%6.62%, exceeding ResNet-50 (7.48%7.48%) by 0.86%0.86% and approaching the performance achieved by the much deeper ResNet-101 network (6.52%6.52% top-5 error) with only half of the computational overhead (3.873.87 GFLOPs vs. 7.587.58 GFLOPs). This pattern is repeated at greater depth, where SE-ResNet-101 (6.07%6.07% top-55 error) not only matches, but outperforms the deeper ResNet-152 network (6.34%6.34% top-5 error) by 0.27%0.27%. Fig.4 depicts the training and validation curves of SE-ResNets and ResNets, respectively. While it should be noted that the SE blocks themselves add depth, they do so in an extremely computationally efficient manner and yield good returns even at the point at which extending the depth of the base architecture achieves diminishing returns. Moreover, we see that the performance improvements are consistent through training across a range of different depths, suggesting that the improvements induced by SE blocks can be used in combination with adding more depth to the base architecture.

    Figure 4. Training curves on ImageNet. (Left): ResNet-50 and SE-ResNet-50; (Right): ResNet-152 and SE-ResNet-152.

    值得注意的是,SE-ResNet-50实现了单裁剪图像6.62%6.62%top-5验证错误率,超过了ResNet-507.48%7.48%0.86%0.86%,接近更深的ResNet-101网络(6.52%6.52%top-5错误率),且只有ResNet-101一半的计算开销(3.873.87 GFLOPs vs. 7.587.58 GFLOPs)。这种模式在更大的深度上重复,SE-ResNet-1016.07%6.07%top-5错误率)不仅可以匹配,而且超过了更深的ResNet-152网络(6.34%6.34%top-5错误率)。图4分别描绘了SE-ResNetsResNets的训练和验证曲线。虽然应该注意SE块本身增加了深度,但是它们的计算效率极高,即使在扩展的基础架构的深度达到收益递减的点上也能产生良好的回报。而且,我们看到通过对各种不同深度的训练,性能改进是一致的,这表明SE块引起的改进可以与增加基础架构更多深度结合使用。

    4ImageNet上的训练曲线。()ResNet-50SE-ResNet-50()ResNet-152SE-ResNet-152

    Integration with modern architectures. We next investigate the effect of combining SE blocks with another two state-of-the-art architectures, Inception-ResNet-v2 [38] and ResNeXt [43]. The Inception architecture constructs modules of convolutions as multibranch combinations of factorised filters, reflecting the Inception hypothesis [6] that spatial correlations and cross-channel correlations can be mapped independently. In contrast, the ResNeXt architecture asserts that richer representations can be obtained by aggregating combinations of sparsely connected (in the channel dimension) convolutional features. Both approaches introduce prior-structured correlations in modules. We construct SENet equivalents of these networks, SE-Inception-ResNet-v2 and SE-ResNeXt (the configuration of SE-ResNeXt-50 (32×4d32×4d) is given in Table 1). Like previous experiments, the same optimisation scheme is used for both the original networks and their SENet counterparts.

    与现代架构集成。接下来我们将研究SE块与另外两种最先进的架构Inception-ResNet-v2[38]ResNeXt[43]的结合效果。Inception架构将卷积模块构造为分解滤波器的多分支组合,反映了Inception假设[6],可以独立映射空间相关性和跨通道相关性。相比之下,ResNeXt体架构断言,可以通过聚合稀疏连接(在通道维度中)卷积特征的组合来获得更丰富的表示。两种方法都在模块中引入了先前结构化的相关性。我们构造了这些网络的SENet等价物,SE-Inception-ResNet-v2SE-ResNeXt(表1给出了SE-ResNeXt-5032×4d32×4d)的配置)。像前面的实验一样,原始网络和它们对应的SENet网络都使用相同的优化方案。

    The results given in Table 2 illustrate the significant performance improvement induced by SE blocks when introduced into both architectures. In particular, SE-ResNeXt-50 has a top-5 error of 5.49%5.49% which is superior to both its direct counterpart ResNeXt-50 (5.90%5.90% top-5 error) as well as the deeper ResNeXt-101 (5.57%5.57% top-5 error), a model which has almost double the number of parameters and computational overhead. As for the experiments of Inception-ResNet-v2, we conjecture the difference of cropping strategy might lead to the gap between their reported result and our re-implemented one, as their original image size has not been clarified in [38] while we crop the 299×299299×299 region from a relative larger image (where the shorter edge is resized to 352). SE-Inception-ResNet-v2 (4.79%4.79% top-5 error) outperforms our reimplemented Inception-ResNet-v2 (5.21%5.21% top-5 error) by 0.42%0.42% (a relative improvement of 8.1%8.1%) as well as the reported result in [38]. The optimisation curves for each network are depicted in Fig. 5, illustrating the consistency of the improvement yielded by SE blocks throughout the training process.

    Figure 5. Training curves on ImageNet. (Left): ResNeXt-50 and SE-ResNeXt-50; (Right): Inception-ResNet-v2 and SE-Inception-ResNet-v2.

    2中给出的结果说明在将SE块引入到两种架构中会引起显著的性能改善。尤其是SE-ResNeXt-50top-5错误率是5.49%5.49%,优于于它直接对应的ResNeXt-505.90%5.90%top-5错误率)以及更深的ResNeXt-1015.57%5.57%top-5错误率),这个模型几乎有两倍的参数和计算开销。对于Inception-ResNet-v2的实验,我们猜测可能是裁剪策略的差异导致了其报告结果与我们重新实现的结果之间的差距,因为它们的原始图像大小尚未在[38]中澄清,而我们从相对较大的图像(其中较短边被归一化为352)中裁剪出299×299299×299大小的区域。SE-Inception-ResNet-v24.79%4.79%top-5错误率)比我们重新实现的Inception-ResNet-v25.21%5.21%top-5错误率)要低0.42%0.42%(相对改进了8.1%8.1%)也优于[38]中报告的结果。每个网络的优化曲线如图5所示,说明了在整个训练过程中SE块产生了一致的改进。

    5ImageNet的训练曲线。(): ResNeXt-50SE-ResNeXt-50()Inception-ResNet-v2SE-Inception-ResNet-v2

    Finally, we assess the effect of SE blocks when operating on a non-residual network by conducting experiments with the BN-Inception architecture [14] which provides good performance at a lower model complexity. The results of the comparison are shown in Table 2 and the training curves are shown in Fig. 6, exhibiting the same phenomena that emerged in the residual architectures. In particular, SE-BN-Inception achieves a lower top-5 error of 7.14%7.14% in comparison to BN-Inception whose error rate is 7.89%7.89%. These experiments demonstrate that improvements induced by SE blocks can be used in combination with a wide range of architectures. Moreover, this result holds for both residual and non-residual foundations.

    Figure 6. Training curves of BN-Inception and SE-BN-Inception on ImageNet.

    最后,我们通过对BN-Inception架构[14]进行实验来评估SE块在非残差网络上的效果,该架构在较低的模型复杂度下提供了良好的性能。比较结果如表2所示,训练曲线如图6所示,表现出的现象与残差架构中出现的现象一样。尤其是与BN-Inception 7.89%7.89%的错误率相比,SE-BN-Inception获得了更低7.14%7.14%top-5错误。这些实验表明SE块引起的改进可以与多种架构结合使用。而且,这个结果适用于残差和非残差基础。

    6BN-InceptionSE-BN-InceptionImageNet上的训练曲线。

    Results on ILSVRC 2017 Classification Competition. ILSVRC [30] is an annual computer vision competition which has proved to be a fertile ground for model developments in image classification. The training and validation data of the ILSVRC 2017 classification task are drawn from the ImageNet 2012 dataset, while the test set consists of an additional unlabelled 100K images. For the purposes of the competition, the top-5 error metric is used to rank entries.

    ILSVRC 2017分类竞赛的结果。ILSVRC[30]是一个年度计算机视觉竞赛,被证明是图像分类模型发展的沃土。ILSVRC 2017分类任务的训练和验证数据来自ImageNet 2012数据集,而测试集包含额外的未标记的10万张图像。为了竞争的目的,使用top-5错误率度量来对输入条目进行排序。

    SENets formed the foundation of our submission to the challenge where we won first place. Our winning entry comprised a small ensemble of SENets that employed a standard multi-scale and multi-crop fusion strategy to obtain a 2.251%2.251% top-5 error on the test set. This result represents a 25%25% relative improvement on the winning entry of 2016 (2.99%2.99% top-5 error). One of our high-performing networks is constructed by integrating SE blocks with a modified ResNeXt [43] (details of the modifications are provided in Appendix A). We compare the proposed architecture with the state-of-the-art models on the ImageNet validation set in Table 3. Our model achieves a top-1 error of 18.68%18.68% and a top-5 error of 4.47%4.47% using a 224×224224×224 centre crop evaluation on each image (where the shorter edge is first resized to 256). To enable a fair comparison with previous models, we also provide a 320×320320×320 centre crop evaluation, obtaining the lowest error rate under both the top-1 (17.28%17.28%) and the top-5 (3.79%3.79%) error metrics.

    Table 3. Single-crop error rates of state-of-the-art CNNs on ImageNet validation set. The size of test crop is 224×224224×224 and 320×320320×320/299×299299×299 as in [10]. Our proposed model, SENet, shows a significant performance improvement on prior work.

    SENets是我们在挑战中赢得第一名的基础。我们的获胜输入由一小群SENets组成,它们采用标准的多尺度和多裁剪图像融合策略,在测试集上获得了2.251%2.251%top-5错误率。这个结果表示在2016年获胜输入(2.99%2.99%top-5错误率)的基础上相对改进了25%25%。我们的高性能网络之一是将SE块与修改后的ResNeXt[43]集成在一起构建的(附录A提供了这些修改的细节)。在表3中我们将提出的架构与最新的模型在ImageNet验证集上进行了比较。我们的模型在每一张图像使用224×224224×224中间裁剪评估(短边首先归一化到256)取得了18.68%18.68%top-1错误率和4.47%4.47%top-5错误率。为了与以前的模型进行公平的比较,我们也提供了320×320320×320的中心裁剪图像评估,在top-1(17.28%17.28%)top-5(3.79%3.79%)的错误率度量中获得了最低的错误率。

    3。最新的CNNsImageNet验证集上单裁剪图像的错误率。测试的裁剪图像大小是224×224224×224[10]中的320×320320×320/299×299299×299。与前面的工作相比,我们提出的模型SENet表现出了显著的改进。

    6.2. Scene Classification

    Large portions of the ImageNet dataset consist of images dominated by single objects. To evaluate our proposed model in more diverse scenarios, we also evaluate it on the Places365-Challenge dataset [48] for scene classification. This dataset comprises 8 million training images and 36, 500 validation images across 365 categories. Relative to classification, the task of scene understanding can provide a better assessment of the ability of a model to generalise well and handle abstraction, since it requires the capture of more complex data associations and robustness to a greater level of appearance variation.

    6.2. 场景分类

    ImageNet数据集的大部分由单个对象支配的图像组成。为了在更多不同的场景下评估我们提出的模型,我们还在Places365-Challenge数据集[48]上对场景分类进行评估。该数据集包含800万张训练图像和365个类别的36500张验证图像。相对于分类,场景理解的任务可以更好地评估模型泛化和处理抽象的能力,因为它需要捕获更复杂的数据关联以及对更大程度外观变化的鲁棒性。

    We use ResNet-152 as a strong baseline to assess the effectiveness of SE blocks and follow the evaluation protocol in [33]. Table 4 shows the results of training a ResNet-152 model and a SE-ResNet-152 for the given task. Specifically, SE-ResNet-152 (11.01%11.01%top-5 error) achieves a lower validation error than ResNet-152 (11.61%11.61% top-5 error), providing evidence that SE blocks can perform well on different datasets. This SENet surpasses the previous state-of-the-art model Places-365-CNN [33] which has a top-5 error of 11.48%11.48% on this task.

    Table 4. Single-crop error rates (%) on the Places365 validation set.

    我们使用ResNet-152作为强大的基线来评估SE块的有效性,并遵循[33]中的评估协议。表4显示了针对给定任务训练ResNet-152模型和SE-ResNet-152的结果。具体而言,SE-ResNet-15211.01%11.01%top-5错误率)取得了比ResNet-15211.61%11.61%top-5错误率)更低的验证错误率,证明了SE块可以在不同的数据集上表现良好。这个SENet超过了先前的最先进的模型Places-365-CNN [33],它在这个任务上有11.48%11.48%top-5错误率。

    4Places365验证集上的单裁剪图像错误率(%)

    6.3. Analysis and Discussion

    Reduction ratio. The reduction ratio rr introduced in Eqn. (5) is an important hyperparameter which allows us to vary the capacity and computational cost of the SE blocks in the model. To investigate this relationship, we conduct experiments based on the SE-ResNet-50 architecture for a range of different rr values. The comparison in Table 5 reveals that performance does not improve monotonically with increased capacity. This is likely to be a result of enabling the SE block to overfit the channel interdependencies of the training set. In particular, we found that setting r=16r=16 achieved a good tradeoff between accuracy and complexity and consequently, we used this value for all experiments.

    Table 5. Single-crop error rates (%) on the ImageNet validation set and corresponding model sizes for the SE-ResNet-50 architecture at different reduction ratios rr. Here original refers to ResNet-50.

    6.3. 分析和讨论

    减少比率。公式(5)中引入的减少比率rr是一个重要的超参数,它允许我们改变模型中SE块的容量和计算成本。为了研究这种关系,我们基于SE-ResNet-50架构进行了一系列不同rr值的实验。表5中的比较表明,性能并没有随着容量的增加而单调上升。这可能是使SE块能够过度拟合训练集通道依赖性的结果。尤其是我们发现设置r=16r=16在精度和复杂度之间取得了很好的平衡,因此我们将这个值用于所有的实验。

    5 ImageNet验证集上单裁剪图像的错误率(%)SE-ResNet-50架构在不同减少比率rr下的模型大小。这里original指的是ResNet-50

    The role of Excitation. While SE blocks have been empirically shown to improve network performance, we would also like to understand how the self-gating excitation mechanism operates in practice. To provide a clearer picture of the behaviour of SE blocks, in this section we study example activations from the SE-ResNet-50 model and examine their distribution with respect to different classes at different blocks. Specifically, we sample four classes from the ImageNet dataset that exhibit semantic and appearance diversity, namely goldfish, pug, plane and cliff (example images from these classes are shown in Fig. 7). We then draw fifty samples for each class from the validation set and compute the average activations for fifty uniformly sampled channels in the last SE block in each stage (immediately prior to downsampling) and plot their distribution in Fig. 8. For reference, we also plot the distribution of average activations across all 1000 classes.

    Figure 7. Example images from the four classes of ImageNet.

    Figure 8. Activations induced by Excitation in the different modules of SE-ResNet-50 on ImageNet. The module is named as SE stageID blockID.

    激励的作用。虽然SE块从经验上显示出其可以改善网络性能,但我们也想了解自门激励机制在实践中是如何运作的。为了更清楚地描述SE块的行为,本节我们研究SE-ResNet-50模型的样本激活,并考察它们在不同块不同类别下的分布情况。具体而言,我们从ImageNet数据集中抽取了四个类,这些类表现出语义和外观多样性,即金鱼,哈巴狗,刨和悬崖(图7中显示了这些类别的示例图像)。然后,我们从验证集中为每个类抽取50个样本,并计算每个阶段最后的SE块中50个均匀采样通道的平均激活(紧接在下采样之前),并在图8中绘制它们的分布。作为参考,我们也绘制所有1000个类的平均激活分布。

    7ImageNet中四个类别的示例图像。

    8SE-ResNet-50不同模块在ImageNet上由Excitation引起的激活。模块名为“SE stageID blockID”。

    We make the following three observations about the role of Excitation in SENets. First, the distribution across different classes is nearly identical in lower layers, e.g. SE_2_3. This suggests that the importance of feature channels is likely to be shared by different classes in the early stages of the network. Interestingly however, the second observation is that at greater depth, the value of each channel becomes much more class-specific as different classes exhibit different preferences to the discriminative value of features e.g. SE_4_6 and SE_5_1. The two observations are consistent with findings in previous work [21, 46], namely that lower layer features are typically more general (i.e. class agnostic in the context of classification) while higher layer features have greater specificity. As a result, representation learning benefits from the recalibration induced by SE blocks which adaptively facilitates feature extraction and specialisation to the extent that it is needed. Finally, we observe a somewhat different phenomena in the last stage of the network. SE_5_2 exhibits an interesting tendency towards a saturated state in which most of the activations are close to 1 and the remainder are close to 0. At the point at which all activations take the value 1, this block would become a standard residual block. At the end of the network in the SE_5_3 (which is immediately followed by global pooling prior before classifiers), a similar pattern emerges over different classes, up to a slight change in scale (which could be tuned by the classifiers). This suggests that SE_5_2 and SE_5_3 are less important than previous blocks in providing recalibration to the network. This finding is consistent with the result of the empirical investigation in Sec. 4 which demonstrated that the overall parameter count could be significantly reduced by removing the SE blocks for the last stage with only a marginal loss of performance (< 0.1%0.1% top-1 error).

    我们对SENetsExcitation的作用提出以下三点看法。首先,不同类别的分布在较低层中几乎相同,例如,SE_2_3。这表明在网络的最初阶段特征通道的重要性很可能由不同的类别共享。然而有趣的是,第二个观察结果是在更大的深度,每个通道的值变得更具类别特定性,因为不同类别对特征的判别性值具有不同的偏好。SE_4_6SE_5_1。这两个观察结果与以前的研究结果一致[21,46],即低层特征通常更普遍(即分类中不可知的类别),而高层特征具有更高的特异性。因此,表示学习从SE块引起的重新校准中受益,其自适应地促进特征提取和专业化到所需要的程度。最后,我们在网络的最后阶段观察到一个有些不同的现象。SE_5_2呈现出朝向饱和状态的有趣趋势,其中大部分激活接近于1,其余激活接近于0。在所有激活值取1的点处,该块将成为标准残差块。在网络的末端SE_5_3中(在分类器之前紧接着是全局池化),类似的模式出现在不同的类别上,尺度上只有轻微的变化(可以通过分类器来调整)。这表明,SE_5_2SE_5_3在为网络提供重新校准方面比前面的块更不重要。这一发现与第四节实证研究的结果是一致的,这表明,通过删除最后一个阶段的SE块,总体参数数量可以显著减少,性能只有一点损失(<0.1%0.1%top-1错误率)。

    7. Conclusion

    In this paper we proposed the SE block, a novel architectural unit designed to improve the representational capacity of a network by enabling it to perform dynamic channel-wise feature recalibration. Extensive experiments demonstrate the effectiveness of SENets which achieve state-of-the-art performance on multiple datasets. In addition, they provide some insight into the limitations of previous architectures in modelling channel-wise feature dependencies, which we hope may prove useful for other tasks requiring strong discriminative features. Finally, the feature importance induced by SE blocks may be helpful to related fields such as network pruning for compression.

    7. 结论

    在本文中,我们提出了SE块,这是一种新颖的架构单元,旨在通过使网络能够执行动态通道特征重新校准来提高网络的表示能力。大量实验证明了SENets的有效性,其在多个数据集上取得了最先进的性能。此外,它们还提供了一些关于以前架构在建模通道特征依赖性上的局限性的洞察,我们希望可能证明SENets对其它需要强判别性特征的任务是有用的。最后,由SE块引起的特征重要性可能有助于相关领域,例如为了压缩的网络修剪。

    Acknowledgements

    We would like to thank Professor Andrew Zisserman for his helpful comments and Samuel Albanie for his discussions and writing edit for the paper. We would like to thank Chao Li for his contributions in the memory optimisation of the training system. Li Shen is supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract number 2014-14071600010. The views and conclusions contained herein are those of the author and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purpose notwithstanding any copyright annotation thereon.

    致谢

    我们要感谢Andrew Zisserman教授的有益评论,并感谢Samuel Albanie的讨论并校订论文。我们要感谢Chao Li在训练系统内存优化方面的贡献。Li Shen由国家情报总监(ODNI),先期研究计划中心(IARPA)资助,合同号为2014-14071600010。本文包含的观点和结论属于作者的观点和结论,不应理解为ODNIIARPA或美国政府明示或暗示的官方政策或认可。尽管有任何版权注释,美国政府有权为政府目的复制和分发重印。

    References

    参考文献

    [1] S. Bell, C. L. Zitnick, K. Bala, and R. Girshick. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In CVPR, 2016.

    [2] T. Bluche. Joint line segmentation and transcription for end-to-end handwritten paragraph recognition. In NIPS, 2016.

    [3] C.Cao, X.Liu, Y.Yang, Y.Yu, J.Wang, Z.Wang, Y.Huang, L. Wang, C. Huang, W. Xu, D. Ramanan, and T. S. Huang. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In ICCV, 2015.

    [4] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In CVPR, 2017.

    [5] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng. Dual path networks. arXiv:1707.01629, 2017.

    [6] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.

    [7] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman. Lip reading sentences in the wild. In CVPR, 2017.

    [8] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In ICCV, 2015.

    [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.

    [10] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In ECCV, 2016.

    [11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997.

    [12] G. Huang, Z. Liu, K. Q. Weinberger, and L. Maaten. Densely connected convolutional networks. In CVPR, 2017.

    [13] Y. Ioannou, D. Robertson, R. Cipolla, and A. Criminisi. Deep roots: Improving CNN efficiency with hierarchical filter groups. In CVPR, 2017.

    [14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.

    [15] L. Itti and C. Koch. Computational modelling of visual attention. Nature reviews neuroscience, 2001.

    [16] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE TPAMI, 1998.

    [17] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu. Spatial transformer networks. In NIPS, 2015.

    [18] M. Jaderberg, A. Vedaldi, and A. Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014.

    [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.

    [20] H. Larochelle and G. E. Hinton. Learning to combine foveal glimpses with a third-order boltzmann machine. In NIPS, 2010.

    [21] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009.

    [22] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv:1312.4400, 2013.

    [23] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.

    [24] A. Miech, I. Laptev, and J. Sivic. Learnable pooling with context gating for video classification. arXiv:1706.06905, 2017.

    [25] V. Mnih, N. Heess, A. Graves, and K. Kavukcuoglu. Recurrent models of visual attention. In NIPS, 2014.

    [26] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, 2010.

    [27] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.

    [28] B. A. Olshausen, C. H. Anderson, and D. C. V. Essen. A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information. Journal of Neuroscience, 1993.

    [29] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.

    [30] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. IJCV, 2015.

    [31] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the fisher vector: Theory and practice. RR-8209, INRIA, 2013.

    [32] L. Shen, Z. Lin, and Q. Huang. Relay backpropagation for effective learning of deep convolutional neural networks. In ECCV, 2016.

    [33] L. Shen, Z. Lin, G. Sun, and J. Hu. Places401 and places365 models. https://github.com/lishen-shirley/ Places2-CNNs, 2016.

    [34] L. Shen, G. Sun, Q. Huang, S. Wang, Z. Lin, and E. Wu. Multi-level discriminative dictionary learning with application to large scale image classification. IEEE TIP, 2015.

    [35] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

    [36] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In NIPS, 2015.

    [37] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal selective attention through feedback connections. In NIPS, 2014.

    [38] C.Szegedy, S.Ioffe, V.Vanhoucke, and A.Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv:1602.07261, 2016.

    [39] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.

    [40] C.Szegedy, V.Vanhoucke, S.Ioffe, J.Shlens, and Z.Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.

    [41] A. Toshev and C. Szegedy. DeepPose: Human pose estimation via deep neural networks. In CVPR, 2014.

    [42] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. In CVPR, 2017.

    [43] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.

    [44] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In ICML, 2015.

    [45] J. Yang, K. Yu, Y. Gong, and T. Huang. Linear spatial pyramid matching using sparse coding for image classification. In CVPR, 2009.

    [46] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In NIPS, 2014.

    [47] X. Zhang, Z. Li, C. C. Loy, and D. Lin. Polynet: A pursuit of structural diversity in very deep networks. In CVPR, 2017.

    [48] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba. Places: A 10 million image database for scene recognition. IEEE TPAMI, 2017.

    A. ILSVRC 2017 Classification Competition Entry Details

    The SENet in Table 3 is constructed by integrating SE blocks to a modified version of the 64×4d64×4d ResNeXt-152 that extends the original ResNeXt-101 [43] by following the block stacking of ResNet-152 [9]. More differences to the design and training (beyond the use of SE blocks) were as follows: (a) The number of first 1×11×1 convolutional channels for each bottleneck building block was halved to reduce the computation cost of the network with a minimal decrease in performance. (b) The first 7×77×7 convolutional layer was replaced with three consecutive 3×33×3 convolutional layers. (c) The down-sampling projection 1×11×1 with stride-2 convolution was replaced with a 3×33×3 stride-2 convolution to preserve information. (d) A dropout layer (with a drop ratio of 0.2) was inserted before the classifier layer to prevent overfitting. (e) Label-smoothing regularisation (as introduced in [40]) was used during training. (f) The parameters of all BN layers were frozen for the last few training epochs to ensure consistency between training and testing. (g) Training was performed with 8 servers (64 GPUs) in parallelism to enable a large batch size (2048) and initial learning rate of 1.0.

    A. ILSVRC 2017分类竞赛输入细节

    3中的SENet是通过将SE块集成到64×4d64×4dResNeXt-152的修改版本中构建的,通过遵循ResNet-152[9]的块堆叠来扩展原始ResNeXt-101[43]。更多设计和训练差异(除了SE块的使用之外)如下:(a)对于每个瓶颈构建块,首先1×11×1卷积通道的数量减半,以性能下降最小的方式降低网络的计算成本。(b)第一个7×77×7卷积层被三个连续的3×33×3卷积层所取代。(c)步长为21×11×1卷积的下采样投影被替换步长为23×33×3卷积以保留信息。(d)在分类器层之前插入一个丢弃层(丢弃比为0.2)以防止过拟合。(e)训练期间使用标签平滑正则化(如[40]中所介绍的)。(f)在最后几个训练迭代周期,所有BN层的参数都被冻结,以确保训练和测试之间的一致性。(g)使用8个服务器(64GPU)并行执行培训,以实现大批量数据大小(2048),初始学习率为1.0

     

    展开全文
  • 此版为中英文对照版,纯中文版请稳步:[ZFNet纯中文版] Visualizing and Understanding Convolutional Networks 可视化和理解卷积网络 Matthew D. Zeiler Rob Fergus Dept....

    图像分类经典论文翻译汇总:[翻译汇总]

    翻译pdf文件下载:[下载地址]

    此版为中英文对照版,纯中文版请稳步:[ZFNet纯中文版]

    Visualizing and Understanding Convolutional Networks

    可视化和理解卷积网络

    Matthew D. Zeiler

    Rob Fergus

    Dept. of Computer Science, New York University, USA(美国纽约大学计算机科学系)

    {zeiler,fergus}@cs.nyu.edu

    Abstract

    Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark Krizhevsky et al. [18]. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we explore both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

    摘要

    大型卷积网络模型最近在ImageNet基准测试上表现出了令人印象深刻的分类性能Krizhevsky[18]。然而,人们还没有明确的理解他们为什么表现如此之好,或者如何改进它们。在本文中,我们将探讨这两个问题。我们介绍了一种新的可视化技术,可以深入了解中间特征层的功能和分类器的操作。作为诊断的手段,这些可视化技术使我们能够找到优于Krizhevsky等人在ImageNet分类基准的模型架构。我们还进行了消融研究,以发现不同模型层的在模型性能上的贡献。我们的研究表明我们的ImageNet模型能很好地泛化到其他数据集:当softmax分类器被重新训练时,它令人信服地击败了Caltech-101Caltech-256数据集上当前最先进的结果。

    1 Introduction

    Since their introduction by LeCun et al. [20] in the early 1990’s, Convolutional Networks (convnets) have demonstrated excellent performance at tasks such as hand-written digit classification and face detection. In the last 18 months, several papers have shown that they can also deliver outstanding performance on more challenging visual classification tasks. Ciresan et al. [4] demonstrate state-of-the-art performance on NORB and CIFAR-10 datasets. Most notably, Krizhevsky et al. [18] show record beating performance on the ImageNet 2012 classification benchmark, with their convnet model achieving an error rate of 16.4%, compared to the 2nd place result of 26.1%. Following on from this work, Girshick et al. [10] have shown leading detection performance on the PASCAL VOC dataset. Several factors are responsible for this dramatic improvement in performance: (i) the availability of much larger training sets, with millions of labeled examples; (ii) powerful GPU implementations, making the training of very large models practical and (iii) better model regularization strategies, such as Dropout [14].

    1 引言

    20世纪90年代早期LeCun[20]提出卷积网络以来,卷积网络(convnets)在手写数字分类和人脸检测等任务中表现出色。在过去的18个月中,有几篇论文表明,他们还可以在更具挑战性的视觉分类任务中具有更出色的表现。Ciresan[4]表明其在NORBCIFAR-10数据集上最好的性能。最值得注意的是,Krizhevsky[18]ImageNet 2012分类基准测试中获得了创纪录的表现,他们的卷积模型实现了16.4%的错误率,而第二名的结果是26.1%。基于这项研究工作,Girshick[10]研究报道了PASCAL VOC数据集上最佳的检测性能。有几个因素导致这种性能的显着提高:(i)具有数百万个标记样本的更大规模的训练集的可用性;ii)强大的GPU实现,使非常大的模型的训练成为现实;iii)更好的模型正则化策略,例如Dropout [14]

    Despite this encouraging progress, there is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientific standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error. In this paper we introduce a visualization technique that reveals the input stimuli that excite individual feature maps at any layer in the model. It also allows us to observe the evolution of features during training and to diagnose potential problems with the model. The visualization technique we propose uses a multi-layered Deconvolutional Network (deconvnet), as proposed by Zeiler et al. [29], to project the feature activations back to the input pixel space. We also perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification.

    尽管取得了令人鼓舞的进展,但对这些复杂模型的内部操作和行为,或者它们如何实现如此良好的性能,仍然了解甚少。从科学的角度来看,这是非常令人不满意的。如果没有清楚地了解它们如何以及为何起作用,那么更好的模型的开发过程将被简化为试错。在本文中,我们介绍了一种可视化技术,该技术揭示了激发模型中任何层的单个特征映射的输入激励。它还允许我们在训练期间观察特征的演变并诊断模型的潜在问题。我们提出的可视化技术使用Zeiler[29]提出的多层反卷积网络(deconvnet),即将特征激活投影回输入像素空间。我们还通过遮挡输入图像的部分来进行分类器输出的灵敏度分析,从而揭示图像的哪些部分对于分类是重要的。

    Using these tools, we start with the architecture of Krizhevsky et al. [18] and explore different architectures, discovering ones that outperform their results on ImageNet. We then explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top. As such, this is a form of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by Hinton et al. [13] and others [1,26].

    使用这些工具,我们从Krizhevsky[18]的架构开始,探索不同的架构,发现在ImageNet上超越其结果的架构。然后,我们探索模型对其他数据集的泛化能力,只需重新训练softmax分类器。因此,这是一种受监督的预训练形式,这不同于Hinton[13]和其他人[1,26]推广的无监督预训练方法。

    1.1 Related Work

    Visualization: Visualizing features to gain intuition about the network is common practice, but mostly limited to the 1st layer where projections to pixel space are possible. In higher layers alternate methods must be used. [8] find the optimal stimulus for each unit by performing gradient descent in image space to maximize the unit’s activation. This requires a careful initialization and does not give any information about the unit’s invariances. Motivated by the latter’s short-coming, [19] (extending an idea by [2]) show how the Hessian of a given unit may be computed numerically around the optimal response, giving some insight into invariances. The problem is that for higher layers, the invariances are extremely complex so are poorly captured by a simple quadratic approximation. Our approach, by contrast, provides a non-parametric view of invariance, showing which patterns from the training set activate the feature map. Our approach is similar to contemporary work by Simonyan et al. [23] who demonstrate how saliency maps can be obtained from a convnet by projecting back from the fully connected layers of the network, instead of the convolutional features that we use. Girshick et al. [10] show visualizations that identify patches within a dataset that are responsible for strong activations at higher layers in the model. Our visualizations differ in that they are not just crops of input images, but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.

    1.1 相关工作

    可视化:可视化特征以获得关于网络的直觉是常见的做法,但主要局限于可以投影到像素空间第一层。在较高层中,必须使用其它方法。[8]通过在图像空间中执行梯度下降来找到每个单元的最佳刺激,以最大化单元的激活。这需要谨慎的初始化,并且不提供有关单元不变量的任何信息。由后者的缺点所激发,[19](通过[2]扩展一个想法)揭示如何围绕最优响应以数字方式计算给定单元的Hessian矩阵,从而对不变量有所了解。问题是对于更高层,不变量非常复杂,因此通过简单的二次近似很难捕获。相反,我们的方法提供了不变量的非参数视图,显示了训练集中的哪些模式激活了特征映射。我们的方法类似于Simonyan[23]同期工作,他们揭示了如何通过从网络的全连接层投影回来而获得显着性图,而不是我们使用的卷积特征。Girshick[10]表明识别数据集中的补丁的可视化,这些补丁与模型中较高层的强激活相关。我们的可视化不同之处在于它们不仅仅是输入图像的裁剪,而是自上而下的投影,揭示每个图像块中刺激特定特征图的结构。

    Feature Generalization: Our demonstration of the generalization ability of convnet features is also explored in concurrent work by Donahue et al. [7] and Girshick et al. [10]. They use the convnet features to obtain state-of-the-art performance on Caltech-101 and the Sun scenes dataset in the former case, and for object detection on the PASCAL VOC dataset, in the latter.

    特征泛化:在Donahue[7]Girshick[10]的同期工作中也探讨了我们研究的卷积特征的泛化能力。他们使用卷积特征在前一个研究中获得Caltech-101Sun场景数据集的最佳性能,后者研究是在PASCAL VOC数据集上进行对象检测。

    2 Approach

    We use standard fully supervised convnet models throughout the paper, as defined by LeCun et al. [20] and Krizhevsky et al. [18]. These models map a color 2D input image xi, via a series of layers, to a probability vector yi over the C different classes. Each layer consists of (i) convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters; (ii) passing the responses through a rectified linear function (relu(x) = max(x, 0)); (iii) [optionally] max pooling over local neighborhoods and (iv) [optionally] a local contrast operation that normalizes the responses across feature maps. For more details of these operations, see [18] and [16]. The top few layers of the network are conventional fully-connected networks and the final layer is a softmax classifier. Fig. 3 shows the model used in many of our experiments.

    2 方法

    根据LeCunKrizhevsky等的定义,我们在整篇论文中使用标准的完全监督的卷积模型。这些模型通过一系列层将彩色2D输入图像xi映射到C个不同类别上的概率向量yi。每层包括:(i)前一层输出(或在第一层的情况下,输入图像)与一组学习过滤器的卷积; ii)通过整流线性函数(relu(x) = max(x0))传递响应;(iii[可选地]在局部邻域上的最大池化和(iv[可选地]局部对比操作,其对特征映射之间的响应进行归一化。有关这些操作的更多详细信息,请参见[18][16]。网络的前几层是传统的全连接网络,最后一层是softmax分类器。图3显示了我们许多实验中使用的模型。

    We train these models using a large set of N labeled images {x, y}, where label yi is a discrete variable indicating the true class. A cross-entropy loss function, suitable for image classification, is used to compare yi and yi. The parameters of the network (filters in the convolutional layers, weight matrices in the fully-connected layers and biases) are trained by back-propagating the derivative of the loss with respect to the parameters throughout the network, and updating the parameters via stochastic gradient descent. Details of training are given in Section 3.

    我们使用大量N个标记图像{xy}训练这些模型,其中标签yi是指示真实类的离散变量。适用于图像分类的交叉熵损失函数用于比较yiyi。网络的参数(卷积层中的滤波器,全连接层中的权重矩阵和偏差)通过相对于整个网络中的参数反向传播损耗的导数来训练,并通过随机梯度下降来更新参数。训练的详细情节见第3部分。

    2.1 Visualization with a Deconvnet

    Understanding the operation of a convnet requires interpreting the feature activity in intermediate layers. We present a novel way to map these activities back to the input pixel space, showing what input pattern originally caused a given activation in the feature maps. We perform this mapping with a Deconvolutional Network (deconvnet) Zeiler et al. [29]. A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite. In Zeiler et al. [29], deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet.

    2.1 通过反卷积可视化

    理解卷积网络的操作需要解释中间层的特征活动。我们提出了一种新颖的方法来将这些活动映射回输入像素空间,显示最初在特征映射中引起给定激活的输入模式。我们使用反卷积网络(deconvnetZeiler[29]实现此映射。反卷积网络可以被认为是一个使用相同组件(过滤,池化)的逆向的卷积模型,即不是将像素映射到特征,而是将特征映射到像素。在Zeiler[29]中,反卷积网络作为进行无监督学习的一种方式而被提出。在这里,它们不会用于任何学习能力,仅作为对已经训练好的卷积网络的探索。

    To examine a convnet, a deconvnet is attached to each of its layers, as illustrated in Fig. 1(top), providing a continuous path back to image pixels. To start, an input image is presented to the convnet and features computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.

    Fig. 1. Top: A deconvnet layer (left) attached to a convnet layer (right). The deconvnet will reconstruct an approximate version of the convnet features from the layer beneath. Bottom: An illustration of the unpooling operation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet. The black/white bars are negative/positive activations within the feature map.

    如图1(上图)所示,为了检查一个卷积网络,网络的每个层都附有一个反卷积网络,提供了一条返回图像像素的连续路径。首先,将输入图像呈现给卷积网络并通过所有层计算特征。为了检查给定卷积网络的激活,我们将图层中的所有其他激活设置为零,并将特征图作为输入传递给附加的反卷积网络层。然后我们依次(i)反池化,(ii)纠正和(iii)过滤以重建下面的层中的活动,从而产生所选择的激活。 然后重复这一过程,直到达到输入像素空间。

    1.上图:反卷积层(左)与卷积层(右)相连。反卷积网络将从下面的层重建一个近似版本的卷积网络特征。下图:反卷积网络中使用switch反池化操作的示意图,switch记录卷积网络池化时每个池化区域(彩色区域)中局部最大值的位置。黑/白条在特征图中表示负/正激活。

    Unpooling: In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus. See Fig. 1(bottom) for an illustration of the procedure.

    反池化:在卷积网络中,最大池化操作是不可逆的,但是我们可以通过在一组切换变量中记录每个池化区域内的最大值的位置来获得近似逆。在反卷积网络中,反池化操作使用这些切换将来自上层的重建放置到适当的位置,从而保留激活的结构。有关步骤的插图,请参见图1(底部)。

    Rectification: The convnet uses relu non-linearities, which rectify the feature maps thus ensuring the feature maps are always positive. To obtain valid feature reconstructions at each layer (which also should be positive), we pass the reconstructed signal through a relu non-linearity1.

    纠正:卷积网络使用relu的非线性,即纠正特征图,从而确保特征图始终为正。为了在每一层获得有效的特征重建(也应该是正的),我们通过relu非线性传递重建的信号。

    Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To approximately invert this, the deconvnet uses transposed versions of the same filters (as other autoencoder models, such as RBMs), but applied to the rectified maps, not the output of the layer beneath. In practice this means flipping each filter vertically and horizontally.

    滤波:卷积网络使用学习到的过滤器来卷积前一层的特征图。为了近似反转这一点,反卷积网络使用相同滤波器的转置版本(如其他自动编码器模型,例如RBM),但应用于纠正的映射图,而不是层下面的输出。实际上,这意味着垂直和水平翻转每个过滤器。

    Note that we do not use any contrast normalization operations when in this reconstruction path. Projecting down from higher layers uses the switch settings generated by the max pooling in the convnet on the way up. As these switch settings are peculiar to a given input image, the reconstruction obtained from a single activation thus resembles a small piece of the original input image, with structures weighted according to their contribution toward to the feature activation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved. The whole procedure is similar to backpropping a single strong activation (rather than the usual gradients), i.e. computing hXn, where h is the element of the feature map with the strong activation and Xn is the input image. However, it differs in that (i) the relu is imposed independently and (ii) contrast normalization operations are not used. A general shortcoming of our approach is that it only visualizes a single activation, not the joint activity present in a layer. Nevertheless, as we show in Fig. 6, these visualizations are accurate representations of the input pattern that stimulates the given feature map in the model: when the parts of the original input image corresponding to the pattern are occluded, we see a distinct drop in activity within the feature map.

    请注意,在此重建路径中,我们没有使用任何对比度归一化操作。从较高层向下投影使用在前进途中由卷积网络中的最大池化生成的切换设置。由于这些开关设置是给定输入图像所特有的,因此从单次激活获得的重建类似于原始输入图像的一小块,其结构根据它们对特征激活的贡献而加权。由于模型是有区别地训练的,因此它们隐含地表明输入图像的哪些部分是有区别的。请注意,这些预测不是来自模型的样本,因为不涉及生成过程。整个过程类似于反向支持单个强激活(而不是通常的梯度),即计算hXn,其中h是具有强激活的特征映射的元素,而Xn是输入图像。然而,它的不同之处在于(i)独立地施加relu,(ii)不使用对比度归一化操作。我们的方法的一个总体缺点是它只能显示单个激活,而不是图层中存在的整体的激活。然而,正如我们在图6中所示,这些可视化是输入模式的精确表示,其刺激模型中的给定特征图:当对应于模式的原始输入图像的部分被遮挡时,我们看到特征图中激活的明显下降。

    3 Training Details

    We now describe the large convnet model that will be visualized in Section 4. The architecture, shown in Fig. 3, is similar to that used by Krizhevsky et al. [18] for ImageNet classification. One difference is that the sparse connections used in Krizhevsky’s layers 3,4,5 (due to the model being split across 2 GPUs) are replaced with dense connections in our model. Other important differences relating to layers 1 and 2 were made following inspection of the visualizations in Fig. 5, as described in Section 4.1.

    3 训练细节

    我们现在描述将在第4节中被可视化的大型卷积网络模型。图3中所示的架构类似于Krizhevsky[18]用于ImageNet分类的架构。一个区别是Krizhevsky3,4,5层使用的稀疏连接(由于模型分为2GPU)在我们的模型中被密集连接替换。另一个重要的不同是关于12层,其被用于图5中后面可视化的检查,如4.1部分所述。

    The model was trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 different classes) [6]. Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256×256 region, subtracting the per-pixel mean (across all images) and then using 10 different sub-crops of size 224×224 (corners + center with(out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 10−2, in conjunction with a momentum term of 0.9. We anneal the learning rate throughout training manually when the validation error plateaus. Dropout [14] is used in the fully connected layers (6 and 7) with a rate of 0.5. All weights are initialized to 10−2 and biases are set to 0.

    该模型在ImageNet 2012训练集上进行了训练(130万张图像,分布在1000多个不同的类别中)[6]。每个RGB图像都经过预处理,方法是将最小尺寸调整为256,裁剪中心256×256区域,减去像素平均值(在所有图像上),然后得到10个不同的裁剪块,尺寸为224×224(原图像及水平翻转的四个角+中心)。使用具有128的小批量大小的随机梯度下降来更新参数,学习率10-2开始,结合动量项0.9。当验证错误达到平稳时,我们在整个训练过程中手动降低学习率。Dropout [14]用于全连接的层(67层),dropout比率为0.5。所有权重都初始化为10-2,偏差设置为0

    Visualization of the first layer filters during training reveals that a few of them dominate. To combat this, we renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10−1 to this fixed radius. This is crucial, especially in the first layer of the model, where the input images are roughly in the [-128, 128] range. As in Krizhevsky et al. [18], we produce multiple different crops and flips of each training example to boost training set size. We stopped training after 70 epochs, which took around 12 days on a single GTX580 GPU, using an implementation based on [18].

    在训练期间可视化第一层过滤器显示其中一些过滤器占主导地位。为了解决这个问题,我们将其RMS值超过固定半径10-1的卷积层中的每个滤波器重新归一化到该固定半径。这一点至关重要,特别是在模型的第一层,输入图像大致在[-128, 128]范围内。如在Krizhevsky[18],我们生成了多种不同的裁剪块和每个训练样例的翻转,以提高训练集的大小。我们在70epochs之后停止了训练,基于[18]的实现在一个GTX580 GPU上花了大约12天。

    4 Convnet Visualization

    Using the model described in Section 3, we now use the deconvnet to visualize the feature activations on the ImageNet validation set.

    4 卷积网络可视化

    使用第3节中描述的模型,我们现在使用反卷积网络可视化ImageNet验证集上的特征激活。

    Feature Visualization: Fig. 2 shows feature visualizations from our model once training is complete. For a given feature map, we show the top 9 activations, each projected separately down to pixel space, revealing the different structures that excite that map and showing its invariance to input deformations. Alongside these visualizations we show the corresponding image patches. These have greater variation than visualizations which solely focus on the discriminant structure within each patch. For example, in layer 5, row 1, col 2, the patches appear to have little in common, but the visualizations reveal that this particular feature map focuses on the grass in the background, not the foreground objects.

    Fig. 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach. Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause high activations in a given feature map. For each feature map we also show the corresponding image patches. Note: (i) the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form. The compression artifacts are a consequence of the 30Mb submission limit, not the reconstruction algorithm itself.

    特征可视化:图2所示为训练完成后我们模型的特征可视化。对于给定的特征映射,我们显示前9个激活,每个激活分别投影到像素空间,揭示激发该映射并显示其对输入变形的不变性的不同结构。 除了这些可视化外,我们还会显示相应的图像补丁。 它们比可视化具有更大的变化,可视化仅关注每个补丁内的判别结构。 例如,在第5层,第1行,第2列中,补丁似乎没有什么共同之处,但可视化显示此特定要素图聚焦于背景中的草,而不是前景对象。

    2.完全训练模型中的特征可视化。对于2-5层,我们在验证数据的特征映射的随机子集中显示前9个激活,使用我们的反卷积网络方法投影到像素空间。我们的重建不是来自模型的样本:它们是来自验证集的重建模式,其导致给定特征图中的高激活。对于每个特征图,我们还会显示相应的图像块。注意:(i)每个特征图内的强分组,(ii)较高层的较大不变性和(iii)图像的辨别部分的放大,例如,狗的眼睛和鼻子(第4层第1行第1列)。电子版观看效果最佳。由于30Mb的提交限制而使用了压缩算法,而不是重建算法本身。

    The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge/color conjunctions. Layer 3 has more complex invariances, capturing similar textures (e.g. mesh patterns (Row 1, Col 1); text (R2,C4)). Layer 4 shows significant variation, and is more class-specific: dog faces (R1,C1); bird’s legs (R4,C2). Layer 5 shows entire objects with significant pose variation, e.g. keyboards (R1,C11) and dogs (R4).

    每层的投影显示了网络中特征的分层特性。 2层响应角落和其他边缘/颜色连接。 3层具有更复杂的不变性,捕获相似的纹理(例如网格图案(第1行,第1列);文本(R2C4))。 4层显示出显着的变化,并且更具有特定类别:狗脸(R1C1; 鸟的腿(R4C2)。 5层显示具有显着姿势变化的整个对象,例如, 键盘(R1C11)和狗(R4)。

    Feature Evolution during Training: Fig. 4 visualizes the progression during training of the strongest activation (across all training examples) within a given feature map projected back to pixel space. Sudden jumps in appearance result from a change in the image from which the strongest activation originates. The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.

    训练期间的特征演变:图4显示了在投射回像素空间的给定特征图内的最强激活(跨越所有训练示例)的训练期间的进展。 外观突然跳跃是由最强激活源自的图像变化引起的。 可以看到模型的较低层在几个时期内收敛。 然而,上层仅在相当多的时期(40-50)之后发展,证明需要让模型训练直到完全收敛。

    4.1 Architecture Selection

    While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures in the first place. By visualizing the first and second layers of Krizhevsky et al. ’s architecture (Fig. 5(a) & (c)), various problems are apparent. The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. To remedy these problems, we (i) reduced the 1st layer filter size from 11x11 to 7x7 and (ii) made the stride of the convolution 2, rather than 4. This new architecture retains much more information in the 1st and 2nd layer features, as shown in Fig. 5(b) & (d). More importantly, it also improves the classification performance as shown in Section 5.1.

    4.1 框架选择

    虽然训练模型的可视化可以深入了解其操作,但它也可以帮助您首选好的架构。通过可视化Krizhevsky等架构(图5a)和(c))的第一层和第二层,各种问题都很明显。第一层滤波器是极高和极低频信息的混合,几乎没有涵盖中频信息。另外,第二层可视化呈现出由第一层卷积中使用的大步幅4引起的混叠伪影。为了解决这些问题,我们(i)将第一层滤波器尺寸从11x11缩小到7x7,并且(ii)使卷积的步幅由4改为2。如图5b)和(d)所示,这种新架构在第1层和第2层特征中保留了更多信息。更重要的是,如第5.1节所示,它还提高了分类性能。

    4.2 Occlusion Sensitivity

    With image classification approaches, a natural question is if the model is truly identifying the location of the object in the image, or just using the surrounding context. Fig. 6 attempts to answer this question by systematically occluding different portions of the input image with a grey square, and monitoring the output of the classifier. The examples clearly show the model is localizing the objects within the scene, as the probability of the correct class drops significantly when the object is occluded. Fig. 6 also shows visualizations from the strongest feature map of the top convolution layer, in addition to activity in this map (summed over spatial locations) as a function of occluder position. When the occluder covers the image region that appears in the visualization, we see a strong drop in activity in the feature map. This shows that the visualization genuinely corresponds to the image structure that stimulates that feature map, hence validating the other visualizations shown in Fig. 4 and Fig. 2.

    Fig. 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a different block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form.

    Fig. 5. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features from Krizhevsky et al. [18]. (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer “dead” features. (d): Visualizations of 2nd layer features from Krizhevsky et al. [18]. (e): Visualizations of our 2nd layer features. These are cleaner, with no aliasing artifacts that are visible in (d).

    Fig. 6. Three test examples where we systematically cover up different portions of the scene with a gray square (1st column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each position of the gray scale, we record the total activation in one layer 5 feature map (the one with the strongest response in the unoccluded image). (c): a visualization of this feature map projected down into the input image (black square), along with visualizations of this map from other images. The first row example shows the strongest feature to be the dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct class probability, as a function of the position of the gray square. E.g. when the dog’s face is obscured, the probability for “pomeranian” drops significantly. (e): the most probable label as a function of occluder position. E.g. in the 1st row, for most locations it is “pomeranian”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ball”. In the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The 3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive to the dog (blue region in (d)), since it uses multiple feature maps.

    Fig. 7. Caltech-256 classification performance as the number of training images per class is varied. Using only 6 training examples per class with our pre-trained feature extractor, we surpass best reported result by Bo et al. [3].

    4.2 遮挡敏感度

    使用图像分类方法,一个自然的问题是模型是否真正识别图像中对象的位置,或者只是使用周围的上下文信息。图6试图通过用灰色方块系统地遮挡输入图像的不同部分并观察分类器的输出,以此尝试解决这个问题。这些示例清楚地表明模型能够定位场景中的对象,尽管当对象被遮挡时正确类的概率会显着下降。图6还示出了来自顶部卷积层的最强特征图的可视化,以及该特征图中的激活(在空间位置上求和)作为遮挡物位置的函数。当遮挡物覆盖可视化中出现的图像区域时,我们会看到特征图中激活的明显下降。这表明可视化真实地对应于激活该特征图的图像结构,图4和图2所示为验证了其他可视化。

    4.通过训练随机选择的模型特征子集的演变。每个图层的特征都显示在不同的块中。在每个块内,我们在epoch[1,2,5,10,20,30,40,64]随机选择特征子集。可视化显示给定特征图的最强激活(在所有训练示例中),使用我们的反卷积方法向下投影到像素空间。人工增强色彩对比度,最好以电子形式观看。

    5.a):第一层特征没有特征尺度削减。请注意,一个特征占主导地位。(b):Krizhevsky[18]的第一层特征。(c):我们的第一层特征。较小的步长(2 vs 4)和卷积核尺寸(7x7 vs 11x11)导致更多特色和更少的特征。(d):Krizhevsky[18]的第二层特征的可视化。(e):我们的第二层特征的可视化。它们更干净,没有(d)中可见的混叠伪影。

    6.三个测试示例,我们系统地用灰色方块(第1列)覆盖场景的不同部分,并查看顶部(第5层)特征如何映射((b)和(c))和分类器输出((d)&(e))如何变化。(b):对于灰度区域的每个位置,我们在一个第5层特征图(在未被遮挡的图像中具有最强响应的那个)中记录总激活。(c):向下投影到输入图像(黑色方块)中的此特征地图的可视化,以及来自其他图像的该地图的可视化。第一行示例显示了最强的特征是狗的脸。当掩盖它时,特征图中的激活降低((b)中的蓝色区域)。(d):正确类概率的映射,作为灰色方块位置的函数。例如。当狗的脸被遮挡时,“博美犬”的概率显着下降。(e):最可能的标签作为遮挡位置的函数。例如。在第1排,对于大多数位置,它是博美犬,但如果狗的脸被遮挡而不是球,那么它预测网球。在第二个示例中,汽车上的文本是第5层中最强的特征,但分类器对车轮最敏感。第3个示例包含多个对象。第5层中最强的特征是挑选出了面部,但是分类器对狗敏感((d)中的蓝色区域),因为它使用多个特征映射。

    7. Caltech-256分类性能随着每个类别训练图像数量的变化而变化。使用每个类别仅用6个训练样例预训练的特征提取器,其结果超过Bo[3]的最佳报告结果。

    5 Experiments

    5. 实验

    5.1 ImageNet 2012

    This dataset consists of 1.3M/50k/100k training/validation/test examples, spread over 1000 categories. Table 1 shows our results on this dataset.

    Table 1. ImageNet 2012/2013 classification error rates. The indicates models that were trained on both ImageNet 2011 and 2012 training sets.

    5.1 ImageNet 2012

    该数据集由1.3M/50k/100k训练/验证/测试样例组成,分布在1000个类别中。表1显示了我们在此数据集上的结果。

    1. ImageNet 2012/2013分类错误率。*表示在ImageNet 20112012训练集上都经过训练的模型。

    Using the exact architecture specified in Krizhevsky et al. [18], we attempt to replicate their result on the validation set. We achieve an error rate within 0.1% of their reported value on the ImageNet 2012 validation set.

    使用Krizhevsky[18]指出的确切架构,我们尝试在验证集上复现他们的结果。我们达到了他们在ImageNet 2012验证集上报告的0.1%的错误率。

    Next we analyze the performance of our model with the architectural changes outlined in Section 4.1 (7×7 filters in layer 1 and stride 2 convolutions in layers 1 & 2). This model, shown in Fig. 3, significantly outperforms the architecture of Krizhevsky et al. [18], beating their single model result by 1.7% (test top-5). When we combine multiple models, we obtain a test error of 14.8%, an improvement of 1.6%. This result is close to that produced by the data-augmentation approaches of Howard [15], which could easily be combined with our architecture. However, our model is some way short of the winner of the 2013 Imagenet classification competition [28].

    Fig. 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 different 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 different 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6·6·256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape.

    接下来,我们分析了第4.1节(第1层中的7×7过滤器和第1层和第2层中的步长为2的卷积)中概述的改变框架的模型的性能。如图3所示,该模型明显优于Krizhevsky[18]的架构,击败了他们单一模型1.7%(测试top-5)的结果。当我们组合多个模型时,我们获得了14.8%的测试误差,提高了1.6%。这个结果接近于Howard [15]通过数据增强所产生的结果,这个架构可以很容易地与我们的架构相结合。然而,我们的模型比2013Imagenet分类竞赛的获胜的模型[28]短小。

    3.我们8层卷积模型的架构。图像(具有3个颜色通道)224×224大小的裁剪作为输入。用96个不同的第一层滤波器(红色)对其进行卷积,每个滤波器的尺寸为7×7,步长为2。然后得到的特征图:(i)通过整流的线性函数(图中未显示),(ii)池化(在3×3区域内取最大值,步长为2)和(iii)在特征图上进行对比度标准化,得到96个不同的55×55个元素特征映射。在2,3,4,5层中重复类似的操作。最后两层为全连接,将顶部卷积层的特征以向量形式(6·6·256=9216维)作为其输入。最后一层是C个类别的softmax函数,C是类别的数量。所有卷积核和特征图都是方形的。

    Varying ImageNet Model Sizes: In Table 2, we first explore the architecture of Krizhevsky et al. [18] by adjusting the size of layers, or removing them entirely. In each case, the model is trained from scratch with the revised architecture. Removing the fully connected layers (6,7) only gives a slight increase in error (in the following, we refer to top-5 validation error). This is surprising, given that they contain the majority of model parameters. Removing two of the middle convolutional layers also makes a relatively small difference to the error rate. However, removing both the middle convolution layers and the fully connected layers yields a model with only 4 layers whose performance is dramatically worse. This would suggest that the overall depth of the model is important for obtaining good performance. We then modify our model, shown in Fig. 3. Changing the size of the fully connected layers makes little difference to performance (same for model of Krizhevsky et al. [18]). However, increasing the size of the middle convolution layers goes give a useful gain in performance. But increasing these, while also enlarging the fully connected layers results in over-fitting.

    Table 2. ImageNet 2012 classification error rates with various architectural changes to the model of Krizhevsky et al. [18] and our model (see Fig. 3)

    改变ImageNet模型尺寸:在表2中,我们首先通过调整图层的大小,或完全删除的方式探索了Krizhevsky[18]的架构。在每种情况下,修改架构后的模型都是从头开始训练。删除全连接层(6,7层)只会略微增加错误率(在下文中,指的是top-5验证错误率)。这是令人惊讶的,因为这两层包含大多数的模型参数。移除两个中间卷积层也会对错误率产生相对较小的差异。然而,同时去除中间卷积层和全连接层而产生仅具有4层的模型,其性能显著变差。这可能表明模型的整体深度对于获得良好的性能至关重要。之后,如图3所示,修改我们的模型。改变全连接层的大小对性能几乎没有影响(Krizhevsky[18]模型也是如此)。但是,增加中间卷积层的大小可以提高性能。但增加这些,将会同时增大全连接层,从而会导致过拟合。

    2.Krizhevsky[18]模型和我们的模型(见图3)经过不同改变的模型在ImageNet 2012上的分类错误率

    5.2 Feature Generalization

    The experiments above show the importance of the convolutional part of our ImageNet model in obtaining state-of-the-art performance. This is supported by the visualizations of Fig. 2 which show the complex invariances learned in the convolutional layers. We now explore the ability of these feature extraction layers to generalize to other datasets, namely Caltech-101 [9], Caltech-256 [11] and PASCAL VOC 2012. To do this, we keep layers 1-7 of our ImageNet-trained model fixed and train a new softmax classifier on top (for the appropriate number of classes) using the training images of the new dataset. Since the softmax contains relatively few parameters, it can be trained quickly from a relatively small number of examples, as is the case for certain datasets.

    5.2 特征泛化

    上面的实验表明了我们ImageNet模型的卷积部分在获得最先进性能方面的重要性。这由图2的可视化可以佐证,其显示了卷积层中学习到的复杂不变性。我们现在探索这些特征提取层泛化到其他数据集的能力,即Caltech-101 [9]Caltech-256 [11]PASCAL VOC 2012.为此,我们保持ImageNet训练的模型1-7层固定,并且在模型顶端使用新数据集的训练数据训练一个新的softmax分类器(类别数量)。由于softmax包含相对较少的参数,因此可以从相对少量的样例中快速训练,如某些数据集的情况。

    The experiments compare our feature representation, obtained from ImageNet, with the hand-crafted features used by other methods. In both our approach and existing ones the Caltech/PASCAL training data is only used to train the classifier. As they are of similar complexity (ours: softmax, others: linear SVM), the feature representation is crucial to performance. It is important to note that both representations were built using images beyond the Caltech and PASCAL training sets. For example, the hyper-parameters in HOG descriptors were determined through systematic experiments on a pedestrian dataset [5].

    实验将我们从ImageNet获得的特征表示与其他方法使用的手工制作的特征进行了比较。在我们的方法和现有方法中,Caltech/PASCAL训练数据仅用于训练分类器。由于这些方法具有相似的复杂性(我们的模型:softmax,其他模型:线性SVM),因此特征表示对性能至关重要。值得注意的是,两种表示都是使用CaltechPASCAL训练集之外的图像构建的。例如,HOG模型中的超参数是通过对行人数据集的系统实验来确定的[5]

    We also try a second strategy of training a model from scratch, i.e. resetting layers 1-7 to random values and train them, as well as the softmax, on the training images of the PASCAL/Caltech dataset.

    我们还尝试了从头开始训练模型的第二种策略,即将1-7层重置为随机值,并与softmax一同在PASCAL / Caltech数据集的训练图像上进行训练。

    One complication is that some of the Caltech datasets have some images that are also in the ImageNet training data. Using normalized correlation, we identified these few “overlap” images2 and removed them from our Imagenet training set and then retrained our Imagenet models, so avoiding the possibility of train/test contamination.

    一个复杂的问题是,一些Caltech数据集中的图像也存在于ImageNet训练数据中。使用归一化相关性,我们识别出这些重复图像2,并将它们从我们的Imagenet训练集中移除,然后重新训练我们的Imagenet模型,从而避免了训练/测试污染的可能性。

    Caltech-101: We follow the procedure of [9] and randomly select 15 or 30 images per class for training and test on up to 50 images per class reporting the average of the per-class accuracies in Table 3, using 5 train/test folds. Training took 17 minutes for 30 images/class. The pre-trained model beats the best reported result for 30 images/class from [3] by 2.2%. Our result agrees with the recently published result of Donahue et al. [7], who obtain 86.1% accuracy (30 imgs/class). The convnet model trained from scratch however does terribly, only achieving 46.5%, showing the impossibility of training a large convnet on such a small dataset.

    Table 3. Caltech-101 classification accuracy for our convnet models, against two leading alternate approaches

    Caltech-101:我们按照[9]的步骤,使用5倍的训练/测试拆分,每个类别随机选择1530张图像进行训练,并且每个类别测试最多50张图像,表3报告了每类准确度平均值。30张图像/类别的训练需要17分钟。预训练模型通过2.2%的结果击败了来自[3]30图像/类别的最佳报告结果。我们的结果与最近公布的Donahue[7]86.1%的准确率(30图像/类别)结果一致。然而,从头开始训练的卷积网模型确实非常糟糕,只达到了46.5%,表明在如此小的数据集上训练大型卷积网络比较不可行。

    3.我们的卷积网络模型与两种领先的类似方法在Caltech-101上的分类准确度比较

    Caltech-256: We follow the procedure of [11], selecting 15, 30, 45, or 60 training images per class, reporting the average of the per-class accuracies in Table 4. Our ImageNet-pretrained model beats the current state-of-the-art results obtained by Bo et al. [3] by a significant margin: 74.2% vs 55.2% for 60 training images/class. However, as with Caltech-101, the model trained from scratch does poorly. In Fig. 7, we explore the “one-shot learning” [9] regime. With our pretrained model, just 6 Caltech-256 training images are needed to beat the leading method using 10 times as many images. This shows the power of the ImageNet feature extractor.

    Table 4. Caltech 256 classification accuracies

    Caltech-256:我们按照[11]的步骤,每个类别选择15,30,4560张训练图像,表4报告了中每个类别准确度平均值。我们的ImageNet预训练模型远远胜过Bo[3]取得的目前最好的结果:60训练图像/类别准确率相比为74.2 vs 55.2%。然而,与Caltech-101一样,从头开始训练的模型也很差。在图7中,我们探索了一次性学习”[9]方式。使用我们的预训练的模型,只需要6Caltech-256训练图像就可以击败使用10倍之多图像的领先方法。这显示了ImageNet特征提取器的强大功能。

    4. Caltech 256分类准确率

    PASCAL 2012: We used the standard training and validation images to train a 20-way softmax on top of the ImageNet-pretrained convnet. This is not ideal, as PASCAL images can contain multiple objects and our model just provides a single exclusive prediction for each image. Table 5 shows the results on the test set, comparing to the leading methods: the top 2 entries in the competition and concurrent work from Oquab et al. [21] who use a convnet with a more appropriate classifier. The PASCAL and ImageNet images are quite different in nature, the former being full scenes unlike the latter. This may explain our mean performance being 3.2% lower than the leading competition result [27], however we do beat them on 5 classes, sometimes by large margins.

    Table 5. PASCAL 2012 classification results, comparing our Imagenet-pretrained convent against the leading two methods and the recent approach of Oquab et al. [21]

    PASCAL 2012:我们使用标准的训练和验证图像在ImageNet预训练的卷积网络上训练20个类别的softmax。这并不理想,因为PASCAL图像可能包含多个对象,而我们的模型为每个图像只提供独一无二的预测结果。表5显示了测试集上的结果,并与领先方法进行相比:竞赛中的前2名和Oquab[21]的同期研究,其使用一个更合适分类器的卷积网络。PASCALImageNet图像在本质上是完全不同的,前者是完整的场景,而后者不是。这可以解释我们的平均性能比领先的竞赛者[27]结果低27%,但是我们确实在5分类的任务上击败它们,有时候是完胜。

    5. PASCAL 2012分类结果,我们的Imagenet预训练卷积网络与领先的两种方法和Oquab[21]近期的方法进行比较

    5.3 Feature Analysis

    We explore how discriminative the features in each layer of our Imagenet-pretrained model are. We do this by varying the number of layers retained from the ImageNet model and place either a linear SVM or softmax classifier on top. Table 6 shows results on Caltech-101 and Caltech-256. For both datasets, a steady improvement can be seen as we ascend the model, with best results being obtained by using all layers. This supports the premise that as the feature hierarchies become deeper, they learn increasingly powerful features.

    Table 6. Analysis of the discriminative information contained in each layer of feature maps within our ImageNet-pretrained convnet. We train either a linear SVM or softmax on features from different layers (as indicated in brackets) from the convnet. Higher layers generally produce more discriminative features.

    5.3特征分析

    我们探讨了Imagenet预训练模型的每一层是如何区别特征的。我们通过改变从ImageNet模型重新训练的网络层数,并在顶部放置线性SVMsoftmax分类器来实现此目的。表6显示了在Caltech-101Caltech-256数据集上的结果。对于这两个数据集,当我们提升模型时可以看到效果稳定的改进,通过使用所有层获得最佳结果。这支持了这样一个前提:当特征层次结构变得更深时,它们会学习到越来越强大的特征。

    6.我们ImageNet预训练卷积网络中每层特征映射中包含判别信息的分析。我们对卷积网络不同层(如括号中所示)的特征上训练线性SVMsoftmax分类器。较高层通常产生更多的辨别特征。

    6 Discussion

    We explored large convolutional neural network models, trained for image classification, in a number ways. First, we presented a novel way to visualize the activity within the model. This reveals the features to be far from random, uninterpretable patterns. Rather, they show many intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers. We also show how these visualization can be used to identify problems with the model and so obtain better results, for example improving on Krizhevsky et al. ’s [18] impressive ImageNet 2012 result. We then demonstrated through a series of occlusion experiments that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context. An ablation study on the model revealed that having a minimum depth to the network, rather than any individual section, is vital to the model’s performance.

    6讨论

    我们以多种方式探索了这些通过图像分类训练到的大型卷积神经网络模型。首先,我们提出了一种可视化模型中激活的新方法。这表明这些特征并非随机,而是无法解释的模式。相反,当我们提升层次时,它们显示出许多直观上令人满意的属性,例如组合性,增加不变性和类别区分度。我们还展示了如何使用这些可视化来识别模型的问题,从而获得更好的结果,例如改进Krizhevsky[18]的令人印象深刻的ImageNet 2012结果。然后,我们通过一系列遮挡实验证明,该模型虽然经过分类训练,但对图像中的局部结构非常敏感,并且不仅仅使用广泛的场景环境。对该模型的消融研究表明,对网络而言,最小深度对模型的性能至关重要,而不是其它任何单个部分,。

    Finally, we showed how the ImageNet trained model can generalize well to other datasets. For Caltech-101 and Caltech-256, the datasets are similar enough that we can beat the best reported results, in the latter case by a significant margin. Our convnet model generalized less well to the PASCAL data, perhaps suffering from dataset bias [25], although it was still within 3.2% of the best reported result, despite no tuning for the task. For example, our performance might improve if a different loss function was used that permitted multiple objects per image. This would naturally enable the networks to tackle the object detection as well.

    最后,我们展示了ImageNet训练模型如何能够很好地泛化到其他数据集。对于Caltech-101Caltech-256,数据集足够相似,我们击败了报告的最佳结果,在后一个数据集上以显著的优势获胜。我们的卷积模型不太适用于PASCAL数据,可能是因为存在数据集偏差[25],尽管在没有对任务进行调整的情况下它仍然在最佳报告结果的3.2%之内。例如,如果使用允许每个图像有多个对象的不同损失函数,我们的性能可能会提高。这自然会使网络也能够解决对象检测问题。

    Acknowledgments

    The authors would like to thank Yann LeCun for helpful discussions and acknowledge support from NSERC, NSF grant #1116923 and Microsoft Research.

    致谢

    作者们感谢Yann LeCun的富有帮助的讨论,感谢NSERCNSF1116923资助和微软研究院的支持。

    References

    参考文献

    1. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: NIPS, pp. 153–160 (2007)

    2. Berkes, P., Wiskott, L.: On the analysis and interpretation of inhomogeneous quadratic forms as receptive fields. Neural Computation (2006)

    3. Bo, L., Ren, X., Fox, D.: Multipath sparse coding using hierarchical matching pursuit. In: CVPR (2013)

    4. Ciresan, D.C., Meier, J., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: CVPR (2012)

    5. Dalal, N., Triggs, B.: Histograms of oriented gradients for pedestrian detection. In: CVPR (2005)

    6. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: CVPR 2009 (2009)

    7. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531 (2013)

    8. Erhan, D., Bengio, Y., Courville, A., Vincent, P.: Visualizing higher-layer features of a deep network. Technical report, University of Montreal (2009)

    9. Fei-fei, L., Fergus, R., Perona, P.: One-shot learning of object categories. IEEE Trans. PAMI (2006)

    10. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv:1311.2524 (2014)

    11. Griffin, G., Holub, A., Perona, P.: The caltech 256. Caltech Technical Report (2006)

    12. Gunji, N., Higuchi, T., Yasumoto, K., Muraoka, H., Ushiku, Y., Harada, T., Kuniyoshi, Y.: Classification entry. Imagenet Competition (2012)

    13. Hinton, G.E., Osindero, S., Teh, Y.: A fast learning algorithm for deep belief nets. Neural Computation 18, 1527–1554 (2006)

    14. Hinton, G.E., Srivastave, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. In: arXiv:1207.0580 (2012)

    15. Howard, A.G.: Some improvements on deep convolutional neural network based image classification. arXiv 1312.5402 (2013)

    16. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.:What is the best multi-stage architecture for object recognition? In: ICCV (2009)

    17. Jianchao, Y., Kai, Y., Yihong, G., Thomas, H.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR (2009) Visualizing and Understanding Convolutional Networks 833

    18. Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)

    19. Le, Q.V., Ngiam, J., Chen, Z., Chia, D., Koh, P., Ng, A.Y.: Tiled convolutional neural networks. In: NIPS (2010)

    20. LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Comput. 1(4), 541–551 (1989)

    21. Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: CVPR (2014)

    22. Sande, K., Uijlings, J., Snoek, C., Smeulders, A.: Hybrid coding for selective search. In: PASCAL VOC Classification Challenge 2012 (2012)

    23. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv 1312.6034v1 (2013)

    24. Sohn, K., Jung, D., Lee, H., Hero III, A.: Efficient learning of sparse, distributed, convolutional feature representations for object recognition. In: ICCV (2011)

    25. Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR (2011)

    26. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: ICML, pp. 1096–1103 (2008)

    27. Yan, S., Dong, J., Chen, Q., Song, Z., Pan, Y., Xia, W., Huang, Z., Hua, Y., Shen, S.: Generalized hierarchical matching for sub-category aware object classification. In: PASCAL VOC Classification Challenge 2012 (2012)

    28. Zeiler, M.: Clarifai (2013), http://www.image-net.org/challenges/LSVRC/2013/ results.php

    29. Zeiler, M., Taylor, G., Fergus, R.: Adaptive deconvolutional networks for mid and high level feature learning. In: ICCV (2011)

     

     

    展开全文
  • AAL模版 中英文对照

    千次阅读 2016-05-11 09:42:00
    神经解剖学所谓细胞结构(Cytoarchitecture),是指在染色的脑组织观察到的神经元的组织方式。Brodmann分区最早由德国神经科医生科比尼安·布洛德曼(Korbinian Brodmann)提出。他的分区系统包括每个半...

    来源:http://52brain.com/thread-17336-1-1.html

    Brodmann分区是一个根据细胞结构将大脑皮层划分为一系列解剖区域的系统。神经解剖学中所谓细胞结构(Cytoarchitecture),是指在染色的脑组织中观察到的神经元的组织方式。Brodmann分区最早由德国神经科医生科比尼安·布洛德曼(Korbinian Brodmann)提出。他的分区系统包括每个半球的52个区域。其中一些区域今天已经被细分,例如23区被分为23a和23b区等。从物种间差异来讲,同一分区号码在不同的物种间并不一定代表相似的区域。具体的分区名称见http://zh.wikipedia.org/wiki/%E5%B8%83%E7%BD%97%E5%BE%B7%E6%9B%BC%E5%88%86%E5%8C%BA%E7%B3%BB%E7%BB%9F

    AAL全称是Anatomical Automatic Labeling,AAL分区是由 Montreal Neurological Institute (MNI)机构提供的。AAL模板一共有116个区域,但只有90个属于大脑,剩余26个属于小脑结构,研究的较少。

    116个分区名称见 http://wenku.baidu.com/view/3f6d558502d276a200292e67.html  具体如下表:

    如上摘自:http://hi.baidu.com/wangjq_17/item/0051a6115a3b272eb83180f0  谢之

    感谢:http://wenku.baidu.com/view/3f6d558502d276a200292e67.html

    表中左边的小脑具体命名,源自:

    http://cercor.oxfordjournals.org/content/21/1/233/T1.expansion.html

     

    中文名称 Mricro编号 Mricro命名 Mricro不明信息
    中央前回

    1

    Precentral_L

    2001

    中央前回

    2

    Precentral_R

    2002

    背外侧额上回

    3

    Frontal_Sup_L

    2101

    背外侧额上回

    4

    Frontal_Sup_R

    2102

    眶部额上回

    5

    Frontal_Sup_Orb_L

    2111

    眶部额上回

    6

    Frontal_Sup_Orb_R

    2112

    额中回

    7

    Frontal_Mid_L

    2201

    额中回

    8

    Frontal_Mid_R

    2202

    眶部额中回

    9

    Frontal_Mid_Orb_L

    2211

    眶部额中回

    10

    Frontal_Mid_Orb_R

    2212

    岛盖部额下回

    11

    Frontal_Inf_Oper_L

    2301

    岛盖部额下回

    12

    Frontal_Inf_Oper_R

    2302

    三角部额下回

    13

    Frontal_Inf_Tri_L

    2311

    三角部额下回

    14

    Frontal_Inf_Tri_R

    2312

    眶部额下回

    15

    Frontal_Inf_Orb_L

    2321

    眶部额下回

    16

    Frontal_Inf_Orb_R

    2322

    中央沟盖

    17

    Rolandic_Oper_L

    2331

    中央沟盖(Rolandic operculum)

    18

    Rolandic_Oper_R

    2332

    补充运动区

    19

    Supp_Motor_Area_L

    2401

    补充运动区

    20

    Supp_Motor_Area_R

    2402

    嗅皮质

    21

    Olfactory_L

    2501

    嗅皮质

    22

    Olfactory_R

    2502

    内侧额上回

    23

    Frontal_Sup_Medial_L

    2601

    内侧额上回

    24

    Frontal_Sup_Medial_R

    2602

    眶内额上回

    25

    Frontal_Mid_Orb_L

    2611

    眶内额上回

    26

    Frontal_Mid_Orb_R

    2612

    回直肌

    27

    Rectus_L

    2701

    回直肌

    28

    Rectus_R

    2702

    脑岛

    29

    Insula_L

    3001

    脑岛

    30

    Insula_R

    3002

    前扣带和旁扣带脑回

    31

    Cingulum_Ant_L

    4001

    前扣带和旁扣带脑回

    32

    Cingulum_Ant_R

    4002

    内侧和旁扣带脑回

    33

    Cingulum_Mid_L

    4011

    内侧和旁扣带脑回

    34

    Cingulum_Mid_R

    4012

    后扣带回

    35

    Cingulum_Post_L

    4021

    后扣带回

    36

    Cingulum_Post_R

    4022

    海马

    37

    Hippocampus_L

    4101

    海马

    38

    Hippocampus_R

    4102

    海马旁回

    39

    ParaHippocampal_L

    4111

    海马旁回

    40

    ParaHippocampal_R

    4112

    杏仁核

    41

    Amygdala_L

    4201

    杏仁核

    42

    Amygdala_R

    4202

    距状裂周围皮层

    43

    Calcarine_L

    5001

    距状裂周围皮层

    44

    Calcarine_R

    5002

    楔叶

    45

    Cuneus_L

    5011

    楔叶

    46

    Cuneus_R

    5012

    舌回

    47

    Lingual_L

    5021

    舌回

    48

    Lingual_R

    5022

    枕上回

    49

    Occipital_Sup_L

    5101

    枕上回

    50

    Occipital_Sup_R

    5102

    枕中回

    51

    Occipital_Mid_L

    5201

    枕中回

    52

    Occipital_Mid_R

    5202

    枕下回

    53

    Occipital_Inf_L

    5301

    枕下回

    54

    Occipital_Inf_R

    5302

    梭状回

    55

    Fusiform_L

    5401

    梭状回

    56

    Fusiform_R

    5402

    中央后回

    57

    Postcentral_L

    6001

    中央后回

    58

    Postcentral_R

    6002

    顶上回

    59

    Parietal_Sup_L

    6101

    顶上回

    60

    Parietal_Sup_R

    6102

    顶下缘角回

    61

    Parietal_Inf_L

    6201

    顶下缘角回

    62

    Parietal_Inf_R

    6202

    缘上回

    63

    SupraMarginal_L

    6211

    缘上回

    64

    SupraMarginal_R

    6212

    角回

    65

    Angular_L

    6221

    角回

    66

    Angular_R

    6222

    楔前叶

    67

    Precuneus_L

    6301

    楔前叶

    68

    Precuneus_R

    6302

    中央旁小叶

    69

    Paracentral_Lobule_L

    6401

    中央旁小叶

    70

    Paracentral_Lobule_R

    6402

    尾状核

    71

    Caudate_L

    7001

    尾状核

    72

    Caudate_R

    7002

    豆状壳核

    73

    Putamen_L

    7011

    豆状壳核

    74

    Putamen_R

    7012

    豆状苍白球

    75

    Pallidum_L

    7021

    豆状苍白球

    76

    Pallidum_R

    7022

    丘脑

    77

    Thalamus_L

    7101

    丘脑

    78

    Thalamus_R

    7102

    颞横回

    79

    Heschl_L

    8101

    颞横回

    80

    Heschl_R

    8102

    颞上回

    81

    Temporal_Sup_L

    8111

    颞上回

    82

    Temporal_Sup_R

    8112

    颞极:颞上回

    83

    Temporal_Pole_Sup_L

    8121

    颞极:颞上回

    84

    Temporal_Pole_Sup_R

    8122

    颞中回

    85

    Temporal_Mid_L

    8201

    颞中回

    86

    Temporal_Mid_R

    8202

    颞极:颞中回

    87

    Temporal_Pole_Mid_L

    8211

    颞极:颞中回

    88

    Temporal_Pole_Mid_R

    8212

    颞下回

    89

    Temporal_Inf_L

    8301

    颞下回

    90

    Temporal_Inf_R

    8302

    Cerebellum_Superior

    91

    Cerebelum_Crus1_L

    9001

    Cerebellum_Superior

    92

    Cerebelum_Crus1_R

    9002

    Cerebellum_Inferior

    93

    Cerebelum_Crus2_L

    9011

    Cerebellum_Inferior

    94

    Cerebelum_Crus2_R

    9012

    Cerebellum_Superior

    95

    Cerebelum_3_L

    9021

    Cerebellum_Superior

    96

    Cerebelum_3_R

    9022

    Cerebellum_Superior

    97

    Cerebelum_4_5_L

    9031

    Cerebellum_Superior

    98

    Cerebelum_4_5_R

    9032

    Cerebellum_Superior

    99

    Cerebelum_6_L

    9041

    Cerebellum_Superior

    100

    Cerebelum_6_R

    9042

    Cerebellum_Inferior

    101

    Cerebelum_7b_L

    9051

    Cerebellum_Inferior

    102

    Cerebelum_7b_R

    9052

    Cerebellum_Inferior

    103

    Cerebelum_8_L

    9061

    Cerebellum_Inferior

    104

    Cerebelum_8_R

    9062

    Cerebellum_Inferior

    105

    Cerebelum_9_L

    9071

    Cerebellum_Inferior

    106

    Cerebelum_9_R

    9072

    Cerebellum_Inferior

    107

    Cerebelum_10_L

    9081

    Cerebellum_Inferior

    108

    Cerebelum_10_R

    9082

    Vermis

    109

    Vermis_1_2

    9100

    Vermis

    110

    Vermis_3

    9110

    Vermis

    111

    Vermis_4_5

    9120

    Vermis

    112

    Vermis_6

    9130

    Vermis

    113

    Vermis_7

    9140

    Vermis

    114

    Vermis_8

    9150

    Vermis

    115

    Vermis_9

    9160

    Vermis

    116

    Vermis_10

    9170

     

    转载于:https://www.cnblogs.com/minks/p/5480818.html

    展开全文
  • CMMI v1.3 中英文对照版

    千次阅读 2011-06-03 12:01:00
    几度春秋几度红,红色总是惹眼的,软件过程改进的这片天在中国的陆地上已经泛起微红,那点点火光也在思步网的一...是一种大家服务潜藏于脑海的意识,是一股不到最后绝不说“不”的韧劲——对,就是有着这样意识和韧劲
  • 奥巴马就职演说中英文对照版

    千次阅读 2009-01-22 08:53:00
    美利坚合众国的同胞们:今天我站在这里,眼前的一切所折服,你们的信任而感动,我们先辈的付出铭感于怀。感谢布什总统对这个国家的勤恳服务,也感谢他在整个交接过程所表现出的宽容与风度。 历史上曾有四十...
  • 27001 中英文对照

    2010-06-24 16:47:58
    ISO 27001 中英文对照,版本ISO/IEC 27001:2005
  • 本书第2扩展了动画、交互式图形以及SVG编程等内容。交互式的在线示例让你很容易在Web浏览器实验SVG的特性。本书还经验丰富的设计师准备了6个附录,解释了XML标记和CSS样式等基本概念,因此即使你没有网页设计...
  • 该jdk版本1.8,注释是经过翻译的中英文双语版本,翻译的不一定尽善尽美,欢迎提出意见建议.以下是简单例子. /** *Increases the capacity to ensure that it can hold at least the number of elements specified by...
  • 1.本人只上传清晰文字,如果是翻译书籍尽量大家找齐中英文对照的版本 2.上传需要资源分都是CSDN逼的,没有资源分我也没有办法下载大家的好资源
  • C++ Primer, Fourth Edition By Stanley B....本资源是 chm 格式,中英文对照阅读版本,大小 1.7 MB。 另,《C++Primer 第4-习题解答(完整)+源码》下载地址:http://download.csdn.net/source/3316842。

空空如也

空空如也

1 2 3 4 5 ... 17
收藏数 321
精华内容 128
关键字:

为中英文对照版