• 显著目标检测
    2021-10-10 13:25:42

    (CVPR’17) Learning to Detect Salient Objects with Image-level Supervision

    Deep Neural Networks (DNNs) have substantially improved the state-of-the-art in salient object detection. However, training DNNs requires costly pixel-level annotations. In this paper, we leverage the observation that image level tags provide important cues of foreground salient objects, and develop a weakly supervised learning method for saliency detection using image-level tags only. The Foreground Inference Network (FIN) is introduced for this challenging task. In the first stage of our training method, FIN is jointly trained with a fully convolutional network (FCN) for image-level tag prediction. A global smooth pooling layer is proposed, enabling FCN to assign object category tags to corresponding object regions, while FIN is capable of capturing all potential foreground regions with the predicted saliency maps. In the second stage, FIN is fine-tuned with its predicted saliency maps as ground truth. For refinement of ground truth, an iterative Conditional Random Field is developed to enforce spatial label consistency and further boost performance. Our method alleviates annotation efforts and allows the usage of existing large scale training sets with image-level tags. Our model runs at 60 FPS, outperforms unsupervised ones with a large margin, and achieves comparable or even superior performance than fully supervised counterparts.

    深度神经网络(DNN)已经大大改善了显著目标检测的SOTA。然而,训练DNN需要昂贵的像素级标注。在本文中,我们利用图像级标签能够提供前景显著目标的重要线索这一观察结果,开发了一种仅使用图像级标签进行显著性检测的弱监督学习方法。前景推理网络(FIN)被引入到这项具有挑战性的任务中。在我们训练方法的第一阶段,FIN与全卷积网络(FCN)联合训练,用于图像级标签的预测。我们提出了一个全局平滑池化层,使FCN能够将物体类别标签分配给相应的物体区域,而FIN能够用预测的显著图捕获所有潜在的前景区域。在第二阶段,FIN以其预测的显著图作为真值进行微调。为了细化真值,我们开发了一个迭代的条件随机场,以加强空间标签的一致性并进一步提高性能。我们的方法减轻了标注工作,并允许使用现有的具有图像级标签的大规模训练集。我们的模型以60 FPS的速度运行,以很大的幅度超越了无监督的模型,并取得了与完全监督的模型相当甚至更高的性能。

    (ICCV’17) Supervision by Fusion: Towards Unsupervised Learning of Deep Salient Object Detector

    In light of the powerful learning capability of deep neural networks (DNNs), deep (convolutional) models have been built in recent years to address the task of salient object detection. Although training such deep saliency models can significantly improve the detection performance, it requires large-scale manual supervision in the form of pixel-level human annotation, which is highly labor-intensive and time-consuming. To address this problem, this paper makes the earliest effort to train a deep salient object detector without using any human annotation. The key insight is “supervision by fusion”, i.e., generating useful supervisory signals from the fusion process of weak but fast unsupervised saliency models. Based on this insight, we combine an intra-image fusion stream and a inter-image fusion stream in the proposed framework to generate the learning curriculum and pseudo ground-truth for supervising the training of the deep salient object detector. Comprehensive experiments on four benchmark datasets demonstrate that our method can approach the same network trained with full supervision (within 2-5% performance gap) and, more encouragingly, even outperform a number of fully supervised state-of-the-art approaches.


    (AAAI’18) Weakly Supervised Salient Object Detection Using Image Labels

    Deep learning based salient object detection has recently achieved great success with its performance greatly outperforms any other unsupervised methods. However, annotating per-pixel saliency masks is a tedious and inefficient procedure. In this paper, we note that superior salient object detection can be obtained by iteratively mining and correcting the labeling ambiguity on saliency maps from traditional unsupervised methods. We propose to use the combination of a coarse salient object activation map from the classification network and saliency maps generated from unsupervised methods as pixel-level annotation, and develop a simple yet very effective algorithm to train fully convolutional networks for salient object detection supervised by these noisy annotations. Our algorithm is based on alternately exploiting a graphical model and training a fully convolutional network for model updating. The graphical model corrects the internal labeling ambiguity through spatial consistency and structure preserving while the fully convolutional network helps to correct the cross-image semantic ambiguity and simultaneously update the coarse activation map for next iteration. Experimental results demonstrate that our proposed method greatly outperforms all state-of-the-art unsupervised saliency detection methods and can be comparable to the current best strongly-supervised methods training with thousands of pixel-level saliency map annotations on all public benchmarks.


    (CVPR’18) Deep Unsupervised Saliency Detection: A Multiple Noisy Labeling Perspective

    The success of current deep saliency detection methods heavily depends on the availability of large-scale supervision in the form of per-pixel labeling. Such supervision, while labor-intensive and not always possible, tends to hinder the generalization ability of the learned models. By contrast, traditional handcrafted features based unsupervised saliency detection methods, even though have been surpassed by the deep supervised methods, are generally dataset-independent and could be applied in the wild. This raises a natural question that “Is it possible to learn saliency maps without using labeled data while improving the generalization ability?”. To this end, we present a novel perspective to unsupervised saliency detection through learning from multiple noisy labeling generated by “weak” and “noisy” unsupervised handcrafted saliency methods. Our end-to-end deep learning framework for unsupervised saliency detection consists of a latent saliency prediction module and a noise modeling module that work collaboratively and are optimized jointly. Explicit noise modeling enables us to deal with noisy saliency maps in a probabilistic way. Extensive experimental results on various benchmarking datasets show that our model not only outperforms all the unsupervised saliency methods with a large margin but also achieves comparable performance with the recent state-of-the-art supervised deep saliency methods.

    目前的深度显著性检测方法的成功在很大程度上取决于是否有大规模的监督,即每个像素的标签形式。这样的监督虽然耗费人力,但并不总是可能的,往往会阻碍所学模型的泛化能力。相比之下,传统的基于手工特征的无监督的突出性检测方法,尽管已经被深度监督方法所超越,但通常是独立于数据集的,可以在任意场景应用。这就提出了一个自然的问题:“是否有可能在不使用标记数据的情况下学习显著性地图,同时提高泛化能力?”。为此,我们提出了一个新的观点,即通过从 "弱 "和 "有噪声 "的无监督手工制作的显著性方法产生的多个嘈杂标签中学习无监督显著性检测。我们用于无监督的显著性检测的端到端深度学习框架包括一个潜在的显著性预测模块和一个噪声建模模块,它们协同工作并共同优化。明确的噪声建模使我们能够以概率的方式来处理有噪声的显著图。在各种基准数据集上的大量实验结果表明,我们的模型不仅以很大的幅度超过了所有的无监督的显著性方法,而且还取得了与最近最先进的有监督的深度显著性方法相当的性能。

    (CVPR’19) Multi-source weak supervision for saliency detection

    The high cost of pixel-level annotations makes it appealing to train saliency detection models with weak supervision. However, a single weak supervision source usually does not contain enough information to train a well-performing model. To this end, we propose a unified framework to train saliency detection models with diverse weak supervision sources. In this paper, we use category labels, captions, and unlabelled data for training, yet other supervision sources can also be plugged into this flexible framework. We design a classification network (CNet) and a caption generation network (PNet), which learn to predict object categories and generate captions, respectively, meanwhile highlight the most important regions for corresponding tasks. An attention transfer loss is designed to transmit supervision signal between networks, such that the network designed to be trained with one supervision source can benefit from another. An attention coherence loss is defined on unlabelled data to encourage the networks to detect generally salient regions instead of task-specific regions. We use CNet and PNet to generate pixel-level pseudo labels to train a saliency prediction network (SNet). During the testing phases, we only need SNet to predict saliency maps. Experiments demonstrate the performance of our method compares favourably against unsupervised and weakly supervised methods and even some supervised methods.


    (CVPR’20) Weakly-Supervised Salient Object Detection via Scribble Annotations

    Compared with laborious pixel-wise dense labeling, it is much easier to label data by scribbles, which only costs 1∼2 seconds to label one image. However, using scribble labels to learn salient object detection has not been explored. In this paper, we propose a weakly-supervised salient object detection model to learn saliency from such annotations. In doing so, we first relabel an existing large-scale salient object detection dataset with scribbles, namely S-DUTS dataset. Since object structure and detail information is not identified by scribbles, directly training with scribble labels will lead to saliency maps of poor boundary localization. To mitigate this problem, we propose an auxiliary edge detection task to localize object edges explicitly, and a gated structure-aware loss to place constraints on the scope of structure to be recovered. Moreover, we design a scribble boosting scheme to iteratively consolidate our scribble annotations, which are then employed as supervision to learn high-quality saliency maps. As existing saliency evaluation metrics neglect to measure structure alignment of the predictions, the saliency map ranking metric may not comply with human perception. We present a new metric, termed saliency structure measure, to measure the structure alignment of the predicted saliency maps, which is more consistent with human perception. Extensive experiments on six benchmark datasets demonstrate that our method not only outperforms existing weakly-supervised/unsupervised methods, but also is on par with several fully-supervised state-of-the-art models.


    (AAAI’21) Structure-Consistent Weakly Supervised Salient Object Detection with Local Saliency Coherence

    Sparse labels have been attracting much attention in recent years. However, the performance gap between weakly supervised and fully supervised salient object detection methods is huge, and most previous weakly supervised works adopt complex training methods with many bells and whistles. In this work, we propose a one-round end-to-end training approach for weakly supervised salient object detection via scribble annotations without pre/post-processing operations or extra supervision data. Since scribble labels fail to offer detailed salient regions, we propose a local coherence loss to propagate the labels to unlabeled regions based on image features and pixel distance, so as to predict integral salient regions with complete object structures. We design a saliency structure consistency loss as self-consistent mechanism to ensure consistent saliency maps are predicted with different scales of the same image as input, which could be viewed as a regularization technique to enhance the model generalization ability. Additionally, we design an aggregation module (AGGM) to better integrate high-level features, low-level features and global context information for the decoder to aggregate various information. Extensive experiments show that our method achieves a new state-of-the-art performance on six benchmarks (e.g. for the ECSSD dataset: F_\beta = 0.8995, E_\xi = 0.9079 and MAE = 0.0489$), with an average gain of 4.60% for F-measure, 2.05% for E-measure and 1.88% for MAE over the previous best method on this task.

    近几年来,稀疏标签一直备受关注。然而,弱监督与完全监督的SOD方法之间的性能差距是巨大的,并且以前的大多数弱监督方法都采用了复杂的训练过程与花哨的设计技巧。在本文中,我们提出了一个通过草图标注(scribble annotation)来进行弱监督显著目标检测的单轮端到端训练方法,不需要预处理/后处理操作或者额外的监督数据。由于草图标签不能提供详细的显著区域,我们提出了一个局部一致性损失,根据图像特征与像素距离来将标签传播到未标记的区域,从而预测具有一致目标结构的整体显著区域。此外,我们设计了一个显著结构一致性损失作为自治机制,以确保在输入不同尺寸下的同一图像时,输出一致的显著图,其可以被看做一种正则化技术,来提高模型的泛化能力。此外,我们还设计了一个融合模块(AGGM),以更好地处理高级特征、低级特征与全局上下文信息,供解码器融合。大量的实验表明,我们的方法在六个基准测试上取得的了新的SOTA。

  • 针对传统背景先验方法中背景提取不精确并且背景抑制能力弱的问题,提出了全局对比和背景先验驱动的显著目标检测方法。首先将图像分割为一系列感知均匀的超像素,再由全局颜色对比得到基于全局的显著图并计算得到前景...
  • 针对该问题提出了基于对比度优化流形排序的显著目标检测算法。利用图像边界信息找出背 景先验,设计出采用显著期望、局部对比度以及全局对比度三个指标来衡量先验质量的算法,并根据先验质量设计带 权加法,代替简单...
  • 传统的显著目标检测模型通常使用手工制作的特征来制定对比度和各种先验知识,然后人为地将它们结合起来。在这项工作中,我们提出了一个新颖的基于卷积神经网络的端到端深度分层显著性网络(DHSNet),用于检测显著...

    (CVPR’15) Visual Saliency Based on Multiscale Deep Features

    Visual saliency is a fundamental problem in both cognitive and computational sciences, including computer vision. In this paper, we discover that a high-quality visual saliency model can be learned from multiscale features extracted using deep convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for feature extraction at three different scales. We then propose a refinement method to enhance the spatial coherence of our saliency results. Finally, aggregating multiple saliency maps computed for different levels of image segmentation can further boost the performance, yielding saliency maps better than those generated from a single segmentation. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pixelwise saliency annotations. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks, improving the F-Measure by 5.0% and 13.2% respectively on the MSRA-B dataset and our new dataset (HKU-IS), and lowering the mean absolute error by 5.7% and 35.1% respectively on these two datasets.


    (CVPR’15) Deep Networks for Saliency Detection via Local Estimation and Global Search

    This paper presents a saliency detection algorithm by integrating both local estimation and global search. In the local estimation stage, we detect local saliency by using a deep neural network (DNN-L) which learns local patch features to determine the saliency value of each pixel. The estimated local saliency maps are further refined by exploring the high level object concepts. In the global search stage, the local saliency map together with global contrast and geometric information are used as global features to describe a set of object candidate regions. Another deep neural network (DNN-G) is trained to predict the saliency score of each object region based on the global features. The final saliency map is generated by a weighted sum of salient object regions. Our method presents two interesting insights. First, local features learned by a supervised scheme can effectively capture local contrast, texture and shape information for saliency detection. Second, the complex relationship between different global saliency cues can be captured by deep networks and exploited principally rather than heuristically. Quantitative and qualitative experiments on several benchmark data sets demonstrate that our algorithm performs favorably against the state-of-the-art methods.


    (CVPR’16) Deep Contrast Learning for Salient Object Detection

    Salient object detection has recently witnessed substantial progress due to powerful features extracted using deep convolutional neural networks (CNNs). However, existing CNN-based methods operate at the patch level instead of the pixel level. Resulting saliency maps are typically blurry, especially near the boundary of salient objects. Furthermore, image patches are treated as independent samples even when they are overlapping, giving rise to significant redundancy in computation and storage. In this paper, we propose an end-to-end deep contrast network to overcome the aforementioned limitations. Our deep network consists of two complementary components, a pixel-level fully convolutional stream and a segment-wise spatial pooling stream. The first stream directly produces a saliency map with pixel-level accuracy from an input image. The second stream extracts segment-wise features very efficiently, and better models saliency discontinuities along object boundaries. Finally, a fully connected CRF model can be optionally incorporated to improve spatial coherence and contour localization in the fused result from these two streams. Experimental results demonstrate that our deep model significantly improves the state of the art.


    (CVPR’16) DHSNet: Deep Hierarchical Saliency Network for Salient Object Detection

    Traditional salient object detection models often use hand-crafted features to formulate contrast and various prior knowledge, and then combine them artificially. In this work, we propose a novel end-to-end deep hierarchical saliency network (DHSNet) based on convolutional neural networks for detecting salient objects. DHSNet first makes a coarse global prediction by automatically learning various global structured saliency cues, including global contrast, objectness, compactness, and their optimal combination. Then a novel hierarchical recurrent convolutional neural network (HRCNN) is adopted to further hierarchically and progressively refine the details of saliency maps step by step via integrating local context information. The whole architecture works in a global to local and coarse to fine manner. DHSNet is directly trained using whole images and corresponding ground truth saliency masks. When testing, saliency maps can be generated by directly and efficiently feedforwarding testing images through the network, without relying on any other techniques. Evaluations on four benchmark datasets and comparisons with other 11 state-of-the-art algorithms demonstrate that DHSNet not only shows its significant superiority in terms of performance, but also achieves a real-time speed of 23 FPS on modern GPUs.


    (CVPR’16) Deep Saliency with Encoded Low level Distance Map and High Level Features

    Recent advances in saliency detection have utilized deep learning to obtain high level features to detect salient regions in a scene. These advances have demonstrated superior results over previous works that utilize hand-crafted low level features for saliency detection. In this paper, we demonstrate that hand-crafted features can provide complementary information to enhance performance of saliency detection that utilizes only high level features. Our method utilizes both high level and low level features for saliency detection under a unified deep learning framework. The high level features are extracted using the VGG-net, and the low level features are compared with other parts of an image to form a low level distance map. The low level distance map is then encoded using a convolutional neural network(CNN) with multiple 1 × 1 convolutional and ReLU layers. We concatenate the encoded low level distance map and the high level features, and connect them to a fully connected neural network classifier to evaluate the saliency of a query region. Our experiments show that our method can further improve the performance of state-of-the-art deep learning-based saliency detection methods.


    (CVPR’17) Non-Local Deep Features for Salient Object Detection

    Saliency detection aims to highlight the most relevant objects in an image. Methods using conventional models struggle whenever salient objects are pictured on top of a cluttered background while deep neural nets suffer from excess complexity and slow evaluation speeds. In this paper, we propose a simplified convolutional neural network which combines local and global information through a multiresolution 4 × 5 grid structure. Instead of enforcing spacial coherence with a CRF or superpixels as is usually the case, we implemented a loss function inspired by the MumfordShah functional which penalizes errors on the boundary. We trained our model on the MSRA-B dataset, and tested it on six different saliency benchmark datasets. Results show that our method is on par with the state-of-the-art while reducing computation time by a factor of 18 to 100 times, enabling near real-time, high performance saliency detection.


    (CVPR’17) Deeply Supervised Salient Object Detection with Short Connections

    Recent progress on salient object detection is substantial, benefiting mostly from the explosive development of Convolutional Neural Networks (CNNs). Semantic segmentation and salient object detection algorithms developed lately have been mostly based on Fully Convolutional Neural Networks (FCNs). There is still a large room for improvement over the generic FCN models that do not explicitly deal with the scale-space problem. Holistically-Nested Edge Detector (HED) provides a skip-layer structure with deep supervision for edge and boundary detection, but the performance gain of HED on saliency detection is not obvious. In this paper, we propose a new salient object detection method by introducing short connections to the skip-layer structures within the HED architecture. Our framework takes full advantage of multi-level and multi-scale features extracted from FCNs, providing more advanced representations at each layer, a property that is critically needed to perform segment detection. Our method produces state-of-theart results on 5 widely tested salient object detection benchmarks, with advantages in terms of efficiency (0.08 seconds per image), effectiveness, and simplicity over the existing algorithms. Beyond that, we conduct an exhaustive analysis on the role of training data on performance. Our experimental results provide a more reasonable and powerful training set for future research and fair comparisons.


    (ICCV’17) A Stagewise Refinement Model for Detecting Salient Objects in Images

    Deep convolutional neural networks (CNNs) have been successfully applied to a wide variety of problems in computer vision, including salient object detection. To detect and segment salient objects accurately, it is necessary to extract and combine high-level semantic features with low-level fine details simultaneously. This happens to be a challenge for CNNs as repeated subsampling operations such as pooling and convolution lead to a significant decrease in the initial image resolution, which results in loss of spatial details and finer structures. To remedy this problem, here we propose to augment feedforward neural networks with a novel pyramid pooling module and a multi-stage refinement mechanism for saliency detection. First, our deep feedward net is used to generate a coarse prediction map with much detailed structures lost. Then, refinement nets are integrated with local context information to refine the preceding saliency maps generated in the master branch in a stagewise manner. Further, a pyramid pooling module is applied for different-region-based global context aggregation. Empirical evaluations over six benchmark datasets show that our proposed method compares favorably against the state-of-the-art approaches.


    (ICCV’17) Amulet: Aggregating Multi-level Convolutional Features for Salient Object Detection

    Fully convolutional neural networks (FCNs) have shown outstanding performance in many dense labeling problems. One key pillar of these successes is mining relevant information from features in convolutional layers. However, how to better aggregate multi-level convolutional feature maps for salient object detection is underexplored. In this work, we present Amulet, a generic aggregating multi-level convolutional feature framework for salient object detection. Our framework first integrates multi-level feature maps into multiple resolutions, which simultaneously incorporate coarse semantics and fine details. Then it adaptively learns to combine these feature maps at each resolution and predict saliency maps with the combined features. Finally, the predicted results are efficiently fused to generate the final saliency map. In addition, to achieve accurate boundary inference and semantic enhancement, edge-aware feature maps in low-level layers and the predicted results of low resolution features are recursively embedded into the learning framework. By aggregating multi-level convolutional features in this efficient and flexible manner, the proposed saliency model provides accurate salient object labeling. Comprehensive experiments demonstrate that our method performs favorably against state-of-the-art approaches in terms of near all compared evaluation metrics.


    (ICCV’17) Learning Uncertain Convolutional Features for Accurate Saliency Detection

    Deep convolutional neural networks (CNNs) have delivered superior performance in many computer vision tasks. In this paper, we propose a novel deep fully convolutional network model for accurate salient object detection. The key contribution of this work is to learn deep uncertain convolutional features (UCF), which encourage the robustness and accuracy of saliency detection. We achieve this via introducing a reformulated dropout (R-dropout) after specific convolutional layers to construct an uncertain ensemble of internal feature units. In addition, we propose an effective hybrid upsampling method to reduce the checkerboard artifacts of deconvolution operators in our decoder network. The proposed methods can also be applied to other deep convolutional networks. Compared with existing saliency detection methods, the proposed UCF model is able to incorporate uncertainties for more accurate object boundary inference. Extensive experiments demonstrate that our proposed saliency model performs favorably against state-ofthe-art approaches. The uncertain feature learning mechanism as well as the upsampling method can significantly improve performance on other pixel-wise vision tasks.

    深度卷积神经网络(CNN)在许多计算机视觉任务中都有出色的表现。在本文中,我们提出了一个新的深度全卷积网络模型,用于准确的显著目标检测。这项工作的主要贡献是学习深度不确定卷积特征(UCF),它鼓励了显著性检测的鲁棒性和准确性。我们通过在特定的卷积层之后引入一个reformulated dropout(R-dropout)来构建内部特征单元的不确定集合来实现这一目标。此外,我们提出了一种有效的混合上采样方法,以减少我们解码器网络中反卷积算子的棋盘伪像。所提出的方法也可以应用于其他深度卷积网络。与现有的显著性检测方法相比,所提出的UCF模型能够纳入不确定因素,以获得更准确的物体比边缘推断。广泛的实验表明,我们提出的显著性模型与现有的方法相比表现良好。不确定的特征学习机制以及上采样方法可以显著提高其他像素级视觉任务的性能。

    (CVPR’18) Detect Globally, Refine Locally: A Novel Approach to Saliency Detection

    Effective integration of contextual information is crucial for salient object detection. To achieve this, most existing methods based on ’skip’ architecture mainly focus on how to integrate hierarchical features of Convolutional Neural Networks (CNNs). They simply apply concatenation or element-wise operation to incorporate high-level semantic cues and low-level detailed information. However, this can degrade the quality of predictions because cluttered and noisy information can also be passed through. To address this problem, we proposes a global Recurrent Localization Network (RLN) which exploits contextual information by the weighted response map in order to localize salient objects more accurately. Particularly, a recurrent module is employed to progressively refine the inner structure of the CNN over multiple time steps. Moreover, to effectively recover object boundaries, we propose a local Boundary Refinement Network (BRN) to adaptively learn the local contextual information for each spatial position. The learned propagation coefficients can be used to optimally capture relations between each pixel and its neighbors. Experiments on five challenging datasets show that our approach performs favorably against all existing methods in terms of the popular evaluation metrics.


    (CVPR’18) PiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection

    Contexts play an important role in the saliency detection task. However, given a context region, not all contextual information is helpful for the final task. In this paper, we propose a novel pixel-wise contextual attention network, i.e., the PiCANet, to learn to selectively attend to informative context locations for each pixel. Specifically, for each pixel, it can generate an attention map in which each attention weight corresponds to the contextual relevance at each context location. An attended contextual feature can then be constructed by selectively aggregating the contextual information. We formulate the proposed PiCANet in both global and local forms to attend to global and local contexts, respectively. Both models are fully differentiable and can be embedded into CNNs for joint training. We also incorporate the proposed models with the U-Net architecture to detect salient objects. Extensive experiments show that the proposed PiCANets can consistently improve saliency detection performance. The global and local PiCANets facilitate learning global contrast and homogeneousness, respectively. As a result, our saliency model can detect salient objects more accurately and uniformly, thus performing favorably against the state-of-the-art methods.


    (CVPR’18) A Bi-directional Message Passing Model for Salient Object Detection

    Recent progress on salient object detection is beneficial from Fully Convolutional Neural Network (FCN). The saliency cues contained in multi-level convolutional features are complementary for detecting salient objects. How to integrate multi-level features becomes an open problem in saliency detection. In this paper, we propose a novel bi-directional message passing model to integrate multilevel features for salient object detection. At first, we adopt a Multi-scale Context-aware Feature Extraction Module (MCFEM) for multi-level feature maps to capture rich context information. Then a bi-directional structure is designed to pass messages between multi-level features, and a gate function is exploited to control the message passing rate. We use the features after message passing, which simultaneously encode semantic information and spatial details, to predict saliency maps. Finally, the predicted results are efficiently combined to generate the final saliency map. Quantitative and qualitative experiments on five benchmark datasets demonstrate that our proposed model performs favorably against the state-of-the-art methods under different evaluation metrics.


    (CVPR’18) Progressive Attention Guided Recurrent Network for Salient Object Detection

    Effective convolutional features play an important role in saliency estimation but how to learn powerful features for saliency is still a challenging task. FCN-based methods directly apply multi-level convolutional features without distinction, which leads to sub-optimal results due to the distraction from redundant details. In this paper, we propose a novel attention guided network which selectively integrates multi-level contextual information in a progressive manner. Attentive features generated by our network can alleviate distraction of background thus achieve better performance. On the other hand, it is observed that most of existing algorithms conduct salient object detection by exploiting side-output features of the backbone feature extraction network. However, shallower layers of backbone network lack the ability to obtain global semantic information, which limits the effective feature learning. To address the problem, we introduce multi-path recurrent feedback to enhance our proposed progressive attention driven framework. Through multi-path recurrent connections, global semantic information from the top convolutional layer is transferred to shallower layers, which intrinsically refines the entire network. Experimental results on six benchmark datasets demonstrate that our algorithm performs favorably against the state-of-the-art approaches.


    (ECCV’18) Contour Knowledge Transfer for Salient Object Detection

    In recent years, deep Convolutional Neural Networks (CNNs) have broken all records in salient object detection. However, training such a deep model requires a large amount of manual annotations. Our goal is to overcome this limitation by automatically converting an existing deep contour detection model into a salient object detection model without using any manual salient object masks. For this purpose, we have created a deep network architecture, namely Contour-to-Saliency Network (C2SNet), by grafting a new branch onto a well-trained contour detection network. Therefore, our C2S-Net has two branches for performing two different tasks: (1) predicting contours with the original contour branch, and (2) estimating per-pixel saliency score of each image with the newly added saliency branch. To bridge the gap between these two tasks, we further propose a contour-to-saliency transferring method to automatically generate salient object masks which can be used to train the saliency branch from outputs of the contour branch. Finally, we introduce a novel alternating training pipeline to gradually update the network parameters. In this scheme, the contour branch generates saliency masks for training the saliency branch, while the saliency branch, in turn, feeds back saliency knowledge in the form of saliency-aware contour labels, for fine-tuning the contour branch. The proposed method achieves state-of-the-art performance on five well-known benchmarks, outperforming existing fully supervised methods while also maintaining high efficiency.


    (ECCV’18) Reverse Attention for Salient Object Detection

    Benefit from the quick development of deep learning techniques, salient object detection has achieved remarkable progresses recently. However, there still exists following two major challenges that hinder its application in embedded devices, low resolution output and heavy model weight. To this end, this paper presents an accurate yet compact deep network for efficient salient object detection. More specifically, given a coarse saliency prediction in the deepest layer, we first employ residual learning to learn side-output residual features for saliency refinement, which can be achieved with very limited convolutional parameters while keep accuracy. Secondly, we further propose reverse attention to guide such side-output residual learning in a top-down manner. By erasing the current predicted salient regions from side-output features, the network can eventually explore the missing object parts and details which results in high resolution and accuracy. Experiments on six benchmark datasets demonstrate that the proposed approach compares favorably against state-of-the-art methods, and with advantages in terms of simplicity, efficiency (45 FPS) and model size (81 MB).

    受益于深度学习技术的快速发展,显著目标检测最近取得了可观的进展。然而,仍然存在以下两大挑战,阻碍了其在嵌入式设备中的应用,即低分辨率输出和庞大的模型参数。为此,本文提出了一个准确而紧凑的深度网络,用于高效的显著目标检测。更具体地说,在最深层给定一个粗略的显著性预测,我们首先采用残差学习来学习侧面输出的残差特征,以实现显著性的细化,这可以用非常有限的卷积参数同时保持准确性。其次,我们进一步提出反向注意力,以自上而下的方式指导这种侧输出的残差学习。通过从侧输出特征中抹去当前预测的显著区域,网络最终可以探索缺失的物体部分和细节,从而获得高的分辨率和准确性。在六个基准数据集上的实验表明,所提出的方法与SOTA相比更有优势,并且在简单性、效率(45 FPS)和模型大小(81 MB)方面更优。

    (CVPR’19) Attentive Feedback Network for Boundary-Aware Salient Object Detection

    Recent deep learning based salient object detection methods achieve gratifying performance built upon Fully Convolutional Neural Networks (FCNs). However, most of them have suffered from the boundary challenge. The state-of-the-art methods employ feature aggregation technique and can precisely find out wherein the salient object, but they often fail to segment out the entire object with fine boundaries, especially those raised narrow stripes. So there is still a large room for improvement over the FCN based models. In this paper, we design the Attentive Feedback Modules (AFMs) to better explore the structure of objects. A Boundary-Enhanced Loss (BEL) is further employed for learning exquisite boundaries. Our proposed deep model produces satisfying results on the object boundaries and achieves state-of-the-art performance on five widely tested salient object detection benchmarks. The network is in a fully convolutional fashion running at a speed of 26 FPS and does not need any post-processing.

    最近基于深度学习的显著目标检测方法在全卷积神经网络(FCN)的基础上取得了令人满意的性能。然而,它们中的大多数都遭受了边缘的挑战。最先进的方法采用了特征融合技术,可以精确地找到显著物体的位置,但它们往往不能用细小的边界分割出整个物体,特别是那些凸起的线条。因此,基于FCN的模型仍有很大的改进空间。在本文中,我们设计了注意力反馈模块(Attentive Feedback Module、AFM)来更好地探索物体的结构。边界增强损失(Boundary-Enhanced Loss、BEL)被进一步用于学习精细的边界。我们提出的深度模型在物体边界上产生了令人满意的结果,并在五个广泛测试的显著目标检测基准上实现了SOTA。该网络以完全卷积的方式运行,速度为26FPS,不需要任何后处理。

    (CVPR’19) Salient Object Detection With Pyramid Attention and Salient Edges

    This paper presents a new method for detecting salient objects in images using convolutional neural networks (CNNs). The proposed network, named PAGE-Net, offers two key contributions. The first is the exploitation of an essential pyramid attention structure for salient object detection. This enables the network to concentrate more on salient regions while considering multi-scale saliency information. Such a stacked attention design provides a powerful tool to efficiently improve the representation ability of the corresponding network layer with an enlarged receptive field. The second contribution lies in the emphasis on the importance of salient edges. Salient edge information offers a strong cue to better segment salient objects and refine object boundaries. To this end, our model is equipped with a salient edge detection module, which is learned for precise salient boundary estimation. This encourages better edge-preserving salient object segmentation. Exhaustive experiments confirm that the proposed pyramid attention and salient edges are effective for salient object detection. We show that our deep saliency model outperforms state-of-the-art approaches for several benchmarks with a fast processing speed (25fps on one GPU).


    (CVPR’19) Pyramid Feature Attention Network for Saliency detection

    Saliency detection is one of the basic challenges in computer vision. How to extract effective features is a critical point for saliency detection. Recent methods mainly adopt integrating multi-scale convolutional features indiscriminately. However, not all features are useful for saliency detection and some even cause interferences. To solve this problem, we propose Pyramid Feature Attention network to focus on effective high-level context features and low-level spatial structural features. First, we design Context-aware Pyramid Feature Extraction (CPFE) module for multi-scale high-level feature maps to capture rich context features. Second, we adopt channel-wise attention (CA) after CPFE feature maps and spatial attention (SA) after low-level feature maps, then fuse outputs of CA & SA together. Finally, we propose an edge preservation loss to guide network to learn more detailed information in boundary localization. Extensive evaluations on five benchmark datasets demonstrate that the proposed method outperforms the state-of-the-art approaches under different evaluation metrics.


    (CVPR’19) BASNet: Boundary-Aware Salient Object Detection

    Deep Convolutional Neural Networks have been adopted for salient object detection and achieved the state-of-the-art performance. Most of the previous works however focus on region accuracy but not on the boundary quality. In this paper, we propose a predict-refine architecture, BASNet, and a new hybrid loss for Boundary-Aware Salient object detection. Specifically, the architecture is composed of a densely supervised Encoder-Decoder network and a residual refinement module, which are respectively in charge of saliency prediction and saliency map refinement. The hybrid loss guides the network to learn the transformation between the input image and the ground truth in a three-level hierarchy – pixel-, patch- and map- level – by fusing Binary Cross Entropy (BCE), Structural SIMilarity (SSIM) and Intersection-over-Union (IoU) losses. Equipped with the hybrid loss, the proposed predict-refine architecture is able to effectively segment the salient object regions and accurately predict the fine structures with clear boundaries. Experimental results on six public datasets show that our method outperforms the state-of-the-art methods both in terms of regional and boundary evaluation measures. Our method runs at over 25 fps on a single GPU. The code is available at: https://github.com/NathanUA/BASNet.


    (CVPR’19) Cascaded Partial Decoder for Fast and Accurate Salient Object Detection

    Existing state-of-the-art salient object detection networks rely on aggregating multi-level features of pre-trained convolutional neural networks (CNNs). Compared to high-level features, low-level features contribute less to performance but cost more computations because of their larger spatial resolutions. In this paper, we propose a novel Cascaded Partial Decoder (CPD) framework for fast and accurate salient object detection. On the one hand, the framework constructs partial decoder which discards larger resolution features of shallower layers for acceleration. On the other hand, we observe that integrating features of deeper layers obtain relatively precise saliency map. Therefore we directly utilize generated saliency map to refine the features of backbone network. This strategy efficiently suppresses distractors in the features and significantly improves their representation ability. Experiments conducted on five benchmark datasets exhibit that the proposed model not only achieves state-of-the-art performance but also runs much faster than existing models. Besides, the proposed framework is further applied to improve existing multi-level feature aggregation models and significantly improve their efficiency and accuracy.


    (CVPR’19) A Simple Pooling-Based Design for Real-Time Salient Object Detection

    We solve the problem of salient object detection by investigating how to expand the role of pooling in convolutional neural networks. Based on the U-shape architecture, we first build a global guidance module (GGM) upon the bottom-up pathway, aiming at providing layers at different feature levels the location information of potential salient objects. We further design a feature aggregation module (FAM) to make the coarse-level semantic information well fused with the fine-level features from the top-down pathway. By adding FAMs after the fusion operations in the top-down pathway, coarse-level features from the GGM can be seamlessly merged with features at various scales. These two pooling-based modules allow the high-level semantic features to be progressively refined, yielding detail enriched saliency maps. Experiment results show that our proposed approach can more accurately locate the salient objects with sharpened details and hence substantially improve the performance compared to the previous state-of-the-arts. Our approach is fast as well and can run at a speed of more than 30 FPS when processing a 300×400 image. Code can be found at http://mmcheng.net/poolnet/.


    (ICCV’19) Stacked Cross Refinement Network for Edge-Aware Salient Object Detection

    Salient object detection is a fundamental computer vision task. The majority of existing algorithms focus on aggregating multi-level features of pre-trained convolutional neural networks. Moreover, some researchers attempt to utilize edge information for auxiliary training. However, existing edge-aware models design unidirectional frameworks which only use edge features to improve the segmentation features. Motivated by the logical interrelations between binary segmentation and edge maps, we propose a novel Stacked Cross Refinement Network (SCRN) for salient object detection in this paper. Our framework aims to simultaneously refine multi-level features of salient object detection and edge detection by stacking Cross Refinement Unit (CRU). According to the logical interrelations, the CRU designs two direction-specific integration operations, and bidirectionally passes messages between the two tasks. Incorporating the refined edge-preserving features with the typical U-Net, our model detects salient objects accurately. Extensive experiments conducted on six benchmark datasets demonstrate that our method outperforms existing state-of-the-art algorithms in both accuracy and efficiency. Besides, the attribute-based performance on the SOC dataset show that the proposed model ranks first in the majority of challenging scenes. Code can be found at https://github.com/wuzhe71/SCAN.


    (ICCV’19) Selectivity or Invariance: Boundary-aware Salient Object Detection

    Typically, a salient object detection (SOD) model faces opposite requirements in processing object interiors and boundaries. The features of interiors should be invariant to strong appearance change so as to pop-out the salient object as a whole, while the features of boundaries should be selective to slight appearance change to distinguish salient objects and background. To address this selectivity-invariance dilemma, we propose a novel boundary-aware network with successive dilation for image-based SOD. In this network, the feature selectivity at boundaries is enhanced by incorporating a boundary localization stream, while the feature invariance at interiors is guaranteed with a complex interior perception stream. Moreover, a transition compensation stream is adopted to amend the probable failures in transitional regions between interiors and boundaries. In particular, an integrated successive dilation module is proposed to enhance the feature invariance at interiors and transitional regions. Extensive experiments on six datasets show that the proposed approach outperforms 16 state-of-the-art methods.


    (ICCV’19) EGNet:Edge Guidance Network for Salient Object Detection

    Fully convolutional neural networks (FCNs) have shown their advantages in the salient object detection task. However, most existing FCNs-based methods still suffer from coarse object boundaries. In this paper, to solve this problem, we focus on the complementarity between salient edge information and salient object information. Accordingly, we present an edge guidance network (EGNet) for salient object detection with three steps to simultaneously model these two kinds of complementary information in a single network. In the first step, we extract the salient object features by a progressive fusion way. In the second step, we integrate the local edge information and global location information to obtain the salient edge features. Finally, to sufficiently leverage these complementary features, we couple the same salient edge features with salient object features at various resolutions. Benefiting from the rich edge information and location information in salient edge features, the fused features can help locate salient objects, especially their boundaries more accurately. Experimental results demonstrate that the proposed method performs favorably against the state-of-the-art methods on six widely used datasets without any pre-processing and post-processing. The source code is available at http://mmcheng.net/egnet/.


    (ICCV’19) Employing Deep Part-Object Relationships for Salient Object Detection

    Despite Convolutional Neural Networks (CNNs) based methods have been successful in detecting salient objects, their underlying mechanism that decides the salient intensity of each image part separately cannot avoid inconsistency of parts within the same salient object. This would ultimately result in an incomplete shape of the detected salient object. To solve this problem, we dig into part-object relationships and take the unprecedented attempt to employ these relationships endowed by the Capsule Network (CapsNet) for salient object detection. The entire salient object detection system is built directly on a Two-Stream Part-Object Assignment Network (TSPOANet) consisting of three algorithmic steps. In the first step, the learned deep feature maps of the input image are transformed to a group of primary capsules. In the second step, we feed the primary capsules into two identical streams, within each of which low-level capsules (parts) will be assigned to their familiar high-level capsules (object) via a locally connected routing. In the final step, the two streams are integrated in the form of a fully connected layer, where the relevant parts can be clustered together to form a complete salient object. Experimental results demonstrate the superiority of the proposed salient object detection network over the state-of-the-art methods.


    (CVPR’20) Interactive Two-Stream Decoder for Accurate and Fast Saliency Detection

    Recently, contour information largely improves the performance of saliency detection. However, the discussion on the correlation between saliency and contour remains scarce. In this paper, we first analyze such correlation and then propose an interactive two-stream decoder to explore multiple cues, including saliency, contour and their correlation. Specifically, our decoder consists of two branches, a saliency branch and a contour branch. Each branch is assigned to learn distinctive features for predicting the corresponding map. Meanwhile, the intermediate connections are forced to learn the correlation by interactively transmitting the features from each branch to the other one. In addition, we develop an adaptive contour loss to automatically discriminate hard examples during learning process. Extensive experiments on six benchmarks well demonstrate that our network achieves competitive performance with a fast speed around 50 FPS. Moreover, our VGG-based model only contains 17.08 million parameters, which is significantly smaller than other VGG-based approaches. Code has been made available at: https://github.com/moothes/ITSD-pytorch.


    (CVPR’20) Multi-scale Interactive Network for Salient Object Detection

    Deep-learning based salient object detection methods achieve great progress. However, the variable scale and unknown category of salient objects are great challenges all the time. These are closely related to the utilization of multi-level and multi-scale features. In this paper, we propose the aggregate interaction modules to integrate the features from adjacent levels, in which less noise is introduced because of only using small up-/down-sampling rates. To obtain more efficient multi-scale features from the integrated features, the self-interaction modules are embedded in each decoder unit. Besides, the class imbalance issue caused by the scale variation weakens the effect of the binary cross entropy loss and results in the spatial inconsistency of the predictions. Therefore, we exploit the consistency-enhanced loss to highlight the fore-/back-ground difference and preserve the intra-class consistency. Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches. The source code will be publicly available at this https https://github.com/lartpang/MINet.


    (CVPR’20) Label Decoupling Framework for Salient Object Detection

    To get more accurate saliency maps, recent methods mainly focus on aggregating multi-level features from fully convolutional network (FCN) and introducing edge information as auxiliary supervision. Though remarkable progress has been achieved, we observe that the closer the pixel is to the edge, the more difficult it is to be predicted, because edge pixels have a very imbalance distribution. To address this problem, we propose a label decoupling framework (LDF) which consists of a label decoupling (LD) procedure and a feature interaction network (FIN). LD explicitly decomposes the original saliency map into body map and detail map, where body map concentrates on center areas of objects and detail map focuses on regions around edges. Detail map works better because it involves much more pixels than traditional edge supervision. Different from saliency map, body map discards edge pixels and only pays attention to center areas. This successfully avoids the distraction from edge pixels during training. Therefore, we employ two branches in FIN to deal with body map and detail map respectively. Feature interaction (FI) is designed to fuse the two complementary branches to predict the saliency map, which is then used to refine the two branches again. This iterative refinement is helpful for learning better representations and more precise saliency maps. Comprehensive experiments on six benchmark datasets demonstrate that LDF outperforms state-of-the-art approaches on different evaluation metrics.


    (ECCV’20) Suppress and Balance: A Simple Gated Network for Salient Object Detection

    Most salient object detection approaches use U-Net or feature pyramid networks (FPN) as their basic structures. These methods ignore two key problems when the encoder exchanges information with the decoder: one is the lack of interference control between them, the other is without considering the disparity of the contributions of different encoder blocks. In this work, we propose a simple gated network (GateNet) to solve both issues at once. With the help of multilevel gate units, the valuable context information from the encoder can be optimally transmitted to the decoder. We design a novel gated dual branch structure to build the cooperation among different levels of features and improve the discriminability of the whole network. Through the dual branch design, more details of the saliency map can be further restored. In addition, we adopt the atrous spatial pyramid pooling based on the proposed “Fold” operation (Fold-ASPP) to accurately localize salient objects of various scales. Extensive experiments on five challenging datasets demonstrate that the proposed model performs favorably against most state-of-the-art methods under different evaluation metrics.


    (ECCV’20) Highly Efficient Salient Object Detection with 100K Parameters

    Salient object detection models often demand a considerable amount of computation cost to make precise prediction for each pixel, making them hardly applicable on low-power devices. In this paper, we aim to relieve the contradiction between computation cost and model performance by improving the network efficiency to a higher degree. We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features, while reducing the representation redundancy by a novel dynamic weight decay scheme. The effective dynamic weight decay scheme stably boosts the sparsity of parameters during training, supports learnable number of channels for each scale in gOctConv, allowing 80% of parameters reduce with negligible performance drop. Utilizing gOctConv, we build an extremely light-weighted model, namely CSNet, which achieves comparable performance with about 0.2% parameters (100k) of large models on popular salient object detection benchmarks.


  • 基于一种改进的跨层级特征融合的循环全卷积神经网络, 提出了一种结合深度学习的图像显著目标检测算法。通过改进的深度卷积网络模型对输入图像进行特征提取, 利用跨层级联合框架进行特征融合, 生成了高层语义特征的...
  • 人工智能-目标检测-显著目标检测与提取研究.pdf
  • 人工智能-目标检测-显著目标检测方法及其应用研究.pdf
  • 针对具有杂乱背景图像的显著目标检测问题,提出了一种无需任何先验知识,通过分析计算区域平均显著值的对比度来提取显著目标的方法.根据显著图,计算出显著目标的最小边界框与其周围区域的显著性差异,且通过折半...
  • 基于最小生成树实现的显著目标实时检测,是文章Real-Time Salient Object Detection with a Minimum Spanning Tree的源码
  • 人工智能-目标检测-夜间场景下显著目标检测方法研究.pdf
  • 人工智能-目标检测-特征融合的显著目标检测方法研究.pdf
  • 人工智能-目标检测-基于区域特征融合的显著目标检测研究.pdf
  • 人工智能-目标检测-基于傅里叶变换的显著目标检测方法研究.pdf
  • 并结合像素的“Center-Surround冶模型和核密度估计,提出一种能由粗到精逐步感知和获取视场中视觉显著性目标位置及尺度的实时显著目标检测算法,称其为基于贝叶斯框架的显著目标检测. 通过在微软MSRA数据集上进行ROC和...
  • 全局低秩显著性检测算法首先根据自然图像前景目标和背景亮度、颜色的差异性重构出图像前景显著目标;然后利用低秩分解对图像中的非显著性区域进行抑制。
  • 人工智能-目标检测-基于先验融合和流形排序的显著目标检测.pdf
  • 人工智能-目标检测-基于视觉感知机制的显著目标检测算法研究.pdf
  • 人工智能-目标检测-基于深度卷积神经网络的显著目标检测方法.pdf
  • 人工智能-目标检测-基于点-集度量学习的显著目标检测.pdf
  • 人工智能-目标检测-基于生物视觉机制的视频显著目标检测算法研究.pdf
  • 人工智能-目标检测-基于人类视觉注意机制的显著目标检测与分割.pdf
  • 学习笔记–深度学习时代的显著目标检测综述 这篇文章作者引用了182篇参考文献,撰写正文16页,堪称显著目标检测领域综述的良心之作。本文系论文学习笔记。 1 引言 文章开篇作者首先介绍了显著目标检测的起源与发展,...



    1 引言


    2 基于深度学习的SOD模型


    2.1 SOD典型网络结构

    胶囊网络是由Hinton等人提出的新型网络,Y. Liu 和Q. Qi 等人将胶囊网络应用于SOD检测。(TSPOANet)

    2.2 从监督层级看SOD


    2.3 从学习范式看SOD


    2.4 目标级别与实例级别的SOD


    3 SOD数据集




    这一部分作者主要介绍了7种常用的评估指标:Precision-Recall (PR), F-measure, Mean Absolute Error (MAE), Weighted Fβ measure (Fbw), Structural measure (S-measure), Enhanced-alignment measure (E-measure), Salient Object Ranking (SOR).更加详细的公式以及计算详见论文1


    基准结果的性能总览:在6个数据集上,采用max-Fmeasure, S-measure, MAE三个属性,评估了47种SOD模型。

    输入扰动影响分析:这一部分的分析主要体现在表8中,噪音有Gaussian blur, Gaussian noise, Rotation, Gray四类,其中前三种根据参数不同各有两类。整体而言非深度方法比深度方法的鲁棒性更强,主要受益于人工超像素层级特征的鲁棒性。针对每一种噪音具体分析,非深度模型的性能,基本不受角度变换的影响,对强Gaussian noise十分敏感。深度方法对Gaussian blur 和强 Gaussian noise敏感,主要是因为这两类噪声会影响浅层网络的感受野。
    考虑到作者选用的6个数据集中,ECSSD的图片数最少(1000),所以模型在不同数据集上训练时随机选取1000张,800张用于训练,200张用于验证。下表是一些分析结果。按列看,表示所有模型在同一个数据集上的性能,可以反映该数据集图片检测的难度;SOC最难,MSRA10K最简单,通过比较最后一行的Mean others.


    在进行SOD模型设计应该多多从特征集合、损失函数、网络拓扑、动态推理结构四个角度思考问题。深度模型最大的优势在于可以提取比传统方法丰富千百倍的特征,然而网络 不同层提取的特征如何有效融合将直接影响SOD模型的预测结果,原文给出了目前常见的特征融合策略:多流/多分辨率融合,自顶向下自底向上融合,边输出融合,其他研究领域相关特征融合(注视点、语义分割)等。针对损失函数的设计,研究人员或将SOD的评估指标写入损失函数,或直接利用MIoU.网络拓扑直接影响网络的训练难度和参数量,诸多实验表明ResNet做基础网络骨架够贱的网络往往优于VGG.在网络拓扑设计这个角度,AutoML将会是一个很有前景的研究方向。动态推理结构主要用于降低网络参数并最大限度地保持网络性能。动态推理结构可以理解为选择网络激活部分输出特征或者实现早期停止,静态方法主要有卷积核分解、网络修剪。

  • 人工智能-目标检测-基于极限学习机与目标候选子空间优化的显著目标检测.pdf
  • 人工智能-目标检测-基于视觉注意机制的显著目标检测与提取算法研究.pdf
  • 人工智能-目标检测-车辆行驶中的视觉显著目标检测及语义分析研究.pdf
  • 关注公众号,发现CV技术之美▊引言最近基于深度学习的显著目标检测方法取得了出色的性能。然而现有的大多数方法多事基于低分辨率输入设计的,这些模型在高分辨率图片上的表现不尽人意,这是由于网络的采样深度和感受...
















    人类的视觉系统具有从复杂场景中快速、准确地定位感兴趣物体或区域的能力,称为选择性注意力机制。显著物体检测(Salient Object Detection, SOD)是对该机制的一种模拟,旨在分割给定图像中最具视觉吸引力的物体或区域。大多数现有的SOD方法在一个特定的输入分辨率范围内表现的很好(例如224×224,384×384)。





    图1:不同结构对比。(a) 输入图片 (b) 真值标签 (c)高分辨率直接输入卷积网络的结果 (d)下采样后输入Swin-FPN结果 (e) 我们方法的结果


    为此,我们重新思考了双分支的架构并设计了一个新颖的单阶段深度网络金字塔嫁接网络(Pyramid Grafting Network, PGNet)来解决高分辨率显著性的问题。





    另一方面,如果混合着规模较大的低分辨率数据集,其低质量的边缘又会对训练数据引入新的噪声影响模型对于高分辨率场景的性能。因此我们提出了一个大规模的超高分辨率显著性检测数据集UHRSD(Ultra High-Resolution for Saliency Detection)。




    图2:左图 边缘像素数量直方图对比;右图 图片对角线长度直方图对比


    图3:左图 UHRSD样例及标注;右图其他高分辨率数据集样例及标注




    图4:左图 UHRSD样例及标注;右图 低分辨率数据集样例及标注













    我们提出跨模型嫁接模块(Cross Model Grafting Module,CMGM)来对不同骨干网络提取到的特征进行融合。与简单的特征融合方式相比,跨模型嫁接模块通过注意力机制,可以利用Transformer提取到的特征中的全局语义指导组合ResNet提取到的丰富细节特征。

    具体而言,跨模型嫁接模块将ResNet提取到的特征c39cc5554d9210c0e3a7c489a8e2e0cc.png展平为fac5eb06b430ec7c26443b476187a09b.png18941db367c50b37390cbec0c6c7e77a.png,对于Swin Transformer提取到的特征d8f36c7cc9424263d5f38ffb7b2a9246.png同样。受到多头注意力机制的启发,我们将层归一化和线性映射得到新的三个特征1393fea621db0aa9daed817793d4e829.png。通过矩阵乘法得到63fc3f90556fcd594f8d8d9a7e4720d3.png。如下公式所示:


    然后我们将c7d5a656dadc249fce7fe107eb34d9a5.png进行线性映射并重新恢复成e94571871f9ddc837847726bcbd92a7f.png后再通过卷积层。经过两个短路连接如图所示。除了产生嫁接特征以外,CMGM还将产生一个交叉注意力矩阵(Cross Attention Matrix, CAM),其生成过程可以表示为:



    为了更好地将Transformer特征的全局语义信息嫁接到ResNet分支,我们设计了注意力引导损失(Attention Guided Loss, AGL)来辅助这一过程。我们认为CMGM产生的交叉注意力矩阵应该和真值标签产生的注意力矩阵相似,因为显著的特征应该有更高的相似度即在交叉注意力矩阵中更高的激活值。























  • 人工智能-目标检测-基于高阶能量项和学习关联模型的显著目标检测.pdf
  • 人工智能-目标检测-基于小波超复数分数阶傅里叶变换的视觉显著目标检测研究.pdf
  • 显著目标检测是计算机视觉的重要组成部分,目的是检测图像中最吸引人眼的目标区域。针对显著检测中特征的适应性不足以及当前一些算法出现多检与漏检的问题,提出从“目标在哪儿”与“背景在哪儿”两个角度描述显著性...
  • BASNET:边界感知的显著目标检测

    千次阅读 2020-06-08 18:44:07
    BASNET:边界感知的显著目标检测 摘要 采用深卷积神经网络进行显著目标检测,取得了较好的效果。然而,以前的工作大多侧重于区域精度,而不是边界质量。在本文中,我们提出了一种预测-细化体系结构Basnet和一种新的...



1 2 3 4 5 ... 20
收藏数 59,307
精华内容 23,722


友情链接: mei_cl_bus.rar