2019-04-27 14:15:59 weixin_38498050 阅读数 122
  • DeepLabv3+图像语义分割实战:训练自己的数据集

    DeepLabv3+是一种非常先进的基于深度学习的图像语义分割方法,可对物体进行像素级分割。 本课程将手把手地教大家使用labelme图像标注工具制作数据集,并使用DeepLabv3+训练自己的数据集,从而能开展自己的图像语义分割应用。 本课程有两个项目实践: (1) CamVid语义分割 :对CamVid数据集进行语义分割 (2) RoadScene语义分割:对汽车行驶场景中的路坑、车、车道线进行物体标注和语义分割 本课程使用TensorFlow版本的DeepLabv3+,在Ubuntu系统上做项目演示。 包括:安装deeplab、数据集标注、数据集格式转换、修改程序文件、训练自己的数据集、测试训练出的网络模型以及性能评估。 本课程提供项目的数据集和Python程序文件。 下图是使用DeepLabv3+训练自己的数据集RoadScene进行图像语义分割的测试结果:

    726 人正在学习 去看看 白勇



Deep Joint Task Learning for Generic Object Extraction

Highly Efficient Forward and Backward Propagation of Convolutional Neural Networks for Pixelwise Classification

Segmentation from Natural Language Expressions

Semantic Object Parsing with Graph LSTM

Fine Hand Segmentation using Convolutional Neural Networks

Feedback Neural Network for Weakly Supervised Geo-Semantic Segmentation

FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics

A deep learning model integrating FCNNs and CRFs for brain tumor segmentation

Texture segmentation with Fully Convolutional Networks

Fast LIDAR-based Road Detection Using Convolutional Neural Networks


Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs

Annotating Object Instances with a Polygon-RNN

Semantic Segmentation via Structured Patch Prediction, Context CRF and Guidance CRF

Nighttime sky/cloud image segmentation

Distantly Supervised Road Segmentation

Superpixel clustering with deep features for unsupervised road segmentation

Learning to Segment Human by Watching YouTube

W-Net: A Deep Model for Fully Unsupervised Image Segmentation


End-to-end detection-segmentation network with ROI convolution


U-Net: Convolutional Networks for Biomedical Image Segmentation

DeepUNet: A Deep Fully Convolutional Network for Pixel-level Sea-Land Segmentation


TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation

Foreground Object Segmentation

Pixel Objectness

A Deep Convolutional Neural Network for Background Subtraction

Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation

From Image-level to Pixel-level Labeling with Convolutional Networks

Feedforward semantic segmentation with zoom-out features


Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Weakly- and Semi-Supervised Learning of a DCNN for Semantic Image Segmentation

DeepLab v2

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

DeepLabv2 (ResNet-101)


DeepLab v3

Rethinking Atrous Convolution for Semantic Image Segmentation


Conditional Random Fields as Recurrent Neural Networks


BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation

Efficient piecewise training of deep structured models for semantic segmentation


Learning Deconvolution Network for Semantic Segmentation


SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

SegNet: Pixel-Wise Semantic Labelling Using a Deep Networks

Getting Started with SegNet


ParseNet: Looking Wider to See Better


Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation

Semantic Image Segmentation via Deep Parsing Network

Multi-Scale Context Aggregation by Dilated Convolutions

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Object Segmentation on SpaceNet via Multi-task Network Cascades (MNC)

Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network

Combining the Best of Convolutional Layers and Recurrent Layers: A Hybrid Network for Semantic Segmentation

Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation


ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation

Laplacian Reconstruction and Refinement for Semantic Segmentation

Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation

Natural Scene Image Segmentation Based on Multi-Layer Feature Extraction

Convolutional Random Walk Networks for Semantic Image Segmentation


ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Fully Convolutional Networks for Dense Semantic Labelling of High-Resolution Aerial Imagery

Deep Learning Markov Random Field for Semantic Segmentation

Region-based semantic segmentation with end-to-end training

Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation


PixelNet: Towards a General Pixel-level Architecture

Exploiting Depth from Single Monocular Images for Object Detection and Semantic Segmentation

  • intro: IEEE T. Image Processing
  • intro: propose an RGB-D semantic segmentation method which applies a multi-task training scheme: semantic label prediction and depth value regression
  • arxiv: https://arxiv.org/abs/1610.01706

PixelNet: Representation of the pixels, by the pixels, and for the pixels

Semantic Segmentation of Earth Observation Data Using Multimodal and Multi-scale Deep Networks

Deep Structured Features for Semantic Segmentation

CNN-aware Binary Map for General Semantic Segmentation

Efficient Convolutional Neural Network with Binary Quantization Layer

Mixed context networks for semantic segmentation

High-Resolution Semantic Labeling with Convolutional Neural Networks

Gated Feedback Refinement Network for Dense Image Labeling


RefineNet: Multi-Path Refinement Networks with Identity Mappings for High-Resolution Semantic Segmentation

RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes

Semantic Segmentation using Adversarial Networks

Improving Fully Convolution Network for Semantic Segmentation

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation

Training Bit Fully Convolutional Network for Fast Semantic Segmentation

Classification With an Edge: Improving Semantic Image Segmentation with Boundary Detection

  • intro: “an end-to-end trainable deep convolutional neural network (DCNN) for semantic segmentation
    with built-in awareness of semantically meaningful boundaries. “
  • arxiv: https://arxiv.org/abs/1612.01337

Diverse Sampling for Self-Supervised Learning of Semantic Segmentation

Mining Pixels: Weakly Supervised Semantic Segmentation Using Image Labels

FCNs in the Wild: Pixel-level Adversarial and Constraint-based Adaptation

Understanding Convolution for Semantic Segmentation

Label Refinement Network for Coarse-to-Fine Semantic Segmentation


Predicting Deeper into the Future of Semantic Segmentation

Guided Perturbations: Self Corrective Behavior in Convolutional Neural Networks

Not All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer Cascade

Large Kernel Matters – Improve Semantic Segmentation by Global Convolutional Network


Loss Max-Pooling for Semantic Image Segmentation

Reformulating Level Sets as Deep Recurrent Neural Network Approach to Semantic Segmentation


A Review on Deep Learning Techniques Applied to Semantic Segmentation


Joint Semantic and Motion Segmentation for dynamic scenes using Deep Convolutional Networks


ICNet for Real-Time Semantic Segmentation on High-Resolution Images


Feature Forwarding: Exploiting Encoder Representations for Efficient Semantic Segmentation

LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation

Pixel Deconvolutional Networks

Incorporating Network Built-in Priors in Weakly-supervised Semantic Segmentation

Deep Semantic Segmentation for Automated Driving: Taxonomy, Roadmap and Challenges

Semantic Segmentation with Reverse Attention

Stacked Deconvolutional Network for Semantic Segmentation


Learning Dilation Factors for Semantic Segmentation of Street Scenes

A Self-aware Sampling Scheme to Efficiently Train Fully Convolutional Networks for Semantic Segmentation


One-Shot Learning for Semantic Segmentation

An Adaptive Sampling Scheme to Efficiently Train Fully Convolutional Networks for Semantic Segmentation


Semantic Segmentation from Limited Training Data


Unsupervised Domain Adaptation for Semantic Segmentation with GANs


Neuron-level Selective Context Aggregation for Scene Segmentation


Road Extraction by Deep Residual U-Net


Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

Error Correction for Dense Semantic Image Labeling


Semantic Segmentation via Highly Fused Convolutional Network with Multiple Soft Cost Functions


Instance Segmentation

Simultaneous Detection and Segmentation

Convolutional Feature Masking for Joint Object and Stuff Segmentation

Proposal-free Network for Instance-level Object Segmentation

Hypercolumns for object segmentation and fine-grained localization

SDS using hypercolumns

Learning to decompose for object detection and instance segmentation

Recurrent Instance Segmentation

Instance-sensitive Fully Convolutional Networks

Amodal Instance Segmentation

Bridging Category-level and Instance-level Semantic Image Segmentation

Bottom-up Instance Segmentation using Deep Higher-Order CRFs

DeepCut: Object Segmentation from Bounding Box Annotations using Convolutional Neural Networks

End-to-End Instance Segmentation and Counting with Recurrent Attention


Translation-aware Fully Convolutional Instance Segmentation

Fully Convolutional Instance-aware Semantic Segmentation

InstanceCut: from Edges to Instances with MultiCut

Deep Watershed Transform for Instance Segmentation

Object Detection Free Instance Segmentation With Labeling Transformations

Shape-aware Instance Segmentation

Interpretable Structure-Evolving LSTM

  • intro: CMU & Sun Yat-sen University & National University of Singapore & Adobe Research
  • intro: CVPR 2017 spotlight paper
  • arxiv: https://arxiv.org/abs/1703.03055

Mask R-CNN

Semantic Instance Segmentation via Deep Metric Learning


Pose2Instance: Harnessing Keypoints for Person Instance Segmentation


Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Instance-Level Salient Object Segmentation

Semantic Instance Segmentation with a Discriminative Loss Function

SceneCut: Joint Geometric and Object Segmentation for Indoor Scenes


S4 Net: Single Stage Salient-Instance Segmentation

Deep Extreme Cut: From Extreme Points to Object Segmentation


Learning to Segment Every Thing

Recurrent Neural Networks for Semantic Instance Segmentation


MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features


Recurrent Pixel Embedding for Instance Grouping

Specific Segmentation

A CNN Cascade for Landmark Guided Semantic Part Segmentation

End-to-end semantic face segmentation with conditional random fields as convolutional, recurrent and adversarial networks

Face Parsing via Recurrent Propagation

Face Parsing via a Fully-Convolutional Continuous CRF Neural Network


Boundary-sensitive Network for Portrait Segmentation


Segment Proposal

Learning to Segment Object Candidates

Learning to Refine Object Segments

FastMask: Segment Object Multi-scale Candidates in One Shot

Scene Labeling / Scene Parsing

Indoor Semantic Segmentation using depth information

Recurrent Convolutional Neural Networks for Scene Parsing

Learning hierarchical features for scene labeling

Multi-modal unsupervised feature learning for rgb-d scene labeling

Scene Labeling with LSTM Recurrent Neural Networks

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models

“Semantic Segmentation for Scene Understanding: Algorithms and Implementations” tutorial

Semantic Understanding of Scenes through the ADE20K Dataset

Learning Deep Representations for Scene Labeling with Guided Supervision

Learning Deep Representations for Scene Labeling with Semantic Context Guided Supervision

Spatial As Deep: Spatial CNN for Traffic Scene Understanding


Multi-Path Feedback Recurrent Neural Network for Scene Parsing

Scene Labeling using Recurrent Neural Networks with Explicit Long Range Contextual Dependency


Pyramid Scene Parsing Network

Open Vocabulary Scene Parsing


Deep Contextual Recurrent Residual Networks for Scene Labeling


Fast Scene Understanding for Autonomous Driving

  • intro: Published at “Deep Learning for Vehicle Perception”, workshop at the IEEE Symposium on Intelligent Vehicles 2017
  • arxiv: https://arxiv.org/abs/1708.02550

FoveaNet: Perspective-aware Urban Scene Parsing


BlitzNet: A Real-Time Deep Network for Scene Understanding

Semantic Foggy Scene Understanding with Synthetic Data


Restricted Deformable Convolution based Road Scene Semantic Segmentation Using Surround View Cameras



MIT Scene Parsing Benchmark

Semantic Understanding of Urban Street Scenes: Benchmark Suite



Large-scale Scene Understanding Challenge

Places2 Challenge


Human Parsing

Human Parsing with Contextualized Convolutional Neural Network

Look into Person: Self-supervised Structure-sensitive Learning and A New Benchmark for Human Parsing

Cross-domain Human Parsing via Adversarial Feature and Label Adaptation

Video Object Segmentation

Fast object segmentation in unconstrained video

Recurrent Fully Convolutional Networks for Video Segmentation

Object Detection, Tracking, and Motion Segmentation for Object-level Video Segmentation

Clockwork Convnets for Video Semantic Segmentation

STFCN: Spatio-Temporal FCN for Semantic Video Segmentation

One-Shot Video Object Segmentation

Video Object Segmentation Without Temporal Information


Convolutional Gated Recurrent Networks for Video Segmentation

Learning Video Object Segmentation from Static Images

Semantic Video Segmentation by Gated Recurrent Flow Propagation

FusionSeg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos

Unsupervised learning from video to detect foreground objects in single images


Semantically-Guided Video Object Segmentation


Learning Video Object Segmentation with Visual Memory


Flow-free Video Object Segmentation


Online Adaptation of Convolutional Neural Networks for Video Object Segmentation


Video Object Segmentation using Tracked Object Proposals

Video Object Segmentation with Re-identification

Pixel-Level Matching for Video Object Segmentation using Convolutional Neural Networks

SegFlow: Joint Learning for Video Object Segmentation and Optical Flow

Video Semantic Object Segmentation by Self-Adaptation of DCNN


Learning to Segment Moving Objects


Instance Embedding Transfer to Unsupervised Video Object Segmentation

Panoptic Segmentation


DAVIS: Densely Annotated VIdeo Segmentation

DAVIS Challenge on Video Object Segmentation 2017



TF Image Segmentation: Image Segmentation framework

KittiSeg: A Kitti Road Segmentation model implemented in tensorflow.

Semantic Segmentation Architectures Implemented in PyTorch

PyTorch for Semantic Segmentation


3D Segmentation

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

DA-RNN: Semantic Mapping with Data Associated Recurrent Neural Networks


SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud

SEGCloud: Semantic Segmentation of 3D Point Clouds


Segmentation Results: VOC2012 BETA: Competition “comp6” (train on own data)



Deep Learning for Natural Image Segmentation Priors


Image Segmentation Using DIGITS 5


Image Segmentation with Tensorflow using CNNs and Conditional Random Fields

Fully Convolutional Networks (FCNs) for Image Segmentation

Image segmentation with Neural Net

A 2017 Guide to Semantic Segmentation with Deep Learning



Deep learning for image segmentation

2018-12-21 11:38:12 sinat_33487968 阅读数 188
  • DeepLabv3+图像语义分割实战:训练自己的数据集

    DeepLabv3+是一种非常先进的基于深度学习的图像语义分割方法,可对物体进行像素级分割。 本课程将手把手地教大家使用labelme图像标注工具制作数据集,并使用DeepLabv3+训练自己的数据集,从而能开展自己的图像语义分割应用。 本课程有两个项目实践: (1) CamVid语义分割 :对CamVid数据集进行语义分割 (2) RoadScene语义分割:对汽车行驶场景中的路坑、车、车道线进行物体标注和语义分割 本课程使用TensorFlow版本的DeepLabv3+,在Ubuntu系统上做项目演示。 包括:安装deeplab、数据集标注、数据集格式转换、修改程序文件、训练自己的数据集、测试训练出的网络模型以及性能评估。 本课程提供项目的数据集和Python程序文件。 下图是使用DeepLabv3+训练自己的数据集RoadScene进行图像语义分割的测试结果:

    726 人正在学习 去看看 白勇

Review of Deep Learning Algorithms for Image Semantic Segmentation

Examples of the COCO dataset for stuff segmentation. Souce: http://cocodataset.org/

Deep learning algorithms have solved several computer vision tasks with an increasing level of difficulty. In my previous blog posts, I have detailled the well kwown ones: image classification and object detection. The image semantic segmentation challenge consists in classifying each pixel of an image (or just several ones) into an instance, each instance (or category) corresponding to an object or a part of the image (road, sky, …). This task is part of the concept of scene understanding: how a deep learning model can better learn the global context of a visual content ?

The object detection task has exceeded the image classification task in term of complexity. It consists in creating bounding boxes around the objects contained in an image and classify each one of them. Most of the object detection models use anchor boxes and proposals to detect bounding box around objects. Unfortunately, just a few models take into account the entire context of an image but they only classify a small part of the information. Thus, they can’t provide a full comprehension of a scene.

In order to understand a scene, each visual information has to be associated to an entity while considering the spatial information. Several other challenges have emerged to really understand the actions in a image or a video: keypoint detection, action recognition, video captioning, visual question answering and so on. A better comprehension of the environment will help in many fields. For example, an autonomous car needs to delimitate the roadsides with a high precision in order to move by itself. In robotics, production machines should understand how to grab, turn and put together two different pieces requiring to delimitate the exact shape of the object.

In this blog post, architecture of a few previous state-of-the-art models on image semantic segmentation challenges are detailed. Note that researchers test their algorithms using different datasets (PASCAL VOC, PASCAL Context, COCO, Cityscapes) which are different between the years and use different metrics of evaluation. Thus the cited performances cannot be directly compared per se. Moreover, the results depend on the pretrained top network (the backbone), the results published in this post correspond to the best scores published in each paper with respect to their test dataset.

Datasets and Metrics

PASCAL Visual Object Classes (PASCAL VOC)

The PASCAL VOC dataset (2012) is well-known an commonly used for object detection and segmentation. More than 11k images compose the train and validation datasets while 10k images are dedicated to the test dataset.

The segmentation challenge is evaluated using the mean Intersection over Union (mIoU) metric. The Intersection over Union (IoU) is a metric also used in object detection to evaluate the relevance of the predicted locations. The IoU is the ratio between the area of overlap and the area of union between the ground truth and the predicted areas. The mIoU is the average between the IoU of the segmented objects over all the images of the test dataset.

Examples of the 2012 PASCAL VOC dataset for image segmentation. Source: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html


The PASCAL-Context dataset (2014) is an extension of the 2010 PASCAL VOC dataset. It contains around 10k images for training, 10k for validation and 10k for testing. The specificity of this new release is that the entire scene is segmented providing more than 400 categories. Note that the images have been annotated during three months by six in-house annotators.

The official evaluation metric of the PASCAL-Context challenge is the mIoU. Several other metrics are published by researches as the pixel Accuracy (pixAcc). Here, the performances will be compared only with the mIoU.

Example of the PASCAL-Context dataset. Source: https://cs.stanford.edu/~roozbeh/pascal-context/

Common Objects in COntext (COCO)

There are two COCO challenges (in 2017 and 2018) for image semantic segmentation (“object detection” and “stuff segmentation”). The “object detection” task consists in segmenting and categorizing objects into 80 categories. The “stuff segmentation” task uses data with large segmented part of the images (sky, wall, grass), they contain almost the entire visual information. In this blog post, only the results of the “object detection” task will be compared because too few of the quoted research papers have published results on the “stuff segmentation” task.

The COCO dataset for object segmentation is composed of more than 200k images with over 500k object instance segmented. It contains a training dataset, a validation dataset, a test dataset for reseachers (test-dev) and a test dataset for the challenge (test-challenge). The annotations of both test datasets are not available. These datasets contain 80 categories and only the corresponding objects are segmented. This challenge uses the same metrics than the object detection challenge: the Average Precision (AP) and the Average Recall (AR) both using the Intersection over Union (IoU).

Details about IoU and AP metrics are available in my previous blog post. Such as the AP, the Average Recall is computed using multiple IoU with a specific range of overlapping values. For a fixed IoU, the objects with the corresponding test / ground truth overlapping are kept. Then the Recall metric is computed for the detected objects. The final AR metric is the average of the computed Recalls for all the IoU range values. Basically the AP and the AR metrics for segmentation works the same way with object detection excepting that the IoU is computed pixel-wise with a non rectangular shape for semantic segmentation.

Example of the COCO dataset for object segmentation. Source: http://cocodataset.org/


The Cityscapes dataset has been released in 2016 and consists in complex segmented urban scenes from 50 cities. It is composed of 23.5k images for training and validation (fine and coarse annotations) and 1.5 images for testing (only fine annotation). The images are fully segmented such as the PASCAL-Context dataset with 29 classes (within 8 super categories: flat, human, vehicle, construction, object, nature, sky, void). It is often used to evaluate semantic segmentation models because of its complexity. It is also well known for its similarity with real urban scenes for autonomous driving applications. The performances of semantic segmentation models are computed using the mIoU metric such as the PASCAL datasets.

Examples of the Cityscapes dataset. Top: coarse annotations. Bottom: fine annotation. Source: https://www.cityscapes-dataset.com/

Fully Convolutional Network (FCN)

J. Long et al. (2015) have been the firsts to develop an Fully Convolutional Network (FCN) (containing only convolutional layers) trained end-to-end for image segmentation.

The FCN takes an image with an arbitrary size and produces a segmented image with the same size. The authors start by modifying well-known architectures (AlexNet, VGG16, GoogLeNet) to have a non fixed size input while replacing all the fully connected layers by convolutional layers. Since the network produces several feature maps with small sizes and dense representations, an upsampling is necessary to create an output with the same size than the input. Basically, it consists in a convolutional layer with a stride inferior to 1. It is commonly called deconvolution because it creates an output with a larger size than the input. This way, the network is trained using a pixel-wise loss. Moreover they have added skip connections in the network to combine high level feature map representations with more specific and dense ones at the top of the network.

The authors have reached a 62.2% mIoU score on the 2012 PASCAL VOC segmentation challenge using pretrained models on the 2012 ImageNet dataset. For the 2012 PASCAL VOC object detection challenge, the benchmark model called Faster R-CNN has reached 78.8% mIoU. Even if we can’t directly compare the two results (different models, different datasets and different challenges), it seems that the semantic segmentation task is more difficult to solve than the object detection task.

Architecture of the FCN. Note that the skip connections are not drawn here. Souce: J. Long et al. (2015)


W. Liu et al. (2015) have published a paper explaining improvements of the FCN model of J. Long et al. (2015). According to the authors, the FCN model loses the global context of the image in its deep layers by specializing the generated feature maps. The ParseNet is an end-to-end convolutional network predicting values for all the pixels at the same time and it avoids taking regions as input to keep the global information. The authors use a module taking feature maps as input. The first step uses a model to generate feature maps which are reduced into a single global feature vector with a pooling layer. This context vector is normalised using the L2 Euclidian Norm and it is unpooled (the output is an expanded version of the input) to produce new feature maps with the same sizes than the inital ones. The second step normalises the entire initial feature maps using the L2 Euclidian Norm. The last step concatenates the feature maps generated by the two previous steps. The normalisation is helpful to scale the concatenated feature maps values and it leads to better performances. Basically, the ParseNet is a FCN with this module replacing convolutional layers. It has obtained a 40.4% mIoU score on the PASCAL-Context challenge and a 69.8% mIoU score on the 2012 PASCAL VOC segmentation challenge.

Comparison between the segmentation of the FCN and the ParseNet and architecture of the ParseNet module. Source: W. Liu et al. (2015)

Convolutional and Deconvolutional Networks

H. Noh et al. (2015) have released an end-to-end model composed of two linked parts. The first part is a convolutional network with a VGG16 architecture. It takes as input an instance proposal, for example a bounding box generated by an object detection model. The proposal is processed and transformed by a convolutional network to generate a vector of features. The second part is a deconvolutional network taking the vector of features as input and generating a map of pixel-wise probabilities belonging to each class. The deconvolutional network uses unpooling targeting the maxium activations to keep the location of the information in the maps. The second network also uses deconvolution associating a single input to multiple feature maps. The deconvolution expands feature maps while keeping the information dense.

Comparison of the convolutional network layers (pooling and convolution) with the deconvolutional network layers (unpooling and deconvolution). Source: H. Noh et al. (2015)

The authors have analysed deconvolution feature maps and they have noted that the low-level ones are specific to the shape while the higher-level ones help to classify the proposal. Finally, when all the proposals of an image are processed by the entire network, the maps are concatenated to obtain the fully segmented image. This network has obtained a 72.5% mIoU on the 2012 PASCAL VOC segmentation challenge.

Architecture of the full network. The convolution network is based on the VGG16 architecture. The deconvolution network uses unpooling and deconvolution layers. Source: H. Noh et al. (2015)


O. Ronneberger et al. (2015) have extended the FCN of J. Long et al. (2015) for biological microscopy images. The authors have created a network called U-net composed in two parts: a contracting part to compute features and a expanding part to spatially localise patterns in the image. The downsampling or contracting part has a FCN-like archicture extracting features with 3x3 convolutions. The upsampling or expanding part uses up-convolution (or deconvolution) reducing the number of feature maps while increasing their height and width. Cropped feature maps from the downsampling part of the network are copied within the upsampling part to avoid loosing pattern information. Finally, a 1x1 convolution processes the feature maps to generate a segmentation map and thus categorise each pixel of the input image. Since then, the U-net architecture has been widely extended in recent works (FPN, PSPNet, DeepLabv3 and so on). Note that it doesn’t use any fully-connected layer. As consequencies, the number of parameters of the model is reduced and it can be trained with a small labelled dataset (using appropriate data augmentation). For example, the authors have used a public dataset with 30 images for training during their experiments.

Architecture of the U-net for a given input image. The blue boxes correspond to feature maps blocks with their denoted shapes. The white boxes correspond to the copied and cropped feature maps. Source: O. Ronneberger et al. (2015)

Feature Pyramid Network (FPN)

The Feature Pyramid Network (FPN) has been developped by T.-Y. Lin et al (2016) and it is used in object detection or image segmentation frameworks. Its architecture is composed of a bottom-up pathway, a top-down pathway and lateral connections in order to join low-resolution and high-resolution features. The bottom-up pathway takes an image with an arbitrary size as input. It is processed with convolutional layers and downsampled by pooling layers. Note that each bunch of feature maps with the same size is called a stage, the outputs of the last layer of each stage are the features used for the pyramid level. The top-down pathway consists in upsampling the last feature maps with unpooling while enhancing them with feature maps from the same stage of the bottom-up pathway using lateral connections. These connections consist in merging the feature maps of the bottom-up pathway processed with a 1x1 convolution (to reduce their dimensions) with the feature maps of the top-down pathway.

Detail of a top-down block process with the lateral connection and the sum of the feature maps. Source: T.-Y. Lin et al (2016)

The concatenated feature maps are then processed by a 3x3 convolution to produce the output of the stage. Finally, each stage of the top-down pathway generates a prediction to detect an object. For image segmentation, the authors uses two Multi-Layer Perceptrons (MLP) to generate two masks with different size over the objets. It works similarly to Region Proposal Networks with anchor boxes (R-CNN R. Girshick et al. (2014), Fast R-CNN R. Girshick et al. (2015), Faster R-CNN S. Ren et al. (2016) and so on). This method is efficient because it better propagates low information into the network. The FPN based on DeepMask (P. 0. Pinheiro et al. (2015)) and SharpMask (P. 0. Pinheiro et al. (2016)) frameworks achieved a 48.1% Average Recall (AR) score on the 2016 COCO segmentation challenge.

Comparison of architectures. (a): The image is scaled with several sizes and each one is processed with convolutions to provide predictions which is computationally expansive. (b): The image has a single scale processed by a CNN with convolution an pooling layers. © Each step of the CNN is used to provide a prediction. (d) Architecture of the FPN with the bottom-up part of the left and the top-down part on the right. Source: T.-Y. Lin et al (2016)

Pyramid Scene Parsing Network (PSPNet)

H. Zhao et al. (2016) have developped the Pyramid Scene Parsing Network (PSPNet) to better learn the global context representation of a scene. Patterns are extracted from the input image using a feature extractor (ResNet K. He et al. (2015)) with a dilated network strategy¹. The feature maps feed a Pyramid Pooling Module to distinguish patterns with different scales. They are pooled with four different scales each one corresponding to a pyramid level and processed by a 1x1 convolutional layer to reduce their dimensions. This way each pyramid level analyses sub-regions of the image with different location. The outputs of the pyramid levels are upsampled and concatenated to the inital feature maps to finally contain the local and the global context information. Then, they are processed by a convolutional layer to generate the pixel-wise predictions. The best PSPNet with a pretrained ResNet (using the COCO dataset) has reached a 85.4% mIoU score on the 2012 PASCAL VOC segmentation challenge.

PSPNet architecture. The input image (a) is processed by a CNN to generate feature maps (b). They feed a Pyramid Pooling Module © and a final convolutional layer generates the pixel-wise predictions. Source: H. Zhao et al. (2016)

Mask R-CNN

K. He et al. (2017) have released the Mask R-CNN model beating all previous benchmarks on many COCO challenges². I have already provided details about Mask R-CNN for object detection in my previous blog post. As a reminder, the Faster R-CNN (S. Ren et al. (2015)) architecture for object detection uses a Region Proposal Network (RPN) to propose bounding box candidates. The RPN extracts Region of Interest (RoI) and a RoIPool layer computes features from these proposals in order to infer the bounding box cordinates and the class of the object. The Mask R-CNN is a Faster R-CNN with 3 output branches: the first one computes the bounding box coordinates, the second one computes the associated class and the last one computes the binary mask³ to segment the object. The binary mask has a fixed size and it is generated by a FCN for a given RoI. It also uses a RoIAlign layer instead of a RoIPool to avoid misalignments due to the quantization of the RoI coordinates. The particularity of the Mask R-CNN model is its multi-task loss combining the losses of the bounding box coordinates, the predicted class and the segmentation mask. The model tries to solve complementary tasks leading to better performances on each individual task. The best Mask R-CNN uses a ResNeXt (S. Xie et al. (2016)) to extract features and a FPN architecture. It has obtained a 37.1% AP score on the 2016 COCO segmentation challenge and a 41.8% AP score on the 2017 COCO segmentation challenge.

Mask R-CNN achitecture. The first layer is a RPN extracting the RoI. The second layer processes the RoI to generate feature maps. They are directly used to compute the bounding box coordinates and the predicted class. The feature maps are also processed by an FCN (third layer) to generate the binary mask. Source: K. He et al. (2017)

DeepLab, DeepLabv3 and DeepLabv3+


Inspired by the FPN model of T.-Y. Lin et al (2016), L.-C. Chen et al. (2017) have released DeepLab combining atrous convolution, spatial pyramid pooling and fully connected CRFs. The model presented in this paper is also called the DeepLabv2 because it is an adjustment of the initial DeepLab model (details about the inital one will not be provided to avoid redundancy). According to the authors, consecutive max-pooling and striding reduces the resolution of the feature maps in deep neural networks. They have introduced the atrous convolution which is basically the dilated convolution of H. Zhao et al. (2016). It consists of filters targeting sparse pixels with a fixed rate. For example, if the rate is equal to 2, the filter targets one pixel over two in the input; if the rate equal to 1, the atrous convolution is a basic convolution. Atrous convolution permits to capture multiple scale of objects. When it is used without max-poolling, it increases the resolution of the final output without increasing the number of weights.

Extraction patterns comparison between standard convolution on a low resolution input (top) and atrous convolution with a rate of 2 on a high resolution input (bottom). Source: L.-C. Chen et al. (2017)

The Atrous Spatial Pyramid Pooling consists in applying several atrous convolution of the same input with different rate to detect spatial patterns. The features maps are processed in separate branches and concatenated using bilinear interpolation to recovert the original size of the input. The output feeds a fully connected Conditional Random Field (CRF) (Krähenbühl and V. Koltun (2012)) computing edges between the features and long terme dependencies to produce the semantic segmentation.

Atrous Spatial Pyramid Pooling (ASPP) exploiting multiple scale of objects to classify the pixel in the center. Source: L.-C. Chen et al. (2017)

The best DeepLab using a ResNet-101 as backbone has reached a 79.7% mIoU score on the 2012 PASCAL VOC challenge, a 45.7% mIoU score on the PASCAL-Context challenge and a 70.4% mIoU score on the Cityscapes challenge.

DeepLab framework. Source: L.-C. Chen et al. (2017)


L.-C. Chen et al. (2017) have revisited the DeepLab framework to create DeepLabv3 combining cascaded and parallel modules of atrous convolutions. The authors have modified the ResNet architecture to keep high resolution feature maps in deep blocks using atrous convolutions.

Cascaded modules in the ResNet architecture. Source: L.-C. Chen et al. (2017)

The parallel atrous convolution modules are grouped in the Atrous Spatial Pyramid Pooling (ASPP). A 1x1 convolution and batch normalisation are added in the ASPP. All the outputs are concatenated and processed by another 1x1 convolution to create the final output with logits for each pixel.

Atrous Spatial Pyramid Pooling in the Deeplabv3 framework. Source: L.-C. Chen et al. (2017)

The best DeepLabv3 model with a ResNet-101 pretrained on ImageNet and JFT-300M datasets has reached 86.9% mIoU score in the 2012 PASCAL VOC challenge. It also achieved a 81.3% mIoU score on the Cityscapes challenge with a model only trained with the associated training dataset.


L.-C. Chen et al. (2018) have finally released the Deeplabv3+ framework using an encoder-decoder structure. The authors have introduced the atrous separable convolution composed of a depthwise convolution (spatial convolution for each channel of the input) and pointwise convolution (1x1 convolution with the depthwise convolution as input).

Combinaison of Depthwise convolution (a) and Pointwise convolution (b) to create Atrous Separable Convolution (with a rate of 2). Source: L.-C. Chen et al. (2018)

They have used the DeepLabv3 framework as encoder. The most performant model has a modified Xception (F. Chollet (2017)) backbone with more layers, atrous depthwise separable convolutions instead of max pooling and batch normalization. The outputs of the ASPP are processed by a 1x1 convolution and upsampled by a factor of 4. The outputs of the encoder backbone CNN are also processed by another 1x1 convolution and concatenated to the previous ones. The feature maps feed two 3x3 convolutional layers and the outputs are upsampled by a factor of 4 to create the final segmented image.

DeepLabv3+ framework: an encoder with a backbone CNN and an ASPP produces feature representations to feed a decoder with 3x3 convolutions producing the final predicted image. Source: L.-C. Chen et al. (2018)

The best DeepLabv3+ pretrained on the COCO and the JFT datasets has obtained a 89.0% mIoU score on the 2012 PASCAL VOC challenge. The model trained on the Cityscapes dataset has reached a 82.1% mIoU score for the associated challenge.

Path Aggregation Network (PANet)

S. Liu et al. (2018) have recently released the Path Aggregation Network (PANet). This network is based on the Mask R-CNN and the FPN frameworks while enhancing information propagation. The feature extractor of the network uses a FPN architecture with a new augmented bottom-up pathway improving the propagation of low-layer features. Each stage of this third pathway takes as input the feature maps of the previous stage and processes them with a 3x3 convolutional layer. The output is added to the same stage feature maps of the top-down pathway using lateral connection and these feature maps feed the next stage.

Lateral connection between the top-down pathway and the augmented bottom-up pathway. Source: S. Liu et al. (2018)

The feature maps of the augmented bottom-up pathway are pooled with a RoIAlign layer to extract proposals from all level features. An adaptative feature pooling layer processes the features maps of each stage with a fully connected layer and concatenate all the outputs.

Adatative feature pooling layer. Source: S. Liu et al. (2018)

The output of the adaptative feature pooling layer feeds three branches similarly to the Mask R-CNN. The two first branches uses a fully connected layer to generate the predictions of the bounding box coordinates and the associated object class. The third branch process the RoI with a FCN to predict a binary pixel-wise mask for the detected object. The authors have added a path processing the output of a convolutional layer of the FCN with a fully connected layer to improve the localisation of the predicted pixels. Finally the output of the parallel path is reshaped and concatenated to the output of the FCN generating the binary mask.

Branch of the PANet predicting the binary mask using a FCN and a new path with a fully connected layer. Source: https://arxiv.org/pdf/1803.01534.pdf

The PANet has achieved 42.0% AP score on the 2016 COCO segmentation challenge using a ResNeXt as feature extractor. They also performed the 2017 COCO segmentation challenge with an 46.7% AP score using a ensemble of seven feature extractors: ResNet (K. He et al. (2015), ResNeXt (S. Xie et al. (2016)) and SENet (J. Hu et al.(2017)).

PANet Achitecture. (a): Feature extractor using the FPN achitecture. (b): The new augmented bottom-up pathway added to the FPN architecture. ©: The adaptative feature pooling layer. (d): The two branches predicting the bounding box coordinated and the object class. (e): The branch predicting the binary mask of the object. The dashed lines correspond to links between low-level and high level patterns, the red one is in the FPN and consists in more than 100 layers, the green one is a shortcut in the PANet consisting of less than 10 layers. Source: S. Liu et al. (2018)

Context Encoding Network (EncNet)

H. Zhang et al. (2018) have created a Context Encoding Network (EncNet) capturing global information in an image to improve scene segmentation. The model starts by using a basic feature extractor (ResNet) and feeds the feature maps into a Context Encoding Module inspired from the Encoding Layer of H. Zhang et al. (2016). Basically, it learns visual centers and smoothing factors to create an embedding taking into account the contextual information while highlighting class-dependant feature maps. On top of the module, scaling factors for the contextual information are learnt with a feature maps attention layer (fully connected layer). In parallel, a Semantic Encoding Loss (SE-Loss) corresponding to a binary cross-entropy loss regularizes the training of the module by detecting presence of object classes (unlike the pixel-wise loss). The outputs of the Context Encoding Module are reshaped and processed by a dilated convolution strategy while minimizing two SE-losses and a final pixel-wise loss. The best EncNet has reached 52.6% mIoU and 81.2% pixAcc scores on the PASCAL-Context challenge. It has also achieved a 85.9% mIoU score on the 2012 PASCAL VOC segmentation challenge.

Dilated convolution strategy. In blue the convolutional filter with D the dilatation rate. The SE-losses (Semantic Encoding Loss) are applied after the third and the fourth stages to detect object classes. A final Seg-loss (pixel-wise loss) is applied to improve the segmentation. Source: H. Zhang et al. (2018)
Architecture of the EncNet. A feature extractor generates feature maps took as input of a Context Encoding Module. The module is trained with regularisation using the Semantic Encoding Loss. The outputs of the module are processed by a dilated convolution strategy to produce the final segmention. Source: [H. Zhang et al. (2018)


Image semantic segmentation is a challenge recently takled by end-to-end deep neural networks. One of the main issue between all the architectures is to take into account the global visual context of the input to improve the prediction of the segmentation. The state-of-the-art models use architectures trying to link different part of the image in order to understand the relations between the objects.

Overview of the scores of the models over the 2012 PASCAL VOC dataset (mIoU), the PASCAL-Context dataset (mIoU), the 2016 / 2017 COCO datasets (AP and AR) and the Cityscapes dataset (mIoU)

The pixel-wise prediction over an entire image allows a better comprehension of the environement with a high precision. Scene understanding is also approached with keypoint detection, action recognition, video captioning or visual question answering. To my opinion, the segmentation task combined with these other issues using multi-task loss should help to outperform the global context understanding of a scene.

Finally, I would like to thanks Long Do Cao for helping me with all my posts, you should check his profile if you’re looking for a great senior data scientist ;).

2017-12-22 19:48:37 Asun0204 阅读数 6332
  • DeepLabv3+图像语义分割实战:训练自己的数据集

    DeepLabv3+是一种非常先进的基于深度学习的图像语义分割方法,可对物体进行像素级分割。 本课程将手把手地教大家使用labelme图像标注工具制作数据集,并使用DeepLabv3+训练自己的数据集,从而能开展自己的图像语义分割应用。 本课程有两个项目实践: (1) CamVid语义分割 :对CamVid数据集进行语义分割 (2) RoadScene语义分割:对汽车行驶场景中的路坑、车、车道线进行物体标注和语义分割 本课程使用TensorFlow版本的DeepLabv3+,在Ubuntu系统上做项目演示。 包括:安装deeplab、数据集标注、数据集格式转换、修改程序文件、训练自己的数据集、测试训练出的网络模型以及性能评估。 本课程提供项目的数据集和Python程序文件。 下图是使用DeepLabv3+训练自己的数据集RoadScene进行图像语义分割的测试结果:

    726 人正在学习 去看看 白勇










在深度学习统治计算机视觉领域之前,人们一般用TextonForestRandom Forest based classifiers来做语义分割。在图片分类问题上,CNN卷积神经网络在分割问题上有了很多成功的例子。

patch classification是深度学习方法中一个流行的方法,主要思想是对于每一个像素点,都用包含它的图片进行分类,然后用这个结果来预测像素点的分类。使用这个方法的原因是深度学习分类网络的输入需要是固定大小的图片。

在2014年,FCN全卷积网络在预测像素的CNN中流行起来。全卷积网络没有任何全连接层,从而图片的输入可以是任何大小的,比上面的patch classification方法快。几乎所有后来的语义分割最新研究都是采用了这个方法。



第二种结构使用所谓的空洞卷积dilated/atrous convolutions,代替了池化层来减少空间维度增加感受野,同时也保留了位置信息。

条件随机场Conditional Random Field (CRF) postprocessing通常用来提升分割的效果。条件随机场是基于潜在图像强度来平滑分割的图像模型。通过观察相似强度的像素,认为同一类的像素强度相似。该方法能提升大概1-2%的分数。


这个章节总结了自FCN以来一些有代表性的文章。这些结构都以VOC2012 evaluation server为基准。


Fully Convolutional Networks for Semantic Segmentation



  1. 在语义分割问题中推广使用端到端(end to end)卷积神经网络
  2. 使用imagenet预训练模型到语义分割中
  3. 使用反卷积层进行上采样
  4. 用跳过连接来提高上采样的粗糙度


可以看到全连接层可以看作是和输入尺寸一样的卷积核进行卷积操作。这相当于最开始的分类网络,即输入一批重叠图片进行评估,但是这种方法效率更高,因为重叠的部分只计算了一次。虽然这个发现不只是在这篇论文中提出(overfeat, this post),但是显著提高了VOC2012的水平。

在卷积化imagenet预训练网络VGG的全连接层后,特征图仍然需要上采样,因为CNN中的池化操作减小了尺寸。不使用简单的双线性插值进行上采样,因为反卷积层能够学习到插值。这一层也叫做upconvolution, full convolution, transposed convolution或者fractionally-strided convolution。

然而,上采样(即使是反卷积层)生成的分割图很粗糙,因为池化过程中损失了太多的信息。因此需要shortcut/skip connections,捷径连接,类似于U-Net结构图中中间灰色的连接线,从更高分辨率的特征地图引入快捷/跳过连接。


Score Comment Source
62.2 - leaderboard
67.2 More momentum. Not described in paper leaderboard


  • 做出了很大的贡献,但是和现在的水平来比差距有点大


SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation



  1. 把最大池化转化为解码器来提高分割分辨率




Score Comment Source
59.9 - leaderboard


  • FCN和SegNet是最开始的编码解码结构
  • SegNet的Benchmarks不好

Dilated Convolutions(空洞卷积)

Multi-Scale Context Aggregation by Dilated Convolutions



  1. 使用空洞卷积,一个用来dense prediction(标注出图像中每个像素点的对象类别,要求不但给出具体目标的位置,还要描绘物体的边界,如图像分割、语义分割、边缘检测等等)的卷积层
  2. 提出多尺度内容聚合(multi scale aggregation)来提高dense prediction效果



空洞卷积层(在DeepLab中也称为atrous convolution带孔卷积)来增加感受野但是不减少空间尺寸。

移除预训练模型(在这里是VGG)的最后两个池化层,随后的卷积层用空洞卷积来代替。尤其是在pool-3和pool-4之间的卷积层是dilation为2的空洞卷积,在pool-4之后是dilation为4的空洞卷积。通过这个模块(论文中称之为frontend module),在没有增加参数数量的基础上实现了dense prediction。

另外还有一个模块(在论文中被称为context module)把frontend module的输出作为输入,单独用来训练。这个模块是不同dilation的空洞卷积层串联在一起,因此多尺度聚合,提高frontend的预测效果。


Score Comment Source
71.3 frontend reported in the paper
73.5 frontend + context reported in the paper
74.7 frontend + context + CRF reported in the paper
75.3 frontend + context + CRF-RNN reported in the paper


  • 注意到预测的分割图大小都是原图的八分之一。几乎所有的方法都是这样。然后通过内插的方式来得到最后的分割图。

DeepLab(v1 & v2)

v1 : Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs


v2 : DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs



  1. 使用空洞卷积
  2. 提出带孔空间金字塔池化atrous spatial pyramid pooling (ASPP)
  3. 使用全连接条件随机场Fully connected CRF






Score Comment Source
79.7 ResNet-101 + atrous Convolutions + ASPP + CRF leaderboard


RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation



  1. 编码解码结构,附带很好的解码块
  2. 所有组件都用残差连接设计


使用空洞卷积的方法并非没有缺点。空洞卷积运算代价高昂,需要大量的内存,因为它们必须应用于大量高分辨率的特征映射。这阻碍了对高分辨率预测的计算。例如,DeepLab的预测是原始输入的1 / 8。

因此,本文提出使用编码解码器架构。编码器部分是resnet-101块。译码器有精细化RefineNet 的块,它连接/融合高分辨率的特征,从以前的精化块的编码器和低分辨率特征。

每个精细化块都有一个组件,它通过对低分辨率特征进行上采样和一个基于重复的5 x 5步长1的池化层来获得环境,来融合多分辨率特征。每个组件都使用了身份映射模式之后的残差连接设计。


Score Comment Source
84.2 Uses CRF, Multiscale inputs, COCO pretraining leaderboard


Pyramid Scene Parsing Network



  1. 提出了金字塔池化模块来汇聚信息
  2. 使用辅助损失






Score Comment Source
85.4 MSCOCO pretraining, multi scale input, no CRF leaderboard
82.6 no MSCOCO pretraining, multi scale input, no CRF reported in the paper

Large Kernel Matters

Large Kernel Matters – Improve Semantic Segmentation by Global Convolutional Network



  1. 提出了一种使用大卷积核的编码解码结构



另一个采用大卷积核的理由是虽然很深的网络,像ResNet有很大的感受野,但是[实验}(https://arxiv.org/abs/1412.6856)显示网络倾向于从更小的区域收集信息(有效感受野valid receptive filed)。

更大的卷积核计算开销更大,需要很多的参数。Therefore, k x k convolution is approximated with sum of 1 x k + k x 1 and k x 1 and 1 x k convolutions. 论文中这个模块叫做全局卷积网络Global Convolutional Network (GCN)。



Score Comment Source
82.2 - reported in the paper
83.6 Improved training, not described in the paper leaderboard

DeepLab v3

Rethinking Atrous Convolution for Semantic Image Segmentation



  1. 改进的带孔空间金字塔池化(ASPP)
  2. 串联空洞卷积构成的模块


在DeepLab v2和空洞卷积中,对ResNet模型做了修改,从而可以使用空洞卷积。改进的ASPP涉及图像级特征的连接,1x1卷积和3个具有不同rate的3×3空洞卷积。在每个并行卷积层之后使用Batch normalization。



这两种模型的性能都要比DeepLab v2好。作者指出batch normalization和更好的方式进行编码的多尺度汇聚是提升的主要原因。


Score Comment Source
85.7 used ASPP (no cascaded modules) leaderboard



2017-11-16 17:24:10 aitazhixin 阅读数 4962
  • DeepLabv3+图像语义分割实战:训练自己的数据集

    DeepLabv3+是一种非常先进的基于深度学习的图像语义分割方法,可对物体进行像素级分割。 本课程将手把手地教大家使用labelme图像标注工具制作数据集,并使用DeepLabv3+训练自己的数据集,从而能开展自己的图像语义分割应用。 本课程有两个项目实践: (1) CamVid语义分割 :对CamVid数据集进行语义分割 (2) RoadScene语义分割:对汽车行驶场景中的路坑、车、车道线进行物体标注和语义分割 本课程使用TensorFlow版本的DeepLabv3+,在Ubuntu系统上做项目演示。 包括:安装deeplab、数据集标注、数据集格式转换、修改程序文件、训练自己的数据集、测试训练出的网络模型以及性能评估。 本课程提供项目的数据集和Python程序文件。 下图是使用DeepLabv3+训练自己的数据集RoadScene进行图像语义分割的测试结果:

    726 人正在学习 去看看 白勇

机器之心:By路雪 2017年7月14日 














































































  从预训练好的分类网络(此处指VGG)中移除最后两个池化层,之后的卷积层都使用空洞卷积。尤其是,pool-3和pool-4之间的卷积是空洞卷积2,pool-4后面的卷积是空洞卷积4。使用这个模块(论文中称为前端模块 frontendmodule)之后,无需增加参数即可实现稠密预测。另一个模块(论文中称为背景模块 contextmodule)将使用前端模块的输出作为输入进行单独训练。该模块是多个不同扩张程度的空洞卷积级联而成,因此该模块可聚合多尺度背景,并改善前端模块获取的预测结果。








  v2 :DeepLab:使用深度卷积网络、带孔卷积和全连接CRF进行图像语义分割(DeepLab:SemanticImageSegmentationwithDeepConvolutionalNets,AtrousConvolution,andFullyConnectedCRFs)












  DeepLab2 流程图



























































2019-01-04 09:22:57 Z199448Y 阅读数 383
  • DeepLabv3+图像语义分割实战:训练自己的数据集

    DeepLabv3+是一种非常先进的基于深度学习的图像语义分割方法,可对物体进行像素级分割。 本课程将手把手地教大家使用labelme图像标注工具制作数据集,并使用DeepLabv3+训练自己的数据集,从而能开展自己的图像语义分割应用。 本课程有两个项目实践: (1) CamVid语义分割 :对CamVid数据集进行语义分割 (2) RoadScene语义分割:对汽车行驶场景中的路坑、车、车道线进行物体标注和语义分割 本课程使用TensorFlow版本的DeepLabv3+,在Ubuntu系统上做项目演示。 包括:安装deeplab、数据集标注、数据集格式转换、修改程序文件、训练自己的数据集、测试训练出的网络模型以及性能评估。 本课程提供项目的数据集和Python程序文件。 下图是使用DeepLabv3+训练自己的数据集RoadScene进行图像语义分割的测试结果:

    726 人正在学习 去看看 白勇


  • 语义分割难点:将各个像素点分类到某一实例,再将各个实例(分类结果)与实体(人、道路等)一一对应。
  • 出现在真实的理解图像或视频的动作的挑战:关键点检测、动作识别、视频字幕、视觉问题回答等。
  • 常用数据集:

PASCAL VOC——train/val   11k张;test  10张;用平均交并比(mIoU)评估图像分割模型的性能

PASCAL-Context——train 10k;val 10k;test  10k


Cityscapes——包含50个城市的复杂的城市场景分割图,train/val  23.5k;test   1.5k

  • 一些网络效果:

FCN——使用ImageNet预训练模型,在2012年的PASCAL VOC上mIoU=62.2%

ParseNet——PASCAL-Context的mIoU=40.4%,2012年的PASCAL VOC 的mIoU=69.8%

卷积与反卷积——2012年的PASCAL VOC的mIoU=72.5%



金字塔场景解析网络(PSPNet)——使用COCO的预训ResNet,在2012年的PASCAL VOC的mIoU=85.4%

Mask R-CNN——最好的Mask R-CNN使用ResNeXt提取特征和FPN结构,2016年COCO的AP=37.1%,2017年的COCO的AP=41.8%


  • DeepLab——带空卷积核、空间金字塔池化、全连接的CRFs

以ResNet-101为主干的DeepLab在2012年的PASCAL VOC的mIoU=79.7%,PASCAL-Context的mIoU=45.7%,Cityscapes的mIoU=70.4%

  • DeepLabv3——带孔卷积的级联和并行模块(空洞空间金字塔池化ASPP)

使用ResNet-101在ImageNet和JFT-300M上预训练的最佳DeepLabv3在2012年的PASCAL VOC上mIoU=86.9%,Cityscapes的mIoU=81.3%

  • DeepLabv3+——结合了编码-解码器结构框架的DeepLabv3,引入空洞可分离卷积,包含深度卷积(将输入的每一个通道进行卷积)和逐点卷积(1*1的卷积和深度卷积作为输入)。


在COCO和JFT上预训练的最佳DeepLabv3+在2012年的PASCAL VOC的mIoU=89.0%,Cityscapes的mIoU=82.1%

  • 路径聚合网络(PANet)——基于Mask R-CNN和FPN框架,同时增强信息传播。特征提取使用改进的FPN架构,添加自底向上的增强路径,从而改善底层特征传播。


  • 环境编码网络(EncNet)——环境编码网络捕捉一张图像中的全局信息,以提高场景分割性能。

在PASCAL-Context的mIoU=52.6%,pixAcc=81.2%;在2012年的PASCAL VOC的mIoU=85.9%