2019-04-27 14:15:59 weixin_38498050 阅读数 122
  • DeepLabv3+图像语义分割实战:训练自己的数据集

    DeepLabv3+是一种非常先进的基于深度学习的图像语义分割方法,可对物体进行像素级分割。 本课程将手把手地教大家使用labelme图像标注工具制作数据集,并使用DeepLabv3+训练自己的数据集,从而能开展自己的图像语义分割应用。 本课程有两个项目实践: (1) CamVid语义分割 :对CamVid数据集进行语义分割 (2) RoadScene语义分割:对汽车行驶场景中的路坑、车、车道线进行物体标注和语义分割 本课程使用TensorFlow版本的DeepLabv3+,在Ubuntu系统上做项目演示。 包括:安装deeplab、数据集标注、数据集格式转换、修改程序文件、训练自己的数据集、测试训练出的网络模型以及性能评估。 本课程提供项目的数据集和Python程序文件。 下图是使用DeepLabv3+训练自己的数据集RoadScene进行图像语义分割的测试结果:

    726 人正在学习 去看看 白勇

语义分割深度学习方法集锦

Papers

Deep Joint Task Learning for Generic Object Extraction

Highly Efficient Forward and Backward Propagation of Convolutional Neural Networks for Pixelwise Classification

Segmentation from Natural Language Expressions

Semantic Object Parsing with Graph LSTM

Fine Hand Segmentation using Convolutional Neural Networks

Feedback Neural Network for Weakly Supervised Geo-Semantic Segmentation

FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics

A deep learning model integrating FCNNs and CRFs for brain tumor segmentation

Texture segmentation with Fully Convolutional Networks

Fast LIDAR-based Road Detection Using Convolutional Neural Networks

https://arxiv.org/abs/1703.03613

Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs

Annotating Object Instances with a Polygon-RNN

Semantic Segmentation via Structured Patch Prediction, Context CRF and Guidance CRF

Nighttime sky/cloud image segmentation

Distantly Supervised Road Segmentation

Superpixel clustering with deep features for unsupervised road segmentation

Learning to Segment Human by Watching YouTube

W-Net: A Deep Model for Fully Unsupervised Image Segmentation

https://arxiv.org/abs/1711.08506

End-to-end detection-segmentation network with ROI convolution

U-Net

U-Net: Convolutional Networks for Biomedical Image Segmentation

DeepUNet: A Deep Fully Convolutional Network for Pixel-level Sea-Land Segmentation

https://arxiv.org/abs/1709.00201

TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation

Foreground Object Segmentation

Pixel Objectness

A Deep Convolutional Neural Network for Background Subtraction

Semantic Segmentation

Fully Convolutional Networks for Semantic Segmentation

From Image-level to Pixel-level Labeling with Convolutional Networks

Feedforward semantic segmentation with zoom-out features

DeepLab

Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

Weakly- and Semi-Supervised Learning of a DCNN for Semantic Image Segmentation

DeepLab v2

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

DeepLabv2 (ResNet-101)

http://liangchiehchen.com/projects/DeepLabv2_resnet.html

DeepLab v3

Rethinking Atrous Convolution for Semantic Image Segmentation

CRF-RNN

Conditional Random Fields as Recurrent Neural Networks

BoxSup

BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation

Efficient piecewise training of deep structured models for semantic segmentation

DeconvNet

Learning Deconvolution Network for Semantic Segmentation

SegNet

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

SegNet: Pixel-Wise Semantic Labelling Using a Deep Networks

Getting Started with SegNet

ParseNet

ParseNet: Looking Wider to See Better

DecoupledNet

Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation

Semantic Image Segmentation via Deep Parsing Network

Multi-Scale Context Aggregation by Dilated Convolutions

Instance-aware Semantic Segmentation via Multi-task Network Cascades

Object Segmentation on SpaceNet via Multi-task Network Cascades (MNC)

Learning Transferrable Knowledge for Semantic Segmentation with Deep Convolutional Neural Network

Combining the Best of Convolutional Layers and Recurrent Layers: A Hybrid Network for Semantic Segmentation

Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation

ScribbleSup

ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation

Laplacian Reconstruction and Refinement for Semantic Segmentation

Laplacian Pyramid Reconstruction and Refinement for Semantic Segmentation

Natural Scene Image Segmentation Based on Multi-Layer Feature Extraction

Convolutional Random Walk Networks for Semantic Image Segmentation

ENet

ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Fully Convolutional Networks for Dense Semantic Labelling of High-Resolution Aerial Imagery

Deep Learning Markov Random Field for Semantic Segmentation

Region-based semantic segmentation with end-to-end training

Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation

PixelNet

PixelNet: Towards a General Pixel-level Architecture

Exploiting Depth from Single Monocular Images for Object Detection and Semantic Segmentation

  • intro: IEEE T. Image Processing
  • intro: propose an RGB-D semantic segmentation method which applies a multi-task training scheme: semantic label prediction and depth value regression
  • arxiv: https://arxiv.org/abs/1610.01706

PixelNet: Representation of the pixels, by the pixels, and for the pixels

Semantic Segmentation of Earth Observation Data Using Multimodal and Multi-scale Deep Networks

Deep Structured Features for Semantic Segmentation

CNN-aware Binary Map for General Semantic Segmentation

Efficient Convolutional Neural Network with Binary Quantization Layer

Mixed context networks for semantic segmentation

High-Resolution Semantic Labeling with Convolutional Neural Networks

Gated Feedback Refinement Network for Dense Image Labeling

RefineNet

RefineNet: Multi-Path Refinement Networks with Identity Mappings for High-Resolution Semantic Segmentation

RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes

Semantic Segmentation using Adversarial Networks

Improving Fully Convolution Network for Semantic Segmentation

The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation

Training Bit Fully Convolutional Network for Fast Semantic Segmentation

Classification With an Edge: Improving Semantic Image Segmentation with Boundary Detection

  • intro: “an end-to-end trainable deep convolutional neural network (DCNN) for semantic segmentation
    with built-in awareness of semantically meaningful boundaries. “
  • arxiv: https://arxiv.org/abs/1612.01337

Diverse Sampling for Self-Supervised Learning of Semantic Segmentation

Mining Pixels: Weakly Supervised Semantic Segmentation Using Image Labels

FCNs in the Wild: Pixel-level Adversarial and Constraint-based Adaptation

Understanding Convolution for Semantic Segmentation

Label Refinement Network for Coarse-to-Fine Semantic Segmentation

https://www.arxiv.org/abs/1703.00551

Predicting Deeper into the Future of Semantic Segmentation

Guided Perturbations: Self Corrective Behavior in Convolutional Neural Networks

Not All Pixels Are Equal: Difficulty-aware Semantic Segmentation via Deep Layer Cascade

Large Kernel Matters – Improve Semantic Segmentation by Global Convolutional Network

https://arxiv.org/abs/1703.02719

Loss Max-Pooling for Semantic Image Segmentation

Reformulating Level Sets as Deep Recurrent Neural Network Approach to Semantic Segmentation

https://arxiv.org/abs/1704.03593

A Review on Deep Learning Techniques Applied to Semantic Segmentation

https://arxiv.org/abs/1704.06857

Joint Semantic and Motion Segmentation for dynamic scenes using Deep Convolutional Networks

ICNet

ICNet for Real-Time Semantic Segmentation on High-Resolution Images

LinkNet

Feature Forwarding: Exploiting Encoder Representations for Efficient Semantic Segmentation

LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation

Pixel Deconvolutional Networks

Incorporating Network Built-in Priors in Weakly-supervised Semantic Segmentation

Deep Semantic Segmentation for Automated Driving: Taxonomy, Roadmap and Challenges

Semantic Segmentation with Reverse Attention

Stacked Deconvolutional Network for Semantic Segmentation

https://arxiv.org/abs/1708.04943

Learning Dilation Factors for Semantic Segmentation of Street Scenes

A Self-aware Sampling Scheme to Efficiently Train Fully Convolutional Networks for Semantic Segmentation

https://arxiv.org/abs/1709.02764

One-Shot Learning for Semantic Segmentation

An Adaptive Sampling Scheme to Efficiently Train Fully Convolutional Networks for Semantic Segmentation

https://arxiv.org/abs/1709.02764

Semantic Segmentation from Limited Training Data

https://arxiv.org/abs/1709.07665

Unsupervised Domain Adaptation for Semantic Segmentation with GANs

https://arxiv.org/abs/1711.06969

Neuron-level Selective Context Aggregation for Scene Segmentation

https://arxiv.org/abs/1711.08278

Road Extraction by Deep Residual U-Net

https://arxiv.org/abs/1711.10684

Mix-and-Match Tuning for Self-Supervised Semantic Segmentation

Error Correction for Dense Semantic Image Labeling

https://arxiv.org/abs/1712.03812

Semantic Segmentation via Highly Fused Convolutional Network with Multiple Soft Cost Functions

https://arxiv.org/abs/1801.01317

Instance Segmentation

Simultaneous Detection and Segmentation

Convolutional Feature Masking for Joint Object and Stuff Segmentation

Proposal-free Network for Instance-level Object Segmentation

Hypercolumns for object segmentation and fine-grained localization

SDS using hypercolumns

Learning to decompose for object detection and instance segmentation

Recurrent Instance Segmentation

Instance-sensitive Fully Convolutional Networks

Amodal Instance Segmentation

Bridging Category-level and Instance-level Semantic Image Segmentation

Bottom-up Instance Segmentation using Deep Higher-Order CRFs

DeepCut: Object Segmentation from Bounding Box Annotations using Convolutional Neural Networks

End-to-End Instance Segmentation and Counting with Recurrent Attention

TA-FCN / FCIS

Translation-aware Fully Convolutional Instance Segmentation

Fully Convolutional Instance-aware Semantic Segmentation

InstanceCut: from Edges to Instances with MultiCut

Deep Watershed Transform for Instance Segmentation

Object Detection Free Instance Segmentation With Labeling Transformations

Shape-aware Instance Segmentation

Interpretable Structure-Evolving LSTM

  • intro: CMU & Sun Yat-sen University & National University of Singapore & Adobe Research
  • intro: CVPR 2017 spotlight paper
  • arxiv: https://arxiv.org/abs/1703.03055

Mask R-CNN

Semantic Instance Segmentation via Deep Metric Learning

https://arxiv.org/abs/1703.10277

Pose2Instance: Harnessing Keypoints for Person Instance Segmentation

https://arxiv.org/abs/1704.01152

Pixelwise Instance Segmentation with a Dynamically Instantiated Network

Instance-Level Salient Object Segmentation

Semantic Instance Segmentation with a Discriminative Loss Function

SceneCut: Joint Geometric and Object Segmentation for Indoor Scenes

https://arxiv.org/abs/1709.07158

S4 Net: Single Stage Salient-Instance Segmentation

Deep Extreme Cut: From Extreme Points to Object Segmentation

https://arxiv.org/abs/1711.09081

Learning to Segment Every Thing

Recurrent Neural Networks for Semantic Instance Segmentation

MaskLab

MaskLab: Instance Segmentation by Refining Object Detection with Semantic and Direction Features

https://arxiv.org/abs/1712.04837

Recurrent Pixel Embedding for Instance Grouping

Specific Segmentation

A CNN Cascade for Landmark Guided Semantic Part Segmentation

End-to-end semantic face segmentation with conditional random fields as convolutional, recurrent and adversarial networks

Face Parsing via Recurrent Propagation

Face Parsing via a Fully-Convolutional Continuous CRF Neural Network

https://arxiv.org/abs/1708.03736

Boundary-sensitive Network for Portrait Segmentation

https://arxiv.org/abs/1712.08675

Segment Proposal

Learning to Segment Object Candidates

Learning to Refine Object Segments

FastMask: Segment Object Multi-scale Candidates in One Shot

Scene Labeling / Scene Parsing

Indoor Semantic Segmentation using depth information

Recurrent Convolutional Neural Networks for Scene Parsing

Learning hierarchical features for scene labeling

Multi-modal unsupervised feature learning for rgb-d scene labeling

Scene Labeling with LSTM Recurrent Neural Networks

Attend, Infer, Repeat: Fast Scene Understanding with Generative Models

“Semantic Segmentation for Scene Understanding: Algorithms and Implementations” tutorial

Semantic Understanding of Scenes through the ADE20K Dataset

Learning Deep Representations for Scene Labeling with Guided Supervision

Learning Deep Representations for Scene Labeling with Semantic Context Guided Supervision

Spatial As Deep: Spatial CNN for Traffic Scene Understanding

MPF-RNN

Multi-Path Feedback Recurrent Neural Network for Scene Parsing

Scene Labeling using Recurrent Neural Networks with Explicit Long Range Contextual Dependency

PSPNet

Pyramid Scene Parsing Network

Open Vocabulary Scene Parsing

https://arxiv.org/abs/1703.08769

Deep Contextual Recurrent Residual Networks for Scene Labeling

https://arxiv.org/abs/1704.03594

Fast Scene Understanding for Autonomous Driving

  • intro: Published at “Deep Learning for Vehicle Perception”, workshop at the IEEE Symposium on Intelligent Vehicles 2017
  • arxiv: https://arxiv.org/abs/1708.02550

FoveaNet: Perspective-aware Urban Scene Parsing

https://arxiv.org/abs/1708.02421

BlitzNet: A Real-Time Deep Network for Scene Understanding

Semantic Foggy Scene Understanding with Synthetic Data

https://arxiv.org/abs/1708.07819

Restricted Deformable Convolution based Road Scene Semantic Segmentation Using Surround View Cameras

https://arxiv.org/abs/1801.00708

Benchmarks

MIT Scene Parsing Benchmark

Semantic Understanding of Urban Street Scenes: Benchmark Suite

https://www.cityscapes-dataset.com/benchmarks/

Challenges

Large-scale Scene Understanding Challenge

Places2 Challenge

http://places2.csail.mit.edu/challenge.html

Human Parsing

Human Parsing with Contextualized Convolutional Neural Network

Look into Person: Self-supervised Structure-sensitive Learning and A New Benchmark for Human Parsing

Cross-domain Human Parsing via Adversarial Feature and Label Adaptation

Video Object Segmentation

Fast object segmentation in unconstrained video

Recurrent Fully Convolutional Networks for Video Segmentation

Object Detection, Tracking, and Motion Segmentation for Object-level Video Segmentation

Clockwork Convnets for Video Semantic Segmentation

STFCN: Spatio-Temporal FCN for Semantic Video Segmentation

One-Shot Video Object Segmentation

Video Object Segmentation Without Temporal Information

https://arxiv.org/abs/1709.06031

Convolutional Gated Recurrent Networks for Video Segmentation

Learning Video Object Segmentation from Static Images

Semantic Video Segmentation by Gated Recurrent Flow Propagation

FusionSeg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos

Unsupervised learning from video to detect foreground objects in single images

https://arxiv.org/abs/1703.10901

Semantically-Guided Video Object Segmentation

https://arxiv.org/abs/1704.01926

Learning Video Object Segmentation with Visual Memory

https://arxiv.org/abs/1704.05737

Flow-free Video Object Segmentation

https://arxiv.org/abs/1706.09544

Online Adaptation of Convolutional Neural Networks for Video Object Segmentation

https://arxiv.org/abs/1706.09364

Video Object Segmentation using Tracked Object Proposals

Video Object Segmentation with Re-identification

Pixel-Level Matching for Video Object Segmentation using Convolutional Neural Networks

SegFlow: Joint Learning for Video Object Segmentation and Optical Flow

Video Semantic Object Segmentation by Self-Adaptation of DCNN

https://arxiv.org/abs/1711.08180

Learning to Segment Moving Objects

https://arxiv.org/abs/1712.01127

Instance Embedding Transfer to Unsupervised Video Object Segmentation

Panoptic Segmentation

Challenge

DAVIS: Densely Annotated VIdeo Segmentation

DAVIS Challenge on Video Object Segmentation 2017

http://davischallenge.org/challenge2017/publications.html

Projects

TF Image Segmentation: Image Segmentation framework

KittiSeg: A Kitti Road Segmentation model implemented in tensorflow.

Semantic Segmentation Architectures Implemented in PyTorch

PyTorch for Semantic Segmentation

https://github.com/ZijunDeng/pytorch-semantic-segmentation

3D Segmentation

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation

DA-RNN: Semantic Mapping with Data Associated Recurrent Neural Networks

https://arxiv.org/abs/1703.03098

SqueezeSeg: Convolutional Neural Nets with Recurrent CRF for Real-Time Road-Object Segmentation from 3D LiDAR Point Cloud

SEGCloud: Semantic Segmentation of 3D Point Clouds

Leaderboard

Segmentation Results: VOC2012 BETA: Competition “comp6” (train on own data)

http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean&challengeid=11&compid=6

Blogs

Deep Learning for Natural Image Segmentation Priors

http://cs.brown.edu/courses/csci2951-t/finals/ghope/

Image Segmentation Using DIGITS 5

https://devblogs.nvidia.com/parallelforall/image-segmentation-using-digits-5/

Image Segmentation with Tensorflow using CNNs and Conditional Random Fields
http://warmspringwinds.github.io/tensorflow/tf-slim/2016/12/18/image-segmentation-with-tensorflow-using-cnns-and-conditional-random-fields/

Fully Convolutional Networks (FCNs) for Image Segmentation

Image segmentation with Neural Net

A 2017 Guide to Semantic Segmentation with Deep Learning

http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review

Talks

Deep learning for image segmentation

2018-12-21 11:38:12 sinat_33487968 阅读数 188
  • DeepLabv3+图像语义分割实战:训练自己的数据集

    DeepLabv3+是一种非常先进的基于深度学习的图像语义分割方法,可对物体进行像素级分割。 本课程将手把手地教大家使用labelme图像标注工具制作数据集,并使用DeepLabv3+训练自己的数据集,从而能开展自己的图像语义分割应用。 本课程有两个项目实践: (1) CamVid语义分割 :对CamVid数据集进行语义分割 (2) RoadScene语义分割:对汽车行驶场景中的路坑、车、车道线进行物体标注和语义分割 本课程使用TensorFlow版本的DeepLabv3+,在Ubuntu系统上做项目演示。 包括:安装deeplab、数据集标注、数据集格式转换、修改程序文件、训练自己的数据集、测试训练出的网络模型以及性能评估。 本课程提供项目的数据集和Python程序文件。 下图是使用DeepLabv3+训练自己的数据集RoadScene进行图像语义分割的测试结果:

    726 人正在学习 去看看 白勇

Review of Deep Learning Algorithms for Image Semantic Segmentation

Examples of the COCO dataset for stuff segmentation. Souce: http://cocodataset.org/

Deep learning algorithms have solved several computer vision tasks with an increasing level of difficulty. In my previous blog posts, I have detailled the well kwown ones: image classification and object detection. The image semantic segmentation challenge consists in classifying each pixel of an image (or just several ones) into an instance, each instance (or category) corresponding to an object or a part of the image (road, sky, …). This task is part of the concept of scene understanding: how a deep learning model can better learn the global context of a visual content ?

The object detection task has exceeded the image classification task in term of complexity. It consists in creating bounding boxes around the objects contained in an image and classify each one of them. Most of the object detection models use anchor boxes and proposals to detect bounding box around objects. Unfortunately, just a few models take into account the entire context of an image but they only classify a small part of the information. Thus, they can’t provide a full comprehension of a scene.

In order to understand a scene, each visual information has to be associated to an entity while considering the spatial information. Several other challenges have emerged to really understand the actions in a image or a video: keypoint detection, action recognition, video captioning, visual question answering and so on. A better comprehension of the environment will help in many fields. For example, an autonomous car needs to delimitate the roadsides with a high precision in order to move by itself. In robotics, production machines should understand how to grab, turn and put together two different pieces requiring to delimitate the exact shape of the object.

In this blog post, architecture of a few previous state-of-the-art models on image semantic segmentation challenges are detailed. Note that researchers test their algorithms using different datasets (PASCAL VOC, PASCAL Context, COCO, Cityscapes) which are different between the years and use different metrics of evaluation. Thus the cited performances cannot be directly compared per se. Moreover, the results depend on the pretrained top network (the backbone), the results published in this post correspond to the best scores published in each paper with respect to their test dataset.

Datasets and Metrics

PASCAL Visual Object Classes (PASCAL VOC)

The PASCAL VOC dataset (2012) is well-known an commonly used for object detection and segmentation. More than 11k images compose the train and validation datasets while 10k images are dedicated to the test dataset.

The segmentation challenge is evaluated using the mean Intersection over Union (mIoU) metric. The Intersection over Union (IoU) is a metric also used in object detection to evaluate the relevance of the predicted locations. The IoU is the ratio between the area of overlap and the area of union between the ground truth and the predicted areas. The mIoU is the average between the IoU of the segmented objects over all the images of the test dataset.

Examples of the 2012 PASCAL VOC dataset for image segmentation. Source: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html

PASCAL-Context

The PASCAL-Context dataset (2014) is an extension of the 2010 PASCAL VOC dataset. It contains around 10k images for training, 10k for validation and 10k for testing. The specificity of this new release is that the entire scene is segmented providing more than 400 categories. Note that the images have been annotated during three months by six in-house annotators.

The official evaluation metric of the PASCAL-Context challenge is the mIoU. Several other metrics are published by researches as the pixel Accuracy (pixAcc). Here, the performances will be compared only with the mIoU.

Example of the PASCAL-Context dataset. Source: https://cs.stanford.edu/~roozbeh/pascal-context/

Common Objects in COntext (COCO)

There are two COCO challenges (in 2017 and 2018) for image semantic segmentation (“object detection” and “stuff segmentation”). The “object detection” task consists in segmenting and categorizing objects into 80 categories. The “stuff segmentation” task uses data with large segmented part of the images (sky, wall, grass), they contain almost the entire visual information. In this blog post, only the results of the “object detection” task will be compared because too few of the quoted research papers have published results on the “stuff segmentation” task.

The COCO dataset for object segmentation is composed of more than 200k images with over 500k object instance segmented. It contains a training dataset, a validation dataset, a test dataset for reseachers (test-dev) and a test dataset for the challenge (test-challenge). The annotations of both test datasets are not available. These datasets contain 80 categories and only the corresponding objects are segmented. This challenge uses the same metrics than the object detection challenge: the Average Precision (AP) and the Average Recall (AR) both using the Intersection over Union (IoU).

Details about IoU and AP metrics are available in my previous blog post. Such as the AP, the Average Recall is computed using multiple IoU with a specific range of overlapping values. For a fixed IoU, the objects with the corresponding test / ground truth overlapping are kept. Then the Recall metric is computed for the detected objects. The final AR metric is the average of the computed Recalls for all the IoU range values. Basically the AP and the AR metrics for segmentation works the same way with object detection excepting that the IoU is computed pixel-wise with a non rectangular shape for semantic segmentation.

Example of the COCO dataset for object segmentation. Source: http://cocodataset.org/

Cityscapes

The Cityscapes dataset has been released in 2016 and consists in complex segmented urban scenes from 50 cities. It is composed of 23.5k images for training and validation (fine and coarse annotations) and 1.5 images for testing (only fine annotation). The images are fully segmented such as the PASCAL-Context dataset with 29 classes (within 8 super categories: flat, human, vehicle, construction, object, nature, sky, void). It is often used to evaluate semantic segmentation models because of its complexity. It is also well known for its similarity with real urban scenes for autonomous driving applications. The performances of semantic segmentation models are computed using the mIoU metric such as the PASCAL datasets.

Examples of the Cityscapes dataset. Top: coarse annotations. Bottom: fine annotation. Source: https://www.cityscapes-dataset.com/

Fully Convolutional Network (FCN)

J. Long et al. (2015) have been the firsts to develop an Fully Convolutional Network (FCN) (containing only convolutional layers) trained end-to-end for image segmentation.

The FCN takes an image with an arbitrary size and produces a segmented image with the same size. The authors start by modifying well-known architectures (AlexNet, VGG16, GoogLeNet) to have a non fixed size input while replacing all the fully connected layers by convolutional layers. Since the network produces several feature maps with small sizes and dense representations, an upsampling is necessary to create an output with the same size than the input. Basically, it consists in a convolutional layer with a stride inferior to 1. It is commonly called deconvolution because it creates an output with a larger size than the input. This way, the network is trained using a pixel-wise loss. Moreover they have added skip connections in the network to combine high level feature map representations with more specific and dense ones at the top of the network.

The authors have reached a 62.2% mIoU score on the 2012 PASCAL VOC segmentation challenge using pretrained models on the 2012 ImageNet dataset. For the 2012 PASCAL VOC object detection challenge, the benchmark model called Faster R-CNN has reached 78.8% mIoU. Even if we can’t directly compare the two results (different models, different datasets and different challenges), it seems that the semantic segmentation task is more difficult to solve than the object detection task.

Architecture of the FCN. Note that the skip connections are not drawn here. Souce: J. Long et al. (2015)

ParseNet

W. Liu et al. (2015) have published a paper explaining improvements of the FCN model of J. Long et al. (2015). According to the authors, the FCN model loses the global context of the image in its deep layers by specializing the generated feature maps. The ParseNet is an end-to-end convolutional network predicting values for all the pixels at the same time and it avoids taking regions as input to keep the global information. The authors use a module taking feature maps as input. The first step uses a model to generate feature maps which are reduced into a single global feature vector with a pooling layer. This context vector is normalised using the L2 Euclidian Norm and it is unpooled (the output is an expanded version of the input) to produce new feature maps with the same sizes than the inital ones. The second step normalises the entire initial feature maps using the L2 Euclidian Norm. The last step concatenates the feature maps generated by the two previous steps. The normalisation is helpful to scale the concatenated feature maps values and it leads to better performances. Basically, the ParseNet is a FCN with this module replacing convolutional layers. It has obtained a 40.4% mIoU score on the PASCAL-Context challenge and a 69.8% mIoU score on the 2012 PASCAL VOC segmentation challenge.

Comparison between the segmentation of the FCN and the ParseNet and architecture of the ParseNet module. Source: W. Liu et al. (2015)

Convolutional and Deconvolutional Networks

H. Noh et al. (2015) have released an end-to-end model composed of two linked parts. The first part is a convolutional network with a VGG16 architecture. It takes as input an instance proposal, for example a bounding box generated by an object detection model. The proposal is processed and transformed by a convolutional network to generate a vector of features. The second part is a deconvolutional network taking the vector of features as input and generating a map of pixel-wise probabilities belonging to each class. The deconvolutional network uses unpooling targeting the maxium activations to keep the location of the information in the maps. The second network also uses deconvolution associating a single input to multiple feature maps. The deconvolution expands feature maps while keeping the information dense.

Comparison of the convolutional network layers (pooling and convolution) with the deconvolutional network layers (unpooling and deconvolution). Source: H. Noh et al. (2015)

The authors have analysed deconvolution feature maps and they have noted that the low-level ones are specific to the shape while the higher-level ones help to classify the proposal. Finally, when all the proposals of an image are processed by the entire network, the maps are concatenated to obtain the fully segmented image. This network has obtained a 72.5% mIoU on the 2012 PASCAL VOC segmentation challenge.

Architecture of the full network. The convolution network is based on the VGG16 architecture. The deconvolution network uses unpooling and deconvolution layers. Source: H. Noh et al. (2015)

U-Net

O. Ronneberger et al. (2015) have extended the FCN of J. Long et al. (2015) for biological microscopy images. The authors have created a network called U-net composed in two parts: a contracting part to compute features and a expanding part to spatially localise patterns in the image. The downsampling or contracting part has a FCN-like archicture extracting features with 3x3 convolutions. The upsampling or expanding part uses up-convolution (or deconvolution) reducing the number of feature maps while increasing their height and width. Cropped feature maps from the downsampling part of the network are copied within the upsampling part to avoid loosing pattern information. Finally, a 1x1 convolution processes the feature maps to generate a segmentation map and thus categorise each pixel of the input image. Since then, the U-net architecture has been widely extended in recent works (FPN, PSPNet, DeepLabv3 and so on). Note that it doesn’t use any fully-connected layer. As consequencies, the number of parameters of the model is reduced and it can be trained with a small labelled dataset (using appropriate data augmentation). For example, the authors have used a public dataset with 30 images for training during their experiments.

Architecture of the U-net for a given input image. The blue boxes correspond to feature maps blocks with their denoted shapes. The white boxes correspond to the copied and cropped feature maps. Source: O. Ronneberger et al. (2015)

Feature Pyramid Network (FPN)

The Feature Pyramid Network (FPN) has been developped by T.-Y. Lin et al (2016) and it is used in object detection or image segmentation frameworks. Its architecture is composed of a bottom-up pathway, a top-down pathway and lateral connections in order to join low-resolution and high-resolution features. The bottom-up pathway takes an image with an arbitrary size as input. It is processed with convolutional layers and downsampled by pooling layers. Note that each bunch of feature maps with the same size is called a stage, the outputs of the last layer of each stage are the features used for the pyramid level. The top-down pathway consists in upsampling the last feature maps with unpooling while enhancing them with feature maps from the same stage of the bottom-up pathway using lateral connections. These connections consist in merging the feature maps of the bottom-up pathway processed with a 1x1 convolution (to reduce their dimensions) with the feature maps of the top-down pathway.

Detail of a top-down block process with the lateral connection and the sum of the feature maps. Source: T.-Y. Lin et al (2016)

The concatenated feature maps are then processed by a 3x3 convolution to produce the output of the stage. Finally, each stage of the top-down pathway generates a prediction to detect an object. For image segmentation, the authors uses two Multi-Layer Perceptrons (MLP) to generate two masks with different size over the objets. It works similarly to Region Proposal Networks with anchor boxes (R-CNN R. Girshick et al. (2014), Fast R-CNN R. Girshick et al. (2015), Faster R-CNN S. Ren et al. (2016) and so on). This method is efficient because it better propagates low information into the network. The FPN based on DeepMask (P. 0. Pinheiro et al. (2015)) and SharpMask (P. 0. Pinheiro et al. (2016)) frameworks achieved a 48.1% Average Recall (AR) score on the 2016 COCO segmentation challenge.

Comparison of architectures. (a): The image is scaled with several sizes and each one is processed with convolutions to provide predictions which is computationally expansive. (b): The image has a single scale processed by a CNN with convolution an pooling layers. © Each step of the CNN is used to provide a prediction. (d) Architecture of the FPN with the bottom-up part of the left and the top-down part on the right. Source: T.-Y. Lin et al (2016)

Pyramid Scene Parsing Network (PSPNet)

H. Zhao et al. (2016) have developped the Pyramid Scene Parsing Network (PSPNet) to better learn the global context representation of a scene. Patterns are extracted from the input image using a feature extractor (ResNet K. He et al. (2015)) with a dilated network strategy¹. The feature maps feed a Pyramid Pooling Module to distinguish patterns with different scales. They are pooled with four different scales each one corresponding to a pyramid level and processed by a 1x1 convolutional layer to reduce their dimensions. This way each pyramid level analyses sub-regions of the image with different location. The outputs of the pyramid levels are upsampled and concatenated to the inital feature maps to finally contain the local and the global context information. Then, they are processed by a convolutional layer to generate the pixel-wise predictions. The best PSPNet with a pretrained ResNet (using the COCO dataset) has reached a 85.4% mIoU score on the 2012 PASCAL VOC segmentation challenge.

PSPNet architecture. The input image (a) is processed by a CNN to generate feature maps (b). They feed a Pyramid Pooling Module © and a final convolutional layer generates the pixel-wise predictions. Source: H. Zhao et al. (2016)

Mask R-CNN

K. He et al. (2017) have released the Mask R-CNN model beating all previous benchmarks on many COCO challenges². I have already provided details about Mask R-CNN for object detection in my previous blog post. As a reminder, the Faster R-CNN (S. Ren et al. (2015)) architecture for object detection uses a Region Proposal Network (RPN) to propose bounding box candidates. The RPN extracts Region of Interest (RoI) and a RoIPool layer computes features from these proposals in order to infer the bounding box cordinates and the class of the object. The Mask R-CNN is a Faster R-CNN with 3 output branches: the first one computes the bounding box coordinates, the second one computes the associated class and the last one computes the binary mask³ to segment the object. The binary mask has a fixed size and it is generated by a FCN for a given RoI. It also uses a RoIAlign layer instead of a RoIPool to avoid misalignments due to the quantization of the RoI coordinates. The particularity of the Mask R-CNN model is its multi-task loss combining the losses of the bounding box coordinates, the predicted class and the segmentation mask. The model tries to solve complementary tasks leading to better performances on each individual task. The best Mask R-CNN uses a ResNeXt (S. Xie et al. (2016)) to extract features and a FPN architecture. It has obtained a 37.1% AP score on the 2016 COCO segmentation challenge and a 41.8% AP score on the 2017 COCO segmentation challenge.

Mask R-CNN achitecture. The first layer is a RPN extracting the RoI. The second layer processes the RoI to generate feature maps. They are directly used to compute the bounding box coordinates and the predicted class. The feature maps are also processed by an FCN (third layer) to generate the binary mask. Source: K. He et al. (2017)

DeepLab, DeepLabv3 and DeepLabv3+

DeepLab

Inspired by the FPN model of T.-Y. Lin et al (2016), L.-C. Chen et al. (2017) have released DeepLab combining atrous convolution, spatial pyramid pooling and fully connected CRFs. The model presented in this paper is also called the DeepLabv2 because it is an adjustment of the initial DeepLab model (details about the inital one will not be provided to avoid redundancy). According to the authors, consecutive max-pooling and striding reduces the resolution of the feature maps in deep neural networks. They have introduced the atrous convolution which is basically the dilated convolution of H. Zhao et al. (2016). It consists of filters targeting sparse pixels with a fixed rate. For example, if the rate is equal to 2, the filter targets one pixel over two in the input; if the rate equal to 1, the atrous convolution is a basic convolution. Atrous convolution permits to capture multiple scale of objects. When it is used without max-poolling, it increases the resolution of the final output without increasing the number of weights.

Extraction patterns comparison between standard convolution on a low resolution input (top) and atrous convolution with a rate of 2 on a high resolution input (bottom). Source: L.-C. Chen et al. (2017)

The Atrous Spatial Pyramid Pooling consists in applying several atrous convolution of the same input with different rate to detect spatial patterns. The features maps are processed in separate branches and concatenated using bilinear interpolation to recovert the original size of the input. The output feeds a fully connected Conditional Random Field (CRF) (Krähenbühl and V. Koltun (2012)) computing edges between the features and long terme dependencies to produce the semantic segmentation.

Atrous Spatial Pyramid Pooling (ASPP) exploiting multiple scale of objects to classify the pixel in the center. Source: L.-C. Chen et al. (2017)

The best DeepLab using a ResNet-101 as backbone has reached a 79.7% mIoU score on the 2012 PASCAL VOC challenge, a 45.7% mIoU score on the PASCAL-Context challenge and a 70.4% mIoU score on the Cityscapes challenge.

DeepLab framework. Source: L.-C. Chen et al. (2017)

DeepLabv3

L.-C. Chen et al. (2017) have revisited the DeepLab framework to create DeepLabv3 combining cascaded and parallel modules of atrous convolutions. The authors have modified the ResNet architecture to keep high resolution feature maps in deep blocks using atrous convolutions.

Cascaded modules in the ResNet architecture. Source: L.-C. Chen et al. (2017)

The parallel atrous convolution modules are grouped in the Atrous Spatial Pyramid Pooling (ASPP). A 1x1 convolution and batch normalisation are added in the ASPP. All the outputs are concatenated and processed by another 1x1 convolution to create the final output with logits for each pixel.

Atrous Spatial Pyramid Pooling in the Deeplabv3 framework. Source: L.-C. Chen et al. (2017)

The best DeepLabv3 model with a ResNet-101 pretrained on ImageNet and JFT-300M datasets has reached 86.9% mIoU score in the 2012 PASCAL VOC challenge. It also achieved a 81.3% mIoU score on the Cityscapes challenge with a model only trained with the associated training dataset.

DeepLabv3+

L.-C. Chen et al. (2018) have finally released the Deeplabv3+ framework using an encoder-decoder structure. The authors have introduced the atrous separable convolution composed of a depthwise convolution (spatial convolution for each channel of the input) and pointwise convolution (1x1 convolution with the depthwise convolution as input).

Combinaison of Depthwise convolution (a) and Pointwise convolution (b) to create Atrous Separable Convolution (with a rate of 2). Source: L.-C. Chen et al. (2018)

They have used the DeepLabv3 framework as encoder. The most performant model has a modified Xception (F. Chollet (2017)) backbone with more layers, atrous depthwise separable convolutions instead of max pooling and batch normalization. The outputs of the ASPP are processed by a 1x1 convolution and upsampled by a factor of 4. The outputs of the encoder backbone CNN are also processed by another 1x1 convolution and concatenated to the previous ones. The feature maps feed two 3x3 convolutional layers and the outputs are upsampled by a factor of 4 to create the final segmented image.

DeepLabv3+ framework: an encoder with a backbone CNN and an ASPP produces feature representations to feed a decoder with 3x3 convolutions producing the final predicted image. Source: L.-C. Chen et al. (2018)

The best DeepLabv3+ pretrained on the COCO and the JFT datasets has obtained a 89.0% mIoU score on the 2012 PASCAL VOC challenge. The model trained on the Cityscapes dataset has reached a 82.1% mIoU score for the associated challenge.

Path Aggregation Network (PANet)

S. Liu et al. (2018) have recently released the Path Aggregation Network (PANet). This network is based on the Mask R-CNN and the FPN frameworks while enhancing information propagation. The feature extractor of the network uses a FPN architecture with a new augmented bottom-up pathway improving the propagation of low-layer features. Each stage of this third pathway takes as input the feature maps of the previous stage and processes them with a 3x3 convolutional layer. The output is added to the same stage feature maps of the top-down pathway using lateral connection and these feature maps feed the next stage.

Lateral connection between the top-down pathway and the augmented bottom-up pathway. Source: S. Liu et al. (2018)

The feature maps of the augmented bottom-up pathway are pooled with a RoIAlign layer to extract proposals from all level features. An adaptative feature pooling layer processes the features maps of each stage with a fully connected layer and concatenate all the outputs.

Adatative feature pooling layer. Source: S. Liu et al. (2018)

The output of the adaptative feature pooling layer feeds three branches similarly to the Mask R-CNN. The two first branches uses a fully connected layer to generate the predictions of the bounding box coordinates and the associated object class. The third branch process the RoI with a FCN to predict a binary pixel-wise mask for the detected object. The authors have added a path processing the output of a convolutional layer of the FCN with a fully connected layer to improve the localisation of the predicted pixels. Finally the output of the parallel path is reshaped and concatenated to the output of the FCN generating the binary mask.

Branch of the PANet predicting the binary mask using a FCN and a new path with a fully connected layer. Source: https://arxiv.org/pdf/1803.01534.pdf

The PANet has achieved 42.0% AP score on the 2016 COCO segmentation challenge using a ResNeXt as feature extractor. They also performed the 2017 COCO segmentation challenge with an 46.7% AP score using a ensemble of seven feature extractors: ResNet (K. He et al. (2015), ResNeXt (S. Xie et al. (2016)) and SENet (J. Hu et al.(2017)).

PANet Achitecture. (a): Feature extractor using the FPN achitecture. (b): The new augmented bottom-up pathway added to the FPN architecture. ©: The adaptative feature pooling layer. (d): The two branches predicting the bounding box coordinated and the object class. (e): The branch predicting the binary mask of the object. The dashed lines correspond to links between low-level and high level patterns, the red one is in the FPN and consists in more than 100 layers, the green one is a shortcut in the PANet consisting of less than 10 layers. Source: S. Liu et al. (2018)

Context Encoding Network (EncNet)

H. Zhang et al. (2018) have created a Context Encoding Network (EncNet) capturing global information in an image to improve scene segmentation. The model starts by using a basic feature extractor (ResNet) and feeds the feature maps into a Context Encoding Module inspired from the Encoding Layer of H. Zhang et al. (2016). Basically, it learns visual centers and smoothing factors to create an embedding taking into account the contextual information while highlighting class-dependant feature maps. On top of the module, scaling factors for the contextual information are learnt with a feature maps attention layer (fully connected layer). In parallel, a Semantic Encoding Loss (SE-Loss) corresponding to a binary cross-entropy loss regularizes the training of the module by detecting presence of object classes (unlike the pixel-wise loss). The outputs of the Context Encoding Module are reshaped and processed by a dilated convolution strategy while minimizing two SE-losses and a final pixel-wise loss. The best EncNet has reached 52.6% mIoU and 81.2% pixAcc scores on the PASCAL-Context challenge. It has also achieved a 85.9% mIoU score on the 2012 PASCAL VOC segmentation challenge.

Dilated convolution strategy. In blue the convolutional filter with D the dilatation rate. The SE-losses (Semantic Encoding Loss) are applied after the third and the fourth stages to detect object classes. A final Seg-loss (pixel-wise loss) is applied to improve the segmentation. Source: H. Zhang et al. (2018)
Architecture of the EncNet. A feature extractor generates feature maps took as input of a Context Encoding Module. The module is trained with regularisation using the Semantic Encoding Loss. The outputs of the module are processed by a dilated convolution strategy to produce the final segmention. Source: [H. Zhang et al. (2018)

Conclusion

Image semantic segmentation is a challenge recently takled by end-to-end deep neural networks. One of the main issue between all the architectures is to take into account the global visual context of the input to improve the prediction of the segmentation. The state-of-the-art models use architectures trying to link different part of the image in order to understand the relations between the objects.

Overview of the scores of the models over the 2012 PASCAL VOC dataset (mIoU), the PASCAL-Context dataset (mIoU), the 2016 / 2017 COCO datasets (AP and AR) and the Cityscapes dataset (mIoU)

The pixel-wise prediction over an entire image allows a better comprehension of the environement with a high precision. Scene understanding is also approached with keypoint detection, action recognition, video captioning or visual question answering. To my opinion, the segmentation task combined with these other issues using multi-task loss should help to outperform the global context understanding of a scene.

Finally, I would like to thanks Long Do Cao for helping me with all my posts, you should check his profile if you’re looking for a great senior data scientist ;).

2017-12-22 19:48:37 Asun0204 阅读数 6332
  • DeepLabv3+图像语义分割实战:训练自己的数据集

    DeepLabv3+是一种非常先进的基于深度学习的图像语义分割方法,可对物体进行像素级分割。 本课程将手把手地教大家使用labelme图像标注工具制作数据集,并使用DeepLabv3+训练自己的数据集,从而能开展自己的图像语义分割应用。 本课程有两个项目实践: (1) CamVid语义分割 :对CamVid数据集进行语义分割 (2) RoadScene语义分割:对汽车行驶场景中的路坑、车、车道线进行物体标注和语义分割 本课程使用TensorFlow版本的DeepLabv3+,在Ubuntu系统上做项目演示。 包括:安装deeplab、数据集标注、数据集格式转换、修改程序文件、训练自己的数据集、测试训练出的网络模型以及性能评估。 本课程提供项目的数据集和Python程序文件。 下图是使用DeepLabv3+训练自己的数据集RoadScene进行图像语义分割的测试结果:

    726 人正在学习 去看看 白勇

总体概述

大多数的语义分割研究都是基于自然,或者说是实际世界的图片。虽然这些结果不能直接应用到医学图像中,但是这些研究更加的成熟,也有很多的借鉴意义。

本篇博客首先解释了什么是语义分割问题,给出了方法的概述,最后总结了一些有趣的论文。

在后续的博客中,我会解释为什么医学图片和自然图片不同,并且研究这些方法如何应用在代表医学图像的数据集上。

什么是语义分割

语义分割是在像素层面上理解图片,比如我们想要把图片中的每个像素都分到一个目标类中。如下图所示:

除了识别摩托车和车上的人以外,我们也需要描述每个类别之间的边界。因此,不像分类问题,我们的模型要有基于像素预测的能力。

VOC2012MSCOCO是语义分割最重要的数据集。

这些方法有什么区别

在深度学习统治计算机视觉领域之前,人们一般用TextonForestRandom Forest based classifiers来做语义分割。在图片分类问题上,CNN卷积神经网络在分割问题上有了很多成功的例子。

patch classification是深度学习方法中一个流行的方法,主要思想是对于每一个像素点,都用包含它的图片进行分类,然后用这个结果来预测像素点的分类。使用这个方法的原因是深度学习分类网络的输入需要是固定大小的图片。

在2014年,FCN全卷积网络在预测像素的CNN中流行起来。全卷积网络没有任何全连接层,从而图片的输入可以是任何大小的,比上面的patch classification方法快。几乎所有后来的语义分割最新研究都是采用了这个方法。

除了全连接层以外,另一个使用CNN的问题是池化层。池化层增加了感受野,能够汇聚信息,但同时也舍弃了位置信息。然而语义分割需要准确的类图索引,需要保留这些位置信息。有两个不同的结构来处理这个问题。

其中一个是编码-解码(encoder-decoder)结构。编码用池化层来逐渐减少空间维度,解码来恢复目标细节和空间维度。通常编码和解码之间有直接的连接来更好的恢复目标的细节。U-Net就是这类一个流行的结构。

第二种结构使用所谓的空洞卷积dilated/atrous convolutions,代替了池化层来减少空间维度增加感受野,同时也保留了位置信息。

条件随机场Conditional Random Field (CRF) postprocessing通常用来提升分割的效果。条件随机场是基于潜在图像强度来平滑分割的图像模型。通过观察相似强度的像素,认为同一类的像素强度相似。该方法能提升大概1-2%的分数。

总结一些有趣的论文

这个章节总结了自FCN以来一些有代表性的文章。这些结构都以VOC2012 evaluation server为基准。

FCN

Fully Convolutional Networks for Semantic Segmentation

提交时间:2014年11月14日

主要贡献

  1. 在语义分割问题中推广使用端到端(end to end)卷积神经网络
  2. 使用imagenet预训练模型到语义分割中
  3. 使用反卷积层进行上采样
  4. 用跳过连接来提高上采样的粗糙度

介绍

可以看到全连接层可以看作是和输入尺寸一样的卷积核进行卷积操作。这相当于最开始的分类网络,即输入一批重叠图片进行评估,但是这种方法效率更高,因为重叠的部分只计算了一次。虽然这个发现不只是在这篇论文中提出(overfeat, this post),但是显著提高了VOC2012的水平。

在卷积化imagenet预训练网络VGG的全连接层后,特征图仍然需要上采样,因为CNN中的池化操作减小了尺寸。不使用简单的双线性插值进行上采样,因为反卷积层能够学习到插值。这一层也叫做upconvolution, full convolution, transposed convolution或者fractionally-strided convolution。

然而,上采样(即使是反卷积层)生成的分割图很粗糙,因为池化过程中损失了太多的信息。因此需要shortcut/skip connections,捷径连接,类似于U-Net结构图中中间灰色的连接线,从更高分辨率的特征地图引入快捷/跳过连接。

Benchmarks(VOC2012)

Score Comment Source
62.2 - leaderboard
67.2 More momentum. Not described in paper leaderboard

评价

  • 做出了很大的贡献,但是和现在的水平来比差距有点大

SegNet

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

提交时间:2015年11月2日

主要贡献

  1. 把最大池化转化为解码器来提高分割分辨率

介绍

FCN中卷积层和一些捷径连接连接产生粗糙的分割图。因此引入了更多的捷径连接。与复制FCN中的编码器功能不同,SegNet可以复制maxpooling中的索引。这使得SegNet比FCN更节省内存。

Benchmarks(VOC2012)

Score Comment Source
59.9 - leaderboard

评价

  • FCN和SegNet是最开始的编码解码结构
  • SegNet的Benchmarks不好

Dilated Convolutions(空洞卷积)

Multi-Scale Context Aggregation by Dilated Convolutions

提交时间:2015年11月23日

主要贡献

  1. 使用空洞卷积,一个用来dense prediction(标注出图像中每个像素点的对象类别,要求不但给出具体目标的位置,还要描绘物体的边界,如图像分割、语义分割、边缘检测等等)的卷积层
  2. 提出多尺度内容聚合(multi scale aggregation)来提高dense prediction效果

介绍

池化可以增加感受野,来帮助分类网络。但是这对分割问题来说不是最好的,因为同时也减小了分辨率。因此作者提出了空洞卷积层,如下图所示:

空洞卷积层(在DeepLab中也称为atrous convolution带孔卷积)来增加感受野但是不减少空间尺寸。

移除预训练模型(在这里是VGG)的最后两个池化层,随后的卷积层用空洞卷积来代替。尤其是在pool-3和pool-4之间的卷积层是dilation为2的空洞卷积,在pool-4之后是dilation为4的空洞卷积。通过这个模块(论文中称之为frontend module),在没有增加参数数量的基础上实现了dense prediction。

另外还有一个模块(在论文中被称为context module)把frontend module的输出作为输入,单独用来训练。这个模块是不同dilation的空洞卷积层串联在一起,因此多尺度聚合,提高frontend的预测效果。

Benchmarks(VOC2012)

Score Comment Source
71.3 frontend reported in the paper
73.5 frontend + context reported in the paper
74.7 frontend + context + CRF reported in the paper
75.3 frontend + context + CRF-RNN reported in the paper

评价

  • 注意到预测的分割图大小都是原图的八分之一。几乎所有的方法都是这样。然后通过内插的方式来得到最后的分割图。

DeepLab(v1 & v2)

v1 : Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs

提交时间:2014年12月22日

v2 : DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

提交时间:2016年6月2日

主要贡献

  1. 使用空洞卷积
  2. 提出带孔空间金字塔池化atrous spatial pyramid pooling (ASPP)
  3. 使用全连接条件随机场Fully connected CRF

介绍

空洞卷积能够增加感受野,而不增加参数数量。网络的修改如上面的空洞卷积论文。

通过将原始图像的多个重新缩放版本传递到并行CNN分支(图像金字塔)和/或通过使用具有不同采样率(ASPP)的多个平行的无规卷积层来实现多尺度处理。

结构化预测由完全连接的CRF完成。CRF作为后处理步骤被分别训练/调整。

Benchmarks(VOC2012)

Score Comment Source
79.7 ResNet-101 + atrous Convolutions + ASPP + CRF leaderboard

RefineNet

RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation

提交时间:2016年11月20日

主要贡献

  1. 编码解码结构,附带很好的解码块
  2. 所有组件都用残差连接设计

介绍

使用空洞卷积的方法并非没有缺点。空洞卷积运算代价高昂,需要大量的内存,因为它们必须应用于大量高分辨率的特征映射。这阻碍了对高分辨率预测的计算。例如,DeepLab的预测是原始输入的1 / 8。

因此,本文提出使用编码解码器架构。编码器部分是resnet-101块。译码器有精细化RefineNet 的块,它连接/融合高分辨率的特征,从以前的精化块的编码器和低分辨率特征。

每个精细化块都有一个组件,它通过对低分辨率特征进行上采样和一个基于重复的5 x 5步长1的池化层来获得环境,来融合多分辨率特征。每个组件都使用了身份映射模式之后的残差连接设计。

Benchmarks(VOC2012)

Score Comment Source
84.2 Uses CRF, Multiscale inputs, COCO pretraining leaderboard

PSPNet

Pyramid Scene Parsing Network

提交时间:2016年12月4日

主要贡献

  1. 提出了金字塔池化模块来汇聚信息
  2. 使用辅助损失

介绍

全局场景类别很重要,因为它提供了细分类分布的线索。金字塔池化模块通过应用大型的内核池化层来捕获这些信息。

在空洞卷积论文中用空洞卷积来修改Resnet,并增加了一个金字塔池化模块。该模块将不同大小(和输入一样大,输入的一半,小部分)卷积核来并行地卷积经过不同大小池化后的特征图,之后通过上采样成一样的尺寸,最后和ResNet输入的特征图一起合并起来。

一种辅助损失,附加于主分支上的损失,是在ResNet的第4级后应用的(i。e输入到金字塔池模块。这个想法也被称为其他地方的中间监督。

Benchmarks(VOC2012)

Score Comment Source
85.4 MSCOCO pretraining, multi scale input, no CRF leaderboard
82.6 no MSCOCO pretraining, multi scale input, no CRF reported in the paper

Large Kernel Matters

Large Kernel Matters – Improve Semantic Segmentation by Global Convolutional Network

提交时间:2017年3月8日

主要贡献

  1. 提出了一种使用大卷积核的编码解码结构

介绍

语义分割需要分割和对分割对象的分类。因为分割结构中不能使用全连接层,所以就用很大的卷积核来代替。

另一个采用大卷积核的理由是虽然很深的网络,像ResNet有很大的感受野,但是[实验}(https://arxiv.org/abs/1412.6856)显示网络倾向于从更小的区域收集信息(有效感受野valid receptive filed)。

更大的卷积核计算开销更大,需要很多的参数。Therefore, k x k convolution is approximated with sum of 1 x k + k x 1 and k x 1 and 1 x k convolutions. 论文中这个模块叫做全局卷积网络Global Convolutional Network (GCN)。

说到结构,ResNet(没有空洞卷积)构成了结构的编码部分,解码部分是GCN和反卷积。同时也使用了一个简单残差块(BR)。

Benchmarks(VOC2012)

Score Comment Source
82.2 - reported in the paper
83.6 Improved training, not described in the paper leaderboard

DeepLab v3

Rethinking Atrous Convolution for Semantic Image Segmentation

提交时间:2017年6月17日

主要贡献

  1. 改进的带孔空间金字塔池化(ASPP)
  2. 串联空洞卷积构成的模块

介绍

在DeepLab v2和空洞卷积中,对ResNet模型做了修改,从而可以使用空洞卷积。改进的ASPP涉及图像级特征的连接,1x1卷积和3个具有不同rate的3×3空洞卷积。在每个并行卷积层之后使用Batch normalization。

串联模块是一个resnet块,除了组件卷积层用不同rate的空洞卷积。这个模块类似于在空洞卷积一文中使用的内容汇聚模块,但是这直接应用于中间特征图而不是信念图(信念图是具有等于数量类的信道的最终CNN特征图)。

独立测评这两种模型并且都没有使用其他提高性能的方法。在没有使用CRF的情况下,它们的性能差不多,ASPP的性能较好。

这两种模型的性能都要比DeepLab v2好。作者指出batch normalization和更好的方式进行编码的多尺度汇聚是提升的主要原因。

Benchmarks(VOC2012)

Score Comment Source
85.7 used ASPP (no cascaded modules) leaderboard

参考资料

http://blog.qure.ai/notes/semantic-segmentation-deep-learning-review

2017-11-16 17:24:10 aitazhixin 阅读数 4962
  • DeepLabv3+图像语义分割实战:训练自己的数据集

    DeepLabv3+是一种非常先进的基于深度学习的图像语义分割方法,可对物体进行像素级分割。 本课程将手把手地教大家使用labelme图像标注工具制作数据集,并使用DeepLabv3+训练自己的数据集,从而能开展自己的图像语义分割应用。 本课程有两个项目实践: (1) CamVid语义分割 :对CamVid数据集进行语义分割 (2) RoadScene语义分割:对汽车行驶场景中的路坑、车、车道线进行物体标注和语义分割 本课程使用TensorFlow版本的DeepLabv3+,在Ubuntu系统上做项目演示。 包括:安装deeplab、数据集标注、数据集格式转换、修改程序文件、训练自己的数据集、测试训练出的网络模型以及性能评估。 本课程提供项目的数据集和Python程序文件。 下图是使用DeepLabv3+训练自己的数据集RoadScene进行图像语义分割的测试结果:

    726 人正在学习 去看看 白勇

机器之心:By路雪 2017年7月14日 

什么是语义分割?

  语义分割指像素级地识别图像,即标注出图像中每个像素所属的对象类别。如下图:

  

  左:输入图像,右:该图像的语义分割

  除了识别车和骑车的人,我们还需要描绘出每个物体的边界。因此,与图像分类不同,语义分割需要根据模型进行密集的像素级分类。

  VOC2012和MSCOCO是语义分割领域最重要的数据集。

  有哪些不同的解决方案?

  在深度学习应用到计算机视觉领域之前,人们使用TextonForest和随机森林分类器进行语义分割。卷积神经网络(CNN)不仅对图像识别有所帮助,也对语义分割领域的发展起到巨大的促进作用。

  语义分割任务最初流行的深度学习方法是图像块分类(patchclassification),即利用像素周围的图像块对每一个像素进行独立的分类。使用图像块分类的主要原因是分类网络通常是全连接层(fullconnectedlayer),且要求固定尺寸的图像。

  2014年,加州大学伯克利分校的Long等人提出全卷积网络(FCN),这使得卷积神经网络无需全连接层即可进行密集的像素预测,CNN从而得到普及。使用这种方法可生成任意大小的图像分割图,且该方法比图像块分类法要快上许多。之后,语义分割领域几乎所有先进方法都采用了该模型。

  除了全连接层,使用卷积神经网络进行语义分割存在的另一个大问题是池化层。池化层不仅扩大感受野、聚合语境从而造成了位置信息的丢失。但是,语义分割要求类别图完全贴合,因此需要保留位置信息。本文将介绍两种不同结构来解决该问题。

  第一个是编码器-解码器结构。编码器逐渐减少池化层的空间维度,解码器逐步修复物体的细节和空间维度。编码器和解码器之间通常存在快捷连接,因此能帮助解码器更好地修复目标的细节。U-Net是这种方法中最常用的结构。

  

  U-Net:一种编码器-解码器结构

  第二种方法使用空洞/带孔卷积(dilated/atrousconvolutions)结构,来去除池化层。

  

  Dilated/atrous卷积,rate=1是典型的卷积结构

  条件随机场(CRF)预处理通常用于改善分割效果。CRF是一种基于底层图像像素强度进行「平滑」分割的图模型。它的工作原理是灰度相近的像素易被标注为同一类别。CRF可令分值提高1-2%。

  

  CRF示意图。(b)一元分类器作为CRF的分割输入。(c、d、e)是CRF的变体,其中(e)是广泛使用的一种CRF

  下面,我将总结几篇论文,介绍分割结构从FCN以来的发展变化。所有这些架构都使用VOC2012评估服务器进行基准测试。

  论文概述

  下列论文按照时间顺序进行介绍:

  1.FCN

  2.SegNet

  3.DilatedConvolutions

  4.DeepLab(v1&v2)

  5.RefineNet

  6.PSPNet

  7.LargeKernelMatters

  8.DeepLabv3

  我列出了每篇论文的主要贡献,并稍加解释。同时我还展示了这些论文在VOC2012测试数据集上的基准测试分数(IOU均值)。

  FCN

  使用全卷积网络进行语义分割(FullyConvolutionalNetworksforSemanticSegmentation)

  2014年11月14日提交

  arXiv链接(https://arxiv.org/abs/1411.4038)

  主要贡献:

  推广端到端卷积网络在语义分割领域的应用

  修改Imagenet预训练网络并应用于语义分割领域

  使用解卷积层进行上采样

  使用跳跃连接,改善上采样的粒度程度

  相关解释:

  本论文的关键点是分类网络中的全连接层可视为使用卷积核覆盖整个输入区域的卷积操作。这相当于根据重叠的输入图像块评估原始分类网络,但由于计算过程由图像块的重叠部分共同分担,这种方法比之前更加高效。尽管该结论并非独一无二,但它显著提高了VOC2012数据集上模型的最佳效果。

  

  全连接层作为卷积操作

  将全连接层在VGG等Imagenet预训练网络中进行卷积操作后,由于CNN中的池化操作,特征图仍旧需要上采样。解卷积层不使用简单的双线性插值,而是学习所进行的插值。解卷积层又被称为上卷积(upconvolution)、完全卷积、转置卷积或微步卷积(fractionally-stridedconvolution)。

  但是,由于池化过程造成信息丢失,上采样(即使带有解卷积层)生成的分割图较为粗糙。因此我们可以从高分辨率的特征图中引入跳跃连接(shortcut/skipconnection)来改善上采样的粗糙程度。

  VOC2012基准测试分数:

  

  个人评价:

  这是一项重要的贡献,但是当前的技术水平又有了很大发展。

  SegNet

  SegNet:用于图像分割的一种深度卷积编码器-解码器架构(SegNet:ADeepConvolutionalEncoder-DecoderArchitectureforImageSegmentation)

  2015年11月2日提交

  Arxiv链接(https://arxiv.org/abs/1511.00561)

  主要贡献:

  将最大池化索引(Maxpoolingindices)转移到解码器,从而改善分割分辨率。

  相关解释:

  在FCN网络中,尽管使用了解卷积层和一些跳跃连接,但输出的分割图仍然比较粗糙。因此,更多的跳跃连接被引入FCN网络。但是,SegNet没有复制FCN中的编码器特征,而是复制了最大池化索引。这使得SegNet比FCN更节省内存。

  

  Segnet结构

  

  个人评价:

  FCN和SegNet都是最早出现的编码器-解码器结构。

  SegNet的基准测试分数不够好,不宜继续使用。

  空洞卷积(DilatedConvolutions)

  使用空洞卷积进行多尺度背景聚合(Multi-ScaleContextAggregationbyDilatedConvolutions)

  2015年11月23日提交

  Arxiv链接(https://arxiv.org/abs/1511.07122)

  主要贡献:

  使用空洞卷积,一种可进行稠密预测的卷积层。

  提出「背景模块」(contextmodule),该模块可使用空洞卷积进行多尺度背景聚合。

  相关解释:

  池化使感受野增大,因此对分类网络有所帮助。但池化会造成分辨率下降,不是语义分割的最佳方法。因此,论文作者使用空洞卷积层(dilatedconvolutionlayer),其工作原理如图:

  

  空洞/带孔卷积

  空洞卷积层(DeepLab将其称为带孔卷积)可使感受野呈指数级增长,而空间维度不至于下降。

  从预训练好的分类网络(此处指VGG)中移除最后两个池化层,之后的卷积层都使用空洞卷积。尤其是,pool-3和pool-4之间的卷积是空洞卷积2,pool-4后面的卷积是空洞卷积4。使用这个模块(论文中称为前端模块 frontendmodule)之后,无需增加参数即可实现稠密预测。另一个模块(论文中称为背景模块 contextmodule)将使用前端模块的输出作为输入进行单独训练。该模块是多个不同扩张程度的空洞卷积级联而成,因此该模块可聚合多尺度背景,并改善前端模块获取的预测结果。

  

  个人评价:

  预测分割图的大小是图像大小的1/8。几乎所有的方法都存在这个现象,通常使用插值的方法获取最终分割图。

  DeepLab(v1&v2)

  v1:使用深度卷积网络和全连接CRF进行图像语义分割(SemanticImageSegmentationwithDeepConvolutionalNetsandFullyConnectedCRFs)

  2014年12月22日提交

  Arxiv链接(https://arxiv.org/abs/1412.7062)

  v2 :DeepLab:使用深度卷积网络、带孔卷积和全连接CRF进行图像语义分割(DeepLab:SemanticImageSegmentationwithDeepConvolutionalNets,AtrousConvolution,andFullyConnectedCRFs)

  2016年6月2日提交

  Arxiv链接(https://arxiv.org/abs/1606.00915)

  主要贡献:

  使用带孔/空洞卷积。

  提出金字塔型的空洞池化(ASPP)

  使用全连接CRF

  相关解释:

  带孔/空洞卷积在不增加参数的情况下增大感受野。如上文中空洞卷积论文中所述,分割网络得到改进。

  将原始图像的多个重新缩放版本传递到CNN网络的并行分支(图像金字塔)中,或者使用采样率不同的多个并行空洞卷积层(ASPP),实现多尺度处理。

  结构化预测可通过全连接CRF实现。CRF的训练/微调需作为后处理的步骤单独进行。

  

  DeepLab2 流程图

  

  RefineNet

  RefineNet:使用多路径精炼网络进行高分辨率语义分割(RefineNet:Multi-PathRefinementNetworksforHigh-ResolutionSemanticSegmentation)

  2016年11月20日提交

  Arxiv链接(https://arxiv.org/abs/1611.06612)

  主要贡献:

  具备精心设计解码器模块的编码器-解码器架构

  所有组件遵循残差连接设计

  相关解释:

  使用空洞/带孔卷积的方法也有弊端。由于空洞卷积需要大量高分辨率特征图,因此其计算成本高昂,且占用大量内存。这妨碍了高分辨率预测的计算。例如,DeepLab的预测结果大小是原始输入图像的1/8。

  因此,这篇论文提出使用编码器-解码器结构。编码器是ResNet-101模块,解码器是RefineNet模块,该模块融合了编码器中的高分辨率特征和先前RefineNet模块中的低分辨率特征。

  

  RefineNet架构

  每一个RefineNet模块都有两个组件,一个组件通过对低分辨率特征进行上采样来融合多分辨率特征,另一个组件基于步幅为1、5x5大小的重复池化层来获取背景信息。这些组件遵循单位映射的思想,采用残差连接设计。

  

  RefineNet模块

  

  PSPNet

  金字塔型场景解析网络

  2016年12月4日提交

  Arxiv链接(https://arxiv.org/abs/1612.01105)

  主要贡献:

  提出金字塔池化模块帮助实现背景聚合。

  使用辅助损失(auxiliaryloss)。

  相关解释:

  全局场景分类为分割的类别分布提供线索,因此很重要。金字塔池化模块(Pyramidpoolingmodule)通过应用较大核池化层的获取这些信息。如上文中空洞卷积论文中所述,PSPNet也使用空洞卷积改善ResNet,并添加一个金字塔池化模块。该模块将ResNet的特征图与并行池化层的上采样输出结果连接起来,其中卷积核核覆盖了图像的全部、一半和小块区域。

  在ResNet的第四阶段之后(即输入到金字塔池化模块),在主分支损失之外又增加了附加损失。这个想法在其他研究中也被称为中间监督(intermediatesupervision)。

  

  PSPNet架构

  

  LargeKernelMatters

  大型核的问题——通过全局卷积网络改善语义分割(LargeKernelMatters--ImproveSemanticSegmentationbyGlobalConvolutionalNetwork)

  2017年3月8日提交

  Arxiv链接(https://arxiv.org/abs/1703.02719)

  主要贡献:

  提出使用带有大型卷积核的编码器-解码器结构

  相关解释:

  语义分割不仅需要分割,同时还需要对分割目标进行分类。由于分割结构中无法使用全连接层,因此带有大核函数的卷积可以替代全连接层得到应用。

  使用大型核的另一个原因是,尽管ResNet等更深层的网络拥有较大的感受野,但相关研究显示这样的网络更易收集较小范围(即有效感受野)内的信息。大型核的计算成本高昂,且拥有大量参数。因此,kxk卷积可近似成1xk+kx1、kx1和1xk。这篇论文中将该模块称为全局卷积网络(GCN)。

  再来看结构,ResNet(没有空洞卷积)构成该结构的编码器部分,而GCN和反卷积构成了解码器部分。该结构还采用了一个叫做边界细化(BR)的简单残差块。

  

  GCN结构

  VOC2012测试分数:

  

  DeepLabv3

  重新思考使用空洞卷积进行图像语义分割(RethinkingAtrousConvolutionforSemanticImageSegmentation)

  2017年6月17日提交

  Arxiv链接(https://arxiv.org/abs/1706.05587)

  主要贡献:

  改进了金字塔型的空洞池化(ASPP)

  模型级联了多个空洞卷积

  相关解释:

  与DeepLabv2和空洞卷积论文一样,该研究也使用空洞/扩张卷积来改进ResNet模型。改进后的ASPP包括图像层级特征连接、一个1x1的卷积和三个3x3的不同比率空洞卷积。每一个并行卷积层之后使用批量归一化操作。

  级联模型是一个ResNet模块,但其中的卷积层是不同比率的空洞卷积。该模型与空洞卷积论文中的背景模块相似,但是它直接应用于中间特征图,而不是可信度地图(信念图是通道数与类别数相同的最终CNN特征图)。

  该论文分别评估了这两个已提出的模型。两个模型在验证集上的性能相似,带有ASPP的模型性能稍好,且未使用CRF。这两个模型优于DeepLabv2中最优的模型。论文作者还提到性能的改进来自于批量归一化操作和更好的多尺度背景编码方式。

  

  DeepLabv3ASPP

  

2019-01-04 09:22:57 Z199448Y 阅读数 383
  • DeepLabv3+图像语义分割实战:训练自己的数据集

    DeepLabv3+是一种非常先进的基于深度学习的图像语义分割方法,可对物体进行像素级分割。 本课程将手把手地教大家使用labelme图像标注工具制作数据集,并使用DeepLabv3+训练自己的数据集,从而能开展自己的图像语义分割应用。 本课程有两个项目实践: (1) CamVid语义分割 :对CamVid数据集进行语义分割 (2) RoadScene语义分割:对汽车行驶场景中的路坑、车、车道线进行物体标注和语义分割 本课程使用TensorFlow版本的DeepLabv3+,在Ubuntu系统上做项目演示。 包括:安装deeplab、数据集标注、数据集格式转换、修改程序文件、训练自己的数据集、测试训练出的网络模型以及性能评估。 本课程提供项目的数据集和Python程序文件。 下图是使用DeepLabv3+训练自己的数据集RoadScene进行图像语义分割的测试结果:

    726 人正在学习 去看看 白勇

参考:AI研习社微信公众号

  • 语义分割难点:将各个像素点分类到某一实例,再将各个实例(分类结果)与实体(人、道路等)一一对应。
  • 出现在真实的理解图像或视频的动作的挑战:关键点检测、动作识别、视频字幕、视觉问题回答等。
  • 常用数据集:

PASCAL VOC——train/val   11k张;test  10张;用平均交并比(mIoU)评估图像分割模型的性能

PASCAL-Context——train 10k;val 10k;test  10k

COCO

Cityscapes——包含50个城市的复杂的城市场景分割图,train/val  23.5k;test   1.5k

  • 一些网络效果:

FCN——使用ImageNet预训练模型,在2012年的PASCAL VOC上mIoU=62.2%

ParseNet——PASCAL-Context的mIoU=40.4%,2012年的PASCAL VOC 的mIoU=69.8%

卷积与反卷积——2012年的PASCAL VOC的mIoU=72.5%

U-Net——扩展FCN模型用于生物显微镜图像。扩展研究FPN、PSPNet、DeepLabv3

FPN——基于DeepMask和SharpMask框架的FPN在COCO的AR=48.1%

金字塔场景解析网络(PSPNet)——使用COCO的预训ResNet,在2012年的PASCAL VOC的mIoU=85.4%

Mask R-CNN——最好的Mask R-CNN使用ResNeXt提取特征和FPN结构,2016年COCO的AP=37.1%,2017年的COCO的AP=41.8%

DeepLab,DeepLabv3,DeepLabv3+

  • DeepLab——带空卷积核、空间金字塔池化、全连接的CRFs

以ResNet-101为主干的DeepLab在2012年的PASCAL VOC的mIoU=79.7%,PASCAL-Context的mIoU=45.7%,Cityscapes的mIoU=70.4%

  • DeepLabv3——带孔卷积的级联和并行模块(空洞空间金字塔池化ASPP)

使用ResNet-101在ImageNet和JFT-300M上预训练的最佳DeepLabv3在2012年的PASCAL VOC上mIoU=86.9%,Cityscapes的mIoU=81.3%

  • DeepLabv3+——结合了编码-解码器结构框架的DeepLabv3,引入空洞可分离卷积,包含深度卷积(将输入的每一个通道进行卷积)和逐点卷积(1*1的卷积和深度卷积作为输入)。

DeepLabv3+框架:一个具有基本的CNN和一个ASPP的编码器产生特征表示,具有3*3卷积的解码器接收特征表示,产生最终预测图像。

在COCO和JFT上预训练的最佳DeepLabv3+在2012年的PASCAL VOC的mIoU=89.0%,Cityscapes的mIoU=82.1%

  • 路径聚合网络(PANet)——基于Mask R-CNN和FPN框架,同时增强信息传播。特征提取使用改进的FPN架构,添加自底向上的增强路径,从而改善底层特征传播。

ResNeXt作为特征提取器,PANet在2016年COCO中获42.0%的平均精度分数。还使用7个特征提取器的集合进行2017年COCO获46.7%的平均精度。

  • 环境编码网络(EncNet)——环境编码网络捕捉一张图像中的全局信息,以提高场景分割性能。

在PASCAL-Context的mIoU=52.6%,pixAcc=81.2%;在2012年的PASCAL VOC的mIoU=85.9%

总结:

各体系结构之间的主要问题之一是考虑输入图像的全局视觉环境,以提高分割的预测能力。最先进的模型架构试图连接图像的不同部分,以便理解对象之间的关系。

 

 

 

没有更多推荐了,返回首页