Deep Learning in Object Recognition, Detection, and Segmentation

Found. Trends Signal Process. Pub Date : 2016-03-01 DOI:10.1561/2000000071

Xiaogang Wang

{"title":"Deep Learning in Object Recognition, Detection, and Segmentation","authors":"Xiaogang Wang","doi":"10.1561/2000000071","DOIUrl":null,"url":null,"abstract":"As a major breakthrough in artificial intelligence, deep learning has achieved very impressive success in solving grand challenges in many fields including speech recognition, natural language processing, computer vision, image and video processing, and multimedia. This article provides a historical overview of deep learning and focus on its applications in object recognition, detection, and segmentation, which are key challenges of computer vision and have numerous applications to images and videos. The discussed research topics on object recognition include image classification on ImageNet, face recognition, and video classification. The detection part covers general object detection on ImageNet, pedestrian detection, face landmark detection face alignment, and human landmark detection pose estimation. On the segmentation side, thearticle discusses the most recent progress on scene labeling, semantic segmentation, face parsing, human parsing and saliency detection. Object recognition is considered as whole-image classification, while detection and segmentation are pixelwise classification tasks. Their fundamental differences will be discussed in this article. Fully convolutional neural networks and highly efficient forward and backward propagation algorithms specially designed for pixelwise classification task will be introduced. The covered application domains are also much diversified. Human and face images have regular structures, while general object and scene images have much more complex variations in geometric structures and layout. Videos include the temporal dimension. Therefore, they need to be processed with different deep models. All the selected domain applications have received tremendous attentions in the computer vision and multimedia communities. Through concrete examples of these applications, we explain the key points which make deep learning outperform conventional computer vision systems. 1 Different than traditional pattern recognition systems, which heavily rely on manually designed features, deep learning automatically learns hierarchical feature representations from massive training data and disentangles hidden factors of input data through multi-level nonlinear mappings. 2 Different than existing pattern recognition systems which sequentially design or train their key components, deep learning is able to jointly optimize all the components and crate synergy through close interactions among them. 3 While most machine learning models can be approximated with neural networks with shallow structures, for some tasks, the expressive power of deep models increases exponentially as their architectures go deep. Deep models are especially good at learning global contextual feature representation with their deep structures. 4 Benefitting from the large learning capacity of deep models, some classical computer vision challenges can be recast as high-dimensional data transform problems and can be solved from new perspectives. Finally, some open questions and future works regarding to deep learning in object recognition, detection, and segmentation will be discussed.","PeriodicalId":12340,"journal":{"name":"Found. Trends Signal Process.","volume":"14 1","pages":"217-382"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"53","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Found. Trends Signal Process.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1561/2000000071","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 53

Abstract

As a major breakthrough in artificial intelligence, deep learning has achieved very impressive success in solving grand challenges in many fields including speech recognition, natural language processing, computer vision, image and video processing, and multimedia. This article provides a historical overview of deep learning and focus on its applications in object recognition, detection, and segmentation, which are key challenges of computer vision and have numerous applications to images and videos. The discussed research topics on object recognition include image classification on ImageNet, face recognition, and video classification. The detection part covers general object detection on ImageNet, pedestrian detection, face landmark detection face alignment, and human landmark detection pose estimation. On the segmentation side, thearticle discusses the most recent progress on scene labeling, semantic segmentation, face parsing, human parsing and saliency detection. Object recognition is considered as whole-image classification, while detection and segmentation are pixelwise classification tasks. Their fundamental differences will be discussed in this article. Fully convolutional neural networks and highly efficient forward and backward propagation algorithms specially designed for pixelwise classification task will be introduced. The covered application domains are also much diversified. Human and face images have regular structures, while general object and scene images have much more complex variations in geometric structures and layout. Videos include the temporal dimension. Therefore, they need to be processed with different deep models. All the selected domain applications have received tremendous attentions in the computer vision and multimedia communities. Through concrete examples of these applications, we explain the key points which make deep learning outperform conventional computer vision systems. 1 Different than traditional pattern recognition systems, which heavily rely on manually designed features, deep learning automatically learns hierarchical feature representations from massive training data and disentangles hidden factors of input data through multi-level nonlinear mappings. 2 Different than existing pattern recognition systems which sequentially design or train their key components, deep learning is able to jointly optimize all the components and crate synergy through close interactions among them. 3 While most machine learning models can be approximated with neural networks with shallow structures, for some tasks, the expressive power of deep models increases exponentially as their architectures go deep. Deep models are especially good at learning global contextual feature representation with their deep structures. 4 Benefitting from the large learning capacity of deep models, some classical computer vision challenges can be recast as high-dimensional data transform problems and can be solved from new perspectives. Finally, some open questions and future works regarding to deep learning in object recognition, detection, and segmentation will be discussed.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

对象识别、检测和分割中的深度学习

作为人工智能的重大突破，深度学习在解决语音识别、自然语言处理、计算机视觉、图像和视频处理以及多媒体等许多领域的重大挑战方面取得了令人印象深刻的成功。本文提供了深度学习的历史概述，并重点介绍了其在对象识别、检测和分割方面的应用，这些都是计算机视觉的关键挑战，并且在图像和视频中有许多应用。讨论的目标识别研究课题包括基于ImageNet的图像分类、人脸识别和视频分类。检测部分包括在ImageNet上的一般目标检测、行人检测、人脸地标检测、人脸对齐、人体地标检测姿态估计。在分割方面，本文讨论了场景标注、语义分割、人脸解析、人工解析和显著性检测等方面的最新进展。目标识别被认为是整幅图像的分类，而检测和分割是像素级的分类任务。本文将讨论它们的根本区别。将介绍专门为像素分类任务设计的全卷积神经网络和高效的前向和后向传播算法。所涵盖的应用程序领域也非常多样化。人类和人脸图像具有规则的结构，而一般物体和场景图像在几何结构和布局上的变化要复杂得多。视频包含时间维度。因此，它们需要用不同的深度模型进行处理。这些领域的应用在计算机视觉和多媒体领域受到了广泛的关注。通过这些应用的具体例子，我们解释了使深度学习优于传统计算机视觉系统的关键点。1与传统模式识别系统严重依赖人工设计的特征不同，深度学习从大量训练数据中自动学习分层特征表示，并通过多层次的非线性映射来解开输入数据的隐藏因素。2与现有的模式识别系统顺序地设计或训练其关键组件不同，深度学习能够通过所有组件之间的密切交互来共同优化所有组件并形成协同效应。虽然大多数机器学习模型可以用具有浅层结构的神经网络近似，但对于某些任务，深度模型的表达能力随着其架构的深入而呈指数级增长。深度模型特别擅长用其深层结构学习全局上下文特征表示。得益于深度模型的巨大学习能力，一些经典的计算机视觉挑战可以被重新塑造为高维数据转换问题，并可以从新的角度来解决。最后，将讨论一些关于深度学习在目标识别、检测和分割方面的开放性问题和未来的工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Found. Trends Signal Process.

自引率

0.00%

发文量

期刊最新文献

Generalizing Graph Signal Processing: High Dimensional Spaces, Models and Structures An Introduction to Quantum Machine Learning for Engineers Signal Decomposition Using Masked Proximal Operators Online Component Analysis, Architectures and Applications Wireless for Machine Learning: A Survey