IEEE Transactions on Image Processing最新文献_第10页

Multi-Task Deep Relative Attribute Learning for Visual Urban Perception. 针对城市视觉感知的多任务深度相对属性学习

IF 10.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing

Pub Date : 2019-08-07 DOI: 10.1109/TIP.2019.2932502

Weiqing Min, Shuhuan Mei, Linhu Liu, Yi Wang, Shuqiang Jiang

Visual urban perception aims to quantify perceptual attributes (e.g., safe and depressing attributes) of physical urban environment from crowd-sourced street-view images and their pairwise comparisons. It has been receiving more and more attention in computer vision for various applications, such as perceptive attribute learning and urban scene understanding. Most existing methods adopt either (i) a regression model trained using image features and ranked scores converted from pairwise comparisons for perceptual attribute prediction or (ii) a pairwise ranking algorithm to independently learn each perceptual attribute. However, the former fails to directly exploit pairwise comparisons while the latter ignores the relationship among different attributes. To address them, we propose a Multi-Task Deep Relative Attribute Learning Network (MTDRALN) to learn all the relative attributes simultaneously via multi-task Siamese networks, where each Siamese network will predict one relative attribute. Combined with deep relative attribute learning, we utilize the structured sparsity to exploit the prior from natural attribute grouping, where all the attributes are divided into different groups based on semantic relatedness in advance. As a result, MTDRALN is capable of learning all the perceptual attributes simultaneously via multi-task learning. Besides the ranking sub-network, MTDRALN further introduces the classification sub-network, and these two types of losses from two sub-networks jointly constrain parameters of the deep network to make the network learn more discriminative visual features for relative attribute learning. In addition, our network can be trained in an end-to-end way to make deep feature learning and multi-task relative attribute learning reinforce each other. Extensive experiments on the large-scale Place Pulse 2.0 dataset validate the advantage of our proposed network. Our qualitative results along with visualization of saliency maps also show that the proposed network is able to learn effective features for perceptual attributes.

视觉城市感知旨在从人群来源的街景图像及其成对比较中量化城市物理环境的感知属性（如安全和压抑属性）。它在计算机视觉领域的各种应用中受到越来越多的关注，如感知属性学习和城市场景理解。大多数现有方法都采用(i)使用图像特征和成对比较转换的排序分数训练的回归模型进行感知属性预测，或(ii)采用成对排序算法独立学习每个感知属性。然而，前者无法直接利用成对比较，而后者则忽略了不同属性之间的关系。为了解决这些问题，我们提出了多任务深度相对属性学习网络（MTDRALN），通过多任务连体网络同时学习所有相对属性，每个连体网络预测一个相对属性。结合深度相对属性学习，我们利用结构稀疏性来利用自然属性分组的先验性，即根据语义相关性预先将所有属性分成不同的组。因此，MTDRALN 能够通过多任务学习同时学习所有感知属性。除了排序子网络外，MTDRALN 还进一步引入了分类子网络，这两种子网络的损失共同约束了深度网络的参数，使网络能够学习到更多具有区分性的视觉特征，从而实现相对属性学习。此外，我们的网络可以端到端方式进行训练，使深度特征学习和多任务相对属性学习相互促进。在大规模 Place Pulse 2.0 数据集上进行的大量实验验证了我们提出的网络的优势。我们的定性结果以及可视化的显著性地图也表明，所提出的网络能够学习有效的感知属性特征。

{"title":"Multi-Task Deep Relative Attribute Learning for Visual Urban Perception.","authors":"Weiqing Min, Shuhuan Mei, Linhu Liu, Yi Wang, Shuqiang Jiang","doi":"10.1109/TIP.2019.2932502","DOIUrl":"10.1109/TIP.2019.2932502","url":null,"abstract":"Visual urban perception aims to quantify perceptual attributes (e.g., safe and depressing attributes) of physical urban environment from crowd-sourced street-view images and their pairwise comparisons. It has been receiving more and more attention in computer vision for various applications, such as perceptive attribute learning and urban scene understanding. Most existing methods adopt either (i) a regression model trained using image features and ranked scores converted from pairwise comparisons for perceptual attribute prediction or (ii) a pairwise ranking algorithm to independently learn each perceptual attribute. However, the former fails to directly exploit pairwise comparisons while the latter ignores the relationship among different attributes. To address them, we propose a Multi-Task Deep Relative Attribute Learning Network (MTDRALN) to learn all the relative attributes simultaneously via multi-task Siamese networks, where each Siamese network will predict one relative attribute. Combined with deep relative attribute learning, we utilize the structured sparsity to exploit the prior from natural attribute grouping, where all the attributes are divided into different groups based on semantic relatedness in advance. As a result, MTDRALN is capable of learning all the perceptual attributes simultaneously via multi-task learning. Besides the ranking sub-network, MTDRALN further introduces the classification sub-network, and these two types of losses from two sub-networks jointly constrain parameters of the deep network to make the network learn more discriminative visual features for relative attribute learning. In addition, our network can be trained in an end-to-end way to make deep feature learning and multi-task relative attribute learning reinforce each other. Extensive experiments on the large-scale Place Pulse 2.0 dataset validate the advantage of our proposed network. Our qualitative results along with visualization of saliency maps also show that the proposed network is able to learn effective features for perceptual attributes.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62584443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Morphology-based Noise Reduction: Structural Variation and Thresholding in the Bitonic Filter. 基于形态学的降噪：比顿滤波器中的结构变化和阈值处理

IF 10.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing

Pub Date : 2019-08-07 DOI: 10.1109/TIP.2019.2932572

Graham Treece

The bitonic filter was recently developed to embody the novel concept of signal bitonicity (one local extremum within a set range) to differentiate from noise, by use of data ranking and linear operators. For processing images, the spatial extent was locally constrained to a fixed circular mask. Since structure in natural images varies, a novel structurally varying bitonic filter is presented, which locally adapts the mask, without following patterns in the noise. This new filter includes novel robust structurally varying morphological operations, with efficient implementations, and a novel formulation of non-iterative directional Gaussian filtering. Data thresholds are also integrated with the morphological operations, increasing noise reduction for low noise, and enabling a multi-resolution framework for high noise levels. The structurally varying bitonic filter is presented without presuming prior knowledge of morphological filtering, and compared to high-performance linear noise-reduction filters, to set this novel concept in context. These are tested over a wide range of noise levels, on a fairly broad set of images. The new filter is a considerable improvement on the fixed-mask bitonic, outperforms anisotropic diffusion and image-guided filtering in all but extremely low noise, non-local means at all noise levels, but not the block-matching 3D filter, though results are promising for very high noise. The structurally varying bitonic tends to have less characteristic residual noise in regions of smooth signal, and very good preservation of signal edges, though with some loss of small scale detail when compared to the block-matching 3D filter. The efficient implementation means that processing time, though slower than the fixed-mask bitonic filter, remains competitive.

最近开发的位子滤波器体现了信号位子性（在设定范围内的一个局部极值）的新概念，通过使用数据排序和线性算子来区分噪声。在处理图像时，空间范围被局部限制为一个固定的圆形掩膜。由于自然图像中的结构各不相同，因此提出了一种新颖的结构变化位子滤波器，它能局部调整掩码，而不遵循噪声中的模式。这种新的滤波器包括新颖稳健的结构变化形态学运算和高效的实现方法，以及非迭代定向高斯滤波的新表述。数据阈值也与形态学运算结合在一起，从而提高了对低噪声的降噪能力，并实现了对高噪声水平的多分辨率框架。在介绍结构变化位子滤波器时，我们并没有预先假定对形态学滤波的了解，而是将其与高性能线性降噪滤波器进行了比较，以确定这一新颖概念的背景。我们在相当广泛的图像集上，对各种噪声水平进行了测试。新滤波器在固定掩膜位子滤波器的基础上有了相当大的改进，在除极低噪声外的所有情况下都优于各向异性扩散和图像引导滤波器，在所有噪声水平下都优于非局部滤波器，但在块匹配三维滤波器上却不尽然，尽管在极高噪声下的结果很有希望。与块匹配三维滤波器相比，结构变化比特子在信号平滑的区域往往具有较少的特征残余噪声，并能很好地保留信号边缘，但会损失一些小尺度细节。高效的实现意味着处理时间虽然比固定掩码位子滤波器慢，但仍然具有竞争力。

{"title":"Morphology-based Noise Reduction: Structural Variation and Thresholding in the Bitonic Filter.","authors":"Graham Treece","doi":"10.1109/TIP.2019.2932572","DOIUrl":"10.1109/TIP.2019.2932572","url":null,"abstract":"The bitonic filter was recently developed to embody the novel concept of signal bitonicity (one local extremum within a set range) to differentiate from noise, by use of data ranking and linear operators. For processing images, the spatial extent was locally constrained to a fixed circular mask. Since structure in natural images varies, a novel structurally varying bitonic filter is presented, which locally adapts the mask, without following patterns in the noise. This new filter includes novel robust structurally varying morphological operations, with efficient implementations, and a novel formulation of non-iterative directional Gaussian filtering. Data thresholds are also integrated with the morphological operations, increasing noise reduction for low noise, and enabling a multi-resolution framework for high noise levels. The structurally varying bitonic filter is presented without presuming prior knowledge of morphological filtering, and compared to high-performance linear noise-reduction filters, to set this novel concept in context. These are tested over a wide range of noise levels, on a fairly broad set of images. The new filter is a considerable improvement on the fixed-mask bitonic, outperforms anisotropic diffusion and image-guided filtering in all but extremely low noise, non-local means at all noise levels, but not the block-matching 3D filter, though results are promising for very high noise. The structurally varying bitonic tends to have less characteristic residual noise in regions of smooth signal, and very good preservation of signal edges, though with some loss of small scale detail when compared to the block-matching 3D filter. The efficient implementation means that processing time, though slower than the fixed-mask bitonic filter, remains competitive.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62584099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sparse Representation based Video Quality Assessment for Synthesized 3D Videos. 基于稀疏表示的合成三维视频质量评估

IF 10.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing

Pub Date : 2019-07-29 DOI: 10.1109/TIP.2019.2929433

Yun Zhang, Huan Zhang, Mei Yu, Sam Kwong, Yo-Sung Ho

The temporal flicker distortion is one of the most annoying noises in synthesized virtual view videos when they are rendered by compressed multi-view video plus depth in Three Dimensional (3D) video system. To assess the synthesized view video quality and further optimize the compression techniques in 3D video system, objective video quality assessment which can accurately measure the flicker distortion is highly needed. In this paper, we propose a full reference sparse representation based video quality assessment method towards synthesized 3D videos. Firstly, a synthesized video, treated as a 3D volume data with spatial (X-Y) and temporal (T) domains, is reformed and decomposed as a number of spatially neighboring temporal layers, i.e., X-T or Y-T planes. Gradient features in temporal layers of the synthesized video and strong edges of depth maps are used as key features in detecting the location of flicker distortions. Secondly, dictionary learning and sparse representation for the temporal layers are then derived and applied to effectively represent the temporal flicker distortion. Thirdly, a rank pooling method is used to pool all the temporal layer scores and obtain the score for the flicker distortion. Finally, the temporal flicker distortion measurement is combined with the conventional spatial distortion measurement to assess the quality of synthesized 3D videos. Experimental results on synthesized video quality database demonstrate our proposed method is significantly superior to other state-of-the-art methods, especially on the view synthesis distortions induced from depth videos.

在三维（3D）视频系统中，当合成虚拟视图视频通过压缩多视图视频和深度视频渲染时，时间闪烁失真是最恼人的噪声之一。为了评估合成视图视频质量并进一步优化三维视频系统中的压缩技术，亟需能够准确测量闪烁失真的客观视频质量评估。本文针对合成三维视频提出了一种基于全参考稀疏表示的视频质量评估方法。首先，合成视频被视为具有空间（X-Y）域和时间（T）域的三维体数据，被重构并分解为多个空间上相邻的时间层，即 X-T 或 Y-T 平面。合成视频时间层的梯度特征和深度图的强边缘是检测闪烁失真的关键特征。其次，对时间层进行字典学习和稀疏表示，从而有效地表示时间闪烁失真。第三，使用秩集合方法集合所有时间层得分，得到闪烁失真的得分。最后，将时间闪烁失真测量与传统的空间失真测量相结合，评估合成三维视频的质量。合成视频质量数据库的实验结果表明，我们提出的方法明显优于其他最先进的方法，尤其是在深度视频引起的视图合成失真方面。

{"title":"Sparse Representation based Video Quality Assessment for Synthesized 3D Videos.","authors":"Yun Zhang, Huan Zhang, Mei Yu, Sam Kwong, Yo-Sung Ho","doi":"10.1109/TIP.2019.2929433","DOIUrl":"10.1109/TIP.2019.2929433","url":null,"abstract":"The temporal flicker distortion is one of the most annoying noises in synthesized virtual view videos when they are rendered by compressed multi-view video plus depth in Three Dimensional (3D) video system. To assess the synthesized view video quality and further optimize the compression techniques in 3D video system, objective video quality assessment which can accurately measure the flicker distortion is highly needed. In this paper, we propose a full reference sparse representation based video quality assessment method towards synthesized 3D videos. Firstly, a synthesized video, treated as a 3D volume data with spatial (X-Y) and temporal (T) domains, is reformed and decomposed as a number of spatially neighboring temporal layers, i.e., X-T or Y-T planes. Gradient features in temporal layers of the synthesized video and strong edges of depth maps are used as key features in detecting the location of flicker distortions. Secondly, dictionary learning and sparse representation for the temporal layers are then derived and applied to effectively represent the temporal flicker distortion. Thirdly, a rank pooling method is used to pool all the temporal layer scores and obtain the score for the flicker distortion. Finally, the temporal flicker distortion measurement is combined with the conventional spatial distortion measurement to assess the quality of synthesized 3D videos. Experimental results on synthesized video quality database demonstrate our proposed method is significantly superior to other state-of-the-art methods, especially on the view synthesis distortions induced from depth videos.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhanced Fuzzy-based Local Information Algorithm for Sonar Image Segmentation. 基于模糊局部信息的声纳图像分割增强算法

IF 10.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing

Pub Date : 2019-07-29 DOI: 10.1109/TIP.2019.2930148

Avi Abu, Roee Diamant

The recent boost in undersea operations has led to the development of high-resolution sonar systems mounted on autonomous vehicles. These vehicles are used to scan the seafloor in search of different objects such as sunken ships, archaeological sites, and submerged mines. An important part of the detection operation is the segmentation of sonar images, where the object's highlight and shadow are distinguished from the seabed background. In this work, we focus on the automatic segmentation of sonar images. We present our enhanced fuzzybased with Kernel metric (EnFK) algorithm for the segmentation of sonar images which, in an attempt to improve segmentation accuracy, introduces two new fuzzy terms of local spatial and statistical information. Our algorithm includes a preliminary de-noising algorithm which, together with the original image, feeds into the segmentation procedure to avoid trapping to local minima and to improve convergence. The result is a segmentation procedure that specifically suits the intensity inhomogeneity and the complex seabed texture of sonar images. We tested our approach using simulated images, real sonar images, and sonar images that we created in two different sea experiments, using multibeam sonar and synthetic aperture sonar. The results show accurate segmentation performance that is far beyond the stateof-the-art results.

最近，海底作业的发展带动了安装在自动驾驶车辆上的高分辨率声纳系统的开发。这些车辆用于扫描海底，寻找不同的物体，如沉船、考古遗址和水下地雷。探测操作的一个重要部分是分割声纳图像，将物体的亮点和阴影与海底背景区分开来。在这项工作中，我们的重点是声纳图像的自动分割。为了提高分割精度，我们引入了两个新的局部空间和统计信息模糊项。我们的算法包括一个初步的去噪算法，该算法与原始图像一起输入到分割程序中，以避免陷入局部最小值并提高收敛性。因此，这种分割程序特别适合声纳图像的强度不均匀性和复杂的海底纹理。我们使用模拟图像、真实声纳图像以及我们在两个不同的海上实验中使用多波束声纳和合成孔径声纳创建的声纳图像对我们的方法进行了测试。结果表明，该方法的精确分割性能远远超过了最先进的结果。

{"title":"Enhanced Fuzzy-based Local Information Algorithm for Sonar Image Segmentation.","authors":"Avi Abu, Roee Diamant","doi":"10.1109/TIP.2019.2930148","DOIUrl":"10.1109/TIP.2019.2930148","url":null,"abstract":"The recent boost in undersea operations has led to the development of high-resolution sonar systems mounted on autonomous vehicles. These vehicles are used to scan the seafloor in search of different objects such as sunken ships, archaeological sites, and submerged mines. An important part of the detection operation is the segmentation of sonar images, where the object's highlight and shadow are distinguished from the seabed background. In this work, we focus on the automatic segmentation of sonar images. We present our enhanced fuzzybased with Kernel metric (EnFK) algorithm for the segmentation of sonar images which, in an attempt to improve segmentation accuracy, introduces two new fuzzy terms of local spatial and statistical information. Our algorithm includes a preliminary de-noising algorithm which, together with the original image, feeds into the segmentation procedure to avoid trapping to local minima and to improve convergence. The result is a segmentation procedure that specifically suits the intensity inhomogeneity and the complex seabed texture of sonar images. We tested our approach using simulated images, real sonar images, and sonar images that we created in two different sea experiments, using multibeam sonar and synthetic aperture sonar. The results show accurate segmentation performance that is far beyond the stateof-the-art results.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Homologous Component Analysis for Domain Adaptation. 用于领域适应的同源成分分析。

IF 10.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing

Pub Date : 2019-07-29 DOI: 10.1109/TIP.2019.2929421

Youfa Liu, Weiping Tu, Bo Du, Lefei Zhang, Dacheng Tao

Covariate shift assumption based domain adaptation approaches usually utilize only one common transformation to align marginal distributions and make conditional distributions preserved. However, one common transformation may cause loss of useful information, such as variances and neighborhood relationship in both source and target domain. To address this problem, we propose a novel method called homologous component analysis (HCA) where we try to find two totally different but homologous transformations to align distributions with side information and make conditional distributions preserved. As it is hard to find a closed form solution to the corresponding optimization problem, we solve them by means of the alternating direction minimizing method (ADMM) in the context of Stiefel manifolds. We also provide a generalization error bound for domain adaptation in semi-supervised case and two transformations can help to decrease this upper bound more than only one common transformation does. Extensive experiments on synthetic and real data show the effectiveness of the proposed method by comparing its classification accuracy with the state-of-the-art methods and numerical evidence on chordal distance and Frobenius distance shows that resulting optimal transformations are different.

基于共变移动假设的域适应方法通常只利用一种共同变换来调整边际分布并保留条件分布。然而，一种共同变换可能会导致有用信息的丢失，如源域和目标域中的方差和邻域关系。为了解决这个问题，我们提出了一种名为同源成分分析（HCA）的新方法，试图找到两种完全不同但同源的变换来对齐具有边际信息的分布，并使条件分布得以保留。由于很难找到相应优化问题的闭式解，我们在 Stiefel 流形的背景下通过交替方向最小化方法（ADMM）来解决它们。我们还为半监督情况下的域适应提供了一个泛化误差约束，与只有一个普通变换相比，两个变换更有助于降低这一上限。在合成数据和真实数据上进行的大量实验表明，通过与最先进方法的分类准确性进行比较，我们提出的方法非常有效；弦距和弗罗贝纽斯距的数值证据表明，我们提出的最佳变换是不同的。

{"title":"Homologous Component Analysis for Domain Adaptation.","authors":"Youfa Liu, Weiping Tu, Bo Du, Lefei Zhang, Dacheng Tao","doi":"10.1109/TIP.2019.2929421","DOIUrl":"10.1109/TIP.2019.2929421","url":null,"abstract":"Covariate shift assumption based domain adaptation approaches usually utilize only one common transformation to align marginal distributions and make conditional distributions preserved. However, one common transformation may cause loss of useful information, such as variances and neighborhood relationship in both source and target domain. To address this problem, we propose a novel method called homologous component analysis (HCA) where we try to find two totally different but homologous transformations to align distributions with side information and make conditional distributions preserved. As it is hard to find a closed form solution to the corresponding optimization problem, we solve them by means of the alternating direction minimizing method (ADMM) in the context of Stiefel manifolds. We also provide a generalization error bound for domain adaptation in semi-supervised case and two transformations can help to decrease this upper bound more than only one common transformation does. Extensive experiments on synthetic and real data show the effectiveness of the proposed method by comparing its classification accuracy with the state-of-the-art methods and numerical evidence on chordal distance and Frobenius distance shows that resulting optimal transformations are different.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unambiguous Scene Text Segmentation with Referring Expression Comprehension. 利用参照表达理解进行无歧义场景文本分割。

IF 10.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing

Pub Date : 2019-07-26 DOI: 10.1109/TIP.2019.2930176

Xuejian Rong, Chucai Yi, Yingli Tian

Text instance provides valuable information for the understanding and interpretation of natural scenes. The rich, precise high-level semantics embodied in the text could be beneficial for understanding the world around us, and empower a wide range of real-world applications. While most recent visual phrase grounding approaches focus on general objects, this paper explores extracting designated texts and predicting unambiguous scene text segmentation mask, i.e. scene text segmentation from natural language descriptions (referring expressions) like orange text on a little boy in black swinging a bat. The solution of this novel problem enables accurate segmentation of scene text instances from the complex background. In our proposed framework, a unified deep network jointly models visual and linguistic information by encoding both region-level and pixel-level visual features of natural scene images into spatial feature maps, and then decode them into saliency response map of text instances. To conduct quantitative evaluations, we establish a new scene text referring expression segmentation dataset: COCO-CharRef. Experimental results demonstrate the effectiveness of the proposed framework on the text instance segmentation task. By combining image-based visual features with language-based textual explanations, our framework outperforms baselines that are derived from state-of-the-art text localization and natural language object retrieval methods on COCO-CharRef dataset.

文本实例为理解和解释自然场景提供了宝贵的信息。文本中体现的丰富、精确的高级语义有助于理解我们周围的世界，并为现实世界中的各种应用提供支持。最近的视觉短语接地方法大多集中在一般物体上，而本文则探索提取指定文本并预测无歧义的场景文本分割掩码，即从自然语言描述（指代表达）中进行场景文本分割，如一个黑衣小男孩挥舞球棒的橙色文本。解决了这个新问题，就能从复杂的背景中准确地分割出场景文本实例。在我们提出的框架中，统一的深度网络通过将自然场景图像的区域级和像素级视觉特征编码为空间特征图，然后将其解码为文本实例的显著性响应图，从而对视觉和语言信息进行联合建模。为了进行定量评估，我们建立了一个新的场景文本引用表达分割数据集：COCO-CharRef。实验结果证明了所提出的框架在文本实例分割任务中的有效性。通过将基于图像的视觉特征与基于语言的文本解释相结合，我们的框架在 COCO-CharRef 数据集上的表现优于最先进的文本定位和自然语言对象检索方法。

{"title":"Unambiguous Scene Text Segmentation with Referring Expression Comprehension.","authors":"Xuejian Rong, Chucai Yi, Yingli Tian","doi":"10.1109/TIP.2019.2930176","DOIUrl":"10.1109/TIP.2019.2930176","url":null,"abstract":"Text instance provides valuable information for the understanding and interpretation of natural scenes. The rich, precise high-level semantics embodied in the text could be beneficial for understanding the world around us, and empower a wide range of real-world applications. While most recent visual phrase grounding approaches focus on general objects, this paper explores extracting designated texts and predicting unambiguous scene text segmentation mask, i.e. scene text segmentation from natural language descriptions (referring expressions) like orange text on a little boy in black swinging a bat. The solution of this novel problem enables accurate segmentation of scene text instances from the complex background. In our proposed framework, a unified deep network jointly models visual and linguistic information by encoding both region-level and pixel-level visual features of natural scene images into spatial feature maps, and then decode them into saliency response map of text instances. To conduct quantitative evaluations, we establish a new scene text referring expression segmentation dataset: COCO-CharRef. Experimental results demonstrate the effectiveness of the proposed framework on the text instance segmentation task. By combining image-based visual features with language-based textual explanations, our framework outperforms baselines that are derived from state-of-the-art text localization and natural language object retrieval methods on COCO-CharRef dataset.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hyperspectral Image Denoising via Matrix Factorization and Deep Prior Regularization. 通过矩阵因式分解和深度优先正则化实现高光谱图像去噪。

IF 10.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing

Pub Date : 2019-07-19 DOI: 10.1109/TIP.2019.2928627

Baihong Lin, Xiaoming Tao, Jianhua Lu

Deep learning has been successfully introduced for 2D-image denoising, but it is still unsatisfactory for hyperspectral image (HSI) denosing due to the unacceptable computational complexity of the end-to-end training process and the difficulty of building a universal 3D-image training dataset. In this paper, instead of developing an end-to-end deep learning denoising network, we propose a hyperspectral image denoising framework for the removal of mixed Gaussian impulse noise, in which the denoising problem is modeled as a convolutional neural network (CNN) constrained non-negative matrix factorization problem. Using the proximal alternating linearized minimization, the optimization can be divided into three steps: the update of the spectral matrix, the update of the abundance matrix and the estimation of the sparse noise. Then, we design the CNN architecture and proposed two training schemes, which can allow the CNN to be trained with a 2D-image dataset. Compared with the state-of-the-art denoising methods, the proposed method has relatively good performance on the removal of the Gaussian and mixed Gaussian impulse noises. More importantly, the proposed model can be only trained once by a 2D-image dataset, but can be used to denoise HSIs with different numbers of channel bands.

深度学习已被成功引入二维图像去噪，但由于端到端训练过程的计算复杂度难以接受，以及难以建立通用的三维图像训练数据集，它在高光谱图像（HSI）去噪方面仍不尽如人意。在本文中，我们没有开发端到端深度学习去噪网络，而是提出了一个用于去除混合高斯脉冲噪声的高光谱图像去噪框架，其中将去噪问题建模为一个卷积神经网络（CNN）约束非负矩阵因式分解问题。利用近端交替线性化最小化，优化可分为三个步骤：频谱矩阵更新、丰度矩阵更新和稀疏噪声估计。然后，我们设计了 CNN 架构，并提出了两种训练方案，使 CNN 可以使用二维图像数据集进行训练。与最先进的去噪方法相比，所提出的方法在去除高斯和混合高斯脉冲噪声方面具有相对较好的性能。更重要的是，所提出的模型只需通过二维图像数据集进行一次训练，但可用于对不同信道带数的 HSI 进行去噪。

{"title":"Hyperspectral Image Denoising via Matrix Factorization and Deep Prior Regularization.","authors":"Baihong Lin, Xiaoming Tao, Jianhua Lu","doi":"10.1109/TIP.2019.2928627","DOIUrl":"10.1109/TIP.2019.2928627","url":null,"abstract":"Deep learning has been successfully introduced for 2D-image denoising, but it is still unsatisfactory for hyperspectral image (HSI) denosing due to the unacceptable computational complexity of the end-to-end training process and the difficulty of building a universal 3D-image training dataset. In this paper, instead of developing an end-to-end deep learning denoising network, we propose a hyperspectral image denoising framework for the removal of mixed Gaussian impulse noise, in which the denoising problem is modeled as a convolutional neural network (CNN) constrained non-negative matrix factorization problem. Using the proximal alternating linearized minimization, the optimization can be divided into three steps: the update of the spectral matrix, the update of the abundance matrix and the estimation of the sparse noise. Then, we design the CNN architecture and proposed two training schemes, which can allow the CNN to be trained with a 2D-image dataset. Compared with the state-of-the-art denoising methods, the proposed method has relatively good performance on the removal of the Gaussian and mixed Gaussian impulse noises. More importantly, the proposed model can be only trained once by a 2D-image dataset, but can be used to denoise HSIs with different numbers of channel bands.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Weighted Guided Image Filtering with Steering Kernel. 使用转向核的加权引导图像过滤技术

IF 10.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing

Pub Date : 2019-07-19 DOI: 10.1109/TIP.2019.2928631

Zhonggui Sun, Bo Han, Jie Li, Jin Zhang, Xinbo Gao

Due to its local property, guided image filter (GIF) generally suffers from halo artifacts near edges. To make up for the deficiency, a weighted guided image filter (WGIF) was proposed recently by incorporating an edge-aware weighting into the filtering process. It takes the advantages of local and global operations, and achieves better performance in edge-preserving. However, edge direction, a vital property of the guidance image, is not considered fully in these guided filters. In order to overcome the drawback, we propose a novel version of GIF, which can leverage the edge direction more sufficiently. In particular, we utilize the steering kernel to adaptively learn the direction and incorporate the learning results into the filtering process to improve the filter's behavior. Theoretical analysis shows that the proposed method can get more powerful performance with preserving edges and reducing halo artifacts effectively. Similar conclusions are also reached through the thorough experiments including edge-aware smoothing, detail enhancement, denoising and dehazing.

由于其局部特性，引导图像滤波器（GIF）通常会在边缘附近出现光晕伪影。为了弥补这一缺陷，最近有人提出了加权引导图像滤波器（WGIF），在滤波过程中加入边缘感知加权。它兼顾了局部操作和全局操作的优点，在边缘保护方面取得了更好的性能。然而，边缘方向作为引导图像的一个重要属性，在这些引导滤波器中并没有得到充分考虑。为了克服这一缺点，我们提出了一种新的 GIF 版本，它能更充分地利用边缘方向。特别是，我们利用转向核来自适应地学习方向，并将学习结果纳入滤波过程，以改进滤波器的行为。理论分析表明，所提出的方法可以在保留边缘和有效减少光晕伪影方面获得更强大的性能。通过对边缘感知平滑、细节增强、去噪和去色等方面的深入实验，也得出了类似的结论。

引用次数: 0

Re-Caption: Saliency-Enhanced Image Captioning through Two-Phase Learning. 再字幕：通过两阶段学习进行显著性增强图像字幕制作

IF 10.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing

Pub Date : 2019-07-17 DOI: 10.1109/TIP.2019.2928144

Lian Zhou, Yuejie Zhang, Yugang Jiang, Tao Zhang, Weiguo Fan

Visual and semantic saliency are important in image captioning. However, single-phase image captioning benefits little from limited saliency without a saliency predictor. In this paper, a novel saliency-enhanced re-captioning framework via two-phase learning is proposed to enhance the single-phase image captioning. In the framework, visual saliency and semantic saliency are distilled from the first-phase model and fused with the second-phase model for model self-boosting. The visual saliency mechanism can generate a saliency map and a saliency mask for an image without learning a saliency map predictor. The semantic saliency mechanism sheds some lights on the properties of words with part-of-speech Noun in a caption. Besides, another type of saliency, sample saliency is proposed to explicitly compute the saliency degree of each sample, which helps for more robust image captioning. In addition, how to combine the above three types of saliency for further performance boost is also examined. Our framework can treat an image captioning model as a saliency extractor, which may benefit other captioning models and related tasks. The experimental results on both the Flickr30k and MSCOCO datasets show that the saliency-enhanced models can obtain promising performance gains.

视觉和语义显著性在图像标题中非常重要。然而，在没有显著性预测器的情况下，单相图像字幕从有限的显著性中获益甚微。本文提出了一种通过两阶段学习来增强单阶段图像标题的新颖的突出度增强再标题框架。在该框架中，视觉显著性和语义显著性从第一阶段模型中提炼出来，并与第二阶段模型融合，以实现模型自增强。视觉显著性机制可以在不学习显著性图预测器的情况下生成图像的显著性图和显著性掩码。语义突出机制可以揭示标题中带有部分词性名词的词的特性。此外，还提出了另一种类型的显著性，即样本显著性，以明确计算每个样本的显著程度，这有助于更稳健的图像标题制作。此外，我们还研究了如何结合上述三种类型的显著性来进一步提高性能。我们的框架可以将图像标题模型视为显著性提取器，这可能会使其他标题模型和相关任务受益。在 Flickr30k 和 MSCOCO 数据集上的实验结果表明，突出度增强模型可以获得可喜的性能提升。

{"title":"Re-Caption: Saliency-Enhanced Image Captioning through Two-Phase Learning.","authors":"Lian Zhou, Yuejie Zhang, Yugang Jiang, Tao Zhang, Weiguo Fan","doi":"10.1109/TIP.2019.2928144","DOIUrl":"10.1109/TIP.2019.2928144","url":null,"abstract":"Visual and semantic saliency are important in image captioning. However, single-phase image captioning benefits little from limited saliency without a saliency predictor. In this paper, a novel saliency-enhanced re-captioning framework via two-phase learning is proposed to enhance the single-phase image captioning. In the framework, visual saliency and semantic saliency are distilled from the first-phase model and fused with the second-phase model for model self-boosting. The visual saliency mechanism can generate a saliency map and a saliency mask for an image without learning a saliency map predictor. The semantic saliency mechanism sheds some lights on the properties of words with part-of-speech Noun in a caption. Besides, another type of saliency, sample saliency is proposed to explicitly compute the saliency degree of each sample, which helps for more robust image captioning. In addition, how to combine the above three types of saliency for further performance boost is also examined. Our framework can treat an image captioning model as a saliency extractor, which may benefit other captioning models and related tasks. The experimental results on both the Flickr30k and MSCOCO datasets show that the saliency-enhanced models can obtain promising performance gains.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Modality-Specific Representations for Visible-Infrared Person Re-Identification. 为可见光-红外线人员再识别学习特定模态表示。

IF 10.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Image Processing

Pub Date : 2019-07-17 DOI: 10.1109/TIP.2019.2928126

Zhanxiang Feng, Jianhuang Lai, Xiaohua Xie

Traditional person re-identification (re-id) methods perform poorly under changing illuminations. This situation can be addressed by using dual-cameras that capture visible images in a bright environment and infrared images in a dark environment. Yet, this scheme needs to solve the visible-infrared matching issue, which is largely under-studied. Matching pedestrians across heterogeneous modalities is extremely challenging because of different visual characteristics. In this paper, we propose a novel framework that employ modality-specific networks to tackle with the heterogeneous matching problem. The proposed framework utilizes the modality-related information and extracts modality-specific representations (MSR) by constructing an individual network for each modality. In addition, a cross-modality Euclidean constraint is introduced to narrow the gap between different networks. We also integrate the modality-shared layers into modality-specific networks to extract shareable information and use a modality-shared identity loss to facilitate the extraction of modality-invariant features. Then a modality-specific discriminant metric is learned for each domain to strengthen the discriminative power of MSR. Eventually, we use a view classifier to learn view information. The experiments demonstrate that the MSR effectively improves the performance of deep networks on VI-REID and remarkably outperforms the state-of-the-art methods.

传统的人员再识别（re-id）方法在光照变化的情况下表现不佳。这种情况可以通过使用双摄像头来解决，即在明亮环境中捕捉可见光图像，在黑暗环境中捕捉红外图像。然而，这一方案需要解决可见光-红外匹配问题，而这一问题在很大程度上还没有得到充分研究。由于不同的视觉特征，跨异构模态匹配行人极具挑战性。在本文中，我们提出了一个新颖的框架，利用特定模态网络来解决异构匹配问题。所提出的框架利用了与模态相关的信息，并通过为每种模态构建一个单独的网络来提取特定模态表征（MSR）。此外，我们还引入了跨模态欧氏约束，以缩小不同网络之间的差距。我们还将模态共享层整合到特定模态网络中，以提取可共享信息，并使用模态共享身份损失来促进模态不变特征的提取。然后为每个域学习特定模态的判别度量，以加强 MSR 的判别能力。最后，我们使用视图分类器来学习视图信息。实验证明，MSR 有效地提高了深度网络在 VI-REID 上的性能，并明显优于最先进的方法。

{"title":"Learning Modality-Specific Representations for Visible-Infrared Person Re-Identification.","authors":"Zhanxiang Feng, Jianhuang Lai, Xiaohua Xie","doi":"10.1109/TIP.2019.2928126","DOIUrl":"10.1109/TIP.2019.2928126","url":null,"abstract":"Traditional person re-identification (re-id) methods perform poorly under changing illuminations. This situation can be addressed by using dual-cameras that capture visible images in a bright environment and infrared images in a dark environment. Yet, this scheme needs to solve the visible-infrared matching issue, which is largely under-studied. Matching pedestrians across heterogeneous modalities is extremely challenging because of different visual characteristics. In this paper, we propose a novel framework that employ modality-specific networks to tackle with the heterogeneous matching problem. The proposed framework utilizes the modality-related information and extracts modality-specific representations (MSR) by constructing an individual network for each modality. In addition, a cross-modality Euclidean constraint is introduced to narrow the gap between different networks. We also integrate the modality-shared layers into modality-specific networks to extract shareable information and use a modality-shared identity loss to facilitate the extraction of modality-invariant features. Then a modality-specific discriminant metric is learned for each domain to strengthen the discriminative power of MSR. Eventually, we use a view classifier to learn view information. The experiments demonstrate that the MSR effectively improves the performance of deep networks on VI-REID and remarkably outperforms the state-of-the-art methods.","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0