Pub Date : 2019-08-07DOI: 10.1109/TIP.2019.2932502
Weiqing Min, Shuhuan Mei, Linhu Liu, Yi Wang, Shuqiang Jiang
Visual urban perception aims to quantify perceptual attributes (e.g., safe and depressing attributes) of physical urban environment from crowd-sourced street-view images and their pairwise comparisons. It has been receiving more and more attention in computer vision for various applications, such as perceptive attribute learning and urban scene understanding. Most existing methods adopt either (i) a regression model trained using image features and ranked scores converted from pairwise comparisons for perceptual attribute prediction or (ii) a pairwise ranking algorithm to independently learn each perceptual attribute. However, the former fails to directly exploit pairwise comparisons while the latter ignores the relationship among different attributes. To address them, we propose a Multi-Task Deep Relative Attribute Learning Network (MTDRALN) to learn all the relative attributes simultaneously via multi-task Siamese networks, where each Siamese network will predict one relative attribute. Combined with deep relative attribute learning, we utilize the structured sparsity to exploit the prior from natural attribute grouping, where all the attributes are divided into different groups based on semantic relatedness in advance. As a result, MTDRALN is capable of learning all the perceptual attributes simultaneously via multi-task learning. Besides the ranking sub-network, MTDRALN further introduces the classification sub-network, and these two types of losses from two sub-networks jointly constrain parameters of the deep network to make the network learn more discriminative visual features for relative attribute learning. In addition, our network can be trained in an end-to-end way to make deep feature learning and multi-task relative attribute learning reinforce each other. Extensive experiments on the large-scale Place Pulse 2.0 dataset validate the advantage of our proposed network. Our qualitative results along with visualization of saliency maps also show that the proposed network is able to learn effective features for perceptual attributes.
视觉城市感知旨在从人群来源的街景图像及其成对比较中量化城市物理环境的感知属性(如安全和压抑属性)。它在计算机视觉领域的各种应用中受到越来越多的关注,如感知属性学习和城市场景理解。大多数现有方法都采用(i)使用图像特征和成对比较转换的排序分数训练的回归模型进行感知属性预测,或(ii)采用成对排序算法独立学习每个感知属性。然而,前者无法直接利用成对比较,而后者则忽略了不同属性之间的关系。为了解决这些问题,我们提出了多任务深度相对属性学习网络(MTDRALN),通过多任务连体网络同时学习所有相对属性,每个连体网络预测一个相对属性。结合深度相对属性学习,我们利用结构稀疏性来利用自然属性分组的先验性,即根据语义相关性预先将所有属性分成不同的组。因此,MTDRALN 能够通过多任务学习同时学习所有感知属性。除了排序子网络外,MTDRALN 还进一步引入了分类子网络,这两种子网络的损失共同约束了深度网络的参数,使网络能够学习到更多具有区分性的视觉特征,从而实现相对属性学习。此外,我们的网络可以端到端方式进行训练,使深度特征学习和多任务相对属性学习相互促进。在大规模 Place Pulse 2.0 数据集上进行的大量实验验证了我们提出的网络的优势。我们的定性结果以及可视化的显著性地图也表明,所提出的网络能够学习有效的感知属性特征。
{"title":"Multi-Task Deep Relative Attribute Learning for Visual Urban Perception.","authors":"Weiqing Min, Shuhuan Mei, Linhu Liu, Yi Wang, Shuqiang Jiang","doi":"10.1109/TIP.2019.2932502","DOIUrl":"10.1109/TIP.2019.2932502","url":null,"abstract":"<p><p>Visual urban perception aims to quantify perceptual attributes (e.g., safe and depressing attributes) of physical urban environment from crowd-sourced street-view images and their pairwise comparisons. It has been receiving more and more attention in computer vision for various applications, such as perceptive attribute learning and urban scene understanding. Most existing methods adopt either (i) a regression model trained using image features and ranked scores converted from pairwise comparisons for perceptual attribute prediction or (ii) a pairwise ranking algorithm to independently learn each perceptual attribute. However, the former fails to directly exploit pairwise comparisons while the latter ignores the relationship among different attributes. To address them, we propose a Multi-Task Deep Relative Attribute Learning Network (MTDRALN) to learn all the relative attributes simultaneously via multi-task Siamese networks, where each Siamese network will predict one relative attribute. Combined with deep relative attribute learning, we utilize the structured sparsity to exploit the prior from natural attribute grouping, where all the attributes are divided into different groups based on semantic relatedness in advance. As a result, MTDRALN is capable of learning all the perceptual attributes simultaneously via multi-task learning. Besides the ranking sub-network, MTDRALN further introduces the classification sub-network, and these two types of losses from two sub-networks jointly constrain parameters of the deep network to make the network learn more discriminative visual features for relative attribute learning. In addition, our network can be trained in an end-to-end way to make deep feature learning and multi-task relative attribute learning reinforce each other. Extensive experiments on the large-scale Place Pulse 2.0 dataset validate the advantage of our proposed network. Our qualitative results along with visualization of saliency maps also show that the proposed network is able to learn effective features for perceptual attributes.</p>","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62584443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-08-07DOI: 10.1109/TIP.2019.2932572
Graham Treece
The bitonic filter was recently developed to embody the novel concept of signal bitonicity (one local extremum within a set range) to differentiate from noise, by use of data ranking and linear operators. For processing images, the spatial extent was locally constrained to a fixed circular mask. Since structure in natural images varies, a novel structurally varying bitonic filter is presented, which locally adapts the mask, without following patterns in the noise. This new filter includes novel robust structurally varying morphological operations, with efficient implementations, and a novel formulation of non-iterative directional Gaussian filtering. Data thresholds are also integrated with the morphological operations, increasing noise reduction for low noise, and enabling a multi-resolution framework for high noise levels. The structurally varying bitonic filter is presented without presuming prior knowledge of morphological filtering, and compared to high-performance linear noise-reduction filters, to set this novel concept in context. These are tested over a wide range of noise levels, on a fairly broad set of images. The new filter is a considerable improvement on the fixed-mask bitonic, outperforms anisotropic diffusion and image-guided filtering in all but extremely low noise, non-local means at all noise levels, but not the block-matching 3D filter, though results are promising for very high noise. The structurally varying bitonic tends to have less characteristic residual noise in regions of smooth signal, and very good preservation of signal edges, though with some loss of small scale detail when compared to the block-matching 3D filter. The efficient implementation means that processing time, though slower than the fixed-mask bitonic filter, remains competitive.
{"title":"Morphology-based Noise Reduction: Structural Variation and Thresholding in the Bitonic Filter.","authors":"Graham Treece","doi":"10.1109/TIP.2019.2932572","DOIUrl":"10.1109/TIP.2019.2932572","url":null,"abstract":"<p><p>The bitonic filter was recently developed to embody the novel concept of signal bitonicity (one local extremum within a set range) to differentiate from noise, by use of data ranking and linear operators. For processing images, the spatial extent was locally constrained to a fixed circular mask. Since structure in natural images varies, a novel structurally varying bitonic filter is presented, which locally adapts the mask, without following patterns in the noise. This new filter includes novel robust structurally varying morphological operations, with efficient implementations, and a novel formulation of non-iterative directional Gaussian filtering. Data thresholds are also integrated with the morphological operations, increasing noise reduction for low noise, and enabling a multi-resolution framework for high noise levels. The structurally varying bitonic filter is presented without presuming prior knowledge of morphological filtering, and compared to high-performance linear noise-reduction filters, to set this novel concept in context. These are tested over a wide range of noise levels, on a fairly broad set of images. The new filter is a considerable improvement on the fixed-mask bitonic, outperforms anisotropic diffusion and image-guided filtering in all but extremely low noise, non-local means at all noise levels, but not the block-matching 3D filter, though results are promising for very high noise. The structurally varying bitonic tends to have less characteristic residual noise in regions of smooth signal, and very good preservation of signal edges, though with some loss of small scale detail when compared to the block-matching 3D filter. The efficient implementation means that processing time, though slower than the fixed-mask bitonic filter, remains competitive.</p>","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62584099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-29DOI: 10.1109/TIP.2019.2929433
Yun Zhang, Huan Zhang, Mei Yu, Sam Kwong, Yo-Sung Ho
The temporal flicker distortion is one of the most annoying noises in synthesized virtual view videos when they are rendered by compressed multi-view video plus depth in Three Dimensional (3D) video system. To assess the synthesized view video quality and further optimize the compression techniques in 3D video system, objective video quality assessment which can accurately measure the flicker distortion is highly needed. In this paper, we propose a full reference sparse representation based video quality assessment method towards synthesized 3D videos. Firstly, a synthesized video, treated as a 3D volume data with spatial (X-Y) and temporal (T) domains, is reformed and decomposed as a number of spatially neighboring temporal layers, i.e., X-T or Y-T planes. Gradient features in temporal layers of the synthesized video and strong edges of depth maps are used as key features in detecting the location of flicker distortions. Secondly, dictionary learning and sparse representation for the temporal layers are then derived and applied to effectively represent the temporal flicker distortion. Thirdly, a rank pooling method is used to pool all the temporal layer scores and obtain the score for the flicker distortion. Finally, the temporal flicker distortion measurement is combined with the conventional spatial distortion measurement to assess the quality of synthesized 3D videos. Experimental results on synthesized video quality database demonstrate our proposed method is significantly superior to other state-of-the-art methods, especially on the view synthesis distortions induced from depth videos.
{"title":"Sparse Representation based Video Quality Assessment for Synthesized 3D Videos.","authors":"Yun Zhang, Huan Zhang, Mei Yu, Sam Kwong, Yo-Sung Ho","doi":"10.1109/TIP.2019.2929433","DOIUrl":"10.1109/TIP.2019.2929433","url":null,"abstract":"<p><p>The temporal flicker distortion is one of the most annoying noises in synthesized virtual view videos when they are rendered by compressed multi-view video plus depth in Three Dimensional (3D) video system. To assess the synthesized view video quality and further optimize the compression techniques in 3D video system, objective video quality assessment which can accurately measure the flicker distortion is highly needed. In this paper, we propose a full reference sparse representation based video quality assessment method towards synthesized 3D videos. Firstly, a synthesized video, treated as a 3D volume data with spatial (X-Y) and temporal (T) domains, is reformed and decomposed as a number of spatially neighboring temporal layers, i.e., X-T or Y-T planes. Gradient features in temporal layers of the synthesized video and strong edges of depth maps are used as key features in detecting the location of flicker distortions. Secondly, dictionary learning and sparse representation for the temporal layers are then derived and applied to effectively represent the temporal flicker distortion. Thirdly, a rank pooling method is used to pool all the temporal layer scores and obtain the score for the flicker distortion. Finally, the temporal flicker distortion measurement is combined with the conventional spatial distortion measurement to assess the quality of synthesized 3D videos. Experimental results on synthesized video quality database demonstrate our proposed method is significantly superior to other state-of-the-art methods, especially on the view synthesis distortions induced from depth videos.</p>","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-29DOI: 10.1109/TIP.2019.2930148
Avi Abu, Roee Diamant
The recent boost in undersea operations has led to the development of high-resolution sonar systems mounted on autonomous vehicles. These vehicles are used to scan the seafloor in search of different objects such as sunken ships, archaeological sites, and submerged mines. An important part of the detection operation is the segmentation of sonar images, where the object's highlight and shadow are distinguished from the seabed background. In this work, we focus on the automatic segmentation of sonar images. We present our enhanced fuzzybased with Kernel metric (EnFK) algorithm for the segmentation of sonar images which, in an attempt to improve segmentation accuracy, introduces two new fuzzy terms of local spatial and statistical information. Our algorithm includes a preliminary de-noising algorithm which, together with the original image, feeds into the segmentation procedure to avoid trapping to local minima and to improve convergence. The result is a segmentation procedure that specifically suits the intensity inhomogeneity and the complex seabed texture of sonar images. We tested our approach using simulated images, real sonar images, and sonar images that we created in two different sea experiments, using multibeam sonar and synthetic aperture sonar. The results show accurate segmentation performance that is far beyond the stateof-the-art results.
{"title":"Enhanced Fuzzy-based Local Information Algorithm for Sonar Image Segmentation.","authors":"Avi Abu, Roee Diamant","doi":"10.1109/TIP.2019.2930148","DOIUrl":"10.1109/TIP.2019.2930148","url":null,"abstract":"<p><p>The recent boost in undersea operations has led to the development of high-resolution sonar systems mounted on autonomous vehicles. These vehicles are used to scan the seafloor in search of different objects such as sunken ships, archaeological sites, and submerged mines. An important part of the detection operation is the segmentation of sonar images, where the object's highlight and shadow are distinguished from the seabed background. In this work, we focus on the automatic segmentation of sonar images. We present our enhanced fuzzybased with Kernel metric (EnFK) algorithm for the segmentation of sonar images which, in an attempt to improve segmentation accuracy, introduces two new fuzzy terms of local spatial and statistical information. Our algorithm includes a preliminary de-noising algorithm which, together with the original image, feeds into the segmentation procedure to avoid trapping to local minima and to improve convergence. The result is a segmentation procedure that specifically suits the intensity inhomogeneity and the complex seabed texture of sonar images. We tested our approach using simulated images, real sonar images, and sonar images that we created in two different sea experiments, using multibeam sonar and synthetic aperture sonar. The results show accurate segmentation performance that is far beyond the stateof-the-art results.</p>","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-29DOI: 10.1109/TIP.2019.2929421
Youfa Liu, Weiping Tu, Bo Du, Lefei Zhang, Dacheng Tao
Covariate shift assumption based domain adaptation approaches usually utilize only one common transformation to align marginal distributions and make conditional distributions preserved. However, one common transformation may cause loss of useful information, such as variances and neighborhood relationship in both source and target domain. To address this problem, we propose a novel method called homologous component analysis (HCA) where we try to find two totally different but homologous transformations to align distributions with side information and make conditional distributions preserved. As it is hard to find a closed form solution to the corresponding optimization problem, we solve them by means of the alternating direction minimizing method (ADMM) in the context of Stiefel manifolds. We also provide a generalization error bound for domain adaptation in semi-supervised case and two transformations can help to decrease this upper bound more than only one common transformation does. Extensive experiments on synthetic and real data show the effectiveness of the proposed method by comparing its classification accuracy with the state-of-the-art methods and numerical evidence on chordal distance and Frobenius distance shows that resulting optimal transformations are different.
{"title":"Homologous Component Analysis for Domain Adaptation.","authors":"Youfa Liu, Weiping Tu, Bo Du, Lefei Zhang, Dacheng Tao","doi":"10.1109/TIP.2019.2929421","DOIUrl":"10.1109/TIP.2019.2929421","url":null,"abstract":"<p><p>Covariate shift assumption based domain adaptation approaches usually utilize only one common transformation to align marginal distributions and make conditional distributions preserved. However, one common transformation may cause loss of useful information, such as variances and neighborhood relationship in both source and target domain. To address this problem, we propose a novel method called homologous component analysis (HCA) where we try to find two totally different but homologous transformations to align distributions with side information and make conditional distributions preserved. As it is hard to find a closed form solution to the corresponding optimization problem, we solve them by means of the alternating direction minimizing method (ADMM) in the context of Stiefel manifolds. We also provide a generalization error bound for domain adaptation in semi-supervised case and two transformations can help to decrease this upper bound more than only one common transformation does. Extensive experiments on synthetic and real data show the effectiveness of the proposed method by comparing its classification accuracy with the state-of-the-art methods and numerical evidence on chordal distance and Frobenius distance shows that resulting optimal transformations are different.</p>","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-26DOI: 10.1109/TIP.2019.2930176
Xuejian Rong, Chucai Yi, Yingli Tian
Text instance provides valuable information for the understanding and interpretation of natural scenes. The rich, precise high-level semantics embodied in the text could be beneficial for understanding the world around us, and empower a wide range of real-world applications. While most recent visual phrase grounding approaches focus on general objects, this paper explores extracting designated texts and predicting unambiguous scene text segmentation mask, i.e. scene text segmentation from natural language descriptions (referring expressions) like orange text on a little boy in black swinging a bat. The solution of this novel problem enables accurate segmentation of scene text instances from the complex background. In our proposed framework, a unified deep network jointly models visual and linguistic information by encoding both region-level and pixel-level visual features of natural scene images into spatial feature maps, and then decode them into saliency response map of text instances. To conduct quantitative evaluations, we establish a new scene text referring expression segmentation dataset: COCO-CharRef. Experimental results demonstrate the effectiveness of the proposed framework on the text instance segmentation task. By combining image-based visual features with language-based textual explanations, our framework outperforms baselines that are derived from state-of-the-art text localization and natural language object retrieval methods on COCO-CharRef dataset.
{"title":"Unambiguous Scene Text Segmentation with Referring Expression Comprehension.","authors":"Xuejian Rong, Chucai Yi, Yingli Tian","doi":"10.1109/TIP.2019.2930176","DOIUrl":"10.1109/TIP.2019.2930176","url":null,"abstract":"<p><p>Text instance provides valuable information for the understanding and interpretation of natural scenes. The rich, precise high-level semantics embodied in the text could be beneficial for understanding the world around us, and empower a wide range of real-world applications. While most recent visual phrase grounding approaches focus on general objects, this paper explores extracting designated texts and predicting unambiguous scene text segmentation mask, i.e. scene text segmentation from natural language descriptions (referring expressions) like orange text on a little boy in black swinging a bat. The solution of this novel problem enables accurate segmentation of scene text instances from the complex background. In our proposed framework, a unified deep network jointly models visual and linguistic information by encoding both region-level and pixel-level visual features of natural scene images into spatial feature maps, and then decode them into saliency response map of text instances. To conduct quantitative evaluations, we establish a new scene text referring expression segmentation dataset: COCO-CharRef. Experimental results demonstrate the effectiveness of the proposed framework on the text instance segmentation task. By combining image-based visual features with language-based textual explanations, our framework outperforms baselines that are derived from state-of-the-art text localization and natural language object retrieval methods on COCO-CharRef dataset.</p>","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-19DOI: 10.1109/TIP.2019.2928627
Baihong Lin, Xiaoming Tao, Jianhua Lu
Deep learning has been successfully introduced for 2D-image denoising, but it is still unsatisfactory for hyperspectral image (HSI) denosing due to the unacceptable computational complexity of the end-to-end training process and the difficulty of building a universal 3D-image training dataset. In this paper, instead of developing an end-to-end deep learning denoising network, we propose a hyperspectral image denoising framework for the removal of mixed Gaussian impulse noise, in which the denoising problem is modeled as a convolutional neural network (CNN) constrained non-negative matrix factorization problem. Using the proximal alternating linearized minimization, the optimization can be divided into three steps: the update of the spectral matrix, the update of the abundance matrix and the estimation of the sparse noise. Then, we design the CNN architecture and proposed two training schemes, which can allow the CNN to be trained with a 2D-image dataset. Compared with the state-of-the-art denoising methods, the proposed method has relatively good performance on the removal of the Gaussian and mixed Gaussian impulse noises. More importantly, the proposed model can be only trained once by a 2D-image dataset, but can be used to denoise HSIs with different numbers of channel bands.
{"title":"Hyperspectral Image Denoising via Matrix Factorization and Deep Prior Regularization.","authors":"Baihong Lin, Xiaoming Tao, Jianhua Lu","doi":"10.1109/TIP.2019.2928627","DOIUrl":"10.1109/TIP.2019.2928627","url":null,"abstract":"<p><p>Deep learning has been successfully introduced for 2D-image denoising, but it is still unsatisfactory for hyperspectral image (HSI) denosing due to the unacceptable computational complexity of the end-to-end training process and the difficulty of building a universal 3D-image training dataset. In this paper, instead of developing an end-to-end deep learning denoising network, we propose a hyperspectral image denoising framework for the removal of mixed Gaussian impulse noise, in which the denoising problem is modeled as a convolutional neural network (CNN) constrained non-negative matrix factorization problem. Using the proximal alternating linearized minimization, the optimization can be divided into three steps: the update of the spectral matrix, the update of the abundance matrix and the estimation of the sparse noise. Then, we design the CNN architecture and proposed two training schemes, which can allow the CNN to be trained with a 2D-image dataset. Compared with the state-of-the-art denoising methods, the proposed method has relatively good performance on the removal of the Gaussian and mixed Gaussian impulse noises. More importantly, the proposed model can be only trained once by a 2D-image dataset, but can be used to denoise HSIs with different numbers of channel bands.</p>","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-19DOI: 10.1109/TIP.2019.2928631
Zhonggui Sun, Bo Han, Jie Li, Jin Zhang, Xinbo Gao
Due to its local property, guided image filter (GIF) generally suffers from halo artifacts near edges. To make up for the deficiency, a weighted guided image filter (WGIF) was proposed recently by incorporating an edge-aware weighting into the filtering process. It takes the advantages of local and global operations, and achieves better performance in edge-preserving. However, edge direction, a vital property of the guidance image, is not considered fully in these guided filters. In order to overcome the drawback, we propose a novel version of GIF, which can leverage the edge direction more sufficiently. In particular, we utilize the steering kernel to adaptively learn the direction and incorporate the learning results into the filtering process to improve the filter's behavior. Theoretical analysis shows that the proposed method can get more powerful performance with preserving edges and reducing halo artifacts effectively. Similar conclusions are also reached through the thorough experiments including edge-aware smoothing, detail enhancement, denoising and dehazing.
{"title":"Weighted Guided Image Filtering with Steering Kernel.","authors":"Zhonggui Sun, Bo Han, Jie Li, Jin Zhang, Xinbo Gao","doi":"10.1109/TIP.2019.2928631","DOIUrl":"10.1109/TIP.2019.2928631","url":null,"abstract":"<p><p>Due to its local property, guided image filter (GIF) generally suffers from halo artifacts near edges. To make up for the deficiency, a weighted guided image filter (WGIF) was proposed recently by incorporating an edge-aware weighting into the filtering process. It takes the advantages of local and global operations, and achieves better performance in edge-preserving. However, edge direction, a vital property of the guidance image, is not considered fully in these guided filters. In order to overcome the drawback, we propose a novel version of GIF, which can leverage the edge direction more sufficiently. In particular, we utilize the steering kernel to adaptively learn the direction and incorporate the learning results into the filtering process to improve the filter's behavior. Theoretical analysis shows that the proposed method can get more powerful performance with preserving edges and reducing halo artifacts effectively. Similar conclusions are also reached through the thorough experiments including edge-aware smoothing, detail enhancement, denoising and dehazing.</p>","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-17DOI: 10.1109/TIP.2019.2928144
Lian Zhou, Yuejie Zhang, Yugang Jiang, Tao Zhang, Weiguo Fan
Visual and semantic saliency are important in image captioning. However, single-phase image captioning benefits little from limited saliency without a saliency predictor. In this paper, a novel saliency-enhanced re-captioning framework via two-phase learning is proposed to enhance the single-phase image captioning. In the framework, visual saliency and semantic saliency are distilled from the first-phase model and fused with the second-phase model for model self-boosting. The visual saliency mechanism can generate a saliency map and a saliency mask for an image without learning a saliency map predictor. The semantic saliency mechanism sheds some lights on the properties of words with part-of-speech Noun in a caption. Besides, another type of saliency, sample saliency is proposed to explicitly compute the saliency degree of each sample, which helps for more robust image captioning. In addition, how to combine the above three types of saliency for further performance boost is also examined. Our framework can treat an image captioning model as a saliency extractor, which may benefit other captioning models and related tasks. The experimental results on both the Flickr30k and MSCOCO datasets show that the saliency-enhanced models can obtain promising performance gains.
{"title":"Re-Caption: Saliency-Enhanced Image Captioning through Two-Phase Learning.","authors":"Lian Zhou, Yuejie Zhang, Yugang Jiang, Tao Zhang, Weiguo Fan","doi":"10.1109/TIP.2019.2928144","DOIUrl":"10.1109/TIP.2019.2928144","url":null,"abstract":"<p><p>Visual and semantic saliency are important in image captioning. However, single-phase image captioning benefits little from limited saliency without a saliency predictor. In this paper, a novel saliency-enhanced re-captioning framework via two-phase learning is proposed to enhance the single-phase image captioning. In the framework, visual saliency and semantic saliency are distilled from the first-phase model and fused with the second-phase model for model self-boosting. The visual saliency mechanism can generate a saliency map and a saliency mask for an image without learning a saliency map predictor. The semantic saliency mechanism sheds some lights on the properties of words with part-of-speech Noun in a caption. Besides, another type of saliency, sample saliency is proposed to explicitly compute the saliency degree of each sample, which helps for more robust image captioning. In addition, how to combine the above three types of saliency for further performance boost is also examined. Our framework can treat an image captioning model as a saliency extractor, which may benefit other captioning models and related tasks. The experimental results on both the Flickr30k and MSCOCO datasets show that the saliency-enhanced models can obtain promising performance gains.</p>","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-07-17DOI: 10.1109/TIP.2019.2928126
Zhanxiang Feng, Jianhuang Lai, Xiaohua Xie
Traditional person re-identification (re-id) methods perform poorly under changing illuminations. This situation can be addressed by using dual-cameras that capture visible images in a bright environment and infrared images in a dark environment. Yet, this scheme needs to solve the visible-infrared matching issue, which is largely under-studied. Matching pedestrians across heterogeneous modalities is extremely challenging because of different visual characteristics. In this paper, we propose a novel framework that employ modality-specific networks to tackle with the heterogeneous matching problem. The proposed framework utilizes the modality-related information and extracts modality-specific representations (MSR) by constructing an individual network for each modality. In addition, a cross-modality Euclidean constraint is introduced to narrow the gap between different networks. We also integrate the modality-shared layers into modality-specific networks to extract shareable information and use a modality-shared identity loss to facilitate the extraction of modality-invariant features. Then a modality-specific discriminant metric is learned for each domain to strengthen the discriminative power of MSR. Eventually, we use a view classifier to learn view information. The experiments demonstrate that the MSR effectively improves the performance of deep networks on VI-REID and remarkably outperforms the state-of-the-art methods.
{"title":"Learning Modality-Specific Representations for Visible-Infrared Person Re-Identification.","authors":"Zhanxiang Feng, Jianhuang Lai, Xiaohua Xie","doi":"10.1109/TIP.2019.2928126","DOIUrl":"10.1109/TIP.2019.2928126","url":null,"abstract":"<p><p>Traditional person re-identification (re-id) methods perform poorly under changing illuminations. This situation can be addressed by using dual-cameras that capture visible images in a bright environment and infrared images in a dark environment. Yet, this scheme needs to solve the visible-infrared matching issue, which is largely under-studied. Matching pedestrians across heterogeneous modalities is extremely challenging because of different visual characteristics. In this paper, we propose a novel framework that employ modality-specific networks to tackle with the heterogeneous matching problem. The proposed framework utilizes the modality-related information and extracts modality-specific representations (MSR) by constructing an individual network for each modality. In addition, a cross-modality Euclidean constraint is introduced to narrow the gap between different networks. We also integrate the modality-shared layers into modality-specific networks to extract shareable information and use a modality-shared identity loss to facilitate the extraction of modality-invariant features. Then a modality-specific discriminant metric is learned for each domain to strengthen the discriminative power of MSR. Eventually, we use a view classifier to learn view information. The experiments demonstrate that the MSR effectively improves the performance of deep networks on VI-REID and remarkably outperforms the state-of-the-art methods.</p>","PeriodicalId":13217,"journal":{"name":"IEEE Transactions on Image Processing","volume":"29 1","pages":""},"PeriodicalIF":10.6,"publicationDate":"2019-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"62583176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}