Pub Date : 2024-09-07DOI: 10.1016/j.image.2024.117198
Mohammad Mahdi Afrasiabi, Reshad Hosseini, Aliazam Abbasfar
Super-resolution is the process of obtaining a high-resolution (HR) image from one or more low-resolution (LR) images. Single image super-resolution (SISR) deals with one LR image while multi-frame super-resolution (MFSR) employs several LR ones to reach the HR output. MFSR pipeline consists of alignment, fusion, and reconstruction. We conduct a theoretical analysis using sparse coding (SC) and iterative shrinkage-thresholding algorithm to fill the gap of mathematical justification in the execution order of the optimal MFSR pipeline. Our analysis recommends executing alignment and fusion before the reconstruction stage (whether through deconvolution or SISR techniques). The suggested order ensures enhanced performance in terms of peak signal-to-noise ratio and structural similarity index. The optimal pipeline also reduces computational complexity compared to intuitive approaches that apply SISR to each input LR image. Also, we demonstrate the usefulness of SC in analysis of computer vision tasks such as MFSR, leveraging the sparsity assumption in natural images. Simulation results support the findings of our theoretical analysis, both quantitatively and qualitatively.
{"title":"A novel theoretical analysis on optimal pipeline of multi-frame image super-resolution using sparse coding","authors":"Mohammad Mahdi Afrasiabi, Reshad Hosseini, Aliazam Abbasfar","doi":"10.1016/j.image.2024.117198","DOIUrl":"10.1016/j.image.2024.117198","url":null,"abstract":"<div><p>Super-resolution is the process of obtaining a high-resolution (HR) image from one or more low-resolution (LR) images. Single image super-resolution (SISR) deals with one LR image while multi-frame super-resolution (MFSR) employs several LR ones to reach the HR output. MFSR pipeline consists of alignment, fusion, and reconstruction. We conduct a theoretical analysis using sparse coding (SC) and iterative shrinkage-thresholding algorithm to fill the gap of mathematical justification in the execution order of the optimal MFSR pipeline. Our analysis recommends executing alignment and fusion before the reconstruction stage (whether through deconvolution or SISR techniques). The suggested order ensures enhanced performance in terms of peak signal-to-noise ratio and structural similarity index. The optimal pipeline also reduces computational complexity compared to intuitive approaches that apply SISR to each input LR image. Also, we demonstrate the usefulness of SC in analysis of computer vision tasks such as MFSR, leveraging the sparsity assumption in natural images. Simulation results support the findings of our theoretical analysis, both quantitatively and qualitatively.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117198"},"PeriodicalIF":3.4,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142172531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1016/j.image.2024.117200
Yuanyuan Li, Zetian Mi, Peng Lin, Xianping Fu
Numerous new underwater image enhancement methods have been proposed to correct color and enhance the contrast. Although these methods have achieved satisfactory enhancement results in some respects, few have taken into account the effect of the raw image illumination distribution on the enhancement results, often leading to oversaturation or undersaturation. To solve these problems, an underwater image enhancement network guided by brightness mask with multi-attention embedding, called BMGMANet, is designed. Specifically, considering that different regions in the underwater images have different degradation degrees, which can be implicitly reflected by a brightness mask characterizing the image illumination distribution, a decoder network guided by a reverse brightness mask is designed to enhance the dark regions while suppressing excessive enhancement of the bright regions. In addition, a triple-attention module is designed to further enhance the contrast of the underwater image and recover more details. Extensive comparative experiments demonstrate that the enhancement results of our network outperform those of existing state-of-the-art methods. Furthermore, additional experiments also prove that our BMGMANet can effectively enhance the non-uniform illumination underwater images and improve the performance of saliency object detection in underwater images.
{"title":"Underwater image enhancement via brightness mask-guided multi-attention embedding","authors":"Yuanyuan Li, Zetian Mi, Peng Lin, Xianping Fu","doi":"10.1016/j.image.2024.117200","DOIUrl":"10.1016/j.image.2024.117200","url":null,"abstract":"<div><p>Numerous new underwater image enhancement methods have been proposed to correct color and enhance the contrast. Although these methods have achieved satisfactory enhancement results in some respects, few have taken into account the effect of the raw image illumination distribution on the enhancement results, often leading to oversaturation or undersaturation. To solve these problems, an underwater image enhancement network guided by brightness mask with multi-attention embedding, called BMGMANet, is designed. Specifically, considering that different regions in the underwater images have different degradation degrees, which can be implicitly reflected by a brightness mask characterizing the image illumination distribution, a decoder network guided by a reverse brightness mask is designed to enhance the dark regions while suppressing excessive enhancement of the bright regions. In addition, a triple-attention module is designed to further enhance the contrast of the underwater image and recover more details. Extensive comparative experiments demonstrate that the enhancement results of our network outperform those of existing state-of-the-art methods. Furthermore, additional experiments also prove that our BMGMANet can effectively enhance the non-uniform illumination underwater images and improve the performance of saliency object detection in underwater images.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"130 ","pages":"Article 117200"},"PeriodicalIF":3.4,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142167794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1016/j.image.2024.117187
Alireza Esmaeilzehi , Morteza Mirzaei , Hossein Zaredar , Dimitrios Hatzinakos , M. Omair Ahmad
In recent years, numerous efficient schemes that employ deep neural networks have been developed for the task of image hashing. However, not much attention is paid to enhancing the performance and robustness of these deep hashing networks, when the input images do not possess high spatial resolution and visual quality. This is a critical problem, as often accessing high-quality high-resolution images is not guaranteed in real-life applications. In this paper, we propose a novel method for the task of joint image upsampling and hashing, that uses a three-stage design. Specifically, in the first two stages of the proposed scheme, we obtain two deep neural networks, each of which is individually trained for the task of image super resolution and image hashing, respectively. We then fine-tune the two deep networks thus obtained by using the ideas of representation learning and alternating optimization process, in order to produce a set of optimal parameters for the task of joint image upsampling and hashing. The effectiveness of the various ideas utilized for designing the proposed method is demonstrated by performing different experimentations. It is shown that the proposed scheme is able to outperform the state-of-the-art image super resolution and hashing methods, even when they are trained simultaneously in a joint end-to-end manner.
{"title":"DJUHNet: A deep representation learning-based scheme for the task of joint image upsampling and hashing","authors":"Alireza Esmaeilzehi , Morteza Mirzaei , Hossein Zaredar , Dimitrios Hatzinakos , M. Omair Ahmad","doi":"10.1016/j.image.2024.117187","DOIUrl":"10.1016/j.image.2024.117187","url":null,"abstract":"<div><p>In recent years, numerous efficient schemes that employ deep neural networks have been developed for the task of image hashing. However, not much attention is paid to enhancing the performance and robustness of these deep hashing networks, when the input images do not possess high spatial resolution and visual quality. This is a critical problem, as often accessing high-quality high-resolution images is not guaranteed in real-life applications. In this paper, we propose a novel method for the task of joint image upsampling and hashing, that uses a three-stage design. Specifically, in the first two stages of the proposed scheme, we obtain two deep neural networks, each of which is individually trained for the task of image super resolution and image hashing, respectively. We then fine-tune the two deep networks thus obtained by using the ideas of representation learning and alternating optimization process, in order to produce a set of optimal parameters for the task of joint image upsampling and hashing. The effectiveness of the various ideas utilized for designing the proposed method is demonstrated by performing different experimentations. It is shown that the proposed scheme is able to outperform the state-of-the-art image super resolution and hashing methods, even when they are trained simultaneously in a joint end-to-end manner.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"129 ","pages":"Article 117187"},"PeriodicalIF":3.4,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142150002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-30DOI: 10.1016/j.image.2024.117190
Falah Jabar, João Ascenso, Maria Paula Queluz
To render a spherical (360° or omnidirectional) image on planar displays, a 2D image - called as viewport - must be obtained by projecting a sphere region on a plane, according to the user's viewing direction and a predefined field of view (FoV). However, any sphere to plan projection introduces geometric distortions, such as object stretching and/or bending of straight lines, which intensity increases with the considered FoV. In this paper, a fully automatic content-aware projection is proposed, aiming to reduce the geometric distortions when high FoVs are used. This new projection is based on the Pannini projection, whose parameters are firstly globally optimized according to the image content, followed by a local conformality improvement of relevant viewport objects. A crowdsourcing subjective test showed that the proposed projection is the most preferred solution among the considered state-of-the-art sphere to plan projections, producing viewports with a more pleasant visual quality.
{"title":"Globally and locally optimized Pannini projection for high FoV rendering of 360° images","authors":"Falah Jabar, João Ascenso, Maria Paula Queluz","doi":"10.1016/j.image.2024.117190","DOIUrl":"10.1016/j.image.2024.117190","url":null,"abstract":"<div><p>To render a spherical (360° or omnidirectional) image on planar displays, a 2D image - called as viewport - must be obtained by projecting a sphere region on a plane, according to the user's viewing direction and a predefined field of view (FoV). However, any sphere to plan projection introduces geometric distortions, such as object stretching and/or bending of straight lines, which intensity increases with the considered FoV. In this paper, a fully automatic content-aware projection is proposed, aiming to reduce the geometric distortions when high FoVs are used. This new projection is based on the Pannini projection, whose parameters are firstly globally optimized according to the image content, followed by a local conformality improvement of relevant viewport objects. A crowdsourcing subjective test showed that the proposed projection is the most preferred solution among the considered state-of-the-art sphere to plan projections, producing viewports with a more pleasant visual quality.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"129 ","pages":"Article 117190"},"PeriodicalIF":3.4,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0923596524000912/pdfft?md5=1ff2da4c676f5e3a19cdbe6c4c5f6989&pid=1-s2.0-S0923596524000912-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142136394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21DOI: 10.1016/j.image.2024.117186
Yadang Chen , Xinyu Xu , Chenchen Wei , Chuhan Lu
Few-shot segmentation was proposed to obtain segmentation results for a image with an unseen class by referring to a few labeled samples. However, due to the limited number of samples, many few-shot segmentation models suffer from poor generalization. Prototypical network-based few-shot segmentation still has issues with spatial inconsistency and prototype bias. Since the target class has different appearance in each image, some specific features in the prototypes generated from the support image and its mask do not accurately reflect the generalized features of the target class. To address the support prototype consistency issue, we put forward two modules: Data Augmentation Self-knowledge Distillation (DASKD) and Prototype-wise Regularization (PWR). The DASKD module focuses on enhancing spatial consistency by using data augmentation and self-knowledge distillation. Self-knowledge distillation helps the model acquire generalized features of the target class and learn hidden knowledge from the support images. The PWR module focuses on obtaining a more representative support prototype by conducting prototype-level loss to obtain support prototypes closer to the category center. Broad evaluation experiments on PASCAL- and COCO- demonstrate that our model outperforms the prior works on few-shot segmentation. Our approach surpasses the state of the art by 7.5% in PASCAL- and 4.2% in COCO-.
{"title":"Prototype-wise self-knowledge distillation for few-shot segmentation","authors":"Yadang Chen , Xinyu Xu , Chenchen Wei , Chuhan Lu","doi":"10.1016/j.image.2024.117186","DOIUrl":"10.1016/j.image.2024.117186","url":null,"abstract":"<div><p>Few-shot segmentation was proposed to obtain segmentation results for a image with an unseen class by referring to a few labeled samples. However, due to the limited number of samples, many few-shot segmentation models suffer from poor generalization. Prototypical network-based few-shot segmentation still has issues with spatial inconsistency and prototype bias. Since the target class has different appearance in each image, some specific features in the prototypes generated from the support image and its mask do not accurately reflect the generalized features of the target class. To address the support prototype consistency issue, we put forward two modules: Data Augmentation Self-knowledge Distillation (DASKD) and Prototype-wise Regularization (PWR). The DASKD module focuses on enhancing spatial consistency by using data augmentation and self-knowledge distillation. Self-knowledge distillation helps the model acquire generalized features of the target class and learn hidden knowledge from the support images. The PWR module focuses on obtaining a more representative support prototype by conducting prototype-level loss to obtain support prototypes closer to the category center. Broad evaluation experiments on PASCAL-<span><math><msup><mrow><mn>5</mn></mrow><mrow><mi>i</mi></mrow></msup></math></span> and COCO-<span><math><mrow><mn>2</mn><msup><mrow><mn>0</mn></mrow><mrow><mi>i</mi></mrow></msup></mrow></math></span> demonstrate that our model outperforms the prior works on few-shot segmentation. Our approach surpasses the state of the art by 7.5% in PASCAL-<span><math><msup><mrow><mn>5</mn></mrow><mrow><mi>i</mi></mrow></msup></math></span> and 4.2% in COCO-<span><math><mrow><mn>2</mn><msup><mrow><mn>0</mn></mrow><mrow><mi>i</mi></mrow></msup></mrow></math></span>.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"129 ","pages":"Article 117186"},"PeriodicalIF":3.4,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-21DOI: 10.1016/j.image.2024.117194
Yan-Lin Chen , Chun-Liang Lin , Yu-Chen Lin , Tzu-Chun Chen
Object recognition in computer vision technology has been a popular research field in recent years. Although the detection success rate of regular objects has achieved impressive results, small object detection (SOD) is still a challenging issue. In the Microsoft Common Objects in Context (MS COCO) public dataset, the detection rate of small objects is typically half that of regular-sized objects. The main reason is that small objects are often affected by multi-layer convolution and pooling, leading to insufficient details to distinguish them from the background or similar objects, resulting in poor recognition rates or even no results. This paper presents a network architecture, Transformer-CNN, that combines a self-attention mechanism-based transformer and a convolutional neural network (CNN) to improve the recognition rate of SOD. It captures global information through a transformer and uses the translation invariance and translation equivalence of CNN to maximize the retention of global and local features while improving the reliability and robustness of SOD. Our experiments show that the proposed model improves the small object recognition rate by 2∼5 % than the general transformer architectures.
近年来,计算机视觉技术中的物体识别是一个热门研究领域。虽然常规物体的检测成功率已经取得了令人瞩目的成果,但小物体检测(SOD)仍然是一个具有挑战性的问题。在 Microsoft Common Objects in Context(MS COCO)公共数据集中,小物体的检测率通常只有常规尺寸物体的一半。主要原因是小物体通常会受到多层卷积和池化的影响,导致细节不足,无法将其与背景或类似物体区分开来,从而导致识别率低下,甚至没有结果。本文提出了一种网络架构--变压器-CNN,它结合了基于自注意机制的变压器和卷积神经网络(CNN),以提高 SOD 的识别率。它通过变压器捕捉全局信息,并利用 CNN 的翻译不变性和翻译等价性最大限度地保留全局和局部特征,同时提高 SOD 的可靠性和鲁棒性。我们的实验表明,与一般的变换器架构相比,所提出的模型可将小物体识别率提高 2∼5%。
{"title":"Transformer-CNN for small image object detection","authors":"Yan-Lin Chen , Chun-Liang Lin , Yu-Chen Lin , Tzu-Chun Chen","doi":"10.1016/j.image.2024.117194","DOIUrl":"10.1016/j.image.2024.117194","url":null,"abstract":"<div><p>Object recognition in computer vision technology has been a popular research field in recent years. Although the detection success rate of regular objects has achieved impressive results, small object detection (SOD) is still a challenging issue. In the Microsoft Common Objects in Context (MS COCO) public dataset, the detection rate of small objects is typically half that of regular-sized objects. The main reason is that small objects are often affected by multi-layer convolution and pooling, leading to insufficient details to distinguish them from the background or similar objects, resulting in poor recognition rates or even no results. This paper presents a network architecture, Transformer-CNN, that combines a self-attention mechanism-based transformer and a convolutional neural network (CNN) to improve the recognition rate of SOD. It captures global information through a transformer and uses the translation invariance and translation equivalence of CNN to maximize the retention of global and local features while improving the reliability and robustness of SOD. Our experiments show that the proposed model improves the small object recognition rate by 2∼5 % than the general transformer architectures.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"129 ","pages":"Article 117194"},"PeriodicalIF":3.4,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142044684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-17DOI: 10.1016/j.image.2024.117195
Zhonghao Chang, Xiao Li, Zihao Zhao
Generalized Category Discovery (GCD) task involves transferring knowledge from labeled known categories to recognize both known and novel categories within an unlabeled dataset. A significant challenge arises from the lack of prior information for novel categories. To address this, we develop a feature extractor that can learn discriminative features for both known and novel categories. Our approach leverages the observation that similar samples often belong to the same class. We construct a similarity matrix and employ similarity contrastive loss to increase the similarity between similar samples in the feature space. Additionally, we incorporate cluster labels to further refine the feature extractor, utilizing K-means clustering to assign these labels to unlabeled data, providing valuable supervision. Our feature extractor is optimized through the utilization of instance-level contrastive learning and class-level contrastive learning constraints. These constraints promote similarity maximization in both the instance space and the label space for instances sharing the same pseudo-labels. These three components complement each other, facilitating the learning of discriminative representations for both known and novel categories. Through comprehensive evaluations of generic image recognition datasets and challenging fine-grained datasets, we demonstrate that our proposed method achieves state-of-the-art performance.
{"title":"Feature extractor optimization for discriminative representations in Generalized Category Discovery","authors":"Zhonghao Chang, Xiao Li, Zihao Zhao","doi":"10.1016/j.image.2024.117195","DOIUrl":"10.1016/j.image.2024.117195","url":null,"abstract":"<div><p>Generalized Category Discovery (GCD) task involves transferring knowledge from labeled known categories to recognize both known and novel categories within an unlabeled dataset. A significant challenge arises from the lack of prior information for novel categories. To address this, we develop a feature extractor that can learn discriminative features for both known and novel categories. Our approach leverages the observation that similar samples often belong to the same class. We construct a similarity matrix and employ similarity contrastive loss to increase the similarity between similar samples in the feature space. Additionally, we incorporate cluster labels to further refine the feature extractor, utilizing K-means clustering to assign these labels to unlabeled data, providing valuable supervision. Our feature extractor is optimized through the utilization of instance-level contrastive learning and class-level contrastive learning constraints. These constraints promote similarity maximization in both the instance space and the label space for instances sharing the same pseudo-labels. These three components complement each other, facilitating the learning of discriminative representations for both known and novel categories. Through comprehensive evaluations of generic image recognition datasets and challenging fine-grained datasets, we demonstrate that our proposed method achieves state-of-the-art performance.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"129 ","pages":"Article 117195"},"PeriodicalIF":3.4,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142020716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-16DOI: 10.1016/j.image.2024.117189
Tasin Islam, Alina Miron, Xiaohui Liu, Yongmin Li
We introduce a novel image-based virtual try-on model designed to replace a candidate’s garment with a desired target item. The proposed model comprises three modules: segmentation, garment warping, and candidate-clothing fusion. Previous methods have shown limitations in cases involving significant differences between the original and target clothing, as well as substantial overlapping of body parts. Our model addresses these limitations by employing two key strategies. Firstly, it utilises a candidate representation based on an RGB skeleton image to enhance spatial relationships among body parts, resulting in robust segmentation and improved occlusion handling. Secondly, truncated U-Net is employed in both the segmentation and warping modules, enhancing segmentation performance and accelerating the try-on process. The warping module leverages an efficient affine transform for ease of training. Comparative evaluations against state-of-the-art models demonstrate the competitive performance of our proposed model across various scenarios, particularly excelling in handling occlusion cases and significant differences in clothing cases. This research presents a promising solution for image-based virtual try-on, advancing the field by overcoming key limitations and achieving superior performance.
{"title":"Image-based virtual try-on: Fidelity and simplification","authors":"Tasin Islam, Alina Miron, Xiaohui Liu, Yongmin Li","doi":"10.1016/j.image.2024.117189","DOIUrl":"10.1016/j.image.2024.117189","url":null,"abstract":"<div><p>We introduce a novel image-based virtual try-on model designed to replace a candidate’s garment with a desired target item. The proposed model comprises three modules: segmentation, garment warping, and candidate-clothing fusion. Previous methods have shown limitations in cases involving significant differences between the original and target clothing, as well as substantial overlapping of body parts. Our model addresses these limitations by employing two key strategies. Firstly, it utilises a candidate representation based on an RGB skeleton image to enhance spatial relationships among body parts, resulting in robust segmentation and improved occlusion handling. Secondly, truncated U-Net is employed in both the segmentation and warping modules, enhancing segmentation performance and accelerating the try-on process. The warping module leverages an efficient affine transform for ease of training. Comparative evaluations against state-of-the-art models demonstrate the competitive performance of our proposed model across various scenarios, particularly excelling in handling occlusion cases and significant differences in clothing cases. This research presents a promising solution for image-based virtual try-on, advancing the field by overcoming key limitations and achieving superior performance.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"129 ","pages":"Article 117189"},"PeriodicalIF":3.4,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0923596524000900/pdfft?md5=d7b74bcca8966cd1d3e0e38fa30c8482&pid=1-s2.0-S0923596524000900-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142049928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1016/j.image.2024.117192
Jing Liu , Xin Li , Jiaqi Zhang , Guangtao Zhai , Yuting Su , Yuyi Zhang , Bo Wang
Micro-expressions (MEs) are unconscious, instant and slight facial movements, revealing people’s true emotions. Locating MEs is a prerequisite of classifying them, while only a few researches focus on this task. Among them, sliding window based methods are the most prevalent. Due to the differences of individual physiological and psychological mechanisms, and some uncontrollable factors, the durations and transition modes of different MEs fluctuate greatly. Limited to fixed window scale and mode, traditional sliding window based ME spotting methods fail to capture the motion changes of all MEs exactly, resulting in performance degradation. In this paper, an ensemble learning based duration & mode-aware (DMA) ME spotting framework is proposed. Specifically, we exploit multiple sliding windows of different scales and modes to generate multiple weak detectors, each of which accommodates to MEs with certain duration and transition mode. Additionally, to get a more comprehensive strong detector, we integrate the analysis results of multiple weak detectors using a voting based aggregation module. Furthermore, a novel interval generation scheme is designed to merge close peaks and their neighbor frames into a complete ME interval. Experimental results on two long video databases show the promising performance of our proposed DMA framework compared with state-of-the-art methods. The codes are available at https://github.com/TJUMMG/DMA-ME-Spotting.
微表情(ME)是一种无意识的、瞬间的、轻微的面部动作,它揭示了人们的真实情感。定位微表情是对微表情进行分类的前提,但目前只有少数研究关注这一任务。其中,基于滑动窗口的方法最为普遍。由于个体生理和心理机制的差异以及一些不可控因素,不同 ME 的持续时间和转换模式波动很大。受限于固定的窗口尺度和模式,传统的基于滑动窗口的 ME 定位方法无法准确捕捉到所有 ME 的运动变化,导致性能下降。本文提出了一种基于集合学习的时长& 模式感知(DMA)ME 定位框架。具体来说,我们利用不同尺度和模式的多个滑动窗口来生成多个弱检测器,每个检测器都适用于具有特定持续时间和过渡模式的 ME。此外,为了得到更全面的强检测器,我们使用基于投票的聚合模块整合了多个弱检测器的分析结果。此外,我们还设计了一种新颖的时间间隔生成方案,可将接近的峰值及其邻近帧合并为一个完整的 ME 时间间隔。在两个长视频数据库上的实验结果表明,与最先进的方法相比,我们提出的 DMA 框架具有良好的性能。代码见 https://github.com/TJUMMG/DMA-ME-Spotting。
{"title":"Duration-aware and mode-aware micro-expression spotting for long video sequences","authors":"Jing Liu , Xin Li , Jiaqi Zhang , Guangtao Zhai , Yuting Su , Yuyi Zhang , Bo Wang","doi":"10.1016/j.image.2024.117192","DOIUrl":"10.1016/j.image.2024.117192","url":null,"abstract":"<div><p>Micro-expressions (MEs) are unconscious, instant and slight facial movements, revealing people’s true emotions. Locating MEs is a prerequisite of classifying them, while only a few researches focus on this task. Among them, sliding window based methods are the most prevalent. Due to the differences of individual physiological and psychological mechanisms, and some uncontrollable factors, the durations and transition modes of different MEs fluctuate greatly. Limited to fixed window scale and mode, traditional sliding window based ME spotting methods fail to capture the motion changes of all MEs exactly, resulting in performance degradation. In this paper, an ensemble learning based duration & mode-aware (DMA) ME spotting framework is proposed. Specifically, we exploit multiple sliding windows of different scales and modes to generate multiple weak detectors, each of which accommodates to MEs with certain duration and transition mode. Additionally, to get a more comprehensive strong detector, we integrate the analysis results of multiple weak detectors using a voting based aggregation module. Furthermore, a novel interval generation scheme is designed to merge close peaks and their neighbor frames into a complete ME interval. Experimental results on two long video databases show the promising performance of our proposed DMA framework compared with state-of-the-art methods. The codes are available at <span><span>https://github.com/TJUMMG/DMA-ME-Spotting</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"129 ","pages":"Article 117192"},"PeriodicalIF":3.4,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142049929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-10DOI: 10.1016/j.image.2024.117193
Jingfei He, Zezhong Yang, Xunan Zheng, Xiaoyue Zhang, Ao Li
Recently, the low-rank tensor completion method based on tensor train (TT) rank has achieved promising performance. Ket augmentation (KA) is commonly used in TT rank-based methods to improve the performance by converting low-dimensional tensors to higher-dimensional tensors. However, block artifacts are caused since KA also destroys the original structure and image continuity of original low-dimensional tensors. To tackle this issue, a low-rank tensor completion method based on TT rank with tensor augmentation by partially overlapped sub-blocks (TAPOS) and total variation (TV) is proposed in this paper. The proposed TAPOS preserves the image continuity of the original tensor and enhances the low-rankness of the generated higher-dimensional tensors, and a weighted de-augmentation method is used to assign different weights to the elements of sub-blocks and further reduce the block artifacts. To further alleviate the block artifacts and improve reconstruction accuracy, TV is introduced in the TAPOS-based model to add the piecewise smooth prior. The parallel matrix decomposition method is introduced to estimate the TT rank to reduce the computational cost. Numerical experiments show that the proposed method outperforms the existing state-of-the-art methods.
{"title":"Low-rank tensor completion based on tensor train rank with partially overlapped sub-blocks and total variation","authors":"Jingfei He, Zezhong Yang, Xunan Zheng, Xiaoyue Zhang, Ao Li","doi":"10.1016/j.image.2024.117193","DOIUrl":"10.1016/j.image.2024.117193","url":null,"abstract":"<div><p>Recently, the low-rank tensor completion method based on tensor train (TT) rank has achieved promising performance. Ket augmentation (KA) is commonly used in TT rank-based methods to improve the performance by converting low-dimensional tensors to higher-dimensional tensors. However, block artifacts are caused since KA also destroys the original structure and image continuity of original low-dimensional tensors. To tackle this issue, a low-rank tensor completion method based on TT rank with tensor augmentation by partially overlapped sub-blocks (TAPOS) and total variation (TV) is proposed in this paper. The proposed TAPOS preserves the image continuity of the original tensor and enhances the low-rankness of the generated higher-dimensional tensors, and a weighted de-augmentation method is used to assign different weights to the elements of sub-blocks and further reduce the block artifacts. To further alleviate the block artifacts and improve reconstruction accuracy, TV is introduced in the TAPOS-based model to add the piecewise smooth prior. The parallel matrix decomposition method is introduced to estimate the TT rank to reduce the computational cost. Numerical experiments show that the proposed method outperforms the existing state-of-the-art methods.</p></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"129 ","pages":"Article 117193"},"PeriodicalIF":3.4,"publicationDate":"2024-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142049912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}