As urban traffic safety becomes increasingly important, real-time crosswalk detection is playing a critical role in the transportation field. However, existing crosswalk detection algorithms must be improved in terms of accuracy and speed. This study proposes a real-time crosswalk detector called X-CDNet based on YOLOX. Based on the ConvNeXt basic module, we designed a new basic module called Reparameterizable Sparse Large-Kernel (RepSLK) convolution that can be used to expand the model’s receptive field without the addition of extra inference time. In addition, we created a new crosswalk dataset called CD9K, which is based on realistic driving scenes augmented by techniques such as synthetic rain and fog. The experimental results demonstrate that X-CDNet outperforms YOLOX in terms of both detection accuracy and speed. X-CDNet achieves a 93.3 AP50 and a real-time detection speed of 123 FPS.
{"title":"X-CDNet: A real-time crosswalk detector based on YOLOX","authors":"Xingyuan Lu, Yanbing Xue, Zhigang Wang, Haixia Xu, Xianbin Wen","doi":"10.1016/j.jvcir.2024.104206","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104206","url":null,"abstract":"<div><p>As urban traffic safety becomes increasingly important, real-time crosswalk detection is playing a critical role in the transportation field. However, existing crosswalk detection algorithms must be improved in terms of accuracy and speed. This study proposes a real-time crosswalk detector called X-CDNet based on YOLOX. Based on the ConvNeXt basic module, we designed a new basic module called <strong>Rep</strong>arameterizable <strong>S</strong>parse <strong>L</strong>arge-<strong>K</strong>ernel (RepSLK) convolution that can be used to expand the model’s receptive field without the addition of extra inference time. In addition, we created a new crosswalk dataset called CD9K, which is based on realistic driving scenes augmented by techniques such as synthetic rain and fog. The experimental results demonstrate that X-CDNet outperforms YOLOX in terms of both detection accuracy and speed. X-CDNet achieves a 93.3 AP50 and a real-time detection speed of 123 FPS.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141483922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-06-01DOI: 10.1016/j.jvcir.2024.104215
Congmin Chen, Xuanqin Mou
Most existing full-reference (FR) Image quality assessment (IQA) models work in the premise of that the two images should be well registered. Shifting an image would lead to an inaccurate evaluation of image quality, because small spatial shifts are far less noticeable than structural distortion for human observers. To this regard, we propose to study an IQA feature that is shift-insensitive to the basic primitive structure of images, i.e., image edge. According to previous studies, the image gradient magnitude (GM) and the Laplacian of Gaussian (LoG) operator that depict the edge profiles of natural images are highly efficient structural features in IQA tasks. In this paper, we find that the Quadratic sum of the normalized GM and the LoG signals (QGL) has excellent shift-insensitive property in representing image edges after theoretically solving the selection problem of a ratio parameter to balance the GM and LoG signals. Based on the proposed QGL feature, two FR-IQA models can be built directly by measuring the similarity map with mean and standard deviation pooling strategies, named mQGL and sQGL, respectively. Experimental results show that the proposed sQGL and mQGL work robustly on four benchmark IQA databases, and QGL-based models show great shift-insensitive property to spatial translation and image rotation while judging the image quality. In addition, we explore the feasibility of combining QGL feature with deep neural networks, and verify that it can help to promote image pattern recognition in texture classification tasks.
{"title":"Shift-insensitive perceptual feature of quadratic sum of gradient magnitude and LoG signals for image quality assessment and image classification","authors":"Congmin Chen, Xuanqin Mou","doi":"10.1016/j.jvcir.2024.104215","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104215","url":null,"abstract":"<div><p>Most existing full-reference (FR) Image quality assessment (IQA) models work in the premise of that the two images should be well registered. Shifting an image would lead to an inaccurate evaluation of image quality, because small spatial shifts are far less noticeable than structural distortion for human observers. To this regard, we propose to study an IQA feature that is shift-insensitive to the basic primitive structure of images, i.e., image edge. According to previous studies, the image gradient magnitude (GM) and the Laplacian of Gaussian (LoG) operator that depict the edge profiles of natural images are highly efficient structural features in IQA tasks. In this paper, we find that the Quadratic sum of the normalized GM and the LoG signals (QGL) has excellent shift-insensitive property in representing image edges after theoretically solving the selection problem of a ratio parameter to balance the GM and LoG signals. Based on the proposed QGL feature, two FR-IQA models can be built directly by measuring the similarity map with mean and standard deviation pooling strategies, named mQGL and sQGL, respectively. Experimental results show that the proposed sQGL and mQGL work robustly on four benchmark IQA databases, and QGL-based models show great shift-insensitive property to spatial translation and image rotation while judging the image quality. In addition, we explore the feasibility of combining QGL feature with deep neural networks, and verify that it can help to promote image pattern recognition in texture classification tasks.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141483920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.jvcir.2024.104162
Yinhui Jiang, Sihui Luo, Lijun Guo, Rong Zhang
Autonomous highlight detection aims to identify the most captivating moments in a video, which is crucial for enhancing the efficiency of video editing and browsing on social media platforms. However, current efforts primarily focus on visual elements and often overlook other modalities, such as text information that could provide valuable semantic signals. To overcome this limitation, we propose a Multi-modal Contrastive Transformer for Video Highlight Detection (MCT-VHD). This transformer-based network mainly utilizes video and audio modalities, along with auxiliary text features (if exist) for video highlight detection. Specifically, We enhance the temporal connections within the video by integrating a convolution-based local enhancement module into the transformer blocks. Furthermore, we explore three multi-modal fusion strategies to improve highlight inference performance and employ a contrastive objective to facilitate interactions between different modalities. Comprehensive experiments conducted on three benchmark datasets validate the effectiveness of MCT-VHD, and our ablation studies provide valuable insights into its essential components.
{"title":"MCT-VHD: Multi-modal contrastive transformer for video highlight detection","authors":"Yinhui Jiang, Sihui Luo, Lijun Guo, Rong Zhang","doi":"10.1016/j.jvcir.2024.104162","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104162","url":null,"abstract":"<div><p>Autonomous highlight detection aims to identify the most captivating moments in a video, which is crucial for enhancing the efficiency of video editing and browsing on social media platforms. However, current efforts primarily focus on visual elements and often overlook other modalities, such as text information that could provide valuable semantic signals. To overcome this limitation, we propose a Multi-modal Contrastive Transformer for Video Highlight Detection (MCT-VHD). This transformer-based network mainly utilizes video and audio modalities, along with auxiliary text features (if exist) for video highlight detection. Specifically, We enhance the temporal connections within the video by integrating a convolution-based local enhancement module into the transformer blocks. Furthermore, we explore three multi-modal fusion strategies to improve highlight inference performance and employ a contrastive objective to facilitate interactions between different modalities. Comprehensive experiments conducted on three benchmark datasets validate the effectiveness of MCT-VHD, and our ablation studies provide valuable insights into its essential components.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140843865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.jvcir.2024.104181
Libo Han , Yanzhao Ren , Sha Tao , Xinfeng Zhang , Wanlin Gao
Automatic contrast enhancement (ACE) is a technique that can automatically enhance the image contrast. Reversible data hiding (RDH) with ACE (ACERDH) can achieve ACE while hiding data. However, some methods with good performance for color images suffer from insufficient enhancement. Therefore, an ACERDH method based on the R, G, B, and V channels enhancement is proposed. First, histogram shifting with contrast control is proposed to enhance the R, G, and B channels. It can prevent contrast degradation and histogram shifting from stopping prematurely. Then, the V channel is enhanced. Since some RDH methods with non-ACE that can well enhance the V channel have a low automation level, histogram shifting with brightness control that can realize ACE very well is proposed. It can effectively avoid over-enhancement by controlling the brightness. Experimental results verify that the proposed method improves the image quality and embedding capability better than some state-of-the-art methods.
自动对比度增强(ACE)是一种能自动增强图像对比度的技术。带有 ACE(ACE)的可逆数据隐藏(RDH)可以在隐藏数据的同时实现 ACE。然而,一些对彩色图像具有良好性能的方法却存在增强不足的问题。因此,本文提出了一种基于 R、G、B 和 V 信道增强的 ACERDH 方法。首先,提出了带有对比度控制的直方图移动来增强 R、G 和 B 信道。它可以防止对比度下降和直方图移动过早停止。然后,增强 V 信道。由于一些能很好增强 V 信道的非ACE RDH 方法自动化程度较低,因此提出了能很好实现 ACE 的带亮度控制的直方图移动。它可以通过控制亮度有效避免过度增强。实验结果证明,与一些最先进的方法相比,所提出的方法能更好地提高图像质量和嵌入能力。
{"title":"Reversible data hiding with automatic contrast enhancement for color images","authors":"Libo Han , Yanzhao Ren , Sha Tao , Xinfeng Zhang , Wanlin Gao","doi":"10.1016/j.jvcir.2024.104181","DOIUrl":"10.1016/j.jvcir.2024.104181","url":null,"abstract":"<div><p>Automatic contrast enhancement (ACE) is a technique that can automatically enhance the image contrast. Reversible data hiding (RDH) with ACE (ACERDH) can achieve ACE while hiding data. However, some methods with good performance for color images suffer from insufficient enhancement. Therefore, an ACERDH method based on the R, G, B, and V channels enhancement is proposed. First, histogram shifting with contrast control is proposed to enhance the R, G, and B channels. It can prevent contrast degradation and histogram shifting from stopping prematurely. Then, the V channel is enhanced. Since some RDH methods with non-ACE that can well enhance the V channel have a low automation level, histogram shifting with brightness control that can realize ACE very well is proposed. It can effectively avoid over-enhancement by controlling the brightness. Experimental results verify that the proposed method improves the image quality and embedding capability better than some state-of-the-art methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141055379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.jvcir.2024.104184
Shuai Yang , Zibei Wang , Guangao Wang , Yongzhen Ke , Fan Qin , Jing Guo , Liming Chen
Learning more abundant image features helps improve the image aesthetic assessment task performance. Masked Image Modeling (MIM) is implemented based on the Vision Transformer (ViT), which learns pixel-level features while reconstructing images. Contrastive learning pulls in the same image features while pushing away different image features in the feature space to learn high-level semantic features. Since contrastive learning and MIM capture different levels of image features, combining these two methods could learn more rich feature representations and thus promote the performance of aesthetic assessment. Therefore, we propose a pretext task combining contrastive learning and MIM with learning richer image features. In this approach, the original image is randomly masked and reconstructed on the online network. The reconstructed and original images composition the positive example to calculate the contrastive loss on the target network. In the experiment on the AVA dataset, our method obtained better performance than the baseline.
学习更丰富的图像特征有助于提高图像美学评估任务的性能。遮蔽图像建模(MIM)是基于视觉转换器(ViT)实现的,它在重建图像时学习像素级特征。对比学习在特征空间中引入相同的图像特征,同时推开不同的图像特征,从而学习高级语义特征。由于对比学习和 MIM 可捕捉不同层次的图像特征,因此将这两种方法结合起来可以学习到更丰富的特征表征,从而提高审美评估的性能。因此,我们提出了一种将对比学习和 MIM 与学习更丰富图像特征相结合的借口任务。在这种方法中,原始图像被随机屏蔽并在在线网络上重建。重建后的图像和原始图像组成正例,用于计算目标网络上的对比损失。在 AVA 数据集的实验中,我们的方法比基线方法获得了更好的性能。
{"title":"A self-supervised image aesthetic assessment combining masked image modeling and contrastive learning","authors":"Shuai Yang , Zibei Wang , Guangao Wang , Yongzhen Ke , Fan Qin , Jing Guo , Liming Chen","doi":"10.1016/j.jvcir.2024.104184","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104184","url":null,"abstract":"<div><p>Learning more abundant image features helps improve the image aesthetic assessment task performance. Masked Image Modeling (MIM) is implemented based on the Vision Transformer (ViT), which learns pixel-level features while reconstructing images. Contrastive learning pulls in the same image features while pushing away different image features in the feature space to learn high-level semantic features. Since contrastive learning and MIM capture different levels of image features, combining these two methods could learn more rich feature representations and thus promote the performance of aesthetic assessment. Therefore, we propose a pretext task combining contrastive learning and MIM with learning richer image features. In this approach, the original image is randomly masked and reconstructed on the online network. The reconstructed and original images composition the positive example to calculate the contrastive loss on the target network. In the experiment on the AVA dataset, our method obtained better performance than the baseline.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141090670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.jvcir.2024.104185
Yiran Tao , Yaosi Hu , Zhenzhong Chen
Recent works on Video Anomaly Detection (VAD) have made advancements in the unsupervised setting, known as Unsupervised VAD (UVAD), which brings it closer to practical applications. Unlike the classic VAD task that requires a clean training set with only normal events, UVAD aims to identify abnormal frames without any labeled normal/abnormal training data. Many existing UVAD methods employ handcrafted surrogate tasks, such as frame reconstruction, to address this challenge. However, we argue that these surrogate tasks are sub-optimal solutions, inconsistent with the essence of anomaly detection. In this paper, we propose a novel approach for UVAD that directly detects anomalies based on similarities between events in videos. Our method generates representations for events while simultaneously capturing prototypical normality patterns, and detects anomalies based on whether an event’s representation matches the captured patterns. The proposed model comprises a memory module to capture normality patterns, and a representation learning network to obtain representations matching the memory module for normal events. A pseudo-label generation module as well as an anomalous event generation module for negative learning are further designed to assist the model to work under the strictly unsupervised setting. Experimental results demonstrate that the proposed method outperforms existing UVAD methods and achieves competitive performance compared with classic VAD methods.
视频异常检测(VAD)的最新研究成果在无监督环境下取得了进展,被称为无监督 VAD(UVAD),使其更接近实际应用。传统的 VAD 任务需要一个只包含正常事件的干净训练集,与之不同的是,UVAD 的目标是在没有任何标注正常/异常训练数据的情况下识别异常帧。现有的许多 UVAD 方法都采用手工制作的代理任务(如帧重构)来应对这一挑战。然而,我们认为这些代理任务是次优解决方案,与异常检测的本质不符。在本文中,我们提出了一种新颖的 UVAD 方法,该方法可根据视频中事件之间的相似性直接检测异常。我们的方法在捕捉原型常态模式的同时为事件生成表征,并根据事件的表征是否与捕捉到的模式相匹配来检测异常。所提出的模型包括一个用于捕捉常态模式的记忆模块,以及一个用于获取与正常事件记忆模块相匹配的表征学习网络。此外,还设计了一个伪标签生成模块和一个用于负向学习的异常事件生成模块,以帮助模型在严格的无监督环境下工作。实验结果表明,所提出的方法优于现有的 UVAD 方法,与传统的 VAD 方法相比,性能更具竞争力。
{"title":"Memory-guided representation matching for unsupervised video anomaly detection","authors":"Yiran Tao , Yaosi Hu , Zhenzhong Chen","doi":"10.1016/j.jvcir.2024.104185","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104185","url":null,"abstract":"<div><p>Recent works on Video Anomaly Detection (VAD) have made advancements in the unsupervised setting, known as Unsupervised VAD (UVAD), which brings it closer to practical applications. Unlike the classic VAD task that requires a clean training set with only normal events, UVAD aims to identify abnormal frames without any labeled normal/abnormal training data. Many existing UVAD methods employ handcrafted surrogate tasks, such as frame reconstruction, to address this challenge. However, we argue that these surrogate tasks are sub-optimal solutions, inconsistent with the essence of anomaly detection. In this paper, we propose a novel approach for UVAD that directly detects anomalies based on similarities between events in videos. Our method generates representations for events while simultaneously capturing prototypical normality patterns, and detects anomalies based on whether an event’s representation matches the captured patterns. The proposed model comprises a memory module to capture normality patterns, and a representation learning network to obtain representations matching the memory module for normal events. A pseudo-label generation module as well as an anomalous event generation module for negative learning are further designed to assist the model to work under the strictly unsupervised setting. Experimental results demonstrate that the proposed method outperforms existing UVAD methods and achieves competitive performance compared with classic VAD methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141095324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The effectiveness of deep learning models is greatly dependent on the availability of a vast amount of labeled data. However, in the realm of surface defect classification, acquiring and annotating defect samples proves to be quite challenging. Consequently, accurately predicting defect types with only a limited number of labeled samples has emerged as a prominent research focus in recent years. Few-shot learning, which leverages a restricted sample set in the support set, can effectively predict the categories of unlabeled samples in the query set. This approach is particularly well-suited for defect classification scenarios. In this article, we propose a transductive few-shot surface defect classification method, which using both the instance-level relations and distribution-level relations in each few-shot learning task. Furthermore, we calculate class center features in transductive manner and incorporate them into the feature aggregation operation to rectify the positioning of edge samples in the mapping space. This adjustment aims to minimize the distance between samples of the same category, thereby mitigating the influence of unlabeled samples at category boundary on classification accuracy. Experimental results on the public dataset show the outstanding performance of our proposed approach compared to the state-of-the-art methods in the few-shot learning settings. Our code is available at https://github.com/Harry10459/CIDnet.
{"title":"Few-shot defect classification via feature aggregation based on graph neural network","authors":"Pengcheng Zhang, Peixiao Zheng, Xin Guo, Enqing Chen","doi":"10.1016/j.jvcir.2024.104172","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104172","url":null,"abstract":"<div><p>The effectiveness of deep learning models is greatly dependent on the availability of a vast amount of labeled data. However, in the realm of surface defect classification, acquiring and annotating defect samples proves to be quite challenging. Consequently, accurately predicting defect types with only a limited number of labeled samples has emerged as a prominent research focus in recent years. Few-shot learning, which leverages a restricted sample set in the support set, can effectively predict the categories of unlabeled samples in the query set. This approach is particularly well-suited for defect classification scenarios. In this article, we propose a transductive few-shot surface defect classification method, which using both the instance-level relations and distribution-level relations in each few-shot learning task. Furthermore, we calculate class center features in transductive manner and incorporate them into the feature aggregation operation to rectify the positioning of edge samples in the mapping space. This adjustment aims to minimize the distance between samples of the same category, thereby mitigating the influence of unlabeled samples at category boundary on classification accuracy. Experimental results on the public dataset show the outstanding performance of our proposed approach compared to the state-of-the-art methods in the few-shot learning settings. Our code is available at <span>https://github.com/Harry10459/CIDnet</span><svg><path></path></svg>.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140950969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.jvcir.2024.104164
Ni Tang , Dongxiao Zhang , Juhao Gao , Yanyun Qu
Single image super-resolution with diffusion probabilistic models (SRDiff) is a successful diffusion model for image super-resolution that produces high-quality images and is stable during training. However, due to the long sampling time, it is slower in the testing phase than other deep learning-based algorithms. Reducing the total number of diffusion steps can accelerate sampling, but it also causes the inverse diffusion process to deviate from the Gaussian distribution and exhibit a multimodal distribution, which violates the diffusion assumption and degrades the results. To overcome this limitation, we propose a fast SRDiff (FSRDiff) algorithm that integrates a generative adversarial network (GAN) with a diffusion model to speed up SRDiff. FSRDiff employs conditional GAN to approximate the multimodal distribution in the inverse diffusion process of the diffusion model, thus enhancing its sampling efficiency when reducing the total number of diffusion steps. The experimental results show that FSRDiff is nearly 20 times faster than SRDiff in reconstruction while maintaining comparable performance on the DIV2K test set.
{"title":"FSRDiff: A fast diffusion-based super-resolution method using GAN","authors":"Ni Tang , Dongxiao Zhang , Juhao Gao , Yanyun Qu","doi":"10.1016/j.jvcir.2024.104164","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104164","url":null,"abstract":"<div><p>Single image super-resolution with diffusion probabilistic models (SRDiff) is a successful diffusion model for image super-resolution that produces high-quality images and is stable during training. However, due to the long sampling time, it is slower in the testing phase than other deep learning-based algorithms. Reducing the total number of diffusion steps can accelerate sampling, but it also causes the inverse diffusion process to deviate from the Gaussian distribution and exhibit a multimodal distribution, which violates the diffusion assumption and degrades the results. To overcome this limitation, we propose a fast SRDiff (FSRDiff) algorithm that integrates a generative adversarial network (GAN) with a diffusion model to speed up SRDiff. FSRDiff employs conditional GAN to approximate the multimodal distribution in the inverse diffusion process of the diffusion model, thus enhancing its sampling efficiency when reducing the total number of diffusion steps. The experimental results show that FSRDiff is nearly 20 times faster than SRDiff in reconstruction while maintaining comparable performance on the DIV2K test set.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140824328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.jvcir.2024.104176
Shanshan Wang , Dawen Xu , Songhan He
High Efficiency Video Coding (HEVC) −based steganography has gained attention as a prominent research focus. Especially, block structure based HEVC video steganography has received increasing attention due to commendable performance. However, current block structure- based steganography algorithms confront with challenges such as reduced coding efficiency and limited capacity. To avoid these problems, an adaptive video steganography algorithm based on Prediction Unit (PU) partition mode in I-frames is proposed. This is done through the analysis of the block division process and the visual distortion resulting from the modification of the PU partition mode in HEVC. The PU block structure is utilized as steganographic covers, and the Rate Distortion Optimization (RDO) technique is introduced to establish an adaptive distortion function for Syndrome-trellis code (STC). Further comparison is performed between the proposed method and the state-of-the-art steganography algorithms, confirming its advantages in embedding capacity, compression efficiency, visual quality, and resistance to video steganalysis.
基于高效视频编码(HEVC)的隐写技术已成为一项突出的研究重点。特别是基于块结构的 HEVC 视频隐写术,由于其出色的性能而受到越来越多的关注。然而,目前基于块结构的隐写术算法面临着编码效率降低和容量有限等挑战。为了避免这些问题,本文提出了一种基于 I 帧中预测单元(PU)分割模式的自适应视频隐写术算法。这是通过分析块划分过程和 HEVC 中 PU 分割模式的修改所导致的视觉失真来实现的。利用 PU 块结构作为隐写封面,并引入速率失真优化(RDO)技术,为 Syndrome-trellis 码(STC)建立自适应失真函数。该方法与最先进的隐写算法进行了进一步比较,证实了其在嵌入容量、压缩效率、视觉质量和抗视频隐写分析方面的优势。
{"title":"Adaptive HEVC video steganograhpy based on PU partition modes","authors":"Shanshan Wang , Dawen Xu , Songhan He","doi":"10.1016/j.jvcir.2024.104176","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104176","url":null,"abstract":"<div><p>High Efficiency<!--> <!-->Video<!--> <!-->Coding (HEVC) −based steganography has gained attention as a prominent research focus. Especially, block structure based HEVC video steganography has received increasing attention due to commendable performance. However, current block structure- based steganography algorithms confront with challenges such as reduced coding efficiency and limited capacity. To avoid these problems, an adaptive video steganography algorithm based on Prediction Unit (PU) partition mode in I-frames is proposed. This is done through the analysis of the block division process and the visual distortion resulting from the modification of the PU partition mode in HEVC. The PU block structure is utilized as steganographic covers, and the Rate Distortion Optimization (RDO) technique is introduced to establish an adaptive distortion function for Syndrome-trellis code (STC). Further comparison is performed between the proposed method and the state-of-the-art steganography algorithms, confirming its advantages in embedding capacity, compression efficiency, visual quality, and resistance to video steganalysis.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140906073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-01DOI: 10.1016/j.jvcir.2024.104173
Haosong Ran , Diansheng Chen , Qinshu Chen , Yifei Li , Yazhe Luo , Xiaoyu Zhang , Jiting Li , Xiaochuan Zhang
6-DoF grasp estimation methods based on point clouds have long been a challenge in robotics due to the limitations of single data input, which hinder the robot’s perception of real-world scenarios, thus reducing its robustness. In this work, we propose a 6-DoF grasp pose estimation method based on RGB-D data, which leverages ResNet to extract color image features, utilizes the PointNet++ network to extract geometric information features, and employs an external attention mechanism to fuse both features. Our method is an end-to-end design, and we validate its performance through benchmark tests on a large-scale dataset and evaluations in a simulated robot environment. Our method outperforms previous state-of-the-art methods on public datasets, achieving 47.75mAP and 40.08mAP for seen and unseen objects, respectively. We also test our grasp pose estimation method on multiple objects in a simulated robot environment, demonstrating that our approach exhibits higher grasp accuracy and robustness than previous methods.
{"title":"6-DoF grasp estimation method that fuses RGB-D data based on external attention","authors":"Haosong Ran , Diansheng Chen , Qinshu Chen , Yifei Li , Yazhe Luo , Xiaoyu Zhang , Jiting Li , Xiaochuan Zhang","doi":"10.1016/j.jvcir.2024.104173","DOIUrl":"10.1016/j.jvcir.2024.104173","url":null,"abstract":"<div><p>6-DoF grasp estimation methods based on point clouds have long been a challenge in robotics due to the limitations of single data input, which hinder the robot’s perception of real-world scenarios, thus reducing its robustness. In this work, we propose a 6-DoF grasp pose estimation method based on RGB-D data, which leverages ResNet to extract color image features, utilizes the PointNet++ network to extract geometric information features, and employs an external attention mechanism to fuse both features. Our method is an end-to-end design, and we validate its performance through benchmark tests on a large-scale dataset and evaluations in a simulated robot environment. Our method outperforms previous state-of-the-art methods on public datasets, achieving 47.75mAP and 40.08mAP for seen and unseen objects, respectively. We also test our grasp pose estimation method on multiple objects in a simulated robot environment, demonstrating that our approach exhibits higher grasp accuracy and robustness than previous methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141042765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}