首页 > 最新文献

Journal of Visual Communication and Image Representation最新文献

英文 中文
X-CDNet: A real-time crosswalk detector based on YOLOX X-CDNet:基于 YOLOX 的实时人行横道检测器
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-01 DOI: 10.1016/j.jvcir.2024.104206
Xingyuan Lu, Yanbing Xue, Zhigang Wang, Haixia Xu, Xianbin Wen

As urban traffic safety becomes increasingly important, real-time crosswalk detection is playing a critical role in the transportation field. However, existing crosswalk detection algorithms must be improved in terms of accuracy and speed. This study proposes a real-time crosswalk detector called X-CDNet based on YOLOX. Based on the ConvNeXt basic module, we designed a new basic module called Reparameterizable Sparse Large-Kernel (RepSLK) convolution that can be used to expand the model’s receptive field without the addition of extra inference time. In addition, we created a new crosswalk dataset called CD9K, which is based on realistic driving scenes augmented by techniques such as synthetic rain and fog. The experimental results demonstrate that X-CDNet outperforms YOLOX in terms of both detection accuracy and speed. X-CDNet achieves a 93.3 AP50 and a real-time detection speed of 123 FPS.

随着城市交通安全变得越来越重要,人行横道实时检测在交通领域发挥着至关重要的作用。然而,现有的人行横道检测算法必须在准确性和速度方面加以改进。本研究提出了一种基于 YOLOX 的实时人行横道检测器 X-CDNet。在 ConvNeXt 基本模块的基础上,我们设计了一个名为 Reparameterizable Sparse Large-Kernel (RepSLK) 卷积的新基本模块,可用于扩展模型的感受野,而无需增加额外的推理时间。此外,我们还创建了一个名为 CD9K 的新人行横道数据集,该数据集基于真实的驾驶场景,并采用了合成雨雾等技术。实验结果表明,X-CDNet 在检测精度和速度方面都优于 YOLOX。X-CDNet 的 AP50 为 93.3,实时检测速度为 123 FPS。
{"title":"X-CDNet: A real-time crosswalk detector based on YOLOX","authors":"Xingyuan Lu,&nbsp;Yanbing Xue,&nbsp;Zhigang Wang,&nbsp;Haixia Xu,&nbsp;Xianbin Wen","doi":"10.1016/j.jvcir.2024.104206","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104206","url":null,"abstract":"<div><p>As urban traffic safety becomes increasingly important, real-time crosswalk detection is playing a critical role in the transportation field. However, existing crosswalk detection algorithms must be improved in terms of accuracy and speed. This study proposes a real-time crosswalk detector called X-CDNet based on YOLOX. Based on the ConvNeXt basic module, we designed a new basic module called <strong>Rep</strong>arameterizable <strong>S</strong>parse <strong>L</strong>arge-<strong>K</strong>ernel (RepSLK) convolution that can be used to expand the model’s receptive field without the addition of extra inference time. In addition, we created a new crosswalk dataset called CD9K, which is based on realistic driving scenes augmented by techniques such as synthetic rain and fog. The experimental results demonstrate that X-CDNet outperforms YOLOX in terms of both detection accuracy and speed. X-CDNet achieves a 93.3 AP50 and a real-time detection speed of 123 FPS.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141483922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Shift-insensitive perceptual feature of quadratic sum of gradient magnitude and LoG signals for image quality assessment and image classification 用于图像质量评估和图像分类的梯度幅度和 LoG 信号二次和的移位不敏感感知特征
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-06-01 DOI: 10.1016/j.jvcir.2024.104215
Congmin Chen, Xuanqin Mou

Most existing full-reference (FR) Image quality assessment (IQA) models work in the premise of that the two images should be well registered. Shifting an image would lead to an inaccurate evaluation of image quality, because small spatial shifts are far less noticeable than structural distortion for human observers. To this regard, we propose to study an IQA feature that is shift-insensitive to the basic primitive structure of images, i.e., image edge. According to previous studies, the image gradient magnitude (GM) and the Laplacian of Gaussian (LoG) operator that depict the edge profiles of natural images are highly efficient structural features in IQA tasks. In this paper, we find that the Quadratic sum of the normalized GM and the LoG signals (QGL) has excellent shift-insensitive property in representing image edges after theoretically solving the selection problem of a ratio parameter to balance the GM and LoG signals. Based on the proposed QGL feature, two FR-IQA models can be built directly by measuring the similarity map with mean and standard deviation pooling strategies, named mQGL and sQGL, respectively. Experimental results show that the proposed sQGL and mQGL work robustly on four benchmark IQA databases, and QGL-based models show great shift-insensitive property to spatial translation and image rotation while judging the image quality. In addition, we explore the feasibility of combining QGL feature with deep neural networks, and verify that it can help to promote image pattern recognition in texture classification tasks.

大多数现有的全参考(FR)图像质量评估(IQA)模型的工作前提是两幅图像应该完全一致。图像偏移会导致图像质量评估不准确,因为对于人类观察者来说,微小的空间偏移远不如结构失真那么明显。为此,我们建议研究一种对图像的基本原始结构(即图像边缘)不敏感的 IQA 特征。根据以往的研究,描述自然图像边缘轮廓的图像梯度大小(GM)和高斯拉普拉斯(LoG)算子是 IQA 任务中非常有效的结构特征。本文从理论上解决了平衡 GM 和 LoG 信号的比值参数选择问题后,发现归一化 GM 和 LoG 信号的二次方和(QGL)在表示图像边缘时具有出色的不敏感偏移特性。基于所提出的 QGL 特征,可以通过平均值和标准偏差池化策略测量相似性图直接建立两个 FR-IQA 模型,分别命名为 mQGL 和 sQGL。实验结果表明,所提出的 sQGL 和 mQGL 在四个基准 IQA 数据库上都能稳健地工作,而且基于 QGL 的模型在判断图像质量时对空间平移和图像旋转表现出极大的移位不敏感特性。此外,我们还探索了将 QGL 特征与深度神经网络相结合的可行性,并验证了它有助于促进纹理分类任务中的图像模式识别。
{"title":"Shift-insensitive perceptual feature of quadratic sum of gradient magnitude and LoG signals for image quality assessment and image classification","authors":"Congmin Chen,&nbsp;Xuanqin Mou","doi":"10.1016/j.jvcir.2024.104215","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104215","url":null,"abstract":"<div><p>Most existing full-reference (FR) Image quality assessment (IQA) models work in the premise of that the two images should be well registered. Shifting an image would lead to an inaccurate evaluation of image quality, because small spatial shifts are far less noticeable than structural distortion for human observers. To this regard, we propose to study an IQA feature that is shift-insensitive to the basic primitive structure of images, i.e., image edge. According to previous studies, the image gradient magnitude (GM) and the Laplacian of Gaussian (LoG) operator that depict the edge profiles of natural images are highly efficient structural features in IQA tasks. In this paper, we find that the Quadratic sum of the normalized GM and the LoG signals (QGL) has excellent shift-insensitive property in representing image edges after theoretically solving the selection problem of a ratio parameter to balance the GM and LoG signals. Based on the proposed QGL feature, two FR-IQA models can be built directly by measuring the similarity map with mean and standard deviation pooling strategies, named mQGL and sQGL, respectively. Experimental results show that the proposed sQGL and mQGL work robustly on four benchmark IQA databases, and QGL-based models show great shift-insensitive property to spatial translation and image rotation while judging the image quality. In addition, we explore the feasibility of combining QGL feature with deep neural networks, and verify that it can help to promote image pattern recognition in texture classification tasks.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141483920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MCT-VHD: Multi-modal contrastive transformer for video highlight detection MCT-VHD:用于视频亮点检测的多模式对比变换器
IF 2.6 4区 计算机科学 Q1 Engineering Pub Date : 2024-05-01 DOI: 10.1016/j.jvcir.2024.104162
Yinhui Jiang, Sihui Luo, Lijun Guo, Rong Zhang

Autonomous highlight detection aims to identify the most captivating moments in a video, which is crucial for enhancing the efficiency of video editing and browsing on social media platforms. However, current efforts primarily focus on visual elements and often overlook other modalities, such as text information that could provide valuable semantic signals. To overcome this limitation, we propose a Multi-modal Contrastive Transformer for Video Highlight Detection (MCT-VHD). This transformer-based network mainly utilizes video and audio modalities, along with auxiliary text features (if exist) for video highlight detection. Specifically, We enhance the temporal connections within the video by integrating a convolution-based local enhancement module into the transformer blocks. Furthermore, we explore three multi-modal fusion strategies to improve highlight inference performance and employ a contrastive objective to facilitate interactions between different modalities. Comprehensive experiments conducted on three benchmark datasets validate the effectiveness of MCT-VHD, and our ablation studies provide valuable insights into its essential components.

自主亮点检测旨在识别视频中最吸引人的瞬间,这对于提高视频编辑和社交媒体平台浏览的效率至关重要。然而,目前的工作主要集中在视觉元素上,往往忽略了其他模态,例如可以提供有价值语义信号的文本信息。为了克服这一局限,我们提出了一种用于视频高亮度检测的多模态对比变换器(MCT-VHD)。这种基于变换器的网络主要利用视频和音频模式以及辅助文本特征(如果存在)来进行视频高亮检测。具体来说,我们将基于卷积的局部增强模块集成到变压器模块中,从而增强视频中的时空联系。此外,我们还探索了三种多模态融合策略来提高亮点推理性能,并采用对比目标来促进不同模态之间的交互。在三个基准数据集上进行的综合实验验证了 MCT-VHD 的有效性,我们的消融研究为了解其基本组成部分提供了宝贵的见解。
{"title":"MCT-VHD: Multi-modal contrastive transformer for video highlight detection","authors":"Yinhui Jiang,&nbsp;Sihui Luo,&nbsp;Lijun Guo,&nbsp;Rong Zhang","doi":"10.1016/j.jvcir.2024.104162","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104162","url":null,"abstract":"<div><p>Autonomous highlight detection aims to identify the most captivating moments in a video, which is crucial for enhancing the efficiency of video editing and browsing on social media platforms. However, current efforts primarily focus on visual elements and often overlook other modalities, such as text information that could provide valuable semantic signals. To overcome this limitation, we propose a Multi-modal Contrastive Transformer for Video Highlight Detection (MCT-VHD). This transformer-based network mainly utilizes video and audio modalities, along with auxiliary text features (if exist) for video highlight detection. Specifically, We enhance the temporal connections within the video by integrating a convolution-based local enhancement module into the transformer blocks. Furthermore, we explore three multi-modal fusion strategies to improve highlight inference performance and employ a contrastive objective to facilitate interactions between different modalities. Comprehensive experiments conducted on three benchmark datasets validate the effectiveness of MCT-VHD, and our ablation studies provide valuable insights into its essential components.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140843865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reversible data hiding with automatic contrast enhancement for color images 利用自动对比度增强功能对彩色图像进行可逆数据隐藏
IF 2.6 4区 计算机科学 Q1 Engineering Pub Date : 2024-05-01 DOI: 10.1016/j.jvcir.2024.104181
Libo Han , Yanzhao Ren , Sha Tao , Xinfeng Zhang , Wanlin Gao

Automatic contrast enhancement (ACE) is a technique that can automatically enhance the image contrast. Reversible data hiding (RDH) with ACE (ACERDH) can achieve ACE while hiding data. However, some methods with good performance for color images suffer from insufficient enhancement. Therefore, an ACERDH method based on the R, G, B, and V channels enhancement is proposed. First, histogram shifting with contrast control is proposed to enhance the R, G, and B channels. It can prevent contrast degradation and histogram shifting from stopping prematurely. Then, the V channel is enhanced. Since some RDH methods with non-ACE that can well enhance the V channel have a low automation level, histogram shifting with brightness control that can realize ACE very well is proposed. It can effectively avoid over-enhancement by controlling the brightness. Experimental results verify that the proposed method improves the image quality and embedding capability better than some state-of-the-art methods.

自动对比度增强(ACE)是一种能自动增强图像对比度的技术。带有 ACE(ACE)的可逆数据隐藏(RDH)可以在隐藏数据的同时实现 ACE。然而,一些对彩色图像具有良好性能的方法却存在增强不足的问题。因此,本文提出了一种基于 R、G、B 和 V 信道增强的 ACERDH 方法。首先,提出了带有对比度控制的直方图移动来增强 R、G 和 B 信道。它可以防止对比度下降和直方图移动过早停止。然后,增强 V 信道。由于一些能很好增强 V 信道的非ACE RDH 方法自动化程度较低,因此提出了能很好实现 ACE 的带亮度控制的直方图移动。它可以通过控制亮度有效避免过度增强。实验结果证明,与一些最先进的方法相比,所提出的方法能更好地提高图像质量和嵌入能力。
{"title":"Reversible data hiding with automatic contrast enhancement for color images","authors":"Libo Han ,&nbsp;Yanzhao Ren ,&nbsp;Sha Tao ,&nbsp;Xinfeng Zhang ,&nbsp;Wanlin Gao","doi":"10.1016/j.jvcir.2024.104181","DOIUrl":"10.1016/j.jvcir.2024.104181","url":null,"abstract":"<div><p>Automatic contrast enhancement (ACE) is a technique that can automatically enhance the image contrast. Reversible data hiding (RDH) with ACE (ACERDH) can achieve ACE while hiding data. However, some methods with good performance for color images suffer from insufficient enhancement. Therefore, an ACERDH method based on the R, G, B, and V channels enhancement is proposed. First, histogram shifting with contrast control is proposed to enhance the R, G, and B channels. It can prevent contrast degradation and histogram shifting from stopping prematurely. Then, the V channel is enhanced. Since some RDH methods with non-ACE that can well enhance the V channel have a low automation level, histogram shifting with brightness control that can realize ACE very well is proposed. It can effectively avoid over-enhancement by controlling the brightness. Experimental results verify that the proposed method improves the image quality and embedding capability better than some state-of-the-art methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141055379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A self-supervised image aesthetic assessment combining masked image modeling and contrastive learning 结合遮蔽图像建模和对比学习的自我监督图像审美评估
IF 2.6 4区 计算机科学 Q1 Engineering Pub Date : 2024-05-01 DOI: 10.1016/j.jvcir.2024.104184
Shuai Yang , Zibei Wang , Guangao Wang , Yongzhen Ke , Fan Qin , Jing Guo , Liming Chen

Learning more abundant image features helps improve the image aesthetic assessment task performance. Masked Image Modeling (MIM) is implemented based on the Vision Transformer (ViT), which learns pixel-level features while reconstructing images. Contrastive learning pulls in the same image features while pushing away different image features in the feature space to learn high-level semantic features. Since contrastive learning and MIM capture different levels of image features, combining these two methods could learn more rich feature representations and thus promote the performance of aesthetic assessment. Therefore, we propose a pretext task combining contrastive learning and MIM with learning richer image features. In this approach, the original image is randomly masked and reconstructed on the online network. The reconstructed and original images composition the positive example to calculate the contrastive loss on the target network. In the experiment on the AVA dataset, our method obtained better performance than the baseline.

学习更丰富的图像特征有助于提高图像美学评估任务的性能。遮蔽图像建模(MIM)是基于视觉转换器(ViT)实现的,它在重建图像时学习像素级特征。对比学习在特征空间中引入相同的图像特征,同时推开不同的图像特征,从而学习高级语义特征。由于对比学习和 MIM 可捕捉不同层次的图像特征,因此将这两种方法结合起来可以学习到更丰富的特征表征,从而提高审美评估的性能。因此,我们提出了一种将对比学习和 MIM 与学习更丰富图像特征相结合的借口任务。在这种方法中,原始图像被随机屏蔽并在在线网络上重建。重建后的图像和原始图像组成正例,用于计算目标网络上的对比损失。在 AVA 数据集的实验中,我们的方法比基线方法获得了更好的性能。
{"title":"A self-supervised image aesthetic assessment combining masked image modeling and contrastive learning","authors":"Shuai Yang ,&nbsp;Zibei Wang ,&nbsp;Guangao Wang ,&nbsp;Yongzhen Ke ,&nbsp;Fan Qin ,&nbsp;Jing Guo ,&nbsp;Liming Chen","doi":"10.1016/j.jvcir.2024.104184","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104184","url":null,"abstract":"<div><p>Learning more abundant image features helps improve the image aesthetic assessment task performance. Masked Image Modeling (MIM) is implemented based on the Vision Transformer (ViT), which learns pixel-level features while reconstructing images. Contrastive learning pulls in the same image features while pushing away different image features in the feature space to learn high-level semantic features. Since contrastive learning and MIM capture different levels of image features, combining these two methods could learn more rich feature representations and thus promote the performance of aesthetic assessment. Therefore, we propose a pretext task combining contrastive learning and MIM with learning richer image features. In this approach, the original image is randomly masked and reconstructed on the online network. The reconstructed and original images composition the positive example to calculate the contrastive loss on the target network. In the experiment on the AVA dataset, our method obtained better performance than the baseline.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141090670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Memory-guided representation matching for unsupervised video anomaly detection 用于无监督视频异常检测的记忆引导表示匹配
IF 2.6 4区 计算机科学 Q1 Engineering Pub Date : 2024-05-01 DOI: 10.1016/j.jvcir.2024.104185
Yiran Tao , Yaosi Hu , Zhenzhong Chen

Recent works on Video Anomaly Detection (VAD) have made advancements in the unsupervised setting, known as Unsupervised VAD (UVAD), which brings it closer to practical applications. Unlike the classic VAD task that requires a clean training set with only normal events, UVAD aims to identify abnormal frames without any labeled normal/abnormal training data. Many existing UVAD methods employ handcrafted surrogate tasks, such as frame reconstruction, to address this challenge. However, we argue that these surrogate tasks are sub-optimal solutions, inconsistent with the essence of anomaly detection. In this paper, we propose a novel approach for UVAD that directly detects anomalies based on similarities between events in videos. Our method generates representations for events while simultaneously capturing prototypical normality patterns, and detects anomalies based on whether an event’s representation matches the captured patterns. The proposed model comprises a memory module to capture normality patterns, and a representation learning network to obtain representations matching the memory module for normal events. A pseudo-label generation module as well as an anomalous event generation module for negative learning are further designed to assist the model to work under the strictly unsupervised setting. Experimental results demonstrate that the proposed method outperforms existing UVAD methods and achieves competitive performance compared with classic VAD methods.

视频异常检测(VAD)的最新研究成果在无监督环境下取得了进展,被称为无监督 VAD(UVAD),使其更接近实际应用。传统的 VAD 任务需要一个只包含正常事件的干净训练集,与之不同的是,UVAD 的目标是在没有任何标注正常/异常训练数据的情况下识别异常帧。现有的许多 UVAD 方法都采用手工制作的代理任务(如帧重构)来应对这一挑战。然而,我们认为这些代理任务是次优解决方案,与异常检测的本质不符。在本文中,我们提出了一种新颖的 UVAD 方法,该方法可根据视频中事件之间的相似性直接检测异常。我们的方法在捕捉原型常态模式的同时为事件生成表征,并根据事件的表征是否与捕捉到的模式相匹配来检测异常。所提出的模型包括一个用于捕捉常态模式的记忆模块,以及一个用于获取与正常事件记忆模块相匹配的表征学习网络。此外,还设计了一个伪标签生成模块和一个用于负向学习的异常事件生成模块,以帮助模型在严格的无监督环境下工作。实验结果表明,所提出的方法优于现有的 UVAD 方法,与传统的 VAD 方法相比,性能更具竞争力。
{"title":"Memory-guided representation matching for unsupervised video anomaly detection","authors":"Yiran Tao ,&nbsp;Yaosi Hu ,&nbsp;Zhenzhong Chen","doi":"10.1016/j.jvcir.2024.104185","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104185","url":null,"abstract":"<div><p>Recent works on Video Anomaly Detection (VAD) have made advancements in the unsupervised setting, known as Unsupervised VAD (UVAD), which brings it closer to practical applications. Unlike the classic VAD task that requires a clean training set with only normal events, UVAD aims to identify abnormal frames without any labeled normal/abnormal training data. Many existing UVAD methods employ handcrafted surrogate tasks, such as frame reconstruction, to address this challenge. However, we argue that these surrogate tasks are sub-optimal solutions, inconsistent with the essence of anomaly detection. In this paper, we propose a novel approach for UVAD that directly detects anomalies based on similarities between events in videos. Our method generates representations for events while simultaneously capturing prototypical normality patterns, and detects anomalies based on whether an event’s representation matches the captured patterns. The proposed model comprises a memory module to capture normality patterns, and a representation learning network to obtain representations matching the memory module for normal events. A pseudo-label generation module as well as an anomalous event generation module for negative learning are further designed to assist the model to work under the strictly unsupervised setting. Experimental results demonstrate that the proposed method outperforms existing UVAD methods and achieves competitive performance compared with classic VAD methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141095324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Few-shot defect classification via feature aggregation based on graph neural network 通过基于图神经网络的特征聚合技术进行少量缺陷分类
IF 2.6 4区 计算机科学 Q1 Engineering Pub Date : 2024-05-01 DOI: 10.1016/j.jvcir.2024.104172
Pengcheng Zhang, Peixiao Zheng, Xin Guo, Enqing Chen

The effectiveness of deep learning models is greatly dependent on the availability of a vast amount of labeled data. However, in the realm of surface defect classification, acquiring and annotating defect samples proves to be quite challenging. Consequently, accurately predicting defect types with only a limited number of labeled samples has emerged as a prominent research focus in recent years. Few-shot learning, which leverages a restricted sample set in the support set, can effectively predict the categories of unlabeled samples in the query set. This approach is particularly well-suited for defect classification scenarios. In this article, we propose a transductive few-shot surface defect classification method, which using both the instance-level relations and distribution-level relations in each few-shot learning task. Furthermore, we calculate class center features in transductive manner and incorporate them into the feature aggregation operation to rectify the positioning of edge samples in the mapping space. This adjustment aims to minimize the distance between samples of the same category, thereby mitigating the influence of unlabeled samples at category boundary on classification accuracy. Experimental results on the public dataset show the outstanding performance of our proposed approach compared to the state-of-the-art methods in the few-shot learning settings. Our code is available at https://github.com/Harry10459/CIDnet.

深度学习模型的有效性在很大程度上取决于大量标注数据的可用性。然而,在表面缺陷分类领域,获取和标注缺陷样本被证明具有相当大的挑战性。因此,仅利用数量有限的标注样本准确预测缺陷类型已成为近年来的一个突出研究重点。少量学习(Few-shot learning)利用支持集中的受限样本集,可以有效预测查询集中未标记样本的类别。这种方法尤其适用于缺陷分类场景。在本文中,我们提出了一种转导式少次元表面缺陷分类方法,该方法在每个少次元学习任务中同时使用实例级关系和分布级关系。此外,我们还以转导方式计算类中心特征,并将其纳入特征聚合操作,以修正边缘样本在映射空间中的定位。这种调整旨在最小化同一类别样本之间的距离,从而减轻类别边界未标记样本对分类准确性的影响。在公共数据集上的实验结果表明,我们提出的方法在少量学习设置中与最先进的方法相比表现出色。我们的代码见 https://github.com/Harry10459/CIDnet。
{"title":"Few-shot defect classification via feature aggregation based on graph neural network","authors":"Pengcheng Zhang,&nbsp;Peixiao Zheng,&nbsp;Xin Guo,&nbsp;Enqing Chen","doi":"10.1016/j.jvcir.2024.104172","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104172","url":null,"abstract":"<div><p>The effectiveness of deep learning models is greatly dependent on the availability of a vast amount of labeled data. However, in the realm of surface defect classification, acquiring and annotating defect samples proves to be quite challenging. Consequently, accurately predicting defect types with only a limited number of labeled samples has emerged as a prominent research focus in recent years. Few-shot learning, which leverages a restricted sample set in the support set, can effectively predict the categories of unlabeled samples in the query set. This approach is particularly well-suited for defect classification scenarios. In this article, we propose a transductive few-shot surface defect classification method, which using both the instance-level relations and distribution-level relations in each few-shot learning task. Furthermore, we calculate class center features in transductive manner and incorporate them into the feature aggregation operation to rectify the positioning of edge samples in the mapping space. This adjustment aims to minimize the distance between samples of the same category, thereby mitigating the influence of unlabeled samples at category boundary on classification accuracy. Experimental results on the public dataset show the outstanding performance of our proposed approach compared to the state-of-the-art methods in the few-shot learning settings. Our code is available at <span>https://github.com/Harry10459/CIDnet</span><svg><path></path></svg>.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140950969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FSRDiff: A fast diffusion-based super-resolution method using GAN FSRDiff:使用 GAN 的基于扩散的快速超分辨率方法
IF 2.6 4区 计算机科学 Q1 Engineering Pub Date : 2024-05-01 DOI: 10.1016/j.jvcir.2024.104164
Ni Tang , Dongxiao Zhang , Juhao Gao , Yanyun Qu

Single image super-resolution with diffusion probabilistic models (SRDiff) is a successful diffusion model for image super-resolution that produces high-quality images and is stable during training. However, due to the long sampling time, it is slower in the testing phase than other deep learning-based algorithms. Reducing the total number of diffusion steps can accelerate sampling, but it also causes the inverse diffusion process to deviate from the Gaussian distribution and exhibit a multimodal distribution, which violates the diffusion assumption and degrades the results. To overcome this limitation, we propose a fast SRDiff (FSRDiff) algorithm that integrates a generative adversarial network (GAN) with a diffusion model to speed up SRDiff. FSRDiff employs conditional GAN to approximate the multimodal distribution in the inverse diffusion process of the diffusion model, thus enhancing its sampling efficiency when reducing the total number of diffusion steps. The experimental results show that FSRDiff is nearly 20 times faster than SRDiff in reconstruction while maintaining comparable performance on the DIV2K test set.

采用扩散概率模型的单图像超分辨率(SRDiff)是一种成功的图像超分辨率扩散模型,它能生成高质量的图像,并且在训练过程中保持稳定。然而,由于采样时间较长,它在测试阶段的速度比其他基于深度学习的算法慢。减少扩散步骤总数可以加快采样速度,但也会导致反扩散过程偏离高斯分布,呈现多模态分布,从而违反扩散假设,降低结果质量。为了克服这一局限,我们提出了一种快速 SRDiff(FSRDiff)算法,它将生成式对抗网络(GAN)与扩散模型相结合,以加快 SRDiff 的速度。FSRDiff 采用条件 GAN 来近似扩散模型反扩散过程中的多模态分布,从而在减少扩散步骤总数的同时提高了采样效率。实验结果表明,FSRDiff 的重构速度比 SRDiff 快近 20 倍,同时在 DIV2K 测试集上保持了相当的性能。
{"title":"FSRDiff: A fast diffusion-based super-resolution method using GAN","authors":"Ni Tang ,&nbsp;Dongxiao Zhang ,&nbsp;Juhao Gao ,&nbsp;Yanyun Qu","doi":"10.1016/j.jvcir.2024.104164","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104164","url":null,"abstract":"<div><p>Single image super-resolution with diffusion probabilistic models (SRDiff) is a successful diffusion model for image super-resolution that produces high-quality images and is stable during training. However, due to the long sampling time, it is slower in the testing phase than other deep learning-based algorithms. Reducing the total number of diffusion steps can accelerate sampling, but it also causes the inverse diffusion process to deviate from the Gaussian distribution and exhibit a multimodal distribution, which violates the diffusion assumption and degrades the results. To overcome this limitation, we propose a fast SRDiff (FSRDiff) algorithm that integrates a generative adversarial network (GAN) with a diffusion model to speed up SRDiff. FSRDiff employs conditional GAN to approximate the multimodal distribution in the inverse diffusion process of the diffusion model, thus enhancing its sampling efficiency when reducing the total number of diffusion steps. The experimental results show that FSRDiff is nearly 20 times faster than SRDiff in reconstruction while maintaining comparable performance on the DIV2K test set.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140824328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive HEVC video steganograhpy based on PU partition modes 基于 PU 分区模式的自适应 HEVC 视频隐写术
IF 2.6 4区 计算机科学 Q1 Engineering Pub Date : 2024-05-01 DOI: 10.1016/j.jvcir.2024.104176
Shanshan Wang , Dawen Xu , Songhan He

High Efficiency Video Coding (HEVC) −based steganography has gained attention as a prominent research focus. Especially, block structure based HEVC video steganography has received increasing attention due to commendable performance. However, current block structure- based steganography algorithms confront with challenges such as reduced coding efficiency and limited capacity. To avoid these problems, an adaptive video steganography algorithm based on Prediction Unit (PU) partition mode in I-frames is proposed. This is done through the analysis of the block division process and the visual distortion resulting from the modification of the PU partition mode in HEVC. The PU block structure is utilized as steganographic covers, and the Rate Distortion Optimization (RDO) technique is introduced to establish an adaptive distortion function for Syndrome-trellis code (STC). Further comparison is performed between the proposed method and the state-of-the-art steganography algorithms, confirming its advantages in embedding capacity, compression efficiency, visual quality, and resistance to video steganalysis.

基于高效视频编码(HEVC)的隐写技术已成为一项突出的研究重点。特别是基于块结构的 HEVC 视频隐写术,由于其出色的性能而受到越来越多的关注。然而,目前基于块结构的隐写术算法面临着编码效率降低和容量有限等挑战。为了避免这些问题,本文提出了一种基于 I 帧中预测单元(PU)分割模式的自适应视频隐写术算法。这是通过分析块划分过程和 HEVC 中 PU 分割模式的修改所导致的视觉失真来实现的。利用 PU 块结构作为隐写封面,并引入速率失真优化(RDO)技术,为 Syndrome-trellis 码(STC)建立自适应失真函数。该方法与最先进的隐写算法进行了进一步比较,证实了其在嵌入容量、压缩效率、视觉质量和抗视频隐写分析方面的优势。
{"title":"Adaptive HEVC video steganograhpy based on PU partition modes","authors":"Shanshan Wang ,&nbsp;Dawen Xu ,&nbsp;Songhan He","doi":"10.1016/j.jvcir.2024.104176","DOIUrl":"https://doi.org/10.1016/j.jvcir.2024.104176","url":null,"abstract":"<div><p>High Efficiency<!--> <!-->Video<!--> <!-->Coding (HEVC) −based steganography has gained attention as a prominent research focus. Especially, block structure based HEVC video steganography has received increasing attention due to commendable performance. However, current block structure- based steganography algorithms confront with challenges such as reduced coding efficiency and limited capacity. To avoid these problems, an adaptive video steganography algorithm based on Prediction Unit (PU) partition mode in I-frames is proposed. This is done through the analysis of the block division process and the visual distortion resulting from the modification of the PU partition mode in HEVC. The PU block structure is utilized as steganographic covers, and the Rate Distortion Optimization (RDO) technique is introduced to establish an adaptive distortion function for Syndrome-trellis code (STC). Further comparison is performed between the proposed method and the state-of-the-art steganography algorithms, confirming its advantages in embedding capacity, compression efficiency, visual quality, and resistance to video steganalysis.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140906073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
6-DoF grasp estimation method that fuses RGB-D data based on external attention 基于外部注意力的融合 RGB-D 数据的 6-DoF 抓握估计方法
IF 2.6 4区 计算机科学 Q1 Engineering Pub Date : 2024-05-01 DOI: 10.1016/j.jvcir.2024.104173
Haosong Ran , Diansheng Chen , Qinshu Chen , Yifei Li , Yazhe Luo , Xiaoyu Zhang , Jiting Li , Xiaochuan Zhang

6-DoF grasp estimation methods based on point clouds have long been a challenge in robotics due to the limitations of single data input, which hinder the robot’s perception of real-world scenarios, thus reducing its robustness. In this work, we propose a 6-DoF grasp pose estimation method based on RGB-D data, which leverages ResNet to extract color image features, utilizes the PointNet++ network to extract geometric information features, and employs an external attention mechanism to fuse both features. Our method is an end-to-end design, and we validate its performance through benchmark tests on a large-scale dataset and evaluations in a simulated robot environment. Our method outperforms previous state-of-the-art methods on public datasets, achieving 47.75mAP and 40.08mAP for seen and unseen objects, respectively. We also test our grasp pose estimation method on multiple objects in a simulated robot environment, demonstrating that our approach exhibits higher grasp accuracy and robustness than previous methods.

由于单一数据输入的局限性,基于点云的 6-DoF 抓取姿态估计方法一直是机器人领域的难题,这阻碍了机器人对真实世界场景的感知,从而降低了其鲁棒性。在这项工作中,我们提出了一种基于 RGB-D 数据的 6-DoF 抓取姿势估计方法,该方法利用 ResNet 提取彩色图像特征,利用 PointNet++ 网络提取几何信息特征,并采用外部注意力机制将这两种特征融合在一起。我们的方法是一种端到端的设计,我们通过大规模数据集的基准测试和模拟机器人环境的评估来验证其性能。我们的方法在公共数据集上的表现优于之前的先进方法,在看到的物体和未看到的物体上分别达到了 47.75mAP 和 40.08mAP。我们还在模拟机器人环境中对多个物体测试了我们的抓取姿势估计方法,结果表明我们的方法比以前的方法具有更高的抓取精度和鲁棒性。
{"title":"6-DoF grasp estimation method that fuses RGB-D data based on external attention","authors":"Haosong Ran ,&nbsp;Diansheng Chen ,&nbsp;Qinshu Chen ,&nbsp;Yifei Li ,&nbsp;Yazhe Luo ,&nbsp;Xiaoyu Zhang ,&nbsp;Jiting Li ,&nbsp;Xiaochuan Zhang","doi":"10.1016/j.jvcir.2024.104173","DOIUrl":"10.1016/j.jvcir.2024.104173","url":null,"abstract":"<div><p>6-DoF grasp estimation methods based on point clouds have long been a challenge in robotics due to the limitations of single data input, which hinder the robot’s perception of real-world scenarios, thus reducing its robustness. In this work, we propose a 6-DoF grasp pose estimation method based on RGB-D data, which leverages ResNet to extract color image features, utilizes the PointNet++ network to extract geometric information features, and employs an external attention mechanism to fuse both features. Our method is an end-to-end design, and we validate its performance through benchmark tests on a large-scale dataset and evaluations in a simulated robot environment. Our method outperforms previous state-of-the-art methods on public datasets, achieving 47.75mAP and 40.08mAP for seen and unseen objects, respectively. We also test our grasp pose estimation method on multiple objects in a simulated robot environment, demonstrating that our approach exhibits higher grasp accuracy and robustness than previous methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141042765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Visual Communication and Image Representation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1