首页 > 最新文献

Journal of Visual Communication and Image Representation最新文献

英文 中文
Spatio-temporal feature learning for enhancing video quality based on screen content characteristics 根据屏幕内容特征进行时空特征学习以提高视频质量
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-28 DOI: 10.1016/j.jvcir.2024.104270

With the rising demands for remote desktops and online meetings, screen content videos have drawn significant attention. Different from natural videos, screen content videos often exhibit scene switches where the content abruptly changes from one frame to the next. These scene switches result in obvious distortions in compressed videos. Besides, frame freezing, where the content remains unchanged for a certain duration, is also very common in screen content videos. Existing alignment-based models struggle to effectively enhance scene switch frames and lack efficiency when dealing with frame freezing situations. Therefore, we propose a novel alignment-free method that effectively handles both scene switches and frame freezing. In our approach, we develop a spatial and temporal feature extraction module that compresses and extracts spatio-temporal information from three groups of frame inputs. This enables efficient handling of scene switches. In addition, an edge aware block is proposed for extracting edge information, which guides the model to focus on restoring the high-frequency components in frame freezing situations. The fusion module is then designed to adaptively fuse the features from three groups, considering different positions of video frames, to enhance frames during scene switch and frame freezing scenarios. Experimental results demonstrate the significant advancements achieved by the proposed edge aware with spatio-temporal information fusion network (EAST) in enhancing the quality of compressed videos, surpassing the current state-of-the-art methods.

随着远程桌面和在线会议需求的不断增长,屏幕内容视频引起了人们的极大关注。与自然视频不同,屏幕内容视频经常出现场景切换,即内容从一帧突然切换到下一帧。这些场景切换会导致压缩视频出现明显的失真。此外,画面内容视频还经常出现帧冻结现象,即内容在一定时间内保持不变。现有的基于对齐的模型难以有效增强场景切换帧,在处理帧冻结情况时也缺乏效率。因此,我们提出了一种新颖的免对齐方法,能有效处理场景切换和帧冻结。在我们的方法中,我们开发了一个空间和时间特征提取模块,可压缩并提取三组帧输入的时空信息。这样就能有效处理场景切换。此外,我们还提出了一个边缘感知模块,用于提取边缘信息,引导模型在帧冻结情况下集中恢复高频成分。融合模块的设计考虑到视频帧的不同位置,能够自适应地融合来自三组的特征,从而增强场景切换和帧冻结场景中的帧。实验结果表明,所提出的边缘感知与时空信息融合网络(EAST)在提高压缩视频质量方面取得了显著进步,超越了当前最先进的方法。
{"title":"Spatio-temporal feature learning for enhancing video quality based on screen content characteristics","authors":"","doi":"10.1016/j.jvcir.2024.104270","DOIUrl":"10.1016/j.jvcir.2024.104270","url":null,"abstract":"<div><p>With the rising demands for remote desktops and online meetings, screen content videos have drawn significant attention. Different from natural videos, screen content videos often exhibit scene switches where the content abruptly changes from one frame to the next. These scene switches result in obvious distortions in compressed videos. Besides, frame freezing, where the content remains unchanged for a certain duration, is also very common in screen content videos. Existing alignment-based models struggle to effectively enhance scene switch frames and lack efficiency when dealing with frame freezing situations. Therefore, we propose a novel alignment-free method that effectively handles both scene switches and frame freezing. In our approach, we develop a spatial and temporal feature extraction module that compresses and extracts spatio-temporal information from three groups of frame inputs. This enables efficient handling of scene switches. In addition, an edge aware block is proposed for extracting edge information, which guides the model to focus on restoring the high-frequency components in frame freezing situations. The fusion module is then designed to adaptively fuse the features from three groups, considering different positions of video frames, to enhance frames during scene switch and frame freezing scenarios. Experimental results demonstrate the significant advancements achieved by the proposed edge aware with spatio-temporal information fusion network (EAST) in enhancing the quality of compressed videos, surpassing the current state-of-the-art methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exposing video surveillance object forgery by combining TSF features and attention-based deep neural networks 通过结合 TSF 特征和基于注意力的深度神经网络揭露视频监控对象伪造问题
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-27 DOI: 10.1016/j.jvcir.2024.104267

Recently, forensics has encountered a new challenge with video surveillance object forgery. This type of forgery combines the characteristics of popular video copy-move and splicing forgeries, failing most existing video forgery detection schemes. In response to this new forgery challenge, this paper proposes a Video Surveillance Object Forgery Detection (VSOFD) method including three parts components: (i) The proposed method presents a special-combined extraction technique that incorporates Temporal-Spatial-Frequent (TSF) perspectives for TSF feature extraction. Furthermore, TSF features can effectively represent video information and benefit from feature dimension reduction, improving computational efficiency. (ii) The proposed method introduces a universal, extensible attention-based Convolutional Neural Network (CNN) baseline for feature processing. This CNN processing architecture is compatible with various series and parallel feed-forward CNN structures, considering these structures as processing backbones. Therefore, the proposed CNN architecture benefits from various state-of-the-art structures, leading to addressing each independent TSF feature. (iii) The method adopts an encoder-attention-decoder RNN framework for feature classification. By incorporating temporal characteristics, the framework can further identify the correlations between the adjacent frames to classify the forgery frames better. Finally, experimental results show that the proposed network can achieve the best F1 = 94.69 % score, increasing at least 5–12 % from the existing State-Of-The-Art (SOTA) VSOFD schemes and other video forensics.

最近,取证工作遇到了视频监控对象伪造的新挑战。这类伪造结合了流行的视频复制移动和拼接伪造的特点,使现有的大多数视频伪造检测方案失效。针对这一新的伪造挑战,本文提出了一种视频监控对象伪造检测(VSOFD)方法,包括三个部分:(i) 本文提出了一种特殊的组合提取技术,该技术结合了时间-空间-频率(TSF)视角进行 TSF 特征提取。此外,TSF 特征能有效地表示视频信息,并能从特征维度缩减中获益,从而提高计算效率。(ii) 所提出的方法为特征处理引入了一个通用的、可扩展的、基于注意力的卷积神经网络(CNN)基线。这种 CNN 处理架构兼容各种串联和并联前馈 CNN 结构,将这些结构视为处理骨干。因此,拟议的 CNN 架构可受益于各种最先进的结构,从而处理每个独立的 TSF 特征。(iii) 该方法采用编码器-注意-解码器 RNN 框架进行特征分类。通过结合时间特征,该框架可以进一步识别相邻帧之间的相关性,从而更好地对伪造帧进行分类。最后,实验结果表明,所提出的网络可以达到最佳 F1 = 94.69 % 的分数,比现有的技术水平 (SOTA) VSOFD 方案和其他视频取证方案至少提高了 5-12 %。
{"title":"Exposing video surveillance object forgery by combining TSF features and attention-based deep neural networks","authors":"","doi":"10.1016/j.jvcir.2024.104267","DOIUrl":"10.1016/j.jvcir.2024.104267","url":null,"abstract":"<div><p>Recently, forensics has encountered a new challenge with video surveillance object forgery. This type of forgery combines the characteristics of popular video copy-move and splicing forgeries, failing most existing video forgery detection schemes. In response to this new forgery challenge, this paper proposes a Video Surveillance Object Forgery Detection (VSOFD) method including three parts components: (i) The proposed method presents a special-combined extraction technique that incorporates Temporal-Spatial-Frequent (TSF) perspectives for TSF feature extraction. Furthermore, TSF features can effectively represent video information and benefit from feature dimension reduction, improving computational efficiency. (ii) The proposed method introduces a universal, extensible attention-based Convolutional Neural Network (CNN) baseline for feature processing. This CNN processing architecture is compatible with various series and parallel feed-forward CNN structures, considering these structures as processing backbones. Therefore, the proposed CNN architecture benefits from various state-of-the-art structures, leading to addressing each independent TSF feature. (iii) The method adopts an encoder-attention-decoder RNN framework for feature classification. By incorporating temporal characteristics, the framework can further identify the correlations between the adjacent frames to classify the forgery frames better. Finally, experimental results show that the proposed network can achieve the best <em>F</em><sub>1</sub> = 94.69 % score, increasing at least 5–12 % from the existing State-Of-The-Art (SOTA) VSOFD schemes and other video forensics.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GINet:Graph interactive network with semantic-guided spatial refinement for salient object detection in optical remote sensing images GINet:用于光学遥感图像中突出物体检测的具有语义引导空间细化功能的图交互网络
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-26 DOI: 10.1016/j.jvcir.2024.104257

There are many challenging scenarios in the task of salient object detection in optical remote sensing images (RSIs), such as various scales and irregular shapes of salient objects, cluttered backgrounds, etc. Therefore, it is difficult to directly apply saliency models targeting natural scene images to optical RSIs. Besides, existing models often do not give sufficient exploration for the potential relationship of different salient objects or different parts of the salient object. In this paper, we propose a graph interaction network (i.e. GINet) with semantic-guided spatial refinement to conduct salient object detection in optical RSIs. The key advantages of GINet lie in two points. Firstly, the graph interactive reasoning (GIR) module conducts information exchange of different-level features via the graph interaction operation, and enhances features along spatial and channel dimensions via the graph reasoning operation. Secondly, we designed the global content-aware refinement (GCR) module, which incorporates the foreground and background feature-based local information and the semantic feature-based global information simultaneously. Experiments results on two public optical RSIs datasets clearly show the effectiveness and superiority of the proposed GINet when compared with the state-of-the-art models.

在光学遥感图像(RSIs)中检测突出物体的任务有许多具有挑战性的场景,例如突出物体的不同尺度和不规则形状、杂乱的背景等。因此,很难将针对自然场景图像的显著性模型直接应用于光学遥感图像。此外,现有模型往往不能充分挖掘不同突出物体或突出物体不同部分之间的潜在关系。在本文中,我们提出了一种具有语义引导空间细化功能的图交互网络(即 GINet),用于在光学 RSI 中进行突出对象检测。GINet 的主要优势在于两点。首先,图交互推理(GIR)模块通过图交互操作进行不同层次特征的信息交换,并通过图推理操作增强空间和通道维度的特征。其次,我们设计了全局内容感知细化(GCR)模块,该模块同时整合了基于前景和背景特征的局部信息和基于语义特征的全局信息。在两个公开的光学 RSI 数据集上的实验结果清楚地表明,与最先进的模型相比,所提出的 GINet 非常有效和优越。
{"title":"GINet:Graph interactive network with semantic-guided spatial refinement for salient object detection in optical remote sensing images","authors":"","doi":"10.1016/j.jvcir.2024.104257","DOIUrl":"10.1016/j.jvcir.2024.104257","url":null,"abstract":"<div><p>There are many challenging scenarios in the task of salient object detection in optical remote sensing images (RSIs), such as various scales and irregular shapes of salient objects, cluttered backgrounds, <em>etc</em>. Therefore, it is difficult to directly apply saliency models targeting natural scene images to optical RSIs. Besides, existing models often do not give sufficient exploration for the potential relationship of different salient objects or different parts of the salient object. In this paper, we propose a graph interaction network (<em>i.e.</em> GINet) with semantic-guided spatial refinement to conduct salient object detection in optical RSIs. The key advantages of GINet lie in two points. Firstly, the graph interactive reasoning (GIR) module conducts information exchange of different-level features via the graph interaction operation, and enhances features along spatial and channel dimensions via the graph reasoning operation. Secondly, we designed the global content-aware refinement (GCR) module, which incorporates the foreground and background feature-based local information and the semantic feature-based global information simultaneously. Experiments results on two public optical RSIs datasets clearly show the effectiveness and superiority of the proposed GINet when compared with the state-of-the-art models.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142129508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-stage feature fusion and efficient self-attention for salient object detection 跨阶段特征融合和高效自我关注,实现突出物体检测
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-26 DOI: 10.1016/j.jvcir.2024.104271

Salient Object Detection (SOD) approaches usually aggregate high-level semantics with object details layer by layer through a pyramid fusion structure. However, the progressive feature fusion mechanism may lead to gradually dilution of valuable semantics and prediction accuracy. In this work, we propose a Cross-stage Feature Fusion Network (CFFNet) for salient object detection. CFFNet consists of a Cross-stage Semantic Fusion Module (CSF), a Feature Filtering and Fusion Module (FFM), and a progressive decoder to tackle the above problems. Specifically, to alleviate the semantics dilution problem, CSF concatenates different stage backbone features and extracts multi-scale global semantics using transformer blocks. Global semantics are then distributed to corresponding backbone stages for cross-stage semantic fusion. The FFM module implements efficient self-attention-based feature fusion. Different from regular self-attention which has quadratic computational complexity. Finally, a progressive decoder is adopted to refine saliency maps. Experimental results demonstrate that CFFNet outperforms state-of-the-arts on six SOD datasets.

突出物体检测(SOD)方法通常通过金字塔式的融合结构,将高层语义与物体细节逐层聚合。然而,这种渐进式的特征融合机制可能会导致有价值的语义和预测精度逐渐被稀释。在这项工作中,我们提出了一种用于突出物体检测的跨阶段特征融合网络(CFFNet)。CFFNet 由跨阶段语义融合模块(CSF)、特征过滤和融合模块(FFM)以及渐进解码器组成,以解决上述问题。具体来说,为缓解语义稀释问题,CSF 将不同阶段的骨干特征串联起来,并使用转换块提取多尺度全局语义。然后,全局语义被分配到相应的骨干阶段,进行跨阶段语义融合。FFM 模块实现了高效的基于自注意的特征融合。与计算复杂度为二次方的常规自注意不同。最后,采用渐进式解码器来完善显著性图。实验结果表明,CFFNet 在六个 SOD 数据集上的表现优于同行。
{"title":"Cross-stage feature fusion and efficient self-attention for salient object detection","authors":"","doi":"10.1016/j.jvcir.2024.104271","DOIUrl":"10.1016/j.jvcir.2024.104271","url":null,"abstract":"<div><p>Salient Object Detection (SOD) approaches usually aggregate high-level semantics with object details layer by layer through a pyramid fusion structure. However, the progressive feature fusion mechanism may lead to gradually dilution of valuable semantics and prediction accuracy. In this work, we propose a Cross-stage Feature Fusion Network (CFFNet) for salient object detection. CFFNet consists of a Cross-stage Semantic Fusion Module (CSF), a Feature Filtering and Fusion Module (FFM), and a progressive decoder to tackle the above problems. Specifically, to alleviate the semantics dilution problem, CSF concatenates different stage backbone features and extracts multi-scale global semantics using transformer blocks. Global semantics are then distributed to corresponding backbone stages for cross-stage semantic fusion. The FFM module implements efficient self-attention-based feature fusion. Different from regular self-attention which has quadratic computational complexity. Finally, a progressive decoder is adopted to refine saliency maps. Experimental results demonstrate that CFFNet outperforms state-of-the-arts on six SOD datasets.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142096968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Progressive cross-level fusion network for RGB-D salient object detection 用于 RGB-D 突出物体检测的渐进式跨层融合网络
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-24 DOI: 10.1016/j.jvcir.2024.104268

Depth maps can provide supplementary information for salient object detection (SOD) and perform better in handling complex scenes. Most existing RGB-D methods only utilize deep cues at the same level, and few methods focus on the information linkage between cross-level features. In this study, we propose a Progressive Cross-level Fusion Network (PCF-Net). It ensures the cross-flow of cross-level features by gradually exploring deeper features, which promotes the interaction and fusion of information between different-level features. First, we designed a Cross-Level Guide Cross-Modal Fusion Module (CGCF) that utilizes the spatial information of upper-level features to suppress modal feature noise and to guide lower-level features for cross-modal feature fusion. Next, the proposed Semantic Enhancement Module (SEM) and Local Enhancement Module (LEM) are used to further introduce deeper features, enhance the high-level semantic information and low-level structural information of cross-modal features, and use self-modality attention refinement to improve the enhancement effect. Finally, the multi-scale aggregation decoder mines enhanced feature information in multi-scale spaces and effectively integrates cross-scale features. In this study, we conducted numerous experiments to demonstrate that the proposed PCF-Net outperforms 16 of the most advanced methods on six popular RGB-D SOD datasets.

深度图可为突出物体检测(SOD)提供补充信息,在处理复杂场景时表现更佳。现有的 RGB-D 方法大多只利用同一层次的深度线索,很少有方法关注跨层次特征之间的信息联系。在本研究中,我们提出了一种渐进式跨层融合网络(PCF-Net)。它通过逐步探索更深层次的特征来确保跨层次特征的交叉流动,从而促进不同层次特征之间的信息交互和融合。首先,我们设计了跨层引导跨模态融合模块(CGCF),利用上层特征的空间信息抑制模态特征噪声,引导下层特征进行跨模态特征融合。其次,利用提出的语义增强模块(SEM)和局部增强模块(LEM)进一步引入更深层次的特征,增强跨模态特征的高层语义信息和低层结构信息,并利用自身模态注意力细化提高增强效果。最后,多尺度聚合解码器会挖掘多尺度空间中的增强特征信息,并有效整合跨尺度特征。在这项研究中,我们进行了大量实验,证明在六个流行的 RGB-D SOD 数据集上,所提出的 PCF-Net 优于 16 种最先进的方法。
{"title":"Progressive cross-level fusion network for RGB-D salient object detection","authors":"","doi":"10.1016/j.jvcir.2024.104268","DOIUrl":"10.1016/j.jvcir.2024.104268","url":null,"abstract":"<div><p>Depth maps can provide supplementary information for salient object detection (SOD) and perform better in handling complex scenes. Most existing RGB-D methods only utilize deep cues at the same level, and few methods focus on the information linkage between cross-level features. In this study, we propose a Progressive Cross-level Fusion Network (PCF-Net). It ensures the cross-flow of cross-level features by gradually exploring deeper features, which promotes the interaction and fusion of information between different-level features. First, we designed a Cross-Level Guide Cross-Modal Fusion Module (CGCF) that utilizes the spatial information of upper-level features to suppress modal feature noise and to guide lower-level features for cross-modal feature fusion. Next, the proposed Semantic Enhancement Module (SEM) and Local Enhancement Module (LEM) are used to further introduce deeper features, enhance the high-level semantic information and low-level structural information of cross-modal features, and use self-modality attention refinement to improve the enhancement effect. Finally, the multi-scale aggregation decoder mines enhanced feature information in multi-scale spaces and effectively integrates cross-scale features. In this study, we conducted numerous experiments to demonstrate that the proposed PCF-Net outperforms 16 of the most advanced methods on six popular RGB-D SOD datasets.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A visible-infrared person re-identification method based on meta-graph isomerization aggregation module 基于元图异构聚合模块的可见红外人员再识别方法
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-23 DOI: 10.1016/j.jvcir.2024.104265

Due to different imaging principles of visible-infrared cameras, there are modal differences between similar person images. For visible-infrared person re-identification (VI-ReID), existing works focus on extracting and aligning cross-modal global descriptors in the shared feature space, while ignoring local variations and graph-structural correlations in cross-modal image pairs. In order to bridge the modal differences with graph-structure between key-parts, a Meta-Graph Isomerization Aggregation Module (MIAM) is proposed, which includes Meta-Graph Node Isomerization Module (MNI) and Dual Aggregation Module (DA). In order to completely describe discriminative in-graph local features, MNI establishes meta-secondary cyclic isomorphism relations of in-graph local features by using a multi-branch embedding generation mechanism. Then, the local features not only contain the limited information of the fixed regions, but also benefit from the neighboring regions. Meanwhile, the secondary node generation process considers similar and different nodes of pedestrian graph structure to reduce the interference of identity differences. In addition, the Dual Aggregation (DA) module combines spatial self-attention and channel self-attention to establish the interdependence of modal heterogeneous graph structure pairs and achieve inter-modal feature aggregation. To match heterogeneous graph structures, a Node Center Joint Mining Loss (NCJM Loss) is proposed to constrain the distance between the node centers of heterogeneous graphs. The experiments performed on the SYSU-MM01, RegDB and LLCM public datasets demonstrate that the proposed method performs excellently in VI-ReID.

由于可见红外摄像机的成像原理不同,相似人物图像之间存在模态差异。在可见红外人物再识别(VI-ReID)方面,现有的工作主要集中在提取和对齐共享特征空间中的跨模态全局描述符,而忽略了跨模态图像对中的局部变化和图结构相关性。为了弥合模态差异和关键部分之间的图结构,提出了元图异构聚合模块(MIAM),其中包括元图节点异构模块(MNI)和双聚合模块(DA)。为了完整描述具有区分性的图内局部特征,MNI 通过多分支嵌入生成机制建立了图内局部特征的元二级循环同构关系。这样,局部特征不仅包含了固定区域的有限信息,还受益于相邻区域的信息。同时,二级节点生成过程考虑了行人图结构中的相似节点和不同节点,以减少身份差异的干扰。此外,双聚合(DA)模块将空间自关注和通道自关注相结合,建立模态异构图结构对的相互依存关系,实现模态间特征聚合。为了匹配异构图结构,提出了节点中心联合挖掘损失(NCJM Loss)来限制异构图节点中心之间的距离。在 SYSU-MM01、RegDB 和 LLCM 公共数据集上进行的实验表明,所提出的方法在 VI-ReID 中表现出色。
{"title":"A visible-infrared person re-identification method based on meta-graph isomerization aggregation module","authors":"","doi":"10.1016/j.jvcir.2024.104265","DOIUrl":"10.1016/j.jvcir.2024.104265","url":null,"abstract":"<div><p>Due to different imaging principles of visible-infrared cameras, there are modal differences between similar person images. For visible-infrared person re-identification (VI-ReID), existing works focus on extracting and aligning cross-modal global descriptors in the shared feature space, while ignoring local variations and graph-structural correlations in cross-modal image pairs. In order to bridge the modal differences with graph-structure between key-parts, a Meta-Graph Isomerization Aggregation Module (MIAM) is proposed, which includes Meta-Graph Node Isomerization Module (MNI) and Dual Aggregation Module (DA). In order to completely describe discriminative in-graph local features, MNI establishes meta-secondary cyclic isomorphism relations of in-graph local features by using a multi-branch embedding generation mechanism. Then, the local features not only contain the limited information of the fixed regions, but also benefit from the neighboring regions. Meanwhile, the secondary node generation process considers similar and different nodes of pedestrian graph structure to reduce the interference of identity differences. In addition, the Dual Aggregation (DA) module combines spatial self-attention and channel self-attention to establish the interdependence of modal heterogeneous graph structure pairs and achieve inter-modal feature aggregation. To match heterogeneous graph structures, a Node Center Joint Mining Loss (NCJM Loss) is proposed to constrain the distance between the node centers of heterogeneous graphs. The experiments performed on the SYSU-MM01, RegDB and LLCM public datasets demonstrate that the proposed method performs excellently in VI-ReID.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142076938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepFake detection method based on multi-scale interactive dual-stream network 基于多尺度交互式双流网络的 DeepFake 检测方法
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-22 DOI: 10.1016/j.jvcir.2024.104263

DeepFake face forgery has a serious negative impact on both society and individuals. Therefore, research on DeepFake detection technologies is necessary. At present, DeepFake detection technology based on deep learning has achieved acceptable results on high-quality datasets; however, its detection performance on low-quality datasets and cross-datasets remains poor. To address this problem, this paper presents a multi-scale interactive dual-stream network (MSIDSnet). The network is divided into spatial- and frequency-domain streams and uses a multi-scale fusion module to capture both the facial features of images that have been manipulated in the spatial domain under different circumstances and the fine-grained high-frequency noise information of forged images. The network fully integrates the features of the spatial- and frequency-domain streams through an interactive dual-stream module and uses vision transformer (ViT) to further learn the global information of the forged facial features for classification. Experimental results confirm that the accuracy of this method reached 99.30 % on the high-quality dataset Celeb-DF-v2, and 95.51 % on the low-quality dataset FaceForensics++. Moreover, the results of the cross-dataset experiments were superior to those of the other comparison methods.

DeepFake 人脸伪造对社会和个人都有严重的负面影响。因此,有必要对 DeepFake 检测技术进行研究。目前,基于深度学习的 DeepFake 检测技术在高质量数据集上取得了可接受的结果,但在低质量数据集和交叉数据集上的检测性能仍然较差。为解决这一问题,本文提出了一种多尺度交互式双流网络(MSIDSnet)。该网络分为空间域和频域流,并使用多尺度融合模块来捕捉在不同情况下经过空间域处理的图像的面部特征以及伪造图像的细粒度高频噪声信息。该网络通过交互式双流模块全面整合空间域和频率域数据流的特征,并使用视觉转换器(ViT)进一步学习伪造面部特征的全局信息,以便进行分类。实验结果表明,该方法在高质量数据集 Celeb-DF-v2 上的准确率达到 99.30%,在低质量数据集 FaceForensics++ 上的准确率达到 95.51%。此外,跨数据集实验的结果也优于其他比较方法。
{"title":"DeepFake detection method based on multi-scale interactive dual-stream network","authors":"","doi":"10.1016/j.jvcir.2024.104263","DOIUrl":"10.1016/j.jvcir.2024.104263","url":null,"abstract":"<div><p>DeepFake face forgery has a serious negative impact on both society and individuals. Therefore, research on DeepFake detection technologies is necessary. At present, DeepFake detection technology based on deep learning has achieved acceptable results on high-quality datasets; however, its detection performance on low-quality datasets and cross-datasets remains poor. To address this problem, this paper presents a multi-scale interactive dual-stream network (MSIDSnet). The network is divided into spatial- and frequency-domain streams and uses a multi-scale fusion module to capture both the facial features of images that have been manipulated in the spatial domain under different circumstances and the fine-grained high-frequency noise information of forged images. The network fully integrates the features of the spatial- and frequency-domain streams through an interactive dual-stream module and uses vision transformer (ViT) to further learn the global information of the forged facial features for classification. Experimental results confirm that the accuracy of this method reached 99.30 % on the high-quality dataset Celeb-DF-v2, and 95.51 % on the low-quality dataset FaceForensics++. Moreover, the results of the cross-dataset experiments were superior to those of the other comparison methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142137013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DiZNet: An end-to-end text detection and recognition algorithm with detail in text zone DiZNet:具有文本区域细节的端到端文本检测和识别算法
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-22 DOI: 10.1016/j.jvcir.2024.104261

This paper proposed an efficient and novel end-to-end text detection and recognition framework called DiZNet. DiZNet is built upon a core representation using text detail maps and employs the classical lightweight ResNet18 as the backbone for the text detection and recognition algorithm model. The redesigned Text Attention Head (TAH) takes multiple shallow backbone features as input, effectively extracting pixel-level information of text in images and global text positional features. The extracted text features are integrated into the stackable Feature Pyramid Enhancement Fusion Module (FPEFM). Supervised with text detail map labels, which include boundary information and texture of important text, the model predicts text detail maps and fuses them into the text detection and recognition heads. Through end-to-end testing on publicly available natural scene text benchmark datasets, our approach demonstrates robust generalization capabilities and real-time detection speeds. Leveraging the advantages of text detail map representation, DiZNet achieves a good balance between precision and efficiency on challenging datasets. For example, DiZNet achieves 91.2% Precision and 85.9% F-measure at a speed of 38.4 FPS on Total-Text and 83.8% F-measure at a speed of 30.0 FPS on ICDAR2015, it attains 83.8% F-measure at 30.0 FPS. The code is publicly available at: https://github.com/DiZ-gogogo/DiZNet

本文提出了一种名为 DiZNet 的高效、新颖的端到端文本检测和识别框架。DiZNet 建立在使用文本细节图的核心表示之上,并采用经典的轻量级 ResNet18 作为文本检测和识别算法模型的骨干。重新设计的文本注意力头(TAH)将多个浅层骨干特征作为输入,有效提取图像中文本的像素级信息和全局文本位置特征。提取的文本特征被集成到可堆叠的特征金字塔增强融合模块(FPEFM)中。在文本细节图标签(包括重要文本的边界信息和纹理)的监督下,该模型可预测文本细节图,并将其融合到文本检测和识别头中。通过在公开的自然场景文本基准数据集上进行端到端测试,我们的方法展示了强大的泛化能力和实时检测速度。利用文本细节图表示法的优势,DiZNet 在具有挑战性的数据集上实现了精度和效率之间的良好平衡。例如,DiZNet 在 Total-Text 上以 38.4 FPS 的速度实现了 91.2% 的精度和 85.9% 的 F-measure,在 ICDAR2015 上以 30.0 FPS 的速度实现了 83.8% 的 F-measure。代码可在以下网址公开获取: https://github.com/DiZ-gogogo/DiZNet
{"title":"DiZNet: An end-to-end text detection and recognition algorithm with detail in text zone","authors":"","doi":"10.1016/j.jvcir.2024.104261","DOIUrl":"10.1016/j.jvcir.2024.104261","url":null,"abstract":"<div><p>This paper proposed an efficient and novel end-to-end text detection and recognition framework called DiZNet. DiZNet is built upon a core representation using text detail maps and employs the classical lightweight ResNet18 as the backbone for the text detection and recognition algorithm model. The redesigned Text Attention Head (TAH) takes multiple shallow backbone features as input, effectively extracting pixel-level information of text in images and global text positional features. The extracted text features are integrated into the stackable Feature Pyramid Enhancement Fusion Module (FPEFM). Supervised with text detail map labels, which include boundary information and texture of important text, the model predicts text detail maps and fuses them into the text detection and recognition heads. Through end-to-end testing on publicly available natural scene text benchmark datasets, our approach demonstrates robust generalization capabilities and real-time detection speeds. Leveraging the advantages of text detail map representation, DiZNet achieves a good balance between precision and efficiency on challenging datasets. For example, DiZNet achieves 91.2% Precision and 85.9% F-measure at a speed of 38.4 FPS on Total-Text and 83.8% F-measure at a speed of 30.0 FPS on ICDAR2015, it attains 83.8% F-measure at 30.0 FPS. The code is publicly available at: <span><span>https://github.com/DiZ-gogogo/DiZNet</span><svg><path></path></svg></span></p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient SpineUNetX for X-ray: A spine segmentation network based on ConvNeXt and UNet 用于 X 射线的高效 SpineUNetX:基于 ConvNeXt 和 UNet 的脊柱分割网络
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-01 DOI: 10.1016/j.jvcir.2024.104245

Accurate localization and delineation of vertebrae are crucial for diagnosing and treating spinal disorders. To achieve this, we propose an efficient X-ray full-spine vertebra instance segmentation method based on an enhanced U-Net architecture. Several key improvements have been made: the ConvNeXt encoder is employed to effectively capture complex features, and IFE feature extraction is introduced in the skip connections to focus on texture-rich and edge-clear clues. The CBAM attention mechanism is used in the bottleneck to integrate coarse and fine-grained semantic information. The decoder employs a residual structure combined with skip connections to achieve multi-scale contextual information and feature fusion. Our method has been validated through experiments on anterior-posterior and lateral spinal segmentation, demonstrating robust feature extraction and precise semantic segmentation capabilities. It effectively handles various spinal disorders, including scoliosis, vertebral wedging, lumbar spondylolisthesis and spondylolysis. This segmentation foundation enables rapid calibration of vertebral parameters and the computation of relevant metrics, providing valuable references and guidance for advancements in medical imaging.

椎骨的精确定位和划分对于脊柱疾病的诊断和治疗至关重要。为此,我们提出了一种基于增强型 U-Net 架构的高效 X 射线全脊柱椎体实例分割方法。我们做了几项关键改进:采用 ConvNeXt 编码器有效捕捉复杂特征,在跳转连接中引入 IFE 特征提取,重点关注纹理丰富和边缘清晰的线索。CBAM 注意机制被用于瓶颈处,以整合粗粒度和细粒度语义信息。解码器采用残差结构结合跳转连接来实现多尺度上下文信息和特征融合。我们的方法已通过前后和侧向脊柱分割实验进行了验证,证明了强大的特征提取和精确的语义分割能力。它能有效处理各种脊柱疾病,包括脊柱侧弯、椎体楔入、腰椎滑脱和脊柱溶解。这种分割基础可快速校准脊椎参数和计算相关指标,为医学成像的进步提供有价值的参考和指导。
{"title":"Efficient SpineUNetX for X-ray: A spine segmentation network based on ConvNeXt and UNet","authors":"","doi":"10.1016/j.jvcir.2024.104245","DOIUrl":"10.1016/j.jvcir.2024.104245","url":null,"abstract":"<div><p>Accurate localization and delineation of vertebrae are crucial for diagnosing and treating spinal disorders. To achieve this, we propose an efficient X-ray full-spine vertebra instance segmentation method based on an enhanced U-Net architecture. Several key improvements have been made: the ConvNeXt encoder is employed to effectively capture complex features, and IFE feature extraction is introduced in the skip connections to focus on texture-rich and edge-clear clues. The CBAM attention mechanism is used in the bottleneck to integrate coarse and fine-grained semantic information. The decoder employs a residual structure combined with skip connections to achieve multi-scale contextual information and feature fusion. Our method has been validated through experiments on anterior-posterior and lateral spinal segmentation, demonstrating robust feature extraction and precise semantic segmentation capabilities. It effectively handles various spinal disorders, including scoliosis, vertebral wedging, lumbar spondylolisthesis and spondylolysis. This segmentation foundation enables rapid calibration of vertebral parameters and the computation of relevant metrics, providing valuable references and guidance for advancements in medical imaging.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141843732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bilevel learning approach for nonlocal p-Laplacien image deblurring with variable weights parameter w(x) 利用可变权重参数 w(x) 实现非局部 p-Laplacien 图像去模糊的双层学习方法
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-08-01 DOI: 10.1016/j.jvcir.2024.104248

This manuscript introduces an innovative bilevel optimization approach designed to improve the deblurring process by incorporating a nonlocal p-Laplacien model with variable weights. The study includes a theoretical analysis to examine the model’s solution, and an effective algorithm is devised for computing the pristine image, incorporating the learning of parameters associated with weights and nonlocal regularization terms. By carefully selecting these parameters, the suggested nonlocal deblurring model demonstrates superior effectiveness and performance when compared to other existing models.

本手稿介绍了一种创新的双层优化方法,旨在通过结合具有可变权重的非局部 p-Laplacien 模型来改进去模糊过程。研究包括理论分析,以检查模型的解决方案,并设计了一种有效的算法来计算原始图像,其中包括学习与权重和非局部正则化项相关的参数。通过仔细选择这些参数,与其他现有模型相比,建议的非局部去模糊模型显示出更高的有效性和性能。
{"title":"Bilevel learning approach for nonlocal p-Laplacien image deblurring with variable weights parameter w(x)","authors":"","doi":"10.1016/j.jvcir.2024.104248","DOIUrl":"10.1016/j.jvcir.2024.104248","url":null,"abstract":"<div><p>This manuscript introduces an innovative bilevel optimization approach designed to improve the deblurring process by incorporating a nonlocal <span><math><mi>p</mi></math></span>-Laplacien model with variable weights. The study includes a theoretical analysis to examine the model’s solution, and an effective algorithm is devised for computing the pristine image, incorporating the learning of parameters associated with weights and nonlocal regularization terms. By carefully selecting these parameters, the suggested nonlocal deblurring model demonstrates superior effectiveness and performance when compared to other existing models.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141964650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Visual Communication and Image Representation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1