Journal of Visual Communication and Image Representation最新文献_第6页

A novel approach for long-term secure storage of domain independent videos 一种长期安全存储独立于领域的视频的新方法

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation

Pub Date : 2024-09-02 DOI: 10.1016/j.jvcir.2024.104279

Jina Varghese , K. Praveen , Sabyasachi Dutta , Avishek Adhikari

Long-term protection of multimedia contents is a complex task, especially when the video has critical elements. It demands sophisticated technology to ensure confidentiality. In this paper, we propose a blended approach which uses proactive visual cryptography scheme along with video summarization techniques to circumvent the aforementioned issues. Proactive visual cryptography is used to protect digital data by updating periodically or renewing the shares, which are stored in different servers. And, video summarization schemes are useful in various scenarios where memory is a major concern. We use a domain independent scheme for summarizing videos and is applicable to both edited and unedited videos. In our scheme, the visual continuity of the raw video is preserved even after summarization. The original video can be reconstructed through the shares using auxiliary data, which was generated during video summarization phase. The mathematical studies and experimental results demonstrate the applicability of our proposed method.

多媒体内容的长期保护是一项复杂的任务，尤其是当视频包含关键元素时。它需要复杂的技术来确保保密性。在本文中，我们提出了一种混合方法，使用主动可视化加密方案和视频摘要技术来规避上述问题。主动可视化加密技术通过定期更新或更新存储在不同服务器中的份额来保护数字数据。视频摘要方案适用于内存是主要问题的各种场景。我们采用了一种独立于领域的视频摘要方案，既适用于经过编辑的视频，也适用于未经编辑的视频。在我们的方案中，即使经过总结，原始视频的视觉连续性也会得到保留。原始视频可以通过共享视频摘要阶段生成的辅助数据重建。数学研究和实验结果证明了我们提出的方法的适用性。

引用次数: 0

VTPL: Visual and text prompt learning for visual-language models VTPL：视觉语言模型的视觉和文本提示学习

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation

Pub Date : 2024-09-02 DOI: 10.1016/j.jvcir.2024.104280

Bo Sun , Zhichao Wu , Hao Zhang , Jun He

Visual-language (V-L) models have achieved remarkable success in learning combined visual–textual representations from large web datasets. Prompt learning, as a solution for downstream tasks, can address the forgetting of knowledge associated with fine-tuning. However, current methods focus on a single modality and fail to fully use multimodal information. This paper aims to address these limitations by proposing a novel approach called visual and text prompt learning (VTPL) to train the model and enhance both visual and text prompts. Visual prompts align visual features with text features, whereas text prompts enrich the semantic information of the text. Additionally, this paper introduces a poly-1 information noise contrastive estimation (InfoNCE) loss and a center loss to increase the interclass distance and decrease the intraclass distance. Experiments on 11 image datasets show that VTPL outperforms state-of-the-art methods, achieving 1.61%, 1.63%, 1.99%, 2.42%, and 2.87% performance boosts over CoOp for 1, 2, 4, 8, and 16 shots, respectively.

视觉语言（V-L）模型在从大型网络数据集中学习视觉与文本相结合的表征方面取得了显著的成功。提示学习作为下游任务的一种解决方案，可以解决与微调相关的知识遗忘问题。然而，目前的方法只关注单一模态，未能充分利用多模态信息。本文旨在通过提出一种名为视觉和文本提示学习（VTPL）的新方法来训练模型并增强视觉和文本提示，从而解决这些局限性。视觉提示使视觉特征与文本特征相一致，而文本提示则丰富了文本的语义信息。此外，本文还引入了多-1 信息噪音对比估计（InfoNCE）损失和中心损失，以增加类间距离，减少类内距离。在 11 个图像数据集上进行的实验表明，VTPL 的性能优于最先进的方法，在 1、2、4、8 和 16 个镜头上分别比 CoOp 提高了 1.61%、1.63%、1.99%、2.42% 和 2.87%。

引用次数: 0

SAFLFusionGait: Gait recognition network with separate attention and different granularity feature learnability fusion SAFLFusionGait：具有独立注意力和不同粒度特征可学性融合的步态识别网络

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation

Pub Date : 2024-09-01 DOI: 10.1016/j.jvcir.2024.104284

Yuchen Hu , Zhenxue Chen , Chengyun Liu, Tian Liang, Dan Lu

Gait recognition, an essential branch of biometric identification, uses walking patterns to identify individuals. Despite its effectiveness, gait recognition faces challenges such as vulnerability to changes in appearance due to factors like angles and clothing conditions. Recent progress in deep learning has greatly enhanced gait recognition, especially through methods like deep convolutional neural networks, which demonstrate impressive performance. However, current approaches often overlook the connection between coarse-grained and fine-grained features, thereby restricting their overall effectiveness. To address this limitation, we propose a new framework for gait recognition framework that combines deep-supervised fine-grained separation with coarse-grained feature learnability. Our framework includes the LFF module, which consists of the SSeg module for fine-grained information extraction and a mechanism for fusing coarse-grained features. Furthermore, we introduce the F-LCM module to extract local disparity features more effectively with learnable weights. Evaluation on CASIA-B and OU-MVLP datasets shows superior performance compared to classical networks.

步态识别是生物识别的一个重要分支，它利用行走模式来识别个人。尽管步态识别非常有效，但它也面临着一些挑战，例如容易受到角度和服装条件等因素的影响而发生外观变化。深度学习的最新进展极大地增强了步态识别能力，特别是通过深度卷积神经网络等方法，表现出了令人印象深刻的性能。然而，目前的方法往往忽略了粗粒度特征与细粒度特征之间的联系，从而限制了其整体效果。为了解决这一局限性，我们提出了一种新的步态识别框架，它将深度监督的细粒度分离与粗粒度特征可学习性相结合。我们的框架包括 LFF 模块，该模块由用于细粒度信息提取的 SSeg 模块和粗粒度特征融合机制组成。此外，我们还引入了 F-LCM 模块，利用可学习权重更有效地提取局部差异特征。在 CASIA-B 和 OU-MVLP 数据集上进行的评估表明，与传统网络相比，该方法的性能更加优越。

{"title":"SAFLFusionGait: Gait recognition network with separate attention and different granularity feature learnability fusion","authors":"Yuchen Hu , Zhenxue Chen , Chengyun Liu, Tian Liang, Dan Lu","doi":"10.1016/j.jvcir.2024.104284","DOIUrl":"10.1016/j.jvcir.2024.104284","url":null,"abstract":"<div>Gait recognition, an essential branch of biometric identification, uses walking patterns to identify individuals. Despite its effectiveness, gait recognition faces challenges such as vulnerability to changes in appearance due to factors like angles and clothing conditions. Recent progress in deep learning has greatly enhanced gait recognition, especially through methods like deep convolutional neural networks, which demonstrate impressive performance. However, current approaches often overlook the connection between coarse-grained and fine-grained features, thereby restricting their overall effectiveness. To address this limitation, we propose a new framework for gait recognition framework that combines deep-supervised fine-grained separation with coarse-grained feature learnability. Our framework includes the LFF module, which consists of the SSeg module for fine-grained information extraction and a mechanism for fusing coarse-grained features. Furthermore, we introduce the F-LCM module to extract local disparity features more effectively with learnable weights. Evaluation on CASIA-B and OU-MVLP datasets shows superior performance compared to classical networks.</div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104284"},"PeriodicalIF":2.6,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Blind image deblurring with a difference of the mixed anisotropic and mixed isotropic total variation regularization 采用混合各向异性和混合各向同性总变化正则化差异进行盲图像去模糊处理

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation

Pub Date : 2024-08-31 DOI: 10.1016/j.jvcir.2024.104285

Dandan Hu , Xianyu Ge , Jing Liu , Jieqing Tan , Xiangrong She

This paper proposes a simple model for image deblurring with a new total variation regularization. Classically, the L_1-21 regularizer represents a difference of anisotropic (i.e. L₁) and isotropic (i.e. L₂₁) total variation, so we define a new regularization as L_e-2e, which is the weighted difference of the mixed anisotropic (i.e. L₀ + L₁ = L_e) and mixed isotropic (i.e. L₀ + L₂₁ = L_2e), and it is characterized by sparsity-promoting and robustness in image deblurring. Then, we merge the L₀-gradient into the model for edge-preserving and detail-removing. The union of the L_e-2e regularization and L₀-gradient improves the performance of image deblurring and yields high-quality blur kernel estimates. Finally, we design a new solution format that alternately iterates the difference of convex algorithm, the split Bregman method, and the approach of half-quadratic splitting to optimize the proposed model. Experimental results on quantitative datasets and real-world images show that the proposed method can obtain results comparable to state-of-the-art works.

本文提出了一种利用新的总变化正则化进行图像去模糊的简单模型。通常，L1-21 正则化表示各向异性（即 L1）和各向同性（即 L21）总变化的差值，因此我们定义了一种新的正则化为 Le-2e，它是混合各向异性（即 L0 + L1 = Le）和混合各向同性（即 L0 + L21 = L2e）的加权差值，在图像去模糊中具有促进稀疏性和鲁棒性的特点。然后，我们将 L0 梯度合并到模型中，以实现边缘保留和细节去除。Le-2e 正则化和 L0-gradient 的结合提高了图像去模糊的性能，并得到了高质量的模糊核估计值。最后，我们设计了一种新的求解格式，交替迭代凸算法差分、分裂布雷格曼方法和半二次分裂方法，以优化所提出的模型。在定量数据集和真实世界图像上的实验结果表明，所提出的方法可以获得与最先进方法相媲美的结果。

{"title":"Blind image deblurring with a difference of the mixed anisotropic and mixed isotropic total variation regularization","authors":"Dandan Hu , Xianyu Ge , Jing Liu , Jieqing Tan , Xiangrong She","doi":"10.1016/j.jvcir.2024.104285","DOIUrl":"10.1016/j.jvcir.2024.104285","url":null,"abstract":"<div>This paper proposes a simple model for image deblurring with a new total variation regularization. Classically, the L1-21 regularizer represents a difference of anisotropic (i.e. L1) and isotropic (i.e. L21) total variation, so we define a new regularization as Le-2e, which is the weighted difference of the mixed anisotropic (i.e. L0 + L1 = Le) and mixed isotropic (i.e. L0 + L21 = L2e), and it is characterized by sparsity-promotingand robustness in image deblurring. Then, we merge the L0-gradient into the model for edge-preserving and detail-removing. The union of the Le-2e regularization and L0-gradient improves the performance of image deblurring and yields high-quality blur kernel estimates. Finally, we design a new solution format that alternately iterates the difference of convex algorithm, the split Bregman method, and the approach of half-quadratic splitting to optimize the proposed model. Experimental results on quantitative datasets and real-world images show that the proposed method can obtain results comparable to state-of-the-art works.</div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104285"},"PeriodicalIF":2.6,"publicationDate":"2024-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142137014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Secret image sharing with distinct covers based on improved Cycling-XOR 基于改进型循环-XOR 的不同封面秘密图像共享

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation

Pub Date : 2024-08-31 DOI: 10.1016/j.jvcir.2024.104282

Jiang-Yi Lin , Ji-Hwei Horng , Chin-Chen Chang

Secret image sharing (SIS) is a technique used to distribute confidential data by dividing it into multiple image shadows. Most of the existing approaches or algorithms protect confidential data by encryption with secret keys. This paper proposes a novel SIS scheme without using any secret key. The secret images are first quantized and encrypted by self-encryption into noisy ones. Then, the encrypted images are mixed into secret shares by cross-encryption. The image shadows are generated by replacing the lower bit-planes of the cover images with the secret shares. In the extraction phase, the receiver can restore the quantized secret images by combinatorial operations of the extracted secret shares. Experimental results show that our method is able to deliver a large amount of data payload with a satisfactory cover image quality. Besides, the computational load is very low since the whole scheme is mostly based on cycling-XOR operations.

秘密图像共享（SIS）是一种通过将机密数据分割成多个图像阴影来分发机密数据的技术。现有的大多数方法或算法都是通过使用秘钥加密来保护机密数据的。本文提出了一种无需使用任何秘钥的新型 SIS 方案。秘密图像首先被量化，并通过自加密技术加密成噪声图像。然后，通过交叉加密将加密图像混合成秘密份额。用秘密份额替换封面图像的低位平面，生成图像阴影。在提取阶段，接收方可以通过对提取的秘密份额进行组合运算来还原量化的秘密图像。实验结果表明，我们的方法能够以令人满意的覆盖图像质量提供大量数据有效载荷。此外，由于整个方案主要基于循环-XOR 运算，因此计算负荷非常低。

引用次数: 0

Background adaptive PosMarker based on online generation and detection for locating watermarked regions in photographs 基于在线生成和检测的背景自适应 PosMarker，用于定位照片中的水印区域

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation

Pub Date : 2024-08-28 DOI: 10.1016/j.jvcir.2024.104269

Chengxin Zhao, Hefei Ling, Jinlong Guo, Zhenghai He

Robust watermarking technology can embed invisible messages in screens to trace the source of unauthorized screen photographs. Locating the four vertices of the embedded region in the photograph is necessary, as existing watermarking methods require geometric correction of the embedded region before revealing the message. Existing localization methods suffer from a performance trade-off: either causing unaesthetic visual quality by embedding visible markers or achieving poor localization precision, leading to message extraction failure. To address this issue, we propose a background adaptive position marker, PosMarker, based on the gray level co-occurrence matrix and the noise visibility function. Besides, we propose an online generation scheme that employs a learnable generator to cooperate with the detector, allowing joint optimization between the two. This simultaneously improves both visual quality and detection precision. Extensive experiments demonstrate the superior localization precision of our PosMarker-based method compared to others.

强大的水印技术可以在屏幕中嵌入隐形信息，以追踪未经授权的屏幕照片的来源。定位照片中嵌入区域的四个顶点是必要的，因为现有的水印方法需要在显示信息之前对嵌入区域进行几何校正。现有的定位方法需要在性能上进行权衡：要么嵌入可见标记，造成不美观的视觉质量；要么定位精度低，导致信息提取失败。为了解决这个问题，我们提出了一种基于灰度共生矩阵和噪声可见性函数的背景自适应位置标记 PosMarker。此外，我们还提出了一种在线生成方案，利用可学习生成器与检测器合作，实现两者之间的联合优化。这同时提高了视觉质量和检测精度。大量实验证明，与其他方法相比，我们基于 PosMarker 的方法具有更高的定位精度。

{"title":"Background adaptive PosMarker based on online generation and detection for locating watermarked regions in photographs","authors":"Chengxin Zhao, Hefei Ling, Jinlong Guo, Zhenghai He","doi":"10.1016/j.jvcir.2024.104269","DOIUrl":"10.1016/j.jvcir.2024.104269","url":null,"abstract":"<div>Robust watermarking technology can embed invisible messages in screens to trace the source of unauthorized screen photographs. Locating the four vertices of the embedded region in the photograph is necessary, as existing watermarking methods require geometric correction of the embedded region before revealing the message. Existing localization methods suffer from a performance trade-off: either causing unaesthetic visual quality by embedding visible markers or achieving poor localization precision, leading to message extraction failure. To address this issue, we propose a background adaptive position marker, PosMarker, based on the gray level co-occurrence matrix and the noise visibility function. Besides, we propose an online generation scheme that employs a learnable generator to cooperate with the detector, allowing joint optimization between the two. This simultaneously improves both visual quality and detection precision. Extensive experiments demonstrate the superior localization precision of our PosMarker-based method compared to others.</div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104269"},"PeriodicalIF":2.6,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142151857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spatio-temporal feature learning for enhancing video quality based on screen content characteristics 根据屏幕内容特征进行时空特征学习以提高视频质量

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation

Pub Date : 2024-08-28 DOI: 10.1016/j.jvcir.2024.104270

Ziyin Huang , Yui-Lam Chan , Sik-Ho Tsang , Ngai-Wing Kwong , Kin-Man Lam , Wing-Kuen Ling

With the rising demands for remote desktops and online meetings, screen content videos have drawn significant attention. Different from natural videos, screen content videos often exhibit scene switches where the content abruptly changes from one frame to the next. These scene switches result in obvious distortions in compressed videos. Besides, frame freezing, where the content remains unchanged for a certain duration, is also very common in screen content videos. Existing alignment-based models struggle to effectively enhance scene switch frames and lack efficiency when dealing with frame freezing situations. Therefore, we propose a novel alignment-free method that effectively handles both scene switches and frame freezing. In our approach, we develop a spatial and temporal feature extraction module that compresses and extracts spatio-temporal information from three groups of frame inputs. This enables efficient handling of scene switches. In addition, an edge aware block is proposed for extracting edge information, which guides the model to focus on restoring the high-frequency components in frame freezing situations. The fusion module is then designed to adaptively fuse the features from three groups, considering different positions of video frames, to enhance frames during scene switch and frame freezing scenarios. Experimental results demonstrate the significant advancements achieved by the proposed edge aware with spatio-temporal information fusion network (EAST) in enhancing the quality of compressed videos, surpassing the current state-of-the-art methods.

随着远程桌面和在线会议需求的不断增长，屏幕内容视频引起了人们的极大关注。与自然视频不同，屏幕内容视频经常出现场景切换，即内容从一帧突然切换到下一帧。这些场景切换会导致压缩视频出现明显的失真。此外，画面内容视频还经常出现帧冻结现象，即内容在一定时间内保持不变。现有的基于对齐的模型难以有效增强场景切换帧，在处理帧冻结情况时也缺乏效率。因此，我们提出了一种新颖的免对齐方法，能有效处理场景切换和帧冻结。在我们的方法中，我们开发了一个空间和时间特征提取模块，可压缩并提取三组帧输入的时空信息。这样就能有效处理场景切换。此外，我们还提出了一个边缘感知模块，用于提取边缘信息，引导模型在帧冻结情况下集中恢复高频成分。融合模块的设计考虑到视频帧的不同位置，能够自适应地融合来自三组的特征，从而增强场景切换和帧冻结场景中的帧。实验结果表明，所提出的边缘感知与时空信息融合网络（EAST）在提高压缩视频质量方面取得了显著进步，超越了当前最先进的方法。

{"title":"Spatio-temporal feature learning for enhancing video quality based on screen content characteristics","authors":"Ziyin Huang , Yui-Lam Chan , Sik-Ho Tsang , Ngai-Wing Kwong , Kin-Man Lam , Wing-Kuen Ling","doi":"10.1016/j.jvcir.2024.104270","DOIUrl":"10.1016/j.jvcir.2024.104270","url":null,"abstract":"<div>With the rising demands for remote desktops and online meetings, screen content videos have drawn significant attention. Different from natural videos, screen content videos often exhibit scene switches where the content abruptly changes from one frame to the next. These scene switches result in obvious distortions in compressed videos. Besides, frame freezing, where the content remains unchanged for a certain duration, is also very common in screen content videos. Existing alignment-based models struggle to effectively enhance scene switch frames and lack efficiency when dealing with frame freezing situations. Therefore, we propose a novel alignment-free method that effectively handles both scene switches and frame freezing. In our approach, we develop a spatial and temporal feature extraction module that compresses and extracts spatio-temporal information from three groups of frame inputs. This enables efficient handling of scene switches. In addition, an edge aware block is proposed for extracting edge information, which guides the model to focus on restoring the high-frequency components in frame freezing situations. The fusion module is then designed to adaptively fuse the features from three groups, considering different positions of video frames, to enhance frames during scene switch and frame freezing scenarios. Experimental results demonstrate the significant advancements achieved by the proposed edge aware with spatio-temporal information fusion network (EAST) in enhancing the quality of compressed videos, surpassing the current state-of-the-art methods.</div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104270"},"PeriodicalIF":2.6,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exposing video surveillance object forgery by combining TSF features and attention-based deep neural networks 通过结合 TSF 特征和基于注意力的深度神经网络揭露视频监控对象伪造问题

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation

Pub Date : 2024-08-27 DOI: 10.1016/j.jvcir.2024.104267

Jun-Liu Zhong , Yan-Fen Gan , Ji-Xiang Yang , Yu-Huan Chen , Ying-Qi Zhao , Zhi-Sheng Lv

Recently, forensics has encountered a new challenge with video surveillance object forgery. This type of forgery combines the characteristics of popular video copy-move and splicing forgeries, failing most existing video forgery detection schemes. In response to this new forgery challenge, this paper proposes a Video Surveillance Object Forgery Detection (VSOFD) method including three parts components: (i) The proposed method presents a special-combined extraction technique that incorporates Temporal-Spatial-Frequent (TSF) perspectives for TSF feature extraction. Furthermore, TSF features can effectively represent video information and benefit from feature dimension reduction, improving computational efficiency. (ii) The proposed method introduces a universal, extensible attention-based Convolutional Neural Network (CNN) baseline for feature processing. This CNN processing architecture is compatible with various series and parallel feed-forward CNN structures, considering these structures as processing backbones. Therefore, the proposed CNN architecture benefits from various state-of-the-art structures, leading to addressing each independent TSF feature. (iii) The method adopts an encoder-attention-decoder RNN framework for feature classification. By incorporating temporal characteristics, the framework can further identify the correlations between the adjacent frames to classify the forgery frames better. Finally, experimental results show that the proposed network can achieve the best F₁ = 94.69 % score, increasing at least 5–12 % from the existing State-Of-The-Art (SOTA) VSOFD schemes and other video forensics.

最近，取证工作遇到了视频监控对象伪造的新挑战。这类伪造结合了流行的视频复制移动和拼接伪造的特点，使现有的大多数视频伪造检测方案失效。针对这一新的伪造挑战，本文提出了一种视频监控对象伪造检测（VSOFD）方法，包括三个部分：(i) 本文提出了一种特殊的组合提取技术，该技术结合了时间-空间-频率（TSF）视角进行 TSF 特征提取。此外，TSF 特征能有效地表示视频信息，并能从特征维度缩减中获益，从而提高计算效率。(ii) 所提出的方法为特征处理引入了一个通用的、可扩展的、基于注意力的卷积神经网络（CNN）基线。这种 CNN 处理架构兼容各种串联和并联前馈 CNN 结构，将这些结构视为处理骨干。因此，拟议的 CNN 架构可受益于各种最先进的结构，从而处理每个独立的 TSF 特征。(iii) 该方法采用编码器-注意-解码器 RNN 框架进行特征分类。通过结合时间特征，该框架可以进一步识别相邻帧之间的相关性，从而更好地对伪造帧进行分类。最后，实验结果表明，所提出的网络可以达到最佳 F1 = 94.69 % 的分数，比现有的技术水平 (SOTA) VSOFD 方案和其他视频取证方案至少提高了 5-12 %。

{"title":"Exposing video surveillance object forgery by combining TSF features and attention-based deep neural networks","authors":"Jun-Liu Zhong , Yan-Fen Gan , Ji-Xiang Yang , Yu-Huan Chen , Ying-Qi Zhao , Zhi-Sheng Lv","doi":"10.1016/j.jvcir.2024.104267","DOIUrl":"10.1016/j.jvcir.2024.104267","url":null,"abstract":"<div>Recently, forensics has encountered a new challenge with video surveillance object forgery. This type of forgery combines the characteristics of popular video copy-move and splicing forgeries, failing most existing video forgery detection schemes. In response to this new forgery challenge, this paper proposes a Video Surveillance Object Forgery Detection (VSOFD) method including three parts components: (i) The proposed method presents a special-combined extraction technique that incorporates Temporal-Spatial-Frequent (TSF) perspectives for TSF feature extraction. Furthermore, TSF features can effectively represent video information and benefit from feature dimension reduction, improving computational efficiency. (ii) The proposed method introduces a universal, extensible attention-based Convolutional Neural Network (CNN) baseline for feature processing. This CNN processing architecture is compatible with various series and parallel feed-forward CNN structures, considering these structures as processing backbones. Therefore, the proposed CNN architecture benefits from various state-of-the-art structures, leading to addressing each independent TSF feature. (iii) The method adopts an encoder-attention-decoder RNN framework for feature classification. By incorporating temporal characteristics, the framework can further identify the correlations between the adjacent frames to classify the forgery frames better. Finally, experimental results show that the proposed network can achieve the best F1 = 94.69 % score, increasing at least 5–12 % from the existing State-Of-The-Art (SOTA) VSOFD schemes and other video forensics.</div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104267"},"PeriodicalIF":2.6,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142089307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-stage feature fusion and efficient self-attention for salient object detection 跨阶段特征融合和高效自我关注，实现突出物体检测

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation

Pub Date : 2024-08-26 DOI: 10.1016/j.jvcir.2024.104271

Xiaofeng Xia, Yingdong Ma

Salient Object Detection (SOD) approaches usually aggregate high-level semantics with object details layer by layer through a pyramid fusion structure. However, the progressive feature fusion mechanism may lead to gradually dilution of valuable semantics and prediction accuracy. In this work, we propose a Cross-stage Feature Fusion Network (CFFNet) for salient object detection. CFFNet consists of a Cross-stage Semantic Fusion Module (CSF), a Feature Filtering and Fusion Module (FFM), and a progressive decoder to tackle the above problems. Specifically, to alleviate the semantics dilution problem, CSF concatenates different stage backbone features and extracts multi-scale global semantics using transformer blocks. Global semantics are then distributed to corresponding backbone stages for cross-stage semantic fusion. The FFM module implements efficient self-attention-based feature fusion. Different from regular self-attention which has quadratic computational complexity. Finally, a progressive decoder is adopted to refine saliency maps. Experimental results demonstrate that CFFNet outperforms state-of-the-arts on six SOD datasets.

突出物体检测（SOD）方法通常通过金字塔式的融合结构，将高层语义与物体细节逐层聚合。然而，这种渐进式的特征融合机制可能会导致有价值的语义和预测精度逐渐被稀释。在这项工作中，我们提出了一种用于突出物体检测的跨阶段特征融合网络（CFFNet）。CFFNet 由跨阶段语义融合模块（CSF）、特征过滤和融合模块（FFM）以及渐进解码器组成，以解决上述问题。具体来说，为缓解语义稀释问题，CSF 将不同阶段的骨干特征串联起来，并使用转换块提取多尺度全局语义。然后，全局语义被分配到相应的骨干阶段，进行跨阶段语义融合。FFM 模块实现了高效的基于自注意的特征融合。与计算复杂度为二次方的常规自注意不同。最后，采用渐进式解码器来完善显著性图。实验结果表明，CFFNet 在六个 SOD 数据集上的表现优于同行。

{"title":"Cross-stage feature fusion and efficient self-attention for salient object detection","authors":"Xiaofeng Xia, Yingdong Ma","doi":"10.1016/j.jvcir.2024.104271","DOIUrl":"10.1016/j.jvcir.2024.104271","url":null,"abstract":"<div>Salient Object Detection (SOD) approaches usually aggregate high-level semantics with object details layer by layer through a pyramid fusion structure. However, the progressive feature fusion mechanism may lead to gradually dilution of valuable semantics and prediction accuracy. In this work, we propose a Cross-stage Feature Fusion Network (CFFNet) for salient object detection. CFFNet consists of a Cross-stage Semantic Fusion Module (CSF), a Feature Filtering and Fusion Module (FFM), and a progressive decoder to tackle the above problems. Specifically, to alleviate the semantics dilution problem, CSF concatenates different stage backbone features and extracts multi-scale global semantics using transformer blocks. Global semantics are then distributed to corresponding backbone stages for cross-stage semantic fusion. The FFM module implements efficient self-attention-based feature fusion. Different from regular self-attention which has quadratic computational complexity. Finally, a progressive decoder is adopted to refine saliency maps. Experimental results demonstrate that CFFNet outperforms state-of-the-arts on six SOD datasets.</div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104271"},"PeriodicalIF":2.6,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142096968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GINet:Graph interactive network with semantic-guided spatial refinement for salient object detection in optical remote sensing images GINet：用于光学遥感图像中突出物体检测的具有语义引导空间细化功能的图交互网络

IF 2.6 4区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Visual Communication and Image Representation

Pub Date : 2024-08-26 DOI: 10.1016/j.jvcir.2024.104257

Chenwei Zhu , Xiaofei Zhou , Liuxin Bao , Hongkui Wang , Shuai Wang , Zunjie Zhu , Chenggang Yan , Jiyong Zhang

There are many challenging scenarios in the task of salient object detection in optical remote sensing images (RSIs), such as various scales and irregular shapes of salient objects, cluttered backgrounds, etc. Therefore, it is difficult to directly apply saliency models targeting natural scene images to optical RSIs. Besides, existing models often do not give sufficient exploration for the potential relationship of different salient objects or different parts of the salient object. In this paper, we propose a graph interaction network (i.e. GINet) with semantic-guided spatial refinement to conduct salient object detection in optical RSIs. The key advantages of GINet lie in two points. Firstly, the graph interactive reasoning (GIR) module conducts information exchange of different-level features via the graph interaction operation, and enhances features along spatial and channel dimensions via the graph reasoning operation. Secondly, we designed the global content-aware refinement (GCR) module, which incorporates the foreground and background feature-based local information and the semantic feature-based global information simultaneously. Experiments results on two public optical RSIs datasets clearly show the effectiveness and superiority of the proposed GINet when compared with the state-of-the-art models.

在光学遥感图像（RSIs）中检测突出物体的任务有许多具有挑战性的场景，例如突出物体的不同尺度和不规则形状、杂乱的背景等。因此，很难将针对自然场景图像的显著性模型直接应用于光学遥感图像。此外，现有模型往往不能充分挖掘不同突出物体或突出物体不同部分之间的潜在关系。在本文中，我们提出了一种具有语义引导空间细化功能的图交互网络（即 GINet），用于在光学 RSI 中进行突出对象检测。GINet 的主要优势在于两点。首先，图交互推理（GIR）模块通过图交互操作进行不同层次特征的信息交换，并通过图推理操作增强空间和通道维度的特征。其次，我们设计了全局内容感知细化（GCR）模块，该模块同时整合了基于前景和背景特征的局部信息和基于语义特征的全局信息。在两个公开的光学 RSI 数据集上的实验结果清楚地表明，与最先进的模型相比，所提出的 GINet 非常有效和优越。

{"title":"GINet:Graph interactive network with semantic-guided spatial refinement for salient object detection in optical remote sensing images","authors":"Chenwei Zhu , Xiaofei Zhou , Liuxin Bao , Hongkui Wang , Shuai Wang , Zunjie Zhu , Chenggang Yan , Jiyong Zhang","doi":"10.1016/j.jvcir.2024.104257","DOIUrl":"10.1016/j.jvcir.2024.104257","url":null,"abstract":"<div>There are many challenging scenarios in the task of salient object detection in optical remote sensing images (RSIs), such as various scales and irregular shapes of salient objects, cluttered backgrounds, etc. Therefore, it is difficult to directly apply saliency models targeting natural scene images to optical RSIs. Besides, existing models often do not give sufficient exploration for the potential relationship of different salient objects or different parts of the salient object. In this paper, we propose a graph interaction network (i.e. GINet) with semantic-guided spatial refinement to conduct salient object detection in optical RSIs. The key advantages of GINet lie in two points. Firstly, the graph interactive reasoning (GIR) module conducts information exchange of different-level features via the graph interaction operation, and enhances features along spatial and channel dimensions via the graph reasoning operation. Secondly, we designed the global content-aware refinement (GCR) module, which incorporates the foreground and background feature-based local information and the semantic feature-based global information simultaneously. Experiments results on two public optical RSIs datasets clearly show the effectiveness and superiority of the proposed GINet when compared with the state-of-the-art models.</div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104257"},"PeriodicalIF":2.6,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142129508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0