首页 > 最新文献

Journal of Visual Communication and Image Representation最新文献

英文 中文
Crowd counting network based on attention feature fusion and multi-column feature enhancement 基于注意力特征融合和多列特征增强的人群计数网络
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-28 DOI: 10.1016/j.jvcir.2024.104323
Qian Liu, Yixiong Zhong, Jiongtao Fang
Density map estimation is commonly used for crowd counting. However, using it alone may make some individuals difficult to recognize, due to the problems of target occlusions, scale variations, complex background and heterogeneous distribution. To alleviate these problems, we propose a two-stage crowd counting network based on attention feature fusion and multi-column feature enhancement (AFF-MFE-TNet). In the first stage, AFF-MFE-TNet transforms the input image into a probability map. In the second stage, a multi-column feature enhancement module is constructed to enhance features by expanding the receptive fields, a dual attention feature fusion module is designed to adaptively fuse the features of different scales through attention mechanisms, and a triple counting loss is presented for AFF-MFE-TNet, which can fit the ground truth probability maps and density maps better, and improve the counting performance. Experimental results show that AFF-MFE-TNet can effectively improve the accuracy of crowd counting, as compared with the state-of-the-art.
密度图估算通常用于人群计数。然而,由于目标遮挡、尺度变化、复杂背景和异质分布等问题,仅使用密度图估计可能会使一些个体难以识别。为了缓解这些问题,我们提出了一种基于注意力特征融合和多列特征增强的两阶段人群计数网络(AFF-MFE-TNet)。在第一阶段,AFF-MFE-TNet 将输入图像转换为概率图。在第二阶段,构建了多列特征增强模块,通过扩大感受野来增强特征;设计了双注意特征融合模块,通过注意机制自适应地融合不同尺度的特征;提出了 AFF-MFE-TNet 的三重计数损失,它能更好地拟合地面实况概率图和密度图,提高计数性能。实验结果表明,与最先进的技术相比,AFF-MFE-TNet 可以有效提高人群计数的准确性。
{"title":"Crowd counting network based on attention feature fusion and multi-column feature enhancement","authors":"Qian Liu,&nbsp;Yixiong Zhong,&nbsp;Jiongtao Fang","doi":"10.1016/j.jvcir.2024.104323","DOIUrl":"10.1016/j.jvcir.2024.104323","url":null,"abstract":"<div><div>Density map estimation is commonly used for crowd counting. However, using it alone may make some individuals difficult to recognize, due to the problems of target occlusions, scale variations, complex background and heterogeneous distribution. To alleviate these problems, we propose a two-stage crowd counting network based on attention feature fusion and multi-column feature enhancement (AFF-MFE-TNet). In the first stage, AFF-MFE-TNet transforms the input image into a probability map. In the second stage, a multi-column feature enhancement module is constructed to enhance features by expanding the receptive fields, a dual attention feature fusion module is designed to adaptively fuse the features of different scales through attention mechanisms, and a triple counting loss is presented for AFF-MFE-TNet, which can fit the ground truth probability maps and density maps better, and improve the counting performance. Experimental results show that AFF-MFE-TNet can effectively improve the accuracy of crowd counting, as compared with the state-of-the-art.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104323"},"PeriodicalIF":2.6,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MVP-HOT: A Moderate Visual Prompt for Hyperspectral Object Tracking MVP-HOT:用于高光谱物体跟踪的适度视觉提示
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-26 DOI: 10.1016/j.jvcir.2024.104326
Lin Zhao, Shaoxiong Xie, Jia Li, Ping Tan, Wenjin Hu
The growing attention to hyperspectral object tracking (HOT) can be attributed to the extended spectral information available in hyperspectral images (HSIs), especially in complex scenarios. This potential makes it a promising alternative to traditional RGB-based tracking methods. However, the scarcity of large hyperspectral datasets poses a challenge for training robust hyperspectral trackers using deep learning methods. Prompt learning, a new paradigm emerging in large language models, involves adapting or fine-tuning a pre-trained model for a specific downstream task by providing task-specific inputs. Inspired by the recent success of prompt learning in language and visual tasks, we propose a novel and efficient prompt learning method for HOT tasks, termed Moderate Visual Prompt for HOT (MVP-HOT). Specifically, MVP-HOT freezes the parameters of the pre-trained model and employs HSIs as visual prompts to leverage the knowledge of the underlying RGB model. Additionally, we develop a moderate and effective strategy to incrementally adapt the HSI prompt information. Our proposed method uses only a few (1.7M) learnable parameters and demonstrates its effectiveness through extensive experiments, MVP-HOT can achieve state-of-the-art performance on three hyperspectral datasets.
高光谱物体追踪(HOT)之所以越来越受到关注,是因为高光谱图像(HSIs)中具有扩展的光谱信息,尤其是在复杂的场景中。这种潜力使其成为传统的基于 RGB 的跟踪方法的一种有前途的替代方法。然而,大型高光谱数据集的稀缺给使用深度学习方法训练稳健的高光谱跟踪器带来了挑战。提示学习是大型语言模型中出现的一种新范式,它通过提供特定任务的输入,针对特定下游任务调整或微调预先训练好的模型。受近期提示学习在语言和视觉任务中取得成功的启发,我们为 HOT 任务提出了一种新颖高效的提示学习方法,即适度视觉提示 HOT(MVP-HOT)。具体来说,MVP-HOT 冻结了预训练模型的参数,并采用 HSI 作为视觉提示,以充分利用底层 RGB 模型的知识。此外,我们还开发了一种适度而有效的策略来逐步调整 HSI 提示信息。我们提出的方法只使用了少量(1.7M)可学习参数,并通过大量实验证明了其有效性,MVP-HOT 可以在三个高光谱数据集上实现最先进的性能。
{"title":"MVP-HOT: A Moderate Visual Prompt for Hyperspectral Object Tracking","authors":"Lin Zhao,&nbsp;Shaoxiong Xie,&nbsp;Jia Li,&nbsp;Ping Tan,&nbsp;Wenjin Hu","doi":"10.1016/j.jvcir.2024.104326","DOIUrl":"10.1016/j.jvcir.2024.104326","url":null,"abstract":"<div><div>The growing attention to hyperspectral object tracking (HOT) can be attributed to the extended spectral information available in hyperspectral images (HSIs), especially in complex scenarios. This potential makes it a promising alternative to traditional RGB-based tracking methods. However, the scarcity of large hyperspectral datasets poses a challenge for training robust hyperspectral trackers using deep learning methods. Prompt learning, a new paradigm emerging in large language models, involves adapting or fine-tuning a pre-trained model for a specific downstream task by providing task-specific inputs. Inspired by the recent success of prompt learning in language and visual tasks, we propose a novel and efficient prompt learning method for HOT tasks, termed Moderate Visual Prompt for HOT (MVP-HOT). Specifically, MVP-HOT freezes the parameters of the pre-trained model and employs HSIs as visual prompts to leverage the knowledge of the underlying RGB model. Additionally, we develop a moderate and effective strategy to incrementally adapt the HSI prompt information. Our proposed method uses only a few (1.7M) learnable parameters and demonstrates its effectiveness through extensive experiments, MVP-HOT can achieve state-of-the-art performance on three hyperspectral datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104326"},"PeriodicalIF":2.6,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the rate-distortion-complexity optimization in neural image compression 探索神经图像压缩中的速率-失真-复杂性优化
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-26 DOI: 10.1016/j.jvcir.2024.104294
Yixin Gao, Runsen Feng, Zongyu Guo, Zhibo Chen
Despite a short history, neural image codecs have been shown to surpass classical image codecs in terms of rate–distortion performance. However, most of them suffer from significantly longer decoding times, which hinders the practical applications of neural image codecs. This issue is especially pronounced when employing an effective yet time-consuming autoregressive context model since it would increase entropy decoding time by orders of magnitude. In this paper, unlike most previous works that pursue optimal RD performance while temporally overlooking the coding complexity, we make a systematical investigation on the rate–distortion-complexity (RDC) optimization in neural image compression. By quantifying the decoding complexity as a factor in the optimization goal, we are now able to precisely control the RDC trade-off and then demonstrate how the rate–distortion performance of neural image codecs could adapt to various complexity demands. Going beyond the investigation of RDC optimization, a variable-complexity neural codec is designed to leverage the spatial dependencies adaptively according to industrial demands, which supports fine-grained complexity adjustment by balancing the RDC tradeoff. By implementing this scheme in a powerful base model, we demonstrate the feasibility and flexibility of RDC optimization for neural image codecs.
尽管神经图像编解码技术问世时间不长,但其在速率-失真性能方面已经超越了传统图像编解码技术。然而,大多数神经图像编解码器的解码时间明显较长,这阻碍了神经图像编解码器的实际应用。当采用有效但耗时的自回归上下文模型时,这一问题尤为突出,因为这会使熵解码时间增加几个数量级。在本文中,与以往大多数追求最佳 RD 性能而忽视编码复杂性的研究不同,我们对神经图像压缩中的速率-失真-复杂性(RDC)优化进行了系统研究。通过将解码复杂度量化为优化目标中的一个因素,我们现在能够精确控制 RDC 的权衡,进而证明神经图像编解码器的速率-失真性能如何适应各种复杂度需求。在研究 RDC 优化的基础上,我们设计了一种可变复杂度神经编解码器,可根据行业需求自适应地利用空间依赖性,通过平衡 RDC 权衡来支持复杂度的细粒度调整。通过在一个强大的基础模型中实施这一方案,我们证明了神经图像编解码器 RDC 优化的可行性和灵活性。
{"title":"Exploring the rate-distortion-complexity optimization in neural image compression","authors":"Yixin Gao,&nbsp;Runsen Feng,&nbsp;Zongyu Guo,&nbsp;Zhibo Chen","doi":"10.1016/j.jvcir.2024.104294","DOIUrl":"10.1016/j.jvcir.2024.104294","url":null,"abstract":"<div><div>Despite a short history, neural image codecs have been shown to surpass classical image codecs in terms of rate–distortion performance. However, most of them suffer from significantly longer decoding times, which hinders the practical applications of neural image codecs. This issue is especially pronounced when employing an effective yet time-consuming autoregressive context model since it would increase entropy decoding time by orders of magnitude. In this paper, unlike most previous works that pursue optimal RD performance while temporally overlooking the coding complexity, we make a systematical investigation on the rate–distortion-complexity (RDC) optimization in neural image compression. By quantifying the decoding complexity as a factor in the optimization goal, we are now able to precisely control the RDC trade-off and then demonstrate how the rate–distortion performance of neural image codecs could adapt to various complexity demands. Going beyond the investigation of RDC optimization, a variable-complexity neural codec is designed to leverage the spatial dependencies adaptively according to industrial demands, which supports fine-grained complexity adjustment by balancing the RDC tradeoff. By implementing this scheme in a powerful base model, we demonstrate the feasibility and flexibility of RDC optimization for neural image codecs.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104294"},"PeriodicalIF":2.6,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142659506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
3D human model guided pose transfer via progressive flow prediction network 通过渐进式流量预测网络进行三维人体模型引导姿势转移
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-26 DOI: 10.1016/j.jvcir.2024.104327
Furong Ma , Guiyu Xia , Qingshan Liu
Human pose transfer is to transfer a conditional person image to a new target pose. The difficulty lies in modeling the large-scale spatial deformation from the conditional pose to the target one. However, the commonly used 2D data representations and one-step flow prediction scheme lead to unreliable deformation prediction because of the lack of 3D information guidance and the great changes in the pose transfer. Therefore, to bring the original 3D motion information into human pose transfer, we propose to simulate the generation process of real person image. We drive the 3D human model reconstructed from the conditional person image with the target pose and project it to the 2D plane. The 2D projection thereby inherits the 3D information of the poses which can guide the flow prediction. Furthermore, we propose a progressive flow prediction network consisting of two streams. One stream is to predict the flow by decomposing the complex pose transformation into multiple sub-transformations. The other is to generate the features of the target image according to the predicted flow. Besides, to enhance the reliability of the generated invisible regions, we use the target pose information which contains structural information from the flow prediction stream as the supplementary information to the feature generation. The synthesized images with accurate depth information and sharp details demonstrate the effectiveness of the proposed method.
人体姿态转移是将有条件的人体图像转移到新的目标姿态上。其难点在于对从条件姿势到目标姿势的大尺度空间变形进行建模。然而,常用的二维数据表示和一步流预测方案由于缺乏三维信息的引导和姿势转移过程中的巨大变化,导致变形预测不可靠。因此,为了将原始的三维运动信息引入人体姿态转移,我们提出模拟真人图像的生成过程。我们用目标姿态驱动从条件人物图像重建的三维人体模型,并将其投影到二维平面上。这样,二维投影就继承了姿势的三维信息,从而为人流预测提供指导。此外,我们还提出了一个由两个流组成的渐进式人流预测网络。一股是通过将复杂的姿势变换分解为多个子变换来预测流量。另一个是根据预测的流量生成目标图像的特征。此外,为了提高生成的不可见区域的可靠性,我们使用包含来自流量预测流的结构信息的目标姿态信息作为特征生成的补充信息。合成的图像具有准确的深度信息和清晰的细节,证明了所提方法的有效性。
{"title":"3D human model guided pose transfer via progressive flow prediction network","authors":"Furong Ma ,&nbsp;Guiyu Xia ,&nbsp;Qingshan Liu","doi":"10.1016/j.jvcir.2024.104327","DOIUrl":"10.1016/j.jvcir.2024.104327","url":null,"abstract":"<div><div>Human pose transfer is to transfer a conditional person image to a new target pose. The difficulty lies in modeling the large-scale spatial deformation from the conditional pose to the target one. However, the commonly used 2D data representations and one-step flow prediction scheme lead to unreliable deformation prediction because of the lack of 3D information guidance and the great changes in the pose transfer. Therefore, to bring the original 3D motion information into human pose transfer, we propose to simulate the generation process of real person image. We drive the 3D human model reconstructed from the conditional person image with the target pose and project it to the 2D plane. The 2D projection thereby inherits the 3D information of the poses which can guide the flow prediction. Furthermore, we propose a progressive flow prediction network consisting of two streams. One stream is to predict the flow by decomposing the complex pose transformation into multiple sub-transformations. The other is to generate the features of the target image according to the predicted flow. Besides, to enhance the reliability of the generated invisible regions, we use the target pose information which contains structural information from the flow prediction stream as the supplementary information to the feature generation. The synthesized images with accurate depth information and sharp details demonstrate the effectiveness of the proposed method.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104327"},"PeriodicalIF":2.6,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GLIC: Underwater target detection based on global–local information coupling and multi-scale feature fusion GLIC:基于全局-局部信息耦合和多尺度特征融合的水下目标探测
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-26 DOI: 10.1016/j.jvcir.2024.104330
Huipu Xu , Meixiang Zhang , Yongzhi Li
With the rapid development of object detection technology, underwater object detection has attracted widespread attention. Most of the existing underwater target detection methods are built based on convolutional neural networks (CNNs), which still have some limitations in the utilization of global information and cannot fully capture the key information in the images. To overcome the challenge of insufficient global–local feature extraction, an underwater target detector (namely GLIC) based on global–local information coupling and multi-scale feature fusion is proposed in this paper. Our GLIC consists of three main components: spatial pyramid pooling, global–local information coupling, and multi-scale feature fusion. Firstly, we embed spatial pyramid pooling, which improves the robustness of the model while retaining more spatial information. Secondly, we design the feature pyramid network with global–local information coupling. The global context of the transformer branch and the local features of the CNN branch interact with each other to enhance the feature representation. Finally, we construct a Multi-scale Feature Fusion (MFF) module that utilizes balanced semantic features integrated at the same depth for multi-scale feature fusion. In this way, each resolution in the pyramid receives equal information from others, thus balancing the information flow and making the features more discriminative. As demonstrated in comprehensive experiments, our GLIC, respectively, achieves 88.46%, 87.51%, and 74.94% mAP on the URPC2019, URPC2020, and UDD datasets.
随着物体检测技术的快速发展,水下物体检测已引起广泛关注。现有的水下目标检测方法大多基于卷积神经网络(CNN),在全局信息的利用上仍存在一定的局限性,不能完全捕捉图像中的关键信息。为了克服全局-局部特征提取不足的难题,本文提出了一种基于全局-局部信息耦合和多尺度特征融合的水下目标检测器(即 GLIC)。我们的 GLIC 由三个主要部分组成:空间金字塔池化、全局-局部信息耦合和多尺度特征融合。首先,我们嵌入了空间金字塔池,在保留更多空间信息的同时提高了模型的鲁棒性。其次,我们设计了具有全局-局部信息耦合的特征金字塔网络。变换器分支的全局上下文和 CNN 分支的局部特征相互影响,以增强特征表示。最后,我们构建了多尺度特征融合(MFF)模块,利用在同一深度集成的均衡语义特征进行多尺度特征融合。这样,金字塔中的每个分辨率都能从其他分辨率获得同等信息,从而平衡了信息流,使特征更具区分度。综合实验表明,我们的 GLIC 在 URPC2019、URPC2020 和 UDD 数据集上分别实现了 88.46%、87.51% 和 74.94% 的 mAP。
{"title":"GLIC: Underwater target detection based on global–local information coupling and multi-scale feature fusion","authors":"Huipu Xu ,&nbsp;Meixiang Zhang ,&nbsp;Yongzhi Li","doi":"10.1016/j.jvcir.2024.104330","DOIUrl":"10.1016/j.jvcir.2024.104330","url":null,"abstract":"<div><div>With the rapid development of object detection technology, underwater object detection has attracted widespread attention. Most of the existing underwater target detection methods are built based on convolutional neural networks (CNNs), which still have some limitations in the utilization of global information and cannot fully capture the key information in the images. To overcome the challenge of insufficient global–local feature extraction, an underwater target detector (namely GLIC) based on global–local information coupling and multi-scale feature fusion is proposed in this paper. Our GLIC consists of three main components: spatial pyramid pooling, global–local information coupling, and multi-scale feature fusion. Firstly, we embed spatial pyramid pooling, which improves the robustness of the model while retaining more spatial information. Secondly, we design the feature pyramid network with global–local information coupling. The global context of the transformer branch and the local features of the CNN branch interact with each other to enhance the feature representation. Finally, we construct a Multi-scale Feature Fusion (MFF) module that utilizes balanced semantic features integrated at the same depth for multi-scale feature fusion. In this way, each resolution in the pyramid receives equal information from others, thus balancing the information flow and making the features more discriminative. As demonstrated in comprehensive experiments, our GLIC, respectively, achieves 88.46%, 87.51%, and 74.94% mAP on the URPC2019, URPC2020, and UDD datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104330"},"PeriodicalIF":2.6,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142659505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scene-aware classifier and re-detector for thermal infrared tracking 用于热红外跟踪的场景感知分类器和再检测器
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-24 DOI: 10.1016/j.jvcir.2024.104319
Qingbo Ji , Pengfei Zhang , Kuicheng Chen , Lei Zhang , Changbo Hou
Compared with common visible light scenes, the target of infrared scenes lacks information such as the color, texture. Infrared images have low contrast, which not only lead to interference between targets, but also interference between the target and the background. In addition, most infrared tracking algorithms lack a redetection mechanism after lost target, resulting in poor tracking effect after occlusion or blurring. To solve these problems, we propose a scene-aware classifier to dynamically adjust low, middle, and high level features, improving the ability to utilize features in different infrared scenes. Besides, we designed an infrared target re-detector based on multi-domain convolutional network to learn from the tracked target samples and background samples, improving the ability to identify the differences between the target and the background. The experimental results on VOT-TIR2015, VOT-TIR2017 and LSOTB-TIR show that the proposed algorithm achieves the most advanced results in the three infrared object tracking benchmark.
与普通可见光场景相比,红外场景的目标缺乏颜色、纹理等信息。红外图像对比度低,不仅会造成目标之间的干扰,还会造成目标与背景之间的干扰。此外,大多数红外跟踪算法缺乏目标丢失后的重新检测机制,导致遮挡或模糊后的跟踪效果不佳。为了解决这些问题,我们提出了一种场景感知分类器,可动态调整低、中、高三级特征,提高了利用不同红外场景特征的能力。此外,我们还设计了基于多域卷积网络的红外目标再检测器,从跟踪的目标样本和背景样本中学习,提高了识别目标与背景差异的能力。在 VOT-TIR2015、VOT-TIR2017 和 LSOTB-TIR 上的实验结果表明,所提出的算法在三个红外物体跟踪基准测试中取得了最先进的结果。
{"title":"Scene-aware classifier and re-detector for thermal infrared tracking","authors":"Qingbo Ji ,&nbsp;Pengfei Zhang ,&nbsp;Kuicheng Chen ,&nbsp;Lei Zhang ,&nbsp;Changbo Hou","doi":"10.1016/j.jvcir.2024.104319","DOIUrl":"10.1016/j.jvcir.2024.104319","url":null,"abstract":"<div><div>Compared with common visible light scenes, the target of infrared scenes lacks information such as the color, texture. Infrared images have low contrast, which not only lead to interference between targets, but also interference between the target and the background. In addition, most infrared tracking algorithms lack a redetection mechanism after lost target, resulting in poor tracking effect after occlusion or blurring. To solve these problems, we propose a scene-aware classifier to dynamically adjust low, middle, and high level features, improving the ability to utilize features in different infrared scenes. Besides, we designed an infrared target re-detector based on multi-domain convolutional network to learn from the tracked target samples and background samples, improving the ability to identify the differences between the target and the background. The experimental results on <em>VOT-TIR2015</em>, <em>VOT-TIR2017</em> and <em>LSOTB-TIR</em> show that the proposed algorithm achieves the most advanced results in the three infrared object tracking benchmark.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104319"},"PeriodicalIF":2.6,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Aesthetic image cropping meets VLP: Enhancing good while reducing bad 美学图像裁剪与 VLP 相结合:扬长避短
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-18 DOI: 10.1016/j.jvcir.2024.104316
Quan Yuan, Leida Li, Pengfei Chen
Aesthetic Image Cropping (AIC) enhances the visual appeal of an image by adjusting its composition and aesthetic elements. People make these adjustments based on these elements, aiming to enhance appealing aspects while minimizing detrimental factors. Motivated by these observations, we propose a novel approach called CLIPCropping, which simulates the human decision-making process in AIC. CLIPCropping leverages Contrastive Language–Image Pre-training (CLIP) to align visual perception with textual description. It consists of three branches: composition embedding, aesthetic embedding, and image cropping. The composition embedding branch learns principles based on Composition Knowledge Embedding (CKE), while the aesthetic embedding branch learns principles based on Aesthetic Knowledge Embedding (AKE). The image cropping branch evaluates the quality of candidate crops by aggregating knowledge from CKE and AKE; an MLP produces the best result. Extensive experiments on three benchmark datasets — GAICD-1236, GAICD-3336, and FCDB — show that CLIPCropping outperforms state-of-the-art methods and provides insightful interpretations.
美学图像裁剪(AIC)通过调整图像的构图和美学元素来增强图像的视觉吸引力。人们会根据这些元素进行调整,目的是增强吸引人的方面,同时尽量减少不利因素。受这些观察结果的启发,我们提出了一种名为 CLIPCropping 的新方法,它可以模拟 AIC 中的人类决策过程。CLIPCropping 利用对比语言-图像预训练(CLIP)来调整视觉感知和文本描述。它由三个分支组成:构图嵌入、美学嵌入和图像裁剪。构图嵌入分支根据构图知识嵌入(CKE)学习原则,而审美嵌入分支则根据审美知识嵌入(AKE)学习原则。图像裁剪分支通过汇总 CKE 和 AKE 的知识来评估候选裁剪的质量;MLP 可产生最佳结果。在三个基准数据集(GAICD-1236、GAICD-3336 和 FCDB)上进行的广泛实验表明,CLIPCropping 优于最先进的方法,并能提供有见地的解释。
{"title":"Aesthetic image cropping meets VLP: Enhancing good while reducing bad","authors":"Quan Yuan,&nbsp;Leida Li,&nbsp;Pengfei Chen","doi":"10.1016/j.jvcir.2024.104316","DOIUrl":"10.1016/j.jvcir.2024.104316","url":null,"abstract":"<div><div>Aesthetic Image Cropping (AIC) enhances the visual appeal of an image by adjusting its composition and aesthetic elements. People make these adjustments based on these elements, aiming to enhance appealing aspects while minimizing detrimental factors. Motivated by these observations, we propose a novel approach called CLIPCropping, which simulates the human decision-making process in AIC. CLIPCropping leverages Contrastive Language–Image Pre-training (CLIP) to align visual perception with textual description. It consists of three branches: composition embedding, aesthetic embedding, and image cropping. The composition embedding branch learns principles based on Composition Knowledge Embedding (CKE), while the aesthetic embedding branch learns principles based on Aesthetic Knowledge Embedding (AKE). The image cropping branch evaluates the quality of candidate crops by aggregating knowledge from CKE and AKE; an MLP produces the best result. Extensive experiments on three benchmark datasets — GAICD-1236, GAICD-3336, and FCDB — show that CLIPCropping outperforms state-of-the-art methods and provides insightful interpretations.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104316"},"PeriodicalIF":2.6,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
U-TPE: A universal approximate thumbnail-preserving encryption method for lossless recovery U-TPE:用于无损恢复的通用近似缩略图保留加密方法
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-01 DOI: 10.1016/j.jvcir.2024.104318
Haiju Fan , Shaowei Shi , Ming Li
Due to the limited local storage space, more and more people are accustomed to uploading images to the cloud, which has aroused concerns about privacy leaks. The traditional solution is to encrypt the images directly. However, in this way, users cannot easily browse the images stored in the cloud. Obviously, the traditional method has lost the visual usability of cloud images. To solve this problem, the Thumbnail-Preserving Encryption (TPE) method is proposed. Although approximate-TPE is more efficient than ideal TPE, it cannot restore the original image without damage and cannot encrypt some images with texture features. Inspired by the above, we propose a universal approximate thumbnail-preserving encryption method with lossless recovery. This method divides the image into equal-sized chunks, each of which is further divided into an embedding area and an adjustment area. The pixels of the embedding area are recorded by prediction. Then, the auxiliary information necessary to restore the image is encrypted and hidden in the embedding area of the encrypted image. Finally, the pixel values of the adjustment area in each block are adjusted so that the average value is close to the original block. Experimental results show that the proposed method can not only restore images losslessly but also process images with different texture features, achieving good generality. On the BOWS2 dataset, all images can be encrypted by adjusting the block size. In addition, it can resist third-party face recognition and comparison, achieving satisfactory results in balancing privacy and visual usability.
由于本地存储空间有限,越来越多的人习惯将图片上传到云端,这引起了人们对隐私泄露的担忧。传统的解决方案是直接加密图片。然而,这样一来,用户就无法轻松浏览存储在云中的图像。显然,传统方法失去了云图像的可视化可用性。为了解决这个问题,我们提出了缩略图保留加密(TPE)方法。虽然近似 TPE 比理想 TPE 更有效,但它无法无损还原原始图像,也无法加密一些具有纹理特征的图像。受上述启发,我们提出了一种无损恢复的通用近似缩略图保留加密方法。该方法将图像分成大小相等的块,每个块又分为嵌入区和调整区。嵌入区的像素通过预测被记录下来。然后,将还原图像所需的辅助信息加密并隐藏在加密图像的嵌入区中。最后,调整每个区块中调整区域的像素值,使其平均值接近原始区块。实验结果表明,所提出的方法不仅能无损还原图像,还能处理具有不同纹理特征的图像,实现了良好的通用性。在 BOWS2 数据集上,通过调整块大小,可以加密所有图像。此外,它还能抵御第三方的人脸识别和对比,在平衡隐私性和视觉可用性方面取得了令人满意的效果。
{"title":"U-TPE: A universal approximate thumbnail-preserving encryption method for lossless recovery","authors":"Haiju Fan ,&nbsp;Shaowei Shi ,&nbsp;Ming Li","doi":"10.1016/j.jvcir.2024.104318","DOIUrl":"10.1016/j.jvcir.2024.104318","url":null,"abstract":"<div><div>Due to the limited local storage space, more and more people are accustomed to uploading images to the cloud, which has aroused concerns about privacy leaks. The traditional solution is to encrypt the images directly. However, in this way, users cannot easily browse the images stored in the cloud. Obviously, the traditional method has lost the visual usability of cloud images. To solve this problem, the Thumbnail-Preserving Encryption (TPE) method is proposed. Although approximate-TPE is more efficient than ideal TPE, it cannot restore the original image without damage and cannot encrypt some images with texture features. Inspired by the above, we propose a universal approximate thumbnail-preserving encryption method with lossless recovery. This method divides the image into equal-sized chunks, each of which is further divided into an embedding area and an adjustment area. The pixels of the embedding area are recorded by prediction. Then, the auxiliary information necessary to restore the image is encrypted and hidden in the embedding area of the encrypted image. Finally, the pixel values of the adjustment area in each block are adjusted so that the average value is close to the original block. Experimental results show that the proposed method can not only restore images losslessly but also process images with different texture features, achieving good generality. On the BOWS2 dataset, all images can be encrypted by adjusting the block size. In addition, it can resist third-party face recognition and comparison, achieving satisfactory results in balancing privacy and visual usability.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104318"},"PeriodicalIF":2.6,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142534733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep video steganography using temporal-attention-based frame selection and spatial sparse adversarial attack 利用基于时间注意力的帧选择和空间稀疏对抗攻击的深度视频隐写术
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-01 DOI: 10.1016/j.jvcir.2024.104311
Beijing Chen , Yuting Hong , Yuxin Nie
With the development of deep learning-based steganalysis, video steganography is facing with great challenges. To address the insufficient security against steganalysis of existing deep video steganography, given that the video has both spatial and temporal dimensions, this paper proposes a deep video steganography method using temporal frame selection and spatial sparse adversarial attack. In temporal dimension, a stego frame selection module based on temporal attention is designed to calculate the weight of each frame and selects frames with high weights for message and sparse perturbation embedding. In spatial dimension, sparse adversarial perturbations are performed in the selected frames to improve the ability of resisting steganalysis. Moreover, to control the adversarial perturbations’ sparsity flexibly, an intra-frame dynamic sparsity threshold mechanism is designed by using percentile. Experimental results demonstrate that the proposed method effectively enhances the visual quality and security against steganalysis of video steganography and has controllable sparsity of adversarial perturbations.
随着基于深度学习的隐写术的发展,视频隐写术面临着巨大的挑战。针对现有深度视频隐写术在空间和时间维度上都存在隐写安全性不足的问题,本文提出了一种利用时间帧选择和空间稀疏对抗攻击的深度视频隐写术方法。在时间维度上,设计了基于时间注意力的偷窃帧选择模块,计算每帧的权重,选择权重高的帧进行信息和稀疏扰动嵌入。在空间维度上,对所选帧进行稀疏对抗扰动,以提高抗隐分析能力。此外,为了灵活控制对抗扰动的稀疏性,利用百分位数设计了帧内动态稀疏性阈值机制。实验结果表明,所提出的方法有效地提高了视频隐写术的视觉质量和抗隐写分析的安全性,并且具有可控的对抗扰动稀疏性。
{"title":"Deep video steganography using temporal-attention-based frame selection and spatial sparse adversarial attack","authors":"Beijing Chen ,&nbsp;Yuting Hong ,&nbsp;Yuxin Nie","doi":"10.1016/j.jvcir.2024.104311","DOIUrl":"10.1016/j.jvcir.2024.104311","url":null,"abstract":"<div><div>With the development of deep learning-based steganalysis, video steganography is facing with great challenges. To address the insufficient security against steganalysis of existing deep video steganography, given that the video has both spatial and temporal dimensions, this paper proposes a deep video steganography method using temporal frame selection and spatial sparse adversarial attack. In temporal dimension, a stego frame selection module based on temporal attention is designed to calculate the weight of each frame and selects frames with high weights for message and sparse perturbation embedding. In spatial dimension, sparse adversarial perturbations are performed in the selected frames to improve the ability of resisting steganalysis. Moreover, to control the adversarial perturbations’ sparsity flexibly, an intra-frame dynamic sparsity threshold mechanism is designed by using percentile. Experimental results demonstrate that the proposed method effectively enhances the visual quality and security against steganalysis of video steganography and has controllable sparsity of adversarial perturbations.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104311"},"PeriodicalIF":2.6,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142424089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A robust coverless image-synthesized video steganography based on asymmetric structure 基于非对称结构的鲁棒无掩码图像合成视频隐写术
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2024-10-01 DOI: 10.1016/j.jvcir.2024.104303
Yueshuang Jiao , Zhenzhen Zhang , Zhenzhen Li , Zichen Li , Xiaolong Li , Jiaoyun Liu
Due to the ability of hiding secret information without modifying image content, coverless image stegonagraphy has gained higher level of security and become a research hot spot. However, in existing methods, the issue of image order disruption during network transmission is overlooked. In this paper, the image-synthesized video carrier is proposed for the first time. The selected images which represent secret information are synthesized to a video in order, thus the image order will not be disrupted during transmission and the effective capacity is greatly increased. Additionally, an asymmetric structure is designed to improve the robustness, in which only the receiver utilizes a robust image retrieval algorithm to restore secret information. Specifically, certain images are randomly selected from a public image database to create multiple coverless image datasets (MCIDs), with each image in a CID mapped to hash sequence. Images are indexed based on secret segments and synthesized into videos. After that, the synthesized videos are sent to the receiver. The receiver decodes the video into frames, identifies the corresponding CID of each frame, retrieves original image, and restores the secret information with the same mapping rule. Experimental results indicate that the proposed method outperforms existing methods in terms of capacity, robustness, and security.
由于可以在不修改图像内容的情况下隐藏秘密信息,无掩盖图像加密技术获得了更高的安全性,成为研究热点。然而,在现有的方法中,忽略了网络传输过程中图像顺序被打乱的问题。本文首次提出了图像合成视频载体。选中的代表秘密信息的图像按顺序合成视频,因此在传输过程中图像顺序不会被打乱,有效容量大大提高。此外,为了提高鲁棒性,还设计了一种非对称结构,只有接收方利用鲁棒图像检索算法来恢复秘密信息。具体来说,从公共图像数据库中随机选择某些图像,创建多个无覆盖图像数据集(MCID),CID 中的每个图像都映射为哈希序列。根据秘密片段对图像进行索引并合成视频。然后,合成视频被发送到接收器。接收器将视频解码为帧,识别每帧对应的 CID,检索原始图像,并以相同的映射规则恢复秘密信息。实验结果表明,所提出的方法在容量、鲁棒性和安全性方面都优于现有方法。
{"title":"A robust coverless image-synthesized video steganography based on asymmetric structure","authors":"Yueshuang Jiao ,&nbsp;Zhenzhen Zhang ,&nbsp;Zhenzhen Li ,&nbsp;Zichen Li ,&nbsp;Xiaolong Li ,&nbsp;Jiaoyun Liu","doi":"10.1016/j.jvcir.2024.104303","DOIUrl":"10.1016/j.jvcir.2024.104303","url":null,"abstract":"<div><div>Due to the ability of hiding secret information without modifying image content, coverless image stegonagraphy has gained higher level of security and become a research hot spot. However, in existing methods, the issue of image order disruption during network transmission is overlooked. In this paper, the image-synthesized video carrier is proposed for the first time. The selected images which represent secret information are synthesized to a video in order, thus the image order will not be disrupted during transmission and the effective capacity is greatly increased. Additionally, an asymmetric structure is designed to improve the robustness, in which only the receiver utilizes a robust image retrieval algorithm to restore secret information. Specifically, certain images are randomly selected from a public image database to create multiple coverless image datasets (MCIDs), with each image in a CID mapped to hash sequence. Images are indexed based on secret segments and synthesized into videos. After that, the synthesized videos are sent to the receiver. The receiver decodes the video into frames, identifies the corresponding CID of each frame, retrieves original image, and restores the secret information with the same mapping rule. Experimental results indicate that the proposed method outperforms existing methods in terms of capacity, robustness, and security.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104303"},"PeriodicalIF":2.6,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142433307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Visual Communication and Image Representation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1