Pub Date : 2024-10-28DOI: 10.1016/j.jvcir.2024.104323
Qian Liu, Yixiong Zhong, Jiongtao Fang
Density map estimation is commonly used for crowd counting. However, using it alone may make some individuals difficult to recognize, due to the problems of target occlusions, scale variations, complex background and heterogeneous distribution. To alleviate these problems, we propose a two-stage crowd counting network based on attention feature fusion and multi-column feature enhancement (AFF-MFE-TNet). In the first stage, AFF-MFE-TNet transforms the input image into a probability map. In the second stage, a multi-column feature enhancement module is constructed to enhance features by expanding the receptive fields, a dual attention feature fusion module is designed to adaptively fuse the features of different scales through attention mechanisms, and a triple counting loss is presented for AFF-MFE-TNet, which can fit the ground truth probability maps and density maps better, and improve the counting performance. Experimental results show that AFF-MFE-TNet can effectively improve the accuracy of crowd counting, as compared with the state-of-the-art.
{"title":"Crowd counting network based on attention feature fusion and multi-column feature enhancement","authors":"Qian Liu, Yixiong Zhong, Jiongtao Fang","doi":"10.1016/j.jvcir.2024.104323","DOIUrl":"10.1016/j.jvcir.2024.104323","url":null,"abstract":"<div><div>Density map estimation is commonly used for crowd counting. However, using it alone may make some individuals difficult to recognize, due to the problems of target occlusions, scale variations, complex background and heterogeneous distribution. To alleviate these problems, we propose a two-stage crowd counting network based on attention feature fusion and multi-column feature enhancement (AFF-MFE-TNet). In the first stage, AFF-MFE-TNet transforms the input image into a probability map. In the second stage, a multi-column feature enhancement module is constructed to enhance features by expanding the receptive fields, a dual attention feature fusion module is designed to adaptively fuse the features of different scales through attention mechanisms, and a triple counting loss is presented for AFF-MFE-TNet, which can fit the ground truth probability maps and density maps better, and improve the counting performance. Experimental results show that AFF-MFE-TNet can effectively improve the accuracy of crowd counting, as compared with the state-of-the-art.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104323"},"PeriodicalIF":2.6,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-26DOI: 10.1016/j.jvcir.2024.104326
Lin Zhao, Shaoxiong Xie, Jia Li, Ping Tan, Wenjin Hu
The growing attention to hyperspectral object tracking (HOT) can be attributed to the extended spectral information available in hyperspectral images (HSIs), especially in complex scenarios. This potential makes it a promising alternative to traditional RGB-based tracking methods. However, the scarcity of large hyperspectral datasets poses a challenge for training robust hyperspectral trackers using deep learning methods. Prompt learning, a new paradigm emerging in large language models, involves adapting or fine-tuning a pre-trained model for a specific downstream task by providing task-specific inputs. Inspired by the recent success of prompt learning in language and visual tasks, we propose a novel and efficient prompt learning method for HOT tasks, termed Moderate Visual Prompt for HOT (MVP-HOT). Specifically, MVP-HOT freezes the parameters of the pre-trained model and employs HSIs as visual prompts to leverage the knowledge of the underlying RGB model. Additionally, we develop a moderate and effective strategy to incrementally adapt the HSI prompt information. Our proposed method uses only a few (1.7M) learnable parameters and demonstrates its effectiveness through extensive experiments, MVP-HOT can achieve state-of-the-art performance on three hyperspectral datasets.
{"title":"MVP-HOT: A Moderate Visual Prompt for Hyperspectral Object Tracking","authors":"Lin Zhao, Shaoxiong Xie, Jia Li, Ping Tan, Wenjin Hu","doi":"10.1016/j.jvcir.2024.104326","DOIUrl":"10.1016/j.jvcir.2024.104326","url":null,"abstract":"<div><div>The growing attention to hyperspectral object tracking (HOT) can be attributed to the extended spectral information available in hyperspectral images (HSIs), especially in complex scenarios. This potential makes it a promising alternative to traditional RGB-based tracking methods. However, the scarcity of large hyperspectral datasets poses a challenge for training robust hyperspectral trackers using deep learning methods. Prompt learning, a new paradigm emerging in large language models, involves adapting or fine-tuning a pre-trained model for a specific downstream task by providing task-specific inputs. Inspired by the recent success of prompt learning in language and visual tasks, we propose a novel and efficient prompt learning method for HOT tasks, termed Moderate Visual Prompt for HOT (MVP-HOT). Specifically, MVP-HOT freezes the parameters of the pre-trained model and employs HSIs as visual prompts to leverage the knowledge of the underlying RGB model. Additionally, we develop a moderate and effective strategy to incrementally adapt the HSI prompt information. Our proposed method uses only a few (1.7M) learnable parameters and demonstrates its effectiveness through extensive experiments, MVP-HOT can achieve state-of-the-art performance on three hyperspectral datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104326"},"PeriodicalIF":2.6,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-26DOI: 10.1016/j.jvcir.2024.104294
Yixin Gao, Runsen Feng, Zongyu Guo, Zhibo Chen
Despite a short history, neural image codecs have been shown to surpass classical image codecs in terms of rate–distortion performance. However, most of them suffer from significantly longer decoding times, which hinders the practical applications of neural image codecs. This issue is especially pronounced when employing an effective yet time-consuming autoregressive context model since it would increase entropy decoding time by orders of magnitude. In this paper, unlike most previous works that pursue optimal RD performance while temporally overlooking the coding complexity, we make a systematical investigation on the rate–distortion-complexity (RDC) optimization in neural image compression. By quantifying the decoding complexity as a factor in the optimization goal, we are now able to precisely control the RDC trade-off and then demonstrate how the rate–distortion performance of neural image codecs could adapt to various complexity demands. Going beyond the investigation of RDC optimization, a variable-complexity neural codec is designed to leverage the spatial dependencies adaptively according to industrial demands, which supports fine-grained complexity adjustment by balancing the RDC tradeoff. By implementing this scheme in a powerful base model, we demonstrate the feasibility and flexibility of RDC optimization for neural image codecs.
{"title":"Exploring the rate-distortion-complexity optimization in neural image compression","authors":"Yixin Gao, Runsen Feng, Zongyu Guo, Zhibo Chen","doi":"10.1016/j.jvcir.2024.104294","DOIUrl":"10.1016/j.jvcir.2024.104294","url":null,"abstract":"<div><div>Despite a short history, neural image codecs have been shown to surpass classical image codecs in terms of rate–distortion performance. However, most of them suffer from significantly longer decoding times, which hinders the practical applications of neural image codecs. This issue is especially pronounced when employing an effective yet time-consuming autoregressive context model since it would increase entropy decoding time by orders of magnitude. In this paper, unlike most previous works that pursue optimal RD performance while temporally overlooking the coding complexity, we make a systematical investigation on the rate–distortion-complexity (RDC) optimization in neural image compression. By quantifying the decoding complexity as a factor in the optimization goal, we are now able to precisely control the RDC trade-off and then demonstrate how the rate–distortion performance of neural image codecs could adapt to various complexity demands. Going beyond the investigation of RDC optimization, a variable-complexity neural codec is designed to leverage the spatial dependencies adaptively according to industrial demands, which supports fine-grained complexity adjustment by balancing the RDC tradeoff. By implementing this scheme in a powerful base model, we demonstrate the feasibility and flexibility of RDC optimization for neural image codecs.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104294"},"PeriodicalIF":2.6,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142659506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-26DOI: 10.1016/j.jvcir.2024.104327
Furong Ma , Guiyu Xia , Qingshan Liu
Human pose transfer is to transfer a conditional person image to a new target pose. The difficulty lies in modeling the large-scale spatial deformation from the conditional pose to the target one. However, the commonly used 2D data representations and one-step flow prediction scheme lead to unreliable deformation prediction because of the lack of 3D information guidance and the great changes in the pose transfer. Therefore, to bring the original 3D motion information into human pose transfer, we propose to simulate the generation process of real person image. We drive the 3D human model reconstructed from the conditional person image with the target pose and project it to the 2D plane. The 2D projection thereby inherits the 3D information of the poses which can guide the flow prediction. Furthermore, we propose a progressive flow prediction network consisting of two streams. One stream is to predict the flow by decomposing the complex pose transformation into multiple sub-transformations. The other is to generate the features of the target image according to the predicted flow. Besides, to enhance the reliability of the generated invisible regions, we use the target pose information which contains structural information from the flow prediction stream as the supplementary information to the feature generation. The synthesized images with accurate depth information and sharp details demonstrate the effectiveness of the proposed method.
{"title":"3D human model guided pose transfer via progressive flow prediction network","authors":"Furong Ma , Guiyu Xia , Qingshan Liu","doi":"10.1016/j.jvcir.2024.104327","DOIUrl":"10.1016/j.jvcir.2024.104327","url":null,"abstract":"<div><div>Human pose transfer is to transfer a conditional person image to a new target pose. The difficulty lies in modeling the large-scale spatial deformation from the conditional pose to the target one. However, the commonly used 2D data representations and one-step flow prediction scheme lead to unreliable deformation prediction because of the lack of 3D information guidance and the great changes in the pose transfer. Therefore, to bring the original 3D motion information into human pose transfer, we propose to simulate the generation process of real person image. We drive the 3D human model reconstructed from the conditional person image with the target pose and project it to the 2D plane. The 2D projection thereby inherits the 3D information of the poses which can guide the flow prediction. Furthermore, we propose a progressive flow prediction network consisting of two streams. One stream is to predict the flow by decomposing the complex pose transformation into multiple sub-transformations. The other is to generate the features of the target image according to the predicted flow. Besides, to enhance the reliability of the generated invisible regions, we use the target pose information which contains structural information from the flow prediction stream as the supplementary information to the feature generation. The synthesized images with accurate depth information and sharp details demonstrate the effectiveness of the proposed method.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104327"},"PeriodicalIF":2.6,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-26DOI: 10.1016/j.jvcir.2024.104330
Huipu Xu , Meixiang Zhang , Yongzhi Li
With the rapid development of object detection technology, underwater object detection has attracted widespread attention. Most of the existing underwater target detection methods are built based on convolutional neural networks (CNNs), which still have some limitations in the utilization of global information and cannot fully capture the key information in the images. To overcome the challenge of insufficient global–local feature extraction, an underwater target detector (namely GLIC) based on global–local information coupling and multi-scale feature fusion is proposed in this paper. Our GLIC consists of three main components: spatial pyramid pooling, global–local information coupling, and multi-scale feature fusion. Firstly, we embed spatial pyramid pooling, which improves the robustness of the model while retaining more spatial information. Secondly, we design the feature pyramid network with global–local information coupling. The global context of the transformer branch and the local features of the CNN branch interact with each other to enhance the feature representation. Finally, we construct a Multi-scale Feature Fusion (MFF) module that utilizes balanced semantic features integrated at the same depth for multi-scale feature fusion. In this way, each resolution in the pyramid receives equal information from others, thus balancing the information flow and making the features more discriminative. As demonstrated in comprehensive experiments, our GLIC, respectively, achieves 88.46%, 87.51%, and 74.94% mAP on the URPC2019, URPC2020, and UDD datasets.
{"title":"GLIC: Underwater target detection based on global–local information coupling and multi-scale feature fusion","authors":"Huipu Xu , Meixiang Zhang , Yongzhi Li","doi":"10.1016/j.jvcir.2024.104330","DOIUrl":"10.1016/j.jvcir.2024.104330","url":null,"abstract":"<div><div>With the rapid development of object detection technology, underwater object detection has attracted widespread attention. Most of the existing underwater target detection methods are built based on convolutional neural networks (CNNs), which still have some limitations in the utilization of global information and cannot fully capture the key information in the images. To overcome the challenge of insufficient global–local feature extraction, an underwater target detector (namely GLIC) based on global–local information coupling and multi-scale feature fusion is proposed in this paper. Our GLIC consists of three main components: spatial pyramid pooling, global–local information coupling, and multi-scale feature fusion. Firstly, we embed spatial pyramid pooling, which improves the robustness of the model while retaining more spatial information. Secondly, we design the feature pyramid network with global–local information coupling. The global context of the transformer branch and the local features of the CNN branch interact with each other to enhance the feature representation. Finally, we construct a Multi-scale Feature Fusion (MFF) module that utilizes balanced semantic features integrated at the same depth for multi-scale feature fusion. In this way, each resolution in the pyramid receives equal information from others, thus balancing the information flow and making the features more discriminative. As demonstrated in comprehensive experiments, our GLIC, respectively, achieves 88.46%, 87.51%, and 74.94% mAP on the URPC2019, URPC2020, and UDD datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104330"},"PeriodicalIF":2.6,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142659505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-24DOI: 10.1016/j.jvcir.2024.104319
Qingbo Ji , Pengfei Zhang , Kuicheng Chen , Lei Zhang , Changbo Hou
Compared with common visible light scenes, the target of infrared scenes lacks information such as the color, texture. Infrared images have low contrast, which not only lead to interference between targets, but also interference between the target and the background. In addition, most infrared tracking algorithms lack a redetection mechanism after lost target, resulting in poor tracking effect after occlusion or blurring. To solve these problems, we propose a scene-aware classifier to dynamically adjust low, middle, and high level features, improving the ability to utilize features in different infrared scenes. Besides, we designed an infrared target re-detector based on multi-domain convolutional network to learn from the tracked target samples and background samples, improving the ability to identify the differences between the target and the background. The experimental results on VOT-TIR2015, VOT-TIR2017 and LSOTB-TIR show that the proposed algorithm achieves the most advanced results in the three infrared object tracking benchmark.
{"title":"Scene-aware classifier and re-detector for thermal infrared tracking","authors":"Qingbo Ji , Pengfei Zhang , Kuicheng Chen , Lei Zhang , Changbo Hou","doi":"10.1016/j.jvcir.2024.104319","DOIUrl":"10.1016/j.jvcir.2024.104319","url":null,"abstract":"<div><div>Compared with common visible light scenes, the target of infrared scenes lacks information such as the color, texture. Infrared images have low contrast, which not only lead to interference between targets, but also interference between the target and the background. In addition, most infrared tracking algorithms lack a redetection mechanism after lost target, resulting in poor tracking effect after occlusion or blurring. To solve these problems, we propose a scene-aware classifier to dynamically adjust low, middle, and high level features, improving the ability to utilize features in different infrared scenes. Besides, we designed an infrared target re-detector based on multi-domain convolutional network to learn from the tracked target samples and background samples, improving the ability to identify the differences between the target and the background. The experimental results on <em>VOT-TIR2015</em>, <em>VOT-TIR2017</em> and <em>LSOTB-TIR</em> show that the proposed algorithm achieves the most advanced results in the three infrared object tracking benchmark.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104319"},"PeriodicalIF":2.6,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-18DOI: 10.1016/j.jvcir.2024.104316
Quan Yuan, Leida Li, Pengfei Chen
Aesthetic Image Cropping (AIC) enhances the visual appeal of an image by adjusting its composition and aesthetic elements. People make these adjustments based on these elements, aiming to enhance appealing aspects while minimizing detrimental factors. Motivated by these observations, we propose a novel approach called CLIPCropping, which simulates the human decision-making process in AIC. CLIPCropping leverages Contrastive Language–Image Pre-training (CLIP) to align visual perception with textual description. It consists of three branches: composition embedding, aesthetic embedding, and image cropping. The composition embedding branch learns principles based on Composition Knowledge Embedding (CKE), while the aesthetic embedding branch learns principles based on Aesthetic Knowledge Embedding (AKE). The image cropping branch evaluates the quality of candidate crops by aggregating knowledge from CKE and AKE; an MLP produces the best result. Extensive experiments on three benchmark datasets — GAICD-1236, GAICD-3336, and FCDB — show that CLIPCropping outperforms state-of-the-art methods and provides insightful interpretations.
{"title":"Aesthetic image cropping meets VLP: Enhancing good while reducing bad","authors":"Quan Yuan, Leida Li, Pengfei Chen","doi":"10.1016/j.jvcir.2024.104316","DOIUrl":"10.1016/j.jvcir.2024.104316","url":null,"abstract":"<div><div>Aesthetic Image Cropping (AIC) enhances the visual appeal of an image by adjusting its composition and aesthetic elements. People make these adjustments based on these elements, aiming to enhance appealing aspects while minimizing detrimental factors. Motivated by these observations, we propose a novel approach called CLIPCropping, which simulates the human decision-making process in AIC. CLIPCropping leverages Contrastive Language–Image Pre-training (CLIP) to align visual perception with textual description. It consists of three branches: composition embedding, aesthetic embedding, and image cropping. The composition embedding branch learns principles based on Composition Knowledge Embedding (CKE), while the aesthetic embedding branch learns principles based on Aesthetic Knowledge Embedding (AKE). The image cropping branch evaluates the quality of candidate crops by aggregating knowledge from CKE and AKE; an MLP produces the best result. Extensive experiments on three benchmark datasets — GAICD-1236, GAICD-3336, and FCDB — show that CLIPCropping outperforms state-of-the-art methods and provides insightful interpretations.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"105 ","pages":"Article 104316"},"PeriodicalIF":2.6,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142552675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.jvcir.2024.104318
Haiju Fan , Shaowei Shi , Ming Li
Due to the limited local storage space, more and more people are accustomed to uploading images to the cloud, which has aroused concerns about privacy leaks. The traditional solution is to encrypt the images directly. However, in this way, users cannot easily browse the images stored in the cloud. Obviously, the traditional method has lost the visual usability of cloud images. To solve this problem, the Thumbnail-Preserving Encryption (TPE) method is proposed. Although approximate-TPE is more efficient than ideal TPE, it cannot restore the original image without damage and cannot encrypt some images with texture features. Inspired by the above, we propose a universal approximate thumbnail-preserving encryption method with lossless recovery. This method divides the image into equal-sized chunks, each of which is further divided into an embedding area and an adjustment area. The pixels of the embedding area are recorded by prediction. Then, the auxiliary information necessary to restore the image is encrypted and hidden in the embedding area of the encrypted image. Finally, the pixel values of the adjustment area in each block are adjusted so that the average value is close to the original block. Experimental results show that the proposed method can not only restore images losslessly but also process images with different texture features, achieving good generality. On the BOWS2 dataset, all images can be encrypted by adjusting the block size. In addition, it can resist third-party face recognition and comparison, achieving satisfactory results in balancing privacy and visual usability.
{"title":"U-TPE: A universal approximate thumbnail-preserving encryption method for lossless recovery","authors":"Haiju Fan , Shaowei Shi , Ming Li","doi":"10.1016/j.jvcir.2024.104318","DOIUrl":"10.1016/j.jvcir.2024.104318","url":null,"abstract":"<div><div>Due to the limited local storage space, more and more people are accustomed to uploading images to the cloud, which has aroused concerns about privacy leaks. The traditional solution is to encrypt the images directly. However, in this way, users cannot easily browse the images stored in the cloud. Obviously, the traditional method has lost the visual usability of cloud images. To solve this problem, the Thumbnail-Preserving Encryption (TPE) method is proposed. Although approximate-TPE is more efficient than ideal TPE, it cannot restore the original image without damage and cannot encrypt some images with texture features. Inspired by the above, we propose a universal approximate thumbnail-preserving encryption method with lossless recovery. This method divides the image into equal-sized chunks, each of which is further divided into an embedding area and an adjustment area. The pixels of the embedding area are recorded by prediction. Then, the auxiliary information necessary to restore the image is encrypted and hidden in the embedding area of the encrypted image. Finally, the pixel values of the adjustment area in each block are adjusted so that the average value is close to the original block. Experimental results show that the proposed method can not only restore images losslessly but also process images with different texture features, achieving good generality. On the BOWS2 dataset, all images can be encrypted by adjusting the block size. In addition, it can resist third-party face recognition and comparison, achieving satisfactory results in balancing privacy and visual usability.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104318"},"PeriodicalIF":2.6,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142534733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.jvcir.2024.104311
Beijing Chen , Yuting Hong , Yuxin Nie
With the development of deep learning-based steganalysis, video steganography is facing with great challenges. To address the insufficient security against steganalysis of existing deep video steganography, given that the video has both spatial and temporal dimensions, this paper proposes a deep video steganography method using temporal frame selection and spatial sparse adversarial attack. In temporal dimension, a stego frame selection module based on temporal attention is designed to calculate the weight of each frame and selects frames with high weights for message and sparse perturbation embedding. In spatial dimension, sparse adversarial perturbations are performed in the selected frames to improve the ability of resisting steganalysis. Moreover, to control the adversarial perturbations’ sparsity flexibly, an intra-frame dynamic sparsity threshold mechanism is designed by using percentile. Experimental results demonstrate that the proposed method effectively enhances the visual quality and security against steganalysis of video steganography and has controllable sparsity of adversarial perturbations.
{"title":"Deep video steganography using temporal-attention-based frame selection and spatial sparse adversarial attack","authors":"Beijing Chen , Yuting Hong , Yuxin Nie","doi":"10.1016/j.jvcir.2024.104311","DOIUrl":"10.1016/j.jvcir.2024.104311","url":null,"abstract":"<div><div>With the development of deep learning-based steganalysis, video steganography is facing with great challenges. To address the insufficient security against steganalysis of existing deep video steganography, given that the video has both spatial and temporal dimensions, this paper proposes a deep video steganography method using temporal frame selection and spatial sparse adversarial attack. In temporal dimension, a stego frame selection module based on temporal attention is designed to calculate the weight of each frame and selects frames with high weights for message and sparse perturbation embedding. In spatial dimension, sparse adversarial perturbations are performed in the selected frames to improve the ability of resisting steganalysis. Moreover, to control the adversarial perturbations’ sparsity flexibly, an intra-frame dynamic sparsity threshold mechanism is designed by using percentile. Experimental results demonstrate that the proposed method effectively enhances the visual quality and security against steganalysis of video steganography and has controllable sparsity of adversarial perturbations.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104311"},"PeriodicalIF":2.6,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142424089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1016/j.jvcir.2024.104303
Yueshuang Jiao , Zhenzhen Zhang , Zhenzhen Li , Zichen Li , Xiaolong Li , Jiaoyun Liu
Due to the ability of hiding secret information without modifying image content, coverless image stegonagraphy has gained higher level of security and become a research hot spot. However, in existing methods, the issue of image order disruption during network transmission is overlooked. In this paper, the image-synthesized video carrier is proposed for the first time. The selected images which represent secret information are synthesized to a video in order, thus the image order will not be disrupted during transmission and the effective capacity is greatly increased. Additionally, an asymmetric structure is designed to improve the robustness, in which only the receiver utilizes a robust image retrieval algorithm to restore secret information. Specifically, certain images are randomly selected from a public image database to create multiple coverless image datasets (MCIDs), with each image in a CID mapped to hash sequence. Images are indexed based on secret segments and synthesized into videos. After that, the synthesized videos are sent to the receiver. The receiver decodes the video into frames, identifies the corresponding CID of each frame, retrieves original image, and restores the secret information with the same mapping rule. Experimental results indicate that the proposed method outperforms existing methods in terms of capacity, robustness, and security.
{"title":"A robust coverless image-synthesized video steganography based on asymmetric structure","authors":"Yueshuang Jiao , Zhenzhen Zhang , Zhenzhen Li , Zichen Li , Xiaolong Li , Jiaoyun Liu","doi":"10.1016/j.jvcir.2024.104303","DOIUrl":"10.1016/j.jvcir.2024.104303","url":null,"abstract":"<div><div>Due to the ability of hiding secret information without modifying image content, coverless image stegonagraphy has gained higher level of security and become a research hot spot. However, in existing methods, the issue of image order disruption during network transmission is overlooked. In this paper, the image-synthesized video carrier is proposed for the first time. The selected images which represent secret information are synthesized to a video in order, thus the image order will not be disrupted during transmission and the effective capacity is greatly increased. Additionally, an asymmetric structure is designed to improve the robustness, in which only the receiver utilizes a robust image retrieval algorithm to restore secret information. Specifically, certain images are randomly selected from a public image database to create multiple coverless image datasets (MCIDs), with each image in a CID mapped to hash sequence. Images are indexed based on secret segments and synthesized into videos. After that, the synthesized videos are sent to the receiver. The receiver decodes the video into frames, identifies the corresponding CID of each frame, retrieves original image, and restores the secret information with the same mapping rule. Experimental results indicate that the proposed method outperforms existing methods in terms of capacity, robustness, and security.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"104 ","pages":"Article 104303"},"PeriodicalIF":2.6,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142433307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}