首页 > 最新文献

Journal of Visual Communication and Image Representation最新文献

英文 中文
Reversible data hiding with automatic contrast enhancement and high embedding capacity based on multi-type histogram modification
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-04-01 DOI: 10.1016/j.jvcir.2025.104450
Libo Han , Wanlin Gao , Xinfeng Zhang , Sha Tao
For an image, we can use reversible data hiding (RDH) with automatic contrast enhancement (ACE) to automatically improve its contrast by continuously embedding data. Some existing methods may make the detailed information in the dark regions of the grayscale image not well presented. Furthermore, these methods sometimes suffer from low embedding capacity (EC). Therefore, we propose an RDH method with ACE and high EC based on multi-type histogram modification. A pixel value histogram modification method is proposed to improve the contrast automatically. In this method, two-sided histogram expansion is used to improve global contrast, and then the histogram right-shift method is used to enhance the dark regions. Then, a prediction error histogram modification method is proposed to improve the EC. In this method, a new prediction method is proposed to better improve the EC. Experiment results show that compared with some advanced methods, the proposed method performs better.
{"title":"Reversible data hiding with automatic contrast enhancement and high embedding capacity based on multi-type histogram modification","authors":"Libo Han ,&nbsp;Wanlin Gao ,&nbsp;Xinfeng Zhang ,&nbsp;Sha Tao","doi":"10.1016/j.jvcir.2025.104450","DOIUrl":"10.1016/j.jvcir.2025.104450","url":null,"abstract":"<div><div>For an image, we can use reversible data hiding (RDH) with automatic contrast enhancement (ACE) to automatically improve its contrast by continuously embedding data. Some existing methods may make the detailed information in the dark regions of the grayscale image not well presented. Furthermore, these methods sometimes suffer from low embedding capacity (EC). Therefore, we propose an RDH method with ACE and high EC based on multi-type histogram modification. A pixel value histogram modification method is proposed to improve the contrast automatically. In this method, two-sided histogram expansion is used to improve global contrast, and then the histogram right-shift method is used to enhance the dark regions. Then, a prediction error histogram modification method is proposed to improve the EC. In this method, a new prediction method is proposed to better improve the EC. Experiment results show that compared with some advanced methods, the proposed method performs better.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"109 ","pages":"Article 104450"},"PeriodicalIF":2.6,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143768971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-level cross-modal attention guided DIBR 3D image watermarking
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-03-30 DOI: 10.1016/j.jvcir.2025.104455
Qingmo Chen , Zhang Wang , Zhouyan He , Ting Luo , Jiangtao Huang
For depth-image-based rendering (DIBR) 3D images, both center and synthesized virtual views are subject to illegal distribution during transmission. To address the issue of copyright protection of DIBR 3D images, we propose a multi-level cross-modal attention guided network (MCANet) for 3D image watermarking. To optimize the watermark embedding process, the watermark adjustment module (WAM) is designed to extract cross-modal information at different scales, thereby calculating 3D image attention to adjust the watermark distribution. Furthermore, the nested dual output U-net (NDOU) is devised to enhance the compensatory capability of the skip connections, thus providing an effective global feature to the up-sampling process for high image quality. Compared to state-of-the-art (SOTA) 3D image watermarking methods, the proposed watermarking model shows superior performance in terms of robustness and imperceptibility.
{"title":"Multi-level cross-modal attention guided DIBR 3D image watermarking","authors":"Qingmo Chen ,&nbsp;Zhang Wang ,&nbsp;Zhouyan He ,&nbsp;Ting Luo ,&nbsp;Jiangtao Huang","doi":"10.1016/j.jvcir.2025.104455","DOIUrl":"10.1016/j.jvcir.2025.104455","url":null,"abstract":"<div><div>For depth-image-based rendering (DIBR) 3D images, both center and synthesized virtual views are subject to illegal distribution during transmission. To address the issue of copyright protection of DIBR 3D images, we propose a multi-level cross-modal attention guided network (MCANet) for 3D image watermarking. To optimize the watermark embedding process, the watermark adjustment module (WAM) is designed to extract cross-modal information at different scales, thereby calculating 3D image attention to adjust the watermark distribution. Furthermore, the nested dual output U-net (NDOU) is devised to enhance the compensatory capability of the skip connections, thus providing an effective global feature to the up-sampling process for high image quality. Compared to state-of-the-art (SOTA) 3D image watermarking methods, the proposed watermarking model shows superior performance in terms of robustness and imperceptibility.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"109 ","pages":"Article 104455"},"PeriodicalIF":2.6,"publicationDate":"2025-03-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143746691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LDINet: Latent decomposition-interpolation for single image fast-moving objects deblatting
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-03-25 DOI: 10.1016/j.jvcir.2025.104439
Haodong Fan, Dingyi Zhang, Yunlong Yu, Yingming Li
The image of fast-moving objects (FMOs) usually contains a blur stripe indicating the blurred object that is mixed with the background. In this work we propose a novel Latent Decomposition-Interpolation Network (LDINet) to generate the appearances and shapes of the objects from the blurry stripe contained in the single image. In particular, we introduce an Decomposition-Interpolation Module (DIM) to break down the feature maps of the inputs into discrete time indexed parts and interpolate the target latent frames according to the provided time indexes with affine transformations, where the features are categorized into the scalar-like and gradient-like parts when warping in the interpolation. Finally, a decoder renders the prediction results. In addition, based on the results, a Refining Conditional Deblatting (RCD) approach is presented to further enhance the fidelity. Extensive experiments are conducted and have shown that the proposed methods achieve superior performances compared to the existing competing methods.
{"title":"LDINet: Latent decomposition-interpolation for single image fast-moving objects deblatting","authors":"Haodong Fan,&nbsp;Dingyi Zhang,&nbsp;Yunlong Yu,&nbsp;Yingming Li","doi":"10.1016/j.jvcir.2025.104439","DOIUrl":"10.1016/j.jvcir.2025.104439","url":null,"abstract":"<div><div>The image of fast-moving objects (FMOs) usually contains a blur stripe indicating the blurred object that is mixed with the background. In this work we propose a novel Latent Decomposition-Interpolation Network (LDINet) to generate the appearances and shapes of the objects from the blurry stripe contained in the single image. In particular, we introduce an Decomposition-Interpolation Module (DIM) to break down the feature maps of the inputs into discrete time indexed parts and interpolate the target latent frames according to the provided time indexes with affine transformations, where the features are categorized into the scalar-like and gradient-like parts when warping in the interpolation. Finally, a decoder renders the prediction results. In addition, based on the results, a Refining Conditional Deblatting (RCD) approach is presented to further enhance the fidelity. Extensive experiments are conducted and have shown that the proposed methods achieve superior performances compared to the existing competing methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"109 ","pages":"Article 104439"},"PeriodicalIF":2.6,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143725251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VDD: Varied Drone Dataset for semantic segmentation
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-03-22 DOI: 10.1016/j.jvcir.2025.104429
Wenxiao Cai, Ke Jin, Jinyan Hou, Cong Guo, Letian Wu, Wankou Yang
Semantic segmentation of drone images is critical for various aerial vision tasks as it provides essential semantic details to understand scenes on the ground. Ensuring high accuracy of semantic segmentation models for drones requires access to diverse, large-scale, and high-resolution datasets, which are often scarce in the field of aerial image processing. While existing datasets typically focus on urban scenes and are relatively small, our Varied Drone Dataset (VDD) addresses these limitations by offering a large-scale, densely labeled collection of 400 high-resolution images spanning 7 classes. This dataset features various scenes in urban, industrial, rural, and natural areas, captured from different camera angles and under diverse lighting conditions. We also make new annotations to UDD (Chen et al., 2018) and UAVid (Lyu et al., 2018), integrating them under VDD annotation standards, to create the Integrated Drone Dataset (IDD). We train seven state-of-the-art models on drone datasets as baselines. It is expected that our dataset will generate considerable interest in drone image segmentation and serve as a foundation for other drone vision tasks. Datasets are publicly available at https://github.com/RussRobin/VDD.
{"title":"VDD: Varied Drone Dataset for semantic segmentation","authors":"Wenxiao Cai,&nbsp;Ke Jin,&nbsp;Jinyan Hou,&nbsp;Cong Guo,&nbsp;Letian Wu,&nbsp;Wankou Yang","doi":"10.1016/j.jvcir.2025.104429","DOIUrl":"10.1016/j.jvcir.2025.104429","url":null,"abstract":"<div><div>Semantic segmentation of drone images is critical for various aerial vision tasks as it provides essential semantic details to understand scenes on the ground. Ensuring high accuracy of semantic segmentation models for drones requires access to diverse, large-scale, and high-resolution datasets, which are often scarce in the field of aerial image processing. While existing datasets typically focus on urban scenes and are relatively small, our Varied Drone Dataset (VDD) addresses these limitations by offering a large-scale, densely labeled collection of 400 high-resolution images spanning 7 classes. This dataset features various scenes in urban, industrial, rural, and natural areas, captured from different camera angles and under diverse lighting conditions. We also make new annotations to UDD (Chen et al., 2018) and UAVid (Lyu et al., 2018), integrating them under VDD annotation standards, to create the Integrated Drone Dataset (IDD). We train seven state-of-the-art models on drone datasets as baselines. It is expected that our dataset will generate considerable interest in drone image segmentation and serve as a foundation for other drone vision tasks. Datasets are publicly available at <span><span>https://github.com/RussRobin/VDD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"109 ","pages":"Article 104429"},"PeriodicalIF":2.6,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143739111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RBMark: Robust and blind video watermark in DT CWT domain
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-03-22 DOI: 10.1016/j.jvcir.2025.104438
I.-Chun Huang , Ji-Yan Wu , Wei Tsang Ooi
Video watermark embedding algorithms based on the transform domain are robust against media processing methods but are limited in data embedding capacity. Learning-based watermarking algorithms have recently become increasingly popular because of their good performance in image feature extraction and data embedding. However, they are not consistently robust against various video processing methods. To effectively trade off the embedding capacity and robustness, this paper proposes RBMark, a novel video watermarking method based on Dual-Tree Complex Wavelet Transform (DT CWT). First, the watermark bit-stream is transformed into a key image in the embedding phase. Second, we extract the DT CWT domain coefficients from both the video frame and the key image and embed the key image coefficients into the video frame coefficients. During the extraction phase, the key image coefficients are extracted to perform inverse DT CWT and reconstruct the watermark bit-stream. Compared with prior watermarking algorithms, RBMark achieves higher robustness and embedding capacity. We evaluated its performance using multiple representative video datasets to compare with prior transform domain and learning-based watermarking algorithms. Experimental results demonstrate that RBMark achieves up to 98% and 99% improvement in Bit Error Rate over the transform domain and learning-based methods, respectively. Furthermore, RBMark can embed at most 2040 bits in each 1080p-resolution video frame (i.e., 9.84×104 bits per pixel). The source code is available in this URL.
{"title":"RBMark: Robust and blind video watermark in DT CWT domain","authors":"I.-Chun Huang ,&nbsp;Ji-Yan Wu ,&nbsp;Wei Tsang Ooi","doi":"10.1016/j.jvcir.2025.104438","DOIUrl":"10.1016/j.jvcir.2025.104438","url":null,"abstract":"<div><div>Video watermark embedding algorithms based on the transform domain are robust against media processing methods but are limited in data embedding capacity. Learning-based watermarking algorithms have recently become increasingly popular because of their good performance in image feature extraction and data embedding. However, they are not consistently robust against various video processing methods. To effectively trade off the embedding capacity and robustness, this paper proposes RBMark, a novel video watermarking method based on Dual-Tree Complex Wavelet Transform (DT CWT). First, the watermark bit-stream is transformed into a key image in the embedding phase. Second, we extract the DT CWT domain coefficients from both the video frame and the key image and embed the key image coefficients into the video frame coefficients. During the extraction phase, the key image coefficients are extracted to perform inverse DT CWT and reconstruct the watermark bit-stream. Compared with prior watermarking algorithms, RBMark achieves higher robustness and embedding capacity. We evaluated its performance using multiple representative video datasets to compare with prior transform domain and learning-based watermarking algorithms. Experimental results demonstrate that RBMark achieves up to 98% and 99% improvement in Bit Error Rate over the transform domain and learning-based methods, respectively. Furthermore, RBMark can embed at most 2040 bits in each 1080p-resolution video frame (i.e., <span><math><mrow><mn>9</mn><mo>.</mo><mn>84</mn><mo>×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mo>−</mo><mn>4</mn></mrow></msup></mrow></math></span> bits per pixel). The source code is available in <span><span>this URL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"109 ","pages":"Article 104438"},"PeriodicalIF":2.6,"publicationDate":"2025-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143705134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving model generalization by on-manifold adversarial augmentation in the frequency domain
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-03-21 DOI: 10.1016/j.jvcir.2025.104437
Chang Liu , Wenzhao Xiang , Yuan He , Hui Xue , Shibao Zheng , Hang Su
Deep Neural Networks (DNNs) often suffer from performance drops when training and test data distributions differ. Ensuring model generalization for Out-Of-Distribution (OOD) data is crucial, but current models still struggle with accuracy on such data. Recent studies have shown that regular or off-manifold adversarial examples as data augmentation improve OOD generalization. Building on this, we provide theoretical validation that on-manifold adversarial examples can enhance OOD generalization even more. However, generating these examples is challenging due to the complexity of real manifolds. To address this, we propose AdvWavAug, an on-manifold adversarial data augmentation method using a Wavelet module. This approach, based on the AdvProp training framework, leverages wavelet transformation to project an image into the wavelet domain and modifies it within the estimated data manifold. Experiments on various models and datasets, including ImageNet and its distorted versions, show that our method significantly improves model generalization, especially for OOD data.
{"title":"Improving model generalization by on-manifold adversarial augmentation in the frequency domain","authors":"Chang Liu ,&nbsp;Wenzhao Xiang ,&nbsp;Yuan He ,&nbsp;Hui Xue ,&nbsp;Shibao Zheng ,&nbsp;Hang Su","doi":"10.1016/j.jvcir.2025.104437","DOIUrl":"10.1016/j.jvcir.2025.104437","url":null,"abstract":"<div><div>Deep Neural Networks (DNNs) often suffer from performance drops when training and test data distributions differ. Ensuring model generalization for Out-Of-Distribution (OOD) data is crucial, but current models still struggle with accuracy on such data. Recent studies have shown that regular or off-manifold adversarial examples as data augmentation improve OOD generalization. Building on this, we provide theoretical validation that on-manifold adversarial examples can enhance OOD generalization even more. However, generating these examples is challenging due to the complexity of real manifolds. To address this, we propose AdvWavAug, an on-manifold adversarial data augmentation method using a Wavelet module. This approach, based on the AdvProp training framework, leverages wavelet transformation to project an image into the wavelet domain and modifies it within the estimated data manifold. Experiments on various models and datasets, including ImageNet and its distorted versions, show that our method significantly improves model generalization, especially for OOD data.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"109 ","pages":"Article 104437"},"PeriodicalIF":2.6,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143686193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-TuneV: Fine-tuning the fusion of multiple modules for video action recognition
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-03-20 DOI: 10.1016/j.jvcir.2025.104441
Xinyuan Liu, Junyong Ye, Jingjing Wang, Guangyi Xu, Youwei Li, Chaoming Zheng
The current pre-trained models have achieved remarkable success, but they usually have complex structures and hundreds of millions of parameters, resulting in a huge computational resource requirement to train or fully fine-tune a pre-trained model, which limits its transfer learning on different tasks. In order to migrate pre-trained models to the field of Video Action Recognition (VAR), recent research uses parametric efficient transfer learning (PETL) approaches, while most of them are studied on a single fine-tuning module. For a complex task like VAR, a single fine-tuning method may not achieve optimal results. To address this challenge, we want to study the effect of joint fine-tuning with multiple modules, so we propose a method that merges multiple fine-tuning modules, namely Multi-TuneV. It combines five fine-tuning methods, including ST-Adapter, AdaptFormer, BitFit, VPT and LoRA. We design a particular architecture for Multi-TuneV and integrate it organically into the Video ViT model so that it can coordinate the multiple fine-tuning modules to extract features. Multi-TuneV enables pre-trained models to migrate to video classification tasks while maintaining improved accuracy and effectively limiting the number of tunable parameters, because it combines the advantages of five fine-tuning methods. We conduct extensive experiments with Multi-TuneV on three common video datasets, and show that it surpasses both full fine-tuning and other single fine-tuning methods. When only 18.7 % (16.09 M) of the full fine-tuning parameters are updated, the accuracy of Multi-TuneV on SSv2 and HMDB51 improve by 23.43 % and 16.46 % compared with the full fine-tuning strategy, and improve to 67.43 % and 75.84 %. This proves the effectiveness of joint multi-module fine-tuning. Multi-TuneV provides a new idea for PETL and a new perspective to address the challenge in video understanding tasks. Code is available at https://github.com/hhh123-1/Multi-TuneV.
{"title":"Multi-TuneV: Fine-tuning the fusion of multiple modules for video action recognition","authors":"Xinyuan Liu,&nbsp;Junyong Ye,&nbsp;Jingjing Wang,&nbsp;Guangyi Xu,&nbsp;Youwei Li,&nbsp;Chaoming Zheng","doi":"10.1016/j.jvcir.2025.104441","DOIUrl":"10.1016/j.jvcir.2025.104441","url":null,"abstract":"<div><div>The current pre-trained models have achieved remarkable success, but they usually have complex structures and hundreds of millions of parameters, resulting in a huge computational resource requirement to train or fully fine-tune a pre-trained model, which limits its transfer learning on different tasks. In order to migrate pre-trained models to the field of Video Action Recognition (VAR), recent research uses parametric efficient transfer learning (PETL) approaches, while most of them are studied on a single fine-tuning module. For a complex task like VAR, a single fine-tuning method may not achieve optimal results. To address this challenge, we want to study the effect of joint fine-tuning with multiple modules, so we propose a method that merges multiple fine-tuning modules, namely Multi-TuneV. It combines five fine-tuning methods, including ST-Adapter, AdaptFormer, BitFit, VPT and LoRA. We design a particular architecture for Multi-TuneV and integrate it organically into the Video ViT model so that it can coordinate the multiple fine-tuning modules to extract features. Multi-TuneV enables pre-trained models to migrate to video classification tasks while maintaining improved accuracy and effectively limiting the number of tunable parameters, because it combines the advantages of five fine-tuning methods. We conduct extensive experiments with Multi-TuneV on three common video datasets, and show that it surpasses both full fine-tuning and other single fine-tuning methods. When only 18.7 % (16.09 M) of the full fine-tuning parameters are updated, the accuracy of Multi-TuneV on SSv2 and HMDB51 improve by 23.43 % and 16.46 % compared with the full fine-tuning strategy, and improve to 67.43 % and 75.84 %. This proves the effectiveness of joint multi-module fine-tuning. Multi-TuneV provides a new idea for PETL and a new perspective to address the challenge in video understanding tasks. Code is available at <span><span>https://github.com/hhh123-1/Multi-TuneV</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"109 ","pages":"Article 104441"},"PeriodicalIF":2.6,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143686194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DBFAM: A dual-branch network with efficient feature fusion and attention-enhanced gating for medical image segmentation
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-03-17 DOI: 10.1016/j.jvcir.2025.104434
Benzhe Ren , Yuhui Zheng , Zhaohui Zheng , Jin Ding , Tao Wang
In the field of medical image segmentation, convolutional neural networks (CNNs) and transformer networks have garnered significant attention due to their unique advantages. However, CNNs have limitations in modeling long-range dependencies, while transformers are constrained by their quadratic computational complexity. Recently, State Space Models (SSMs), exemplified by Mamba, have emerged as a promising approach. These models excel in capturing long-range interactions while maintaining linear computational complexity. This paper proposes a dual-branch parallel network that combines CNNs with Visual State Space Models (VSSMs). The two branches of the encoder separately capture local and global information. To further leverage the intricate relationships between local and global features, a dual-branch local–global feature fusion module is introduced, effectively integrating features from both branches. Additionally, an Attention-Enhanced Gated Module is proposed to replace traditional skip connections, aiming to improve the alignment of information transfer between the encoder and decoder. Extensive experiments on multiple datasets validate the effectiveness of our method.
{"title":"DBFAM: A dual-branch network with efficient feature fusion and attention-enhanced gating for medical image segmentation","authors":"Benzhe Ren ,&nbsp;Yuhui Zheng ,&nbsp;Zhaohui Zheng ,&nbsp;Jin Ding ,&nbsp;Tao Wang","doi":"10.1016/j.jvcir.2025.104434","DOIUrl":"10.1016/j.jvcir.2025.104434","url":null,"abstract":"<div><div>In the field of medical image segmentation, convolutional neural networks (CNNs) and transformer networks have garnered significant attention due to their unique advantages. However, CNNs have limitations in modeling long-range dependencies, while transformers are constrained by their quadratic computational complexity. Recently, State Space Models (SSMs), exemplified by Mamba, have emerged as a promising approach. These models excel in capturing long-range interactions while maintaining linear computational complexity. This paper proposes a dual-branch parallel network that combines CNNs with Visual State Space Models (VSSMs). The two branches of the encoder separately capture local and global information. To further leverage the intricate relationships between local and global features, a dual-branch local–global feature fusion module is introduced, effectively integrating features from both branches. Additionally, an Attention-Enhanced Gated Module is proposed to replace traditional skip connections, aiming to improve the alignment of information transfer between the encoder and decoder. Extensive experiments on multiple datasets validate the effectiveness of our method.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"109 ","pages":"Article 104434"},"PeriodicalIF":2.6,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143686192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transformer-based weakly supervised 3D human pose estimation
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-03-14 DOI: 10.1016/j.jvcir.2025.104432
Xiao-guang Wu , Hu-jie Xie , Xiao-chen Niu , Chen Wang , Ze-lei Wang , Shi-wen Zhang , Yu-ze Shan
Deep learning-based 3D human pose estimation methods typically require large amounts of 3D pose annotations. However, due to limitations in data quality and the scarcity of 3D labeled data, researchers have adopted weak supervision methods to reduce the demand for annotated data. Compared to traditional approaches, Transformers have recently achieved remarkable success in 3D human pose estimation. Leveraging their powerful modeling and generalization capabilities, Transformers effectively capture patterns and features in the data, even under limited data conditions, mitigating the issue of data scarcity. Nonetheless, the Transformer architecture struggles to capture long-term dependencies and spatio-temporal correlations between joints when processing spatio-temporal features, which limits its ability to model temporal and spatial relationships comprehensively. To address these challenges and better utilize limited labeled data under weak supervision, we proposed an improved Transformer-based model. By grouping joints according to body parts, we enhanced the spatio-temporal correlations between joints. Additionally, the integration of LSTM captures long-term dependencies, improving temporal sequence modeling and enabling the generation of accurate 3D poses from limited data. These structural improvements, combined with weak supervision strategies, enhance the model’s performance while reducing the reliance on extensive 3D annotations. Furthermore, a multi-hypothesis strategy and temporal smoothness consistency constraints were employed to regulate variations between adjacent time steps. Comparisons on the Human3.6M and HumanEva datasets validate the effectiveness of our approach.
{"title":"Transformer-based weakly supervised 3D human pose estimation","authors":"Xiao-guang Wu ,&nbsp;Hu-jie Xie ,&nbsp;Xiao-chen Niu ,&nbsp;Chen Wang ,&nbsp;Ze-lei Wang ,&nbsp;Shi-wen Zhang ,&nbsp;Yu-ze Shan","doi":"10.1016/j.jvcir.2025.104432","DOIUrl":"10.1016/j.jvcir.2025.104432","url":null,"abstract":"<div><div>Deep learning-based 3D human pose estimation methods typically require large amounts of 3D pose annotations. However, due to limitations in data quality and the scarcity of 3D labeled data, researchers have adopted weak supervision methods to reduce the demand for annotated data. Compared to traditional approaches, Transformers have recently achieved remarkable success in 3D human pose estimation. Leveraging their powerful modeling and generalization capabilities, Transformers effectively capture patterns and features in the data, even under limited data conditions, mitigating the issue of data scarcity. Nonetheless, the Transformer architecture struggles to capture long-term dependencies and spatio-temporal correlations between joints when processing spatio-temporal features, which limits its ability to model temporal and spatial relationships comprehensively. To address these challenges and better utilize limited labeled data under weak supervision, we proposed an improved Transformer-based model. By grouping joints according to body parts, we enhanced the spatio-temporal correlations between joints. Additionally, the integration of LSTM captures long-term dependencies, improving temporal sequence modeling and enabling the generation of accurate 3D poses from limited data. These structural improvements, combined with weak supervision strategies, enhance the model’s performance while reducing the reliance on extensive 3D annotations. Furthermore, a multi-hypothesis strategy and temporal smoothness consistency constraints were employed to regulate variations between adjacent time steps. Comparisons on the Human3.6M and HumanEva datasets validate the effectiveness of our approach.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"109 ","pages":"Article 104432"},"PeriodicalIF":2.6,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143654646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cell tracking-by-detection using elliptical bounding boxes
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-03-14 DOI: 10.1016/j.jvcir.2025.104425
Lucas N. Kirsten, Cláudio R. Jung
Cell detection and tracking are crucial for bio-analysis. Current approaches rely on the tracking-by-model evolution paradigm, where end-to-end deep learning models are trained for cell detection and tracking. However, such methods require extensive amounts of annotated data, which is time-consuming and often requires specialized annotators. The proposed method involves approximating cell shapes as oriented ellipses and utilizing generic-purpose-oriented object detectors for cell detection to alleviate the requirement of annotated data. A global data association algorithm is then employed to explore temporal cell similarity using probability distance metrics, considering that the ellipses relate to two-dimensional Gaussian distributions. The results of this study suggest that the proposed tracking-by-detection paradigm is a viable alternative for cell tracking. The method achieves competitive results and reduces the dependency on extensive annotated data, addressing a common challenge in current cell detection and tracking approaches. Our code is publicly available at https://github.com/LucasKirsten/Deep-Cell-Tracking-EBB.
{"title":"Cell tracking-by-detection using elliptical bounding boxes","authors":"Lucas N. Kirsten,&nbsp;Cláudio R. Jung","doi":"10.1016/j.jvcir.2025.104425","DOIUrl":"10.1016/j.jvcir.2025.104425","url":null,"abstract":"<div><div>Cell detection and tracking are crucial for bio-analysis. Current approaches rely on the tracking-by-model evolution paradigm, where end-to-end deep learning models are trained for cell detection and tracking. However, such methods require extensive amounts of annotated data, which is time-consuming and often requires specialized annotators. The proposed method involves approximating cell shapes as oriented ellipses and utilizing generic-purpose-oriented object detectors for cell detection to alleviate the requirement of annotated data. A global data association algorithm is then employed to explore temporal cell similarity using probability distance metrics, considering that the ellipses relate to two-dimensional Gaussian distributions. The results of this study suggest that the proposed tracking-by-detection paradigm is a viable alternative for cell tracking. The method achieves competitive results and reduces the dependency on extensive annotated data, addressing a common challenge in current cell detection and tracking approaches. Our code is publicly available at <span><span>https://github.com/LucasKirsten/Deep-Cell-Tracking-EBB</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"108 ","pages":"Article 104425"},"PeriodicalIF":2.6,"publicationDate":"2025-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143631697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Visual Communication and Image Representation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1