Pub Date : 2025-10-24DOI: 10.1016/j.jvcir.2025.104596
Banala Revanth, Manoj Kumar, Sanjay K. Dwivedi
In image dehazing, various vision transformer-based approaches have been applied, resulting in favorable outcomes. Nevertheless, these techniques require a significant amount of data for training. We have created a vision transformer that uses mixed attention to extract features at different levels for a given image. The proposed method, called MixViT, is a U-Net-based vision transformer that utilizes mixed attention for image dehazing. The MixViT architecture comprises an encoder and decoder with integrated skip connections. The MixViT model is trained using the I-Haze, O-Haze, NH-Haze and Dense-Haze datasets, employing the mean square error as the loss function. The proposed MixViT model has exceptional performance on I-Haze and O-Haze datasets, but shows moderate performance on NH-Haze and Dense-haze datasets. On average, the proposed method yields more favorable results in terms of complexity, as well as quantitative and visual outcomes, compared to the current state-of-the-art methods for image dehazing.
{"title":"MixViT: Single image dehazing using Mixed Attention based Vision Transformer","authors":"Banala Revanth, Manoj Kumar, Sanjay K. Dwivedi","doi":"10.1016/j.jvcir.2025.104596","DOIUrl":"10.1016/j.jvcir.2025.104596","url":null,"abstract":"<div><div>In image dehazing, various vision transformer-based approaches have been applied, resulting in favorable outcomes. Nevertheless, these techniques require a significant amount of data for training. We have created a vision transformer that uses mixed attention to extract features at different levels for a given image. The proposed method, called MixViT, is a U-Net-based vision transformer that utilizes mixed attention for image dehazing. The MixViT architecture comprises an encoder and decoder with integrated skip connections. The MixViT model is trained using the I-Haze, O-Haze, NH-Haze and Dense-Haze datasets, employing the mean square error as the loss function. The proposed MixViT model has exceptional performance on I-Haze and O-Haze datasets, but shows moderate performance on NH-Haze and Dense-haze datasets. On average, the proposed method yields more favorable results in terms of complexity, as well as quantitative and visual outcomes, compared to the current state-of-the-art methods for image dehazing.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104596"},"PeriodicalIF":3.1,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145365338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1016/j.jvcir.2025.104608
Jinhui Wu , Wenkang Zhang , Wankou Yang
Recent advances in Transformer-based light-weight trackers have set new standards on various benchmarks due to their efficiency and effectiveness. However, despite these achievements, current light-weight trackers discard temporal modeling due to its complexity and inefficiency, which significantly limits their tracking performance in practical applications. To address this limitation, we propose EAPTrack, an efficient tracking model with appearance prompts. The key of EAPTrack lies in generating real-time appearance prompts to guide tracking while maintaining efficient inference, thus overcoming the limitations of static templates in adapting to changes. Unlike existing trackers with complex temporal modeling processes, EAPTrack employs a simple appearance prompt modulation module to generate appearance prompts with minimal computational overhead. Additionally, we design an efficient object encoder equipped with various acceleration mechanisms, which enhance efficiency by reducing the sequence length during feature extraction. Extensive experiments demonstrate the efficiency and effectiveness of our model. For example, on the GOT-10k benchmark, EAPTrack achieves 5.9% higher accuracy than the leading real-time trackers while maintaining comparable speeds at 156 FPS on GPU.
{"title":"Exploring efficient appearance prompts for light-weight object tracking","authors":"Jinhui Wu , Wenkang Zhang , Wankou Yang","doi":"10.1016/j.jvcir.2025.104608","DOIUrl":"10.1016/j.jvcir.2025.104608","url":null,"abstract":"<div><div>Recent advances in Transformer-based light-weight trackers have set new standards on various benchmarks due to their efficiency and effectiveness. However, despite these achievements, current light-weight trackers discard temporal modeling due to its complexity and inefficiency, which significantly limits their tracking performance in practical applications. To address this limitation, we propose EAPTrack, an efficient tracking model with appearance prompts. The key of EAPTrack lies in generating real-time appearance prompts to guide tracking while maintaining efficient inference, thus overcoming the limitations of static templates in adapting to changes. Unlike existing trackers with complex temporal modeling processes, EAPTrack employs a simple appearance prompt modulation module to generate appearance prompts with minimal computational overhead. Additionally, we design an efficient object encoder equipped with various acceleration mechanisms, which enhance efficiency by reducing the sequence length during feature extraction. Extensive experiments demonstrate the efficiency and effectiveness of our model. For example, on the GOT-10k benchmark, EAPTrack achieves 5.9% higher accuracy than the leading real-time trackers while maintaining comparable speeds at 156 FPS on GPU.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104608"},"PeriodicalIF":3.1,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145418407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23DOI: 10.1016/j.jvcir.2025.104621
Wuyong Tao , Xianghong Hua , Bufan Zhao , Dong Chen , Chong Wu , Danhua Min
3D object recognition remains an active research area in computer vision and graphics. Recognizing objects in cluttered scenes is challenging due to clutter and occlusion. Local feature descriptors (LFDs), known for their robustness to clutter and occlusion, are widely used for 3D object recognition. However, existing LFDs are often affected by noise and varying point density, leading to poor descriptor matching performance. To address this, we propose a new LFD in this paper. First, a novel weighting strategy is introduced, utilizing projection distances to calculate weights for neighboring points, thereby constructing a robust local reference frame (LRF). Next, a new feature attribution (i.e., local voxel center) is proposed to compute voxel values. These voxel values are concatenated to form the final feature descriptor. This feature attribution is resistant to noise and varying point density, enhancing the overall robustness of the LFD. Additionally, we design a 3D transformation estimation method to generate transformation hypotheses. This method ranks correspondences by distance ratio and traverses the top-ranked ones to compute transformations, reducing iterations and eliminating randomness while allowing predetermined iteration counts. Experiments demonstrate that the proposed LRF achieves high repeatability and the LFD exhibits excellent matching performance. The transformation estimation method is more accurate and computationally efficient. Overall, our 3D object recognition method achieves a high recognition rate. On three experimental datasets, it gets the recognition rates of 99.07%, 98.31% and 81.13%, respectively, surpassing the comparative methods. The code is available at: https://github.com/taowuyong?tab=repositories.
{"title":"LoVCS: A local voxel center based descriptor for 3D object recognition","authors":"Wuyong Tao , Xianghong Hua , Bufan Zhao , Dong Chen , Chong Wu , Danhua Min","doi":"10.1016/j.jvcir.2025.104621","DOIUrl":"10.1016/j.jvcir.2025.104621","url":null,"abstract":"<div><div>3D object recognition remains an active research area in computer vision and graphics. Recognizing objects in cluttered scenes is challenging due to clutter and occlusion. Local feature descriptors (LFDs), known for their robustness to clutter and occlusion, are widely used for 3D object recognition. However, existing LFDs are often affected by noise and varying point density, leading to poor descriptor matching performance. To address this, we propose a new LFD in this paper. First, a novel weighting strategy is introduced, utilizing projection distances to calculate weights for neighboring points, thereby constructing a robust local reference frame (LRF). Next, a new feature attribution (i.e., local voxel center) is proposed to compute voxel values. These voxel values are concatenated to form the final feature descriptor. This feature attribution is resistant to noise and varying point density, enhancing the overall robustness of the LFD. Additionally, we design a 3D transformation estimation method to generate transformation hypotheses. This method ranks correspondences by distance ratio and traverses the top-ranked ones to compute transformations, reducing iterations and eliminating randomness while allowing predetermined iteration counts. Experiments demonstrate that the proposed LRF achieves high repeatability and the LFD exhibits excellent matching performance. The transformation estimation method is more accurate and computationally efficient. Overall, our 3D object recognition method achieves a high recognition rate. On three experimental datasets, it gets the recognition rates of 99.07%, 98.31% and 81.13%, respectively, surpassing the comparative methods. The code is available at: <span><span>https://github.com/taowuyong?tab=repositories</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104621"},"PeriodicalIF":3.1,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145365339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-18DOI: 10.1016/j.jvcir.2025.104610
Lewen Fan, Yun Zhang
Video-Based Point Cloud Compression (V-PCC) leverages Versatile Video Coding (VVC) to compress point clouds efficiently, yet suffers from high computational complexity that challenges real-time applications. To address this critical problem, we propose a texture-aware fast Coding Unit (CU) mode decision algorithm and a complexity allocation strategy for VVC-based V-PCC. By analyzing CU distributions and complexity characteristics, we introduce adaptive early termination thresholds that incorporate spatial, parent–child, and intra/inter CU correlations. Furthermore, we established a complexity allocation method by formulating and solving an optimization problem to determine relaxation factors for optimal complexity-efficiency trade-offs. Experimental results demonstrate that the proposed fast mode decision achieves an average of 33.89% and 44.59% complexity reductions compared to the anchor VVC-based V-PCC, which are better than those of the state-of-the-art fast mode decision schemes. Meanwhile, the average Bjónteggard Delta Bit Rate (BDBR) loss is 1.04% and 1.64%, which are negligible.
{"title":"Texture-aware fast mode decision and complexity allocation for VVC based point cloud compression","authors":"Lewen Fan, Yun Zhang","doi":"10.1016/j.jvcir.2025.104610","DOIUrl":"10.1016/j.jvcir.2025.104610","url":null,"abstract":"<div><div>Video-Based Point Cloud Compression (V-PCC) leverages Versatile Video Coding (VVC) to compress point clouds efficiently, yet suffers from high computational complexity that challenges real-time applications. To address this critical problem, we propose a texture-aware fast Coding Unit (CU) mode decision algorithm and a complexity allocation strategy for VVC-based V-PCC. By analyzing CU distributions and complexity characteristics, we introduce adaptive early termination thresholds that incorporate spatial, parent–child, and intra/inter CU correlations. Furthermore, we established a complexity allocation method by formulating and solving an optimization problem to determine relaxation factors for optimal complexity-efficiency trade-offs. Experimental results demonstrate that the proposed fast mode decision achieves an average of 33.89% and 44.59% complexity reductions compared to the anchor VVC-based V-PCC, which are better than those of the state-of-the-art fast mode decision schemes. Meanwhile, the average Bjónteggard Delta Bit Rate (BDBR) loss is 1.04% and 1.64%, which are negligible.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104610"},"PeriodicalIF":3.1,"publicationDate":"2025-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145365333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-17DOI: 10.1016/j.jvcir.2025.104611
Qimin Jiang , Xiaoyong Lu , Dong Liang , Songlin Du
Local feature matching is the core of many computer vision tasks. Current methods only consider the extracted points individually, disregarding the connections between keypoints and the scene information, making feature matching challenging in scenarios with rich changes in viewpoint and illumination. To address this problem, this paper proposes SemMatcher, a novel semantic-aware feature matching framework which combines scene information overlooked by keypoints. Specifically, co-visual regions are filtered out through semantic segmentation for more focused learning of subsequent attention mechanisms, which refer to obtaining regions of the same category in two images. In SemMatcher, we design a semantic-aware attention mechanism, which pays more attention to co-visual regions unlike conventional global learning, achieving a win–win situation in terms of efficiency and performance. Besides, to build connections between keypoints, we introduce a semantic-aware neighborhood consensus which incorporates neighborhood consensus into attentional aggregation and constructs contextualized neighborhood information. Extensive experiments on homography estimation, pose estimation and image matching demonstrate that the model is superior to other methods and yields outstanding performance improvements.
{"title":"SemMatcher: Semantic-aware feature matching with neighborhood consensus","authors":"Qimin Jiang , Xiaoyong Lu , Dong Liang , Songlin Du","doi":"10.1016/j.jvcir.2025.104611","DOIUrl":"10.1016/j.jvcir.2025.104611","url":null,"abstract":"<div><div>Local feature matching is the core of many computer vision tasks. Current methods only consider the extracted points individually, disregarding the connections between keypoints and the scene information, making feature matching challenging in scenarios with rich changes in viewpoint and illumination. To address this problem, this paper proposes SemMatcher, a novel semantic-aware feature matching framework which combines scene information overlooked by keypoints. Specifically, co-visual regions are filtered out through semantic segmentation for more focused learning of subsequent attention mechanisms, which refer to obtaining regions of the same category in two images. In SemMatcher, we design a semantic-aware attention mechanism, which pays more attention to co-visual regions unlike conventional global learning, achieving a win–win situation in terms of efficiency and performance. Besides, to build connections between keypoints, we introduce a semantic-aware neighborhood consensus which incorporates neighborhood consensus into attentional aggregation and constructs contextualized neighborhood information. Extensive experiments on homography estimation, pose estimation and image matching demonstrate that the model is superior to other methods and yields outstanding performance improvements.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104611"},"PeriodicalIF":3.1,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145365337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1016/j.jvcir.2025.104606
Wenyu Wang , Jie Yao , Guosheng Yu , Xianlu Bian , Dandan Ding
To accelerate the quad-tree with nested multi-type tree (QTMT) partition process in H.266/VVC, we propose an edge detection-driven LightGBM method for fast coding unit (CU) partitioning. Unlike previous approaches using multiple binary classifiers—one for each partition candidate, our method reformulates the CU partition as a multi-class classification problem and employs LightGBM to resolve it. By extracting edge and gradient features that reflect texture complexity and direction, our method organizes a CU’s edges and gradient information as a feature vector, which is then fed into the LightGBM model to directly predict the probabilities of all candidate partitions. In this way, low-probability partitions can be efficiently skipped to reduce encoding time. Extensive experiments under the common test conditions of H.266/VVC demonstrate that our method achieves encoding time reductions of approximately 44% to 57% on VTM-15.0, with only 0.96% to 1.77% Bjøntegaard Delta Bitrate (BDBR) loss, significantly outperforming existing methods.
{"title":"Edge detection-driven LightGBM for fast intra partition of H.266/VVC","authors":"Wenyu Wang , Jie Yao , Guosheng Yu , Xianlu Bian , Dandan Ding","doi":"10.1016/j.jvcir.2025.104606","DOIUrl":"10.1016/j.jvcir.2025.104606","url":null,"abstract":"<div><div>To accelerate the quad-tree with nested multi-type tree (QTMT) partition process in H.266/VVC, we propose an edge detection-driven LightGBM method for fast coding unit (CU) partitioning. Unlike previous approaches using multiple binary classifiers—one for each partition candidate, our method reformulates the CU partition as a multi-class classification problem and employs LightGBM to resolve it. By extracting edge and gradient features that reflect texture complexity and direction, our method organizes a CU’s edges and gradient information as a feature vector, which is then fed into the LightGBM model to directly predict the probabilities of all candidate partitions. In this way, low-probability partitions can be efficiently skipped to reduce encoding time. Extensive experiments under the common test conditions of H.266/VVC demonstrate that our method achieves encoding time reductions of approximately 44% to 57% on VTM-15.0, with only 0.96% to 1.77% Bjøntegaard Delta Bitrate (BDBR) loss, significantly outperforming existing methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104606"},"PeriodicalIF":3.1,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145365334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-16DOI: 10.1016/j.jvcir.2025.104609
Zhijie Li, Hong Long, Yunpeng Li, Changhua Li
Scene segmentation remains a challenge in neural view synthesis, particularly when reconstructing specific entities in complex environments. We propose Mask-NeRF, a framework that integrates the Segment Anything Model (SAM) with Neural Radiance Fields (NeRF) for instance-aware reconstruction. By generating entity-specific masks and pruning rays in non-target regions, Mask-NeRF improves rendering quality and efficiency. To address viewpoint and scale inconsistencies among masks, we introduce a Geometric Correction Module combining SIFT-based keypoint detection with FLANN-based feature matching for accurate alignment. A redesigned multi-scale positional encoding (, ) further enhances spatial representation. Experiments show Mask-NeRF achieves an 8.87% improvement in rendering accuracy over NeRF while reducing computational cost, demonstrating strong potential for efficient reconstruction on resource-constrained platforms; nevertheless, real-time applicability requires further validation under broader conditions.
{"title":"NeRF gets personal: Mask-NeRF for targeted scene elements reconstruction","authors":"Zhijie Li, Hong Long, Yunpeng Li, Changhua Li","doi":"10.1016/j.jvcir.2025.104609","DOIUrl":"10.1016/j.jvcir.2025.104609","url":null,"abstract":"<div><div>Scene segmentation remains a challenge in neural view synthesis, particularly when reconstructing specific entities in complex environments. We propose <strong>Mask-NeRF</strong>, a framework that integrates the Segment Anything Model (SAM) with Neural Radiance Fields (NeRF) for instance-aware reconstruction. By generating entity-specific masks and pruning rays in non-target regions, Mask-NeRF improves rendering quality and efficiency. To address viewpoint and scale inconsistencies among masks, we introduce a Geometric Correction Module combining SIFT-based keypoint detection with FLANN-based feature matching for accurate alignment. A redesigned multi-scale positional encoding (<span><math><mrow><mi>L</mi><mo>=</mo><mn>10</mn></mrow></math></span>, <span><math><mrow><mi>N</mi><mo>=</mo><mn>4</mn></mrow></math></span>) further enhances spatial representation. Experiments show Mask-NeRF achieves an 8.87% improvement in rendering accuracy over NeRF while reducing computational cost, demonstrating strong potential for efficient reconstruction on resource-constrained platforms; nevertheless, real-time applicability requires further validation under broader conditions.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104609"},"PeriodicalIF":3.1,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145365336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15DOI: 10.1016/j.jvcir.2025.104597
Fei Peng , Shenghui Zhu , Min Long
To address the risk of image piracy caused by screen photography, this paper proposes an end-to-end robust watermarking scheme designed to resist screen-shooting attacks. It comprises an encoder, a noise layer, and a decoder. Specifically, the encoder and decoder are equipped with the DMCB structure, which combines dilated and standard convolutions to effectively enlarge the receptive field and enable the extraction of richer image features. Moreover, it selects optimal watermark embedding regions to ensure high imperceptibility while maintaining reliable extractability after screen-shooting. To further optimize the training process, a dynamic learning rate adjustment strategy is introduced to adaptively modify the learning rate based on a predefined schedule. This accelerates convergence, avoids local minima, and improves both the stability and accuracy of watermark extraction. Experimental results demonstrate its strong robustness under various shooting distances and angles, and the visual quality of images is preserved.
{"title":"A screen-shooting resilient watermarking based on Dual-Mode Convolution Block and dynamic learning strategy","authors":"Fei Peng , Shenghui Zhu , Min Long","doi":"10.1016/j.jvcir.2025.104597","DOIUrl":"10.1016/j.jvcir.2025.104597","url":null,"abstract":"<div><div>To address the risk of image piracy caused by screen photography, this paper proposes an end-to-end robust watermarking scheme designed to resist screen-shooting attacks. It comprises an encoder, a noise layer, and a decoder. Specifically, the encoder and decoder are equipped with the DMCB structure, which combines dilated and standard convolutions to effectively enlarge the receptive field and enable the extraction of richer image features. Moreover, it selects optimal watermark embedding regions to ensure high imperceptibility while maintaining reliable extractability after screen-shooting. To further optimize the training process, a dynamic learning rate adjustment strategy is introduced to adaptively modify the learning rate based on a predefined schedule. This accelerates convergence, avoids local minima, and improves both the stability and accuracy of watermark extraction. Experimental results demonstrate its strong robustness under various shooting distances and angles, and the visual quality of images is preserved.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104597"},"PeriodicalIF":3.1,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145365335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15DOI: 10.1016/j.jvcir.2025.104601
Yu-Chen Lai, Wei-Ta Chu
Automatic analyzing computational histopathology images has shown significant progress in aiding pathologists. However, developing a robust model based on supervised learning is challenging because of the scarcity of tumor-marked samples and unknown diseases. Unsupervised anomaly detection (UAD) methods that were mostly used in industrial inspection are thus proposed for histopathology images. UAD only requires normal samples for training and largely reduces the burden of labeling. This paper introduces a reconstruction-based UAD approach to improve representation learning based on adversarial learning and simulated anomalies. On the one hand, we mix up features extracted from normal images to build a smoother feature distribution and employ adversarial learning to enhance an autoencoder for image reconstruction. On the other hand, we simulate anomalous images by deformation and guide the autoencoder to catch the global characteristics of images well. We demonstrate its effectiveness on histopathology images and other medical image benchmarks and show state-of-the-art performance.
{"title":"ALSA-UAD: Unsupervised anomaly detection on histopathology images using adversarial learning and simulated anomaly","authors":"Yu-Chen Lai, Wei-Ta Chu","doi":"10.1016/j.jvcir.2025.104601","DOIUrl":"10.1016/j.jvcir.2025.104601","url":null,"abstract":"<div><div>Automatic analyzing computational histopathology images has shown significant progress in aiding pathologists. However, developing a robust model based on supervised learning is challenging because of the scarcity of tumor-marked samples and unknown diseases. Unsupervised anomaly detection (UAD) methods that were mostly used in industrial inspection are thus proposed for histopathology images. UAD only requires normal samples for training and largely reduces the burden of labeling. This paper introduces a reconstruction-based UAD approach to improve representation learning based on adversarial learning and simulated anomalies. On the one hand, we mix up features extracted from normal images to build a smoother feature distribution and employ adversarial learning to enhance an autoencoder for image reconstruction. On the other hand, we simulate anomalous images by deformation and guide the autoencoder to catch the global characteristics of images well. We demonstrate its effectiveness on histopathology images and other medical image benchmarks and show state-of-the-art performance.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104601"},"PeriodicalIF":3.1,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145326673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-14DOI: 10.1016/j.jvcir.2025.104607
Ping Cao , Shuran Lin , Yanwu Yang , Chunjie Zhang
Deep learning-based image super-resolution methods have made significant progress. However, most of these methods attempt to improve super-resolution performance by using deeper and wider single-branch networks, while ignoring the gradient prior knowledge in images. Besides, it is extremely challenging to recover both of them simultaneously. To address these two issues, in this paper, we propose a Dual-branch Interactive Guided Network(DIGN) based on gradient prior, which not only focuses on restoring global information such as overall brightness and color distribution but also on preserving fine local details. It consists of two parallel branches, an image reconstruction branch which is responsible for restoring the HR image, and a gradient reconstruction branch that predicts the gradient map of the HR image. More importantly, we incorporate multiple bidirectional cross-attention modules between the two branches to guide each other. Experiments on five benchmark datasets demonstrate DIGN’s effectiveness.
{"title":"Dual-branch interactive guided network based on gradient prior for image super-resolution","authors":"Ping Cao , Shuran Lin , Yanwu Yang , Chunjie Zhang","doi":"10.1016/j.jvcir.2025.104607","DOIUrl":"10.1016/j.jvcir.2025.104607","url":null,"abstract":"<div><div>Deep learning-based image super-resolution methods have made significant progress. However, most of these methods attempt to improve super-resolution performance by using deeper and wider single-branch networks, while ignoring the gradient prior knowledge in images. Besides, it is extremely challenging to recover both of them simultaneously. To address these two issues, in this paper, we propose a <strong>D</strong>ual-branch <strong>I</strong>nteractive <strong>G</strong>uided <strong>N</strong>etwork(DIGN) based on gradient prior, which not only focuses on restoring global information such as overall brightness and color distribution but also on preserving fine local details. It consists of two parallel branches, an image reconstruction branch which is responsible for restoring the HR image, and a gradient reconstruction branch that predicts the gradient map of the HR image. More importantly, we incorporate multiple bidirectional cross-attention modules between the two branches to guide each other. Experiments on five benchmark datasets demonstrate DIGN’s effectiveness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"113 ","pages":"Article 104607"},"PeriodicalIF":3.1,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145326675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}