Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104694
Xiujin Zhu , Chee-Onn Chow , Joon Huang Chuah
Existing Transformer-based shadow removal methods are limited by fixed window sizes, making it difficult to effectively model global information. In addition, they do not fully utilize the distance prior in shadow images. This study argues that shadow removal should model brightness variations between regions from a global perspective. Non-shadow areas near the shadow boundaries are the most important for restoring brightness in shadow regions, and their importance gradually decreases as the distance increases. To achieve this, a regional decay attention mechanism is proposed, which introduces a positional decay bias into the self-attention computation to enable dynamic modeling of contributions from different spatial positions. A local perception module is introduced to improve the model’s ability to capture local details, and a shadow removal model named FW-Former is developed. This model achieves superior performance across multiple datasets, demonstrates stable generalization capability, and maintains a low parameter count.
{"title":"Regional decay attention for image shadow removal","authors":"Xiujin Zhu , Chee-Onn Chow , Joon Huang Chuah","doi":"10.1016/j.jvcir.2025.104694","DOIUrl":"10.1016/j.jvcir.2025.104694","url":null,"abstract":"<div><div>Existing Transformer-based shadow removal methods are limited by fixed window sizes, making it difficult to effectively model global information. In addition, they do not fully utilize the distance prior in shadow images. This study argues that shadow removal should model brightness variations between regions from a global perspective. Non-shadow areas near the shadow boundaries are the most important for restoring brightness in shadow regions, and their importance gradually decreases as the distance increases. To achieve this, a regional decay attention mechanism is proposed, which introduces a positional decay bias into the self-attention computation to enable dynamic modeling of contributions from different spatial positions. A local perception module is introduced to improve the model’s ability to capture local details, and a shadow removal model named FW-Former is developed. This model achieves superior performance across multiple datasets, demonstrates stable generalization capability, and maintains a low parameter count.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104694"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104700
Nili Tian, YuLong Ling, Qing Pan
To address the limitations of insufficient consideration of multi-dimensional modal image features in existing medical image fusion methods, this paper proposes a novel dual-branch step-by-step fusion (DSFusion) network. The dual sequence extraction block (DSE) is used for initial feature extraction, followed by the multi-scale lightweight residual (MLRes) block for enhanced efficiency and generalization. Features are then fused through the global pixel-level multi-dimensional fusion (GPMF) module, comprising a multi-dimensional feature extraction block (MFEB) and a pixel-level global fusion branch (PGFB). Finally, fused features are reconstructed into the final image. Experiments performed on different modality datasets demonstrate that DSFusion achieves competitive or superior performance across multiple evaluation metrics in the four indicators as , , and metrics.
{"title":"DSFusion: A dual-branch step-by-step fusion network for medical image fusion","authors":"Nili Tian, YuLong Ling, Qing Pan","doi":"10.1016/j.jvcir.2025.104700","DOIUrl":"10.1016/j.jvcir.2025.104700","url":null,"abstract":"<div><div>To address the limitations of insufficient consideration of multi-dimensional modal image features in existing medical image fusion methods, this paper proposes a novel dual-branch step-by-step fusion (DSFusion) network. The dual sequence extraction block (DSE) is used for initial feature extraction, followed by the multi-scale lightweight residual (MLRes) block for enhanced efficiency and generalization. Features are then fused through the global pixel-level multi-dimensional fusion (GPMF) module, comprising a multi-dimensional feature extraction block (MFEB) and a pixel-level global fusion branch (PGFB). Finally, fused features are reconstructed into the final image. Experiments performed on different modality datasets demonstrate that DSFusion achieves competitive or superior performance across multiple evaluation metrics in the four indicators as <span><math><msub><mrow><mi>Q</mi></mrow><mrow><mi>M</mi><mi>I</mi></mrow></msub></math></span>, <span><math><mrow><mi>P</mi><mi>S</mi><mi>N</mi><mi>R</mi></mrow></math></span>, and <span><math><msub><mrow><mi>Q</mi></mrow><mrow><mi>P</mi></mrow></msub></math></span> metrics.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104700"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104701
Noor Ul Ain Tahir , Li Kuang , Melikamu Liyih Sinishaw , Muhammad Asim
Pedestrian and vehicle detection in aerial images remains a challenging task due to small object sizes, scale variation, and occlusion, often resulting in missed detections or the need for overly complex models. This study introduces the PV3M-YOLO approach, which incorporates three key focus modules to enhance the detection of small and undetectable objects. The proposed model integrates lightweight Ghost convolution, Convolutional Block Attention Module (CBAM), and Coordination Attention (CA) modules with optimized feature aggregation (C2f) into a unified architecture. These modifications improve the capacity of model to capture essential spatial details and contextual dependencies without increasing computational complexity. Furthermore, the Wise-IoUv3 loss calculus is employed to reduce the influence of subpar cases, enhancing localization and reducing erroneous identifications by reducing negative gradients. Experimental evaluations on VisDrone2019 dataset demonstrate that PV3M-YOLO achieves a [email protected] of 45.4% and [email protected]:95 of 27.9%, surpassing the baseline by 3.9% and 2.7% respectively. The model maintains efficiency with a compact size of 45.3 MB and a runtime of 8.8 ms. However, the detection of extremely small objects remains a limitation due to high-altitude nature of aerial imagery, indicating the need for future model enhancements targeting ultra-small object detection.
{"title":"PV3M-YOLO: A triple attention-enhanced model for detecting pedestrians and vehicles in UAV-enabled smart transport networks","authors":"Noor Ul Ain Tahir , Li Kuang , Melikamu Liyih Sinishaw , Muhammad Asim","doi":"10.1016/j.jvcir.2025.104701","DOIUrl":"10.1016/j.jvcir.2025.104701","url":null,"abstract":"<div><div>Pedestrian and vehicle detection in aerial images remains a challenging task due to small object sizes, scale variation, and occlusion, often resulting in missed detections or the need for overly complex models. This study introduces the PV3M-YOLO approach, which incorporates three key focus modules to enhance the detection of small and undetectable objects. The proposed model integrates lightweight Ghost convolution, Convolutional Block Attention Module (CBAM), and Coordination Attention (CA) modules with optimized feature aggregation (C2f) into a unified architecture. These modifications improve the capacity of model to capture essential spatial details and contextual dependencies without increasing computational complexity. Furthermore, the Wise-IoUv3 loss calculus is employed to reduce the influence of subpar cases, enhancing localization and reducing erroneous identifications by reducing negative gradients. Experimental evaluations on VisDrone2019 dataset demonstrate that PV3M-YOLO achieves a [email protected] of 45.4% and [email protected]:95 of 27.9%, surpassing the baseline by 3.9% and 2.7% respectively. The model maintains efficiency with a compact size of 45.3 MB and a runtime of 8.8 ms. However, the detection of extremely small objects remains a limitation due to high-altitude nature of aerial imagery, indicating the need for future model enhancements targeting ultra-small object detection.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104701"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A key feature of video surveillance systems is face recognition, which allows the identification and verification of people who appear in scenes frequently collected by a distributed network of cameras. The scientific community is interested in recognizing the individuals faces in videos, in part due to the potential applications also due to the difficulty in the artificial vision algorithms. The deep convolutional neural network is utilized to recognize the face from the set of provided video samples by the hybrid weighted texture pattern descriptor (HWTP). The deep CNN parameter is tuned by the Enhanced social collie optimization (ESCO), which determines the better solution by the various strategies, similar to this, the face of an individual is identified using optimum parameters. The attained accuracy, precision, recall, and F-measure of the proposed model is 87.92 %, 88.01 %, 88.01 %, and 88.01 % for the number of retrieval 500, respectively.
视频监控系统的一个关键特征是面部识别,它允许识别和验证出现在分布式摄像机网络经常收集的场景中的人。科学界对识别视频中的个人面孔很感兴趣,部分原因是由于潜在的应用,也由于人工视觉算法的困难。通过混合加权纹理模式描述符(HWTP),利用深度卷积神经网络从提供的视频样本集中识别人脸。深度CNN参数通过增强社会牧羊犬优化(Enhanced social collie optimization, ESCO)进行调整,通过各种策略确定更好的解决方案,类似地,使用最优参数识别个体的面部。在检索次数为500的情况下,该模型的准确率、精密度、召回率和F-measure分别为87.92%、88.01%、88.01%和88.01%。
{"title":"Effective face recognition from video using enhanced social collie optimization-based deep convolutional neural network technique","authors":"Jitendra Chandrakant Musale , Anuj kumar Singh , Swati Shirke","doi":"10.1016/j.jvcir.2025.104639","DOIUrl":"10.1016/j.jvcir.2025.104639","url":null,"abstract":"<div><div>A key feature of video surveillance systems is face recognition, which allows the identification and verification of people who appear in scenes frequently collected by a distributed network of cameras. The scientific community is interested in recognizing the individuals faces in videos, in part due to the potential applications also due to the difficulty in the artificial vision algorithms. The deep convolutional neural network is utilized to recognize the face from the set of provided video samples by the hybrid weighted texture pattern descriptor (HWTP). The deep CNN parameter is tuned by the Enhanced social collie optimization (ESCO), which determines the better solution by the various strategies, similar to this, the face of an individual is identified using optimum parameters. The attained accuracy, precision, recall, and F-measure of the proposed model is 87.92 %, 88.01 %, 88.01 %, and 88.01 % for the number of retrieval 500, respectively.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104639"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2026.104706
Morteza Moradi , Mohammad Moradi , Concetto Spampinato , Ali Borji , Simone Palazzo
Modeling spatio-temporal dynamics remains a major challenge and critical factor for effective video saliency prediction (VSP). The evolution from LSTM and 3D convolutional networks to vision transformers has sparked numerous innovations for tackling this complex video understanding task. However, current technologies still struggle to capture short- and long-term frame dependencies simultaneously. The emergence of large-scale video models has introduced unprecedented opportunities to overcome these limitations but poses significant practical challenges due to their substantial parameter counts and computational costs. To address this, we propose leveraging knowledge distillation—an approach yet to be fully explored in VSP solutions. Specifically, we employ THTD-Net, a leading transformer-based VSP architecture, as the student network, guided by a newly developed large-scale VSP model serving as the teacher. Evaluations on benchmark datasets confirm the efficacy of this novel approach, demonstrating promising performance and substantially reducing the complexity required for real-world applications.
{"title":"Knowledge distillation meets video foundation models: A video saliency prediction case study","authors":"Morteza Moradi , Mohammad Moradi , Concetto Spampinato , Ali Borji , Simone Palazzo","doi":"10.1016/j.jvcir.2026.104706","DOIUrl":"10.1016/j.jvcir.2026.104706","url":null,"abstract":"<div><div>Modeling spatio-temporal dynamics remains a major challenge and critical factor for effective video saliency prediction (VSP). The evolution from LSTM and 3D convolutional networks to vision transformers has sparked numerous innovations for tackling this complex video understanding task. However, current technologies still struggle to capture short- and long-term frame dependencies simultaneously. The emergence of large-scale video models has introduced unprecedented opportunities to overcome these limitations but poses significant practical challenges due to their substantial parameter counts and computational costs. To address this, we propose leveraging knowledge distillation—an approach yet to be fully explored in VSP solutions. Specifically, we employ THTD-Net, a leading transformer-based VSP architecture, as the student network, guided by a newly developed large-scale VSP model serving as the teacher. Evaluations on benchmark datasets confirm the efficacy of this novel approach, demonstrating promising performance and substantially reducing the complexity required for real-world applications.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104706"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104697
Suning Ge , Chunlin Ren , Ya’nan He , Linjie Li , Jiaqi Yang , Kun Sun , Yanning Zhang
As a crucial topic in 3D vision, Structure-from-Motion (SfM) aims to recover camera poses and 3D structures from unconstrained images. Performing pairwise image matching is a critical step. Typically, matching relationships are represented as a view graph, but the initial graph often contains redundant or potentially false edges, affecting both efficiency and accuracy. We propose an efficient incremental SfM method that optimizes the critical image matching step. Specifically, given an image similarity graph, an initialized weighted view graph is constructed. Next, the vertices and edges of the graph are treated as candidates and voters, with iterative mutual voting performed to score image pairs until convergence. Then, the optimal subgraph is extracted using the maximum spanning tree (MST). Finally, incremental reconstruction is carried out based on the selected images. We demonstrate the efficiency and accuracy of our method on general datasets and ambiguous datasets.
{"title":"Iterative mutual voting matching for efficient and accurate Structure-from-Motion","authors":"Suning Ge , Chunlin Ren , Ya’nan He , Linjie Li , Jiaqi Yang , Kun Sun , Yanning Zhang","doi":"10.1016/j.jvcir.2025.104697","DOIUrl":"10.1016/j.jvcir.2025.104697","url":null,"abstract":"<div><div>As a crucial topic in 3D vision, Structure-from-Motion (SfM) aims to recover camera poses and 3D structures from unconstrained images. Performing pairwise image matching is a critical step. Typically, matching relationships are represented as a view graph, but the initial graph often contains redundant or potentially false edges, affecting both efficiency and accuracy. We propose an efficient incremental SfM method that optimizes the critical image matching step. Specifically, given an image similarity graph, an initialized weighted view graph is constructed. Next, the vertices and edges of the graph are treated as candidates and voters, with iterative mutual voting performed to score image pairs until convergence. Then, the optimal subgraph is extracted using the maximum spanning tree (MST). Finally, incremental reconstruction is carried out based on the selected images. We demonstrate the efficiency and accuracy of our method on general datasets and ambiguous datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104697"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104695
Chunman Yan, Ningning Qi
Ship detection plays an increasingly important role in the field of marine monitoring, with Optical Remote Sensing (ORS) technology providing high-resolution spatial and texture information support. However, existing ship detection methods still face significant challenges in accurately detecting small targets, suppressing complex background interference, and modeling cross-scale semantic relationships, limiting their effectiveness in practical applications. Inspired by feature diffusion theory and higher-order spatial interaction mechanisms, this paper proposes a ship detection model for Optical Remote Sensing imagery. Specifically, to address the problem of fine-grained information loss during feature downsampling, the Single-branch and Dual-branch Residual Feature Downsampling (SRFD and DRFD) modules are designed to enhance small target preservation and multi-scale robustness. To capture long-range spatial dependencies and improve robustness against target rotation variations, the Fast Spatial Pyramid Pooling module based on Large Kernel Separable Convolution Attention (SPPF-LSKA) is introduced, enabling efficient large receptive field modeling with rotation-invariant constraints. Furthermore, to dynamically model complex semantic dependencies between different feature scales, the Feature Diffusion Pyramid Network (FDPN) is proposed based on continuous feature diffusion and cross-scale graph reasoning. Experimental results show that model achieves an AP50 of 86.2 % and an AP50-95 of 58.0 % on multiple remote sensing ship detection datasets, with the number of parameters reduced to 2.6 M and the model size compressed to 5.5 MB, significantly outperforming several state-of-the-art models in terms of both detection accuracy and lightweight deployment. These results demonstrate the detection capability, robustness, and application potential of the proposed model in Optical Remote Sensing ship monitoring tasks.
{"title":"An optical remote sensing ship detection model based on feature diffusion and higher-order relationship modeling","authors":"Chunman Yan, Ningning Qi","doi":"10.1016/j.jvcir.2025.104695","DOIUrl":"10.1016/j.jvcir.2025.104695","url":null,"abstract":"<div><div>Ship detection plays an increasingly important role in the field of marine monitoring, with Optical Remote Sensing (ORS) technology providing high-resolution spatial and texture information support. However, existing ship detection methods still face significant challenges in accurately detecting small targets, suppressing complex background interference, and modeling cross-scale semantic relationships, limiting their effectiveness in practical applications. Inspired by feature diffusion theory and higher-order spatial interaction mechanisms, this paper proposes a ship detection model for Optical Remote Sensing imagery. Specifically, to address the problem of fine-grained information loss during feature downsampling, the Single-branch and Dual-branch Residual Feature Downsampling (SRFD and DRFD) modules are designed to enhance small target preservation and multi-scale robustness. To capture long-range spatial dependencies and improve robustness against target rotation variations, the Fast Spatial Pyramid Pooling module based on Large Kernel Separable Convolution Attention (SPPF-LSKA) is introduced, enabling efficient large receptive field modeling with rotation-invariant constraints. Furthermore, to dynamically model complex semantic dependencies between different feature scales, the Feature Diffusion Pyramid Network (FDPN) is proposed based on continuous feature diffusion and cross-scale graph reasoning. Experimental results show that model achieves an <em>AP<sub>50</sub></em> of 86.2 % and an <em>AP<sub>50-95</sub></em> of 58.0 % on multiple remote sensing ship detection datasets, with the number of parameters reduced to 2.6 M and the model size compressed to 5.5 MB, significantly outperforming several state-of-the-art models in terms of both detection accuracy and lightweight deployment. These results demonstrate the detection capability, robustness, and application potential of the proposed model in Optical Remote Sensing ship monitoring tasks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104695"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104673
Hongyue Huang , Chen Cui , Chuanmin Jia , Xinfeng Zhang , Siwei Ma
{"title":"Corrigendum to “Lightweight macro-pixel quality enhancement network for light field images compressed by versatile video coding” [J. Vis. Commun. Image Represent. 105 (2024) 104329]","authors":"Hongyue Huang , Chen Cui , Chuanmin Jia , Xinfeng Zhang , Siwei Ma","doi":"10.1016/j.jvcir.2025.104673","DOIUrl":"10.1016/j.jvcir.2025.104673","url":null,"abstract":"","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104673"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146037533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104699
Wonjun Kim
As a demand for immersive services increases in various fields, the ability to express objects or scenes in 3D has become essential. In particular, 3D human modeling has gained considerable attentions due to its plentiful possibilities for daily life as well as industrial applications. The first step of 3D human modeling is to restore a mesh, which is commonly defined as a set of connected vertices in the 3D space, from images and videos. This is so-called human mesh recovery (HMR). Such HMR has been studied based on complicated optimization techniques, however, owing to the great success of deep learning in recent years, it has been reformulated as a simple regression problem, thus numerous studies are now being actively conducted. This paper aims at providing a comprehensive review with a special focus on deep learning-based methods for HMR. Specifically, this paper covers a systematic taxonomy along with questions at the heart of each research period, diverse methodologies, and abundant performance evaluations on benchmark datasets both qualitatively and quantitatively, and also gives constructive discussions for realization of HMR-based commercialization services. This review is expected to serve as a concise handbook to HMR rather than a vast collection of existing studies.
{"title":"3D human mesh recovery: Comparative review, models, and prospects","authors":"Wonjun Kim","doi":"10.1016/j.jvcir.2025.104699","DOIUrl":"10.1016/j.jvcir.2025.104699","url":null,"abstract":"<div><div>As a demand for immersive services increases in various fields, the ability to express objects or scenes in 3D has become essential. In particular, 3D human modeling has gained considerable attentions due to its plentiful possibilities for daily life as well as industrial applications. The first step of 3D human modeling is to restore a mesh, which is commonly defined as a set of connected vertices in the 3D space, from images and videos. This is so-called human mesh recovery (HMR). Such HMR has been studied based on complicated optimization techniques, however, owing to the great success of deep learning in recent years, it has been reformulated as a simple regression problem, thus numerous studies are now being actively conducted. This paper aims at providing a comprehensive review with a special focus on deep learning-based methods for HMR. Specifically, this paper covers a systematic taxonomy along with questions at the heart of each research period, diverse methodologies, and abundant performance evaluations on benchmark datasets both qualitatively and quantitatively, and also gives constructive discussions for realization of HMR-based commercialization services. This review is expected to serve as a concise handbook to HMR rather than a vast collection of existing studies.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104699"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104702
Rui Zhang , Shuliang Jiang , Zi Kang , Shuo Xu , Yuanlong Lv , Hui Xia
Existing transfer-based adversarial attacks suffer from poor transferability due to limitations of the proxy dataset or inaccurate imitation of the target model by the substitute model. Thus, we propose a theft model-based black-box adversarial attack in embedding space. The substitute model acts as the discriminator of the generative adversarial network, and we introduce a diversity loss to train the generator without relying on a proxy dataset, enabling it to imitate the target model better. Furthermore, we design a combined adversarial attack method that integrates the gradient-based attack and natural evolution strategy to construct adversarial examples in the embedding space search. This ensures that the adversarial examples are compelling on both the target and the substitute models. Experimental results demonstrate that our method has good imitation ability and transferability. When using VGG16, OUR outperforms TREMBA by 14.71% in un-targeted attack success rate and shows a 13.49% improvement in targeted attacks.
{"title":"Theft model-based black-box adversarial attack in embedding space","authors":"Rui Zhang , Shuliang Jiang , Zi Kang , Shuo Xu , Yuanlong Lv , Hui Xia","doi":"10.1016/j.jvcir.2025.104702","DOIUrl":"10.1016/j.jvcir.2025.104702","url":null,"abstract":"<div><div>Existing transfer-based adversarial attacks suffer from poor transferability due to limitations of the proxy dataset or inaccurate imitation of the target model by the substitute model. Thus, we propose a theft model-based black-box adversarial attack in embedding space. The substitute model acts as the discriminator of the generative adversarial network, and we introduce a diversity loss to train the generator without relying on a proxy dataset, enabling it to imitate the target model better. Furthermore, we design a combined adversarial attack method that integrates the gradient-based attack and natural evolution strategy to construct adversarial examples in the embedding space search. This ensures that the adversarial examples are compelling on both the target and the substitute models. Experimental results demonstrate that our method has good imitation ability and transferability. When using VGG16, OUR outperforms TREMBA by 14.71% in un-targeted attack success rate and shows a 13.49% improvement in targeted attacks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104702"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925086","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}