Pub Date : 2026-03-01Epub Date: 2026-01-29DOI: 10.1016/j.jvcir.2026.104738
Yi Lu, Shenghao Ren, Zhiyu Jin, Qiu Shen
Imitation of human actions is crucial for humanoid robots to enhance motion capabilities and understand human action mechanisms. Current methods focus on capturing human motion and imposing these parameters to robots, but differences in size, structure, and mechanics often result in unstable and distorted robot actions. To address these issues, we propose improving the stability and adaptability of motion data and conducting motion retargeting across multiple spaces. Specifically, we utilize mode adaptive motion smoothing (MAMS) for lower and upper body joints, adapting to different support modes. To balance similarity and stability, we propose a multi-objective motion optimization (MOMO) model under kinematic stability constraints, which takes into account the robot’s stable trajectories and the fundamental poses of the human body. Experiments demonstrate that our approach enhances the reliability and stability of robot motion while maintaining a high degree of similarity to human movements, significantly advancing the field of humanoid robot imitation.
{"title":"Stability optimization in action imitation for humanoid robot","authors":"Yi Lu, Shenghao Ren, Zhiyu Jin, Qiu Shen","doi":"10.1016/j.jvcir.2026.104738","DOIUrl":"10.1016/j.jvcir.2026.104738","url":null,"abstract":"<div><div>Imitation of human actions is crucial for humanoid robots to enhance motion capabilities and understand human action mechanisms. Current methods focus on capturing human motion and imposing these parameters to robots, but differences in size, structure, and mechanics often result in unstable and distorted robot actions. To address these issues, we propose improving the stability and adaptability of motion data and conducting motion retargeting across multiple spaces. Specifically, we utilize mode adaptive motion smoothing (MAMS) for lower and upper body joints, adapting to different support modes. To balance similarity and stability, we propose a multi-objective motion optimization (MOMO) model under kinematic stability constraints, which takes into account the robot’s stable trajectories and the fundamental poses of the human body. Experiments demonstrate that our approach enhances the reliability and stability of robot motion while maintaining a high degree of similarity to human movements, significantly advancing the field of humanoid robot imitation.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104738"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pedestrian trajectory prediction is a challenging issue because the future trajectories are influenced by the surrounding environment and constrained by the common sense rules. The existing trajectory prediction methods typically consider one kind of cues, i.e., social-aware cue, environment-aware cue, and goal-conditioned cue to model the interactions with the trajectory information, which results in insufficient interactions. In this article, we propose an innovative Transformer network named Multi-cue Transformer (McTrans) aimed at pedestrian trajectory prediction, where we design the Hierarchical Cross-Attention (HCA) module to learn the goal–social–environment interactions between the trajectory information of pedestrians and three kinds of cues from the perspectives of temporal and spatial dependencies. Furthermore, in order to reasonably utilize the guidance of the goal information, we propose the Gradual Goal-guided Loss (GGLoss) which gradually increases the weights of the coordinate difference between the predicted goal and the ground-truth goal as the time steps increase. We conduct extensive experiments on three public datasets, i.e., SDD, inD, and ETH/UCY. The experimental results demonstrate that the proposed McTrans is superior to other state-of-the-art methods.
{"title":"Pedestrian trajectory prediction using multi-cue transformer","authors":"Yanlong Tian , Rui Zhai , Xiaoting Fan , Qi Xue , Zhong Zhang , Xinshan Zhu","doi":"10.1016/j.jvcir.2026.104723","DOIUrl":"10.1016/j.jvcir.2026.104723","url":null,"abstract":"<div><div>Pedestrian trajectory prediction is a challenging issue because the future trajectories are influenced by the surrounding environment and constrained by the common sense rules. The existing trajectory prediction methods typically consider one kind of cues, i.e., social-aware cue, environment-aware cue, and goal-conditioned cue to model the interactions with the trajectory information, which results in insufficient interactions. In this article, we propose an innovative Transformer network named Multi-cue Transformer (McTrans) aimed at pedestrian trajectory prediction, where we design the Hierarchical Cross-Attention (HCA) module to learn the goal–social–environment interactions between the trajectory information of pedestrians and three kinds of cues from the perspectives of temporal and spatial dependencies. Furthermore, in order to reasonably utilize the guidance of the goal information, we propose the Gradual Goal-guided Loss (GGLoss) which gradually increases the weights of the coordinate difference between the predicted goal and the ground-truth goal as the time steps increase. We conduct extensive experiments on three public datasets, i.e., SDD, inD, and ETH/UCY. The experimental results demonstrate that the proposed McTrans is superior to other state-of-the-art methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104723"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lane detection plays a fundamental role in autonomous driving by providing geometric and semantic guidance for robust localization and planning. Empirical studies have shown that reliable lane perception can reduce vehicle localization error by up to 15% and improve trajectory stability by more than 10%, underscoring its critical importance in safety-critical navigation systems. Visual degradations such as occlusions, worn paint, and illumination shifts result in missing or ambiguous lane boundaries, reducing the reliability of appearance-only methods and motivating scene-aware reasoning. Inspired by the human ability to jointly interpret scene context and road structure, this work presents LaDeL (Lane Detection with Large Language Models), which, to our knowledge, is the first framework to leverage multimodal large language models for lane detection through visual-instruction reasoning. LaDeL reformulates lane perception as a multimodal question-answering task that performs lane localization, lane counting, and scene captioning in a unified manner. We introduce lane-specific tokens to enable precise numerical coordinate prediction and construct a diverse instruction-tuning corpus combining lane queries, lane-count prompts, and scene descriptions. Experiments demonstrate that LaDeL achieves state-of-the-art performance, including an F1-score of 82.35% on CULane and 98.23% on TuSimple, outperforming previous methods. Although LaDeL requires greater computational resources than conventional lane detection networks, it provides new insight into integrating geometric perception with high-level reasoning. Beyond lane detection, this formulation opens opportunities for language-guided perception and reasoning in autonomous driving, including road-scene analysis, interactive driving assistants, and language-aware perception.
车道检测通过为鲁棒定位和规划提供几何和语义指导,在自动驾驶中起着至关重要的作用。实证研究表明,可靠的车道感知可以将车辆定位误差降低15%,并将轨迹稳定性提高10%以上,这凸显了其在安全关键型导航系统中的重要性。视觉退化,如遮挡、磨损的油漆和照明变化,导致缺失或模糊的车道边界,降低了仅外观方法的可靠性,并激发了场景感知推理。受人类共同解释场景上下文和道路结构的能力的启发,这项工作提出了LaDeL (Lane Detection with Large Language Models),据我们所知,这是第一个利用多模态大语言模型通过视觉指令推理进行车道检测的框架。LaDeL将车道感知重新定义为一个多模态问答任务,以统一的方式执行车道定位、车道计数和场景字幕。我们引入了特定于车道的标记来实现精确的数值坐标预测,并构建了一个结合车道查询、车道计数提示和场景描述的多样化指令调优语料库。实验表明,LaDeL的性能达到了最先进的水平,在CULane上的f1得分为82.35%,在TuSimple上的f1得分为98.23%,优于以往的方法。尽管LaDeL比传统的车道检测网络需要更多的计算资源,但它为将几何感知与高级推理相结合提供了新的见解。除了车道检测之外,该公式还为自动驾驶中的语言引导感知和推理提供了机会,包括道路场景分析、交互式驾驶助手和语言感知感知。
{"title":"LaDeL: Lane detection via multimodal large language model with visual instruction tuning","authors":"Yun Zhang , Xin Cheng , Zhou Zhou , Jingmei Zhou , Tong Yang","doi":"10.1016/j.jvcir.2025.104704","DOIUrl":"10.1016/j.jvcir.2025.104704","url":null,"abstract":"<div><div>Lane detection plays a fundamental role in autonomous driving by providing geometric and semantic guidance for robust localization and planning. Empirical studies have shown that reliable lane perception can reduce vehicle localization error by up to 15% and improve trajectory stability by more than 10%, underscoring its critical importance in safety-critical navigation systems. Visual degradations such as occlusions, worn paint, and illumination shifts result in missing or ambiguous lane boundaries, reducing the reliability of appearance-only methods and motivating scene-aware reasoning. Inspired by the human ability to jointly interpret scene context and road structure, this work presents LaDeL (Lane Detection with Large Language Models), which, to our knowledge, is the first framework to leverage multimodal large language models for lane detection through visual-instruction reasoning. LaDeL reformulates lane perception as a multimodal question-answering task that performs lane localization, lane counting, and scene captioning in a unified manner. We introduce lane-specific tokens to enable precise numerical coordinate prediction and construct a diverse instruction-tuning corpus combining lane queries, lane-count prompts, and scene descriptions. Experiments demonstrate that LaDeL achieves state-of-the-art performance, including an F1-score of 82.35% on CULane and 98.23% on TuSimple, outperforming previous methods. Although LaDeL requires greater computational resources than conventional lane detection networks, it provides new insight into integrating geometric perception with high-level reasoning. Beyond lane detection, this formulation opens opportunities for language-guided perception and reasoning in autonomous driving, including road-scene analysis, interactive driving assistants, and language-aware perception.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104704"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-08DOI: 10.1016/j.jvcir.2026.104707
Shaofan Chen
High-resolution, detail-rich image generation models are essential for text-driven embroidery pattern synthesis. In this paper, the Semantic Response Generative Adversarial Network (SR-GAN) is used for embroidery image synthesis. It generates higher-quality images and improves text-image alignment. The model integrates word-level text embeddings into the image latent space through a cross-attention mechanism and a confidence-aware fusion scheme. In this way, word-level semantic features are effectively injected into hidden image features. The Semantic Perception Module is also refined by replacing standard convolutions with depthwise separable convolutions, which reduces the number of model parameters. In addition, the Deep Attention Multimodal Similarity Model directly scores word-pixel correspondences to compute fine-grained matching loss. It injects embroidery-domain word embeddings into the text encoder for joint training and further tightens the alignment between generated images and text. Experimental results show that the proposed method achieves an FID of 13.84 and an IS of 5.51.
{"title":"Semantic Response GAN (SR-GAN) for embroidery pattern generation","authors":"Shaofan Chen","doi":"10.1016/j.jvcir.2026.104707","DOIUrl":"10.1016/j.jvcir.2026.104707","url":null,"abstract":"<div><div>High-resolution, detail-rich image generation models are essential for text-driven embroidery pattern synthesis. In this paper, the Semantic Response Generative Adversarial Network (SR-GAN) is used for embroidery image synthesis. It generates higher-quality images and improves text-image alignment. The model integrates word-level text embeddings into the image latent space through a cross-attention mechanism and a confidence-aware fusion scheme. In this way, word-level semantic features are effectively injected into hidden image features. The Semantic Perception Module is also refined by replacing standard convolutions with depthwise separable convolutions, which reduces the number of model parameters. In addition, the Deep Attention Multimodal Similarity Model directly scores word-pixel correspondences to compute fine-grained matching loss. It injects embroidery-domain word embeddings into the text encoder for joint training and further tightens the alignment between generated images and text. Experimental results show that the proposed method achieves an FID of 13.84 and an IS of 5.51.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104707"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-21DOI: 10.1016/j.jvcir.2026.104728
Yumei Tan , Haiying Xia , Shuxiang Song
Facial action unit (AU) detection poses challenges in capturing discriminative local features and intricate AU correlations. To solve this challenge, we propose an effective Global–local Co-regularization Network (Co-GLN) trained in a collaborative manner. Co-GLN consists a global branch and a local branch, aiming to establish global feature-level interrelationships in the global branch while excavating region-level discriminative features in the local branch. Specifically, in the global branch, a Global Interaction (GI) module is designed to enhance cross-pixel relations for capturing global semantic information. The local branch comprises three components: the Region Localization (RL) module, the Intra-feature Relation Modeling (IRM) module, and the Region Interaction (RI) module. The RL module extracts regional features according to the pre-defined facial regions, then IRM module extracts local features for each region. Subsequently, the RI module integrates complementary information across regions. Finally, a co-regularization constraint is used to encourage consistency between the global and local branches. Experimental results demonstrate that Co-GLN consistently enhances AU detection performance on the BP4D and DISFA datasets.
{"title":"Global–local co-regularization network for facial action unit detection","authors":"Yumei Tan , Haiying Xia , Shuxiang Song","doi":"10.1016/j.jvcir.2026.104728","DOIUrl":"10.1016/j.jvcir.2026.104728","url":null,"abstract":"<div><div>Facial action unit (AU) detection poses challenges in capturing discriminative local features and intricate AU correlations. To solve this challenge, we propose an effective Global–local Co-regularization Network (Co-GLN) trained in a collaborative manner. Co-GLN consists a global branch and a local branch, aiming to establish global feature-level interrelationships in the global branch while excavating region-level discriminative features in the local branch. Specifically, in the global branch, a Global Interaction (GI) module is designed to enhance cross-pixel relations for capturing global semantic information. The local branch comprises three components: the Region Localization (RL) module, the Intra-feature Relation Modeling (IRM) module, and the Region Interaction (RI) module. The RL module extracts regional features according to the pre-defined facial regions, then IRM module extracts local features for each region. Subsequently, the RI module integrates complementary information across regions. Finally, a co-regularization constraint is used to encourage consistency between the global and local branches. Experimental results demonstrate that Co-GLN consistently enhances AU detection performance on the BP4D and DISFA datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104728"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-06DOI: 10.1016/j.jvcir.2025.104703
Keyang Cheng , Nan Chen , Chang Liu , Yue Yu , Hao Zhou , Zhe Wang , Changsheng Peng
Unmanned aerial vehicles (UAVs) are extensively utilized in both military and civilian sectors, offering benefits and posing challenges. Traditional infrared small target detection techniques often suffer from high false alarm rates and low accuracy. To overcome these issues, we propose the Depthwise Separable Residual Dense Attention Network (DSRDANet), which redefines the detection task as a residual image prediction problem. This approach features an Adaptive Adjustment Segmentation Module (AASM) that uses depthwise separable residual dense blocks to extract detailed hierarchical features during encoding. Additionally, multi-scale feature fusion blocks are included to thoroughly aggregate multi-scale features and enhance residual image reconstruction during decoding. Furthermore, the Channel Attention Modulation Module (CAMM) is designed to model channel interdependencies and spatial encoding, optimizing the outputs from AASM by adjusting feature importance distribution across channels, ensuring comprehensive target attention. Experimental results on datasets for infrared small UAV target detection and tracking in various backgrounds validate our approach. Compared to state-of-the-art methods, our technique significantly enhances performance, improving the average F1 score by nearly 0.1, the IOU by 0.12, and the CG by 0.66.
{"title":"Infrared small UAV target detection via depthwise separable residual dense attention network","authors":"Keyang Cheng , Nan Chen , Chang Liu , Yue Yu , Hao Zhou , Zhe Wang , Changsheng Peng","doi":"10.1016/j.jvcir.2025.104703","DOIUrl":"10.1016/j.jvcir.2025.104703","url":null,"abstract":"<div><div>Unmanned aerial vehicles (UAVs) are extensively utilized in both military and civilian sectors, offering benefits and posing challenges. Traditional infrared small target detection techniques often suffer from high false alarm rates and low accuracy. To overcome these issues, we propose the Depthwise Separable Residual Dense Attention Network (DSRDANet), which redefines the detection task as a residual image prediction problem. This approach features an Adaptive Adjustment Segmentation Module (AASM) that uses depthwise separable residual dense blocks to extract detailed hierarchical features during encoding. Additionally, multi-scale feature fusion blocks are included to thoroughly aggregate multi-scale features and enhance residual image reconstruction during decoding. Furthermore, the Channel Attention Modulation Module (CAMM) is designed to model channel interdependencies and spatial encoding, optimizing the outputs from AASM by adjusting feature importance distribution across channels, ensuring comprehensive target attention. Experimental results on datasets for infrared small UAV target detection and tracking in various backgrounds validate our approach. Compared to state-of-the-art methods, our technique significantly enhances performance, improving the average F1 score by nearly 0.1, the IOU by 0.12, and the CG by 0.66.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104703"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-21DOI: 10.1016/j.jvcir.2026.104746
Rong Guo , Zhenxuan Zeng , Jiang Wu , Xiyu Zhang , Siwen Quan , Zhongwen Hu , Yu Zhu , Jiaqi Yang
Cross-source point cloud registration plays a pivotal role in enabling seamless 3D perception across heterogeneous sensors. However, this task remains highly challenging due to significant density variations, sensor-specific noise, and partial overlaps between heterogeneous sensors. To address these challenges, we propose DFF-Matcher, a robust framework that integrates density-robust feature learning and bidirectional consensus matching to bridge domain gaps across different sensors. Our approach introduces a density-fused feature module to handle significant point density variations and a self-attention enhanced matching strategy to ensure reliable correspondence estimation. This unified framework establishes a new paradigm for cross-source registration, achieving superior performance across diverse sensor modalities. Extensive experiments demonstrate significant improvements, including 25.4% higher feature matching recall and 22.2% greater registration recall on challenging Kinect-LiDAR datasets, while maintaining robust performance in both indoor and outdoor scenarios.
{"title":"DFF-Matcher: Robust cross-source registration with density-fused feature and bidirectional consensus matching","authors":"Rong Guo , Zhenxuan Zeng , Jiang Wu , Xiyu Zhang , Siwen Quan , Zhongwen Hu , Yu Zhu , Jiaqi Yang","doi":"10.1016/j.jvcir.2026.104746","DOIUrl":"10.1016/j.jvcir.2026.104746","url":null,"abstract":"<div><div>Cross-source point cloud registration plays a pivotal role in enabling seamless 3D perception across heterogeneous sensors. However, this task remains highly challenging due to significant density variations, sensor-specific noise, and partial overlaps between heterogeneous sensors. To address these challenges, we propose DFF-Matcher, a robust framework that integrates density-robust feature learning and bidirectional consensus matching to bridge domain gaps across different sensors. Our approach introduces a density-fused feature module to handle significant point density variations and a self-attention enhanced matching strategy to ensure reliable correspondence estimation. This unified framework establishes a new paradigm for cross-source registration, achieving superior performance across diverse sensor modalities. Extensive experiments demonstrate significant improvements, including 25.4% higher feature matching recall and 22.2% greater registration recall on challenging Kinect-LiDAR datasets, while maintaining robust performance in both indoor and outdoor scenarios.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104746"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147398302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-23DOI: 10.1016/j.jvcir.2026.104731
Jingying Cai , Hang Cheng , Jiabin Chen , Haichou Wang , Meiqing Wang
Image manipulation localization requires comprehensive extraction and integration of global and local features. However, existing methods often adopt parallel architectures that process semantic context and local details separately, leading to limited interaction and fragmented representations. Moreover, applying uniform patching strategies across all layers ignores the varying semantic roles and spatial properties of deep features. To address these issues, we propose a unified framework that derives local representations directly from hierarchical global features. A reverse patch scaling strategy assigns smaller patch sizes and larger overlaps to deeper layers, enabling dense local modeling aligned with increasing semantic abstraction. An asymmetric cross-attention module improves feature interaction and consistency. Additionally, a dual-strategy decoder fuses multi-scale features via concatenation and addition, while a statistically guided edge awareness module models local variance and entropy from the predicted mask to refine boundary perception. Extensive experiments show that our method outperforms state-of-the-art approaches in both accuracy and robustness.
{"title":"Unified global–local feature modeling via reverse patch scaling for image manipulation localization","authors":"Jingying Cai , Hang Cheng , Jiabin Chen , Haichou Wang , Meiqing Wang","doi":"10.1016/j.jvcir.2026.104731","DOIUrl":"10.1016/j.jvcir.2026.104731","url":null,"abstract":"<div><div>Image manipulation localization requires comprehensive extraction and integration of global and local features. However, existing methods often adopt parallel architectures that process semantic context and local details separately, leading to limited interaction and fragmented representations. Moreover, applying uniform patching strategies across all layers ignores the varying semantic roles and spatial properties of deep features. To address these issues, we propose a unified framework that derives local representations directly from hierarchical global features. A reverse patch scaling strategy assigns smaller patch sizes and larger overlaps to deeper layers, enabling dense local modeling aligned with increasing semantic abstraction. An asymmetric cross-attention module improves feature interaction and consistency. Additionally, a dual-strategy decoder fuses multi-scale features via concatenation and addition, while a statistically guided edge awareness module models local variance and entropy from the predicted mask to refine boundary perception. Extensive experiments show that our method outperforms state-of-the-art approaches in both accuracy and robustness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104731"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-04DOI: 10.1016/j.jvcir.2026.104730
Binrui Li , Zhihan Tian , Linyu Huang , Yong Guo
Shoeprint analysis plays a vital role in forensic investigations, especially in linking impressions to suspect footwear. Structured-light 3D scanning enables high-resolution capture of shoeprint point clouds, preserving geometric and depth details. However, traditional geometry-based methods often struggle with limited feature representation and noise sensitivity. To address this, we propose ShoeMatch3D, a deep learning framework for fine-grained 3D shoeprint comparison. The core network, CA-PointShoeNet, enhances PointNet++ with channel attention to better extract discriminative features. A cosine similarity-based triplet loss further optimizes the embedding space for robust matching. Experiments on a self-collected dataset demonstrate strong performance, with accuracies of 95.50%, 93.21%, and 90.90% on training, testing, and validation sets, respectively. These results confirm the method’s effectiveness and its potential for broader 3D forensic identification tasks.
{"title":"ShoeMatch3D: Attention-Enhanced deep learning framework for high-precision 3D shoeprint comparison","authors":"Binrui Li , Zhihan Tian , Linyu Huang , Yong Guo","doi":"10.1016/j.jvcir.2026.104730","DOIUrl":"10.1016/j.jvcir.2026.104730","url":null,"abstract":"<div><div>Shoeprint analysis plays a vital role in forensic investigations, especially in linking impressions to suspect footwear. Structured-light 3D scanning enables high-resolution capture of shoeprint point clouds, preserving geometric and depth details. However, traditional geometry-based methods often struggle with limited feature representation and noise sensitivity. To address this, we propose ShoeMatch3D, a deep learning framework for fine-grained 3D shoeprint comparison. The core network, CA-PointShoeNet, enhances PointNet++ with channel attention to better extract discriminative features. A cosine similarity-based triplet loss further optimizes the embedding space for robust matching. Experiments on a self-collected dataset demonstrate strong performance, with accuracies of 95.50%, 93.21%, and 90.90% on training, testing, and validation sets, respectively. These results confirm the method’s effectiveness and its potential for broader 3D forensic identification tasks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104730"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-17DOI: 10.1016/j.jvcir.2026.104721
Yun Liu, Daoxin Fan, Zihan Liu, Sifan Li, Haiyuan Wang
With the development of Artificial Intelligence (AI) generated technology, AI generated video (AIGV) has aroused much attention. Compared to the visual perceptual in traditional video, AIGV has its unique challenges, such as visual consistency, text-to-video alignment, etc. In this paper, we propose a multi-aspect perception assisted AIGV quality assessment model, which gives a comprehensive quality evaluation of AIGV from three aspects: text–video alignment score, visual spatial perceptual score, and visual temporal perceptual score. Specifically, a pre-trained vision-language module is adopted to study the text-to-video alignment quality, and the semantic-aware module is applied to capture the visual spatial perceptual features. Besides, an effective visual temporal feature extraction module is used to capture multi-scale temporal features. Finally, text–video alignment features, visual spatial, visual temporal perceptual features, and multi-scale visual fusion features are integrated to give a comprehensive quality evaluation. Our model holds state-of-the-art results on three public AIGV datasets, proving its effectiveness.
{"title":"MTPA: A multi-aspects perception assisted AIGV quality assessment model","authors":"Yun Liu, Daoxin Fan, Zihan Liu, Sifan Li, Haiyuan Wang","doi":"10.1016/j.jvcir.2026.104721","DOIUrl":"10.1016/j.jvcir.2026.104721","url":null,"abstract":"<div><div>With the development of Artificial Intelligence (AI) generated technology, AI generated video (AIGV) has aroused much attention. Compared to the visual perceptual in traditional video, AIGV has its unique challenges, such as visual consistency, text-to-video alignment, etc. In this paper, we propose a multi-aspect perception assisted AIGV quality assessment model, which gives a comprehensive quality evaluation of AIGV from three aspects: text–video alignment score, visual spatial perceptual score, and visual temporal perceptual score. Specifically, a pre-trained vision-language module is adopted to study the text-to-video alignment quality, and the semantic-aware module is applied to capture the visual spatial perceptual features. Besides, an effective visual temporal feature extraction module is used to capture multi-scale temporal features. Finally, text–video alignment features, visual spatial, visual temporal perceptual features, and multi-scale visual fusion features are integrated to give a comprehensive quality evaluation. Our model holds state-of-the-art results on three public AIGV datasets, proving its effectiveness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104721"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}