Pedestrian trajectory prediction is a challenging issue because the future trajectories are influenced by the surrounding environment and constrained by the common sense rules. The existing trajectory prediction methods typically consider one kind of cues, i.e., social-aware cue, environment-aware cue, and goal-conditioned cue to model the interactions with the trajectory information, which results in insufficient interactions. In this article, we propose an innovative Transformer network named Multi-cue Transformer (McTrans) aimed at pedestrian trajectory prediction, where we design the Hierarchical Cross-Attention (HCA) module to learn the goal–social–environment interactions between the trajectory information of pedestrians and three kinds of cues from the perspectives of temporal and spatial dependencies. Furthermore, in order to reasonably utilize the guidance of the goal information, we propose the Gradual Goal-guided Loss (GGLoss) which gradually increases the weights of the coordinate difference between the predicted goal and the ground-truth goal as the time steps increase. We conduct extensive experiments on three public datasets, i.e., SDD, inD, and ETH/UCY. The experimental results demonstrate that the proposed McTrans is superior to other state-of-the-art methods.
{"title":"Pedestrian trajectory prediction using multi-cue transformer","authors":"Yanlong Tian , Rui Zhai , Xiaoting Fan , Qi Xue , Zhong Zhang , Xinshan Zhu","doi":"10.1016/j.jvcir.2026.104723","DOIUrl":"10.1016/j.jvcir.2026.104723","url":null,"abstract":"<div><div>Pedestrian trajectory prediction is a challenging issue because the future trajectories are influenced by the surrounding environment and constrained by the common sense rules. The existing trajectory prediction methods typically consider one kind of cues, i.e., social-aware cue, environment-aware cue, and goal-conditioned cue to model the interactions with the trajectory information, which results in insufficient interactions. In this article, we propose an innovative Transformer network named Multi-cue Transformer (McTrans) aimed at pedestrian trajectory prediction, where we design the Hierarchical Cross-Attention (HCA) module to learn the goal–social–environment interactions between the trajectory information of pedestrians and three kinds of cues from the perspectives of temporal and spatial dependencies. Furthermore, in order to reasonably utilize the guidance of the goal information, we propose the Gradual Goal-guided Loss (GGLoss) which gradually increases the weights of the coordinate difference between the predicted goal and the ground-truth goal as the time steps increase. We conduct extensive experiments on three public datasets, i.e., SDD, inD, and ETH/UCY. The experimental results demonstrate that the proposed McTrans is superior to other state-of-the-art methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104723"},"PeriodicalIF":3.1,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-15DOI: 10.1016/j.jvcir.2026.104722
Jinjin Li , Baiyuan Qing , Kun Zhang , Xinyuan Yang , Xiangui Yin , Yichang Liu
The multi-modal nature of light field imaging produces a refocused image stack, but each image suffers from a limited depth-of-field. All-in-focus (AIF) fusion aims to create a single, sharp image from this stack, a task challenged by irregular depth boundaries and degraded spatial resolution. We propose a novel fusion framework based on the graph wavelet transform (GWT). Unlike traditional methods, our approach adaptively models pixel correlations to better handle irregular boundaries while preserving details. The method decomposes each image using a fast GWT. Low-frequency components are fused via a multi-layer strategy, while high-frequency components are merged using an integrated weighting scheme enhanced by guided filtering. Finally, the AIF image is reconstructed via an inverse GWT. Experimental results on light field datasets demonstrate superior performance over existing methods, achieving average EI, , and SSIM scores of 44.939, 0.9941, and 0.8719, respectively, showing its potential for practical applications.
{"title":"All-in-focus image fusion using graph wavelet transform for multi-modal light field","authors":"Jinjin Li , Baiyuan Qing , Kun Zhang , Xinyuan Yang , Xiangui Yin , Yichang Liu","doi":"10.1016/j.jvcir.2026.104722","DOIUrl":"10.1016/j.jvcir.2026.104722","url":null,"abstract":"<div><div>The multi-modal nature of light field imaging produces a refocused image stack, but each image suffers from a limited depth-of-field. All-in-focus (AIF) fusion aims to create a single, sharp image from this stack, a task challenged by irregular depth boundaries and degraded spatial resolution. We propose a novel fusion framework based on the graph wavelet transform (GWT). Unlike traditional methods, our approach adaptively models pixel correlations to better handle irregular boundaries while preserving details. The method decomposes each image using a fast GWT. Low-frequency components are fused via a multi-layer strategy, while high-frequency components are merged using an integrated weighting scheme enhanced by guided filtering. Finally, the AIF image is reconstructed via an inverse GWT. Experimental results on light field datasets demonstrate superior performance over existing methods, achieving average EI, <span><math><msub><mrow><mi>Q</mi></mrow><mrow><mi>Y</mi></mrow></msub></math></span>, and SSIM scores of 44.939, 0.9941, and 0.8719, respectively, showing its potential for practical applications.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104722"},"PeriodicalIF":3.1,"publicationDate":"2026-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.1016/j.jvcir.2026.104711
Fengjuan Wang , Jiayi Liu , Ruonan Zhang, Zhengxue Li, Feng Zhang, Gaoyun An
Knowledge-based Visual Question Answering (KB-VQA) requires models to integrate visual content with external knowledge to answer questions, which is crucial for building intelligent systems capable of real-world understanding. However, effectively incorporating external knowledge into visual reasoning faces three major challenges: the incompleteness of external knowledge bases leads to missing knowledge for many specific visual scenarios, semantic gaps exist between retrieved textual knowledge and visual content making alignment difficult, and effective mechanisms for fusing heterogeneous knowledge sources are lacking. While Multimodal Large Language Models(MLLMs) have demonstrated strong performance in visual understanding tasks, but face notable challenges in KB-VQA, particularly in knowledge utilization efficiency and semantic alignment, which seriously limits the reasoning depth and robustness. To address these problems, a Context-aware Knowledge Construction and Retrieval (CKCR) method is proposed for knowledge-based VQA, which includes the following three modules. The multi-granularity knowledge retrieval module constructs joint query vector based on the multi-dimensional embedding representation of images and questions, accurately obtaining explicit knowledge that is highly matched with the context. The vision-to-knowledge generation module supplements fine-grained semantic clues from the perspective of visual content, generating visual knowledge closely related to the image and making up for the expression limitations of general knowledge. To achieve deep alignment of knowledge representation, the knowledge adaptive learning module accurately embeds multi-source knowledge into the semantic space of MLLM by introducing a learnable knowledge mapping mechanism. Experimental evaluation on OK-VQA and A-OKVQA dataset shows the CKCR outperforms state-of-the-art methods of the same-scale. Ablation experiments and visualization analysis demonstrate the superiority of CKCR in its perception of fine-grained visual information and its ability to align knowledge semantics. Our code will be released on GitHub: https://github.com/fjwang3/CKCR.
基于知识的视觉问答(knowledge -based Visual Question answer, KB-VQA)要求模型集成视觉内容和外部知识来回答问题,这对于构建能够理解现实世界的智能系统至关重要。然而,将外部知识有效地整合到视觉推理中面临着三个主要挑战:外部知识库的不完整性导致许多特定视觉场景的知识缺失;检索的文本知识与视觉内容之间存在语义差距导致对齐困难;缺乏有效的融合异构知识来源的机制。虽然多模态大型语言模型(Multimodal Large Language Models, mllm)在视觉理解任务中表现出了较强的性能,但在知识利用效率和语义对齐方面面临着显著的挑战,严重限制了推理深度和鲁棒性。针对这些问题,本文提出了一种基于知识的VQA的情境感知知识构建与检索方法,该方法包括以下三个模块。多粒度知识检索模块基于图像和问题的多维嵌入表示构建联合查询向量,准确获取与上下文高度匹配的显式知识。视觉到知识生成模块从视觉内容的角度补充细粒度的语义线索,生成与图像密切相关的视觉知识,弥补一般知识的表达局限性。为实现知识表示的深度对齐,知识自适应学习模块通过引入可学习的知识映射机制,将多源知识精确嵌入到MLLM的语义空间中。在OK-VQA和A-OKVQA数据集上的实验评估表明,CKCR优于同尺度的最先进方法。消融实验和可视化分析证明了CKCR在细粒度视觉信息感知和知识语义对齐能力方面的优势。我们的代码将在GitHub上发布:https://github.com/fjwang3/CKCR。
{"title":"CKCR: Context-aware knowledge construction and retrieval for knowledge-based visual question answering","authors":"Fengjuan Wang , Jiayi Liu , Ruonan Zhang, Zhengxue Li, Feng Zhang, Gaoyun An","doi":"10.1016/j.jvcir.2026.104711","DOIUrl":"10.1016/j.jvcir.2026.104711","url":null,"abstract":"<div><div>Knowledge-based Visual Question Answering (KB-VQA) requires models to integrate visual content with external knowledge to answer questions, which is crucial for building intelligent systems capable of real-world understanding. However, effectively incorporating external knowledge into visual reasoning faces three major challenges: the incompleteness of external knowledge bases leads to missing knowledge for many specific visual scenarios, semantic gaps exist between retrieved textual knowledge and visual content making alignment difficult, and effective mechanisms for fusing heterogeneous knowledge sources are lacking. While Multimodal Large Language Models(MLLMs) have demonstrated strong performance in visual understanding tasks, but face notable challenges in KB-VQA, particularly in knowledge utilization efficiency and semantic alignment, which seriously limits the reasoning depth and robustness. To address these problems, a Context-aware Knowledge Construction and Retrieval (CKCR) method is proposed for knowledge-based VQA, which includes the following three modules. The multi-granularity knowledge retrieval module constructs joint query vector based on the multi-dimensional embedding representation of images and questions, accurately obtaining explicit knowledge that is highly matched with the context. The vision-to-knowledge generation module supplements fine-grained semantic clues from the perspective of visual content, generating visual knowledge closely related to the image and making up for the expression limitations of general knowledge. To achieve deep alignment of knowledge representation, the knowledge adaptive learning module accurately embeds multi-source knowledge into the semantic space of MLLM by introducing a learnable knowledge mapping mechanism. Experimental evaluation on OK-VQA and A-OKVQA dataset shows the CKCR outperforms state-of-the-art methods of the same-scale. Ablation experiments and visualization analysis demonstrate the superiority of CKCR in its perception of fine-grained visual information and its ability to align knowledge semantics. Our code will be released on GitHub: <span><span>https://github.com/fjwang3/CKCR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104711"},"PeriodicalIF":3.1,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08DOI: 10.1016/j.jvcir.2026.104707
Shaofan Chen
High-resolution, detail-rich image generation models are essential for text-driven embroidery pattern synthesis. In this paper, the Semantic Response Generative Adversarial Network (SR-GAN) is used for embroidery image synthesis. It generates higher-quality images and improves text-image alignment. The model integrates word-level text embeddings into the image latent space through a cross-attention mechanism and a confidence-aware fusion scheme. In this way, word-level semantic features are effectively injected into hidden image features. The Semantic Perception Module is also refined by replacing standard convolutions with depthwise separable convolutions, which reduces the number of model parameters. In addition, the Deep Attention Multimodal Similarity Model directly scores word-pixel correspondences to compute fine-grained matching loss. It injects embroidery-domain word embeddings into the text encoder for joint training and further tightens the alignment between generated images and text. Experimental results show that the proposed method achieves an FID of 13.84 and an IS of 5.51.
{"title":"Semantic Response GAN (SR-GAN) for embroidery pattern generation","authors":"Shaofan Chen","doi":"10.1016/j.jvcir.2026.104707","DOIUrl":"10.1016/j.jvcir.2026.104707","url":null,"abstract":"<div><div>High-resolution, detail-rich image generation models are essential for text-driven embroidery pattern synthesis. In this paper, the Semantic Response Generative Adversarial Network (SR-GAN) is used for embroidery image synthesis. It generates higher-quality images and improves text-image alignment. The model integrates word-level text embeddings into the image latent space through a cross-attention mechanism and a confidence-aware fusion scheme. In this way, word-level semantic features are effectively injected into hidden image features. The Semantic Perception Module is also refined by replacing standard convolutions with depthwise separable convolutions, which reduces the number of model parameters. In addition, the Deep Attention Multimodal Similarity Model directly scores word-pixel correspondences to compute fine-grained matching loss. It injects embroidery-domain word embeddings into the text encoder for joint training and further tightens the alignment between generated images and text. Experimental results show that the proposed method achieves an FID of 13.84 and an IS of 5.51.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104707"},"PeriodicalIF":3.1,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08DOI: 10.1016/j.jvcir.2026.104709
Panpan Niu, Hongxin Wang, Xingqi Wang
Copy-move forgery is one of the most commonly used manipulations for tampering digital images. In recent years, keypoint-based detection methods have achieved encouraging results, but there are still several shortcomings that can be improved. First, unability to generate sufficient keypoints in small or smooth regions, causing detection failure. Second, lack of robust and discriminative descriptors for image keypoints, resulting in false matches. Third, high computational cost of image keypoints matching. To tackle this challenge, we present a new keypoint-based image copy-move forgery detection (CMFD) using three-stage matching with constraints. In keypoint extraction, we extract sufficient SIFT keypoints by adaptively enlarging image and enhancing image contrast. In feature description, we adopt the combination of complex and real values of Polar Harmonic Fourier Moments (PHFMs) as the PHFMs-based hybrid feature vector of each keypoint, which substantially enhances the differentiation of the features. In feature matching, we present a fast stratification approach based on SLIC and locally optimal orientation pattern (LOOP), and utilize the stratification results as the constraints of matching, which can reduce the search space. Then a high-precision three-stage matching strategy based on amplitude information, phase information and distance information is executed. In post-processing, the location of the tampered regions is finally determined by one-step filtering and one-step clustering. Extensive experimental results show the superiority of the proposed method over the existing representative CMFD techniques.
{"title":"Image copy-move forgery detection using three-stage matching with constraints","authors":"Panpan Niu, Hongxin Wang, Xingqi Wang","doi":"10.1016/j.jvcir.2026.104709","DOIUrl":"10.1016/j.jvcir.2026.104709","url":null,"abstract":"<div><div>Copy-move forgery is one of the most commonly used manipulations for tampering digital images. In recent years, keypoint-based detection methods have achieved encouraging results, but there are still several shortcomings that can be improved. First, unability to generate sufficient keypoints in small or smooth regions, causing detection failure. Second, lack of robust and discriminative descriptors for image keypoints, resulting in false matches. Third, high computational cost of image keypoints matching. To tackle this challenge, we present a new keypoint-based image copy-move forgery detection (CMFD) using three-stage matching with constraints. In keypoint extraction, we extract sufficient SIFT keypoints by adaptively enlarging image and enhancing image contrast. In feature description, we adopt the combination of complex and real values of Polar Harmonic Fourier Moments (PHFMs) as the PHFMs-based hybrid feature vector of each keypoint, which substantially enhances the differentiation of the features. In feature matching, we present a fast stratification approach based on SLIC and locally optimal orientation pattern (LOOP), and utilize the stratification results as the constraints of matching, which can reduce the search space. Then a high-precision three-stage matching strategy based on amplitude information, phase information and distance information is executed. In post-processing, the location of the tampered regions is finally determined by one-step filtering and one-step clustering. Extensive experimental results show the superiority of the proposed method over the existing representative CMFD techniques.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104709"},"PeriodicalIF":3.1,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145950136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lane detection plays a fundamental role in autonomous driving by providing geometric and semantic guidance for robust localization and planning. Empirical studies have shown that reliable lane perception can reduce vehicle localization error by up to 15% and improve trajectory stability by more than 10%, underscoring its critical importance in safety-critical navigation systems. Visual degradations such as occlusions, worn paint, and illumination shifts result in missing or ambiguous lane boundaries, reducing the reliability of appearance-only methods and motivating scene-aware reasoning. Inspired by the human ability to jointly interpret scene context and road structure, this work presents LaDeL (Lane Detection with Large Language Models), which, to our knowledge, is the first framework to leverage multimodal large language models for lane detection through visual-instruction reasoning. LaDeL reformulates lane perception as a multimodal question-answering task that performs lane localization, lane counting, and scene captioning in a unified manner. We introduce lane-specific tokens to enable precise numerical coordinate prediction and construct a diverse instruction-tuning corpus combining lane queries, lane-count prompts, and scene descriptions. Experiments demonstrate that LaDeL achieves state-of-the-art performance, including an F1-score of 82.35% on CULane and 98.23% on TuSimple, outperforming previous methods. Although LaDeL requires greater computational resources than conventional lane detection networks, it provides new insight into integrating geometric perception with high-level reasoning. Beyond lane detection, this formulation opens opportunities for language-guided perception and reasoning in autonomous driving, including road-scene analysis, interactive driving assistants, and language-aware perception.
车道检测通过为鲁棒定位和规划提供几何和语义指导,在自动驾驶中起着至关重要的作用。实证研究表明,可靠的车道感知可以将车辆定位误差降低15%,并将轨迹稳定性提高10%以上,这凸显了其在安全关键型导航系统中的重要性。视觉退化,如遮挡、磨损的油漆和照明变化,导致缺失或模糊的车道边界,降低了仅外观方法的可靠性,并激发了场景感知推理。受人类共同解释场景上下文和道路结构的能力的启发,这项工作提出了LaDeL (Lane Detection with Large Language Models),据我们所知,这是第一个利用多模态大语言模型通过视觉指令推理进行车道检测的框架。LaDeL将车道感知重新定义为一个多模态问答任务,以统一的方式执行车道定位、车道计数和场景字幕。我们引入了特定于车道的标记来实现精确的数值坐标预测,并构建了一个结合车道查询、车道计数提示和场景描述的多样化指令调优语料库。实验表明,LaDeL的性能达到了最先进的水平,在CULane上的f1得分为82.35%,在TuSimple上的f1得分为98.23%,优于以往的方法。尽管LaDeL比传统的车道检测网络需要更多的计算资源,但它为将几何感知与高级推理相结合提供了新的见解。除了车道检测之外,该公式还为自动驾驶中的语言引导感知和推理提供了机会,包括道路场景分析、交互式驾驶助手和语言感知感知。
{"title":"LaDeL: Lane detection via multimodal large language model with visual instruction tuning","authors":"Yun Zhang , Xin Cheng , Zhou Zhou , Jingmei Zhou , Tong Yang","doi":"10.1016/j.jvcir.2025.104704","DOIUrl":"10.1016/j.jvcir.2025.104704","url":null,"abstract":"<div><div>Lane detection plays a fundamental role in autonomous driving by providing geometric and semantic guidance for robust localization and planning. Empirical studies have shown that reliable lane perception can reduce vehicle localization error by up to 15% and improve trajectory stability by more than 10%, underscoring its critical importance in safety-critical navigation systems. Visual degradations such as occlusions, worn paint, and illumination shifts result in missing or ambiguous lane boundaries, reducing the reliability of appearance-only methods and motivating scene-aware reasoning. Inspired by the human ability to jointly interpret scene context and road structure, this work presents LaDeL (Lane Detection with Large Language Models), which, to our knowledge, is the first framework to leverage multimodal large language models for lane detection through visual-instruction reasoning. LaDeL reformulates lane perception as a multimodal question-answering task that performs lane localization, lane counting, and scene captioning in a unified manner. We introduce lane-specific tokens to enable precise numerical coordinate prediction and construct a diverse instruction-tuning corpus combining lane queries, lane-count prompts, and scene descriptions. Experiments demonstrate that LaDeL achieves state-of-the-art performance, including an F1-score of 82.35% on CULane and 98.23% on TuSimple, outperforming previous methods. Although LaDeL requires greater computational resources than conventional lane detection networks, it provides new insight into integrating geometric perception with high-level reasoning. Beyond lane detection, this formulation opens opportunities for language-guided perception and reasoning in autonomous driving, including road-scene analysis, interactive driving assistants, and language-aware perception.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104704"},"PeriodicalIF":3.1,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-06DOI: 10.1016/j.jvcir.2025.104703
Keyang Cheng , Nan Chen , Chang Liu , Yue Yu , Hao Zhou , Zhe Wang , Changsheng Peng
Unmanned aerial vehicles (UAVs) are extensively utilized in both military and civilian sectors, offering benefits and posing challenges. Traditional infrared small target detection techniques often suffer from high false alarm rates and low accuracy. To overcome these issues, we propose the Depthwise Separable Residual Dense Attention Network (DSRDANet), which redefines the detection task as a residual image prediction problem. This approach features an Adaptive Adjustment Segmentation Module (AASM) that uses depthwise separable residual dense blocks to extract detailed hierarchical features during encoding. Additionally, multi-scale feature fusion blocks are included to thoroughly aggregate multi-scale features and enhance residual image reconstruction during decoding. Furthermore, the Channel Attention Modulation Module (CAMM) is designed to model channel interdependencies and spatial encoding, optimizing the outputs from AASM by adjusting feature importance distribution across channels, ensuring comprehensive target attention. Experimental results on datasets for infrared small UAV target detection and tracking in various backgrounds validate our approach. Compared to state-of-the-art methods, our technique significantly enhances performance, improving the average F1 score by nearly 0.1, the IOU by 0.12, and the CG by 0.66.
{"title":"Infrared small UAV target detection via depthwise separable residual dense attention network","authors":"Keyang Cheng , Nan Chen , Chang Liu , Yue Yu , Hao Zhou , Zhe Wang , Changsheng Peng","doi":"10.1016/j.jvcir.2025.104703","DOIUrl":"10.1016/j.jvcir.2025.104703","url":null,"abstract":"<div><div>Unmanned aerial vehicles (UAVs) are extensively utilized in both military and civilian sectors, offering benefits and posing challenges. Traditional infrared small target detection techniques often suffer from high false alarm rates and low accuracy. To overcome these issues, we propose the Depthwise Separable Residual Dense Attention Network (DSRDANet), which redefines the detection task as a residual image prediction problem. This approach features an Adaptive Adjustment Segmentation Module (AASM) that uses depthwise separable residual dense blocks to extract detailed hierarchical features during encoding. Additionally, multi-scale feature fusion blocks are included to thoroughly aggregate multi-scale features and enhance residual image reconstruction during decoding. Furthermore, the Channel Attention Modulation Module (CAMM) is designed to model channel interdependencies and spatial encoding, optimizing the outputs from AASM by adjusting feature importance distribution across channels, ensuring comprehensive target attention. Experimental results on datasets for infrared small UAV target detection and tracking in various backgrounds validate our approach. Compared to state-of-the-art methods, our technique significantly enhances performance, improving the average F1 score by nearly 0.1, the IOU by 0.12, and the CG by 0.66.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104703"},"PeriodicalIF":3.1,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104693
Yi Wang, Haonan Su, Zhaolin Xiao, Haiyan Jin
Deep low-light enhancement methods typically learn from long-exposure ground truths. While effective for dark scenes, this approach often causes overexposure in HDR scenarios and lacks adaptability to varying illumination levels. Therefore, we develop a deep low light image enhancement via Multi-task Learning of Few Shot Exposure Imaging (MLFSEI) which is formulated as Bayesian multi-task directed graphical model and predict the enhanced images by learning few-shot tasks comprising multi-exposure images and their corresponding exposure vectors. The proposed method predicts the enhanced image from the selected exposure vector and the learned latent variable among few tasks. The exposure vectors are defined as the characteristics of few shot exposure datasets containing mean, variance and contrast of images. Moreover, the multi order gradients are developed to constrain the structure and details from the ground truth. Experimental results demonstrate significant improvements, with average gains of 4.64 dB in PSNR and 0.071 in SSIM, along with an average reduction of 1.12 in NIQE across multiple benchmark datasets compared to state-of-the-art methods. Furthermore, the proposed method can be extended to accommodate multiple outputs with varying exposure levels among one model.
{"title":"Deep low light image enhancement via Multi-Task Learning of Few Shot Exposure Imaging","authors":"Yi Wang, Haonan Su, Zhaolin Xiao, Haiyan Jin","doi":"10.1016/j.jvcir.2025.104693","DOIUrl":"10.1016/j.jvcir.2025.104693","url":null,"abstract":"<div><div>Deep low-light enhancement methods typically learn from long-exposure ground truths. While effective for dark scenes, this approach often causes overexposure in HDR scenarios and lacks adaptability to varying illumination levels. Therefore, we develop a deep low light image enhancement via Multi-task Learning of Few Shot Exposure Imaging (MLFSEI) which is formulated as Bayesian multi-task directed graphical model and predict the enhanced images by learning few-shot tasks comprising multi-exposure images and their corresponding exposure vectors. The proposed method predicts the enhanced image from the selected exposure vector and the learned latent variable among few tasks. The exposure vectors are defined as the characteristics of few shot exposure datasets containing mean, variance and contrast of images. Moreover, the multi order gradients are developed to constrain the structure and details from the ground truth. Experimental results demonstrate significant improvements, with average gains of 4.64 dB in PSNR and 0.071 in SSIM, along with an average reduction of 1.12 in NIQE across multiple benchmark datasets compared to state-of-the-art methods. Furthermore, the proposed method can be extended to accommodate multiple outputs with varying exposure levels among one model.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104693"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104698
Jian He , Rongqi Cao , Cheng Zhang , Suyu Wang
Traffic police gesture recognition is important in autonomous driving. Most existing methods rely on extracting pixel-level features from RGB images, which lack interpretability due to the absence of explicit skeletal gesture features. Current deep learning approaches often fail to effectively model skeletal gesture information because they ignore the inherent connections between joint coordinate data and gesture semantics. Additionally, many methods fail to integrate multi-modal skeletal information (such as joint positions, rotations, and root orientation), limiting their ability to capture cross-modal correlations. Beyond methodological limitations, existing datasets often lack diversity in commanding directions, hindering fine-grained recognition of gestures intended for different traffic flows. To address these limitations, this paper presents the CTPGesture v2 dataset with Chinese traffic police gestures that command vehicles in four directions and proposes a skeleton-based graph convolution method for continuous gesture recognition. Specifically, a position-rotation graph (PR-Graph) is constructed with joint positions, rotations, and root rotations all in the same graph to enrich the graph’s representational power. An elevation partitioning strategy (EPS) is introduced to address the shortcutting issue of the conventional spatial configuration partitioning strategy (SCPS). Experiments demonstrate our method achieves 0.842 Jaccard score on CTPGesture v2 at 31.9 FPS, improving over previous works. The proposed PR-Graph and EPS establish a more descriptive graph for GCN and help capture cross-modality correlations during the graph convolution stages. Our code is available at https://github.com/crq0528/RT-VIBT. Our datasets are available at https://github.com/crq0528/traffic-gesture-datasets.
{"title":"Position-rotation graph and elevation partitioning strategy for traffic police gesture recognition","authors":"Jian He , Rongqi Cao , Cheng Zhang , Suyu Wang","doi":"10.1016/j.jvcir.2025.104698","DOIUrl":"10.1016/j.jvcir.2025.104698","url":null,"abstract":"<div><div>Traffic police gesture recognition is important in autonomous driving. Most existing methods rely on extracting pixel-level features from RGB images, which lack interpretability due to the absence of explicit skeletal gesture features. Current deep learning approaches often fail to effectively model skeletal gesture information because they ignore the inherent connections between joint coordinate data and gesture semantics. Additionally, many methods fail to integrate multi-modal skeletal information (such as joint positions, rotations, and root orientation), limiting their ability to capture cross-modal correlations. Beyond methodological limitations, existing datasets often lack diversity in commanding directions, hindering fine-grained recognition of gestures intended for different traffic flows. To address these limitations, this paper presents the CTPGesture v2 dataset with Chinese traffic police gestures that command vehicles in four directions and proposes a skeleton-based graph convolution method for continuous gesture recognition. Specifically, a position-rotation graph (PR-Graph) is constructed with joint positions, rotations, and root rotations all in the same graph to enrich the graph’s representational power. An elevation partitioning strategy (EPS) is introduced to address the shortcutting issue of the conventional spatial configuration partitioning strategy (SCPS). Experiments demonstrate our method achieves 0.842 Jaccard score on CTPGesture v2 at 31.9 FPS, improving over previous works. The proposed PR-Graph and EPS establish a more descriptive graph for GCN and help capture cross-modality correlations during the graph convolution stages. Our code is available at <span><span>https://github.com/crq0528/RT-VIBT</span><svg><path></path></svg></span>. Our datasets are available at <span><span>https://github.com/crq0528/traffic-gesture-datasets</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104698"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145925099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.1016/j.jvcir.2025.104694
Xiujin Zhu , Chee-Onn Chow , Joon Huang Chuah
Existing Transformer-based shadow removal methods are limited by fixed window sizes, making it difficult to effectively model global information. In addition, they do not fully utilize the distance prior in shadow images. This study argues that shadow removal should model brightness variations between regions from a global perspective. Non-shadow areas near the shadow boundaries are the most important for restoring brightness in shadow regions, and their importance gradually decreases as the distance increases. To achieve this, a regional decay attention mechanism is proposed, which introduces a positional decay bias into the self-attention computation to enable dynamic modeling of contributions from different spatial positions. A local perception module is introduced to improve the model’s ability to capture local details, and a shadow removal model named FW-Former is developed. This model achieves superior performance across multiple datasets, demonstrates stable generalization capability, and maintains a low parameter count.
{"title":"Regional decay attention for image shadow removal","authors":"Xiujin Zhu , Chee-Onn Chow , Joon Huang Chuah","doi":"10.1016/j.jvcir.2025.104694","DOIUrl":"10.1016/j.jvcir.2025.104694","url":null,"abstract":"<div><div>Existing Transformer-based shadow removal methods are limited by fixed window sizes, making it difficult to effectively model global information. In addition, they do not fully utilize the distance prior in shadow images. This study argues that shadow removal should model brightness variations between regions from a global perspective. Non-shadow areas near the shadow boundaries are the most important for restoring brightness in shadow regions, and their importance gradually decreases as the distance increases. To achieve this, a regional decay attention mechanism is proposed, which introduces a positional decay bias into the self-attention computation to enable dynamic modeling of contributions from different spatial positions. A local perception module is introduced to improve the model’s ability to capture local details, and a shadow removal model named FW-Former is developed. This model achieves superior performance across multiple datasets, demonstrates stable generalization capability, and maintains a low parameter count.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"115 ","pages":"Article 104694"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145884212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}