首页 > 最新文献

Journal of Visual Communication and Image Representation最新文献

英文 中文
Stability optimization in action imitation for humanoid robot 仿人机器人动作模仿的稳定性优化
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-03-01 Epub Date: 2026-01-29 DOI: 10.1016/j.jvcir.2026.104738
Yi Lu, Shenghao Ren, Zhiyu Jin, Qiu Shen
Imitation of human actions is crucial for humanoid robots to enhance motion capabilities and understand human action mechanisms. Current methods focus on capturing human motion and imposing these parameters to robots, but differences in size, structure, and mechanics often result in unstable and distorted robot actions. To address these issues, we propose improving the stability and adaptability of motion data and conducting motion retargeting across multiple spaces. Specifically, we utilize mode adaptive motion smoothing (MAMS) for lower and upper body joints, adapting to different support modes. To balance similarity and stability, we propose a multi-objective motion optimization (MOMO) model under kinematic stability constraints, which takes into account the robot’s stable trajectories and the fundamental poses of the human body. Experiments demonstrate that our approach enhances the reliability and stability of robot motion while maintaining a high degree of similarity to human movements, significantly advancing the field of humanoid robot imitation.
模仿人的动作是提高人形机器人运动能力和理解人的动作机制的关键。目前的方法侧重于捕捉人体运动并将这些参数施加给机器人,但尺寸、结构和力学的差异往往导致机器人动作不稳定和扭曲。为了解决这些问题,我们建议提高运动数据的稳定性和适应性,并在多个空间进行运动重定向。具体来说,我们利用模式自适应运动平滑(MAMS)为上下身体关节,以适应不同的支撑模式。为了平衡相似度和稳定性,提出了一种运动稳定性约束下的多目标运动优化(MOMO)模型,该模型考虑了机器人的稳定轨迹和人体的基本姿态。实验表明,我们的方法在保持与人类运动高度相似的同时,提高了机器人运动的可靠性和稳定性,极大地推动了仿人机器人领域的发展。
{"title":"Stability optimization in action imitation for humanoid robot","authors":"Yi Lu,&nbsp;Shenghao Ren,&nbsp;Zhiyu Jin,&nbsp;Qiu Shen","doi":"10.1016/j.jvcir.2026.104738","DOIUrl":"10.1016/j.jvcir.2026.104738","url":null,"abstract":"<div><div>Imitation of human actions is crucial for humanoid robots to enhance motion capabilities and understand human action mechanisms. Current methods focus on capturing human motion and imposing these parameters to robots, but differences in size, structure, and mechanics often result in unstable and distorted robot actions. To address these issues, we propose improving the stability and adaptability of motion data and conducting motion retargeting across multiple spaces. Specifically, we utilize mode adaptive motion smoothing (MAMS) for lower and upper body joints, adapting to different support modes. To balance similarity and stability, we propose a multi-objective motion optimization (MOMO) model under kinematic stability constraints, which takes into account the robot’s stable trajectories and the fundamental poses of the human body. Experiments demonstrate that our approach enhances the reliability and stability of robot motion while maintaining a high degree of similarity to human movements, significantly advancing the field of humanoid robot imitation.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104738"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pedestrian trajectory prediction using multi-cue transformer 基于多线索变压器的行人轨迹预测
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-03-01 Epub Date: 2026-01-15 DOI: 10.1016/j.jvcir.2026.104723
Yanlong Tian , Rui Zhai , Xiaoting Fan , Qi Xue , Zhong Zhang , Xinshan Zhu
Pedestrian trajectory prediction is a challenging issue because the future trajectories are influenced by the surrounding environment and constrained by the common sense rules. The existing trajectory prediction methods typically consider one kind of cues, i.e., social-aware cue, environment-aware cue, and goal-conditioned cue to model the interactions with the trajectory information, which results in insufficient interactions. In this article, we propose an innovative Transformer network named Multi-cue Transformer (McTrans) aimed at pedestrian trajectory prediction, where we design the Hierarchical Cross-Attention (HCA) module to learn the goal–social–environment interactions between the trajectory information of pedestrians and three kinds of cues from the perspectives of temporal and spatial dependencies. Furthermore, in order to reasonably utilize the guidance of the goal information, we propose the Gradual Goal-guided Loss (GGLoss) which gradually increases the weights of the coordinate difference between the predicted goal and the ground-truth goal as the time steps increase. We conduct extensive experiments on three public datasets, i.e., SDD, inD, and ETH/UCY. The experimental results demonstrate that the proposed McTrans is superior to other state-of-the-art methods.
行人轨迹预测是一个具有挑战性的问题,因为未来的轨迹受周围环境的影响,并受到常识规则的约束。现有的轨迹预测方法通常只考虑一种线索,即社会意识线索、环境意识线索和目标条件线索来模拟与轨迹信息的相互作用,导致相互作用不足。在本文中,我们提出了一种创新的针对行人轨迹预测的Multi-cue Transformer (McTrans)网络,其中我们设计了分层交叉注意(HCA)模块,从时间和空间依赖的角度学习行人轨迹信息与三种线索之间的目标-社会-环境相互作用。此外,为了合理利用目标信息的导引作用,我们提出了渐进式目标导引损失算法(GGLoss),该算法随着时间步长的增加,逐渐增大预测目标与真地目标之间的坐标差的权重。我们在三个公共数据集上进行了广泛的实验,即SDD, inD和ETH/UCY。实验结果表明,所提出的McTrans方法优于其他最先进的方法。
{"title":"Pedestrian trajectory prediction using multi-cue transformer","authors":"Yanlong Tian ,&nbsp;Rui Zhai ,&nbsp;Xiaoting Fan ,&nbsp;Qi Xue ,&nbsp;Zhong Zhang ,&nbsp;Xinshan Zhu","doi":"10.1016/j.jvcir.2026.104723","DOIUrl":"10.1016/j.jvcir.2026.104723","url":null,"abstract":"<div><div>Pedestrian trajectory prediction is a challenging issue because the future trajectories are influenced by the surrounding environment and constrained by the common sense rules. The existing trajectory prediction methods typically consider one kind of cues, i.e., social-aware cue, environment-aware cue, and goal-conditioned cue to model the interactions with the trajectory information, which results in insufficient interactions. In this article, we propose an innovative Transformer network named Multi-cue Transformer (McTrans) aimed at pedestrian trajectory prediction, where we design the Hierarchical Cross-Attention (HCA) module to learn the goal–social–environment interactions between the trajectory information of pedestrians and three kinds of cues from the perspectives of temporal and spatial dependencies. Furthermore, in order to reasonably utilize the guidance of the goal information, we propose the Gradual Goal-guided Loss (GGLoss) which gradually increases the weights of the coordinate difference between the predicted goal and the ground-truth goal as the time steps increase. We conduct extensive experiments on three public datasets, i.e., SDD, inD, and ETH/UCY. The experimental results demonstrate that the proposed McTrans is superior to other state-of-the-art methods.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104723"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LaDeL: Lane detection via multimodal large language model with visual instruction tuning LaDeL:通过带有视觉指令调整的多模态大语言模型进行车道检测
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-03-01 Epub Date: 2026-01-06 DOI: 10.1016/j.jvcir.2025.104704
Yun Zhang , Xin Cheng , Zhou Zhou , Jingmei Zhou , Tong Yang
Lane detection plays a fundamental role in autonomous driving by providing geometric and semantic guidance for robust localization and planning. Empirical studies have shown that reliable lane perception can reduce vehicle localization error by up to 15% and improve trajectory stability by more than 10%, underscoring its critical importance in safety-critical navigation systems. Visual degradations such as occlusions, worn paint, and illumination shifts result in missing or ambiguous lane boundaries, reducing the reliability of appearance-only methods and motivating scene-aware reasoning. Inspired by the human ability to jointly interpret scene context and road structure, this work presents LaDeL (Lane Detection with Large Language Models), which, to our knowledge, is the first framework to leverage multimodal large language models for lane detection through visual-instruction reasoning. LaDeL reformulates lane perception as a multimodal question-answering task that performs lane localization, lane counting, and scene captioning in a unified manner. We introduce lane-specific tokens to enable precise numerical coordinate prediction and construct a diverse instruction-tuning corpus combining lane queries, lane-count prompts, and scene descriptions. Experiments demonstrate that LaDeL achieves state-of-the-art performance, including an F1-score of 82.35% on CULane and 98.23% on TuSimple, outperforming previous methods. Although LaDeL requires greater computational resources than conventional lane detection networks, it provides new insight into integrating geometric perception with high-level reasoning. Beyond lane detection, this formulation opens opportunities for language-guided perception and reasoning in autonomous driving, including road-scene analysis, interactive driving assistants, and language-aware perception.
车道检测通过为鲁棒定位和规划提供几何和语义指导,在自动驾驶中起着至关重要的作用。实证研究表明,可靠的车道感知可以将车辆定位误差降低15%,并将轨迹稳定性提高10%以上,这凸显了其在安全关键型导航系统中的重要性。视觉退化,如遮挡、磨损的油漆和照明变化,导致缺失或模糊的车道边界,降低了仅外观方法的可靠性,并激发了场景感知推理。受人类共同解释场景上下文和道路结构的能力的启发,这项工作提出了LaDeL (Lane Detection with Large Language Models),据我们所知,这是第一个利用多模态大语言模型通过视觉指令推理进行车道检测的框架。LaDeL将车道感知重新定义为一个多模态问答任务,以统一的方式执行车道定位、车道计数和场景字幕。我们引入了特定于车道的标记来实现精确的数值坐标预测,并构建了一个结合车道查询、车道计数提示和场景描述的多样化指令调优语料库。实验表明,LaDeL的性能达到了最先进的水平,在CULane上的f1得分为82.35%,在TuSimple上的f1得分为98.23%,优于以往的方法。尽管LaDeL比传统的车道检测网络需要更多的计算资源,但它为将几何感知与高级推理相结合提供了新的见解。除了车道检测之外,该公式还为自动驾驶中的语言引导感知和推理提供了机会,包括道路场景分析、交互式驾驶助手和语言感知感知。
{"title":"LaDeL: Lane detection via multimodal large language model with visual instruction tuning","authors":"Yun Zhang ,&nbsp;Xin Cheng ,&nbsp;Zhou Zhou ,&nbsp;Jingmei Zhou ,&nbsp;Tong Yang","doi":"10.1016/j.jvcir.2025.104704","DOIUrl":"10.1016/j.jvcir.2025.104704","url":null,"abstract":"<div><div>Lane detection plays a fundamental role in autonomous driving by providing geometric and semantic guidance for robust localization and planning. Empirical studies have shown that reliable lane perception can reduce vehicle localization error by up to 15% and improve trajectory stability by more than 10%, underscoring its critical importance in safety-critical navigation systems. Visual degradations such as occlusions, worn paint, and illumination shifts result in missing or ambiguous lane boundaries, reducing the reliability of appearance-only methods and motivating scene-aware reasoning. Inspired by the human ability to jointly interpret scene context and road structure, this work presents LaDeL (Lane Detection with Large Language Models), which, to our knowledge, is the first framework to leverage multimodal large language models for lane detection through visual-instruction reasoning. LaDeL reformulates lane perception as a multimodal question-answering task that performs lane localization, lane counting, and scene captioning in a unified manner. We introduce lane-specific tokens to enable precise numerical coordinate prediction and construct a diverse instruction-tuning corpus combining lane queries, lane-count prompts, and scene descriptions. Experiments demonstrate that LaDeL achieves state-of-the-art performance, including an F1-score of 82.35% on CULane and 98.23% on TuSimple, outperforming previous methods. Although LaDeL requires greater computational resources than conventional lane detection networks, it provides new insight into integrating geometric perception with high-level reasoning. Beyond lane detection, this formulation opens opportunities for language-guided perception and reasoning in autonomous driving, including road-scene analysis, interactive driving assistants, and language-aware perception.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104704"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic Response GAN (SR-GAN) for embroidery pattern generation 语义响应GAN (SR-GAN)用于刺绣图案生成
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-03-01 Epub Date: 2026-01-08 DOI: 10.1016/j.jvcir.2026.104707
Shaofan Chen
High-resolution, detail-rich image generation models are essential for text-driven embroidery pattern synthesis. In this paper, the Semantic Response Generative Adversarial Network (SR-GAN) is used for embroidery image synthesis. It generates higher-quality images and improves text-image alignment. The model integrates word-level text embeddings into the image latent space through a cross-attention mechanism and a confidence-aware fusion scheme. In this way, word-level semantic features are effectively injected into hidden image features. The Semantic Perception Module is also refined by replacing standard convolutions with depthwise separable convolutions, which reduces the number of model parameters. In addition, the Deep Attention Multimodal Similarity Model directly scores word-pixel correspondences to compute fine-grained matching loss. It injects embroidery-domain word embeddings into the text encoder for joint training and further tightens the alignment between generated images and text. Experimental results show that the proposed method achieves an FID of 13.84 and an IS of 5.51.
高分辨率、细节丰富的图像生成模型对于文本驱动的刺绣图案合成是必不可少的。本文将语义响应生成对抗网络(SR-GAN)用于刺绣图像的合成。它可以生成更高质量的图像,并改善文本图像对齐。该模型通过交叉注意机制和置信度感知融合方案将词级文本嵌入到图像潜在空间中。这样,将词级语义特征有效地注入到隐藏的图像特征中。语义感知模块也通过用深度可分离卷积代替标准卷积来改进,这减少了模型参数的数量。此外,深度注意多模态相似模型直接对字像素对应进行评分,计算细粒度匹配损失。它将刺绣域词嵌入到文本编码器中进行联合训练,并进一步加强生成的图像与文本之间的对齐。实验结果表明,该方法的FID为13.84,IS为5.51。
{"title":"Semantic Response GAN (SR-GAN) for embroidery pattern generation","authors":"Shaofan Chen","doi":"10.1016/j.jvcir.2026.104707","DOIUrl":"10.1016/j.jvcir.2026.104707","url":null,"abstract":"<div><div>High-resolution, detail-rich image generation models are essential for text-driven embroidery pattern synthesis. In this paper, the Semantic Response Generative Adversarial Network (SR-GAN) is used for embroidery image synthesis. It generates higher-quality images and improves text-image alignment. The model integrates word-level text embeddings into the image latent space through a cross-attention mechanism and a confidence-aware fusion scheme. In this way, word-level semantic features are effectively injected into hidden image features. The Semantic Perception Module is also refined by replacing standard convolutions with depthwise separable convolutions, which reduces the number of model parameters. In addition, the Deep Attention Multimodal Similarity Model directly scores word-pixel correspondences to compute fine-grained matching loss. It injects embroidery-domain word embeddings into the text encoder for joint training and further tightens the alignment between generated images and text. Experimental results show that the proposed method achieves an FID of 13.84 and an IS of 5.51.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104707"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Global–local co-regularization network for facial action unit detection 面部动作单元检测的全局-局部协同正则化网络
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-03-01 Epub Date: 2026-01-21 DOI: 10.1016/j.jvcir.2026.104728
Yumei Tan , Haiying Xia , Shuxiang Song
Facial action unit (AU) detection poses challenges in capturing discriminative local features and intricate AU correlations. To solve this challenge, we propose an effective Global–local Co-regularization Network (Co-GLN) trained in a collaborative manner. Co-GLN consists a global branch and a local branch, aiming to establish global feature-level interrelationships in the global branch while excavating region-level discriminative features in the local branch. Specifically, in the global branch, a Global Interaction (GI) module is designed to enhance cross-pixel relations for capturing global semantic information. The local branch comprises three components: the Region Localization (RL) module, the Intra-feature Relation Modeling (IRM) module, and the Region Interaction (RI) module. The RL module extracts regional features according to the pre-defined facial regions, then IRM module extracts local features for each region. Subsequently, the RI module integrates complementary information across regions. Finally, a co-regularization constraint is used to encourage consistency between the global and local branches. Experimental results demonstrate that Co-GLN consistently enhances AU detection performance on the BP4D and DISFA datasets.
面部动作单元(AU)检测在捕捉判别性的局部特征和复杂的AU相关性方面提出了挑战。为了解决这一挑战,我们提出了一种以协作方式训练的有效的全局-局部协同正则化网络(Co-GLN)。Co-GLN由全球分支和局部分支组成,旨在在全球分支中建立全球特征级的相互关系,同时在局部分支中挖掘区域级的判别特征。具体而言,在全局分支中,设计了一个全局交互(GI)模块来增强跨像素关系,以捕获全局语义信息。本地分支由区域定位(RL)模块、特征内关系建模(IRM)模块和区域交互(RI)模块组成。RL模块根据预定义的人脸区域提取区域特征,IRM模块针对每个区域提取局部特征。随后,RI模块整合跨区域的互补信息。最后,使用协正则化约束来鼓励全局和局部分支之间的一致性。实验结果表明,Co-GLN在BP4D和DISFA数据集上的AU检测性能持续提高。
{"title":"Global–local co-regularization network for facial action unit detection","authors":"Yumei Tan ,&nbsp;Haiying Xia ,&nbsp;Shuxiang Song","doi":"10.1016/j.jvcir.2026.104728","DOIUrl":"10.1016/j.jvcir.2026.104728","url":null,"abstract":"<div><div>Facial action unit (AU) detection poses challenges in capturing discriminative local features and intricate AU correlations. To solve this challenge, we propose an effective Global–local Co-regularization Network (Co-GLN) trained in a collaborative manner. Co-GLN consists a global branch and a local branch, aiming to establish global feature-level interrelationships in the global branch while excavating region-level discriminative features in the local branch. Specifically, in the global branch, a Global Interaction (GI) module is designed to enhance cross-pixel relations for capturing global semantic information. The local branch comprises three components: the Region Localization (RL) module, the Intra-feature Relation Modeling (IRM) module, and the Region Interaction (RI) module. The RL module extracts regional features according to the pre-defined facial regions, then IRM module extracts local features for each region. Subsequently, the RI module integrates complementary information across regions. Finally, a co-regularization constraint is used to encourage consistency between the global and local branches. Experimental results demonstrate that Co-GLN consistently enhances AU detection performance on the BP4D and DISFA datasets.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104728"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Infrared small UAV target detection via depthwise separable residual dense attention network 基于深度可分离残差密集注意网络的红外小型无人机目标检测
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-03-01 Epub Date: 2026-01-06 DOI: 10.1016/j.jvcir.2025.104703
Keyang Cheng , Nan Chen , Chang Liu , Yue Yu , Hao Zhou , Zhe Wang , Changsheng Peng
Unmanned aerial vehicles (UAVs) are extensively utilized in both military and civilian sectors, offering benefits and posing challenges. Traditional infrared small target detection techniques often suffer from high false alarm rates and low accuracy. To overcome these issues, we propose the Depthwise Separable Residual Dense Attention Network (DSRDANet), which redefines the detection task as a residual image prediction problem. This approach features an Adaptive Adjustment Segmentation Module (AASM) that uses depthwise separable residual dense blocks to extract detailed hierarchical features during encoding. Additionally, multi-scale feature fusion blocks are included to thoroughly aggregate multi-scale features and enhance residual image reconstruction during decoding. Furthermore, the Channel Attention Modulation Module (CAMM) is designed to model channel interdependencies and spatial encoding, optimizing the outputs from AASM by adjusting feature importance distribution across channels, ensuring comprehensive target attention. Experimental results on datasets for infrared small UAV target detection and tracking in various backgrounds validate our approach. Compared to state-of-the-art methods, our technique significantly enhances performance, improving the average F1 score by nearly 0.1, the IOU by 0.12, and the CG by 0.66.
无人驾驶飞行器(uav)广泛应用于军事和民用领域,带来了好处,也带来了挑战。传统的红外小目标检测技术存在虚警率高、准确率低的问题。为了克服这些问题,我们提出了深度可分残差密集注意网络(DSRDANet),它将检测任务重新定义为残差图像预测问题。该方法采用自适应调整分割模块(AASM),利用深度可分残差密集块在编码过程中提取详细的层次特征。此外,该算法还引入了多尺度特征融合块,实现了多尺度特征的深度聚合,增强了解码过程中的残差图像重建。此外,设计了信道注意调制模块(CAMM),对信道相互依赖和空间编码进行建模,通过调整信道间特征重要性分布来优化AASM输出,确保全面的目标注意。在不同背景下红外小型无人机目标检测与跟踪数据集上的实验结果验证了我们的方法。与最先进的方法相比,我们的技术显著提高了性能,平均F1分数提高了近0.1,IOU提高了0.12,CG提高了0.66。
{"title":"Infrared small UAV target detection via depthwise separable residual dense attention network","authors":"Keyang Cheng ,&nbsp;Nan Chen ,&nbsp;Chang Liu ,&nbsp;Yue Yu ,&nbsp;Hao Zhou ,&nbsp;Zhe Wang ,&nbsp;Changsheng Peng","doi":"10.1016/j.jvcir.2025.104703","DOIUrl":"10.1016/j.jvcir.2025.104703","url":null,"abstract":"<div><div>Unmanned aerial vehicles (UAVs) are extensively utilized in both military and civilian sectors, offering benefits and posing challenges. Traditional infrared small target detection techniques often suffer from high false alarm rates and low accuracy. To overcome these issues, we propose the Depthwise Separable Residual Dense Attention Network (DSRDANet), which redefines the detection task as a residual image prediction problem. This approach features an Adaptive Adjustment Segmentation Module (AASM) that uses depthwise separable residual dense blocks to extract detailed hierarchical features during encoding. Additionally, multi-scale feature fusion blocks are included to thoroughly aggregate multi-scale features and enhance residual image reconstruction during decoding. Furthermore, the Channel Attention Modulation Module (CAMM) is designed to model channel interdependencies and spatial encoding, optimizing the outputs from AASM by adjusting feature importance distribution across channels, ensuring comprehensive target attention. Experimental results on datasets for infrared small UAV target detection and tracking in various backgrounds validate our approach. Compared to state-of-the-art methods, our technique significantly enhances performance, improving the average F1 score by nearly 0.1, the IOU by 0.12, and the CG by 0.66.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104703"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145981705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DFF-Matcher: Robust cross-source registration with density-fused feature and bidirectional consensus matching DFF-Matcher:具有密度融合特征和双向一致性匹配的鲁棒跨源配准
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-03-01 Epub Date: 2026-02-21 DOI: 10.1016/j.jvcir.2026.104746
Rong Guo , Zhenxuan Zeng , Jiang Wu , Xiyu Zhang , Siwen Quan , Zhongwen Hu , Yu Zhu , Jiaqi Yang
Cross-source point cloud registration plays a pivotal role in enabling seamless 3D perception across heterogeneous sensors. However, this task remains highly challenging due to significant density variations, sensor-specific noise, and partial overlaps between heterogeneous sensors. To address these challenges, we propose DFF-Matcher, a robust framework that integrates density-robust feature learning and bidirectional consensus matching to bridge domain gaps across different sensors. Our approach introduces a density-fused feature module to handle significant point density variations and a self-attention enhanced matching strategy to ensure reliable correspondence estimation. This unified framework establishes a new paradigm for cross-source registration, achieving superior performance across diverse sensor modalities. Extensive experiments demonstrate significant improvements, including 25.4% higher feature matching recall and 22.2% greater registration recall on challenging Kinect-LiDAR datasets, while maintaining robust performance in both indoor and outdoor scenarios.
跨源点云配准在实现跨异构传感器的无缝3D感知方面起着关键作用。然而,由于显著的密度变化、传感器特定噪声和异构传感器之间的部分重叠,这项任务仍然具有很高的挑战性。为了解决这些挑战,我们提出了DFF-Matcher,这是一个集成了密度鲁棒特征学习和双向共识匹配的鲁棒框架,以弥合不同传感器之间的域差距。我们的方法引入了一个密度融合特征模块来处理显著的点密度变化,并引入了一个自关注增强匹配策略来确保可靠的对应估计。这个统一的框架为跨源配准建立了一个新的范例,在不同的传感器模式下实现了卓越的性能。大量的实验证明了显著的改进,包括在具有挑战性的Kinect-LiDAR数据集上提高25.4%的特征匹配召回率和22.2%的注册召回率,同时在室内和室外场景下都保持了稳健的性能。
{"title":"DFF-Matcher: Robust cross-source registration with density-fused feature and bidirectional consensus matching","authors":"Rong Guo ,&nbsp;Zhenxuan Zeng ,&nbsp;Jiang Wu ,&nbsp;Xiyu Zhang ,&nbsp;Siwen Quan ,&nbsp;Zhongwen Hu ,&nbsp;Yu Zhu ,&nbsp;Jiaqi Yang","doi":"10.1016/j.jvcir.2026.104746","DOIUrl":"10.1016/j.jvcir.2026.104746","url":null,"abstract":"<div><div>Cross-source point cloud registration plays a pivotal role in enabling seamless 3D perception across heterogeneous sensors. However, this task remains highly challenging due to significant density variations, sensor-specific noise, and partial overlaps between heterogeneous sensors. To address these challenges, we propose DFF-Matcher, a robust framework that integrates density-robust feature learning and bidirectional consensus matching to bridge domain gaps across different sensors. Our approach introduces a density-fused feature module to handle significant point density variations and a self-attention enhanced matching strategy to ensure reliable correspondence estimation. This unified framework establishes a new paradigm for cross-source registration, achieving superior performance across diverse sensor modalities. Extensive experiments demonstrate significant improvements, including 25.4% higher feature matching recall and 22.2% greater registration recall on challenging Kinect-LiDAR datasets, while maintaining robust performance in both indoor and outdoor scenarios.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104746"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147398302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unified global–local feature modeling via reverse patch scaling for image manipulation localization 统一全局-局部特征建模,通过反向补丁缩放实现图像操作定位
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-03-01 Epub Date: 2026-01-23 DOI: 10.1016/j.jvcir.2026.104731
Jingying Cai , Hang Cheng , Jiabin Chen , Haichou Wang , Meiqing Wang
Image manipulation localization requires comprehensive extraction and integration of global and local features. However, existing methods often adopt parallel architectures that process semantic context and local details separately, leading to limited interaction and fragmented representations. Moreover, applying uniform patching strategies across all layers ignores the varying semantic roles and spatial properties of deep features. To address these issues, we propose a unified framework that derives local representations directly from hierarchical global features. A reverse patch scaling strategy assigns smaller patch sizes and larger overlaps to deeper layers, enabling dense local modeling aligned with increasing semantic abstraction. An asymmetric cross-attention module improves feature interaction and consistency. Additionally, a dual-strategy decoder fuses multi-scale features via concatenation and addition, while a statistically guided edge awareness module models local variance and entropy from the predicted mask to refine boundary perception. Extensive experiments show that our method outperforms state-of-the-art approaches in both accuracy and robustness.
图像处理定位需要对全局特征和局部特征进行综合提取和融合。然而,现有的方法通常采用并行架构,分别处理语义上下文和局部细节,导致有限的交互和碎片化表示。此外,在所有层上使用统一的修补策略忽略了深层特征的不同语义角色和空间属性。为了解决这些问题,我们提出了一个统一的框架,直接从分层全局特征中派生局部表示。反向补丁缩放策略将较小的补丁大小和较大的重叠分配给更深的层,从而使密集的局部建模与增加的语义抽象相一致。一个不对称的交叉注意模块提高了功能的交互性和一致性。此外,双策略解码器通过串联和加法融合多尺度特征,而统计引导的边缘感知模块对预测掩模的局部方差和熵进行建模,以改进边界感知。大量的实验表明,我们的方法在准确性和鲁棒性方面都优于最先进的方法。
{"title":"Unified global–local feature modeling via reverse patch scaling for image manipulation localization","authors":"Jingying Cai ,&nbsp;Hang Cheng ,&nbsp;Jiabin Chen ,&nbsp;Haichou Wang ,&nbsp;Meiqing Wang","doi":"10.1016/j.jvcir.2026.104731","DOIUrl":"10.1016/j.jvcir.2026.104731","url":null,"abstract":"<div><div>Image manipulation localization requires comprehensive extraction and integration of global and local features. However, existing methods often adopt parallel architectures that process semantic context and local details separately, leading to limited interaction and fragmented representations. Moreover, applying uniform patching strategies across all layers ignores the varying semantic roles and spatial properties of deep features. To address these issues, we propose a unified framework that derives local representations directly from hierarchical global features. A reverse patch scaling strategy assigns smaller patch sizes and larger overlaps to deeper layers, enabling dense local modeling aligned with increasing semantic abstraction. An asymmetric cross-attention module improves feature interaction and consistency. Additionally, a dual-strategy decoder fuses multi-scale features via concatenation and addition, while a statistically guided edge awareness module models local variance and entropy from the predicted mask to refine boundary perception. Extensive experiments show that our method outperforms state-of-the-art approaches in both accuracy and robustness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104731"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ShoeMatch3D: Attention-Enhanced deep learning framework for high-precision 3D shoeprint comparison ShoeMatch3D:用于高精度3D鞋印比较的注意力增强深度学习框架
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-03-01 Epub Date: 2026-02-04 DOI: 10.1016/j.jvcir.2026.104730
Binrui Li , Zhihan Tian , Linyu Huang , Yong Guo
Shoeprint analysis plays a vital role in forensic investigations, especially in linking impressions to suspect footwear. Structured-light 3D scanning enables high-resolution capture of shoeprint point clouds, preserving geometric and depth details. However, traditional geometry-based methods often struggle with limited feature representation and noise sensitivity. To address this, we propose ShoeMatch3D, a deep learning framework for fine-grained 3D shoeprint comparison. The core network, CA-PointShoeNet, enhances PointNet++ with channel attention to better extract discriminative features. A cosine similarity-based triplet loss further optimizes the embedding space for robust matching. Experiments on a self-collected dataset demonstrate strong performance, with accuracies of 95.50%, 93.21%, and 90.90% on training, testing, and validation sets, respectively. These results confirm the method’s effectiveness and its potential for broader 3D forensic identification tasks.
鞋印分析在法医调查中起着至关重要的作用,特别是在将脚印与可疑的鞋子联系起来时。结构光3D扫描能够高分辨率捕获鞋印点云,保留几何和深度细节。然而,传统的基于几何的方法往往受到有限的特征表示和噪声敏感性的困扰。为了解决这个问题,我们提出了ShoeMatch3D,这是一个用于细粒度3D鞋印比较的深度学习框架。核心网络CA-PointShoeNet通过通道关注对PointNet++进行增强,以更好地提取判别特征。基于余弦相似度的三元组损失进一步优化了嵌入空间,实现了鲁棒匹配。在自收集数据集上的实验显示了较强的性能,训练集、测试集和验证集的准确率分别为95.50%、93.21%和90.90%。这些结果证实了该方法的有效性及其在更广泛的3D法医鉴定任务中的潜力。
{"title":"ShoeMatch3D: Attention-Enhanced deep learning framework for high-precision 3D shoeprint comparison","authors":"Binrui Li ,&nbsp;Zhihan Tian ,&nbsp;Linyu Huang ,&nbsp;Yong Guo","doi":"10.1016/j.jvcir.2026.104730","DOIUrl":"10.1016/j.jvcir.2026.104730","url":null,"abstract":"<div><div>Shoeprint analysis plays a vital role in forensic investigations, especially in linking impressions to suspect footwear. Structured-light 3D scanning enables high-resolution capture of shoeprint point clouds, preserving geometric and depth details. However, traditional geometry-based methods often struggle with limited feature representation and noise sensitivity. To address this, we propose ShoeMatch3D, a deep learning framework for fine-grained 3D shoeprint comparison. The core network, CA-PointShoeNet, enhances PointNet++ with channel attention to better extract discriminative features. A cosine similarity-based triplet loss further optimizes the embedding space for robust matching. Experiments on a self-collected dataset demonstrate strong performance, with accuracies of 95.50%, 93.21%, and 90.90% on training, testing, and validation sets, respectively. These results confirm the method’s effectiveness and its potential for broader 3D forensic identification tasks.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104730"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MTPA: A multi-aspects perception assisted AIGV quality assessment model MTPA:一个多层面感知辅助的AIGV质量评估模型
IF 3.1 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2026-03-01 Epub Date: 2026-01-17 DOI: 10.1016/j.jvcir.2026.104721
Yun Liu, Daoxin Fan, Zihan Liu, Sifan Li, Haiyuan Wang
With the development of Artificial Intelligence (AI) generated technology, AI generated video (AIGV) has aroused much attention. Compared to the visual perceptual in traditional video, AIGV has its unique challenges, such as visual consistency, text-to-video alignment, etc. In this paper, we propose a multi-aspect perception assisted AIGV quality assessment model, which gives a comprehensive quality evaluation of AIGV from three aspects: text–video alignment score, visual spatial perceptual score, and visual temporal perceptual score. Specifically, a pre-trained vision-language module is adopted to study the text-to-video alignment quality, and the semantic-aware module is applied to capture the visual spatial perceptual features. Besides, an effective visual temporal feature extraction module is used to capture multi-scale temporal features. Finally, text–video alignment features, visual spatial, visual temporal perceptual features, and multi-scale visual fusion features are integrated to give a comprehensive quality evaluation. Our model holds state-of-the-art results on three public AIGV datasets, proving its effectiveness.
随着人工智能(AI)生成技术的发展,人工智能生成视频(AI generated video, AIGV)引起了人们的广泛关注。与传统视频中的视觉感知相比,AIGV有其独特的挑战,如视觉一致性、文本-视频对齐等。本文提出了一种多面向感知辅助的AIGV质量评价模型,从文本-视频对齐评分、视觉空间感知评分和视觉时间感知评分三个方面对AIGV质量进行综合评价。具体而言,采用预训练的视觉语言模块来研究文本到视频的对齐质量,使用语义感知模块来捕获视觉空间感知特征。此外,采用有效的视觉时间特征提取模块捕获多尺度时间特征。最后,综合文本-视频对齐特征、视觉空间特征、视觉时间感知特征和多尺度视觉融合特征,对图像质量进行综合评价。我们的模型在三个公共AIGV数据集上拥有最先进的结果,证明了它的有效性。
{"title":"MTPA: A multi-aspects perception assisted AIGV quality assessment model","authors":"Yun Liu,&nbsp;Daoxin Fan,&nbsp;Zihan Liu,&nbsp;Sifan Li,&nbsp;Haiyuan Wang","doi":"10.1016/j.jvcir.2026.104721","DOIUrl":"10.1016/j.jvcir.2026.104721","url":null,"abstract":"<div><div>With the development of Artificial Intelligence (AI) generated technology, AI generated video (AIGV) has aroused much attention. Compared to the visual perceptual in traditional video, AIGV has its unique challenges, such as visual consistency, text-to-video alignment, etc. In this paper, we propose a multi-aspect perception assisted AIGV quality assessment model, which gives a comprehensive quality evaluation of AIGV from three aspects: text–video alignment score, visual spatial perceptual score, and visual temporal perceptual score. Specifically, a pre-trained vision-language module is adopted to study the text-to-video alignment quality, and the semantic-aware module is applied to capture the visual spatial perceptual features. Besides, an effective visual temporal feature extraction module is used to capture multi-scale temporal features. Finally, text–video alignment features, visual spatial, visual temporal perceptual features, and multi-scale visual fusion features are integrated to give a comprehensive quality evaluation. Our model holds state-of-the-art results on three public AIGV datasets, proving its effectiveness.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"116 ","pages":"Article 104721"},"PeriodicalIF":3.1,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146024842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Visual Communication and Image Representation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1