Pub Date : 2025-12-29DOI: 10.1016/j.cviu.2025.104613
Leonardo Plini , Luca Scofano , Edoardo De Matteis , Guido Maria D’Amely di Melendugno , Alessandro Flaborea , Andrea Sanchietti , Giovanni Maria Farinella , Fabio Galasso , Antonino Furnari
Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, no existing technique can reliably detect open-set procedural mistakes in an online setting. We propose a dual-branch architecture to address this problem in an online fashion: the recognition branch takes input frames from egocentric video, predicts the current action and aggregates frame-level results into action tokens while the anticipation branch leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module.
Extensive experiments on two novel procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach.
{"title":"TI-PREGO: Chain of Thought and In-Context Learning for online mistake detection in PRocedural EGOcentric videos","authors":"Leonardo Plini , Luca Scofano , Edoardo De Matteis , Guido Maria D’Amely di Melendugno , Alessandro Flaborea , Andrea Sanchietti , Giovanni Maria Farinella , Fabio Galasso , Antonino Furnari","doi":"10.1016/j.cviu.2025.104613","DOIUrl":"10.1016/j.cviu.2025.104613","url":null,"abstract":"<div><div>Identifying procedural errors online from egocentric videos is a critical yet challenging task across various domains, including manufacturing, healthcare and skill-based training. The nature of such mistakes is inherently open-set, as unforeseen or novel errors may occur, necessitating robust detection systems that do not rely on prior examples of failure. Currently, no existing technique can reliably detect open-set procedural mistakes in an online setting. We propose a dual-branch architecture to address this problem in an online fashion: the recognition branch takes input frames from egocentric video, predicts the current action and aggregates frame-level results into action tokens while the anticipation branch leverages the solid pattern-matching capabilities of Large Language Models (LLMs) to predict action tokens based on previously predicted ones. Mistakes are detected as mismatches between the currently recognized action and the action predicted by the anticipation module.</div><div>Extensive experiments on two novel procedural datasets demonstrate the challenges and opportunities of leveraging a dual-branch architecture for mistake detection, showcasing the effectiveness of our proposed approach.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104613"},"PeriodicalIF":3.5,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.1016/j.cviu.2025.104626
Harry Hughes , Patrick Lucey , Michael Horton , Harshala Gammulle , Clinton Fookes , Sridha Sridharan
In soccer, tracking data (player and ball locations over time) is central to performance analysis and a major focus of computer vision in sport. Tracking from broadcast or single-view video offers scalable coverage across all professional matches but suffers from frequent occlusions and missing information. Existing academic work typically evaluates short clips under simplified conditions, whereas industrial applications require complete, game-level coverage. We address these challenges with a multimodal transformer–diffusion framework that combines human-in-the-loop event supervision with single-view video. Our approach first leverages long-term multimodal context — tracking and event annotations — to improve coarse agent localization, then reconstructs full trajectories using a diffusion-based generative model that produces realistic, temporally coherent motion. Compared to state-of-the-art methods, our approach substantially improves both coarse and fine-grained accuracy while scaling effectively to industrial settings. By integrating human supervision with multimodal generative modeling, we provide a robust and practical solution for producing accurate and realistic player and ball trajectories under challenging real-world single-view conditions.
{"title":"Multimodal transformer–diffusion framework for large-scale reconstruction of soccer tracking data","authors":"Harry Hughes , Patrick Lucey , Michael Horton , Harshala Gammulle , Clinton Fookes , Sridha Sridharan","doi":"10.1016/j.cviu.2025.104626","DOIUrl":"10.1016/j.cviu.2025.104626","url":null,"abstract":"<div><div>In soccer, tracking data (player and ball locations over time) is central to performance analysis and a major focus of computer vision in sport. Tracking from broadcast or single-view video offers scalable coverage across all professional matches but suffers from frequent occlusions and missing information. Existing academic work typically evaluates short clips under simplified conditions, whereas industrial applications require complete, game-level coverage. We address these challenges with a multimodal transformer–diffusion framework that combines human-in-the-loop event supervision with single-view video. Our approach first leverages long-term multimodal context — tracking and event annotations — to improve coarse agent localization, then reconstructs full trajectories using a diffusion-based generative model that produces realistic, temporally coherent motion. Compared to state-of-the-art methods, our approach substantially improves both coarse and fine-grained accuracy while scaling effectively to industrial settings. By integrating human supervision with multimodal generative modeling, we provide a robust and practical solution for producing accurate and realistic player and ball trajectories under challenging real-world single-view conditions.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104626"},"PeriodicalIF":3.5,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-27DOI: 10.1016/j.cviu.2025.104627
Zhibo Wang , Amir Nazemi , Stephie Liu , Sirisha Rambhatla , Yuhao Chen , David Clausi
Accurate registration of ice hockey rinks from broadcast video frames is fundamental to sports analytics, as it aligns the rink template and broadcast frame into a unified coordinate system for consistent player analysis. Existing approaches, including keypoint- and segmentation-based methods, often yield suboptimal homography estimation due to insufficient attention to rink boundaries. To address this, we propose a segmentation-based framework that explicitly introduces the rink boundary as a new segmentation class. To further improve accuracy, we introduce three components that enhance boundary awareness: (i) a boundary-aware loss to strengthen boundary representation, (ii) a dynamic class-weighted mechanism in homography estimation to emphasize informative regions, and (iii) a self-distillation strategy to enrich feature diversity. Experiments on the NHL and SHL datasets demonstrate that our method significantly outperforms both baselines, achieving improvements of and in IoUpart and IoUwhole on the NHL dataset, and and on the SHL dataset, respectively. Ablation studies further confirm the contribution of each component, establishing a robust solution for rink registration and a strong foundation for downstream sports vision tasks.
{"title":"Boundary-aware semantic segmentation for ice hockey rink registration","authors":"Zhibo Wang , Amir Nazemi , Stephie Liu , Sirisha Rambhatla , Yuhao Chen , David Clausi","doi":"10.1016/j.cviu.2025.104627","DOIUrl":"10.1016/j.cviu.2025.104627","url":null,"abstract":"<div><div>Accurate registration of ice hockey rinks from broadcast video frames is fundamental to sports analytics, as it aligns the rink template and broadcast frame into a unified coordinate system for consistent player analysis. Existing approaches, including keypoint- and segmentation-based methods, often yield suboptimal homography estimation due to insufficient attention to rink boundaries. To address this, we propose a segmentation-based framework that explicitly introduces the rink boundary as a new segmentation class. To further improve accuracy, we introduce three components that enhance boundary awareness: (i) a boundary-aware loss to strengthen boundary representation, (ii) a dynamic class-weighted mechanism in homography estimation to emphasize informative regions, and (iii) a self-distillation strategy to enrich feature diversity. Experiments on the NHL and SHL datasets demonstrate that our method significantly outperforms both baselines, achieving improvements of <span><math><mrow><mo>+</mo><mn>2</mn><mo>.</mo><mn>84</mn></mrow></math></span> and <span><math><mrow><mo>+</mo><mn>3</mn><mo>.</mo><mn>48</mn></mrow></math></span> in IoU<sub>part</sub> and IoU<sub>whole</sub> on the NHL dataset, and <span><math><mrow><mo>+</mo><mn>1</mn><mo>.</mo><mn>53</mn></mrow></math></span> and <span><math><mrow><mo>+</mo><mn>5</mn><mo>.</mo><mn>85</mn></mrow></math></span> on the SHL dataset, respectively. Ablation studies further confirm the contribution of each component, establishing a robust solution for rink registration and a strong foundation for downstream sports vision tasks.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104627"},"PeriodicalIF":3.5,"publicationDate":"2025-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145928119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-26DOI: 10.1016/j.cviu.2025.104608
Xinyu Zhang , Lingling Zhang , Yanrui Wu , Shaowei Wang , Wenjun Wu , Muye Huang , Qianying Wang , Jun Liu
Large language models (LLMs) can effectively generate reasoning processes for simple tasks, but they struggle in complex and novel reasoning scenarios. This problem stems from LLMs often fusing visual and textual information in a single step, lacking the capture and representation of key information during the reasoning process, ignoring critical changes in the reasoning process, and failing to reflect the complex and dynamic nature of human-like reasoning. To address these issues, we propose a new framework called Memory-Enriched Thought-by-Thought (METbT), which incorporates memory and operators. On the one hand, the memory is used to store intermediate representations of the reasoning process, preserving information from the reasoning steps and preventing the language model from generating illogical text. On the other hand, the introduction of operators offers various methods for merging visual and textual representations, significantly enhancing the model’s ability to learn representations. We develop the METbT-Bert, METbT-T5, METbT-Qwen and METbT-InternLM, leveraging Bert, T5, Qwen and InternLM as the foundational language models with our framework, respectively. Experiments are conducted on multiple datasets including Smart-101, ScienceQA, and IconQA, and in all cases, the results surpassed those of the same language models. The results demonstrate that our METbT framework offers superior scalability and robustness.
{"title":"Memory-enriched thought-by-thought framework for complex Diagram Question Answering","authors":"Xinyu Zhang , Lingling Zhang , Yanrui Wu , Shaowei Wang , Wenjun Wu , Muye Huang , Qianying Wang , Jun Liu","doi":"10.1016/j.cviu.2025.104608","DOIUrl":"10.1016/j.cviu.2025.104608","url":null,"abstract":"<div><div>Large language models (LLMs) can effectively generate reasoning processes for simple tasks, but they struggle in complex and novel reasoning scenarios. This problem stems from LLMs often fusing visual and textual information in a single step, lacking the capture and representation of key information during the reasoning process, ignoring critical changes in the reasoning process, and failing to reflect the complex and dynamic nature of human-like reasoning. To address these issues, we propose a new framework called <strong>M</strong>emory-<strong>E</strong>nriched <strong>T</strong>hought-by-<strong>T</strong>hought (METbT), which incorporates memory and operators. On the one hand, the memory is used to store intermediate representations of the reasoning process, preserving information from the reasoning steps and preventing the language model from generating illogical text. On the other hand, the introduction of operators offers various methods for merging visual and textual representations, significantly enhancing the model’s ability to learn representations. We develop the METbT-Bert, METbT-T5, METbT-Qwen and METbT-InternLM, leveraging Bert, T5, Qwen and InternLM as the foundational language models with our framework, respectively. Experiments are conducted on multiple datasets including Smart-101, ScienceQA, and IconQA, and in all cases, the results surpassed those of the same language models. The results demonstrate that our METbT framework offers superior scalability and robustness.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104608"},"PeriodicalIF":3.5,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-23DOI: 10.1016/j.cviu.2025.104612
Zhuoya Wang, Gui Chen, Yaxin Li, Yongsheng Dong
Image style transfer techniques have significantly advanced, aiming to create images that adopt the style attributes of one source while maintaining the spatial layout of another. However, the interrelationship between style and content often causes the problem of information entanglement within the generated stylized result. To alleviate this issue, in this paper we propose a stepwise attention style transfer network based on diffusion models (SASTD). Specifically, we introduce an attention feature extraction and fusion module, which employs a step-by-step injection method to effectively combine the extracted content and style attention features at different time stages. Additionally, we propose a noise initialization module based on adaptive instance normalization (AdaIN) in the early fusion stage to initialize the initial latent noise during image generation, preserving certain initial feature statistics. Furthermore, we incorporate edge attention from the content image to enhance the preservation of its structural details. Finally, we propose a LAB space alignment module to further optimize the initially generated stylized image. This method ensures high-quality style transfer while better maintaining the spatial semantics of the content image. Experimental results demonstrate that our proposed SASTD achieves better performance in both qualitative and quantitative comparisons compared to both image style transfer methods and style-guided text-to-image synthesis methods.
{"title":"SASTD: Stepwise attention style transfer network based on diffusion models","authors":"Zhuoya Wang, Gui Chen, Yaxin Li, Yongsheng Dong","doi":"10.1016/j.cviu.2025.104612","DOIUrl":"10.1016/j.cviu.2025.104612","url":null,"abstract":"<div><div>Image style transfer techniques have significantly advanced, aiming to create images that adopt the style attributes of one source while maintaining the spatial layout of another. However, the interrelationship between style and content often causes the problem of information entanglement within the generated stylized result. To alleviate this issue, in this paper we propose a stepwise attention style transfer network based on diffusion models (SASTD). Specifically, we introduce an attention feature extraction and fusion module, which employs a step-by-step injection method to effectively combine the extracted content and style attention features at different time stages. Additionally, we propose a noise initialization module based on adaptive instance normalization (AdaIN) in the early fusion stage to initialize the initial latent noise during image generation, preserving certain initial feature statistics. Furthermore, we incorporate edge attention from the content image to enhance the preservation of its structural details. Finally, we propose a LAB space alignment module to further optimize the initially generated stylized image. This method ensures high-quality style transfer while better maintaining the spatial semantics of the content image. Experimental results demonstrate that our proposed SASTD achieves better performance in both qualitative and quantitative comparisons compared to both image style transfer methods and style-guided text-to-image synthesis methods.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104612"},"PeriodicalIF":3.5,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145886184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-23DOI: 10.1016/j.cviu.2025.104621
Edgar R. Guzman , Letizia Gionfrida , Robert D. Howe
This paper introduces a vision-based environment quantification pipeline designed to tailor the assistance provided by lower limb assistive devices during the transition from level walking to stair navigation. The framework consists of three components: staircase detection, transitional step prediction, and staircase dimension estimation. These components utilize an RGB-D camera worn on the chest and an Inertial Measurement Unit (IMU) worn at the hip. To detect ascending stairs, we employed a YOLOv3 model applied to continuous recordings, achieving an average accuracy of 98.1%. For descending stair detection, an edge detection algorithm was used, resulting in a pixel-wise edge localization accuracy of 89.1%. To estimate user locomotion speed and footfall, the IMU was positioned on the participant’s left waist, and the RGB-D camera was mounted at chest level. This setup accurately captured step lengths with an average accuracy of 94.4% across all participants and trials, enabling precise determination of the number of steps leading up to the transitional step on the staircase. As a result, the system accurately predicted the number of steps and localized the final footfall with an average error of , measured as the distance between the predicted and actual placement of the final foot relative to the target destination. Finally, to capture the dimensions of the staircase’s tread depth and riser height, an algorithm analyzing point cloud data was applied when the user was in close proximity to the stairs. This yielded mean absolute errors of in height and in depth for ascending stairs, and in height and in depth for descending stairs. Our proposed approach lays the groundwork for optimizing control strategies in exoskeleton technologies by integrating environmental sensing with human locomotion analysis. These results demonstrate the feasibility and effectiveness of our system, promising enhanced user experiences and improved functionality in real-world scenarios.
{"title":"RGB-D and IMU-based staircase quantification for assistive navigation using step estimation for exoskeleton support","authors":"Edgar R. Guzman , Letizia Gionfrida , Robert D. Howe","doi":"10.1016/j.cviu.2025.104621","DOIUrl":"10.1016/j.cviu.2025.104621","url":null,"abstract":"<div><div>This paper introduces a vision-based environment quantification pipeline designed to tailor the assistance provided by lower limb assistive devices during the transition from level walking to stair navigation. The framework consists of three components: staircase detection, transitional step prediction, and staircase dimension estimation. These components utilize an RGB-D camera worn on the chest and an Inertial Measurement Unit (IMU) worn at the hip. To detect ascending stairs, we employed a YOLOv3 model applied to continuous recordings, achieving an average accuracy of 98.1%. For descending stair detection, an edge detection algorithm was used, resulting in a pixel-wise edge localization accuracy of 89.1%. To estimate user locomotion speed and footfall, the IMU was positioned on the participant’s left waist, and the RGB-D camera was mounted at chest level. This setup accurately captured step lengths with an average accuracy of 94.4% across all participants and trials, enabling precise determination of the number of steps leading up to the transitional step on the staircase. As a result, the system accurately predicted the number of steps and localized the final footfall with an average error of <span><math><mrow><mn>5</mn><mo>.</mo><mn>77</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span>, measured as the distance between the predicted and actual placement of the final foot relative to the target destination. Finally, to capture the dimensions of the staircase’s tread depth and riser height, an algorithm analyzing point cloud data was applied when the user was in close proximity to the stairs. This yielded mean absolute errors of <span><math><mrow><mn>1</mn><mo>.</mo><mn>20</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>49</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in height and <span><math><mrow><mn>1</mn><mo>.</mo><mn>35</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>45</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in depth for ascending stairs, and <span><math><mrow><mn>1</mn><mo>.</mo><mn>28</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>55</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in height and <span><math><mrow><mn>1</mn><mo>.</mo><mn>47</mn><mo>±</mo><mn>0</mn><mo>.</mo><mn>65</mn><mspace></mspace><mtext>cm</mtext></mrow></math></span> in depth for descending stairs. Our proposed approach lays the groundwork for optimizing control strategies in exoskeleton technologies by integrating environmental sensing with human locomotion analysis. These results demonstrate the feasibility and effectiveness of our system, promising enhanced user experiences and improved functionality in real-world scenarios.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"264 ","pages":"Article 104621"},"PeriodicalIF":3.5,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145847576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-18DOI: 10.1016/j.cviu.2025.104616
Abdulrhman H. Al-Jebrni , Saba Ghazanfar Ali , Bin Sheng , Huating Li , Xiao Lin , Ping Li , Younhyun Jung , Jinman Kim , Li Xu , Lixin Jiang , Jing Du
Segmenting small, low-contrast anatomical structures and classifying their pathological status in ultrasound (US) images remain challenging tasks in computer vision, especially under the noise and ambiguity inherent in real-world clinical data. Papillary thyroid microcarcinoma (PTMC), characterized by nodules cm, exemplifies these challenges where both precise segmentation and accurate lymph node metastasis (LNM) prediction are essential for informed clinical decisions. We propose SynTaskNet, a synergistic multi-task learning (MTL) architecture that jointly performs PTMC nodule segmentation and LNM classification from US images. Built upon a DenseNet201 backbone, SynTaskNet incorporates several specialized modules: a Coordinated Depth-wise Convolution (CDC) layer for enhancing spatial features, an Adaptive Context Block (ACB) for embedding contextual dependencies, and a Multi-scale Contextual Boundary Attention (MCBA) module to improve boundary localization in low-contrast regions. To strengthen task interaction, we introduce a Selective Enhancement Fusion (SEF) mechanism that hierarchically integrates features across three semantic levels, enabling effective information exchange between segmentation and classification branches. On top of this, we formulate a synergistic learning scheme wherein an Auxiliary Segmentation Map (ASM) generated by the segmentation decoder is injected into SEF’s third class-specific fusion path to guide LNM classification. In parallel, the predicted LNM label is concatenated with the third-path SEF output to refine the Final Segmentation Map (FSM), enabling bidirectional task reinforcement. Extensive evaluations on a dedicated PTMC US dataset demonstrate that SynTaskNet achieves state-of-the-art performance, with a Dice score of 93.0% for segmentation and a classification accuracy of 94.2% for LNM prediction, validating its clinical relevance and technical efficacy.
{"title":"SynTaskNet: A synergistic multi-task network for joint segmentation and classification of small anatomical structures in ultrasound imaging","authors":"Abdulrhman H. Al-Jebrni , Saba Ghazanfar Ali , Bin Sheng , Huating Li , Xiao Lin , Ping Li , Younhyun Jung , Jinman Kim , Li Xu , Lixin Jiang , Jing Du","doi":"10.1016/j.cviu.2025.104616","DOIUrl":"10.1016/j.cviu.2025.104616","url":null,"abstract":"<div><div>Segmenting small, low-contrast anatomical structures and classifying their pathological status in ultrasound (US) images remain challenging tasks in computer vision, especially under the noise and ambiguity inherent in real-world clinical data. Papillary thyroid microcarcinoma (PTMC), characterized by nodules <span><math><mrow><mo>≤</mo><mn>1</mn><mo>.</mo><mn>0</mn></mrow></math></span> cm, exemplifies these challenges where both precise segmentation and accurate lymph node metastasis (LNM) prediction are essential for informed clinical decisions. We propose SynTaskNet, a synergistic multi-task learning (MTL) architecture that jointly performs PTMC nodule segmentation and LNM classification from US images. Built upon a DenseNet201 backbone, SynTaskNet incorporates several specialized modules: a Coordinated Depth-wise Convolution (CDC) layer for enhancing spatial features, an Adaptive Context Block (ACB) for embedding contextual dependencies, and a Multi-scale Contextual Boundary Attention (MCBA) module to improve boundary localization in low-contrast regions. To strengthen task interaction, we introduce a Selective Enhancement Fusion (SEF) mechanism that hierarchically integrates features across three semantic levels, enabling effective information exchange between segmentation and classification branches. On top of this, we formulate a synergistic learning scheme wherein an Auxiliary Segmentation Map (ASM) generated by the segmentation decoder is injected into SEF’s third class-specific fusion path to guide LNM classification. In parallel, the predicted LNM label is concatenated with the third-path SEF output to refine the Final Segmentation Map (FSM), enabling bidirectional task reinforcement. Extensive evaluations on a dedicated PTMC US dataset demonstrate that SynTaskNet achieves state-of-the-art performance, with a Dice score of 93.0% for segmentation and a classification accuracy of 94.2% for LNM prediction, validating its clinical relevance and technical efficacy.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104616"},"PeriodicalIF":3.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-18DOI: 10.1016/j.cviu.2025.104614
Yue Wu , Yunhong Wang , Guodong Wang , Jinjin Zhang , Yingjie Gao , Xiuguo Bao , Di Huang
Prompt tuning has emerged as a pivotal technique for adapting pre-trained vision-language models (VLMs) to a wide range of downstream tasks. Recent developments have introduced multimodal learnable prompts to construct task-specific classifiers. However, these methods often exhibit limited generalization to unseen classes, primarily due to fixed prompt designs that are tightly coupled with seen training data and lack adaptability to novel class distributions. To overcome this limitation, we propose Label-Informed Knowledge Integration (LIKI)—a novel framework that harnesses the robust generalizability of textual label semantics to guide the generation of adaptive visual prompts. Rather than directly mapping textual prompts into the visual domain, LIKI utilizes robust text embeddings as a knowledge source to inform the visual prompt optimization. Central to our method is a simple yet effective Label Semantic Integration (LSI) module, which dynamically incorporates knowledge from both seen and unseen labels into the visual prompts. This label-informed prompting strategy imbues the visual encoder with semantic awareness, thereby enhancing the generalization and discriminative capacity of VLMs across diverse scenarios. Extensive experiments demonstrate that LIKI consistently outperforms state-of-the-art approaches in base-to-novel generalization, cross-dataset transfer, and domain generalization tasks, offering a significant advancement in prompt-based VLM adaptation.
{"title":"Label-informed knowledge integration: Advancing visual prompt for VLMs adaptation","authors":"Yue Wu , Yunhong Wang , Guodong Wang , Jinjin Zhang , Yingjie Gao , Xiuguo Bao , Di Huang","doi":"10.1016/j.cviu.2025.104614","DOIUrl":"10.1016/j.cviu.2025.104614","url":null,"abstract":"<div><div>Prompt tuning has emerged as a pivotal technique for adapting pre-trained vision-language models (VLMs) to a wide range of downstream tasks. Recent developments have introduced multimodal learnable prompts to construct task-specific classifiers. However, these methods often exhibit limited generalization to unseen classes, primarily due to fixed prompt designs that are tightly coupled with seen training data and lack adaptability to novel class distributions. To overcome this limitation, we propose Label-Informed Knowledge Integration (LIKI)—a novel framework that harnesses the robust generalizability of textual label semantics to guide the generation of adaptive visual prompts. Rather than directly mapping textual prompts into the visual domain, LIKI utilizes robust text embeddings as a knowledge source to inform the visual prompt optimization. Central to our method is a simple yet effective Label Semantic Integration (LSI) module, which dynamically incorporates knowledge from both seen and unseen labels into the visual prompts. This label-informed prompting strategy imbues the visual encoder with semantic awareness, thereby enhancing the generalization and discriminative capacity of VLMs across diverse scenarios. Extensive experiments demonstrate that LIKI consistently outperforms state-of-the-art approaches in base-to-novel generalization, cross-dataset transfer, and domain generalization tasks, offering a significant advancement in prompt-based VLM adaptation.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104614"},"PeriodicalIF":3.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-18DOI: 10.1016/j.cviu.2025.104611
Yu Zhu , Liqiang Song , Junli Zhao , Guodong Wang , Hui Li , Yi Li
Early diagnosis and intervention are critical in managing acute ischemic stroke to effectively reduce morbidity and mortality. Medical image synthesis generates multimodal images from unimodal inputs, while image fusion integrates complementary information across modalities. However, current approaches typically address these tasks separately, neglecting their inherent synergies and the potential for a richer, more comprehensive diagnostic picture. To overcome this, we propose a two-stage deep learning(DL) framework for improved lesion analysis in ischemic stroke, which combines medical image synthesis and fusion to improve diagnostic informativeness. In the first stage, a Generative Adversarial Network (GAN)-based method, pix2pixHD, efficiently synthesizes high-fidelity multimodal medical images from unimodal inputs, thereby enriching the available diagnostic data for subsequent processing. The second stage introduces a multimodal medical image fusion network, SCAFNet, leveraging self-attention and cross-attention mechanisms. SCAFNet captures intra-modal feature relationships via self-attention to emphasize key information within each modality, and constructs inter-modal feature interactions via cross-attention to fully exploit their complementarity. Additionally, an Information Assistance Module (IAM) is introduced to facilitate the extraction of more meaningful information and improve the visual quality of fused images. Experimental results demonstrate that the proposed framework significantly outperforms existing methods in both generated and fused image quality, highlighting its substantial potential for clinical applications in medical image analysis.
{"title":"SCAFNet: Multimodal stroke medical image synthesis and fusion network based on self attention and cross attention","authors":"Yu Zhu , Liqiang Song , Junli Zhao , Guodong Wang , Hui Li , Yi Li","doi":"10.1016/j.cviu.2025.104611","DOIUrl":"10.1016/j.cviu.2025.104611","url":null,"abstract":"<div><div>Early diagnosis and intervention are critical in managing acute ischemic stroke to effectively reduce morbidity and mortality. Medical image synthesis generates multimodal images from unimodal inputs, while image fusion integrates complementary information across modalities. However, current approaches typically address these tasks separately, neglecting their inherent synergies and the potential for a richer, more comprehensive diagnostic picture. To overcome this, we propose a two-stage deep learning(DL) framework for improved lesion analysis in ischemic stroke, which combines medical image synthesis and fusion to improve diagnostic informativeness. In the first stage, a Generative Adversarial Network (GAN)-based method, pix2pixHD, efficiently synthesizes high-fidelity multimodal medical images from unimodal inputs, thereby enriching the available diagnostic data for subsequent processing. The second stage introduces a multimodal medical image fusion network, SCAFNet, leveraging self-attention and cross-attention mechanisms. SCAFNet captures intra-modal feature relationships via self-attention to emphasize key information within each modality, and constructs inter-modal feature interactions via cross-attention to fully exploit their complementarity. Additionally, an Information Assistance Module (IAM) is introduced to facilitate the extraction of more meaningful information and improve the visual quality of fused images. Experimental results demonstrate that the proposed framework significantly outperforms existing methods in both generated and fused image quality, highlighting its substantial potential for clinical applications in medical image analysis.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104611"},"PeriodicalIF":3.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Image captioning (IC) is a pivotal cross-modal task that generates coherent textual descriptions for visual inputs, bridging vision and language domains. Attention-based methods have significantly advanced the field of image captioning. However, empirical observations indicate that attention mechanisms often allocate focus uniformly across the full spectrum of feature sequences, which inadvertently diminishes emphasis on long-range dependencies. Such remote elements, nevertheless, play a critical role in yielding captions of superior quality. Therefore, we pursued strategies that harmonize comprehensive feature representation with targeted prioritization of key signals, ultimately proposed the Dynamic Hybrid Network (DH-Net) to enhance caption quality. Specifically, following the encoder–decoder architecture, we propose a hybrid encoder (HE) to integrate the attention mechanisms with the mamba blocks. which further complements the attention by leveraging mamba’s superior long-sequence modeling capabilities, and enables a synergistic combination of local feature extraction and global context modeling. Additionally, we introduce a Feature Aggregation Module (FAM) into the decoder, which dynamically adapts multi-modal feature fusion to evolving decoding contexts, ensuring context-sensitive integration of heterogeneous features. Extensive evaluations on the MSCOCO and Flickr30k dataset demonstrate that DH-Net achieves state-of-the-art performance, significantly outperforming existing approaches in generating accurate and semantically rich captions. The implementation code is accessible via https://github.com/simple-boy/DH-Net.
{"title":"A dynamic hybrid network with attention and mamba for image captioning","authors":"Lulu Wang, Ruiji Xue, Zhengtao Yu, Ruoyu Zhang, Tongling Pan, Yingna Li","doi":"10.1016/j.cviu.2025.104617","DOIUrl":"10.1016/j.cviu.2025.104617","url":null,"abstract":"<div><div>Image captioning (IC) is a pivotal cross-modal task that generates coherent textual descriptions for visual inputs, bridging vision and language domains. Attention-based methods have significantly advanced the field of image captioning. However, empirical observations indicate that attention mechanisms often allocate focus uniformly across the full spectrum of feature sequences, which inadvertently diminishes emphasis on long-range dependencies. Such remote elements, nevertheless, play a critical role in yielding captions of superior quality. Therefore, we pursued strategies that harmonize comprehensive feature representation with targeted prioritization of key signals, ultimately proposed the Dynamic Hybrid Network (DH-Net) to enhance caption quality. Specifically, following the encoder–decoder architecture, we propose a hybrid encoder (HE) to integrate the attention mechanisms with the mamba blocks. which further complements the attention by leveraging mamba’s superior long-sequence modeling capabilities, and enables a synergistic combination of local feature extraction and global context modeling. Additionally, we introduce a Feature Aggregation Module (FAM) into the decoder, which dynamically adapts multi-modal feature fusion to evolving decoding contexts, ensuring context-sensitive integration of heterogeneous features. Extensive evaluations on the MSCOCO and Flickr30k dataset demonstrate that DH-Net achieves state-of-the-art performance, significantly outperforming existing approaches in generating accurate and semantically rich captions. The implementation code is accessible via <span><span>https://github.com/simple-boy/DH-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"263 ","pages":"Article 104617"},"PeriodicalIF":3.5,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}