Pub Date : 2026-02-10DOI: 10.1016/j.media.2026.103988
Ting Luo , Jinxian Zhang , Tao Chen , Zhouyan He , Yanda Meng , Mengting Liu , Jiong Zhang , Dan Zhang
Optical coherence tomography angiography (OCTA) enables non-invasive visualization of retinal microvasculature, and accurate 3D vessel segmentation is essential for quantifying biomarkers critical for early diagnosis and monitoring of diabetic retinopathy. However, reliable 3D OCTA segmentation is hindered by capillary invisibility, complex vascular topology, and motion artifacts, which compromise biomarker accuracy. Furthermore, the scarcity of manually annotated 3D OCTA microvascular data constrains methodological development. To address this challenge, we introduce our publicly accessible 3D microvascular dataset and propose MT-Net, a multi-view, topology-aware 3D retinal microvascular segmentation network. First, a novel dimension transformation strategy is employed to enhance topological accuracy by effectively encoding spatial dependencies across multiple planes. Second, to mitigate the impact of motion artifacts, we introduce a unidirectional Artifact Suppression Module (ASM) that selectively suppresses noise along the B-scan direction. Third, a Twin-Cross Attention Module (TCAM), guided by vessel centerlines, is designed to enhance the continuity and completeness of segmented vessels by reinforcing cross-view contextual information. Experiments on two 3D OCTA datasets show that MT-Net achieves state-of-the-art accuracy and topological consistency, with strong generalizability validated by cross-dataset analysis. We plan to release our manual annotations to facilitate future research in retinal OCTA segmentation.
{"title":"Artifact-suppressed 3D retinal microvascular segmentation via multi-scale topology regulation","authors":"Ting Luo , Jinxian Zhang , Tao Chen , Zhouyan He , Yanda Meng , Mengting Liu , Jiong Zhang , Dan Zhang","doi":"10.1016/j.media.2026.103988","DOIUrl":"10.1016/j.media.2026.103988","url":null,"abstract":"<div><div>Optical coherence tomography angiography (OCTA) enables non-invasive visualization of retinal microvasculature, and accurate 3D vessel segmentation is essential for quantifying biomarkers critical for early diagnosis and monitoring of diabetic retinopathy. However, reliable 3D OCTA segmentation is hindered by capillary invisibility, complex vascular topology, and motion artifacts, which compromise biomarker accuracy. Furthermore, the scarcity of manually annotated 3D OCTA microvascular data constrains methodological development. To address this challenge, we introduce our publicly accessible 3D microvascular dataset and propose MT-Net, a multi-view, topology-aware 3D retinal microvascular segmentation network. First, a novel dimension transformation strategy is employed to enhance topological accuracy by effectively encoding spatial dependencies across multiple planes. Second, to mitigate the impact of motion artifacts, we introduce a unidirectional Artifact Suppression Module (ASM) that selectively suppresses noise along the B-scan direction. Third, a Twin-Cross Attention Module (TCAM), guided by vessel centerlines, is designed to enhance the continuity and completeness of segmented vessels by reinforcing cross-view contextual information. Experiments on two 3D OCTA datasets show that MT-Net achieves state-of-the-art accuracy and topological consistency, with strong generalizability validated by cross-dataset analysis. We plan to release our manual annotations to facilitate future research in retinal OCTA segmentation.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"110 ","pages":"Article 103988"},"PeriodicalIF":11.8,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146146686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-09DOI: 10.1016/j.media.2026.103986
Yaqi Wang, Zhi Li, Chengyu Wu, Jun Liu, Yifan Zhang, Jiaxue Ni, Qian Luo, Jialuo Chen, Hongyuan Zhang, Jin Liu, Can Han, Kaiwen Fu, Changkai Ji, Xinxu Cai, Jing Hao, Zhihao Zheng, Shi Xu, Junqiang Chen, Xiaoyang Yu, Qianni Zhang, Dahong Qian, Shuai Wang, Huiyu Zhou
Orthopantomogram (OPGs) and Cone-Beam Computed Tomography (CBCT) are vital for dentistry, but creating large datasets for automated tooth segmentation is hindered by the labor-intensive process of manual instance-level annotation. This research aimed to benchmark and advance semi-supervised learning (SSL) as a solution for this data scarcity problem. We organized the 2nd Semi-supervised Teeth Segmentation (STS 2024) Challenge at MICCAI 2024. We provided a large-scale dataset comprising over 90,000 2D images and 3D axial slices, which includes 2,380 OPG images and 330 CBCT scans, all featuring detailed instance-level FDI annotations on part of the data. The challenge attracted 114 (OPG) and 106 (CBCT) registered teams. To ensure algorithmic excellence and full transparency, we rigorously evaluated the valid, open-source submissions from the top 10 (OPG) and top 5 (CBCT) teams, respectively. All successful submissions were deep learning-based SSL methods. The winning semi-supervised models demonstrated impressive performance gains over a fully-supervised nnU-Net baseline trained only on the labeled data. For the 2D OPG track, the top method improved the Instance Affinity (IA) score by over 44 percentage points. For the 3D CBCT track, the winning approach boosted the Instance Dice score by 61 percentage points. This challenge demonstrates the potential benefit benefit of SSL for complex, instance-level medical image segmentation tasks where labeled data is scarce. The most effective approaches consistently leveraged hybrid semi-supervised frameworks that combined knowledge from foundational models like SAM with multi-stage, coarse-to-fine refinement pipelines. Both the challenge dataset and the participants’ submitted code have been made publicly available on GitHub (https://github.com/ricoleehduu/STS-Challenge-2024), ensuring transparency and reproducibility.
{"title":"MICCAI STS 2024 Challenge: Semi-Supervised Instance-Level Tooth Segmentation in Panoramic X-ray and CBCT Images","authors":"Yaqi Wang, Zhi Li, Chengyu Wu, Jun Liu, Yifan Zhang, Jiaxue Ni, Qian Luo, Jialuo Chen, Hongyuan Zhang, Jin Liu, Can Han, Kaiwen Fu, Changkai Ji, Xinxu Cai, Jing Hao, Zhihao Zheng, Shi Xu, Junqiang Chen, Xiaoyang Yu, Qianni Zhang, Dahong Qian, Shuai Wang, Huiyu Zhou","doi":"10.1016/j.media.2026.103986","DOIUrl":"https://doi.org/10.1016/j.media.2026.103986","url":null,"abstract":"Orthopantomogram (OPGs) and Cone-Beam Computed Tomography (CBCT) are vital for dentistry, but creating large datasets for automated tooth segmentation is hindered by the labor-intensive process of manual instance-level annotation. This research aimed to benchmark and advance semi-supervised learning (SSL) as a solution for this data scarcity problem. We organized the 2nd Semi-supervised Teeth Segmentation (STS 2024) Challenge at MICCAI 2024. We provided a large-scale dataset comprising over 90,000 2D images and 3D axial slices, which includes 2,380 OPG images and 330 CBCT scans, all featuring detailed instance-level FDI annotations on part of the data. The challenge attracted 114 (OPG) and 106 (CBCT) registered teams. To ensure algorithmic excellence and full transparency, we rigorously evaluated the valid, open-source submissions from the top 10 (OPG) and top 5 (CBCT) teams, respectively. All successful submissions were deep learning-based SSL methods. The winning semi-supervised models demonstrated impressive performance gains over a fully-supervised nnU-Net baseline trained only on the labeled data. For the 2D OPG track, the top method improved the Instance Affinity (IA) score by over 44 percentage points. For the 3D CBCT track, the winning approach boosted the Instance Dice score by 61 percentage points. This challenge demonstrates the potential benefit benefit of SSL for complex, instance-level medical image segmentation tasks where labeled data is scarce. The most effective approaches consistently leveraged hybrid semi-supervised frameworks that combined knowledge from foundational models like SAM with multi-stage, coarse-to-fine refinement pipelines. Both the challenge dataset and the participants’ submitted code have been made publicly available on GitHub (<ce:inter-ref xlink:href=\"https://github.com/ricoleehduu/STS-Challenge-2024\" xlink:type=\"simple\">https://github.com/ricoleehduu/STS-Challenge-2024</ce:inter-ref>), ensuring transparency and reproducibility.","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"108 1","pages":""},"PeriodicalIF":10.9,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146146639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-09DOI: 10.1016/j.media.2026.103981
Md Kamrul Hasan , Guang Yang , Choon Hwai Yap
Cardiac anatomy segmentation is essential for clinical assessment of cardiac function and disease diagnosis to inform treatment and intervention. Deep learning (DL) has improved cardiac anatomy segmentation accuracy, especially when information on cardiac motion dynamics is integrated into the networks. Several methods for incorporating motion information have been proposed; however, existing methods are not yet optimal: adding the time dimension to input data causes high computational costs, and incorporating registration into the segmentation network remains computationally costly and can be affected by errors of registration, especially with non-DL registration. While attention-based motion modeling is promising, suboptimal design constrains its capacity to learn the complex and coherent temporal interactions inherent in cardiac image sequences. Here, we propose a novel approach to incorporating motion information in the DL segmentation networks: a computationally efficient yet robust Temporal Attention Module (TAM), modeled as a small, multi-headed, cross-temporal attention module, which can be plug-and-play inserted into a broad range of segmentation networks (CNN, transformer, or hybrid) without a drastic architecture modification. Extensive experiments on multiple cardiac imaging datasets, such as 2D echocardiography (CAMUS and EchoNet-Dynamic), 3D echocardiography (MITEA), and 3D cardiac MRI (ACDC), confirm that TAM consistently improves segmentation performance across datasets when added to a range of networks, including UNet, FCN8s, UNetR, SwinUNetR, and the recent I2UNet and DT-VNet. Integrating TAM into SAM yields a temporal SAM that reduces Hausdorff distance (HD) from 3.99 mm to 3.51 mm on the CAMUS dataset, while integrating TAM into a pre-trained MedSAM reduces HD from 3.04 to 2.06 pixels after fine-tuning on the EchoNet-Dynamic dataset. On the ACDC 3D dataset, our TAM-UNet and TAM-DT-VNet achieve substantial reductions in HD, from 7.97 mm to 4.23 mm and 6.87 mm to 4.74 mm, respectively. Additionally, TAM’s training does not require segmentation of ground truths from all time frames and can be achieved with sparse temporal annotation. TAM is thus a robust, generalizable, and adaptable solution for motion-awareness enhancement that is easily scaled from 2D to 3D. The code is available at https://github.com/kamruleee51/TAM.
{"title":"An efficient, scalable, and adaptable plug-and-play temporal attention module for motion-guided cardiac segmentation with sparse temporal labels","authors":"Md Kamrul Hasan , Guang Yang , Choon Hwai Yap","doi":"10.1016/j.media.2026.103981","DOIUrl":"10.1016/j.media.2026.103981","url":null,"abstract":"<div><div>Cardiac anatomy segmentation is essential for clinical assessment of cardiac function and disease diagnosis to inform treatment and intervention. Deep learning (DL) has improved cardiac anatomy segmentation accuracy, especially when information on cardiac motion dynamics is integrated into the networks. Several methods for incorporating motion information have been proposed; however, existing methods are not yet optimal: adding the time dimension to input data causes high computational costs, and incorporating registration into the segmentation network remains computationally costly and can be affected by errors of registration, especially with non-DL registration. While attention-based motion modeling is promising, suboptimal design constrains its capacity to learn the complex and coherent temporal interactions inherent in cardiac image sequences. Here, we propose a novel approach to incorporating motion information in the DL segmentation networks: a computationally efficient yet robust Temporal Attention Module (TAM), modeled as a small, multi-headed, cross-temporal attention module, which can be plug-and-play inserted into a broad range of segmentation networks (CNN, transformer, or hybrid) without a drastic architecture modification. Extensive experiments on multiple cardiac imaging datasets, such as 2D echocardiography (CAMUS and EchoNet-Dynamic), 3D echocardiography (MITEA), and 3D cardiac MRI (ACDC), confirm that TAM consistently improves segmentation performance across datasets when added to a range of networks, including UNet, FCN8s, UNetR, SwinUNetR, and the recent I<sup>2</sup>UNet and DT-VNet. Integrating TAM into SAM yields a temporal SAM that reduces Hausdorff distance (HD) from 3.99 mm to 3.51 mm on the CAMUS dataset, while integrating TAM into a pre-trained MedSAM reduces HD from 3.04 to 2.06 pixels after fine-tuning on the EchoNet-Dynamic dataset. On the ACDC 3D dataset, our TAM-UNet and TAM-DT-VNet achieve substantial reductions in HD, from 7.97 mm to 4.23 mm and 6.87 mm to 4.74 mm, respectively. Additionally, TAM’s training does not require segmentation of ground truths from all time frames and can be achieved with sparse temporal annotation. TAM is thus a robust, generalizable, and adaptable solution for motion-awareness enhancement that is easily scaled from 2D to 3D. The code is available at <span><span>https://github.com/kamruleee51/TAM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"110 ","pages":"Article 103981"},"PeriodicalIF":11.8,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146146638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-07DOI: 10.1016/j.media.2026.103974
Julian Suk , Dieuwertje Alblas , Barbara A. Hutten , Albert Wiegman , Christoph Brune , Pim van Ooij , Jelmer M. Wolterink
Hemodynamic quantities are valuable biomedical risk factors for cardiovascular pathology such as atherosclerosis. Non-invasive, in-vivo measurement of these quantities can only be performed using a select number of modalities that are not widely available, such as 4D flow magnetic resonance imaging (MRI). In this work, we create a surrogate model for hemodynamic flow field estimation, powered by machine learning. We train graph neural networks that include priors about the underlying symmetries and physics, limiting the amount of data required for training. This allows us to train the model using moderately-sized, in-vivo 4D flow MRI datasets, instead of large in-silico datasets obtained by computational fluid dynamics (CFD), as is the current standard. We create an efficient, equivariant neural network by combining the popular PointNet++ architecture with group-steerable layers. To incorporate the physics-informed priors, we derive an efficient discretisation scheme for the involved differential operators. We perform extensive experiments in carotid arteries and show that our model can accurately estimate low-noise hemodynamic flow fields in the carotid artery. Moreover, we show how the learned relation between geometry and hemodynamic quantities transfers to 3D vascular models obtained using a different imaging modality than the training data. This shows that physics-informed graph neural networks can be trained using 4D flow MRI data to estimate blood flow in unseen carotid artery geometries.
{"title":"Physics-informed graph neural networks for flow field estimation in carotid arteries","authors":"Julian Suk , Dieuwertje Alblas , Barbara A. Hutten , Albert Wiegman , Christoph Brune , Pim van Ooij , Jelmer M. Wolterink","doi":"10.1016/j.media.2026.103974","DOIUrl":"10.1016/j.media.2026.103974","url":null,"abstract":"<div><div>Hemodynamic quantities are valuable biomedical risk factors for cardiovascular pathology such as atherosclerosis. Non-invasive, in-vivo measurement of these quantities can only be performed using a select number of modalities that are not widely available, such as 4D flow magnetic resonance imaging (MRI). In this work, we create a surrogate model for hemodynamic flow field estimation, powered by machine learning. We train graph neural networks that include priors about the underlying symmetries and physics, limiting the amount of data required for training. This allows us to train the model using moderately-sized, in-vivo 4D flow MRI datasets, instead of large in-silico datasets obtained by computational fluid dynamics (CFD), as is the current standard. We create an efficient, equivariant neural network by combining the popular PointNet++ architecture with group-steerable layers. To incorporate the physics-informed priors, we derive an efficient discretisation scheme for the involved differential operators. We perform extensive experiments in carotid arteries and show that our model can accurately estimate low-noise hemodynamic flow fields in the carotid artery. Moreover, we show how the learned relation between geometry and hemodynamic quantities transfers to 3D vascular models obtained using a different imaging modality than the training data. This shows that physics-informed graph neural networks can be trained using 4D flow MRI data to estimate blood flow in unseen carotid artery geometries.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"110 ","pages":"Article 103974"},"PeriodicalIF":11.8,"publicationDate":"2026-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146138282","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1016/j.media.2026.103971
Chengjin Yu , Bin Zhang , Chenchu Xu , Dongsheng Ruan , Rui Wang , Huafeng Liu , Xiaohu Li , Shuo Li
Masked Autoencoders (MAEs) have grown increasingly prominent as a powerful self-supervised learning paradigm. They are capable of effectively leveraging inherent image prior information and are gaining traction in the field of medical image analysis. However, their application to feature representations of the non-salient objects, such as microvasculature, accessory organs, and early-stage tumors–is fundamentally limited by dimensional collapse problem, which diminishes feature diversity critical for non-salient structure discrimination. To address this, we propose a Multi-Granularity Masked Autoencoder (MG-MAE) framework for feature diversity learning: (1) We extend the conventional MAE into a multi-granularity framework, a global branch reconstructs global pixels, with a local branch recovering Histogram of Oriented Gradients (HOG) features, enabling hierarchical representation of both coarse-grained and fine-grained patterns; (2) Critically, in the local branch, a diversity-enhanced loss function incorporating Nuclear Norm Maximization (NNM) constraint to explicitly mitigate feature space collapse through orthogonal embedding regularization; and (3) A Dynamic Weight Adjustment (DWA) strategy that dynamically prioritizes hard-to-reconstruct regions via entropy-driven gradient modulation. Comprehensive evaluations across five clinical benchmarks–CCTA139, BTCV, LiTS, ACDC, and MSD Pancreas Tumour datasets–demonstrate that MG-MAE achieves statistically significant improvements in Dice Similarity Coefficient (DSC) scores for non-salient object segmentation, outperforming state-of-the-art methods. The code is available at https://github.com/zhangbbin/mgmae.
{"title":"Diversity-driven MG-MAE: Multi-granularity representation learning for non-salient object segmentation","authors":"Chengjin Yu , Bin Zhang , Chenchu Xu , Dongsheng Ruan , Rui Wang , Huafeng Liu , Xiaohu Li , Shuo Li","doi":"10.1016/j.media.2026.103971","DOIUrl":"10.1016/j.media.2026.103971","url":null,"abstract":"<div><div>Masked Autoencoders (MAEs) have grown increasingly prominent as a powerful self-supervised learning paradigm. They are capable of effectively leveraging inherent image prior information and are gaining traction in the field of medical image analysis. However, their application to feature representations of the non-salient objects, such as microvasculature, accessory organs, and early-stage tumors–is fundamentally limited by dimensional collapse problem, which diminishes feature diversity critical for non-salient structure discrimination. To address this, we propose a Multi-Granularity Masked Autoencoder (MG-MAE) framework for feature diversity learning: (1) We extend the conventional MAE into a multi-granularity framework, a global branch reconstructs global pixels, with a local branch recovering Histogram of Oriented Gradients (HOG) features, enabling hierarchical representation of both coarse-grained and fine-grained patterns; (2) Critically, in the local branch, a diversity-enhanced loss function incorporating Nuclear Norm Maximization (NNM) constraint to explicitly mitigate feature space collapse through orthogonal embedding regularization; and (3) A Dynamic Weight Adjustment (DWA) strategy that dynamically prioritizes hard-to-reconstruct regions via entropy-driven gradient modulation. Comprehensive evaluations across five clinical benchmarks–CCTA139, BTCV, LiTS, ACDC, and MSD Pancreas Tumour datasets–demonstrate that MG-MAE achieves statistically significant improvements in Dice Similarity Coefficient (DSC) scores for non-salient object segmentation, outperforming state-of-the-art methods. The code is available at <span><span>https://github.com/zhangbbin/mgmae</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"110 ","pages":"Article 103971"},"PeriodicalIF":11.8,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146134049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-03DOI: 10.1016/j.media.2026.103972
Guiqiu Liao , Matjaž Jogan , Marcel Hussing , Kenta Nakahashi , Kazuhiro Yasufuku , Amin Madani , Eric Eaton , Daniel A. Hashimoto
Object-centric slot attention is a powerful framework for unsupervised learning of structured and explainable representations that can support reasoning about objects and actions, including in surgical video. However, current object-centric models either fail to reliably capture object dependencies in seconds-long video episodes that encompass surgical actions and tasks or are computationally too expensive for practical implementation. We introduce Slot-BERT, a slot attention model with a temporal slot transformer module to overcome these limitations. Our core innovations are: 1) A bidirectional transformer module that processes object-centric slot representations, enabling longer-range temporal coherence; 2) A slot-contrastive loss that further improves the representation by enforcing slot dissimilarity; 3) We evaluate Slot-BERT on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures, and on real and synthetic videos with everyday objects. Our method surpasses state-of-the-art object-centric approaches under unsupervised training achieving superior performance across these domains. We also demonstrate efficient zero-shot domain adaptation to data from diverse surgical specialties and databases.
{"title":"Slot-BERT: Self-supervised object discovery in surgical video","authors":"Guiqiu Liao , Matjaž Jogan , Marcel Hussing , Kenta Nakahashi , Kazuhiro Yasufuku , Amin Madani , Eric Eaton , Daniel A. Hashimoto","doi":"10.1016/j.media.2026.103972","DOIUrl":"10.1016/j.media.2026.103972","url":null,"abstract":"<div><div>Object-centric slot attention is a powerful framework for unsupervised learning of structured and explainable representations that can support reasoning about objects and actions, including in surgical video. However, current object-centric models either fail to reliably capture object dependencies in seconds-long video episodes that encompass surgical actions and tasks or are computationally too expensive for practical implementation. We introduce Slot-BERT, a slot attention model with a temporal slot transformer module to overcome these limitations. Our core innovations are: 1) A bidirectional transformer module that processes object-centric slot representations, enabling longer-range temporal coherence; 2) A slot-contrastive loss that further improves the representation by enforcing slot dissimilarity; 3) We evaluate Slot-BERT on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures, and on real and synthetic videos with everyday objects. Our method surpasses state-of-the-art object-centric approaches under unsupervised training achieving superior performance across these domains. We also demonstrate efficient zero-shot domain adaptation to data from diverse surgical specialties and databases.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"110 ","pages":"Article 103972"},"PeriodicalIF":11.8,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146109926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-02DOI: 10.1016/j.media.2026.103970
Baizhi Wang , Kun Zhang , Yuhao Wang , Yunjie Gu , Haijing Luan , Ying Zhou , Taiyuan Hu , Rundong Wang , Zhidong Yang , Zihang Jiang , Rui Yan , S. Kevin Zhou
Each gigapixel whole slide image (WSI) contains tens of thousands of patches, many of which are redundant, leading to significant computational, storage, and transmission overhead. This motivates the need for automatic WSI summarization, which aims to extract a compact subset of patches that can effectively approximate the original WSI. In this paper, we propose WSISum, a unified framework that performs WSI Summarization through dual-level semantic reconstruction. Specifically, WSISum integrates two complementary reconstruction strategies: low-level patch semantic reconstruction via clustering-based sparse sampling; and high-level slide semantic reconstruction through knowledge distillation from multiple WSI-level foundation models. Experimental results show that WSISum achieves satisfactory performance in a variety of downstream tasks, including cancer subtyping, biomarker prediction, and metastasis subtyping, while significantly reducing computational cost. Code and models are available at https://github.com/Badgewho/WSISum.
{"title":"WSISum: WSI summarization via dual-level semantic reconstruction","authors":"Baizhi Wang , Kun Zhang , Yuhao Wang , Yunjie Gu , Haijing Luan , Ying Zhou , Taiyuan Hu , Rundong Wang , Zhidong Yang , Zihang Jiang , Rui Yan , S. Kevin Zhou","doi":"10.1016/j.media.2026.103970","DOIUrl":"10.1016/j.media.2026.103970","url":null,"abstract":"<div><div>Each gigapixel whole slide image (WSI) contains tens of thousands of patches, many of which are redundant, leading to significant computational, storage, and transmission overhead. This motivates the need for automatic WSI summarization, which aims to extract a compact subset of patches that can effectively approximate the original WSI. In this paper, we propose <strong>WSISum</strong>, a unified framework that performs <strong>WSI Sum</strong>marization through dual-level semantic reconstruction. Specifically, WSISum integrates two complementary reconstruction strategies: <em>low-level patch semantic reconstruction</em> via clustering-based sparse sampling; and <em>high-level slide semantic reconstruction</em> through knowledge distillation from multiple WSI-level foundation models. Experimental results show that WSISum achieves satisfactory performance in a variety of downstream tasks, including cancer subtyping, biomarker prediction, and metastasis subtyping, while significantly reducing computational cost. Code and models are available at <span><span>https://github.com/Badgewho/WSISum</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"110 ","pages":"Article 103970"},"PeriodicalIF":11.8,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146109933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}