Calum Green, Sharif Ahmed, Shashidhara Marathe, Liam Perera, Alberto Leonardi, Killian Gmyrek, Daniele Dini, James Le Houx
Machine learning techniques are being increasingly applied in medical and physical sciences across a variety of imaging modalities; however, an important issue when developing these tools is the availability of good quality training data. Here we present a unique, multimodal synchrotron dataset of a bespoke zinc-doped Zeolite 13X sample that can be used to develop advanced deep learning and data fusion pipelines. Multi-resolution micro X-ray computed tomography was performed on a zinc-doped Zeolite 13X fragment to characterise its pores and features, before spatially resolved X-ray diffraction computed tomography was carried out to characterise the homogeneous distribution of sodium and zinc phases. Zinc absorption was controlled to create a simple, spatially isolated, two-phase material. Both raw and processed data is available as a series of Zenodo entries. Altogether we present a spatially resolved, three-dimensional, multimodal, multi-resolution dataset that can be used for the development of machine learning techniques. Such techniques include development of super-resolution, multimodal data fusion, and 3D reconstruction algorithm development.
机器学习技术正越来越多地应用于医学和物理科学领域的各种成像模式;然而,开发这些工具时的一个重要问题是能否获得高质量的训练数据。在这里,我们展示了一个定制的掺锌沸石 13X 样品的独特、多模态同步加速器数据集,该数据集可用于开发先进的深度学习和数据融合管道。对掺锌沸石 13X 碎片进行了多分辨率显微 X 射线计算机断层成像,以确定其孔隙和特征,然后进行了空间分辨 X 射线衍射计算机断层成像,以确定钠相和锌相的均匀分布特征。对锌的吸收进行了控制,以形成一种简单、空间隔离的两相材料。原始数据和经过处理的数据都以一系列 Zenodo 条目的形式提供。总之,我们提供了一个空间分辨、三维、多模态、多分辨率的数据集,可用于机器学习技术的开发。这些技术包括超分辨率开发、多模态数据融合和三维重建算法开发。
{"title":"Three-Dimensional, Multimodal Synchrotron Data for Machine Learning Applications","authors":"Calum Green, Sharif Ahmed, Shashidhara Marathe, Liam Perera, Alberto Leonardi, Killian Gmyrek, Daniele Dini, James Le Houx","doi":"arxiv-2409.07322","DOIUrl":"https://doi.org/arxiv-2409.07322","url":null,"abstract":"Machine learning techniques are being increasingly applied in medical and\u0000physical sciences across a variety of imaging modalities; however, an important\u0000issue when developing these tools is the availability of good quality training\u0000data. Here we present a unique, multimodal synchrotron dataset of a bespoke\u0000zinc-doped Zeolite 13X sample that can be used to develop advanced deep\u0000learning and data fusion pipelines. Multi-resolution micro X-ray computed\u0000tomography was performed on a zinc-doped Zeolite 13X fragment to characterise\u0000its pores and features, before spatially resolved X-ray diffraction computed\u0000tomography was carried out to characterise the homogeneous distribution of\u0000sodium and zinc phases. Zinc absorption was controlled to create a simple,\u0000spatially isolated, two-phase material. Both raw and processed data is\u0000available as a series of Zenodo entries. Altogether we present a spatially\u0000resolved, three-dimensional, multimodal, multi-resolution dataset that can be\u0000used for the development of machine learning techniques. Such techniques\u0000include development of super-resolution, multimodal data fusion, and 3D\u0000reconstruction algorithm development.","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wangduo Xie, Richard Schoonhoven, Tristan van Leeuwen, Matthew B. Blaschko
Computed tomography (CT) reconstruction plays a crucial role in industrial nondestructive testing and medical diagnosis. Sparse view CT reconstruction aims to reconstruct high-quality CT images while only using a small number of projections, which helps to improve the detection speed of industrial assembly lines and is also meaningful for reducing radiation in medical scenarios. Sparse CT reconstruction methods based on implicit neural representations (INRs) have recently shown promising performance, but still produce artifacts because of the difficulty of obtaining useful prior information. In this work, we incorporate a powerful prior: the total number of material categories of objects. To utilize the prior, we design AC-IND, a self-supervised method based on Attenuation Coefficient Estimation and Implicit Neural Distribution. Specifically, our method first transforms the traditional INR from scalar mapping to probability distribution mapping. Then we design a compact attenuation coefficient estimator initialized with values from a rough reconstruction and fast segmentation. Finally, our algorithm finishes the CT reconstruction by jointly optimizing the estimator and the generated distribution. Through experiments, we find that our method not only outperforms the comparative methods in sparse CT reconstruction but also can automatically generate semantic segmentation maps.
{"title":"AC-IND: Sparse CT reconstruction based on attenuation coefficient estimation and implicit neural distribution","authors":"Wangduo Xie, Richard Schoonhoven, Tristan van Leeuwen, Matthew B. Blaschko","doi":"arxiv-2409.07171","DOIUrl":"https://doi.org/arxiv-2409.07171","url":null,"abstract":"Computed tomography (CT) reconstruction plays a crucial role in industrial\u0000nondestructive testing and medical diagnosis. Sparse view CT reconstruction\u0000aims to reconstruct high-quality CT images while only using a small number of\u0000projections, which helps to improve the detection speed of industrial assembly\u0000lines and is also meaningful for reducing radiation in medical scenarios.\u0000Sparse CT reconstruction methods based on implicit neural representations\u0000(INRs) have recently shown promising performance, but still produce artifacts\u0000because of the difficulty of obtaining useful prior information. In this work,\u0000we incorporate a powerful prior: the total number of material categories of\u0000objects. To utilize the prior, we design AC-IND, a self-supervised method based\u0000on Attenuation Coefficient Estimation and Implicit Neural Distribution.\u0000Specifically, our method first transforms the traditional INR from scalar\u0000mapping to probability distribution mapping. Then we design a compact\u0000attenuation coefficient estimator initialized with values from a rough\u0000reconstruction and fast segmentation. Finally, our algorithm finishes the CT\u0000reconstruction by jointly optimizing the estimator and the generated\u0000distribution. Through experiments, we find that our method not only outperforms\u0000the comparative methods in sparse CT reconstruction but also can automatically\u0000generate semantic segmentation maps.","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advancements in predicting pedestrian crossing intentions for Autonomous Vehicles using Computer Vision and Deep Neural Networks are promising. However, the black-box nature of DNNs poses challenges in understanding how the model works and how input features contribute to final predictions. This lack of interpretability delimits the trust in model performance and hinders informed decisions on feature selection, representation, and model optimisation; thereby affecting the efficacy of future research in the field. To address this, we introduce Context-aware Permutation Feature Importance (CAPFI), a novel approach tailored for pedestrian intention prediction. CAPFI enables more interpretability and reliable assessments of feature importance by leveraging subdivided scenario contexts, mitigating the randomness of feature values through targeted shuffling. This aims to reduce variance and prevent biased estimations in importance scores during permutations. We divide the Pedestrian Intention Estimation (PIE) dataset into 16 comparable context sets, measure the baseline performance of five distinct neural network architectures for intention prediction in each context, and assess input feature importance using CAPFI. We observed nuanced differences among models across various contextual characteristics. The research reveals the critical role of pedestrian bounding boxes and ego-vehicle speed in predicting pedestrian intentions, and potential prediction biases due to the speed feature through cross-context permutation evaluation. We propose an alternative feature representation by considering proximity change rate for rendering dynamic pedestrian-vehicle locomotion, thereby enhancing the contributions of input features to intention prediction. These findings underscore the importance of contextual features and their diversity to develop accurate and robust intent-predictive models.
{"title":"Feature Importance in Pedestrian Intention Prediction: A Context-Aware Review","authors":"Mohsen Azarmi, Mahdi Rezaei, He Wang, Ali Arabian","doi":"arxiv-2409.07645","DOIUrl":"https://doi.org/arxiv-2409.07645","url":null,"abstract":"Recent advancements in predicting pedestrian crossing intentions for\u0000Autonomous Vehicles using Computer Vision and Deep Neural Networks are\u0000promising. However, the black-box nature of DNNs poses challenges in\u0000understanding how the model works and how input features contribute to final\u0000predictions. This lack of interpretability delimits the trust in model\u0000performance and hinders informed decisions on feature selection,\u0000representation, and model optimisation; thereby affecting the efficacy of\u0000future research in the field. To address this, we introduce Context-aware\u0000Permutation Feature Importance (CAPFI), a novel approach tailored for\u0000pedestrian intention prediction. CAPFI enables more interpretability and\u0000reliable assessments of feature importance by leveraging subdivided scenario\u0000contexts, mitigating the randomness of feature values through targeted\u0000shuffling. This aims to reduce variance and prevent biased estimations in\u0000importance scores during permutations. We divide the Pedestrian Intention\u0000Estimation (PIE) dataset into 16 comparable context sets, measure the baseline\u0000performance of five distinct neural network architectures for intention\u0000prediction in each context, and assess input feature importance using CAPFI. We\u0000observed nuanced differences among models across various contextual\u0000characteristics. The research reveals the critical role of pedestrian bounding\u0000boxes and ego-vehicle speed in predicting pedestrian intentions, and potential\u0000prediction biases due to the speed feature through cross-context permutation\u0000evaluation. We propose an alternative feature representation by considering\u0000proximity change rate for rendering dynamic pedestrian-vehicle locomotion,\u0000thereby enhancing the contributions of input features to intention prediction.\u0000These findings underscore the importance of contextual features and their\u0000diversity to develop accurate and robust intent-predictive models.","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ho Man Kwan, Ge Gao, Fan Zhang, Andrew Gower, David Bull
Recent advances in implicit neural representation (INR)-based video coding have demonstrated its potential to compete with both conventional and other learning-based approaches. With INR methods, a neural network is trained to overfit a video sequence, with its parameters compressed to obtain a compact representation of the video content. However, although promising results have been achieved, the best INR-based methods are still out-performed by the latest standard codecs, such as VVC VTM, partially due to the simple model compression techniques employed. In this paper, rather than focusing on representation architectures as in many existing works, we propose a novel INR-based video compression framework, Neural Video Representation Compression (NVRC), targeting compression of the representation. Based on the novel entropy coding and quantization models proposed, NVRC, for the first time, is able to optimize an INR-based video codec in a fully end-to-end manner. To further minimize the additional bitrate overhead introduced by the entropy models, we have also proposed a new model compression framework for coding all the network, quantization and entropy model parameters hierarchically. Our experiments show that NVRC outperforms many conventional and learning-based benchmark codecs, with a 24% average coding gain over VVC VTM (Random Access) on the UVG dataset, measured in PSNR. As far as we are aware, this is the first time an INR-based video codec achieving such performance. The implementation of NVRC will be released at www.github.com.
{"title":"NVRC: Neural Video Representation Compression","authors":"Ho Man Kwan, Ge Gao, Fan Zhang, Andrew Gower, David Bull","doi":"arxiv-2409.07414","DOIUrl":"https://doi.org/arxiv-2409.07414","url":null,"abstract":"Recent advances in implicit neural representation (INR)-based video coding\u0000have demonstrated its potential to compete with both conventional and other\u0000learning-based approaches. With INR methods, a neural network is trained to\u0000overfit a video sequence, with its parameters compressed to obtain a compact\u0000representation of the video content. However, although promising results have\u0000been achieved, the best INR-based methods are still out-performed by the latest\u0000standard codecs, such as VVC VTM, partially due to the simple model compression\u0000techniques employed. In this paper, rather than focusing on representation\u0000architectures as in many existing works, we propose a novel INR-based video\u0000compression framework, Neural Video Representation Compression (NVRC),\u0000targeting compression of the representation. Based on the novel entropy coding\u0000and quantization models proposed, NVRC, for the first time, is able to optimize\u0000an INR-based video codec in a fully end-to-end manner. To further minimize the\u0000additional bitrate overhead introduced by the entropy models, we have also\u0000proposed a new model compression framework for coding all the network,\u0000quantization and entropy model parameters hierarchically. Our experiments show\u0000that NVRC outperforms many conventional and learning-based benchmark codecs,\u0000with a 24% average coding gain over VVC VTM (Random Access) on the UVG dataset,\u0000measured in PSNR. As far as we are aware, this is the first time an INR-based\u0000video codec achieving such performance. The implementation of NVRC will be\u0000released at www.github.com.","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gaia Romana De Paolis, Dimitrios Lenis, Johannes Novotny, Maria Wimmer, Astrid Berg, Theresa Neubauer, Philip Matthias Winter, David Major, Ariharasudhan Muthusami, Gerald Schröcker, Martin Mienkina, Katja Bühler
Efficient and fast reconstruction of anatomical structures plays a crucial role in clinical practice. Minimizing retrieval and processing times not only potentially enhances swift response and decision-making in critical scenarios but also supports interactive surgical planning and navigation. Recent methods attempt to solve the medical shape reconstruction problem by utilizing implicit neural functions. However, their performance suffers in terms of generalization and computation time, a critical metric for real-time applications. To address these challenges, we propose to leverage meta-learning to improve the network parameters initialization, reducing inference time by an order of magnitude while maintaining high accuracy. We evaluate our approach on three public datasets covering different anatomical shapes and modalities, namely CT and MRI. Our experimental results show that our model can handle various input configurations, such as sparse slices with different orientations and spacings. Additionally, we demonstrate that our method exhibits strong transferable capabilities in generalizing to shape domains unobserved at training time.
{"title":"Fast Medical Shape Reconstruction via Meta-learned Implicit Neural Representations","authors":"Gaia Romana De Paolis, Dimitrios Lenis, Johannes Novotny, Maria Wimmer, Astrid Berg, Theresa Neubauer, Philip Matthias Winter, David Major, Ariharasudhan Muthusami, Gerald Schröcker, Martin Mienkina, Katja Bühler","doi":"arxiv-2409.07100","DOIUrl":"https://doi.org/arxiv-2409.07100","url":null,"abstract":"Efficient and fast reconstruction of anatomical structures plays a crucial\u0000role in clinical practice. Minimizing retrieval and processing times not only\u0000potentially enhances swift response and decision-making in critical scenarios\u0000but also supports interactive surgical planning and navigation. Recent methods\u0000attempt to solve the medical shape reconstruction problem by utilizing implicit\u0000neural functions. However, their performance suffers in terms of generalization\u0000and computation time, a critical metric for real-time applications. To address\u0000these challenges, we propose to leverage meta-learning to improve the network\u0000parameters initialization, reducing inference time by an order of magnitude\u0000while maintaining high accuracy. We evaluate our approach on three public\u0000datasets covering different anatomical shapes and modalities, namely CT and\u0000MRI. Our experimental results show that our model can handle various input\u0000configurations, such as sparse slices with different orientations and spacings.\u0000Additionally, we demonstrate that our method exhibits strong transferable\u0000capabilities in generalizing to shape domains unobserved at training time.","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
PET/CT is extensively used in imaging malignant tumors because it highlights areas of increased glucose metabolism, indicative of cancerous activity. Accurate 3D lesion segmentation in PET/CT imaging is essential for effective oncological diagnostics and treatment planning. In this study, we developed an advanced 3D residual U-Net model for the Automated Lesion Segmentation in Whole-Body PET/CT - Multitracer Multicenter Generalization (autoPET III) Challenge, which will be held jointly with 2024 Medical Image Computing and Computer Assisted Intervention (MICCAI) conference at Marrakesh, Morocco. Proposed model incorporates a novel sample attention boosting technique to enhance segmentation performance by adjusting the contribution of challenging cases during training, improving generalization across FDG and PSMA tracers. The proposed model outperformed the challenge baseline model in the preliminary test set on the Grand Challenge platform, and our team is currently ranking in the 2nd place among 497 participants worldwide from 53 countries (accessed date: 2024/9/4), with Dice score of 0.8700, False Negative Volume of 19.3969 and False Positive Volume of 1.0857.
{"title":"Dual channel CW nnU-Net for 3D PET-CT Lesion Segmentation in 2024 autoPET III Challenge","authors":"Ching-Wei Wang, Ting-Sheng Su, Keng-Wei Liu","doi":"arxiv-2409.07144","DOIUrl":"https://doi.org/arxiv-2409.07144","url":null,"abstract":"PET/CT is extensively used in imaging malignant tumors because it highlights\u0000areas of increased glucose metabolism, indicative of cancerous activity.\u0000Accurate 3D lesion segmentation in PET/CT imaging is essential for effective\u0000oncological diagnostics and treatment planning. In this study, we developed an\u0000advanced 3D residual U-Net model for the Automated Lesion Segmentation in\u0000Whole-Body PET/CT - Multitracer Multicenter Generalization (autoPET III)\u0000Challenge, which will be held jointly with 2024 Medical Image Computing and\u0000Computer Assisted Intervention (MICCAI) conference at Marrakesh, Morocco.\u0000Proposed model incorporates a novel sample attention boosting technique to\u0000enhance segmentation performance by adjusting the contribution of challenging\u0000cases during training, improving generalization across FDG and PSMA tracers.\u0000The proposed model outperformed the challenge baseline model in the preliminary\u0000test set on the Grand Challenge platform, and our team is currently ranking in\u0000the 2nd place among 497 participants worldwide from 53 countries (accessed\u0000date: 2024/9/4), with Dice score of 0.8700, False Negative Volume of 19.3969\u0000and False Positive Volume of 1.0857.","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Low-light image enhancement, particularly in cross-domain tasks such as mapping from the raw domain to the sRGB domain, remains a significant challenge. Many deep learning-based methods have been developed to address this issue and have shown promising results in recent years. However, single-stage methods, which attempt to unify the complex mapping across both domains, leading to limited denoising performance. In contrast, two-stage approaches typically decompose a raw image with color filter arrays (CFA) into a four-channel RGGB format before feeding it into a neural network. However, this strategy overlooks the critical role of demosaicing within the Image Signal Processing (ISP) pipeline, leading to color distortions under varying lighting conditions, especially in low-light scenarios. To address these issues, we design a novel Mamba scanning mechanism, called RAWMamba, to effectively handle raw images with different CFAs. Furthermore, we present a Retinex Decomposition Module (RDM) grounded in Retinex prior, which decouples illumination from reflectance to facilitate more effective denoising and automatic non-linear exposure correction. By bridging demosaicing and denoising, better raw image enhancement is achieved. Experimental evaluations conducted on public datasets SID and MCR demonstrate that our proposed RAWMamba achieves state-of-the-art performance on cross-domain mapping.
{"title":"Retinex-RAWMamba: Bridging Demosaicing and Denoising for Low-Light RAW Image Enhancement","authors":"Xianmin Chen, Peiliang Huang, Xiaoxu Feng, Dingwen Zhang, Longfei Han, Junwei Han","doi":"arxiv-2409.07040","DOIUrl":"https://doi.org/arxiv-2409.07040","url":null,"abstract":"Low-light image enhancement, particularly in cross-domain tasks such as\u0000mapping from the raw domain to the sRGB domain, remains a significant\u0000challenge. Many deep learning-based methods have been developed to address this\u0000issue and have shown promising results in recent years. However, single-stage\u0000methods, which attempt to unify the complex mapping across both domains,\u0000leading to limited denoising performance. In contrast, two-stage approaches\u0000typically decompose a raw image with color filter arrays (CFA) into a\u0000four-channel RGGB format before feeding it into a neural network. However, this\u0000strategy overlooks the critical role of demosaicing within the Image Signal\u0000Processing (ISP) pipeline, leading to color distortions under varying lighting\u0000conditions, especially in low-light scenarios. To address these issues, we\u0000design a novel Mamba scanning mechanism, called RAWMamba, to effectively handle\u0000raw images with different CFAs. Furthermore, we present a Retinex Decomposition\u0000Module (RDM) grounded in Retinex prior, which decouples illumination from\u0000reflectance to facilitate more effective denoising and automatic non-linear\u0000exposure correction. By bridging demosaicing and denoising, better raw image\u0000enhancement is achieved. Experimental evaluations conducted on public datasets\u0000SID and MCR demonstrate that our proposed RAWMamba achieves state-of-the-art\u0000performance on cross-domain mapping.","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the field of Alzheimer's disease diagnosis, segmentation and classification tasks are inherently interconnected. Sharing knowledge between models for these tasks can significantly improve training efficiency, particularly when training data is scarce. However, traditional knowledge distillation techniques often struggle to bridge the gap between segmentation and classification due to the distinct nature of tasks and different model architectures. To address this challenge, we propose a dual-stream pipeline that facilitates cross-task and cross-architecture knowledge sharing. Our approach introduces a dual-stream embedding module that unifies feature representations from segmentation and classification models, enabling dimensional integration of these features to guide the classification model. We validated our method on multiple 3D datasets for Alzheimer's disease diagnosis, demonstrating significant improvements in classification performance, especially on small datasets. Furthermore, we extended our pipeline with a residual temporal attention mechanism for early diagnosis, utilizing images taken before the atrophy of patients' brain mass. This advancement shows promise in enabling diagnosis approximately six months earlier in mild and asymptomatic stages, offering critical time for intervention.
{"title":"DS-ViT: Dual-Stream Vision Transformer for Cross-Task Distillation in Alzheimer's Early Diagnosis","authors":"Ke Chen, Yifeng Wang, Yufei Zhou, Haohan Wang","doi":"arxiv-2409.07584","DOIUrl":"https://doi.org/arxiv-2409.07584","url":null,"abstract":"In the field of Alzheimer's disease diagnosis, segmentation and\u0000classification tasks are inherently interconnected. Sharing knowledge between\u0000models for these tasks can significantly improve training efficiency,\u0000particularly when training data is scarce. However, traditional knowledge\u0000distillation techniques often struggle to bridge the gap between segmentation\u0000and classification due to the distinct nature of tasks and different model\u0000architectures. To address this challenge, we propose a dual-stream pipeline\u0000that facilitates cross-task and cross-architecture knowledge sharing. Our\u0000approach introduces a dual-stream embedding module that unifies feature\u0000representations from segmentation and classification models, enabling\u0000dimensional integration of these features to guide the classification model. We\u0000validated our method on multiple 3D datasets for Alzheimer's disease diagnosis,\u0000demonstrating significant improvements in classification performance,\u0000especially on small datasets. Furthermore, we extended our pipeline with a\u0000residual temporal attention mechanism for early diagnosis, utilizing images\u0000taken before the atrophy of patients' brain mass. This advancement shows\u0000promise in enabling diagnosis approximately six months earlier in mild and\u0000asymptomatic stages, offering critical time for intervention.","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conventional radiography is the widely used imaging technology in diagnosing, monitoring, and prognosticating musculoskeletal (MSK) diseases because of its easy availability, versatility, and cost-effectiveness. In conventional radiographs, bone overlaps are prevalent, and can impede the accurate assessment of bone characteristics by radiologists or algorithms, posing significant challenges to conventional and computer-aided diagnoses. This work initiated the study of a challenging scenario - bone layer separation in conventional radiographs, in which separate overlapped bone regions enable the independent assessment of the bone characteristics of each bone layer and lay the groundwork for MSK disease diagnosis and its automation. This work proposed a Bone Layer Separation GAN (BLS-GAN) framework that can produce high-quality bone layer images with reasonable bone characteristics and texture. This framework introduced a reconstructor based on conventional radiography imaging principles, which achieved efficient reconstruction and mitigates the recurrent calculations and training instability issues caused by soft tissue in the overlapped regions. Additionally, pre-training with synthetic images was implemented to enhance the stability of both the training process and the results. The generated images passed the visual Turing test, and improved performance in downstream tasks. This work affirms the feasibility of extracting bone layer images from conventional radiographs, which holds promise for leveraging bone layer separation technology to facilitate more comprehensive analytical research in MSK diagnosis, monitoring, and prognosis. Code and dataset will be made available.
{"title":"BLS-GAN: A Deep Layer Separation Framework for Eliminating Bone Overlap in Conventional Radiographs","authors":"Haolin Wang, Yafei Ou, Prasoon Ambalathankandy, Gen Ota, Pengyu Dai, Masayuki Ikebe, Kenji Suzuki, Tamotsu Kamishima","doi":"arxiv-2409.07304","DOIUrl":"https://doi.org/arxiv-2409.07304","url":null,"abstract":"Conventional radiography is the widely used imaging technology in diagnosing,\u0000monitoring, and prognosticating musculoskeletal (MSK) diseases because of its\u0000easy availability, versatility, and cost-effectiveness. In conventional\u0000radiographs, bone overlaps are prevalent, and can impede the accurate\u0000assessment of bone characteristics by radiologists or algorithms, posing\u0000significant challenges to conventional and computer-aided diagnoses. This work\u0000initiated the study of a challenging scenario - bone layer separation in\u0000conventional radiographs, in which separate overlapped bone regions enable the\u0000independent assessment of the bone characteristics of each bone layer and lay\u0000the groundwork for MSK disease diagnosis and its automation. This work proposed\u0000a Bone Layer Separation GAN (BLS-GAN) framework that can produce high-quality\u0000bone layer images with reasonable bone characteristics and texture. This\u0000framework introduced a reconstructor based on conventional radiography imaging\u0000principles, which achieved efficient reconstruction and mitigates the recurrent\u0000calculations and training instability issues caused by soft tissue in the\u0000overlapped regions. Additionally, pre-training with synthetic images was\u0000implemented to enhance the stability of both the training process and the\u0000results. The generated images passed the visual Turing test, and improved\u0000performance in downstream tasks. This work affirms the feasibility of\u0000extracting bone layer images from conventional radiographs, which holds promise\u0000for leveraging bone layer separation technology to facilitate more\u0000comprehensive analytical research in MSK diagnosis, monitoring, and prognosis.\u0000Code and dataset will be made available.","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Somayeh PakdelmoezDepartment of Biomedical Engineering, Amirkabir University of Technology, Tehran, Iran, Saba OmidikiaDepartment of Biomedical Engineering, Amirkabir University of Technology, Tehran, Iran, Seyyed Ali SeyyedsalehiDepartment of Biomedical Engineering, Amirkabir University of Technology, Tehran, Iran, Seyyede Zohreh SeyyedsalehiDepartment of Biomedical Engineering, Faculty of Health, Tehran Medical Sciences, Islamic Azad University, Tehran, Iran
Diabetic retinopathy (DR) is a consequence of diabetes mellitus characterized by vascular damage within the retinal tissue. Timely detection is paramount to mitigate the risk of vision loss. However, training robust grading models is hindered by a shortage of annotated data, particularly for severe cases. This paper proposes a framework for controllably generating high-fidelity and diverse DR fundus images, thereby improving classifier performance in DR grading and detection. We achieve comprehensive control over DR severity and visual features (optic disc, vessel structure, lesion areas) within generated images solely through a conditional StyleGAN, eliminating the need for feature masks or auxiliary networks. Specifically, leveraging the SeFa algorithm to identify meaningful semantics within the latent space, we manipulate the DR images generated conditionally on grades, further enhancing the dataset diversity. Additionally, we propose a novel, effective SeFa-based data augmentation strategy, helping the classifier focus on discriminative regions while ignoring redundant features. Using this approach, a ResNet50 model trained for DR detection achieves 98.09% accuracy, 99.44% specificity, 99.45% precision, and an F1-score of 98.09%. Moreover, incorporating synthetic images generated by conditional StyleGAN into ResNet50 training for DR grading yields 83.33% accuracy, a quadratic kappa score of 87.64%, 95.67% specificity, and 72.24% precision. Extensive experiments conducted on the APTOS 2019 dataset demonstrate the exceptional realism of the generated images and the superior performance of our classifier compared to recent studies.
糖尿病视网膜病变(DR)是糖尿病的一种后遗症,其特点是视网膜组织内的血管受损。及时检测对降低视力丧失的风险至关重要。然而,由于缺乏注释数据,尤其是严重病例的注释数据,训练稳健的分级模型受到了阻碍。本文提出了一种框架,用于可控地生成高保真和多样化的 DR 眼底图像,从而提高 DR 分级和检测中分类器的性能。我们仅通过条件式广域网(StyleGAN)就实现了对 DR 严重程度和生成图像中视觉特征(视盘、血管结构、病变区域)的全面控制,从而消除了对特征掩码或辅助网络的需求。具体来说,我们利用 SeFa 算法识别潜空间内有意义的语义,根据等级有条件地处理生成的 DR 图像,进一步增强了数据集的多样性。此外,我们还提出了一种新颖、有效的基于 SeFa 的数据分割策略,帮助分类器专注于有区分度的区域,同时忽略冗余特征。利用这种方法,针对 DR 检测训练的 ResNet50 模型达到了 98.09% 的准确率、99.44% 的特异性、99.45% 的精确性和 98.09% 的 F1 分数。此外,将有条件的 StyleGAN 生成的合成图像纳入用于 DR 分级的 ResNet50 训练,可获得 83.33% 的准确率、87.64% 的二次 kappa 分数、95.67% 的特异性和 72.24% 的精确度。在 APTOS 2019 数据集上进行的大量实验证明,生成的图像异常逼真,与近期的研究相比,我们的分类器性能更优。
{"title":"Controllable retinal image synthesis using conditional StyleGAN and latent space manipulation for improved diagnosis and grading of diabetic retinopathy","authors":"Somayeh PakdelmoezDepartment of Biomedical Engineering, Amirkabir University of Technology, Tehran, Iran, Saba OmidikiaDepartment of Biomedical Engineering, Amirkabir University of Technology, Tehran, Iran, Seyyed Ali SeyyedsalehiDepartment of Biomedical Engineering, Amirkabir University of Technology, Tehran, Iran, Seyyede Zohreh SeyyedsalehiDepartment of Biomedical Engineering, Faculty of Health, Tehran Medical Sciences, Islamic Azad University, Tehran, Iran","doi":"arxiv-2409.07422","DOIUrl":"https://doi.org/arxiv-2409.07422","url":null,"abstract":"Diabetic retinopathy (DR) is a consequence of diabetes mellitus characterized\u0000by vascular damage within the retinal tissue. Timely detection is paramount to\u0000mitigate the risk of vision loss. However, training robust grading models is\u0000hindered by a shortage of annotated data, particularly for severe cases. This\u0000paper proposes a framework for controllably generating high-fidelity and\u0000diverse DR fundus images, thereby improving classifier performance in DR\u0000grading and detection. We achieve comprehensive control over DR severity and\u0000visual features (optic disc, vessel structure, lesion areas) within generated\u0000images solely through a conditional StyleGAN, eliminating the need for feature\u0000masks or auxiliary networks. Specifically, leveraging the SeFa algorithm to\u0000identify meaningful semantics within the latent space, we manipulate the DR\u0000images generated conditionally on grades, further enhancing the dataset\u0000diversity. Additionally, we propose a novel, effective SeFa-based data\u0000augmentation strategy, helping the classifier focus on discriminative regions\u0000while ignoring redundant features. Using this approach, a ResNet50 model\u0000trained for DR detection achieves 98.09% accuracy, 99.44% specificity, 99.45%\u0000precision, and an F1-score of 98.09%. Moreover, incorporating synthetic images\u0000generated by conditional StyleGAN into ResNet50 training for DR grading yields\u000083.33% accuracy, a quadratic kappa score of 87.64%, 95.67% specificity, and\u000072.24% precision. Extensive experiments conducted on the APTOS 2019 dataset\u0000demonstrate the exceptional realism of the generated images and the superior\u0000performance of our classifier compared to recent studies.","PeriodicalId":501289,"journal":{"name":"arXiv - EE - Image and Video Processing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142200245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}