{"title":"Multi-scale convolutional attention frequency-enhanced transformer network for medical image segmentation","authors":"Shun Yan, Benquan Yang, Aihua Chen, Xiaoming Zhao, Shiqing Zhang","doi":"10.1016/j.inffus.2025.103019","DOIUrl":null,"url":null,"abstract":"<div><div>Automatic segmentation of medical images plays a crucial role in assisting doctors with diagnosis and treatment planning. Among them, multi-scale vision transformer has become a powerful tool for medical image segmentation. However, due to its overly aggressive self-attention design leads to issues such as insufficient local feature extraction and lack of detailed feature information. To address these problems, this study proposes Multi-Scale Convolutional Attention Frequency-Enhanced Transformer Network (MCAFT), which includes Multi-Scale Convolutional Attention Frequency-Enhanced Transformer Modules (MCAFTM) and Multi-Scale Progressive Gate-Spatial Attention (MSGA). MCAFTM employs channel, spatial mechanisms, which are highly effective in capturing complex spatial relationships while focusing on prominent regions. Additionally, it applies Discrete Wavelet Transform (DWT) to decompose input feature maps into sub-bands: low-frequency sub-band (<span><math><mrow><mi>L</mi><mi>L</mi></mrow></math></span>), which captures overall structural information, and high-frequency sub-bands (<span><math><mrow><mi>L</mi><mi>H</mi></mrow></math></span>, <span><math><mrow><mi>H</mi><mi>L</mi></mrow></math></span>, <span><math><mrow><mi>H</mi><mi>H</mi></mrow></math></span>) which retain fine-grained details such as edges and textures. Subsequently, an efficient transformer and reverse attention mechanism are employed to enhance contextual attention and boundary information. The proposed MSGA enhances multi-scale context, adaptively modeling inter-scale dependencies to bridge the semantic gap between encoder and decoder modules. Extensive experiments are conducted on several representative medical image segmentation tasks, including synapse abdominal multi-organ, cardiac organ, and polyp lesions. The proposed MCAFTM achieves DICE scores of 83.87 and 92.32 for synapse abdominal multi-organ and cardiac organ segmentation, respectively. For five polyp datasets (ClinicDB, Kvasir, ColonDB, ETIS, CVC-T), MCAFTM obtaines DICE scores of 94.49, 92.62, 81.07, 78.68, and 88.91 respectively. These results demonstrate that both MCAFTM and MSGA are effective architectures.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"119 ","pages":"Article 103019"},"PeriodicalIF":14.7000,"publicationDate":"2025-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525000922","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Automatic segmentation of medical images plays a crucial role in assisting doctors with diagnosis and treatment planning. Among them, multi-scale vision transformer has become a powerful tool for medical image segmentation. However, due to its overly aggressive self-attention design leads to issues such as insufficient local feature extraction and lack of detailed feature information. To address these problems, this study proposes Multi-Scale Convolutional Attention Frequency-Enhanced Transformer Network (MCAFT), which includes Multi-Scale Convolutional Attention Frequency-Enhanced Transformer Modules (MCAFTM) and Multi-Scale Progressive Gate-Spatial Attention (MSGA). MCAFTM employs channel, spatial mechanisms, which are highly effective in capturing complex spatial relationships while focusing on prominent regions. Additionally, it applies Discrete Wavelet Transform (DWT) to decompose input feature maps into sub-bands: low-frequency sub-band (), which captures overall structural information, and high-frequency sub-bands (, , ) which retain fine-grained details such as edges and textures. Subsequently, an efficient transformer and reverse attention mechanism are employed to enhance contextual attention and boundary information. The proposed MSGA enhances multi-scale context, adaptively modeling inter-scale dependencies to bridge the semantic gap between encoder and decoder modules. Extensive experiments are conducted on several representative medical image segmentation tasks, including synapse abdominal multi-organ, cardiac organ, and polyp lesions. The proposed MCAFTM achieves DICE scores of 83.87 and 92.32 for synapse abdominal multi-organ and cardiac organ segmentation, respectively. For five polyp datasets (ClinicDB, Kvasir, ColonDB, ETIS, CVC-T), MCAFTM obtaines DICE scores of 94.49, 92.62, 81.07, 78.68, and 88.91 respectively. These results demonstrate that both MCAFTM and MSGA are effective architectures.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.