{"title":"用于语义分割的混合层语义提取和多尺度聚合转换器","authors":"Tianping Li, Xiaolong Yang, Zhenyi Zhang, Zhaotong Cui, Zhou Maoxia","doi":"10.1007/s40747-024-01650-6","DOIUrl":null,"url":null,"abstract":"<p>Recently, a number of vision transformer models for semantic segmentation have been proposed, with the majority of these achieving impressive results. However, they lack the ability to exploit the intrinsic position and channel features of the image and are less capable of multi-scale feature fusion. This paper presents a semantic segmentation method that successfully combines attention and multiscale representation, thereby enhancing performance and efficiency. This represents a significant advancement in the field. Multi-layers semantic extraction and multi-scale aggregation transformer decoder (MEMAFormer) is proposed, which consists of two components: mix-layers dual channel semantic extraction module (MDCE) and semantic aggregation pyramid pooling module (SAPPM). The MDCE incorporates a multi-layers cross attention module (MCAM) and an efficient channel attention module (ECAM). In MCAM, horizontal connections between encoder and decoder stages are employed as feature queries for the attention module. The hierarchical feature maps derived from different encoder and decoder stages are integrated into key and value. To address long-term dependencies, ECAM selectively emphasizes interdependent channel feature maps by integrating relevant features across all channels. The adaptability of the feature maps is reduced by pyramid pooling, which reduces the amount of computation without compromising performance. SAPPM is comprised of several distinct pooled kernels that extract context with a deeper flow of information, forming a multi-scale feature by integrating various feature sizes. The MEMAFormer-B0 model demonstrates superior performance compared to SegFormer-B0, exhibiting gains of 4.8%, 4.0% and 3.5% on the ADE20K, Cityscapes and COCO-stuff datasets, respectively.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":null,"pages":null},"PeriodicalIF":5.0000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mix-layers semantic extraction and multi-scale aggregation transformer for semantic segmentation\",\"authors\":\"Tianping Li, Xiaolong Yang, Zhenyi Zhang, Zhaotong Cui, Zhou Maoxia\",\"doi\":\"10.1007/s40747-024-01650-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Recently, a number of vision transformer models for semantic segmentation have been proposed, with the majority of these achieving impressive results. However, they lack the ability to exploit the intrinsic position and channel features of the image and are less capable of multi-scale feature fusion. This paper presents a semantic segmentation method that successfully combines attention and multiscale representation, thereby enhancing performance and efficiency. This represents a significant advancement in the field. Multi-layers semantic extraction and multi-scale aggregation transformer decoder (MEMAFormer) is proposed, which consists of two components: mix-layers dual channel semantic extraction module (MDCE) and semantic aggregation pyramid pooling module (SAPPM). The MDCE incorporates a multi-layers cross attention module (MCAM) and an efficient channel attention module (ECAM). In MCAM, horizontal connections between encoder and decoder stages are employed as feature queries for the attention module. The hierarchical feature maps derived from different encoder and decoder stages are integrated into key and value. To address long-term dependencies, ECAM selectively emphasizes interdependent channel feature maps by integrating relevant features across all channels. The adaptability of the feature maps is reduced by pyramid pooling, which reduces the amount of computation without compromising performance. SAPPM is comprised of several distinct pooled kernels that extract context with a deeper flow of information, forming a multi-scale feature by integrating various feature sizes. The MEMAFormer-B0 model demonstrates superior performance compared to SegFormer-B0, exhibiting gains of 4.8%, 4.0% and 3.5% on the ADE20K, Cityscapes and COCO-stuff datasets, respectively.</p>\",\"PeriodicalId\":10524,\"journal\":{\"name\":\"Complex & Intelligent Systems\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Complex & Intelligent Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s40747-024-01650-6\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01650-6","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Mix-layers semantic extraction and multi-scale aggregation transformer for semantic segmentation
Recently, a number of vision transformer models for semantic segmentation have been proposed, with the majority of these achieving impressive results. However, they lack the ability to exploit the intrinsic position and channel features of the image and are less capable of multi-scale feature fusion. This paper presents a semantic segmentation method that successfully combines attention and multiscale representation, thereby enhancing performance and efficiency. This represents a significant advancement in the field. Multi-layers semantic extraction and multi-scale aggregation transformer decoder (MEMAFormer) is proposed, which consists of two components: mix-layers dual channel semantic extraction module (MDCE) and semantic aggregation pyramid pooling module (SAPPM). The MDCE incorporates a multi-layers cross attention module (MCAM) and an efficient channel attention module (ECAM). In MCAM, horizontal connections between encoder and decoder stages are employed as feature queries for the attention module. The hierarchical feature maps derived from different encoder and decoder stages are integrated into key and value. To address long-term dependencies, ECAM selectively emphasizes interdependent channel feature maps by integrating relevant features across all channels. The adaptability of the feature maps is reduced by pyramid pooling, which reduces the amount of computation without compromising performance. SAPPM is comprised of several distinct pooled kernels that extract context with a deeper flow of information, forming a multi-scale feature by integrating various feature sizes. The MEMAFormer-B0 model demonstrates superior performance compared to SegFormer-B0, exhibiting gains of 4.8%, 4.0% and 3.5% on the ADE20K, Cityscapes and COCO-stuff datasets, respectively.
期刊介绍:
Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.