Wei Zhang;Miaoxin Cai;Tong Zhang;Guoqiang Lei;Yin Zhuang;Xuerui Mao
{"title":"大力水手从遥感图像中进行多源船舶探测的统一视觉语言模型","authors":"Wei Zhang;Miaoxin Cai;Tong Zhang;Guoqiang Lei;Yin Zhuang;Xuerui Mao","doi":"10.1109/JSTARS.2024.3488034","DOIUrl":null,"url":null,"abstract":"Ship detection needs to identify ship locations from remote sensing scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to setup a unified paradigm for achieving multisource ship detection. To address this challenge, in this article, leveraging the large language models powerful generalization ability, a unified visual-language model called Popeye is proposed for multisource ship detection from RS imagery. Specifically, to bridge the interpretation gap across the multisource images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box and oriented bounding box. Subsequently, the hybrid experts encoder is designed to refine multiscale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pretrained visual-language knowledge from the nature scene into the RS domain for multisource ship detection. In addition, the segment anything model is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models in zero-shot multisource various ship detection tasks.","PeriodicalId":13116,"journal":{"name":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","volume":"17 ","pages":"20050-20063"},"PeriodicalIF":4.7000,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10738390","citationCount":"0","resultStr":"{\"title\":\"Popeye: A Unified Visual-Language Model for Multisource Ship Detection From Remote Sensing Imagery\",\"authors\":\"Wei Zhang;Miaoxin Cai;Tong Zhang;Guoqiang Lei;Yin Zhuang;Xuerui Mao\",\"doi\":\"10.1109/JSTARS.2024.3488034\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Ship detection needs to identify ship locations from remote sensing scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to setup a unified paradigm for achieving multisource ship detection. To address this challenge, in this article, leveraging the large language models powerful generalization ability, a unified visual-language model called Popeye is proposed for multisource ship detection from RS imagery. Specifically, to bridge the interpretation gap across the multisource images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box and oriented bounding box. Subsequently, the hybrid experts encoder is designed to refine multiscale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pretrained visual-language knowledge from the nature scene into the RS domain for multisource ship detection. In addition, the segment anything model is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models in zero-shot multisource various ship detection tasks.\",\"PeriodicalId\":13116,\"journal\":{\"name\":\"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing\",\"volume\":\"17 \",\"pages\":\"20050-20063\"},\"PeriodicalIF\":4.7000,\"publicationDate\":\"2024-10-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10738390\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10738390/\",\"RegionNum\":2,\"RegionCategory\":\"地球科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10738390/","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Popeye: A Unified Visual-Language Model for Multisource Ship Detection From Remote Sensing Imagery
Ship detection needs to identify ship locations from remote sensing scenes. Due to different imaging payloads, various appearances of ships, and complicated background interference from the bird's eye view, it is difficult to setup a unified paradigm for achieving multisource ship detection. To address this challenge, in this article, leveraging the large language models powerful generalization ability, a unified visual-language model called Popeye is proposed for multisource ship detection from RS imagery. Specifically, to bridge the interpretation gap across the multisource images for ship detection, a novel unified labeling paradigm is designed to integrate different visual modalities and the various ship detection ways, i.e., horizontal bounding box and oriented bounding box. Subsequently, the hybrid experts encoder is designed to refine multiscale visual features, thereby enhancing visual perception. Then, a visual-language alignment method is developed for Popeye to enhance interactive comprehension ability between visual and language content. Furthermore, an instruction adaption mechanism is proposed for transferring the pretrained visual-language knowledge from the nature scene into the RS domain for multisource ship detection. In addition, the segment anything model is also seamlessly integrated into the proposed Popeye to achieve pixel-level ship segmentation without additional training costs. Finally, extensive experiments are conducted on the newly constructed ship instruction dataset named MMShip, and the results indicate that the proposed Popeye outperforms current specialist, open-vocabulary, and other visual-language models in zero-shot multisource various ship detection tasks.
期刊介绍:
The IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing addresses the growing field of applications in Earth observations and remote sensing, and also provides a venue for the rapidly expanding special issues that are being sponsored by the IEEE Geosciences and Remote Sensing Society. The journal draws upon the experience of the highly successful “IEEE Transactions on Geoscience and Remote Sensing” and provide a complementary medium for the wide range of topics in applied earth observations. The ‘Applications’ areas encompasses the societal benefit areas of the Global Earth Observations Systems of Systems (GEOSS) program. Through deliberations over two years, ministers from 50 countries agreed to identify nine areas where Earth observation could positively impact the quality of life and health of their respective countries. Some of these are areas not traditionally addressed in the IEEE context. These include biodiversity, health and climate. Yet it is the skill sets of IEEE members, in areas such as observations, communications, computers, signal processing, standards and ocean engineering, that form the technical underpinnings of GEOSS. Thus, the Journal attracts a broad range of interests that serves both present members in new ways and expands the IEEE visibility into new areas.