{"title":"A novel key point based ROI segmentation and image captioning using guidance information","authors":"Jothi Lakshmi Selvakani, Bhuvaneshwari Ranganathan, Geetha Palanisamy","doi":"10.1007/s00138-024-01597-1","DOIUrl":null,"url":null,"abstract":"<p>Recently, image captioning has become an intriguing task that has attracted many researchers. This paper proposes a novel keypoint-based segmentation algorithm for extracting regions of interest (ROI) and an image captioning model guided by this information to generate more accurate image captions. The Difference of Gaussian (DoG) is used to identify keypoints. A novel ROI segmentation algorithm then utilizes these keypoints to extract the ROI. Features of the ROI are extracted, and the text features of related images are merged into a common semantic space using canonical correlation analysis (CCA) to produce the guiding information. The text features are constructed using a Bag of Words (BoW) model. Based on the guiding information and the entire image features, an LSTM generates a caption for the image. The guiding information helps the LSTM focus on important semantic regions in the image to generate the most significant keywords in the image caption. Experiments on the Flickr8k dataset show that the proposed ROI segmentation algorithm accurately identifies the ROI, and the image captioning model with the guidance information outperforms state-of-the-art methods.</p>","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"2011 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Vision and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00138-024-01597-1","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, image captioning has become an intriguing task that has attracted many researchers. This paper proposes a novel keypoint-based segmentation algorithm for extracting regions of interest (ROI) and an image captioning model guided by this information to generate more accurate image captions. The Difference of Gaussian (DoG) is used to identify keypoints. A novel ROI segmentation algorithm then utilizes these keypoints to extract the ROI. Features of the ROI are extracted, and the text features of related images are merged into a common semantic space using canonical correlation analysis (CCA) to produce the guiding information. The text features are constructed using a Bag of Words (BoW) model. Based on the guiding information and the entire image features, an LSTM generates a caption for the image. The guiding information helps the LSTM focus on important semantic regions in the image to generate the most significant keywords in the image caption. Experiments on the Flickr8k dataset show that the proposed ROI segmentation algorithm accurately identifies the ROI, and the image captioning model with the guidance information outperforms state-of-the-art methods.
期刊介绍:
Machine Vision and Applications publishes high-quality technical contributions in machine vision research and development. Specifically, the editors encourage submittals in all applications and engineering aspects of image-related computing. In particular, original contributions dealing with scientific, commercial, industrial, military, and biomedical applications of machine vision, are all within the scope of the journal.
Particular emphasis is placed on engineering and technology aspects of image processing and computer vision.
The following aspects of machine vision applications are of interest: algorithms, architectures, VLSI implementations, AI techniques and expert systems for machine vision, front-end sensing, multidimensional and multisensor machine vision, real-time techniques, image databases, virtual reality and visualization. Papers must include a significant experimental validation component.